Download User Guide

Transcript
Gaussian Mixture Model Uncertainty Learning (GMMUL)
Version 1.0
User Guide
Alexey Ozerov 1 , Mathieu Lagrange
2
and Emmanuel Vincent
1
1
INRIA, Centre de Rennes - Bretagne Atlantique
Campus de Beaulieu, 35042 Rennes cedex, France
2
IRCAM, STMS, UPMC, CNRS - UMR 9912
1 place Igor Stravinsky, 75004 Paris
{alexey.ozerov, emmanuel.vincent}@inria.fr, [email protected]
January 10, 2012
1
Introduction
This user guide describes a method that can be considered for learning Gaussian Mixture Models
(GMMs) while acknowledging the fact that the data can be uncertain. The level of uncertainty
can be known or estimated. In order to fully grasp the theoretical aspects of the method as well as
the practical facts that motivated the proposal of this method, the reader is strongly encouraged
to read the following papers [1] [2] that will continuously be referred to in the remaining of this
document.
This guide is organized as follows. Section 2 describes the two core functions that allows the
user to learn GMMs from uncertain data. Those functions are then presented within two evaluation
frameworks respectively considered in [1] and [2]. Section 3 describes the first one that focus on
synthetic data, i.e. observed data and uncertainty is generated from sampling of GMMs. Section
4 describes the second one, that considered noisy speech data where the uncertainty is known as
a prior or estimated.
2
2.1
Processing Functions
GMM EM uncertainty learning
The core function is GMM EM uncertainty learning that implements a new Expectation-Maximization
(EM) algorithm for learning GMMs from uncertain data. It can be considered for the purpose of
training GMMs and decoding GMMs, both with handling of uncertainty.
1
f u n c t i o n [ uXe , cXe , wXe , l o g _ l i k e _ N , l ] = . . .
G M M _ E M _ u n c e r t a i n t y _ l e a r n i n g ( y , cE , uX , cX , wX , n b I t e r a t i o n s ,
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
[ uXe , cXe , wXe , l o g l i k e N , l ] = . . .
GMM EM uncertainty learning ( y , cE , uX , cX , wX,
nbIterations ,
print_log_flag )
print log flag ) ;
E x p e c t a t i o n −m a x i m i z a t i o n (EM) a l g o r i t h m f o r t r a i n i n g G a u s s i a n m i x t u r e
models (GMMs) from n o i s y o b s e r v a t i o n s w i t h G a u s s i a n u n c e r t a i n t y
input
−−−−−
y
cE
uX
cX
wX
nbIterations
print log flag
where
: o b s e r v a t i o n s (N v e c t o r s o f l e n g t h M) , (M x N) m a t r i x
: ( opt ) Gaussian u n c e r t a i n t y , f u l l or d i a g o n a l c o v a r i a n c e
m a t r i c e s t h a t can be :
− (M x M x N) t e n s o r i n c a s e o f f u l l c o v a r i a n c e s
− (M x N) m a t r i x i n c a s e o f d i a g o n a l c o v a r i a n c e s
− empty (= [ ] ) ( e q u i v a l e n t t o c o n v e n t i o n a l t r a i n i n g )
( def = [ ] )
: i n i t i a l GMM G a u s s i a n means , (M x K) m a t r i x
: i n i t i a l GMM G a u s s i a n f u l l o r d i a g o n a l c o v a r i a n c e s t h a t can be :
− (M x M x K) t e n s o r i n c a s e o f f u l l c o v a r i a n c e s
− (M x K) m a t r i x i n c a s e o f d i a g o n a l c o v a r i a n c e s
: i n i t i a l GMM G a u s s i a n w e i g h t s , K −l e n g t h v e c t o r
: ( o p t ) number o f EM a l g o r i t h m i t e r a t i o n s ( d e f = 3 0 )
: ( opt ) p r i n t i n g l o g f l a g ( d e f = 1)
K : number o f g a u s s i a n s i n GMM
M : dimensionality of feature vector
N : number o f o b s e r v a t i o n s
output
−−−−−−
uXe
cXe
wXe
log like N
l
: e s t i m a t e d GMM G a u s s i a n means , (M x K) m a t r i x
: e s t i m a t e d GMM G a u s s i a n f u l l o r d i a g o n a l c o v a r i a n c e s t h a t can be :
− (M x M x K) t e n s o r i n c a s e o f f u l l c o v a r i a n c e s
− (M x K) m a t r i x i n c a s e o f d i a g o n a l c o v a r i a n c e s
NOTE: cXe ha s t h e same d i m e n t i o n a l i t y a s i t s
i n i t i a l i z a t i o n cX
: e s t i m a t e d GMM G a u s s i a n w e i g h t s , K −l e n g t h v e c t o r
: N−l e n g t h a r r a y o f l o g −l i k e l i h o o d s f o r e a c h o b s e r v a t i o n
: n b I t e r a t i o n s −l e n g t h a r r a y o f g l o b a l l o g −l i k e l i h o o d s o v e r
EM a l g o r i t h m i t e r a t i o n s
Figure 1: Header of the main processing function
For decoding GMMs, one simply needs to provide the data, optionally with a Gaussian uncertainty, and the parameters of the model. In this case, the number of iteration can be set to 1, and
the log-likelihood computed at the Expectation step is considered.
For training GMMs, one needs to provide the data with a Gaussian uncertainty, a first estimate
of the parameters of the model and a number of iterations that is sufficient to reach convergence.
In all the experiments reported in [1] and [2], those estimates are obtained with the function VQ
discussed below.
Details about inputs and outputs are described In both cases, if the uncertainty is not provided,
the algorithm reduces to a standard EM algorithm for training GMMs [3]. Detailed description of
the input and output parameters is given in Figure 2.1.
2.2
GMMs initialization
The VQ implements a hierarchical clustering algorithm to provide a first estimate of the GMMs
models. It does not consider uncertainty. Therefore, the only input parameters are the actual data
2
function
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
[ means , covars , index ] = VQ ( x , niveau_max , d i a g _ c o v s _ f l a g )
[ means , c o v a r s , i n d e x ] = VQ( x , niveau max , d i a g c o v s f l a g ) ;
An h i e r a r c h i c a l
algorithm
c l u s t e r i n g ( v e c t o r q u a n t i z a t i o n (VQ) o r K−means )
input
−−−−−
x
niveau max
diag covs flag
: o b s e r v a t i o n s (T v e c t o r s o f l e n g t h n ) , ( n x T) m a t r i x
: d e s i r e d number o f s p l i t s i n VQ
: ( opt ) e s t i m a t e d i a g o n a l c o v a r i a n c e s f l a g ( d e f = 1)
output
−−−−−−
means
covars
index
: ( n x Q) m a t r i x o f c l u s t e r means
: c l u s t e r f u l l o r d i a g o n a l c o v a r i a n c e s t h a t can be :
− ( n x n x Q) t e n s o r i n c a s e o f f u l l c o v a r i a n c e s
( d i a g c o v s f l a g = 1)
− ( n x Q) m a t r i x i n c a s e o f d i a g o n a l c o v a r i a n c e s
( d i a g c o v s f l a g = 0)
: ( 2 ˆ niveau max +1) ∗1 − l e n g t h v e c t o r o f s p l i t s
where : Q (<= 2ˆ niveau max ) i s t h e r e s u l t i n g number o f
clusters
Figure 2: VQ function
and the number of binary splits and the type of covariance matrix to be estimated (full or diagonal
covariance). After a hierarchical split of the data operated along the axes of maximal variance,
the function outputs a first estimate of the GMM parameters, see Figure 2 for further details.
3
Experiment on Synthetic Data
The code discussed here has been considered to generate the results discussed in [1] with the
additional case of conventional noisy training and uncertainty decoding.
3.1
Data
The data consists in synthetically generated features using GMM sampling. The uncertainty is
here sampled from a signal Gaussian and supposed to be known both for the training and testing
phases, see [1] for further details.
3.2
Code
The function EXAMPLE 1 synthetic 2D data demonstrate the use of uncertainty learning for the
purpose of classifiying synthetic data. As can be seen on Figure 3, it mainly consists in training
and testing phases for each classes using the proposed approach (Uncertainty training on noisy
data) and conventional ones (Conventional training on clean or noisy data).
The Figure 4 gives a sense of the results for a 2-dimensional case.
3
function EXAMPLE_1__synthetic_2D_data ()
% comments and g e n e r a t i o n o f s y n t h e t i c d a t a
...
for train_mode = train_modes
%d i s p l a y i n f o
...
% TRAIN
% −−−−−−−−−−−−−−−−−−−−−−−−−−−
f o r cl = 1 : n b C l a s s e s
f p r i n t f ( ' T r a i n model %d o f %d models \n ' , cl ,
nbClasses ) ;
x _ n o i s y _ t r a i n _ c l = s q u e e z e ( x _ t r a i n _ C U R ( cl , : , : ) ) ;
c E _ t r a i n _ c l = s q u e e z e ( c E _ t r a i n _ C U R ( cl , : , : ) ) ;
[ uXe , c X e ] = V Q ( x _ n o i s y _ t r a i n _ c l ,
covariances
log2 ( nbGaussians ) , 0) ;
% 0:
initialize
f u l l ←-
wXe = ones (1 , nbGaussians ) ;
w X e = w X e /sum ( w X e ) ;
[ e s t _ g m m s { t r a i n _ m o d e , c l } . u _ g m m , e s t _ g m m s { t r a i n _ m o d e , c l } . c _ g m m , e s t _ g m m s {←train_mode , cl } . w_gmm ] = . . .
G M M _ E M _ u n c e r t a i n t y _ l e a r n i n g ( x _ n o i s y _ t r a i n _ c l , c E _ t r a i n _ c l , uXe , cXe , wXe , ←[ ] , 0 ) ; % 0 : no l o g
end ;
% TEST
% −−−−−−−−−−−−−−−−−−−−−−−−−−−
log_likelihoods = zeros ( nbClasses , nbSequences_test ,
f o r cl = 1 : n b C l a s s e s
f p r i n t f ( ' T e s t f o r c l a s s %d o f %d c l a s s e s \n ' , cl ,
nbClasses ) ;
nbClasses ) ;
f o r seq = 1 : nbSequences_test
for cl_model = 1: nbClasses
[ d u m m y 1 , d u m m y 2 , d u m m y 3 , d u m m y 4 , l o g _ l i k e l i h o o d s ( cl , seq , c l _ m o d e l ) ] = ←...
G M M _ E M _ u n c e r t a i n t y _ l e a r n i n g ( s q u e e z e ( x _ n o i s y _ t e s t ( cl , seq , : , : ) ) , ←s q u e e z e ( c E _ t e s t ( cl , seq , : , : ) ) , . . .
e s t _ g m m s { t r a i n _ m o d e , c l _ m o d e l } . u _ g m m , e s t _ g m m s { t r a i n _ m o d e , c l _ m o d e l } . ←c _ g m m , e s t _ g m m s { t r a i n _ m o d e , c l _ m o d e l } . w _ g m m , 1 , 0 ) ; % 0 : no l o g
end ;
end ;
end ;
% compute s c o r e
...
end ;
% visualization
...
Figure 3: Code of the experiment on synthetic data.
4
Speaker recognition experiment on Speech Data
In order to experiment with a more realistic task, we considered in [2], a speaker recognition task
on noisy speech data. In this case, the uncertainty is not know as a prior, and estimated using a
method based on Wiener filtering and Vector Taylor Series (VTS) expansion, see Section 4.1.5 of
[2] for more details.
4
Data GMMs
Noisy data
6
Clean data
6
GMM of class 1
GMM of class 2
GMM of class 3
4
6
Class 1
Class 2
Class 3
4
4
2
2
2
0
0
0
−2
−2
−2
−4
−5
0
5
10
−4
−5
Conventional training, clean data
0
5
10
−4
Conventional training, noisy data
6
6
4
4
4
2
2
2
0
0
0
−2
−2
−2
−5
0
5
10
−4
−5
0
5
0
5
10
Uncertainty training, noisy data
6
−4
−5
10
−4
−5
0
5
10
Figure 4: GMMs estimated using different learning conditions. By considering the uncertainty in
the learning algorithm, the resulting GMMs are much closer to the original ones when considering
noisy data.
4.1
Data
The data is provided separately from this toolbox and can be freely obtained at http://www.
irisa.fr/metiss/ozerov/Software/SP_REC_Uncrt_MFCC.zip. It consists of Mel Frequency Cepstral Coefficients (MFCCs) computed over 3 different inputs:
1. mix: raw addition of clean speech and noise
2. ssep: output of a state of the art source separation algorithm fed with the raw addition of
clean speech and noise
3. ssep uncrt: same, but with an estimate of the uncertainty of the output done with the VTS
method
Details about how and on what kind of audio data the MFCCs have been generated is provided
in Section 4.1.2 of [2].
Data is provided as .MAT file that can be opened with Matlab® or any parser that can read
HDF5 files. The naming convention is as follows: s<speakerId> <utteranceId> mfcc.mat. The
speakerId corresponds to the numeric id of the speaker from 1 to 34. For the 3 different inputs
described above, the .MAT file contains the mfcc variable which is a 2 dimensional floating point
matrix of size 20 × nf , where nf is the number of frames that have been considered for computing
the 20 dimensional MFCCs. For the last input, a second variable mfcc covar is available. It
encodes the uncertainty as a 3 dimensional tensor of size 20 × 20 × nf .
5
The data is divided in two main directories (test and train) that respectively contains the data
used for training and testing. For each one, clean (no noise addition) and several Signal to Noise
ratios are considered from -6dB to 9dB of SNRs. For each SNR, the 3 above discussed conditions
(mix, ssep, ssep uncrt) are available. For the latter, the full covariance uncertainty estimated
using the Wiener / VTS approach is provided. It shall be noted that the MFCCs of this condition
are the same as in the ssep condition as the VTS estimator does not change the actual MFCC
values, see Equation C.2 in [2].
4.2
Code
The function EXAMPLE 2 real 19D MFCC data can be considered to replicate the full results of the
experiments reported in Table D.7 of [2]. One run computes the results for one training/testing
condition for the following 4 cases:
1. Conventional training / Conventional decoding without signal enhancement
2. Conventional training / Conventional decoding with signal enhancement
3. Conventional training / Uncertainty decoding with signal enhancement
4. Uncertainty training / Uncertainty decoding with signal enhancement
Figure 5 shows the first line of the function that set the main processing parameters:
• data dir name: path to the data repository
• speaker ids: selection vector for the speaker to consider. 1 : 34 will consider all the speakers
available
• subdir name test: SNR condition for training from ’m6dB’ (-6 dB SNR) to ’9dB’ (9 dB
SNR), and ’clean’ (∞ SNR)
• subdir name train: SNR condition for testing
On a standard 2 GHz machine with one core, the example shown on Figure 5 with 3 speakers
needs about an hour. The evaluation of one training/testing condition (for example -9dB SNR at
training and 0 dB SNR at testing) is expected to take one day for the 34 speakers available. So,
replicating the results of the full table is expected to take about 48 days.
References
[1] A. Ozerov, M. Lagrange, and E. Vincent, “GMM-based classification from noisy features,” in
Proc. 1st Int. Workshop on Machine Listening in Multisource Environments (CHiME), Florence, Italy, September 2011, pp. 30–35.
[2] ——, “Uncertainty-based learning of gaussian mixture models from noisy data,” Computer
Speech and Language, 2011, submitted.
[3] A. P. Dempster, N. M. Laird, and D. B. Rubin., “Maximum likelihood from incomplete data
via the EM algorithm,” Journal of the Royal Statistical Society. Series B (Methodological),
vol. 39, pp. 1–38, 1977.
6
function EXAMPLE_2__real_19D_MFCC_data ()
% comments
....
d a t a _ d i r _ n a m e = ' . . / SP REC Uncrt MFCC/ ' ;
speaker_ids = [2 , 4 ,
6];
s u b d i r _ n a m e _ t e s t = 'm6dB ' ;
s u b d i r _ n a m e _ t r a i n = 'm6dB ' ;
% can be any from 1 t o 34
% can be ' 0 dB ' ,
% can be ' 0 dB ' ,
' 3 dB ' ,
' 3 dB ' ,
' 6 dB ' ,
' 6 dB ' ,
' 9 dB ' ,
' 9 dB ' ,
'm3dB ' ,
'm3dB ' ,
Figure 5: Main parameters of the speaker recognition task.
7
'm6dB '
'm6dB '