Frequency warping for VTLN and speaker adaptation by linear transformation of standard MFCC

https://doi.org/10.1016/j.csl.2008.02.003Get rights and content

Abstract

Vocal tract length normalization (VTLN) for standard filterbank-based Mel frequency cepstral coefficient (MFCC) features is usually implemented by warping the center frequencies of the Mel filterbank, and the warping factor is estimated using the maximum likelihood score (MLS) criterion. A linear transform (LT) equivalent for frequency warping (FW) would enable more efficient MLS estimation. We recently proposed a novel LT to perform FW for VTLN and model adaptation with standard MFCC features. In this paper, we present the mathematical derivation of the LT and give a compact formula to calculate it for any FW function. We also show that our LT is closely related to different LTs previously proposed for FW with cepstral features, and these LTs for FW are all shown to be numerically almost identical for the sine-log all-pass transform (SLAPT) warping functions. Our formula for the transformation matrix is, however, computationally simpler and, unlike other previous LT approaches to VTLN with MFCC features, no modification of the standard MFCC feature extraction scheme is required. In VTLN and speaker adaptive modeling (SAM) experiments with the DARPA resource management (RM1) database, the performance of the new LT was comparable to that of regular VTLN implemented by warping the Mel filterbank, when the MLS criterion was used for FW estimation. This demonstrates that the approximations involved do not lead to any performance degradation. Performance comparable to front end VTLN was also obtained with LT adaptation of HMM means in the back end, combined with mean bias and variance adaptation according to the maximum likelihood linear regression (MLLR) framework. The FW methods performed significantly better than standard MLLR for very limited adaptation data (1 utterance), and were equally effective with unsupervised parameter estimation. We also performed speaker adaptive training (SAT) with feature space LT denoted CLTFW. Global CLTFW SAT gave results comparable to SAM and VTLN. By estimating multiple CLTFW transforms using a regression tree, and including an additive bias, we obtained significantly improved results compared to VTLN, with increasing adaptation data.

Introduction

Vocal Tract Length Normalization (VTLN) is a speaker normalization technique widely used to improve the accuracy of speech recognition systems. In VTLN, spectral mismatch caused by variation in vocal tract lengths of speakers is reduced by performing spectral frequency warping (FW) or its equivalent, typically during feature extraction. VTLN has proven to be particularly effective when only limited adaptation data from a test speaker is available, even in an unsupervised mode. The estimation and implementation of frequency warping have received much attention in recent years.

The parameters controlling the FW are commonly estimated by optimizing a maximum likelihood (ML) criterion over the adaptation data. The ML criterion could be the ASR likelihood score of the recognizer over the adaptation data (Lee and Rose, 1998, Pitz et al., 2001, Pitz and Ney, 2003), the EM auxiliary function (Dempster et al., 1977, McDonough, 2000, Loof et al., 2006), or likelihoods of Gaussian mixture models (GMMs) trained specifically for FW parameter estimation (Wegmann et al., 1996, Lee and Rose, 1998). Another FW estimation method is by alignment of formants or formant-like spectral peaks between the test speaker and a reference speaker from the training set (Gouvea and Stern, 1997, Claes et al., 1998, Cui and Alwan, 2006).

Maximizing the likelihood score is commonly performed using grid search over a set of warping factors, when the FW is described by a single parameter that controls the scaling of the frequency axis (Lee and Rose, 1998). More recently, optimization methods based on the gradient and higher order derivatives of the objective function have been used to estimate the FW function. This allows efficient estimation of multiple parameter FW functions like the All-Pass Transform (APT) FWs, which can give better recognition performance than single parameter FWs (McDonough, 2000, Panchapagesan and Alwan, 2006).

Frequency warping of the spectrum has been shown to correspond to a linear transformation in the cepstral space (McDonough et al., 1998, Pitz et al., 2001). This relationship confers some important advantages for speech recognition systems that use cepstral features. Firstly, one can apply the linear transform to previously computed unwarped features and not have to recompute features with different warp factors during VTLN estimation. This results in significant computational savings (Umesh et al., 2005), which would be important in embedded and distributed speech recognition (DSR) applications, where resources are limited. Given the recognition alignment of an utterance obtained with baseline models without VTLN, it can be shown by a rough calculation that parameter estimation for regular VTLN is about 2.5 times as expensive as for LT VTLN, when the fixed alignment is used for VTLN estimation with the MLS criterion, with single Gaussian mixture HMMs and a grid search. The linear transform approach also has the advantage that one need not have access to any of the intermediate stages in the feature extraction during VTLN estimation. This aspect would have definite advantages in DSR, where feature extraction is performed at the client and recognition is performed at the server. During VTLN estimation using a grid search over warping factors, since it would be impractical for the client to recompute and transmit features for each warping factor, warped features would have to be computed at the server. With a linear transform, only the cepstral transformation matrices for each warping factor need to be applied to unwarped features to choose the best warping factor, while with VTLN by spectral warping, the linear frequency spectrum needs to be reconstructed and the warped features recomputed for each warping factor.

The linearity also enables one to take the expectation and thereby apply the linear transformation to the means of HMM distributions (Claes et al., 1998, McDonough and Byrne, 1999). Different transforms could then be estimated for different phonemes or classes of HMM distributions, unlike VTLN where the same global transformation is applied to all speech features (McDonough, 2000).

Mel frequency cepstral coefficients (MFCCs) computed using a filterbank and the DCT (Davis and Mermelstein, 1980), are a very popular choice of features for speech recognition. The equivalence of FW to linear transformation, though true also for cepstral features which are based on perceptual linear prediction (PLP) or by Mel warping of the frequency axis (McDonough, 2000, Pitz and Ney, 2003), does not hold exactly for standard MFCC features. In fact, for standard MFCC features, because of the non-invertible filterbank with non-uniform filter widths, even with the assumption of quefrency limitedness, the MFCC features after warping cannot even be expressed as a function (linear or non-linear) of the unwarped MFCC features. i.e., for a given warping of the linear frequency signal spectrum, there is not a single function (for all possible cepstra) that will give the warped cepstra from the unwarped cepstra. Hence, approximate linear transforms have been developed for FW with MFCC features (Claes et al., 1998, Cui and Alwan, 2006, Umesh et al., 2005).

Claes et al. (1998) were the first to derive an approximate linear transform which was used to perform model adaptation with some success. Cui and Alwan, 2005, Cui and Alwan, 2006 derived a simpler linear transform that is essentially an “index mapping” on the Mel filterbank outputs, i.e. one filterbank output is mapped to another. In fact, it may be shown to be mathematically a special case of Claes et al.’s transform (see Section 2) but was demonstrated to give better performance (Cui and Alwan, 2005). In both Claes et al., 1998, Cui and Alwan, 2006, the FW was estimated by alignment of formants or formant-like peaks in the linear frequency domain.

Umesh et al. (2005) showed that the formula for computing the linear transform for ordinary cepstra, derived in Pitz et al. (2001), could be considerably simplified under the assumption of quefrency limitedness of the cepstra, when the log spectrum can be obtained from samples by sinc interpolation. They also developed non-standard filterbank-based MFCC features, to which the linear transformation was extended. In their modified filterbank, the filter center frequencies were uniformly spaced in the linear frequency domain but filter bandwidths were uniform in the Mel domain. Their transformation formula (discussed further in Section 4) was, however, complicated by the use of two different DCT matrices, one for warping purposes and the other for computing the cepstra.

In Panchapagesan (2006), we introduced a novel linear transform for MFCCs that required no modification of the standard MFCC feature extraction scheme. The main idea was to directly warp the continuous log filterbank output obtained by cosine interpolation with the IDCT. This approach can be viewed as using the idea of spectral interpolation of Umesh et al. (2005), to perform a continuous warping of the log filterbank outputs instead of the discrete mapping in Cui and Alwan (2006). However, a single warped IDCT matrix was used to perform both the interpolation and warping, thus resulting in a simpler mathematical formula for computing the transform compared to Umesh et al. (2005). Also, the warping in the IDCT matrix is parametrized and the parameter can be estimated directly by optimizing an objective criterion, without using the intermediate linear frequency spectrum as in the peak alignment method of Cui and Alwan (2006). As mentioned above, this would be advantageous in distributed speech recognition, where intermediate variables in the feature extraction have to be reconstructed at the recognizer. Also, with a smooth parametrization of the FW, it is possible to estimate the FW parameters by faster optimization techniques as in McDonough, 2000, Panchapagesan and Alwan, 2006 instead of the commonly used grid search, and also perform simultaneous optimization of several parameters.

In Panchapagesan (2006), we validated the technique on connected digit recognition of children’s speech, and showed that for that task, it performed favorably compared to regular VTLN by warping the filterbank. We also compared the method in the back end with the peak alignment method (Cui and Alwan, 2006), and showed comparable and slightly better results.

In this paper, the mathematical derivation of our linear transform (LT) is presented in more detail, and the final formula for computing the LT for any given frequency warping function and parameter is expressed in a simple and compact form. It is also shown that our LT and other LTs proposed in the past for FW with cepstral features (McDonough, 2000, Pitz et al., 2001, Umesh et al., 2005) are closely related. We validate our LT further by demonstrating its effectiveness in continuous speech recognition using the DARPA resource management (RM1) database. These include experiments with front end VTLN and back end adaptation of HMM means, as well as speaker adaptive modeling and training using the LT (Welling et al., 2002, Anastasakos et al., 1996). We show that in all cases, LT VTLN can give results comparable to those of regular VTLN by warping the Mel filterbank center frequencies. We also discuss optimization of the EM auxiliary function for LT estimation, and show that by estimating multiple transforms using a regression tree, results better than global VTLN can be obtained. Finally, we present the results of unsupervised VTLN and adaptation with the LT.

The rest of this paper is organized as follows. In Section 2 we consider the problem of deriving a linear transformation for FW of MFCCs, review previous work, and motivate the development of our new linear transform. The matrix for the new linear transformation is derived in Section 3, and the proposed transform is compared with previous approaches in Section 4. We then consider the estimation of FWs using MLS and EM auxiliary function as objective criteria in Section 5, and also derive formulae for convex optimization of the EM auxiliary function for multiple FW parameters. Experimental results are presented in Section 6, and summary and conclusions in Section 7.

Section snippets

FW as linear transformation of standard MFCC – review of previous work

Standard MFCCs are computed as shown in Fig. 1, and the Mel filterbank is shown in Fig. 2. The filters are assumed to be triangular and half overlapping, with center frequencies spaced equally apart on the Mel scale.

During feature extraction, the speech signal is pre-emphasized and divided into frames and each frame is first windowed using the Hamming window. The short-time power spectrum vector S is obtained from the squared magnitude of the FFT of the windowed frame.

The log of the filterbank

Linearity of the cepstral transformation

Eq. (9) describes how the smoothed log filterbank output may be approximately recovered from the truncated cepstra using the IDCT. We use a unitary type-II DCT matrix, for which we have C-1=CT, withC=αkcosπ(2m-1)k2M0kN-11mMwhere M is the number of filters in the filterbank, N is the number of cepstra used in the features, andαk=1M,k=02M,k=1,2,,N-1is a factor that ensures that the DCT is unitary. Similar expressions are valid for C and C-1 with a non-unitary type-II DCT matrix, but then C-1

Comparison and relationships with previous transforms

As discussed in Section 1, several cepstral linear transforms have earlier been derived in the literature as equivalents of frequency warping for use in speaker normalization and adaptation. Some of them were derived for plain or PLP cepstra (McDonough et al., 1998, Pitz et al., 2001) and extended to non-standard MFCC features (Pitz and Ney, 2003, Umesh et al., 2005). Although our LT was derived for standard MFCCs by warping the log filterbank output, motivated by the work of Cui and Alwan, 2006

MLS objective criterion

In our work, for VTLN estimation, we used the commonly used maximum likelihood score (MLS) criterion (Lee and Rose, 1998, Pitz et al., 2001). For a feature space transform, the MLS criterion to estimate the optimal FW parameters pˆ is:pˆ=argmaxp[logP(Xp,Θp|W,Λ)+Tlog|Ap|]where p is(are) the FW parameter(s), xp=Apx is a normalized feature vector, |Ap| is the determinant of Ap, Xp={x1p,x2p,,xTp} is the normalized adaptation data, W is the word (or other unit) transcription, Λ are the

Experimental results

We validated the LT by testing it on connected digit and continuous speech recognition tasks and comparing the performance with that of regular VTLN by warping the filterbank center frequencies (hereafter referred to as regular VTLN in this paper). The main advantages of using the LT over regular VTLN, as discussed in Section 1, are computational savings and flexibility of implementation. The spectral information available during LT parameter estimation consists only of a smoothed Mel-warped

Summary and conclusions

We have developed a novel linear transform (LT) to implement frequency warping for standard filterbank-based MFCC features. There are a few important advantages of using a linear transform for frequency warping: VTLN estimation by optimizing an objective function is performed computationally more efficiently with a LT than with regular VTLN by warping the center frequencies of the filterbank; the transform can also be estimated and applied in the back end to HMM means; and one need not have

Acknowledgements

We wish to thank Xiaodong Cui for the use of his connected digit recognition and adaptation systems, Shizhen Wang for discussions on this work, and Markus Iseli for help with programming. We also thank the anonymous reviewers for their comments that helped improve the paper. This work was supported in part by the NSF.

References (27)

  • Gouvea, E.B., Stern, R.M., 1997. Speaker normalization through formant-based warping of the frequency scale. In:...
  • L. Lee et al.

    A frequency warping approach to speaker normalization

    IEEE Transactions on Speech and Audio Processing

    (1998)
  • Loof, J., Ney, H., Umesh, S., 2006. VTLN warping factor estimation using accumulation of sufficient statistics. In:...
  • Cited by (0)

    View full text