System fusion for high-performance voice conversion

Tian, Xiaohai; Wu, Zhizheng; Lee, Siu Wa; Nguyen, Quy Hy; Dong, Minghui; Chng, Eng Siong

doi:10.21437/Interspeech.2015-581

System fusion for high-performance voice conversion

Xiaohai Tian, Zhizheng Wu, Siu Wa Lee, Quy Hy Nguyen, Minghui Dong, Eng Siong Chng

Recently, a number of voice conversion methods have been developed. These methods attempt to improve conversion performance by using diverse mapping techniques in various acoustic domains, e.g. high-resolution spectra and low-resolution Mel-cepstral coefficients. Each individual method has its own pros and cons. In this paper, we introduce a system fusion framework, which leverages and synergizes the merits of these state-of-the-art and even potential future conversion methods. For instance, methods delivering high speech quality are fused with methods capturing speaker characteristics, bringing another level of performance gain. To examine the feasibility of the proposed framework, we select two state-of-the-art methods, Gaussian mixture model and frequency warping based systems, as a case study. Experimental results reveal that the fusion system outperforms each individual method in both objective and subjective evaluation, and demonstrate the effectiveness of the proposed fusion framework.

doi: 10.21437/Interspeech.2015-581

Cite as: Tian, X., Wu, Z., Lee, S.W., Nguyen, Q.H., Dong, M., Chng, E.S. (2015) System fusion for high-performance voice conversion. Proc. Interspeech 2015, 2759-2763, doi: 10.21437/Interspeech.2015-581

@inproceedings{tian15b_interspeech,
  author={Xiaohai Tian and Zhizheng Wu and Siu Wa Lee and Quy Hy Nguyen and Minghui Dong and Eng Siong Chng},
  title={{System fusion for high-performance voice conversion}},
  year=2015,
  booktitle={Proc. Interspeech 2015},
  pages={2759--2763},
  doi={10.21437/Interspeech.2015-581}
}