DOI QR코드

DOI QR Code

Performance Comparison and Duration Model Improvement of Speaker Adaptation Methods in HMM-based Korean Speech Synthesis

HMM 기반 한국어 음성합성에서의 화자적응 방식 성능비교 및 지속시간 모델 개선

  • Received : 2012.07.25
  • Accepted : 2012.09.21
  • Published : 2012.09.30

Abstract

In this paper, we compare the performance of several speaker adaptation methods for a HMM-based Korean speech synthesis system with small amounts of adaptation data. According to objective and subjective evaluations, a hybrid method of constrained structural maximum a posteriori linear regression (CSMAPLR) and maximum a posteriori (MAP) adaptation shows better performance than other methods, when only five minutes of adaptation data are available for the target speaker. During the objective evaluation, we find that the duration models are insufficiently adapted to the target speaker as the spectral envelope and pitch models. To alleviate the problem, we propose the duration rectification method and the duration interpolation method. Both the objective and subjective evaluations reveal that the incorporation of the proposed two methods into the conventional speaker adaptation method is effective in improving the performance of the duration model adaptation.

Keywords

References

  1. Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T. & Kitamura, T. (1999). Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. Proc. of Eurospeech, 2347-2350.
  2. http://www.synsig.org/index.php/Blizzard_Challenge_2012_Workshop.
  3. Yamagishi, J. & Kobayashi, T. (2007). Average-voice-based speech synthesis using HSMM-based speaker adaptation and adaptive training. IEICE Trans. Inf. Syst. E90-D(2), 533-543. https://doi.org/10.1093/ietisy/e90-d.2.533
  4. Yamagishi, J., Kobayashi, T., Nakano, Y., Ogata, K. & Isogai, J. (2009). Analysis of speaker adaptation algorithms for HMM-based speech synthesis and a constrained SMAPLR adaptation algorithm. IEEE Trans. Audio, Speech, Language Process., 17(1), 66-83. https://doi.org/10.1109/TASL.2008.2006647
  5. Yamagishi, J., Ogata, K., Nakano, Y., Isogai, J. & Kobayashi, T. (2006). HSMM-based model adaptation algorithms for average-voice-based speech synthesis. Proc. ICASSP, 77-80.
  6. Zen, H., Tokuda, K., Masuko, T., Kobayashi, T. & Kitamura, T. (2004). Hidden semi-Markov model based speech synthesis. Proc. of ICSLP, 1397-1400.
  7. Shinoda K. & Lee, C.-H. (2001). A structural Bayes approach to speaker adaptation. IEEE Trans. Speech, Audio Process., 9(3), 276-287. https://doi.org/10.1109/89.906001
  8. Yamagishi, J. & Kobayashi, T. (2005). Adaptive training for Hidden semi-Markov model. Proc. ICASSP, 365-368.
  9. http://hts.sp.nitech.ac.jp/archives/2.2/HTS-demo_CMU-ARCTICSLT_STRAIGHT.tar.bz2.
  10. Lee, H. & Kim, H. S. (2012). Performance comparison of speaker adaptation methods for HMM-based Korean speech synthesis system. Proc. of Spring Conference of Korean Society of Speech Sciences, 241-242. (이혜민, 김형순 (2012). HMM 기반의 한국어 음성합성에서의 화자적응 방식 성능 비교. 한국음성학회 봄 학술대회, 241-242.)