DOI QR코드

DOI QR Code

Voice transformation for HTS using correlation between fundamental frequency and vocal tract length

기본주파수와 성도길이의 상관관계를 이용한 HTS 음성합성기에서의 목소리 변환

  • 유효근 (한국과학기술원 전기및전자공학부) ;
  • 김영관 (한국과학기술원) ;
  • 서영주 (한국과학기술원 전기및전자공학부) ;
  • 김회린 (한국과학기술원 전기및전자공학부)
  • Received : 2017.01.31
  • Accepted : 2017.03.21
  • Published : 2017.03.31

Abstract

The main advantage of the statistical parametric speech synthesis is its flexibility in changing voice characteristics. A personalized text-to-speech(TTS) system can be implemented by combining a speech synthesis system and a voice transformation system, and it is widely used in many application areas. It is known that the fundamental frequency and the spectral envelope of speech signal can be independently modified to convert the voice characteristics. Also it is important to maintain naturalness of the transformed speech. In this paper, a speech synthesis system based on Hidden Markov Model(HMM-based speech synthesis, HTS) using the STRAIGHT vocoder is constructed and voice transformation is conducted by modifying the fundamental frequency and spectral envelope. The fundamental frequency is transformed in a scaling method, and the spectral envelope is transformed through frequency warping method to control the speaker's vocal tract length. In particular, this study proposes a voice transformation method using the correlation between fundamental frequency and vocal tract length. Subjective evaluations were conducted to assess preference and mean opinion scores(MOS) for naturalness of synthetic speech. Experimental results showed that the proposed voice transformation method achieved higher preference than baseline systems while maintaining the naturalness of the speech quality.

Keywords

References

  1. Assmann, P. F., Nearey, T. M., & Scott, J. M. (2002). Modeling the perception of frequency-shifted vowels. Proceedings of the 7th International Conference on Spoken Language Processing (pp. 425-428).
  2. Kawahara, H. (2006). STRAIGHT, exploration of the other aspect of vocoder: Perceptually isomorphic decomposition of speech sounds. Journal of Acoustical Science and Technology, 27, 349-353. https://doi.org/10.1250/ast.27.349
  3. Saheer, L., Dines, J., & Garner, P. N. (2012). Vocal tract length normalization for statistical parametric speech synthesis. IEEE Transactions on Audio, Speech, and Language Processing, 20(7), 2134-2148. https://doi.org/10.1109/TASL.2012.2198058
  4. Saheer, L., Dines, J., Garner, P. N., & Liang, H. (2010). Implementation of VTLN for Statistical Speech Synthesis. Proceedings of the 7th ISCA Speech Synthesis Workshop (pp. 224-229).
  5. Stylianou, Y. (2009). Voice transformation: A survey. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (pp. 3585-3588).
  6. Sundermann, D., & Ney, H. (2003). VTLN-based voice conversion. Proceedings of the 3rd IEEE International Symposium on Signal Processing and Information Technology (pp. 556-559).
  7. Tokuda, K., Masuko, T., Kobayashi, T., & Imai, S. (1994). Mel-generalized cepstral analysis - A unified approach to speech spectral estimation. Proceedings of the International Conference on Spoken Language Processing (pp. 1043-1046).
  8. Tokuda, K., Nankaku, Y., Toda, T., Zen, H., Yamagishi, J ., & Oura, K. (2013). Speech Synthesis Based on Hidden Markov Models. Proceedings of the IEEE (pp. 1234-1252).
  9. Zen, H., Toda, T., Nakamura, M., & Tokuda, K. (2007). Details of the Nitech HMM-based speech synthesis system for Blizzard Challenge 2005. IEICE Transactions on Information and Systems, E90-D, 325-333. https://doi.org/10.1093/ietisy/e90-1.1.325