Singing Voice Synthesis Using HMM Based TTS and MusicXML

HMM 기반 TTS와 MusicXML을 이용한 노래음 합성

  • Received : 2015.04.09
  • Accepted : 2015.05.15
  • Published : 2015.05.30


Singing voice synthesis is the generation of a song using a computer given its lyrics and musical notes. Hidden Markov models (HMM) have been proved to be the models of choice for text to speech synthesis. HMMs have also been used for singing voice synthesis research, however, a huge database is needed for the training of HMMs for singing voice synthesis. And commercially available singing voice synthesis systems which use the piano roll music notation, needs to adopt the easy to read standard music notation which make it suitable for singing learning applications. To overcome this problem, we use a speech database for training context dependent HMMs, to be used for singing voice synthesis. Pitch and duration control methods have been devised to modify the parameters of the HMMs trained on speech, to be used as the synthesis units for the singing voice. This work describes a singing voice synthesis system which uses a MusicXML based music score editor as the front-end interface for entry of the notes and lyrics to be synthesized and a hidden Markov model based text to speech synthesis system as the back-end synthesizer. A perceptual test shows the feasibility of our proposed system.


TTS;HMM;Singing Voice Synthesis;Score Editor


Supported by : University of Ulsan


  1. H. Kenmochi and H. Ohshita, "VOCALOID-commercial singing synthesizer based on sample concatenation," in Proc. INTERSPEECH, pp. 4009-4010, 2007.
  2. H. Kenmochi, "Singing synthesis as a new musical instrument," in Proc. ICASSP, pp. 5385-5388, 2012.
  3. "UTAU" available:
  4. J. Xu, H. Li, and S. Zhou, "An Overview of Deep Generative Models," IETE Technical Review, pp. 1-9, 2014.
  5. G.J. Lim and J.C. Lee, "Improvement of Naturalness for a HMM-based Korean TTS using the prosodic boundary information," Journal of the Korea Society of Computer and Information, v.17, no.9, pp. 75-84, Sep. 2012.
  6. H. Zen, T. Nose, J. Yamagishi, S. Sako, T. Masuko, A. W. Black, et al., "The HMM-based speech synthesis system (HTS) version 2.0," in Proc. ISCA Workshop Speech Synthesis, pp. 294-299, 2007.
  7. K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and T. Kitamura, "Speech parameter generation algorithms for HMM-based speech synthesis," in Proc. ICASSP, pp. 1315-1318, 2000.
  8. K. Saino, H. Zen, Y. Nankaku, A. Lee, and K. Tokuda, "An HMM-based singing voice synthesis system," in Proc. INTERSPEECH, 2006.
  9. K. Oura, A. Mase, T. Yamada, S. Muto, Y. Nankaku, and K. Tokuda, "Recent development of the HMM-based singing voice synthesis system-Sinsy," in Proc. ISCA Workshop Speech Synthesis, pp. 211-216, 2010.
  10. K. Nakamura, K. Oura, Y. Nankaku, and K. Tokuda, "HMM-Based singing voice synthesis and its application to Japanese and English," in Proc. ICASSP, pp. 265-269, 2014.
  11. K. Shirota, K. Nakamura, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda, "Integration of speaker and pitch adaptive training for HMM-based singing voice synthesis," in Proc. ICASSP, pp. 2559-2563, 2014.
  12. T. Saitou, M. Goto, M. Unoki, and M. Akagi, "Speech-to-singing synthesis: Converting speaking voices to singing voices by controlling acoustic features unique to singing voices," in Applications of Signal Processing to Audio and Acoustics, pp. 215-218, 2007.
  13. J. Kominek and A. W. Black, "The CMU Arctic speech databases," in Fifth ISCA Workshop on Speech Synthesis, pp. 223-224, 2004.
  14. K. Tokuda, T. Masuko, N. Miyazaki, and T. Kobayashi, "Multi-space probability distribution HMM," IEICE TRANSACTIONS on Information and Systems, vol. 85, pp. 455-464, 2002.
  15. T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, "Duration modeling for HMM-based speech synthesis," in Proc. ICSLP, pp. 29-31, 1998.
  16. K. Shinoda and T. Watanabe, "MDL-based context-dependent subword modeling for speech recognition," The Journal of the Acoustical Society of Japan (E), vol. 21, pp. 79-86, 2000.
  17. T. YoshimuraY, K. TokudaY, T. MasukoYY, T. KobayashiYY, and T. KitamuraY, "Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis," in Proc. Eurospeech, pp. 2347-2350, 1999.
  18. "MusicXML" available:
  19. N. U. Khan and J.C. Lee, "Development of a Music Score Editor based on MusicXML," Journal of the Korea Society of Computer and Information, vol. 19, pp. 77-90, 2014.
  20. W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, Numerical recipes in C, vol. 2: Citeseer, 1996.
  21. K. Tokuda, T. Kobayashi, T. Masuko, and S. Imai, "Mel-generalized cepstral analysis-a unified approach to speech spectral estimation," in Proc. ICSLP, pp. 1043-1046, 1994.