DOI QR코드

DOI QR Code

Performance of speech recognition unit considering morphological pronunciation variation

형태소 발음변이를 고려한 음성인식 단위의 성능

  • Received : 2018.08.29
  • Accepted : 2018.10.09
  • Published : 2018.12.31

Abstract

This paper proposes a method to improve speech recognition performance by extracting various pronunciations of the pseudo-morpheme unit from an eojeol unit corpus and generating a new recognition unit considering pronunciation variations. In the proposed method, we first align the pronunciation of the eojeol units and the pseudo-morpheme units, and then expand the pronunciation dictionary by extracting the new pronunciations of the pseudo-morpheme units at the pronunciation of the eojeol units. Then, we propose a new recognition unit that relies on pronunciation by tagging the obtained phoneme symbols according to the pseudo-morpheme units. The proposed units and their extended pronunciations are incorporated into the lexicon and language model of the speech recognizer. Experiments for performance evaluation are performed using the Korean speech recognizer with a trigram language model obtained by a 100 million pseudo-morpheme corpus and an acoustic model trained by a multi-genre broadcast speech data of 445 hours. The proposed method is shown to reduce the word error rate relatively by 13.8% in the news-genre evaluation data and by 4.5% in the total evaluation data.

Keywords

References

  1. Bang, J. U., & Kwon, O. W. (2014). Performance of pseudomorpheme-based speech recognition units obtained by unsupervised segmentation and merging. Phonetics and Speech Sciences, 6(3), 155-164. https://doi.org/10.13064/KSSS.2014.6.3.155
  2. Bang, J. U., Choi, M. Y., Kim, S. H., & Kwon, O. W. (2017). Improving speech recognizers by refining broadcast data with inaccurate subtitle timestamps. Proceedings of the International Conference on Spoken Language Processing (INTERSPEECH'17). Stockholm, Sweden. August 20-24, 2017.
  3. Chung, M. H., & Lee, K. N. (2004). Modeling cross-morpheme pronunciation variations for Korean large vocabulary continuous speech recognition. Malsori, 49, 107-121.
  4. Jeon, J., Cha, S., Chung, M., Park, J., & Hwang, K. (1998). Automatic generation of Korean pronunciation variants by multistage applications of phonological rules. Proceedings of the 5th International Conference on Spoken Language Processing (ICSLP'98). Sydney, Australia. November 30-December 4, 1998.
  5. Kang, B. O. (2003). A study on the multiple pronunciation dictionary for spontaneous speech recognition. Proceedings of the 2003 Conference of the Korean Society of Speech Sciences (pp. 65-68).
  6. Kneser, R., & Ney, H. (1995). Improved backing-off for m-gram language modeling. Proceedings of the 1995 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP'95). Detroit, USA. May 9-12, 1995.
  7. Kwon, O. W., & Park, J. (2003). Korean large vocabulary continuous speech recognition with morpheme-based recognition units. Speech Communication, 39(3-4), 287-300. https://doi.org/10.1016/S0167-6393(02)00031-6
  8. Kwon, O. W., Hwang, K. W., & Park, J. (1999). Korean large vocabulary continuous speech recognition using pseudomorpheme units. Proceedings of the 6th European Conference on Speech Communication and Technology (EUROSPEECH'99). Budapest, Hungary. September 5-9, 1999.
  9. Lee, K. N., & Chung, M. (2004). Pronunciation lexicon modeling and design for Korean large vocabulary continuous speech recognition. Proceedings of the 8th International Conference on Spoken Language Processing (INTERSPEECH'04). Jeju, Korea. October 4-8, 2004.
  10. Lee, K. N., & Chung, M. H. (2003). Statistical analysis of Korean pronunciation variations. Proceedings of the 15th International Congress of Phonetic Sciences (ICPhS'03). Barcelona, Spain. August 3-9, 2003.
  11. Povey, D. (2016). Align-text algorithm. Retrieved from https://github.com/kaldi-asr/kaldi/blob/master/src/bin/align-text.cc on July 1, 2018.
  12. Povey, D. (2018). Neural-network training script. Retrieved from https://github.com/kaldi-asr/kaldi/blob/master/egs/wsj/s5/steps/nnet2/train_block.sh on July 1, 2018.
  13. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., & Vesely, K. (2011). The Kaldi speech recognition toolkit. Proceedings of the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding (ASRU'11). Hawaii, USA. December 11-15, 2011.
  14. Razavi, M., & Magimai.-Doss, M. (2015). An HMM-based formalism for automatic subword unit derivation and pronunciation generation. Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP'15). Brisbane, Australia. April 19-24, 2015.
  15. Stolcke, A. (2002). SRILM - an extensible language modeling toolkit. Proceedings of the 7th International Conference on Spoken Language Processing (INTERSPEECH'02). Denver, USA. September 16-22, 2002.
  16. Young, S. J., Odell, J. J., & Woodland, P. C. (1994). Tree-based state tying for high accuracy acoustic modelling. Proceedings of the Workshop on Human Language Technology (HLT'94). Plainsboro, USA. March 8-11, 1994.