DOI QR코드

DOI QR Code

A study on recognition improvement of velopharyngeal insufficiency patient's speech using various types of deep neural network

심층신경망 구조에 따른 구개인두부전증 환자 음성 인식 향상 연구

  • Kim, Min-seok (Department of Computer Science and Engineering, Incheon National University) ;
  • Jung, Jae-hee (Department of Computer Science and Engineering, Incheon National University) ;
  • Jung, Bo-kyung (Department of Computer Science and Engineering, Incheon National University) ;
  • Yoon, Ki-mu (Department of Computer Science and Engineering, Incheon National University) ;
  • Bae, Ara (Department of Computer Science and Engineering, Incheon National University) ;
  • Kim, Wooil (Department of Computer Science and Engineering, Incheon National University)
  • 김민석 (인천대학교 컴퓨터공학부) ;
  • 정재희 (인천대학교 컴퓨터공학부) ;
  • 정보경 (인천대학교 컴퓨터공학부) ;
  • 윤기무 (인천대학교 컴퓨터공학부) ;
  • 배아라 (인천대학교 컴퓨터공학부) ;
  • 김우일 (인천대학교 컴퓨터공학부)
  • Received : 2019.10.08
  • Accepted : 2019.10.29
  • Published : 2019.11.30

Abstract

This paper proposes speech recognition systems employing Convolutional Neural Network (CNN) and Long Short Term Memory (LSTM) structures combined with Hidden Markov Moldel (HMM) to effectively recognize the speech of VeloPharyngeal Insufficiency (VPI) patients, and compares the recognition performance of the systems to the Gaussian Mixture Model (GMM-HMM) and fully-connected Deep Neural Network (DNNHMM) based speech recognition systems. In this paper, the initial model is trained using normal speakers' speech and simulated VPI speech is used for generating a prior model for speaker adaptation. For VPI speaker adaptation, selected layers are trained in the CNN-HMM based model, and dropout regulatory technique is applied in the LSTM-HMM based model, showing 3.68 % improvement in recognition accuracy. The experimental results demonstrate that the proposed LSTM-HMM-based speech recognition system is effective for VPI speech with small-sized speech data, compared to conventional GMM-HMM and fully-connected DNN-HMM system.

본 논문에서는 구개인두부전증(VeloPharyngeal Insufficiency, VPI) 환자의 음성을 효과적으로 인식하기 위해 컨볼루션 신경망 (Convolutional Neural Network, CNN), 장단기 모델(Long Short Term Memory, LSTM) 구조 신경망을 은닉 마르코프 모델(Hidden Markov Model, HMM)과 결합한 하이브리드 구조의 음성 인식 시스템을 구축하고 모델 적응 기법을 적용하여, 기존 Gaussian Mixture Model(GMM-HMM), 완전 연결형 Deep Neural Network(DNN-HMM) 기반의 음성 인식 시스템과 성능을 비교한다. 정상인 화자가 PBW452단어를 발화한 데이터를 이용하여 초기 모델을 학습하고 정상인 화자의 VPI 모의 음성을 이용하여 화자 적응의 사전 모델을 생성한 후에 VPI 환자들의 음성으로 추가 적응 학습을 진행한다. VPI환자의 화자 적응 시에 CNN-HMM 기반 모델에서는 일부층만 적응 학습하고, LSTM-HMM 기반 모델의 경우에는 드롭 아웃 규제기법을 적용하여 성능을 관찰한 결과 기존 완전 연결형 DNN-HMM 인식기보다 3.68 % 향상된 음성 인식 성능을 나타낸다. 이러한 결과는 본 논문에서 제안하는 LSTM-HMM 기반의 하이브리드 음성 인식 기법이 많은 데이터를 확보하기 어려운 VPI 환자 음성에 대해 보다 향상된 인식률의 음성 인식 시스템을 구축하는데 효과적임을 입증한다.

Keywords

References

  1. S. G. Fletcher, "Theory and instrumentation for quantitative measurement of nasality," J. Cleft Palate. 7, 601-609 (1970).
  2. J. -E. Lee, W. -E. Kim, K. H. Kim, M. -W. Sung, and T. -K. Kwon,"Research on construction of the Korean speech corpus in patient with velopharyngeal insufficiency" (in Korean), JKORL. 55, 498-507 (2012).
  3. M. Y. Sung, H. Kim, T. -K. Kwon, and M. -W. Sung, "Analysis on vowel and consonants sounds of patient's speech with velopharyngeal insufficiency (VPI) and simulated speech" (in Korean), JKIICE. 18, 1740-1748 (2014).
  4. M. Y. Sung, T. -K. Kwon, M. -W. Sung, and W. Kim, "Effective recognition of velopharyngeal insufficiency (VPI) patient's speech using simulated speech model" (in Korean), JKIICE. 19, 1243-1250 (2015).
  5. K. Yoon and W. Kim, "Effective recognition of velopharyngeal insufficiency (VPI) patient's speech using DNN-HMM-based system" (in Korean), JKIICE. 23, 33-38 (2019).
  6. HTK Speech Recognition Toolkit, http://htk.eng.cam.ac.uk/, (Last viewed March 11, 2015).
  7. ETSI ES 201 108, Standard Document, v1.1.2.(2000-04)., 2000.
  8. J. L. Gauvain and C. H. Lee, "Maximum a posteriori estimation for multivariate Gaussian mixture observations of markov chains," IEEE Trans. on Speech and Audio Proc. 2, 291-298 (1994). https://doi.org/10.1109/89.279278
  9. C. J. Leggetter and P. C. Woodland, "Maximum likelihood linear regression for speaker adaptation of continuous density HMMs," Computer Speech and Language, 9, 171-185 (1995). https://doi.org/10.1006/csla.1995.0010
  10. J. -T. Huang, J. Li, D. Yu, L. Deng, and Y. Gong, "Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers," Proc. IEEE ICASSP. 7304-7308 (2013).
  11. W. Hu, Y. Qian, and F. K. Soong, "A DNN-based acoustic modeling of tonal language and its application to Mandarin pronunciation training," Proc. IEEE ICASSP. 3206-3210 (2014).
  12. S. Park, Y. Jeong, and H. S. Kim, "Multiresolution CNN for reverberant speech recognition," Proc. 20th Conf. O-COCOSDA. 1-4 (2017).
  13. A. Senior, H. Sak, and I. Shafran, "Context dependent phone models for LSTM RNN acoustic modeling," Proc. IEEE ICASSP. 4585-4589 (2015).
  14. S. Hochreiter and J. Schmichuber, "Long short-term memory," Neural Computation, 9, 1735-1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735
  15. S. J. Rennie, V. Goel, and S. Thomas, "Annealed dropout training of deep networks," Proc. IEEE SLT. 159-164 (2014).