DOI QR코드

DOI QR Code

Effective Recognition of Velopharyngeal Insufficiency (VPI) Patient's Speech Using DNN-HMM-based System

DNN-HMM 기반 시스템을 이용한 효과적인 구개인두부전증 환자 음성 인식

  • Yoon, Ki-mu (Department of Computer Science and Engineering, Incheon National University) ;
  • Kim, Wooil (Department of Computer Science and Engineering, Incheon National University)
  • Received : 2018.10.31
  • Accepted : 2018.12.03
  • Published : 2019.01.31

Abstract

This paper proposes an effective recognition method of VPI patient's speech employing DNN-HMM-based speech recognition system, and evaluates the recognition performance compared to GMM-HMM-based system. The proposed method employs speaker adaptation technique to improve VPI speech recognition. This paper proposes to use simulated VPI speech for generating a prior model for speaker adaptation and selective learning of weight matrices of DNN, in order to effectively utilize the small size of VPI speech for model adaptation. We also apply Linear Input Network (LIN) based model adaptation technique for the DNN model. The proposed speaker adaptation method brings 2.35% improvement in average accuracy compared to GMM-HMM based ASR system. The experimental results demonstrate that the proposed DNN-HMM-based speech recognition system is effective for VPI speech with small-sized speech data, compared to conventional GMM-HMM system.

본 논문에서는 효과적으로 VPI 환자 음성을 인식하기 위해 DNN-HMM 하이브리드 구조의 음성 인식 시스템을 구축하고 기존의 GMM-HMM 기반의 음성 인식 시스템과의 성능을 비교한다. 정상인의 깨끗한 음성 데이터베이스를 이용하여 초기 모델을 학습하고 정상인의 VPI 모의 음성을 이용하여 VPI 환자 음성에 대한 화자 인식을 위한 기본 모델을 생성한다. VPI 환자의 화자 적응 시에는 DNN의 각 층 별 가중치 행렬을 부분적으로 학습하여 성능을 관찰한 결과 GMM-HMM 인식기보다 높은 성능을 나타냈다. 성능 향상을 위해 DNN 모델 적응을 적용하고 LIN 기반의 DNN 모델 적용 결과 평균 2.35%의 인식률 향상을 나타냈다. 또한 소량의 데이터를 사용했을 때 GMM-HMM 기반 음성인식 기법에 비해 DNN-HMM 기반 음성 인식 기법이 향상된 VPI 음성 인식 성능을 보인다.

Keywords

Table. 1 GMM-HMM based speech recognition result of speaker adaptation (word accuracy, %).

HOJBC0_2019_v23n1_33_t0001.png 이미지

Table. 2 DNN-HMM based speech recognition result of speaker adaptation (word accuracy, %).

HOJBC0_2019_v23n1_33_t0002.png 이미지

Table. 3 Speech recognition result of LIN-based DNN adaptation (word accuracy, %).

HOJBC0_2019_v23n1_33_t0003.png 이미지

Table. 4 Speech recognition result of LON-based DNN adaptation (word accuracy, %).

HOJBC0_2019_v23n1_33_t0004.png 이미지

Table. 5 Speech recognition result of speaker adaptation for different amount of data (word accuracy, %).

HOJBC0_2019_v23n1_33_t0005.png 이미지

References

  1. S. G. Fletcher, "Theory and instrumentation for quantitative measurement of nasality," Cleft Palate Journal, vol. 7, pp. 601-609, 1970.
  2. J. Lee, W. Kim, K. Kim, M. Sung and T. Kwon, "Research on Construction of the Korean Speech Corpus in Patient with Velopharyngeal Insufficiency," Korean Journal of Otorhinolaryngol - Head & Neck Surgery, vol. 55, no. 8, pp. 498-507, 2012 . https://doi.org/10.3342/kjorl-hns.2012.55.8.498
  3. M. Sung, H. Kim, T. Kwon, M. Sung, and W. Kim, "Analysis on Vowel and Consonants Sounds of Patient's Speech with Velopharyngeal Insufficiency (VPI) and Simulated Speech," Journal of Korea Institute of Information and Communication Engineering, vol. 18, no. 7, pp. 1740-1748, July 2014 . https://doi.org/10.6109/jkiice.2014.18.7.1740
  4. M. Sung, T. Kwon, M. Sung, and W. Kim, "Effective Recognition of Velopharyngeal Insufficiency (VPI) Patient's Speech Using Simulated Speech Model," Journal of Korea Institute of Information and Communication Engineering, vol. 19, no. 5, pp. 1243- 1250, May 2015 . https://doi.org/10.6109/jkiice.2015.19.5.1243
  5. S. Young, HTK Book, Ver. 3.4, Cambridge, UK: Cambridge University Press, 2006.
  6. ETSI standard document, ETSI ES 201 108 v1.1.2 (2000-04), Feb. 2000.
  7. J. L. Gauvain and C. H. Lee, "Maximum a Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains," IEEE Trans. on Speech and Audio Proc., vol. 2, no. 2, pp. 291-298, 1994. https://doi.org/10.1109/89.279278
  8. C. J. Leggetter and P. C. Woodland, "Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density HMMs," Computer Speech and Language, 9, pp. 171-185, 1995. https://doi.org/10.1006/csla.1995.0010
  9. J.-T. Huang, J. Li, D. Yu, L. Deng and Y. Gong, "Crosslanguage knowledge transfer using multilingual deep neural network with shared hidden layers," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP-2013), pp. 7304-7308, 2013.
  10. W. Hu, Y. Qian and F. K. Soong, "A DNN-based acoustic modeling of tonal language and its application to Mandarin pronunciation training," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP-2014), pp. 3206-3210, 2014.
  11. S. Liu, and K. C. Sim, "On combining DNN and GMM with unsupervised speaker adaptation for robust automatic speech recognition," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP-2014), pp. 195-199, 2014.
  12. D. Yu, L. Deng, Automatic Speech Recognition; A Deep Learning Approach, Springer, 2015.