DOI QR코드

DOI QR Code

Compromised feature normalization method for deep neural network based speech recognition

심층신경망 기반의 음성인식을 위한 절충된 특징 정규화 방식

  • Kim, Min Sik (Department of Electronics Engineering, Pusan National University) ;
  • Kim, Hyung Soon (Department of Electronics Engineering, Pusan National University)
  • 김민식 (부산대학교 전자공학과) ;
  • 김형순 (부산대학교 전자공학과)
  • Received : 2020.07.31
  • Accepted : 2020.09.18
  • Published : 2020.09.30

Abstract

Feature normalization is a method to reduce the effect of environmental mismatch between the training and test conditions through the normalization of statistical characteristics of acoustic feature parameters. It demonstrates excellent performance improvement in the traditional Gaussian mixture model-hidden Markov model (GMM-HMM)-based speech recognition system. However, in a deep neural network (DNN)-based speech recognition system, minimizing the effects of environmental mismatch does not necessarily lead to the best performance improvement. In this paper, we attribute the cause of this phenomenon to information loss due to excessive feature normalization. We investigate whether there is a feature normalization method that maximizes the speech recognition performance by properly reducing the impact of environmental mismatch, while preserving useful information for training acoustic models. To this end, we introduce the mean and exponentiated variance normalization (MEVN), which is a compromise between the mean normalization (MN) and the mean and variance normalization (MVN), and compare the performance of DNN-based speech recognition system in noisy and reverberant environments according to the degree of variance normalization. Experimental results reveal that a slight performance improvement is obtained with the MEVN over the MN and the MVN, depending on the degree of variance normalization.

특징 정규화는 음성 특징 파라미터들의 통계적인 특성의 정규화를 통해 훈련 및 테스트 조건 사이의 환경 불일치의 영향을 감소시키는 방법으로서 기존의 Gaussian mixture model-hidden Markov model(GMM-HMM) 기반의 음성인식 시스템에서 우수한 성능개선을 입증한 바 있다. 하지만 심층신경망(deep neural network, DNN) 기반의 음성인식 시스템에서는 환경 불일치의 영향을 최소화 하는 것이 반드시 최고의 성능 개선으로 연결되지는 않는다. 본 논문에서는 이러한 현상의 원인을 과도한 특징 정규화로 인한 정보손실 때문이라 보고, 음향모델을 훈련 하는데 유용한 정보는 보존하면서 환경 불일치의 영향은 적절히 감소시켜 음성인식 성능을 최대화 하는 특징 정규화 방식이 있는 지 검토해보고자 한다. 이를 위해 평균 정규화(mean normalization, MN)와 평균 및 분산 정규화(mean and variance normalization, MVN)의 절충 방식인 평균 및 지수적 분산 정규화(mean and exponentiated variance normalization, MEVN)를 도입하여, 잡음 및 잔향 환경에서 분산에 대한 정규화의 정도에 따른 DNN 기반의 음성인식 시스템의 성능을 비교한다. 실험 결과, 성능 개선의 폭이 크지는 않으나 분산 정규화의 정도에 따라 MEVN이 MN과 MVN보다 성능이 우수함을 보여준다.

Keywords

References

  1. Atal, B. S. (1974). Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. The Journal of the Acoustical Society of America, 55(6), 1304-1312. https://doi.org/10.1121/1.1914702
  2. De La Torre, A., Peinado, A. M., Segura, J. C., Perez-Cordoba, J. L., Benitez, M. C., & Rubio, A. J. (2005). Histogram equalization of speech representation for robust speech recognition. IEEE Transactions on Speech and Audio Processing, 13(3), 355-366. https://doi.org/10.1109/TSA.2005.845805
  3. Deng, L., Li, J., Huang, J. T., Yao, K., Yu, D., Seide, F., Seltzer, M., Zweig, G., ... Gong, Y. (2013, May). Recent advances in deep learning for speech research at Microsoft. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 8604-8608). Vancouver, BC.
  4. Ioffe, S., & Szegedy, C. (2015, July). Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proceedings of 32nd International Conference on Machine Learning (Vol. 37, pp. 448-456). Lille, France.
  5. Kinoshita, K., Delcroix, M., Yoshioka, T., Nakatani, T., Habets, E., Haeb-Umbach, R., Leutnant, V., ... & Gannot, S. (2013, October). The REVERB challenge: A common evaluation framework for dereverberation and recognition of reverberant speech. In 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (pp. 1-4). New Paltz, NY.
  6. Li, J., Deng, L., Gong, Y., & Haeb-Umbach, R. (2014). An overview of noise-robust automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(4), 745-777. https://doi.org/10.1109/TASLP.2014.2304637
  7. Molau, S., Hilger, F., & Ney, H. (2003, April). Feature space normalization in adverse acoustic conditions. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing 2003 Proceedings (ICASSP'03) (Vol. 1, pp. I-I). Hong Kong.
  8. Pearce, D., & Picone, J. (2002). Aurora working group: DSR front end LVCSR evaluation AU/384/02 (Technical report). Mississippi State, MS; Institute for Signal and Information Processing at Mississippi State University.
  9. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., ... Vesely, K. (2011). The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding (No. CONF). Hawaii, HI.
  10. Viikki, O., Bye, D., & Laurila, K. (1998, May). A recursive feature vector normalization approach for robust speech recognition in noise. Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP'98 (Vol. 2, pp. 733-736). Seattle, WA.
  11. Yu, D., Seltzer, M. L., Li, J., Huang, J. T., & Seide, F. (2013, March). Feature learning in deep neural networks - studies on speech recognition tasks. Proceedings of International Conference on Learning Representations(ICLR). Scottsdale, AZ.