DOI QR코드

DOI QR Code

CTC를 이용한 LSTM RNN 기반 한국어 음성인식 시스템

LSTM RNN-based Korean Speech Recognition System Using CTC

  • 이동현 (서강대학교 컴퓨터공학과) ;
  • 임민규 (서강대학교 컴퓨터공학과) ;
  • 박호성 (서강대학교 컴퓨터공학과) ;
  • 김지환 (서강대학교 컴퓨터공학과)
  • Lee, Donghyun (Department of Computer Science and Engineering, Sogang University) ;
  • Lim, Minkyu (Department of Computer Science and Engineering, Sogang University) ;
  • Park, Hosung (Department of Computer Science and Engineering, Sogang University) ;
  • Kim, Ji-Hwan (Department of Computer Science and Engineering, Sogang University)
  • 투고 : 2017.01.19
  • 심사 : 2017.02.25
  • 발행 : 2017.02.28

초록

Long Short Term Memory (LSTM) Recurrent Neural Network (RNN)를 이용한 hybrid 방법은 음성 인식률을 크게 향상시켰다. Hybrid 방법에 기반한 음향모델을 학습하기 위해서는 Gaussian Mixture Model (GMM)-Hidden Markov Model (HMM)로부터 forced align된 HMM state sequence가 필요하다. 그러나, GMM-HMM을 학습하기 위해서 많은 연산 시간이 요구되고 있다. 본 논문에서는 학습 속도를 향상하기 위해, LSTM RNN 기반 한국어 음성인식을 위한 end-to-end 방법을 제안한다. 이를 구현하기 위해, Connectionist Temporal Classification (CTC) 알고리즘을 제안한다. 제안하는 방법은 기존의 방법과 비슷한 인식률을 보였지만, 학습 속도는 1.27 배 더 빨라진 성능을 보였다.

A hybrid approach using Long Short Term Memory (LSTM) Recurrent Neural Network (RNN) has showed great improvement in speech recognition accuracy. For training acoustic model based on hybrid approach, it requires forced alignment of HMM state sequence from Gaussian Mixture Model (GMM)-Hidden Markov Model (HMM). However, high computation time for training GMM-HMM is required. This paper proposes an end-to-end approach for LSTM RNN-based Korean speech recognition to improve learning speed. A Connectionist Temporal Classification (CTC) algorithm is proposed to implement this approach. The proposed method showed almost equal performance in recognition rate, while the learning speed is 1.27 times faster.

키워드

참고문헌

  1. A. Acero et al., "Live search for mobile: web services by voice on the cellphone," in Proceeding of the Interspeech, Brisbane, Australia, pp. 5256-5259, 2008.
  2. J. Jiang et al., Automatic online evaluation of intelligent assistants, in Opportunities and Challenges for Next-Generation Applied Intelligence, Berlin, Germany: Springer, pp. 285-290, 2009.
  3. S. Kim and J. Ahn, "Speech Recognition System in Car Noise Environment," The Journal of Digital Contents Society, Vol. 10, No. 1, pp. 121-127, Mar. 2009.
  4. L. Rabiner and B. Juang, Fundamentals of Speech Recognition, 1st ed. Englewood Cliffs, NJ: Prentice Hall, 1993.
  5. D. Su, X. Wu, and L. Xu, "GMM-HMM acoustic model training by a two level procedure with gaussian components determined by automatic model selection," in Proceeding of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Dallas: TX, pp. 4890-4893, 2010.
  6. H. Hermansky, D. Ellis, and S. Sharma, "Tandem connectionist feature extraction for conventional HMM systems," in Proceeding of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Istanbul, Turkey, pp. 1635-1638, 2000.
  7. T. Mikolov and G. Zweig, Context dependent recurrent neural network language model, Microsoft Research, Redmond: WA, Technical Report MSR-TR-2012-92, 2012.
  8. G. Hinton et al., "Deep Neural Networks for Acoustic Modeling in Speech Recognition," The IEEE Signal Processing Magazine, Vol. 29, No. 6, pp. 82-97, Oct.2012. https://doi.org/10.1109/MSP.2012.2205597
  9. L. Deng, G. Hinton, and B. Kingsbury, "New types of deep neural network learning for speech recognition for speech recognition and related applications: An overview," in Proceeding of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vancouver, Canada, pp. 8599-8603, May. 2013.
  10. A. Graves et al., "Hybrid speech recognition with deep bidirectional LSTM," in Proceeding of the IEEE Automatic Speech Recognition and Understanding Workshop, Olomouc, Czech Republic, pp. 273-278, 2013.
  11. A. Graves and N. Jaitly, "Towards end-to-end speech recognition with recurrent neural networks," in Proceeding of the 31st International Conference on Machine Learning, Beijing, China, pp. 1764-1772, 2014.
  12. A. Graves, Supervised sequence labelling with recurrent neural networks, Ph.D. dissertation, Technische Universitat Munchen, Munchen, Germany, 2008.
  13. A. Graves et. al, "Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks," in Proceeding of the 23rd International Conference on Machine Learning, Pittsburgh: PA, pp. 369-376, 2006.
  14. S. Hochreiter and J. Schmidhuber, "Long Short-Term Memory," Neural Computation, Vol. 9, No. 8, pp. 1735-1780, Nov. 1997. https://doi.org/10.1162/neco.1997.9.8.1735
  15. H. Sak, A. Senior, and F. Beaufays, "Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition," arXiv:1402.1128, pp. 1-5, Feb. 2014.
  16. M. Liwicki, A. Graves, H. Bunke and J. Schmidhuber, "A novel approach to on-line handwriting recognition based on bidirectional long short-term memory networks," in Proceeding of the 9th International Conference on Document Analysis and Recognition, Curitiba, Brazil, pp. 367-371, 2017.
  17. Y. Miao et al., "EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding," in Proceeding of the IEEE Automatic Speech Recognition and Understanding Workshop, Scottsdale: AZ, pp. 167-174, 2015.
  18. Y. Rao, A. Senior, and H. Sak, "Flat start training of CD-CTC-sMBR LSTM RNN acoustic models," in Proceeding of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Shanghai, China, pp. 5405-5409, 2016.

피인용 문헌

  1. A Study on Factors Affecting the Investment Intention of Information Security vol.19, pp.8, 2018, https://doi.org/10.9728/dcs.2018.19.8.1515
  2. Forecast of Bee Swarming using Data Fusion and LSTM vol.20, pp.1, 2019, https://doi.org/10.9728/dcs.2019.20.1.1
  3. 문장에 포함된 외국어의 자연스러운 발음 표현을 위한 LSTM 방법 vol.8, pp.4, 2017, https://doi.org/10.3745/ktsde.2019.8.4.163
  4. Smart Beehive using Data Fused Preprocessing and Artificial Neural networks vol.20, pp.12, 2017, https://doi.org/10.9728/dcs.2019.20.12.2321