Korean speech recognition using deep learning

Lee, Suji;Han, Seokjin;Park, Sewon;Lee, Kyeongwon;Lee, Jaeyong;

doi:10.5351/KJAS.2019.32.2.213

응용통계연구 (The Korean Journal of Applied Statistics)

제32권2호
/
Pages.213-227
/
2019
/
1225-066X(pISSN)
/
2383-5818(eISSN)

한국통계학회 (The Korean Statistical Society)

DOI QR Code

딥러닝 모형을 사용한 한국어 음성인식

Korean speech recognition using deep learning

이수지 (서울대학교 통계학과) ;
한석진 (서울대학교 통계학과) ;
박세원 (서울대학교 통계학과) ;
이경원 (서울대학교 통계학과) ;
이재용 (서울대학교 통계학과)

Lee, Suji (Department of Statistics, Seoul National University) ;
Han, Seokjin (Department of Statistics, Seoul National University) ;
Park, Sewon (Department of Statistics, Seoul National University) ;
Lee, Kyeongwon (Department of Statistics, Seoul National University) ;
Lee, Jaeyong (Department of Statistics, Seoul National University)

투고 : 2018.12.21
심사 : 2019.02.14
발행 : 2019.04.30

https://doi.org/10.5351/KJAS.2019.32.2.213 인용 PDF KSCI HTML

PDF 다운로드

⟨ 이전 논문 다음 논문 ⟩

초록

본 논문에서는 베이즈 신경망을 결합한 종단 간 딥러닝 모형을 한국어 음성인식에 적용하였다. 논문에서는 종단 간 학습 모형으로 연결성 시계열 분류기(connectionist temporal classification), 주의 기제, 그리고 주의 기제에 연결성 시계열 분류기를 결합한 모형을 사용하였으며. 각 모형은 순환신경망(recurrent neural network) 혹은 합성곱신경망(convolutional neural network)을 기반으로 하였다. 추가적으로 디코딩 과정에서 빔 탐색과 유한 상태 오토마타를 활용하여 자모음 순서를 조정한 최적의 문자열을 도출하였다. 또한 베이즈 신경망을 각 종단 간 모형에 적용하여 일반적인 점 추정치와 몬테카를로 추정치를 구하였으며 이를 기존 종단 간 모형의 결괏값과 비교하였다. 최종적으로 본 논문에 제안된 모형 중에 가장 성능이 우수한 모형을 선택하여 현재 상용되고 있는 Application Programming Interface (API)들과 성능을 비교하였다. 우리말샘 온라인 사전 훈련 데이터에 한하여 비교한 결과, 제안된 모형의 word error rate (WER)와 label error rate (LER)는 각각 26.4%와 4.58%로서 76%의 WER와 29.88%의 LER 값을 보인 Google API보다 월등히 개선된 성능을 보였다.

In this paper, we propose an end-to-end deep learning model combining Bayesian neural network with Korean speech recognition. In the past, Korean speech recognition was a complicated task due to the excessive parameters of many intermediate steps and needs for Korean expertise knowledge. Fortunately, Korean speech recognition becomes manageable with the aid of recent breakthroughs in "End-to-end" model. The end-to-end model decodes mel-frequency cepstral coefficients directly as text without any intermediate processes. Especially, Connectionist Temporal Classification loss and Attention based model are a kind of the end-to-end. In addition, we combine Bayesian neural network to implement the end-to-end model and obtain Monte Carlo estimates. Finally, we carry out our experiments on the "WorimalSam" online dictionary dataset. We obtain 4.58% Word Error Rate showing improved results compared to Google and Naver API.

키워드

GCGHDE_2019_v32n2_213_f0001.png 이미지

Figure 2.1. Sequence to sequence model.

GCGHDE_2019_v32n2_213_f0002.png 이미지

Figure 2.2. Attention model (Bahdanau et al., 2014).

GCGHDE_2019_v32n2_213_f0003.png 이미지

Figure 4.1. Mel-frequency cepstral coeﬃcients.

GCGHDE_2019_v32n2_213_f0004.png 이미지

Figure 4.2. The structure of the encoder.

GCGHDE_2019_v32n2_213_f0005.png 이미지

Figure 4.3. A ﬁnite automata that searches for correct Korean strings.

Table 5.1. Performance comparison between end-to-end deep learning models

GCGHDE_2019_v32n2_213_t0001.png 이미지

Table 5.2. Performance comparison when adding a ﬁnite automata language model

GCGHDE_2019_v32n2_213_t0002.png 이미지

Table 5.3. Performance comparison with commercial API

GCGHDE_2019_v32n2_213_t0003.png 이미지

참고문헌

Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. arXiv preprint arXiv:1409.0473
Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult, IEEE Transactions on Neural Networks, 5, 157-166. https://doi.org/10.1109/72.279181
Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. (2015). Weight uncertainty in neural networks, arXiv preprint, arXiv:1505.05424
Chan, W., Jaitly, N., Le, Q. V., and Vinyals, O. (2015). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, 4960-4964. IEEE, 2016.
Cho, K., van Merrienboer, B., Bahdanau, D., and Bengio, Y. (2014). On the properties of neural machine translation: Encoder-Decoder approaches. In Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation.
Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, arXiv preprint, arXiv:1406.1078
Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling, arXiv preprint, arXiv:1412.3555
Gal, Y. and Ghahramani, Z. (2016a). Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning, 1050-1059.
Gal, Y. and Ghahramani, Z. (2016b). A theoretically grounded application of dropout in recurrent neural networks. In Advances in Neural Information Processing Systems, 1019-1027.
Goodfellow, I., Bengio, Y., Courville, A., and Bengio, Y. (2016). Deep learning (Vol. 1), MIT press, Cambridge.
Graves, A., Fernandez, S., Gomez, F., and Schmidhuber, J. (2006). Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, 369-376. ACM.
Gales, M., and Young, S. (2008). The application of hidden Markov models in speech recognition, Foundations and Trends ${\mu}l$ kpa in Signal Processing, 1, 195-304. https://doi.org/10.1561/2000000004
Hochreiter, S., and Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9, 1735-1780. https://doi.org/10.1162/neco.1997.9.8.1735
Huang, X., Acero, A., and Hon, H. (2001). Spoken Language Processing: A Guide to Theory, Algorithm, and System Development, Prentice hall PTR, New Jersey.
Jelinek, F. (1997). Statistical Methods for Speech Recognition, MIT press, Cambridge.
Kim, S., Hori, T., andWatanabe, S. (2017). Joint CTC-attention based end-to-end speech recognition using multi-task learning. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, 4835-4839. IEEE.
Kingma, D. P. and Ba, J. (2014). Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Kwon, O. W. and Park, J. (2003). Korean large vocabulary continuous speech recognition with morphemebased recognition units, Speech Communication, 39, 287-300. https://doi.org/10.1016/S0167-6393(02)00031-6
Luong, M.-T., Pham, H., and Manning, C. D. (2015). Effective Approaches to Attention-based Neural Machine Translation. arXiv preprint arXiv:1508.04025

응용통계연구 (The Korean Journal of Applied Statistics)

딥러닝 모형을 사용한 한국어 음성인식

Korean speech recognition using deep learning

초록

키워드

참고문헌

자세히 찾기