DOI QR코드

DOI QR Code

An Automatic Data Construction Approach for Korean Speech Command Recognition

  • Lim, Yeonsoo (Dept. of Computer Engineering, Kumoh National Institute of Technology) ;
  • Seo, Deokjin (Dept. of Computer Engineering, Kumoh National Institute of Technology) ;
  • Park, Jeong-sik (Dept. of English Linguistics & Language Technology, Hankuk University of Foreign Studies) ;
  • Jung, Yuchul (Dept. of Computer Engineering, Kumoh National Institute of Technology)
  • Received : 2019.11.26
  • Accepted : 2019.12.20
  • Published : 2019.12.31

Abstract

The biggest problem in the AI field, which has become a hot topic in recent years, is how to deal with the lack of training data. Since manual data construction takes a lot of time and efforts, it is non-trivial for an individual to easily build the necessary data. On the other hand, automatic data construction needs to handle data quality issue. In this paper, we introduce a method to automatically extract the data required to develop Korean speech command recognizer from the web and to automatically select the data that can be used for training data. In particular, we propose a modified ResNet model that shows modest performance for the automatically constructed Korean speech command data. We conducted an experiment to show the applicability of the command set of the health and daily life domain. In a series of experiments using only automatically constructed data, the accuracy of the health domain was 89.5% in ResNet15 and 82% in ResNet8 in the daily lives domain, respectively.

최근 화두가 되고 있는 AI분야에서 가장 큰 문제점은 학습데이터의 부족 문제를 꼽을 수 있다. 수동 데이터 구축에는 많은 시간과 노력이 소요되기에 개인이 손쉽게 필요 데이터를 구축하기는 매우 어렵다. 반면, 수동 데이터 구축에 비해 자동으로 구축하는 것은 높은 품질을 유지하는 것이 관건이다. 본 논문에서는 한국어 음성 명령어 인식기 개발에 필요한 데이터를 웹에서 자동으로 추출하고, 학습데이터로 사용할 수 있는 데이터를 자동으로 선별하는 방법을 소개한다. 특히, 자동 구축된 한국어 음성 데이터를 대상으로 우수한 성능을 보이는 ResNet기반의 수정 모델을 기반으로, 건강 및 일상생활도메인의 명령어 셋을 대상으로 적용가능성을 보이기 위한 실험을 진행하였다. 자동으로 구축된 데이터만을 사용한 일련의 실험에서 건강도메인은 ResNet15에서 89.5%, 일상생활도메인에서는 ResNet8에서 82%의 정확도를 보임으로써, 자동 수집 데이터의 활용 가능성을 검증하였다.

Keywords

References

  1. E. Lakomkin, S. Magg, C. Weber, and S. Wermter, "KT-Speech-Crawler: Automatic Dataset Construction for Speech Recognition from YouTube Videos," arXiv:1903.00216 , 2019.
  2. Zeroth project, Available at https://github.com/goodatlas/zeroth
  3. KSS data set, Available at https://www.kaggle.com/bryanpark/korean-single-speaker-speech-dataset
  4. J. Kaewprateep and S. Prom-On, "Evaluation of small-scale deep learning architectures in Thai speech recognition," 1st Int. ECTI North. Sect. Conf. Electr. Electron. Comput. Telecommun. Eng. ECTI-NCON 2018, pp. 60-64, 2018.
  5. Y. Choi and B. Lee, "Pansori: ASR Corpus Generation from Open Online Video Contents," IEEE Seoul Sect. Student Pap. Contest, pp. 117-121, 2018.
  6. H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, "MixUp: Beyond empirical risk minimization," 6th Int. Conf. Learn. Represent. ICLR 2018 - Conf. Track Proc., pp. 1-13, 2018.
  7. T. N. Sainath and C. Parada, "Convolutional Neural Networks for Small-footprint Keyword Spotting," Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, pp. 1478-1482, 2015.
  8. R. Tang and J. Lin, "Honk: A PyTorch Reimplementation of Convolutional Neural Networks for Keyword Spotting," arXiv:1710.06554, 2017.
  9. G. Chen, C. Parada, and G. Heigold, "Small-footprint keyword spotting using deep neural networks," ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., pp. 4087-4091, 2014.
  10. P. Warden, "Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition," arXiv:1804.03209, 2018.
  11. R. Tang and J. Lin, "Deep residual learning for small-footprint keyword spotting," ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., vol. 2018-April, pp. 5484-5488, 2018.
  12. K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2016-December, pp. 770-778, 2016.
  13. K. He, X. Zhang, S. Ren, and J. Sun, "Identity mappings in deep residual networks," Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 9908 LNCS, pp. 630-645, 2016.
  14. S. Choi et al., "Temporal convolution for real-time keyword spotting on mobile devices," Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2019-September, pp. 3372-3376, 2019.
  15. D. Oneata and H. Cucu, "Kite: Automatic Speech Recognition for Unmanned Aerial Vehicles," Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2019-September, pp. 2998-3002, 2019.
  16. T. Rajapakshe, R. Rana, S. Latif, S. Khalifa, and B. W. Schuller, "Pre-training in Deep Reinforcement Learning for Automatic Speech Recognition," arXiv:1910.11256, 2019.
  17. J. Vadillo and R. Santana, "Universal adversarial examples in speech command classification," arXiv:1911.10182, 2019.