DOI QR코드

DOI QR Code

Performance comparison of wake-up-word detection on mobile devices using various convolutional neural networks

다양한 합성곱 신경망 방식을 이용한 모바일 기기를 위한 시작 단어 검출의 성능 비교

  • 김상홍 (인하대학교 전자공학과) ;
  • 이보원 (인하대학교 전자공학과)
  • Received : 2020.05.21
  • Accepted : 2020.07.07
  • Published : 2020.09.30

Abstract

Artificial intelligence assistants that provide speech recognition operate through cloud-based voice recognition with high accuracy. In cloud-based speech recognition, Wake-Up-Word (WUW) detection plays an important role in activating devices on standby. In this paper, we compare the performance of Convolutional Neural Network (CNN)-based WUW detection models for mobile devices by using Google's speech commands dataset, using the spectrogram and mel-frequency cepstral coefficient features as inputs. The CNN models used in this paper are multi-layer perceptron, general convolutional neural network, VGG16, VGG19, ResNet50, ResNet101, ResNet152, MobileNet. We also propose network that reduces the model size to 1/25 while maintaining the performance of MobileNet is also proposed.

음성인식 기능을 제공하는 인공지능 비서들은 정확도가 뛰어난 클라우드 기반의 음성인식을 통해 동작한다. 클라우드 기반의 음성인식에서 시작 단어 인식은 대기 중인 기기를 활성화하는 데 중요한 역할을 한다. 본 논문에서는 공개 데이터셋인 구글의 Speech Commands 데이터셋을 사용하여 스펙트로그램 및 멜-주파수 캡스트럼 계수 특징을 입력으로 하여 모바일 기기에 대응한 저 연산 시작 단어 검출을 위한 합성곱 신경망의 성능을 비교한다. 본 논문에서 사용한 합성곱 신경망은 다층 퍼셉트론, 일반적인 합성곱 신경망, VGG16, VGG19, ResNet50, ResNet101, ResNet152, MobileNet이며, MobileNet의 성능을 유지하면서 모델 크기를 1/25로 줄인 네트워크도 제안한다.

Keywords

References

  1. B. H. Juang and L. R. Rabiner, "Hidden Markov models for speech recognition," Technometrics, 33, 251-272 (1991). https://doi.org/10.1080/00401706.1991.10484833
  2. C. Cortes and V. Vladimir, "Support-vector networks," Machine learning, 20, 273-297 (1995). https://doi.org/10.1007/BF00994018
  3. Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proc. the IEEE. 86, 2278-2324 (1998). https://doi.org/10.1109/5.726791
  4. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, "Going deeper with convolutions," Proc. the IEEE CVPR. 1-9 (2015).
  5. A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, "Mobilenets: Efficient convolutional neural networks for mobile vision applications," arXiv preprint arXiv: 1704.04861 (2017).
  6. M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, "Mobilenetv2: Inverted residuals and linear bottlenecks," Proc. the IEEE CVF. Conf. computer vision and pattern recognition, 4510-4520 (2018).
  7. B. Logan, "Mel frequency cepstral coefficients for music modeling," Ismir. 270, 1-11 (2000).
  8. K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556 (2014).
  9. K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," Proc. the IEEE Conf. CVPR. 770-778 (2016).
  10. P. Warden, "Speech commands: A dataset for limitedvocabulary speech recognition," arXiv preprint arXiv: 1804.03209 (2018).
  11. M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng, "Tensorflow: A system for large-scale machine learning," Proc. the 12th USENIX symposium on OSDI. 265-283 (2016).
  12. F. Provost and R. Kohavi. "Guest editors' introduction: On applied research in machine learning," Machine learning, 30, 127-132 (1998). https://doi.org/10.1023/A:1007442505281