DOI QR코드

DOI QR Code

Speech Recognition Model Based on CNN using Spectrogram

스펙트로그램을 이용한 CNN 음성인식 모델

  • Won-Seog Jeong ;
  • Haeng-Woo Lee (Dept. Intelligent Information and Communication Engineering, Namseoul University)
  • 정원석 ((주)AICube) ;
  • 이행우 (남서울대학교 지능정보통신공학과)
  • Received : 2024.06.03
  • Accepted : 2024.08.12
  • Published : 2024.08.31

Abstract

In this paper, we propose a new CNN model to improve the recognition performance of command voice signals. This method obtains a spectrogram image after performing a short-time Fourier transform (STFT) of the input signal and improves command recognition performance through supervised learning using a CNN model. After Fourier transforming the input signal for each short-time section, a spectrogram image is obtained and multi-classification learning is performed using a CNN deep learning model. This effectively classifies commands by converting the time domain voice signal to the frequency domain to express the characteristics well and performing deep learning training using the spectrogram image for the conversion parameters. To verify the performance of the speech recognition system proposed in this study, a simulation program using Tensorflow and Keras libraries was created and a simulation experiment was performed. As a result of the experiment, it was confirmed that an accuracy of 92.5% could be obtained using the proposed deep learning algorithm.

본 논문에서는 명령어 음성신호의 인식 성능을 개선하기 위한 새로운 합성곱 신경망(CNN: Convolutional Neural Network) 모델을 제안한다. 이 방법은 입력신호의 단구간 푸리에 변환(STFT: Short-Time Fourier Transform) 후 스펙트로그램 이미지를 구하고 CNN 모델을 이용한 지도학습을 통하여 명령어 인식 성능을 개선하였다. 입력신호를 단시간 구간별로 푸리에 변환한 다음 스펙트로그램 이미지를 구하고 CNN 딥러닝 모델을 이용하여 다중 분류 학습을 수행한다. 이는 시간영역 음성신호를 특성이 잘 표현되도록 주파수영역으로 변환하고 변환 파라미터에 대한 스펙트로그램 이미지를 이용하여 딥러닝 훈련을 수행함으로써 명령어를 효과적으로 분류한다. 본 연구에서 제안한 음성인식시스템의 성능을 검증하기 위하여 Tensorflow와 Keras 라이브러리를 사용한 시뮬레이션 프로그램을 작성하고 모의실험을 수행하였다. 실험 결과, 제안한 심층학습 알고리즘을 이용하면 92.5%의 정확도를 얻을 수 있는 것으로 확인되었다.

Keywords

References

  1. D. Kim, A. Lee, G. Lee, S. Kim, and B. Lee, "Kiosk for the visually impaired using voice recognition," J. of the Korea Institute of Electronic Communication Sciences, vol. 17, no. 5, 2022, pp. 873-882. https://doi.org/10.13067/JKIECS.2022.17.5.873
  2. D. Jang, and J. Kim, "Two-way interactive algorithms based on speech and motion recognition with Generative AI technology," J. of the Korea Institute of Electronic Communication Sciences, vol. 19, no. 2, 2024, pp. 397-402. https://doi.org/10.13067/JKIECS.2024.19.2.397
  3. Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, and L. Jackel, Handwritten digit recognition with a back-propagation network. Burlington,: Morgan Kaufmann Publishers Inc., 1990.
  4. S. Lawrence, C. Giles, A. Tsoi, and A. Back, "Face recognition: A convolutional neural-network approach," IEEE Trans. Neural Network, vol. 8, 1997, pp. 98-113. https://doi.org/10.1109/72.554195
  5. J. Schmidhuber, "Deep learning in neural networks: An overview," Neural Networks, vol. 61, 2015, pp. 85-117. https://doi.org/10.1016/j.neunet.2014.09.003
  6. A. Krizhevsky, I. Sutskever, and G. Hinton, "Imagenet classification with deep convolutional neural networks," In Proceedings of the Advances in Neural Information Processing Systems 2012, Lake Tahoe, NV, USA, Dec. 2012, pp. 3-8. https://doi.org/10.1145/3065386
  7. O. Abdel-Hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, "Convolutional neural networks for speech recognition," IEEE ACM Trans. Audio Speech Lang. Process., vol. 22, 2014, pp. 1533-1545. https://doi.org/10.1109/TASLP.2014.2339736
  8. N. Cummins, S. Amiriparian, G. Hagerer, A. Batliner, S. Steidl, and B. Schuller, "An image-based deep spectrum feature representation for the recognition of emotional speech," In Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, Oct. 2017, pp. 23-27. https://doi.org/10.1145/3123266.3123371
  9. W. Xiong, L. Wu, F. Alleva, J. Droppo, X. Huang, and A. Stolcke, "The Microsoft 2017 conversational speech recognition system," In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) IEEE, Calgary, AB, Canada, Apr. 2018, pp. 15-20. https://doi.org/10.1109/ICASSP.2018.8461870
  10. Z. Ren, N. Cummins, V. Pandit, J. Han, K. Qian, and B. Schuller, "Learning image-based representations for heart sound classification," In Proceedings of the 2018 International Conference on Digital Health, Lyon, France, Apr. 2018, pp. 23-26. https://doi.org/10.1145/3194658.3194671
  11. X. Liu, J. van de Weijer, and Bagdanov, "A.D. Exploiting unlabeled data in cnns by self-supervised learning to rank," IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, 2019, pp. 1862-1878. https://doi.org/10.1109/TPAMI.2019.2899857