DOI QR코드

DOI QR Code

A Study on a Non-Voice Section Detection Model among Speech Signals using CNN Algorithm

CNN(Convolutional Neural Network) 알고리즘을 활용한 음성신호 중 비음성 구간 탐지 모델 연구

  • 이후영 (이르테크 기업부설연구소)
  • Received : 2021.03.29
  • Accepted : 2021.06.20
  • Published : 2021.06.28

Abstract

Speech recognition technology is being combined with deep learning and is developing at a rapid pace. In particular, voice recognition services are connected to various devices such as artificial intelligence speakers, vehicle voice recognition, and smartphones, and voice recognition technology is being used in various places, not in specific areas of the industry. In this situation, research to meet high expectations for the technology is also being actively conducted. Among them, in the field of natural language processing (NLP), there is a need for research in the field of removing ambient noise or unnecessary voice signals that have a great influence on the speech recognition recognition rate. Many domestic and foreign companies are already using the latest AI technology for such research. Among them, research using a convolutional neural network algorithm (CNN) is being actively conducted. The purpose of this study is to determine the non-voice section from the user's speech section through the convolutional neural network. It collects the voice files (wav) of 5 speakers to generate learning data, and utilizes the convolutional neural network to determine the speech section and the non-voice section. A classification model for discriminating speech sections was created. Afterwards, an experiment was conducted to detect the non-speech section through the generated model, and as a result, an accuracy of 94% was obtained.

음성인식 기술은 딥러닝과 결합되며 빠른 속도로 발전하고 있다. 특히 음성인식 서비스가 인공지능 스피커, 차량용 음성인식, 스마트폰 등의 각종 기기와 연결되며 음성인식 기술이 산업의 특정 분야가 아닌 다양한 곳에 활용되고 있다. 이러한 상황에서 해당 기술에 대한 높은 기대 수준을 맞추기 위한 연구 역시 활발히 진행되고 있다. 그중에서 자연어처리(NLP, Natural Language Processing)분야에서 음성인식 인식률에 많은 영향을 주는 주변의 소음이나 불필요한 음성신호를 제거하는 분야에 연구가 필요한 상황이다. 이미 많은 국내외 기업에서 이러한 연구를 위해 최신의 인공지능 기술을 활용하고 있다. 그중에서 합성곱신경망 알고리즘(CNN)을 활용한 연구가 활발하게 진행되고 있다. 본 연구의 목적은 합성곱 신경망을 통해서 사용자의 발화구간에서 비음성 구간을 판별하는 것으로 5명의 발화자의 음성파일(wav)을 수집하여 학습용 데이터를 생성하고 이를 합성곱신경망을 활용하여 음성 구간과 비음성 구간을 판별하는 분류 모델을 생성하였다. 이후 생성된 모델을 통해 비음성 구간을 탐지하는 실험을 진행한 결과 94%의 정확도를 얻었다.

Keywords

References

  1. D. S. Park. (2018). A Study on the Gender and Age Classification of Speech Data Using CNN. Journal of KIIT, 16(11), 11-21. DOI : 10.14801/jkiit.2018.16.11.11
  2. Filipp Akopyan et all. (2015). TrueNorth: Design and Tool Flow of a 65 mW 1 Million Neuron Programmable Neurosynaptic Chip, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 34(10), 1537-1557. DOI : 10.1109/TCAD.2015.2474396
  3. A. Oord et all. (2016). WaveNet: A Generative Model for Raw Audio. arXiv:1609.03499, 1-15.
  4. S. S. Jo & Y. G. Kim. (2017). AI (Artificial Intelligence) Voice Assistant Evolving to Platform. IITP, p1-25, Feb.
  5. B. S. Kim & H. J. Woo. (2019). A Study on the Intention to Use AI Speakers: focusing on extended technology acceptance model. The Korea Contents Association, 19(9), 1-10. DOI : 10.5392/JKCA.2019.19.09.001
  6. L. H. Meng & J. S. Han. (2017). The Impact of Relational Benefits on Positive Affect, Perceived Value, and Behavior Intention in Social Commerce : Focused on Chinese Tourist having the Hotel Service of Social Commerce environment. Journal of tourism and leisure research, 29(10), 69-88.
  7. J. H. Seo & Y. T. Kim. (2013). Effects of Service Convenience on Customer Satisfaction and Reuse Intention by Korail Talk App Users among Korail Passengers. Journal of the Korean Society for Railway, 16(5), 410- 417. https://doi.org/10.7782/JKSR.2013.16.5.410
  8. H. Zhou et al. (2017). Using deep convolutional neural network to classify urban sounds. In TENCON 2017-2017 IEEE Region 10 Conference (pp. 3089-3092). IEEE. DOI : 10.1109/TENCON.2017.8228392
  9. J. G. van Velden & G. F. Smoorenburg. (1991). Vowel recognition in noise for male, female and child voices. In Acoustics, Speech, and Signal Processing, IEEE International Conference on (pp. 989-992). IEEE Computer Society. DOI : 10.1109/ICASSP.1991.150507
  10. X. Zha, H. Peng, X. Qin, G. Li & S. Yang. (2019). A deep learning framework for signal detection and modulation classification. Sensors, 19(18), 4042. DOI : 10.3390/s19184042
  11. Y. E. Yuan. (2019). DeepMorse: A deep convolutional learning method for blind morse signal detection in wideband wireless spectrum. IEEE Access, 7, 80577-80587. DOI : 10.1109/ACCESS.2019.2923084
  12. Y. LeCun, Y. Bengio & G. Hinton. (2015). Deep learning. nature, 521(7553), 436-444. DOI : 10.1038/nature14539
  13. B. Theodore et all. (2013, August). Feature extraction with convolutional neural networks for handwritten word recognition. In 2013 12th International Conference on Document Analysis and Recognition (pp. 285-289). IEEE. DOI : 10.1109/ICDAR.2013.64
  14. L. Xiaojun et al. (2017). Feature extraction and fusion using deep convolutional neural networks for face detection. Mathematical Problems in Engineering. 1-9. DOI : 10.1155/2017/1376726
  15. Y. Lecun et all. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324. DOI: 10.1109/5.726791
  16. Library of Congress. (2008). WAVE Audio File Format.
  17. Microsoft Corporation. (1998). WAVE and AVI codec Registries-RFC 2361, IETF.
  18. IBM & Microsoft. (1991). Multimedia Programming interface and Data Specifications 1.0
  19. R. Branson. (2015). What Makes WAV Better than MP3, Online Video Converter.