DOI QR코드

DOI QR Code

Spontaneous Speech Emotion Recognition Based On Spectrogram With Convolutional Neural Network

CNN 기반 스펙트로그램을 이용한 자유발화 음성감정인식

  • 손귀영 (세종대학교 소프트웨어학과) ;
  • 권순일 (세종대학교 소프트웨어학과)
  • Received : 2024.05.02
  • Accepted : 2024.05.20
  • Published : 2024.06.30

Abstract

Speech emotion recognition (SER) is a technique that is used to analyze the speaker's voice patterns, including vibration, intensity, and tone, to determine their emotional state. There has been an increase in interest in artificial intelligence (AI) techniques, which are now widely used in medicine, education, industry, and the military. Nevertheless, existing researchers have attained impressive results by utilizing acted-out speech from skilled actors in a controlled environment for various scenarios. In particular, there is a mismatch between acted and spontaneous speech since acted speech includes more explicit emotional expressions than spontaneous speech. For this reason, spontaneous speech-emotion recognition remains a challenging task. This paper aims to conduct emotion recognition and improve performance using spontaneous speech data. To this end, we implement deep learning-based speech emotion recognition using the VGG (Visual Geometry Group) after converting 1-dimensional audio signals into a 2-dimensional spectrogram image. The experimental evaluations are performed on the Korean spontaneous emotional speech database from AI-Hub, consisting of 7 emotions, i.e., joy, love, anger, fear, sadness, surprise, and neutral. As a result, we achieved an average accuracy of 83.5% and 73.0% for adults and young people using a time-frequency 2-dimension spectrogram, respectively. In conclusion, our findings demonstrated that the suggested framework outperformed current state-of-the-art techniques for spontaneous speech and showed a promising performance despite the difficulty in quantifying spontaneous speech emotional expression.

음성감정인식(Speech Emotion Recognition, SER)은 사용자의 목소리에서 나타나는 떨림, 어조, 크기 등의 음성 패턴 분석을 통하여 감정 상태를 판단하는 기술이다. 하지만, 기존의 음성 감정인식 연구는 구현된 시나리오를 이용하여 제한된 환경 내에서 숙련된 연기자를 대상으로 기록된 음성인 구현발화를 중심의 연구로 그 결과 또한 높은 성능을 얻을 수 있지만, 이에 반해 자유발화 감정인식은 일상생활에서 통제되지 않는 환경에서 이루어지기 때문에 기존 구현발화보다 현저히 낮은 성능을 보여주고 있다. 본 논문에서는 일상적 자유발화 음성을 활용하여 감정인식을 진행하고, 그 성능을 향상하고자 한다. 성능평가를 위하여 AI Hub에서 제공되는 한국인 자유발화 대화 음성데이터를 사용하였으며, 딥러닝 학습을 위하여 1차원의 음성신호를 시간-주파수가 포함된 2차원의 스펙트로그램(Spectrogram)로 이미지 변환을 진행하였다. 생성된 이미지는 CNN기반 전이학습 신경망 모델인 VGG (Visual Geometry Group) 로 학습하였고, 그 결과 7개 감정(기쁨, 사랑스러움, 화남, 두려움, 슬픔, 중립, 놀람)에 대해서 성인 83.5%, 청소년 73.0%의 감정인식 성능을 확인하였다. 본 연구를 통하여, 기존의 구현발화기반 감정인식 성능과 비교하면, 낮은 성능이지만, 자유발화 감정표현에 대한 정량화할 수 있는 음성적 특징을 규정하기 어려움에도 불구하고, 일상생활에서 이루어진 대화를 기반으로 감정인식을 진행한 점에서 의의를 두고자 한다.

Keywords

Acknowledgement

이 논문은 2021년도 세종대학교 교내연구비 지원에 의하여 연구되었음(No.20211105).

References

  1. Eunji Lee(2022). ASTI MARKET INSIGHT 67: Speech recognition service.
  2. Integrated Data Analysis Center, "Development of the world's first 'voice phishing voice analysis model'," Ministry of the Interior and Safety, 2023.02.22.
  3. AI 기술 및 제품.서비스 개발에 필요한 AI 통합 플랫폼 [Internet], https://aihub.or.kr/
  4. J. Liu, Z. Liu, L. Wang, L. Guo, and J. Dang, "Speech emotion recognition with local-global aware deep representation learning," In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7174-7178), IEEE, 2020.
  5. S. Kwon, "MLT-DNet: Speech emotion recognition using 1D dilated CNN based on multi-learning trick approach," Expert Systems with Applications, Vol.167, pp.114177, 2021.
  6. M. Ishaq, M. Khan, and S. Kwon, "TC-Net: A modest & lightweight emotion recognition system using temporal convolution network," Computer Systems Science & Engineering, Vol.46, No.3, pp.3355-3369, 2023. https://doi.org/10.32604/csse.2023.037373
  7. J. Cai, et al., "Feature-level and model-level audiovisual fusion for emotion recognition in the wild," In 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR) (pp.443-448). IEEE, 2019.
  8. G. Chen, S. Zhang, X. Tao, and X. Zhao, "Speech emotion recognition by combining a unified first-order attention network with data balance," IEEE Access, Vol.8, pp.215851-215862, 2020. https://doi.org/10.1109/ACCESS.2020.3038493
  9. S. Zhang, X. Tao, Y. Chuang, and X. Zhao, "Learning deep multimodal affective features for spontaneous speech emotion recognition," Speech Communication, Vol.127, pp.73-81, 2021. https://doi.org/10.1016/j.specom.2020.12.009
  10. A., Amjad, L., Khan, N., Ashraf, M. B., Mahmood, and H. T. Chang, "Recognizing semi-natural and spontaneous speech emotions using deep neural networks," IEEE Access, Vol.10, pp.37149-37163, 2022.
  11. Mustaqeem and S. Kwon, "A CNN-assisted enhanced audio signal processing for speech emotion recognition," Sensors, Vol.20, No.1, pp.183, 2019.
  12. M. Khan, M. Ishaq, M. Swain, and S. Kwon, "Advanced sequence learning approaches for emotion recognition using speech signals," In Intelligent Multimedia Signal Processing for Smart Ecosystems (pp.307-325). Cham: Springer International Publishing, 2023.
  13. H. C. Shin, et al., "Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning," IEEE Transactions on Medical Imaging, Vol.35, No.5, pp.1285-1298, 2016. https://doi.org/10.1109/TMI.2016.2528162
  14. O. Russakovsky, et al., "Imagenet large scale visual recognition challenge," International Journal of Computer Vision, Vol.115, pp.211-252, 2015.
  15. O. Abdel-Hamid, A. R. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, "Convolutional neural networks for speech recognition," IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol.22, No.10, pp.1533-1545, 2014. https://doi.org/10.1109/TASLP.2014.2339736
  16. A. B. Nassif, I. Shahin, I. Attili, M. Azzeh, and K. Shaalan, "Speech recognition using deep neural networks: A systematic review," IEEE Access, Vol.7, pp.19143-19165, 2019. https://doi.org/10.1109/ACCESS.2019.2896880
  17. A. Aggarwal, et al., "Two-way feature extraction for speech emotion recognition using deep learning," Sensors, Vol.22, No.6, pp.2378, 2022.
  18. S. Akinpelu, S. Viriri, and A. Adegun, "Lightweight Deep Learning Framework for Speech Emotion Recognition," IEEE Access, Vol.11, pp.77086-7709, 2023. https://doi.org/10.1109/ACCESS.2023.3297269
  19. 감정이 태깅된 자유대화(성인) [Internet], https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=71631
  20. 감정이 태깅된 자유대화(청소년) [Internet], https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=71632
  21. M. Abadi, et al., "Tensorflow: Large-scale machine learning on heterogeneous distributed systems," arXiv preprintarXiv: 1603.04467, 2016.
  22. M. Sokolova, N. Japkowicz, and S. Szpakowicz, "Beyond accuracy, F-score and ROC: a family of discriminant measures for performance evaluation," In Australasian Joint Conference on Artificial Intelligence (pp.1015-1021). Berlin, Heidelberg: Springer Berlin Heidelberg, 2006.