Speech Emotion Recognition Using 2D-CNN with Mel-Frequency Cepstrum Coefficients

Eom, Youngsik;Bang, Junseong;

doi:10.6109/jicce.2021.19.3.148

Journal of information and communication convergence engineering

제19권3호
/
Pages.148-154
/
2021
/
2234-8255(pISSN)
/
2234-8883(eISSN)

한국정보통신학회 (The Korea Institute of Information and Commucation Engineering)

DOI QR Code

Speech Emotion Recognition Using 2D-CNN with Mel-Frequency Cepstrum Coefficients

Eom, Youngsik (Department of Electronic and Electrical Engineering, Sungkyunkwan University) ;
Bang, Junseong (Public Safety Intelligence Research Section, Electronics and Telecommunications Research Institute (ETRI))

투고 : 2021.05.17
심사 : 2021.07.23
발행 : 2021.09.30

https://doi.org/10.6109/jicce.2021.19.3.148 인용 PDF KSCI

PDF 다운로드

⟨ 이전 논문 다음 논문 ⟩

초록

With the advent of context-aware computing, many attempts were made to understand emotions. Among these various attempts, Speech Emotion Recognition (SER) is a method of recognizing the speaker's emotions through speech information. The SER is successful in selecting distinctive 'features' and 'classifying' them in an appropriate way. In this paper, the performances of SER using neural network models (e.g., fully connected network (FCN), convolutional neural network (CNN)) with Mel-Frequency Cepstral Coefficients (MFCC) are examined in terms of the accuracy and distribution of emotion recognition. For Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset, by tuning model parameters, a two-dimensional Convolutional Neural Network (2D-CNN) model with MFCC showed the best performance with an average accuracy of 88.54% for 5 emotions, anger, happiness, calm, fear, and sadness, of men and women. In addition, by examining the distribution of emotion recognition accuracies for neural network models, the 2D-CNN with MFCC can expect an overall accuracy of 75% or more.

키워드

과제정보

This research was supported and funded by the Korean National Police Agency. [Pol-Bot Development for Conversational Police Knowledge Services / PR09-01-000-20]

참고문헌

S. Byun and S. Lee, "Emotion recognition using tone and tempo based on voice for IoT," Trans. of the Korean Institute of Electrical Engineers, vol. 65, no. 1, pp. 116-121, 2016. DOI: 10.5370/kiee.2016.65.1.116.
I. Hong, Y. Ko, Y. Kim, and H. Shin, "A study on the emotional feature composed of the mel-frequency cepstral coefficient and the speech speed," Journal of Computing Science and Engineering, vol. 13, no. 4, pp. 131-140, 2019. DOI: 10.5626/JCSE.2019.13.4.131
M. S. Likitha, S. R. R. Gupta, K. Hasitha, and A. U. Raju, "Speech based human emotion recognition using MFCC," in 2017 Int. Conf. on Wireless Communications, Signal Processing and Networking (WiSPNET), pp. 2257-2260, Mar. 2017. DOI: 10.1109/WiSPNET.2017.8300161.
S. Park, D. Kim, S. Kwon, and N. Park, "Speech emotion recognition based on CNN using spectrogram," in Information and Control Symposium, pp. 240-241, Oct. 2018.
J. Lee, H. Ryu, D. Chang, and M. Koo, "End-to-end Korean speech emotion recognition using deep neural networks," in Korea Computer Congress, pp. 1000-1002, Jun. 2018.
G. Tangriberganov, T. A. Adesuyi, and B. Kim, "A hybrid approach for speech emotion recognition using 1D-CNN LSTM," in Korea Computer Congress, pp. 833-835, July. 2020.
G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou, B. Schuller, and S. Zafeiriou, "Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network," in 2016 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp. 5200-5204, Mar. 2016. DOI: 10.1109/ICASSP.2016.7472669.
K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv: 1409.1556, 2014.
J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, "Image-Net: a large-scale hierarchical image database," in 2009 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 248-255, Jun. 2009. DOI: 10.1109/CVPR.2009.5206848.
J. Lee, U. Yoon, and G. Jo, "CNN-based speech emotion recognition model applying transfer learning and attention mechanism," Journal of KIISE, vol. 47, no. 7, pp. 665-673, 2020. DOI: 10.5626/JOK.2020.47.7.665
S. R. Livingstone and F. A. Russo, "The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English," PLoS ONE, vol. 13, no. 5, pp. e0196391, May. 2018. DOI: 10.1371/journal.pone.0196391.
W. Tang, G. Long, L. Liu, T. Zhou, J. Jiang, and M. Blumenstein, "Rethinking 1D-CNN for time series classification: a stronger baseline," arXiv: 2002.10061, 2020.
L. Huang, J. Dong, D. Zhou, and Q. Zhang, "Speech emotion recognition based on three-channel feature fusion of CNN and BiLSTM," in 2020 the 4th International Conference on Innovation in Artificial Intelligence (ICIAI), pp. 52-58, May. 2020. DOI: 10.1145/3390557.3394317
P. Mishra and R. Sharma, "Gender differentiated convolutional neural networks for speech emotion recognition," in 12th Int. Congress on Ultra Modern Telecommunications and Control Systems and Workshops (ICUMT), pp. 142-148, Oct. 2020. DOI: 10.1109/ICUMT51630.2020.9222412.
librosa [Internet]. Available: https://librosa.org/doc/latest/index.html.

Journal of information and communication convergence engineering

Speech Emotion Recognition Using 2D-CNN with Mel-Frequency Cepstrum Coefficients

초록

키워드

과제정보

참고문헌

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)