Speech Emotion Recognition Using 2D-CNN with Mel-Frequency Cepstrum Coefficients

  • Eom, Youngsik (Department of Electronic and Electrical Engineering, Sungkyunkwan University) ;
  • Bang, Junseong (Public Safety Intelligence Research Section, Electronics and Telecommunications Research Institute (ETRI))
  • Received : 2021.05.17
  • Accepted : 2021.07.23
  • Published : 2021.09.30


With the advent of context-aware computing, many attempts were made to understand emotions. Among these various attempts, Speech Emotion Recognition (SER) is a method of recognizing the speaker's emotions through speech information. The SER is successful in selecting distinctive 'features' and 'classifying' them in an appropriate way. In this paper, the performances of SER using neural network models (e.g., fully connected network (FCN), convolutional neural network (CNN)) with Mel-Frequency Cepstral Coefficients (MFCC) are examined in terms of the accuracy and distribution of emotion recognition. For Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset, by tuning model parameters, a two-dimensional Convolutional Neural Network (2D-CNN) model with MFCC showed the best performance with an average accuracy of 88.54% for 5 emotions, anger, happiness, calm, fear, and sadness, of men and women. In addition, by examining the distribution of emotion recognition accuracies for neural network models, the 2D-CNN with MFCC can expect an overall accuracy of 75% or more.



This research was supported and funded by the Korean National Police Agency. [Pol-Bot Development for Conversational Police Knowledge Services / PR09-01-000-20]


  1. S. Byun and S. Lee, "Emotion recognition using tone and tempo based on voice for IoT," Trans. of the Korean Institute of Electrical Engineers, vol. 65, no. 1, pp. 116-121, 2016. DOI: 10.5370/kiee.2016.65.1.116.
  2. I. Hong, Y. Ko, Y. Kim, and H. Shin, "A study on the emotional feature composed of the mel-frequency cepstral coefficient and the speech speed," Journal of Computing Science and Engineering, vol. 13, no. 4, pp. 131-140, 2019. DOI: 10.5626/JCSE.2019.13.4.131
  3. M. S. Likitha, S. R. R. Gupta, K. Hasitha, and A. U. Raju, "Speech based human emotion recognition using MFCC," in 2017 Int. Conf. on Wireless Communications, Signal Processing and Networking (WiSPNET), pp. 2257-2260, Mar. 2017. DOI: 10.1109/WiSPNET.2017.8300161.
  4. S. Park, D. Kim, S. Kwon, and N. Park, "Speech emotion recognition based on CNN using spectrogram," in Information and Control Symposium, pp. 240-241, Oct. 2018.
  5. J. Lee, H. Ryu, D. Chang, and M. Koo, "End-to-end Korean speech emotion recognition using deep neural networks," in Korea Computer Congress, pp. 1000-1002, Jun. 2018.
  6. G. Tangriberganov, T. A. Adesuyi, and B. Kim, "A hybrid approach for speech emotion recognition using 1D-CNN LSTM," in Korea Computer Congress, pp. 833-835, July. 2020.
  7. G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou, B. Schuller, and S. Zafeiriou, "Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network," in 2016 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp. 5200-5204, Mar. 2016. DOI: 10.1109/ICASSP.2016.7472669.
  8. K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv: 1409.1556, 2014.
  9. J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, "Image-Net: a large-scale hierarchical image database," in 2009 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 248-255, Jun. 2009. DOI: 10.1109/CVPR.2009.5206848.
  10. J. Lee, U. Yoon, and G. Jo, "CNN-based speech emotion recognition model applying transfer learning and attention mechanism," Journal of KIISE, vol. 47, no. 7, pp. 665-673, 2020. DOI: 10.5626/JOK.2020.47.7.665
  11. S. R. Livingstone and F. A. Russo, "The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English," PLoS ONE, vol. 13, no. 5, pp. e0196391, May. 2018. DOI: 10.1371/journal.pone.0196391.
  12. W. Tang, G. Long, L. Liu, T. Zhou, J. Jiang, and M. Blumenstein, "Rethinking 1D-CNN for time series classification: a stronger baseline," arXiv: 2002.10061, 2020.
  13. L. Huang, J. Dong, D. Zhou, and Q. Zhang, "Speech emotion recognition based on three-channel feature fusion of CNN and BiLSTM," in 2020 the 4th International Conference on Innovation in Artificial Intelligence (ICIAI), pp. 52-58, May. 2020. DOI: 10.1145/3390557.3394317
  14. P. Mishra and R. Sharma, "Gender differentiated convolutional neural networks for speech emotion recognition," in 12th Int. Congress on Ultra Modern Telecommunications and Control Systems and Workshops (ICUMT), pp. 142-148, Oct. 2020. DOI: 10.1109/ICUMT51630.2020.9222412.
  15. librosa [Internet]. Available: