DOI QR코드

DOI QR Code

RoutingConvNet: A Light-weight Speech Emotion Recognition Model Based on Bidirectional MFCC

RoutingConvNet: 양방향 MFCC 기반 경량 음성감정인식 모델

  • 임현택 (전남대학교 인공지능융합학과) ;
  • 김수형 (전남대학교 인공지능융합학과) ;
  • 이귀상 (전남대학교 인공지능융합학과) ;
  • 양형정 (전남대학교 인공지능융합학과)
  • Received : 2023.02.17
  • Accepted : 2023.05.31
  • Published : 2023.06.30

Abstract

In this study, we propose a new light-weight model RoutingConvNet with fewer parameters to improve the applicability and practicality of speech emotion recognition. To reduce the number of learnable parameters, the proposed model connects bidirectional MFCCs on a channel-by-channel basis to learn long-term emotion dependence and extract contextual features. A light-weight deep CNN is constructed for low-level feature extraction, and self-attention is used to obtain information about channel and spatial signals in speech signals. In addition, we apply dynamic routing to improve the accuracy and construct a model that is robust to feature variations. The proposed model shows parameter reduction and accuracy improvement in the overall experiments of speech emotion datasets (EMO-DB, RAVDESS, and IEMOCAP), achieving 87.86%, 83.44%, and 66.06% accuracy respectively with about 156,000 parameters. In this study, we proposed a metric to calculate the trade-off between the number of parameters and accuracy for performance evaluation against light-weight.

본 연구에서는 음성감정인식의 적용 가능성과 실용성 향상을 위해 적은 수의 파라미터를 가지는 새로운 경량화 모델 RoutingConvNet(Routing Convolutional Neural Network)을 제안한다. 제안모델은 학습 가능한 매개변수를 줄이기 위해 양방향 MFCC(Mel-Frequency Cepstral Coefficient)를 채널 단위로 연결해 장기간의 감정 의존성을 학습하고 상황 특징을 추출한다. 저수준 특징 추출을 위해 경량심층 CNN을 구성하고, 음성신호에서의 채널 및 공간 신호에 대한 정보 확보를 위해 셀프어텐션(Self-attention)을 사용한다. 또한, 정확도 향상을 위해 동적 라우팅을 적용해 특징의 변형에 강인한 모델을 구성하였다. 제안모델은 음성감정 데이터셋(EMO-DB, RAVDESS, IEMOCAP)의 전반적인 실험에서 매개변수 감소와 정확도 향상을 보여주며 약 156,000개의 매개변수로 각각 87.86%, 83.44%, 66.06%의 정확도를 달성하였다. 본 연구에서는 경량화 대비 성능 평가를 위한 매개변수의 수, 정확도간 trade-off를 계산하는 지표를 제안하였다.

Keywords

Acknowledgement

본 연구는 과학기술정보통신부 및 한국연구재단의 기초연구사업 연구 결과로 수행되었음(RS-2023-00219107)

References

  1. 임명진, 이명호, 신주현, "상담 챗봇의 다차원 감정인식 모델," 스마트미디어저널, 제10권 제4호, 21-27쪽, 2021년 12월
  2. 이명호, 임명진, 신주현, "텍스트와 음성의 앙상블을 통한 다중 감정인식 모델," 스마트미디어저널, 제11권, 제8호, 65-72쪽, 2022년 09월 https://doi.org/10.30693/SMJ.2022.11.8.65
  3. H.J. Vogel, C. Suss, T. Hubregtsen, and E. Andre, "Emotion-awareness for intelligent vehicle assistants: A research agenda," Proc. of the 1st International Workshop on Software Engineering for AI in Autonomous Systems, pp. 11-15, Gothenburg, Swede, May. 2018.
  4. 임명진, 박원호, 신주현, "Word2Vec과 LSTM을 활용한 이별 가사 감정 분류," 스마트미디어저널, 제9권, 제3호, 90-97쪽, 2020년 9월 https://doi.org/10.30693/SMJ.2020.9.3.90
  5. J. Parry, D. Palaz, G. Clarke, P. Lecomte, R. Mead, M. Berger, and G. Hofer, "Analysis of Deep Learning Architectures for Cross-Corpus Speech Emotion Recognition," Interspeech, pp. 1656-1660, Graz, Austria, Sep. 2019.
  6. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," Proc, of Conference on Neural Information Processing Systems, pp. 5998-6008, Long Beach, California, USA, Dec. 2017.
  7. Z. Zhao, Q. Li, Z. Zhang, N. Cummins, H. Wang, J. Tao, and B.W. Schuller, "Combining a parallel 2D CNN with a self-attention Dilated Residual Network for CTC-based discrete speech emotion recognition," Neural Network, vol. 141, pp. 52-60, 2021. https://doi.org/10.1016/j.neunet.2021.03.013
  8. S. Sabour, N. Frosst, and G.E. Hinton, "Dynamic routing between capsules," Proc, of Conference on Neural Information Processing Systems, pp. 3856-3866, Long Beach, California, USA, Dec. 2017.
  9. F. Liu, S.Y. Shen, Z.W. Fu, H.Y. Wang, A.M. Zhou, and J.Y. Qi, "LGCCT: A light gated and crossed complementation transformer for multimodal speech emotion recognition," Entropy, vol. 24, no. 7, pp. 1010-1025, 2022. https://doi.org/10.3390/e24071010
  10. C.W. Wu, "ProdSumNet: reducing model parameters in deep neural networks via product-of-sums matrix decompositions," arXiv:1809.02209, 2018.
  11. J. Ye, X.C. Wen, Y. Wei, Y. Xu, K. Liu, and H. Shan, "Temporal Modeling Matters: A Novel Temporal Emotional Modeling Approach for Speech Emotion Recognition," arXiv:2211.08233, 2022.
  12. S. Zhang, S. Zhang, T. Huang, and W. Gao, "Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching," IEEE Transactions on Multimedia, vol. 20, no. 6, pp. 1576-1590, 2017. https://doi.org/10.1109/TMM.2017.2766843
  13. F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier, and B. Weiss, "A Database of German Emotional Speech," Interspeech, pp. 1-4, Lisbon, Portugal, 2005.
  14. S.R Livingstone and F.A. Russo, "The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English," Plos one, vol. 13, no. 5, pp. e0196391, 2018.
  15. C. Busso, M. Bulut, C.C Lee, A. Kazemzadeh, E. Mower, S. Kim, J.N. Chang, S. Lee, and S.S narayanan, "IEMOCAP: Interactive emotional dyadic motion capture database," Language resources and evaluation, vol. 42, no. 4, pp. 335-359, 2008. https://doi.org/10.1007/s10579-008-9076-6
  16. P. Nantasri, E. Phaisangittisagul, J. Karnjana, S. Boonkla, S. Keerativittayanun, A. Rugchatjaroen, and T. Shinozaki, "A light-weight artificial neural network for speech emotion recognition using average values of MFCCs and their derivatives," 17th International conference on electrical engineering/electronics, computer, telecommunications and information technology (ECTI-CON), pp. 41-44, Phuket, Thailand, Jun. 2020.
  17. K. Atsavasirilert, T. Theeramunkong, S. Usanavasin, A. Rugchatjaroen, S. Boonkla, J. Karnjana, S. Keerativittayanun, and M. Okumura, "A light-weight deep convolutional neural network for speech emotion recognition using mel-spectrograms," 14th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP), pp. 1-4, Chiang Mai, Thailand, Oct. 2019.
  18. A. Krizhevsky, I. Sutskever, and G.E. Hinton, "Imagenet classification with deep convolutional neural networks," Communications of the ACM, vol. 60, no. 6, pp. 84-90, 2017. https://doi.org/10.1145/3065386
  19. J.X Ye, X.C. Wen, X.Z. Wang, Y. Xu, Y. Luo, C.L. Wu, L.Y. Chen, and K.H. Liu, "GM-TCNet: Gated Multi-scale Temporal Convolutional Network using Emotion Causality for Speech Emotion Recognition," Speech Communication, vol. 145, pp. 21-35, 2022. https://doi.org/10.1016/j.specom.2022.07.005
  20. J. L. Bautista, Y.K. Lee, and H.S. Shin, "Speech Emotion Recognition Based on Parallel CNN-Attention Networks with Multi-Fold Data Augmentation," Electronics, vol. 11, no. 23, pp. 3935-3949, 2022. https://doi.org/10.3390/electronics11233935
  21. D. Tang, P. Kuppens, L. Geurts, and T.V. Waterschoot, "End-to-end speech emotion recognition using a novel context-stacking dilated convolution neural network," EURASIP Journal on Audio, Speech, and Music Processing, vol. 2021, no. 1, pp. 1-16, 2021.
  22. S. Loffe and C. Szegedy, "Batch normalization: Accelerating deep network training by reducing internal covariate shift," Proc 32nd International Conference on International Conference on Machine Learning, pp. 448-456, Lille, France, Jul. 2015.
  23. D.A. Clevert, T. Unterthiner, and S. Hochreiter, "Fast and accurate deep network learning by exponential linear units (ELUs)," arXiv:1511.07289, 2015.
  24. J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler "Efficient object localization using convolutional networks," Proc. of the IEEE conference on computer vision and pattern recognition, pp. 648-656, Boston, USA, Jun. 2015.
  25. B. McFee, C. Raffel, D. Liang, D.P.W. Ellis, M. McVicar, E. Battenberg, and O. Nieto, "librosa: Audio and Music Signal Analysis in Python," Proc. of the 14th python in science conference, pp. 18-25, Austin, Texas, USA, Jul. 2015.
  26. D.P. Kingma and J. Ba, "Adam: A Method for Stochastic Optimization," arXiv:1412.6980, 2014.
  27. B. Nagarajan and V.R.M. Oruganti, "Deep Learning as Feature Encoding for Emotion Recognition," arXiv:1810.12613 (2018).
  28. K. Chauhan, K.K. Sharma, and T. Varma, "Speech Emotion Recognition Using Convolution Neural Networks," international conference on artificial intelligence and smart systems (ICAIS), pp. 1176-1181, JTC College, Mar. 2021.
  29. X. Wu, S. Hu, Z. Wu, X. Liu, and H. Meng, "Neural Architecture Search for Speech Emotion Recognition," 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) IEEE, pp. 6902-6906, Singapore, May. 2022.