DOI QR코드

DOI QR Code

Development of Emotion Recognition Model Using Audio-video Feature Extraction Multimodal Model

음성-영상 특징 추출 멀티모달 모델을 이용한 감정 인식 모델 개발

  • Jong-Gu Kim (Department of Electrical and Computing Science, Inha University) ;
  • Jang-Woo Kwon (Department of Computer Science, Inha University)
  • 김종구 (인하대학교 전기컴퓨터공학과) ;
  • 권장우 (인하대학교 컴퓨터공학과)
  • Received : 2023.10.10
  • Accepted : 2023.12.28
  • Published : 2023.12.31

Abstract

Physical and mental changes caused by emotions can affect various behaviors, such as driving or learning behavior. Therefore, recognizing these emotions is a very important task because it can be used in various industries, such as recognizing and controlling dangerous emotions while driving. In this paper, we attempted to solve the emotion recognition task by implementing a multimodal model that recognizes emotions using both audio and video data from different domains. After extracting voice from video data using RAVDESS data, features of voice data are extracted through a model using 2D-CNN. In addition, the video data features are extracted using a slowfast feature extractor. And the information contained in the audio and video data, which have different domains, are combined into one feature that contains all the information. Afterwards, emotion recognition is performed using the combined features. Lastly, we evaluate the conventional methods that how to combine results from models and how to vote two model's results and a method of unifying the domain through feature extraction, then combining the features and performing classification using a classifier.

감정으로 인해 생기는 신체적 정신적인 변화는 운전이나 학습 행동 등 다양한 행동에 영향을 미칠 수 있다. 따라서 이러한 감정을 인식하는 것은 운전 중 위험한 감정 인식 및 제어 등 다양한 산업에서 이용될 수 있기 때문에 매우 중요한 과업이다. 본 논문에는 서로 도메인이 다른 음성과 영상 데이터를 모두 이용하여 감정을 인식하는 멀티모달 모델을 구현하여 감정 인식 연구를 진행했다. 본 연구에서는 RAVDESS 데이터를 이용하여 영상 데이터에 음성을 추출한 뒤 2D-CNN을 이용한 모델을 통해 음성 데이터 특징을 추출하였으며 영상 데이터는 Slowfast feature extractor를 통해 영상 데이터 특징을 추출하였다. 감정 인식을 위한 제안된 멀티모달 모델에서 음성 데이터와 영상 데이터의 특징 벡터를 통합하여 감정 인식을 시도하였다. 또한 멀티모달 모델을 구현할 때 많이 쓰인 방법론인 각 모델의 결과 스코어를 합치는 방법, 투표하는 방법을 이용하여 멀티모달 모델을 구현하고 본 논문에서 제안하는 방법과 비교하여 각 모델의 성능을 확인하였다.

Keywords

Acknowledgement

이 논문은 2023년도 정부(산업통상자원부)의 재원으로 한국에너지기술평가원의 지원을 받아 수행된 연구임(20224B10100060, 회전설비 인공지능형 진동 감시 시스템 개발)

References

  1. Bum-Jun kim et al. (2018). Research on data augmentation method for audio marking based on deep neural network. Journal of the Korean Acoustical Society, 37(6), 475-482.
  2. Ju-Hyuk Han et al. (2022). MR-CL: Contrastive Loss-based multimodal convergence learning method for emotion recognition. Korean Society of Information Scientists and Engineers Academic Presentation Papers
  3. Chung-bin Kim et al. (2022). Feature extraction from self-supervised learning model and transformer-based multimodal expression learning for combined voice-text emotion recognition. Korean Society of Information Scientists and Engineers Academic Presentation Papers
  4. Yong-hwa Jo el al. (2020). Implementation of a Classification System for Dog Behaviors using YOLI-based Object Detection and a Node.js Server. Journal of Convergence Signal Processing Society. v.21, no.1, pp.29-37
  5. Lee-sun Mun et al. (2023). Multimodal emotion recognition based on biometric signals and voice data. Proceedings of the Domestic Conference of the Society of Control and Robotics Systems
  6. P. Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. W. Schuller and S. Zafeiriou, "End-to-End Multimodal Emotion Recognition Using Deep Neural Networks," in IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1301-1309, Dec. 2017,
  7. H. Ranganathan, S. Chakraborty and S. Panchanathan, "Multimodal emotion recognition using deep learning architectures," 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA, 2016, pp. 1-9,
  8. Kahou, S.E., Bouthillier, X., Lamblin, P. et al. EmoNets: Multimodal deep learning approaches for emotion recognition in video. J Multimodal User Interfaces 10, 99-111 (2016).
  9. Xu, Haiyang, et al. "Learning alignment for multimodal emotion recognition from speech."arXiv preprint arXiv:1909.05645 (2019).
  10. Alu, D. A. S. C., Elteto Zoltan, and Ioan Cristian Stoica. "Voice based emotion recognition with convolutional neural networks for companion robots." Science and Technology 20.3 (2017): 222-240.
  11. Livingstone, Steven R., and Frank A. Russo. "The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English." PloS one 13.5 (2018): e0196391.
  12. Feichtenhofer, Christoph, et al. "Slowfast networks for video recognition." Proceedings of the IEEE/CVF international conference on computer vision. 2019.
  13. Shen, Jonathan, et al. "Natural tts synthesis by conditioning wavenet on mel spectrogram predictions." 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018.
  14. Schluter, Jan, and Thomas Grill. "Exploring data augmentation for improved singing voice detection with neural networks." ISMIR. 2015.
  15. Ngiam, Jiquan, et al. "Multimodal deep learning." Proceedings of the 28th international conference on machine learning (ICML-11). 2011.
  16. Jae-Eun Lee et al. (2022) A neck healthy warning algorithm for identifying text neck posture prevention. Journal of Convergence Signal Processing Society. v.23, no.3, pp.115-122
  17. HATAMI, Nima; GAVET, Yann; DEBAYLE, Johan. Classification of time-series images using deep convolutional neural networks. In: Tenth international conference on machine vision (ICMV 2017). SPIE, 2018. p. 242-249.