DOI QR코드

DOI QR Code

Efficient Emotion Classification Method Based on Multimodal Approach Using Limited Speech and Text Data

적은 양의 음성 및 텍스트 데이터를 활용한 멀티 모달 기반의 효율적인 감정 분류 기법

  • Received : 2023.12.26
  • Accepted : 2024.03.18
  • Published : 2024.04.30

Abstract

In this paper, we explore an emotion classification method through multimodal learning utilizing wav2vec 2.0 and KcELECTRA models. It is known that multimodal learning, which leverages both speech and text data, can significantly enhance emotion classification performance compared to methods that solely rely on speech data. Our study conducts a comparative analysis of BERT and its derivative models, known for their superior performance in the field of natural language processing, to select the optimal model for effective feature extraction from text data for use as the text processing model. The results confirm that the KcELECTRA model exhibits outstanding performance in emotion classification tasks. Furthermore, experiments using datasets made available by AI-Hub demonstrate that the inclusion of text data enables achieving superior performance with less data than when using speech data alone. The experiments show that the use of the KcELECTRA model achieved the highest accuracy of 96.57%. This indicates that multimodal learning can offer meaningful performance improvements in complex natural language processing tasks such as emotion classification.

본 논문에서는 wav2vec 2.0과 KcELECTRA 모델을 활용하여 멀티모달 학습을 통한 감정 분류 방법을 탐색한다. 음성 데이터와 텍스트 데이터를 함께 활용하는 멀티모달 학습이 음성만을 활용하는 방법에 비해 감정 분류 성능을 유의미하게 향상시킬 수 있음이 알려져 있다. 본 연구는 자연어 처리 분야에서 우수한 성능을 보인 BERT 및 BERT 파생 모델들을 비교 분석하여 텍스트 데이터의 효과적인 특징 추출을 위한 최적의 모델을 선정하여 텍스트 처리 모델로 활용한다. 그 결과 KcELECTRA 모델이 감정 분류 작업에서 뛰어난 성능이 보임을 확인하였다. 또한, AI-Hub에 공개되어 있는 데이터 세트를 활용한 실험을 통해 텍스트 데이터를 함께 활용하면 음성 데이터만 사용할 때보다 더 적은 양의 데이터로도 더 우수한 성능을 달성할 수 있음을 발견하였다. 실험을 통해 KcELECTRA 모델을 활용한 경우가 정확도 96.57%로 가장 우수한 성능을 보였다. 이는 멀티모달 학습이 감정 분류와 같은 복잡한 자연어 처리 작업에서 의미 있는 성능 개선을 제공할 수 있음을 보여준다.

Keywords

Acknowledgement

본 연구는 과학기술정보통신부 및 정보통신기획평가원의 학석사연계ICT핵심인재양성사업의 연구결과로 수행되었음(IITP-2024-RS-2023-00260175).

References

  1. M. Shin and Y. Shin, "Data sampling strategy for Korean speech emotion classification using wav2vec2.0," in Proceedings of the Annual Conference of Korea Information Processing Society Conference (KIPS) 2023, Vol.30, No.2, pp.493-494, 2023. [Online]. Available: https://kiss.kstudy.com/Detail/Ar?key=4059338 
  2. A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, "wav2vec 2.0: A framework for self-supervised learning of speech representations," Advances in Neural Information Processing Systems, Vol.33, pp.12449-12460, 2020. 
  3. S., Schneider, A., Baevski, R., Collobert, and M. Auli, "wav2vec: Unsupervised pre-training for speech recognition," arXiv preprint arXiv:1904.05862, 2019. 
  4. A., Vaswani et al., "Attention is all you need," Advances in Neural Information Processing Systems, Vol.30, 2017. 
  5. J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018. 
  6. Y. Liu, et al., "Roberta: A robustly optimized bert pretraining approach," arXiv preprint arXiv:1907.11692, 2019. 
  7. K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning, "Electra: Pre-training text encoders as discriminators rather than generators," arXiv preprint arXiv:2003.10555, 2020. 
  8. S. Yoon, S. Byun, and K. Jung, "Multimodal speech emotion recognition using audio and text," In: 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE. pp.112-118, 2018. 
  9. X, Zhang, M, Wang, and X. Guo, "Multi-modal emotion recognition based on deep learning in speech, video and text," 2020 IEEE 5th International Conference on Signal and Image Processing (ICSIP), Signal and Image Processing (ICSIP), pp.328-333, 2020. 
  10. J. Agarkhed, "Machine learning based integrated audio and text modalities for enhanced emotional analysis," In 2023 5th International Conference on Inventive Research in Computing Applications (ICIRCA), IEEE, pp.989-993, 2023. 
  11. S. S. Hosseini, M. R. Yamaghani, and S. Poorzaker Arabani, "Multimodal modelling of human emotion using sound, image and text fusion," Signal, Image and Video Processing, Vol.18, No.1, pp.71-79, 2024.  https://doi.org/10.1007/s11760-023-02707-8
  12. H. Koh, S. Joo, and K. Jung, "Reflecting dialogue and pretrained information into multi modal emotion recognition: focusing on text and audio," in The Korean Institute of Information Scientists and Engineers, pp.2136-2138, 2023, [Online]. Available: https://www.dbpia.co.kr/journal/articleDetail?nodeId=NODE11488656 
  13. Y.-J. Kim, K. Roh, and D. Chae, "Feature-based Emotion Recognition Model Using Multimodal Data," in The Korean Institute of Information Scientists and Engineers, pp. 2169-2171. 2023, [Online]. Available: https://www.dbpia.co.kr/journal/articleDetail?nodeId=NODE11488667 
  14. T. Yoon, S. Lee, H. Lee, H. Jin, and M. Song, "CoKoME: Context modeling for Korean multimodal emotion recognition in conversation," in The Korean Institute of Information Scientists and Engineers, pp.2100-2102, 2023, [Online]. Available: https://www.dbpia.co.kr/journal/articleDetail?nodeId=NODE11488644 
  15. J. Lee, J. Bae, and S. Cho, "Multi-modal emotion recognition in Korean conversation via Contextualized GNN," in The Korean Institute of Information Scientists and Engineers, pp.2094-2096. 2023, [Online]. Available: https://www.dbpia.co.kr/journal/articleDetail?nodeId=NODE11488642 
  16. S. Park, et al., "Klue: Korean language understanding evaluation," arXiv preprint arXiv:2105.09680, 2021.