Audio and Video Bimodal Emotion Recognition in Social Networks Based on Improved AlexNet Network and Attention Mechanism

Liu, Min;Tang, Jun;

doi:10.3745/JIPS.02.0161

Journal of Information Processing Systems

Volume 17 Issue 4
/
Pages.754-771
/
2021
/
1976-913X(pISSN)
/
2092-805X(eISSN)

Korea Information Processing Society (한국정보처리학회)

DOI QR Code

Audio and Video Bimodal Emotion Recognition in Social Networks Based on Improved AlexNet Network and Attention Mechanism

Liu, Min (Software School, Hunan Vocational College of Science and Technology) ;
Tang, Jun (Software School, Hunan Vocational College of Science and Technology)

Received : 2020.07.27
Accepted : 2020.10.04
Published : 2021.08.31

https://doi.org/10.3745/JIPS.02.0161 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

In the task of continuous dimension emotion recognition, the parts that highlight the emotional expression are not the same in each mode, and the influences of different modes on the emotional state is also different. Therefore, this paper studies the fusion of the two most important modes in emotional recognition (voice and visual expression), and proposes a two-mode dual-modal emotion recognition method combined with the attention mechanism of the improved AlexNet network. After a simple preprocessing of the audio signal and the video signal, respectively, the first step is to use the prior knowledge to realize the extraction of audio characteristics. Then, facial expression features are extracted by the improved AlexNet network. Finally, the multimodal attention mechanism is used to fuse facial expression features and audio features, and the improved loss function is used to optimize the modal missing problem, so as to improve the robustness of the model and the performance of emotion recognition. The experimental results show that the concordance coefficient of the proposed model in the two dimensions of arousal and valence (concordance correlation coefficient) were 0.729 and 0.718, respectively, which are superior to several comparative algorithms.

Keywords

Acknowledgement

This work was supported by the Research Foundation of Education Bureau of Hunan Province (No.18B564), Science and Technology Plans of Development and Reform Commission of Hunan Province, China (No. 2013-1199).

References

R. Jamwal, J. Enticott, L. Farnworth, D. Winkler, and L. Callaway, "The use of electronic assistive technology for social networking by people with disability living in shared supported accommodation," Disability and Rehabilitation: Assistive Technology, vol. 15, no. 1, pp. 101-108, 2020. https://doi.org/10.1080/17483107.2018.1534998
S. Sharma, G. Singh, and A. S. Aiyub, "Use of social networking sites by SMEs to engage with their customers: a developing country perspective," Journal of Internet Commerce, vol. 19, no. 1, pp. 62-81, 2020. https://doi.org/10.1080/15332861.2019.1695180
D. Tiwari and N. Singh, "Ensemble approach for twitter sentiment analysis," International Journal of Information Technology and Computer Science, vol. 11, no. 8, pp. 20-26, 2019. https://doi.org/10.5815/ijitcs.2019.08.03
R. Bhargava, S. Arora, and Y. Sharma, "Neural network-based architecture for sentiment analysis in Indian languages," Journal of Intelligent Systems, vol. 28, no. 3, pp. 361-375, 2019. https://doi.org/10.1515/jisys-2017-0398
J. McDonald, A. C. M. Moskal, A. Goodchild, S. Stein, and S. Terry, "Advancing text-analysis to tap into the student voice: a proof-of-concept study," Assessment & Evaluation in Higher Education, vol. 45, no. 1, pp. 154-164, 2020. https://doi.org/10.1080/02602938.2019.1614524
F. R. Sullivan and P. K. Keith, "Exploring the potential of natural language processing to support microgenetic analysis of collaborative learning discussions," British Journal of Educational Technology, vol. 50, no. 6, pp. 3047-3063, 2019. https://doi.org/10.1111/bjet.12875
A. L. Afzal and S. Asharaf, "Deep multiple multilayer kernel learning in core vector machines," Expert Systems with Applications, vol. 96, pp. 149-156, 2018. https://doi.org/10.1016/j.eswa.2017.11.058
G. Manogaran, R. Varatharajan, and M. K. Priyan, "Hybrid recommendation system for heart disease diagnosis based on multiple kernel learning with adaptive neuro-fuzzy inference system," Multimedia Tools and Applications, vol. 77, no. 4, pp. 4379-4399, 2018. https://doi.org/10.1007/s11042-017-5515-y
Z. Dong and B. Lin, "BMF-CNN: an object detection method based on multi-scale feature fusion in VHR remote sensing images," Remote Sensing Letters, vol. 11, no. 3, pp. 215-224, 2020. https://doi.org/10.1080/2150704x.2019.1706007
A. Moussavi-Khalkhali and M. Jamshidi, "Feature fusion models for deep autoencoders: application to traffic flow prediction," Applied Artificial Intelligence, vol. 33, no. 13, pp. 1179-1198, 2019. https://doi.org/10.1080/08839514.2019.1677312
M. Wollmer, F. Weninger, T. Knaup, B. Schuller, C. Sun, K. Sagae, and L. P. Morency, "Youtube movie reviews: sentiment analysis in an audio-visual context," IEEE Intelligent Systems, vol. 28, no. 3, pp. 46-53, 2013. https://doi.org/10.1109/MIS.2013.34
S. Zhang, X. Wang, G. Zhang, and X. Zhao, "Multimodal emotion recognition integrating affective speech with facial expression," WSEAS Transactions on Signal Processing, vol. 10, pp. 526-537, 2014.
S. Dobrisek, R. Gajsek, F. Mihelic, N. Pavesic, and V. Struc, "Towards efficient multi-modal emotion recognition," International Journal of Advanced Robotic Systems, vol. 10, article no. 53, 2013. https://doi.org/10.5772/54002
Y. Wang, "Multimodal emotion recognition algorithm based on edge network emotion element compensation and data fusion," Personal and Ubiquitous Computing, vol. 23, no. 3, pp. 383-392, 2019. https://doi.org/10.1007/s00779-018-01195-9
S. Sahoo and A. Routray, "Emotion recognition from audio-visual data using rule based decision level fusion," in Proceedings of 2016 IEEE Students' Technology Symposium (TechSym), Kharagpur, India, 2016, pp. 7-12.
M. Glodek, S. Reuter, M. Schels, K. Dietmayer, and F. Schwenker, "Kalman filter based classifier fusion for affective state recognition," in Multiple Classifier Systems. Heidelberg, Germany: Springer, 2013, pp. 85-94.
P. Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. W. Schuller, and S. Zafeiriou, "End-to-end multimodal emotion recognition using deep neural networks," IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1301-1309, 2017. https://doi.org/10.1109/JSTSP.2017.2764438
J. Huang, Y. Li, J. Tao, and J. Yi, "Multimodal emotion recognition with transfer learning of deep neural network," ZTE Communications, vol. 15, no. S2, pp. 23-29, 2017.
S. Zhang, S. Zhang, T. Huang, W. Gao, and Q. Tian, "Learning affective features with a hybrid deep model for audio-visual emotion recognition," IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 10, pp. 3030-3043, 2018. https://doi.org/10.1109/tcsvt.2017.2719043
W. Li, M. Chu, and J. Qiao, "Design of a hierarchy modular neural network and its application in multimodal emotion recognition," Soft Computing, vol. 23, no. 22, pp. 11817-11828, 2019. https://doi.org/10.1007/s00500-018-03735-0
M. Valstar, B. Schuller, K. Smith, T. Almaev, F. Eyben, J. Krajewski, R. Cowie, and M. Pantic, "AVEC 2014: 3D dimensional affect and depression recognition challenge," in Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge, Orlando, FL, 2014, pp. 3-10.
Y. Wang, L. Guan, and A. N. Venetsanopoulos, "Kernel cross-modal factor analysis for information fusion with application to bimodal emotion recognition," IEEE Transactions on Multimedia, vol. 14, no. 3, pp. 597-607, 2012. https://doi.org/10.1109/TMM.2012.2189550
S. Poria, E. Cambria, N. Howard, G. B. Huang, and A. Hussain, "Fusing audio, visual and textual clues for sentiment analysis from multimodal content," Neurocomputing, vol. 174, pp. 50-59, 2016. https://doi.org/10.1016/j.neucom.2015.01.095
N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. R. Lanckriet, R. Levy, and N. Vasconcelos, "A new approach to cross-modal multimedia retrieval," in Proceedings of the 18th ACM International Conference on Multimedia, Firenze, Italy, 2010, pp. 251-260.
J. Ma, Y. Sun, and X. Zhang, "Multimodal emotion recognition for the fusion of speech and EEG signals," Xi'an Dianzi Keji Daxue Xuebao/Journal of Xidian University, vol. 46, no. 1, pp. 143-150, 2019.
B. Knyazev, R. Shvetsov, N. Efremova, and A. Kuharenko, "Leveraging large face recognition data for emotion classification," in Proceedings of 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG2018), Xi'an, China, 2018, pp. 692-696.
J. Friedman, T. Hastie, and R. Tibshirani, "Regularization paths for generalized linear models via coordinate descent," Journal of Statistical Software, vol. 33, no. 1, pp. 1-22, 2010.

Journal of Information Processing Systems

Audio and Video Bimodal Emotion Recognition in Social Networks Based on Improved AlexNet Network and Attention Mechanism

Abstract

Keywords

Acknowledgement

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)