Multimodal audiovisual speech recognition architecture using a three-feature multi-fusion method for noise-robust systems

Sanghun Jeon;Jieun Lee;Dohyeon Yeo;Yong-Ju Lee;SeungJun Kim;

doi:10.4218/etrij.2023-0266

ETRI Journal

Volume 46 Issue 1
/
Pages.22-34
/
2024
/
1225-6463(pISSN)
/
2233-7326(eISSN)

Electronics and Telecommunications Research Institute (한국전자통신연구원)

DOI QR Code

Multimodal audiovisual speech recognition architecture using a three-feature multi-fusion method for noise-robust systems

Sanghun Jeon (Electronics and Telecommunications Research Institute) ;
Jieun Lee (Gwangju Institute of Science and Technology, School of Integrated Technology) ;
Dohyeon Yeo (Gwangju Institute of Science and Technology, School of Integrated Technology) ;
Yong-Ju Lee (Electronics and Telecommunications Research Institute) ;
SeungJun Kim (Gwangju Institute of Science and Technology, School of Integrated Technology)

Received : 2023.07.11
Accepted : 2023.12.20
Published : 2024.02.20

https://doi.org/10.4218/etrij.2023-0266 Citation PDF

Download PDF

⟨ Previous Next ⟩

Abstract

Exposure to varied noisy environments impairs the recognition performance of artificial intelligence-based speech recognition technologies. Degraded-performance services can be utilized as limited systems that assure good performance in certain environments, but impair the general quality of speech recognition services. This study introduces an audiovisual speech recognition (AVSR) model robust to various noise settings, mimicking human dialogue recognition elements. The model converts word embeddings and log-Mel spectrograms into feature vectors for audio recognition. A dense spatial-temporal convolutional neural network model extracts features from log-Mel spectrograms, transformed for visual-based recognition. This approach exhibits improved aural and visual recognition capabilities. We assess the signal-to-noise ratio in nine synthesized noise environments, with the proposed model exhibiting lower average error rates. The error rate for the AVSR model using a three-feature multi-fusion method is 1.711%, compared to the general 3.939% rate. This model is applicable in noise-affected environments owing to its enhanced stability and recognition rate.

Keywords

Acknowledgement

This study was supported by the following grants: an Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (Ministry of Science and ICT, MSIT) (No. 2022-0-00871, Development of AI Autonomy and Knowledge Enhancement for AI Agent Collaboration, 60%); a National Research Foundation of Korea (NRF) grant funded by the Ministry of Science and ICT (2021R1A4A1030075, 20%); and a GISTMIT Research Collaboration grant funded by the GIST in 2023 (20%).

References

H. McGurk and J. J. N. MacDonald, Hearing lips and seeing voices, Nature 264 (1976), no. 5588, 746-748. https://doi.org/10.1038/264746a0
A. G. Chitu and L. J. M. Rothkrantz, Automatic visual speech recognition, In Speech enhancement, modeling, recognition-algorithms, and applications, S. Ramakrishnan (ed.), IntechOpen, London, UK, 2012, 95-120.
P. Agarwal and S. Kumar, Electroencephalography-based imagined speech recognition using deep long short-term memory network, ETRI J. 44 (2022), no. 4, 672-685. https://doi.org/10.4218/etrij.2021-0118
L. Sun, Q. Li, S. Fu, and P. Li, Speech emotion recognition based on genetic algorithm-decision tree fusion of deep and acoustic features, ETRI J. 44 (2022), no. 3, 462-475. https://doi.org/10.4218/etrij.2020-0458
S. Kumar, Real-time implementation and performance evaluation of speech classifiers in speech analysis-synthesis, ETRI J. 43 (2021), no. 1, 82-94. https://doi.org/10.4218/etrij.2019-0364
S. B. Alex and L. Mary, Variational autoencoder for prosody-based speaker recognition, ETRI J. 45 (2023), no. 4, 678-689.
B. Shi, W. Hsu, and A. Mohamed, Robust self-supervised audiovisual speech recognition, (Interspeech, Incheon, Rep. of Korea), 2022, DOI 10.21437/Interspeech.2022-99.
R. Shashidhar, S. Patilkulkarni, and S. B. Puneeth, Combining audio and visual speech recognition using LSTM and deep convolutional neural network, Int. J. Inform. Technol. 14 (2022), 3425-3436. https://doi.org/10.1007/s41870-022-00907-y
D. Serdyuk, O. Braga, and O. Siohan, Transformer-based video front-ends for audio-visual speech recognition for single and multi-person video, (Interspeech, Incheon, Rep. of Korea), 2022, DOI 10.21437/Interspeech.2022-10920.
C. Deuerlein, M. Langer, J. Sessner, P. Hess, and J. Franke, Human-robot-interaction using cloud-based speech recognition systems, Proc. CIRP 97 (2021), 130-135. https://doi.org/10.1016/j.procir.2020.05.214
K. Takashi, T. Nose, S. Hirooka, Y. Chiba, and A. Ito, Comparison of speech recognition performance between Kaldi and Google cloud speech API, In Recent advances in intelligent information hiding and multimedia signal processing: proceeding of the fourteenth international conference on intelligent information hiding and multimedia signal processing, November, 26-28, 2018, Sendai, Japan, Vol. 110, Springer International Publishing, 2019.
S. Qiya, B. Sun, and S. Li, Multimodal sparse transformer network for audio-visual speech recognition, IEEE Trans. Neural Netw. Learn. Syst. 34 (2022), no. 12, DOI 10.1109/TNNLS.2022.3163771
L. D. Terissi, G. D. Sad, and J. C. Gomez, Robust front-end for audio, visual and audio-visual speech classification, Int. J. Speech Technol. 21 (2018), 293-307. https://doi.org/10.1007/s10772-018-9504-y
G. Calvert, C. Spence, and B. E. Stein, The handbook of multisensory processes, MIT Press, London, UK, 2004.
J. Venezia, W. Matchin, and G. Hickok, Multisensory integration and audiovisual speech perception, Brain Mapp. Encycl. Ref. 2 (2015), 565-572. https://doi.org/10.1016/B978-0-12-397025-1.00047-6
R. Campbell, The processing of audio-visual speech: empirical and neural bases, Philos. Trans. R. Soc. Lond. B Biol. Sci. 363 (2008), no. 1493, 1001-1010. https://doi.org/10.1098/rstb.2007.2155
Y.M. Assael, B. Shillingford, S. Whiteson, and N. De Freitas, LipNet: End-to-end sentence-level lipreading, arXiv preprint, 2016, DOI 10.48550/arXiv.1611.01599
Google assistant, Available at: https:/developers.google.com/assistant?hl=ko/ [last accessed 27 April 2022].
Microsoft azure cognitive services, Available at: https://azure.microsoft.com/en-us/services/cognitive-services/ [last accessed 27 April 2022].
Watson speech to text, Available at: https://www.ibm.com/kr-ko/cloud/watson-speech-to-text [last accessed 27 April 2022].
Available at: https://aws.amazon.com/transcribe/ [last accessed 27 April 2022].
P. Sawers, Amazon Transcribe can now automatically redact personally identifiable data, available at https://venturebeat.com/2020/02/27/amazon-transcribe-can-now-automatically-redact-personally-identifiable-data/, VentureBeat, 27th February 2020, Retrieved 3rd February 2021.
Available at: https://aiopen.etri.re.kr/guide/Recognition [last accessed 27 April 2022].
Available at: https://clova.ai/speech [last accessed 27 April 2022].
Available at: https://speech-api.kakao.com/ [last accessed 27 April 2022].
A. Fernandez-Lopez and F. M. Sukno, Survey on automatic lip-reading in the era of deep learning, Image vis. Comput. 78 (2018), 53-72. https://doi.org/10.1016/j.imavis.2018.07.002
Y. R. Oh, K. Park, and J. G. Park, Fast offline transformer-based end-to-end automatic speech recognition for real-world applications, ETRI J. 44 (2022), no. 3, 476-490. https://doi.org/10.4218/etrij.2021-0106
C. Sheng, G. Kuang, L. Bai, C. Hou, Y. Guo, X. Xu, M. Pietikainen, and L. Liu, Deep learning for visual speech analysis: a survey, arXiv preprint, 2022, DOI 10.48550/arXiv.2205.10839
Y. R. Oh, K. Park, H. B. Jeon, and J. G. Park, Automatic proficiency assessment of Korean speech read aloud by non-natives using bidirectional LSTM-based speech recognition, ETRI J. 42 (2020), no. 5, 761-772. https://doi.org/10.4218/etrij.2019-0400
D. Kolossa, S. Zeiler, A. Vorwerk, and R. Orglmeister, Audiovisual speech recognition with missing or unreliable data, (Proc. International Conference on Auditory Visual Speech Processing, Norwich, UK), (2009), pp. 117-122.
V. Kepuska and G. Bohouta, Comparing speech recognition systems (Microsoft API, Google API and CMU sphinx), Int. J. Eng. Res. Appl. 7 (2017), 20-24.
H. J. Yoo, S. Seo, S. W. Im, and G. Y. Gim, The performance evaluation of continuous speech recognition based on Korean phonological rules of cloud-based speech recognition open API, Int. J. Netw. Distrib. Comput. 9 (2021), no. 1, 10-18. https://doi.org/10.2991/ijndc.k.201218.005
B. Alibegovic, N. Prljaca, M. Kimmel, and M. Schultalbers, Speech recognition system for a service robot-a performance evaluation, (Proc. 2020 16th Intl Conf. Control Autom. Robot. Vis. (ICARCV), Shenzhen, China), 2020, pp. 1171-1176.
S. Jeon and M. S. Kim, End-to-end lip-reading open cloud-based speech architecture, Sensors (Basel). 22 (2022), no. 8, 2938.
S. Jeon and M. S. Kim, Noise-robust multimodal audio-visual speech recognition system for speech-based interaction applications, Sensors (Basel). 22 (2022), no. 20, 7738.
T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, Distributed representations of words and phrases and their compositionality, (Proceedings of the 26th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA), 2013, pp. 3111-3119.
T. Mikolov, K. Chen, G. Corrado, and J. Dean, Efficient estimation of word representations in vector space, arXiv preprint, 2013, DOI 10.48550/arXiv.1301.3781
S. Ji, M. Yang, and K. Yu, 3D convolutional neural networks for human action recognition, IEEE Trans. Pattern Anal. Mach. Intell. 35 (2013), no. 1, 221-231. https://doi.org/10.1109/TPAMI.2012.59
G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, Improving neural networks by preventing co-adaptation of feature detectors, arXiv preprint, 2012, DOI 10.48550/arXiv.1207.0580
J. Tompson, R. Goroshin, A. Jain, Y. Lecun, and C. Bregler, Efficient object localization using convolutional networks, (Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Boston, MA, USA), 2015, pp. 648-656.
S. Lee and C. Lee, Revisiting spatial dropout for regularizing convolutional neural networks, Multimed. Tools Appl. 79 (2020), 34195-34207. https://doi.org/10.1007/s11042-020-09054-7
S. Jeon, A. Elsharkawy, and M. S. Kim, Lipreading architecture based on multiple convolutional neural networks for sentence-level visual speech recognition, Sensors (Basel). 22 (2021), no. 1, DOI 10.3390/s22010072.
S. Jeon and M. S. Kim, End-to-end sentence-level multi-view lip-reading architecture with spatial attention module integrated multiple CNNs and cascaded local self-attention-CTC, Sensors (Basel). 22 (2022), no. 9, DOI 10.3390/s22093597.
S. Woo, J. Park, J. Y. Lee and I. S. Kweon, Cbam: convolutional block attention module, (Proc. Eur. Conf. Comput. Vis. (ECCV), Munich, Germany), 2018, pp. 3-19.
A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks, (Proc. 23rd Intl Conf. Mach. Learn., Pittsburgh, PA, USA), 2006, pp. 369-376.
P. Warden, Speech commands: a dataset for limited-vocabulary speech recognition arXiv preprint, 2018, DOI 10.48550/arXiv.1804.03209
C. H. H. Yang, J. Qi, S. Y. C. Chen, P. Y. Chen, S. M. Siniscalchi, X. Ma, and C. H. Lee, Decentralizing feature extraction with quantum convolutional neural network for automatic speech recognition, (ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, Canada), 2021, pp. 6523-6527.
S. Seo, C. Kim, and J. H. Kim, Convolutional neural networks using log mel-spectrogram separation for audio event classification with unknown devices, J. Web Eng. 21 (2022), no. 2, 497-522.
S. Suh, S. Park, Y. Jeong and T. Lee, Designing acoustic scene classification models with CNN variants, Technical Report, Detection and Classification of Acoustic Scenes and Events, Challenge, 2020.
C. Sagonas, G. Tzimiropoulos, S. Zafeiriou and M. Pantic, 300 faces in-the-wild challenge: the first facial landmark localization challenge, (Proceedings of the IEEE International Conference Computability Vision Workshops 2013, Sydney, Australia), 2013, pp. 397-403.
D. P. Kingma and J. Ba, Adam: a method for stochastic optimization, arXiv preprint, 2014, DOI 10.48550/arXiv.1412.6980
J. Thiemann, N. Ito, and E. Vincent, DEMAND: diverse environments multichannel acoustic noise database, Proc. Mtgs. Acoust. 19 (2013), DOI 10.1121/1.4799597

ETRI Journal

Multimodal audiovisual speech recognition architecture using a three-feature multi-fusion method for noise-robust systems

Abstract

Keywords

Acknowledgement

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)