Acknowledgement
This study was supported by the following grants: an Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (Ministry of Science and ICT, MSIT) (No. 2022-0-00871, Development of AI Autonomy and Knowledge Enhancement for AI Agent Collaboration, 60%); a National Research Foundation of Korea (NRF) grant funded by the Ministry of Science and ICT (2021R1A4A1030075, 20%); and a GISTMIT Research Collaboration grant funded by the GIST in 2023 (20%).
References
- H. McGurk and J. J. N. MacDonald, Hearing lips and seeing voices, Nature 264 (1976), no. 5588, 746-748.
- A. G. Chitu and L. J. M. Rothkrantz, Automatic visual speech recognition, In Speech enhancement, modeling, recognition-algorithms, and applications, S. Ramakrishnan (ed.), IntechOpen, London, UK, 2012, 95-120.
- P. Agarwal and S. Kumar, Electroencephalography-based imagined speech recognition using deep long short-term memory network, ETRI J. 44 (2022), no. 4, 672-685.
- L. Sun, Q. Li, S. Fu, and P. Li, Speech emotion recognition based on genetic algorithm-decision tree fusion of deep and acoustic features, ETRI J. 44 (2022), no. 3, 462-475.
- S. Kumar, Real-time implementation and performance evaluation of speech classifiers in speech analysis-synthesis, ETRI J. 43 (2021), no. 1, 82-94.
- S. B. Alex and L. Mary, Variational autoencoder for prosody-based speaker recognition, ETRI J. 45 (2023), no. 4, 678-689.
- B. Shi, W. Hsu, and A. Mohamed, Robust self-supervised audiovisual speech recognition, (Interspeech, Incheon, Rep. of Korea), 2022, DOI 10.21437/Interspeech.2022-99.
- R. Shashidhar, S. Patilkulkarni, and S. B. Puneeth, Combining audio and visual speech recognition using LSTM and deep convolutional neural network, Int. J. Inform. Technol. 14 (2022), 3425-3436.
- D. Serdyuk, O. Braga, and O. Siohan, Transformer-based video front-ends for audio-visual speech recognition for single and multi-person video, (Interspeech, Incheon, Rep. of Korea), 2022, DOI 10.21437/Interspeech.2022-10920.
- C. Deuerlein, M. Langer, J. Sessner, P. Hess, and J. Franke, Human-robot-interaction using cloud-based speech recognition systems, Proc. CIRP 97 (2021), 130-135.
- K. Takashi, T. Nose, S. Hirooka, Y. Chiba, and A. Ito, Comparison of speech recognition performance between Kaldi and Google cloud speech API, In Recent advances in intelligent information hiding and multimedia signal processing: proceeding of the fourteenth international conference on intelligent information hiding and multimedia signal processing, November, 26-28, 2018, Sendai, Japan, Vol. 110, Springer International Publishing, 2019.
- S. Qiya, B. Sun, and S. Li, Multimodal sparse transformer network for audio-visual speech recognition, IEEE Trans. Neural Netw. Learn. Syst. 34 (2022), no. 12, DOI 10.1109/TNNLS.2022.3163771
- L. D. Terissi, G. D. Sad, and J. C. Gomez, Robust front-end for audio, visual and audio-visual speech classification, Int. J. Speech Technol. 21 (2018), 293-307.
- G. Calvert, C. Spence, and B. E. Stein, The handbook of multisensory processes, MIT Press, London, UK, 2004.
- J. Venezia, W. Matchin, and G. Hickok, Multisensory integration and audiovisual speech perception, Brain Mapp. Encycl. Ref. 2 (2015), 565-572.
- R. Campbell, The processing of audio-visual speech: empirical and neural bases, Philos. Trans. R. Soc. Lond. B Biol. Sci. 363 (2008), no. 1493, 1001-1010.
- Y.M. Assael, B. Shillingford, S. Whiteson, and N. De Freitas, LipNet: End-to-end sentence-level lipreading, arXiv preprint, 2016, DOI 10.48550/arXiv.1611.01599
- Google assistant, Available at: https:/developers.google.com/assistant?hl=ko/ [last accessed 27 April 2022].
- Microsoft azure cognitive services, Available at: https://azure.microsoft.com/en-us/services/cognitive-services/ [last accessed 27 April 2022].
- Watson speech to text, Available at: https://www.ibm.com/kr-ko/cloud/watson-speech-to-text [last accessed 27 April 2022].
- Available at: https://aws.amazon.com/transcribe/ [last accessed 27 April 2022].
- P. Sawers, Amazon Transcribe can now automatically redact personally identifiable data, available at https://venturebeat.com/2020/02/27/amazon-transcribe-can-now-automatically-redact-personally-identifiable-data/, VentureBeat, 27th February 2020, Retrieved 3rd February 2021.
- Available at: https://aiopen.etri.re.kr/guide/Recognition [last accessed 27 April 2022].
- Available at: https://clova.ai/speech [last accessed 27 April 2022].
- Available at: https://speech-api.kakao.com/ [last accessed 27 April 2022].
- A. Fernandez-Lopez and F. M. Sukno, Survey on automatic lip-reading in the era of deep learning, Image vis. Comput. 78 (2018), 53-72.
- Y. R. Oh, K. Park, and J. G. Park, Fast offline transformer-based end-to-end automatic speech recognition for real-world applications, ETRI J. 44 (2022), no. 3, 476-490.
- C. Sheng, G. Kuang, L. Bai, C. Hou, Y. Guo, X. Xu, M. Pietikainen, and L. Liu, Deep learning for visual speech analysis: a survey, arXiv preprint, 2022, DOI 10.48550/arXiv.2205.10839
- Y. R. Oh, K. Park, H. B. Jeon, and J. G. Park, Automatic proficiency assessment of Korean speech read aloud by non-natives using bidirectional LSTM-based speech recognition, ETRI J. 42 (2020), no. 5, 761-772.
- D. Kolossa, S. Zeiler, A. Vorwerk, and R. Orglmeister, Audiovisual speech recognition with missing or unreliable data, (Proc. International Conference on Auditory Visual Speech Processing, Norwich, UK), (2009), pp. 117-122.
- V. Kepuska and G. Bohouta, Comparing speech recognition systems (Microsoft API, Google API and CMU sphinx), Int. J. Eng. Res. Appl. 7 (2017), 20-24.
- H. J. Yoo, S. Seo, S. W. Im, and G. Y. Gim, The performance evaluation of continuous speech recognition based on Korean phonological rules of cloud-based speech recognition open API, Int. J. Netw. Distrib. Comput. 9 (2021), no. 1, 10-18.
- B. Alibegovic, N. Prljaca, M. Kimmel, and M. Schultalbers, Speech recognition system for a service robot-a performance evaluation, (Proc. 2020 16th Intl Conf. Control Autom. Robot. Vis. (ICARCV), Shenzhen, China), 2020, pp. 1171-1176.
- S. Jeon and M. S. Kim, End-to-end lip-reading open cloud-based speech architecture, Sensors (Basel). 22 (2022), no. 8, 2938.
- S. Jeon and M. S. Kim, Noise-robust multimodal audio-visual speech recognition system for speech-based interaction applications, Sensors (Basel). 22 (2022), no. 20, 7738.
- T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, Distributed representations of words and phrases and their compositionality, (Proceedings of the 26th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA), 2013, pp. 3111-3119.
- T. Mikolov, K. Chen, G. Corrado, and J. Dean, Efficient estimation of word representations in vector space, arXiv preprint, 2013, DOI 10.48550/arXiv.1301.3781
- S. Ji, M. Yang, and K. Yu, 3D convolutional neural networks for human action recognition, IEEE Trans. Pattern Anal. Mach. Intell. 35 (2013), no. 1, 221-231.
- G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, Improving neural networks by preventing co-adaptation of feature detectors, arXiv preprint, 2012, DOI 10.48550/arXiv.1207.0580
- J. Tompson, R. Goroshin, A. Jain, Y. Lecun, and C. Bregler, Efficient object localization using convolutional networks, (Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Boston, MA, USA), 2015, pp. 648-656.
- S. Lee and C. Lee, Revisiting spatial dropout for regularizing convolutional neural networks, Multimed. Tools Appl. 79 (2020), 34195-34207.
- S. Jeon, A. Elsharkawy, and M. S. Kim, Lipreading architecture based on multiple convolutional neural networks for sentence-level visual speech recognition, Sensors (Basel). 22 (2021), no. 1, DOI 10.3390/s22010072.
- S. Jeon and M. S. Kim, End-to-end sentence-level multi-view lip-reading architecture with spatial attention module integrated multiple CNNs and cascaded local self-attention-CTC, Sensors (Basel). 22 (2022), no. 9, DOI 10.3390/s22093597.
- S. Woo, J. Park, J. Y. Lee and I. S. Kweon, Cbam: convolutional block attention module, (Proc. Eur. Conf. Comput. Vis. (ECCV), Munich, Germany), 2018, pp. 3-19.
- A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks, (Proc. 23rd Intl Conf. Mach. Learn., Pittsburgh, PA, USA), 2006, pp. 369-376.
- P. Warden, Speech commands: a dataset for limited-vocabulary speech recognition arXiv preprint, 2018, DOI 10.48550/arXiv.1804.03209
- C. H. H. Yang, J. Qi, S. Y. C. Chen, P. Y. Chen, S. M. Siniscalchi, X. Ma, and C. H. Lee, Decentralizing feature extraction with quantum convolutional neural network for automatic speech recognition, (ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, Canada), 2021, pp. 6523-6527.
- S. Seo, C. Kim, and J. H. Kim, Convolutional neural networks using log mel-spectrogram separation for audio event classification with unknown devices, J. Web Eng. 21 (2022), no. 2, 497-522.
- S. Suh, S. Park, Y. Jeong and T. Lee, Designing acoustic scene classification models with CNN variants, Technical Report, Detection and Classification of Acoustic Scenes and Events, Challenge, 2020.
- C. Sagonas, G. Tzimiropoulos, S. Zafeiriou and M. Pantic, 300 faces in-the-wild challenge: the first facial landmark localization challenge, (Proceedings of the IEEE International Conference Computability Vision Workshops 2013, Sydney, Australia), 2013, pp. 397-403.
- D. P. Kingma and J. Ba, Adam: a method for stochastic optimization, arXiv preprint, 2014, DOI 10.48550/arXiv.1412.6980
- J. Thiemann, N. Ito, and E. Vincent, DEMAND: diverse environments multichannel acoustic noise database, Proc. Mtgs. Acoust. 19 (2013), DOI 10.1121/1.4799597