Acknowledgement
This work was supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (2022-0-00989, Development of Artificial Intelligence Technology for Multi-speaker Dialog Modeling).
References
- T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, Deep audio-visual speech recognition, IEEE Trans. Pattern Anal. Machine Intellig. 44 (2022), no. 12, 8717-8727. https://doi.org/10.1109/TPAMI.2018.2889052
- S. Petridis, T. Stafylakis, P. Ma, F. Cai, G. Tzimiropoulos, and M. Pantic, End-to-end audiovisual speech recognition, (IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), Calgary, Canada), 2018, pp. 6548-6552.
- J. Li, L. Deng, Y. Gong, and R. Haeb-Umbach, An overview of noise-robust automatic speech recognition, IEEE/ACM Trans. Audio Speech Lang. Process. 22 (2014), no. 4, 745-777.
- J. Chung, A. Senior, O. Vinyals, and A. Zisserman, Lip reading sentences in the wild, (IEEE Conf. Comput. Vision Pattern Recognit. (CVPR), Honolulu, HI, USA), 2017, pp. 3444-3453.
- P. Ma, A. Haliassos, A. Fernandez-Lopez, H. Chen, S. Petridis, and M. Pantic, Auto-AVSR: audio-visual speech recognition with automatic labels, (IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), Rhodes Island, Greece), 2023, pp. 1-5.
- B. Shi, W.-N. Hsu, and A. Mohamed, Robust self-supervised audio-visual speech recognition, arXiv preprint, 2022, DOI 10.48550/arXiv.2201.01763
- I. Matthews, T. F. Cootes, J. A. Bangham, S. Cox, and R. Harvey, Extraction of visual features for lipreading, IEEE Trans. Pattern Anal. Machine Intell. 24 (2002), no. 2, 198-213. https://doi.org/10.1109/34.982900
- E. K. Patterson, S. Gurbuz, Z. Tufekci, and J. N. Gowdy, CUAVE: a new audio-visual database for multimodal human-computer interface research, (IEEE Int. Conf. Acoust. Speech Signal Process., Orlando, FL, USA), 2002, DOI 10.1109/ICASSP.2002.5745028.
- I. Anina, Z. Zhou, G. Zhao, and M. Pietikainen, OuluVS2: a multi-view audiovisual database for non-rigid mouth motion analysis, (11th IEEE Int. Conf. Workshops Autom. Face Gesture Recognit. (FG), Ljubljana, Slovenia), 2015, DOI 10.1109/FG.2015.7163155
- T. J. Hazen, K. Saenko, C.-H. La, and J. R. Glass, A segment-based audio-visual speech recognizer: data collection, development, and initial experiments, (Proc. 6th Int. Conf. Multimodal Interfaces, ICMI '04, Association for Computing Machinery, New York, NY, USA), 2004, pp. 235-242.
- J. Park, J.-W. Hwang, K. Choi, S.-H. Lee, J. H. Ahn, R.-H. Park, and H.-M. Park, OLKAVS: an open large-scale Korean audio-visual speech dataset, arXiv preprint, 2023, DOI 10.48550/arXiv.2301.06375.
- A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, wav2vec 2.0: a framework for self-supervised learning of speech representations, (34th Conference Neural Information Processing Systems, Vancouver, Canada), 2020, pp. 12449-12460.
- Y.-H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L.-P. Morency, and R. Salakhutdinov, Multimodal transformer for unaligned multimodal language sequences, (Proc. 57th Annu. Meet. Assoc. Comput. Ling., Florence, Italy), 2019, pp. 6558-6569.
- B. Shi, W.-N. Hsu, K. Lakhotia, and A. Mohamed, Learning audio-visual speech representation by masked multimodal cluster prediction, arXiv preprint, 2022, DOI 10.48550/arXiv.2201.02184
- T. Afouras, J. Son Chung, and A. Zisserman, LRS3-TED: a large-scale dataset for visual speech recognition, arXiv preprint, 2018, DOI 10.48550/arXiv.1809.00496
- A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. Freeman, and M. Rubinstein, Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation, ACM Trans. Graph. 37 (2018), no. 4, 1-11.
- J. S. Chung, A. Nagrani, and A. Zisserman, VoxCeleb2: deep speaker recognition, (Proc. INTERSPEECH, Hyderabad, India), 2018, pp. 1086-1090. DOI 10.21437/Interspeech.2018-1929
- T. Likhomanenko, Q. Xu, V. Pratap, P. Tomasello, J. Kahn, G. Avidov, R. Collobert, and G. Synnaeve, Rethinking evaluation in ASR: are our models robust enough? (INTERSPEECH, Brno, Czechia), 2021, pp. 311-315.
- A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, Robust speech recognition via large-scale weak supervision, (Int. Conf. Mach. Learn., Honolulu, HI, USA), 2023, pp. 28492-28518.
- Y. Zhang, D. Park, W. Han, J. Qin, A. Gulati, J. Shor, A. Jansen, Y. Xu, Y. Huang, S. Wang, Z. Zhou, B. Li, M. Ma, W. Chan, J. Yu, Y. Wang, L. Cao, K. Sim, B. Ramabhadran, and Y. Wu, BigSSL: exploring the frontier of large-scale semi-supervised learning for automatic speech recognition, IEEE J. Sel. Top. Signal Process. 16 (2022), 1-14. https://doi.org/10.1109/JSTSP.2021.3132270
- M. Cooke, J. Barker, S. P. Cunningham, and X. Shao, An audio-visual corpus for speech perception and automatic speech recognition, J. Acoust. Soc. Am. 120 (2006), no. 5, 2421-2424, DOI 10.1121/1.2229005
- B. Lee, M. Hasegawa-Johnson, C. Goudeseune, S. Kamdar, S. Borys, M. Liu, and T. Huang, AVICAR: audio-visual speech corpus in a car environment, (Proc. INTERSPEECH, Jeju, Rep. of Korea), 2004, pp. 2489-2492.
- G. Zhao, M. Barnard, and M. Pietikainen, Lipreading with local spatiotemporal descriptors, IEEE Trans. Multimed. 11 (2009), no. 7, 1254-1265. https://doi.org/10.1109/TMM.2009.2030637
- J. S. Chung and A. Zisserman, Lip reading in the wild, (Proc. Asian Conf. Comput. Vision, Taipei, Taiwan), 2016, pp. 87-103.
- A. Nagrani, J. S. Chung, W. Xie, and A. Zisserman, Voxceleb: large-scale speaker verification in the wild, Comput. Speech Lang. 60 (2020), 101027, DOI 10.1016/j.csl.2019.101027
- W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C.-Y. Fu, and A. C. Berg, SSD: single shot multibox detector, Computer Vision-ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, (eds.), Lecture Notes in Computer Science, Vol. 9905, Springer, Cham, 2016, pp. 21-37.
- S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. Enrique Yalta Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, ESPnet: end-to-end speech processing toolkit, (Proc. INTERSPEECH, Hyderabad, India), 2018, pp. 2207-2211.
- R. Tao, Z. Pan, R. K. Das, X. Qian, M. Z. Shou, and H. Li, Is someone speaking? Exploring long-term temporal features for audio-visual active speaker detection, (Proc. 29th ACM Int. Conf. Multimedia, Association for Computing Machinery, New York, NY, USA), 2021, pp. 3927-3935.
- D. E. King, Dlib-ml: a machine learning toolkit, J. Mach. Learn. Res. 10 (2009), 1755-1758.
- D. Snyder, G. Chen, and D. Povey, MUSAN: a music, speech, and noise corpus, arXiv preprint, 2015, DOI 10.48550/arXiv.1510.08484