KMSAV: Korean multi-speaker spontaneous audiovisual dataset

Kiyoung Park;Changhan Oh;Sunghee Dong;

doi:10.4218/etrij.2023-0352

ETRI Journal

Volume 46 Issue 1
/
Pages.71-81
/
2024
/
1225-6463(pISSN)
/
2233-7326(eISSN)

Electronics and Telecommunications Research Institute (한국전자통신연구원)

DOI QR Code

KMSAV: Korean multi-speaker spontaneous audiovisual dataset

Kiyoung Park (Superintelligence Creative Research Laboratory, Electronics and Telecommunications Research Institute) ;
Changhan Oh (Superintelligence Creative Research Laboratory, Electronics and Telecommunications Research Institute) ;
Sunghee Dong (Superintelligence Creative Research Laboratory, Electronics and Telecommunications Research Institute)

Received : 2023.08.25
Accepted : 2023.12.20
Published : 2024.02.20

https://doi.org/10.4218/etrij.2023-0352 Citation PDF

Download PDF

⟨ Previous Next ⟩

Abstract

Recent advances in deep learning for speech and visual recognition have accelerated the development of multimodal speech recognition, yielding many innovative results. We introduce a Korean audiovisual speech recognition corpus. This dataset comprises approximately 150 h of manually transcribed and annotated audiovisual data supplemented with additional 2000 h of untranscribed videos collected from YouTube under the Creative Commons License. The dataset is intended to be freely accessible for unrestricted research purposes. Along with the corpus, we propose an open-source framework for automatic speech recognition (ASR) and audiovisual speech recognition (AVSR). We validate the effectiveness of the corpus with evaluations using state-of-the-art ASR and AVSR techniques, capitalizing on both pretrained models and fine-tuning processes. After fine-tuning, ASR and AVSR achieve character error rates of 11.1% and 18.9%, respectively. This error difference highlights the need for improvement in AVSR techniques. We expect that our corpus will be an instrumental resource to support improvements in AVSR.

Keywords

Acknowledgement

This work was supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (2022-0-00989, Development of Artificial Intelligence Technology for Multi-speaker Dialog Modeling).

References

T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, Deep audio-visual speech recognition, IEEE Trans. Pattern Anal. Machine Intellig. 44 (2022), no. 12, 8717-8727. https://doi.org/10.1109/TPAMI.2018.2889052
S. Petridis, T. Stafylakis, P. Ma, F. Cai, G. Tzimiropoulos, and M. Pantic, End-to-end audiovisual speech recognition, (IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), Calgary, Canada), 2018, pp. 6548-6552.
J. Li, L. Deng, Y. Gong, and R. Haeb-Umbach, An overview of noise-robust automatic speech recognition, IEEE/ACM Trans. Audio Speech Lang. Process. 22 (2014), no. 4, 745-777.
J. Chung, A. Senior, O. Vinyals, and A. Zisserman, Lip reading sentences in the wild, (IEEE Conf. Comput. Vision Pattern Recognit. (CVPR), Honolulu, HI, USA), 2017, pp. 3444-3453.
P. Ma, A. Haliassos, A. Fernandez-Lopez, H. Chen, S. Petridis, and M. Pantic, Auto-AVSR: audio-visual speech recognition with automatic labels, (IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), Rhodes Island, Greece), 2023, pp. 1-5.
B. Shi, W.-N. Hsu, and A. Mohamed, Robust self-supervised audio-visual speech recognition, arXiv preprint, 2022, DOI 10.48550/arXiv.2201.01763
I. Matthews, T. F. Cootes, J. A. Bangham, S. Cox, and R. Harvey, Extraction of visual features for lipreading, IEEE Trans. Pattern Anal. Machine Intell. 24 (2002), no. 2, 198-213. https://doi.org/10.1109/34.982900
E. K. Patterson, S. Gurbuz, Z. Tufekci, and J. N. Gowdy, CUAVE: a new audio-visual database for multimodal human-computer interface research, (IEEE Int. Conf. Acoust. Speech Signal Process., Orlando, FL, USA), 2002, DOI 10.1109/ICASSP.2002.5745028.
I. Anina, Z. Zhou, G. Zhao, and M. Pietikainen, OuluVS2: a multi-view audiovisual database for non-rigid mouth motion analysis, (11th IEEE Int. Conf. Workshops Autom. Face Gesture Recognit. (FG), Ljubljana, Slovenia), 2015, DOI 10.1109/FG.2015.7163155
T. J. Hazen, K. Saenko, C.-H. La, and J. R. Glass, A segment-based audio-visual speech recognizer: data collection, development, and initial experiments, (Proc. 6th Int. Conf. Multimodal Interfaces, ICMI '04, Association for Computing Machinery, New York, NY, USA), 2004, pp. 235-242.
J. Park, J.-W. Hwang, K. Choi, S.-H. Lee, J. H. Ahn, R.-H. Park, and H.-M. Park, OLKAVS: an open large-scale Korean audio-visual speech dataset, arXiv preprint, 2023, DOI 10.48550/arXiv.2301.06375.
A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, wav2vec 2.0: a framework for self-supervised learning of speech representations, (34th Conference Neural Information Processing Systems, Vancouver, Canada), 2020, pp. 12449-12460.
Y.-H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L.-P. Morency, and R. Salakhutdinov, Multimodal transformer for unaligned multimodal language sequences, (Proc. 57th Annu. Meet. Assoc. Comput. Ling., Florence, Italy), 2019, pp. 6558-6569.
B. Shi, W.-N. Hsu, K. Lakhotia, and A. Mohamed, Learning audio-visual speech representation by masked multimodal cluster prediction, arXiv preprint, 2022, DOI 10.48550/arXiv.2201.02184
T. Afouras, J. Son Chung, and A. Zisserman, LRS3-TED: a large-scale dataset for visual speech recognition, arXiv preprint, 2018, DOI 10.48550/arXiv.1809.00496
A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. Freeman, and M. Rubinstein, Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation, ACM Trans. Graph. 37 (2018), no. 4, 1-11.
J. S. Chung, A. Nagrani, and A. Zisserman, VoxCeleb2: deep speaker recognition, (Proc. INTERSPEECH, Hyderabad, India), 2018, pp. 1086-1090. DOI 10.21437/Interspeech.2018-1929
T. Likhomanenko, Q. Xu, V. Pratap, P. Tomasello, J. Kahn, G. Avidov, R. Collobert, and G. Synnaeve, Rethinking evaluation in ASR: are our models robust enough? (INTERSPEECH, Brno, Czechia), 2021, pp. 311-315.
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, Robust speech recognition via large-scale weak supervision, (Int. Conf. Mach. Learn., Honolulu, HI, USA), 2023, pp. 28492-28518.
Y. Zhang, D. Park, W. Han, J. Qin, A. Gulati, J. Shor, A. Jansen, Y. Xu, Y. Huang, S. Wang, Z. Zhou, B. Li, M. Ma, W. Chan, J. Yu, Y. Wang, L. Cao, K. Sim, B. Ramabhadran, and Y. Wu, BigSSL: exploring the frontier of large-scale semi-supervised learning for automatic speech recognition, IEEE J. Sel. Top. Signal Process. 16 (2022), 1-14. https://doi.org/10.1109/JSTSP.2021.3132270
M. Cooke, J. Barker, S. P. Cunningham, and X. Shao, An audio-visual corpus for speech perception and automatic speech recognition, J. Acoust. Soc. Am. 120 (2006), no. 5, 2421-2424, DOI 10.1121/1.2229005
B. Lee, M. Hasegawa-Johnson, C. Goudeseune, S. Kamdar, S. Borys, M. Liu, and T. Huang, AVICAR: audio-visual speech corpus in a car environment, (Proc. INTERSPEECH, Jeju, Rep. of Korea), 2004, pp. 2489-2492.
G. Zhao, M. Barnard, and M. Pietikainen, Lipreading with local spatiotemporal descriptors, IEEE Trans. Multimed. 11 (2009), no. 7, 1254-1265. https://doi.org/10.1109/TMM.2009.2030637
J. S. Chung and A. Zisserman, Lip reading in the wild, (Proc. Asian Conf. Comput. Vision, Taipei, Taiwan), 2016, pp. 87-103.
A. Nagrani, J. S. Chung, W. Xie, and A. Zisserman, Voxceleb: large-scale speaker verification in the wild, Comput. Speech Lang. 60 (2020), 101027, DOI 10.1016/j.csl.2019.101027
W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C.-Y. Fu, and A. C. Berg, SSD: single shot multibox detector, Computer Vision-ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, (eds.), Lecture Notes in Computer Science, Vol. 9905, Springer, Cham, 2016, pp. 21-37.
S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. Enrique Yalta Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, ESPnet: end-to-end speech processing toolkit, (Proc. INTERSPEECH, Hyderabad, India), 2018, pp. 2207-2211.
R. Tao, Z. Pan, R. K. Das, X. Qian, M. Z. Shou, and H. Li, Is someone speaking? Exploring long-term temporal features for audio-visual active speaker detection, (Proc. 29th ACM Int. Conf. Multimedia, Association for Computing Machinery, New York, NY, USA), 2021, pp. 3927-3935.
D. E. King, Dlib-ml: a machine learning toolkit, J. Mach. Learn. Res. 10 (2009), 1755-1758.
D. Snyder, G. Chen, and D. Povey, MUSAN: a music, speech, and noise corpus, arXiv preprint, 2015, DOI 10.48550/arXiv.1510.08484

ETRI Journal

KMSAV: Korean multi-speaker spontaneous audiovisual dataset

Abstract

Keywords

Acknowledgement

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)