Acknowledgement
This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (2022-0-00989, Development of Artificial Intelligence Technology for Multi-speaker Dialog Modeling).
References
- AiHub (2021). Aihub broadcast content korean speech recognition data. Retrieved from https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=463
- Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020, December). wav2vec 2.0: A framework for self-supervised learning of speech representations. Proceedings of the Advances in Neural Information Processing Systems (pp. 12449-12460). Online Conference.
- Bang, J. U., Yun, S., Kim, S. H., Choi, M. Y., Lee, M. K., Kim, Y. J., Kim, D. H., ... Kim, S. H. (2020). KsponSpeech: Korean spontaneous speech corpus for automatic speech recognition. Applied Sciences, 10(19), 6936.
- Chan, W., Jaitly, N., Le, Q., & Vinyals, O. (2016, March). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4960-4964). Shanghai, China.
- Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020, July). A simple framework for contrastive learning of visual representations. Proceedings of the 37th International Conference on Machine Learning (pp. 1597-1607). Online Conference.
- Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. Retrieved from https://arxiv.org/abs/1810.04805
- Graves, A., Fernandez, S., Gomez, F., & Schmidhuber, J. (2006, June). Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. Proceedings of the 23rd International Conference on Machine Learning (pp. 369-376). Pittsburgh, PA.
- Graves, A., & Jaitly, N. (2014, June). Towards end-to-end speech recognition with recurrent neural networks. Proceedings of the 31st International Conference on Machine Learning (pp. 1764-1772). Beijing, China.
- Hadsell, R., Chopra, S., & LeCun, Y. (2006, June). Dimensionality reduction by learning an invariant mapping. Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06) (pp. 1735-1742). New York, NY.
- Nam, K. (2019). A study on processing of speech recognition Korean words. The Journal of the Convergence on Culture Technology, 5(4), 407-412. https://doi.org/10.17703/JCCT.2019.5.4.407
- Oh, Y. R., Park, K., & Park, J. G. (2022). Fast offline transformer-based end-to-end automatic speech recognition for real-world applications. ETRI Journal, 44(3), 476-490. https://doi.org/10.4218/etrij.2021-0106
- OpenAi (2023). Openai/whisper. Retrieved from https://github.com/openai/whisper
- Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015, April). Librispeech: An ASR corpus based on public domain audio books. Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5206-5210). South Brisbane, Australia.
- Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2023, July). Robust speech recognition via large-scale weak supervision. Proceedings of the 40th International Conference on Machine Learning (pp. 28492-28518). Honolulu, HI.
- Schneider, S., Baevski, A., Collobert, R., & Auli, M. (2019, September). wav2vec: Unsupervised pre-training for speech recognition. Proceedings of the Interspeech 2019 (pp. 3465-3469). Graz, Austria.
- Tsai, Y. H. H., Bai, S., Liang, P. P., Kolter, J. Z., Morency, L. P., & Salakhutdinov, R. (2019, July). Multimodal transformer for unaligned multimodal language sequences. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. (pp. 6558-6569). Florence, Italy.
- van den Oord, A., Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding. Retrieved from https://doi.org/10.48550/arxiv.1807.03748
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., ... Polosukhin, I. (2017, December). Attention is all you need. Proceedings of the Advances in Neural Information Processing Systems. Long Beach, CA.
- Watanabe, S., Hori, T., Karita, S., Hayashi, T., Nishitoba, J., Unno, Y., Soplin, N. E. Y., ... Ochiai, T. (2018). ESPnet: End-to-end speech processing toolkit. Retrieved from https://doi.org/10.48550/arXiv.1804.00015
- Yadav, H., & Sitaram, S. (2022, June). A survey of multilingual models for automatic speech recognition. Proceedings of the Thirteenth Language Resources and Evaluation Conference (pp. 5071-5079). Marseille, France.