Acknowledgement
이 성과는 정부(과학기술정보통신부)의 재원으로 한국연구재단과 과학기술사업화진흥원의 지원을 받아 수행된 연구임(RS-2023-00237117).
References
- Bain, M., Huh, J., Han, T., & Zisserman, A. (2023, August). WhisperX: Time-accurate speech transcription of long-form audio. Proceedings of the Interspeech 2023 (pp. 4489-4493). Dublin, Ireland.
- Bang, J. U., Yun, S., Kim, S. H., Choi, M. Y., Lee, M. K., Kim, Y. J., Kim, D. H., ... Kim, S. H. (2020). KsponSpeech: Korean spontaneous speech corpus for automatic speech recognition. Applied Sciences, 10(19), 6936.
- Child, R., Gray, S., Radford, A., & Sutskever, I. (2019). Generating long sequences with sparse Transformers. arXiv. https://doi.org/10.48550/arXiv.1904.10509.
- Choi, H., Choi, M., Kim, S., Lim, Y., Lee, M., Yun, S., Kim, D., ... Kim, S. H. (2024). Spoken-to-written text conversion for enhancement of Korean-English readability and machine translation. ETRI Journal, 46(1), 127-136.
- Conneau, A., Ma, M., Khanuja, S., Zhang, Y., Axelrod, V., Dalmia, S., Riesa, J., ... Bapna, A. (2023, January). Fleurs: Few-shot learning evaluation of universal representations of speech. Proceedings of the 2022 IEEE Spoken Language Technology Workshop (SLT) (pp. 798-805).Doha, Qatar.
- Dong, L., Xu, S., & Xu, B. (2018, April). Speech-Transformer: A no-recurrence sequence-to-sequence model for speech recognition. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5884-5888). Calgary, AB.
- Gulati, A., Qin, J., Chiu, C. C., Parmar, N., Zhang, Y., Yu, J., Han, W., ... Pang, R. (2020, October). Conformer: Convolution-augmented Transformer for speech recognition. Proceedings of Interspeech 2020 (pp. 5036-5040). Shanghai, China.
- Kim, K., Wu, F., Peng, Y., Pan, J., Sridhar, P., Han, K. J., & Watanabe, S. (2023, January). E-Branchformer: Branchformer with enhanced merging for speech recognition. Proceedings of the 2022 IEEE Spoken Language Technology Workshop (SLT) (pp. 84-91). Doha, Qatar.
- Oh, C., Kim, C., & Park, K. (2023). Building robust Korean speech recognition model by fine-tuning large pretrained model. Phonetics and Speech Sciences, 15(3), 75-82.
- Peng, Y., Dalmia, S., Lane, I., & Watanabe, S. (2022, June). Branchformer: Parallel mlp-attention architectures to capture local and global context for speech recognition and understanding. Proceedings of the International Conference on Machine Learning (pp. 17627-17643). Baltimore, MD.
- Peng, Y., Kim, K., Wu, F., Yan, B., Arora, S., Chen, W., Tang, J., ... Watanabe, S. (2023, August). A comparative study on E-Branchformer vs Conformer in speech recognition, translation, and understanding tasks. Proceedings of Interspeech 2023 (pp. 2208-2212). Dublin, Ireland.
- Pan, J., Lei, T., Kim, K., Han, K. J., & Watanabe, S. (2022, May). SRU++: Pioneering fast recurrence with attention for speech recognition. Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7872- 7876). Singapore, Singapore.
- Park, K., Oh, C., & Dong, S. (2024). KMSAV: Korean multi-speaker spontaneous audiovisual dataset. ETRI Journal, 46(1), 71-81.
- Shaw, P., Uszkoreit, J., & Vaswani, A. (2018, June). Self-attention with relative position representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) (pp. 464-468). New Orleans, Louisiana.
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L, ... Polosukhin, I. (2017, Deccember). Attention is all you need. Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017). Long Beach, CA.
- Watanabe, S., Hori, T., Karita, S., Hayashi, T., Nishitoba, J., Unno, Y., Enrique Yalta Soplin, N., ... Ochiai, T. (2018, September). ESPnet: End-to-end speech processing toolkit. Proceedings of the Interspeech 2018 (pp. 2207-2211). Hyderabad, India.