Acknowledgement
This work was supported by Electronics and Telecommunications Research Institute grant funded by the Korean government (22ZS1100, Core Technology Research for Self-Improving Integrated Artificial Intelligence System).
References
- A. Berard, O. Pietquin, L. Besacier, and C. Servan, Listen and translate: A proof of concept for end-to-end speech-to-text translation, (Proc. NIPS Workshop on end-to-end learning for speech and audio processing, Barcelona, Spain), Dec. 2016.
- H. Ney, Speech translation: Coupling of recognition and translation, (Speech translation: Coupling of recognition and translation, Phoenix, AZ, USA), 1999, pp. 517-520.
- H. Inaguma, K. Duh, T. Kawahara, and S. Watanabe, Multilingual end-to-end speech translation, (Multilingual end-to-end speech translation, Singapore), 2019, pp. 570-577.
- V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, Librispeech: an asr corpus based on public domain audio books, (Librispeech: an asr corpus based on public domain audio books, South Brisbane, Australia), Apr. 2015, pp. 5206-5210.
- F. Hernandez, V. Nguyen, S. Ghannay, N. Tomashenko, and Y. Esteve, TED-LIUM 3: Twice as much data and corpus repartition for experiments on speaker adaptation, (Proc. International Conference on Speech and Computer, Leipzig, Germany), 2018, pp. 198-208.
- J.-U. Bang, S. Yun, S. H. Kim, M. Y. Choi, M. K. Lee, Y. J. Kim, D. H. Kim, J. Park, Y. J. Lee, and S. H. Kim, KsponSpeech: Korean spontaneous speech corpus for automatic speech recognition, Appl. Sci. 10 (2020), no. 19, 6936.
- A. C. Kocabiyikoglu, L. Besacier, and O. Kraif, Augmenting librispeech with french translations: A multimodal corpus for direct speech translation evaluation, (Proceedings of the Eleventh International Conference on Language Resources and Evaluation, Miyazaki, Japan), 2018.
- B. Beilharz, X. Sun, S. Karimova, and S. Riezler, LibriVoxDeEn: A corpus for German-to-English speech translation and German speech recognition, (Proceeding of the 12th Language Resources and Evaluation Conference, Marseille, France), 2020, pp. 3590-3594.
- M. Post, G. Kumar, A. Lopez, D. Karakos, C. Callison-Burch, and S. Khudanpur, Improved speech-to-text translation with the Fisher and Callhome Spanish-English speech translation corpus, (Proceedings of the 10th International Workshop on Spoken Language Translation: papers, Heidelberg, Germany), 2013.
- R. Cattoni, M. A. Di Gangi, L. Bentivogli, M. Negri, and M. Turchi, MuST-C: A multilingual corpus for end-to-end speech translation, Comput Speech Lang 66 (2021), 101155.
- R. S. Wurman and H. Marks, TED talks, 1984, https://www.ted.com/talks/ [last accessed September 2020].
- H. Inaguma, S. Kiyono, K. Duh, S. Karita, N. E. Soplin, T. Hayashi, and S. Watanabe, ESPnet-ST: All-in-one speech translation toolkit, (Proceedings of the 58th Annual Meeting of the Association for Computatioanl Linguistics: System Demonstrations), 2020, pp. 302-311.
- C. Wang, Y. Tang, X. Ma, A. Wu, D. Okhonko, and J. Pino, Fairseq S2T: Fast speech-to-text modeling with Fairseq, (Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing: System Demonstrations, Suzhou, China), 2020, pp. 33-39.
- L. C. C. Rosado, Cinema at the service of natural language processing, M.S. thesis, Instituto Superior Tecnico, University of Lisbon, Lisbon, Portugal, 2016.
- D. Varga, P. Halacsy, A. Kornai, V. Nagy, L. Nemeth, and V. Tron, Parallel corpora for medium density languages, In Amsterdam studies in the theory and history of linguistic science, Series 4, Benjamins, Amsterdam, 2007.
- F. Braune and A. Fraser, Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora, (Proceedings of the 23rd International Conference on Computational Linguistics: Posters, Beijing, China), 2010, pp. 81-89.
- B. Thompson and P. Koehn, Vecalign: Improved sentence alignment in linear time and space, (Proc. Conference Empirical Methods Natural Language Processing-International Joint Conference on Natural Language Processing, Hong Kong, China), 2019, pp. 1342-1348.
- S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N.E. Yalta Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, Espnet: End-to-end speech processing toolkit, arXiv preprint, 2018. https://doi.org/10.48550/arXiv.1804.00015
- H. Inaguma, N. Kamo, S. Watanabe, and Y. Hayashibe, ESPnet MuST-C recipe, https://github.com/espnet/espnet/tree/master/egs/must_c/st1/2020 [last accessed September 2020].
- W. I. Cho, S. M. Kim, H. Cho, and N. S. Kim, Kosp2e: Korean speech to English translation corpus, (Proc. Interspeech, Brno, Czechia), 2021, pp. 3705-3709.
- R. Sanabria, O. Caglayan, S. Palaskar, D. Elliott, L. Barrault, L. Specia, and F. Metze, How2: A large-scale dataset for multimodal language understanding, (Proc. Conference on Neural Information Processing Systems, Montreal, Canada), 2018.
- C. Federmann and W. D. Lewis, Microsoft speech language translation (MSLT) corpus: The iwslt 2016 release for English, French and German, (Proc. International Conference on Spoken Language Translation, Seattle, WA, USA), 2016.
- C. Wang, A. Wu, and J. Pino, Covost 2: A massively multilingual speech-to-text translation corpus, arXiv preprint, 2020. https://doi.org/10.48550/arXiv.2007.10310
- M. Brayan, Figure Eight website, 1996, https://appen.com/ [last accessed September 2020].
- O. F. Zaidan and C. Callison-Burch, Crowdsourcing translation: Professional quality from non-professionals, (Proc. Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA), 2011.
- R. M. Ochshorn and M. Hawkins, Gentle, 2017, https://github.com/lowerquality/gentle/ [last accessed September 2020].
- W. A. Gale and K. Church, A program for aligning sentences in bilingual corpora, Comput Linguist 19 (1993), no. 1, 75-102.
- J. Tiedemann, Improved sentence alignment for movie subtitles, (Proc. Conference on Recent Advances in Natural Language Processing, Borovets, Bulgaria), 2007, pp. 582-588.
- F. Feng, Y. Yang, D. Cer, N. Arivazhagan, and W. Wang, Language-agnostic bert sentence embedding, arXiv preprint, 2020. https://doi.org/10.48550/arXiv.2007.01852
- B. Fabrice, and M. Niedermayer, Ffmpeg, 2012, https://www.ffmpeg.org/ [last accessed September 2020].
- D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, The Kaldi speech recognition toolkit, (Proc. ASRU), 2011.
- P. Bell, M. J. F. Gales, T. Hain, J. Kilgour, P. Lanchantin, X. Liu, A. McParland, A. Renals, O. Saz, M. Wester, and P. C. Woodland, The MGB challenge: Evaluating multi-genre broadcast media recognition, (Proc. IEEE Workshop on Automatic Speech Recognition and Understanding, Scottsdale, AZ, USA), 2015, pp. 687-693.
- T. Kudo, Subword regularization: Improving neural network translation models with multiple subword candidates, (Proc. 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia), 2018, pp. 66-75.
- K. Papineni, S. Roukos, T. Ward, and W. J. Zhu, Bleu: A method for automatic evaluation of machine translation, (Proc. 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA), 2002, pp. 311-318.
- R. Sennrich and J. Barry, BLEU score, 2017, https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu-detok.perl/ [last accessed September 2020].
- E. L. Park and S. Cho, KoNLPy: Korean natural language processing in Python, (Proc. Annual Conference on Human and Language Technology), 2014, pp. 133-136.