DOI QR코드

DOI QR Code

English-Korean speech translation corpus (EnKoST-C): Construction procedure and evaluation results

  • Jeong-Uk Bang (Integrated Intelligence Research Section, Electronics and Telecommunications Research Institute) ;
  • Joon-Gyu Maeng (ICT-Computer Software, University of Science and Technology) ;
  • Jun Park (Integrated Intelligence Research Section, Electronics and Telecommunications Research Institute) ;
  • Seung Yun (Integrated Intelligence Research Section, Electronics and Telecommunications Research Institute) ;
  • Sang-Hun Kim (Integrated Intelligence Research Section, Electronics and Telecommunications Research Institute)
  • Received : 2021.09.26
  • Accepted : 2022.05.02
  • Published : 2023.02.20

Abstract

We present an English-Korean speech translation corpus, named EnKoST-C. End-to-end model training for speech translation tasks often suffers from a lack of parallel data, such as speech data in the source language and equivalent text data in the target language. Most available public speech translation corpora were developed for European languages, and there is currently no public corpus for English-Korean end-to-end speech translation. Thus, we created an EnKoST-C centered on TED Talks. In this process, we enhance the sentence alignment approach using the subtitle time information and bilingual sentence embedding information. As a result, we built a 559-h English-Korean speech translation corpus. The proposed sentence alignment approach showed excellent performance of 0.96 f-measure score. We also show the baseline performance of an English-Korean speech translation model trained with EnKoST-C. The EnKoST-C is freely available on a Korean government open data hub site.

Keywords

Acknowledgement

This work was supported by Electronics and Telecommunications Research Institute grant funded by the Korean government (22ZS1100, Core Technology Research for Self-Improving Integrated Artificial Intelligence System).

References

  1. A. Berard, O. Pietquin, L. Besacier, and C. Servan, Listen and translate: A proof of concept for end-to-end speech-to-text translation, (Proc. NIPS Workshop on end-to-end learning for speech and audio processing, Barcelona, Spain), Dec. 2016.
  2. H. Ney, Speech translation: Coupling of recognition and translation, (Speech translation: Coupling of recognition and translation, Phoenix, AZ, USA), 1999, pp. 517-520.
  3. H. Inaguma, K. Duh, T. Kawahara, and S. Watanabe, Multilingual end-to-end speech translation, (Multilingual end-to-end speech translation, Singapore), 2019, pp. 570-577.
  4. V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, Librispeech: an asr corpus based on public domain audio books, (Librispeech: an asr corpus based on public domain audio books, South Brisbane, Australia), Apr. 2015, pp. 5206-5210.
  5. F. Hernandez, V. Nguyen, S. Ghannay, N. Tomashenko, and Y. Esteve, TED-LIUM 3: Twice as much data and corpus repartition for experiments on speaker adaptation, (Proc. International Conference on Speech and Computer, Leipzig, Germany), 2018, pp. 198-208.
  6. J.-U. Bang, S. Yun, S. H. Kim, M. Y. Choi, M. K. Lee, Y. J. Kim, D. H. Kim, J. Park, Y. J. Lee, and S. H. Kim, KsponSpeech: Korean spontaneous speech corpus for automatic speech recognition, Appl. Sci. 10 (2020), no. 19, 6936.
  7. A. C. Kocabiyikoglu, L. Besacier, and O. Kraif, Augmenting librispeech with french translations: A multimodal corpus for direct speech translation evaluation, (Proceedings of the Eleventh International Conference on Language Resources and Evaluation, Miyazaki, Japan), 2018.
  8. B. Beilharz, X. Sun, S. Karimova, and S. Riezler, LibriVoxDeEn: A corpus for German-to-English speech translation and German speech recognition, (Proceeding of the 12th Language Resources and Evaluation Conference, Marseille, France), 2020, pp. 3590-3594.
  9. M. Post, G. Kumar, A. Lopez, D. Karakos, C. Callison-Burch, and S. Khudanpur, Improved speech-to-text translation with the Fisher and Callhome Spanish-English speech translation corpus, (Proceedings of the 10th International Workshop on Spoken Language Translation: papers, Heidelberg, Germany), 2013.
  10. R. Cattoni, M. A. Di Gangi, L. Bentivogli, M. Negri, and M. Turchi, MuST-C: A multilingual corpus for end-to-end speech translation, Comput Speech Lang 66 (2021), 101155.
  11. R. S. Wurman and H. Marks, TED talks, 1984, https://www.ted.com/talks/ [last accessed September 2020].
  12. H. Inaguma, S. Kiyono, K. Duh, S. Karita, N. E. Soplin, T. Hayashi, and S. Watanabe, ESPnet-ST: All-in-one speech translation toolkit, (Proceedings of the 58th Annual Meeting of the Association for Computatioanl Linguistics: System Demonstrations), 2020, pp. 302-311.
  13. C. Wang, Y. Tang, X. Ma, A. Wu, D. Okhonko, and J. Pino, Fairseq S2T: Fast speech-to-text modeling with Fairseq, (Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing: System Demonstrations, Suzhou, China), 2020, pp. 33-39.
  14. L. C. C. Rosado, Cinema at the service of natural language processing, M.S. thesis, Instituto Superior Tecnico, University of Lisbon, Lisbon, Portugal, 2016.
  15. D. Varga, P. Halacsy, A. Kornai, V. Nagy, L. Nemeth, and V. Tron, Parallel corpora for medium density languages, In Amsterdam studies in the theory and history of linguistic science, Series 4, Benjamins, Amsterdam, 2007.
  16. F. Braune and A. Fraser, Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora, (Proceedings of the 23rd International Conference on Computational Linguistics: Posters, Beijing, China), 2010, pp. 81-89.
  17. B. Thompson and P. Koehn, Vecalign: Improved sentence alignment in linear time and space, (Proc. Conference Empirical Methods Natural Language Processing-International Joint Conference on Natural Language Processing, Hong Kong, China), 2019, pp. 1342-1348.
  18. S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N.E. Yalta Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, Espnet: End-to-end speech processing toolkit, arXiv preprint, 2018. https://doi.org/10.48550/arXiv.1804.00015
  19. H. Inaguma, N. Kamo, S. Watanabe, and Y. Hayashibe, ESPnet MuST-C recipe, https://github.com/espnet/espnet/tree/master/egs/must_c/st1/2020 [last accessed September 2020].
  20. W. I. Cho, S. M. Kim, H. Cho, and N. S. Kim, Kosp2e: Korean speech to English translation corpus, (Proc. Interspeech, Brno, Czechia), 2021, pp. 3705-3709.
  21. R. Sanabria, O. Caglayan, S. Palaskar, D. Elliott, L. Barrault, L. Specia, and F. Metze, How2: A large-scale dataset for multimodal language understanding, (Proc. Conference on Neural Information Processing Systems, Montreal, Canada), 2018.
  22. C. Federmann and W. D. Lewis, Microsoft speech language translation (MSLT) corpus: The iwslt 2016 release for English, French and German, (Proc. International Conference on Spoken Language Translation, Seattle, WA, USA), 2016.
  23. C. Wang, A. Wu, and J. Pino, Covost 2: A massively multilingual speech-to-text translation corpus, arXiv preprint, 2020. https://doi.org/10.48550/arXiv.2007.10310
  24. M. Brayan, Figure Eight website, 1996, https://appen.com/ [last accessed September 2020].
  25. O. F. Zaidan and C. Callison-Burch, Crowdsourcing translation: Professional quality from non-professionals, (Proc. Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA), 2011.
  26. R. M. Ochshorn and M. Hawkins, Gentle, 2017, https://github.com/lowerquality/gentle/ [last accessed September 2020].
  27. W. A. Gale and K. Church, A program for aligning sentences in bilingual corpora, Comput Linguist 19 (1993), no. 1, 75-102.
  28. J. Tiedemann, Improved sentence alignment for movie subtitles, (Proc. Conference on Recent Advances in Natural Language Processing, Borovets, Bulgaria), 2007, pp. 582-588.
  29. F. Feng, Y. Yang, D. Cer, N. Arivazhagan, and W. Wang, Language-agnostic bert sentence embedding, arXiv preprint, 2020. https://doi.org/10.48550/arXiv.2007.01852
  30. B. Fabrice, and M. Niedermayer, Ffmpeg, 2012, https://www.ffmpeg.org/ [last accessed September 2020].
  31. D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, The Kaldi speech recognition toolkit, (Proc. ASRU), 2011.
  32. P. Bell, M. J. F. Gales, T. Hain, J. Kilgour, P. Lanchantin, X. Liu, A. McParland, A. Renals, O. Saz, M. Wester, and P. C. Woodland, The MGB challenge: Evaluating multi-genre broadcast media recognition, (Proc. IEEE Workshop on Automatic Speech Recognition and Understanding, Scottsdale, AZ, USA), 2015, pp. 687-693.
  33. T. Kudo, Subword regularization: Improving neural network translation models with multiple subword candidates, (Proc. 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia), 2018, pp. 66-75.
  34. K. Papineni, S. Roukos, T. Ward, and W. J. Zhu, Bleu: A method for automatic evaluation of machine translation, (Proc. 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA), 2002, pp. 311-318.
  35. R. Sennrich and J. Barry, BLEU score, 2017, https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu-detok.perl/ [last accessed September 2020].
  36. E. L. Park and S. Cho, KoNLPy: Korean natural language processing in Python, (Proc. Annual Conference on Human and Language Technology), 2014, pp. 133-136.