DOI QR코드

DOI QR Code

공공 한영 병렬 말뭉치를 이용한 기계번역 성능 향상 연구

A Study on the Performance Improvement of Machine Translation Using Public Korean-English Parallel Corpus

  • Park, Chanjun (Department of Computer Science and Engineering, Korea University) ;
  • Lim, Heuiseok (Department of Computer Science and Engineering, Korea University)
  • 투고 : 2020.03.11
  • 심사 : 2020.06.20
  • 발행 : 2020.06.28

초록

기계번역이란 소스언어를 목적언어로 컴퓨터가 번역하는 소프트웨어를 의미하며 규칙기반, 통계기반 기계번역을 거쳐 최근에는 인공신경망 기반 기계번역에 대한 연구가 활발히 이루어지고 있다. 인공신경망 기계번역에서 중요한 요소 중 하나로 고품질의 병렬 말뭉치를 뽑을 수 있는데 이제까지 한국어 관련 언어쌍의 고품질 병렬 코퍼스를 구하기 쉽지 않은 실정이었다. 최근 한국정보화진흥원의 AI HUB에서 고품질의 160만 문장의 한-영 기계번역 병렬 말뭉치를 공개하였다. 이에 본 논문은 AI HUB에서 공개한 데이터 및 현재까지 가장 많이 쓰인 한-영 병렬 데이터인 OpenSubtitles와 성능 비교를 통해 각각의 데이터의 품질을 검증하고자 한다. 테스트 데이터로 한-영 기계번역 관련 공식 테스트셋인 IWSLT에서 공개한 테스트셋을 이용하여 보다 객관성을 확보하였다. 실험결과 동일한 테스트셋으로 실험한 기존의 한-영 기계번역 관련 논문들보다 좋은 성능을 보임을 알 수 있었으며 이를 통해 고품질 데이터의 중요성을 알 수 있었다.

Machine translation refers to software that translates a source language into a target language, and has been actively researching Neural Machine Translation through rule-based and statistical-based machine translation. One of the important factors in the Neural Machine Translation is to extract high quality parallel corpus, which has not been easy to find high quality parallel corpus of Korean language pairs. Recently, the AI HUB of the National Information Society Agency(NIA) unveiled a high-quality 1.6 million sentences Korean-English parallel corpus. This paper attempts to verify the quality of each data through performance comparison with the data published by AI Hub and OpenSubtitles, the most popular Korean-English parallel corpus. As test data, objectivity was secured by using test set published by IWSLT, official test set for Korean-English machine translation. Experimental results show better performance than the existing papers tested with the same test set, and this shows the importance of high quality data.

키워드

참고문헌

  1. Edunov, S., Ott, M., Auli, M., & Grangier, D. (2018). Understanding back-translation at scale. arXiv preprint arXiv:1808.09381.
  2. Currey, A., Miceli-Barone, A. V., & Heafield, K. (2017, September). Copied monolingual data improves low-resource neural machine translation. In Proceedings of the Second Conference on Machine Translation (pp. 148-156).
  3. Koehn, P., Och, F. J., & Marcu, D. (2003, May). Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1 (pp. 48-54). Association for Computational Linguistics.
  4. Yamada, K., & Knight, K. (2001, July). A syntax-based statistical translation model. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (pp. 523-530).
  5. Cho, K., Van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
  6. Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
  7. Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems (pp. 3104-3112).
  8. Kalchbrenner, N., Espeholt, L., Simonyan, K., Oord, A. V. D., Graves, A., & Kavukcuoglu, K. (2016). Neural machine translation in linear time. arXiv preprint arXiv:1610.10099.
  9. Gehring, J., Auli, M., Grangier, D., Yarats, D., & Dauphin, Y. N. (2017, August). Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 (pp. 1243-1252). JMLR. org.
  10. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).
  11. Luong, M. T., Pham, H., & Manning, C. D. (2015). Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025.
  12. Lample, G., & Conneau, A. (2019). Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291.
  13. Song, K., Tan, X., Qin, T., Lu, J., & Liu, T. Y. (2019). Mass: Masked sequence to sequence pre-training for language generation. arXiv preprint arXiv:1905.02450.
  14. Liu, Y., Gu, J., Goyal, N., Li, X., Edunov, S., Ghazvininejad, M., ... & Zettlemoyer, L. (2020). Multilingual denoising pre-training for neural machine translation. arXiv preprint arXiv:2001.08210.
  15. Y. J. Jeong, C. E. Park, C. K. Lee & J. S. Kim. (2019). English-Korean Neural Machine Translation using MASS, .The 31st Annual Conference on Human & Cognitive Language Technology
  16. Guanghao Xu, Youngjoong Ko, Jungyun Seo. (2019). Improving Low-resource Machine Translation by utilizing Multilingual, Out-domain Resources. KIISE, 46(1), PP. 0649-0651
  17. J. H. Lee, B. S. Kim, Guanghao Xu, Youngjoong Ko & J. Y. Seo. (2018). English-Korean Neural Machine Translation using Subword Units KIISE 2018 (), 586-588.
  18. Xu, Guanghao, Youngjoong Ko, and Jungyun Seo.(2018) "Expanding Korean/English Parallel Corpora using Back-translation for Neural Machine Translation." Annual Conference on Human and Language Technology. Human and Language Technology.
  19. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  20. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 9.
  21. Sennrich, R., Haddow, B., & Birch, A. (2015). Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
  22. Kudo, T., & Richardson, J. (2018). Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226.
  23. Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002, July). BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics (pp. 311-318). Association for Computational Linguistics.