Enhancing Performance of Bilingual Lexicon Extraction through Refinement of Pivot-Context Vectors

중간언어 문맥벡터의 정제를 통한 이중언어 사전 구축의 성능개선

  • 권홍석 (한국해양대학교 컴퓨터공학과) ;
  • 서형원 (한국해양대학교 컴퓨터공학과) ;
  • 김재훈 (한국해양대학교 컴퓨터공학과)
  • Received : 2014.01.21
  • Accepted : 2014.05.21
  • Published : 2014.07.15

Abstract

This paper presents the performance enhancement of automatic bilingual lexicon extraction by using refinement of pivot-context vectors under the standard pivot-based approach, which is very effective method for less-resource language pairs. In this paper, we gradually improve the performance through two different refinements of pivot-context vectors: One is to filter out unhelpful elements of the pivot-context vectors and to revise the values of the vectors through bidirectional translation probabilities estimated by Anymalign and another one is to remove non-noun elements from the original vectors. In this paper, experiments have been conducted on two different language pairs that are bi-directional Korean-Spanish and Korean-French, respectively. The experimental results have demonstrated that our method for high-frequency words shows at least 48.5% at the top 1 and up to 88.5% at the top 20 and for the low-frequency words at least 43.3% at the top 1 and up to 48.9% at the top 20.

본 논문은 중간언어 기반 이중언어 사전 구축 방법에서 문맥벡터의 정제 방법을 제안한다. 중간언어 기반 이중언어 사전 구축 방법은 두 언어 간의 사전이나 병렬말뭉치 등 언어 자원이 부족한 언어쌍에 매우 효과적인 방법이다. 본 논문은 두 가지 정제 방법을 통해서 성능을 개선한다. 첫 번째 방법은 양방향 번역확률을 통하여 문맥벡터를 정제하였고 두 번째 방법은 품사 정보를 이용하여 문맥벡터를 정제하였다. 본 논문은 두 개의 서로 다른 언어 쌍으로 한국어-스페인어 그리고 한국어-프랑스어 양방향에 대해서 각각 이중언어 사전을 추출하는 실험을 하였다. 높은 빈도수를 가지는 어휘에 대한 번역 정확도는 최상위에서 최소 48.5%를, 상위 20에서 최대 88.5%의 정확도를 얻었고, 낮은 빈도수를 가지는 어휘에 대한 번역 정확도는 최상위에서 최소 26.5%를, 상위 20에서 최대 66.5%의 성능을 보였다.

Keywords

Acknowledgement

Grant : 지식학습 기반의 다국어 확장이 용이한 관광/국제행사 통역률 90%급 자동 통번역 소프트웨어 원천 기술 개발

Supported by : 한국산업기술평가관리원

References

  1. D. Wu and X. Xia, "Learning an English-Chinese lexicon from a parallel corpus," Proceedings of the First Conference of the Association for Machine Translation in the Americas, pp.206-213, 1994.
  2. P. Fung, "Compiling bilingual lexicon entries from a non-parallel English-Chinese corpus," Proceedings of the Third Workshop on Very Large Corpora, pp.173-183, 1994.
  3. K. Yu and J. Tsujii, "Bilingual dictionary extraction from Wikipedia," Proceedings of the 12th Machine Translation Summit, pp.379-386, 2009.
  4. A. Ismail and S. Manandha, "Bilingual lexicon extraction from comparable corpora using in-domain terms," Proceedings of the International Conference on Computational Linguistics, pp.481-489, 2010.
  5. K. Tanaka and K. Umemura, "Construction of a bilingual dictionary intermediated by a third language," Proceedings of the 15th International Conference on Computational Linguistics, pp.297-303, 1994.
  6. H. Wu and H. Wang, "Pivot language approach for phrase-based statistical machine translation," Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, pp.856-863, 2007.
  7. T. Tsunakawa, N. Okazaki, and J. Tsujii, "Building bilingual lexicons using lexical translation probabilities via pivot languages," Proceedings of the International Conference on Computational Linguistics, Posters Proceedings, pp.18-22, 2008.
  8. P. Fung, "A statistical view on bilingual lexicon extraction: from parallel corpora to non-parallel corpora," Proceedings of the Parallel Text Processing, pp.1-16, 1998.
  9. E. Gaussier, J.-M. Renders, I. Matveeva, C. Goutte and H. Dejean, "A geometric view on bilingual lexicon extraction from comparable corpor," Proceedings of the 42th Annual Meeting of the Association for Computational Linguistics, pp.527-534, 2004.
  10. A. Hazem and E. Morin, "Adaptive dictionary for bilingual lexicon extraction from comparable corpora," Proceedings of the 8th International Conference on Language Resources and Evaluation, pp.288-292, 2012.
  11. R. Rapp, "Automatic identification of word translations from unrelated English and German corpora," Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pp.519- 526, 1999.
  12. H. Kwon, H. Seo, and J. Kim, "Bilingual lexicon extraction via pivot language and word alignment tool," Proceedings of the 6th Building and Using Comparable Corpora, pp.14-19, 2012.
  13. D. Andrade, T. Matsuzaki, and J. Tsujii, "Statistical extraction and comparison of pivot words for bilingual lexicon extension," ACM Trans. Asian Language Information Processing, vol.11, no.2, pp.6 2012.
  14. M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini, "Building a large annotated corpus of English: the penn treebank," Computional Linguistics, vol.19, no.2, pp.313-330, 1993.
  15. H. Seo, H. Kim, H. Cho, J. Kim and S. Yang, "Automatically constructing English-Korean parallel corpus from web documents," Proceedings of the 26th KIPS Fall Conference, vol.13, no.2, pp.161- 164, 2006.
  16. P. Koehn, "EuroParl: A parallel corpus for statistical machine translation," Proceedings of the Conference on the Machine Translation Summit, pp.79-86, 2005.
  17. J. Shin and C. Ock, "A Korean morphological analyzer using a pre-analyzed partial word-phrase dictionary," Journal of KIISE: Software and Applications, vol.39, no.5, pp.415-424, 2012.
  18. H. Schmid, "Probabilistic part-of-speech tagging using decision trees," Proceedings of International Conference on New Methods in Language Processing, pp.44-49, 1994.
  19. E. Voorhees, "The TREC-8 question answering track report," Proceedings of 8th Text Retrieval Conference, pp.77-82, 1999.