DOI QR코드

DOI QR Code

A Study on Utilization of Wikipedia Contents for Automatic Construction of Linguistic Resources

언어자원 자동 구축을 위한 위키피디아 콘텐츠 활용 방안 연구

  • Yoo, Cheol-Jung (Dept. of Software Engineering, Chonbuk National University) ;
  • Kim, Yong (Dept. of Library & Information Science, Chonbuk National University) ;
  • Yun, Bo-Hyun (Dept. of Computer Science Education, Mokwon University)
  • 류철중 (전북대학교 소프트웨어공학과) ;
  • 김용 (전북대학교 문헌정보학과) ;
  • 윤보현 (목원대학교 컴퓨터교육학과)
  • Received : 2015.03.17
  • Accepted : 2015.05.20
  • Published : 2015.05.28

Abstract

Various linguistic knowledge resources are required in order that machine can understand diverse variation in natural languages. This paper aims to devise an automatic construction method of linguistic resources by reflecting characteristics of online contents toward continuous expansion. Especially we focused to build NE(Named-Entity) dictionary because the applicability of NEs is very high in linguistic analysis processes. Based on the investigation on Korean Wikipedia, we suggested an efficient construction method of NE dictionary using the syntactic patterns and structural features such as metadatas.

급변하는 자연언어를 기계가 이해할 수 있도록 하기 위해서는 다양한 언어지식자원(linguistic knowledge resources)의 구축이 필수적으로 수반된다. 본 논문에서는 온라인 콘텐츠의 특성을 활용해 언어지식자원을 자동으로 구축함으로써 지속적으로 확장 가능한 방법을 고안하고자 한다. 특히 언어분석 과정에서 가장 활용도가 높은 개체명(NE: Named Entity) 사전을 자동으로 구축, 확장하는데 주안점을 둔다. 이를 위해 본 논문에서는 개체명 사전 구축대상문서로 위키피디아(Wikipedia)를 선정, 그 특성을 파악하기 위해 다양한 통계 분석을 수행하였다. 이에 기반하여 위키피디아 콘텐츠가 갖는 구문적 특성과 구조 정보 등의 메타데이터를 활용하여 개체명 사전을 구축, 확장하는 방법을 제안한다.

Keywords

References

  1. Michael Scriven. Evaluation thesaurus. UK: Sage Press, 1991.
  2. V. Nastase, M. Strube, B, Boerschinger, C. Zim, A. Elghafari, WikiNet: A Very Large Scale Multi-Lingual Concept Network. Proc. of LREC. pp.1015-1022, 2010.
  3. Y. J. Bae, C. Y. Ok, Semantic Analysis of Korean Compound Noun using Lexical Semantic Network(U-WIN). Journal of KIISE: Software and Applications, pp.833-847, 2013.
  4. T. J. Kim, E. Sang, D. M. Fien, Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. Proc. of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4. Association for Computational Linguistics, pp.142-147, 2003.
  5. Y.M. Park, J. S. Lee, Named Entity Recognition and Dictionary Construction for Korean Title: Books, Movies, Music and TV Programs, KIPS Tr. Software and Data Eng. Vol. 3, No. 7, pp.285-292, 2014. https://doi.org/10.3745/KTSDE.2014.3.7.285
  6. A. Mikheev, C. Grover, M. Moens, Description of the LTG System Used for MUC-7. Proc. of MUC-7. pp.1-8 1998.
  7. S. Brin, Extracting Patterns and Relations from the World Wide Web. Proc. of the International Workshop on The World Wide Web and Databases, pp.172-183, 1998.
  8. M. Negri, B. Magnini, Using wordnet predicates for multilingual named entity recognition. Proc. of The Second Global Wordnet Conference, pp.169-174. 2004.
  9. B. Magnini, N. Matteo, R. Prevete, and H. Tanev, A wordnet-based approach to named entities recognition. Proc. of the 2002 workshop on Building and using semantic networks, pp.1-7, 2002.
  10. S. Sekine, R. Grishman, H. Shinnou, A decision tree method for finding and classifying names in Japanese texts. Proc. the Sixth Workshop on Very Large Corpora. 1998.
  11. C. K. Lee, P-M. Ryu, H. K Kim, Named Entity Recognition using a modified Pegasos algorithm. Proc. of the CIKM, pp.655-667. 2010.
  12. Korean Wikipedia, http://ko.wikipedia.org/
  13. Toral, A. R. Munoz, A proposal to automatically build and maintain gazetters for named entity recognition by using Wikipedia, NEW TEXT Wikis and blogs and other dynamic text sources, 2006.
  14. R. Bunescu, M. Pasca, Using encyclopedia knowledge for named entity disambiguation. Proc. of EACL pp.9-16, 2006.
  15. T. Nguyen H. cao, Exploiting Wikipedia and text features for named entity disambiguation. Proc. of the 2nd international conference on intelligent information and database system, pp.101-104, 2010.
  16. C. Lee, Y. Hwang, M. Jang, Fine-Grained Named Entity Recognition and Relation Extraction for Question Answering. Proc. of SIGIR, pp.799-800, 2007.
  17. L. Deng, D. Yu, Deep Learning: Methods and Applications. Foundations and $Trends^{(R)}$ in Signal Processing. Vol. 7, No. 3-4, pp 197-387, 2014. https://doi.org/10.1561/2000000039