DOI QR코드

DOI QR Code

위키백과로부터 기계학습 기반 한국어 지식베이스 구축

Construction of Korean Knowledge Base Based on Machine Learning from Wikipedia

  • 정석원 (강원대학교 컴퓨터정보통신공학과) ;
  • 최맹식 (강원대학교 컴퓨터정보통신공학과) ;
  • 김학수 (강원대학교 컴퓨터정보통신공학과)
  • 투고 : 2015.04.08
  • 심사 : 2015.05.26
  • 발행 : 2015.08.15

초록

지식베이스는 자연어 처리 기반의 다양한 응용 시스템 성능에 영향을 미치는 중요한 요소이다. 영어권에서는 WordNet, YAGO, Cyc, BabelNet과 같은 지식베이스들이 널리 사용되고 있다. 본 논문에서는 위키백과와 YAGO로부터 YAGO 형식의 한국어 지식베이스(이하 K-YAGO)를 자동 구축하는 방법을 제안한다. 제안 시스템은 YAGO와 위키백과 인포박스간의 간단한 매칭을 통해 초기 K-YAGO를 구축한 뒤, 기계학습을 이용하여 초기 K-YAGO를 확장한다. 실험 결과, 제안 시스템은 초기 K-YAGO 구축 실험에서 0.9642의 신뢰도를 보였고, K-YAGO 확장 실험에서 0.9468의 정확도와 0.7596의 매크로 F1 척도를 보였다.

The performance of many natural language processing applications depends on the knowledge base as a major resource. WordNet, YAGO, Cyc, and BabelNet have been extensively used as knowledge bases in English. In this paper, we propose a method to construct a YAGO-style knowledge base automatically for Korean (hereafter, K-YAGO) from Wikipedia and YAGO. The proposed system constructs an initial K-YAGO simply by matching YAGO to info-boxes in Wikipedia. Then, the initial K-YAGO is expanded through the use of a machine learning technique. Experiments with the initial K-YAGO shows that the proposed system has a precision of 0.9642. In the experiments with the expanded part of K-YAGO, an accuracy of 0.9468 was achieved with an average macro F1-measure of 0.7596.

키워드

과제정보

연구 과제 주관 기관 : 한국연구재단

참고문헌

  1. P. McNamee, T. Finin, D. Lawrie, J. Mayfield, "HLTCOE Participation at TAC 2013," Proc. of the Sixth Text Analysis Conference, Vol. 22, pp. 2, 2013.
  2. R. Grishman, "Off to a Cold Start: New York University's 2013 Knowledge Base Population Systems," Text Analysis Conference, 2013.
  3. Miller, George A., "WordNet: a lexical database for English," Communications of the ACM, Vol. 38, No. 11, pp. 39-41, 1995. https://doi.org/10.1145/219717.219748
  4. Suchanek, F. M., Kasneci, G., & Weikum, G., "Yago: a core of semantic knowledge," Proc. of the 16th international conference on World Wide Web, pp. 697-706, 2007.
  5. Lenat, D. B., "CYC: A large-scale investment in knowledge infrastructure," Communications of the ACM, Vol. 38, No. 11, pp. 33-38, 1995.
  6. Navigli, R., & Ponzetto, S. P., "BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network," Artificial Intelligence, Vol. 193, pp. 217-250, 2012. https://doi.org/10.1016/j.artint.2012.07.001
  7. Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., & Hellmann, S., "DBpedia-A crystallization point for the Web of Data," Web Semantics: science, services and agents on the world wide web, Vol. 7, No. 3, pp. 154-165, 2009. https://doi.org/10.1016/j.websem.2009.07.002
  8. David Ferrucci, Eric Brown, Jennifer Chu-Carroll, James Fan, David Gondek, Aditya A. Kalyanpur, Adam Lally, J. William Murdock, Eric Nyberg, John Prager, Nico Schlaefer, Chris Welty, "Building Watson: An Overview of the DeepQA Project," AI Magazine, Vol. 31, No. 3, pp. 59-79, 2010.
  9. Clarke, J., Merhav, Y., Suleiman, G., Zheng, S., & Murgatroyd, D., "Basis Technology at TAC 2012 Entity Linking," Proc. of TAC-2012, 2012.
  10. Yang, X., Wang, R., Li, M., & Tan, Y, "BUPTTeam Participation at TAC 2013 Entity Linking," Text Analysis Conference, 2013.
  11. StefanKazalski, G. S. M., & DietrichKlakow, F. B. A., "Saarland university spoken language systems at the slot filling task of tac kbp 2010," 2010.
  12. Byrne, L., Fenlon, C., & Dunnion, J., "UCD IIRG at TAC KBP 2013," Text Analysis Conference, 2013.
  13. Mintz, M., Bills, S., Snow, R., & Jurafsky, D., "Distant supervision for relation extraction without labeled data,", Proc. of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Vol. 2, pp. 1003-1011, 2009.
  14. Berger, A. L., Pietra, V. J. D., & Pietra, S. A. D., "A maximum entropy approach to natural language processing," Computational linguistics, Vol. 22, No. 1, pp. 39-71, 1996.