DOI QR코드

DOI QR Code

Knowledge Extraction Methodology and Framework from Wikipedia Articles for Construction of Knowledge-Base

지식베이스 구축을 위한 한국어 위키피디아의 학습 기반 지식추출 방법론 및 플랫폼 연구

  • Received : 2018.12.18
  • Accepted : 2019.03.11
  • Published : 2019.03.31

Abstract

Development of technologies in artificial intelligence has been rapidly increasing with the Fourth Industrial Revolution, and researches related to AI have been actively conducted in a variety of fields such as autonomous vehicles, natural language processing, and robotics. These researches have been focused on solving cognitive problems such as learning and problem solving related to human intelligence from the 1950s. The field of artificial intelligence has achieved more technological advance than ever, due to recent interest in technology and research on various algorithms. The knowledge-based system is a sub-domain of artificial intelligence, and it aims to enable artificial intelligence agents to make decisions by using machine-readable and processible knowledge constructed from complex and informal human knowledge and rules in various fields. A knowledge base is used to optimize information collection, organization, and retrieval, and recently it is used with statistical artificial intelligence such as machine learning. Recently, the purpose of the knowledge base is to express, publish, and share knowledge on the web by describing and connecting web resources such as pages and data. These knowledge bases are used for intelligent processing in various fields of artificial intelligence such as question answering system of the smart speaker. However, building a useful knowledge base is a time-consuming task and still requires a lot of effort of the experts. In recent years, many kinds of research and technologies of knowledge based artificial intelligence use DBpedia that is one of the biggest knowledge base aiming to extract structured content from the various information of Wikipedia. DBpedia contains various information extracted from Wikipedia such as a title, categories, and links, but the most useful knowledge is from infobox of Wikipedia that presents a summary of some unifying aspect created by users. These knowledge are created by the mapping rule between infobox structures and DBpedia ontology schema defined in DBpedia Extraction Framework. In this way, DBpedia can expect high reliability in terms of accuracy of knowledge by using the method of generating knowledge from semi-structured infobox data created by users. However, since only about 50% of all wiki pages contain infobox in Korean Wikipedia, DBpedia has limitations in term of knowledge scalability. This paper proposes a method to extract knowledge from text documents according to the ontology schema using machine learning. In order to demonstrate the appropriateness of this method, we explain a knowledge extraction model according to the DBpedia ontology schema by learning Wikipedia infoboxes. Our knowledge extraction model consists of three steps, document classification as ontology classes, proper sentence classification to extract triples, and value selection and transformation into RDF triple structure. The structure of Wikipedia infobox are defined as infobox templates that provide standardized information across related articles, and DBpedia ontology schema can be mapped these infobox templates. Based on these mapping relations, we classify the input document according to infobox categories which means ontology classes. After determining the classification of the input document, we classify the appropriate sentence according to attributes belonging to the classification. Finally, we extract knowledge from sentences that are classified as appropriate, and we convert knowledge into a form of triples. In order to train models, we generated training data set from Wikipedia dump using a method to add BIO tags to sentences, so we trained about 200 classes and about 2,500 relations for extracting knowledge. Furthermore, we evaluated comparative experiments of CRF and Bi-LSTM-CRF for the knowledge extraction process. Through this proposed process, it is possible to utilize structured knowledge by extracting knowledge according to the ontology schema from text documents. In addition, this methodology can significantly reduce the effort of the experts to construct instances according to the ontology schema.

최근 4차 산업혁명과 함께 인공지능 기술에 대한 연구가 활발히 진행되고 있으며, 이전의 그 어느 때보다도 기술의 발전이 빠르게 진행되고 있는 추세이다. 이러한 인공지능 환경에서 양질의 지식베이스는 인공지능 기술의 향상 및 사용자 경험을 높이기 위한 기반 기술로써 중요한 역할을 하고 있다. 특히 최근에는 인공지능 스피커를 통한 질의응답과 같은 서비스의 기반 지식으로 활용되고 있다. 하지만 지식베이스를 구축하는 것은 사람의 많은 노력을 요하며, 이로 인해 지식을 구축하는데 많은 시간과 비용이 소모된다. 이러한 문제를 해결하기 위해 본 연구에서는 기계학습을 이용하여 지식베이스의 구조에 따라 학습을 수행하고, 이를 통해 자연어 문서로부터 지식을 추출하여 지식화하는 방법에 대해 제안하고자 한다. 이러한 방법의 적절성을 보이기 위해 DBpedia 온톨로지의 구조를 기반으로 학습을 수행하여 지식을 구축할 것이다. 즉, DBpedia의 온톨로지 구조에 따라 위키피디아 문서에 기술되어 있는 인포박스를 이용하여 학습을 수행하고 이를 바탕으로 자연어 텍스트로부터 지식을 추출하여 온톨로지화하기 위한 방법론을 제안하고자 한다. 학습을 바탕으로 지식을 추출하기 위한 과정은 문서 분류, 적합 문장 분류, 그리고 지식 추출 및 지식베이스 변환의 과정으로 이루어진다. 이와 같은 방법론에 따라 실제 지식 추출을 위한 플랫폼을 구축하였으며, 실험을 통해 본 연구에서 제안하고자 하는 방법론이 지식을 확장하는데 있어 유용하게 활용될 수 있음을 증명하였다. 이러한 방법을 통해 구축된 지식은 향후 지식베이스를 기반으로 한 인공지능을 위해 활용될 수 있을 것으로 판단된다.

Keywords

JJSHBB_2019_v25n1_43_f0001.png 이미지

Knowledge Extraction Process

JJSHBB_2019_v25n1_43_f0002.png 이미지

Statistics of Infobox Categories

JJSHBB_2019_v25n1_43_f0003.png 이미지

Statistics of Infobox Attributes

JJSHBB_2019_v25n1_43_f0004.png 이미지

Bi-LSTM-CRF for Knowledge Extraction

JJSHBB_2019_v25n1_43_f0005.png 이미지

Knowledge Extraction Web Service

JJSHBB_2019_v25n1_43_f0006.png 이미지

Establish Attribute of University Category

Number of Data for Training and Testing

JJSHBB_2019_v25n1_43_t0001.png 이미지

Experimental Results of CRF and Bi-LSTM-CRF

JJSHBB_2019_v25n1_43_t0002.png 이미지

Experimental Results by Value Type

JJSHBB_2019_v25n1_43_t0003.png 이미지

References

  1. Berger, A. L., V. J. D. Pietra, and S. A. D. Pietra, "A maximum entropy approach to natural language processing," Computational linguistics, Vol.22, No.1(1996), 39-71.
  2. Bergman, M., Knowledge-based Artificial Intelligence, AI3, 2014. Available at http://www.mkbergman.com/1816/knowledge-based-artificial-intelligence/ (Accessed 13 November, 2018).
  3. Bhuiyan, H., K. J. Oh, M. D. Hong, and G. S. Jo, "An effective approach to generate Wikipedia infobox of movie domain using semi-structured data," Journal of Internet Computing and Services, Vol.18, No.3(2017), 49-61. https://doi.org/10.7472/JKSII.2017.18.3.49
  4. Bizer, C., T. Heath, K. Idehen, and T. Berners-Lee, "Linked Data on the Web (LDOW2008)," Workshop at the 17th International World Wide Web Conference, (2008).
  5. Bizer, C., J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cyganiak, and S. Hellmann, "DBpedia - A Crystallization Point for the Web of Data," Journal of Web Semantics: Science, Services and Agents on the World Wide Web, Vol. 7, No. 3(2009), 154-165. https://doi.org/10.1016/j.websem.2009.07.002
  6. Brandao, W. C., E. S. Moura, A. S. Silva, and N. Ziviani, "A Self-Supervised Approach for Extraction of Attribute-Value Pairs from Wikipedia Articles," Proceedings of the 17th international conference on String processing and information retrieval, (2010), 279-289.
  7. Chiu, J. and E. Nichols, "Named Entity Recognition with Bidirectional LSTM-CNNs," Transactions of the Association for Computational Linguistics, Vol. 4, No. 1(2016), 357-370. https://doi.org/10.1162/tacl_a_00104
  8. Choi, H., M. Kim, W. Kim, D. Shin, and Y. H. Lee, "Development of Information Extraction System from Multi Source Unstructured Documents for Knowledge Base Expansion," Journal of Intelligence and Information Systems, Vol. 24, No. 4(2018), 111-136. https://doi.org/10.13088/JIIS.2018.24.4.111
  9. Dai, A. M., C. Olah, and Q. V. Le, "Document Embedding with Paragraph Vectors," NIPS Deep Learning Workshop, (2014).
  10. Engelmore, R. S., "Artificial Intelligence and Knowledge Based Systems: Origins, Methods and Opportunities for NDE," Review of Progress in Quantitative Nondestructive Evaluation, Springer Science, New York, 1987.
  11. Forsythe, D. E., "Engineering Knowledge: The Construction of Knowledge in Artificial Intelligence," Social Studies of Science, Vol.23, No.3(1993), 445-477. https://doi.org/10.1177/0306312793023003002
  12. Hearst, M. A., S. T. Dumais, E. Osuna, J. Platt, and B. Scholkopf, "Support vector machines," IEEE Intelligent Systems and their Applications, Vol.13, No.4(1998), 18-28. https://doi.org/10.1109/5254.708428
  13. Higashinaka, R., K. Dohsaka, and H. Isozaki, "Learning to rank definitions to generate quizzes for interactive information presentation," Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, (2007), 117-120.
  14. Huang, Z., W. Xu, and K. Yu, "Bidirectional LSTM-CRF models for sequence tagging," arXiv.org preprint, 2015. Available at https://arxiv.org/pdf/1508.01991.pdf (Downloaded 15 November, 2018).
  15. Jeong, S., M. Choi, and H. Kim, "Construction of Korean Knowledge Base Based on Machine Learning from Wikipedia," Journal of KIISE, Vol. 42, No. 8(2015), 1065-1070. https://doi.org/10.5626/JOK.2015.42.8.1065
  16. Jin, S., H. Jang, and W. Kim, "Improving Bidirectional LSTM-CRF model Of Sequence Tagging by using Ontology knowledge based feature," Journal of intelligence and information systems, Vol.24, No.1(2018), 253-266. https://doi.org/10.13088/JIIS.2018.24.1.253
  17. Kaisser, M., "The qualim question answering demo: Supplementing answers with paragraphs drawn from wikipedia," Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, (2008), 32-35.
  18. Kingma, D. and J. Ba, "Adam: A method for stochastic optimization," Proceedings of the 3rd International Conference for Learning Representations, (2015).
  19. Krishna, S, Introduction to Database and Knowledge-base Systems, World Scientific Publishing, Singapore, 1992.
  20. Lafferty, J., A. McCallum, and F. C.N. Pereira, "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data," Proceedings of the Eighteenth International Conference on Machine Learning, (2001), 282-289.
  21. Lange, D., C. Bohm, and F. Naumann, "Extracting structured information from Wikipedia articles to populate infoboxes," Proceedings of the 19th ACM international conference on Information and knowledge management, (2010), 1661-1664.
  22. Lehmann, J. R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer, and C. Bizer, "DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia," Semantic Web, Vol.6, No.2(2015), 167-195. https://doi.org/10.3233/SW-140134
  23. Ljubesic, N., "Comparing CRF and LSTM performance on the task of morphosyntactic tagging of non-standard varieties of South Slavic languages," Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects, (2018), 156-163.
  24. Ramshaw, L. A. and M. P. Marcus, "Text Chunking using Transformation-Based Learning," ACL Third Workshop on Very Large Corpora, (1995), 82-94.
  25. Russell, S. J., and P. Norvig, Artificial Intelligence : A Modern Approach, Prentice Hall, 2009.
  26. Suchanek, F. M., G. Kasneci, and G. Weikum, "Yago:a core of semantic knowledge," Proceedings of the 16th international conference on World Wide Web, (2007), 697-706.
  27. Sun, R., Artificial intelligence: Connectionist and symbolic approaches, In: N. J. Smelser and P. B. Baltes (eds.), International Encyclopedia of the Social and Behavioral Sciences, Pergamon/Elsevier, Oxford, 2001.
  28. Viterbi, A. J., "Error bounds for convolutional codes and an asymptotically optimum decoding algorithm," IEEE Transactions on Information Theory, Vol.13, No.2(1967), 260-269. https://doi.org/10.1109/TIT.1967.1054010
  29. Wu, F. and D.S. Weld, "Autonomously semantifying Wikipedia," Proceedings of the sixteenth ACM conference on Conference on Information and knowledge management, (2007), 41-50.
  30. Wu, J., X. Hu, R. Zhao, F. Ren, and M. Hu, "Clinical Named Entity Recognition via Bi-directional LSTM-CRF Model," Proceedings of the Evaluation Task at the China Conference on Knowledge Graph and Semantic Computing, (2017), 31-36.

Cited by

  1. 지식베이스를 이용한 작업자 증상 기반 화학물질 추정 시스템 설계 vol.25, pp.3, 2019, https://doi.org/10.7842/kigas.2021.25.3.9