DOI QR코드

DOI QR Code

Grammatical Structure Oriented Automated Approach for Surface Knowledge Extraction from Open Domain Unstructured Text

  • Received : 2021.06.20
  • Accepted : 2021.11.21
  • Published : 2022.06.30

Abstract

News in the form of web data generates increasingly large amounts of information as unstructured text. The capability of understanding the meaning of news is limited to humans; thus, it causes information overload. This hinders the effective use of embedded knowledge in such texts. Therefore, Automatic Knowledge Extraction (AKE) has now become an integral part of Semantic web and Natural Language Processing (NLP). Although recent literature shows that AKE has progressed, the results are still behind the expectations. This study proposes a method to auto-extract surface knowledge from English news into a machine-interpretable semantic format (triple). The proposed technique was designed using the grammatical structure of the sentence, and 11 original rules were discovered. The initial experiment extracted triples from the Sri Lankan news corpus, of which 83.5% were meaningful. The experiment was extended to the British Broadcasting Corporation (BBC) news dataset to prove its generic nature. This demonstrated a higher meaningful triple extraction rate of 92.6%. These results were validated using the inter-rater agreement method, which guaranteed the high reliability.

Keywords

References

  1. Y. Matsuo and M Ishizuka, "Keyword extraction from a single document using word cooccurrence statistical information," International Journal on Artificial Intelligence Tools, vol. 13, no. 1, pp. 157-169, Mar. 2004. DOI: 10.1142/S0218213004001466.
  2. P. K. Shah, C. Perez-Iratxeta, P. Bork, and M. A. Andrade, "Information extraction from full text scientific articles: where are the keywords?," BMC bioinformatics, vol. 4, no. 1, p. 20, May. 2003. DOI: 10.1186/1471-2105-4-20.
  3. S. Beliga, A. Mestrovic, and S. Martincic-Ipsic, "An overview of graph-based key words extraction methods and approaches," Journal of information and organizational sciences and JIOS, vol. 39, no. 1, pp. 1-20, Jul. 2015.
  4. A. Tixier, F. Malliaros, and M. Vazirgiannis, "A graph degeneracy-based approach to keyword extraction," in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin: TX, USA. pp. 1860-1870, 2016. DOI: 10.18653/v1/D16-1191.
  5. H. M. M. Hasan, F. Sanyal, D. Chaki, and M. H. Ali, "An empirical study of important keyword extraction techniques from documents," in 2017 1st International Conference on Intelligent Systems and Information Management (ICISIM), Aurangabad, India, pp. 91-94, Oct. 2017. DOI: 10.1109/ICISIM.2017.8122154.
  6. S. K. Bharti, K. S. Babu, and A. Pradhan, "Automatic keyword extraction for text summarization in multi-document e-newspapers articles," European Journal of Advances in Engineering and Technology, vol. 4, no. 6, pp. 410-427, 2017.
  7. Z. Liu, P. Li, Y. Zheng, and M. Sun, "Clustering to find exemplar terms for keyphrase extraction," in Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1, Singapore, pp. 257-266, Aug. 2009. DOI: 10.3115/1699510.1699544.
  8. Y. Ouyang, W. Li, and R. Zhang, "273. Task 5. keyphrase extraction based on core word identification and word expansion," in Proceedings of the 5th international workshop on semantic evaluation, Uppsala, Sweden, pp. 142-145, 2010.
  9. S. N. Kim, O. Medelyan, M. -Y. Kan, and T. Baldwin, "Automatic keyphrase extraction from scientific articles," Language Resources and Evaluation, vol. 47, pp. 723-742, Dec. 2013. DOI: 10.1007/s10579-012-9210-3.
  10. D. Mahata, J. Kuriakose, R. R. Shah, and R. Zimmermann, "Key2Vec: Automatic ranked keyphrase extraction from scientific articles using phrase embeddings," in Proceedings of NAACL-HLT 2018, New Orleans: LA, USA, vol. 2, pp. 634-639, 2018. DOI: 10.18653/v1/N18-2100.
  11. G. Rabby, S. Azad, M. Mahmud, K. Z. Zamli, and M. M. Rahman, "A flexible keyphrase extraction technique for academic literature," in Procedia Computer Science, Tangerang, Indonesia, vol. 135, pp. 553-563, 2018. DOI: 10.1016/j.procs.2018.08.208.
  12. K. Bennani-Smires, C. Musat, A. Hossmann, M. Baeriswyl, and M. Jaggi, "Simple unsupervised keyphrase extraction using sentence embeddings," in Proceedings of the 22nd Conference on Computational Natural Language Learning, Brussels, Belgium, pp. 221-229, Jan. 2018. DOI: 10.18653/v1/K18-1022.
  13. O. Etzioni, M. Cafarella, D. Downey, S. Kok, A. -M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates,"Web-scale information extraction in knowitall: (preliminary results)," in Proceedings of the 13th international conference on World Wide Web, New York: NY, USA, pp. 100-110, May. 2004. DOI: 10.1145/988672.988687.
  14. O. Etzioni, M. Cafarella, D. Downey, A. -M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates, "Unsupervised named-entity extraction from the web: An experimental study," Artificial Intelligence, vol. 165, no. 1, pp. 91-134, Jun. 2005. DOI: 10.1016/j.artint.2005.03.001.
  15. A. Ritter, S. Clark, Mausam, and O. Etzioni, "Named entity recognition in tweets: An experimental study," in Proceedings of the Conference on Empirical Methods in Natural Language Processing, Edinburgh, U.K, pp. 1524-1534, Jul. 2011.
  16. M. Mintz, S. Bills, R. Snow, and D. Jurafsky, "Distant supervision for relation extraction without labeled data," in Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2, Suntec, Singapore, vol. 2, pp. 1003-1011, 2009. DOI: 10.3115/1690219.1690287.
  17. D. Q. Nguyen and K. Verspoor, "Convolutional neural networks for chemical-disease relation extraction are improved with character-based word embeddings," in Proceedings of the BioNLP 2018 workshop, Melbourne, Australia, pp. 129-136, May. 2018. DOI: 10.18653/v1/W18-2314.
  18. K. G'abor, D. Buscaldi, A. -K. Schumann, B. QasemiZadeh, H. Zargayouna, and T. Charnois, "SemEval-2018Task7: Semantic relation extraction and classification in scientific papers," in Proceedings of the 12th International Workshop on Semantic Evaluation (SemEval-2018), New Orleans: LA, USA, pp. 679-688, 2018. DOI: 10.18653/v1/S18-1111.
  19. S. Pawar, G. K. Palshikar, and P. Bhattacharyya, "Relation extraction: A survey," arXiv:1712.05191 [cs], Dec. 2017. DOI: 10.1007/978-981-10-7359-5_6.
  20. G. Bordea, E. Lefever, and P. Buitelaar, "Semeval-2016 task 13: Taxonomy extraction evaluation (texeval-2)," in SemEval-2016, San Diego: CA, USA, pp. 1081-1091, 2016. DOI: 10.18653/v1/S16-1168.
  21. P. Maitra and D. Das, "JUNLP at SemEval-2016 Task 13: A language independent approach for hypernym identification," in Proceedings of SemEval, San Diego: CA, USA, pp. 1310-1314, 2016. DOI: 10.18653/v1/S16-1204.
  22. A. Panchenko, S. Faralli, E. Ruppert, S. Remus, H. Naets, C. Fairon, S. P. Ponzetto, and C. Biemann, "TAXI at SemEval-2016 Task 13: A taxonomy induction method based on lexico-syntactic patterns, substrings and focused crawling," in Proceedings of SemEval, San Diego: CA, USA, pp. 1320-1327, 2016. DOI: 10.18653/v1/S16-1206.
  23. A. Yates, M. Banko, M. Broadhead, M. Cafarella, O. Etzioni, and S. Soderland, "TextRunner: open information extraction on the web," in Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations on XX - NAACL '07, Rochester: NY, USA, pp. 25-26, 2007.
  24. F. Wu and D. S. Weld, "Open information extraction using Wikipedia," in Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, pp. 118-127, 2010.
  25. O. Etzioni, A. Fader, J. Christensen, S. Soderland, and M. Mausam, "Open information extraction: The second generation.," in IJCAI, 2011, vol. 11, pp. 3-10. Accessed: Jul. 04, 2017. DOI: 10.5591/978-1-57735-516-8/IJCAI11-012.
  26. J. Fan, A. Kalyanpur, D. C. Gondek, and D. A. Ferrucci, "Automatic knowledge extraction from documents," IBM Journal of Research and Development, vol. 56, no. 3.4, pp. 5:1-5:10, May. 2012. DOI: 10.1147/JRD.2012.2186519.
  27. S. Soderland, B. Roof, B. Qin, S. Xu, Mausam, and O. Etzioni, "Adapting open information extraction to domain-specific relations," AI Magazine, vol. 31, pp. 93-102, Jul. 2010. DOI: 10.1609/aimag.v31i3.2305.
  28. T. M. Mitchell, W. Cohen, E. Hruschka, P. Talukdar, B. Yang, J. Betteridge, A. Carlson, B. Dalvi, M. Gardner, B. Kisiel, J. Krishnamurthy, N. Lao, K. Mazaitis, T. Mohamed, N. Nakashole, E. Platanios, A. Ritter, M. Samadi, B. Settles, R. Wang, D. Wijaya, A. Gupta, X. Chen, A. Saparov, M. Greaves, and J. Welling, "NeverEnding learning," Communication of the ACM, vol. 61, no. 5, p. 103-115, May. 2018. DOI: 10.1145/3191513.
  29. D. Bennet and A Bennet, "The depth of knowledge: surface, shallow or deep?," VINE, vol. 38, no. 4, pp. 405-420, Oct. 2008. DOI: 10.1108/03055720810917679.
  30. "BBC News Summary", Kaggle [Online]. Available: https://www.kaggle.com/pariza/bbc-news-summary (accessed May 20, 2020)