DOI QR코드

DOI QR Code

Machine Learning Based Keyphrase Extraction: Comparing Decision Trees, Naïve Bayes, and Artificial Neural Networks

  • Sarkar, Kamal (Dept. of Computer Science and Engineering, Jadavpur University) ;
  • Nasipuri, Mita (Dept. of Computer Science and Engineering, Jadavpur University) ;
  • Ghose, Suranjan (Dept. of Computer Science and Engineering, Jadavpur University)
  • Received : 2012.07.02
  • Accepted : 2012.10.05
  • Published : 2012.12.31

Abstract

The paper presents three machine learning based keyphrase extraction methods that respectively use Decision Trees, Na$\ddot{i}$ve Bayes, and Artificial Neural Networks for keyphrase extraction. We consider keyphrases as being phrases that consist of one or more words and as representing the important concepts in a text document. The three machine learning based keyphrase extraction methods that we use for experimentation have been compared with a publicly available keyphrase extraction system called KEA. The experimental results show that the Neural Network based keyphrase extraction method outperforms two other keyphrase extraction methods that use the Decision Tree and Na$\ddot{i}$ve Bayes. The results also show that the Neural Network based method performs better than KEA.

Keywords

References

  1. Y. B. Wu, Q. Li, "Document keyphrases as subject metadata: incorporating document key concepts in search results". Journal of Information Retrieval, Volume 11, Number 3, 2008, pp.229-249. https://doi.org/10.1007/s10791-008-9044-1
  2. O. Buyukkokten, H. Garcia-Molina and A. Paepcke, "Seeking the Whole in Parts: Text Summarization for Web Browsing on Handheld Devices", In Proceedings of the World Wide Web Conference (WWW10), Hong Kong, May, 2001.
  3. O. Buyukkokten, O. Kaljuvee, H. Garcia-Molina, A. Paepcke and T. Winograd, "Efficient Web Browsing on Handheld Devices Using Page and Form Summarization", ACM Transactions on Information Systems (TOIS), 20(1), 2002, pp.82-115 https://doi.org/10.1145/503104.503109
  4. S. Jones, M. Staveley, "Phrasier: A system for interactive document retrieval using keyphrases", In: proceedings of SIGIR'99, Berkeley, CA, 1999.
  5. C. Gutwin, G.W. Paynter, I.H. Witten, C.G. Nevill-Manning and E. Frank, "Improving browsing in digital libraries with keyphrase indexes." Journal of Decision Support Systems, Vol.27, no 1-2, 1999, pp.81-104. https://doi.org/10.1016/S0167-9236(99)00038-X
  6. B. Kosovac, D. J. Vanier, T. M. Froese, "Use of keyphrase extraction software for creation of an AEC/FM thesaurus", Journal of Information Technology in Construction, 5, 2000, pp.25-36.
  7. S. Jonse, M. Mahoui, "Hierarchical document clustering using automatically extracted keyphrase" , In proceedings of the third international Asian conference on digital libraries, Seoul, Korea, 2000, p. 113-20.
  8. K. Barker, N. Cornacchia, "Using Noun Phrase Heads to Extract Document Keyphrases", In: H. Hamilton, Q. Yang (eds.): Canadian AI 2000. Lecture Notes in Artificial Intelligence, Vol.1822, Springer-Verlag, Berlin Heidelberg, 2000, pp.40-52.
  9. L. F. Chien, "PAT-tree-based Adaptive Keyphrase Extraction for Intelligent Chinese Information Retrieval", Information Processing and Management", 35, 1999, pp.501-521. https://doi.org/10.1016/S0306-4573(98)00054-5
  10. Y. HaCohen-Kerner, "Automatic Extraction of Keywords from Abstracts", In: V. Palade, R. J. Howlett, L. C. Jain (eds.): KES 2003. Lecture Notes in Artificial Intelligence, Vol.2773, Springer-Verlag, Berlin Heidelberg, 2003, pp.843-849.
  11. Y. HaCohen-Kerner, Z. Gross, A. Masa, "Automatic Extraction and Learning of Keyphrases from Scientific Articles. In: A. Gelbukh (ed.): CICLing 2005. Lecture Notes in Computer Science, Vol.3406, Springer-Verlag, Berlin Heidelberg, 2005, pp.657-669.
  12. A. Hulth, J. Karlgren, A. Jonsson, H. Boström, "Automatic Keyword Extraction Using Domain Knowledge" , In: A. Gelbukh (ed.): CICLing 2001. Lecture Notes in Computer Science, Vol.2004, Springer-Verlag, Berlin Heidelberg, 2001, pp.472-482.
  13. Y. Matsuo, Y. Ohsawa, M. Ishizuka, "KeyWorld: Extracting Keywords from a Document as a Small World", In: K. P. Jantke, A. shinohara (eds.): DS 2001. Lecture Notes in Computer Science, Vol.2226, Springer-Verlag, Berlin Heidelberg 2001, pp.271-281.
  14. J. Wang, H. Peng, J-S. Hu, "Automatic Keyphrase Extraction from Document Using Neural Network", ICMLC 2005, 2005, pp.633-641.
  15. P. D. Turney, "Learning algorithm for keyphrase extraction", Journal of Information Retrieval, 2(4), 2000, pp.303-36. https://doi.org/10.1023/A:1009976227802
  16. E. Frank, G. Paynter, I. H. Witten, C. Gutwin and C. Nevill-Manning, " Domain-specific keyphrase extraction", In: proceeding of the sixteenth international joint conference on artificial intelligence, San Mateo, CA,1999.
  17. I.H. Witten, G.W. Paynter, E. Frank, et al, "KEA: Practical Automatic Keyphrase Extraction", In: E. A. Fox, N. Rowe (eds.): Proceedings of Digital Libraries'99: The Fourth ACM Conference on Digital Libraries. ACM Press, Berkeley, CA, 1999, pp.254-255.
  18. N. Kumar, K. Srinathan, "Automatic keyphrase extraction from scientific documents using N-gram filtration technique", In Proceeding of the eighth ACM symposium on Document engineering, Sao Paulo, Brazil, 2008.
  19. Q. Li, Y. B. Wu, "Identifying important concepts from medical documents", Journal of Biomedical Informatics, 2006, pp.668-679.
  20. C. Fellbaum, "WordNet: An electronic lexical database", Cambridge: MIT Press, 1998.
  21. H. Liu, "MontyLingua: An end-to-end natural language processor with common sense', 2004, retrieved in 2005 from http://www.web.media.mit.edu/-hugo/montylingua
  22. H. P. Edmundson, "New methods in automatic extracting". Journal of the Association for Computing Machinery, 16(2), 1969, 264-285. https://doi.org/10.1145/321510.321519
  23. G.K. Zipf, "The psycho-biology of language", Cambridge MA: MIT press, 1935 (reprinted 1965).
  24. G. Holmes, A. Donkin and I. H. Witten, "Weka: A machine learning workbench". In Proc Second Australia and New Zealand Conference on Intelligent Information Systems, Brisbane, Australia, 1994.
  25. I. H. Witten, E. Frank, "Data Mining: Practical machine learning tools and techniques", 2nd Edition, Morgan Kaufmann, San Francisco, 2005. http://www.cs.waikato.ac.nz/-ml/weka/book.html, Retrieved in 2007.
  26. U.M. Fayyad, K.B. Irani, "Multi-interval discretization of continuous-valued attributes for classification learning", In proceedings of 13th International Joint Conference on Artificial Intelligence, San Francisco, CA, Morgan Kaufmann, pp.1022-1027, 1993.
  27. J. R. Quinlan, "C4.5: Programs for Machine Learning", Morgan Kaufmann Publishers, 1993.
  28. K. Sarkar, "Automatic Keyphrase Extraction from Medical Documents", Proceedings of the 3rd International Conference on Pattern Recognition and Machine Intelligence (PReMI 09), 2009, pp.273-278

Cited by

  1. Towards robust and reliable multimedia analysis through semantic integration of services vol.75, pp.22, 2016, https://doi.org/10.1007/s11042-014-2445-9
  2. Automatic Single Document Text Summarization Using Key Concepts in Documents vol.9, pp.4, 2013, https://doi.org/10.3745/JIPS.2013.9.4.602
  3. A Game Theoretical Approach for Solving Winner Determination Problems vol.2014, 2014, https://doi.org/10.1155/2014/845071
  4. Botnet Detection Using Support Vector Machines with Artificial Fish Swarm Algorithm vol.2014, 2014, https://doi.org/10.1155/2014/986428
  5. Correcting vindictive bidding behaviors in sponsored search auctions vol.69, pp.3, 2014, https://doi.org/10.1007/s11227-013-1002-z
  6. Automatic keyphrase extraction for Arabic news documents based on KEA system vol.30, pp.4, 2016, https://doi.org/10.3233/IFS-151923
  7. Adaptive mechanism for schedule arrangement and optimization in socially-empowered professional sports games vol.74, pp.14, 2015, https://doi.org/10.1007/s11042-014-1852-2
  8. A Keyphrase-Based Approach to Text Summarization for English and Bengali Documents vol.5, pp.2, 2014, https://doi.org/10.4018/ijtd.2014040103