DOI QR코드

DOI QR Code

Method of Related Document Recommendation with Similarity and Weight of Keyword

키워드의 유사도와 가중치를 적용한 연관 문서 추천 방법

  • Lim, Myung Jin (Dept. of Computer Engineering, Graduate School, Chosun University) ;
  • Kim, Jae Hyun (Dept. of Development Devision, Bichgalam Information Co.) ;
  • Shin, Ju Hyun (Dept. of Advanced Industry Convergence, Chosun University)
  • Received : 2019.08.30
  • Accepted : 2019.11.25
  • Published : 2019.11.30

Abstract

With the development of the Internet and the increase of smart phones, various services considering user convenience are increasing, so that users can check news in real time anytime and anywhere. However, online news is categorized by media and category, and it provides only a few related search terms, making it difficult to find related news related to keywords. In order to solve this problem, we propose a method to recommend related documents more accurately by applying Doc2Vec similarity to the specific keywords of news articles and weighting the title and contents of news articles. We collect news articles from Naver politics category by web crawling in Java environment, preprocess them, extract topics using LDA modeling, and find similarities using Doc2Vec. To supplement Doc2Vec, we apply TF-IDF to obtain TC(Title Contents) weights for the title and contents of news articles. Then we combine Doc2Vec similarity and TC weight to generate TC weight-similarity and evaluate the similarity between words using PMI technique to confirm the keyword association.

Keywords

References

  1. D. Ayers, A. Watt, Beginning Rss And Atom Programming, John Wiley & Sons Inc., 2005.
  2. K.P. Lee, D.N. Kim, H.J. Kim, "A Survey on Tagging in the Web 2.0 Environment", Communications of the Korean Institute of Information Scientists and Engineers, Vol. 25, No. 10, pp. 36-42, 2007.
  3. M.S. Kim, G.Y. Hae, "XML Information Retrieval by Document Filtering and Query Expansion Based on Ontology," Journal of Korea Multimedia Society, Vol. 8, No. 5, pp. 596-605, 2005.
  4. E.S. You, G.H. Choi, S.H. Kim, "Study on Extraction of Keywords Using TF-IDF and Text Structure of Novels", Journal of the Korea Society of Computer and Information, Vol. 20, No. 2, pp. 121-129, 2015. https://doi.org/10.9708/jksci.2015.20.2.121
  5. D. Blei, A.Y. Ng, M. Jordan, "Latent Dirichlet allocation", Journal of Machine Learning Research, Vol. 3, pp. 993-1022, 2003.
  6. T.M. Cho, J.H. Lee, "Latent Keyphrase Extraction using LDA Model", Korean Institute of Intelligent Systems, Vol. 24, No. 2, pp. 125-126, 2014.
  7. An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation(2016). https://arxiv.org/abs/1607.05368 (accessed September 1, 2018).
  8. Q. Le, T. Milokov, "Distributed Representations of Sentences and Documents", Proceedings of the 31st International Conference on Machine Learning, 2014.
  9. K. Cheng, J. Li, J. Tang, H. Liu, "Unsupervised Sentiment Analysis with Signed Social Networks", Proc. of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 777-786, 2017.
  10. S.M. Kim, I.S. Na, J.H. Shin, "A Method on Associated Document Recommendation with Word Correlation Weights", Journal of Korea Multimedia Society, Vol. 22, No. 2, pp. 250-259, 2019. https://doi.org/10.9717/KMMS.2019.22.2.250
  11. S. Robertson, "Understanding Inverse Document Frequency: On theroretical arguments for IDF", Journal of Documentation, Vol. 60, No. 5, pp. 503-520, 2004. https://doi.org/10.1108/00220410410560582
  12. S.J. Lee, H.J. Kim, "Keyword Extraction from News Corpus using Modified TF-IDF", The Jounal of Society for e-Business Studies, Vol. 14, No. 4, pp. 59-73, 2009.
  13. Brothers and children reunion Family members' five days ahead 'Thank you alive' (2018). http://www.newsis.com/view/?id=NISX20180814_00003910928&cID=103018&pID=10300 (accessed September 1, 2018).
  14. Discrete reunion day ahead ... Southern family gathering Sokcho(2018). http://news.kbs.co.kr/news/view.do?ncd=40182418 &ref=A (accessed September 1, 2018).
  15. North America's Denuclearization Will End Stall ... Wen Ji-hye re-enacts 'Korean peninsula driver'(2018). http://www.newsis.com/view/?id=NISX20180815_0000391295&cID=10301&pID=10300 (accessed September 1, 2018).
  16. A. Muller, S. Guido, Introduction to Machine Learning with Python, O'REILLY, 2017.
  17. J.H. Kim, Method of Keyword Recommendation Considering Importance and Correlation of words, Master's Thesis of Chosun University, 2018.
  18. P.D. Turney, M.L. Littman, "Measuring praise and criticism: Inference of semantic orientation from association", Proceedings of ACL-02, 40th Annual Meeting of the Association for Computational Linguistics, pp. 417-424, July 2002.