DOI QR코드

DOI QR Code

Web Site Keyword Selection Method by Considering Semantic Similarity Based on Word2Vec

Word2Vec 기반의 의미적 유사도를 고려한 웹사이트 키워드 선택 기법

  • Lee, Donghun (Dept. of Industrial and Management Engineering, Incheon National University) ;
  • Kim, Kwanho (Dept. of Industrial and Management Engineering, Incheon National University)
  • Received : 2018.04.11
  • Accepted : 2018.05.29
  • Published : 2018.05.31

Abstract

Extracting keywords representing documents is very important because it can be used for automated services such as document search, classification, recommendation system as well as quickly transmitting document information. However, when extracting keywords based on the frequency of words appearing in a web site documents and graph algorithms based on the co-occurrence of words, the problem of containing various words that are not related to the topic potentially in the web page structure, There is a difficulty in extracting the semantic keyword due to the limit of the performance of the Korean tokenizer. In this paper, we propose a method to select candidate keywords based on semantic similarity, and solve the problem that semantic keyword can not be extracted and the accuracy of Korean tokenizer analysis is poor. Finally, we use the technique of extracting final semantic keywords through filtering process to remove inconsistent keywords. Experimental results through real web pages of small business show that the performance of the proposed method is improved by 34.52% over the statistical similarity based keyword selection technique. Therefore, it is confirmed that the performance of extracting keywords from documents is improved by considering semantic similarity between words and removing inconsistent keywords.

문서를 대표하는 키워드를 추출하는 것은 문서의 정보를 빠르게 전달할 수 있을 뿐만 아니라 문서의 검색, 분류, 추천시스템 등의 자동화서비스에 유용하게 사용 될 수 있어 매우 중요하다. 그러나 웹사이트 문서에서 출현하는 단어의 빈도수, 단어의 동시출현관계를 통한 그래프 알고리즘 등의 기반으로 키워드를 추출할 경우 웹페이지 구조상 잠재적으로 주제와 관련이 없는 다양한 단어를 포함하고 있는 문제점과 한국어 형태소 분석의 정확성이 떨어지는 형태소 분석기 성능의 한계점 때문에 의미적인 키워드를 추출하는데 어려움이 존재한다. 따라서 본 논문에서는 의미적 단어 위주로 구축된 후보키워드들의 집합과 의미적 유사도 기반의 후보 키워드를 선택하는 방법으로써 의미적 키워드를 추출하지 못하는 문제점과 형태소 분석의 정확성이 떨어지는 문제점을 해결하고 일관성 없는 키워드를 제거하는 필터링 과정을 통해 최종 의미적 키워드를 추출하는 기법을 제안한다. 실 중소기업 웹페이지를 통한 실험 결과, 본 연구에서 제안한 기법의 성능이 통계적 유사도 기반의 키워드 선택기법보다 34.52% 향상된 것을 확인하였다. 따라서 단어 간의 의미적 유사성을 고려하고 일관성 없는 키워드를 제거함으로써 문서에서 키워드를 추출하는 성능을 향상시켰음을 확인하였다.

Keywords

References

  1. Cao, J., Jiang, Z., Huang, M., and Wang, K., "A Way to Improve Graph-Based Keyword Extraction," Proceedings of IEEE International Conference on Computer and Communications, pp. 166-170, 2015.
  2. Cho, T. and Lee, J.-H., "Latent Keyphrase Extraction Using LDA Model," Journal of Korean Institute of Intelligent Systems, Vol. 25, No. 2, pp. 180-185, 2015. https://doi.org/10.5391/JKIIS.2015.25.2.180
  3. Choi, D. J., Lee, S. W., Kim. J. K., and Lee, J. H., "A Study on Graph-Based Topic Extraction from Microblogs," Journal of Korean Institute of Intelligent Systems, Vol. 21, No. 5, pp. 564-568, 2011. https://doi.org/10.5391/JKIIS.2011.21.5.564
  4. Hu, J., Jin, F., Zhang, G., Wang, J., and Yang, Y., "A User Profile Modeling Method Based on Word2Vec," Proceedings of IEEE International Conference on Software Quality, Reliability and Security Companion, pp. 410-414, 2017.
  5. Lee, K-H., Lee, K-C., and Kim, K-Ok., "Ranked Web Service Retrieval by Keyword Search," The Journal of Society for e-Business Studies, Vol. 13, No. 2, pp. 213-223, 2008.
  6. Lee, S. and Kim, H. J., "News Keyword Extraction for Topic Tracking," Proceedings of IEEE Networked Computing and Advanced Information Management, Vol. 2, pp. 554-559, 2008.
  7. Lee, S.-J. and Kim, H-J., "Keyword Extraction from News Corpus using Modified TF-IDF," The Journal of Society for e-Business Studies, Vol. 14, No. 4, pp. 59-73, 2009.
  8. Lee, Y. J., "Korean Morphological Analysis Algorithmas for Automatic Idexing," Proceedings of the Annual Conference on Human and Cognitive Language Technology, pp. 240-246, 1989.
  9. Lott, B., "Survey of Keyword Extraction Techniques," UNM Education, 2012.
  10. Matsuo, Y. and Ishizuka, M., "Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information," International Journal of Artificial Intelligence Tools, Vol. 13, No. 1, pp. 157-169, 2004. https://doi.org/10.1142/S0218213004001466
  11. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J., "Distributed Representations of Words and Phrases and Their Compositionality," Advances in Neural Information Processing Systems, pp. 3111-3119, 2013.
  12. Mikolov, T., Chen, K., Corrado, G., and Dean, J., "Efficient Estimation of Word Representations in Vector Space," arXiv preprint arXiv, pp. 1301-3781, 2013.
  13. Noh, Y., Lim, J., Bok, K., and Yoo, J., "Hot Topic Prediction Scheme using Modified TF-IDF in Social Network Environments," Journal of Korean Institute of Information Scientists end Engineers, Vol. 23, No. 4, pp. 217-225, 2017.
  14. Oh, J. Y. and Cha, J. W., "High Speed Korean Dependency Analysis using Cascaded Chunking," Journal of the Korea Society for Simulation, Vol. 19, No. 1, pp. 103-111, 2010.
  15. Robertson, S. E., "Term Specificity," Journal of Documentation, Vol. 28, No. 1, pp. 164-165, 1972.
  16. Rose, S., Engel, D., Cramer, N., and Cowley, W., "Automatic Keyword Extraction from Individual Documents, Text Mining: Applications and Theory," pp. 1-20, WILEY, 2010.
  17. Shin, J.-C. and Ock, C.-Y., "A Korean Morphological Analyzer using a Pre-analyzed Partial Word-phrase Dictionary," Journal of Software and Applications, Vol. 39, No. 5, pp. 415-424, 2012.
  18. Song, G. H. and Kim, Y.-S., "Automatic Keyword Extraction using Hierarchical Graph Model Based on Word Co-occurrences," Journal of Korean Institute of Information Scientists end Engineers, Vol. 44, No. 5, pp. 522-536, 2017.
  19. Wen, Y., Yuan, H., and Zhang, P., "Research on Keyword Extraction Based on Word2-Vec Weighted TextRank," Proceedings of IEEE International Conference on Computer and Communications, No. 2, pp. 2109-2113, 2016.
  20. Yarowsky, D., "Unsupervised word sense disambiguation rivaling supervised methods," Proceedings of the Association for Computational Linguistics, pp. 189-196, 1995.

Cited by

  1. 지식 간 내용적 연관성 파악 기법의 지식 서비스 관리 접목을 위한 정량적/정성적 고려사항 검토 vol.26, pp.3, 2018, https://doi.org/10.7838/jsebs.2021.26.3.119