DOI QR코드

DOI QR Code

A study on Korean language processing using TF-IDF

TF-IDF를 활용한 한글 자연어 처리 연구

  • 이종화 (동의대학교 e비즈니스학과) ;
  • 이문봉 (동의대학교 경영학과) ;
  • 김종원 (동의대학교 경영정보학과)
  • Received : 2019.06.17
  • Accepted : 2019.08.08
  • Published : 2019.09.30

Abstract

Purpose One of the reasons for the expansion of information systems in the enterprise is the increased efficiency of data analysis. In particular, the rapidly increasing data types which are complex and unstructured such as video, voice, images, and conversations in and out of social networks. The purpose of this study is the customer needs analysis from customer voices, ie, text data, in the web environment.. Design/methodology/approach As previous study results, the word frequency of the sentence is extracted as a word that interprets the sentence has better affects than frequency analysis. In this study, we applied the TF-IDF method, which extracts important keywords in real sentences, not the TF method, which is a word extraction technique that expresses sentences with simple frequency only, in Korean language research. We visualized the two techniques by cluster analysis and describe the difference. Findings TF technique and TF-IDF technique are applied for Korean natural language processing, the research showed the value from frequency analysis technique to semantic analysis and it is expected to change the technique by Korean language processing researcher.

Keywords

Big Data;TF-IDF;Text Mining;Cluster Analysis;KoNLP

References

  1. 김성현, 김동재, "군집화 및 특성도를 이용한 결측치 대체 방법," 응용통계연구, 제31권, 제1호, 2018, pp. 29-40. https://doi.org/10.5351/KJAS.2018.31.1.029
  2. 김은우, 금득규, "특집명: 빅데이터 분석: Social BigDate 서비스 분석플랫폼 구축기술 소개," 정보처리학회지, 제21권, 제3호, 2014, pp. 35-42.
  3. 남민지, 이은지, 신주현, "인스타그램 해시태그를 이용한 사용자 감정 분류 방법," 멀티미디어학회논문지, 제18권, 제11호, 2015, pp. 1391-1399.
  4. 서새남, "4차 산업혁명 주요기술에 대한 법적 고찰-한국 및 중국을 중심으로," 문화.미디어.엔터테인먼트법, 제11권, 제1호, 2017, pp. 141-172.
  5. 양낙영, 김성근, 강주영, "텍스트 마이닝 방법론과 메신저 UI 를 활용한 융합연구 촉진을 위한 연구자 및 연구 분야 추천 시스템의 제안," 정보시스템연구, 제27권, 제4호, 2018, pp. 71-96.
  6. 유은지, 김정철, 이춘열, 김남규, "시맨틱 텍스트 마이닝을 위한 온톨로지 활용 방안," 정보시스템연구, 제21권, 제3호, 2012, pp. 137-161.
  7. 이종화, "Python을 이용한 SNS 크롤링 시스템 구축," 한국산업정보학회논문지, 제23권, 제5호, 2018, pp. 61-76. https://doi.org/10.9723/JKSIIS.2018.23.5.061
  8. 이종화, 이현규, "오픈소스 소프트웨어를 활용한 자연어 처리 패키지 제작에 관한 연구," 정보시스템연구, 제25권, 제4호, 2016, pp. 121-139.
  9. Amado, A., Cortez, P., Rita, P., and Moro, S., "Research trends on Big Data in Marketing: A text mining and topic modeling based literature analysis," European Research on Management and Business Economics, Vol. 24, No. 1, 2017, pp. 1-7. https://doi.org/10.1016/j.iedeen.2017.06.002
  10. An, J., and Kim, H. W., "Building a Korean sentiment lexicon using collective intelligence," Journal of Intelligence and Information Systems, Vol. 21, No. 2, 2015, pp. 49-67. https://doi.org/10.13088/jiis.2015.21.2.49
  11. Christian, H., Agus, M. P., and Suhartono, D., "Single Document Automatic Text Summarization using Term Frequency-Inverse Document Frequency (TFIDF)," ComTech: Computer, Mathematics and Engineering Applications, Vol. 7, No. 4, 2016, pp. 285-294. https://doi.org/10.21512/comtech.v7i4.3746
  12. Eder, M., "Visualization in stylometry: Cluster analysis using networks," Digital Scholarship in the Humanities, Vol. 32, No. 1, 2017, pp. 50-64. https://doi.org/10.1093/llc/fqv061
  13. Ferreira, L. N., and Zhao, L., "Time series clustering via community detection in networks," Information Sciences, Vol. 326, 2016, pp. 227-242. https://doi.org/10.1016/j.ins.2015.07.046
  14. Hartigan, J., A., "Clustering Algorithms," New York: Wiley, 1975.
  15. Hoyer, P. O., "Non-negative matrix factorization with sparseness constraints," Journal of machine learning research, Vol. 5, No. Nov, 2004, pp. 1457-1469.
  16. Javadi, S., Hashemy, S. M., Mohammadi, K., Howard, K. W. F., and Neshat, A., "Classification of aquifer vulnerability using K-means cluster analysis," Journal of hydrology, Vol. 549, 2017, pp. 27-37. https://doi.org/10.1016/j.jhydrol.2017.03.060
  17. Jia J, Xiao X, Liu B, and Jiao L., "Bagging-based spectral clustering ensemble selection," Pattern Recognit Lett, Vol. 32, No. 10, 2011, pp. 1456-1467. https://doi.org/10.1016/j.patrec.2011.04.008
  18. Lee, J. H., "Big data, data mining and temporary reproduction," The Journal of Intellectual Property, Vol. 8, No. 4, 2013, pp. 93-125.
  19. Lee, J. H., and, Lee, H. K., "A Research on Real-time Analysis of Association Rules Using Hash Tags," The Journal of Internet Electronic Commerce Research, Vol. 17, No. 4, 2017, pp. 105-117.
  20. Lee, Y. S., Lee, J., and Gil, J. M., "A Study on Research Paper Classification Using Keyword Clustering," KIPS Transactions on Software and Data Engineering, Vol. 7, No. 12, 2018, pp. 477-484. https://doi.org/10.3745/KTSDE.2018.7.12.477
  21. Nowak, G., and Tibshirani, R., "Complementary hierarchical clustering," Biostatistics, Vol. 9, No. 3, 2007, pp. 467-483. https://doi.org/10.1093/biostatistics/kxm046
  22. Rong H, Ma TH, Tang ML, Cao J., "A novel subgraph k+-isomorphism method in social network based on graph similarity detection," Soft Comput, 2018, Vol. 22, No. 8, pp. 2853-2601.
  23. Salton, G., and Buckley, C., "Term-weighting approaches in automatic text retrieval," Information processing & management, Vol. 24, No. 5, 1988, pp. 513-523. https://doi.org/10.1016/0306-4573(88)90021-0
  24. Topchy A, Jain AK, and Punch W., "A mixture model of clustering ensembles," Proc SIAM Intl Conf Data Mining 2004, pp. 379-390.
  25. Witten, I. H., Frank, E., Hall, M. A., and Pal, C. J., "Data Mining: Practical machine learning tools and techniques," Morgan Kaufmann, 2016.
  26. Wu, X., Ma, T., Cao, J., Tian, Y., and Alabdulkarim, A., "A comparative study of clustering ensemble algorithms, Computers & Electrical Engineering," No. 68, 2018, pp. 603-615.
  27. Zhang, Y., Jin, R., & Zhou, Z. H., "Understanding bag-of-words model: a statistical framework," International Journal of Machine Learning and Cybernetics, Vol. 1, 2010, pp. 43-52. https://doi.org/10.1007/s13042-010-0001-0