DOI QR코드

DOI QR Code

Efficient Keyword Extraction from Social Big Data Based on Cohesion Scoring

  • Kim, Hyeon Gyu (Div. of Computer Science and Engineering, Sahmyook University)
  • Received : 2020.08.19
  • Accepted : 2020.10.12
  • Published : 2020.10.30

Abstract

Social reviews such as SNS feeds and blog articles have been widely used to extract keywords reflecting opinions and complaints from users' perspective, and often include proper nouns or new words reflecting recent trends. In general, these words are not included in a dictionary, so conventional morphological analyzers may not detect and extract those words from the reviews properly. In addition, due to their high processing time, it is inadequate to provide analysis results in a timely manner. This paper presents a method for efficient keyword extraction from social reviews based on the notion of cohesion scoring. Cohesion scores can be calculated based on word frequencies, so keyword extraction can be performed without a dictionary when using it. On the other hand, their accuracy can be degraded when input data with poor spacing is given. Regarding this, an algorithm is presented which improves the existing cohesion scoring mechanism using the structure of a word tree. Our experiment results show that it took only 0.008 seconds to extract keywords from 1,000 reviews in the proposed method while resulting in 15.5% error ratio which is better than the existing morphological analyzers.

블로그나 SNS 피드 등의 소셜 리뷰는 고객 관점의 의견이나 불만 사항을 반영한 키워드를 추출하기 위한 목적으로 광범위하게 활용되고 있으며, 최근 트렌드를 반영한 신조어나 고유명사를 포함하는 경우가 많다. 이들 단어는 사전에 포함되어 있지 않아 기존 형태소 분석기가 잘 인지하지 못하는 경우가 많으며, 동시에 상당한 처리 시간이 소요되어 키워드 분석 결과를 실시간으로 제공하는데 어려움이 있다. 본 논문에서는 응집도 점수 개념을 기반으로 소셜 리뷰로부터 키워드를 효율적으로 추출하기 위한 방법을 제안한다. 응집도 점수는 단어의 빈도수를 기반으로 계산되어 별도의 사전이 필요없다는 장점이 있으나, 띄어쓰기가 되지 않은 입력 데이터에 대해서는 정확도가 떨어질 수 있다. 이와 관련하여 본 논문에서는 단어 트리 구조를 이용하여 기존의 응집도 점수 계산 방법을 개선한 알고리즘을 제시한다. 또한 실험을 통해 제안하는 방법이 15.5%의 오류율을 보이는 동시에, 1,000개의 리뷰를 처리하는데 0.008초 정도 소요됨을 확인하였다.

Keywords

References

  1. W. L. Kang, H. G. Kim, and Y, J. Lee, "Reducing IO Cost in OLAP Query Processing with MapReduce," IEICE Trans. Inf. & Syst, Vol. E98-D, No. 2, pp. 444-447, Feb. 2015. https://doi.org/10.1587/transinf.2014EDL8143
  2. K. H. Lee et al., "Parallel Data Processing with MapReduce: a Survey," ACM SIGMOD Record, Vol. 40, No. 4, pp. 11-20, 2012. https://doi.org/10.1145/2094114.2094118
  3. IDC Korea, https://www.idc.com/getdoc.jsp?containerId=prAP45938720
  4. Naver Open API, https://developers.naver.com/docs/common/open apiguide/
  5. Google Developer API, https://developers.google.com/
  6. Hannanum, http://semanticweb.kaist.ac.kr/hannanum/index.html
  7. Kokoma, http://kkma.snu.ac.kr/documents/index.jsp
  8. H. G. Kim, "Developing a Big Data Analysis Platform for Small and Medium-Sized Enterprises," Journal of the Korea Society of Computer and Information, Vol. 25, No. 8, Aug. 2020.
  9. H. J. Kim and S. J. Cho, "Cleansing Noisy Text Using Corpus Extraction and String Match," MS. Thesis, Seoul National University, 2013.
  10. Cohesion Score, https://lovit.github.io/nlp/2018/04/09/cohesion _ltokenizer/
  11. H. G. Seo and H. W. Park, "Design and Implementation of Potential Advertisement Keyword Extraction System Using SNS," Journal of the Korea Convergence Society, Vol. 9, No. 7, pp. 14-24, 2018.
  12. O. J. Lee, S. B. Park, D. Chung, and E. S. You, "Movie Box-Office Analysis Using Social Big Data," Journal of the Korea Contents Society, Vol. 14, No. 10, pp. 527-538, 2014.
  13. C. Lee, D. Choi, S. Kim, and J. Kang, "Classification and Analysis of Emotion in Korean Microblog Texts," Journal of KIISE, Vol. 40, No. 3, pp. 159-167, Jun. 2013.
  14. J. Y. Chang, "A Sentiment Analysis Algorithm for Automatic Product Reviews Classification in Online Shop ping Mall," Vol. 14, No. 4, pp. 19-32, 2009.
  15. H. Lim, B. Yoon, and H. Lim, "An Efficient Korean Morphological Analyzer using Exclusive Information," Journal of KIISE, Vol. 22, No. 6, pp. 957-964, 1995.
  16. Y. Kim, M. Park, J. Choi, and H. Kwon, "Improvement of Analysis Speed in Korean Morphologlcal Analyzer Using Ameliorated Dictionary," Proc. of the 11th Hangul and Korean Information Processing, pp. 479-483, 1999.
  17. S. H. Yang and Y. S. Kim, "A High-Speed Korean Morphological Analysis Method based on Pre-Analyzed Partial Words," Journal of KIISE, Vol. 27, No. 3, pp. 290-301, 2000.
  18. Z. Jin and K Tanaka-Ishii, "Unsupervised Segmentatino of Chinese Text by Use of Branching Entropy," The Journal of Korea Navigation Institute, pp. 428-435, Jul. 2006.
  19. Soynlp, https://github.com/lovit/soynlp
  20. E. Kim, "The Unsupervised Learning-based Language Modeling of Word Comprehension in Korean," Journal of the Korea Society of Computer and Information, Vol. 24, No. 11, pp. 41-49, Nov. 2019.

Cited by

  1. Social Big Data Analysis for Franchise Stores vol.26, pp.8, 2020, https://doi.org/10.9708/jksci.2021.26.08.039
  2. A Method for Compound Noun Extraction to Improve Accuracy of Keyword Analysis of Social Big Data vol.26, pp.8, 2020, https://doi.org/10.9708/jksci.2021.26.08.055