DOI QR코드

DOI QR Code

A Study on Research Paper Classification Using Keyword Clustering

키워드 군집화를 이용한 연구 논문 분류에 관한 연구

  • 이윤수 (대구가톨릭대학교 컴퓨터정보통신공학과) ;
  • ;
  • 이종혁 (대구가톨릭대학교 빅데이터공학과) ;
  • 길준민 (대구가톨릭대학교 IT공학부)
  • Received : 2018.07.06
  • Accepted : 2018.08.09
  • Published : 2018.12.31

Abstract

Due to the advancement of computer and information technologies, numerous papers have been published. As new research fields continue to be created, users have a lot of trouble finding and categorizing their interesting papers. In order to alleviate users' this difficulty, this paper presents a method of grouping similar papers and clustering them. The presented method extracts primary keywords from the abstracts of each paper by using TF-IDF. Based on TF-IDF values extracted using K-means clustering algorithm, our method clusters papers to the ones that have similar contents. To demonstrate the practicality of the proposed method, we use paper data in FGCS journal as actual data. Based on these data, we derive the number of clusters using Elbow scheme and show clustering performance using Silhouette scheme.

컴퓨터 기술의 발전으로 힘입어 수많은 논문이 출판되고 있으며, 새로운 분야들도 계속 생기면서 사용자들은 방대한 논문들 중 자신이 필요로 하는 논문을 검색하거나 분류하기에 많은 어려움을 겪고 있다. 사용자의 이러한 어려움을 완화하기 위해 본 논문에서는 유사 내용의 논문을 분류하고 이를 군집화하는 방법을 제한한다. 본 논문의 제안 방법은 TF-IDF를 이용하여 각 논문의 초록으로부터 주요 주제어를 추출하고, K-평균 클러스터링 알고리즘을 이용하여 추출한 TF-IDF 값을 근거로 논문들을 유사 내용의 논문으로 군집화한다. 제안 방법의 실효성을 검증하기 위해 실제 데이터인 FGCS 저널의 논문 데이터를 사용하였으며, 엘보우 기법을 적용하여 클러스터 개수를 도출하고 실루엣 기법을 이용하여 클러스터링 성능을 검증하였다.

Keywords

JBCRJM_2018_v7n12_477_f0001.png 이미지

Fig. 1. System Flow

JBCRJM_2018_v7n12_477_f0002.png 이미지

Fig. 2. Elbow Graph for Top 10 Keywords

JBCRJM_2018_v7n12_477_f0003.png 이미지

Fig. 3. Elbow Graph for Top 20 Keywords

JBCRJM_2018_v7n12_477_f0004.png 이미지

Fig. 4. Elbow Graph for Top 30 Keywords

JBCRJM_2018_v7n12_477_f0005.png 이미지

Fig. 5. Silhouette Graph for Top 10 Keywords

JBCRJM_2018_v7n12_477_f0006.png 이미지

Fig. 6. Silhouette Graph for Top 20 Keywords

JBCRJM_2018_v7n12_477_f0007.png 이미지

Fig. 7. Silhouette Graph for Top 30 Keywords

JBCRJM_2018_v7n12_477_f0008.png 이미지

Algorithm 1. Word Count Map-Reduce Algorithm

JBCRJM_2018_v7n12_477_f0009.png 이미지

Algorithm 2. TF Map-Reduce Algorithm

JBCRJM_2018_v7n12_477_f0010.png 이미지

Algorithm 3. DF Map-Reduce Algorithm

JBCRJM_2018_v7n12_477_f0011.png 이미지

Algorithm 4. TF-IDF Map-Reduce Algorithm

Table 1. Top 10∼30 Keywords

JBCRJM_2018_v7n12_477_t0001.png 이미지

Table 2. Clustering Results Using the Top 10 Keywords

JBCRJM_2018_v7n12_477_t0002.png 이미지

References

  1. Prafulla Bafna, Dhanya Pramod, and Anagha Vaidya, "Document clustering: TF-IDF approach," in Proceedings of 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), pp.61-66, 2016.
  2. Lukas Havrlant and Vladik Kreinovich, "A simple probabilistic explanation of term frequency-inverse document frequency (tf-idf) heuristic (and variations motivated by this explanation)," International Journal of General Systems, Vol.46, No.1, pp.27-36, 2017. https://doi.org/10.1080/03081079.2017.1291635
  3. Bruno Trstenjak, Sasa Mikac, and Dzenana Donko, "KNN with TF-IDF based Framework for Text Categorization," Procedia Engineering, Vol.69, pp.1356-1364, 2014. https://doi.org/10.1016/j.proeng.2014.03.129
  4. Akiko Aizawa, "An information-theoretic perspective of tf-idf measures," Information Processing and Management, Vol.39, Iss.1, pp.45-65, Jan. 2003. https://doi.org/10.1016/S0306-4573(02)00021-3
  5. Shereen Albitar, Sébastien Fournier, and Bernard Espinasse, "An effective TF/IDF-based text-to-text semantic similarity measure for text classification," in Proceedings of International Conference on Web Information Systems Engineering (WISE 2014), pp.105-114, 2014.
  6. Chyi-Kwei Yau, Alan Porter, Nils Newman, and Arho Suominen, "Clustering scientific documents with topic modeling," Scientometrics, Vol.100, Iss.3, pp.767-786, Sept. 2014. https://doi.org/10.1007/s11192-014-1321-8
  7. Rakesh Chandra Balabantaray, Chandrali Sarma, and Monica Jha. "Document clustering using K-means and K-medoids," International Journal of Knowledge Based Computer Systems, Vol.1, Iss.1, 2015.
  8. Rajeev Srivastava and Himanshu Gupta, "K-means based document clustering with automatic "K" selection and cluster refinement," International Journal of Computer Science and Mobile Applications, Vol.2, Iss.5, pp.7-13, 2014.
  9. Charu C. Aggarwal and Chandan K. Reddy, Data clustering: algorithms and applications, CRC press., 2013.
  10. N. K. Nagwani, "Summarizing large text collection using topic modeling and clustering based on MapReduce framework," Journal of Big Data, Vol.2, No.6, pp.1-18, Dec. 2015.
  11. FGCS Journal [Internet], https://www.journals.elsevier.com/future-generation-computer-systems
  12. Kil-Hong Joo, Eun-Young Shin, Joo-Il Lee, and Won-Suk Lee, "Hierarchical Automatic Classification of News Articles based on Association Rules," Journal of Korean Multimedia Society, Vol.14, No.6, pp.730-741, 2011. https://doi.org/10.9717/kmms.2011.14.6.730
  13. H. Cho and J.-S. Lee, "Data-driven feature word selection for clustering online news comments," in Proceedings of 2016 International Conference on Big Data and Smart Computing (BigComp), pp.494-497, Jan. 2016.
  14. Anand Mahendran, Anjali Duraiswamy, Amulya Reddy, and Clayton Gonsalves, "Opinion Mining for text classification," International Journal of Scientific Engineering and Technology, Vol.2, Iss.6, pp.589-594, Jun. 2013.
  15. Izzat Alsmadi and Ikdam Alhami, "Clustering and classification of email contents," Journal of King Saud University-Computer and Information Sciences, Vol.27, Iss.1, pp.46-57, Jan. 2015. https://doi.org/10.1016/j.jksuci.2014.03.014
  16. Bravo-Alcobendas and C. O. S. Sorzano, "Clustering of biomedical scientific papers," in Proceedings of 2009 IEEE International Symposium on Intelligent Signal Processing, pp.205-209, Aug. 2009.
  17. Mohsen Taheriyan, "Subject classification of research papers based on interrelationships analysis," in Proceedings of the 2011 Workshop on Knowledge Discovery, Modeling and Simulation, pp.39-44, Aug. 2011.
  18. Hidetsugu Nanba, Noriko Kando, and Manabu Okumura, "Classification of research papers using citation links and citation types: towards automatic review article generation," in Proceedings of 11th ASIS SIG/CR Classification Research Workshop, pp.117-134, 2011.
  19. Thien Hai Nguyen and Kiyoaki Shirai. "Text classification of technical papers based on text segmentation," Lecture Notes in Computer Science, Vol.7934, pp.278-284, 2013.
  20. Trupti M. Kodinariya and Prashant R. Makwana, "Review on determining number of Cluster in K-Means Clustering," International Journal of Advanced. Researches in Computer Science and Management Studies, Vol.1, Iss.6, pp.90-95, Nov. 2013.
  21. Scikit-Learn [Internet], http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
  22. Gilberto V. Oliveira, Felipe P. Coutinho, Ricardo Campello, and Murilo C. Naldi, "Improving k-means through distributed scalable metaheuristics," Neurocomputing, Vol.246, No.12, pp.45-57, Jul. 2017. https://doi.org/10.1016/j.neucom.2016.07.074
  23. Peter J. Rousseeuw, "Silhouettes: a graphical aid to the interpretation and validation of cluster analysis," Journal of Computational and Applied Mathematics, Vol.20, pp.53-65, Nov. 1987. https://doi.org/10.1016/0377-0427(87)90125-7