Fig. 1. System Flow
Fig. 2. Elbow Graph for Top 10 Keywords
Fig. 3. Elbow Graph for Top 20 Keywords
Fig. 4. Elbow Graph for Top 30 Keywords
Fig. 5. Silhouette Graph for Top 10 Keywords
Fig. 6. Silhouette Graph for Top 20 Keywords
Fig. 7. Silhouette Graph for Top 30 Keywords
Algorithm 1. Word Count Map-Reduce Algorithm
Algorithm 2. TF Map-Reduce Algorithm
Algorithm 3. DF Map-Reduce Algorithm
Algorithm 4. TF-IDF Map-Reduce Algorithm
Table 1. Top 10∼30 Keywords
Table 2. Clustering Results Using the Top 10 Keywords
References
- Prafulla Bafna, Dhanya Pramod, and Anagha Vaidya, "Document clustering: TF-IDF approach," in Proceedings of 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), pp.61-66, 2016.
- Lukas Havrlant and Vladik Kreinovich, "A simple probabilistic explanation of term frequency-inverse document frequency (tf-idf) heuristic (and variations motivated by this explanation)," International Journal of General Systems, Vol.46, No.1, pp.27-36, 2017. https://doi.org/10.1080/03081079.2017.1291635
- Bruno Trstenjak, Sasa Mikac, and Dzenana Donko, "KNN with TF-IDF based Framework for Text Categorization," Procedia Engineering, Vol.69, pp.1356-1364, 2014. https://doi.org/10.1016/j.proeng.2014.03.129
- Akiko Aizawa, "An information-theoretic perspective of tf-idf measures," Information Processing and Management, Vol.39, Iss.1, pp.45-65, Jan. 2003. https://doi.org/10.1016/S0306-4573(02)00021-3
- Shereen Albitar, Sébastien Fournier, and Bernard Espinasse, "An effective TF/IDF-based text-to-text semantic similarity measure for text classification," in Proceedings of International Conference on Web Information Systems Engineering (WISE 2014), pp.105-114, 2014.
- Chyi-Kwei Yau, Alan Porter, Nils Newman, and Arho Suominen, "Clustering scientific documents with topic modeling," Scientometrics, Vol.100, Iss.3, pp.767-786, Sept. 2014. https://doi.org/10.1007/s11192-014-1321-8
- Rakesh Chandra Balabantaray, Chandrali Sarma, and Monica Jha. "Document clustering using K-means and K-medoids," International Journal of Knowledge Based Computer Systems, Vol.1, Iss.1, 2015.
- Rajeev Srivastava and Himanshu Gupta, "K-means based document clustering with automatic "K" selection and cluster refinement," International Journal of Computer Science and Mobile Applications, Vol.2, Iss.5, pp.7-13, 2014.
- Charu C. Aggarwal and Chandan K. Reddy, Data clustering: algorithms and applications, CRC press., 2013.
- N. K. Nagwani, "Summarizing large text collection using topic modeling and clustering based on MapReduce framework," Journal of Big Data, Vol.2, No.6, pp.1-18, Dec. 2015.
- FGCS Journal [Internet], https://www.journals.elsevier.com/future-generation-computer-systems
- Kil-Hong Joo, Eun-Young Shin, Joo-Il Lee, and Won-Suk Lee, "Hierarchical Automatic Classification of News Articles based on Association Rules," Journal of Korean Multimedia Society, Vol.14, No.6, pp.730-741, 2011. https://doi.org/10.9717/kmms.2011.14.6.730
- H. Cho and J.-S. Lee, "Data-driven feature word selection for clustering online news comments," in Proceedings of 2016 International Conference on Big Data and Smart Computing (BigComp), pp.494-497, Jan. 2016.
- Anand Mahendran, Anjali Duraiswamy, Amulya Reddy, and Clayton Gonsalves, "Opinion Mining for text classification," International Journal of Scientific Engineering and Technology, Vol.2, Iss.6, pp.589-594, Jun. 2013.
- Izzat Alsmadi and Ikdam Alhami, "Clustering and classification of email contents," Journal of King Saud University-Computer and Information Sciences, Vol.27, Iss.1, pp.46-57, Jan. 2015. https://doi.org/10.1016/j.jksuci.2014.03.014
- Bravo-Alcobendas and C. O. S. Sorzano, "Clustering of biomedical scientific papers," in Proceedings of 2009 IEEE International Symposium on Intelligent Signal Processing, pp.205-209, Aug. 2009.
- Mohsen Taheriyan, "Subject classification of research papers based on interrelationships analysis," in Proceedings of the 2011 Workshop on Knowledge Discovery, Modeling and Simulation, pp.39-44, Aug. 2011.
- Hidetsugu Nanba, Noriko Kando, and Manabu Okumura, "Classification of research papers using citation links and citation types: towards automatic review article generation," in Proceedings of 11th ASIS SIG/CR Classification Research Workshop, pp.117-134, 2011.
- Thien Hai Nguyen and Kiyoaki Shirai. "Text classification of technical papers based on text segmentation," Lecture Notes in Computer Science, Vol.7934, pp.278-284, 2013.
- Trupti M. Kodinariya and Prashant R. Makwana, "Review on determining number of Cluster in K-Means Clustering," International Journal of Advanced. Researches in Computer Science and Management Studies, Vol.1, Iss.6, pp.90-95, Nov. 2013.
- Scikit-Learn [Internet], http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
- Gilberto V. Oliveira, Felipe P. Coutinho, Ricardo Campello, and Murilo C. Naldi, "Improving k-means through distributed scalable metaheuristics," Neurocomputing, Vol.246, No.12, pp.45-57, Jul. 2017. https://doi.org/10.1016/j.neucom.2016.07.074
- Peter J. Rousseeuw, "Silhouettes: a graphical aid to the interpretation and validation of cluster analysis," Journal of Computational and Applied Mathematics, Vol.20, pp.53-65, Nov. 1987. https://doi.org/10.1016/0377-0427(87)90125-7