DOI QR코드

DOI QR Code

Combining Distributed Word Representation and Document Distance for Short Text Document Clustering

  • Received : 2018.04.18
  • Accepted : 2020.02.04
  • Published : 2020.04.30

Abstract

This paper presents a method for clustering short text documents, such as news headlines, social media statuses, or instant messages. Due to the characteristics of these documents, which are usually short and sparse, an appropriate technique is required to discover hidden knowledge. The objective of this paper is to identify the combination of document representation, document distance, and document clustering that yields the best clustering quality. Document representations are expanded by external knowledge sources represented by a Distributed Representation. To cluster documents, a K-means partitioning-based clustering technique is applied, where the similarities of documents are measured by word mover's distance. To validate the effectiveness of the proposed method, experiments were conducted to compare the clustering quality against several leading methods. The proposed method produced clusters of documents that resulted in higher precision, recall, F1-score, and adjusted Rand index for both real-world and standard data sets. Furthermore, manual inspection of the clustering results was conducted to observe the efficacy of the proposed method. The topics of each document cluster are undoubtedly reflected by members in the cluster.

Keywords

References

  1. V. K. R. Sridhar, "Unsupervised topic modeling for short texts using distributed representations of words," in Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, Denver, CO, 2015, pp. 192-200.
  2. M. Allahyari, S. Pouriyeh, M. Assefi, S. Safaei, E. D. Trippe, J. B. Gutierrez, and K. Kochut, "A brief survey of text mining: classification, clustering and extraction techniques," 2017, https://arxiv.org/abs/1707.02919.
  3. V. Gupta and G. S. Lehal, "A survey of text mining techniques and applications," Journal of Emerging Technologies in Web Intelligence, vol. 1, no. 1, pp. 60-76, 2009.
  4. A. M. Jadhav and D. P. Gadekar, "A survey on text mining and its techniques," International Journal of Science and Research, vol. 3, no. 11, pp. 2110-2113, 2014.
  5. C. C. Aggarwal, "Mining text and social streams: a review," ACM SIGKDD Explorations Newsletter, vol. 15, no. 2, pp. 9-19, 2014. https://doi.org/10.1145/2641190.2641194
  6. L. F. S. Coletta, N. F. F. da Silva, E. R. Hruschka, and E. R. Hruschka, "Combining classification and clustering for tweet sentiment analysis," in Proceedings of 2014 Brazilian Conference on Intelligent Systems, Sao Paulo, Brazil, 2014, pp. 210-215.
  7. S. Sharma and V. Gupta, "Recent developments in text clustering techniques," International Journal of Computer Applications, vol. 37, no. 6, pp. 14-19, 2012. https://doi.org/10.5120/4611-6604
  8. L. Rokach and O. Maimon, Clustering Methods. Boston, MA: Springer, 2005.
  9. C. C. Aggarwal and C. Zhai, "A survey of text clustering algorithms," in Mining Text Data. Boston, MA: Springer, 2012, pp. 77-128.
  10. K. S. Jones, "A statistical interpretation of term specificity and its application in retrieval," Journal of Documentation, vol. 28, no. 1, pp. 11-21, 1972. https://doi.org/10.1108/eb026526
  11. G. Salton and M. J. McGill, Introduction to Modern Information Retrieval. New York, NY: McGraw-Hill Inc., 1986.
  12. J. Ramos, "Using tf-idf to determine word relevance in document queries," in Proceedings of the 1st Instructional Conference on Machine Learning, Washington, DC, 2003, pp. 133-142.
  13. J. Xu, P. Wang, G. Tian, B. Xu, J. Zhao, F. Wang, and H. Hao, "Short text clustering via convolutional neural networks," in Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, Denver, CO, 2015, pp. 62-69.
  14. N. Kalchbrenner, E. Grefenstette, and P. Blunsom, "A convolutional neural network for modelling sentences," 2014, https://arxiv.org/abs/1404.2188.
  15. J. Xu, B. Xu, P. Wang, S. Zheng, G. Tian, and J. Zhao, "Self-taught convolutional neural networks for short text clustering," Neural Networks, vol. 88, pp. 22-31, 2017. https://doi.org/10.1016/j.neunet.2016.12.008
  16. C. Ma, Q. Zhao, J. Pan, and Y. Yan, "Short text classification based on distributional representations of words," IEICE Transactions on Information and Systems, vol. 99, no. 10, pp. 2562-2565, 2016. https://doi.org/10.1587/transinf.2016sll0006
  17. Y. Yan, R. Huang, C. Ma, L. Xu, Z. Ding, R. Wang, T. Huang, and B. Liu, "Improving document clustering for short texts by long documents via a Dirichlet multinomial allocation model," in Web and Big Data. Cham: Springer, 2017, pp. 626-641.
  18. L. Hong and B. D. Davison, "Empirical study of topic modeling in twitter," in Proceedings of the 1st Workshop on Social Media Analytics, Washington, DC, 2010, pp. 80-88.
  19. J. Weng, E. P. Lim, J. Jiang, and Q. He, "Twitterrank: finding topic-sensitive influential twitterers," in Proceedings of the 3rd ACM International Conference on Web Search and Data Mining, New York, NY, 2010, pp. 261-270.
  20. R. Mehrotra, S. Sanner, W. Buntine, and L. Xie, "Improving LDA topic models for microblogs via tweet pooling and automatic labeling," in Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, 2013, pp. 889-892.
  21. X. Quan, C. Kit, Y. Ge, and S. J. Pan, "Short and sparse text topic modeling via self-aggregation," in Proceedings of the 24th International Conference on Artificial Intelligence, Buenos Aires, Argentina, 2015, pp. 2270-2276.
  22. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, "Distributed representations of words and phrases and their compositionality," Advances in Neural Information Processing Systems, vol. 26, pp. 3111-3119, 2013.
  23. M. Steinbach, G. Karypis, and V. Kumar, "A comparison of document clustering techniques," in Proceedings of the International KDD Workshop on Text Mining, Boston, MA, 2000.
  24. P. B. Nagpal and P. A. Mann, "Comparative study of density based clustering algorithms," International Journal of Computer Applications, vol. 27, no. 11, pp. 421-435, 2011.
  25. G. Karypis, E. H. Han, and V. Kumar, "Chameleon: hierarchical clustering using dynamic modeling," Computer, vol. 32, no. 8, pp. 68-75, 1999. https://doi.org/10.1109/2.781637
  26. K. Mumtaz and K. Duraiswamy, "A novel density based improved k-means clustering algorithm - Dbkmeans," International Journal on Computer Science and Engineering, vol. 2, no. 2, pp. 213-218, 2010.
  27. A. Karami and R. Johansson, "Choosing DBSCAN parameters automatically using differential evolution," International Journal of Computer Applications, vol. 91, no. 7, pp. 1-11, 2014. https://doi.org/10.5120/15890-5059
  28. W. H. Gomaa and A. A. Fahmy, "A survey of text similarity approaches," International Journal of Computer Applications, vol. 68, no. 13, pp. 13-18, 2013. https://doi.org/10.5120/11638-7118
  29. M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger, "From word embeddings to document distances," in Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 2015, pp. 957-966.
  30. J. MacQueen, "Some methods for classification and analysis of multivariate observations," in Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, 1967, pp. 281-297.
  31. G. Salton and C. Buckley, "Term-weighting approaches in automatic text retrieval," Information Processing and Management, vol. 24, no. 5, pp. 513-523, 1988. https://doi.org/10.1016/0306-4573(88)90021-0
  32. T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estimation of word representations in vector space," 2013, https://arxiv.org/abs/1301.3781.
  33. C. De Boom, S. Van Canneyt, S. Bohez, T. Demeester, and B. Dhoedt, "Learning semantic similarity for very short texts," in Proceedings of 2015 IEEE International Conference on Data Mining Workshop (ICDMW), Atlantic City, NJ, 2015, pp. 1229-1234.
  34. S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, "Indexing by latent semantic analysis," Journal of the American Society for Information Science, vol. 41, no. 6, pp. 391-407, 1990. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
  35. T. Hofmann, "Probabilistic latent semantic indexing," in Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley CA, 1999, pp. 50-57.
  36. D. M. Blei, A. Y. Ng, and M. I. Jordan, "Latent Dirichlet allocation," Journal of Machine Learning Research, vol. 3, pp. 993-1022, 2003.
  37. D. E. Rumelhart, J. L. McClelland, Parallel Distributed Processing: Explorations in the Microstructure of Cognition (Volume 1: Foundations). Cambridge, MA: MIT Press, 1986.
  38. J. L. Elman, "Distributed representations, simple recurrent networks, and grammatical structure," Machine Learning, vol. 7, no. 2-3, pp. 195-225, 1991. https://doi.org/10.1007/BF00114844
  39. T. Kenter and M. de Rijke, "Short text similarity with word embeddings," in Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, Melbourne, Australia, 2015, pp. 1411-1420.
  40. J. Qiang, P. Chen, T. Wang, and X. Wu, "Topic modeling over short texts by incorporating word embeddings," in Advances in Knowledge Discovery and Data Mining. Cham: Springer International Publishing, 2017, pp. 363-374.
  41. A. Karandikar, "Clustering short status messages: a topic model based approach," M.S. thesis, Faculty of the Graduate School, University of Maryland Baltimore County, Baltimore, MD, 2010.
  42. A. Barron-Cedeno, P. Rosso, E. Agirre, and G. Labaka, "Plagiarism detection across distant language pairs," in Proceedings of the 23rd International Conference on Computational Linguistics, Stroudsburg, PA, 2010, pp. 37-45.
  43. A. Huang, "Similarity measures for text document clustering," in Proceedings of the Sixth New Zealand Computer Science Research Student Conference (NZCSRSC), Christchurch, New Zealand, 2008, pp. 49-56.
  44. Y. Rubner, C. Tomasi, and L. J. Guibas, "A metric for distributions with applications to image databases," in Proceedings of the 6th International Conference on Computer Vision (IEEE Cat. No. 98CH36271), Bombay, India, 1998, pp. 59-66.
  45. Y. Rubner, C. Tomasi, and L. J. Guibas, "The earth mover's distance as a metric for image retrieval," International Journal of Computer Vision, vol. 40, no. 2, pp. 99-121, 2000. https://doi.org/10.1023/a:1026543900054
  46. J. A. Hartigan, Clustering Algorithms. New York, NY: John Wiley & Sons Inc., 1975.
  47. P. N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining. Boston, MA: Pearson Education Inc., 2006.
  48. J. Soler, F. Tence, L. Gaubert, and C. Buche, "Data clustering and similarity," in Proceedings of the 26th International Florida Artificial Intelligence Research Society Conference (FLAIRS'13), St Pete Beach, FL, 2013, pp. 492-495.
  49. M. Steinbach, G. Karypis, and V. Kumar, "A comparison of document clustering techniques," in Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, MA, 2000.
  50. D. A. Ingaramo, M. L. Errecalde, and P. Rosso, "Density-based clustering of short-text corpora," Procesamiento del Lenguaje Natural, vol. 41, pp. 81-87, 2008.
  51. A. Rangrej, S. Kulkarni, and A. V. Tendulkar, "Comparative study of clustering techniques for short text documents," in Proceedings of the 20th International Conference Companion on World Wide Web, Hyderabad India, 2011, pp. 111-112.
  52. N. Singh and N. S. Chaudhari, "A novel clustering technique for short texts," in Proceedings of 2016 5th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), Noida, India, 2016, pp. 228-232.
  53. T. S. Madhulatha, "An overview on clustering methods," 2012, https://arxiv.org/abs/1205.1117. https://doi.org/10.9790/3021-0204719725
  54. H. Singh, "Clustering of text documents by implementation of k-means algorithms," Streamed Info-Ocean, vol. 1, pp. 53-63, 2016.
  55. Y. Chen and L. Tu, "Density-based clustering for real-time stream data," in Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Jose, CA, 2007, pp. 133-142.
  56. A. K. Jain and R. C. Dubes, Algorithms for Clustering Data. Upper Saddle River, NJ: Prentice-Hall Inc., 1988.
  57. M. Ester, H. P. Kriegel, J. Sander, and X. Xu, "A density-based algorithm for discovering clusters a densitybased algorithm for discovering clusters in large spatial databases with noise," in Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR, 1996, pp. 226-231.
  58. Z. Miller, B. Dickinson, W. Deitrick, W. Hu, and A. H. Wang, "Twitter spammer detection using data stream clustering," Information Sciences, vol. 260, pp. 64-73, 2014. https://doi.org/10.1016/j.ins.2013.11.016
  59. E. K. Ikonomakisa, D. K. Tasoulisa, and M. N. Vrahatisa, "Density based text clustering," in Recent Progress in Computational Sciences and Engineering. Boca Raton, FL: Taylor & Francis, 2006, pp. 218-221.
  60. S. Yang and Y. Wang, "Density-based clustering of massive short messages using domain ontology," in Proceedings of 2009 Asia-Pacific Conference on Information Processing, Shenzhen, China, 2009, pp. 505-508.
  61. M. T. H. Elbatta and W. M. Ashour, "A dynamic method for discovering density varied clusters," International Journal of Signal Processing, Image Processing and Pattern Recognition, vol. 6, no. 1, pp. 123-134, 2013.
  62. M. Parimala, D. Lopez, and N. Senthilkumar, "A survey on density based clustering algorithms for mining large spatial databases," International Journal of Advanced Science and Technology, vol. 31, pp. 59-66, 2011.
  63. K. Sawant, "Adaptive methods for determining DBSCAN parameters," International Journal of Innovative Science, Engineering & Technology, vol. 1, no. 4, pp. 329-334, 2014.
  64. A. K. Pujari, Data Mining Techniques. Hyderabad, India: Universities Press (India) Private Limited, 2001.
  65. V. K. Singh, N. Tiwari, and S. Garg, "Document clustering using k-means, heuristic k-means and fuzzy c-means," in Proceedings of the 2011 International Conference on Computational Intelligence and Communication Networks, Gwalior, India, 2011, pp. 297-301.
  66. S. C. Punitha and M. Punithavalli, "A comparative study to find a suitable method for text document clustering," International Journal of Computer Science & Information Technology (IJCSIT), vol. 3, no. 6, pp. 49-59, 2011. https://doi.org/10.5121/ijcsit.2011.3604
  67. S. T. Deokar, "Text documents clustering using k means algorithm," International Journal of Technology & Engineering Science (IJTES), vol. 1, no. 4, pp. 282-286, 2013.
  68. R. Rehurek and P. Sojka, "Software framework for topic modelling with large corpora," in Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta, 2010, pp. 45-50.
  69. D. Sailaja, M. Kishore, B. Jyothi, and N. R. G. K. Prasad, "An overview of pre-processing text clustering methods," International Journal of Computer Science & Information Technologies, vol. 6, o. 3, pp. 3119-3124, 2015.
  70. A. I. Kadhim, Y. N. Cheah, and N. H. Ahamed, "Text document preprocessing and dimension reduction techniques for text document clustering," in Proceedings of the 2014 4th International Conference on Artificial Intelligence with Applications in Engineering and Technology, Kota Kinabalu, Malaysia, 2014, pp. 69-73.
  71. Y. Song and D. Roth, "Unsupervised sparse vector densification for short text similarity," in Proceedings of 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, CO, 2015, pp. 1275-1280.
  72. X. H. Phan, L. M. Nguyen, and S. Horiguchi, "Learning to classify short and sparse text & web with hidden topics from large-scale data collections," in Proceedings of the 17th International Conference on World Wide Web, Beijing, China, 2008, pp. 91-100.
  73. P. Shrestha, "Corpus-based methods for short text similarity," in Proceedings of the 17th Rencontre des Etudiants Chercheurs en Informatique pour le Traitement Automatique des Langues, Caen, France, 2011.
  74. D. Greene and P. Cunningham, "Practical solutions to the problem of diagonal dominance in kernel document clustering," in Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, 2006, pp. 377-384.
  75. S. Vijayarani, M. J. Ilamathi, and M. Nithya, "Preprocessing techniques for text mining-an overview," International Journal of Computer Science & Communication Networks, vol. 5, no. 1, pp. 7-16, 2015.
  76. S. Seifzadeh, A. K. Farahat, M. S. Kamel, and F. Karray, "Short-text clustering using statistical semantics," in Proceedings of the 24th International Conference on World Wide Web, Florence, Italy, 2015, pp. 805-810.
  77. M. Speriosu, N. Sudan, S. Upadhyay, and J. Baldridge, "Twitter polarity classification with label propagation over lexical links and the follower graph," in Proceedings of the 1st Workshop on Unsupervised Learning in NLP, Stroudsburg, PA, 2011, pp. 53-63.
  78. S. Baillargeon, S. Halle, and C. Gagne, "Stream clustering of tweets," in Proceedings of 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), San Francisco, CA, 2016, pp. 1256-1261.