DOI QR코드

DOI QR Code

Performance Evaluations of Text Ranking Algorithms

  • Received : 2019.07.17
  • Accepted : 2020.01.03
  • Published : 2020.02.28

Abstract

The text ranking algorithm is a representative method for keyword extraction, and its importance is emphasized highly. In this paper, we compare the performance of recent research and experiments with TF-IDF, SMART, INQUERY and CCA algorithms, which are used in text ranking algorithm.. After explaining each algorithm, we compare the performance of each algorithm based on the data collected from news and Twitter. Experimental results show that all of four algorithms can extract specific words from news data equally. However, in the case of Twitter, CCA has the best performance to extract specific words, and INQUERY shows the worst performance. We also analyze the accuracy of the algorithm through six comparison metrics. The experimental results present that CCA shows the best accuracy in the news data. In case of Twitter, TF-IDF and CCA show similar performance and demonstrate good performance.

텍스트 순위 알고리즘은 키워드 추출을 위한 대표적인 방법이며 그 중요성이 강조되고 있다. 본 논문에서는 텍스트 랭킹 알고리즘에서 대표적으로 사용되는 TF-IDF, SMART, INQUERY, CCA 알고리즘이 적용된 최근 연구와 실험해비교한다. 먼저, 각 알고리즘을 설명한 후 뉴스와 트위터 데이터를 기반으로 알고리즘의 성능을 분석한다. 실험 결과에 따르면 네 가지 알고리즘 모두 뉴스 데이터에서 특정 단어의 추출 성능이 좋다는 것을 알 수 있다. 그러나 Twitter의 경우 CCA는 특정 단어를 추출하는 최고의 성능을 가지며 INQUERY는 가장 낮은 성능을 보여준다. 또한 6 가지 비교 메트릭을 통해 알고리즘의 정확성을 분석한다. 실험 결과 CCA가 뉴스 데이터에서 최고의 정확도를 보여주고, 트위터의 경우 TF-IDF와 CCA는 비슷한 성능을 보이며 높은 정확도를 보인다.

Keywords

References

  1. B. Lott, "Survey of keyword extraction techniques," UNM Education, vol. 50, pp. 1-11, 2012.
  2. Y. K. Meena, A. Jain, and D. Gopalani, "Survey on graph and cluster based approaches in multi-document text summarization," in International Conference on Recent Advances and Innovations in Engineering (ICRAIE-2014), 2014, pp. 1-5.
  3. S. S. Sonawane and P. A. Kulkarni, "Graph based representation and analysis of text document: A survey of techniques," International Journal of Computer Applications, vol. 96, no. 19, 2014.
  4. Ramos, "Using tf-idf to determine word relevance in document queries," in Proceedings of the first instructional conference on machine learning, 2003, vol. 242, pp. 133-142.
  5. C. Buckley, G. Salton, J. Allan, and A. Singhal, "Automatic query expansion using SMART: TREC 3," NIST special publication sp, pp. 69-69, 1995.
  6. J. P. Callan, W. B. Croft, and S. M. Harding, "The INQUERY retrieval system," in Database and expert systems applications, 1992, pp. 78-83.
  7. H. M. de Almeida, M. A. Goncalves, M. Cristo, and P. Calado, "A combined component approach for finding collection-adapted ranking functions based on genetic programming," in Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, 2007, pp. 399-406.
  8. D. G. Fisher and P. Hoffman, "The adjusted Rand statistic: A SAS macro," Psychometrika, vol. 53, no. 3, pp. 417-423, 1988. https://doi.org/10.1007/BF02294222
  9. S. Niwattanakul, J. Singthongchai, E. Naenudorn, and S. Wanapu, "Using of Jaccard coefficient for keywords similarity," in Proceedings of the international multiconference of engineers and computer scientists, 2013, vol. 1, pp. 380-38
  10. M. Halkidi, Y. Batistakis, and M. Vazirgiannis, "On clustering validation techniques," Journal of intelligent information systems, vol. 17, no. 2-3, pp. 107-145, 2001. https://doi.org/10.1023/A:1012801612483
  11. C. O. Schmidt and T. Kohlmann, "When to use the odds ratio or the relative risk?," International journal of public health, vol. 53, no. 3, pp. 165-167, 2008. https://doi.org/10.1007/s00038-008-7068-3
  12. J. Zhang and F. Y. Kai, "What's the relative risk?: A method of correcting the odds ratio in cohort studies of common outcomes," Jama, vol. 280, no. 19, pp. 1690-1691, 1998. https://doi.org/10.1001/jama.280.19.1690
  13. Powers, David Martin. "Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation." (2011).
  14. Z. Yun-tao, G. Ling, and W. Yong-cheng, "An improved TF-IDF approach for text classification," Journal of Zhejiang University-Science A, vol. 6, no. 1, pp. 49-55, 2005. https://doi.org/10.1631/jzus.2005.A0049
  15. T. Roelleke and J. Wang, "TF-IDF uncovered: a study of theories and probabilities," in Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, 2008, pp. 435-442.
  16. Guo, Jinghuan, et al. "Activity feature solving based on TF-IDF for activity recognition in smart homes." Complexity 2019 (2019).
  17. Petrik, Juraj, and Daniela Chuda. "Twitter Feeds Profiling With TF-IDF." CLEF. 2019.
  18. Kyi Ho Lee, Joon Ho Lee, Kyu Chul Lee., "Improving Retrieval Effectiveness with Multiple Query Combination," JOURNAL OF THE KOREAN SOCIETY FOR LIBRARY AND INFORMATION SCIENCE 31(3), 1997.9, 135-146(12 pages)
  19. C. Buckley, A. Singhal, M. Mitra, and G. Salton, "New retrieval approaches using SMART: TREC 4," in Proceedings of the Fourth Text REtrieval Conference (TREC-4), 1995, pp. 25-48.
  20. Macdonald, Craig, Nicola Tonellotto, and Iadh Ounis. "Efficient & effective selective query rewriting with efficiency predictions." Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2017.
  21. Raza, Muhammad Ahsan, et al. "A Taxonomy and Survey of Semantic Approaches for Query Expansion." IEEE Access 7 (2019): 17823-17833. https://doi.org/10.1109/access.2019.2894679
  22. J. P. Callan, W. B. Croft, and J. Broglio, "TREC and TIPSTER experiments with INQUERY," Information Processing & Management, vol. 31, no. 3, pp. 327-343, 1995. https://doi.org/10.1016/0306-4573(94)00050-D
  23. Nwesri, Abdusalam F. Ahmad, and Hasan AH Alyagoubi. "Applying Arabic Stemming Using Query Expansion." 2015 26th International Workshop on Database and Expert Systems Applications (DEXA). IEEE, 2015.
  24. J. Allan, L. Ballesteros, J. P. Callan, W. B. Croft, and Z. Lu, "Recent experiments with INQUERY," in Proceedings of the 4th Text Retrieval Conference, 1995, pp. 49-64.
  25. Daou, Hoda. "Detection of Sentiment Provoking Events in Social Media." Proceedings of the 52nd Hawaii International Conference on System Sciences. 2019.
  26. F. Asano, S. Ikeda, M. Ogawa, H. Asoh, and N. Kitawaki, "Combined approach of array processing and independent component analysis for blind separation of acoustic signals," IEEE Transactions on Speech and Audio Processing, vol. 11, no. 3, pp. 204-215, 2003. https://doi.org/10.1109/TSA.2003.809191
  27. Baeza-Yates, Ricardo, et al. "An effective and efficient algorithm for ranking web documents via genetic programming." Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing. ACM, 2019.
  28. Fernández, Alejandro Moreo, Andrea Esuli, and Fabrizio Sebastiani. "Learning to Weight for Text Classification." IEEE Transactions on Knowledge and Data Engineering (2018).