Parametric and Non Parametric Measures for Text Similarity

텍스트 유사성을 위한 파라미터 및 비 파라미터 측정

  • Mlyahilu, John (Department of IT Convergence and Application Engineering, Pukyong National University) ;
  • Kim, Jong-Nam (Department of IT Convergence and Application Engineering, Pukyong National University)
  • Received : 2019.12.03
  • Accepted : 2019.12.25
  • Published : 2019.12.31

Abstract

The wide spread of genuine and fake information on internet has lead to various studies on text analysis. Copying and pasting others' work without acknowledgement, research results manipulation without proof has been trending for a while in the era of data science. Various tools have been developed to reduce, combat and possibly eradicate plagiarism in various research fields. Text similarity measurements can be manually done by using both parametric and non parametric methods of which this study implements cosine similarity and Pearson correlation as parametric while Spearman correlation as non parametric. Cosine similarity and Pearson correlation metrics have achieved highest coefficients of similarity while Spearman shown low similarity coefficients. We recommend the use of non parametric methods in measuring text similarity due to their non normality assumption as opposed to the parametric methods which relies on normality assumptions and biasness.

References

  1. L. Yuhua, M. David, B. Zuhair, O. James, and C. Keeley,"Sentence Similarity Based on Semantic Nets and Corpus Statistics," IEEE Trans. on Knowledge and Data Engineering, vol. 18, no. 8, pp. 1138-1150, 2006. https://doi.org/10.1109/TKDE.2006.130
  2. K. T. Tung, N. D. Hung, and L. T.M. Hanh, "A Comparison of Algorithms used to measure similarity between two documents," International Journal of Advanced Research in Computer Engineering and technology, vol. 14 no. 4, pp.1118-1121, 2015.
  3. W. Gomaa, and A. Fahmy, "A Survey of text Similarity Approaches," International Journal of Computer Applications, vol. 68, no. 13, pp. 305-332, 2013.
  4. M. K. Vijaymeena, and K. Kavitha, "A Survey on Similarity Measures in Text Mining," Machine Learning and Applications: An International Journal, vol. 3, no. 1, pp. 19-28, 2016.
  5. L.M. QAbualigah, "Feature Selection and EnhancedKrill herd Algorithm for text Document Clustering,"Springer, ISSN 1860-949X, 2018.
  6. V. Zhelezniak, A. Savkov, A. Shen, and N. Y. Hammerla, "Correlation Coefficients and Semantic Textual Similarity," Annual Conference Northern American. Association for Computational Linguistics, pp. 951-962, 2019.
  7. C. Luo, J. Zhan, L.Wang, and Q.Yang "Cosine Normalization: Using Cosine Similarity Instead of Dot Product in neural Networks," arXiv. 1702.05870v5.
  8. S. Hajeer, "Comparison on the Effectiveness of Different Statistical Similarity Measures," International Journal of Computer Applications, vol. 53, no. 8, pp. 14-16, 2012.
  9. S. Korkman, D. Goksuluk, and G. Zararsiz "Multivariate Normality Tests" The R Journal, vol. 6, no. 2, pp. 151-162, 2014. https://doi.org/10.32614/rj-2014-031