Parametric and Non Parametric Measures for Text Similarity

텍스트 유사성을 위한 파라미터 및 비 파라미터 측정

  • Mlyahilu, John (Department of IT Convergence and Application Engineering, Pukyong National University) ;
  • Kim, Jong-Nam (Department of IT Convergence and Application Engineering, Pukyong National University)
  • Received : 2019.12.03
  • Accepted : 2019.12.25
  • Published : 2019.12.31

Abstract

The wide spread of genuine and fake information on internet has lead to various studies on text analysis. Copying and pasting others' work without acknowledgement, research results manipulation without proof has been trending for a while in the era of data science. Various tools have been developed to reduce, combat and possibly eradicate plagiarism in various research fields. Text similarity measurements can be manually done by using both parametric and non parametric methods of which this study implements cosine similarity and Pearson correlation as parametric while Spearman correlation as non parametric. Cosine similarity and Pearson correlation metrics have achieved highest coefficients of similarity while Spearman shown low similarity coefficients. We recommend the use of non parametric methods in measuring text similarity due to their non normality assumption as opposed to the parametric methods which relies on normality assumptions and biasness.

인터넷상에서의 진짜 및 가짜 정보의 범람이 수많은 텍스트 분석에 대한 연구를 이끌었다. 문헌 표기 없이 타인의 저작물을 무단 복제 및 관련 없는 연구결과 조작 등이 한동안 세간의 주목을 이끌었다. 연구 분야에서 표절과 이의 대항 및 감소를 위해 다양한 도구들이 개발되었다. Pearson Spearman 본 연구에서는 코사인 유사성과 및 상관관계를 이용하는 파라미터 및 비 파라미터 방법을 이용하여 문장 유사성을 측정한다. Pearson 코사인 유사성과 상관관계는 가장 높은 유사성 계수를 얻었으나 Spearman 상관관계는 낮은 유사성 계수를 보여주었다. 본 논문에서는 정상성 가정과 편향성에 의존하는 파라미터 방법들에 반하도록 비정상성 가정으로 인한 문장 유사도를 측정하는 데 있어 비 파라미터 방법들을 사용하는 것을 제안한다.

Keywords

References

  1. L. Yuhua, M. David, B. Zuhair, O. James, and C. Keeley,"Sentence Similarity Based on Semantic Nets and Corpus Statistics," IEEE Trans. on Knowledge and Data Engineering, vol. 18, no. 8, pp. 1138-1150, 2006. https://doi.org/10.1109/TKDE.2006.130
  2. K. T. Tung, N. D. Hung, and L. T.M. Hanh, "A Comparison of Algorithms used to measure similarity between two documents," International Journal of Advanced Research in Computer Engineering and technology, vol. 14 no. 4, pp.1118-1121, 2015.
  3. W. Gomaa, and A. Fahmy, "A Survey of text Similarity Approaches," International Journal of Computer Applications, vol. 68, no. 13, pp. 305-332, 2013.
  4. M. K. Vijaymeena, and K. Kavitha, "A Survey on Similarity Measures in Text Mining," Machine Learning and Applications: An International Journal, vol. 3, no. 1, pp. 19-28, 2016.
  5. L.M. QAbualigah, "Feature Selection and EnhancedKrill herd Algorithm for text Document Clustering,"Springer, ISSN 1860-949X, 2018.
  6. V. Zhelezniak, A. Savkov, A. Shen, and N. Y. Hammerla, "Correlation Coefficients and Semantic Textual Similarity," Annual Conference Northern American. Association for Computational Linguistics, pp. 951-962, 2019.
  7. C. Luo, J. Zhan, L.Wang, and Q.Yang "Cosine Normalization: Using Cosine Similarity Instead of Dot Product in neural Networks," arXiv. 1702.05870v5.
  8. S. Hajeer, "Comparison on the Effectiveness of Different Statistical Similarity Measures," International Journal of Computer Applications, vol. 53, no. 8, pp. 14-16, 2012. https://doi.org/10.5120/8440-2224
  9. S. Korkman, D. Goksuluk, and G. Zararsiz "Multivariate Normality Tests" The R Journal, vol. 6, no. 2, pp. 151-162, 2014. https://doi.org/10.32614/rj-2014-031