DOI QR코드

DOI QR Code

Modern Methods of Text Analysis as an Effective Way to Combat Plagiarism

  • Myronenko, Serhii (National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute") ;
  • Myronenko, Yelyzaveta (National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute")
  • Received : 2022.08.05
  • Published : 2022.08.30

Abstract

The article presents the analysis of modern methods of automatic comparison of original and unoriginal text to detect textual plagiarism. The study covers two types of plagiarism - literal, when plagiarists directly make exact copying of the text without changing anything, and intelligent, using more sophisticated techniques, which are harder to detect due to the text manipulation, like words and signs replacement. Standard techniques related to extrinsic detection are string-based, vector space and semantic-based. The first, most common and most successful target models for detecting literal plagiarism - N-gram and Vector Space are analyzed, and their advantages and disadvantages are evaluated. The most effective target models that allow detecting intelligent plagiarism, particularly identifying paraphrases by measuring the semantic similarity of short components of the text, are investigated. Models using neural network architecture and based on natural language sentence matching approaches such as Densely Interactive Inference Network (DIIN), Bilateral Multi-Perspective Matching (BiMPM) and Bidirectional Encoder Representations from Transformers (BERT) and its family of models are considered. The progress in improving plagiarism detection systems, techniques and related models is summarized. Relevant and urgent problems that remain unresolved in detecting intelligent plagiarism - effective recognition of unoriginal ideas and qualitatively paraphrased text - are outlined.

Keywords

References

  1. Akanksha B., Anukruti A., Tarjni V., Desai S., Nair A.: A Survey on plagiarism detection. Advances in computational sciences and technology, 10(8), 2359-2365 (2017).
  2. Vani K., Gupta D.: Study on extrinsic text plagiarism detection techniques and tools. Journal of engineering science and technology review, 9(5), 9-23 (2016). https://doi.org/10.25103/jestr.095.02
  3. Alzahrani S. M., Salim N., Abraham A.: Understanding plagiarism linguistic patterns, textual features, and detection methods. IEEE Transactions on systems, man, and cybernetics - Part C: applications and reviews, 42(2), 133-149 (2012). https://doi.org/10.1109/TSMCC.2011.2134847
  4. Clough P., Stevenson M.: Developing a corpus of plagiarised short answers. Language resources and evaluation, 45(1), 5-24 (2011). https://doi.org/10.1007/s10579-009-9112-1
  5. Maurer H., Kappe F., Zaka B.: Plagiarism - A Survey. Journal of universal computer science, 12(8), 1050-1084 (2006).
  6. Gupta D., Vani K., Leema L.M.: Plagiarism detection in text documents using sentence bounded stop word n-grams. Journal of engineering science and technology, 11(10), 1403-1420, 2016.
  7. Thomas S. W., Adams B., Hassan A. E., Blostein D.: Studying software evolution using topic models. Science of computer programming 80: 457-479 (2014). https://doi.org/10.1016/j.scico.2012.08.003
  8. Bin-Habtoor A. S., Zaher M. A.: A survey on text plagiarism detection systems. International journal of computer theory and engineering, 4(2), 185-188 (2012). https://doi.org/10.7763/IJCTE.2012.V4.447
  9. Sanchez-Vega F., Villatoro-Tello E., Montes-y-Gomez M., Pineda L.V., Rosso P.: Determining and characterizing the reused text for plagiarism detection. Expert systems with applications, 40(5), 1804-1813 (2013). https://doi.org/10.1016/j.eswa.2012.09.021
  10. Potthast M., Barron-Cedeno A., Stein B., Rosso P.: Cross-language plagiarism detection. Language resources & evaluation, 45(1), 45-62 (2011). https://doi.org/10.1007/s10579-009-9114-z
  11. Adhya S., Setua S. K.: Text plagiarism checker using friendship graphs. International journal of computer science & information technology, 8(4), 13-21 (2016). https://doi.org/10.5121/ijcsit.2016.8402
  12. Araseab Y., Tsujiibc J.: Transfer fine-tuning of BERT with phrasal paraphrases. Computer speech & language, 66, 101-164 (2021).
  13. Guu K., Hashimoto T. B., Yonatan Oren Y., Liang P.: Generating sentences by editing prototypes. Transactions of the Association for Computational Linguistics, 6, 437-450 (2018). https://doi.org/10.1162/tacl_a_00030
  14. Shi Z., Minlie Huang M.: Robustness to modification with shared words in paraphrase identification. Association for computational linguistics. Findings of the association for computational linguistics: EMNLP 2020, 164-171, 2020.
  15. Carvalho N. R., Almeida J. J., Henriques P. R., Varanda M. J.: From source code identifiers to natural language terms. Journal of systems and software, 100, 117-128 (2015). https://doi.org/10.1016/j.jss.2014.10.013
  16. Chew Y. C., Yoshiki Mikami Y., Nagano R. L.: Language identification of web pages based on improved n-gram algorithm. International journal of computer science, 8(3), 47-58 (2011).
  17. Nahas M. N.: Survey and comparison between plagiarism detection tools. American journal of data mining and knowledge discovery, 2(2), 50-53 (2017).
  18. Peng X., Huang J., Hu Q., Zhang S., Elgammal A., Metaxas D.: From circle to 3-sphere: Head pose estimation by instance parameterization. Computer vision and image understanding, 136, 92-102 (2015). https://doi.org/10.1016/j.cviu.2015.03.008
  19. Amine A., Elberrichi Z., Simonet M.: Automatic Language Identification: An Alternative Unsupervised Approach Using a New Hybrid Algorithm. International Journal of Computer Science and Applications, 7(1), 94-107 (2010).
  20. Arrish S., Afif F. N., Maidorawa A., Salim N.: Shape-based plagiarism detection for flowchart figures in texts. International journal of computer science & information technology, 6(1), 113-124 (2014). https://doi.org/10.5121/ijcsit.2014.6108
  21. Oberreuter G., Velasquez J.: Text mining applied to plagiarism detection: The use of words for detecting deviations in the writing style. Expert systems with applications, 40(9), 3756-3763 (2013). https://doi.org/10.1016/j.eswa.2012.12.082
  22. Ji Y., Eisenstein J.: Discriminative improvements to distributional sentence similarity. Proceedings of the 2013 Conference on empirical methods in natural language processing, 891-896. Seattle, Washington, USA (October 18-21), 2013.
  23. Madnani N., Dorr B. J.: Generating Phrasal and Sentential Paraphrases: A Survey of data-driven methods. Computational linguistics, 36(3), 341-387 (2010). https://doi.org/10.1162/coli_a_00002
  24. Nguyen-Son Q., Yusuke Miyao Y., Echizen I.: Paraphrase detection based on identical phrase and similar word matching. 29th Pacific Asia conference on language, Information and computation, 504-512. Shanghai, China (October 30-November 1), 2015.
  25. Vo N. P. A., Popescu O., Magnolini S.: Paraphrase identification and semantic similarity in Twitter with simple features. Association for computational linguistics. Proceedings of the Third International Workshop on natural language processing for social media, 10-19, 2015.
  26. Gipp B., Meuschke N., Beel J.: Comparative evaluation of text- and citation-based plagiarism detection approaches using GuttenPlag. In Proceedings of 11th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL'11), 255-258. Ottawa, Canada (June 13-17), 2011.
  27. Adam R., Suharjito M.: Plagiarism detection algorithm using natural language processing based on grammar analyzing. Journal of theoretical and applied information technology, 63(1), 168-180 (2014).
  28. Butakov S., Dyagilev V., Tskhay A.: Protecting students' intellectual property in the web plagiarism detection process. The International review of research in open and distributed learning, 13(5), 1-19 (2012). https://doi.org/10.19173/irrodl.v13i5.1239