Generating Pylogenetic Tree of Homogeneous Source Code in a Plagiarism Detection System

  • Ji, Jeong-Hoon (Graduate School of Computer Engineering, Pusan National University) ;
  • Park, Su-Hyun (Graduate School of Computer Engineering, Pusan National University) ;
  • Woo, Gyun (Graduate School of Computer Engineering, Pusan National University) ;
  • Cho, Hwan-Gue (Graduate School of Computer Engineering, Pusan National University)
  • Published : 2008.12.31

Abstract

Program plagiarism is widespread due to intelligent software and the global Internet environment. Consequently the detection of plagiarized source code and software is becoming important especially in academic field. Though numerous studies have been reported for detecting plagiarized pairs of codes, we cannot find any profound work on understanding the underlying mechanisms of plagiarism. In this paper, we study the evolutionary process of source codes regarding that the plagiarism procedure can be considered as evolutionary steps of source codes. The final goal of our paper is to reconstruct a tree depicting the evolution process in the source code. To this end, we extend the well-known bioinformatics approach, a local alignment approach, to detect a region of similar code with an adaptive scoring matrix. The asymmetric code similarity based on the local alignment can be considered as one of the main contribution of this paper. The phylogenetic tree or evolution tree of source codes can be reconstructed using this asymmetric measure. To show the effectiveness and efficiency of the phylogeny construction algorithm, we conducted experiments with more than 100 real source codes which were obtained from East-Asia ICPC(International Collegiate Programming Contest). Our experiments showed that the proposed algorithm is quite successful in reconstructing the evolutionary direction, which enables us to identify plagiarized codes more accurately and reliably. Also, the phylogeny construction algorithm is successfully implemented on top of the plagiarism detection system of an automatic program evaluation system.

Keywords

References

  1. N. Forbes, Imitation of Life: How Biology is Inspiring Computing, MIT Press, 2004
  2. L. A. Goldberg, P. W. Goldberg, C. A. Phillips, and G. B. Sorkin, "Constructing computer virus phylogenies," J. of Algorithms, vol. 26, no. 1, pp. 188-208, January 1998 https://doi.org/10.1006/jagm.1997.0897
  3. J.-H. Ji, G. Woo, and H.-G. Cho, "A source code linearization technique for detecting plagiarized programs," ACM SIGCSE Bulletin, vol. 39, no. 3, pp. 73-77, June 2007 https://doi.org/10.1145/1269900.1268807
  4. J.-H. Ji, G. Woo, S.-H. Park, and H.-G. Cho, "An intelligent system for detecting source code plagiarism using a probabilistic graph model," Proc. of the 5th International Conference on Machine Learning and Data Mining in Pattern Recognition, MLDM Posters 2007, pp. 55-69, July. 2007
  5. J.-H. Ji, S.-H. Park, G. Woo, and H.-G. Cho, "Evolution analysis of homogenous source code and its application to plagiarism detection," Proc. of the FBIT2007, pp. 813-818, October 2007
  6. J.-H. Ji, G. Woo, S.-H. Park, and H.-G. Cho, "Understanding evolution process of program source for investigating software authorship and plagiarism," Proc. of the 2nd International Conference on Digital Information Management, pp. 98-103, October 2007
  7. M. E. Karim, A. Walenstein, A. Lakhotia, and L. Parida, "Malware phylogeny generation using permutations of code," J. in Computer Virology, vol. 1, no. 1, pp. 13-23, 2005 https://doi.org/10.1007/s11416-005-0002-9
  8. C. F. Kemerer and S. Slaughter, "An empirical approach to studying software evolution," IEEE Trans. on Software Engineering, vol. 25, no. 4, pp. 493-509, 1999 https://doi.org/10.1109/32.799945
  9. S. Meyer zu Eissen and B. Stein, "Intrinsic plagiarism detection," Proc. of ECIR 2006, Lecture Notes in Computer Science, vol. 3936, pp. 565-569, 2006
  10. L. Prechelt, G. Malpohl, and M. Philippsen, "Finding plagiarisms among a set of programs with JPlag," J. of Universal Computer Science, vol. 8, no. 11, pp. 1016-1038, 2002
  11. J. H. Johnson, "Identifying redundancy in source code using fingerprints," Proc. of the Conference of the Centre for Advanced Studies on Collaborative Research, pp. 171-183, IBM Press, 1993
  12. S. Brin, J. Davis, and H. Garcia-Molina, "Copy detection mechanisms for digital documents," Proc. of the ACM SIGMOD Annual Conference, pp. 398-409, 1995 https://doi.org/10.1145/568271.223855
  13. K. L. Verco and M. J. Wise, "Software for detecting suspected plagiarism: Comparing structure and attribute-counting systems," Proc. of the 1st Australian Conference on Computer Science Education, Sydney, Australia, pp. 130-134, July 1996
  14. S. D. Stephens, "Using metrics to detect plagiarism (student paper)," Proc. of the 7th Annual Consortium for Computing in Small Colleges, pp. 191-196, Consortium for Computing Sciences in Colleges, USA, 2001
  15. I. D. Baxter, A. Yahin, L. M. D. Moura, M. Sant'Anna, and L. Bier, "Clone detection using abstract syntax trees," Proc. of the International Conference on Software Maintenance, pp. 368-377, 1998
  16. J.-W. Son, S.-B. Park, and S.-Y. Park, "Program plagiarism detection using parse tree kernels," Proc. of the 9th Pacific Rim International Conference on Artificial Intelligence, Lecture Notes in Computer Science, Springer, vol. 4099, pp. 1000-1004, Aug. 2006
  17. M. J. Wise, "YAP3: Improved detection of similarities in computer program and other texts," Proc. of SIGCSE '96, pp. 130-134, 1996
  18. A. Aiken, Moss (Measure of Software Mimilarity) Plagiarism Detection System, Available: http://theory.stanford.edu/~aiken/moss/, 1998
  19. D. Gitchell and N. Tran, "Sim: A utility for detecting similarity in computer programs," Proc. Of the Thirtieth SIGCSE Technical Symposium on Computer Science Education, pp. 266-270, ACM Press 1999 https://doi.org/10.1145/384266.299783
  20. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, "Basic local alignment search tool," J. Molecular Biology, vol. 215, pp. 403-410, 1990 https://doi.org/10.1016/S0022-2836(05)80360-2