DOI QR코드

DOI QR Code

Tree-Pattern-Based Clone Detection with High Precision and Recall

  • Lee, Hyo-Sub (Department of Computer Science, College of Engineering, Hanyang University) ;
  • Choi, Myung-Ryul (Division of Electronic Engineering, College of Engineering, Hanyang University ERICA) ;
  • Doh, Kyung-Goo (Division of Computer Science, College of Computing, Hanyang University ERICA)
  • Received : 2017.04.15
  • Accepted : 2017.12.10
  • Published : 2018.05.31

Abstract

The paper proposes a code-clone detection method that gives the highest possible precision and recall, without giving much attention to efficiency and scalability. The goal is to automatically create a reliable reference corpus that can be used as a basis for evaluating the precision and recall of clone detection tools. The algorithm takes an abstract-syntax-tree representation of source code and thoroughly examines every possible pair of all duplicate tree patterns in the tree, while avoiding unnecessary and duplicated comparisons wherever possible. The largest possible duplicate patterns are then collected in the set of pattern clusters that are used to identify code clones. The method is implemented and evaluated for a standard set of open-source Java applications. The experimental result shows very high precision and recall. False-negative clones missed by our method are all non-contiguous clones. Finally, the concept of neighbor patterns, which can be used to improve recall by detecting non-contiguous clones and intertwined clones, is proposed.

Keywords

References

  1. Brenda S. Baker, "On finding duplication and near-duplication in large software systems," in Proc. of the Working Conf. on Reverse Engineering, pp.86-95, July 14-16, 1995.
  2. Elizabeth Burd and John Bailey, "Evaluating clone detection tools for use during preventive maintenance," in Proc. of the IEEE Int. Workshop on Source Code Analysis and Manipulation, pp.36-43, October 1, 2002.
  3. Toshihiro Kamiya, Shinji Kusumoto and Katsuro Inoue, "CCFinder: a multi-linguistic token based code clone detection system for large scale source code," IEEE Transactions on Software Engineering, vol. 28, no. 7, pp.654-670, July, 2002. https://doi.org/10.1109/TSE.2002.1019480
  4. Zhenmin Li, Shan Lu, Suvda Myagmar and Yuanyuan Zhou, "CP-Miner: finding copy-paste and related bugs in large-scale software code," IEEE Transactions on Software Engineering, vol. 32, no. 3, pp.176-192, March, 2006. https://doi.org/10.1109/TSE.2006.28
  5. Lingxiao Jiang, Ghassan Misherghi, Zhendong Su and Stephane Glondu, "DECKARD: scalable and accurate tree-based detection of code clones," in Proc. of Int. Conf. of Software Engineering, pp.96-105, May 20-26, 2007.
  6. Ira D. Baxter, Andrew Yahin, Leonard Moura, Marcelo Sant'Anna and Lorraine Bier, "Clone detection using abstract syntax trees," in Proc. of Int. Conf. on Software Maintenance, pp.368-377, March 16-19, 1998.
  7. Wuu Yang, "Identifying syntactic differences between two programs," Software - Practice and Experience, vol. 21, no. 7, pp.739-755, June, 1991. https://doi.org/10.1002/spe.4380210706
  8. William S. Evans, Christopher W. Fraser and Fei Ma, "Clone detection via structural abstraction," Software Quality Journal, vol. 17, no. 4, pp.309-330, December, 2009. https://doi.org/10.1007/s11219-009-9074-y
  9. Hyo-Sub Lee and Kyung-Goo Doh, "Tree-pattern-based duplicate code detection," in Proc. of ACM Int. Workshop on Data-Intensive Software Management and Mining, pp.7-12, November 6, 2009.
  10. Gordon D. Plotkin, "A note on inductive generalization," Machine Intelligence, vol. 5, pp.153-163, 1970.
  11. John C. Reynolds, "Transformational systems and the algebraic structure of atomic formulas," Machine Intelligence, vol. 5, pp.135-151, 1970.
  12. Stefan Bellon, Rainer Koschke, Giulio Antoniol, Jns Krinke and Ettore Merlo, "Comparison and evaluation of clone detection tools," IEEE Tranactions on Software Engineering, vol. 33, no. 9, pp.577-591, September, 2007. https://doi.org/10.1109/TSE.2007.70725
  13. Chanchal K. Roy and James R. Cordy, "A survey on software clone detection research," Queen's School of Computing TR 2007-541, 115, September 26, 2007.
  14. Chanchal K. Roy, James R. Cordy and Rainer Koschke, "Comparison and evaluation of code clone detection techniques and tools: a qualitative approach," Science of Computer Programming, vol. 74, no. 7, pp.470-495, 2009. https://doi.org/10.1016/j.scico.2009.02.007
  15. Jan Harder and Nils Gode, "Efficiently handling clone data: RCF and Cyclone," in Proc. of the 5th International Workshop on Software Clones, pp.81-82, ACM 2011.
  16. Peter Bulychev and Marius Minea, "An evaluation of duplicate code detection using anti-unification," in Proc. of Int. Workshop on Software Clones, March 24-27, 2009.
  17. Jean Mayrand, Claude Leblanc and Ettore Merlo, "Experiment on the automatic detection of function clones in a software system using metrics," in Proc. of Int. Conf. on Software Maintenance, pp.244-253, November 4-8, 1996.
  18. Jean-Francois Patenaude, Ettore Merlo, Michel Dagenais and Bruno Lague, "Extending software quality assessment techniques to Java systems," in Proc. of Int. Workshop on Program Comprehension, pp.49-56, May 5-7, 1999.
  19. Raghavan Komondoor and Susan Horwitz, "Using slicing to identify duplication in source code," in Proc. of Int. Symp. on Static Analysis, pp.40-56, July 40-56, 2001.
  20. Chao Liu, Chen Chen, Jiawei Han and Philip S. Yu, "GPLAG: detection of software plagiarism by program by program dependence graph analysis," in Proc. of ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp.872-881, August 20-23, 2006.
  21. Antonio M. Leitao, "Detection of redundant code using R2D2," Software Quality Journal, vol. 12, no. 4, pp.361-382, December 2004. https://doi.org/10.1023/B:SQJO.0000039793.31052.72
  22. Huiqing Li and Simon Thompson, "Similar code detection and elimination for Erlang programs," in Proc. of Int. Symp. on Practical Aspects of Declarative Languages, pp.104-118, January 18-19, 2010.
  23. Huiqing Li and Simon Thompson, "Incremental code detection and elimination for Erlang programs," in Proc. of Int. Conf. on Fundamental Approaches to Software Engineering, pp.356-370, March 26 - April 3, 2011.
  24. Brenda S. Baker and Udi Manber, "Deducing similarities in Java sources from bytecodes," in Proc. of USENIX Annual Technical Conference, pp.15-18, June 15-19, 1998.
  25. Ian J. Davis and Michael W. Godfrey, "From whence it came: detecting source code clones by analyzing assembler," in Proc. of Working Conference on Reverse Engineering, pp.242-246, October 13-16, 2010.
  26. Andrew Saebjoernsen, Jeremiah Wilcock, Thomas Panas, Daniel Quinlan and Zhendong Su, "Detecting code clones in binary executables," in Proc. of Int. Symp. on Software Testing and Analysis, pp.117-128, July 19-23, 2009.
  27. Antonella Santone, "Clone detection through process algebras and Java bytecode," in Proc. of Int. Workshop on Software Clones, pp.73-74, May 23, 2011.
  28. Heejung Kim, Yungbum Jung, Sunghun Kim and Kwangkeun Yi, "MeCC: memory comparison-based clone detector," in Proc. of Int. Conf. of Software Engineering, pp.301-310, May 21-28, 2011.
  29. Ira D. Baxter, Christopher Pidgeon and Michael Mehlich, "DMS: Program transformations for practical scalable software evolution," in Proc. of Int. Conf. on Software Engineering, pp.625-634, May 23-28, 2004.
  30. Narcisa A. Milea, Lingxiao Jiang, and Siau-Cheng Khoo, "Vector abstraction and concretization for scalable detection of refactorings," in Proc. of the ACM SIGSOFT Int. Symp. on the Foundations of Software Engineering, November 16-21, pp.86-97, 2014.
  31. Vera Wahler, Dietmar Seipel, Jurgen Wolff v. Gudenberg and Gregor Fischer, "Clone detection in source code by frequent itemset techniques," in Proc. of IEEE Int. Workshop on Source Code Analysis and Manipulation, pp.128-135, September 15-16, 2004.
  32. Rainer Koschke, Raimar Falke and Pierre Frenzel, "Clone detection using abstract syntax suffix trees," in Proc. of Working Conf. on Reverse Engineering, pp.253-262, October 23-27, 2006.
  33. Robert Tairas and Jeff Gray, "Phoenix-based clone detection using suffix trees," in Proc. of ACM Annual Southeast Regional Conference, pp.679-684, March 10-12, 2006.
  34. Anna Corazza, Sergio Di Martino, Valerio Maggio and Giuseppe Scanniello, "A tree kernel based approach for clone detection," in Proc. of Int. Conf. on Software Maintenance, pp.1-5, September 12-18, 2010.