• Title/Summary/Keyword: code clone

Search Result 25, Processing Time 0.024 seconds

Domain Analysis of Device Drivers Using Code Clone Detection Method

  • Ma, Yu-Seung;Woo, Duk-Kyun
    • ETRI Journal
    • /
    • v.30 no.3
    • /
    • pp.394-402
    • /
    • 2008
  • Domain analysis is the process of analyzing related software systems in a domain to find their common and variable parts. In the case of device drivers, they are highly suitable for domain analysis because device drivers of the same domain are implemented similarly for each device and each system that they support. Considering this characteristic, this paper introduces a new approach to the domain analysis of device drivers. Our method uses a code clone detection technique to extract similarity among device drivers of the same domain. To examine the applicability of our method, we investigated whole device drivers of a Linux source. Results showed that many reusable similar codes can be discerned by the code clone detection method. We also investigated if our method is applicable to other kernel sources. However, the results show that the code clone detection method is not useful for the domain analysis of all kernel sources. That is, the applicability of the code clone detection method to domain analysis is a peculiar feature of device drivers.

  • PDF

CCR : Tree-pattern based Code-clone Detector (CCR : 트리패턴 기반의 코드클론 탐지기)

  • Lee, Hyo-Sub;Do, Kyung-Goo
    • Journal of Software Assessment and Valuation
    • /
    • v.8 no.2
    • /
    • pp.13-27
    • /
    • 2012
  • This paper presents a tree-pattern based code-clone detector as CCR(Code Clone Ransacker) that finds all clusterd dulpicate pattern by comparing all pair of subtrees in the programs. The pattern included in its entirely in another pattern is ignored since only the largest duplicate patterns are interesed. Evaluation of CCR is high precision and recall. The previous tree-pattern based code-clone detectors are known to have good precision and recall because of comparing program structure. CCR is still high precision and the maximum 5 times higher recall than Asta and about 1.9 times than CloneDigger. The tool also include the majority of Bellon's reference corpus.

Tree-Pattern-Based Clone Detection with High Precision and Recall

  • Lee, Hyo-Sub;Choi, Myung-Ryul;Doh, Kyung-Goo
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.12 no.5
    • /
    • pp.1932-1950
    • /
    • 2018
  • The paper proposes a code-clone detection method that gives the highest possible precision and recall, without giving much attention to efficiency and scalability. The goal is to automatically create a reliable reference corpus that can be used as a basis for evaluating the precision and recall of clone detection tools. The algorithm takes an abstract-syntax-tree representation of source code and thoroughly examines every possible pair of all duplicate tree patterns in the tree, while avoiding unnecessary and duplicated comparisons wherever possible. The largest possible duplicate patterns are then collected in the set of pattern clusters that are used to identify code clones. The method is implemented and evaluated for a standard set of open-source Java applications. The experimental result shows very high precision and recall. False-negative clones missed by our method are all non-contiguous clones. Finally, the concept of neighbor patterns, which can be used to improve recall by detecting non-contiguous clones and intertwined clones, is proposed.

Improvement of BigCloneBench Using Tree-Based Convolutional Neural Network (트리 기반 컨볼루션 신경망을 이용한 BigCloneBench 개선)

  • Park, Gunwoo;Hong, Sung-Moon;Kim, Hyunha;Doh, Kyung-Goo
    • Journal of Software Assessment and Valuation
    • /
    • v.15 no.1
    • /
    • pp.43-53
    • /
    • 2019
  • BigCloneBench has recently been used for performance evaluation of code clone detection tool using machine learning. However, since BigCloneBench is not a benchmark that is optimized for machine learning, incorrect learning data can be created. In this paper, we have shown through experiments using machine learning that the set of Type-4 clone methods provided by BigCloneBench can additionally be found. Experimental results using Tree-Based Convolutional Neural Network show that our proposed method is effective in improving BigCloneBench's dataset.

Cross-Language Clone Detection based on Common Token (공통 토큰에 기반한 서로 다른 언어의 유사성 검사)

  • Hong, Sung-Moon;Kim, Hyunha;Lee, Jaehyung;Park, Sungwoo;Mo, Ji-Hwan;Doh, Kyung-Goo
    • Journal of Software Assessment and Valuation
    • /
    • v.14 no.2
    • /
    • pp.35-44
    • /
    • 2018
  • Tools for detecting cross-language clones usually compare abstract-syntax-tree representations of source code, which lacks scalability. In order to compare large source code to a practical level, we need a similarity checking technique that works on a token level basis. In this paper, we define common tokens that represent all tokens commonly used in programming languages of different paradigms. Each source code of different language is then transformed into the list of common tokens that are compared. Experimental results using exEyes show that our proposed method using common tokens is effective in detecting cross-language clones.

Enhancing the performance of code-clone detection tools using code2vec (code2vec을 이용한 유사도 감정 도구의 성능 개선)

  • Um, Taeho;Hong, Sung Moon;Yang, Joon Hyuk;Jang, Hyo Seok;Doh, Kyung-Goo
    • Journal of Software Assessment and Valuation
    • /
    • v.17 no.1
    • /
    • pp.31-40
    • /
    • 2021
  • Plagiarism refers to the act of using the original data as if it were one's own without revealing the source. The plagiarism of source code causes a variety of problems, including legal disputes. Plagiarism in software projects is usually determined by measuring similarity by comparing every pair of source code within two projects. However, blindly comparing every pair has been a huge computational burden, causing a major factor of not using tools of better accuracy. If we can only compare pairs that are probable to be clones, eliminating pairs that are impossible to be clones, we can concentrate more on improving the accuracy of detection. In this paper, we propose a method of selecting highly probable candidates of clone pairs by pre-classifying suspected source-codes using a machine-learning model called code2vec.

Automatic Generation of Code-clone Reference Corpus (코드클론 표본 집합체 자동 생성기)

  • Lee, Hyo-Sub;Doh, Kyung-Goo
    • Journal of Software Assessment and Valuation
    • /
    • v.7 no.1
    • /
    • pp.29-39
    • /
    • 2011
  • To evaluate the quality of clone detection tools, we should know how many clones the tool misses. Hence we need to have the standard code-clone reference corpus for a carefully chosen set of sample source codes. The reference corpus available so far has been built by manually collecting clones from the results of various existing tools. This paper presents a tree-pattern-based clone detection tool that can be used for automatic generation of reference corpus. Our tool is compared with CloneDR for precision and Bellon's reference corpus for recall. Our tool finds no false positives and 2 to 3 times more clones than CloneDR. Compared to Bellon's reference corpus, our tools shows the 93%-to-100% recall rate and detects far more clones.

Web Page Similarity based on Size and Frequency of Tokens (토큰 크기 및 출현 빈도에 기반한 웹 페이지 유사도)

  • Lee, Eun-Joo;Jung, Woo-Sung
    • Journal of Information Technology Services
    • /
    • v.11 no.4
    • /
    • pp.263-275
    • /
    • 2012
  • It is becoming hard to maintain web applications because of high complexity and duplication of web pages. However, most of research about code clone is focusing on code hunks, and their target is limited to a specific language. Thus, we propose GSIM, a language-independent statistical approach to detect similar pages based on scarcity and frequency of customized tokens. The tokens, which can be obtained from pages splitted by a set of given separators, are defined as atomic elements for calculating similarity between two pages. In this paper, the domain definition for web applications and algorithms for collecting tokens, making matrics, calculating similarity are given. We also conducted experiments on open source codes for evaluation, with our GSIM tool. The results show the applicability of the proposed method and the effects of parameters such as threshold, toughness, length of tokens, on their quality and performance.

A Code Clustering Technique for Unifying Method Full Path of Reusable Cloned Code Sets of a Product Family (제품군의 재사용 가능한 클론 코드의 메소드 경로 통일을 위한 코드 클러스터링 방법)

  • Kim, Taeyoung;Lee, Jihyun;Kim, Eunmi
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.12 no.1
    • /
    • pp.1-18
    • /
    • 2023
  • Similar software is often developed with the Clone-And-Own (CAO) approach that copies and modifies existing artifacts. The CAO approach is considered as a bad practice because it makes maintenance difficult as the number of cloned products increases. Software product line engineering is a methodology that can solve the issue of the CAO approach by developing a product family through systematic reuse. Migrating product families that have been developed with the CAO approach to the product line engineering begins with finding, integrating, and building them as reusable assets. However, cloning occurs at various levels from directories to code lines, and their structures can be changed. This makes it difficult to build product line code base simply by finding clones. Successful migration thus requires unifying the source code's file path, class name, and method signature. This paper proposes a clustering method that identifies a set of similar codes scattered across product variants and some of their method full paths are different, so path unification is necessary. In order to show the effectiveness of the proposed method, we conducted an experiment using the Apo Games product line, which has evolved with the CAO approach. As a result, the average precision of clustering performed without preprocessing was 0.91 and the number of identified common clusters was 0, whereas our method showed 0.98 and 15 respectively.

Which Code Changes Should You Review First?: A Code Review Tool to Summarize and Prioritize Important Software Changes

  • Song, Myoungkyu;Kwon, Young-Woo
    • Journal of Multimedia Information System
    • /
    • v.4 no.4
    • /
    • pp.255-262
    • /
    • 2017
  • In recent software development, repetitive code fragments (i.e., clones) are common due to the copy-and-paste programming practice, the framework-based development, or the reuse of same design patterns. Such similar code fragments are likely to introduce more bugs but are easily disregarded by a code reviewer or a programmer. In this paper, we present a code review tool to help code reviewers identify important code changes written by other programmers and recommend which changes need to be reviewed first. Specifically, to identify important code changes, our approach detects code clones across revisions and investigates them. Then, to help a code reviewer, our approach ranks the identified changes in accordance with several software quality metrics and statistics on those clones and changes. Furthermore, our approach allows the code reviewer to express their preferences during code review time. As a result, the code reviewer who has little knowledge of a code base can reduce his or her effort by reviewing the most significant changes that require an instant attention. To evaluate our approach, we integrated our approach with a modern IDE (e.g., Eclipse) as a plugin and then analyzed two third-party open source projects. The experimental results indicate that our approach can improve code reviewer's productivity.