1. Introduction
The modern software has high quality through sophisticated software development techniques and processes. However, even though developed through a sophisticated process, many defects are found in most software. Defects are critical to software, and how to manage defects reported to developers after software development is always a majorissue. Even if a developer understands the developedsof tware, in the case of large projects, it is difficult to predictin a short time what components of the software defects are associated with and which components to modify when the defect is reported[1]. The procedure of tracing these faults consumes additional human and temporal resources. Therefore, it is necessary to trace the location of the defect-related resources as well as how to correct the defects. For supporting this, in this paper, we propose and verify aenhancing model-based fault traceability technique using bugreport and commit information of software VCS. The contribution of this study is as follows.
- We extract keywords from VCS commits descriptions, source code changes, and bug report, and uses them as criteria for selecting commits associated with the reported bug report.
- After finding similar commit information, we use the behavior model to trace resources related to the origin of the fault.
An order of rest of this paper is as follows. We first describe the background knowledge in Chapter 2. Chapter 3 introduces related works, and Chapter 4 describes ourproposed technique. And we do case study in Chapter 5, and verify our technique through experiments using open sourceproject in Chapter 6. After that, we discuss our research in Chapter 7. At last, in Chapter 8, we conclude our paper.
2. Background Knowledge
2.1 Bug Report
A bug report describes the defect or functionalenhancement that the user found during the test phase orusing the deployed software [2, 3]. The bug report describesthe details of the defect and provides a variety of informationso that the developer can figure out the reason of the defect.
(Figure 1) A Bug Report
Figure 1 is an example of a bug report, one of the bugreports from zxing , an open source project. The bug report of github , the repository of git, contains information such astitle, bug report content, related images, and date.
2.2 Version Control System
The Version Control System (VCS) is a system forassigning versions of changes of software source code andsof tware output, storing the versions, and controlling it.
When there are changes of the source code and output, the developer commits it and reflects it in VCS. A singlecommit has information such as a description of the commitand source code changes. Generally, a commit descriptiondescribes source code changes or modified features toindicate changes and allow for commits to be traced and recovered throughout the project [4]. Typically, a unit of commit is a group of changes with the same purpose, suchas bug fixes or functional enhancements. That is, changescontained in a single commit are considered to be related resources.
3. Related Works
There are many existing studies using a variety of methodologies for fault traceability that trace related resources in the event of a software fault. In the previous studies, S. Back et al. [5] used behavior model, commitinformation of VCS, and web application specified bugreport form for web applications developed on the basis of MVC patterns. In the step of tracking the source code from the bug report, [5] used the method of tracing the entry point of the controller according to the user's request through the URI information of the bug report if there is URI information in the bug report. Otherwise, they used thesimilarity between the bug report and the source code of the controller. And [5] used only source code changes as commitinformation to improve fault traceability, and considered the source code changes in a single commit as associativeresources. C. Youm et al. [6] proposed a methodology to improve fault traceability by utilizing bug report, structured source code change history, and stack trace information based on information retrieval method. L. Moreno et al. [7]determined whether the bug report is related to the source code information such as class name, method name, andargument through the information retrieval method, and proposed a methodology of tracing the source code from the bug report.
Existing fault traceability studies mainly used information retrieval methods, and some studies utilized behavior model [8-12]. However, since the level of traceability of most of the studies is a source code file level, it is hard to providedetailed resources related to the defect to the developer. Andin [5], the trace level is a method level, but it is impossible to apply it to general projects because the proposed technique is restricted to web applications using MVC pattern. But, foruseful defect tracing techniques, the trace level should be as detailed as a method level[13] and general versatility must beensured so that it can be applied to general projects.
4. Fault Traceability Enhancement Technique
4.1 Overview
The overview of the fault traceability enhancementtechnique proposed in this paper is shown in Figure 2. The fault traceability technique is largely a two-step process. The first step is finding the most related commit with the reported bug report. gram as a behavior model. A detailed description of each step is given in Sections 4.2 and 4.3.
(Figure 2) An Overview of Enhancing Fault Traceability Technique
4.2 Tracing a Related Commit with the Bug Report
This step is to trace the commit that has the most related commit information with the bug report. This step takes advantage of the nature of VCS's commits. Fist of all, a commit description describes the changes made by the commit. Also, changes contained in a single commit are resources that are related to each other. Therefore, we use VSM(Vector Space Model) and TF-IDF techniques tocalculate the similarity between the bug report and commitinformation in order to find the commit related with the bugreport [14]. The detailed steps are as follows.
- Extract keywords from the bug report and commit information
- Calculate similarity between bug report and commit information using VSM
- Extract information of the commit with the highest similarity score
Step 1 removes stopwords from natural language and program code and then extracts keywords. The target of the keyword extraction is the description of the commit and the source code changes. In the case of source code changes, weextended the above and below three lines of source codebased on the added, changed, and deleted codes. This is aheuristic method, in which some source codes in the modified method are subjected to keyword extraction so that it is possible to extract the more accurate keyword. In this study, we used the stopword module of Node.js for Step 1. In addition, in the source code changes of commits, we improved the accuracy of Step 3 by removing some reserved words such as public, void, string, and int as well as stopword.
Step 2 is a step of calculating the similarity between the bug report and the commit information. VSM is used tocalculate the similarity between the commit information and the bug report, and the similarity expression can be expressed as follows.
\(\text { similarity }\left(d_{i}, \mathrm{q}\right)=\frac{\overrightarrow{V_{d_{i}}} \cdot \overrightarrow{V_{q}}}{\|\overrightarrow{V_{d_{i}}}\|\|\overrightarrow{V_{q}}\|}\)
\(\overrightarrow{V_{d_{i}}}\) is vector of term weights of document \(d_i \)
\(\overrightarrow{V_q}\) is vector of term weights of query q
In the similarity formula, \(q\) and \(d\) refer to the query and i-th document in the D, respectively. And \(\overrightarrow{V_{d_{i}}}\) is the term weight vector of the i-th document, and \(\overrightarrow{V_q}\) is the weight vector of the query term respectively. Each term weight vector is calculated using the term frequency (TF) and the inverse document frequency (IDF), and its calculation formula is as follows.
\(\operatorname{TF}(t, d)=\frac{f(t, d)}{\# \text { of terms }}\)
\($\operatorname{IDF}(t, D)=\log \left(\frac{\# \text { of } D}{n}\right)$\)
where \(\vec V\)is vector of \(w_t\),
where \(w_t\)is weight of a term \(t\)
In the word frequency formula, \(q\) and \(d\) are terms and documents, respectively \(f(t,d)\), is the number ofoccurrences of term \(t\) in document \(d\), and # of terms is the total number of terms in document \(d\). In the inversedocument frequency formula, \(t\) and \(D\) mean term and whole document respectively, # of \(D\) means the total number of documents, and \(n\) means the number of documents containing term \(t\). Also,\(\vec V\)is the weight vector of term \(t\) and the weight of term \(t\) is the product of \(TF\)and \(IDF\)
As a final step, Step 3 ranks commits based on thesimilarity score calculated using VSM and extracts the information of the commit with the highest similarity score.
4.3 Enhancing Fault Traceability using Behavior Model
In this study, a behavior model is used for improving fault traceability. A behavior model represents how each component behaves to a function that meets a specificrequirement. Generally, the behavior model is represented as a sequence diagram [15] and can trace the resources associated with the component [16]. We enhance faulttraceability by using the behavior model toward the information of the commit which is most similar to the bugreport found through the method The detailed steps of this method are as follows.
- Extract source code changes from commit information
- Extract methods modified by source code changes
- Trace related resources using sequence diagrams
In the first step, only source code changes are extracted from the information of the commit with the highestsimilarity score. Step 2 is the step of extracting the method modified by the source code changes. The source codechanges, which is the information of the commit, save only which lines of source code have been added, modified and deleted. It is necessary to determine the modified lines belong to which method and class. Therefore, we parse the modified source files to extract the method that the source code changes belongs to.
As a final step, fault tracing is performed through the component flow of the sequence diagram for the extracted methods. The reason for this procedure is that the methods that need to be modified to fix the bug may be in otherfunctionally related resources, rather than the source codechanges of the commit that most similar with the bug report. In this paper, we define traceability set using enhanced faulttraceability as follows.
Traceability set TS is a set of methods. m_c is a method modified by a commit similar to the bug report, and it is added to the TS. \(m_c \) is an related method called by \(m_c \),which adds it to the TS.
5. Case Study
In this chapter, we will apply the examples to the methodology presented above to the zxing project.
5.1 Tracing a Related Commit with the Bug Report
We first extract keywords from the bug report of Figure 1 in Chapter 2, and remove stop words. The total number ofterms extracted from the bug report through this procedure is 70. After that, keywords excluding stop words and reserved words are extracted from the description and the source codechanges of the whole commit of the project. We then treatthe bug report as a query and the commit information as adocument, respectively. And then TF-IDF and VSM areapplied to them. Table 1 shows the results of calculating similarity scores between the bug report and commits.
(Table 1) TF-IDF & VSM Result(Top 5 Commits)
Comparing the similarity scores calculated by the VSM results, we can see that the commit (84a3b27...) with asimilarity score of 0.5017 is the most similar commit with the bug report. Therefore, we extract the information of the commit and use it in the next step.
5.2 Enhancing Fault Traceability by Using Behavior Model
The source code changes of the commit extracted from the previous step include lines 239-245 in RSSExpanded Image 2binary Test Case. java, lines 91-97 in RSSExpanded Image 2 result Test Case. java and so on. First, we analyze the structure of the method by parsing the modified Java files t ofigure out which method the modified line belongs to. Through the analyzed method structure, we can trace lines 239-245 in RSSExpandedImage2binaryTestCase.java and lines 91-97 in RSSExpanded Imag2 result Test Case. javamodified and belongs to assertCorrectImage2binary() method and assertCorrectImage2result() method, respectively. We add these traced methods to the traceability set TS.
As a final step, the behavior model, sequence diagram, improves fault traceability. Figure 3 shows part of the flow of the assertCorrectImage2result() method in the sequencediagram of RSSExpandedImage2resultTestCase.java. According to the sequence diagram in RSSExpandedImage2 resultTestCase.java, assertCorrectImage2result() calls get Height() and getBlackRow(). Therefore, these methods are added to the traceability set TS, since these methods are also resources associated with the fault.
In the zxing project, in reality, the developer applied a commit (6cdc749...) to fix the bug described in the bugreport of the case study. The source code changes of this commit are line 65, lines 69-86. These source code changes are included in the getBlackRow() method in the Global Histogram Binarizer. java file and this method exists in the traceability set TS created in the case study. That is, when the proposed technique is applied, it can be seen that the defective method has been successfully traced.
(Figure 3) Sequence Diagram
6. Experiment and Evaluation
For verifying and seeing the effectiveness of the enhanced fault traceability approach proposed in this paper in case of a general program, we apply our approach to the open sourceproject. The subject of the experiment is the zxing projectused in case studies.
6.1 Experimental Environment and Procedure
Before proceeding with the experiment, We collected bugreports from the issues listed in github's zxing project, with the exception of bug reports that is simple questions or havelacked information. In addition, if the bug report includes animage, the natural language portion excluding the image was used as the bug report information. Also, in order to verify the accuracy of the proposed method, it is necessary to confirm that the tracked resources are related to actualdefects [17]. That is the commit that fixes the defect described by the bug report should be clear to verify the success of the test results. Therefore, only the bug reports that are connected with commits and can confirm the success of the experiment results were selected and became subject of the experiment. Also, if the commit that most similar to the bug report is a commit connected to the bug report, this commit is exactly a commit for fixing the defect, so that itis unacceptable fault traceability. Therefore, this kind of commits, in this case, were excluded from the dataset of faulttraceability.
The experiment is divided into two steps. In the first step, the VSM and the TF-IDF were used to select the commit with high similarity to the bug report. Thereafter, the methods including the source code changes of the selected commit were extracted and added to the traceability set TS. In the second step, the behavior model was used to enhance fault traceability. We analyzed the behavioral model to tracethe methods called by the methods extracted in the previous step and added those methods to the TS. In this study, we confirmed the improvement effect of fault traceability through these two steps.
6.2 Experiment Result
(Table 2) Experiment Results
Table 2 shows the results of the experiment. We experimented 15 bug reports set B in this paper. First, the first step used VSM evaluates whether we can directly find the file contains the fault. In this case, the average traceability accuracy was about 41%. In case of traceability set TS consistof only directly modified methods by the selected commit instep 1, result of the fault traceability did not show ameaning ful value. Therefore, in the first step, we did nottrack up to the method unit. In order to extend it to the method unit, fault tracing was performed through the second step using behavior model and its tracing accuracy was about 54% in method unit. In addition, through the second step, it was possible to trace even the fault that failed to trace the file in the first step, and it was confirmed that tracing using the behavior model can improve the fault traceability. However, if the number of files included in the commit is small and the commit information is reliable and it was easy to distinguish the specificity and purpose of the commit, high traceability was obtained. In contrast, if the number of files is large or the commit information is uncertain, the tracing fault was failed or tracing only some of the resources. That is if the commitcontains the large scale of function or bug, or source codechanges are not closely related to each other, it becomes difficult to distinguish the specificity and purpose of the commit. So, it led to decreasing fault traceability precision.
7. Discussion
The feature of the proposed approach and related worksis to trace the resources related to the reported bug report by using the existing project's commit history and behaviormodel. Using this technique, it is possible to identify defective resources and to fix them with minimized the developer's time and effort. In this chapter, we compare the fault traceability enhancement method of this study, which was verified through previous case studies and experimental results, with the existing studies and then discuss the limitations.
7.1 Comparison with Existing Study
Table 3 shows the qualitative comparison of the features of this study and previous studies. All of the previous studies, including this study, used techniques such as NLP(Natural Language Processing) of bug reports, ietokenization and stop word removal. Also, in all the studies,
VSM was used to convert the term of the bug report t ovector for calculating similarity. However, in this paper, we used commit information including source code changes and description for calculating similarity with bug reports, while existing studies only used the source code of a software for the similarity. Commit information has been used in some existing studies, but its purpose is to use changed filenames and changed method names. Therefore, the purpose is different from the purpose of commit information in this study. And the behavior model was used in some previous studies as well as in this study. A fault traceability level of the studies was method unit. However, our proposed method can be applied to a general program, while the program that can be traced by the method of [5] is limited to the webapplication developed based on the MVC design pattern.
(Table 3) Qualitative Comparison of Our Study and Existing Studies
7.2 Limitation
1. Commit Information : This study begins by looking forsimilar commit information with the reported bug report. Butit has a limitation that existing commits must have a commitcorresponding to the bug report. In addition, a certainamount of commits are required for valid similarity calculation for using TF-IDF. In other words, there is alimitation that commits of the project occurs more than acertain level and also the description of the commits must be described well in order to use the technique proposed in this study.
2. Behavior model : The second step of our technique isfault traceability enhancement using behavior model. The behavior model allows you to identify the methods associated with the object. However, if the project to whichour technique is applied has a high coupling, there would betoo many associated components, and too many trace results would be added to the traceability set TS. This can beineffective or adversely affecting developers who want touse software fault traceability method to save time andeffort.
8. Conclusion
In this paper, we proposed a fault traceability enhancement technique for the reported bug report. Thesimilar commit was traced through the computation of similarity between the bug report and the commitinformation. Next, behavior model was used for the similarcommit to improving fault traceability. Compare with previous studies, this study can be applied to generalsof tware and also tracing faults to method level at the sametime. However, it is possible that the behavior model leads to a situation where too many trace results belong to the traceability set. Therefore, in the future, it would be moreeffective to rank the traceability results according to how much the result in the traceability set is associated with the fault. Also, in order to more accurately find a commit thatis similar to bug reports, we will develop techniques to improve fault traceability by continuing to study additional applicable elements such as commit author information other than commit descriptions and source code changes.
References
- D. Baek, B. Lee, J. Lee, "Content-based Configuration Management System for Software Research and Development Document Artifacts," KSII Transactions on Internet and Information Systems, Vol. 10, No. 3, pp.1404-1415, 2016. http://dx.doi.org/10.3837/tiis.2016.03.027
- S. Kim, T. Zimmermann, E. Whitehead, A. Zeller, "Predicting Faults from Cached History," in Proc. of 29th International Conference on Software Engineering (ICSE), pp.489-498, 2007. http://dx.doi.org/10.1109/ICSE.2007.66
- H. Zhang, "An Investigation of the Relationships between Lines of Code and Defects," in Proc. of 2009 IEEE International Conference on Software Maintenance (ICSM), pp.274-283, 2009. http://dx.doi.org/10.1109/ICSM.2009.5306304
- S. Wang, D. Lo, "Version History, Similar Report, and Structure: Putting Them Together for Improved Bug Localization," in Proc. of the 22nd International Conference on Program Comprehension(ICPC), pp.53-63, 2014. http://dx.doi.org/10.1145/2597008.2597148
- S. Baek, J. Lee, B. Lee, "Improving fault traceability of web application by utilizing software revision information and behavior model," KSII Transactions on Internet and Information Systems, Vol. 12, No. 2, pp.817-828, 2018. http://doi.org/10.3837/tiis.2018.02.016
- C. Youm, J. Ahn, J. Kim, E. Lee, "Bug localization based on code change histories and bug reports," in Proc. of Asia-Pacific Software Engineering Conference (APSEC), pp.190-197, 2015. http://doi.org/10.1109/APSEC.2015.23
- L. Moreno, W. Bandara, S. Haiduc, A. Marcus, "On the Relationship between the Vocabulary of Bug Reports and Source Code," in Proc. of International Conference on Software Maintenance(ICSM), pp.452-455, 2013. http://dx.doi.org/10.1109/ICSM.2013.70
- R. Tsuchiya, H. Washizaki, Y. Fukazawa, K. Oshima, R. Mibe, "Interactive Recovery of Requirements Traceability Links Using User Feedback and Configuration Management Logs," in Proc. of International Conference on Advanced Information Systems Engineering, pp.247-262, 2015. http://dx.doi.org/10.1007/978-3-319-19069-3_16
- R. Tsuchiya, H. Washizaki, Y. Fukazawa, T. Kato, M. Kawakami, K. Yoshimura, "Recovering Traceability Links between Requirements and Source Code Using the Configuration Management Log," IEICE Transactions on Information and Systems, Vol. 98, No. 4, pp.852-862, 2015. http://dx.doi.org/10.1587/transinf.2014EDP7199
- C. McMillan, D. Poshyvanyk, M. Revelle, "Combining textual and structural analysis of software artifacts for traceability link recovery," in Proc. of ICSE Workshop on Traceability in Emerging Forms of Software Engineering, pp.41-48, 2009. https://doi.org/10.1109/TEFSE.2009.5069582
- B. Van Rompaey, S. Demeyer, "Establishing Traceability Links between Unit Test Cases and Units under Test," in Proc. of 13th European Conference on Software Maintenance and Reengineering, pp.209-218, 2009. https://doi.org/10.1109/CSMR.2009.39
- X. Ye, R. Bunescu, C. Liu, "Mapping Bug Reports to Relevant Files: A Ranking Model, a Fine-Grained Benchmark, and Feature Evaluation," IEEE Transactions on Software Engineering, Vol. 42, No. 4, pp.379-402, 2016. https://doi.org/10.1109/TSE.2015.2479232
- H. Choi, J. Lee, B. Lee, "Supporting Systematic Software Test Process in R&D Project with Behavioral Models," Journal of Internet Computing and Services(JICS), Vol. 19, No. 2, pp.43-48, 2018. http://dx.doi.org/10.7472/jksii.2018.19.2.43
- R. Saha, M. Lease, S. Khurshid, D. Perry, "Improving Bug Localization using Structured Information Retrieval," in Proc. of 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp.345-355, 2013. http://dx.doi.org/10.1109/ASE.2013.6693093
- J. Rumbaugh, I. Jacobson, G. Booch, Unified Modeling Language Reference Manual, Pearson Higher Education, 2004.
- Le, T.B., Oentaryo, R.J., Lo, D., "Information Retrieval and Spectrum Based Bug Localization: Better Together," in Proc. of the 2015 10th Joint Meeting on Foundations of Software Engineering(ESEC/FSE), pp.579-590, 2015. http://dx.doi.org/10.1145/2786805.2786880
- Herzig, K., Just, S., Zeller, A., "It's Not a Bug, It's a Feature: How Misclassification Impacts Bug Prediction," in Proc. of the 2013 International Conference on Software Engineering(ICSE), pp.392-401, 2013. http://dx.doi.org/10.1109/ICSE.2013.6606585