DOI QR코드

DOI QR Code

Analyzing Machine Learning Techniques for Fault Prediction Using Web Applications

  • Malhotra, Ruchika (Dept. of Computer Science and Engineering, Delhi Technological University) ;
  • Sharma, Anjali (Dept. of Computer Science and Engineering, Delhi Technological University)
  • 투고 : 2016.04.01
  • 심사 : 2017.04.09
  • 발행 : 2018.06.30

초록

Web applications are indispensable in the software industry and continuously evolve either meeting a newer criteria and/or including new functionalities. However, despite assuring quality via testing, what hinders a straightforward development is the presence of defects. Several factors contribute to defects and are often minimized at high expense in terms of man-hours. Thus, detection of fault proneness in early phases of software development is important. Therefore, a fault prediction model for identifying fault-prone classes in a web application is highly desired. In this work, we compare 14 machine learning techniques to analyse the relationship between object oriented metrics and fault prediction in web applications. The study is carried out using various releases of Apache Click and Apache Rave datasets. En-route to the predictive analysis, the input basis set for each release is first optimized using filter based correlation feature selection (CFS) method. It is found that the LCOM3, WMC, NPM and DAM metrics are the most significant predictors. The statistical analysis of these metrics also finds good conformity with the CFS evaluation and affirms the role of these metrics in the defect prediction of web applications. The overall predictive ability of different fault prediction models is first ranked using Friedman technique and then statistically compared using Nemenyi post-hoc analysis. The results not only upholds the predictive capability of machine learning models for faulty classes using web applications, but also finds that ensemble algorithms are most appropriate for defect prediction in Apache datasets. Further, we also derive a consensus between the metrics selected by the CFS technique and the statistical analysis of the datasets.

키워드

참고문헌

  1. P. He, B. Li, X. Liu, J. Chen, and Y. Ma, "An empirical study on software defect prediction with a simplified metric set," Information and Software Technology, vol. 59, pp. 170-190, 2015. https://doi.org/10.1016/j.infsof.2014.11.006
  2. F. Rahman, D. Posnett, and P. Devanbu, "Recalling the imprecision of cross-project defect prediction," in Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering, Cary, NC, 2012.
  3. S. H. Kan, Metrics and Models in Software Quality Engineering, 2nd ed. Boston, MA: Addison-Wesley, 2003.
  4. S. Lessmann, B. Baesens, C. Mues, and S. Pietsch, "Benchmarking classification models for software defect prediction: a proposed framework and novel findings," IEEE Transactions on Software Engineering, vol. 34, no. 4, pp. 485-496, 2008 https://doi.org/10.1109/TSE.2008.35
  5. T. Gyimothy, R. Ferenc, and I. Siket, "Empirical validation of object-oriented metrics on open source software for fault prediction," IEEE Transactions on Software Engineering, vol. 31, no. 10, pp. 897-910, 2005. https://doi.org/10.1109/TSE.2005.112
  6. T. Menzies, J. Greenwald, and A. Frank, "Data mining static code attributes to learn defect predictors," IEEE Transactions on Software Engineering, vol. 33, no. 1, pp. 2-13, 2007. https://doi.org/10.1109/TSE.2007.256941
  7. J. Demsar, "Statistical comparisons of classifiers over multiple data sets," Journal of Machine Learning Research, vol. 7, pp. 1-30, 2006.
  8. M. D'Ambros, M. Lanza, and R. Robbes, "Evaluating defect prediction approaches: a benchmark and an extensive comparison," Empirical Software Engineering, vol. 17, no. 4-5, pp. 531-577, 2012. https://doi.org/10.1007/s10664-011-9173-9
  9. J. Bansiya, and C. G. Davis, "A hierarchical model for object-oriented design quality assessment," IEEE Transactions on Software Engineering, vol. 28, no. 1, pp. 4-17, 2002. https://doi.org/10.1109/32.979986
  10. S. R. Chidamber and C. F. Kemerer "A metrics suite for object oriented design," IEEE Transactions on Software Engineering, vol. 20, no. 6, pp. 476-493, 1994. https://doi.org/10.1109/32.295895
  11. T. Fawcett, "ROC graphs: notes and practical considerations for researchers," Machine Learning, vol. 31, no. 1, pp. 1-38, 2004.
  12. M. A. Hall, "Correlation-based feature selection for machine learning," Ph.D dissertation, Department of Computer Science, The Waikato University, Hamilton, New Zealand, 1998.
  13. M. A. Hall and L. A. Smith, "Practical feature subset selection for machine learning," in Proceedings of the 21st Australasian Computer Science Conference, Perth, Australia, 1998, pp. 181-191.
  14. M. A. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, "The WEKA data mining software: an update," ACM SIGKDD Explorations Newsletter, vol. 11, no. 1, pp. 10-18, 2009. https://doi.org/10.1145/1656274.1656278
  15. Q. Song, J. Ni, and G. Wang, "A fast clustering-based feature subset selection algorithm for high-dimensional data," IEEE Transactions on Knowledge and Data Engineering, vol. 25, no. 1, pp. 1-14, 2013. https://doi.org/10.1109/TKDE.2011.181
  16. V. Bolon-Canedo, N. Sanchez-Marono, A. Alonso-Betanzos, J. M. Benitez, and F. Herrera, "A review of microarray datasets and applied feature selection methods," Information Sciences, vol. 282, pp. 111-135, 2014. https://doi.org/10.1016/j.ins.2014.05.042
  17. D. Liu, D. W. Sun, and X. A. Zeng, "Recent advances in wavelength selection techniques for hyperspectral image processing in the food industry," Food and Bioprocess Technology, vol. 7, no. 2, pp. 307-323, 2014. https://doi.org/10.1007/s11947-013-1193-6
  18. A. B. de Carvalho, A. Pozo, S. Vergilio, and A. Lenz, "Predicting fault proneness of classes through a multiobjective particle swarm optimization algorithm," in Proceedings of the 20th IEEE International Conference on Tools with Artificial Intelligence, Dayton, OH, 2008, pp. 387-394.
  19. C. Catal, B. Diri, and B. Ozumut, "An artificial immune system approach for fault prediction in object-oriented Software," in Proceedings of the 2nd International Conference on Dependability of Computer System, Szklarska, Poland, 2007, pp. 238-245.
  20. B. Henderson-Sellers, Object-Oriented Metrics: Measures of Complexity. Upper Saddle River, NJ: Prentice-Hall, 1996.
  21. K. Dejaeger, T. Verbraken, and B. Baesens, "Toward comprehensible software fault prediction models using Bayesian network classifiers," IEEE Transactions on Software Engineering, vol. 39, no. 2, pp. 237-257, 2013. https://doi.org/10.1109/TSE.2012.20
  22. E. Arisholm, L. C. Briand, and E. B. Johannessen, "A systematic and comprehensive investigation of methods to build and evaluate fault prediction models," Journal of System and Software, vol. 83, no. 1, pp. 2-17, 2010. https://doi.org/10.1016/j.jss.2009.06.055
  23. T. J. McCabe, "A complexity measure," IEEE Transactions on Software Engineering, vol. 2, no. 4, pp. 308-320, 1976.
  24. A. H. Watson, D. R. Wallace, and T. J. McCabe, Structured Testing: A Testing Methodology Using the Cyclomatic Complexity Metric. Gaithersburg, MD: National Institute of Standards and Technology, 1996.
  25. M. H. Halstead, Elements of Software Science. New York, NY: North-Holland, 1977.
  26. C. Catal and B. Diri, "A systematic review of software fault prediction studies," Expert Systems Applications, vol. 36, no. 4, pp. 7346-7354, 2009. https://doi.org/10.1016/j.eswa.2008.10.027
  27. J. Kennedy and R. Eberhart, "Particle swarm optimization," in Proceedings of 4th IEEE International Conference on Neural Networks, Perth, Australia, 1995, pp. 1942-1948.
  28. Y. Shi and R. C. Eberhart, "A modified particle swarm optimizer," in Proceedings of IEEE International Conference on Evolutionary Computation, Anchorage, AK, 1998, pp. 69-73.
  29. F. Wilcoxon, "Individual comparisons by ranking methods," Biometrics Bulletin, vol. 1, no. 6, pp. 80-83, 1945. https://doi.org/10.2307/3001968
  30. Y. Singh, A. Kaur, and R. Malhotra, "Application of support vector machine to predict fault prone classes," ACM SIGSOFT Software Engineering Notes, vol. 34, no. 1, pp. 1-6, 2009.
  31. H. M. Olague, L. H. Etzkorn, S. Gholston, and S. Quattlebaum, "Empirical validation of three software metrics suites to predict fault-proneness of object-oriented classes developed using highly iterative or agile software development processes," IEEE Transactions on software Engineering, vol. 33, no. 6, pp. 402-419, 2007. https://doi.org/10.1109/TSE.2007.1015
  32. G. J. Pai and J. B. Dugan, "Empirical analysis of software fault content and fault proneness using Bayesian methods," IEEE Transactions on Software Engineering, vol. 33, no. 10, pp. 675-686, 2007. https://doi.org/10.1109/TSE.2007.70722
  33. S. Kanmani, V. R. Uthariaraj, V. Sankaranarayanan, and P. Thambidurai, "Object-oriented software fault prediction using neural networks," Information and Software Technology, vol. 49, no. 5, pp. 483-492, 2007. https://doi.org/10.1016/j.infsof.2006.07.005
  34. D. Azar and J. Vybihal, "An ant colony optimization algorithm to improve software quality prediction models: case of class stability," Information and Software Technology, vol. 53, no. 4, pp. 388-393, 2011. https://doi.org/10.1016/j.infsof.2010.11.013
  35. S. Di Martino, F. Ferrucci, C. Gravino, and F. Sarro, "A genetic algorithm to configure support vector machines for predicting fault-prone components," in Product-Focused Software Process Improvement. Heidelberg: Springer, 2011, pp. 247-261.
  36. A. Okutan and O. T. Yildiz, "Software defect prediction using Bayesian networks," Empirical Software Engineering, vol. 19, no. 1, pp. 154-181, 2014. https://doi.org/10.1007/s10664-012-9218-8
  37. Y. Zhou, B. Xu, and H. Leung, "On the ability of complexity metrics to predict fault-prone classes in object oriented systems," Journal of Systems and Software, vol. 83, no. 4, pp. 660-674, 2010. https://doi.org/10.1016/j.jss.2009.11.704
  38. Y. Zhou and H. Leung, "Empirical analysis of object oriented design metrics for predicting high severity faults," IEEE Transactions on Software Engineering, vol. 32, no. 10, pp. 771-784, 2006. https://doi.org/10.1109/TSE.2006.102
  39. D. Bowes, T. Hall, M. Harman, Y. Jia, F. Sarro, and F. Wu, "Mutation-aware fault prediction," in Proceedings of the 25th International Symposium on Software Testing and Analysis, Saarbrucken, Germany, 2016, pp. 330-341.
  40. R. Malhotra and R. Raje, "An empirical comparison of machine learning techniques for software defect prediction," in Proceedings of the 8th International Conference on Bioinspired Information and Communications Technologies, Boston, MA, 2014, pp. 320-327.
  41. R. Malhotra, N. Pritam, K. Nagpal, and P. Upmanyu, "Defect collection and reporting system for Git based open source software," in Proceedings of the International Conference on Data Mining and Intelligent Computing, New Delhi, India, 2014, pp. 1-7.
  42. P. A. Devijver and J. Kittler, Pattern Recognition: A Statistical Approach. Englewood Cliffs, NJ: Prentice-Hall, 1982.
  43. A. Bhattacharyya, "On a measure of divergence between two statistical populations defined by their probability distributions," Bulletin of the Calcutta Mathematical Society, vol. 35, pp. 99-109, 1943.
  44. H. Chernoff, "A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations," The Annals of Mathematical Statistics, vol. 23, no. 4, pp. 493-507, 1952. https://doi.org/10.1214/aoms/1177729330
  45. E. Patrick and F. Fisher, "Nonparametric feature selection," IEEE Transactions in Information Theory, vol. 15, no. 4, pp. 577-584, 1969. https://doi.org/10.1109/TIT.1969.1054354
  46. Z. Xu, J. Liu, Z. Yang, G. An, and X. Jia, "The impact of feature selection on defect prediction performance: an empirical comparison," in Proceedings of the 27th International Symposium on Software Reliability Engineering, Ottawa, Canada, 2016, pp. 309-320.
  47. M. Stone, "Cross-validatory choice and assessment of statistical predictions," Journal of the Royal Statistical Society Series B (Methodological), vol. 36, no. 2, pp. 111-114, 1974.
  48. W. Fu, T. Menzies, and X. Shen, "Tuning for software analytics: is it really necessary?," in Information and Software Technology, vol. 76, pp. 135-146, 2016. https://doi.org/10.1016/j.infsof.2016.04.017
  49. F. Sarro, S. Di Martino, F. Ferrucci, and C. Gravino, "A further analysis on the use of genetic algorithm to configure support vector machines for inter-release fault prediction," in Proceedings of the 27th Annual ACM Symposium on Applied Computing, Trento, Italy, 2012, pp. 1215-1220.
  50. A. Arcuri and G. Fraser, "On parameter tuning in search based software engineering," in Search Based Software Engineering. Heidelberg: Springer, 2011, pp. 33-47.
  51. M. Friedman, "A comparison of alternative tests of significance for the problem of m rankings," The Annals of Mathematical Statistics, vol. 11, no. 1, pp. 86-92, 1940. https://doi.org/10.1214/aoms/1177731944
  52. P. B. Nemenyi, "Distribution-free multiple comparisons," Ph.D. dissertation, Princeton University, Princeton, NJ, 1963.
  53. S. C. Misra and V. C. Bhavsar, "Relationships between selected software measures and latent bug-density: guidelines for improving quality," in Computational Science and Its Applications. Heidelberg: Springer, 2003, pp. 724-732.
  54. D. Glasberg, K. El Emam, W. Melo, and N. Madhavji, Validating Object-Oriented Design Metrics on a Commercial Java Application. Ottawa, Canada: National Research Council of Canada, 2000.
  55. S. M. A. Shah, M. Morisio, and M. Torchiano, "An overview of software defect density: a scoping study," in Proceedings of the 19th Asia-Pacific Software Engineering Conference, Hong Kong, China, 2012, pp. 406-415.
  56. B. Ghotra, S. McIntosh, and A. E. Hassan, "Revisiting the impact of classification techniques on the performance of defect prediction models," in Proceedings of the 37th International Conference on Software Engineering, Florence, Italy, 2015, pp. 789-800.