A Missing Data Imputation by Combining K Nearest Neighbor with Maximum Likelihood Estimation for Numerical Software Project Data

K-NN과 최대 우도 추정법을 결합한 소프트웨어 프로젝트 수치 데이터용 결측값 대치법

  • Published : 2009.04.15

Abstract

Missing data is one of the common problems in building analysis or prediction models using software project data. Missing imputation methods are known to be more effective missing data handling method than deleting methods in small software project data. While K nearest neighbor imputation is a proper missing imputation method in the software project data, it cannot use non-missing information of incomplete project instances. In this paper, we propose an approach to missing data imputation for numerical software project data by combining K nearest neighbor and maximum likelihood estimation; we also extend the average absolute error measure by normalization for accurate evaluation. Our approach overcomes the limitation of K nearest neighbor imputation and outperforms on our real data sets.

소프트웨어 프로젝트 데이터를 이용한 각종 분석 예측 모델 생성시 직면하는 문제 중 하나는 데이터에 포함된 결측값이며 이에 대한 효과적인 방안은 결측값 대치 법이다. 대표적인 결측값 대치법인 K 최근접 이웃 대치법은 대치과정에서 결측값을 포함하는 인스턴스의 관측정보를 활용하지 못한다는 단점이 있다. 본 연구에서는 이러한 단점을 극복하기 위해 K 최근접 이웃 대치법과 최대 우도 추정법을 결합한 새로운 소프트웨어 프로젝트 수치 데이터용 결측값 대치법을 제안한다. 또한 결측값 대치법의 정확도를 비교하기 위한 새로운 측도를 함께 제안한다.

Keywords

References

  1. Kevin Strike, Khaled El Emam, and Nazim Madhavji, 'Software Cost Estimation with Incomplete Data,' IEEE Transactions on Software Engineering, Vol.27, No.10, pp. 890-908, 2001 https://doi.org/10.1109/32.962560
  2. Ingunn Myrtveit, Erik Stensrud, and Ulf H. Olsson, 'Analyzing Data Sets with Missing Data: An Empirical Evaluation of Imputation Methods and Likelihood-Based Methods,' IEEE Transactions on Software Engineering, Vol.27, No.11, pp.999-1013, 2001 https://doi.org/10.1109/32.965340
  3. M. H. Cartwright, M. J. Shepperd, and Q. Song, 'Dealing with Missing Software Project Data,' Proceeding of the Ninth International Software Metrics Symposium, pp. 154-165, 2003
  4. Roderick J. A. Little, Donald B. Rubin, Statistical Analysis with Missing Data, John Wiley & Sons, 1987
  5. Donald B. Rubin, Multiple imputation for nonresponse in surveys, John Wiley & Sons, 1987
  6. Bhekisipho Twala, Michelle Cartwright, and Martin Shepperd, 'Comparison of Various Methods for Handling Incomplete Data in Software Engineering Databases,' International Symposium on Empirical Software Engineering, pp. 105-114, 2005
  7. Qinbao Song, Martin Shepperd, 'A new imputation method of small software project data sets,' The Journal of Systems and Software, Vol.80, No.1, pp. 51-62, 2007 https://doi.org/10.1016/j.jss.2006.05.003
  8. Qinbao Song, Martin Shepperd, and Michelle Cartwright, 'A Short Note on Safest Default Missingness Mechanism Assumptions,' Empirical Software Engineering, Vol.10, No.2, pp. 235-243, 2005 https://doi.org/10.1007/s10664-004-6193-8
  9. Jason Van Hulse, Taghi M. Khoshgoftaar, 'A comprehensive empirical evaluation of missing value imputation in noisy software measurement data,' The Journal of Systems and Software, Vol. 81, No.5, pp. 691-708, 2008 https://doi.org/10.1016/j.jss.2007.07.043
  10. Taghi Khoshgoftaar, Andres Folleco, Jason Van Hulse, and Lofton Bullard, 'Multiple Imputation of Missing Values in Software Measurement Data,' International Journal of Software Measurement, Vol.1, No.1, pp. 1-12, 2007
  11. Anthony J. Hayter, Probability and Statistics for Engineers and Scientists, 3rd Ed., Thomson Higher Education, 2007
  12. Frank Wilcoxon, 'Individual Comparisons by Ranking Methods,' Biometrics Bulletin, Vol.1, No.6, pp. 80-83, 1945 https://doi.org/10.2307/3001968