A Missing Data Imputation by Combining K Nearest Neighbor with Maximum Likelihood Estimation for Numerical Software Project Data

Lee, Dong-Ho;Yoon, Kyung-A;Bae, Doo-Hwan;

Journal of KIISE:Software and Applications (한국정보과학회논문지:소프트웨어및응용)

Volume 36 Issue 4
/
Pages.273-282
/
2009
/
1229-6848(pISSN)

Korean Institute of Information Scientists and Engineers (한국정보과학회)

A Missing Data Imputation by Combining K Nearest Neighbor with Maximum Likelihood Estimation for Numerical Software Project Data

K-NN과 최대 우도 추정법을 결합한 소프트웨어 프로젝트 수치 데이터용 결측값 대치법

이동호 (KAIST 전산학과) ;
윤경아 (KAIST 전산학과) ;
배두환 (KAIST 전산학과)

Published : 2009.04.15

PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Missing data is one of the common problems in building analysis or prediction models using software project data. Missing imputation methods are known to be more effective missing data handling method than deleting methods in small software project data. While K nearest neighbor imputation is a proper missing imputation method in the software project data, it cannot use non-missing information of incomplete project instances. In this paper, we propose an approach to missing data imputation for numerical software project data by combining K nearest neighbor and maximum likelihood estimation; we also extend the average absolute error measure by normalization for accurate evaluation. Our approach overcomes the limitation of K nearest neighbor imputation and outperforms on our real data sets.

소프트웨어 프로젝트 데이터를 이용한 각종 분석 예측 모델 생성시 직면하는 문제 중 하나는 데이터에 포함된 결측값이며 이에 대한 효과적인 방안은 결측값 대치 법이다. 대표적인 결측값 대치법인 K 최근접 이웃 대치법은 대치과정에서 결측값을 포함하는 인스턴스의 관측정보를 활용하지 못한다는 단점이 있다. 본 연구에서는 이러한 단점을 극복하기 위해 K 최근접 이웃 대치법과 최대 우도 추정법을 결합한 새로운 소프트웨어 프로젝트 수치 데이터용 결측값 대치법을 제안한다. 또한 결측값 대치법의 정확도를 비교하기 위한 새로운 측도를 함께 제안한다.

Keywords

References

Kevin Strike, Khaled El Emam, and Nazim Madhavji, 'Software Cost Estimation with Incomplete Data,' IEEE Transactions on Software Engineering, Vol.27, No.10, pp. 890-908, 2001 https://doi.org/10.1109/32.962560
Ingunn Myrtveit, Erik Stensrud, and Ulf H. Olsson, 'Analyzing Data Sets with Missing Data: An Empirical Evaluation of Imputation Methods and Likelihood-Based Methods,' IEEE Transactions on Software Engineering, Vol.27, No.11, pp.999-1013, 2001 https://doi.org/10.1109/32.965340
M. H. Cartwright, M. J. Shepperd, and Q. Song, 'Dealing with Missing Software Project Data,' Proceeding of the Ninth International Software Metrics Symposium, pp. 154-165, 2003
Roderick J. A. Little, Donald B. Rubin, Statistical Analysis with Missing Data, John Wiley & Sons, 1987
Donald B. Rubin, Multiple imputation for nonresponse in surveys, John Wiley & Sons, 1987
Bhekisipho Twala, Michelle Cartwright, and Martin Shepperd, 'Comparison of Various Methods for Handling Incomplete Data in Software Engineering Databases,' International Symposium on Empirical Software Engineering, pp. 105-114, 2005
Qinbao Song, Martin Shepperd, 'A new imputation method of small software project data sets,' The Journal of Systems and Software, Vol.80, No.1, pp. 51-62, 2007 https://doi.org/10.1016/j.jss.2006.05.003
Qinbao Song, Martin Shepperd, and Michelle Cartwright, 'A Short Note on Safest Default Missingness Mechanism Assumptions,' Empirical Software Engineering, Vol.10, No.2, pp. 235-243, 2005 https://doi.org/10.1007/s10664-004-6193-8
Jason Van Hulse, Taghi M. Khoshgoftaar, 'A comprehensive empirical evaluation of missing value imputation in noisy software measurement data,' The Journal of Systems and Software, Vol. 81, No.5, pp. 691-708, 2008 https://doi.org/10.1016/j.jss.2007.07.043
Taghi Khoshgoftaar, Andres Folleco, Jason Van Hulse, and Lofton Bullard, 'Multiple Imputation of Missing Values in Software Measurement Data,' International Journal of Software Measurement, Vol.1, No.1, pp. 1-12, 2007
Anthony J. Hayter, Probability and Statistics for Engineers and Scientists, 3rd Ed., Thomson Higher Education, 2007
Frank Wilcoxon, 'Individual Comparisons by Ranking Methods,' Biometrics Bulletin, Vol.1, No.6, pp. 80-83, 1945 https://doi.org/10.2307/3001968

Journal of KIISE:Software and Applications (한국정보과학회논문지:소프트웨어및응용)

A Missing Data Imputation by Combining K Nearest Neighbor with Maximum Likelihood Estimation for Numerical Software Project Data

K-NN과 최대 우도 추정법을 결합한 소프트웨어 프로젝트 수치 데이터용 결측값 대치법

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)