DOI QR코드

DOI QR Code

Imputation of Missing Data Based on Hot Deck Method Using K-nn

K-nn을 이용한 Hot Deck 기반의 결측치 대체

  • Received : 2014.04.25
  • Accepted : 2014.12.14
  • Published : 2014.12.31

Abstract

Researchers cannot avoid missing data in collecting data, because some respondents arbitrarily or non-arbitrarily do not answer questions in studies and experiments. Missing data not only increase and distort standard deviations, but also impair the convenience of estimating parameters and the reliability of research results. Despite widespread use of hot deck, researchers have not been interested in it, since it handles missing data in ambiguous ways. Hot deck can be complemented using K-nn, a method of machine learning, which can organize donor groups closest to properties of missing data. Interested in the role of k-nn, this study was conducted to impute missing data based on the hot deck method using k-nn. After setting up imputation of missing data based on hot deck using k-nn as a study objective, deletion of listwise, mean, mode, linear regression, and svm imputation were compared and verified regarding nominal and ratio data types and then, data closest to original values were obtained reasonably. Simulations using different neighboring numbers and the distance measuring method were carried out and better performance of k-nn was accomplished. In this study, imputation of hot deck was re-discovered which has failed to attract the attention of researchers. As a result, this study shall be able to help select non-parametric methods which are less likely to be affected by the structure of missing data and its causes.

Keywords

References

  1. Acock, A.C., "Working with missing values", Journal of Marriage and Family, Vol.67, No.4, 2005, 1012-1028. https://doi.org/10.1111/j.1741-3737.2005.00191.x
  2. Allison, P.D., "Missing data : Quantitative applications in the social sciences", British Journal of Mathematical and Statistical Psychology, Vol.55, No.1, 2002, 193-196. https://doi.org/10.1348/000711002159653
  3. Anderson, A.B., R. Basilevsky, and D.P.J. Hum, "Missing data : a review of the literature", Handbook of survey research, Vol.4, 1983, 415-494.
  4. Andridge, R.R. and R.J.A. Little, "A Review of Hot Deck Imputation for Survey Non-response", International Statistical Review, Vol.78, No.1, 2010, 40-64. https://doi.org/10.1111/j.1751-5823.2010.00103.x
  5. Baraldi, A.N. and C.K. Enders, "An introduction to modern missing data analyses", Journal of School Psychology, Vol.48, No.1, 2010, 5-37. https://doi.org/10.1016/j.jsp.2009.10.001
  6. Batista, G.E. and M.C. Monard, "A Study of K-Nearest Neighbour as an Imputation Method", HIS, Vol.87, 2002, 251-260.
  7. Bennett, D.A., "How can I deal with missing data in my study?", Australian and New Zealand Journal of Public Health, Vol.25, No.5, 2001, 464-469. https://doi.org/10.1111/j.1467-842X.2001.tb00294.x
  8. Carpenter, J., "Statistical modelling with missing data using multiple imputation Session 2 : Multiple Imputation", 2010.
  9. Cheng, X. and D. Cook, and H. Hofmann, "MissingDataGUI : A Graphical User Interface for Exploring Missing Values in Data", 2013.
  10. Christobel, Y.A. and P. Sivaprakasam, "Improving the performance of K-nearest neighbor algorithm for the classification of diabetes dataset with missing values", International Journal of Computer Engineering and Technology, Vol.3, No.3, 2012, 16-23.
  11. Devane, D.C., M. Begley, and M. Clarke, "How many do I need? Basic principles of sample size estimation", Journal of Advanced Nursing, Vol.47, No.3, 2004, 297-302. https://doi.org/10.1111/j.1365-2648.2004.03093.x
  12. Finch, W.H., "Imputation Methods for Missing Categorical Questionnaire Data : A Comparison of Approaches", Journal of Data Science, Vol.8, 2010, 361-378.
  13. Graham, J.W., P.E. Cumsille, and E. Elek, Fisk Methods for handling missing data, Handbook of psychology, 2003.
  14. Gunn, S.R., "Support vector machines for classification and regression", ISIS technical report, Vol.14, 1998.
  15. He, H., W. Graco, and X. Yao, "Application of genetic algorithm and k-nearest neighbour method in medical fraud detection", Simulated Evolution and Learning, Springer Berlin Heidelberg, 1999, 74-81.
  16. Jonsson, P. and C. Wohlin, "An evaluation of k-nearest neighbour imputation using likert data", Software Metrics, 2004. Proceedings 10th International Symposium on IEEE, 2004.
  17. Kim, K. and H. Ahn, "Optimization of Support Vector Machines for Financial Forecasting", Journal of Intelligence and Information System, Vol.17, No.4, 2011, 241-254.
  18. King, G. et al., "Analyzing incomplete political science data : An alternative algorithm for multiple imputation", American Political Science Association, Vol.95. No.1, 2001.
  19. Little, R.J.A., "A test of missing completely at random for multivariate data with missing values", Journal of the American Statistical Association, Vol.83, No.404, 1988, 1198-1202. https://doi.org/10.1080/01621459.1988.10478722
  20. Little, R.J.A. and D.B. Rubin, "Statistical Analysis with", 2002.
  21. MacCallum, R.C. et al., "On the practice of dichotomization of quantitative variables", Psychological methods, Vol.7, No.1, 2002, 19. https://doi.org/10.1037/1082-989X.7.1.19
  22. Martin, A.T., M. Akshmi, and V.P. Venkatesan, "An Analysis on Qualitative Bankruptcy Prediction Rules using Ant-Miner", International Journal of Intelligent Systems and Applications, Vol.6, No.1, 2013.
  23. Peng, C.J. et al., "Advances in missing data methods and implications for educational research", Real data analysis, 2006, 31-78.
  24. Pettersson, N., "Real donor imputation pools", Proceedings of the Workshop of the Baltic-Nordic-Ukrainian network on survey statistics, 2012.
  25. Roth, P.L., "Missing data : A conceptual review for applied psychologists", Personnel Psychology, Vol.47, No.3, 1994, 537-560. https://doi.org/10.1111/j.1744-6570.1994.tb01736.x
  26. Rubin, D.B., "Inference and missing data", Biometrika, Vol.63, No.3, 1976, 581-592. https://doi.org/10.1093/biomet/63.3.581
  27. Sarma, H.T. et al., "An improvement to k-nearest neighbor classifier", arXiv preprint arXiv : 1301.6324, 2013.
  28. Saunders, J.A. et al., "Imputing missing data : A comparison of methods for social work researchers", Social work research, Vol.30, No.1, 2006, 19-31. https://doi.org/10.1093/swr/30.1.19
  29. Somasundaram, R.S. and R. Nedunchezhian, "Evaluation of Three Simple Imputation Methods for Enhancing Preprocessing of Data with Missing Values", International Journal of Computer Applications (0975-8887), Vol.21, No.10, 2011, 14-19.
  30. Schafer, J.L., Analysis of incomplete multivariate data, CRC press, 1997.
  31. Schafer, J.L. and J.W. Graham, "Missing data : our view of the state of the art", Psychological methods, Vol.7, No.2, 2002, 147. https://doi.org/10.1037/1082-989X.7.2.147
  32. Schlomer, G.L., S. Bauman, and N.A. Card. "Best practices for missing data management in counseling psychology", Journal of Counseling Psychology, Vol.57, No.1, 2010.
  33. Suykens, J.A., "Advances in learning theory : methods, models, and applications," Vol.190, IOS Press, 2003.
  34. Van Buuren, Stef, Flexible imputation of missing data, CRC press, 2012.
  35. Vapnik, V.N., Statistical Learning Theory, Wiley, New York, 1998.
  36. Viswanath, P. and T.H. Sarma, "An improvement to k-nearest neighbor classifier", Recent Advances in Intelligent Computational Systems (RAICS), IEEE, 2011.
  37. Yan, X., "Weighted K-Nearest Neighbor Classification Algorithm Based on Genetic Algorithm", TELKOMNIKA Indonesian Journal of Electrical Engineering, Vol.11, No.10, 2013.
  38. Zhang, C., Q.Y. Zhu, X.J. Zhang, and S. Zhang, "Clustering-based missing value imputation for data preprocessing", In Industrial Informatics, IEEE International Conference on, IEEE, 2006, 1081-1086.