DOI QR코드

DOI QR Code

분류 성능 향상을 위한 지역적 선형 재구축 기반 결측치 대치

Missing Value Imputation based on Locally Linear Reconstruction for Improving Classification Performance

  • 강필성 (서울과학기술대학교 글로벌융합산업공학과)
  • Kang, Pilsung (Industrial and Information Systems Engineering, Seoul National University of Science and Technology (Seoultech))
  • 투고 : 2012.07.23
  • 심사 : 2012.10.05
  • 발행 : 2012.12.01

초록

Classification algorithms generally assume that the data is complete. However, missing values are common in real data sets due to various reasons. In this paper, we propose to use locally linear reconstruction (LLR) for missing value imputation to improve the classification performance when missing values exist. We first investigate how much missing values degenerate the classification performance with regard to various missing ratios. Then, we compare the proposed missing value imputation (LLR) with three well-known single imputation methods over three different classifiers using eight data sets. The experimental results showed that (1) any imputation methods, although some of them are very simple, helped to improve the classification accuracy; (2) among the imputation methods, the proposed LLR imputation was the most effective over all missing ratios, and (3) when the missing ratio is relatively high, LLR was outstanding and its classification accuracy was as high as the classification accuracy derived from the compete data set.

키워드

과제정보

연구 과제 주관 기관 : 서울과학기술대학교

참고문헌

  1. Acuna, E. and Rodriguez, C. (2004), The Treatment of Missing Values and Its Effect in the Classifier Accuracy, in Classification, Clustering and Data Mining Applications, 639-648.
  2. Batista, G. E. A. P. A. and Monard, M. C. (2003), An Analysis of Four Missing Data Treatment Methods for Supervosed Learning, Applied Artificial Intelligence, 17(5-6), 519-533. https://doi.org/10.1080/713827181
  3. Bernard, J. and Meng, X. L. (1999), Applications of Multiple Imputation in Medical Studies : From AIDS to NHANES, Statistical Methods in Medical Research, 8(1), 17-36. https://doi.org/10.1191/096228099666230705
  4. Bishop, C. M. (2006), Pattern Recognition and Machine Learning, Springer, Singapore.
  5. Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984), Classification and Regression Trees, Boca Raton, FL : CRC Press.
  6. Ennett, C. M., Frize, M., and Walker, R. (2008), Imputation of Missing Values by Integrating Neural Networks and Case-Based Reasoning, In: Proceedings of the 30th Annual International IEEE Engineering in Medicine and Biology Society (EBMS '08), Vancouver, BC, Canada, 4337-4341.
  7. Farhangfar, A., Kurgan, L., and Dy, J. (2008), Impact of Imputation of Missing Values on Classification Error for Discrete Data, Pattern Recognition, 41(12), 3692-3705. https://doi.org/10.1016/j.patcog.2008.05.019
  8. Farhangfar, A., Kurgan, L., and Pedrycz, W. (2007), A Novel Framework for Imputation of Missing Values in Database, IEEE Transactions on Systems, Man, and Cybernetics-Part A : Systems and Humans 37(5), 692-709.
  9. Garcia-Laencina, P., Sancho-Gomez, J.-L., Rigueiras-Vidal, A. R., and Verleysen, M. (2009), K-nearest Neighbours with Mutual Information for Simultaneous Classification and Missing Data Imputation, Neurocomputing, 72(7-9), 1483-1493. https://doi.org/10.1016/j.neucom.2008.11.026
  10. Ghahramani, Z. and Jordan, M. I. (1994), Supervised Learning from Incomplete Data Via an EM Approach, In : Advances in NIPS 6, Morgan Kaufmann, Los Altos, CA, USA, 120-127.
  11. Hron, K., Templ, M., and Filzmoser, P. (2010), Imputation of Missing Values for Compositional Data using Classical and Robust Methods, Computational Statistics and Data Analytics, 54(12), 3095-3107. https://doi.org/10.1016/j.csda.2009.11.023
  12. Jerez, J. M., Molina, I., Garcia-Laencina, G., Alba, E., Ribelles, N., Martin, M., and Franco, L. (2010), Missing Data Imputation using Statistical and Machine Learning Methods in a Real Breast Cancer Problem, Artificial Intelligence in Medicine, 50(2), 105-115. https://doi.org/10.1016/j.artmed.2010.05.002
  13. Jerzy, W.G-B. and Hu, M. (2000), A Comparison of Several Approaches to Missing Attribute Values in Data Mining, In: Proceedings of the 2nd International Conference on Rough Sets and Current Trends in Computing(RSCTC'00), Banff, Canada, 378-385.
  14. Kang, P. and Cho, S. (2008), Locally Linear Reconstruction for Instance- Based Learnining, Pattern Recognition, 41(11), 3507-3518. https://doi.org/10.1016/j.patcog.2008.04.009
  15. Li, H., Zhou, X., and Yao, Y. (2009), Missing Values Imputation Hypothesis : An Experimental Evaluation, In Proceedings of the 8th IEEE International Conference on Cognitive Informatics(ICCI '09), Hong Kong, China, 275-280.
  16. Little, R. J. and Rubin, D. B. (1987), Statistical Analysis with Missing Data, John Wiley and Sons, New York.
  17. McCullagh, P. and Nelder, J. A. (1990), Generalized Linear Models, New York : Chapman and Hall.
  18. Kohavi, R., Becker, B., and Sommerfield, D. (1997), Improving Simple Bayes, In: Proceedings of the European Conference on Machine Learning (ECML'97), Prague, Czech Republic.
  19. Su, X., Khoshgoftaar, T. M., and Greiner, R. (2008), Using Imputation Techniques to Help Learn Accurate Classifiers, In : Proceedings of the 20th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'08), Dayton, OH, USA, 437-444.
  20. UCI Machine Learning Repository : http://archive.ics.uci.edu/ml/.
  21. van Buuren, S. and Groothuis-Oudshoorn, K. (2011), MICE : Multivariate Imputation by Chained Equation in R, Journal of Statistical Software, 45(3).
  22. Witten, I. H. and Frank, E. (2005), Data Mining : Practical Machine Learning Tools and Techniques, 2nd edition, Morgan Faufmann.
  23. Yu, T., Peng, H., and Sun, W. (2011), Incorporating Nonlinear Relationships in Microarray Missing Value Imputation, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 8(3), 723-731. https://doi.org/10.1109/TCBB.2010.73
  24. Zhang, P. (2003), Multiple Imputation : Theory and Method, International Statistical Review, 71(3), 581-592.
  25. Zhang, Y. and Liu, Y. (2009), Data Imputation using Least Squares Support Vector Machines in Urban Arterial Street, IEEE Signal Processing Letters, 15(5), 414-417.