A data extension technique to handle incomplete data

Lee, Jong Chan;

doi:10.15207/JKCS.2021.12.2.007

Journal of the Korea Convergence Society (한국융합학회논문지)

Volume 12 Issue 2
/
Pages.7-13
/
2021
/
2233-4890(pISSN)
/
2713-6353(eISSN)

Korea Convergence Society (한국융합학회)

DOI QR Code

A data extension technique to handle incomplete data

불완전한 데이터를 처리하기 위한 데이터 확장기법

Lee, Jong Chan (Dept. of Computer Engineering, Chungwoon University)

이종찬 (청운대학교 컴퓨터공학과)

Received : 2020.12.01
Accepted : 2021.02.20
Published : 2021.02.28

https://doi.org/10.15207/JKCS.2021.12.2.007 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

This paper introduces an algorithm that compensates for missing values after converting them into a format that can represent the probability for incomplete data including missing values in training data. In the previous method using this data conversion, incomplete data was processed by allocating missing values with an equal probability that missing variables can have. This method applied to many problems and obtained good results, but it was pointed out that there is a loss of information in that all information remaining in the missing variable is ignored and a new value is assigned. On the other hand, in the new proposed method, only complete information not including missing values is input into the well-known classification algorithm (C4.5), and the decision tree is constructed during learning. Then, the probability of the missing value is obtained from this decision tree and assigned as an estimated value of the missing variable. That is, some lost information is recovered using a lot of information that has not been lost from incomplete learning data.

본 논문은 학습 데이터에 손실값을 포함하고 있는 불완전한 데이터를 위하여 확률을 나타낼 수 있는 형식으로 변환한 후 손실값을 보상하는 알고리즘을 소개한다. 기존에 이러한 데이터 변환을 사용한 방법에서는 손실 변수가 가질 수 있는 균등한 확률로 손실값을 할당하여 불완전한 데이터를 처리하는 것이었다. 이 방법으로 많은 문제에 적용하여 좋은 결과를 얻었으나, 손실 변수에 남아있는 모든 정보를 무시하고 새로운 값을 할당한다는 점에서 정보의 손실이 있다는 지적이 있었다. 이에 반해 새로운 제안 방법은 손실값을 포함하지 않는 완전한 정보만을 잘 알려진 분류 알고리즘(C4.5)에 입력하고 학습하는 중에 결정트리가 구축된다. 그리고 이 결정트리로 부터 손실값에 대한 확률을 구하여 이를 손실 변수의 추정값으로 할당한다. 즉, 불완전한 학습 데이터에서 손실되지 않은 많은 정보들을 사용하여 손실된 일부 정보를 복구하는 것이다.

Keywords

References

J. Han, J. Pei & M. Kamber. (2011). Data Mining: Concepts and Techniques, Waltham : Elsevier
R. Kohavi & J. R. Quinlan. (2002). Data mining tasks and methods: Classification: Decision-tree discovery, Handbook of data mining and knowledge discovery, New York : Oxford University Press, 267-276.
D. Kim, D. Lee & W. D. Lee. (2006). Classifier using Extended Data Expression, IEEE Mountain Workshop on Adaptive and Learning Systems. DOI : 10.1109/SMCALS.2006.250708
J. C. Lee, D. H. Seo, C. H. Song & W. D. Lee. (2007). FLDF based Decision Tree using Extended Data Expression, The 6th Conference on Machine Learning & Cybernetics, 3478-3483
J. C. Lee. (2018). Application Examples Applying Extended Data Expression Technique to Classification Problems, Journal of the Korea convergence society, 9(12), 9-15. DOI : 10.15207 /JKCS.2018.9.12.009 https://doi.org/10.15207/JKCS.2018.9.12.009
J. C. Lee. (2019). Deep Learning Model for Incomplete Data, Journal of the Korea Convergence Society, 10(2), 1-6. DOI : 10.15207 /JKCS.2019.10.2.001 https://doi.org/10.15207/JKCS.2019.10.2.001
J. C. Lee & W. D. Lee. (2010). Classifier handling incomplete data. Journal of the Korea Institute of Information and Communication Engineering, 14(1), 53-62. https://doi.org/10.6109/jkiice.2010.14.1.053
A. McCallum, D. Freitag & F. Pereira. (2000). Maximum Entropy Markov Models for Information Extraction and Segmentation. Proc. Of 17th International Conference on Machine Learning, 591-598
T.Delavallade & T.H.Dang.(2007). Using Entropy to Impute Missing Data in a Classification Task. IEEE International Fuzzy Systems Conference. DOI : 10.1109/FUZZY.2007.4295430
J.R.Quinlan.(1993). C4.5 : Program for Machine Learning. San Mateo : Morgan Kaufmann
A. Sportisse, C. Boyer, A. Dieuleveut & J. Josse. (2020). Debiasing Averaged Stochastic Gradient Descent to handle missing values, 34th Conference on Neural Information Processing Systems, Vancouver, Canada, 1-11
T. F. Johnson, N. J. B. Isaac, A. Paviolo, M. Gonzalez-Suarez. (2020). Handling missing values in trait data, Global Ecology & Biogeography, 1-12. DOI : 10.1111/geb.13185
S. Huang & C. Cheng. (2020). A Safe-Region Imputation Method for Handling Medical Data with Missing Values, Symmetry 2020, 12, 1792; DOI : 10.3390/sym12111792
J. You, X. Ma, D. Y. Ding, M. Kochenderfer & J. Leskovec. (2020). Handling Missing Data with Graph Representation Learning, 34th Conference on Neural Information Processing Systems, Vancouver, Canada. 1-13
Center for Machine Learning and Intelligent Systems, University of California, Irvine, (2020). UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets.php

Journal of the Korea Convergence Society (한국융합학회논문지)

A data extension technique to handle incomplete data

불완전한 데이터를 처리하기 위한 데이터 확장기법

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)