• 제목/요약/키워드: Data imputation

검색결과 203건 처리시간 0.029초

Comparison of EM with Jackknife Standard Errors and Multiple Imputation Standard Errors

  • Kang, Shin-Soo
    • Journal of the Korean Data and Information Science Society
    • /
    • 제16권4호
    • /
    • pp.1079-1086
    • /
    • 2005
  • Most discussions of single imputation methods and the EM algorithm concern point estimation of population quantities with missing values. A second concern is how to get standard errors of the point estimates obtained from the filled-in data by single imputation methods and EM algorithm. Now we focus on how to estimate standard errors with incorporating the additional uncertainty due to nonresponse. There are some approaches to account for the additional uncertainty. The general two possible approaches are considered. One is the jackknife method of resampling methods. The other is multiple imputation(MI). These two approaches are reviewed and compared through simulation studies.

  • PDF

Imputation Using Factor Score Regression

  • Lee, Sang-Eun;Hwang, Hee-Jin;Shin, Key-Il
    • Communications for Statistical Applications and Methods
    • /
    • 제16권2호
    • /
    • pp.317-323
    • /
    • 2009
  • Recently not even government polices but small town decisions are based on the survey data/information, so the most of government agencies/organizations demand various sample surveys in each fields for more detail information. However in conducting the sample survey, nonresponse problem rises very often and it becomes a major issue on judging the accuracy of survey. For that matters, one solution ran be using the administration data. However unfortunately most of administration data are restricted to the common users. The other solution can be the imputation. Therefore several method, of imputation are studied in various fields. In this study, in stead of the simple regression imputation method which is commonly used, factor score regression method is applied specially to the incomplete data which have the unit and item misting values in survey data. Here for simulation study, Consumer Expenditure Surveys in Korea are used.

Imputation Method Using Local Linear Regression Based on Bidirectional k-nearest-components

  • Yonggeol, Lee
    • Journal of information and communication convergence engineering
    • /
    • 제21권1호
    • /
    • pp.62-67
    • /
    • 2023
  • This paper proposes an imputation method using a bidirectional k-nearest components search based local linear regression method. The bidirectional k-nearest-components search method selects components in the dynamic range from the missing points. Unlike the existing methods, which use a fixed-size window, the proposed method can flexibly select adjacent components in an imputation problem. The weight values assigned to the components around the missing points are calculated using local linear regression. The local linear regression method is free from the rank problem in a matrix of dependent variables. In addition, it can calculate the weight values that reflect the data flow in a specific environment, such as a blackout. The original missing values were estimated from a linear combination of the components and their weights. Finally, the estimated value imputes the missing values. In the experimental results, the proposed method outperformed the existing methods when the error between the original data and imputation data was measured using MAE and RMSE.

Jackknife Variance Estimation under Imputation for Nonrandom Nonresponse with Follow-ups

  • Park, Jinwoo
    • Journal of the Korean Statistical Society
    • /
    • 제29권4호
    • /
    • pp.385-394
    • /
    • 2000
  • Jackknife variance estimation based on adjusted imputed values when nonresponse is nonrandom and follow-up data are available for a subsample of nonrespondents is provided. Both hot-deck and ratio imputation method are considered as imputation method. The performance of the proposed variance estimator under nonrandom response mechanism is investigated through numerical simulation.

  • PDF

표본조사에서 공간 변수(SPATIAL VARIABLE)를 이용한 결측 대체(MISSING IMPUTATION)의 효율성 비교 (Missing Imputation Methods Using the Spatial Variable in Sample Survey)

  • 이진희;김진;이기재
    • 응용통계연구
    • /
    • 제19권1호
    • /
    • pp.57-67
    • /
    • 2006
  • 표본조사에서 무응답은 여러 가지 이유로 발생하며, 이 때 응답자들의 정보로만 분석을 실시한다면 편향된 결과를 산출할 수 있어 보조변수를 이 용한 많은 무응답 대체 방법들이 연구되고 있다. 만일 결측자료 대체를 위한 보조변수들이 충분하지 않고 응답자들과 무응답자들 사이에 지역적 상관관계가 존재한다면 이를 결측자료 대체(missing data imputation)에 이용 할 수 있을 것이다. 본 논문에서는 2002년 강원지역의 농가경제 자료를 예제로 하여 공간상관을 이용한 무응답 대체 방법을 살펴보았으며, 공간상관이 존재할 경우 공간 대체 방법이 효율적임을 확인하였다.

Missing Value Imputation Technique for Water Quality Dataset

  • Jin-Young Jun;Youn-A Min
    • 한국컴퓨터정보학회논문지
    • /
    • 제29권4호
    • /
    • pp.39-46
    • /
    • 2024
  • 많은 연구자들이 다양한 모델을 이용하여 물의 수질을 평가하기 위해 노력하고 있다. 평가 모델에는 결측값이 없는 데이터셋이 필요하지만, 관측 데이터셋에는 결측값이 다수 포함되는 것이 현실이다. 단순히 결측값을 삭제하는 방법은 경우에 따라 기저 데이터의 분포를 왜곡시키고 모델의 예측성능에도 편의(bias)를 불러올 위험성이 있다. 본 연구에서는 수질 데이터의 결측값 처리에 적합한 기법을 탐색하기 위해, 기존의 KNN과 MICE Imputation, 그리고 생성형 신경망 모델인 Autoencoder와 Denoising Autoencoder를 기반으로 몇 가지 대치 기법을 실험하였다. 실험 결과, KNN과 MICE Imputation의 결과를 평균한 Combined Imputation이 실측치에 가장 가깝게 값을 추정하였으며, 이 기법을 적용하여 결측값을 처리한 관측 데이터셋을 support vector machine과 ensemble 기반의 분류 모델로 평가한 결과, 결측값을 삭제했을 때에 비해 Accuracy, F1 score, ROC-AUC score, 그리고 MCC(Mathews Correlation Coefficient) 지표가 향상되었다.

K-nn을 이용한 Hot Deck 기반의 결측치 대체 (Imputation of Missing Data Based on Hot Deck Method Using K-nn)

  • 권순창
    • 한국IT서비스학회지
    • /
    • 제13권4호
    • /
    • pp.359-375
    • /
    • 2014
  • Researchers cannot avoid missing data in collecting data, because some respondents arbitrarily or non-arbitrarily do not answer questions in studies and experiments. Missing data not only increase and distort standard deviations, but also impair the convenience of estimating parameters and the reliability of research results. Despite widespread use of hot deck, researchers have not been interested in it, since it handles missing data in ambiguous ways. Hot deck can be complemented using K-nn, a method of machine learning, which can organize donor groups closest to properties of missing data. Interested in the role of k-nn, this study was conducted to impute missing data based on the hot deck method using k-nn. After setting up imputation of missing data based on hot deck using k-nn as a study objective, deletion of listwise, mean, mode, linear regression, and svm imputation were compared and verified regarding nominal and ratio data types and then, data closest to original values were obtained reasonably. Simulations using different neighboring numbers and the distance measuring method were carried out and better performance of k-nn was accomplished. In this study, imputation of hot deck was re-discovered which has failed to attract the attention of researchers. As a result, this study shall be able to help select non-parametric methods which are less likely to be affected by the structure of missing data and its causes.

Investigation of multiple imputation variance estimation

  • 김재광
    • 한국통계학회:학술대회논문집
    • /
    • 한국통계학회 2002년도 춘계 학술발표회 논문집
    • /
    • pp.183-188
    • /
    • 2002
  • Multiple imputation, proposed by Rubin, is a procedure for handling missing data. One of the attractive parts of multiple imputation is the simplicity of the variance estimation formula. Because of the simplicity, it has been often abused and misused beyond its original prescription. This paper provides the bias of the multiple imputation variance estimator for a linear point estimator and discusses when the bias can be safely neglected.

  • PDF

Application of SOLAS to the Multiple Imputation for Missing Data

  • Moon, Sung-Ho;Kim, Hyun-Jeong;Shin, Jae-Kyoung
    • Journal of the Korean Data and Information Science Society
    • /
    • 제14권3호
    • /
    • pp.579-590
    • /
    • 2003
  • When we analyze incomplete data, i.e., data with missing values, we need treatment for the missing values. A common way to deal with this problem is to delete the cases with missing values. Various other methods have been developed. Among them are EM algorithm and regression algorithm which can estimate missing values and impute the missing elements with the estimated values. In this paper, we introduce multiple imputation software SOLAS which generates multiple data sets and imputes with them.

  • PDF

Multiple imputation for competing risks survival data via pseudo-observations

  • Han, Seungbong;Andrei, Adin-Cristian;Tsui, Kam-Wah
    • Communications for Statistical Applications and Methods
    • /
    • 제25권4호
    • /
    • pp.385-396
    • /
    • 2018
  • Competing risks are commonly encountered in biomedical research. Regression models for competing risks data can be developed based on data routinely collected in hospitals or general practices. However, these data sets usually contain the covariate missing values. To overcome this problem, multiple imputation is often used to fit regression models under a MAR assumption. Here, we introduce a multivariate imputation in a chained equations algorithm to deal with competing risks survival data. Using pseudo-observations, we make use of the available outcome information by accommodating the competing risk structure. Lastly, we illustrate the practical advantages of our approach using simulations and two data examples from a coronary artery disease data and hepatocellular carcinoma data.