• Title/Summary/Keyword: data imputation

Search Result 202, Processing Time 0.022 seconds

On the Use of Weighted k-Nearest Neighbors for Missing Value Imputation (Weighted k-Nearest Neighbors를 이용한 결측치 대치)

  • Lim, Chanhui;Kim, Dongjae
    • The Korean Journal of Applied Statistics
    • /
    • v.28 no.1
    • /
    • pp.23-31
    • /
    • 2015
  • A conventional missing value problem in the statistical analysis k-Nearest Neighbor(KNN) method are used for a simple imputation method. When one of the k-nearest neighbors is an extreme value or outlier, the KNN method can create a bias. In this paper, we propose a Weighted k-Nearest Neighbors(WKNN) imputation method that can supplement KNN's faults. A Monte-Carlo simulation study is also adapted to compare the WKNN method and KNN method using real data set.

A comparison of imputation methods for the consecutive missing temperature data (연속적 결측이 존재하는 기온 자료에 대한 결측복원 기법의 비교)

  • Kim, Hee-Kyung;Kang, In-Kyeong;Lee, Jae-Won;Lee, Yung-Seop
    • The Korean Journal of Applied Statistics
    • /
    • v.29 no.3
    • /
    • pp.549-557
    • /
    • 2016
  • Consecutive missing values are likely to occur in long climate data due to system error or defective equipment. Furthermore, it is difficult to impute missing values. However, these complicated problems can be overcame by imputing missing values with reference time series. Reference time series must be composed of similar time series to time series that include missing values. We performed a simulation to compare three missing imputation methods (the adjusted normal ratio method, the regression method and the IDW method) to complete the missing values of time series. A comparison of the three missing imputation methods for the daily mean temperatures at 14 climatological stations indicated that the IDW method was better thanx others at south seaside stations. We also found the regression method was better than others at most stations (except south seaside stations).

Comparison of the estimated breeding value and accuracy by imputation reference Beadchip platform and scaling factor of the genomic relationship matrix in Hanwoo cattle

  • Soo Hyun, Lee;Chang Gwon, Dang;Mina, Park;Seung Soo, Lee;Young Chang, Lee;Jae Gu, Lee;Hyuk Kee, Chang;Ho Baek, Yoon;Chung-il, Cho;Sang Hong, Lee;Tae Jeong, Choi
    • Korean Journal of Agricultural Science
    • /
    • v.49 no.3
    • /
    • pp.431-440
    • /
    • 2022
  • Hanwoo cattle are a unique and historical breed in Korea that have been genetically improved and maintained by the national evaluation and selection system. The aim of this study was to provide information that can help improve the accuracy of the estimated breeding values in Hanwoo cattle by showing the difference between the imputation reference chip platforms of genomic data and the scaling factor of the genetic relationship matrix (GRM). In this study, nine sets of data were compared that consisted of 3 reference platforms each with 3 different scaling factors (-0.5, 0 and 0.5). The evaluation was performed using MTG2.0 with nine different GRMs for the same number of genotyped animals, pedigree, and phenotype data. A five multi-trait model was used for the evaluation in this study which is the same model used in the national evaluation system. Our results show that the Hanwoo custom v1 platform is the best option for all traits, providing a mean accuracy improvement by 0.1 - 0.3%. In the case of the scaling factor, regardless of the imputation chip platform, a setting of -1 resulted in a better accuracy increased by 0.5 to 1.6% compared to the other scaling factors. In conclusion, this study revealed that Hanwoo custom v1 used as the imputation reference chip platform and a scaling factor of -0.5 can improve the accuracy of the estimated breeding value in the Hanwoo population. This information could help to improve the current evaluation system.

A Study on the Treatment of Missing Value using Grey Relational Grade and k-NN Approach

  • Chun, Young-Min;Chung, Sung-Suk
    • 한국데이터정보과학회:학술대회논문집
    • /
    • 2006.04a
    • /
    • pp.55-62
    • /
    • 2006
  • Huang proposed a grey-based nearest neighbor approach to predict accurately missing attribute value in 2004. Our study proposes which way to decide the number of nearest neighbors using not only the dong's grey relational grade but also the wen's grey relational grade. Besides, our study uses not an arithmetic(unweighted) mean but a weighted one. Also, GRG is used by a weighted value when we impute a missing values. There are four different methods - DU, DW, WU, WW. The performance of WW(wen's GRG & weighted mean) method is the best of my other methods. It had been proven by Huang that his method was much better than mean imputation method and multiple imputation method. The performance of our study is far superior to that of Huang.

  • PDF

Estimating a Binomial Proportion with Bayes Estimated Imputed Conditional Means

  • Shin, Min-Woong;Lee, Sang-Eun
    • Communications for Statistical Applications and Methods
    • /
    • v.9 no.1
    • /
    • pp.63-73
    • /
    • 2002
  • The one of analytic imputation technique involving conditional means was mentioned by Schafer and Schenker(2000). And their derivations are based on asymptotic expansions of point estimator and their associated variance estimator, and the result of imputation can be thought of as first-order approximations to the estimators. Specially in this paper, we are presenting the method of estimating a Binomial proportion with Bayesian approach of imputed conditional means. That is, instead of using maximum likelihood(ML) estimator to estimate a Binomial proportion, in general, we use the Bayesian estimators and will show the result of estimated Imputed conditional means.

Multiple imputation inference for stratified random sample with nonignorable nonresponse

  • Shin Minwoong;Lee Sangeun;Lee Sungchul;Lee Juyoung
    • Proceedings of the Korean Statistical Society Conference
    • /
    • 2001.11a
    • /
    • pp.191-194
    • /
    • 2001
  • In general, the imputation problems which are caused from survey nonresponse have been studied for being based on ignorable cases. However the model based approach can be applied to survey with nonresponse suspected of being nonignorable. Here in this study, we will make the nonresponse for nonignorable into ignorable cell using adjustment cell approach, then we can applied the ignorable nonresponse method. For data sets of each nonresponse cells are simulated from normal distribution.

  • PDF

Logistic Regression Method in Interval-Censored Data

  • Yun, Eun-Young;Kim, Jin-Mi;Ki, Choong-Rak
    • The Korean Journal of Applied Statistics
    • /
    • v.24 no.5
    • /
    • pp.871-881
    • /
    • 2011
  • In this paper we propose a logistic regression method to estimate the survival function and the median survival time in interval-censored data. The proposed method is motivated by the data augmentation technique with no sacrifice in augmenting data. In addition, we develop a cross validation criterion to determine the size of data augmentation. We compare the proposed estimator with other existing methods such as the parametric method, the single point imputation method, and the nonparametric maximum likelihood estimator through extensive numerical studies to show that the proposed estimator performs better than others in the sense of the mean squared error. An illustrative example based on a real data set is given.

Approximate moments of a variance estimate with imputed conditional means

  • Kang Woo Ram;Shin Min Woong;Lee Sang Eum
    • Proceedings of the Korean Statistical Society Conference
    • /
    • 2001.11a
    • /
    • pp.179-184
    • /
    • 2001
  • Schafer and Shenker(2000) mentioned the one of analytic imputation technique involving conditional means. We derive an approximate moments of a variance estimate with imputed conditional means.

  • PDF

Comparing Accuracy of Imputation Methods for Categorical Incomplete Data (범주형 자료의 결측치 추정방법 성능 비교)

  • 신형원;손소영
    • The Korean Journal of Applied Statistics
    • /
    • v.15 no.1
    • /
    • pp.33-43
    • /
    • 2002
  • Various kinds of estimation methods have been developed for imputation of categorical missing data. They include category method, logistic regression, and association rule. In this study, we propose two fusions algorithms based on both neural network and voting scheme that combine the results of individual imputation methods. A Mont-Carlo simulation is used to compare the performance of these methods. Five factors used to simulate the missing data pattern are (1) input-output function, (2) data size, (3) noise of input-output function (4) proportion of missing data, and (5) pattern of missing data. Experimental study results indicate the following: when the data size is small and missing data proportion is large, modal category method, association rule, and neural network based fusion have better performances than the other methods. However, when the data size is small and correlation between input and missing output is strong, logistic regression and neural network barred fusion algorithm appear better than the others. When data size is large with low missing data proportion, a large noise, and strong correlation between input and missing output, neural networks based fusion algorithm turns out to be the best choice.

Association measure of doubly interval censored data using a Kendall's 𝜏 estimator

  • Kang, Seo-Hyun;Kim, Yang-Jin
    • Communications for Statistical Applications and Methods
    • /
    • v.28 no.2
    • /
    • pp.151-159
    • /
    • 2021
  • In this article, our interest is to estimate the association between consecutive gap times which are subject to interval censoring. Such data are referred as doubly interval censored data (Sun, 2006). In a context of serial event, an induced dependent censoring frequently occurs, resulting in biased estimates. In this study, our goal is to propose a Kendall's 𝜏 based association measure for doubly interval censored data. For adjusting the impact of induced dependent censoring, the inverse probability censoring weighting (IPCW) technique is implemented. Furthermore, a multiple imputation technique is applied to recover unknown failure times owing to interval censoring. Simulation studies demonstrate that the suggested association estimator performs well with moderate sample sizes. The proposed method is applied to a dataset of children's dental records.