• Title/Summary/Keyword: Data imputation

Search Result 199, Processing Time 0.024 seconds

Imputation method for missing data based on measure of property (특성도를 이용한 결측치 대체방법)

  • Kim, Hyungju;Kim, Dongjae
    • The Korean Journal of Applied Statistics
    • /
    • v.30 no.3
    • /
    • pp.463-473
    • /
    • 2017
  • How to handle missing data is a main issue in clinical trials. We impute missing data based on missing data that follows a mechanism according to the intention-to-treat rule. However, using the right imputation method for missing data is very important because this supposition is unclear. We suggest a new imputation method for missing data using agreement and maintenance introduced by Kang and Kim (1997). We give an example and adapt a Monte Carlo simulation to compare the performance between the established method and the suggested method.

Sparse Data Cleaning using Multiple Imputations

  • Jun, Sung-Hae;Lee, Seung-Joo;Oh, Kyung-Whan
    • International Journal of Fuzzy Logic and Intelligent Systems
    • /
    • v.4 no.1
    • /
    • pp.119-124
    • /
    • 2004
  • Real data as web log file tend to be incomplete. But we have to find useful knowledge from these for optimal decision. In web log data, many useful things which are hyperlink information and web usages of connected users may be found. The size of web data is too huge to use for effective knowledge discovery. To make matters worse, they are very sparse. We overcome this sparse problem using Markov Chain Monte Carlo method as multiple imputations. This missing value imputation changes spare web data to complete. Our study may be a useful tool for discovering knowledge from data set with sparseness. The more sparseness of data in increased, the better performance of MCMC imputation is good. We verified our work by experiments using UCI machine learning repository data.

Veri cation of Improving a Clustering Algorith for Microarray Data with Missing Values

  • Kim, Su-Young
    • The Korean Journal of Applied Statistics
    • /
    • v.24 no.2
    • /
    • pp.315-321
    • /
    • 2011
  • Gene expression microarray data often include multiple missing values. Most gene expression analysis (including gene clustering analysis); however, require a complete data matric as an input. In ordinary clustering methods, just a single missing value makes one abandon the whole data of a gene even if the rest of data for that gene was intact. The quality of analysis may decrease seriously as the missing rate is increased. In the opposite aspect, the imputation of missing value may result in an artifact that reduces the reliability of the analysis. To clarify this contradiction in microarray clustering analysis, this paper compared the accuracy of clustering with and without imputation over several microarray data having different missing rates. This paper also tested the clustering efficiency of several imputation methods including our propose algorithm. The results showed it is worthwhile to check the clustering result in this alternative way without any imputed data for the imperfect microarray data.

Comparison of binary data imputation methods in clinical trials (임상시험에서 이분형 결측치 처리방법의 비교연구)

  • An, Koosung;Kim, Dongjae
    • The Korean Journal of Applied Statistics
    • /
    • v.29 no.3
    • /
    • pp.539-547
    • /
    • 2016
  • We discussed how to handle missing binary data clinical trials. Patterns of occurring missing data are discussed and introduce missing binary data imputation methods that include the modified method. A simulation is performed by modifying actual data for each method. The condition of this simulation is controlled by a response rate and a missing value rate. We list the simulation results for each method and discussed them at the end of this paper.

A case study of competing risk analysis in the presence of missing data

  • Limei Zhou;Peter C. Austin;Husam Abdel-Qadir
    • Communications for Statistical Applications and Methods
    • /
    • v.30 no.1
    • /
    • pp.1-19
    • /
    • 2023
  • Observational data with missing or incomplete data are common in biomedical research. Multiple imputation is an effective approach to handle missing data with the ability to decrease bias while increasing statistical power and efficiency. In recent years propensity score (PS) matching has been increasingly used in observational studies to estimate treatment effect as it can reduce confounding due to measured baseline covariates. In this paper, we describe in detail approaches to competing risk analysis in the setting of incomplete observational data when using PS matching. First, we used multiple imputation to impute several missing variables simultaneously, then conducted propensity-score matching to match statin-exposed patients with those unexposed. Afterwards, we assessed the effect of statin exposure on the risk of heart failure-related hospitalizations or emergency visits by estimating both relative and absolute effects. Collectively, we provided a general methodological framework to assess treatment effect in incomplete observational data. In addition, we presented a practical approach to produce overall cumulative incidence function (CIF) based on estimates from multiple imputed and PS-matched samples.

Considering of the Rainfall Effect in Missing Traffic Volume Data Imputation Method (누락교통량자료 보정방법에서 강우의 영향 고려)

  • Kim, Min-Heon;Oh, Ju-Sam
    • The Journal of The Korea Institute of Intelligent Transport Systems
    • /
    • v.14 no.2
    • /
    • pp.1-13
    • /
    • 2015
  • Traffic volume data is basic information that is used in a wide variety of fields. Existing missing traffic volume data imputation method did not take the effect on the rainfall. This research analyzed considering of the rainfall effect in missing traffic volume data imputation method. In order to consider the effect of rainfall, established the following assumption. When missing of traffic volume data generated in rainy days it would be more accurate to use only the traffic volume data of the past rainy days. To confirm this assumption, compared for accuracy of imputed results at three kinds of imputation method(Unconditional Mean, Auto Regression, Expectation-Maximization Algorithm). The analysis results, the case on consideration of the rainfall effect was more low error occurred.

Development and Application of Imputation Technique Based on NPR for Missing Traffic Data (NPR기반 누락 교통자료 추정기법 개발 및 적용)

  • Jang, Hyeon-Ho;Han, Dong-Hui;Lee, Tae-Gyeong;Lee, Yeong-In;Won, Je-Mu
    • Journal of Korean Society of Transportation
    • /
    • v.28 no.3
    • /
    • pp.61-74
    • /
    • 2010
  • ITS (Intelligent transportation systems) collects real-time traffic data, and accumulates vest historical data. But tremendous historical data has not been managed and employed efficiently. With the introduction of data management systems like ADMS (Archived Data Management System), the potentiality of huge historical data dramatically surfs up. However, traffic data in any data management system includes missing values in nature, and one of major obstacles in applying these data has been the missing data because it makes an entire dataset useless every so often. For these reasons, imputation techniques take a key role in data management systems. To address these limitations, this paper presents a promising imputation technique which could be mounted in data management systems and robustly generates the estimations for missing values included in historical data. The developed model, based on NPR (Non-Parametric Regression) approach, employs various traffic data patterns in historical data and is designated for practical requirements such as the minimization of parameters, computational speed, the imputation of various types of missing data, and multiple imputation. The model was tested under the conditions of various missing data types. The results showed that the model outperforms reported existing approaches in the side of prediction accuracy, and meets the computational speed required to be mounted in traffic data management systems.

Predicting Personal Credit Rating with Incomplete Data Sets Using Frequency Matrix technique (Frequency Matrix 기법을 이용한 결측치 자료로부터의 개인신용예측)

  • Bae, Jae-Kwon;Kim, Jin-Hwa;Hwang, Kook-Jae
    • Journal of Information Technology Applications and Management
    • /
    • v.13 no.4
    • /
    • pp.273-290
    • /
    • 2006
  • This study suggests a frequency matrix technique to predict personal credit rate more efficiently using incomplete data sets. At first this study test on multiple discriminant analysis and logistic regression analysis for predicting personal credit rate with incomplete data sets. Missing values are predicted with mean imputation method and regression imputation method here. An artificial neural network and frequency matrix technique are also tested on their performance in predicting personal credit rating. A data set of 8,234 customers in 2004 on personal credit information of Bank A are collected for the test. The performance of frequency matrix technique is compared with that of other methods. The results from the experiments show that the performance of frequency matrix technique is superior to that of all other models such as MDA-mean, Logit-mean, MDA-regression, Logit-regression, and artificial neural networks.

  • PDF

A Multiple Imputation for Reducing Outlier Effect (이상점 영향력 축소를 통한 무응답 대체법)

  • Kim, Man-Gyeom;Shin, Key-Il
    • The Korean Journal of Applied Statistics
    • /
    • v.27 no.7
    • /
    • pp.1229-1241
    • /
    • 2014
  • Most of sampling surveys have outliers and non-response missing values simultaneously. In that case, due to the effect of outliers, the result of imputation is not good enough to meet a given precision. To overcome this situation, outlier treatment should be conducted before imputation. In this paper in order for reducing the effect of outlier, we study outlier imputation methods and outlier weight adjustment methods. For the outlier detection, the method suggested by She and Owen (2011) is used. A small simulation study is conducted and for real data analysis, Monthly Labor Statistic and Briquette Consumption Survey Data are used.

Comparison of three boosting methods in parent-offspring trios for genotype imputation using simulation study

  • Mikhchi, Abbas;Honarvar, Mahmood;Kashan, Nasser Emam Jomeh;Zerehdaran, Saeed;Aminafshar, Mehdi
    • Journal of Animal Science and Technology
    • /
    • v.58 no.1
    • /
    • pp.1.1-1.6
    • /
    • 2016
  • Background: Genotype imputation is an important process of predicting unknown genotypes, which uses reference population with dense genotypes to predict missing genotypes for both human and animal genetic variations at a low cost. Machine learning methods specially boosting methods have been used in genetic studies to explore the underlying genetic profile of disease and build models capable of predicting missing values of a marker. Methods: In this study strategies and factors affecting the imputation accuracy of parent-offspring trios compared from lower-density SNP panels (5 K) to high density (10 K) SNP panel using three different Boosting methods namely TotalBoost (TB), LogitBoost (LB) and AdaBoost (AB). The methods employed using simulated data to impute the un-typed SNPs in parent-offspring trios. Four different datasets of G1 (100 trios with 5 k SNPs), G2 (100 trios with 10 k SNPs), G3 (500 trios with 5 k SNPs), and G4 (500 trio with 10 k SNPs) were simulated. In four datasets all parents were genotyped completely, and offspring genotyped with a lower density panel. Results: Comparison of the three methods for imputation showed that the LB outperformed AB and TB for imputation accuracy. The time of computation were different between methods. The AB was the fastest algorithm. The higher SNP densities resulted the increase of the accuracy of imputation. Larger trios (i.e. 500) was better for performance of LB and TB. Conclusions: The conclusion is that the three methods do well in terms of imputation accuracy also the dense chip is recommended for imputation of parent-offspring trios.