• Title/Summary/Keyword: Missing data estimation

Search Result 141, Processing Time 0.026 seconds

Analysis of Missing Data Using an Empirical Bayesian Method (경험적 베이지안 방법을 이용한 결측자료 연구)

  • Yoon, Yong Hwa;Choi, Boseung
    • The Korean Journal of Applied Statistics
    • /
    • v.27 no.6
    • /
    • pp.1003-1016
    • /
    • 2014
  • Proper missing data imputation is an important procedure to obtain superior results for data analysis based on survey data. This paper deals with both a model based imputation method and model estimation method. We utilized a Bayesian method to solve a boundary solution problem in which we applied a maximum likelihood estimation method. We also deal with a missing mechanism model selection problem using forecasting results and a comparison between model accuracies. We utilized MWPE(modified within precinct error) (Bautista et al., 2007) to measure prediction correctness. We applied proposed ML and Bayesian methods to the Korean presidential election exit poll data of 2012. Based on the analysis, the results under the missing at random mechanism showed superior prediction results than under the missing not at random mechanism.

Comparative Evaluation of the Pollutant Load Estimation Method in the Water Quality Data Missing Intervals (수질자료 결측구간의 오염부하 추정기법 비교평가)

  • Cho, Beom-Jun;Cho, Hong-Yeon;Kahng, Sung-Hyun
    • Journal of Korean Society of Coastal and Ocean Engineers
    • /
    • v.19 no.1
    • /
    • pp.45-56
    • /
    • 2007
  • Direct estimation of the pollutant load(PL) should be carried out by the data filling in the missing intervals using an appropriate method because it is impossible in which the flow discharge(water quantity) or water quality(WQ) time-series data set have the missing intervals. In this study, the several methods estimating the water quality in the missing periods are suggested and the WQ and pollutants load change patterns are compared and evaluated based on the reproducible degree of the available data change patterns. The most appropriate method is finally suggested and the contribution factor deciding the influence degree and the PL characteristics of the river estuary is also suggested. Based on the PL estimation results using the several methods, the interpolation method considering the fluctuation of the available WQ data is shown to be most efficient. The PL patterns of the Han river estuary is classified as the discharge-dominated type. The data filling process is inevitable and the WQ estimation using the efficient and effective method should be carried out in order to estimate reasonable PL.

Development of a Model Combining Covariance Matrices Derived from Spatial and Temporal Data to Estimate Missing Rainfall Data (공간 데이터와 시계열 데이터로부터 유도된 공분산행렬을 결합한 강수량 결측값 추정 모형)

  • Sung, Chan Yong
    • Journal of Environmental Science International
    • /
    • v.22 no.3
    • /
    • pp.303-308
    • /
    • 2013
  • This paper proposed a new method for estimating missing values in time series rainfall data. The proposed method integrated the two most widely used estimation methods, general linear model(GLM) and ordinary kriging(OK), by taking a weighted average of covariance matrices derived from each of the two methods. The proposed method was cross-validated using daily rainfall data at thirteen rain gauges in the Hyeong-san River basin. The goodness-of-fit of the proposed method was higher than those of GLM and OK, which can be attributed to the weighting algorithm that was designed to minimize errors caused by violations of assumptions of the two existing methods. This result suggests that the proposed method is more accurate in missing values in time series rainfall data, especially in a region where the assumptions of existing methods are not met, i.e., rainfall varies by season and topography is heterogeneous.

Comparing Accuracy of Imputation Methods for Categorical Incomplete Data (범주형 자료의 결측치 추정방법 성능 비교)

  • 신형원;손소영
    • The Korean Journal of Applied Statistics
    • /
    • v.15 no.1
    • /
    • pp.33-43
    • /
    • 2002
  • Various kinds of estimation methods have been developed for imputation of categorical missing data. They include category method, logistic regression, and association rule. In this study, we propose two fusions algorithms based on both neural network and voting scheme that combine the results of individual imputation methods. A Mont-Carlo simulation is used to compare the performance of these methods. Five factors used to simulate the missing data pattern are (1) input-output function, (2) data size, (3) noise of input-output function (4) proportion of missing data, and (5) pattern of missing data. Experimental study results indicate the following: when the data size is small and missing data proportion is large, modal category method, association rule, and neural network based fusion have better performances than the other methods. However, when the data size is small and correlation between input and missing output is strong, logistic regression and neural network barred fusion algorithm appear better than the others. When data size is large with low missing data proportion, a large noise, and strong correlation between input and missing output, neural networks based fusion algorithm turns out to be the best choice.

Large tests of independence in incomplete two-way contingency tables using fractional imputation

  • Kang, Shin-Soo;Larsen, Michael D.
    • Journal of the Korean Data and Information Science Society
    • /
    • v.26 no.4
    • /
    • pp.971-984
    • /
    • 2015
  • Imputation procedures fill-in missing values, thereby enabling complete data analyses. Fully efficient fractional imputation (FEFI) and multiple imputation (MI) create multiple versions of the missing observations, thereby reflecting uncertainty about their true values. Methods have been described for hypothesis testing with multiple imputation. Fractional imputation assigns weights to the observed data to compensate for missing values. The focus of this article is the development of tests of independence using FEFI for partially classified two-way contingency tables. Wald and deviance tests of independence under FEFI are proposed. Simulations are used to compare type I error rates and Power. The partially observed marginal information is useful for estimating the joint distribution of cell probabilities, but it is not useful for testing association. FEFI compares favorably to other methods in simulations.

Estimation using response probability when missing data happen on the second occasion

  • Park, Hyeonah;Na, Seongryong
    • Journal of the Korean Data and Information Science Society
    • /
    • v.25 no.1
    • /
    • pp.263-269
    • /
    • 2014
  • When the loss of samples appears under repeated surveys, new samples can often replace missing values. Estimators using response probability can be considered under repeated surveys on two occasions where new samples are selected instead of missing data on the second occasion. We propose a new estimator that uses both respondents and new samples on the second occasion. It is considered for the simulation setting that missing values can happen at the second occasion and are replaced by new samples. We can see that the proposed estimator is more efficient than that using a weighting adjustment method for respondents at the second occasion.

Missing Hydrological Data Estimation using Neural Network and Real Time Data Reconciliation (신경망을 이용한 결측 수문자료 추정 및 실시간 자료 보정)

  • Oh, Jae-Woo;Park, Jin-Hyeog;Kim, Young-Kuk
    • Journal of Korea Water Resources Association
    • /
    • v.41 no.10
    • /
    • pp.1059-1065
    • /
    • 2008
  • Rainfall data is the most basic input data to analyze the hydrological phenomena and can be missing due to various reasons. In this research, a neural network based model to estimate missing rainfall data as approximate values was developed for 12 rainfall stations in the Soyang river basin to improve existing methods. This approach using neural network has shown to be useful in many applications to deal with complicated natural phenomena and displayed better results compared to the popular offline estimating methods, such as RDS(Reciprocal Distance Squared) method and AMM(Arithmetic Mean Method). Additionally, we proposed automated data reconciliation systems composed of a neural network learning processer to be capable of real-time reconciliation to transmit reliable hydrological data online.

Missing Data Estimation for Link Travel Time (차량 결측속도정보 추정에 관한 연구)

  • Yoon, Won-Sik;Jung, Hee-Cheol
    • Journal of Korean Society of Transportation
    • /
    • v.26 no.2
    • /
    • pp.101-107
    • /
    • 2008
  • Traffic speed data may be missed due to detector malfunction or network problems. In this paper we have proposed effective methods to estimate the data which could not be collected through loop detectors. Our proposed algorithm has three steps. First step is to find the most similar neighbor data record by coefficient of correlation. Second step is to make some data records which is calculated by the 5 kinds of estimation methods. Third step is to compare the data records with history data record of observation link and thus the best method is selected. The proposed method is useful for estimating travel time.

Assessment of Improving SWAT Weather Input Data using Basic Spatial Interpolation Method

  • Felix, Micah Lourdes;Choi, Mikyoung;Zhang, Ning;Jung, Kwansue
    • Proceedings of the Korea Water Resources Association Conference
    • /
    • 2022.05a
    • /
    • pp.368-368
    • /
    • 2022
  • The Soil and Water Assessment Tool (SWAT) has been widely used to simulate the long-term hydrological conditions of a catchment. Two output variables, outflow and sediment yield have been widely investigated in the field of water resources management, especially in determining the conditions of ungauged subbasins. The presence of missing data in weather input data can cause poor representation of the climate conditions in a catchment especially for large or mountainous catchments. Therefore, in this study, a custom module was developed and evaluated to determine the efficiency of utilizing basic spatial interpolation methods in the estimation of weather input data. The module has been written in Python language and can be considered as a pre-processing module prior to using the SWAT model. The results of this study suggests that the utilization of the proposed pre-processing module can improve the simulation results for both outflow and sediment yield in a catchment, even in the presence of missing data.

  • PDF

Imputation of Medical Data Using Subspace Condition Order Degree Polynomials

  • Silachan, Klaokanlaya;Tantatsanawong, Panjai
    • Journal of Information Processing Systems
    • /
    • v.10 no.3
    • /
    • pp.395-411
    • /
    • 2014
  • Temporal medical data is often collected during patient treatments that require personal analysis. Each observation recorded in the temporal medical data is associated with measurements and time treatments. A major problem in the analysis of temporal medical data are the missing values that are caused, for example, by patients dropping out of a study before completion. Therefore, the imputation of missing data is an important step during pre-processing and can provide useful information before the data is mined. For each patient and each variable, this imputation replaces the missing data with a value drawn from an estimated distribution of that variable. In this paper, we propose a new method, called Newton's finite divided difference polynomial interpolation with condition order degree, for dealing with missing values in temporal medical data related to obesity. We compared the new imputation method with three existing subspace estimation techniques, including the k-nearest neighbor, local least squares, and natural cubic spline approaches. The performance of each approach was then evaluated by using the normalized root mean square error and the statistically significant test results. The experimental results have demonstrated that the proposed method provides the best fit with the smallest error and is more accurate than the other methods.