• Title/Summary/Keyword: Data imputation

Search Result 199, Processing Time 0.019 seconds

A Generation and Accuracy Evaluation of Common Metadata Prediction Model Using Public Bicycle Data and Imputation Method

  • Kim, Jong-Chan;Jung, Se-Hoon
    • Journal of Korea Multimedia Society
    • /
    • v.25 no.2
    • /
    • pp.287-296
    • /
    • 2022
  • Today, air pollution is becoming a severe issue worldwide and various policies are being implemented to solve environmental pollution. In major cities, public bicycles are installed and operated to reduce pollution and solve transportation problems, and operational information is collected in real time. However, research using public bicycle operation information data has not been processed. This study uses the daily weather data of Korea Meteorological Agency and real-time air pollution data of Korea Environment Corporation to predict the amount of daily rental bicycles. Cross- validation, principal component analysis and multiple regression analysis were used to determine the independent variables of the predictive model. Then, the study selected the elements that satisfy the significance level, constructed a model, predicted the amount of daily rental bicycles, and measured the accuracy.

Predictive Optimization Adjusted With Pseudo Data From A Missing Data Imputation Technique (결측 데이터 보정법에 의한 의사 데이터로 조정된 예측 최적화 방법)

  • Kim, Jeong-Woo
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.20 no.2
    • /
    • pp.200-209
    • /
    • 2019
  • When forecasting future values, a model estimated after minimizing training errors can yield test errors higher than the training errors. This result is the over-fitting problem caused by an increase in model complexity when the model is focused only on a given dataset. Some regularization and resampling methods have been introduced to reduce test errors by alleviating this problem but have been designed for use with only a given dataset. In this paper, we propose a new optimization approach to reduce test errors by transforming a test error minimization problem into a training error minimization problem. To carry out this transformation, we needed additional data for the given dataset, termed pseudo data. To make proper use of pseudo data, we used three types of missing data imputation techniques. As an optimization tool, we chose the least squares method and combined it with an extra pseudo data instance. Furthermore, we present the numerical results supporting our proposed approach, which resulted in less test errors than the ordinary least squares method.

On the Use of Sequential Adaptive Nearest Neighbors for Missing Value Imputation (순차 적응 최근접 이웃을 활용한 결측값 대치법)

  • Park, So-Hyun;Bang, Sung-Wan;Jhun, Myoung-Shic
    • The Korean Journal of Applied Statistics
    • /
    • v.24 no.6
    • /
    • pp.1249-1257
    • /
    • 2011
  • In this paper, we propose a Sequential Adaptive Nearest Neighbor(SANN) imputation method that combines the Adaptive Nearest Neighbor(ANN) method and the Sequential k-Nearest Neighbor(SKNN) method. When choosing the nearest neighbors of missing observations, the proposed SANN method takes the local feature of the missing observations into account as well as reutilizes the imputed observations in a sequential manner. By using a Monte Carlo study and a real data example, we demonstrate the characteristics of the SANN method and its potential performance.

Comparison of EM and Multiple Imputation Methods with Traditional Methods in Monotone Missing Pattern

  • Kang, Shin-Soo
    • Journal of the Korean Data and Information Science Society
    • /
    • v.16 no.1
    • /
    • pp.95-106
    • /
    • 2005
  • Complete-case analysis is easy to carry out and it may be fine with small amount of missing data. However, this method is not recommended in general because the estimates are usually biased and not efficient. There are numerous alternatives to complete-case analysis. A natural alternative procedure is available-case analysis. Available-case analysis uses all cases that contain the variables required for a specific task. The EM algorithm is a general approach for computing maximum likelihood estimates of parameters from incomplete data. These methods and multiple imputation(MI) are reviewed and the performances are compared by simulation studies in monotone missing pattern.

  • PDF

A Study on Imputation using Adjusted Cohen Method

  • Chung, Sung-Suk;Chun, Young-Min;Lee, Sun-Kyung
    • Journal of the Korean Data and Information Science Society
    • /
    • v.17 no.3
    • /
    • pp.871-888
    • /
    • 2006
  • Many studies have been done to develop procedures to deal with missing values. Most common method is to reassign the other values to the missing data. The purpose of our study is to suggest adjusted Cohen methods and to compare the efficiency of them with other methods through a simulation study. The adjusted Cohen methods use an auxiliary variable to arrange ranking of the variable with missing values. It leads to a reduced mean square error(MSE) compared with the Cohen method.

  • PDF

MLE for Incomplete Contingency Tables with Lagrangian Multiplier

  • Kang, Shin-Soo
    • Journal of the Korean Data and Information Science Society
    • /
    • v.17 no.3
    • /
    • pp.919-925
    • /
    • 2006
  • Maximum likelihood estimate(MLE) is obtained from the partial log-likelihood function for the cell probabilities of two way incomplete contingency tables proposed by Chen and Fienberg(1974). The partial log-likelihood function is modified by adding lagrangian multiplier that constraints can be incorporated with. Variances of MLE estimators of population proportions are derived from the matrix of second derivatives of the loglikelihood with respect to cell probabilities. Simulation results, when data are missing at random, reveal that Complete-case(CC) analysis produces biased estimates of joint probabilities under MAR and less efficient than either MLE or MI. MLE and MI provides consistent results under either the MAR situation. MLE provides more efficient estimates of population proportions than either multiple imputation(MI) based on data augmentation or complete case analysis. The standard errors of MLE from the proposed method using lagrangian multiplier are valid and have less variation than the standard errors from MI and CC.

  • PDF

A study on multiple imputation modeling for Korean EAPS (경제활동인구조사 자료를 위한 다중대체 방식 연구)

  • Park, Min-Jeong;Bae, Yoonjong;Kim, Joungyoun
    • The Korean Journal of Applied Statistics
    • /
    • v.34 no.5
    • /
    • pp.685-696
    • /
    • 2021
  • The Korean Economically Active Population Survey (KEAPS) is a national survey that produces employment-related statistics. The main purpose of the survey is to find out the economic activity status (employed/ unemployed/ non-employed) of the people. KEAPS has a unique characteristics caused by the survey method. In this study, through understanding of structural non-response and utilization of past data, we would like to present an improved imputation model. The performance of the proposed model is compared with the existing model through simulation. The performance of the imputation models is evaluated based on the degree of mathing/nonmatching rates. For this, we employ the KEAPS data in November 2019. For the randomly selected ones among the total 59,996 respondents, the six explanatory variables, which are critical in determining the economic activity states, are treated as non-response. The proposed model includes industry variable and job status variable in addition to the explanatory variables used in the precedent research. This is based on the linkage and utilization of past data. The simulation results confirm that the proposed model with additional variables outperforms the existing model in the precedent research. In addition, we consider various scenarios for the number of non-responders by the economic activity status.

Pairwise fusion approach to cluster analysis with applications to movie data (영화 데이터를 위한 쌍별 규합 접근방식의 군집화 기법)

  • Kim, Hui Jin;Park, Seyoung
    • The Korean Journal of Applied Statistics
    • /
    • v.35 no.2
    • /
    • pp.265-283
    • /
    • 2022
  • MovieLens data consists of recorded movie evaluations that was often used to measure the evaluation score in the recommendation system research field. In this paper, we provide additional information obtained by clustering user-specific genre preference information through movie evaluation data and movie genre data. Because the number of movie ratings per user is very low compared to the total number of movies, the missing rate in this data is very high. For this reason, there are limitations in applying the existing clustering methods. In this paper, we propose a convex clustering-based method using the pairwise fused penalty motivated by the analysis of MovieLens data. In particular, the proposed clustering method execute missing imputation, and at the same time uses movie evaluation and genre weights for each movie to cluster genre preference information possessed by each individual. We compute the proposed optimization using alternating direction method of multipliers algorithm. It is shown that the proposed clustering method is less sensitive to noise and outliers than the existing method through simulation and MovieLens data application.

Missing values imputation for time course gene expression data using the pattern consistency index adaptive nearest neighbors (시간경로 유전자 발현자료에서 패턴일치지수와 적응 최근접 이웃을 활용한 결측값 대치법)

  • Shin, Heyseo;Kim, Dongjae
    • The Korean Journal of Applied Statistics
    • /
    • v.33 no.3
    • /
    • pp.269-280
    • /
    • 2020
  • Time course gene expression data is a large amount of data observed over time in microarray experiments. This data can also simultaneously identify the level of gene expression. However, the experiment process is complex, resulting in frequent missing values due to various causes. In this paper, we propose a pattern consistency index adaptive nearest neighbors as a method of missing value imputation. This method combines the adaptive nearest neighbors (ANN) method that reflects local characteristics and the pattern consistency index that considers consistent degree for gene expression between observations over time points. We conducted a Monte Carlo simulation study to evaluate the usefulness of proposed the pattern consistency index adaptive nearest neighbors (PANN) method for two yeast time course data.

On Adaptation to Sparse Design in Bivariate Local Linear Regression

  • Hall, Peter;Seifert, Burkhardt;Turlach, Berwin A.
    • Journal of the Korean Statistical Society
    • /
    • v.30 no.2
    • /
    • pp.231-246
    • /
    • 2001
  • Local linear smoothing enjoys several excellent theoretical and numerical properties, an in a range of applications is the method most frequently chosen for fitting curves to noisy data. Nevertheless, it suffers numerical problems in places where the distribution of design points(often called predictors, or explanatory variables) is spares. In the case of univariate design, several remedies have been proposed for overcoming this problem, of which one involves adding additional ″pseudo″ design points in places where the orignal design points were too widely separated. This approach is particularly well suited to treating sparse bivariate design problem, and in fact attractive, elegant geometric analogues of unvariate imputation and interpolation rules are appropriate for that case. In the present paper we introduce and develop pseudo dta rules for bivariate design, and apply them to real data.

  • PDF