Search | Korea Science

A Study on Automatic Missing Value Imputation Replacement Method for Data Processing in Digital Data (디지털 데이터에서 데이터 전처리를 위한 자동화된 결측 구간 대치 방법에 관한 연구)

Kim, Jong-Chan;Sim, Chun-Bo;Jung, Se-Hoon
- Journal of Korea Multimedia Society
- /
- v.24 no.2
- /
- pp.245-254
- /
- 2021
We proposed the research on an analysis and prediction model that allows the identification of outliers or abnormality in the data followed by effective and rapid imputation of missing values was conducted. This model is expected to analyze efficiently the problems in the data based on the calibrated raw data. As a result, a system that can adequately utilize the data was constructed by using the introduced KNN + MLE algorithm. With this algorithm, the problems in some of the existing KNN-based missing data imputation algorithms such as ignoring the missing values in some data sections or discarding normal observations were effectively addressed. A comparative evaluation was performed between the existing imputation approaches such as K-means, KNN, MEI, and MI as well as the data missing mechanisms including MCAR, MAR, and NI to check the effectiveness/efficiency of the proposed algorithm, and its superiority in all aspects was confirmed.
https://doi.org/10.9717/kmms.2020.24.2.245 인용 PDF KSCI HTML

Identification of Differentially Expressed Genes Using Tests Based on Multiple Imputations

Kim, Sang Cheol;Yu, Donghyeon
- Quantitative Bio-Science
- /
- v.36 no.1
- /
- pp.23-31
- /
- 2017
Datasets from DNA microarray experiments, which are in the form of large matrices of expression levels of genes, often have missing values. However, the existing statistical methods including the principle components analysis (PCA) and Hotelling's t-test are not directly applicable for the datasets having missing values due to the fact that they assume the observed dataset is complete in general. Many methods have been proposed in previous literature to impute the missing in the observed data. Troyanskaya et al. [1] study the k-nearest neighbor (kNN) imputation, Kim et al. [2] propose the local least squares (LLS) method and Rubin [3] propose the multiple imputation (MI) for missing values. To identify differentially expressed genes, we propose a new testing procedure when the missing exists in the observed data. The proposed procedure uses the Stouffer's z-scores and combines the test results of individual imputed samples, which are dependent to each other. We numerically show that the proposed test procedure based on MI performs better than the existing test procedures based on single imputation (SI) by comparing their ROC curves. We apply the proposed method to analyzing a public microarray data.
https://doi.org/10.22283/qbs.2017.36.1.23 인용

A Probabilistic Tensor Factorization approach for Missing Data Inference in Mobile Crowd-Sensing

Akter, Shathee;Yoon, Seokhoon
- International Journal of Internet, Broadcasting and Communication
- /
- v.13 no.3
- /
- pp.63-72
- /
- 2021
Mobile crowd-sensing (MCS) is a promising sensing paradigm that leverages mobile users with smart devices to perform large-scale sensing tasks in order to provide services to specific applications in various domains. However, MCS sensing tasks may not always be successfully completed or timely completed for various reasons, such as accidentally leaving the tasks incomplete by the users, asynchronous transmission, or connection errors. This results in missing sensing data at specific locations and times, which can degrade the performance of the applications and lead to serious casualties. Therefore, in this paper, we propose a missing data inference approach, called missing data approximation with probabilistic tensor factorization (MDI-PTF), to approximate the missing values as closely as possible to the actual values while taking asynchronous data transmission time and different sensing locations of the mobile users into account. The proposed method first normalizes the data to limit the range of the possible values. Next, a probabilistic model of tensor factorization is formulated, and finally, the data are approximated using the gradient descent method. The performance of the proposed algorithm is verified by conducting simulations under various situations using different datasets.
https://doi.org/10.7236/IJIBC.2021.13.3.63 인용 PDF KSCI

Imputation method for missing data based on clustering and measure of property (군집화 및 특성도를 이용한 결측치 대체 방법)

Kim, Sunghyun;Kim, Dongjae
- The Korean Journal of Applied Statistics
- /
- v.31 no.1
- /
- pp.29-40
- /
- 2018
There are various reasons for missing values when collecting data. Missing values have some influence on the analysis and results; consequently, various methods of processing missing values have been studied to solve the problem. It is thought that the later point of view may be affected by the initial time point value in the repeated measurement data. However, in the existing method, there was no method for the imputation of missing values using this concept. Therefore, we proposed a new missing value imputation method in this study using clustering in initial time point of the repeated measurement data and the measure of property proposed by Kim and Kim (The Korean Communications in Statistics, 30, 463-473, 2017). We also applied the Monte Carlo simulations to compare the performance of the established method and suggested methods in repeated measurement data.
https://doi.org/10.5351/KJAS.2018.31.1.029 인용 PDF KSCI

Exploiting Patterns for Handling Incomplete Coevolving EEG Time Series

Thi, Ngoc Anh Nguyen;Yang, Hyung-Jeong;Kim, Sun-Hee
- International Journal of Contents
- /
- v.9 no.4
- /
- pp.1-10
- /
- 2013
The electroencephalogram (EEG) time series is a measure of electrical activity received from multiple electrodes placed on the scalp of a human brain. It provides a direct measurement for characterizing the dynamic aspects of brain activities. These EEG signals are formed from a series of spatial and temporal data with multiple dimensions. Missing data could occur due to fault electrodes. These missing data can cause distortion, repudiation, and further, reduce the effectiveness of analyzing algorithms. Current methodologies for EEG analysis require a complete set of EEG data matrix as input. Therefore, an accurate and reliable imputation approach for missing values is necessary to avoid incomplete data sets for analyses and further improve the usage of performance techniques. This research proposes a new method to automatically recover random consecutive missing data from real world EEG data based on Linear Dynamical System. The proposed method aims to capture the optimal patterns based on two main characteristics in the coevolving EEG time series: namely, (i) dynamics via discovering temporal evolving behaviors, and (ii) correlations by identifying the relationships between multiple brain signals. From these exploits, the proposed method successfully identifies a few hidden variables and discovers their dynamics to impute missing values. The proposed method offers a robust and scalable approach with linear computation time over the size of sequences. A comparative study has been performed to assess the effectiveness of the proposed method against interpolation and missing values via Singular Value Decomposition (MSVD). The experimental simulations demonstrate that the proposed method provides better reconstruction performance up to 49% and 67% improvements over MSVD and interpolation approaches, respectively.
https://doi.org/10.5392/IJoC.2013.9.4.001 인용 PDF KSCI KPUBS HTML

Large tests of independence in incomplete two-way contingency tables using fractional imputation

Kang, Shin-Soo;Larsen, Michael D.
- Journal of the Korean Data and Information Science Society
- /
- v.26 no.4
- /
- pp.971-984
- /
- 2015
Imputation procedures fill-in missing values, thereby enabling complete data analyses. Fully efficient fractional imputation (FEFI) and multiple imputation (MI) create multiple versions of the missing observations, thereby reflecting uncertainty about their true values. Methods have been described for hypothesis testing with multiple imputation. Fractional imputation assigns weights to the observed data to compensate for missing values. The focus of this article is the development of tests of independence using FEFI for partially classified two-way contingency tables. Wald and deviance tests of independence under FEFI are proposed. Simulations are used to compare type I error rates and Power. The partially observed marginal information is useful for estimating the joint distribution of cell probabilities, but it is not useful for testing association. FEFI compares favorably to other methods in simulations.
https://doi.org/10.7465/jkdi.2015.26.4.971 인용 PDF KSCI

Recovering Incomplete Data using Tucker Model for Tensor with Low-n-rank

Thieu, Thao Nguyen;Yang, Hyung-Jeong;Vu, Tien Duong;Kim, Sun-Hee
- International Journal of Contents
- /
- v.12 no.3
- /
- pp.22-28
- /
- 2016
Tensor with missing or incomplete values is a ubiquitous problem in various fields such as biomedical signal processing, image processing, and social network analysis. In this paper, we considered how to reconstruct a dataset with missing values by using tensor form which is called tensor completion process. We applied Tucker factorization to solve tensor completion which was built base on optimization problem. We formulated the optimization objective function using components of Tucker model after decomposing. The weighted least square matric contained only known values of the tensor with low rank in its modes. A first order optimization method, namely Nonlinear Conjugated Gradient, was applied to solve the optimization problem. We demonstrated the effectiveness of the proposed method in EEG signals with about 70% missing entries compared to other algorithms. The relative error was proposed to compare the difference between original tensor and the process output.
https://doi.org/10.5392/IJoC.2016.12.3.022 인용 PDF KSCI KPUBS HTML

A Sparse Data Preprocessing Using Support Vector Regression (Support Vector Regression을 이용한 희소 데이터의 전처리)

Jun, Sung-Hae;Park, Jung-Eun;Oh, Kyung-Whan
- Journal of the Korean Institute of Intelligent Systems
- /
- v.14 no.6
- /
- pp.789-792
- /
- 2004
In various fields as web mining, bioinformatics, statistical data analysis, and so forth, very diversely missing values are found. These values make training data to be sparse. Largely, the missing values are replaced by predicted values using mean and mode. We can used the advanced missing value imputation methods as conditional mean, tree method, and Markov Chain Monte Carlo algorithm. But general imputation models have the property that their predictive accuracy is decreased according to increase the ratio of missing in training data. Moreover the number of available imputations is limited by increasing missing ratio. To settle this problem, we proposed statistical learning theory to preprocess for missing values. Our statistical learning theory is the support vector regression by Vapnik. The proposed method can be applied to sparsely training data. We verified the performance of our model using the data sets from UCI machine learning repository.
https://doi.org/10.5391/JKIIS.2004.14.6.789 인용 PDF KSCI

Proposal to Supplement the Missing Values of Air Pollution Levels in Meteorological Dataset (기상 데이터에서 대기 오염도 요소의 결측치 보완 기법 제안)

Jo, Dong-Chol;Hahn, Hee-Il
- The Journal of the Institute of Internet, Broadcasting and Communication
- /
- v.21 no.1
- /
- pp.181-187
- /
- 2021
Recently, various air pollution factors have been measured and analyzed to reduce damages caused by it. In this process, many missing values occur due to various causes. To compensate for this, basically a vast amount of training data is required. This paper proposes a statistical techniques that effectively compensates for missing values generated in the process of measuring ozone, carbon dioxide, and ultra-fine dust using a small amount of learning data. The proposed algorithm first extracts a group of meteorological data that is expected to have positive effects on the correction of missing values through statistical information analysis such as the correlation between meteorological data and air pollution level factors, p-value, etc. It is a technique that efficiently and effectively compensates for missing values by analyzing them. In order to confirm the performance of the proposed algorithm, we analyze its characteristics through various experiments and compare the performance of the well-known representative algorithms with ours.
https://doi.org/10.7236/JIIBC.2021.21.1.181 인용 PDF KSCI HTML

Development of a Machine Learning Model for Imputing Time Series Data with Massive Missing Values (결측치 비율이 높은 시계열 데이터 분석 및 예측을 위한 머신러닝 모델 구축)

Bangwon Ko;Yong Hee Han
- The Journal of Korea Institute of Information, Electronics, and Communication Technology
- /
- v.17 no.3
- /
- pp.176-182
- /
- 2024
In this study, we compared and analyzed various methods of missing data handling to build a machine learning model that can effectively analyze and predict time series data with a high percentage of missing values. For this purpose, Predictive State Model Filtering (PSMF), MissForest, and Imputation By Feature Importance (IBFI) methods were applied, and their prediction performance was evaluated using LightGBM, XGBoost, and Explainable Boosting Machines (EBM) machine learning models. The results of the study showed that MissForest and IBFI performed the best among the methods for handling missing values, reflecting the nonlinear data patterns, and that XGBoost and EBM models performed better than LightGBM. This study emphasizes the importance of combining nonlinear imputation methods and machine learning models in the analysis and prediction of time series data with a high percentage of missing values, and provides a practical methodology.
https://doi.org/10.17661/jkiiect.2024.17.3.176 인용 PDF HTML

Search Result 441, Processing Time 0.025 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)