• Title/Summary/Keyword: 결측

Search Result 428, Processing Time 0.027 seconds

Missing Value Imputation Technique for Water Quality Dataset

  • Jin-Young Jun;Youn-A Min
    • Journal of the Korea Society of Computer and Information
    • /
    • v.29 no.4
    • /
    • pp.39-46
    • /
    • 2024
  • Many researchers make efforts to evaluate water quality using various models. Such models require a dataset without missing values, but in real world, most datasets include missing values for various reasons. Simple deletion of samples having missing value(s) could distort distribution of the underlying data and pose a significant risk of biasing the model's inference when the missing mechanism is not MCAR. In this study, to explore the most appropriate technique for handing missing values in water quality data, several imputation techniques were experimented based on existing KNN and MICE imputation with/without the generative neural network model, Autoencoder(AE) and Denoising Autoencoder(DAE). The results shows that KNN and MICE combined imputation without generative networks provides the closest estimated values to the true values. When evaluating binary classification models based on support vector machine and ensemble algorithms after applying the combined imputation technique to the observed water quality dataset with missing values, it shows better performance in terms of Accuracy, F1 score, RoC-AuC score and MCC compared to those evaluated after deleting samples having missing values.

A Sparse Data Preprocessing Using Support Vector Regression (Support Vector Regression을 이용한 희소 데이터의 전처리)

  • 전성해;박정은;오경환
    • Proceedings of the Korean Institute of Intelligent Systems Conference
    • /
    • 2004.04a
    • /
    • pp.499-501
    • /
    • 2004
  • 웹 로그, 바이오정보학 둥 여러 분야에서 다양한 형태의 결측치가 발생하여 학습 데이터를 희소하게 만든다. 결측치는 주로 전처리 과정에서 조건부 평균이나 나무 모형과 같은 기본적인 Imputation 방법을 이용하여 추정된 값에 의해 대체되기도 하고 일부는 제거되기도 한다. 특히, 결측치 비율이 매우 크게 되면 기존의 결측치 대체 방법의 정확도는 떨어진다. 또한 데이터의 결측치 비율이 증가할수록 사용 가능한 Imputation 방법들의 수는 극히 제한된다. 이러한 문제점을 해결하기 위하여 본 논문에서는 Vapnik의 Support Vector Regression을 데이터 전처리 과정에 알맞게 변형한 Support Vector Regression을 제안하여 이러한 문제점들을 해결하였다. 제안 방법을 통하여 결측치의 비율이 상당히 큰 희소 데이터의 전처리도 가능하게 되었다. UCI machine learning repository로부터 얻어진 데이터를 이용하여 제안 방법의 성능을 확인하였다.

  • PDF

A Study of the Method for Estimating the Missing Data from Weather Measurement Instruments (인공신경망을 이용한 기상관측장비 결측 보완 기술에 관한 연구)

  • Min, Jae-Sik;Lee, Moo-Hun;Jee, Joon-Bum;Jang, Min
    • Journal of Digital Convergence
    • /
    • v.14 no.8
    • /
    • pp.245-252
    • /
    • 2016
  • The purpose of this study is to make up for missing of weather informations from ASOS and AWS using artificial neural networks. We collected temperature, relative humidity and wind velocity for August during 5-yr (2011-2015) and sample designed artificial neural networks, assuming the Seoul weather station was missing. The result of sensitivity study on number of epoch shows that early stopping appeared at 2,000 epochs. Correlation between observation and prediction was higher than 0.6, especially temperature and humidity was higher than 0.9, 0.8 respectively. RMSE decreased gradually and training time increased exponentially with respect to increase of number of epochs. The predictability at 40 epoch was more than 80% effect on of improved results by the time the early stopping. It is expected to make it possible to use more detailed weather information via the rapid missing complemented by quick learning time within 2 seconds.

A comparison of imputation methods using nonlinear models (비선형 모델을 이용한 결측 대체 방법 비교)

  • Kim, Hyein;Song, Juwon
    • The Korean Journal of Applied Statistics
    • /
    • v.32 no.4
    • /
    • pp.543-559
    • /
    • 2019
  • Data often include missing values due to various reasons. If the missing data mechanism is not MCAR, analysis based on fully observed cases may an estimation cause bias and decrease the precision of the estimate since partially observed cases are excluded. Especially when data include many variables, missing values cause more serious problems. Many imputation techniques are suggested to overcome this difficulty. However, imputation methods using parametric models may not fit well with real data which do not satisfy model assumptions. In this study, we review imputation methods using nonlinear models such as kernel, resampling, and spline methods which are robust on model assumptions. In addition, we suggest utilizing imputation classes to improve imputation accuracy or adding random errors to correctly estimate the variance of the estimates in nonlinear imputation models. Performances of imputation methods using nonlinear models are compared under various simulated data settings. Simulation results indicate that the performances of imputation methods are different as data settings change. However, imputation based on the kernel regression or the penalized spline performs better in most situations. Utilizing imputation classes or adding random errors improves the performance of imputation methods using nonlinear models.

Comparison of Single Imputation Methods in 2×2 Cross-Over Design with Missing Observations (2×2 교차계획법에서 결측치가 있을 때의 결측치 처리 방법 비교에 관한 연구)

  • Jo, Bobae;Kim, Dongjae
    • The Korean Journal of Applied Statistics
    • /
    • v.28 no.3
    • /
    • pp.529-540
    • /
    • 2015
  • A cross-over design is frequently used in clinical trials (especially in bioequivalence tests with a parametric method) for the comparison of two treatments. Missing values frequently take place in cross-over designs in the second period. Usually, subjects that have missing values are removed and analyzed. However, it can be unsuitable in clinical trials with a small sample size. In this paper, we compare single imputation methods in a $2{\times}2$ cross-over design when missing values exist in the second period. Additionally, parametric and nonparametric methods are compared after applying single imputation methods. A Monte-Carlo simulation study compares type I error and the power of methods.

Missing Imputation Methods Using the Spatial Variable in Sample Survey (표본조사에서 공간 변수(SPATIAL VARIABLE)를 이용한 결측 대체(MISSING IMPUTATION)의 효율성 비교)

  • Lee Jin-Hee;Kim Jin;Lee Kee-Jae
    • The Korean Journal of Applied Statistics
    • /
    • v.19 no.1
    • /
    • pp.57-67
    • /
    • 2006
  • In sampling survey, nonresponse tend to occur inevitably. If we use information from respondents only, the estimates will be baised. To overcome this, various non-response imputation methods have been studied. If there are few auxiliary variables for replacing missing imputation or spatial autocorrelation exists between respondents and nonrespondents, spatial autocorrelation can be used for missing imputation. In this paper, we apply several nonresponse imputation methods including spatial imputation for the analysis of farm household economy data of the Gangwon-Do in 2002 as an example. We show that spatial imputation is more efficient than other methods through the numerical simulations.

A Comparative Study between Telemetering and Recording Stage Gage Data (TM 및 일반수위자료 비교분석연구)

  • Kim, Hwi-Rin;Cho, Hyo-Seob;Baek, Chang-Hyun;Jeong, Hyeon-Gyo
    • Proceedings of the Korea Water Resources Association Conference
    • /
    • 2008.05a
    • /
    • pp.1320-1323
    • /
    • 2008
  • 현재 건설교통부 한강홍수통제소에서는 96개의(2006 한국수문조사연보 기준) 수위관측소를 설치 운영하고 있으며 현장에서 수집되는 수위자료의 전송방법은 크게 두 가지로 중계소를 통해 실시간으로 전송되는 TM(TeleMetering) 방식과 기록지(Recording) 방식으로 구분된다. 고품질 수위자료의 생산 및 제공은 비단 수자원 관련 연구 분야에서 뿐만 아니라 하천의 효율적인 관리, 각종 국토개발계획 등에 다양하게 이용되고 있으며 특히 TM 자료의 경우는 실시간으로 현장에서 관측되는 수문자료를 수집하여 홍수예보시스템의 가장 중요한 입력자료로서 활용된다. 한강홍수통제소에서 구축 운영하고 있는 TM 수위자료와 일반 수위자료를 대상으로 현황을 검토한 결과 일반적으로 수위관측소의 자료 오류 유형을 관측기기부터 전송단계별로 분류하면 수위계 기기 고장(부자 걸림 등), 전송로 변경 및 통신 장비 고장 등으로 인한 오 결측으로 구분될 수 있다. 과거 오 결측된 자료를 보정하기 위한 방법으로는 2시간, 3시간 전(前)수위 자료를 이용해 이상치를 보정하는 것이 유일하게 활용되고 있었으나 작년에 한강수계를 대상으로 "국가수문자료 품질관리시스템구축(1차)" 연구 용역을 실시하여 시범 구축 결과를 금년부터 활용하고 있으며 본 시스템에 자료보정에 대한 다양한 방법이 탑재되어 있다. 이와 별도로 기왕자료의 보정방법으로 TM과 일반방식이 이중화 되어 있는 관측소의 경우에는 연속적인 자료를 나타내는 기록지 자료를 활용하는 것이 대안으로 제시되고 있다. 하지만, 기록지 자료를 통해 오 결측된 TM 자료를 보완하는 것에 대해서는 아직 연구된 바가 없으며 이와 관련된 다각적인 검토가 국내에서 부족한 실정이므로 본 연구에서는 실제 한강홍수통제소에서 관할하고 있는 이중화 기록방식의 관측소를 선정하여 TM과 기록지 수위관측자료의 비교 분석을 통해 오 결측된 TM 자료를 일반 기록지 자료로 보완에 하는 것에 대한 실효성을 심도 있게 검토하여 수위자료 품질향상의 기반을 마련코자 한다.

  • PDF

A New Method for Imputation of Missing Genotype using Linkage Disequilibrium and Haplotype Information (결측치가 존재하는 유전형 자료에서의 연관불균형과 일배체형을 사용한 결측치 대치 방법)

  • Park Yun-Ju;Kim Young-Jin;Park Jung-Sun;Kim Kuchan;Koh Insong;Jung Ho-Youl
    • Journal of KIISE:Software and Applications
    • /
    • v.32 no.2
    • /
    • pp.99-107
    • /
    • 2005
  • In this paper, wc propose a now missing imputation method for minimizing loss of information linkage disequilibrium-based and haplotype-based imputation method, which estimate missing values of the data based on the specificity of Single Nucleotide Polymorphism(SNP) genotype data. Method for imputing data is needed to minimize the loss of information caused by experimental missing data. In general, missing imputation of biological data has used major allele imputation method. but this approach is not optima]. 1'his method has high error rates of missing values estimation since the characteristics of the genotype data are not considered not take into consideration the specific structure of the data. In this paper, we show the results of the comparative evaluation of our model methods and major imputation method for the estimation of missing values.

Variational Mode Decomposition with Missing Data (결측치가 있는 자료에서의 변동모드분해법)

  • Choi, Guebin;Oh, Hee-Seok;Lee, Youngjo;Kim, Donghoh;Yu, Kyungsang
    • The Korean Journal of Applied Statistics
    • /
    • v.28 no.2
    • /
    • pp.159-174
    • /
    • 2015
  • Dragomiretskiy and Zosso (2014) developed a new decomposition method, termed variational mode decomposition (VMD), which is efficient for handling the tone detection and separation of signals. However, VMD may be inefficient in the presence of missing data since it is based on a fast Fourier transform (FFT) algorithm. To overcome this problem, we propose a new approach based on a novel combination of VMD and hierarchical (or h)-likelihood method. The h-likelihood provides an effective imputation methodology for missing data when VMD decomposes the signal into several meaningful modes. A simulation study and real data analysis demonstrates that the proposed method can produce substantially effective results.

Outlier Filtering and Missing Data Imputation Algorithm using TCS Data (TCS데이터를 이용한 이상치제거 및 결측보정 알고리즘 개발)

  • Do, Myung-Sik;Lee, Hyang-Mee;NamKoong, Seong
    • Journal of Korean Society of Transportation
    • /
    • v.26 no.4
    • /
    • pp.241-250
    • /
    • 2008
  • With the ever-growing amount of traffic, there is an increasing need for good quality travel time information. Various existing outlier filtering and missing data imputation algorithms using AVI data for interrupted and uninterrupted traffic flow have been proposed. This paper is devoted to development of an outlier filtering and missing data imputation algorithm by using Toll Collection System (TCS) data. TCS travel time data collected from August to September 2007 were employed. Travel time data from TCS are made out of records of every passing vehicle; these data have potential for providing real-time travel time information. However, the authors found that as the distance between entry tollgates and exit tollgates increases, the variance of travel time also increases. Also, time gaps appeared in the case of long distances between tollgates. Finally, the authors propose a new method for making representative values after removal of abnormal and "noise" data and after analyzing existing methods. The proposed algorithm is effective.