• Title/Summary/Keyword: 결측데이터

Search Result 133, Processing Time 0.034 seconds

Long-gap Filling Method for the Coastal Monitoring Data (해양모니터링 자료의 장기결측 보충 기법)

  • Cho, Hong-Yeon;Lee, Gi-Seop;Lee, Uk-Jae
    • Journal of Korean Society of Coastal and Ocean Engineers
    • /
    • v.33 no.6
    • /
    • pp.333-344
    • /
    • 2021
  • Technique for the long-gap filling that occur frequently in ocean monitoring data is developed. The method estimates the unknown values of the long-gap by the summation of the estimated trend and selected residual components of the given missing intervals. The method was used to impute the data of the long-term missing interval of about 1 month, such as temperature and water temperature of the Ulleungdo ocean buoy data. The imputed data showed differences depending on the monitoring parameters, but it was found that the variation pattern was appropriately reproduced. Although this method causes bias and variance errors due to trend and residual components estimation, it was found that the bias error of statistical measure estimation due to long-term missing is greatly reduced. The mean, and the 90% confidence intervals of the gap-filling model's RMS errors are 0.93 and 0.35~1.95, respectively.

A Sparse Data Preprocessing Using Support Vector Regression (Support Vector Regression을 이용한 희소 데이터의 전처리)

  • 전성해;박정은;오경환
    • Proceedings of the Korean Institute of Intelligent Systems Conference
    • /
    • 2004.04a
    • /
    • pp.499-501
    • /
    • 2004
  • 웹 로그, 바이오정보학 둥 여러 분야에서 다양한 형태의 결측치가 발생하여 학습 데이터를 희소하게 만든다. 결측치는 주로 전처리 과정에서 조건부 평균이나 나무 모형과 같은 기본적인 Imputation 방법을 이용하여 추정된 값에 의해 대체되기도 하고 일부는 제거되기도 한다. 특히, 결측치 비율이 매우 크게 되면 기존의 결측치 대체 방법의 정확도는 떨어진다. 또한 데이터의 결측치 비율이 증가할수록 사용 가능한 Imputation 방법들의 수는 극히 제한된다. 이러한 문제점을 해결하기 위하여 본 논문에서는 Vapnik의 Support Vector Regression을 데이터 전처리 과정에 알맞게 변형한 Support Vector Regression을 제안하여 이러한 문제점들을 해결하였다. 제안 방법을 통하여 결측치의 비율이 상당히 큰 희소 데이터의 전처리도 가능하게 되었다. UCI machine learning repository로부터 얻어진 데이터를 이용하여 제안 방법의 성능을 확인하였다.

  • PDF

Development of a method for constructing hydrological time series input data for deep learning analysis (딥러닝 분석을 위한 수문시계열 입력자료 구성 기법 개발)

  • Yuk, Gi-moon;Cho, He-rin;Park, Chan-ho;Moon, Soo-jin;Moon, Yong-il
    • Proceedings of the Korea Water Resources Association Conference
    • /
    • 2021.06a
    • /
    • pp.349-349
    • /
    • 2021
  • 일반적인 도시홍수모형은 수리-수문모형을 기반으로 한 홍수위 모형을 사용하고 있으나 강우사상이나 물리적 조건에 따라 모의시간의 변화가 있으며 경우에 따라서는 긴 모의시간이 소요된다. 알파고 이후 큰 관심을 갖게된 딥러닝을 이용한 데이터기반의 모의를 통해 수자원 부분에 적용하여 수위 예측을 진행하였다. 본 연구에서는 딥러닝을 이용하여 관측자료기반의 수위예측 연구를 수행하였다. 대상유역은 중랑천 유역으로 선정하였으며 2015년 ~ 2020년 사이의 10분단위 강우, 수위자료를 이용하였다. 지방자치단체에서 제공하는 강우, 수위자료의 경우 결측자료 또는 이상자료에 대한 보정이 미흡하여 기계학습을 통합 분석자료로 활용하는데 어려움이 있다. 이에, 결측 및 이상자료가 포함된 자료로부터 인위적으로 교란된 데이터 및 결측구간을 삭제한 데이터를 생성하여 자료의 시계열성을 제거하고, 딥러닝을 통한 수위 예측 결과를 정상 데이터를 적용한 결과와 비교하였다. 사용된 딥러닝 모형은 시계열 데이터 예측에 우수한성능을 보이는 LSTM모형과 GRU모형을 이용하였으며 RMSE, NSE를 이용하여 평가하였다. 본 연구에서는 결측자료 및 이상자료가 포함된 수문자료를 자료의 시계열성 제거를 통해 딥러닝 분석 입력자료 구성하기 위한 방안을 제시하였다.

  • PDF

Missing Value Imputation Technique for Water Quality Dataset

  • Jin-Young Jun;Youn-A Min
    • Journal of the Korea Society of Computer and Information
    • /
    • v.29 no.4
    • /
    • pp.39-46
    • /
    • 2024
  • Many researchers make efforts to evaluate water quality using various models. Such models require a dataset without missing values, but in real world, most datasets include missing values for various reasons. Simple deletion of samples having missing value(s) could distort distribution of the underlying data and pose a significant risk of biasing the model's inference when the missing mechanism is not MCAR. In this study, to explore the most appropriate technique for handing missing values in water quality data, several imputation techniques were experimented based on existing KNN and MICE imputation with/without the generative neural network model, Autoencoder(AE) and Denoising Autoencoder(DAE). The results shows that KNN and MICE combined imputation without generative networks provides the closest estimated values to the true values. When evaluating binary classification models based on support vector machine and ensemble algorithms after applying the combined imputation technique to the observed water quality dataset with missing values, it shows better performance in terms of Accuracy, F1 score, RoC-AuC score and MCC compared to those evaluated after deleting samples having missing values.

Outlier Filtering and Missing Data Imputation Algorithm using TCS Data (TCS데이터를 이용한 이상치제거 및 결측보정 알고리즘 개발)

  • Do, Myung-Sik;Lee, Hyang-Mee;NamKoong, Seong
    • Journal of Korean Society of Transportation
    • /
    • v.26 no.4
    • /
    • pp.241-250
    • /
    • 2008
  • With the ever-growing amount of traffic, there is an increasing need for good quality travel time information. Various existing outlier filtering and missing data imputation algorithms using AVI data for interrupted and uninterrupted traffic flow have been proposed. This paper is devoted to development of an outlier filtering and missing data imputation algorithm by using Toll Collection System (TCS) data. TCS travel time data collected from August to September 2007 were employed. Travel time data from TCS are made out of records of every passing vehicle; these data have potential for providing real-time travel time information. However, the authors found that as the distance between entry tollgates and exit tollgates increases, the variance of travel time also increases. Also, time gaps appeared in the case of long distances between tollgates. Finally, the authors propose a new method for making representative values after removal of abnormal and "noise" data and after analyzing existing methods. The proposed algorithm is effective.

A Study on the Imputation for Missing Data in Dual-loop Vehicle Detector System (차량 검지자료 결측 보정처리에 관한 연구 (이력자료 활용방안을 중심으로))

  • Kim, Jeong-Yeon;Lee, Yeong-In;Baek, Seung-Geol;Nam, Gung-Seong
    • Journal of Korean Society of Transportation
    • /
    • v.24 no.7 s.93
    • /
    • pp.27-40
    • /
    • 2006
  • The traffic information is provided, which based on the volume of traffic, speed, occupancy collected through the currently operating Vehicle Detector System(VDS). In addition to the trend in utilization fold of traffic information is increasing gradually with the applied various fields and users. Missing data in Vehicle detector data means series of data transmitted to controller without specific property. The missing data does not have a data property, so excluded at the whole data Process Hence, increasing ratio of missing data in VDS data inflicts unreliable representation of actual traffic situation. This study presented the imputation process due out which applied the methodologies that utilized adjacent stations reference and historical data utilize about missing data. Applied imputation process methodologies to VDS data or SeoHaeAn/Kyongbu Expressway, currently operation VDS, after processes at missing data ratio of an option. Imputation process held presented to per lane-30seconds-period, and morning/afternoon/daily time scope ranges classified, and analyzed an error of imputed data preparing for actual data. The analysis results, an low error occurred relatively in the results of the imputation process way that utilized a historical data compare with adjacent stations reference methods.

A Design of Behavior Recognition method through GAN-based skeleton data generation (GAN 기반 관절 데이터 생성을 통한 행동 인식 방법 설계)

  • Kim, Jinah;Moon, Nammee
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2022.11a
    • /
    • pp.592-593
    • /
    • 2022
  • 다중 데이터 기반의 행동 인식 과정에서 데이터 수집 반경이 비교적 제한되는 영상 데이터의 결측에 대한 보완이 요구된다. 본 논문에서는 6축 센서 데이터를 이용하여 결측된 영상 데이터를 생성함으로써 행동 인식의 성능을 개선하는 방법을 제안한다. 가속도와 자이로 센서로부터 수집된 행동 데이터를 이용하여 GAN(Generative Adversarial Network)을 통해 영상에서의 관절(Skeleton) 움직임에 대한 데이터를 생성하고자 한다. 이를 위해 DeepLabCut 기반 모델 학습을 통해 관절 좌표를 추출하며, 전처리된 센서 시퀀스 데이터를 가지고 GRU 기반 GAN 모델을 통해 관절 좌표에 대한 영상 시퀀스 데이터를 생성한다. 생성된 영상 시퀀스 데이터는 영상 데이터의 결측이 발생했을 때 대신 행동 인식 모델의 입력값으로 활용될 수 있어 성능 향상을 기대할 수 있다.

Data Cleansing Algorithm for reducing Outlier (데이터 오·결측 저감 정제 알고리즘)

  • Lee, Jongwon;Kim, Hosung;Hwang, Chulhyun;Kang, Inshik;Jung, Hoekyung
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2018.10a
    • /
    • pp.342-344
    • /
    • 2018
  • This paper shows the possibility to substitute statistical methods such as mean imputation, correlation coefficient analysis, graph correlation analysis for the proposed algorithm, and replace statistician for processing various abnormal data measured in the water treatment process with it. In addition, this study aims to model a data-filtering system based on a recent fractile pattern and a deep learning-based LSTM algorithm in order to improve the reliability and validation of the algorithm, using the open-sourced libraries such as KERAS, THEANO, TENSORFLOW, etc.

  • PDF

A Study on the cleansing of water data using LSTM algorithm (LSTM 알고리즘을 이용한 수도데이터 정제기법)

  • Yoo, Gi Hyun;Kim, Jong Rib;Shin, Gang Wook
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2017.10a
    • /
    • pp.501-503
    • /
    • 2017
  • In the water sector, various data such as flow rate, pressure, water quality and water level are collected during the whole process of water purification plant and piping system. The collected data is stored in each water treatment plant's DB, and the collected data are combined in the regional DB and finally stored in the database server of the head office of the Korea Water Resources Corporation. Various abnormal data can be generated when a measuring instrument measures data or data is communicated over various processes, and it can be classified into missing data and wrong data. The cause of each abnormal data is different. Therefore, there is a difference in the method of detecting the wrong side and the missing side data, but the method of cleansing the data is the same. In this study, a program that can automatically refine missing or wrong data by applying deep learning LSTM (Long Short Term Memory) algorithm will be studied.

  • PDF

Denoising Self-Attention Network for Mixed-type Data Imputation (혼합형 데이터 보간을 위한 디노이징 셀프 어텐션 네트워크)

  • Lee, Do-Hoon;Kim, Han-Joon;Chun, Joonghoon
    • The Journal of the Korea Contents Association
    • /
    • v.21 no.11
    • /
    • pp.135-144
    • /
    • 2021
  • Recently, data-driven decision-making technology has become a key technology leading the data industry, and machine learning technology for this requires high-quality training datasets. However, real-world data contains missing values for various reasons, which degrades the performance of prediction models learned from the poor training data. Therefore, in order to build a high-performance model from real-world datasets, many studies on automatically imputing missing values in initial training data have been actively conducted. Many of conventional machine learning-based imputation techniques for handling missing data involve very time-consuming and cumbersome work because they are applied only to numeric type of columns or create individual predictive models for each columns. Therefore, this paper proposes a new data imputation technique called 'Denoising Self-Attention Network (DSAN)', which can be applied to mixed-type dataset containing both numerical and categorical columns. DSAN can learn robust feature expression vectors by combining self-attention and denoising techniques, and can automatically interpolate multiple missing variables in parallel through multi-task learning. To verify the validity of the proposed technique, data imputation experiments has been performed after arbitrarily generating missing values for several mixed-type training data. Then we show the validity of the proposed technique by comparing the performance of the binary classification models trained on imputed data together with the errors between the original and imputed values.