• Title/Summary/Keyword: Outlier & Missing Value

Search Result 13, Processing Time 0.026 seconds

Adjustment System for Outlier and Missing Value using Data Storage (데이터 저장소를 이용한 이상치 및 결측치 보정 시스템)

  • Gwangho Kim;Neunghoe Kim
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.23 no.5
    • /
    • pp.47-53
    • /
    • 2023
  • With the advent of the 4th Industrial Revolution, diverse and a large amount of data has been accumulated now. The agricultural community has also collected environmental data that affects the growth of crops in smart farms or open fields with sensors. Environmental data has different features depending on where and when they are measured. Studies have been conducted using collected agricultural data to predict growth and yield with statistics and artificial intelligence. The results of these studies vary greatly depending on the data on which they are based. So, studies to enhance data quality have also been continuously conducted for performance improvement. A lot of data is required for high performance, but if there are outlier or missing values in the data, it can greatly affect the results even if the amount is sufficient. So, adjustment of outlier and missing values is essential in the data preprocessing. Therefore, this paper integrates data collected from actual farms and proposes a adjustment system for outlier and missing values based on it.

On the Use of Weighted k-Nearest Neighbors for Missing Value Imputation (Weighted k-Nearest Neighbors를 이용한 결측치 대치)

  • Lim, Chanhui;Kim, Dongjae
    • The Korean Journal of Applied Statistics
    • /
    • v.28 no.1
    • /
    • pp.23-31
    • /
    • 2015
  • A conventional missing value problem in the statistical analysis k-Nearest Neighbor(KNN) method are used for a simple imputation method. When one of the k-nearest neighbors is an extreme value or outlier, the KNN method can create a bias. In this paper, we propose a Weighted k-Nearest Neighbors(WKNN) imputation method that can supplement KNN's faults. A Monte-Carlo simulation study is also adapted to compare the WKNN method and KNN method using real data set.

Deduction of Data Quality Control Strategy for High Density Rain Gauge Network in Seoul Area (서울시 고밀도 지상강우자료 품질관리방안 도출)

  • Yoon, Seongsim;Lee, Byongju;Choi, Youngjean
    • Journal of Korea Water Resources Association
    • /
    • v.48 no.4
    • /
    • pp.245-255
    • /
    • 2015
  • This study used high density network of integrated meteorological sensor, which are operated by SK Planet, with KMA weather stations to estimate the quantitative precipitation field in Seoul area. We introduced SK Planet network and analyzed quality of the observed data for 3 months data from 1 July to 30 September 2013. As the quality analysis result, we checked most SK Planet stations observed similar with previous KMA stations. We developed the real-time quality check and adjustment method to reduce the error effect for hydrological application by missing and outlier value and we confirmed the developed method can be corrected the missing and outlier value. Through this method, we used the 190 stations(KMA 34 stations, SK Planet 156 stations) that missing ratio is less than 20% and the effect of the outlier was the smallest for quantitative precipitation estimation. Moreover, we evaluated reproducibility of rainfall field high density rain gauge network has $3km^2$/gauge. As the result, the spatial relative frequency of rainfall field using SK Planet and KMA stations is similar with radar rainfall field. And, it supplement the blank of KMA observation network. Especially, through this research we will take advantage of the density of the network to estimate rainfall field which can be considered as a very good approximation of the true value.

Method of Processing the Outliers and Missing Values of Field Data to Improve RAM Analysis Accuracy (RAM 분석 정확도 향상을 위한 야전운용 데이터의 이상값과 결측값 처리 방안)

  • Kim, In Seok;Jung, Won
    • Journal of Applied Reliability
    • /
    • v.17 no.3
    • /
    • pp.264-271
    • /
    • 2017
  • Purpose: Field operation data contains missing values or outliers due to various causes of the data collection process, so caution is required when utilizing RAM analysis results by field operation data. The purpose of this study is to present a method to minimize the RAM analysis error of the field data to improve the accuracy. Methods: Statistical methods are presented for processing of the outliers and the missing values of the field operating data, and after analyzing the RAM, the differences between before and after applying the technique are discussed. Results: The availability is estimated to be lower by 6.8 to 23.5% than that before processing, and it is judged that the processing of the missing values and outliers greatly affect the RAM analysis result. Conclusion: RAM analysis of OO weapon system was performed and suggestions for improvement of RAM analysis were presented through comparison with the new and current method. Data analysis results without appropriate treatment of error values may result in incorrect conclusions leading to inappropriate decisions and actions.

On the use of weighted adaptive nearest neighbors for missing value imputation (가중 적응 최근접 이웃을 이용한 결측치 대치)

  • Yum, Yunjin;Kim, Dongjae
    • The Korean Journal of Applied Statistics
    • /
    • v.31 no.4
    • /
    • pp.507-516
    • /
    • 2018
  • Widely used among the various single imputation methods is k-nearest neighbors (KNN) imputation due to its robustness even when a parametric model such as multivariate normality is not satisfied. We propose a weighted adaptive nearest neighbors imputation method that combines the adaptive nearest neighbors imputation method that accounts for the local features of the data in the KNN imputation method and weighted k-nearest neighbors method that are less sensitive to extreme value or outlier among k-nearest neighbors. We conducted a Monte Carlo simulation study to compare the performance of the proposed imputation method with previous imputation methods.

Study on Weather Data Interpolation of a Buoy Based on Machine Learning Techniques (기계 학습을 이용한 항로표지 기상 자료의 보간에 관한 연구)

  • Seong-Hun Jeong;Jun-Ik Ma;Seong-Hyun Jo;Gi-Ryun Lim;Jun-Woo Lee;Jun-Hee Han
    • Proceedings of the Korean Institute of Navigation and Port Research Conference
    • /
    • 2022.06a
    • /
    • pp.72-74
    • /
    • 2022
  • Several types of data are collected from buoy due to the development of hardware technology.. However, the collected data are difficult to use due to errors including missing values and outliers depending on mechanical faults and meteorological environment. Therefore, in this study, linear interpolation is performed by adding the missing time data to enable machine learning to the insufficient meteorological data. After the linear interpolation, XGBoost and KNN-regressor, are used to forecast error data and suggested model is evaluated by using real-world data of a buoy.

  • PDF

Robust multiple imputation method for missings with boundary and outliers (한계와 이상치가 있는 결측치의 로버스트 다중대체 방법)

  • Park, Yousung;Oh, Do Young;Kwon, Tae Yeon
    • The Korean Journal of Applied Statistics
    • /
    • v.32 no.6
    • /
    • pp.889-898
    • /
    • 2019
  • The problem of missing value imputation for variables in surveys that include item missing becomes complicated if outliers and logical boundary conditions between other survey items cannot be ignored. If there are outliers and boundaries in a variable including missing values, imputed values based on previous regression-based imputation methods are likely to be biased and not meet boundary conditions. In this paper, we approach these difficulties in imputation by combining various robust regression models and multiple imputation methods. Through a simulation study on various scenarios of outliers and boundaries, we find and discuss the optimal combination of robust regression and multiple imputation method.

A Maximum Power Demand Prediction Method by Average Filter Combination (평균필터 조합을 통한 최대수요전력 예측기법)

  • Yu, Chan-Jik;Kim, Jae-Sung;Roh, Kyung-Woo;Cho, Wan-Sup
    • The Journal of Bigdata
    • /
    • v.5 no.1
    • /
    • pp.227-239
    • /
    • 2020
  • This paper introduces a method for predicting the maximum power demand despite communication errors in industrial sites. Due to the recent policy of de-nuclearization in Korea, the price of electricity is inevitable, and the amount of electricity used and maximum load management for the management of power demand are becoming important issues. Accordingly, it is important to predict and manage peak power. However, problems such as loss and modulation of measured power data occur at industrial sites due to noise generated by various facilities and sensors. It is difficult to predict the exact value when measured effective power data are lost. The study presents a model for predicting and correcting anomalies and missing values when measured effective power data are lost. The models used in this study are expected to be useful in predicting peak power demand in the event of communication errors at industrial sites.

A Big Data-Driven Business Data Analysis System: Applications of Artificial Intelligence Techniques in Problem Solving

  • Donggeun Kim;Sangjin Kim;Juyong Ko;Jai Woo Lee
    • The Journal of Bigdata
    • /
    • v.8 no.1
    • /
    • pp.35-47
    • /
    • 2023
  • It is crucial to develop effective and efficient big data analytics methods for problem-solving in the field of business in order to improve the performance of data analytics and reduce costs and risks in the analysis of customer data. In this study, a big data-driven data analysis system using artificial intelligence techniques is designed to increase the accuracy of big data analytics along with the rapid growth of the field of data science. We present a key direction for big data analysis systems through missing value imputation, outlier detection, feature extraction, utilization of explainable artificial intelligence techniques, and exploratory data analysis. Our objective is not only to develop big data analysis techniques with complex structures of business data but also to bridge the gap between the theoretical ideas in artificial intelligence methods and the analysis of real-world data in the field of business.

Design of Heuristic Decision Tree (HDT) Using Human Knowledge (인간 지식을 이용한 경험적 의사결정트리의 설계)

  • Yoon, Tae-Tok;Lee, Jee-Hyong
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.19 no.4
    • /
    • pp.525-531
    • /
    • 2009
  • Data mining is the process of extracting hidden patterns from collected data. At this time, for collected data which take important role as the basic information for prediction and recommendation, the process to discriminate incorrect data in order to enhance the performance of analysis result, is needed. The existing methods to discriminate unexpected data from collected data, mainly relies on methods which are based on statistics or simple distance between data. However, for these methods, the problematic point that even meaningful data could be excluded from analysis due that the environment and characteristic of the relevant data are not considered, exists. This study proposes a method to endow human heuristic knowledge with weight value through the comparison between collected data and human heuristic knowledge, and to use the value for creating a decision tree. The data discrimination by the method proposed is more credible as human knowledge is reflected in the created tree. The validity of the proposed method is verified through an experiment.