• Title/Summary/Keyword: 결측데이터

Search Result 133, Processing Time 0.036 seconds

Performance Evaluation of an Imputation Method based on Generative Adversarial Networks for Electric Medical Record (전자의무기록 데이터에서의 적대적 생성 알고리즘 기반 결측값 대치 알고리즘 성능분석)

  • Jo, Yong-Yeon;Jeong, Min-Yeong;Hwangbo, Yul
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2019.10a
    • /
    • pp.879-881
    • /
    • 2019
  • 전자의무기록 (EMR)과 같은 의료 현장에서 수집되는 대용량의 데이터는 임상 해석적으로 잠재가치가 크고 활용도가 다양하나 결측값이 많아 희소성이 크다는 한계점이 있어 분석이 어렵다. 특히 EMR의 정보수집과정에서 발생하는 결측값은 무작위적이고 임의적이어서 분석 정확도를 낮추고 예측 모델의 성능을 저하시키는 주된 요인으로 작용하기 때문에, 결측치 대체는 필수불가결하다. 최근 통상적으로 활용되어지던 통계기반 알고리즘기반의 결측치 대체 알고리즘보다는 딥러닝 기술을 활용한 알고리즘들이 새로이 등장하고 있다. 본 논문에서는 Generative Adversarial Network를 기반한 최신 결측값 대치 알고리즘인 Generative Adversarial Imputation Nets을 적용하여 EMR에서의 성능을 분석해보고자 하였다.

Development of data processing method and system for huge Highway Data (대용량 교통 데이터의 자료처리 과정과 시스템의 개발)

  • Cheong, Sujeong;Song, Sookyung;Lee, Minsoo;Namgung, Sung
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2007.11a
    • /
    • pp.295-297
    • /
    • 2007
  • 교통 관련 검지기 시스템에 의해 수집된 교통량, 점유율, 속도와 같은 교통 정보 데이터는 품질평가, 오류판단, 결측보정의 자료처리를 거치게 되며 이러한 전처리 후 다양한 목적에 의해 연구자들에게 활용된다. 신속하고 정확한 자료처리와 보다 편리하고 효과적인 웹 UI 의 제공은 매우 중요하다. 본 논문에서는 품질평가, 오류판단, 결측보정에 해당하는 세 단계의 자료처리 알고리즘을 개발하고 사용자에게 자료처리의 과정을 제공하는 웹 UI 시스템을 구현한다.

A study on the imputation solution for missing speed data on UTIS by using adaptive k-NN algorithm (적응형 k-NN 기법을 이용한 UTIS 속도정보 결측값 보정처리에 관한 연구)

  • Kim, Eun-Jeong;Bae, Gwang-Soo;Ahn, Gye-Hyeong;Ki, Yong-Kul;Ahn, Yong-Ju
    • The Journal of The Korea Institute of Intelligent Transport Systems
    • /
    • v.13 no.3
    • /
    • pp.66-77
    • /
    • 2014
  • UTIS(Urban Traffic Information System) directly collects link travel time in urban area by using probe vehicles. Therefore it can estimate more accurate link travel speed compared to other traffic detection systems. However, UTIS includes some missing data caused by the lack of probe vehicles and RSEs on road network, system failures, and other factors. In this study, we suggest a new model, based on k-NN algorithm, for imputing missing data to provide more accurate travel time information. New imputation model is an adaptive k-NN which can flexibly adjust the number of nearest neighbors(NN) depending on the distribution of candidate objects. The evaluation result indicates that the new model successfully imputed missing speed data and significantly reduced the imputation error as compared with other models(ARIMA and etc). We have a plan to use the new imputation model improving traffic information service by applying UTIS Central Traffic Information Center.

A Case Study of Data Editing for the Korean Housing Price Survey (주택가격동향조사를 위한 데이터편집 사례연구)

  • Park, Jin-Woo;Park, Hyun-Joo;Kim, Jin-Eok
    • Survey Research
    • /
    • v.6 no.1
    • /
    • pp.83-98
    • /
    • 2005
  • Large scale survey database may contain some erroneous data or missing data. Incomplete or erroneous data may be produced in the process of data collection or data capture. Since erroneous data can cause some bias and inconsistency, data editing, which is the procedure for detecting and adjusting individual errors in data records, is a very important work in statistical survey. In this paper, we introduce an editing process for the housing price survey to enhance discussions on that topic. We explain how to decide some appropriate edit rules and show some related data. Furthermore, we describe input editing procedures which is appropriate for on-line survey and how to find and eliminate erroneous data through output editing.

  • PDF

Predictive Optimization Adjusted With Pseudo Data From A Missing Data Imputation Technique (결측 데이터 보정법에 의한 의사 데이터로 조정된 예측 최적화 방법)

  • Kim, Jeong-Woo
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.20 no.2
    • /
    • pp.200-209
    • /
    • 2019
  • When forecasting future values, a model estimated after minimizing training errors can yield test errors higher than the training errors. This result is the over-fitting problem caused by an increase in model complexity when the model is focused only on a given dataset. Some regularization and resampling methods have been introduced to reduce test errors by alleviating this problem but have been designed for use with only a given dataset. In this paper, we propose a new optimization approach to reduce test errors by transforming a test error minimization problem into a training error minimization problem. To carry out this transformation, we needed additional data for the given dataset, termed pseudo data. To make proper use of pseudo data, we used three types of missing data imputation techniques. As an optimization tool, we chose the least squares method and combined it with an extra pseudo data instance. Furthermore, we present the numerical results supporting our proposed approach, which resulted in less test errors than the ordinary least squares method.

Development of Machine Learning Based Precipitation Imputation Method (머신러닝 기반의 강우추정 방법 개발)

  • Heechan Han;Changju Kim;Donghyun Kim
    • Journal of Wetlands Research
    • /
    • v.25 no.3
    • /
    • pp.167-175
    • /
    • 2023
  • Precipitation data is one of the essential input datasets used in various fields such as wetland management, hydrological simulation, and water resource management. In order to efficiently manage water resources using precipitation data, it is essential to secure as much data as possible by minimizing the missing rate of data. In addition, more efficient hydrological simulation is possible if precipitation data for ungauged areas are secured. However, missing precipitation data have been estimated mainly by statistical equations. The purpose of this study is to propose a new method to restore missing precipitation data using machine learning algorithms that can predict new data based on correlations between data. Moreover, compared to existing statistical methods, the applicability of machine learning techniques for restoring missing precipitation data is evaluated. Representative machine learning algorithms, Artificial Neural Network (ANN) and Random Forest (RF), were applied. For the performance of classifying the occurrence of precipitation, the RF algorithm has higher accuracy in classifying the occurrence of precipitation than the ANN algorithm. The F1-score and Accuracy values, which are evaluation indicators of the classification model, were calculated as 0.80 and 0.77, while the ANN was calculated as 0.76 and 0.71. In addition, the performance of estimating precipitation also showed higher accuracy in RF than in ANN algorithm. The RMSE of the RF and ANN algorithms was 2.8 mm/day and 2.9 mm/day, and the values were calculated as 0.68 and 0.73.

An Estimation of Link Travel Time by Using BMS Data (BMS 데이터를 활용한 링크단위 여행시간 산출방안에 관한 연구)

  • Jeon, Ok-Hee;Ahn, Gye-Hyeong;Hyun, Cheol-Seung;Hong, Kyung-Sik;Kim, Hyun-Ju;Lee, Choul-Ki
    • The Journal of The Korea Institute of Intelligent Transport Systems
    • /
    • v.13 no.3
    • /
    • pp.78-88
    • /
    • 2014
  • Now, UTIS collects and provides traffic information by building RSE 1,150(unit) and OBE about 51,000(vehicle). it's inevitable to enlarge traffic information sources which use to improve quality of UTIS traffic information for Stabilizing UTIS's service. but there are missing data sections. And, In this study as a way to overcome these problems, based on BIS(Bus information system) installed and operating in the capital area to develop normal vehicle's link transit time estimation model which is used realtime collecting BMS data, we'll utilize the model to provide missing data section's information. For these problem, we selected partial section of suwon-city, anyang-city followed by drive only way or not and conducted model estimating and verification each of BMS data and UTIS traffic information. Consequently, Case2,4,6,8 presented highly credibility between UTIS communication data and estimated value but In the Case 3,5 we determined to replace communication data of UTIS' missing data section too hard for large error. So we need to apply high credibility model formula adjusting road managing condition and the situation of object section.

Development of Water Velocity Data Preprocessing Method for PAVOs (PAVOs 활용을 위한 유속데이터 전처리 기법 개발)

  • Soyeon Lim;Youngmoo Yu;Sinjae Lee;Yeongil Lee
    • Proceedings of the Korea Water Resources Association Conference
    • /
    • 2023.05a
    • /
    • pp.85-85
    • /
    • 2023
  • 유량 측정을 위해 도섭법, 횡측선법 등의 인력에 의한 방법이 적용되고 있으나, 이는 야간 및 휴일 측정, 인력 부족 등 여러 제약으로 인해 고수위 홍수를 측정하는 데에 한계가 있다. 이를 해결하기 위해 시공간적 제약이 없는 도플러 방식 초음파유속계(Acousitc Doppler Velocity Meter, ADVM)와 자동유속관측시스템(Portable Automatic Velocity Observation System; PAVOs)이 제안되었다. 이 방법들은 교량에 설치된 장치를 통해 실시간으로 유속이 계측되어 시공간적 제약이 없으며 홍수 관리에 유용하게 사용될 수 있다. 실시간으로 계측된 유속 데이터는 오·결측 값이 발생하며 ADVM의 경우 수위-유량관계식을 활용하는 등 전처리 방법이 활용되고 있지만 전자파표면유속계를 활용한 PAVOs 데이터의 전처리 방법에 대한 연구는 부족하다. 따라서 본 연구에서는 PAVOs에서 실시간으로 계측된 유속 데이터의 전 처리 과정(Pre-processing)을 개발하였다. PAVOs를 통해 측정된 데이터는 5분 단위로 10개의 유속이 한번에 측정되며 비정상성(Non-stationary)인 특징을 가진다. 이 데이터의 전처리 과정으로 오·결측값에 대한 처리 및 보간법 적용 이후 10개 값 중 실제 유속을 판단하고 잡음제거(Denoising)를 수행하였다. 이를 강원도 홍천강에 위치한 홍천교에서 계측된 유속 데이터에 적용하였다. 그 결과 데이터의 상승부와 하강부에서 일정한 경향성을 파악할 수 있다. 이 데이터를 통해 산정한 유량과 실측 기반의 평균유속과 관계를 통해 계산한 유량을 비교해 보았을 때 낮은 편차율을 가지는 것을 확인하였다. 전 처리 된 실시간 유속 데이터를 활용한다면 최고수위가 발생하였을 경우 홍수량을 산정할 수 있을 것이다. 또한, 강우 또는 하천 공사에 의해 변동하는 수위-유량관계곡선식을 실시간으로 개발할 수 있을 것이며 이는 효과적인 홍수관리에 큰 역할을 할 수 있을 것이다.

  • PDF

Personalized Data Restoration Algorithm to Improve Wearable Device Service (웨어러블 디바이스 서비스 향상을 위한 개인 맞춤형 데이터 복원 알고리즘)

  • Kikun Park;Hye-Rim Bae
    • The Journal of Bigdata
    • /
    • v.6 no.2
    • /
    • pp.51-60
    • /
    • 2021
  • The market size of wearable devices is growing rapidly every year, and manufacturers around the world are introducing products that utilize their unique characteristics to keep up with the demand. Among them, smart watches are wearable devices with a very high share in sales, and they provide a variety of services to users by using information collected in real-time. The quality of service depends on the accuracy of the data collected by the smart watch, but data measurement may not be possible depending on the situation. This paper introduces a method to restore data that a smart watch could not collect. It deals with the similarity calculation method of trajectory information measured over time for data restoration and introduces a procedure for restoring missing sections according to the similarity. To prove the performance of the proposed methodology, a comparative experiment with a machine learning algorithm was conducted. Finally, the expected effects of this study and future research directions are discussed.

An Empirical Comparison of Bagging, Boosting and Support Vector Machine Classifiers in Data Mining (데이터 마이닝에서 배깅, 부스팅, SVM 분류 알고리즘 비교 분석)

  • Lee Yung-Seop;Oh Hyun-Joung;Kim Mee-Kyung
    • The Korean Journal of Applied Statistics
    • /
    • v.18 no.2
    • /
    • pp.343-354
    • /
    • 2005
  • The goal of this paper is to compare classification performances and to find a better classifier based on the characteristics of data. The compared methods are CART with two ensemble algorithms, bagging or boosting and SVM. In the empirical study of twenty-eight data sets, we found that SVM has smaller error rate than the other methods in most of data sets. When comparing bagging, boosting and SVM based on the characteristics of data, SVM algorithm is suitable to the data with small numbers of observation and no missing values. On the other hand, boosting algorithm is suitable to the data with number of observation and bagging algorithm is suitable to the data with missing values.