• Title/Summary/Keyword: 결측데이터

Search Result 133, Processing Time 0.03 seconds

A Study on Shape Variability in Canonical Correlation Biplot with Missing Values (결측값이 있는 정준상관 행렬도의 형상변동 연구)

  • Hong, Hyun-Uk;Choi, Yong-Seok;Shin, Sang-Min;Ka, Chang-Wan
    • The Korean Journal of Applied Statistics
    • /
    • v.23 no.5
    • /
    • pp.955-966
    • /
    • 2010
  • Canonical correlation biplot is a useful biplot for giving a graphical description of the data matrix which consists of the association between two sets of variables, for detecting patterns and displaying results found by more formal methods of analysis. Nevertheless, when some values are missing in data, most biplots are not directly applicable. To solve this problem, we estimate the missing data using the median, mean, EM algorithm and MCMC imputation methods according to missing rates. Even though we estimate the missing values of biplot of incomplete data, we have different shapes of biplots according to the imputation methods and missing rates. Therefore we use a RMS(root mean square) which was proposed by Shin et al. (2007) and PS(procrustes statistic) for measuring and comparing the shape variability between the original biplots and the estimated biplots.

A Study on the Index Estimation of Missing Real Estate Transaction Cases Using Machine Learning (머신러닝을 활용한 결측 부동산 매매 지수의 추정에 대한 연구)

  • Kim, Kyung-Min;Kim, Kyuseok;Nam, Daisik
    • Journal of the Economic Geographical Society of Korea
    • /
    • v.25 no.1
    • /
    • pp.171-181
    • /
    • 2022
  • The real estate price index plays key roles as quantitative data in real estate market analysis. International organizations including OECD publish the real estate price indexes by country, and the Korea Real Estate Board announces metropolitan-level and municipal-level indexes. However, when the index is set on the smaller spatial unit level than metropolitan and municipal-level, problems occur: missing values. As the spatial scope is narrowed down, there are cases where there are few or no transactions depending on the unit period, which lead index calculation difficult or even impossible. This study suggests a supervised learning-based machine learning model to compensate for missing values that may occur due to no transaction in a specific range and period. The models proposed in our research verify the accuracy of predicting the existing values and missing values.

A study on the factors influencing the data collection performance of smart buoys (스마트 항로표지의 데이터 수집 성능에 영향을 미치는 요인에 관한 연구)

  • Ho-Joon Kim;Min-Kyu Kim;Nam-Yong Lee;Chul-Soo Kim;Sangmun Shin;Se-woong Oh;Jin-Hong Yang
    • Proceedings of the Korean Institute of Navigation and Port Research Conference
    • /
    • 2021.11a
    • /
    • pp.60-62
    • /
    • 2021
  • 항로표지는 해상상황 정보를 수집하고 선박들의 항해에 안전을 도모하기 위해 설치 및 운용되고 있다. 관련해 개별 지방청에서 운영되는 데이터를 빅데이터 형태로 활용하고자 하는 경우 수집된 데이터의 품질에 대한 평가가 이루어져야 한다. 본 논문에서는 수집된 항로표지 데이터의 누락 정보를 중심으로 데이터 수집에 있어 장애 생성의 주된 원인을 찾고자 하였다. 수집된 데이터의 분석 결과 기상악화와 표지의 전압이 하락한 날에 데이터 결측 발생률이 톺음을 확인할 수 있었다. 이를 통해 기상 상황, 표지의 전압 상태 그리고 수집된 데이터 개수의 비교를 통해 기상악화가 영향을 미쳤을 수 있음을 확인하였다.

  • PDF

Comparison of Data Reconstruction Methods for Missing Value Imputation (결측값 대체를 위한 데이터 재현 기법 비교)

  • Cheongho Kim;Kee-Hoon Kang
    • The Journal of the Convergence on Culture Technology
    • /
    • v.10 no.1
    • /
    • pp.603-608
    • /
    • 2024
  • Nonresponse and missing values are caused by sample dropouts and avoidance of answers to surveys. In this case, problems with the possibility of information loss and biased reasoning arise, and a replacement of missing values with appropriate values is required. In this paper, as an alternative to missing values imputation, we compare several replacement methods, which use mean, linear regression, random forest, K-nearest neighbor, autoencoder and denoising autoencoder based on deep learning. These methods of imputing missing values are explained, and each method is compared by using continuous simulation data and real data. The comparison results confirm that in most cases, the performance of the random forest imputation method and the denoising autoencoder imputation method are better than the others.

Correction of Drifter Data Using Recurrent Neural Networks (순환신경망을 이용한 뜰개의 관측 데이터 보정)

  • Kim, Gyoung-Do;Kim, Yong-Hyuk
    • Journal of the Korea Convergence Society
    • /
    • v.9 no.3
    • /
    • pp.15-21
    • /
    • 2018
  • The ocean drifter is a device for observing the ocean weather by floating off the sea surface. The data observed through the drifter is utilized in the ocean weather prediction and oil spill. Observed data may contain incorrect or missing data at the time of observation, and accuracy may be lowered when we use the data. In this paper, we propose a data correction model using recurrent neural networks. We corrected data collected from 7 drifters in 2015 and 8 drifters in 2016, and conducted experiments of drifter moving prediction to reflect the correction results. Experimental results showed that observed data are corrected by 13.9% and improved the performance of the prediction model by 1.4%.

Proposal to Supplement the Missing Values of Air Pollution Levels in Meteorological Dataset (기상 데이터에서 대기 오염도 요소의 결측치 보완 기법 제안)

  • Jo, Dong-Chol;Hahn, Hee-Il
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.21 no.1
    • /
    • pp.181-187
    • /
    • 2021
  • Recently, various air pollution factors have been measured and analyzed to reduce damages caused by it. In this process, many missing values occur due to various causes. To compensate for this, basically a vast amount of training data is required. This paper proposes a statistical techniques that effectively compensates for missing values generated in the process of measuring ozone, carbon dioxide, and ultra-fine dust using a small amount of learning data. The proposed algorithm first extracts a group of meteorological data that is expected to have positive effects on the correction of missing values through statistical information analysis such as the correlation between meteorological data and air pollution level factors, p-value, etc. It is a technique that efficiently and effectively compensates for missing values by analyzing them. In order to confirm the performance of the proposed algorithm, we analyze its characteristics through various experiments and compare the performance of the well-known representative algorithms with ours.

Comparision of Missing Imputaion Methods In fine dust data (미세먼지 자료에서의 결측치 대체 방법 비교)

  • Kim, YeonJin;Park, HeonJin
    • The Journal of Bigdata
    • /
    • v.4 no.2
    • /
    • pp.105-114
    • /
    • 2019
  • Missing value replacement is one of the big issues in data analysis. If you ignore the occurrence of the missing value and proceed with the analysis, a bias can occur and give incorrect results for the estimate. In this paper, we need to find and apply an appropriate alternative to missing data from weather data. Through this, we attempted to clarify and compare the simulations for various situations using existing methods such as MICE and MissForest based on R and time series-based models. When comparing these results with each variable, it was determined that the kalman filter of the auto arima model using the ImputeTS package and the MissForest model gave good results in the weather data.

  • PDF

A Missing Data Imputation by Combining K Nearest Neighbor with Maximum Likelihood Estimation for Numerical Software Project Data (K-NN과 최대 우도 추정법을 결합한 소프트웨어 프로젝트 수치 데이터용 결측값 대치법)

  • Lee, Dong-Ho;Yoon, Kyung-A;Bae, Doo-Hwan
    • Journal of KIISE:Software and Applications
    • /
    • v.36 no.4
    • /
    • pp.273-282
    • /
    • 2009
  • Missing data is one of the common problems in building analysis or prediction models using software project data. Missing imputation methods are known to be more effective missing data handling method than deleting methods in small software project data. While K nearest neighbor imputation is a proper missing imputation method in the software project data, it cannot use non-missing information of incomplete project instances. In this paper, we propose an approach to missing data imputation for numerical software project data by combining K nearest neighbor and maximum likelihood estimation; we also extend the average absolute error measure by normalization for accurate evaluation. Our approach overcomes the limitation of K nearest neighbor imputation and outperforms on our real data sets.

Development of Performance to Predict the Prognosis of Korean Patients with Acute Myocardial Infarction by Data Transformation for Naïve Bayes Method (나이브 베이지안 방법을 위한 데이터 변환법으로 한국인 급성 심근경색증 환자의 예후를 예측하는 성능의 향상)

  • Cho, Sun Ho;Kim, Jeong-su;Kwon, Hyuk-Chul
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2014.11a
    • /
    • pp.868-871
    • /
    • 2014
  • 오늘날 한국에서는 급성 심근경색증으로 인한 사망률이 높은 상태로, 발병 시에 치료까지 신속한 의사결정이 요구되는 위중한 질병이기 때문에, 한국인에게 맞는 급성 심근경색증 연구가 매우 중요 하다. 본 연구는 한국인 급성 심근경색증 등록 데이터를 이용해 기계 학습 방법의 한 종류인 나이브 베이지안 방법을 이용해 급성 심근경색증 환자의 예후를 예측하고자, 의료 데이터의 특성에 따른 데이터 변환 방법을 제안한다. 타겟 클래스에서 보다 중요한 의미를 가진 death 값에 대해 각 값을, nominal value, numeric value, 결측치로 구분한 방식에 따라, 확률을 계산해 변환한다. 실험 결과를 통해 결측치를 피처마다 존재하는 값들의 평균을 낸 값으로 대입하였을 때 가장 좋은 성능임을 알 수 있었는데, 기존의 방법에 비해 precision=5.4%, recall=7.0%의 성능이 향상되었다. 따라서 제안한 방법은 나이브 베이지안 방법의 예측 성능 향상에 기여하였다고 판단된다. 이후 적용했던 데이터 변환 방법을 여러 가지 기계 학습 방법에서 판단해보고, 다른 타겟 클래스에도 시험해보고자 한다.