• Title/Summary/Keyword: Missing Values

Search Result 440, Processing Time 0.025 seconds

Multiple imputation for competing risks survival data via pseudo-observations

  • Han, Seungbong;Andrei, Adin-Cristian;Tsui, Kam-Wah
    • Communications for Statistical Applications and Methods
    • /
    • v.25 no.4
    • /
    • pp.385-396
    • /
    • 2018
  • Competing risks are commonly encountered in biomedical research. Regression models for competing risks data can be developed based on data routinely collected in hospitals or general practices. However, these data sets usually contain the covariate missing values. To overcome this problem, multiple imputation is often used to fit regression models under a MAR assumption. Here, we introduce a multivariate imputation in a chained equations algorithm to deal with competing risks survival data. Using pseudo-observations, we make use of the available outcome information by accommodating the competing risk structure. Lastly, we illustrate the practical advantages of our approach using simulations and two data examples from a coronary artery disease data and hepatocellular carcinoma data.

Genetic Algorithm Based Attribute Value Taxonomy Generation for Learning Classifiers with Missing Data (유전자 알고리즘 기반의 불완전 데이터 학습을 위한 속성값계층구조의 생성)

  • Joo Jin-U;Yang Ji-Hoon
    • The KIPS Transactions:PartB
    • /
    • v.13B no.2 s.105
    • /
    • pp.133-138
    • /
    • 2006
  • Learning with Attribute Value Taxonomies (AVT) has shown that it is possible to construct accurate, compact and robust classifiers from a partially missing dataset (dataset that contains attribute values specified with different level of precision). Yet, in many cases AVTs are generated from experts or people with specialized knowledge in their domain. Unfortunately these user-provided AVTs can be time-consuming to construct and misguided during the AVT building process. Moreover experts are occasionally unavailable to provide an AVT for a particular domain. Against these backgrounds, this paper introduces an AVT generating method called GA-AVT-Learner, which finds a near optimal AVT with a given training dataset using a genetic algorithm. This paper conducted experiments generating AVTs through GA-AVT-Learner with a variety of real world datasets. We compared these AVTs with other types of AVTs such as HAC-AVTs and user-provided AVTs. Through the experiments we have proved that GA-AVT-Learner provides AVTs that yield more accurate and compact classifiers and improve performance in learning missing data.

Missing Hydrological Data Estimation using Neural Network and Real Time Data Reconciliation (신경망을 이용한 결측 수문자료 추정 및 실시간 자료 보정)

  • Oh, Jae-Woo;Park, Jin-Hyeog;Kim, Young-Kuk
    • Journal of Korea Water Resources Association
    • /
    • v.41 no.10
    • /
    • pp.1059-1065
    • /
    • 2008
  • Rainfall data is the most basic input data to analyze the hydrological phenomena and can be missing due to various reasons. In this research, a neural network based model to estimate missing rainfall data as approximate values was developed for 12 rainfall stations in the Soyang river basin to improve existing methods. This approach using neural network has shown to be useful in many applications to deal with complicated natural phenomena and displayed better results compared to the popular offline estimating methods, such as RDS(Reciprocal Distance Squared) method and AMM(Arithmetic Mean Method). Additionally, we proposed automated data reconciliation systems composed of a neural network learning processer to be capable of real-time reconciliation to transmit reliable hydrological data online.

Outlier Filtering and Missing Data Imputation Algorithm using TCS Data (TCS데이터를 이용한 이상치제거 및 결측보정 알고리즘 개발)

  • Do, Myung-Sik;Lee, Hyang-Mee;NamKoong, Seong
    • Journal of Korean Society of Transportation
    • /
    • v.26 no.4
    • /
    • pp.241-250
    • /
    • 2008
  • With the ever-growing amount of traffic, there is an increasing need for good quality travel time information. Various existing outlier filtering and missing data imputation algorithms using AVI data for interrupted and uninterrupted traffic flow have been proposed. This paper is devoted to development of an outlier filtering and missing data imputation algorithm by using Toll Collection System (TCS) data. TCS travel time data collected from August to September 2007 were employed. Travel time data from TCS are made out of records of every passing vehicle; these data have potential for providing real-time travel time information. However, the authors found that as the distance between entry tollgates and exit tollgates increases, the variance of travel time also increases. Also, time gaps appeared in the case of long distances between tollgates. Finally, the authors propose a new method for making representative values after removal of abnormal and "noise" data and after analyzing existing methods. The proposed algorithm is effective.

Default Voting using User Coefficient of Variance in Collaborative Filtering System (협력적 여과 시스템에서 사용자 변동 계수를 이용한 기본 평가간 예측)

  • Ko, Su-Jeong
    • Journal of KIISE:Software and Applications
    • /
    • v.32 no.11
    • /
    • pp.1111-1120
    • /
    • 2005
  • In collaborative filtering systems most users do not rate preferences; so User-Item matrix shows great sparsity because it has missing values for items not rated by users. Generally, the systems predict the preferences of an active user based on the preferences of a group of users. However, default voting methods predict all missing values for all users in User-Item matrix. One of the most common methods predicting default voting values tried two different approaches using the average rating for a user or using the average rating for an item. However, there is a problem that they did not consider the characteristics of items, users, and the distribution of data set. We replace the missing values in the User-Item matrix by the default noting method using user coefficient of variance. We select the threshold of user coefficient of variance by using equations automatically and determine when to shift between the user averages and item averages according to the threshold. However, there are not always regular relations between the averages and the thresholds of user coefficient of variances in datasets. It is caused that the distribution information of user coefficient of variances in datasets affects the threshold of user coefficient of variance as well as their average. We decide the threshold of user coefficient of valiance by combining them. We evaluate our method on MovieLens dataset of user ratings for movies and show that it outperforms previously default voting methods.

A comparison of imputation methods using nonlinear models (비선형 모델을 이용한 결측 대체 방법 비교)

  • Kim, Hyein;Song, Juwon
    • The Korean Journal of Applied Statistics
    • /
    • v.32 no.4
    • /
    • pp.543-559
    • /
    • 2019
  • Data often include missing values due to various reasons. If the missing data mechanism is not MCAR, analysis based on fully observed cases may an estimation cause bias and decrease the precision of the estimate since partially observed cases are excluded. Especially when data include many variables, missing values cause more serious problems. Many imputation techniques are suggested to overcome this difficulty. However, imputation methods using parametric models may not fit well with real data which do not satisfy model assumptions. In this study, we review imputation methods using nonlinear models such as kernel, resampling, and spline methods which are robust on model assumptions. In addition, we suggest utilizing imputation classes to improve imputation accuracy or adding random errors to correctly estimate the variance of the estimates in nonlinear imputation models. Performances of imputation methods using nonlinear models are compared under various simulated data settings. Simulation results indicate that the performances of imputation methods are different as data settings change. However, imputation based on the kernel regression or the penalized spline performs better in most situations. Utilizing imputation classes or adding random errors improves the performance of imputation methods using nonlinear models.

Development of Machine Learning Based Precipitation Imputation Method (머신러닝 기반의 강우추정 방법 개발)

  • Heechan Han;Changju Kim;Donghyun Kim
    • Journal of Wetlands Research
    • /
    • v.25 no.3
    • /
    • pp.167-175
    • /
    • 2023
  • Precipitation data is one of the essential input datasets used in various fields such as wetland management, hydrological simulation, and water resource management. In order to efficiently manage water resources using precipitation data, it is essential to secure as much data as possible by minimizing the missing rate of data. In addition, more efficient hydrological simulation is possible if precipitation data for ungauged areas are secured. However, missing precipitation data have been estimated mainly by statistical equations. The purpose of this study is to propose a new method to restore missing precipitation data using machine learning algorithms that can predict new data based on correlations between data. Moreover, compared to existing statistical methods, the applicability of machine learning techniques for restoring missing precipitation data is evaluated. Representative machine learning algorithms, Artificial Neural Network (ANN) and Random Forest (RF), were applied. For the performance of classifying the occurrence of precipitation, the RF algorithm has higher accuracy in classifying the occurrence of precipitation than the ANN algorithm. The F1-score and Accuracy values, which are evaluation indicators of the classification model, were calculated as 0.80 and 0.77, while the ANN was calculated as 0.76 and 0.71. In addition, the performance of estimating precipitation also showed higher accuracy in RF than in ANN algorithm. The RMSE of the RF and ANN algorithms was 2.8 mm/day and 2.9 mm/day, and the values were calculated as 0.68 and 0.73.

The WISE Quality Control System for Integrated Meteorological Sensor Data (WISE 복합기상센서 관측 자료 품질관리시스템)

  • Chae, Jung-Hoon;Park, Moon-Soo;Choi, Young-Jean
    • Atmosphere
    • /
    • v.24 no.3
    • /
    • pp.445-456
    • /
    • 2014
  • A real-time quality control system for meteorological data (air temperature, air pressure, relative humidity, wind speed, wind direction, and precipitation) measured by an integrated meteorological sensor has been developed based on comparison of quality control procedures for meteorological data that were developed by the World Meteorological Organization and the Korea Meteorological Administration (KMA), using time series and statistical analysis of a 12-year meteorological data set observed from 2000 to 2011 at the Incheon site in Korea. The quality control system includes missing value, physical limit, step, internal consistency, persistence, and climate range tests. Flags indicating good, doubtful, erroneous, not checked, or missing values were added to the raw data after the quality control procedure. The climate range test was applied to the monthly data for air temperature and pressure, and its threshold values were modified from ${\pm}2{\sigma}$ and ${\pm}3{\sigma}$ to ${\pm}3{\sigma}$ and ${\pm}6{\sigma}$, respectively, in order to consider extreme phenomena such as heat waves and typhoons. In addition, the threshold values of the step test for air temperature, air pressure, relative humidity, and wind speed were modified to $0.7^{\circ}C$, 0.4 hPa, 5.9%, and $4.6m\;s^{-1}$, respectively, through standard deviation analysis of step difference according to their averaging period. The modified quality control system was applied to the meteorological data observed by the Weather Information Service Engine in March 2014 and exhibited improved performance compared to the KMA procedures.

Nearest-Neighbor Collaborative Filtering Using Dimensionality Reduction by Non-negative Matrix Factorization (비부정 행렬 인수분해 차원 감소를 이용한 최근 인접 협력적 여과)

  • Ko, Su-Jeong
    • The KIPS Transactions:PartB
    • /
    • v.13B no.6 s.109
    • /
    • pp.625-632
    • /
    • 2006
  • Collaborative filtering is a technology that aims at teaming predictive models of user preferences. Collaborative filtering systems have succeeded in Ecommerce market but they have shortcomings of high dimensionality and sparsity. In this paper we propose the nearest neighbor collaborative filtering method using non-negative matrix factorization(NNMF). We replace the missing values in the user-item matrix by using the user variance coefficient method as preprocessing for matrix decomposition and apply non-negative factorization to the matrix. The positive decomposition method using the non-negative decomposition represents users as semantic vectors and classifies the users into groups based on semantic relations. We compute the similarity between users by using vector similarity and selects the nearest neighbors based on the similarity. We predict the missing values of items that didn't rate by a new user based on the values that the nearest neighbors rated items.

SMOOTH SINGULAR VALUE THRESHOLDING ALGORITHM FOR LOW-RANK MATRIX COMPLETION PROBLEM

  • Geunseop Lee
    • Journal of the Korean Mathematical Society
    • /
    • v.61 no.3
    • /
    • pp.427-444
    • /
    • 2024
  • The matrix completion problem is to predict missing entries of a data matrix using the low-rank approximation of the observed entries. Typical approaches to matrix completion problem often rely on thresholding the singular values of the data matrix. However, these approaches have some limitations. In particular, a discontinuity is present near the thresholding value, and the thresholding value must be manually selected. To overcome these difficulties, we propose a shrinkage and thresholding function that smoothly thresholds the singular values to obtain more accurate and robust estimation of the data matrix. Furthermore, the proposed function is differentiable so that the thresholding values can be adaptively calculated during the iterations using Stein unbiased risk estimate. The experimental results demonstrate that the proposed algorithm yields a more accurate estimation with a faster execution than other matrix completion algorithms in image inpainting problems.