• Title/Summary/Keyword: data imputation

Search Result 202, Processing Time 0.03 seconds

A nonnormal Bayesian imputation

  • Shin Minwoong;Lee Jinhee;Lee Juyoung;Lee Sangeun
    • Proceedings of the Korean Statistical Society Conference
    • /
    • 2000.11a
    • /
    • pp.51-56
    • /
    • 2000
  • When the standard inference is to be used with complete data and nonresponse is ignorable, then multiple imputations should be created as repetitions under a Bayesian normal model. Many Bayesian models besides the normal, however, approximately yield the standard inference with complete data and thus many such models can be used to create proper imputations. We consider the Bayesian bootstrap (BB) application.

  • PDF

Policies for Improving the Survey of Research and Development in Science and Technology: The Case of Industrial Sector (과학기술연구개발활동조사의 개선방안 -기업부문을 중심으로-)

  • 유승훈;문혜선
    • Journal of Korea Technology Innovation Society
    • /
    • v.5 no.2
    • /
    • pp.228-244
    • /
    • 2002
  • The survey of research and development (R&D) in science and technology (S&T) covers the current status of R&D activities in S&T in Korea, and provides a basis for decision making regarding S&T policy. Continuous improvement of the survey is widely needed to present reliable national basic statistics. Therefore, the purpose of the study is two-fold: to introduce sampling survey method in industrial sector and to make statistical technique to deal with non-response data from industrial sector. To these ends, first, case studies of the United States and Japan are illustrated. A new sampling design for the R&D survey is proposed and implementing stratified random sampling scheme is suggested. Moreover, statistical analysis of the non-response data is dealt with. Based on several screening criteria, we develop a new imputation method suitable for the R&D survey and also provide more detailed implementation plan. Various solutions to a problem arising from non-response item are also presented. Finally, some implications of the results are discussed.

  • PDF

Prediction of Dissolved Oxygen in Jindong Bay Using Time Series Analysis (시계열 분석을 이용한 진동만의 용존산소량 예측)

  • Han, Myeong-Soo;Park, Sung-Eun;Choi, Youngjin;Kim, Youngmin;Hwang, Jae-Dong
    • Journal of the Korean Society of Marine Environment & Safety
    • /
    • v.26 no.4
    • /
    • pp.382-391
    • /
    • 2020
  • In this study, we used artificial intelligence algorithms for the prediction of dissolved oxygen in Jindong Bay. To determine missing values in the observational data, we used the Bidirectional Recurrent Imputation for Time Series (BRITS) deep learning algorithm, Auto-Regressive Integrated Moving Average (ARIMA), a widely used time series analysis method, and the Long Short-Term Memory (LSTM) deep learning method were used to predict the dissolved oxygen. We also compared accuracy of ARIMA and LSTM. The missing values were determined with high accuracy by BRITS in the surface layer; however, the accuracy was low in the lower layers. The accuracy of BRITS was unstable due to the experimental conditions in the middle layer. In the middle and bottom layers, the LSTM model showed higher accuracy than the ARIMA model, whereas the ARIMA model showed superior performance in the surface layer.

Association of HLA Genotype and Fulminant Type 1 Diabetes in Koreans

  • Kwak, Soo Heon;Kim, Yoon Ji;Chae, Jeesoo;Lee, Cue Hyunkyu;Han, Buhm;Kim, Jong-Il;Jung, Hye Seung;Cho, Young Min;Park, Kyong Soo
    • Genomics & Informatics
    • /
    • v.13 no.4
    • /
    • pp.126-131
    • /
    • 2015
  • Fulminant type 1 diabetes (T1DM) is a distinct subtype of T1DM that is characterized by rapid onset hyperglycemia, ketoacidosis, absolute insulin deficiency, and near normal levels of glycated hemoglobin at initial presentation. Although it has been reported that class II human leukocyte antigen (HLA) genotype is associated with fulminant T1DM, the genetic predisposition is not fully understood. In this study we investigated the HLA genotype and haplotype in 11 Korean cases of fulminant T1DM using imputation of whole exome sequencing data and compared its frequencies with 413 participants of the Korean Reference Panel. The $HLA-DRB1^*04:05-HLA-DQB1^*04:01$ haplotype was significantly associated with increased risk of fulminant T1DM in Fisher's exact test (odds ratio [OR], 4.11; 95% confidence interval [CI], 1.56 to 10.86; p = 0.009). A histidine residue at $HLA-DR{\beta}1$ position 13 was marginally associated with increased risk of fulminant T1DM (OR, 2.45; 95% CI, 1.01 to 5.94; p = 0.054). Although we had limited statistical power, we provide evidence that HLA haplotype and amino acid change can be a genetic risk factor of fulminant T1DM in Koreans. Further large-scale research is required to confirm these findings.

Improvement of A Preprocessing of Archived Traffic Data Collected by Expressway Vehicle Detection System (고속도로 차량검지기 이력자료 활용을 위한 전처리과정 개선)

  • Lee, Hwan-Pil;NamKoong, Seong;Kim, Soo-Hee;Kim, Jin
    • The Journal of The Korea Institute of Intelligent Transport Systems
    • /
    • v.12 no.1
    • /
    • pp.15-27
    • /
    • 2013
  • While the vehicle detector is collected from a variety of information was mainly used as a real-time data. Recently scheme of application for archived traffic data has become increasingly important. In this background, this research were conducted on the improvement of the preprocessing for archived traffic data application. The purpose of improving specific preprocessing was reflect transportation phenomena by traffic data. As evaluation result, improvement preprocessing was close to the actual value than exist preprocessing.

The Comparison of Imputation Methods in Space Time Series Data with Missing Values (공간시계열모형의 결측치 추정방법 비교)

  • Lee, Sung-Duck;Kim, Duck-Ki
    • Communications for Statistical Applications and Methods
    • /
    • v.17 no.2
    • /
    • pp.263-273
    • /
    • 2010
  • Missing values in time series can be treated as unknown parameters and estimated by maximum likelihood or as random variables and predicted by the conditional expectation of the unknown values given the data. The purpose of this study is to impute missing values which are regarded as the maximum likelihood estimator and random variable in incomplete data and to compare with two methods using ARMA and STAR model. For illustration, the Mumps data reported from the national capital region monthly over the years 2001~2009 are used, and estimate precision of missing values and forecast precision of future data are compared with two methods.

The Comparison of Imputation Methods in Time Series Data with Missing Values (시계열자료에서 결측치 추정방법의 비교)

  • Lee, Sung-Duck;Choi, Jae-Hyuk;Kim, Duck-Ki
    • Communications for Statistical Applications and Methods
    • /
    • v.16 no.4
    • /
    • pp.723-730
    • /
    • 2009
  • Missing values in time series can be treated as unknown parameters and estimated by maximum likelihood or as random variables and predicted by the expectation of the unknown values given the data. The purpose of this study is to impute missing values which are regarded as the maximum likelihood estimator and random variable in incomplete data and to compare with two methods using ARMA model. For illustration, the Mumps data reported from the national capital region monthly over the years 2001 ${\sim}$ 2006 are used, and results from two methods are compared with using SSF(Sum of square for forecasting error).

Methods for screening time series data according to data quality and statistical status (품질 및 조건 기반 시계열 데이터 선별 활용 방법)

  • Moon, JaeWon;Yu, MiSeon;Oh, SeungTaek;Kum, SeungWoo;Hwang, JiSoo;Lee, JiHoon
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2022.01a
    • /
    • pp.399-402
    • /
    • 2022
  • 본 논문에서는 불완전한 시계열 데이터를 활용하기 전 데이터를 선별하여 활용하는 방법을 소개한다. 시계열 데이터의 품질은 수집 네트워크와 수집 기기의 시간적 변화와 같은 가변적 상황에 의존적이므로 불규칙적으로 이상 혹은 누락 데이터가 발생한다. 이때 에러를 포함하였다는 이유로 일괄적으로 데이터를 제거하여 활용하지 않거나, 혹은 누락 데이터의 구간을 조건 없이 복원하여 활용한다면 원하지 않는 결과를 초래할 수 있다. 제안하는 방법은 시계열 데이터의 구간에 대한 누락 데이터의 통계적 정보를 축출하고 이에 기반하여 활용 목적과 활용 가능한 품질의 기준에 부합하지 않는다면 활용 불가능한 데이터라고 판별하고 미리 분석 등의 데이터 활용 시 자동 제외하는 구조를 제안하고 실험하였다. 제안하는 방법은 활용 목적과 상황에 적응적으로 누락 값을 포함하는 데이터의 빠른 활용 판단이 가능하며 보다 나은 분석 결과를 얻을 수 있다.

  • PDF

Data Cleansing Algorithm for reducing Outlier (데이터 오·결측 저감 정제 알고리즘)

  • Lee, Jongwon;Kim, Hosung;Hwang, Chulhyun;Kang, Inshik;Jung, Hoekyung
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2018.10a
    • /
    • pp.342-344
    • /
    • 2018
  • This paper shows the possibility to substitute statistical methods such as mean imputation, correlation coefficient analysis, graph correlation analysis for the proposed algorithm, and replace statistician for processing various abnormal data measured in the water treatment process with it. In addition, this study aims to model a data-filtering system based on a recent fractile pattern and a deep learning-based LSTM algorithm in order to improve the reliability and validation of the algorithm, using the open-sourced libraries such as KERAS, THEANO, TENSORFLOW, etc.

  • PDF

Household, personal, and financial determinants of surrender in Korean health insurance

  • Shim, Hyunoo;Min, Jung Yeun;Choi, Yang Ho
    • Communications for Statistical Applications and Methods
    • /
    • v.28 no.5
    • /
    • pp.447-462
    • /
    • 2021
  • In insurance, the surrender rate is an important variable that threatens the sustainability of insurers and determines the profitability of the contract. Unlike other actuarial assumptions that determine the cash flow of an insurance contract, however, it is characterized by endogenous variables such as people's economic, social, and subjective decisions. Therefore, a microscopic approach is required to identify and analyze the factors that determine the lapse rate. Specifically, micro-level characteristics including the individual, demographic, microeconomic, and household characteristics of policyholders are necessary for the analysis. In this study, we select panel survey data of Korean Retirement Income Study (KReIS) with many diverse dimensions to determine which variables have a decisive effect on the lapse and apply the lasso regularized regression model to analyze it empirically. As the data contain many missing values, they are imputed using the random forest method. Among the household variables, we find that the non-existence of old dependents, the existence of young dependents, and employed family members increase the surrender rate. Among the individual variables, divorce, non-urban residential areas, apartment type of housing, non-ownership of homes, and bad relationship with siblings increase the lapse rate. Finally, among the financial variables, low income, low expenditure, the existence of children that incur child care expenditure, not expecting to bequest from spouse, not holding public health insurance, and expecting to benefit from a retirement pension increase the lapse rate. Some of these findings are consistent with those in the literature.