• Title/Summary/Keyword: Missing variables

Search Result 201, Processing Time 0.029 seconds

Sensitivity analysis of missing mechanisms for the 19th Korean presidential election poll survey (19대 대선 여론조사에서 무응답 메카니즘의 민감도 분석)

  • Kim, Seongyong;Kwak, Dongho
    • The Korean Journal of Applied Statistics
    • /
    • v.32 no.1
    • /
    • pp.29-40
    • /
    • 2019
  • Categorical data with non-responses are frequently observed in election poll surveys, and can be represented by incomplete contingency tables. To estimate supporting rates of candidates, the identification of the missing mechanism should be pre-determined because the estimates of non-responses can be changed depending on the assumed missing mechanism. However, it has been shown that it is not possible to identify the missing mechanism when using observed data. To overcome this problem, sensitivity analysis has been suggested. The previously proposed sensitivity analysis can be applicable only to two-way incomplete contingency tables with binary variables. The previous sensitivity analysis is inappropriate to use since more than two of the factors such as region, gender, and age are usually considered in election poll surveys. In this paper, sensitivity analysis suitable to an multi-dimensional incomplete contingency table is devised, and also applied to the 19th Korean presidential election poll survey data. As a result, the intervals of estimates from the sensitivity analysis include actual results as well as estimates from various missing mechanisms. In addition, the properties of the missing mechanism that produce estimates nearest to actual election results are investigated.

Robust multiple imputation method for missings with boundary and outliers (한계와 이상치가 있는 결측치의 로버스트 다중대체 방법)

  • Park, Yousung;Oh, Do Young;Kwon, Tae Yeon
    • The Korean Journal of Applied Statistics
    • /
    • v.32 no.6
    • /
    • pp.889-898
    • /
    • 2019
  • The problem of missing value imputation for variables in surveys that include item missing becomes complicated if outliers and logical boundary conditions between other survey items cannot be ignored. If there are outliers and boundaries in a variable including missing values, imputed values based on previous regression-based imputation methods are likely to be biased and not meet boundary conditions. In this paper, we approach these difficulties in imputation by combining various robust regression models and multiple imputation methods. Through a simulation study on various scenarios of outliers and boundaries, we find and discuss the optimal combination of robust regression and multiple imputation method.

A longitudinal data analysis for child academic achievement with Korea welfare panel study data (경시적 자료를 이용한 아동 학업성취도 분석)

  • Lee, Naeun;Huh, Jib
    • Journal of the Korean Data and Information Science Society
    • /
    • v.28 no.1
    • /
    • pp.1-10
    • /
    • 2017
  • Longitudinal data of Korean child academic achievement have been used to find the significant exploratory variables under the assumption of independent repeated measured data. Using the exploratory variables in previous research works, we analyze the linear mixed model incorporating the fixed and random effects for child academic achievement to detect the significant exploratory variables. Korea welfare panel study data observed three times between 2006 and 2012 by additional survey for children. The child academic achievement is evaluated by the sum of academic achievements of Korean, English and Mathematics. We also investigate the multicollinearity and the missing mechanism and select some popular correlation matrices to analyze the linear mixed model.

Undecided inference using bivariate probit models (이변량 프로빗모형을 이용한 미결정자 추론)

  • Hong, Chong-Sun;Jung, Mi-Yang
    • Journal of the Korean Data and Information Science Society
    • /
    • v.22 no.6
    • /
    • pp.1017-1028
    • /
    • 2011
  • When it is not easy to decide the credit scoring for some loan applicants, credit evaluation is postponded and reserve to ask a specialist for further evaluation of undecided applicants. This undecided inference is one of problems that happen to most statistical models including the biostatistics and sportal statistics as well as credit evaluation area. In this work, the undecided inference is regarded as a missing data mechanism under the assumption of MNAR, and use the bivariate probit model which is one of sample selection models. Two undecided inference methods are proposed: one is to make use of characteristic variables to represent the state for decided applicants, and the other is that more accurate and additional informations are collected and apply these new variables. With an illustrated example, misclassification error rates for undecided and overall applicants are obtainded and compared according to various characteristic variables, undecided intervals, and thresholds. It is found that misclassification error rates could be reduced when the undecided interval is increased and more accurate information is put to model, since more accurate situation of decided applications are reflected in the bivariate probit model.

Analysis of medical panel binary data using marginalized models (주변화 모형을 이용한 의료 패널 이진 데이터 분석)

  • Chaeyoung Oh;Keunbaik Lee
    • The Korean Journal of Applied Statistics
    • /
    • v.37 no.4
    • /
    • pp.467-484
    • /
    • 2024
  • Longitudinal data are measured repeatedly over time from the same subject, so there is a correlation from the repeated outcomes. Therefore, when analyzing this correlation, both serial correlation and between-subject variation must be considered in longitudinal data analysis. In this paper, we will focus on the marginalized models to estimate the population average effect of covariates among models for analyzing longitudinal binary data. Marginalized models for longitudinal binary data include marginalized random effects models, marginalized transition models, and marginalized transition random effect models, and in this paper, these models are first reviewed, and simulations are conducted using complete data and missing data to compare the performance of the models. When there were missing values in the data, there is a difference in performance depending on the model in which the data was generated. We analyze Korea Health Panel data using marginalized models. The Korean Medical Panel data considers subjective unhealthy responses as response variables as binary variables, compares models with several explanatory variables, and presents the most suitable model.

The Forecast analysis on Non-electrical Machinery and Equipment of Macroeconomic variables (기계산업 수출액에 대한 거시경제변수의 예측 실험 - 보건과학분야의 정밀기계 수출액 포함 -)

  • Kim Jong-Kwon
    • Proceedings of the Safety Management and Science Conference
    • /
    • 2006.04a
    • /
    • pp.471-484
    • /
    • 2006
  • The focus of analysis is effect on Non-electrical Machinery and Equipment of Macroeconomic variables through long-term and short-term periods. Also, this paper is related with implication on steady growth possibility of Non-electrical Machinery and Equipment. The period of variables is from 1985 to April in 2005. In case of not-available data is treated as missing figures. As spatial scope, these data are Non-electrical Machinery and Equipment on the basis of KSIC. In case of items, it composes MTI 1&3 digit of Korea International Trade Association (KITA), on the basis of HSK & classification of Korea Machinery industries. According to Granger causality test, yield of Cooperate Bond and export amount of Machinery have a meaning at statistical Confidence level of 10%. In case of index of the unit cost of export and export amount of Machinery, they have an interactive Granger cause. In yen dollar exchange rate and export amount of Machinery, former variable gives an unilateral Granger cause to latter that.

  • PDF

Denoising Self-Attention Network for Mixed-type Data Imputation (혼합형 데이터 보간을 위한 디노이징 셀프 어텐션 네트워크)

  • Lee, Do-Hoon;Kim, Han-Joon;Chun, Joonghoon
    • The Journal of the Korea Contents Association
    • /
    • v.21 no.11
    • /
    • pp.135-144
    • /
    • 2021
  • Recently, data-driven decision-making technology has become a key technology leading the data industry, and machine learning technology for this requires high-quality training datasets. However, real-world data contains missing values for various reasons, which degrades the performance of prediction models learned from the poor training data. Therefore, in order to build a high-performance model from real-world datasets, many studies on automatically imputing missing values in initial training data have been actively conducted. Many of conventional machine learning-based imputation techniques for handling missing data involve very time-consuming and cumbersome work because they are applied only to numeric type of columns or create individual predictive models for each columns. Therefore, this paper proposes a new data imputation technique called 'Denoising Self-Attention Network (DSAN)', which can be applied to mixed-type dataset containing both numerical and categorical columns. DSAN can learn robust feature expression vectors by combining self-attention and denoising techniques, and can automatically interpolate multiple missing variables in parallel through multi-task learning. To verify the validity of the proposed technique, data imputation experiments has been performed after arbitrarily generating missing values for several mixed-type training data. Then we show the validity of the proposed technique by comparing the performance of the binary classification models trained on imputed data together with the errors between the original and imputed values.

The Current State and Determinants of Korean Baby-Boomers' Welfare Consciousness

  • Lee, Hyoung-Ha
    • Journal of the Korea Society of Computer and Information
    • /
    • v.21 no.5
    • /
    • pp.193-200
    • /
    • 2016
  • This study was conducted in order to assess the effect of variables influencing Korean baby-boomers' welfare consciousness. For this purpose, data from the $8^{th}$ supplementary survey of the Korea Welfare Panel in 2013 were analyzed. The subjects of analysis were 2,035 people who were born between 1955 and 1965 whose welfare panel data did not have missing values for the variables of the research model. According to the results of analysis, first, when the descriptive statistics of the major variables were analyzed, those showing a relatively high mean score among the sub-factors of the baby-boomers' welfare consciousness were 'expansion of expenditure for public assistance' (mean 3.65, SD .557), 'expansion of expenditure for social insurance' (mean 3.53, SD .646), and 'expansion of expenditure for social services' (mean 3.26, SD .424). The mean score of the baby-boomers' overall welfare consciousness was relatively high as 3.45 (SD .428), advocating the expansion of welfare expenditure. Second, the independent variables influencing the baby-boomers' welfare consciousness was found to have explanatory power of 12.9%. In the results of regression analysis, variables found to have a significant effect were gender (B=.100, t=2.573, p<.01), personal responsibility for poverty (B=-.151, t=-3.635, p<.01), social responsibility for poverty (B=.149, t=3.437, p<.001), and recipient's laziness (B=.251, t=6.578, p<.001). Based on these results were discussed major relevant policies.

REGRESSION FRACTIONAL HOT DECK IMPUTATION

  • Kim, Jae-Kwang
    • Journal of the Korean Statistical Society
    • /
    • v.36 no.3
    • /
    • pp.423-434
    • /
    • 2007
  • Imputation using a regression model is a method to preserve the correlation among variables and to provide imputed point estimators. We discuss the implementation of regression imputation using fractional imputation. By a suitable choice of fractional weights, the fractional regression imputation can take the form of hot deck fractional imputation, thus no artificial values are constructed after the imputation. A variance estimator, which extends the method of Kim and Fuller (2004), is also proposed. Results from a limited simulation study are presented.

Grouping the Ginseng Field Soil Based on the Development of Root Rot of Ginseng Seedlings (유묘 뿌리썩음병 진전에 따른 이산재배 토양의 유별)

  • 박규진;박은우;정후섭
    • Korean Journal Plant Pathology
    • /
    • v.13 no.1
    • /
    • pp.37-45
    • /
    • 1997
  • Disease incidence (DI), pre-emergence damping-off (PDO), days until the first symptom appeared (DUS), disease progress curve (DPC), and area under disease progress curve (AUDPC) were investigated in vivo after sowing ginseng seeds in each of 37 ginseng-cultivated soils which were sampled from 4 regions in Korea. Non linear fitting parameters, A, B, K and M, were estimated from the Richards' function, one of the disease progress models, by using the DI at each day from the bioassay. Inter- and intra-relationships between disease variables and stand-missing rate (SMR) in fields were investigated by using the simple correlation analysis. Disease variables of the root rot were divided into two groups: variables related to disease incidence, e.g., DI, AUDPC and A parameter, and variables related to disease progress, e.g., B, K and M parameters. DI, AUDPC, and DUS had significant correlations with SMR in ginseng fields, and then it showed that the disease development in vivo corresponded with that in fields. Soil samples could be separated into 3 and 4 groups, respectively, on the basis of the principal component 1 (PC1) and the principal component 2 (PC2), which were derived from the principal component analysis (PCA) of Richards' parameters, A, B, K and M. PC1 accounted for B, K and M parameters, and PC2 accounted for A parameter.

  • PDF