• Title/Summary/Keyword: 통계 변수

Search Result 3,682, Processing Time 0.024 seconds

Correlated variable importance for random forests (랜덤포레스트를 위한 상관예측변수 중요도)

  • Shin, Seung Beom;Cho, Hyung Jun
    • The Korean Journal of Applied Statistics
    • /
    • v.34 no.2
    • /
    • pp.177-190
    • /
    • 2021
  • Random forests is a popular method that improves the instability and accuracy of decision trees by ensembles. In contrast to increasing the accuracy, the ease of interpretation is sacrificed; hence, to compensate for this, variable importance is provided. The variable importance indicates which variable plays a role more importantly in constructing the random forests. However, when a predictor is correlated with other predictors, the variable importance of the existing importance algorithm may be distorted. The downward bias of correlated predictors may reduce the importance of truly important predictors. We propose a new algorithm remedying the downward bias of correlated predictors. The performance of the proposed algorithm is demonstrated by the simulated data and illustrated by the real data.

Feature selection and prediction modeling of drug responsiveness in Pharmacogenomics (약물유전체학에서 약물반응 예측모형과 변수선택 방법)

  • Kim, Kyuhwan;Kim, Wonkuk
    • The Korean Journal of Applied Statistics
    • /
    • v.34 no.2
    • /
    • pp.153-166
    • /
    • 2021
  • A main goal of pharmacogenomics studies is to predict individual's drug responsiveness based on high dimensional genetic variables. Due to a large number of variables, feature selection is required in order to reduce the number of variables. The selected features are used to construct a predictive model using machine learning algorithms. In the present study, we applied several hybrid feature selection methods such as combinations of logistic regression, ReliefF, TurF, random forest, and LASSO to a next generation sequencing data set of 400 epilepsy patients. We then applied the selected features to machine learning methods including random forest, gradient boosting, and support vector machine as well as a stacking ensemble method. Our results showed that the stacking model with a hybrid feature selection of random forest and ReliefF performs better than with other combinations of approaches. Based on a 5-fold cross validation partition, the mean test accuracy value of the best model was 0.727 and the mean test AUC value of the best model was 0.761. It also appeared that the stacking models outperform than single machine learning predictive models when using the same selected features.

지분구조의 다가자료에 관한 모형

  • 최재성
    • Communications for Statistical Applications and Methods
    • /
    • v.4 no.2
    • /
    • pp.377-384
    • /
    • 1997
  • 본 논문은 지분구조를 갖는 범주형 자료가 명목상의 다가자료일 때, 지분구조의 각 단계에서 정의될 수 있는 지분변수들의 유형과 지분변수들의 관심확률들에 영향을 미치는 변수들을 고려한 자료분석 모형들을 제시하고 있다.

  • PDF

Application of a Statistical Disclosure Control Techniques Based on Multiplicative Noise (승법잡음모형을 이용한 통계적 노출조절기법의 적용)

  • Kim, Young-Won;Kim, Tae-Yeon;Ki, Kye-Nam
    • The Korean Journal of Applied Statistics
    • /
    • v.24 no.1
    • /
    • pp.127-136
    • /
    • 2011
  • Multiplicative noise model is the one of popular method for masking continuous variables. In this paper, we propose the transformation on the variable to which random noise was multiplied. An advantage of the masking method using proposed transformation is that the masking data users can obtain the unbiased values of mean and variance of original (unmasked) data. We also consider the data utility and correlation structure of variables when we apply the proposed multiplicative noise scheme. To investigate the properties of the method of masking based on multiplicative noise, a simulation study has been conducted using the 2008 Householder Income and Expenditure Survey data.

Imputation for Binary or Ordered Categorical Traits Based on the Bayesian Threshold Model (베이지안 분계점 모형에 의한 순서 범주형 변수의 대체)

  • Lee Seung-Chun
    • The Korean Journal of Applied Statistics
    • /
    • v.18 no.3
    • /
    • pp.597-606
    • /
    • 2005
  • The nonresponse in sample survey causes a problem when it comes time to analyze dataset in public-use files where the user has only complete-data methods available and has limited information about the reasons for nonresponse. Recently imputation for nonresponse is becoming a standard approach for handling nonresponse and various imputation methods have been devised . However, most imputation methods concern with continuous traits while many interesting features are measured by binary or ordered categorical scales in sample survey. In this note. an imputation method for ignorable nonresponse in binary or ordered categorical traits is considered.

A unified measure of association for complex data obtained from independence tests (혼합자료에서 독립성검정에 의한 연관성 측정)

  • Lee, Seung-Chun;Huh, Moon Yul
    • The Korean Journal of Applied Statistics
    • /
    • v.34 no.4
    • /
    • pp.523-536
    • /
    • 2021
  • Although there exist numerous measures of association, most of them are lacking in generality in that they do not intend to measure the association between heterogeneous type of random variables. On the other hand, many statistical analyzes dealing with complex data sets require a very sophisticate measure of association. In this note, the p-value of independence tests is utilized to obtain a measure of association. The proposed measure of association have some consistency in measuring association between various types of random variables.

An Analysis for the Adjustment Process of Market Variations by the Formulation of Time tag Structure (시차구조의 설정에 따른 시장변동의 조정과정 분석)

  • 김태호;이청림
    • The Korean Journal of Applied Statistics
    • /
    • v.16 no.1
    • /
    • pp.87-100
    • /
    • 2003
  • Most of statistical data are generated by a set of dynamic, stochastic, and simultaneous relations. An important question is how to specify statistical models so that they are consistent with the dynamic feature of those data. A general hypothesis is that the lagged effect of a change in an explanatory variable is not felt all at once at a single point in time, but The impact is distributed over a number of future points in time. In other words, current control variables are determined by a function that can be reduced to a distributed lag function of past observations. It is possible to explain the relationship between variables in different points of time and to estimate the long-run impacts of a change in a variable on another if time lag series of explanatory variables are incorporated in the model specification. In this study, distributed lag structure is applied to the domestic stock market model to capture the dynamic response of the market by exogenous shocks. The Domestic market is found more responsive to the changes in foreign market factors both in the short and the long run.

자본시장심리지수와 금융투자자 휴리스틱에 관한 연구

  • Kim, Seok-Hwan;Gang, Hyeong-Gu
    • 한국벤처창업학회:학술대회논문집
    • /
    • 2020.11a
    • /
    • pp.179-184
    • /
    • 2020
  • 본 연구는 확장된 합리적 행동이론(ETRA)을 이용하여 주식투자 시 자본시장심리지수를 기반으로 한 어플리케이션의 선택행동에 영향을 끼치는 요인들과 투자자의 휴리스틱과의 관계를 알아보는데 있다. 연구자는 개별 투자자의 휴리스틱이 선택행동에 영향을 미칠 것으로 추정하고 대표성 휴리스틱, 가용성 휴리스틱, 감정 휴리스틱을 측정하여 선택행동에 영향을 미치는 매개변수로 분석을 하였다. 연구모델의 경로계수 분석결과는 다음과 같다. 첫째, 독립변수인 투자기회확장 그리고 매개변수인 휴리스틱 중 대표성 휴리스틱이 행동의도에 영향을 미치는 것으로 나타났다. 둘째, 행동의도가 종속변수인 선택행동에 영향을 미치고 매개변수인 가용성 휴리스틱이 선택행동에 영향을 미치는 것으로 나타났다. 연구모형에서 대표성 휴리스틱에 영향을 주는 독립변수는 혁신적 성향, 투자기회확장, 사용비용, 그리고 인지된 효익이며 반면에 가용성 휴리스틱에 영향을 주는 독립변수는 혁신적 성향과 투자기회확장으로 밝혀졌다. 매개효과 검증결과에 의하면 서비스다양성은 선택행동에 영향을 미치는데 휴리스틱의 매개효과가 없고 직접효과만 있는 것으로 밝혀졌다. 반면에 투자기회확장은 선택행동에 미치는 직접효과는 통계적으로 유의하지 않고 매개변수 휴리스틱의 간접효과 값이 0.217이고 통계적으로 유의하여 매개효과가 있는 것으로 밝혀졌다. 휴리스틱의 매개효과를 개별적으로 확인한 결과 첫째, 대표성 휴리스틱은 매개효과를 통한 간접효과가 없는 것으로 확인되었다. 둘째, 가용성 휴리스틱은 매개효과의 크기가 0.1360이고 경로계수가 통계적으로 유의하게 나타나 매개효과를 통한 간접효과가 있다는 것을 확인하였다. 따라서 독립변수 투자기회확장은 시장 심리지수를 기반으로 한 어플리케이션에 대한 선택행동에 영향을 미치는데 직접적으로 영향을 미치지 않고 투자자의 가용성 휴리스틱이 매개가 되어 간접적으로 선택행동에 영향을 나타내는 것을 실증적으로 확인하였다.

  • PDF

Sample-spacing Approach for the Estimation of Mutual Information (SAMPLE-SPACING 방법에 의한 상호정보의 추정)

  • Huh, Moon-Yul;Cha, Woon-Ock
    • The Korean Journal of Applied Statistics
    • /
    • v.21 no.2
    • /
    • pp.301-312
    • /
    • 2008
  • Mutual information is a measure of association of explanatory variable for predicting target variable. It is used for variable ranking and variable subset selection. This study is about the Sample-spacing approach which can be used for the estimation of mutual information from data consisting of continuous explanation variables and categorical target variable without estimating a joint probability density function. The results of Monte-Carlo simulation and experiments with real-world data show that m = 1 is preferable in using Sample-spacing.

A comparative study of feature screening methods for ultrahigh dimensional multiclass classification (초고차원 다범주분류를 위한 변수선별 방법 비교 연구)

  • Lee, Kyungeun;Kim, Kyoung Hee;Shin, Seung Jun
    • The Korean Journal of Applied Statistics
    • /
    • v.30 no.5
    • /
    • pp.793-808
    • /
    • 2017
  • We compare various variable screening methods on multiclass classification problems when the data is ultrahigh-dimensional. Two different approaches were considered: (1) pairwise extension from binary classification via one versus one or one versus rest comparisons and (2) direct classification of multiclass responses. We conducted extensive simulation studies under different conditions: heavy tailed explanatory variables, correlated signal and noise variables, correlated joint distributions but uncorrelated marginals, and unbalanced response variables. We then analyzed real data to examine the performance of the methods. The results showed that model-free methods perform better for multiclass classification problems as well as binary ones.