• 제목/요약/키워드: Variable importance

검색결과 807건 처리시간 0.022초

랜덤포레스트를 위한 상관예측변수 중요도 (Correlated variable importance for random forests)

  • 신승범;조형준
    • 응용통계연구
    • /
    • 제34권2호
    • /
    • pp.177-190
    • /
    • 2021
  • 랜덤포레스트는 여러 의사결정나무 모형들을 융합하여 안정성과 예측력을 높여주기 때문에 종종 사용되는 방법이다. 예측력을 증가시키는 반면 해석의 용이성을 희생하기 때문에 이를 보상하기 위해 변수의 중요도를 제공한다. 변수의 중요도는 랜덤포레스트를 구축할 때 변수가 얼마나 중요한 역할을 하는지를 알려 준다. 그러나 어떤 예측변수가 다른 예측변수들과 상관되어 있을 때 기존 알고리즘의 변수중요도는 왜곡될 수 있다. 상관된 예측변수들의 하향 편향은 예측변수의 중요도를 실제 중요도보다 낮게 측정하게 한다. 우리는 기존 알고리즘을 수정하여 상관 예측변수의 하향 편향을 회복하는 새로운 알고리즘을 제안한다. 제안된 알고리즘의 성능은 모의 자료에 의해 증명되고 실제 자료에 의해 설명된다.

SEM-ANN 2단계 분석에서 예측성능과 변수중요도의 비교연구 (Comparative Study of Prediction Performance and Variable Importance in SEM-ANN Two-stage Analysis)

  • 권순동;조의;방화룡
    • Journal of Information Technology Applications and Management
    • /
    • 제31권1호
    • /
    • pp.11-25
    • /
    • 2024
  • The purpose of this study is to investigate the improvement of prediction performance and changes in variable importance in SEM-ANN two-stage analysis. 366 cosmetics repurchase-related survey data were analyzed and the results were presented. The results of this study are summarized as follows. First, in SEM-ANN two-stage analysis, SEM and ANN models were trained with train data and predicted with test data, respectively, and the R2 was showed. As a result, the prediction performance was doubled from SEM 0.3364 to ANN 0.6836. Looking at this degree of R2 improvement as the effect size f2 of Cohen (1988), it corresponds to a very large effect at 110%. Second, as a result of comparing changes in normalized variable importance through SEM-ANN two-stage analysis, variables with high importance in SEM were also found to have high importance in ANN, but variables with little or no importance in SEM became important in ANN. This study is meaningful in that it increased the validity of the comparison by using the same learning and evaluation method in the SEM-ANN two-stage analysis. This study is meaningful in that it compared the degree of improvement in prediction performance and the change in variable importance through SEM-ANN two-stage analysis.

An importance sampling for a function of a multivariate random variable

  • Jae-Yeol Park;Hee-Geon Kang;Sunggon Kim
    • Communications for Statistical Applications and Methods
    • /
    • 제31권1호
    • /
    • pp.65-85
    • /
    • 2024
  • The tail probability of a function of a multivariate random variable is not easy to estimate by the crude Monte Carlo simulation. When the occurrence of the function value over a threshold is rare, the accurate estimation of the corresponding probability requires a huge number of samples. When the explicit form of the cumulative distribution function of each component of the variable is known, the inverse transform likelihood ratio method is directly applicable scheme to estimate the tail probability efficiently. The method is a type of the importance sampling and its efficiency depends on the selection of the importance sampling distribution. When the cumulative distribution of the multivariate random variable is represented by a copula and its marginal distributions, we develop an iterative algorithm to find the optimal importance sampling distribution, and show the convergence of the algorithm. The performance of the proposed scheme is compared with the crude Monte Carlo simulation numerically.

ACCOUNTING FOR IMPORTANCE OF VARIABLES IN MUL TI-SENSOR DATA FUSION USING RANDOM FORESTS

  • Park No-Wook;Chi Kwang-Hoon
    • 대한원격탐사학회:학술대회논문집
    • /
    • 대한원격탐사학회 2005년도 Proceedings of ISRS 2005
    • /
    • pp.283-285
    • /
    • 2005
  • To account for the importance of variable in multi-sensor data fusion, random forests are applied to supervised land-cover classification. The random forests approach is a non-parametric ensemble classifier based on CART-like trees. Its distinguished feature is that the importance of variable can be estimated by randomly permuting the variable of interest in all the out-of-bag samples for each classifier. Supervised classification with a multi-sensor remote sensing data set including optical and polarimetric SAR data was carried out to illustrate the applicability of random forests. From the experimental result, the random forests approach could extract important variables or bands for land-cover discrimination and showed good performance, as compared with other non-parametric data fusion algorithms.

  • PDF

Input Variable Importance in Supervised Learning Models

  • Huh, Myung-Hoe;Lee, Yong Goo
    • Communications for Statistical Applications and Methods
    • /
    • 제10권1호
    • /
    • pp.239-246
    • /
    • 2003
  • Statisticians, or data miners, are often requested to assess the importances of input variables in the given supervised learning model. For the purpose, one may rely on separate ad hoc measures depending on modeling types, such as linear regressions, the neural networks or trees. Consequently, the conceptual consistency in input variable importance measures is lacking, so that the measures cannot be directly used in comparing different types of models, which is often done in data mining processes, In this short communication, we propose a unified approach to the importance measurement of input variables. Our method uses sensitivity analysis which begins by perturbing the values of input variables and monitors the output change. Research scope is limited to the models for continuous output, although it is not difficult to extend the method to supervised learning models for categorical outcomes.

자산전용성과 협업환경하에서의 정보공유가 공급사슬에 미치는 영향 : 통합적 SCM 성과형성 모델 (The Effect of Asset Specificity, Information Sharing, and a Collaborative Environment on Supply Chain Management (SCM): An Integrated SCM Performance Formation Model)

  • 김태룡;송장근
    • 유통과학연구
    • /
    • 제11권4호
    • /
    • pp.51-60
    • /
    • 2013
  • Purpose - The objective of this paper is to investigate the effect of asset specificity, the level of information sharing, the importance of information sharing, and an integrated collaborative environment on supply chain performance. Research design, data, and methodology - Data collection was implemented as follows: questionnaires were distributed to 250 companies that have business ties with Halla Climate Control Corporation. The empirical study to test our hypothesis was based on statistical analysis (using SPSS 18.0 and AMOS 18.0). The hypothesis of this paper is that the asset specificity variable has positive effects on the following variables: Level of information sharing, the importance of information sharing, and integrated collaborative environment. Moreover the variables, the level of information sharing, and the importance of information sharing are strongly influenced by the variable integrated collaborative environment, and these when combined, have an effect on the dependent variable, supply chain performance. We tested our hypothesized model utilizing path analysis with latent variables. Results - According to the results of our analysis, hypothesis H1, which tests whether there is a relationship between asset specificity and the integrated collaborative environment, is supported at the 0.01 level. Hypotheses H2 and H3 were also confirmed, and asset specificity had positive effects (+) on the level of information sharing variable. The importance of the information sharing variable was statistically significant at the 0.01 level. Hypotheses H4 and H5 posited that the integrated collaborative environment variable would have a positive effect on the level of information sharing; the importance of information sharing variable was strongly supported statistically, with a significant p-value below. Moreover, the level of information sharing (H6), and the importance of information sharing (H7) variables also had a statistically relevant influence on supply chain performance. As a result, existence of a collaborative system between companies would influence supply chain performance by strengthening real-time information access and information sharing. Thus, it is important to construct a collaborative environment where information sharing among companies and cooperation is possible. Conclusions - First, with rapid changes in the business environment, it becomes necessary for enterprises to acquire the right information in order to properly implement SCM. For successful SCM, firms should understand the importance of collaboration with supply chain partners and an internally built collaboration system, which in turn will better promote a partnership commitment with suppliers as well as collaborative integration with buyers. A collaborative system, as we suggest in this paper, facilitates the maintenance of a long-term relationship of trust, and can help reinforce information sharing. Second, it is necessary to increase information sharing over time via a collaborative system so that employees of the suppliers become aware of the system. The more proactive and positive attitudes are towards such a collaborative system by the managerial group, the higher the level of information sharing will be among the users. Successful SCM performance is achieved by information sharing through a collaborative environment rather than by investing only in setting up an information system.

  • PDF

유통업체의 고객서비스에 관한 연구 -의류제품을 중심을- (A Study on the Consumer Service of Retailing - focusing on the Apparel Product -)

  • 이은숙
    • 한국의상디자인학회지
    • /
    • 제4권2호
    • /
    • pp.31-45
    • /
    • 2002
  • The purpose of this study was designed to investigate if self-monitoring variable among various individual trait theories and demographic variable would be variables which can explain about the importance differences of consumer service level of retailing in the garment product. The survey was conducted from Feb, 6 to 16, 2002. For this survey, the 118 data were analysed with spss window 9.0 version and Cronbach's, Factor analysis, one-way ANOVA, Duncan test, Frequency, mean, percentage were applied. The results of this study were as follows; 1. Consumer service was classified in attitude/confidence/expert knowledge of salesperson, product display, product information, product assortment, shopping environment, lighting setup. 2. As a result of analyzing the importance differences per consumer service dimension depending on self-monitoring levels, it was not significant differences. 3. As a result of analyzing the importance differences per consumer service dimension depending on demographic variables, it was not significant differences.

  • PDF

Application of Random Forests to Assessment of Importance of Variables in Multi-sensor Data Fusion for Land-cover Classification

  • Park No-Wook;Chi kwang-Hoon
    • 대한원격탐사학회지
    • /
    • 제22권3호
    • /
    • pp.211-219
    • /
    • 2006
  • A random forests classifier is applied to multi-sensor data fusion for supervised land-cover classification in order to account for the importance of variable. The random forests approach is a non-parametric ensemble classifier based on CART-like trees. The distinguished feature is that the importance of variable can be estimated by randomly permuting the variable of interest in all the out-of-bag samples for each classifier. Two different multi-sensor data sets for supervised classification were used to illustrate the applicability of random forests: one with optical and polarimetric SAR data and the other with multi-temporal Radarsat-l and ENVISAT ASAR data sets. From the experimental results, the random forests approach could extract important variables or bands for land-cover discrimination and showed reasonably good performance in terms of classification accuracy.

나무구조의 분류분석에서 변수 중요도에 대한 고찰 (Comparison of Variable Importance Measures in Tree-based Classification)

  • 김나영;이은경
    • 응용통계연구
    • /
    • 제27권5호
    • /
    • pp.717-729
    • /
    • 2014
  • 본 연구에서는 나무구조의 분류분석에서 자료의 크기가 방대해짐에 따라 중요한 문제로 대두되고 있는 변수의 중요도에 대하여 사영추적분류나무를 중심으로 고찰하였다. 사영추적분류나무(projection pursuit classification tree)는 각 마디에서 사영추적을 이용하여 그룹을 잘 분리하는 변수들의 선형결합을 이용하는 방법으로 이때 사용되는 사영계수들은 각 마디에서의 분류에 대한 정보를 가지고 있다. 이를 종합하여 각 변수의 분류에 대한 중요도를 계산할 수 있다. 먼저 사영추적분류나무의 분류과정에서 계산되는 사영추적계수를 이용하여 분류를 위한 변수선택의 중요도를 계산하고 이들의 특성을 살펴보고 이를 같은 형태의 나무모형방법인 CART와 랜덤 포레스트의 결과와 비교 분석하여 사영추적분류나무의 특성을 살펴보고 비교, 분석하였다. 대부분의 자료에서 사영추적분류나무가 훨씬 좋은 성능을 보이고 있었으며 특히 상관계수가 높은 변수들이 포함되어 있는 경우에는 상대적으로 적은 수의 변수로도 잘 분류를 할 수 있음을 확인하였다. 랜덤 포레스트에서 제공하는 변수 중요도는 변수들 간의 상관관계가 높은 경우에는 사영추적분류나무의 변수중요도와 매우 다르게 나타나며 사영추적분류나무의 변수 중요도가 조금 더 나은 성능을 보이고 있음을 알 수 있다.

AutoML을 이용한 산사태 예측 및 변수 중요도 산정 (Prediction of Landslides and Determination of Its Variable Importance Using AutoML)

  • 남경훈;김만일;권오일;왕파우;정교철
    • 지질공학
    • /
    • 제30권3호
    • /
    • pp.315-325
    • /
    • 2020
  • 이 연구는 도로 비탈면에서 발생하는 산사태의 확률론적 예측에 기반된 산사태 발생에 영향을 미치는 인자의 중요도 산정 및 예측 모델을 개발하는 것이다. 산사태 예측 모델을 개발하기 위해 한반도 전 지역을 대상으로 2007년부터 2020년까지 조사된 30,615사면의 현장조사 자료를 활용하였다. 전체 131개의 변수 인자 중 지형인자 17개, 지질인자 114개(기반암 89개를 포함), 도로와의 이격거리를 사용하였다. 산사태 발생에 영향을 미치는 인자를 자동화된 머신러닝인 AutoML을 실시하여 예측 성능이 뛰어난 XRT(extremely randomized trees)를 선정하였다. 변수 중요도 분석결과 지형적 요인 10개, 지질인자 9개, 사회적 영향성인 도로와의 이격 거리와 관련된 항목순으로 급경사지 불안정에 가장 많은 영향을 주는 것으로 분석되었다. 개발된 모델의 신뢰성 검증을 수행한 결과 AUC 83.977%의 예측율을 확보한 것으로 나타났다. 이 모델은 산사태 이력을 기반으로 한 현장조사 자료만을 이용하여 변수 중요도의 순위를 도출함으로써 그에 따른 산사태 발생 가능성을 확률적 및 정량적으로 평가하였다. 향후 의사 결정자들에게 현장조사를 통한 사면진단 안전평가 시 신뢰성 있는 근거를 제공하리라 판단된다.