• Title/Summary/Keyword: variable importance

Search Result 807, Processing Time 0.025 seconds

Correlated variable importance for random forests (랜덤포레스트를 위한 상관예측변수 중요도)

  • Shin, Seung Beom;Cho, Hyung Jun
    • The Korean Journal of Applied Statistics
    • /
    • v.34 no.2
    • /
    • pp.177-190
    • /
    • 2021
  • Random forests is a popular method that improves the instability and accuracy of decision trees by ensembles. In contrast to increasing the accuracy, the ease of interpretation is sacrificed; hence, to compensate for this, variable importance is provided. The variable importance indicates which variable plays a role more importantly in constructing the random forests. However, when a predictor is correlated with other predictors, the variable importance of the existing importance algorithm may be distorted. The downward bias of correlated predictors may reduce the importance of truly important predictors. We propose a new algorithm remedying the downward bias of correlated predictors. The performance of the proposed algorithm is demonstrated by the simulated data and illustrated by the real data.

Comparative Study of Prediction Performance and Variable Importance in SEM-ANN Two-stage Analysis (SEM-ANN 2단계 분석에서 예측성능과 변수중요도의 비교연구)

  • Sun-Dong Kwon;Yi Zhao;Hua-Long Fang
    • Journal of Information Technology Applications and Management
    • /
    • v.31 no.1
    • /
    • pp.11-25
    • /
    • 2024
  • The purpose of this study is to investigate the improvement of prediction performance and changes in variable importance in SEM-ANN two-stage analysis. 366 cosmetics repurchase-related survey data were analyzed and the results were presented. The results of this study are summarized as follows. First, in SEM-ANN two-stage analysis, SEM and ANN models were trained with train data and predicted with test data, respectively, and the R2 was showed. As a result, the prediction performance was doubled from SEM 0.3364 to ANN 0.6836. Looking at this degree of R2 improvement as the effect size f2 of Cohen (1988), it corresponds to a very large effect at 110%. Second, as a result of comparing changes in normalized variable importance through SEM-ANN two-stage analysis, variables with high importance in SEM were also found to have high importance in ANN, but variables with little or no importance in SEM became important in ANN. This study is meaningful in that it increased the validity of the comparison by using the same learning and evaluation method in the SEM-ANN two-stage analysis. This study is meaningful in that it compared the degree of improvement in prediction performance and the change in variable importance through SEM-ANN two-stage analysis.

An importance sampling for a function of a multivariate random variable

  • Jae-Yeol Park;Hee-Geon Kang;Sunggon Kim
    • Communications for Statistical Applications and Methods
    • /
    • v.31 no.1
    • /
    • pp.65-85
    • /
    • 2024
  • The tail probability of a function of a multivariate random variable is not easy to estimate by the crude Monte Carlo simulation. When the occurrence of the function value over a threshold is rare, the accurate estimation of the corresponding probability requires a huge number of samples. When the explicit form of the cumulative distribution function of each component of the variable is known, the inverse transform likelihood ratio method is directly applicable scheme to estimate the tail probability efficiently. The method is a type of the importance sampling and its efficiency depends on the selection of the importance sampling distribution. When the cumulative distribution of the multivariate random variable is represented by a copula and its marginal distributions, we develop an iterative algorithm to find the optimal importance sampling distribution, and show the convergence of the algorithm. The performance of the proposed scheme is compared with the crude Monte Carlo simulation numerically.

ACCOUNTING FOR IMPORTANCE OF VARIABLES IN MUL TI-SENSOR DATA FUSION USING RANDOM FORESTS

  • Park No-Wook;Chi Kwang-Hoon
    • Proceedings of the KSRS Conference
    • /
    • 2005.10a
    • /
    • pp.283-285
    • /
    • 2005
  • To account for the importance of variable in multi-sensor data fusion, random forests are applied to supervised land-cover classification. The random forests approach is a non-parametric ensemble classifier based on CART-like trees. Its distinguished feature is that the importance of variable can be estimated by randomly permuting the variable of interest in all the out-of-bag samples for each classifier. Supervised classification with a multi-sensor remote sensing data set including optical and polarimetric SAR data was carried out to illustrate the applicability of random forests. From the experimental result, the random forests approach could extract important variables or bands for land-cover discrimination and showed good performance, as compared with other non-parametric data fusion algorithms.

  • PDF

Input Variable Importance in Supervised Learning Models

  • Huh, Myung-Hoe;Lee, Yong Goo
    • Communications for Statistical Applications and Methods
    • /
    • v.10 no.1
    • /
    • pp.239-246
    • /
    • 2003
  • Statisticians, or data miners, are often requested to assess the importances of input variables in the given supervised learning model. For the purpose, one may rely on separate ad hoc measures depending on modeling types, such as linear regressions, the neural networks or trees. Consequently, the conceptual consistency in input variable importance measures is lacking, so that the measures cannot be directly used in comparing different types of models, which is often done in data mining processes, In this short communication, we propose a unified approach to the importance measurement of input variables. Our method uses sensitivity analysis which begins by perturbing the values of input variables and monitors the output change. Research scope is limited to the models for continuous output, although it is not difficult to extend the method to supervised learning models for categorical outcomes.

The Effect of Asset Specificity, Information Sharing, and a Collaborative Environment on Supply Chain Management (SCM): An Integrated SCM Performance Formation Model (자산전용성과 협업환경하에서의 정보공유가 공급사슬에 미치는 영향 : 통합적 SCM 성과형성 모델)

  • Kim, Tae-Ryong;Song, Jang-Gwen
    • Journal of Distribution Science
    • /
    • v.11 no.4
    • /
    • pp.51-60
    • /
    • 2013
  • Purpose - The objective of this paper is to investigate the effect of asset specificity, the level of information sharing, the importance of information sharing, and an integrated collaborative environment on supply chain performance. Research design, data, and methodology - Data collection was implemented as follows: questionnaires were distributed to 250 companies that have business ties with Halla Climate Control Corporation. The empirical study to test our hypothesis was based on statistical analysis (using SPSS 18.0 and AMOS 18.0). The hypothesis of this paper is that the asset specificity variable has positive effects on the following variables: Level of information sharing, the importance of information sharing, and integrated collaborative environment. Moreover the variables, the level of information sharing, and the importance of information sharing are strongly influenced by the variable integrated collaborative environment, and these when combined, have an effect on the dependent variable, supply chain performance. We tested our hypothesized model utilizing path analysis with latent variables. Results - According to the results of our analysis, hypothesis H1, which tests whether there is a relationship between asset specificity and the integrated collaborative environment, is supported at the 0.01 level. Hypotheses H2 and H3 were also confirmed, and asset specificity had positive effects (+) on the level of information sharing variable. The importance of the information sharing variable was statistically significant at the 0.01 level. Hypotheses H4 and H5 posited that the integrated collaborative environment variable would have a positive effect on the level of information sharing; the importance of information sharing variable was strongly supported statistically, with a significant p-value below. Moreover, the level of information sharing (H6), and the importance of information sharing (H7) variables also had a statistically relevant influence on supply chain performance. As a result, existence of a collaborative system between companies would influence supply chain performance by strengthening real-time information access and information sharing. Thus, it is important to construct a collaborative environment where information sharing among companies and cooperation is possible. Conclusions - First, with rapid changes in the business environment, it becomes necessary for enterprises to acquire the right information in order to properly implement SCM. For successful SCM, firms should understand the importance of collaboration with supply chain partners and an internally built collaboration system, which in turn will better promote a partnership commitment with suppliers as well as collaborative integration with buyers. A collaborative system, as we suggest in this paper, facilitates the maintenance of a long-term relationship of trust, and can help reinforce information sharing. Second, it is necessary to increase information sharing over time via a collaborative system so that employees of the suppliers become aware of the system. The more proactive and positive attitudes are towards such a collaborative system by the managerial group, the higher the level of information sharing will be among the users. Successful SCM performance is achieved by information sharing through a collaborative environment rather than by investing only in setting up an information system.

  • PDF

A Study on the Consumer Service of Retailing - focusing on the Apparel Product - (유통업체의 고객서비스에 관한 연구 -의류제품을 중심을-)

  • 이은숙
    • Journal of the Korea Fashion and Costume Design Association
    • /
    • v.4 no.2
    • /
    • pp.31-45
    • /
    • 2002
  • The purpose of this study was designed to investigate if self-monitoring variable among various individual trait theories and demographic variable would be variables which can explain about the importance differences of consumer service level of retailing in the garment product. The survey was conducted from Feb, 6 to 16, 2002. For this survey, the 118 data were analysed with spss window 9.0 version and Cronbach's, Factor analysis, one-way ANOVA, Duncan test, Frequency, mean, percentage were applied. The results of this study were as follows; 1. Consumer service was classified in attitude/confidence/expert knowledge of salesperson, product display, product information, product assortment, shopping environment, lighting setup. 2. As a result of analyzing the importance differences per consumer service dimension depending on self-monitoring levels, it was not significant differences. 3. As a result of analyzing the importance differences per consumer service dimension depending on demographic variables, it was not significant differences.

  • PDF

Application of Random Forests to Assessment of Importance of Variables in Multi-sensor Data Fusion for Land-cover Classification

  • Park No-Wook;Chi kwang-Hoon
    • Korean Journal of Remote Sensing
    • /
    • v.22 no.3
    • /
    • pp.211-219
    • /
    • 2006
  • A random forests classifier is applied to multi-sensor data fusion for supervised land-cover classification in order to account for the importance of variable. The random forests approach is a non-parametric ensemble classifier based on CART-like trees. The distinguished feature is that the importance of variable can be estimated by randomly permuting the variable of interest in all the out-of-bag samples for each classifier. Two different multi-sensor data sets for supervised classification were used to illustrate the applicability of random forests: one with optical and polarimetric SAR data and the other with multi-temporal Radarsat-l and ENVISAT ASAR data sets. From the experimental results, the random forests approach could extract important variables or bands for land-cover discrimination and showed reasonably good performance in terms of classification accuracy.

Comparison of Variable Importance Measures in Tree-based Classification (나무구조의 분류분석에서 변수 중요도에 대한 고찰)

  • Kim, Na-Young;Lee, Eun-Kyung
    • The Korean Journal of Applied Statistics
    • /
    • v.27 no.5
    • /
    • pp.717-729
    • /
    • 2014
  • Projection pursuit classification tree uses a 1-dimensional projection with the view of the most separating classes in each node. These projection coefficients contain information distinguishing two groups of classes from each other and can be used to calculate the importance measure of classification in each variable. This paper reviews the variable importance measure with increasing interest in line with growing data size. We compared the performances of projection pursuit classification tree with those of classification and regression tree(CART) and random forest. Projection pursuit classification tree are found to produce better performance in most cases, particularly with highly correlated variables. The importance measure of projection pursuit classification tree performs slightly better than the importance measure of random forest.

Prediction of Landslides and Determination of Its Variable Importance Using AutoML (AutoML을 이용한 산사태 예측 및 변수 중요도 산정)

  • Nam, KoungHoon;Kim, Man-Il;Kwon, Oil;Wang, Fawu;Jeong, Gyo-Cheol
    • The Journal of Engineering Geology
    • /
    • v.30 no.3
    • /
    • pp.315-325
    • /
    • 2020
  • This study was performed to develop a model to predict landslides and determine the variable importance of landslides susceptibility factors based on the probabilistic prediction of landslides occurring on slopes along the road. Field survey data of 30,615 slopes from 2007 to 2020 in Korea were analyzed to develop a landslide prediction model. Of the total 131 variable factors, 17 topographic factors and 114 geological factors (including 89 bedrocks) were used to predict landslides. Automated machine learning (AutoML) was used to classify landslides and non-landslides. The verification results revealed that the best model, an extremely randomized tree (XRT) with excellent predictive performance, yielded 83.977% of prediction rates on test data. As a result of the analysis to determine the variable importance of the landslide susceptibility factors, it was composed of 10 topographic factors and 9 geological factors, which was presented as a percentage for each factor. This model was evaluated probabilistically and quantitatively for the likelihood of landslide occurrence by deriving the ranking of variable importance using only on-site survey data. It is considered that this model can provide a reliable basis for slope safety assessment through field surveys to decision-makers in the future.