• Title/Summary/Keyword: Ensemble decision tree

Search Result 73, Processing Time 0.022 seconds

Pruning the Boosting Ensemble of Decision Trees

  • Yoon, Young-Joo;Song, Moon-Sup
    • Communications for Statistical Applications and Methods
    • /
    • v.13 no.2
    • /
    • pp.449-466
    • /
    • 2006
  • We propose to use variable selection methods based on penalized regression for pruning decision tree ensembles. Pruning methods based on LASSO and SCAD are compared with the cluster pruning method. Comparative studies are performed on some artificial datasets and real datasets. According to the results of comparative studies, the proposed methods based on penalized regression reduce the size of boosting ensembles without decreasing accuracy significantly and have better performance than the cluster pruning method. In terms of classification noise, the proposed pruning methods can mitigate the weakness of AdaBoost to some degree.

A New Incremental Learning Algorithm with Probabilistic Weights Using Extended Data Expression

  • Yang, Kwangmo;Kolesnikova, Anastasiya;Lee, Won Don
    • Journal of information and communication convergence engineering
    • /
    • v.11 no.4
    • /
    • pp.258-267
    • /
    • 2013
  • New incremental learning algorithm using extended data expression, based on probabilistic compounding, is presented in this paper. Incremental learning algorithm generates an ensemble of weak classifiers and compounds these classifiers to a strong classifier, using a weighted majority voting, to improve classification performance. We introduce new probabilistic weighted majority voting founded on extended data expression. In this case class distribution of the output is used to compound classifiers. UChoo, a decision tree classifier for extended data expression, is used as a base classifier, as it allows obtaining extended output expression that defines class distribution of the output. Extended data expression and UChoo classifier are powerful techniques in classification and rule refinement problem. In this paper extended data expression is applied to obtain probabilistic results with probabilistic majority voting. To show performance advantages, new algorithm is compared with Learn++, an incremental ensemble-based algorithm.

Diabetes prediction mechanism using machine learning model based on patient IQR outlier and correlation coefficient (환자 IQR 이상치와 상관계수 기반의 머신러닝 모델을 이용한 당뇨병 예측 메커니즘)

  • Jung, Juho;Lee, Naeun;Kim, Sumin;Seo, Gaeun;Oh, Hayoung
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.25 no.10
    • /
    • pp.1296-1301
    • /
    • 2021
  • With the recent increase in diabetes incidence worldwide, research has been conducted to predict diabetes through various machine learning and deep learning technologies. In this work, we present a model for predicting diabetes using machine learning techniques with German Frankfurt Hospital data. We apply outlier handling using Interquartile Range (IQR) techniques and Pearson correlation and compare model-specific diabetes prediction performance with Decision Tree, Random Forest, Knn (k-nearest neighbor), SVM (support vector machine), Bayesian Network, ensemble techniques XGBoost, Voting, and Stacking. As a result of the study, the XGBoost technique showed the best performance with 97% accuracy on top of the various scenarios. Therefore, this study is meaningful in that the model can be used to accurately predict and prevent diabetes prevalent in modern society.

Learning to Prevent Inactive Student of Indonesia Open University

  • Tama, Bayu Adhi
    • Journal of Information Processing Systems
    • /
    • v.11 no.2
    • /
    • pp.165-172
    • /
    • 2015
  • The inactive student rate is becoming a major problem in most open universities worldwide. In Indonesia, roughly 36% of students were found to be inactive, in 2005. Data mining had been successfully employed to solve problems in many domains, such as for educational purposes. We are proposing a method for preventing inactive students by mining knowledge from student record systems with several state of the art ensemble methods, such as Bagging, AdaBoost, Random Subspace, Random Forest, and Rotation Forest. The most influential attributes, as well as demographic attributes (marital status and employment), were successfully obtained which were affecting student of being inactive. The complexity and accuracy of classification techniques were also compared and the experimental results show that Rotation Forest, with decision tree as the base-classifier, denotes the best performance compared to other classifiers.

Forecasting Energy Consumption of Steel Industry Using Regression Model (회귀 모델을 활용한 철강 기업의 에너지 소비 예측)

  • Sung-Ho KANG;Hyun-Ki KIM
    • Journal of Korea Artificial Intelligence Association
    • /
    • v.1 no.2
    • /
    • pp.21-25
    • /
    • 2023
  • The purpose of this study was to compare the performance using multiple regression models to predict the energy consumption of steel industry. Specific independent variables were selected in consideration of correlation among various attributes such as CO2 concentration, NSM, Week Status, Day of week, and Load Type, and preprocessing was performed to solve the multicollinearity problem. In data preprocessing, we evaluated linear and nonlinear relationships between each attribute through correlation analysis. In particular, we decided to select variables with high correlation and include appropriate variables in the final model to prevent multicollinearity problems. Among the many regression models learned, Boosted Decision Tree Regression showed the best predictive performance. Ensemble learning in this model was able to effectively learn complex patterns while preventing overfitting by combining multiple decision trees. Consequently, these predictive models are expected to provide important information for improving energy efficiency and management decision-making at steel industry. In the future, we plan to improve the performance of the model by collecting more data and extending variables, and the application of the model considering interactions with external factors will also be considered.

The Effect of Input Variables Clustering on the Characteristics of Ensemble Machine Learning Model for Water Quality Prediction (입력자료 군집화에 따른 앙상블 머신러닝 모형의 수질예측 특성 연구)

  • Park, Jungsu
    • Journal of Korean Society on Water Environment
    • /
    • v.37 no.5
    • /
    • pp.335-343
    • /
    • 2021
  • Water quality prediction is essential for the proper management of water supply systems. Increased suspended sediment concentration (SSC) has various effects on water supply systems such as increased treatment cost and consequently, there have been various efforts to develop a model for predicting SSC. However, SSC is affected by both the natural and anthropogenic environment, making it challenging to predict SSC. Recently, advanced machine learning models have increasingly been used for water quality prediction. This study developed an ensemble machine learning model to predict SSC using the XGBoost (XGB) algorithm. The observed discharge (Q) and SSC in two fields monitoring stations were used to develop the model. The input variables were clustered in two groups with low and high ranges of Q using the k-means clustering algorithm. Then each group of data was separately used to optimize XGB (Model 1). The model performance was compared with that of the XGB model using the entire data (Model 2). The models were evaluated by mean squared error-ob servation standard deviation ratio (RSR) and root mean squared error. The RSR were 0.51 and 0.57 in the two monitoring stations for Model 2, respectively, while the model performance improved to RSR 0.46 and 0.55, respectively, for Model 1.

Enhancing prediction accuracy of concrete compressive strength using stacking ensemble machine learning

  • Yunpeng Zhao;Dimitrios Goulias;Setare Saremi
    • Computers and Concrete
    • /
    • v.32 no.3
    • /
    • pp.233-246
    • /
    • 2023
  • Accurate prediction of concrete compressive strength can minimize the need for extensive, time-consuming, and costly mixture optimization testing and analysis. This study attempts to enhance the prediction accuracy of compressive strength using stacking ensemble machine learning (ML) with feature engineering techniques. Seven alternative ML models of increasing complexity were implemented and compared, including linear regression, SVM, decision tree, multiple layer perceptron, random forest, Xgboost and Adaboost. To further improve the prediction accuracy, a ML pipeline was proposed in which the feature engineering technique was implemented, and a two-layer stacked model was developed. The k-fold cross-validation approach was employed to optimize model parameters and train the stacked model. The stacked model showed superior performance in predicting concrete compressive strength with a correlation of determination (R2) of 0.985. Feature (i.e., variable) importance was determined to demonstrate how useful the synthetic features are in prediction and provide better interpretability of the data and the model. The methodology in this study promotes a more thorough assessment of alternative ML algorithms and rather than focusing on any single ML model type for concrete compressive strength prediction.

A Grey Wolf Optimized- Stacked Ensemble Approach for Nitrate Contamination Prediction in Cauvery Delta

  • Kalaivanan K;Vellingiri J
    • Economic and Environmental Geology
    • /
    • v.57 no.3
    • /
    • pp.329-342
    • /
    • 2024
  • The exponential increase in nitrate pollution of river water poses an immediate threat to public health and the environment. This contamination is primarily due to various human activities, which include the overuse of nitrogenous fertilizers in agriculture and the discharge of nitrate-rich industrial effluents into rivers. As a result, the accurate prediction and identification of contaminated areas has become a crucial and challenging task for researchers. To solve these problems, this work leads to the prediction of nitrate contamination using machine learning approaches. This paper presents a novel approach known as Grey Wolf Optimizer (GWO) based on the Stacked Ensemble approach for predicting nitrate pollution in the Cauvery Delta region of Tamilnadu, India. The proposed method is evaluated using a Cauvery River dataset from the Tamilnadu Pollution Control Board. The proposed method shows excellent performance, achieving an accuracy of 93.31%, a precision of 93%, a sensitivity of 97.53%, a specificity of 94.28%, an F1-score of 95.23%, and an ROC score of 95%. These impressive results underline the demonstration of the proposed method in accurately predicting nitrate pollution in river water and ultimately help to make informed decisions to tackle these critical environmental problems.

A Study on the Prediction Model for Sales of Women's Golfwear with Data Mining: Focus on Macroeconomic Factors and Consumer Sales Price (데이터마이닝을 적용한 여성 골프웨어 판매 예측 모델 연구: 거시경제요인과 소비자판매가격을 중심으로)

  • Han, Ki-Hyang
    • Journal of Digital Convergence
    • /
    • v.19 no.11
    • /
    • pp.445-456
    • /
    • 2021
  • The purpose of this study is to identify the importance of variables affecting women's golf wear sales with macroeconomic variables and consumer selling prices that affect consumers' purchasing behavior, and to propose a price strategy to increase sales of golf wear. Data of domestic women's golf wear brands were analyzed using decision tree algorithms and ensemble. Consumer selling price is the most significant factors in terms of sales volume for T-shirt, pants and knit, while categories were found to be the most important factors in addition to consumer sales prices for skirt and one piece dress. These findings suggest that items have different economic variables that affect consumers' purchasing behavior, suggesting that sales and profits can be maximized through appropriate price strategies.

The study of foreign exchange trading revenue model using decision tree and gradient boosting (외환거래에서 의사결정나무와 그래디언트 부스팅을 이용한 수익 모형 연구)

  • Jung, Ji Hyeon;Min, Dae Kee
    • Journal of the Korean Data and Information Science Society
    • /
    • v.24 no.1
    • /
    • pp.161-170
    • /
    • 2013
  • The FX (Foreign Exchange) is a form of exchange for the global decentralized trading of international currencies. The simple sense of Forex is simultaneous purchase and sale of the currency or the exchange of one country's currency for other countries'. We can find the consistent rules of trading by comparing the gradient boosting method and the decision trees methods. Methods such as time series analysis used for the prediction of financial markets have advantage of the long-term forecasting model. On the other hand, it is difficult to reflect the rapidly changing price fluctuations in the short term. Therefore, in this study, gradient boosting method and decision tree method are applied to analyze the short-term data in order to make the rules for the revenue structure of the FX market and evaluated the stability and the prediction of the model.