• Title/Summary/Keyword: Ensemble prediction

Search Result 373, Processing Time 0.024 seconds

Feature selection and prediction modeling of drug responsiveness in Pharmacogenomics (약물유전체학에서 약물반응 예측모형과 변수선택 방법)

  • Kim, Kyuhwan;Kim, Wonkuk
    • The Korean Journal of Applied Statistics
    • /
    • v.34 no.2
    • /
    • pp.153-166
    • /
    • 2021
  • A main goal of pharmacogenomics studies is to predict individual's drug responsiveness based on high dimensional genetic variables. Due to a large number of variables, feature selection is required in order to reduce the number of variables. The selected features are used to construct a predictive model using machine learning algorithms. In the present study, we applied several hybrid feature selection methods such as combinations of logistic regression, ReliefF, TurF, random forest, and LASSO to a next generation sequencing data set of 400 epilepsy patients. We then applied the selected features to machine learning methods including random forest, gradient boosting, and support vector machine as well as a stacking ensemble method. Our results showed that the stacking model with a hybrid feature selection of random forest and ReliefF performs better than with other combinations of approaches. Based on a 5-fold cross validation partition, the mean test accuracy value of the best model was 0.727 and the mean test AUC value of the best model was 0.761. It also appeared that the stacking models outperform than single machine learning predictive models when using the same selected features.

Analysis and Prediction of (Ultra) Air Pollution based on Meteorological Data and Atmospheric Environment Data (기상 데이터와 대기 환경 데이터 기반 (초)미세먼지 분석과 예측)

  • Park, Hong-Jin
    • The Journal of Korea Institute of Information, Electronics, and Communication Technology
    • /
    • v.14 no.4
    • /
    • pp.328-337
    • /
    • 2021
  • Air pollution, which is a class 1 carcinogen, such as asbestos and benzene, is the cause of various diseases. The spread of ultra-air pollution is one of the important causes of the spread of the corona virus. This paper analyzes and predicts fine dust and ultra-air pollution from 2015 to 2019 based on weather data such as average temperature, precipitation, and average wind speed in Seoul and atmospheric environment data such as SO2, NO2, and O3. Linear regression, SVM, and ensemble models among machine learning models were compared and analyzed to predict fine dust by grasping and analyzing the status of air pollution and ultra-air pollution by season and month. In addition, important features(attributes) that affect the generation of fine dust and ultra-air pollution are identified. The highest ultra-air pollution was found in March, and the lowest ultra-air pollution was observed from August to September. In the case of meteorological data, the data that has the most influence on ultra-air pollution is average temperature, and in the case of meteorological data and atmospheric environment data, NO2 has the greatest effect on ultra-air pollution generation.

A Study on the Prediction Model for Sales of Women's Golfwear with Data Mining: Focus on Macroeconomic Factors and Consumer Sales Price (데이터마이닝을 적용한 여성 골프웨어 판매 예측 모델 연구: 거시경제요인과 소비자판매가격을 중심으로)

  • Han, Ki-Hyang
    • Journal of Digital Convergence
    • /
    • v.19 no.11
    • /
    • pp.445-456
    • /
    • 2021
  • The purpose of this study is to identify the importance of variables affecting women's golf wear sales with macroeconomic variables and consumer selling prices that affect consumers' purchasing behavior, and to propose a price strategy to increase sales of golf wear. Data of domestic women's golf wear brands were analyzed using decision tree algorithms and ensemble. Consumer selling price is the most significant factors in terms of sales volume for T-shirt, pants and knit, while categories were found to be the most important factors in addition to consumer sales prices for skirt and one piece dress. These findings suggest that items have different economic variables that affect consumers' purchasing behavior, suggesting that sales and profits can be maximized through appropriate price strategies.

Incremental Ensemble Learning for The Combination of Multiple Models of Locally Weighted Regression Using Genetic Algorithm (유전 알고리즘을 이용한 국소가중회귀의 다중모델 결합을 위한 점진적 앙상블 학습)

  • Kim, Sang Hun;Chung, Byung Hee;Lee, Gun Ho
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.7 no.9
    • /
    • pp.351-360
    • /
    • 2018
  • The LWR (Locally Weighted Regression) model, which is traditionally a lazy learning model, is designed to obtain the solution of the prediction according to the input variable, the query point, and it is a kind of the regression equation in the short interval obtained as a result of the learning that gives a higher weight value closer to the query point. We study on an incremental ensemble learning approach for LWR, a form of lazy learning and memory-based learning. The proposed incremental ensemble learning method of LWR is to sequentially generate and integrate LWR models over time using a genetic algorithm to obtain a solution of a specific query point. The weaknesses of existing LWR models are that multiple LWR models can be generated based on the indicator function and data sample selection, and the quality of the predictions can also vary depending on this model. However, no research has been conducted to solve the problem of selection or combination of multiple LWR models. In this study, after generating the initial LWR model according to the indicator function and the sample data set, we iterate evolution learning process to obtain the proper indicator function and assess the LWR models applied to the other sample data sets to overcome the data set bias. We adopt Eager learning method to generate and store LWR model gradually when data is generated for all sections. In order to obtain a prediction solution at a specific point in time, an LWR model is generated based on newly generated data within a predetermined interval and then combined with existing LWR models in a section using a genetic algorithm. The proposed method shows better results than the method of selecting multiple LWR models using the simple average method. The results of this study are compared with the predicted results using multiple regression analysis by applying the real data such as the amount of traffic per hour in a specific area and hourly sales of a resting place of the highway, etc.

Ensemble Learning for Solving Data Imbalance in Bankruptcy Prediction (기업부실 예측 데이터의 불균형 문제 해결을 위한 앙상블 학습)

  • Kim, Myoung-Jong
    • Journal of Intelligence and Information Systems
    • /
    • v.15 no.3
    • /
    • pp.1-15
    • /
    • 2009
  • In a classification problem, data imbalance occurs when the number of instances in one class greatly outnumbers the number of instances in the other class. Such data sets often cause a default classifier to be built due to skewed boundary and thus the reduction in the classification accuracy of such a classifier. This paper proposes a Geometric Mean-based Boosting (GM-Boost) to resolve the problem of data imbalance. Since GM-Boost introduces the notion of geometric mean, it can perform learning process considering both majority and minority sides, and reinforce the learning on misclassified data. An empirical study with bankruptcy prediction on Korea companies shows that GM-Boost has the higher classification accuracy than previous methods including Under-sampling, Over-Sampling, and AdaBoost, used in imbalanced data and robust learning performance regardless of the degree of data imbalance.

  • PDF

Comparison of the Machine Learning Models Predicting Lithium-ion Battery Capacity for Remaining Useful Life Estimation (리튬이온 배터리 수명추정을 위한 용량예측 머신러닝 모델의 성능 비교)

  • Yoo, Sangwoo;Shin, Yongbeom;Shin, Dongil
    • Journal of the Korean Institute of Gas
    • /
    • v.24 no.6
    • /
    • pp.91-97
    • /
    • 2020
  • Lithium-ion batteries (LIBs) have a longer lifespan, higher energy density, and lower self-discharge rates than other batteries, therefore, they are preferred as an Energy Storage System (ESS). However, during years 2017-2019, 28 ESS fire accidents occurred in Korea, and accurate capacity estimation of LIB is essential to ensure safety and reliability during operations. In this study, data-driven modeling that predicts capacity changes according to the charging cycle of LIB was conducted, and developed models were compared their performance for the selection of the optimal machine learning model, which includes the Decision Tree, Ensemble Learning Method, Support Vector Regression, and Gaussian Process Regression (GPR). For model training, lithium battery test data provided by NASA was used, and GPR showed the best prediction performance. Based on this study, we will develop an enhanced LIB capacity prediction and remaining useful life estimation model through additional data training, and improve the performance of anomaly detection and monitoring during operations, enabling safe and stable ESS operations.

Machine Learning Based Capacity Prediction Model of Terminal Maneuvering Area (기계학습 기반 접근관제구역 수용량 예측 모형)

  • Han, Sanghyok;Yun, Taegyeong;Kim, Sang Hyun
    • Journal of the Korean Society for Aeronautical & Space Sciences
    • /
    • v.50 no.3
    • /
    • pp.215-222
    • /
    • 2022
  • The purpose of air traffic flow management is to balance demand and capacity in the national airspace, and its performance relies on an accurate capacity prediction of the airport or airspace. This paper developed a regression model that predicts the number of aircraft actually departing and arriving in a terminal maneuvering area. The regression model is based on a boosting ensemble learning algorithm that learns past aircraft operational data such as time, weather, scheduled demand, and unfulfilled demand at a specific airport in the terminal maneuvering area. The developed model was tested using historical departure and arrival flight data at Incheon International Airport, and the coefficient of determination is greater than 0.95. Also, the capacity of the terminal maneuvering area of interest is implicitly predicted by using the model.

A Study on Classification Models for Predicting Bankruptcy Based on XAI (XAI 기반 기업부도예측 분류모델 연구)

  • Jihong Kim;Nammee Moon
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.12 no.8
    • /
    • pp.333-340
    • /
    • 2023
  • Efficient prediction of corporate bankruptcy is an important part of making appropriate lending decisions for financial institutions and reducing loan default rates. In many studies, classification models using artificial intelligence technology have been used. In the financial industry, even if the performance of the new predictive models is excellent, it should be accompanied by an intuitive explanation of the basis on which the result was determined. Recently, the US, EU, and South Korea have commonly presented the right to request explanations of algorithms, so transparency in the use of AI in the financial sector must be secured. In this paper, an artificial intelligence-based interpretable classification prediction model was proposed using corporate bankruptcy data that was open to the outside world. First, data preprocessing, 5-fold cross-validation, etc. were performed, and classification performance was compared through optimization of 10 supervised learning classification models such as logistic regression, SVM, XGBoost, and LightGBM. As a result, LightGBM was confirmed as the best performance model, and SHAP, an explainable artificial intelligence technique, was applied to provide a post-explanation of the bankruptcy prediction process.

Predictive Models for the Tourism and Accommodation Industry in the Era of Smart Tourism: Focusing on the COVID-19 Pandemic (스마트관광 시대의 관광숙박업 영업 예측 모형: 코로나19 팬더믹을 중심으로)

  • Yu Jin Jo;Cha Mi Kim;Seung Yeon Son;Mi Jin Noh
    • Smart Media Journal
    • /
    • v.12 no.8
    • /
    • pp.18-25
    • /
    • 2023
  • The COVID-19 outbreak in 2020 caused continuous damage worldwode, especially the smart tourism industry was hit directly by the blockade of sky roads and restriction of going out. At a time when overseas travel and domestic travel have decreased significantly, the number of tourist hotels that are colsed and closed due to the continued deficit is increasing. Therefore, in this study, licensing data from the Ministry of Public Administraion and Security were collected and visualized to understand the operation status of the tourism and lodging industry. The machine learning classification algorithm was applied to implement the business status prediction model of the tourist hotel, the performance of the prediction model was optimized using the ensemble algorithm, and the performance of the model was evaluated through 5-Fold cross-validation. It was predicted that the survival rate of tourist hotels would decrease somewhat, but the actual survival rate was analyzed to be no different from before COVID-19. Through the prediction of the business status of the hotel industry in this paper, it can be used as a basis for grasping the operability and development trends of the entire tourism and lodging industry.

Assessment of compressive strength of high-performance concrete using soft computing approaches

  • Chukwuemeka Daniel;Jitendra Khatti;Kamaldeep Singh Grover
    • Computers and Concrete
    • /
    • v.33 no.1
    • /
    • pp.55-75
    • /
    • 2024
  • The present study introduces an optimum performance soft computing model for predicting the compressive strength of high-performance concrete (HPC) by comparing models based on conventional (kernel-based, covariance function-based, and tree-based), advanced machine (least square support vector machine-LSSVM and minimax probability machine regressor-MPMR), and deep (artificial neural network-ANN) learning approaches using a common database for the first time. A compressive strength database, having results of 1030 concrete samples, has been compiled from the literature and preprocessed. For the purpose of training, testing, and validation of soft computing models, 803, 101, and 101 data points have been selected arbitrarily from preprocessed data points, i.e., 1005. Thirteen performance metrics, including three new metrics, i.e., a20-index, index of agreement, and index of scatter, have been implemented for each model. The performance comparison reveals that the SVM (kernel-based), ET (tree-based), MPMR (advanced), and ANN (deep) models have achieved higher performance in predicting the compressive strength of HPC. From the overall analysis of performance, accuracy, Taylor plot, accuracy metric, regression error characteristics curve, Anderson-Darling, Wilcoxon, Uncertainty, and reliability, it has been observed that model CS4 based on the ensemble tree has been recognized as an optimum performance model with higher performance, i.e., a correlation coefficient of 0.9352, root mean square error of 5.76 MPa, and mean absolute error of 4.1069 MPa. The present study also reveals that multicollinearity affects the prediction accuracy of Gaussian process regression, decision tree, multilinear regression, and adaptive boosting regressor models, novel research in compressive strength prediction of HPC. The cosine sensitivity analysis reveals that the prediction of compressive strength of HPC is highly affected by cement content, fine aggregate, coarse aggregate, and water content.