• Title/Summary/Keyword: Gradient Boosting Regression

Search Result 84, Processing Time 0.024 seconds

Ensemble Learning-Based Prediction of Good Sellers in Overseas Sales of Domestic Books and Keyword Analysis of Reviews of the Good Sellers (앙상블 학습 기반 국내 도서의 해외 판매 굿셀러 예측 및 굿셀러 리뷰 키워드 분석)

  • Do Young Kim;Na Yeon Kim;Hyon Hee Kim
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.12 no.4
    • /
    • pp.173-178
    • /
    • 2023
  • As Korean literature spreads around the world, its position in the overseas publishing market has become important. As demand in the overseas publishing market continues to grow, it is essential to predict future book sales and analyze the characteristics of books that have been highly favored by overseas readers in the past. In this study, we proposed ensemble learning based prediction model and analyzed characteristics of the cumulative sales of more than 5,000 copies classified as good sellers published overseas over the past 5 years. We applied the five ensemble learning models, i.e., XGBoost, Gradient Boosting, Adaboost, LightGBM, and Random Forest, and compared them with other machine learning algorithms, i.e., Support Vector Machine, Logistic Regression, and Deep Learning. Our experimental results showed that the ensemble algorithm outperforms other approaches in troubleshooting imbalanced data. In particular, the LightGBM model obtained an AUC value of 99.86% which is the best prediction performance. Among the features used for prediction, the most important feature is the author's number of overseas publications, and the second important feature is publication in countries with the largest publication market size. The number of evaluation participants is also an important feature. In addition, text mining was performed on the four book reviews that sold the most among good-selling books. Many reviews were interested in stories, characters, and writers and it seems that support for translation is needed as many of the keywords of "translation" appear in low-rated reviews.

Comparison of Error Rate and Prediction of Compression Index of Clay to Machine Learning Models using Orange Mining (오렌지마이닝을 활용한 기계학습 모델별 점토 압축지수의 오차율 및 예측 비교)

  • Yoo-Jae Woong;Woo-Young Kim;Tae-Hyung Kim
    • Journal of the Korean Geosynthetics Society
    • /
    • v.23 no.3
    • /
    • pp.15-22
    • /
    • 2024
  • Predicting ground settlement during the improvement of soft ground and the construction of a structure is an crucial factor. Numerous studies have been conducted, and many prediction equations have been proposed to estimate settlement. Settlement can be calculated using the compression index of clay. In this study, data on water content, void ratio, liquid limit, plastic limit, and compression index from the Busan New Port area were collected to construct a dataset. Correlation analysis was conducted among the collected data. Machine learning algorithms, including Random Forest, Neural Network, Linear Regression, Ada Boost, and Gradient Boosting, were applied using the Orange mining program to propose compression index prediction models. The models' results were evaluated by comparing RMSE and MAPE values, which indicate error rates, and R2 values, which signify the models' significance. As a result, water content showed the highest correlation, while the plastic limit showed a somewhat lower correlation than other characteristics. Among the compared models, the AdaBoost model demonstrated the best performance. As a result of comparing each model, the AdaBoost model had the lowest error rate and a large coefficient of determination.

Analysis of Important Indicators of TCB Using GBM (일반화가속모형을 이용한 기술신용평가 주요 지표 분석)

  • Jeon, Woo-Jeong(Michael);Seo, Young-Wook
    • The Journal of Society for e-Business Studies
    • /
    • v.22 no.4
    • /
    • pp.159-173
    • /
    • 2017
  • In order to provide technical financial support to small and medium-sized venture companies based on technology, the government implemented the TCB evaluation, which is a kind of technology rating evaluation, from the Kibo and a qualified private TCB. In this paper, we briefly review the current state of TCB evaluation and available indicators related to technology evaluation accumulated in the Korea Credit Information Services (TDB), and then use indicators that have a significant effect on the technology rating score. Multiple regression techniques will be explored. And the relative importance and classification accuracy of the indicators were calculated by applying the key indicators as independent features applied to the generalized boosting model, which is a representative machine learning classifier, as the class influence and the fitness of each model. As a result of the analysis, it was analyzed that the relative importance between the two models was not significantly different. However, GBM model had more weight on the InnoBiz certification, R&D department, patent registration and venture confirmation indicators than regression model.

Comparative Study of Data Preprocessing and ML&DL Model Combination for Daily Dam Inflow Prediction (댐 일유입량 예측을 위한 데이터 전처리와 머신러닝&딥러닝 모델 조합의 비교연구)

  • Youngsik Jo;Kwansue Jung
    • Proceedings of the Korea Water Resources Association Conference
    • /
    • 2023.05a
    • /
    • pp.358-358
    • /
    • 2023
  • 본 연구에서는 그동안 수자원분야 강우유출 해석분야에 활용되었던 대표적인 머신러닝&딥러닝(ML&DL) 모델을 활용하여 모델의 하이퍼파라미터 튜닝뿐만 아니라 모델의 특성을 고려한 기상 및 수문데이터의 조합과 전처리(lag-time, 이동평균 등)를 통하여 데이터 특성과 ML&DL모델의 조합시나리오에 따른 일 유입량 예측성능을 비교 검토하는 연구를 수행하였다. 이를 위해 소양강댐 유역을 대상으로 1974년에서 2021년까지 축적된 기상 및 수문데이터를 활용하여 1) 강우, 2) 유입량, 3) 기상자료를 주요 영향변수(독립변수)로 고려하고, 이에 a) 지체시간(lag-time), b) 이동평균, c) 유입량의 성분분리조건을 적용하여 총 36가지 시나리오 조합을 ML&DL의 입력자료로 활용하였다. ML&DL 모델은 1) Linear Regression(LR), 2) Lasso, 3) Ridge, 4) SVR(Support Vector Regression), 5) Random Forest(RF), 6) LGBM(Light Gradient Boosting Model), 7) XGBoost의 7가지 ML방법과 8) LSTM(Long Short-Term Memory models), 9) TCN(Temporal Convolutional Network), 10) LSTM-TCN의 3가지 DL 방법, 총 10가지 ML&DL모델을 비교 검토하여 일유입량 예측을 위한 가장 적합한 데이터 조합 특성과 ML&DL모델을 성능평가와 함께 제시하였다. 학습된 모형의 유입량 예측 결과를 비교·분석한 결과, 소양강댐 유역에서는 딥러닝 중에서는 TCN모형이 가장 우수한 성능을 보였고(TCN>TCN-LSTM>LSTM), 트리기반 머신러닝중에서는 Random Forest와 LGBM이 우수한 성능을 보였으며(RF, LGBM>XGB), SVR도 LGBM수준의 우수한 성능을 나타내었다. LR, Lasso, Ridge 세가지 Regression모형은 상대적으로 낮은 성능을 보였다. 또한 소양강댐 댐유입량 예측에 대하여 강우, 유입량, 기상계열을 36가지로 조합한 결과, 입력자료에 lag-time이 적용된 강우계열의 조합 분석에서 세가지 Regression모델을 제외한 모든 모형에서 NSE(Nash-Sutcliffe Efficiency) 0.8이상(최대 0.867)의 성능을 보였으며, lag-time이 적용된 강우와 유입량계열을 조합했을 경우 NSE 0.85이상(최대 0.901)의 더 우수한 성능을 보였다.

  • PDF

Machine Learning Algorithm for Estimating Ink Usage (머신러닝을 통한 잉크 필요량 예측 알고리즘)

  • Se Wook Kwon;Young Joo Hyun;Hyun Chul Tae
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.46 no.1
    • /
    • pp.23-31
    • /
    • 2023
  • Research and interest in sustainable printing are increasing in the packaging printing industry. Currently, predicting the amount of ink required for each work is based on the experience and intuition of field workers. Suppose the amount of ink produced is more than necessary. In this case, the rest of the ink cannot be reused and is discarded, adversely affecting the company's productivity and environment. Nowadays, machine learning models can be used to figure out this problem. This study compares the ink usage prediction machine learning models. A simple linear regression model, Multiple Regression Analysis, cannot reflect the nonlinear relationship between the variables required for packaging printing, so there is a limit to accurately predicting the amount of ink needed. This study has established various prediction models which are based on CART (Classification and Regression Tree), such as Decision Tree, Random Forest, Gradient Boosting Machine, and XGBoost. The accuracy of the models is determined by the K-fold cross-validation. Error metrics such as root mean squared error, mean absolute error, and R-squared are employed to evaluate estimation models' correctness. Among these models, XGBoost model has the highest prediction accuracy and can reduce 2134 (g) of wasted ink for each work. Thus, this study motivates machine learning's potential to help advance productivity and protect the environment.

A Study on Design of Real-time Big Data Collection and Analysis System based on OPC-UA for Smart Manufacturing of Machine Working

  • Kim, Jaepyo;Kim, Youngjoo;Kim, Seungcheon
    • International Journal of Internet, Broadcasting and Communication
    • /
    • v.13 no.4
    • /
    • pp.121-128
    • /
    • 2021
  • In order to design a real time big data collection and analysis system of manufacturing data in a smart factory, it is important to establish an appropriate wired/wireless communication system and protocol. This paper introduces the latest communication protocol, OPC-UA (Open Platform Communication Unified Architecture) based client/server function, applied user interface technology to configure a network for real-time data collection through IoT Integration. Then, Database is designed in MES (Manufacturing Execution System) based on the analysis table that reflects the user's requirements among the data extracted from the new cutting process automation process, bush inner diameter indentation measurement system and tool monitoring/inspection system. In summary, big data analysis system introduced in this paper performs SPC (statistical Process Control) analysis and visualization analysis with interface of OPC-UA-based wired/wireless communication. Through AI learning modeling with XGBoost (eXtream Gradient Boosting) and LR (Linear Regression) algorithm, quality and visualization analysis is carried out the storage and connection to the cloud.

Predicting Reports of Theft in Businesses via Machine Learning

  • JungIn, Seo;JeongHyeon, Chang
    • International Journal of Advanced Culture Technology
    • /
    • v.10 no.4
    • /
    • pp.499-510
    • /
    • 2022
  • This study examines the reporting factors of crime against business in Korea and proposes a corresponding predictive model using machine learning. While many previous studies focused on the individual factors of theft victims, there is a lack of evidence on the reporting factors of crime against a business that serves the public good as opposed to those that protect private property. Therefore, we proposed a crime prevention model for the willingness factor of theft reporting in businesses. This study used data collected through the 2015 Commercial Crime Damage Survey conducted by the Korea Institute for Criminal Policy. It analyzed data from 834 businesses that had experienced theft during a 2016 crime investigation. The data showed a problem with unbalanced classes. To solve this problem, we jointly applied the Synthetic Minority Over Sampling Technique and the Tomek link techniques to the training data. Two prediction models were implemented. One was a statistical model using logistic regression and elastic net. The other involved a support vector machine model, tree-based machine learning models (e.g., random forest, extreme gradient boosting), and a stacking model. As a result, the features of theft price, invasion, and remedy, which are known to have significant effects on reporting theft offences, can be predicted as determinants of such offences in companies. Finally, we verified and compared the proposed predictive models using several popular metrics. Based on our evaluation of the importance of the features used in each model, we suggest a more accurate criterion for predicting var.

Predicting the compressive strength of SCC containing nano silica using surrogate machine learning algorithms

  • Neeraj Kumar Shukla;Aman Garg;Javed Bhutto;Mona Aggarwal;Mohamed Abbas;Hany S. Hussein;Rajesh Verma;T.M. Yunus Khan
    • Computers and Concrete
    • /
    • v.32 no.4
    • /
    • pp.373-381
    • /
    • 2023
  • Fly ash, granulated blast furnace slag, marble waste powder, etc. are just some of the by-products of other sectors that the construction industry is looking to include into the many types of concrete they produce. This research seeks to use surrogate machine learning methods to forecast the compressive strength of self-compacting concrete. The surrogate models were developed using Gradient Boosting Machine (GBM), Support Vector Machine (SVM), Random Forest (RF), and Gaussian Process Regression (GPR) techniques. Compressive strength is used as the output variable, with nano silica content, cement content, coarse aggregate content, fine aggregate content, superplasticizer, curing duration, and water-binder ratio as input variables. Of the four models, GBM had the highest accuracy in determining the compressive strength of SCC. The concrete's compressive strength is worst predicted by GPR. Compressive strength of SCC with nano silica is found to be most affected by curing time and least by fine aggregate.

Developing a Predictive Model of Young Job Seekers' Preference for Hidden Champions Using Machine Learning and Analyzing the Relative Importance of Preference Factors (머신러닝을 활용한 청년 구직자의 강소기업 선호 예측모형 개발 및 요인별 상대적 중요도 분석)

  • Cho, Yoon Ju;Kim, Jin Soo;Bae, Hwan seok;Yang, Sung-Byung;Yoon, Sang-Hyeak
    • The Journal of Information Systems
    • /
    • v.32 no.4
    • /
    • pp.229-245
    • /
    • 2023
  • Purpose This study aims to understand the inclinations of young job seekers towards "hidden champions" - small but competitive companies that are emerging as potential solutions to the growing disparity between youth-targeted job vacancies and job seekers. We utilize machine learning techniques to discern the appeal of these hidden champions. Design/methodology/approach We examined the characteristics of small and medium-sized enterprises using data sourced from the Ministry of Employment and Labor and Youth Worknet. By comparing the efficacy of five machine learning classification models (i.e., Logistic Regression, Random Forest Classifier, Gradient Boosting Classifier, LGBM Classifier, and XGB Classifier), we discovered that the predictive model utilizing the LGBM Classifier yielded the most consistent performance. Findings Our analysis of the relative significance of preference determinants revealed that industry type, geographical location, and employee count are pivotal factors influencing preference. Drawing from these insights, we propose targeted strategic interventions for policymakers, hidden champions, and young job seekers.

Analyzing Key Variables in Network Attack Classification on NSL-KDD Dataset using SHAP (SHAP 기반 NSL-KDD 네트워크 공격 분류의 주요 변수 분석)

  • Sang-duk Lee;Dae-gyu Kim;Chang Soo Kim
    • Journal of the Society of Disaster Information
    • /
    • v.19 no.4
    • /
    • pp.924-935
    • /
    • 2023
  • Purpose: The central aim of this study is to leverage machine learning techniques for the classification of Intrusion Detection System (IDS) data, with a specific focus on identifying the variables responsible for enhancing overall performance. Method: First, we classified 'R2L(Remote to Local)' and 'U2R (User to Root)' attacks in the NSL-KDD dataset, which are difficult to detect due to class imbalance, using seven machine learning models, including Logistic Regression (LR) and K-Nearest Neighbor (KNN). Next, we use the SHapley Additive exPlanation (SHAP) for two classification models that showed high performance, Random Forest (RF) and Light Gradient-Boosting Machine (LGBM), to check the importance of variables that affect classification for each model. Result: In the case of RF, the 'service' variable and in the case of LGBM, the 'dst_host_srv_count' variable were confirmed to be the most important variables. These pivotal variables serve as key factors capable of enhancing performance in the context of classification for each respective model. Conclusion: In conclusion, this paper successfully identifies the optimal models, RF and LGBM, for classifying 'R2L' and 'U2R' attacks, while elucidating the crucial variables associated with each selected model.