• Title/Summary/Keyword: random forest model

Search Result 555, Processing Time 0.026 seconds

The Prediction of Export Credit Guarantee Accident using Machine Learning (기계학습을 이용한 수출신용보증 사고예측)

  • Cho, Jaeyoung;Joo, Jihwan;Han, Ingoo
    • Journal of Intelligence and Information Systems
    • /
    • v.27 no.1
    • /
    • pp.83-102
    • /
    • 2021
  • The government recently announced various policies for developing big-data and artificial intelligence fields to provide a great opportunity to the public with respect to disclosure of high-quality data within public institutions. KSURE(Korea Trade Insurance Corporation) is a major public institution for financial policy in Korea, and thus the company is strongly committed to backing export companies with various systems. Nevertheless, there are still fewer cases of realized business model based on big-data analyses. In this situation, this paper aims to develop a new business model which can be applied to an ex-ante prediction for the likelihood of the insurance accident of credit guarantee. We utilize internal data from KSURE which supports export companies in Korea and apply machine learning models. Then, we conduct performance comparison among the predictive models including Logistic Regression, Random Forest, XGBoost, LightGBM, and DNN(Deep Neural Network). For decades, many researchers have tried to find better models which can help to predict bankruptcy since the ex-ante prediction is crucial for corporate managers, investors, creditors, and other stakeholders. The development of the prediction for financial distress or bankruptcy was originated from Smith(1930), Fitzpatrick(1932), or Merwin(1942). One of the most famous models is the Altman's Z-score model(Altman, 1968) which was based on the multiple discriminant analysis. This model is widely used in both research and practice by this time. The author suggests the score model that utilizes five key financial ratios to predict the probability of bankruptcy in the next two years. Ohlson(1980) introduces logit model to complement some limitations of previous models. Furthermore, Elmer and Borowski(1988) develop and examine a rule-based, automated system which conducts the financial analysis of savings and loans. Since the 1980s, researchers in Korea have started to examine analyses on the prediction of financial distress or bankruptcy. Kim(1987) analyzes financial ratios and develops the prediction model. Also, Han et al.(1995, 1996, 1997, 2003, 2005, 2006) construct the prediction model using various techniques including artificial neural network. Yang(1996) introduces multiple discriminant analysis and logit model. Besides, Kim and Kim(2001) utilize artificial neural network techniques for ex-ante prediction of insolvent enterprises. After that, many scholars have been trying to predict financial distress or bankruptcy more precisely based on diverse models such as Random Forest or SVM. One major distinction of our research from the previous research is that we focus on examining the predicted probability of default for each sample case, not only on investigating the classification accuracy of each model for the entire sample. Most predictive models in this paper show that the level of the accuracy of classification is about 70% based on the entire sample. To be specific, LightGBM model shows the highest accuracy of 71.1% and Logit model indicates the lowest accuracy of 69%. However, we confirm that there are open to multiple interpretations. In the context of the business, we have to put more emphasis on efforts to minimize type 2 error which causes more harmful operating losses for the guaranty company. Thus, we also compare the classification accuracy by splitting predicted probability of the default into ten equal intervals. When we examine the classification accuracy for each interval, Logit model has the highest accuracy of 100% for 0~10% of the predicted probability of the default, however, Logit model has a relatively lower accuracy of 61.5% for 90~100% of the predicted probability of the default. On the other hand, Random Forest, XGBoost, LightGBM, and DNN indicate more desirable results since they indicate a higher level of accuracy for both 0~10% and 90~100% of the predicted probability of the default but have a lower level of accuracy around 50% of the predicted probability of the default. When it comes to the distribution of samples for each predicted probability of the default, both LightGBM and XGBoost models have a relatively large number of samples for both 0~10% and 90~100% of the predicted probability of the default. Although Random Forest model has an advantage with regard to the perspective of classification accuracy with small number of cases, LightGBM or XGBoost could become a more desirable model since they classify large number of cases into the two extreme intervals of the predicted probability of the default, even allowing for their relatively low classification accuracy. Considering the importance of type 2 error and total prediction accuracy, XGBoost and DNN show superior performance. Next, Random Forest and LightGBM show good results, but logistic regression shows the worst performance. However, each predictive model has a comparative advantage in terms of various evaluation standards. For instance, Random Forest model shows almost 100% accuracy for samples which are expected to have a high level of the probability of default. Collectively, we can construct more comprehensive ensemble models which contain multiple classification machine learning models and conduct majority voting for maximizing its overall performance.

The Development of Major Tree Species Classification Model using Different Satellite Images and Machine Learning in Gwangneung Area (이종센서 위성영상과 머신 러닝을 활용한 광릉지역 주요 수종 분류 모델 개발)

  • Lim, Joongbin;Kim, Kyoung-Min;Kim, Myung-Kil
    • Korean Journal of Remote Sensing
    • /
    • v.35 no.6_2
    • /
    • pp.1037-1052
    • /
    • 2019
  • We had developed in preceding study a classification model for the Korean pine and Larch with an accuracy of 98 percent using Hyperion and Sentinel-2 satellite images, texture information, and geometric information as the first step for tree species mapping in the inaccessible North Korea. Considering a share of major tree species in North Korea, the classification model needs to be expanded as it has a large share of Oak(29.5%), Pine (12.7%), Fir (8.2%), and as well as Larch (17.5%) and Korean pine (5.8%). In order to classify 5 major tree species, national forest type map of South Korea was used to build 11,039 training and 2,330 validation data. Sentinel-2 data was used to derive spectral information, and PlanetScope data was used to generate texture information. Geometric information was built from SRTM DEM data. As a machine learning algorithm, Random forest was used. As a result, the overall accuracy of classification was 80% with 0.80 kappa statistics. Based on the training data and the classification model constructed through this study, we will extend the application to Mt. Baekdu and North and South Goseong areas to confirm the applicability of tree species classification on the Korean Peninsula.

City Gas Pipeline Pressure Prediction Model (도시가스 배관압력 예측모델)

  • Chung, Won Hee;Park, Giljoo;Gu, Yeong Hyeon;Kim, Sunghyun;Yoo, Seong Joon;Jo, Young-do
    • The Journal of Society for e-Business Studies
    • /
    • v.23 no.2
    • /
    • pp.33-47
    • /
    • 2018
  • City gas pipelines are buried underground. Because of this, pipeline is hard to manage, and can be easily damaged. This research proposes a real time prediction system that helps experts can make decision about pressure anomalies. The gas pipline pressure data of Jungbu City Gas Company, which is one of the domestic city gas suppliers, time variables and environment variables are analysed. In this research, regression models that predicts pipeline pressure in minutes are proposed. Random forest, support vector regression (SVR), long-short term memory (LSTM) algorithms are used to build pressure prediction models. A comparison of pressure prediction models' preformances shows that the LSTM model was the best. LSTM model for Asan-si have root mean square error (RMSE) 0.011, mean absolute percentage error (MAPE) 0.494. LSTM model for Cheonan-si have RMSE 0.015, MAPE 0.668.

Data-driven Model Prediction of Harmful Cyanobacterial Blooms in the Nakdong River in Response to Increased Temperatures Under Climate Change Scenarios (기후변화 시나리오의 기온상승에 따른 낙동강 남세균 발생 예측을 위한 데이터 기반 모델 시뮬레이션)

  • Gayeon Jang;Minkyoung Jo;Jayun Kim;Sangjun Kim;Himchan Park;Joonhong Park
    • Journal of Korean Society on Water Environment
    • /
    • v.40 no.3
    • /
    • pp.121-129
    • /
    • 2024
  • Harmful cyanobacterial blooms (HCBs) are caused by the rapid proliferation of cyanobacteria and are believed to be exacerbated by climate change. However, the extent to which HCBs will be stimulated in the future due to increased temperature remains uncertain. This study aims to predict the future occurrence of cyanobacteria in the Nakdong River, which has the highest incidence of HCBs in South Korea, based on temperature rise scenarios. Representative Concentration Pathways (RCPs) were used as the basis for these scenarios. Data-driven model simulations were conducted, and out of the four machine learning techniques tested (multiple linear regression, support vector regressor, decision tree, and random forest), the random forest model was selected for its relatively high prediction accuracy. The random forest model was used to predict the occurrence of cyanobacteria. The results of boxplot and time-series analyses showed that under the worst-case scenario (RCP8.5 (2100)), where temperature increases significantly, cyanobacterial abundance across all study areas was greatly stimulated. The study also found that the frequencies of HCB occurrences exceeding certain thresholds (100,000 and 1,000,000 cells/mL) increased under both the best-case scenario (RCP2.6 (2050)) and worst-case scenario (RCP8.5 (2100)). These findings suggest that the frequency of HCB occurrences surpassing a certain threshold level can serve as a useful diagnostic indicator of vulnerability to temperature increases caused by climate change. Additionally, this study highlights that water bodies currently susceptible to HCBs are likely to become even more vulnerable with climate change compared to those that are currently less susceptible.

Implementing of a Machine Learning-based College Dropout Prediction Model (머신러닝 기반 대학생 중도탈락 예측 모델 구현 방안)

  • Yoon-Jung Roh
    • Journal of the Institute of Convergence Signal Processing
    • /
    • v.25 no.2
    • /
    • pp.119-126
    • /
    • 2024
  • This study aims to evaluate the feasibility of an early warning system for college dropout by machine learning the main patterns that affect college student dropout and to suggest ways to implement a system that can actively prevent it. For this purpose, a performance comparison experiment was conducted using five types of machine learning-based algorithms using data from the Korean Educational Longitudinal Study, 2005, conducted by the Korea Educational Development Institute. As a result of the experiment, the identification accuracy rate of students with the intention to drop out was up to 94.0% when using Random Forest, and the recall rate of students with the intention of dropping out was up to 77.0% when using Logistic Regression. It was measured. Lastly, based on the highest prediction model, we will provide counseling and management to students who are likely to drop out, and in particular, we will apply factors showing high importance by characteristic to the counseling method model. This study seeks to implement a model using IT technology to solve the career problems faced by college students, as dropout causes great costs to universities and individuals.

Predicting Gross Box Office Revenue for Domestic Films

  • Song, Jongwoo;Han, Suji
    • Communications for Statistical Applications and Methods
    • /
    • v.20 no.4
    • /
    • pp.301-309
    • /
    • 2013
  • This paper predicts gross box office revenue for domestic films using the Korean film data from 2008-2011. We use three regression methods, Linear Regression, Random Forest and Gradient Boosting to predict the gross box office revenue. We only consider domestic films with a revenue size of at least KRW 500 million; relevant explanatory variables are chosen by data visualization and variable selection techniques. The key idea of analyzing this data is to construct the meaningful explanatory variables from the data sources available to the public. Some variables must be categorized to conduct more effective analysis and clustering methods are applied to achieve this task. We choose the best model based on performance in the test set and important explanatory variables are discussed.

Machine learning-based regression analysis for estimating Cerchar abrasivity index

  • Kwak, No-Sang;Ko, Tae Young
    • Geomechanics and Engineering
    • /
    • v.29 no.3
    • /
    • pp.219-228
    • /
    • 2022
  • The most widely used parameter to represent rock abrasiveness is the Cerchar abrasivity index (CAI). The CAI value can be applied to predict wear in TBM cutters. It has been extensively demonstrated that the CAI is affected significantly by cementation degree, strength, and amount of abrasive minerals, i.e., the quartz content or equivalent quartz content in rocks. The relationship between the properties of rocks and the CAI is investigated in this study. A database comprising 223 observations that includes rock types, uniaxial compressive strengths, Brazilian tensile strengths, equivalent quartz contents, quartz contents, brittleness indices, and CAIs is constructed. A linear model is developed by selecting independent variables while considering multicollinearity after performing multiple regression analyses. Machine learning-based regression methods including support vector regression, regression tree regression, k-nearest neighbors regression, random forest regression, and artificial neural network regression are used in addition to multiple linear regression. The results of the random forest regression model show that it yields the best prediction performance.

Comparison of tree-based ensemble models for regression

  • Park, Sangho;Kim, Chanmin
    • Communications for Statistical Applications and Methods
    • /
    • v.29 no.5
    • /
    • pp.561-589
    • /
    • 2022
  • When multiple classifications and regression trees are combined, tree-based ensemble models, such as random forest (RF) and Bayesian additive regression trees (BART), are produced. We compare the model structures and performances of various ensemble models for regression settings in this study. RF learns bootstrapped samples and selects a splitting variable from predictors gathered at each node. The BART model is specified as the sum of trees and is calculated using the Bayesian backfitting algorithm. Throughout the extensive simulation studies, the strengths and drawbacks of the two methods in the presence of missing data, high-dimensional data, or highly correlated data are investigated. In the presence of missing data, BART performs well in general, whereas RF provides adequate coverage. The BART outperforms in high dimensional, highly correlated data. However, in all of the scenarios considered, the RF has a shorter computation time. The performance of the two methods is also compared using two real data sets that represent the aforementioned situations, and the same conclusion is reached.

Selecting Machine Learning Model Based on Natural Language Processing for Shanghanlun Diagnostic System Classification (자연어 처리 기반 『상한론(傷寒論)』 변병진단체계(辨病診斷體系) 분류를 위한 기계학습 모델 선정)

  • Young-Nam Kim
    • 대한상한금궤의학회지
    • /
    • v.14 no.1
    • /
    • pp.41-50
    • /
    • 2022
  • Objective : The purpose of this study is to explore the most suitable machine learning model algorithm for Shanghanlun diagnostic system classification using natural language processing (NLP). Methods : A total of 201 data items were collected from 『Shanghanlun』 and 『Clinical Shanghanlun』, 'Taeyangbyeong-gyeolhyung' and 'Eumyangyeokchahunobokbyeong' were excluded to prevent oversampling or undersampling. Data were pretreated using a twitter Korean tokenizer and trained by logistic regression, ridge regression, lasso regression, naive bayes classifier, decision tree, and random forest algorithms. The accuracy of the models were compared. Results : As a result of machine learning, ridge regression and naive Bayes classifier showed an accuracy of 0.843, logistic regression and random forest showed an accuracy of 0.804, and decision tree showed an accuracy of 0.745, while lasso regression showed an accuracy of 0.608. Conclusions : Ridge regression and naive Bayes classifier are suitable NLP machine learning models for the Shanghanlun diagnostic system classification.

  • PDF

Market Timing and Seasoned Equity Offering (마켓 타이밍과 유상증자)

  • Sung Won Seo
    • Asia-Pacific Journal of Business
    • /
    • v.15 no.1
    • /
    • pp.145-157
    • /
    • 2024
  • Purpose - In this study, we propose an empirical model for predicting seasoned equity offering (SEO here after) using machine learning methods. Design/methodology/approach - The models utilize the random forest method based on decision trees that considers non-linear relationships, as well as the gradient boosting tree model. SEOs incur significant direct and indirect costs. Therefore, CEOs' decisions of seasoned equity issuances are made only when the benefits outweigh the costs, which leads to a non-linear relationship between SEOs and a determinant of them. Particularly, a variable related to market timing effectively exhibit such non-linear relations. Findings - To account for these non-linear relationships, we hypothesize that decision tree-based random forest and gradient boosting tree models are more suitable than the linear methodologies due to the non-linear relations. The results of this study support this hypothesis. Research implications or Originality - We expect that our findings can provide meaningful information to investors and policy makers by classifying companies to undergo SEOs.