• Title/Summary/Keyword: random forest model

Search Result 538, Processing Time 0.026 seconds

Machine learning model for residual chlorine prediction in sediment basin to control pre-chlorination in water treatment plant (정수장 전염소 공정제어를 위한 침전지 잔류염소농도 예측 머신러닝 모형)

  • Kim, Juhwan;Lee, Kyunghyuk;Kim, Soojun;Kim, Kyunghun
    • Journal of Korea Water Resources Association
    • /
    • v.55 no.spc1
    • /
    • pp.1283-1293
    • /
    • 2022
  • The purpose of this study is to predict residual chlorine in order to maintain stable residual chlorine concentration in sedimentation basin by using artificial intelligence algorithms in water treatment process employing pre-chlorination. Available water quantity and quality data are collected and analyzed statistically to apply into mathematical multiple regression and artificial intelligence models including multi-layer perceptron neural network, random forest, long short term memory (LSTM) algorithms. Water temperature, turbidity, pH, conductivity, flow rate, alkalinity and pre-chlorination dosage data are used as the input parameters to develop prediction models. As results, it is presented that the random forest algorithm shows the most moderate prediction result among four cases, which are long short term memory, multi-layer perceptron, multiple regression including random forest. Especially, it is result that the multiple regression model can not represent the residual chlorine with the input parameters which varies independently with seasonal change, numerical scale and dimension difference between quantity and quality. For this reason, random forest model is more appropriate for predict water qualities than other algorithms, which is classified into decision tree type algorithm. Also, it is expected that real time prediction by artificial intelligence models can play role of the stable operation of residual chlorine in water treatment plant including pre-chlorination process.

An Analysis on Determinants of the Capesize Freight Rate and Forecasting Models (케이프선 시장 운임의 결정요인 및 운임예측 모형 분석)

  • Lim, Sang-Seop;Yun, Hee-Sung
    • Journal of Navigation and Port Research
    • /
    • v.42 no.6
    • /
    • pp.539-545
    • /
    • 2018
  • In recent years, research on shipping market forecasting with the employment of non-linear AI models has attracted significant interest. In previous studies, input variables were selected with reference to past papers or by relying on the intuitions of the researchers. This paper attempts to address this issue by applying the stepwise regression model and the random forest model to the Cape-size bulk carrier market. The Cape market was selected due to the simplicity of its supply and demand structure. The preliminary selection of the determinants resulted in 16 variables. In the next stage, 8 features from the stepwise regression model and 10 features from the random forest model were screened as important determinants. The chosen variables were used to test both models. Based on the analysis of the models, it was observed that the random forest model outperforms the stepwise regression model. This research is significant because it provides a scientific basis which can be used to find the determinants in shipping market forecasting, and utilize a machine-learning model in the process. The results of this research can be used to enhance the decisions of chartering desks by offering a guideline for market analysis.

Study on Improvement of Frost Occurrence Prediction Accuracy (서리발생 예측 정확도 향상을 위한 방법 연구)

  • Kim, Yongseok;Choi, Wonjun;Shim, Kyo-moon;Hur, Jina;Kang, Mingu;Jo, Sera
    • Korean Journal of Agricultural and Forest Meteorology
    • /
    • v.23 no.4
    • /
    • pp.295-305
    • /
    • 2021
  • In this study, we constructed using Random Forest(RF) by selecting the meteorological factors related to the occurrence of frost. As a result, when constructing a classification model for frost occurrence, even if the amount of data set is large, the imbalance in the data set for development of model has been analyzed to have a bad effect on the predictive power of the model. It was found that building a single integrated model by grouping meteorological factors related to frost occurrence by region is more efficient than building each model reflecting high-importance meteorological factors. Based on our results, it is expected that a high-accuracy frost occurrence prediction model will be able to be constructed as further studies meteorological factors for frost prediction.

Prediction of spatio-temporal AQI data

  • KyeongEun Kim;MiRu Ma;KyeongWon Lee
    • Communications for Statistical Applications and Methods
    • /
    • v.30 no.2
    • /
    • pp.119-133
    • /
    • 2023
  • With the rapid growth of the economy and fossil fuel consumption, the concentration of air pollutants has increased significantly and the air pollution problem is no longer limited to small areas. We conduct statistical analysis with the actual data related to air quality that covers the entire of South Korea using R and Python. Some factors such as SO2, CO, O3, NO2, PM10, precipitation, wind speed, wind direction, vapor pressure, local pressure, sea level pressure, temperature, humidity, and others are used as covariates. The main goal of this paper is to predict air quality index (AQI) spatio-temporal data. The observations of spatio-temporal big datasets like AQI data are correlated both spatially and temporally, and computation of the prediction or forecasting with dependence structure is often infeasible. As such, the likelihood function based on the spatio-temporal model may be complicated and some special modelings are useful for statistically reliable predictions. In this paper, we propose several methods for this big spatio-temporal AQI data. First, random effects with spatio-temporal basis functions model, a classical statistical analysis, is proposed. Next, neural networks model, a deep learning method based on artificial neural networks, is applied. Finally, random forest model, a machine learning method that is closer to computational science, will be introduced. Then we compare the forecasting performance of each other in terms of predictive diagnostics. As a result of the analysis, all three methods predicted the normal level of PM2.5 well, but the performance seems to be poor at the extreme value.

Developing a Model for Predicting Success of Machine Learning based Health Consulting (머신러닝 기반 건강컨설팅 성공여부 예측모형 개발)

  • Lee, Sang Ho;Song, Tae-Min
    • Journal of Information Technology Services
    • /
    • v.17 no.1
    • /
    • pp.91-103
    • /
    • 2018
  • This study developed a prediction model using machine learning technology and predicted the success of health consulting by using life log data generated through u-Health service. The model index of the Random Forest model was the highest using. As a result of analyzing the Random Forest model, blood pressure was the most influential factor in the success or failure of metabolic syndrome in the subjects of u-Health service, followed by triglycerides, body weight, blood sugar, high cholesterol, and medication appear. muscular, basal metabolic rate and high-density lipoprotein cholesterol were increased; waist circumference, Blood sugar and triglyceride were decreased. Further, biometrics and health behavior improved. After nine months of u-health services, the number of subjects with four or more factors for metabolic syndrome decreased by 28.6%; 3.7% of regular drinkers stopped drinking; 23.2% of subjects who rarely exercised began to exercise twice a week or more; and 20.0% of smokers stopped smoking. If the predictive model developed in this study is linked with CBR, it can be used as case study data of CBR with high probability of success in the prediction model to improve the compliance of the subject and to improve the qualitative effect of counseling for the improvement of the metabolic syndrome.

The Prediction of Export Credit Guarantee Accident using Machine Learning (기계학습을 이용한 수출신용보증 사고예측)

  • Cho, Jaeyoung;Joo, Jihwan;Han, Ingoo
    • Journal of Intelligence and Information Systems
    • /
    • v.27 no.1
    • /
    • pp.83-102
    • /
    • 2021
  • The government recently announced various policies for developing big-data and artificial intelligence fields to provide a great opportunity to the public with respect to disclosure of high-quality data within public institutions. KSURE(Korea Trade Insurance Corporation) is a major public institution for financial policy in Korea, and thus the company is strongly committed to backing export companies with various systems. Nevertheless, there are still fewer cases of realized business model based on big-data analyses. In this situation, this paper aims to develop a new business model which can be applied to an ex-ante prediction for the likelihood of the insurance accident of credit guarantee. We utilize internal data from KSURE which supports export companies in Korea and apply machine learning models. Then, we conduct performance comparison among the predictive models including Logistic Regression, Random Forest, XGBoost, LightGBM, and DNN(Deep Neural Network). For decades, many researchers have tried to find better models which can help to predict bankruptcy since the ex-ante prediction is crucial for corporate managers, investors, creditors, and other stakeholders. The development of the prediction for financial distress or bankruptcy was originated from Smith(1930), Fitzpatrick(1932), or Merwin(1942). One of the most famous models is the Altman's Z-score model(Altman, 1968) which was based on the multiple discriminant analysis. This model is widely used in both research and practice by this time. The author suggests the score model that utilizes five key financial ratios to predict the probability of bankruptcy in the next two years. Ohlson(1980) introduces logit model to complement some limitations of previous models. Furthermore, Elmer and Borowski(1988) develop and examine a rule-based, automated system which conducts the financial analysis of savings and loans. Since the 1980s, researchers in Korea have started to examine analyses on the prediction of financial distress or bankruptcy. Kim(1987) analyzes financial ratios and develops the prediction model. Also, Han et al.(1995, 1996, 1997, 2003, 2005, 2006) construct the prediction model using various techniques including artificial neural network. Yang(1996) introduces multiple discriminant analysis and logit model. Besides, Kim and Kim(2001) utilize artificial neural network techniques for ex-ante prediction of insolvent enterprises. After that, many scholars have been trying to predict financial distress or bankruptcy more precisely based on diverse models such as Random Forest or SVM. One major distinction of our research from the previous research is that we focus on examining the predicted probability of default for each sample case, not only on investigating the classification accuracy of each model for the entire sample. Most predictive models in this paper show that the level of the accuracy of classification is about 70% based on the entire sample. To be specific, LightGBM model shows the highest accuracy of 71.1% and Logit model indicates the lowest accuracy of 69%. However, we confirm that there are open to multiple interpretations. In the context of the business, we have to put more emphasis on efforts to minimize type 2 error which causes more harmful operating losses for the guaranty company. Thus, we also compare the classification accuracy by splitting predicted probability of the default into ten equal intervals. When we examine the classification accuracy for each interval, Logit model has the highest accuracy of 100% for 0~10% of the predicted probability of the default, however, Logit model has a relatively lower accuracy of 61.5% for 90~100% of the predicted probability of the default. On the other hand, Random Forest, XGBoost, LightGBM, and DNN indicate more desirable results since they indicate a higher level of accuracy for both 0~10% and 90~100% of the predicted probability of the default but have a lower level of accuracy around 50% of the predicted probability of the default. When it comes to the distribution of samples for each predicted probability of the default, both LightGBM and XGBoost models have a relatively large number of samples for both 0~10% and 90~100% of the predicted probability of the default. Although Random Forest model has an advantage with regard to the perspective of classification accuracy with small number of cases, LightGBM or XGBoost could become a more desirable model since they classify large number of cases into the two extreme intervals of the predicted probability of the default, even allowing for their relatively low classification accuracy. Considering the importance of type 2 error and total prediction accuracy, XGBoost and DNN show superior performance. Next, Random Forest and LightGBM show good results, but logistic regression shows the worst performance. However, each predictive model has a comparative advantage in terms of various evaluation standards. For instance, Random Forest model shows almost 100% accuracy for samples which are expected to have a high level of the probability of default. Collectively, we can construct more comprehensive ensemble models which contain multiple classification machine learning models and conduct majority voting for maximizing its overall performance.

City Gas Pipeline Pressure Prediction Model (도시가스 배관압력 예측모델)

  • Chung, Won Hee;Park, Giljoo;Gu, Yeong Hyeon;Kim, Sunghyun;Yoo, Seong Joon;Jo, Young-do
    • The Journal of Society for e-Business Studies
    • /
    • v.23 no.2
    • /
    • pp.33-47
    • /
    • 2018
  • City gas pipelines are buried underground. Because of this, pipeline is hard to manage, and can be easily damaged. This research proposes a real time prediction system that helps experts can make decision about pressure anomalies. The gas pipline pressure data of Jungbu City Gas Company, which is one of the domestic city gas suppliers, time variables and environment variables are analysed. In this research, regression models that predicts pipeline pressure in minutes are proposed. Random forest, support vector regression (SVR), long-short term memory (LSTM) algorithms are used to build pressure prediction models. A comparison of pressure prediction models' preformances shows that the LSTM model was the best. LSTM model for Asan-si have root mean square error (RMSE) 0.011, mean absolute percentage error (MAPE) 0.494. LSTM model for Cheonan-si have RMSE 0.015, MAPE 0.668.

Data-driven Model Prediction of Harmful Cyanobacterial Blooms in the Nakdong River in Response to Increased Temperatures Under Climate Change Scenarios (기후변화 시나리오의 기온상승에 따른 낙동강 남세균 발생 예측을 위한 데이터 기반 모델 시뮬레이션)

  • Gayeon Jang;Minkyoung Jo;Jayun Kim;Sangjun Kim;Himchan Park;Joonhong Park
    • Journal of Korean Society on Water Environment
    • /
    • v.40 no.3
    • /
    • pp.121-129
    • /
    • 2024
  • Harmful cyanobacterial blooms (HCBs) are caused by the rapid proliferation of cyanobacteria and are believed to be exacerbated by climate change. However, the extent to which HCBs will be stimulated in the future due to increased temperature remains uncertain. This study aims to predict the future occurrence of cyanobacteria in the Nakdong River, which has the highest incidence of HCBs in South Korea, based on temperature rise scenarios. Representative Concentration Pathways (RCPs) were used as the basis for these scenarios. Data-driven model simulations were conducted, and out of the four machine learning techniques tested (multiple linear regression, support vector regressor, decision tree, and random forest), the random forest model was selected for its relatively high prediction accuracy. The random forest model was used to predict the occurrence of cyanobacteria. The results of boxplot and time-series analyses showed that under the worst-case scenario (RCP8.5 (2100)), where temperature increases significantly, cyanobacterial abundance across all study areas was greatly stimulated. The study also found that the frequencies of HCB occurrences exceeding certain thresholds (100,000 and 1,000,000 cells/mL) increased under both the best-case scenario (RCP2.6 (2050)) and worst-case scenario (RCP8.5 (2100)). These findings suggest that the frequency of HCB occurrences surpassing a certain threshold level can serve as a useful diagnostic indicator of vulnerability to temperature increases caused by climate change. Additionally, this study highlights that water bodies currently susceptible to HCBs are likely to become even more vulnerable with climate change compared to those that are currently less susceptible.

The Development of Major Tree Species Classification Model using Different Satellite Images and Machine Learning in Gwangneung Area (이종센서 위성영상과 머신 러닝을 활용한 광릉지역 주요 수종 분류 모델 개발)

  • Lim, Joongbin;Kim, Kyoung-Min;Kim, Myung-Kil
    • Korean Journal of Remote Sensing
    • /
    • v.35 no.6_2
    • /
    • pp.1037-1052
    • /
    • 2019
  • We had developed in preceding study a classification model for the Korean pine and Larch with an accuracy of 98 percent using Hyperion and Sentinel-2 satellite images, texture information, and geometric information as the first step for tree species mapping in the inaccessible North Korea. Considering a share of major tree species in North Korea, the classification model needs to be expanded as it has a large share of Oak(29.5%), Pine (12.7%), Fir (8.2%), and as well as Larch (17.5%) and Korean pine (5.8%). In order to classify 5 major tree species, national forest type map of South Korea was used to build 11,039 training and 2,330 validation data. Sentinel-2 data was used to derive spectral information, and PlanetScope data was used to generate texture information. Geometric information was built from SRTM DEM data. As a machine learning algorithm, Random forest was used. As a result, the overall accuracy of classification was 80% with 0.80 kappa statistics. Based on the training data and the classification model constructed through this study, we will extend the application to Mt. Baekdu and North and South Goseong areas to confirm the applicability of tree species classification on the Korean Peninsula.

Predicting Gross Box Office Revenue for Domestic Films

  • Song, Jongwoo;Han, Suji
    • Communications for Statistical Applications and Methods
    • /
    • v.20 no.4
    • /
    • pp.301-309
    • /
    • 2013
  • This paper predicts gross box office revenue for domestic films using the Korean film data from 2008-2011. We use three regression methods, Linear Regression, Random Forest and Gradient Boosting to predict the gross box office revenue. We only consider domestic films with a revenue size of at least KRW 500 million; relevant explanatory variables are chosen by data visualization and variable selection techniques. The key idea of analyzing this data is to construct the meaningful explanatory variables from the data sources available to the public. Some variables must be categorized to conduct more effective analysis and clustering methods are applied to achieve this task. We choose the best model based on performance in the test set and important explanatory variables are discussed.