• Title/Summary/Keyword: Randomforest

Search Result 8, Processing Time 0.023 seconds

Forecasting daily PM10 concentrations in Seoul using various data mining techniques

  • Choi, Ji-Eun;Lee, Hyesun;Song, Jongwoo
    • Communications for Statistical Applications and Methods
    • /
    • v.25 no.2
    • /
    • pp.199-215
    • /
    • 2018
  • Interest in $PM_{10}$ concentrations have increased greatly in Korea due to recent increases in air pollution levels. Therefore, we consider a forecasting model for next day $PM_{10}$ concentration based on the principal elements of air pollution, weather information and Beijing $PM_{2.5}$. If we can forecast the next day $PM_{10}$ concentration level accurately, we believe that this forecasting can be useful for policy makers and public. This paper is intended to help forecast a daily mean $PM_{10}$, a daily max $PM_{10}$ and four stages of $PM_{10}$ provided by the Ministry of Environment using various data mining techniques. We use seven models to forecast the daily $PM_{10}$, which include five regression models (linear regression, Randomforest, gradient boosting, support vector machine, neural network), and two time series models (ARIMA, ARFIMA). As a result, the linear regression model performs the best in the $PM_{10}$ concentration forecast and the linear regression and Randomforest model performs the best in the $PM_{10}$ class forecast. The results also indicate that the $PM_{10}$ in Seoul is influenced by Beijing $PM_{2.5}$ and air pollution from power stations in the west coast.

The long-term agricultural weather forcast methods using machine learning and GloSea5 : on the cultivation zone of Chinese cabbage. (기계학습과 GloSea5를 이용한 장기 농업기상 예측 : 고랭지배추 재배 지역을 중심으로)

  • Kim, Junseok;Yang, Miyeon;Yoon, Sanghoo
    • Journal of Digital Convergence
    • /
    • v.18 no.4
    • /
    • pp.243-250
    • /
    • 2020
  • Systematic farming can be planned and managed if long-term agricultural weather information of the plantation is available. Because the greatest risk factor for crop cultivation is the weather. In this study, a method for long-term predicting of agricultural weather using the GloSea5 and machine learning is presented for the cultivation of Chinese cabbage. The GloSea5 is a long-term weather forecast that is available up to 240 days. The deep neural networks and the spatial randomforest were considered as the method of machine learning. The longterm prediction performance of the deep neural networks was slightly better than the spatial randomforest in the sense of root mean squared error and mean absolute error. However, the spatial randomforest has the advantage of predicting temperatures with a global model, which reduces the computation time.

Naval Vessel Spare Parts Demand Forecasting Using Data Mining (데이터마이닝을 활용한 해군함정 수리부속 수요예측)

  • Yoon, Hyunmin;Kim, Suhwan
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.40 no.4
    • /
    • pp.253-259
    • /
    • 2017
  • Recent development in science and technology has modernized the weapon system of ROKN (Republic Of Korea Navy). Although the cost of purchasing, operating and maintaining the cutting-edge weapon systems has been increased significantly, the national defense expenditure is under a tight budget constraint. In order to maintain the availability of ships with low cost, we need accurate demand forecasts for spare parts. We attempted to find consumption pattern using data mining techniques. First we gathered a large amount of component consumption data through the DELIIS (Defense Logistics Intergrated Information System). Through data collection, we obtained 42 variables such as annual consumption quantity, ASL selection quantity, order-relase ratio. The objective variable is the quantity of spare parts purchased in f-year and MSE (Mean squared error) is used as the predictive power measure. To construct an optimal demand forecasting model, regression tree model, randomforest model, neural network model, and linear regression model were used as data mining techniques. The open software R was used for model construction. The results show that randomforest model is the best value of MSE. The important variables utilized in all models are consumption quantity, ASL selection quantity and order-release rate. The data related to the demand forecast of spare parts in the DELIIS was collected and the demand for the spare parts was estimated by using the data mining technique. Our approach shows improved performance in demand forecasting with higher accuracy then previous work. Also data mining can be used to identify variables that are related to demand forecasting.

Korean Text Classification Using Randomforest and XGBoost Focusing on Seoul Metropolitan Civil Complaint Data (RandomForest와 XGBoost를 활용한 한국어 텍스트 분류: 서울특별시 응답소 민원 데이터를 중심으로)

  • Ha, Ji-Eun;Shin, Hyun-Chul;Lee, Zoon-Ky
    • The Journal of Bigdata
    • /
    • v.2 no.2
    • /
    • pp.95-104
    • /
    • 2017
  • In 2014, Seoul Metropolitan Government launched a response service aimed at responding promptly to civil complaints. The complaints received are categorized based on their content and sent to the department in charge. If this part can be automated, the time and labor costs will be reduced. In this study, we collected 17,700 cases of complaints for 7 years from June 1, 2010 to May 31, 2017. We compared the XGBoost with RandomForest and confirmed the suitability of Korean text classification. As a result, the accuracy of XGBoost compared to RandomForest is generally high. The accuracy of RandomForest was unstable after upsampling and downsampling using the same sample, while XGBoost showed stable overall accuracy.

  • PDF

Compositional Feature Selection and Its Effects on Bandgap Prediction by Machine Learning (기계학습을 이용한 밴드갭 예측과 소재의 조성기반 특성인자의 효과)

  • Chunghee Nam
    • Korean Journal of Materials Research
    • /
    • v.33 no.4
    • /
    • pp.164-174
    • /
    • 2023
  • The bandgap characteristics of semiconductor materials are an important factor when utilizing semiconductor materials for various applications. In this study, based on data provided by AFLOW (Automatic-FLOW for Materials Discovery), the bandgap of a semiconductor material was predicted using only the material's compositional features. The compositional features were generated using the python module of 'Pymatgen' and 'Matminer'. Pearson's correlation coefficients (PCC) between the compositional features were calculated and those with a correlation coefficient value larger than 0.95 were removed in order to avoid overfitting. The bandgap prediction performance was compared using the metrics of R2 score and root-mean-squared error. By predicting the bandgap with randomforest and xgboost as representatives of the ensemble algorithm, it was found that xgboost gave better results after cross-validation and hyper-parameter tuning. To investigate the effect of compositional feature selection on the bandgap prediction of the machine learning model, the prediction performance was studied according to the number of features based on feature importance methods. It was found that there were no significant changes in prediction performance beyond the appropriate feature. Furthermore, artificial neural networks were employed to compare the prediction performance by adjusting the number of features guided by the PCC values, resulting in the best R2 score of 0.811. By comparing and analyzing the bandgap distribution and prediction performance according to the material group containing specific elements (F, N, Yb, Eu, Zn, B, Si, Ge, Fe Al), various information for material design was obtained.

The Comparison of Peach Price and Trading Volume Prediction Model Using Machine Learning Technique (기계학습을 이용한 복숭아 경락가격 및 거래량 예측모형 비교)

  • Kim, Mihye;Hong, Sungmin;Yoon, Sanghoo
    • Journal of the Korean Data Analysis Society
    • /
    • v.20 no.6
    • /
    • pp.2933-2940
    • /
    • 2018
  • It is known that fruit is more affected by the weather than other crops. Therefore, in order to create high value for farmers, it is necessary to develop a wholesale price model considering the weather. Peaches produced under relatively limited conditions were chosen as subjects of study. The data were collected from 2015 to 2017 provided by okdab 4.0. The meteorological data used for the analysis were generated by weighting the cultivation area and the variables with high correlation among the weather data were selected from the day before to 7 days before. Randomforest, gradient boosting machine, and XGboost were used for the analysis. As a result of analysis, XGboost showed the best performance in the sense of RMSE and correlation, and price prediction was comparatively well predicted, but the accuracy of the trading volume prediction was not so good enough. The top three weather variables affecting to the peach were minimum temperature, average maximum temperature, and precipitation.

Utilization Evaluation of Numerical forest Soil Map to Predict the Weather in Upland Crops (밭작물 농업기상을 위한 수치형 산림입지토양도 활용성 평가)

  • Kang, Dayoung;Hwang, Yeongeun;Yoon, Sanghoo
    • Korean Journal of Agricultural and Forest Meteorology
    • /
    • v.23 no.1
    • /
    • pp.34-45
    • /
    • 2021
  • Weather is one of the important factors in the agricultural industry as it affects the price, production, and quality of crops. Upland crops are directly exposed to the natural environment because they are mainly grown in mountainous areas. Therefore, it is necessary to provide accurate weather for upland crops. This study examined the effectiveness of 12 forest soil factors to interpolate the weather in mountainous areas. The daily temperature and precipitation were collected by the Korea Meteorological Administration between January 2009 and December 2018. The Generalized Additive Model (GAM), Kriging, and Random Forest (RF) were considered to interpolate. For evaluating the interpolation performance, automatic weather stations were used as training data and automated synoptic observing systems were used as test data for cross-validation. Unfortunately, the forest soil factors were not significant to interpolate the weather in the mountainous areas. GAM with only geography aspects showed that it can interpolate well in terms of root mean squared error and mean absolute error. The significance of the factors was tested at the 5% significance level in GAM, and the climate zone code (CLZN_CD) and soil water code B (SIBFLR_LAR) were identified as relatively important factors. It has shown that CLZN_CD could help to interpolate the daily average and minimum daily temperature for upland crops.

Bike Insurance Fraud Detection Model Using Balanced Randomforest Algorithm (균형 랜덤 포레스트를 이용한 이륜차 보험사기 적발 모형 개발)

  • Kim, Seunghoon;Lee, Soo Il;Kim, Tae ho
    • Journal of Digital Convergence
    • /
    • v.20 no.2
    • /
    • pp.241-250
    • /
    • 2022
  • Due to the COVID-19 pandemic, with increased 'untact' services and with unstable household economy, the bike insurance fraud is expected to surge. Moreover, the fraud methodology gets complicated. However, the fraud detection model for bike insurance is absent. we deal with the issue of skewed class distribution and reflect the criterion of fraud detection expert. We utilize a balanced random-forest algorithm to develop an efficient bike insurance fraud detection model. As a result, while the predictive performance of balanced random-forest model is superior than it of non-balanced model. There is no significant difference between the variables used by the experts and the confirmatory models. The important variables to detect frauds are turned out to be age and gender of driver, correspondence between insured and driver, the amount of self-repairing claim, and the amount of bodily injury liability.