DOI QR코드

DOI QR Code

Effect of input variable characteristics on the performance of an ensemble machine learning model for algal bloom prediction

앙상블 머신러닝 모형을 이용한 하천 녹조발생 예측모형의 입력변수 특성에 따른 성능 영향

  • Kang, Byeong-Koo (Department of Civil and Environmental Engineering, Hanbat National University) ;
  • Park, Jungsu (Department of Civil and Environmental Engineering, Hanbat National University)
  • 강병구 (국립한밭대학교 건설환경공학과) ;
  • 박정수 (국립한밭대학교 건설환경공학과)
  • Received : 2021.11.03
  • Accepted : 2021.11.23
  • Published : 2021.12.15

Abstract

Algal bloom is an ongoing issue in the management of freshwater systems for drinking water supply, and the chlorophyll-a concentration is commonly used to represent the status of algal bloom. Thus, the prediction of chlorophyll-a concentration is essential for the proper management of water quality. However, the chlorophyll-a concentration is affected by various water quality and environmental factors, so the prediction of its concentration is not an easy task. In recent years, many advanced machine learning algorithms have increasingly been used for the development of surrogate models to prediction the chlorophyll-a concentration in freshwater systems such as rivers or reservoirs. This study used a light gradient boosting machine(LightGBM), a gradient boosting decision tree algorithm, to develop an ensemble machine learning model to predict chlorophyll-a concentration. The field water quality data observed at Daecheong Lake, obtained from the real-time water information system in Korea, were used for the development of the model. The data include temperature, pH, electric conductivity, dissolved oxygen, total organic carbon, total nitrogen, total phosphorus, and chlorophyll-a. First, a LightGBM model was developed to predict the chlorophyll-a concentration by using the other seven items as independent input variables. Second, the time-lagged values of all the input variables were added as input variables to understand the effect of time lag of input variables on model performance. The time lag (i) ranges from 1 to 50 days. The model performance was evaluated using three indices, root mean squared error-observation standard deviation ration (RSR), Nash-Sutcliffe coefficient of efficiency (NSE) and mean absolute error (MAE). The model showed the best performance by adding a dataset with a one-day time lag (i=1) where RSR, NSE, and MAE were 0.359, 0.871 and 1.510, respectively. The improvement of model performance was observed when a dataset with a time lag up of about 15 days (i=15) was added.

Keywords

Acknowledgement

본 연구는 환경부의 재원으로 한국환경산업기술원의 수생태계 건강성 확보 기술개발사업의 지원을 받아 연구되었습니다(과제번호 : 2020003030006).

References

  1. Belgiu, M. and Dragut, L. (2016). Random forest in remote sensing: A review of applications and future directions, ISPRS J. Photogramm. Remote Sens., 114, 24-31. https://doi.org/10.1016/j.isprsjprs.2016.01.011
  2. Bennett, N.D., Croke, B.F., Guariso, G., Guillaume, J.H., Hamilton, S.H., Jakeman, A.J., Marsili-Libelli, S., Newham, L.T., Norton, J.P. and Perrin, C. (2013). Characterising performance of environmental models, Environ. Modell. Softw., 40, 1-20. https://doi.org/10.1016/j.envsoft.2012.09.011
  3. Chen, T. and Guestrin, C. (2016). "Xgboost: A scalable tree boosting system", In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 13-17 August, San Francisco, CA, USA. Association for computing Machinery.
  4. Dietterich, T.G. (2000). Ensemble methods in machine learning. 1-15.
  5. Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q. and Liu, T.Y. (2017). Lightgbm: A highly efficient gradient boosting decision tree, Adv. Neural Inf. Process. Syst., 30, 3146-3154.
  6. Kwak, J. (2021). A study on the 3-month prior prediction of Chl-a concentraion in the Daechong lake using hydrometeorological forecasting data, J. Wetl. Res., 23(2), 144-153. https://doi.org/10.17663/JWR.2021.23.2.144
  7. K-water Mywater https://www.water.or.kr/ (May 22, 2021).
  8. Kwon, Y.S., Baek, S.H., Lim, Y.K., Pyo, J., Ligaray, M., Park, Y. and Cho, K.H. (2018). Monitoring coastal chlorophyll-a concentrations in coastal areas using machine learning models, Water, 10(8), 1020. https://doi.org/10.3390/w10081020
  9. Lee, S.M., Park, K.D. and Kim, I.K. (2020). Comparison of machine learning algorithms for Chl-a prediction in the middle of Nakdong river (focusing on water quality and quantity factors), J. Korean Soc. Water Wastewater, 34(4), 277-288. https://doi.org/10.11001/jksww.2020.34.4.277
  10. LightGBM. https://lightgbm.readthedocs.io/en/latest/ (August, 2021).
  11. Lim, H.S. and An, H.U. (2018). "Prediction of pollution loads in Geum River using machine learning", Proceedings of the Korea Water Resources Association Conference, Korea Water Resources Association.
  12. Ma, X., Sha, J., Wang, D., Yu, Y., Yang, Q. and Niu, X. (2018). Study on a prediction of P2P network loan default based on the machine learning LightGBM and Xgboost algorithms according to different high dimensional data cleaning, Electron. Commer. Res. Appl., 31, 24-39. https://doi.org/10.1016/j.elerap.2018.08.002
  13. Moriasi, D.N., Arnold, J.G., Van Liew, M.W., Bingner, R.L., Harmel, R.D. and Veith, T.L. (2007). Model evaluation guidelines for systematic quantification of accuracy in watershed simulations, Am. Soc. Agric. Biol. Eng., 50, 885-900.
  14. NIER National Institute of Environmental Research, realtime water information system0000.http://www.koreawqi.go.kr/index_web.jsp (May 22, 2021).
  15. Oh, H.R., Son, A.L., and Lee, Z. (2021). Occupational accident prediction modeling and analysis using SHAP, J. Digit. Contents Soc., 22(7), 1115-1123. https://doi.org/10.9728/dcs.2021.22.7.1115
  16. Park, Y., Cho, K.H., Park, J., Cha, S.M. and Kim, J.H. (2015). Development of early-warning protocol for predicting chlorophyll-a concentration using machine learning models in freshwater and estuarine reservoirs, Korea Sci. Total Environ., 502, 31-41. https://doi.org/10.1016/j.scitotenv.2014.09.005
  17. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R. and Dubourg, V. (2011). Scikit-learn: Machine learning in Python, J. Mach. Learn. Res., 12, 2825-2830.
  18. Shin, C.M., Min, J.H., Park, S.Y., Choi, J., Park, J.H., Song, Y.S. and Kim, K. (2017). Operational water quality forecast for the Yeongsan river using EFDC model, J. Korean Soc. Water Environ., 33(2), 219-229. https://doi.org/10.15681/KSWE.2017.33.2.219
  19. Su, Y. and Zhao, Y. (2020). Prediction of downstream BOD based on light gradient boosting machine method, IEEE, 127-130.
  20. Zhang, D., Qian, L., Mao, B., Huang, C., Huang, B. and Si, Y. (2018). A data-driven design for fault detection of wind turbines using random forests and XGboost, IEEE Access, 6, 21020-21031. https://doi.org/10.1109/access.2018.2818678
  21. Zhou, Z.H. (2021). Machine Learning. 181-210.