DOI QR코드

DOI QR Code

Performance Characteristics of an Ensemble Machine Learning Model for Turbidity Prediction With Improved Data Imbalance

데이터 불균형 개선에 따른 탁도 예측 앙상블 머신러닝 모형의 성능 특성

  • HyunSeok Yang (Department of Civil and Environmental Engineering, Hanbat National University) ;
  • Jungsu Park (Department of Civil and Environmental Engineering, Hanbat National University)
  • 양현석 (국립한밭대학교 건설환경공학과) ;
  • 박정수 (국립한밭대학교 건설환경공학과)
  • Received : 2023.08.31
  • Accepted : 2023.10.24
  • Published : 2023.12.31

Abstract

High turbidity in source water can have adverse effects on water treatment plant operations and aquatic ecosystems, necessitating turbidity management. Consequently, research aimed at predicting river turbidity continues. This study developed a multi-class classification model for prediction of turbidity using LightGBM (Light Gradient Boosting Machine), a representative ensemble machine learning algorithm. The model utilized data that was classified into four classes ranging from 1 to 4 based on turbidity, from low to high. The number of input data points used for analysis varied among classes, with 945, 763, 95, and 25 data points for classes 1 to 4, respectively. The developed model exhibited precisions of 0.85, 0.71, 0.26, and 0.30, as well as recalls of 0.82, 0.76, 0.19, and 0.60 for classes 1 to 4, respectively. The model tended to perform less effectively in the minority classes due to the limited data available for these classes. To address data imbalance, the SMOTE (Synthetic Minority Over-sampling Technique) algorithm was applied, resulting in improved model performance. For classes 1 to 4, the Precision and Recall of the improved model were 0.88, 0.71, 0.26, 0.25 and 0.79, 0.76, 0.38, 0.60, respectively. This demonstrated that alleviating data imbalance led to a significant enhancement in Recall of the model. Furthermore, to analyze the impact of differences in input data composition addressing the input data imbalance, input data was constructed with various ratios for each class, and the model performances were compared. The results indicate that an appropriate composition ratio for model input data improves the performance of the machine learning model.

고 탁도의 원수는 정수장 운영 및 수 생태 환경에 부정적인 영향을 줄 수 있어 관리가 필요한 수질 인자이며, 하천의 탁도 예측을 통해 고 탁도의 원수의 효율적 관리를 수행하기 위해 관련분야에 대한 연구가 지속되고 있다. 본 연구에서는 대표적인 앙상블 머신러닝 알고리즘 중 하나인 LightGBM (light gradient boosting machine)을 이용하여 탁도를 예측하는 다중 분류 모형을 구축하였다. 모형의 구축을 위해 입력자료를 탁도값에 따라 탁도가 낮은 경우부터 높은 경우까지 4개의 class로 구분하였으며, class 1 - 4에 속하는 자료수는 각각 945개, 763개, 95개, 25개로 분류되었다. 구축한 모형의 class 1 - 4에 대한 정밀도 (Precision) 각각 0.85, 0.71, 0.26, 0.30 재현율 (Recall)은 각각 0.82, 0.76, 0.19, 0.60로 데이터 수가 적은 소수 class에서 상대적으로 모형이 성능이 낮은 경향을 보였다. 데이터 불균형을 해소하기 위해 over-sampling알고리즘 중 SMOTE를 적용한 결과 개선된 모형의 class 1 - 4에 대한 정밀도 및 재현율은 각각 0.88, 0.71, 0.26, 0.25 및 0.79, 0.76, 0.38, 0.60으로 데이터 불균형 해소를 통해 모형의 재현율이 크게 개선되는 것을 확인할 수 있었다. 또한 데이터 구성비율이 모형성능에 미치는 영향에 대한 확인을 위하여 입력자료의 구성비를 다양하게 하고 각각의 자료로 구축된 모형의 결과를 비교하여 입력자료 구성비에 따른 모형성능의 차이를 분석하였으며, 모형 입력자료의 구성비의 적정한 산정을 통해 모형의 성능을 향상시킬 수 있음을 확인하였다.

Keywords

Acknowledgement

이 성과는 정부 (과학기술정보통신부)의 재원으로 한국연구재단의 지원을 받아 수행된 연구임(No. 2022R1F1A1065518) (50%). 본 결과물은 환경부의 재원으로 한국환경산업기술원의 환경시설 재난재해 대응기술개발사업의 지원을 받아 연구되었습니다 (2022002870001) (50%).

References

  1. Alexandrov, Y., Laronne, J.B., and Reid, I. 2007. Intravent and inter-seasonal behaviour of suspended sediment in flash floods of the semi-arid northern Negev, Israel. Geomorphology 85(1-2): 85-97.  https://doi.org/10.1016/j.geomorph.2006.03.013
  2. Asadollah, S.B.H.S., Sharafati, A., Motta, D., and Yaseen, Z.M. 2021. River water quality index prediction and uncertainty analysis: A comparative study of machine learning models. Journal of Environmental Chemical Engineering 9(1): 104599. 
  3. Chung, S.W. and Oh, J.K. 2006. River water temperature variations at upstream of Daecheong lake during rainfall events and development of prediction models. Journal of Korea Water Resources Association 39(1): 79-88. (in Korean)  https://doi.org/10.3741/JKWRA.2006.39.1.079
  4. Dietterich, T.G. 2000. Ensemble methods in machine learning. In International workshop on multiple classifier systems: 1-15. Berlin, Heidelberg: Springer Berlin Heidelberg. 
  5. Gu, K., Zhang, Y., and Qiao, J. 2020. Random forest ensemble for river turbidity measurement from space remote sensing data. IEEE Transactions on Instrumentation and Measurement 69(11): 9028-9036.  https://doi.org/10.1109/TIM.2020.2998615
  6. Han, J.W., Cho, Y.C., Lee, S.Y., Kim, S.H., and Kang, T.G. 2023. Short-Term Water Quality Prediction of the Paldang Reservoir Using Recurrent Neural Network Models. Journal of Korean Society on Water Environment 39(1). (in Korean) 
  7. Ministry of Environment (ME). 2022. Investigation of Pollution Sources in the Geum River watershed Tributaries and Research on Water Quality Improvement Measures. Ministry of Environment Geum River Basin Environmental Office pp. 1-29. (in Korean) 
  8. Iglesias, C., Martinez Torres, J., Garcia Nieto, P.J., Alonso Fernandez, J.R., Diaz Muniz, C., Pineiro, J.I., and Taboada, J. 2014. Turbidity prediction in a river basin by using artificial neural networks: a case study in northern Spain. Water Resources Management 28: 319-331.  https://doi.org/10.1007/s11269-013-0487-9
  9. Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., ... and Liu, T.Y. 2017. Lightgbm: A highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems 30. 
  10. Kim, J.I., Choi, J.W., and An, K.G. 2014. Spatial and temporal variations of water quality in an urban Miho stream and some influences of the tributaries on the water quality. Journal of Environmental Science International 23(3): 433-445. (in Korean)  https://doi.org/10.5322/JESI.2014.23.3.433
  11. Kim, J.O. and Park, J.S. 2023. Evaluation of Multi-classification Model Performance for Algal Bloom Prediction Using CatBoost. Journal of Korean Society on Water Quality 39(1): 1-8. (in Korean) 
  12. Kumar, L., Afzal, M.S., and Ahmad, A. 2022. Prediction of water turbidity in a marine environment using machine learning: A case study of Hong Kong. Regional Studies in Marine Science 52: 102260. 
  13. Kwon, S.B., Ahn, H.W., Kang, J.G., and Son, B.Y. 2004. Operation and diagnosis of DAF water treatment plant at highly turbid raw water. Journal of Korean Society of Water and Wastewater 18(2): 191-200. (in Korean) 
  14. Lemaitre, G., Nogueira, F., and Aridas, C.K. 2017. Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. The Journal of Machine Learning Research 18(1): 559-563. 
  15. Lin, W.W., Sung, S.S., Chen, L.C., Chung, H.Y., Wang, C.C., Wu, R.M., ... and Chang, H.L. 2004. Treating high-turbidity water using full-scale floc blanket clarifiers. Journal of Environmental Engineering 130(12): 1481-1487.  https://doi.org/10.1061/(ASCE)0733-9372(2004)130:12(1481)
  16. Lu, H. and Ma, X. 2020. Hybrid decision tree-based machine learning models for short-term water quality prediction. Chemosphere 249: 126169.
  17. LightGBM (LGBM). https://lightgbm.readthedocs.io/en/stable/ 
  18. National Institute of Environmental Research (NIER). 2023. Water Environment Information System, https://water.nier.go.kr/web. Accessed 10 June 2023. (in Korean) 
  19. Nasrabadi, T., Ruegner, H., Sirdari, Z.Z., Schwientek, M., and Grathwohl, P. 2016. Using total suspended solids (TSS) and turbidity as proxies for evaluation of metal transport in river water. Applied Geochemistry 68: 1-9.  https://doi.org/10.1016/j.apgeochem.2016.03.003
  20. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... and Duchesnay, E. 2011. Scikit-learn: Machine learning in Python. the Journal of Machine Learning Research 12: 2825-2830. 
  21. Sagi, O. and Rokach, L. 2018. Ensemble learning: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8(4): e1249. 
  22. Schleiger, S.L. 2000. Use of an index of biotic integrity to detect effects of land uses on stream fish communities in west-central Georgia. Transactions of the American Fisheries Society 129(5): 1118-1133.  https://doi.org/10.1577/1548-8659(2000)129<1118:UOAIOB>2.0.CO;2
  23. Seo, S.D., Lee, J.Y., and Ha, S.R. 2011. Effect of Hydroelectric Power Plant Discharge on the Turbidity Distribution in Dae-Cheong Dam Reservoir. Journal of Environmental Impact Assessment 20(2): 227-234. (in Korean)  https://doi.org/10.14249/EIA.2011.20.2.227
  24. Shin, J.H., Lee, S.H., Kim, M.S., and Park, H.W. 2021. Imbalanced data augmentation for algal blooming warning AI. J. Inf. Technol. Appl. Eng. 11: 15-23. (in Korean) 
  25. Uyun, S. and Sulistyowati, E. 2020. Feature selection for multiple water quality status: Integrated bootstrapping and SMOTE approach in imbalance classes. International Journal of Electrical and Computer Engineering 10(4): 4331. 
  26. Water Resources Management Information System (WAMIS). 2023. http://www.wamis.go.kr/ Accessed 10 June 2023. (in Korean) 
  27. Xu, T., Coco, G., and Neale, M. 2020. A predictive model of recreational water quality based on adaptive synthetic sampling algorithms and machine learning. Water Research 177: 115788. 
  28. Zhang, D., Qian, L., Mao, B., Huang, C., Huang, B., and Si, Y. 2018. A data-driven design for fault detection of wind turbines using random forests and XGboost. Ieee Access 6: 21020-21031.  https://doi.org/10.1109/ACCESS.2018.2818678
  29. Zounemat-Kermani, M., Alizamir, M., Fadaee, M., Sankaran Namboothiri, A., and Shiri, J. 2021. Online sequential extreme learning machine in river water quality (turbidity) prediction: a comparative study on different data mining approaches. Water and Environment Journal 35(1): 335-34