DOI QR코드

DOI QR Code

Comparison of the Performance of Machine Learning Models for TOC Prediction Based on Input Variable Composition

입력변수 구성에 따른 총유기탄소(TOC) 예측 머신러닝 모형의 성능 비교

  • Sohyun Lee (Department of Civil and Environmental Engineering, Hanbat National University) ;
  • Jungsu Park (Department of Civil and Environmental Engineering, Hanbat National University)
  • 이소현 (국립한밭대학교 환경공학과) ;
  • 박정수 (국립한밭대학교 환경공학과)
  • Received : 2024.07.31
  • Accepted : 2024.08.22
  • Published : 2024.09.30

Abstract

Total organic carbon (TOC) represents the total amount of organic carbon contained in water and is a key water quality parameter used, along with biochemical oxygen demand (BOD) and chemical oxygen demand (COD), to quantify the amount of organic matter in water. In this study, a model to predict TOC was developed using XGBoost (XGB), a representative ensemble machine learning algorithm. Independent variables for model construction included water temperature, pH, electrical conductivity, dissolved oxygen concentration, BOD, COD, suspended solids, total nitrogen, total phosphorus, and discharge. To quantitatively analyze the impact of various water quality parameters used in model construction, the feature importance of input variables was calculated. Based on the results of feature importance analysis, items with low importance were sequentially excluded to observe changes in model performance. When built by sequentially excluding items with low importance, the performance of the model showed a root mean squared error-observation standard deviation ratio (RSR) range of 0.53 to 0.55. The model that applied all input variables showed the best performance with an RSR value of 0.53. To enhance the model's field applicability, models using relatively easily measurable parameters were also built, and the performance changes were analyzed. The results showed that a model constructed using only the relatively easily measurable parameters of water temperature, electrical conductivity, pH, dissolved oxygen concentration, and suspended solids had an RSR of 0.72. This indicates that stable performance can be achieved using relatively easily measurable field water quality parameters.

총 유기 탄소 (total organic carbon, TOC)는 물에 포함된 유기 탄소의 총량을 나타내며 BOD, COD와 함께 수중의 유기물질량에 대한 정량적인 지표로 활용되는 대표적인 수질 항목이다. 본 연구에서는 대표적인 앙상블(ensemble) 머신러닝 알고리즘의 하나인 XGBoost (XGB)를 이용하여 TOC를 예측하는 모형을 구축하였다. 모형의 구축을 위한 독립변수로는 수온, pH, 전기전도도, 용존 산소 농도, 생물화학적 산소요구량, 화학적 산소요구량, 부유물질, 총질소, 총인 및 유량을 활용하였다. 또한 모형의 구축에 활용된 다양한 수질 항목의 영향에 대한 정량적인 분석을 위해 입력변수의 feature importance를 산정하였으며, 이를 기반으로 변수중요도에 따라 중요도가 낮은 항목을 순차적으로 제외하여 모형의 성능 변화를 분석하였다. 변수중요도가 낮은 항목을 순차적으로 제외하여 구축한 모형의 성능은 RSR (root mean squared error-observation standard deviation ratio) 0.53~0.55의 범위를 보였으며, 전체 입력변수를 적용한 모형의 RSR 값은 0.53로 가장 우수한 성능을 보이는 것으로 분석되었다. 또한 모형의 현장 적용성을 높이기 위해 현장 측정이 상대적으로 용이한 측정항목을 중심으로 모형을 구축하고 성능을 분석하였다. 분석결과 상대적으로 측정이 용이한 항목인 수온, pH, 전기전도도, 용존산소농도, 부유물질농도만으로 구축된 모형의 경우에도 RSR 값이 0.72로 분석되어 상대적으로 측정이 용이한 현장 수질측정항목만을 이용하는 경우에도 안정적인 성능의 확보가 가능할 수 있음을 확인하였다.

Keywords

Acknowledgement

이 성과는 정부 (과학기술정보통신부)의 재원으로 한국연구재단의 지원을 받아 수행된 연구임 (No. 2022R1F1A1065518).

References

  1. Ministry of Environment (ME), Introduction of Total Organic Carbon Management in Nakdonggang River Water System, Ministry of Environment, 1~5, (2022). 
  2. Park, S. R., Son, S. H., Bae, J. G., Lee, D., Seo, D. I., and Kim, J. S., "Estimation of Chlorophyll-a Concentration in Nakdong River Using Machine Learning-Based Satellite Data and Water Quality, Hydrological and Meteorological Factors", Korean Journal of Remote Sensing, 39(5), pp. 655~667. (2023). 
  3. Lee, S. M., Park, K. D., and Kim, I. K., "Comparison of machine learning algorithms for Chl-a prediction in the middle of Nakdong River (focusing on water quality and quantity factors)", Journal of Korean Society of Water and Wastewater, 34(4), pp. 277~288. (2020). 
  4. Jun, G., Kwon, D., and Ki, S., "Comparing the Performance of Machine Learning Algorithms in Predicting River Water Quality and Quantity", Journal of Korea Society of Water Science and Technology, 28(1), pp. 49~57. (2020). 
  5. Nafsin, N., and Li, J., "Prediction of total organic carbon and E. coli in rivers within the Milwaukee River basin using machine learning methods", Environmental Science: Advances, 2(2), pp. 278~293. (2023). 
  6. Chen, T., and Guestrin, C., "Xgboost: A scalable tree boosting system", in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785~794. (2016). 
  7. Water Environment Information System (WEIS), https://water.nier.go.kr/web (Accessed date: April 23, 2024). 
  8. Choi, B. D., "The Function or urban river and sustainable regional development: The case of Kumho river", Journal of the Korean Association of Regional Geographers, 10(4), pp. 757~774. (2004). 
  9. Yang, D. S., and Bae, H. K., "The effect of branches on Kumho River's water quality". Journal of Environmental Science International, 21(10), pp. 1245~1253. (2012). 
  10. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A ., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E., "Scikit-learn: Machine learning in Python", the Journal of machine Learning research, 12, pp. 2825~2830. (2011). 
  11. Cao, Z., Ma, R., Duan, H., Pahlevan, N., Melack, J., Shen, M., and Xue, K., "A machine learning approach to estimate chlorophyll-a from Landsat-8 measurements in inland lakes", Remote Sensing of Environment, 248, pp. 111974. (2020). 
  12. XGBoost, https://pypi.org/project/xgboost/ (Accessed date: November 21, 2023). 
  13. Bennett, N. D., Croke, B. F., Guariso, G., Guillaume, J. H., Hamilton, S. H., Jakeman, A. J., Marsili-Libelli, S., Newham, L. T. H., Northon, J. P., Perrin, C., Pierce, S. A., Robson, B. J., Seppelt, R., Voinov, A., Fath, B. D., and Andreassian, V., "Characterising performance of environmental models", Environmental Modelling & Software, 40, pp. 1~20. (2013). 
  14. Moriasi, D. N., Arnold, J. G., Van Liew, M. W., Bingner, R. L., Harmel, R. D., and Veith, T. L., "Model evaluation guidelines for systematic quantification of accuracy in watershed simulations", Transactions of the ASABE, 50(3), pp. 885~900. (2007).