DOI QR코드

DOI QR Code

Comparing the Performance of a Deep Learning Model (TabPFN) for Predicting River Algal Blooms with Varying Data Composition

데이터 구성에 따른 하천 조류 예측 딥러닝 모형 (TabPFN) 성능 비교

  • Hyunseok Yang (Department of Civil and Environmental Engineering, Hanbat National University) ;
  • Jungsu Park (Department of Civil and Environmental Engineering, Hanbat National University)
  • 양현석 (국립한밭대학교 건설환경공학과) ;
  • 박정수 (국립한밭대학교 건설환경공학과)
  • Received : 2024.04.26
  • Accepted : 2024.06.26
  • Published : 2024.08.31

Abstract

The algal blooms in rivers can negatively affect water source management and water treatment processes, necessitating continuous management. In this study, a multi-classification model was developed to predict the concentration of chlorophyll-a (chl-a), one of the key indicators of algal blooms, using Tabular Prior Fitted Networks (TabPFN), a novel deep learning algorithm known for its relatively superior performance on small tabular datasets. The model was developed using daily observation data collected at Buyeo water quality monitoring station from January 1, 2014, to December 31, 2022. The collected data were averaged to construct input data sets with measurement frequencies of 1 day, 3 days, 6 days, 12 days. The performance comparison of the four models, constructed with input data on observation frequencies of 1 day, 3 days, 6 days, and 12 days, showed that the model exhibits stable performance even when the measurement frequency is longer and the number of observations is smaller. The macro average for each model were analyzed as follows: Precision was 0.77, 0.76, 0.83, 0.84; Recall was 0.63, 0.65, 0.66, 0.74; F1-score was 0.67, 0.69, 0.71, 0.78. For the weighted average, Precision was 0.76, 0.77, 0.81, 0.84; Recall was 0.76, 0.78, 0.81, 0.85; F1-score was 0.74, 0.77, 0.80, 0.84. This study demonstrates that the chl-a prediction model constructed using TabPFN exhibits stable performance even with small-scale input data, verifying the feasibility of its application in fields where the input data required for model construction is limited.

하천에서 조류의 과다 발생은 취수원 관리 및 정수 처리에 악영향을 줄 수 있어 지속적인 관리가 필요하다. 본 연구에서는 딥러닝 알고리즘 중 작은 규모의 테이블 데이터에서도 상대적으로 우수한 성능을 보이는 것으로 알려진 tabular prior data fitted networks (TabPFN)을 사용하여 조류 발생 지표 중 하나인 chlorophyll-a (chl-a) 농도를 예측하는 다중 분류 모형을 구축하였다. 모형의 구축을 위해 부여지점 수질자동측정망에서 2014년 1월 1일부터 2022년 12월 31일까지 측정된 일일측정자료를 사용하였으며 입력 자료의 크기가 모형의 성능에 미치는 영향을 확인하기 위해 입력 자료의 평균값을 이용하여 1일, 3일, 6일, 12일의 측정 주기를 가진 입력 자료를 구성하였다. 각 모형의 성능을 비교한 결과 측정 주기가 길어져 입력 자료의 규모가 작은 경우에도 모형이 안정적인 성능을 보이는 것을 확인하였다. 각 모형의 macro average는 precision이 0.77, 0.76, 0.83, 0.84였으며, recall은 0.63, 0.65, 0.66, 0.74 F1-score는 0.67, 0.69, 0.71, 0.78로 분석되었다. Weighted average는 precision이 0.76, 0.77, 0.81, 0.84이며 recall은 0.76, 0.78, 0.81, 0.85 F1-score는 0.74, 0.77, 0.80, 0.84로 분석되었다. 본 연구에서는 TabPFN을 이용하여 구축한 chl-a 예측 모형이 작은 규모의 입력 자료에서도 안정적인 성능을 보이는 것을 확인하여 모형구축에 필요한 입력 자료가 제한적인 현장에서의 적용 가능성을 확인하였다.

Keywords

Acknowledgement

이 성과는 정부 (과학기술정보통신부)의 재원으로 한국연구재단의 지원을 받아 수행된 연구임 (No. 2022R1F1A1065518).

References

  1. Amorim, F. D. L. L. D., Rick, J., Lohmann, G., and Wiltshire, K. H. (2021). Evaluation of machine learning predictions of a highly resolved time series of chlorophyll-a concentration. Applied Sciences, 11(16), 7208.
  2. Barzegar, R., Aalami, M. T., and Adamowski, J. (2020). Short-term water quality variable prediction using a hybrid CNN-LSTM deep learning model. Stochastic Environmental Research and Risk Assessment, 34(2), 415-433.
  3. Blix, K., and Eltoft, T. (2018). Machine learning automatic model selection algorithm for oceanic chlorophyll-a content retrieval. Remote Sensing, 10(5), 775.
  4. Chauhan, K., Jani, S., Thakkar, D., Dave, R., Bhatia, J., Tanwar, S., and Obaidat, M. S. (2020). Automated machine learning: The new wave of machine learning. IEEE. In 2020 2nd International Conference on Innovative Mechanisms for Industry Applications (ICIMIA) (pp. 205-212).
  5. Chen, C., Chen, Q., Yao, S., He, M., Zhang, J., Li, G., and Lin, Y. (2024). Combining physical-based model and machine learning to forecast chlorophyll-a concentration in freshwater lakes. Science of The Total Environment, 907, 168097.
  6. Chen, Y. W., Song, Q., and Hu, X. (2021). Techniques for automated machine learning. ACM SIGKDD Explorations Newsletter, 22(2), 35-50.
  7. Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., and Hutter, F. (2015). Efficient and robust automated machine learning. Advances in neural information processing systems, 28.
  8. Hollmann, N., Muller, S., Eggensperger, K., and Hutter, F. (2022). Tabpfn: A transformer that solves small tabular classification problems in a second. arXiv preprint arXiv:2207.01848.
  9. Kim, H. R., Soh, H. Y., Kwak, M. T., and Han, S. H. (2022). Machine learning and multiple imputation approach to predict chlorophyll-a concentration in the coastal zone of Korea. Water, 14(12), 1862.
  10. Kim, K. M., and Ahn, J. H. (2022). Machine learning predictions of chlorophyll-a in the Han river basin, Korea. Journal of Environmental Management, 318, 115636.
  11. Kwon, Y. S., Baek, S. H., Lim, Y. K., Pyo, J., Ligaray, M., Park, Y., and Cho, K. H. (2018). Monitoring coastal chlorophyll-a concentrations in coastal areas using machine learning models. Water, 10(8), 1020.
  12. Li, H., Li, X., Song, D., Nie, J., and Liang, S. (2024). Prediction on daily spatial distribution of chlorophyll-a in coastal seas using a synthetic method of remote sensing, machine learning and numerical modeling. Science of The Total Environment, 910, 168642.
  13. Loftin, K. A., Graham, J. L., Hilborn, E. D., Lehmann, S. C., Meyer, M. T., Dietze, J. E., and Griffith, C. B. (2016). Cyanotoxins in inland lakes of the United States: Occurrence and potential recreational health risks in the EPA National Lakes Assessment 2007. Harmful algae, 56, 77-90.
  14. Magadan, L., Roldan-Gomez, J., Granda, J. C., & Suarez, F. J. (2023). Early fault classification in rotating machinery with limited data using TabPFN. IEEE Sensors Journal.
  15. Ministry of Environment (ME). (2023). The First Comprehensive Water Management Plan for the Geum River Basin, 2021-2030. Geum River Basin Management Commission pp 17-19
  16. Ministry of Environment (ME). (2024). Water Quality Monitoring Program. Ministry of Environment pp 6-7
  17. Moon, Y. H., Shin, I. H., Lee, Y. J., and Min, D. G. (2019). Recent research & development trends in automated machine learning. Electronics and Telecommunications Trends, 34(4), 32-42
  18. National Institute of Environmental Research (NIER). (2023). Water Environment Information System, https://water.nier.go.kr/web, Accessed 4 December 2023
  19. https:// water.nier.go.kr/web. Accessed 4 December 2023. P
  20. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12, 2825-2830.
  21. Schindler, D. W. (2006). Recent advances in the understanding and management of eutrophication. Limnology and oceanography, 51(1part2), 356-363.
  22. Shin, Y., Kim, T., Hong, S., Lee, S., Lee, E., Hong, S., ... and Heo, T. Y. (2020). Prediction of chlorophyll-a concentrations in the Nakdong River using machine learning methods. Water, 12(6), 1822.
  23. Tuggener, L., Amirian, M., Rombach, K., Lorwald, S., Varlet, A., Westermann, C., and Stadelmann, T. (2019). Automated machine learning in practice: state of the art and recent results. IEEE. In 2019 6th Swiss Conference on Data Science (SDS) (pp. 31-36).
  24. Wurtsbaugh, W. A., Paerl, H. W., and Dodds, W. K. (2019). Nutrients, eutrophication and harmful algal blooms along the freshwater to marine continuum. Wiley Interdisciplinary Reviews: Water, 6(5), e1373.