DOI QR코드

DOI QR Code

Development of benthic macroinvertebrate species distribution models using the Bayesian optimization

베이지안 최적화를 통한 저서성 대형무척추동물 종분포모델 개발

  • Go, ByeongGeon (Department of Environmental Engineering, University of Seoul) ;
  • Shin, Jihoon (Department of Environmental Engineering, University of Seoul) ;
  • Cha, Yoonkyung (Department of Environmental Engineering, University of Seoul)
  • 고병건 (서울시립대학교 환경공학과) ;
  • 신지훈 (서울시립대학교 환경공학과) ;
  • 차윤경 (서울시립대학교 환경공학과)
  • Received : 2021.05.21
  • Accepted : 2021.07.06
  • Published : 2021.08.15

Abstract

This study explored the usefulness and implications of the Bayesian hyperparameter optimization in developing species distribution models (SDMs). A variety of machine learning (ML) algorithms, namely, support vector machine (SVM), random forest (RF), boosted regression tree (BRT), XGBoost (XGB), and Multilayer perceptron (MLP) were used for predicting the occurrence of four benthic macroinvertebrate species. The Bayesian optimization method successfully tuned model hyperparameters, with all ML models resulting an area under the curve (AUC) > 0.7. Also, hyperparameter search ranges that generally clustered around the optimal values suggest the efficiency of the Bayesian optimization in finding optimal sets of hyperparameters. Tree based ensemble algorithms (BRT, RF, and XGB) tended to show higher performances than SVM and MLP. Important hyperparameters and optimal values differed by species and ML model, indicating the necessity of hyperparameter tuning for improving individual model performances. The optimization results demonstrate that for all macroinvertebrate species SVM and RF required fewer numbers of trials until obtaining optimal hyperparameter sets, leading to reduced computational cost compared to other ML algorithms. The results of this study suggest that the Bayesian optimization is an efficient method for hyperparameter optimization of machine learning algorithms.

Keywords

Acknowledgement

본 연구는 환경부의 재원으로 한국환경산업기술원 수생태계 건강성 확보 기술개발사업의 지원을 받아 연구되었습니다(과제번호: 2020003050003)

References

  1. Allan, J.D., and Flecker, A.S. (1993). Biodiversity conservation in running waters, Biosci., 43(1), 32-43. https://doi.org/10.2307/1312104
  2. Bergstra, J., Bardenet, R., Bengio, Y., and Kegl, B. (2011). "Algorithms for hyper-parameter optimization", In 25th annual conference on neural information processing systems, Neural Information Processing Systems Foundation, Curran Associates Inc., Granada, Spain.
  3. Bergstra, J., and Bengio, Y. (2012). Random search for hyper-parameter optimization, J. Mach. Learn. Res., 13(2), 281-305.
  4. Bergstra, J., Yamins, D., and Cox, D. (2013). "Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures", In International conference on machine learning, PMLR 28(1), Omnipress, Atlanta, USA.
  5. Bergstra, J., Komer, B., Eliasmith, C., Yamins, D., and Cox, D.D. (2015). Hyperopt: a python library for model selection and hyperparameter optimization, Comput. Sci. Discov., 8(1), 014008. https://doi.org/10.1088/1749-4699/8/1/014008
  6. Breiman, L. (2001). Random forests, Mach. Learn., 45(1), 5-32. https://doi.org/10.1023/A:1010933404324
  7. Candelieri, A., Giordani, I., Archetti, F., Barkalov, K., Meyerov, I., Polovinkin, A., Sysoyev, A., and Zolotykh, N. (2019). Tuning hyperparameters of a SVM-based water demand forecasting system through parallel global optimization, Comput. Op. Res., 106, 202-209. https://doi.org/10.1016/j.cor.2018.01.013
  8. Carlson, C.J. (2020). Embarcadero: Species distribution modelling with Bayesian additive regression trees in R, Methods Ecol. Evol., 11(7), 850-858. https://doi.org/10.1111/2041-210X.13389
  9. Cawley, G.C., and Talbot, N.L. (2007). Preventing over-fitting during model selection via bayesian regularisation of the hyper-parameters, J. Mach. Learn. Res., 8(4), 841-861.
  10. Chen, Y., Xu, J., Yu, H., Zhen, Z., and Li, D. (2016a). Three-dimensional short-term prediction model of dissolved oxygen content based on PSO-BPANN algorithm coupled with kriging interpolation, Math. Probl. Eng., 6564202.
  11. Chen, T., and Guestrin, C. (2016b). "Xgboost: A scalable tree boosting system", In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, Association for Computing Machinery, New York, USA.
  12. Chun, S.P., Jun, Y.C., Kim, H.G., Lee, W.K., Kim, M.C., Chun, S.H., and Jung, S.E. (2017). Analysis and prediction of the spatial distribution of EPT (Ephemeroptera, Plecoptera, and Trichoptera) assemblages in the Han River watershed in Korea, J. Asia Pac. Entomol., 20(2), 613-625. https://doi.org/10.1016/j.aspen.2017.03.024
  13. Cortes, C., and Vapnik, V. (1995). Support-vector networks, Mach. Learn., 20(3), 273-297. https://doi.org/10.1007/BF00994018
  14. Duarte, E., and Wainer, J. (2017). Empirical comparison of cross-validation and internal metrics for tuning SVM hyperparameters, Pattern Recognit. Lett., 88, 6-11. https://doi.org/10.1016/j.patrec.2017.01.007
  15. Efron, B., and Tibshirani, R.J. (1994). An introduction to the bootstrap, CRC press, Florida.
  16. Elith, J., Graham, C.H., Anderson, R.P., Dudik, M., Ferrier, S., Guisan, A., Hijmans, R.J., Huettmann, F., Leathwick, J.R., Lehmann, A., Li, J., Lohmann, L.G., Loiselle, B.A., Manion, G., Moritz, C., Nakamura, M., Nakazawa, Y., Overton, J. McC., Peterson, A.T., Phillips, S.J., Richardson, K.S., Scachetti-Pereira, R., Schapire, R.E., Soberon, J., Williams, S., Wisz, M. S. and Zimmermann, N.E. (2006). Novel methods improve prediction of species' distributions from occurrence data, Ecogr., 29(2), 129-151. https://doi.org/10.1111/j.2006.0906-7590.04596.x
  17. Friedman, J.H. (2001). Greedy function approximation: a gradient boosting machine, Ann. Statist., 29(5), 1189-1232. https://doi.org/10.1214/aos/1013203451
  18. Gobeyn, S., Volk, M., Dominguez-Granda, L., and Goethals, P.L. (2017). Input variable selection with a simple genetic algorithm for conceptual species distribution models: A case study of river pollution in Ecuador, Environ. Model Softw., 92, 269-316. https://doi.org/10.1016/j.envsoft.2017.02.012
  19. Gobeyn, S., Mouton, A.M., Cord, A.F., Kaim, A., Volk, M., and Goethals, P.L. (2019). Evolutionary algorithms for species distribution modelling: A review in the context of machine learning, Ecol. Mod., 392, 179-195. https://doi.org/10.1016/j.ecolmodel.2018.11.013
  20. Goodfellow, I., Bengio, Y., Courville, A., and Bengio, Y. (2016). Deep learning, Cambridge: MIT press, Cambridge.
  21. Harris, D.J. (2015). Generating realistic assemblages with a joint species distribution model, Method. Ecol. Evol., 6(4), 465-473. https://doi.org/10.1111/2041-210X.12332
  22. Hastie, T., Tibshirani, R., and Friedman, J. (2009). The elements of statistical learning: data mining, inference, and prediction. Springer Science and Business Media, Berlin.
  23. Huang, J., and Ling, C.X. (2005). Using AUC and accuracy in evaluating learning algorithms, IEEE Trans. Knowl. Data Eng., 17(3), 299-310. https://doi.org/10.1109/TKDE.2005.50
  24. Hutter, F., Hoos, H.H., and Leyton-Brown, K. (2011). "Sequential model-based optimization for general algorithm configuration", In International conference on learning and intelligent optimization, Springer, Rome, Italy.
  25. Jones, D.R. (2001). A taxonomy of global optimization methods based on response surfaces, J. Glob. Optim., 21(4), 345-383. https://doi.org/10.1023/A:1012771025575
  26. Kong, D., Park, Y., and Jeon, Y.R. (2018a). Revision of ecological score of benthic macroinvertebrates community in Korea, J. Korean Soc. Water Environ., 34(3), 251-269. https://doi.org/10.15681/KSWE.2018.34.3.251
  27. Kong, D., Son, S.H., Hwang, S.J., Won, D.H, Kim, M.C., Park, J.H., Jeon, T.S., Lee, J.E., Kim, J.H., Kim, J.S., Park, J., Kwak, I.S., Ham, S.A., Jun, Y.C., Park, Y.S., Lee, J.K., Lee, S.W., Park, C.H., Moon, J.S., Kim, J.Y., H.K., Park, S.J., Kwon, Y., Kim, P., and Kim, A.R. (2018b). Development of benthic macroinvertebrates index (BMI) for biological assessment on stream environment, J. Korean Soc. Water Environ., 34(2), 183-201. https://doi.org/10.15681/KSWE.2018.34.2.183
  28. Kwak, I.S., Lee, D.S., Hong, C., and Park, Y.S. (2018). Distribution patterns of benthic macroinvertebrates in streams of Korea, Korean J. Ecol. Environ., 51(1), 60-70. https://doi.org/10.11614/KSL.2018.51.1.060
  29. Lee, S.W., Hwang, S.J., Lee, J.K., Jung, D.I., Park, Y.J., and Kim, J.T. (2011). Overview and application of the national aquatic ecological monitoring program (NAEMP) in Korea, Ann. Limnol. Int. J. Lim., 47(S1), S3-S14. https://doi.org/10.1051/limn/2011016
  30. Lee, J.W., Lee, S.W., An, K.J., Hwang, S.J., and Kim, N.Y. (2020). An estimated structural equation model to assess the effects of land use on water quality and benthic macroinvertebrates in streams of the Nam-Han river system, South Korea, Int. J. Environ. Res. Public Health, 17(6), 2116. https://doi.org/10.3390/ijerph17062116
  31. Levesque, J.C. (2018). Bayesian hyperparameter optimization: overfitting, ensembles and conditional spaces, Doctor's Thesis, Universite Laval, Quebec, Canada.
  32. Liu, W., Liu, W.D., and Gu, J. (2020). Predictive model for water absorption in sublayers using a Joint Distribution Adaption based XGBoost transfer learning method, J. Pet. Sci. Eng., 188, 106937. https://doi.org/10.1016/j.petrol.2020.106937
  33. Lu, S., Cai, Z.J., and Zhang, X.B. (2009). "Forecasting agriculture water consumption based on PSO and SVM", In 2009 2nd IEEE International Conference on Computer Science and Information Technology, Piscataway, NJ : IEEE, Beijing, China.
  34. McCarron, E., and Frydenborg, R. (1997). The Florida bioassessment program: an agent of change, Hum. Ecol. Risk Assess., 3(6), 967-977. https://doi.org/10.1080/10807039709383740
  35. Merow, C., Smith, M.J., Edwards Jr, T.C., Guisan, A., McMahon, S.M., Normand, S., Thuiller, W., Wuest, R.O. , Zimmermann, N.E., and Elith, J. (2014). What do we gain from simplicity versus complexity in species distribution models?, Ecogeg., 37(12), 1267-1281.
  36. Min, J.K., and Kong, D.S. (2020). Distribution patterns of benthic macroinvertebrate communities based on multispatial-scale environmental variables in the river systems of Republic of Korea, J. Freshw. Ecol., 35(1), 323-347. https://doi.org/10.1080/02705060.2020.1815599
  37. Mockus, J. (2012). Bayesian approach to global optimization: theory and applications. Springer Science and Business Media, Berlin.
  38. Munoz-Mas, R., Vezza, P., Alcaraz-Hernandez, J.D., and Martinez-Capel, F. (2016). Risk of invasion predicted with support vector machines: A case study on northern pike (Esox Lucius, L.) and bleak (Alburnus alburnus, L.), Ecol. Modell, 342, 123-134. https://doi.org/10.1016/j.ecolmodel.2016.10.006
  39. Nieto, P. G., Garcia-Gonzalo, E., Fernandez, J. A., and Muniz, C. D. (2014). Hybrid PSO-SVM-based method for long-term forecasting of turbidity in the Nalon river basin: A case study in Northern Spain, Ecol. Eng., 73, 192-200. https://doi.org/10.1016/j.ecoleng.2014.09.042
  40. Olof, S. S. (2018). A comparative study of black-box optimization algorithms for tuning of hyper-parameters in deep neural networks, Master's Thesis, Lulea University of Technology, Lulea, Sweden.
  41. Pearson, R. G., New York : American Museum of Natural History. (2007). Species' distribution modeling for conservation educators and practitioners, 98210, 1-50
  42. Probst, P., Wright, M. N., and Boulesteix, A. L. (2019). Hyperparameters and tuning strategies for random forest, Wiley Interdiscip Rev Data Min Knowl Discov, 9(3), e1301.
  43. Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). ""Why should I trust you?" Explaining the predictions of any classifier", In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, Association for Computing Machinery, New York, USA.
  44. Rojas-Dominguez, A., Padierna, L. C., Valadez, J. M. C., Puga-Soberanes, H. J., and Fraire, H. J. (2017). Optimal hyper-parameter tuning of SVM classifiers with application to medical diagnosis, IEEE Access, 6, 7164-7176. https://doi.org/10.1109/access.2017.2779794
  45. Schapire, R. E. (1990). The strength of weak learnability, Mach. Learn., 5(2), 197-227. https://doi.org/10.1007/BF00116037
  46. Schratz, P., Muenchow, J., Iturritxa, E., Richter, J., and Brenning, A. (2019). Hyperparameter tuning and performance assessment of statistical and machine-learning algorithms using spatial data, Ecol. Model., 406, 109-120. https://doi.org/10.1016/j.ecolmodel.2019.06.002
  47. Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., and De Freitas, N. (2015). Taking the human out of the loop: A review of Bayesian optimization, IEEE, 104(1), 148-175.
  48. Thornton, C., Hutter, F., Hoos, H.H., and Leyton-Brown, K. (2013). "Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms", In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, Association for Computing Machinery, Chicago, USA.
  49. Zhang, Y., Dudgeon, D., Cheng, D., Thoe, W., Fok, L., Wang, Z., and Lee, J.H. (2010). Impacts of land use and water quality on macroinvertebrate communities in the Pearl River drainage basin, China, Hydrobiologia, 652(1), 71-88. https://doi.org/10.1007/s10750-010-0320-x