DOI QR코드

DOI QR Code

Automated Data Extraction from Unstructured Geotechnical Report based on AI and Text-mining Techniques

AI 및 텍스트 마이닝 기법을 활용한 지반조사보고서 데이터 추출 자동화

  • Park, Jimin (Dept. of Civil and Environmental Eng., Yonsei Univ.) ;
  • Seo, Wanhyuk (Dept. of Civil and Environmental Eng., Yonsei Univ.) ;
  • Seo, Dong-Hee (Geotechnical Eng. Team, RIF tech., GS E&C) ;
  • Yun, Tae-Sup (Dept. of Civil and Environmental Eng., Yonsei Univ.)
  • 박지민 (연세대학교 건설환경공학과) ;
  • 서완혁 (연세대학교 건설환경공학과) ;
  • 서동희 (GS건설 RIF Tech. 인프라기술연구센터 지반지질팀) ;
  • 윤태섭 (연세대학교 건설환경공학과)
  • Received : 2024.06.28
  • Accepted : 2024.07.12
  • Published : 2024.08.31

Abstract

Field geotechnical data are obtained from various field and laboratory tests and are documented in geotechnical investigation reports. For efficient design and construction, digitizing these geotechnical parameters is essential. However, current practices involve manual data entry, which is time-consuming, labor-intensive, and prone to errors. Thus, this study proposes an automatic data extraction method from geotechnical investigation reports using image-based deep learning models and text-mining techniques. A deep-learning-based page classification model and a text-searching algorithm were employed to classify geotechnical investigation report pages with 100% accuracy. Computer vision algorithms were utilized to identify valid data regions within report pages, and text analysis was used to match and extract the corresponding geotechnical data. The proposed model was validated using a dataset of 205 geotechnical investigation reports, achieving an average data extraction accuracy of 93.0%. Finally, a user-interface-based program was developed to enhance the practical application of the extraction model. It allowed users to upload PDF files of geotechnical investigation reports, automatically analyze these reports, and extract and edit data. This approach is expected to improve the efficiency and accuracy of digitizing geotechnical investigation reports and building geotechnical databases.

현장 지반정수 데이터는 다양한 현장 및 실내시험을 통해 획득된 후 지반조사보고서의 형태로 작성되어 유통된다. 효율적인 설계 및 시공을 위해선 지반정수의 디지털 데이터베이스화가 필수적이나, 현재 지반조사보고서 데이터는 수동 입력 방식으로 많은 시간과 인력이 소요되며, 오류가 발생하기도 한다. 본 연구는 이미지 기반 딥러닝 모델 및 텍스트 마이닝 기법을 사용하여 지반조사보고서에서 데이터를 자동으로 추출하는 방법을 제안하였다. 딥러닝 기반의 페이지 분류 모델과 텍스트 서칭 알고리즘을 사용하여 지반조사보고서 부록 내 세부 지반시험 결과 보고서를 100%의 정확도로 분류할 수 있었다. 컴퓨터 비전 알고리즘을 통해 보고서 페이지 내 유효한 데이터 영역을 결정하고, 텍스트 분석을 통해 추출 데이터 항목과 상응하는 지반 데이터를 짝지어 데이터를 추출했다. 제안한 모델은 205개의 지반조사 보고서로 구성된 데이터셋을 통해 검증되었으며, 평균 93.0%의 데이터 추출 정확도를 기록하였다. 마지막으로, 추출 모델의 실무 적용성을 위해 사용자 인터페이스 기반 프로그램을 개발하였다. 프로그램 내 사용자 상호작용을 통해 지반조사보고서 PDF 파일을 업로드하고 자동으로 보고서를 분석 및 데이터를 추출, 편집할 수 있도록 했다. 이를 통해 지반조사보고서의 디지털화 및 지반 데이터베이스 구축이 더욱 효율적이고 정확하게 이루어질 수 있을 것으로 판단된다.

Keywords

Acknowledgement

본 연구는 GS건설에서 데이터를 제공받아 수행되었으며, 한국연구재단(NRF)의 지원사업으로 이루어진 것으로 해당 부처에 깊은 감사를 드립니다(No. 2023R1A2C2003534, NRF-2021R1A5A1032433).

References

  1. Adamo, F., Attivissimo, F., Di Nisio, A., and Spadavecchia, M. (2015), "An Automatic Document Processing System for Medical Data Extraction", Measurement, Vol.61, pp.88-99.
  2. Bakkali, S., Ming, Z., Coustaty, M., and Rusinol, M. (2020), "Visual and Textual Deep Feature Fusion for Document Image Classification", In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp.562-563.
  3. Brekke, P. H., Rama, T., Pilan, I., Nytro, O., and Ovrelid, L. (2021), "Synthetic Data for Annotation and Extraction of Family History Information from Clinical Text", Journal of Biomedical Semantics, Vol.12, pp.1-11.
  4. Cubrinovski, M. and Ishihara, K. (1999), "Empirical Correlation between SPT N-value and Relative Density for Sandy Soils", Soils and Foundations, Vol.39, No.5, pp.61-71.
  5. Faraji, A., Rashidi, M., and Perera, S. (2021), "Text Mining Risk Assessment-based Model to Conduct Uncertainty Analysis of the General Conditions of Contract in Housing Construction Projects: Case Study of the NSW GC21", Journal of Architectural Engineering, Vol.27, No.3, 04021025.
  6. Gang, B., Hwang, B., and Cho, W. (2018), "Empirical Estimations of Soil Constants Using Standard Penetration Test N Value", Journal of the Korean GEO-environmental Society, Vol.19, No.6, pp.5-12.
  7. Harley, A. W., Ufkes, A., and Derpanis, K. G. (2015), "Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval", In 2015 13th International Conference on Document Analysis and Recognition IEEE, pp.991-995.
  8. Imai, T. and Tonoughi, K. (2021), "Correlation of N Value with S-wave Velocity and Shear Modulus", Penetration Testing, Vol.1, pp.67-72.
  9. Kang, L., Kumar, J., Ye, P., Li, Y., and Doermann, D. (2014), "Convolutional neural networks for document image classification", In 2014 22nd international conference on pattern recognition IEEE, pp.3168-3172.
  10. Kheradmandi, N. and Mehranfar, V. (2022), "A Critical Review and Comparative Study on Image Segmentation-based Techniques for Pavement Crack Detection", Construction and Building Materials, Vol.321, 126162.
  11. Kim, Y. and Yun, T. S. (2021), "How to Classify Sand Types: A Deep Learning Approach", Engineering geology, Vol.288, 106142.
  12. Ma, J. and Yun, T. S. (2022), "Prediction of Slope Failure Arc Using Multilayer Perceptron", Journal of the Korean Geotechnical Society, Vol.38, No.8, pp.39-52.
  13. Ma, K., Tan, Y., Tian, M., Xie, X., Qiu, Q., Li, S., and Wang, X. (2022), "Extraction of Temporal Information from Social Media Messages Using the BERT Model", Earth Science Informatics, Vol.15, No.1, pp.573-584.
  14. Murlidhar, B. R., Sinha, R. K., Mohamad, E. T., Sonkar, R., and Khorami, M. (2020), "The Effects of Particle Swarm Optimisation and Genetic Algorithm on ANN Results in Predicting Pile Bearing Capacity", International Journal of Hydromechatronics, Vol.3, No.1, pp.69-87.
  15. Park, S. S., Ogunjinmi, P. D., Woo, S. W., and Lee, D. E. (2020), "A Simple and Sustainable Prediction Method of Liquefaction-induced Settlement at Pohang Using an Artificial Neural Network", Sustainability, Vol.12, No.10, 4001.
  16. Park, J., Kim, K. Y., and Yun, T. S. (2023), "Morphological Analysis of Hydraulically Stimulated Fractures by Deep-Learning Segmentation Method", Journal of the Korean Geotechnical Society, Vol.39, No.8, pp.17-28.
  17. Park, J., Kim, Y., Kim, S. S., Kim, K. Y., and Yun, T. S. (2024), "Effect of Injection Rate and Viscosity on Stimulated Fracture in Granite: Extraction of Fracture by Convolutional Neural Network and Morphological Analysis", Rock Mechanics and Rock Engineering, Vol.57, No.3, pp.2159-2174.
  18. Park, K. H., Han, J. T., and Yoon, Y. (2021), "A Study on the Automatic Digital DB of Boring Log Using AI", Journal of the Korean Geotechnical Society, Vol.37, No.11, pp.119-129.
  19. Pham, T. A., Ly, H. B., Tran, V. Q., Giap, L. V., Vu, H. L. T., and Duong, H. A. T. (2020), "Prediction of Pile Axial Bearing Capacity Using Artificial Neural Network and Random Forest", Applied Sciences, Vol.10, No.5, 1871.
  20. Samui, P. and Sitharam, T. G. (2010), "Site Characterization Model Using Artificial Neural Network and Kriging", International Journal of Geomechanics, Vol.10, No.5, pp.171-180.
  21. Seo, W., Kim, Y., Sim, H., Song, Y., and Yun, T. S. (2022), "Classification of Igneous Rocks from Petrographic Thin Section Images Using Convolutional Neural Network", Earth Science Informatics, Vol.15, No.2, pp.1297-1307.
  22. Shen, Z., Zhang, R., Dell, M., Lee, B. C. G., Carlson, J., and Li, W. (2021), "Layoutparser: A Nnified Toolkit for Deep Learning based Document Image Analysis", Document Analysis and Recognition-ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5-10, 2021, Proceedings, Part I 16, pp.131-146.
  23. Shigarov, A., Altaev, A., Mikhailov, A., Paramonov, V., and Cherkashin, E. (2018), "TabbyPDF: Web-based System for PDF Table Extraction", Information and Software Technologies: 24th International Conference, ICIST 2018, Vilnius, Lithuania, October 4-6, 2018, Proceedings 24, pp.257-269.
  24. Stark, T. D. and Hussain, M. (2013), "Empirical Correlations: Drained Shear Strength for Slope Stability Analyses", Journal of Geotechnical and Geoenvironmental Engineering, Vol.139, No.6, pp.853-862.
  25. Tan, M. and Le, Q. (2019), "Efficientnet: Rethinking Model Scaling for Convolutional Neural Betworks", International Conference on Machine Learning, pp.6105-6114.
  26. Tensmeyer, C. and Martinez, T. (2017), "Analysis of Convolutional Neural Networks for Document Image Classification", 2017 14th IAPR international conference on document analysis and recognition IEEE, Vol.1, pp.388-393.
  27. Varol, T., Ozel, H. B., Ertugrul, M., Emir, T., Tunay, M., Cetin, M., and Sevik, H. (2021), "Prediction of Soil-bearing Capacity on Forest Roads by Statistical Approaches", Environmental Monitoring and Assessment, Vol.193, pp.1-13.
  28. Weng, Y. and Xia, C. (2020), "A New Deep Learning-based Handwritten Character Recognition System on Mobile Computing Devices", Mobile Networks and Applications, Vol.25, No.2, pp. 402-411.
  29. Zhang, J., Zhang, Y., Tian, Y., Liu, G., Xu, L., and Hu, Y. (2020a), "A Rapid Method for Information Extraction from Borehole Log Images", Applied Sciences, Vol.10, No.16, 5520.
  30. Zhang, X., Zhang, J., Tian, Y., Li, Z., Zhang, Y., Xu, L., and Wang, S. (2020b), "Urban Geological 3D Modeling based on Papery Borehole Log", ISPRS International Journal of Geo-Information, Vol.9, No.6, 389.
  31. Zhou, W. H., Xu, X., and Garg, A. (2016), "Measurement of Unsaturated Shear Strength Parameters of Silty Sand and Its Correlation with Unconfined Compressive Strength", Measurement, Vol.93, pp.351-358.