DOI QR코드

DOI QR Code

AI Model-Based Automated Data Cleaning for Reliable Autonomous Driving Image Datasets

자율주행 영상데이터의 신뢰도 향상을 위한 AI모델 기반 데이터 자동 정제

  • Kana Kim (Department of Electrical and Computer Engineering, Inha University) ;
  • Hakil Kim (Department of Electrical and Computer Engineering, Inha University)
  • 김가나 (인하대학교 전기컴퓨터공학과) ;
  • 김학일 (인하대학교 전기컴퓨터공학과)
  • Received : 2023.03.22
  • Accepted : 2023.04.13
  • Published : 2023.05.30

Abstract

This paper aims to develop a framework that can fully automate the quality management of training data used in large-scale Artificial Intelligence (AI) models built by the Ministry of Science and ICT (MSIT) in the 'AI Hub Data Dam' project, which has invested more than 1 trillion won since 2017. Autonomous driving technology using AI has achieved excellent performance through many studies, but it requires a large amount of high-quality data to train the model. Moreover, it is still difficult for humans to directly inspect the processed data and prove it is valid, and a model trained with erroneous data can cause fatal problems in real life. This paper presents a dataset reconstruction framework that removes abnormal data from the constructed dataset and introduces strategies to improve the performance of AI models by reconstructing them into a reliable dataset to increase the efficiency of model training. The framework's validity was verified through an experiment on the autonomous driving dataset published through the AI Hub of the National Information Society Agency (NIA). As a result, it was confirmed that it could be rebuilt as a reliable dataset from which abnormal data has been removed.

본 연구는 과학기술정보통신부가 2017년부터 1조원 이상을 투자한 'AI Hub 댐' 사업에서 구축된 인공지능 모델 학습데이터의 품질관리를 자동화할 수 있는 프레임워크의 개발을 목표로 한다. 자율주행 개발에 사용되는 AI 모델 학습에는 다량의 고품질의 데이터가 필요하며, 가공된 데이터를 검수자가 데이터 자체의 이상을 검수하고 유효함을 증명하는 데는 여전히 어려움이 있으며 오류가 있는 데이터로 학습된 모델은 실제 상황에서 큰 문제를 야기할 수 있다. 본 논문에서는 이상 데이터를 제거하는 신뢰할 수 있는 데이터셋 정제 프레임워크를 통해 모델의 인식 성능을 향상시키는 전략을 소개한다. 제안하는 방법은 인공지능 학습용 데이터 품질관리 가이드라인의 지표를 기반으로 설계되었다. 한국정보화진흥원의 AI Hub을 통해 공개된 자율주행 데이터셋에 대한 실험을 통해 프레임워크의 유효성을 증명하였고, 이상 데이터가 제거된 신뢰할 수 있는 데이터셋으로 재구축될 수 있음을 확인하였다.

Keywords

Acknowledgement

이 논문은 2023년도 정부(산업통상자원부)의 재원으로 한국산업기술진흥원 (P0017124, 2023년 산업혁신인재성장지원사업)의 지원을 받아 수행된 연구임.

References

  1. Ministry of Science and ICT, National Information Society Agency , "Data Quality Management Guidelines and Construction Guidelines for AI Learning v3.0," Vol.1 Quality Management Guidelines. doi : https://aihub.or.kr/aihubnews/qlityguidance, 2023.
  2. H. Zhu, J. Shi, J. Wu, "Pick and Learn: Automatic Quality Evaluation for Noisy-labeled Image Segmentation," International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 576-584, 2019. doi: http://dx.doi.org/10.1007/978-3-030-32226-7_64
  3. Y. Kim, J.M. Kim, Z. Akata, J. Lee, "Large Loss Matters in Weakly Supervised Multi-Label Classification," Computer Vision and Pattern Recognition Conference, pp. 14156-14165, 2022. doi: https://doi.org/10.1109/CVPR52688.2022.01376
  4. J. Li, R. Socher, and S. C. Hoi, "DivideMix: Learning with noisy labels as semi-supervised learning," International Conference on Learning Representations , pp. 1-14, 2020. doi: https://doi.org/10.1109/IJCNN55064.2022.9892927
  5. A. Ghosh, N. Manwani, and P. Sastry, "On the robustness of decision tree learning under label noise," The Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 685-697, 2017. doi: https://doi.org/10.1007/978-3-319-57454-7_53
  6. V. Mnih and G. E. Hinton, "Learning to label aerial images from noisy data," International Conference on Machine Learning, pp. 567-574, 2012.
  7. M. Bernhardt, D.C. Castro, R. Tanno, A. Schwaighofer, K.C. Tezcan, M. Monteiro, S. Bannur, M.P. Lungren, A. Nori, B. Glocker and J. Alvarez-Valle, "Active label cleaning for improved dataset quality under resource constraints", Nature communications, 13(1), pp.1-11, 2022. doi: https://doi.org/10.1038/s41467-022-28818-3
  8. O Sener, S Savarese, "Active learning for convolutional neural networks: A core-set approach", International Conference on Learning Representations, 2018. doi: https://doi.org/10.48550/arXiv.1708.00489
  9. G. Contardo, L. Denoyer and T. Artieres, "A meta-learning approach to one-step active-learning" in arXiv:1706.08334, 2017.  doi: https://doi.org/10.48550/arXiv.1706.08334
  10. P. Bachman, A. Sordoni and A. Trischler, "Learning algorithms for active learning," in arXiv:1708.00088, 2017. doi: https://doi.org/10.48550/arXiv.1708.00088
  11. A. Byerly and T. Kalganova, "Class Density and Dataset Quality in High-Dimensional, Unstructured Data," arXiv preprint arXiv:2202. 03856, 2022. doi: https://doi.org/10.48550/arXiv.2202.03856
  12. Y. Zhong, L. Wu, X. Liu and J. Jiang, "Exploiting the Potential of Datasets: A Data-Centric Approach for Model Robustness," arXiv preprint arXiv:2203.05323, 2022. doi: https://doi.org/10.48550/arXiv.2203.05323
  13. D. Gamberger, N. Lavrac, and S. Dzeroski, "Noise detection and elimination in data preprocessing: Experiments in medical domains," Applied Artificial Intelligence, vol. 14, no. 2, pp. 205-223, 2000. doi: http://dx.doi.org/10.1080/088395100117124
  14. T. Liu, D. Tao, "Classification with noisy labels by importance reweighting," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 3, pp. 447-461, 2015. doi: http://dx.doi.org/10.1109/TPAMI.2015.2456899
  15. Y. Sun, Z. Gu, "Using computer vision to recognize construction material: A Trustworthy Dataset Perspective," Resources, Conservation and Recycling, 183, p.106362, 2022. doi: http://dx.doi.org/10.1016/j.resconrec.2022.106362
  16. H. Song, M. Kim, D. Park, J. Lee, "Learning from noisy labels with deep neural networks: A survey," arXiv preprint arXiv:2007.08199, 2021. doi: http://dx.doi.org/10.1109/TNNLS.2022.3152527
  17. B. Settles, "Active learning literature survey," Computer Science Technical Report 1648, University of Wisconsin-Madison, January 2010.
  18. Korea National Information Society Agency, AI Hub, https://aihub. or.kr
  19. Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo and R. Girshick, "Detectron2", 2019. https://github.com/facebookresearch/detectron2