AI Model-Based Automated Data Cleaning for Reliable Autonomous Driving Image Datasets

Kana Kim;Hakil Kim;

doi:10.5909/JBE.2023.28.3.302

Journal of Broadcast Engineering (방송공학회논문지)

Volume 28 Issue 3
/
Pages.302-313
/
2023
/
1226-7953(pISSN)
/
2287-9137(eISSN)

The Korean Institute of Broadcast and Media Engineers (한국방송∙미디어공학회)

DOI QR Code

AI Model-Based Automated Data Cleaning for Reliable Autonomous Driving Image Datasets

자율주행 영상데이터의 신뢰도 향상을 위한 AI모델 기반 데이터 자동 정제

Kana Kim (Department of Electrical and Computer Engineering, Inha University) ;
Hakil Kim (Department of Electrical and Computer Engineering, Inha University)

김가나 (인하대학교 전기컴퓨터공학과) ;
김학일 (인하대학교 전기컴퓨터공학과)

Received : 2023.03.22
Accepted : 2023.04.13
Published : 2023.05.30

https://doi.org/10.5909/JBE.2023.28.3.302 Citation PDF

Download PDF

⟨ Previous Next ⟩

Abstract

This paper aims to develop a framework that can fully automate the quality management of training data used in large-scale Artificial Intelligence (AI) models built by the Ministry of Science and ICT (MSIT) in the 'AI Hub Data Dam' project, which has invested more than 1 trillion won since 2017. Autonomous driving technology using AI has achieved excellent performance through many studies, but it requires a large amount of high-quality data to train the model. Moreover, it is still difficult for humans to directly inspect the processed data and prove it is valid, and a model trained with erroneous data can cause fatal problems in real life. This paper presents a dataset reconstruction framework that removes abnormal data from the constructed dataset and introduces strategies to improve the performance of AI models by reconstructing them into a reliable dataset to increase the efficiency of model training. The framework's validity was verified through an experiment on the autonomous driving dataset published through the AI Hub of the National Information Society Agency (NIA). As a result, it was confirmed that it could be rebuilt as a reliable dataset from which abnormal data has been removed.

본 연구는 과학기술정보통신부가 2017년부터 1조원 이상을 투자한 'AI Hub 댐' 사업에서 구축된 인공지능 모델 학습데이터의 품질관리를 자동화할 수 있는 프레임워크의 개발을 목표로 한다. 자율주행 개발에 사용되는 AI 모델 학습에는 다량의 고품질의 데이터가 필요하며, 가공된 데이터를 검수자가 데이터 자체의 이상을 검수하고 유효함을 증명하는 데는 여전히 어려움이 있으며 오류가 있는 데이터로 학습된 모델은 실제 상황에서 큰 문제를 야기할 수 있다. 본 논문에서는 이상 데이터를 제거하는 신뢰할 수 있는 데이터셋 정제 프레임워크를 통해 모델의 인식 성능을 향상시키는 전략을 소개한다. 제안하는 방법은 인공지능 학습용 데이터 품질관리 가이드라인의 지표를 기반으로 설계되었다. 한국정보화진흥원의 AI Hub을 통해 공개된 자율주행 데이터셋에 대한 실험을 통해 프레임워크의 유효성을 증명하였고, 이상 데이터가 제거된 신뢰할 수 있는 데이터셋으로 재구축될 수 있음을 확인하였다.

Keywords

Acknowledgement

이 논문은 2023년도 정부(산업통상자원부)의 재원으로 한국산업기술진흥원 (P0017124, 2023년 산업혁신인재성장지원사업)의 지원을 받아 수행된 연구임.

References

Ministry of Science and ICT, National Information Society Agency , "Data Quality Management Guidelines and Construction Guidelines for AI Learning v3.0," Vol.1 Quality Management Guidelines. doi : https://aihub.or.kr/aihubnews/qlityguidance, 2023.
H. Zhu, J. Shi, J. Wu, "Pick and Learn: Automatic Quality Evaluation for Noisy-labeled Image Segmentation," International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 576-584, 2019. doi: http://dx.doi.org/10.1007/978-3-030-32226-7_64
Y. Kim, J.M. Kim, Z. Akata, J. Lee, "Large Loss Matters in Weakly Supervised Multi-Label Classification," Computer Vision and Pattern Recognition Conference, pp. 14156-14165, 2022. doi: https://doi.org/10.1109/CVPR52688.2022.01376
J. Li, R. Socher, and S. C. Hoi, "DivideMix: Learning with noisy labels as semi-supervised learning," International Conference on Learning Representations , pp. 1-14, 2020. doi: https://doi.org/10.1109/IJCNN55064.2022.9892927
A. Ghosh, N. Manwani, and P. Sastry, "On the robustness of decision tree learning under label noise," The Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 685-697, 2017. doi: https://doi.org/10.1007/978-3-319-57454-7_53
V. Mnih and G. E. Hinton, "Learning to label aerial images from noisy data," International Conference on Machine Learning, pp. 567-574, 2012.
M. Bernhardt, D.C. Castro, R. Tanno, A. Schwaighofer, K.C. Tezcan, M. Monteiro, S. Bannur, M.P. Lungren, A. Nori, B. Glocker and J. Alvarez-Valle, "Active label cleaning for improved dataset quality under resource constraints", Nature communications, 13(1), pp.1-11, 2022. doi: https://doi.org/10.1038/s41467-022-28818-3
O Sener, S Savarese, "Active learning for convolutional neural networks: A core-set approach", International Conference on Learning Representations, 2018. doi: https://doi.org/10.48550/arXiv.1708.00489
G. Contardo, L. Denoyer and T. Artieres, "A meta-learning approach to one-step active-learning" in arXiv:1706.08334, 2017. doi: https://doi.org/10.48550/arXiv.1706.08334
P. Bachman, A. Sordoni and A. Trischler, "Learning algorithms for active learning," in arXiv:1708.00088, 2017. doi: https://doi.org/10.48550/arXiv.1708.00088
A. Byerly and T. Kalganova, "Class Density and Dataset Quality in High-Dimensional, Unstructured Data," arXiv preprint arXiv:2202. 03856, 2022. doi: https://doi.org/10.48550/arXiv.2202.03856
Y. Zhong, L. Wu, X. Liu and J. Jiang, "Exploiting the Potential of Datasets: A Data-Centric Approach for Model Robustness," arXiv preprint arXiv:2203.05323, 2022. doi: https://doi.org/10.48550/arXiv.2203.05323
D. Gamberger, N. Lavrac, and S. Dzeroski, "Noise detection and elimination in data preprocessing: Experiments in medical domains," Applied Artificial Intelligence, vol. 14, no. 2, pp. 205-223, 2000. doi: http://dx.doi.org/10.1080/088395100117124
T. Liu, D. Tao, "Classification with noisy labels by importance reweighting," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 3, pp. 447-461, 2015. doi: http://dx.doi.org/10.1109/TPAMI.2015.2456899
Y. Sun, Z. Gu, "Using computer vision to recognize construction material: A Trustworthy Dataset Perspective," Resources, Conservation and Recycling, 183, p.106362, 2022. doi: http://dx.doi.org/10.1016/j.resconrec.2022.106362
H. Song, M. Kim, D. Park, J. Lee, "Learning from noisy labels with deep neural networks: A survey," arXiv preprint arXiv:2007.08199, 2021. doi: http://dx.doi.org/10.1109/TNNLS.2022.3152527
B. Settles, "Active learning literature survey," Computer Science Technical Report 1648, University of Wisconsin-Madison, January 2010.
Korea National Information Society Agency, AI Hub, https://aihub. or.kr
Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo and R. Girshick, "Detectron2", 2019. https://github.com/facebookresearch/detectron2

Journal of Broadcast Engineering (방송공학회논문지)

AI Model-Based Automated Data Cleaning for Reliable Autonomous Driving Image Datasets

자율주행 영상데이터의 신뢰도 향상을 위한 AI모델 기반 데이터 자동 정제

Abstract

Keywords

Acknowledgement

References

Detail Search