A Study on Improvement of Image Classification Accuracy Using Image-Text Pairs

이미지-텍스트 쌍을 활용한 이미지 분류 정확도 향상에 관한 연구

  • Mi-Hui Kim (School. of Computer Engineering & Applied Mathematics, Computer System Institute, Hankyong National University) ;
  • Ju-Hyeok Lee (School. of Computer Engineering & Applied Mathematics, Computer System Institute, Hankyong National University)
  • Received : 2023.11.21
  • Accepted : 2023.12.26
  • Published : 2023.12.31


With the development of deep learning, it is possible to solve various computer non-specialized problems such as image processing. However, most image processing methods use only the visual information of the image to process the image. Text data such as descriptions and annotations related to images may provide additional tactile and visual information that is difficult to obtain from the image itself. In this paper, we intend to improve image classification accuracy through a deep learning model that analyzes images and texts using image-text pairs. The proposed model showed an approximately 11% classification accuracy improvement over the deep learning model using only image information.

딥러닝의 발전으로 다양한 컴퓨터 비전 연구를 수행할 수 있게 됐다. 딥러닝은 컴퓨터 비전 연구 중 이미지 처리에서 높은 정확도와 성능을 보여줬다. 하지만 대부분의 이미지 처리 방식은 이미지의 시각 정보만을 이용해 이미지를 처리하는 경우가 대부분이다. 이미지-텍스트 쌍을 활용할 경우 이미지와 관련된 설명, 주석 등의 텍스트 데이터가 이미지 자체에서는 얻기 힘든 추가적인 맥락과 시각 정보를 제공할 수 있다. 본 논문에서는 이미지-텍스트 쌍을 활용하여 이미지와 텍스트를 분석하는 딥러닝 모델 제안한다. 제안 모델은 이미지 정보만을 사용한 딥러닝 모델보다 약 11% 향상된 분류 정확도 결과를 보였다.



  1. K. He, X. Zhang, S. Ren and J. Sun, "Deep Residual Learning for Image Recognition," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.770-778, 2016. DOI: 10.1109/CVPR.2016.90
  2. J. H Lee, M. H Kim, "Image classification model utilizing text to improve image classification accuracy," Annual Conference of KIPS 2023, p.4, 2023.
  3. J. Johnson, A. Karpathy and L. Fei-Fei, "Dense-Cap: Fully Convolutional Localization Networks for Dense Captioning," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.4565-4574, 2016. DOI: 10.1109/CVPR.2016.494
  4. Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, "ImageNet Classification with Deep Convolutional Neural Networks," Advances in Neural Information Processing Systems 25 (NIPS 2012), p.14, 2012. DOI:10.1145/3065386
  5. Sepp Hochreiter, Jurgen Schmidhuber, "Long short-term memory," Neural computation, 9(8), pp.1735-1780, 1997. DOI: 10.1162/neco.1997.9.8.1735
  6. Yann LeCun Leon Bottou Yoshua Bengio and Patrick Haner, "GradientBased Learning Applied to Document Recognition," Proceedings of the IEEE, 86(11), pp.2278-2324, 1998. DOI: 10.1109/5.726791
  7. Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," Proceedings of the 2019 Conference of theNorth American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.4171-4186, 2019. DOI: 10.48550/arXiv.1810.04805
  8. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, "Attention is All you Need," Advances in Neural Information Processing Systems 30 (NIPS 2017), pp.15, 2018. DOI: 10.48550/arXiv.1706.03762
  9. COYO-700M: Image-Text Pair Dataset "COYO dataset," [Internet],