DOI QR코드

DOI QR Code

A general-purpose model capable of image captioning in Korean and Englishand a method to generate text suitable for the purpose

한국어 및 영어 이미지 캡션이 가능한 범용적 모델 및 목적에 맞는 텍스트를 생성해주는 기법

  • Cho, Su Hyun (Department of Department of Artificial Intelligence Convergence, Sungkyunkwan University) ;
  • Oh, Hayoung (College of Computing and Informatics, Sungkyunkwan University)
  • Received : 2022.05.03
  • Accepted : 2022.08.01
  • Published : 2022.08.31

Abstract

Image Capturing is a matter of viewing images and describing images in language. The problem is an important problem that can be solved by keeping, understanding, and bringing together two areas of image processing and natural language processing. In addition, by automatically recognizing and describing images in text, images can be converted into text and then into speech for visually impaired people to help them understand their surroundings, and important issues such as image search, art therapy, sports commentary, and real-time traffic information commentary. So far, the image captioning research approach focuses solely on recognizing and texturing images. However, various environments in reality must be considered for practical use, as well as being able to provide image descriptions for the intended purpose. In this work, we limit the universally available Korean and English image captioning models and text generation techniques for the purpose of image captioning.

Image Captioning은 이미지를 보고 이미지를 언어로 설명하는 문제이다. 해당 문제는 이미지 처리와 자연어 처리 두 가지의 분야를 하나로 묵고 이해하고 하나로 묶어 해결할 수 있는 중요한 문제이다. 또한, 이미지를 자동으로 인식하고 텍스트로 설명함으로써 시각 장애인을 위해 이미지를 텍스트로 변환 후 음성으로 변환하여 주변 환경을 이해하는 데 도움을 줄 수 있으며, 이미지 검색, 미술치료, 스포츠 경기 해설, 실시간 교통 정보 해설 등 많은 곳에 적용할 수 있는 중요한 문제이다. 지금까지의 이미지 캡션 구 방식은 이미지를 인식하고 텍스트화시키는 데에만 집중하고 있다. 하지만 실질적인 사용을 하기 위해 현실의 다양한 환경이 고려되어야 하며 뿐만 아니라 사용하고자 하는 목적에 맞는 이미지 설명을 할 수 있어야 한다. 본 논문에서는 범용적으로 사용 가능한 한국어 및 영어 이미지 캡션 모델과 이미지 캡션 목적에 맞는 텍스트 생성 기법을 제한한다.

Keywords

Acknowledgement

This work was supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (No. NRF-2022R1F1A1074696).

References

  1. K. Papineni, S. Roukos, T. Ward, and W. J. Zhu, "Bleu: a Method for Automatic Evaluation of Machine Translation," in Proceedings of 40th annual meeting of the Association for Computational Linguistics, Philadelphia: PA, USA, pp. 311-318, 2002.
  2. S. Banerjee and A. Lavie, "METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments," in Proceedings of ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Michigan: MI, USA, pp. 65-72, 2005.
  3. C. Lin, "ROUGE: A Package for Automatic Evaluation of Summaries," in Proceedings of Text summarization branches out, Barcelona, Spain, pp. 74-81, 2004.
  4. P. Anderson, B. Fernando, M. Johnson, and S. Gould, "SPICE: Semantic Propositional Image Caption Evaluation," in Proceedings of European conference on computer vision. Springer, Amsterdam, The Netherlands, pp. 382-398, 2016.
  5. R. Vedantam, C. L. Zitnick, and D. Parikh, "CIDEr: Consensus-Based Image Description Evaluation," in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Boston: MA, USA, pp. 4566-4575, 2015.
  6. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention Is All You Need," in Proceedings of neural information processing systems, Long Beach: CA, USA, vol. 30, 2017.
  7. J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding," arXiv preprint arXiv:1810.04805, 2018.
  8. O. Sidorov, R. Hu, M. Rohrbach, and A. Singh, "TextCaps: A Dataset for Image Captioning with Reading Comprehension," in Proceedings of European Conference on Computer Vision, Glasgow, UK, pp. 742-758, 2020.
  9. X. Chen, H. Fang, T. Y. Lin, R. Vedantam, S. Gupta, P. Dollar, and C. L. Zitnick, "Microsoft COCO Captions: Data Collection and Evaluation Server," arXiv preprint arXiv:1504.00325. 2015.
  10. A. Singh, V. Natarajan, M. Shah, Y. Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach, "Towards VQA Models That Can Read," in Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach: CA, USA, pp. 8317-8326, 2019.
  11. D. Kiela, H. Firooz, A. Mohan, V. Goswami, A. Singh, P. Ringshia, and D. Testuggine, "The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes," in Advances in Neural Information Processing Systems, Vancouver, Canada, vol. 33, pp. 2611-2624, 2020.
  12. X. Hu, X. Yin, K. Lin, L. Wang, L. Zhang, J. Gao, and Z. Liu, "VICO: Visual Vocabulary Pre-Training for Novel Object Captioning," arXiv:2009.13682, 2020.
  13. P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, "From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions," Transactions of the Association for Computational Linguistics, vol. 2, pp. 67-78, Feb. 2014. https://doi.org/10.1162/tacl_a_00166
  14. H. Agrawal, K. Desai, Y. Wang, X. Chen, R. Jain, M. Johnson, D. Batra, D. Parikh, S. Lee and P. Anderson, "nocaps: novel object captioning at scale." in Proceedings of IEEE/CVF International Conference on Computer Vision, Seoul, KR, pp. 8948-8957, 2019.
  15. P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, Y. Choi, and J. Gao, "Vinvl: Revisiting Visual Representations in Vision-Language Models," in Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, pp. 5579-5588, 2021.
  16. P. Sharma, N. Ding, S. Goodman, and R. Soricut, "Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning," in Proceedings of 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, vol. 1, pp. 2556-2565, 2018.
  17. V. Ordonez, G. Kulkarni, and T. Berg, "Im2text: Describing Images Using 1 Million Captioned Photographs," in Proceedings of Advances in neural information processing systems, Virtual,vol. 24, pp. 1143-1151, 2011.
  18. X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei, Y. Choi, and J. Gao, "Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks," in Proceedings of European Conference on Computer Vision, Glasgow, UK, pp. 121-137, 2020.
  19. R. Luo and G. Shakhnarovich, "Controlling Length in Image Captioning," arXiv preprint arXiv:2005.14386, 2020.
  20. D. Yu, X. Li, C. Zhang, T. Liu, J. Han, J. Liu, and E. Ding, "Towards Accurate Scene Text Recognition with Semantic Reasoning Networks," in Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, pp. 12113-12122, 2020.
  21. B. Shi, X. Bai, and C. Yao, "An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition," IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 11, pp. 2298-2304, Nov. 2017. https://doi.org/10.1109/TPAMI.2016.2646371
  22. Y. Du, C. Li, R. Guo, X. Yin, W. Liu, J. Zhou, Y. Bai, Z. Yu, Y. Yang, Q. Dang, and H. Wang, "PP-OCR: A Practical Ultra Lightweight OCR System," arXiv preprint arXiv:2009.09941, 2020.
  23. D. A. Hudson and C. D. Manning, "GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering," in Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach: CA, USA, pp. 6700-6709, 2019.
  24. S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, "Self-Critical Sequence Training for Image Captioning," in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu: HI, USA, pp. 7008-7024, 2017.