Image classification and captioning model considering a CAM-based disagreement loss

Yoon, Yeo Chan;Park, So Young;Park, Soo Myoung;Lim, Heuiseok;

doi:10.4218/etrij.2018-0621

ETRI Journal

Volume 42 Issue 1
/
Pages.67-77
/
2020
/
1225-6463(pISSN)
/
2233-7326(eISSN)

Electronics and Telecommunications Research Institute (한국전자통신연구원)

DOI QR Code

Image classification and captioning model considering a CAM-based disagreement loss

Yoon, Yeo Chan (SW Content Research Laboratory, Electronics and Technology Research Institute) ;
Park, So Young (Department of Game Design and Development, Sangmyung University) ;
Park, Soo Myoung (SW Content Research Laboratory, Electronics and Technology Research Institute) ;
Lim, Heuiseok (Department of Computer Science and Engineering, Korea University)

Received : 2018.12.05
Accepted : 2019.05.07
Published : 2020.02.07

https://doi.org/10.4218/etrij.2018-0621 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Image captioning has received significant interest in recent years, and notable results have been achieved. Most previous approaches have focused on generating visual descriptions from images, whereas a few approaches have exploited visual descriptions for image classification. This study demonstrates that a good performance can be achieved for both description generation and image classification through an end-to-end joint learning approach with a loss function, which encourages each task to reach a consensus. When given images and visual descriptions, the proposed model learns a multimodal intermediate embedding, which can represent both the textual and visual characteristics of an object. The performance can be improved for both tasks by sharing the multimodal embedding. Through a novel loss function based on class activation mapping, which localizes the discriminative image region of a model, we achieve a higher score when the captioning and classification model reaches a consensus on the key parts of the object. Using the proposed model, we established a substantially improved performance for each task on the UCSD Birds and Oxford Flowers datasets.

Keywords

References

J. Donahue et al., Long-term recurrent convolutional networks for visual recognition and description, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Boston, MA, USA, June 2015, pp. 2625-2634.
O. Vinyals et al., Show and tell: A neural image caption generator, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Boston, MA, USA, June 2015, pp. 3156-3164.
Y. Dong et al., Improving interpretability of deep neural networks with semantic information, arXiv preprint arXiv: 1703.04096 (2017), 3-19.
L.A. Hendricks et al., Generating visual explanations, in Eur. Conf. Comput. Vision, Amsterdam, The Netherlands, Oct. 2016, pp. 3-19.
L.A. Hendricks et al., Deep compositional captioning: Describing novel object categories without paired training data, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Las Vegas, NV, USA, June 2016, pp. 1-10.
Q. You et al., Image captioning with semantic attention, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Las Vegas, NV, USA, June 2016, pp. 4651-4659.
S.J. Rennie et al., Self-critical sequence training for image captioning, in IEEE Conf. Comput. Vision Pattern Recogn., Honolulu, HI, USA, July 2017, pp. 1179-1195.
W. Qi et al., Image captioning and visual question answering based on attributes and external knowledge, IEEE Trans. Pattern Anal. Mach. Intell. 40 (2018), no. 6, 1367-1381. https://doi.org/10.1109/TPAMI.2017.2708709
Y. Youngjae et al., End-to-end concept word detection for video captioning, retrieval, and question answering in IEEE Conf. Comput. Vision Pattern Recogn., Honolulu, HI, USA, July 2017, pp. 3261-3269.
P. Anderson et al., Bottom-up and top-down attention for image captioning and VQA, arXiv preprint arXiv: 1707.07998, 2017.
L. Jiasen et al., Knowing when to look: Adaptive attention via a visual sentinel for image captioning, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Honolulu, HI, USA, July 2017, pp. 3242-3250.
T. Yao et al., Boosting image captioning with attributes, in IEEE Int. Conf. Comput. Vision, Venice, Italy, Oct. 2017, pp. 22-29.
C. Wang, H. Yang, and C. Meinel, Image captioning with deep bidirectional lstms and multi-task learning, ACM Trans. Multimedia Comput., Commun., Applicat., 14 (2018), no. 2s, 1-20.
C. Szegedy et al., Going deeper with convolutions, in Proc. IEEE Conf. Computer Vision Pattern Recogn., Boston, MA, USA, June 2015, pp. 1-9.
S. Reed et al., Learning deep representations of fine-grained visual descriptions, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Las Vegas, NV, USA, June 2016, pp. 49-58.
L. Zhang et al., Learning a deep embedding model for zero-shot learning, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Honolulu, HI, USA, July 2017, pp. 3010-3019.
X. He and Y. Peng, Fine-grained image classification via combining vision and language, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Honolulu, HI, USA, July 2017, pp. 7332-7340.
R. Kiros, R. Salakhutdinov, and R.S. Zemel, Unifying visual-semantic embeddings with multimodal neural language models, arXiv preprint arXiv: abs/1411.2539, 2014.
J. Mao et al., Learning like a child: Fast novel visual concept learning from sentence descriptions of images, in Proc. IEEE Int. Conf. Comput. Vision, Santiago, Chile, 2015, pp. 2533-2541.
R. Vedantam et al., Context-aware captions from context-agnostic supervision, in Proc. IEEE, Conf. Comput. Vision Pattern Recogn., Honolulu, HI, USA, July 2017, pp. 1070-1079.
A.H. Abdulnabi et al., Multi-task CNN model for attribute prediction, IEEE Trans. Multimedia 17 (2015), no. 11, 1949-1959. https://doi.org/10.1109/TMM.2015.2477680
T.-H. Chen et al., Show adapt and tell: Adversarial training of cross-domain image captioner, in IEEE, Int. Conf. Comput. Vision, Venice, Italy, Oct. 2017, pp. 521-530.
R.R. Selvaraju et al., Grad-CAM: Visual explanations from deep networks via gradient-based localization, in IEEE Int. Conf. Comput. Vision, Venice, Italy, Oct. 2017, pp. 618-626.
Y.-C. Yoon et al., Fine-grained mobile application clustering model using retrofitted document embedding, ETRI J. 39 (2017), no. 4, 443-454. https://doi.org/10.4218/etrij.17.0116.0936
S. Kong and C. Fowlkes, Low-rank bilinear pooling for fine-grained classification, in IEEE Comput. Vision Pattern Recogn., Honolulu, HI, USA, July 2017, pp. 7025-7034.
Y. Shaoyong et al., A model for fine-grained vehicle classification based on deep learning, Neurocomput. 257 (2017), 97-103. https://doi.org/10.1016/j.neucom.2016.09.116
X.-S. Wei et al., Selective convolutional descriptor aggregation for fine-grained image retrieval, IEEE Trans. Image Process. 26 (2017), no. 6, 2868-2881. https://doi.org/10.1109/TIP.2017.2688133
G.-S. Xie et al., LG-CNN: from local parts to global discrimination for fine-grained recognition, Pattern Recogn. 71 (2017), 118-131. https://doi.org/10.1016/j.patcog.2017.06.002
S.H. Lee, HGO-CNN: Hybrid generic-organ convolutional neural network for multi-organ plant classification, in IEEE Int. Conf. Image Process., Beijing, China, Sept. 2017, pp. 4462-4466.
A. Li et al., Zero-shot fine-grained classification by deep feature learning with semantics, arXiv preprint arXiv: abs/1707.00785, 2017.
Z. Akata et al., Evaluation of output embeddings for fine-grained image classification, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Boston, MA, USA, June 2015, pp. 2927-2936.
R. Ranjan, V. M. Patel, and R. Chellappa, Hyperface: A deep multitask learning framework for face detection, landmark localization, pose estimation, and gender recognition, IEEE Trans. Pattern Anal. Mach. Intell. 41 (2018), 121-135. https://doi.org/10.1109/TPAMI.2017.2781233
K. Hashimoto et al., A joint many-task model: Growing a neural network for multiple NLP tasks, arXiv preprint arXiv: abs/1611.01587, 2016.
R. Caruana, Multitask learning: a knowledge-based source of inductive bias, in Proc. Int. Conf. Mach. Learn., Amherst, MA, USA, June 1993, pp. 41 - 48.
L. Duong et al., Low resource dependency parsing: Cross-lingual parameter sharing in a neural network parser, in Proc. Annu. Meeting Association Computat. Linguistics Int. Joint Conf. Natural Language Process., Beijing, China, July 2015, pp. 845-850.
M. Nilsback and A. Zisserman, Automated flower classification over a large number of classes, in Proc. Indian Conf. Comput. Vision, Graphics Image Process., Bhubaneswar, India, Dec. 2008, pp. 722-729.
C. Wah et al., The Caltech-UCSD Birds-200-2011 Dataset, Tech. Report CNS-TR-2011-001, California Institute of Technology, 2011.
K. Papineni et al., Bleu: A method for automatic evaluation of machine translation, in Proc. Annu. Meeting Association Computat. Linguistics, Philadelphia, PA, USA, July 2002, pp. 311-318.
C.-Y. Lin, Rouge: a package for automatic evaluation of summaries, in Workshop Text Summarization Branches Out, Post-Conf. Workshop ACL, Barcelona, Spain, July 2004, pp. 74-81.
S. Banerjee and A. Lavie, Meteor: an automatic metric for MT evaluation with improved correlation with human judgments, in Proc. ACL Workshop Intrinsic Extrinsic Evaluation Measures Mach. Translation Summarization, Ann Arbor, MI, USA, 2005, pp. 65-72.
R. Lawrence, C.L. Zitnick, and D. Parikh, Cider: Consensus-based image description evaluation, arXiv preprint arXiv: abs/1411.5726 (2014).
C. Szegedy, S. Ioffe, and V. Vanhoucke, Inception-v4, Inception-Resnet and the impact of residual connections on learning, in Proc. AAAI Conf. Artif. Intell., San Francisco, CA, USA, Feb. 2017, pp. 2478-4284.
A. Paszke et al., Automatic differentiation in PyTorch, in Proc. NIPS, Long Beach, CA, USA, 2017.

Cited by

Automated optimization for memory-efficient high-performance deep neural network accelerators vol.42, pp.4, 2020, https://doi.org/10.4218/etrij.2020-0125
CitiusSynapse: A Deep Learning Framework for Embedded Systems vol.11, pp.23, 2020, https://doi.org/10.3390/app112311570

ETRI Journal

Image classification and captioning model considering a CAM-based disagreement loss

Abstract

Keywords

References

Cited by

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)