Improving visual relationship detection using linguistic and spatial cues

Jung, Jaewon;Park, Jongyoul;

doi:10.4218/etrij.2019-0093

ETRI Journal

Volume 42 Issue 3
/
Pages.399-410
/
2020
/
1225-6463(pISSN)
/
2233-7326(eISSN)

Electronics and Telecommunications Research Institute (한국전자통신연구원)

DOI QR Code

Improving visual relationship detection using linguistic and spatial cues

Jung, Jaewon (Artificial Intelligence Laboratory, University of Science and Technology, ETRI SCHOOL) ;
Park, Jongyoul (Artificial Intelligence Laboratory, Electronics and Telecommunications Research Institute)

Received : 2019.02.27
Accepted : 2019.07.29
Published : 2020.06.08

https://doi.org/10.4218/etrij.2019-0093 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Detecting visual relationships in an image is important in an image understanding task. It enables higher image understanding tasks, that is, predicting the next scene and understanding what occurs in an image. A visual relationship comprises of a subject, a predicate, and an object, and is related to visual, language, and spatial cues. The predicate explains the relationship between the subject and object and can be categorized into different categories such as prepositions and verbs. A large visual gap exists although the visual relationship is included in the same predicate. This study improves upon a previous study (that uses language cues using two losses) and a spatial cue (that only includes individual information) by adding relative information on the subject and object of the extant study. The architectural limitation is demonstrated and is overcome to detect all zero-shot visual relationships. A new problem is discovered, and an explanation of how it decreases performance is provided. The experiment is conducted on the VRD and VG datasets and a significant improvement over previous results is obtained.

Keywords

References

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, 2014, CoRR abs/1409.1556.
H. Kaiming et al., Deep residual learning for image recognition, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Las Vegas, NV, USA, June 2016, pp. 770-778.
G. Ross et al., Region-based convolutional networks for accurate object detection and segmentation, IEEE Trans. Pattern Anal. Mach. Intell. 38 (2016), no. 1, 142-158. https://doi.org/10.1109/TPAMI.2015.2437384
R. Girshick, Fast R-CNN, in Proc. IEEE Int. Conf. Comput. Vision, Santiago, Chile, Dec. 2015, pp. 1440-1448.
S. Ren et al., Faster R-CNN: Towards real-time object detection with region proposal networks, Adv. Neural Inf. Process. Syst., Montreal, Canada, Dec. 2015, pp. 91-99.
K. He et al., Mask R-CNN, in IEEE Int. Conf. Comput. Vision (ICCV), Venice, Italy, Oct. 2017, pp. 2980-2988.
Z. Ren et al., Deep reinforcement learning-based image captioning with embedding reward, in Proc. IEEE Conf. Comput. Vision Pattern Recogn. (CVPR), Honolulu, HI, USA, July 2017, pp. 1151-1159.
S. Li et al., Person search with natural language description, in Proc. IEEE Conf. Comput. Vision Pattern Recogn. (CVPR), Honolulu, HI, USA, July 2017, pp. 5187-5196.
J. Johnson et al., Image retrieval using scene graphs, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Boston, MA, USA, June 2015, pp. 3668-3678.
Y. Li et al., Scene graph generation from objects, phrases and region captions, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Venice, Italy, Oct. 2017, pp. 1261-1270.
X. Danfei et al., Scene graph generation by iterative message passing, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Honolulu, HI, USA, July 2017, pp. 3097-3106.
Y. Goyal et al., Making the V in VQA matter: Elevating the role of image understanding in visual question answering, in IEEE Conf. Comput. Vision Pattern Recogn. (CVPR), Honolulu, HI, USA, July 2017, pp. 6325-6334.
S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural Comput 9 (1997), no. 8, 1735-1780. https://doi.org/10.1162/neco.1997.9.8.1735
M. A. Sadeghi and A. Farhadi, Recognition using visual phrases, in Proc. IEEE Conf. Comput. Vision Pattern Recogn. (CVPR), Providence, RI, USA, June 2011, pp. 1745-1752.
L. Cewu et al., Visual relationship detection with language priors, in Proc. Eur. Conf. Comput. Vision, Amsterdam, Netherlands, Oct. 2016, pp. 852-869.
R. Krishna et al., Visual genome: Connecting language and vision using crowdsourced dense image annotations, Int. J. Comput. Vision 123 (2017), no. 1, 32-73. https://doi.org/10.1007/s11263-016-0981-7
T. Mikolov et al., Efficient estimation of word representations in vector space, in Proc. Int. Conf. Learn. Representations (ICLR) Workshop, Scottsdale, AZ, USA, 2013, pp. 1-12.
Y. Ruichi et al., Visual relationship detection with internal and external linguistic knowledge distillation, in Proc. IEEE Int. Conf. Comput. Vision (ICCV), Venice, Italy, Oct. 2017, pp. 1068-1076.
Y. Zhu, S. Jiang, and X. Li, Visual relationship detection with object spatial distribution, in Proc. IEEE Int. Conf. Multimedia Expo (ICME), Venice, Italy, Oct. 2017, pp. 379-384.
Y. W. Chao et al., Learning to detect human-object interactions, in Proc. IEEE Winter Conf. Applicat. Comput. Vision, Lake Tahoe, NV, USA, Mar. 2018, pp. 381-389.
G. Gkioxari et al., Detecting and recognizing human-object interactions, in Proc. Conf. Vision Pattern Recong., Salt Lake City, UT, USA, June 2018, pp. 8359-8367.
T. Y. Lin et al., Microsoft coco: Common objects in context, in Proc. Comput. Vision - ECCV, Zurich, Switzerland, Sept. 2014, pp. 740-755.
B. Dai, Y. Zhang, and D. Lin, Detecting visual relationships with deep relational networks, in Proc. IEEE Conf. Comput. Vision Pattern Recogn. (CVPR), Honolulu, HI, USA, July 2017, pp. 3298-3308.
Y. Li et al., ViP-CNN: Visual phrase guided convolutional neural network, in Proc. IEEE Conf. Comput. Vision Pattern Recogn. (CVPR), Honolulu, HI, USA, July 2017, pp. 7244-7253.
H. Zhang et al., Visual translation embedding network for visual relation detection, in Proc. IEEE Conf. Comput. Vision Pattern Recogn. (CVPR), Honolulu, HI, USA, July 2017, pp. 3107-3115.
B. A. Plummer et al., Phrase localization and visual relationship detection with comprehensive image-language cues, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Venice, Italy, Oct. 2017, pp. 1928-1937.
X. Liang, L. Lee, and E. P. Xing, Deep variation-structured reinforcement learning for visual relationship and attribute detection, 2017, CoRR abs/1703.03054.
X. Shang et al., Video visual relation detection, in Proc. ACM Int. Conf. Multimedia, Mountain View, CA, USA, Oct. 2017, pp. 1300-1308.
X. Liang, L. Lee, and E. P. Xing, Deep variation-structured reinforcement learning for visual relationship and attribute detection, in Proc. IEEE Conf. Comput. Vision Pattern Recogn. (CVPR), Honolulu, HI, USA, July 2017, pp. 4408-4417.

Cited by

Automated optimization for memory-efficient high-performance deep neural network accelerators vol.42, pp.4, 2020, https://doi.org/10.4218/etrij.2020-0125

ETRI Journal

Improving visual relationship detection using linguistic and spatial cues

Abstract

Keywords

References

Cited by

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)