DOI QR코드

DOI QR Code

Multimodal Context Embedding for Scene Graph Generation

  • Jung, Gayoung (Dept. of Computer Science, Graduate School of Kyonggi University) ;
  • Kim, Incheol (Dept. of Computer Science, Kyonggi University)
  • 투고 : 2020.06.26
  • 심사 : 2020.09.14
  • 발행 : 2020.12.31

초록

This study proposes a novel deep neural network model that can accurately detect objects and their relationships in an image and represent them as a scene graph. The proposed model utilizes several multimodal features, including linguistic features and visual context features, to accurately detect objects and relationships. In addition, in the proposed model, context features are embedded using graph neural networks to depict the dependencies between two related objects in the context feature vector. This study demonstrates the effectiveness of the proposed model through comparative experiments using the Visual Genome benchmark dataset.

키워드

참고문헌

  1. Y. Li, W. Ouyang, B. Zhou, K. Wang, and X. Wang, "Scene graph generation from objects, phrases and region captions," in Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 2017, pp. 1261-1270.
  2. J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh, "Graph R-CNN for scene graph generation," in Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 2018, pp. 670-685.
  3. W. Liao, B. Rosenhahn, L. Shuai, M. Y. Yang, "Natural language guided visual relationship detection," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, 2019, pp. 444-453.
  4. R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, et al., "Visual Genome: connecting language and vision using crowdsourced dense image annotations," International Journal Of Computer Vision, vol. 123, no. 1, pp. 32-73, 2017. https://doi.org/10.1007/s11263-016-0981-7
  5. B. Dai, Y. Zhang, and D. Lin, "Detecting visual relationships with deep relational networks," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, 2017, pp. 3076-3086.
  6. N. Gkanatsios, V. Pitsikalis, P. Koutras, A. Zlatintsi, and P. Maragos, "Deeply supervised multimodal attentional translation embeddings for visual relationship detection," in Proceedings of 2019 IEEE International Conference on Image Processing, Taipei, Taiwan, 2019, pp. 1840-1844.
  7. J. Gu, H. Zhao, Z. Lin, S. Li, J. Cai, and M. Ling, "Scene graph generation with external knowledge and image reconstruction," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, 2019, pp. 1969-1978.
  8. M. Qi, W. Li, Z. Yang, Y. Wang, and J. Luo, "Attentive relational networks for mapping images to scene graphs," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, 2019, pp. 3957-3966.
  9. S. Woo, D. Kim, D. Cho, and I. S. Kweon, "LinkNet: relational embedding for scene graph," Advances in Neural Information Processing Systems, vol. 31, pp. 560-570, 2018.
  10. T. N. Kipf and M. Welling, "Semi-supervised classification with graph convolutional networks," in Proceedings of the 5th International Conference on Learning Representations (ICLR), Toulon, France, 2017.