과제정보
이 논문은 2019년도 중앙대학교 연구장학기금 지원에 의한 것임.
참고문헌
- Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, and Zhang L (2018). Bottom-up and top-down attention for image captioning and visual question answering, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 6077-6086.
- Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick L, and Parikh D (2015). VQA: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, 2425-2433.
- Ba JL, Kiros JR, and Hinton GE (2016). Layer Normalization, arXiv preprint arXiv:1607.06450.
- Chorowski JK, Bahdanau D, Serdyuk D, Cho K, and Bengio Y (2015). Attention-based models for speech recognition. In Advances in Neural Information Processing Systems (NIPS), 577-585.
- Herdade S, Kappeler A, Boakye K, and Soares J (2019). Image Captioning: Transforming Objects into Words. In Advances in Neural Information Processing Systems, Mit Press, Cambridge, MA, USA, 11137-11147.
- Kim JH, Jun J, and Zhang BT (2018). Bilinear attention networks. In Advances in Neural Information Processing Systems, 31, 1564-1574.
- Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA, Bernstein MS, and Li FF (2016). Visual Genome: connecting language and vision using crowdsourced dense image annotations, arXiv preprint arXiv:1602.07332.
- Li Q, Tao Q, Joty S, Cai J, and Luo J (2018). VQA-E: Explaining, elaborating, and enhancing your answers for visual questions, arXiv preprint arXiv:1803.07464.
- Loper E and Bird S (2002). NLTK: The natural language toolkit, ETMTNLP '02: Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, 1, 63--70.
- Lu J, Yang J, Batra D, and Parikh D (2017). Hierarchical question-image co-attention for visual question answering, arXiv preprint arXiv:1606.00061.
- Mnih V, Heess N, Graves A, and Kavukcuoglu K (2014). Recurrent models of visual attention. In Advances in neural information processing systems (NIPS), 2204-2212.
- Pennington J, Socher R, and Manning CD (2014). GloVe: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), 1532-1543.
- Teney D, Anderson P, He X, and Hengel A (2017). Tips and tricks for visual question answering: Learnings from the 2017 challenge, arXiv preprint arXiv:1708.02711.
- Vaswani A, Shazeer M, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017). Attention is all you need. In Advances in Neural Information Processing Systems, 6000-6010.
- Wu J, Hu Z, and Mooney R (2019). Generating question relevant captions to aid visual question answering, arXiv preprint arXiv:1906.00513.
- Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhutdinov R, Zemel R, Bengio Y (2015). Show, attend and tell: Neural image caption generation with visual attention, arXiv preprint arXiv:1502.03044.
- Yu Z, Yu J, Cui Y, Tao D, and Tian Q (2019). Deep modular co-attention networks for visual question answering, arXiv preprint arXiv:1906.10770.
- Yu Z, Yu J, Xiang C, Fan J, and Tao D (2017). Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering. In Proceedings of the IEEE, 26, 2275-2290.