1. Introduction
Image captioning is a task that makes a sentence from reading an image. The sentence should be fluence and hold semantic consistency with image. Such a task is the intersection part of computer vision (CV) and natural language processing (NLP). Benefiting from the rapid development of machine translation [1] and object detection [2] in these two fields, image captioning has attracted more and more attention in recent years. To hold semantic consistency between image vision and sentence text, the image captioning model needs not only to recognize specific objects, but also to capture the relationships between objects.
The Encoder-decoder framework is the leading network of image captions currently. It’s essentially a type of deep learning model. Duo to the captions generated by models based on deep learning [3-6] are closer to natural language than the conventional template-based [7] and retrieval-based [8] methods, so it has been widely used. CNN+RNN model is the representative framework instance, which finishes the task through two stages of encoding and decoding. In the first step, CNN [9] encodes the image content and extracts the semantic feature information of an image. In the second stage, RNN [10] decodes image features extracted from the encoder into corresponding captions. Another encoder-decoder instance is the transformer, in which encoding and decoding are all achieved by a specific number of transformer layer.
Before the Up-Down [11] model was proposed, captioning models usually used ResNet [9] and other similar encoders to extract the grid features of images and obtain the spatial semantic information of images. The Up-Down model first used Faster R-CNN [2] as an encoder to remove the salient region features of images for image captioning, and achieved state-of-the-art results at that time. However, most captioning models only use the features extracted by Faster R-CNN, discarding the grid features extracted by ResNet, which leads to the underutilization of spatial feature information. Based on this, a Synergy-Gated Attention (SGA) method is proposed, where the encoder attends to salient and spatial features of the image, and establishes a gated mechanism through the global features of both, to better control the interaction of two kinds of information.
Many current works in image captioning focus on the decoding side, exploring how RNNs decode the image features more efficiently. As a variant of RNNs, the LSTM [12] is widely used in the decoder of image captioning models. It plays a crucial role in processing sequential data, and by introducing the input gate, forget gate, and output gate, LSTM can effectively alleviate the problem of gradient vanishing in RNNs. However, the hidden vector of the LSTM output in one time step usually depends on the output vector of the preceding LSTM, which ignores the visual correlation of the posterior LSTM hidden vectors. To solve the above problem, we design the RF-LSTM to replace the traditional LSTM unit. The RF-LSTM predicts the sequence information of the posterior steps in one time step and uses this information together to guide the output of the current LSTM. In this way, our model can achieve competitive performance.
Experimentally, we explore the effects of SGA and RF-LSTM respectively, and note that both methods show good performance in captioning model. To comparing fairly with other models, we integrate SGA and RF-LSTM, called SGA-RF-LSTM. The overall architecture is shown in Fig. 1. Through quantitative and qualitative analysis, it is demonstrated that the proposed method achieves competitive performance against state-of-the-art methods. Specifically, we obtain 119.1 CIDEr score under cross-entropy loss (XE) and 130.0 CIDEr-D scores with reinforcement learning [13, 14] on MS COCO “Karpathy” offline test split [10]. The main work of this study is as follows:
Fig. 1. An overview of the proposed framework.
1) We proposed a Synergy-Gated Attention (SGA), which can attend to salient region features and spatial features simultaneously. The global information of these two features is also used to guide the interaction between the two attended features.
2) We designed a Recurrent Fusion LSTM (RF-LSTM), which can obtain future output information to improve linguistic coherence. This is essentially different from recursing LSTM, which only focuses on previous output information.
3) We proposed an image captioning model combined with SGA and RF-LSTM, which achieve competitive performance compared to the state-of-the-art models, on the MS COCO datasets.
2. Related work
2.1 Encoder-Decoder based captioning
In recent years, with the significant progress of deep learning, image captioning models have developed rapidly and acquired breakthrough results. Inspired by sequence-to-sequence tasks [15], such as machine translation, the encoder-decoder frameworks were widely used in image captioning models. Vinyals et al. [3] proposed a captioning model, where the encoder was designed by CNN, and the decoder was designed by LSTM. Image features were only utilized at the beginning of LSTM. Xu et al. [16] first introduced the attention mechanism into the image caption model, where attention was used at every moment of the LSTM to focus on the salient position information of images. After that, a series of innovations based on the encoder-decoder framework was proposed to guide the captioning models to generate sentences that meet the description of human language by adding semantic attributes [17, 18]. Moreover, to explore the relationship between visual regions and mine the available semantic information in images, some methods have emerged to build scene graphs [19-21], which enhance the representation of images and quality of captions by constructing visual relationship graphs.
Innovations in the structure of the decoder and LSTM-based refinements play a significant role in image captioning models, and a effective decoder can help the captioning model to generate more accurate descriptions. Ke et al. [22] proposed a reflective decoding network for image captioning, which enhances both the long-sequence dependency and position perception of words in a caption decoder. Li et al. [23] used CNN as a decoder to replace the conventional LSTM, which solved the problems of long-term memory loss and lack of parallel processing in LSTM.
2.2 Attention-based captioning
Currently, attention plays a significant role in the task of image captioning. The attention mechanism will selectively focus on the part of the image, allowing the model to obtain valuable information from the image quickly, which is more in line with human cognitive behavior. Lu et al. [24] proposed an adaptive attention model combined with a visual sentinel, which can adaptively decide whether to focus on visual information or non-visual text, so that meaningful information can be extracted at every time step. Wang et al. [25] proposed that using hierarchical features enables attention to be synchronously calculated on the features of pyramid levels, and multiple multi-modal integration strategies can significantly improve model performance. Huang et al. [26] proposed an attention module that enhances visual attention by further measuring the relevance between the attention result and the query. Based on Transformer [27], Herdade et al. [28] proved the importance of spatial awareness of the model by combining spatial relation information between objects through geometric attention. Given the excellent results obtained by the above methods, it inspires us to use different attention methods and different image features from multiple perspectives.
3. Method
In this paper, a new image captioning model is proposed to explore the diversity of image feature information fusion. Fig. 2 shows this model’s total framework. We first show multi-mode embedding in Section 3.1, then introduce the Synergy-Gated Attention (SGA) and Recurrent Fusion LSTM (RF-LSTM) in Section 3.2 and Section 3.3 respectively, and finally introduce the training implements of the model in Section 3.4.
Fig. 2. The framework of the proposed SGA-RF-LSTM.
3.1 Multimodal embedding
We use CNN to extract spatial semantic information and Faster R-CNN to extract salient region information. Based on the above, we obtain visual embedding \(\begin{aligned}\hat{V}=\left[\hat{v}_{1}, \hat{v}_{2}, \ldots, \hat{v}_{N}\right], \hat{v}_{i} \in \mathbb{R}^{d}\end{aligned}\), \(\begin{aligned}\hat{v}_{i} \in \mathbb{R}^{d}\end{aligned}\) and space embedding \(\begin{aligned}\hat{E}=\left[\hat{e}_{1}, \hat{e}_{2}, \ldots, \hat{e}_{M}\right]\\\end{aligned}\), \(\begin{aligned}\hat{e}_{i} \in \mathbb{R}^{d}\\\end{aligned}\). Since both the grid features and region features are visual contents and exist in images as a whole, it is necessary to model their interaction. Formally, given \(\begin{aligned}\hat{V}\\\end{aligned}\) and \(\begin{aligned}\hat{E}\\\end{aligned}\), we use a 3-layer Transformer module 𝜓(. ; θa) to obtain more informative features via a self-attention operation as:
\(\begin{aligned}V, E=\psi\left([\hat{V}, \hat{E}] ; \theta_{a}\right)\\\end{aligned}\) (1)
where \(\begin{aligned}\hat{E} \in \mathbb{R}^{(w \times h) \times d}\\\end{aligned}\) is the feature map outputted from the last convolutional layer of the ResNet. M = w × h denotes the number of grids composed of image areas of the same size. \(\begin{aligned}\hat{V} \in \mathbb{R}^{N \times d}\\\end{aligned}\) is the output vector of the Faster R-CNN, which is composed of N d-dimensional image area features vi. E = {e1, e2, …, eM}, ei ∈ ℝd, V = {v1, v2, … , vN}, vi ∈ ℝd.
3.2 Synergy-Gated Attention
Our Synergy-Gated Attention method performs a multimodal task based on the multimodal embedding generation (V, E) in Section 3.1. Both varieties of the information are fused so that the LSTM can employ different regional and spatial information of the image simultaneously to generate the current captions at each moment.
The formula for calculating the attentive weights of E is defined as follows:
Zet = WehT tanh(WeE + (Wehhta)aeT) (2)
αet = soft max(Zet) (3)
cet = ∑i=1Mαet,iei (4)
where WehT, We and Weh are the matrices for learning spatial attentive weight. αet = {αet,1, αet,2, ..., αet,M} is the related weights of E, which sums to 1. cet is a weighted sum of E, and it indicates the most relevant position of grid region of the images. The formula of calculating the attentive weights of V is defined as follows:
Zvt = WvhT tanh(WvE + (Wvhhta)avT) (5)
αvt = soft max(Zvt) (6)
cvt = ∑i=1Nαvt,ivi (7)
where WvhT, Wv and Wvh are the matrices for learning regional attentive weight. αvt = {αvt,1, αvt,2, ..., αvt,N} are the related weights of V, which sums to 1. cvt is a weighted sum of V, and it indicates the most relevant position of salient region of the images.
The utilization of two features may produce semantic noises during the fusion process. To solve the problem, we concatenate the pooling features Ie extracted by the CNN model and the pooling features Iv extracted by Faster R-CNN. Then we send the concatenated matrix to the gate control unit to obtain the gate output value gt:
\(\begin{aligned}I_{e}=\frac{1}{M} \sum_{i=1}^{M} e_{t i}\\\end{aligned}\) (8)
\(\begin{aligned}I_{v}=\frac{1}{N} \sum_{i=1}^{N} v_{t i}\\\end{aligned}\) (9)
gt = σ(Wg[Ie,Iv]) (10)
where Ie ∈ ℝd×1 denotes the mean value of the feature map vector E, and Iv ∈ ℝd×1 denotes the mean value of vector V of salient region features. σ(⋅) denotes a Sigmoid function.
Through the dual gate control [29], the attentive information of the salient region is guided by the gate output value ggtt, and the supplementary value (1 − gt) guides the spatial semantic information to achieve the final attentive fusion:
st = gt ⊙ cvt + (1 − gt) ⊙ cet (11)
where ⊙ indicates the Hadamard product, and st ∈ ℝd×1 represents the output of SGA.
3.3 Recurrent Fusion LSTM
To improve the performance of sequence generation in LSTM, we introduce Recurrent Fusion LSTM (RF-LSTM). As shown in Fig. 2, the structure is an encoder-decoder framework based on double-layer RF-LSTM. The first layer is the attention Recurrent Fusion LSTM, which denotes RF − LSTMa composed of attention LSTMs that generate attentive weight. The second layer is the language Recurrent Fusion LSTM, which indicates RF − LSTMl composed of language LSTMs that generate words.
Our RF-LSTM, by recurring multiple LSTMs in one time step, focuses on modeling the same input, and establishes the relationship between the input information. The number of recurrences in the same layer is P, which means that there are P different fusion outputs in each layer.
On the first layer, the hidden state of attention RF-LSTM hta is calculated as follows:
\(\begin{aligned}h_{t}^{a}=R F-L S T M^{a}\left(x_{t}^{a}, h_{t-1}^{a}\right)=\frac{1}{P} \sum_{i=1}^{P} h_{t, i}^{a}\\\end{aligned}\) (12)
ht,ia = LSTMa(xta, ht−1,ia), i = 1,2, … P (13)
where xta is the input vector of attention RF-LSTM, and ht−1a is the hidden state of previous time step of attention RF-LSTM. ht,ia is the i-th output of attention RF-LSTM at time step t.
The input of attention RF-LSTM consists of the word embedding in the current time step and visual vector Iv + ht−1a, where Iv is the pooling feature extracted by Faster R-CNN and ht−1l is the context vector of the previous time step of language RF-LSTM (ht−1a is initialized to 0 at the beginning):
xta = [Ewt−1, Iv + ht−1l] (14)
where EE is the embedding matrix of words, and wt−1 is the word generated by language RF-LSTM at the previous time step. We follow the earlier approach in the traditional image captioning baseline model, in which the embedding of each word token depends on its context. Specifically, create a learnable weight of shape (x, y), where x represents the size of the dictionary and y represents the dimension of the embedding vector, initialized as a random number in the range (0, 1). It is worth noting that the word embedding generation method can also be replaced with the pre-trained language model to extract language features.
On the second layer, the hidden state of language LSTM is calculated as:
\(\begin{aligned}h_{t}^{l}=R F-\operatorname{LSTM}^{l}\left(x_{t}^{l}, h_{t-1}^{l}\right)=\frac{1}{P} \sum_{i=1}^{P} h_{t, i}^{l}\\ \end{aligned}\) (15)
ht,il = LSTMl(xtl, ht−1l), i = 1,2, … P (16)
where xtl is the input of language RF-LSTM, and ht−1l is the hidden state of previous time step of language RF-LSTM. ht,il is the i-th output of language RF-LSTM at time step t.
The input of language RF-LSTM is denoted as xtl, which is defined as follows:
xtl = [st, hta] (17)
where st denotes the output of SGA, and hta is the hidden state of attention RF-LSTM at current time step.
The probability distribution of the output word of the SGA-RF-LSTM model at time step ttis denoted as p(yt|y1:t−1):
p(yt|y1:t−1) = softmax(Wphtl) (18)
where htl denotes the hidden state of language RF-LSTM at time step t, and Wp ∈ ℝd×1 is a parameter matrix.
3.4 Objectives
Given the target ground truth sequence y1:T* and the captioning model with parameter θ, SGA-RF-LSTM is trained by minimizing cross entropy LXE:
LXE(θ) = − ∑t=1Tlog(pθ(yt*|y1:t−1*)) (19)
Since reinforcement learning has been used to train captioning models, we follow this training strategy to optimize non-differentiable metrics, and then seek the minimum negative expected score from the initialization of the trained model under cross-entropy:
LR(θ) = − Ey1:T~pθ[r(y1:T)] (20)
where r is the CIDEr-D score function. We directly optimize the non-differentiable metrics with Self-Critical Sequence Training (SCST) [14], and the gradient can be approximated:
\(\begin{aligned}\nabla_{\theta} L_{R}(\theta) \approx-\left(r\left(y_{1: T}^{S}\right)-r\left(\hat{y}_{1: T}\right)\right) \nabla_{\theta} \log p_{\theta}\left(y_{1: T}^{S}\right)\\\end{aligned}\) (21)
where y1:Ts denotes a result sampled from a probability distribution, and r(\(\begin{aligned}\hat{y}_{1: T}\\\end{aligned}\)) is the baseline score of greedily decoding.
4. Experiments
4.1 Datasets
The proposed method is evaluated on the MSCOCO 2014 [30] and Flickr30K dataset [39]. The MSCOCO dataset is the largest offline dataset for the image captioning task, which contains 123,287 images with five different annotations. For offline evaluation, we use the "Karpathy" Data Split [10] where 113,287 images are used for training and 5000 respectively for validation and testing. For the Flickr30K dataset, 29014, 1000 and 1000 images are used to train, validate, and test respectively. To quantitatively evaluate the performance of the method proposed in the paper and compare it with other methods, we use standard automatic evaluation metrics, including BLEU [31], METEOR [32], ROUGE-L [33], CIDEr-D [34] and SPICE [35].
4.2 Implementation Details
We use the pre-trained ResNet-101 to extract the grid features of the images, and the Faster R-CNN [2] to extract the salient region features of images. In training implementation, the dimension of the original encoding feature vector is 2048, and we project it into a new space with a dimension of 1024. The dimension of pooling and attentive layer of SGA is 1024. We follow the training strategy of AoA model [26]. In XE training stage, the batch size is set to 10 with 40 epochs. We initialize the learning rate to 2e-4, and anneal it by 0.8 every 3 epochs.
The predetermined sampling probability [36] is increased by 0.05 every 5 epochs. In reinforcement learning stage, we initialize the learning rate to 2e-5 to train 20 epochs. When the validation score does not improve on some metrics, we anneal it to 0.5. We employ Adam optimizer in both stages and the beam size is set to 2.
4.3 Performance Comparison
As shown in Table 1, we report the performance of the proposed model on the offline COCO Karpathy test split and compare the performance of our approach with that of several recent image captioning models. The compared models include: SCST [14], which applies advanced attribute features to image captioning tasks; RFNet [4], which fused the encoding features of multiple CNN networks to form a representation of the decoder; Up-Down [11], which employed the Faster R-CNN as the bottom-up mechanism, extracting the salient region features; GCN-LSTM [19], which used Graph Convolutional Networks to explore pair-wise relations between image regions; HAN [25], which proposed the adoption of hierarchical features so that attention could be calculated synchronously on the features of pyramid levels; AoANet [26], which enhanced traditional visual information attention by further measuring the correlation between attention results and queries; SRT [6], which proposed a new recall mechanism consisting of recall unit, semantic guidance and recall words; MT [37], which constructed a fully connected architecture between each encoder layer and decoder layer. We can see that our model has achieved the highest scores compared with other models in most of the metrics. Model it is important to note that the repeat five times experiments, we found five experimental results scores than the baseline model, the results score value fluctuates up and down in the experiment between the average of five times, the probability of 0.5 to the fifth power, The P value is 0.03125, less than significant level, the considerable difference statistically significant.
Table 1. Performance of our method on MS COCO Karpathy’s test split under XE loss and CIDEr reward optimization, where B-1 / B-4 / M / R / C / S means BLEU1/ BLEU4 / METEOR / ROUGE-L / CIDEr / SPICE scores
As shown in Table 2, we report experimental results on the Flickr30K dataset. Our approach significantly outperforms all the compared methods, indicating that the performance improvement produced by our proposed SGA-RF-LSTM model is equally effective on different datasets.
Table 2. The image captioning results obtained on the Flickr30K Karpathy test split under XE loss
For better qualitatively evaluating the generated results, we visualize the evolutions of the contribution of visual features to the model output for AoA [26] and SGA in Fig. 3. The assistance of one region related to the output is given by non-linear correlation. Hence, we employ the Integrated Gradients approach [38], which approximates the integral of gradients concerning the given input. In addition, our SGA attaches the attention heat map on the grid, which can cooperate with the salient region information to focus on the corresponding grid spatial features, thereby helping the model to effectively use the image information and generate accurate captions.
Fig. 3. The visualization and captions generation processes of the AoA model and the SGA model.
4.4 Qualitative Analysis
Table 3 shows several example image captions, which contain images and their caption generated by the proposed SGA-RF-LSTM, Up-Down [11] baseline, and two ground truths (GT) respectively. From these examples, we found that the captions generated by the baseline model are logically correct, but not accurate enough, and even some captions do not match the image content. SGA-RF-LSTM generates more descriptive and precise captions. For example, the baseline model generates "a bathroom with a toilet and a shower". Although the caption is correct, it does not clearly describe the positional relationship between the objects. SGA-RF-LSTM accurately describes the position information of "next to a white toilet". Besides, the baseline model generates “a group of people are skiing on the snow”, while SGA-RF-LSTM specifically describes "a man and little girl". SGA-RF-LSTM has such advantages, because it combines the grid features and salient region features of the image simultaneously, and uses the recurrent fusion method in the decoding stage to enhance the output of LSTM.
Table 3. Examples of image captioning results generated by our SGA-RF-LSTM
4.5 Ablation Study
To quantify the impact of the proposed SGA-RF-LSTM on the image captioning model in the sequence generation stage and the attention stage, we performed ablation experiments by comparing the different variants of SGA-RF-LSTM. As shown in Table 4, the first two rows represent the baseline model using only grid features and regional features, rows 3 through 11 represent ablation experiments with SGA and RF-LSTM, and the penultimate row represents the performance scores combined with SGA and RF-LSTM. It is worth noting that the last line experiments the results of the generation method based on BERT [43] word embedding based on the model proposed in this paper, and it is found that the model's performance is further improved compared with the previous one.
Table 4. Ablation study about the SGA-RF-LSTM under XE loss
4.5.1 Effect of SGA
We set up different schemes to evaluate the effect of the SGA method in the attention stage. First, to explore the impact of paying attention to two salient region features simultaneously in the attention phase, we set "Region+Region+A", where "A" represents the soft attention processing; Second, to verify the influence of paying attention to the grid features and salient region features on the model simultaneously, we set "Grid×Region+A" and "Grid+Region+A", where "×" means the relationships between features established through matrix multiplication, and "+" means relationships between features shown through matrix summation. Finally, to verify the impact of the gating mechanism on the model by using the pooling information of both parts while focusing on the grid features and the salient region features, we set "Grid+Region+GA", where "GA" means adding the gating mechanism. From Table 4, we observe that paying attention to two salient region features simultaneously can improve the performance of the model compared to single attention. However, the improvement of cooperatively attending grid and region feature is more prominent, which demonstrates that the grid space information can be better used as a complement to the salient region information. Furthermore, we find that the gating mechanism between two different features can effectively alleviate semantic noise and guide the interaction of information between them.
4.5.2 Effect of RF-LSTM
As shown in Fig. 4, we design different structures of LSTM and compare the use of different variants to model the vectors of hidden state. From Table 4, we observed that three parallel different LSTMs do not improve the model's performance. Compared with the paralleled LSTM in the same layer, we use the proposed RF-LSTM method. The recurrent fusion of three different LSTMs in the same layer can improve performance slightly. We also find that the same parameters used in the same layer of LSTM can further improve performance. Meanwhile, we also evaluate the use of RF-LSTM in the first layer of the decoder and find that the performance is better than the original LSTM, but the execution is not as good as in the second layer. Eventually, we use the RF-LSTM method on both LSTM layers and find that the model outperforms the other structures to achieve the highest performance.
Fig. 4. Different schemes of outputting LSTM hidden state. (a) a base LSTMl output htl. (b) Pooled and merged output htl by three parallel LSTMl. (c) Fusion output htl by three different recurrent LSTMl. (d) Fusion output htl by three identical recurrent LSTMl. (e) Fusion output hta by three identical recurrent LSTMa.
We combined the Synergy-Gated Attention and Recurrent Fusion LSTM method to form SGA − RF − LSTMl+a. We set the variable on times of LSTM's recurrence to verify the impact of recurrence times of single time step on model performance in Table 5. Generally, more repetitions in a single time step can get more different outputs, and the fusion of these output vectors can improve the performance of the model. We observe that three times of recurrent fusion can achieve the best performance, which verifies the effectiveness of recurrent fusion LSTM for modeling inputs in one time step.
Table 5. Ablation on the times of LSTM's recurrence under XE loss
5. Conclusion
In this paper, we propose a Synergy-Gated Attention (SGA) method, which can attend to the salient region features and grid features simultaneously, so that the image information can be better used at the attention stage. We also propose a gating mechanism by using the global information of the two feature sources, which effectively guides the interaction between the two source information, and alleviates the problem of semantic noise generated during the fusion processing. In addition, we replace the original LSTM with the RF-LSTM. The new architecture not only relies on the previously hidden vector information, but also integrate information from future predictions to guide the current word generation, resulting in better performance than the original LSTM. Extensive experiments show the superiority of the proposed method compared with the state-of-the-art methods on the benchmark dataset.
References
- M. T. Luong, H. Pham, C. D. Manning, "Effective approaches to attention-based neural machine translation," in Proc. of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1412-1421, 2015.
- S. Ren, K. He, R. Girshick and J. Sun, "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137-1149, June 2017. https://doi.org/10.1109/TPAMI.2016.2577031
- O. Vinyals, A. Toshev, S. Bengio and D. Erhan, "Show and tell: A neural image caption generator," in Proc. of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3156-3164, 2015.
- W. Jiang, L. Ma, Y. Jiang, W. Liu and T. Zhang, "Recurrent Fusion Network for Image Captioning," in Proc. of European Conference on Computer Vision, vol. 11206, pp. 510-526, October 2018.
- X. Chen, L. Ma, W. Jiang, J. Yao and W. Liu, "Regularizing RNNs for Caption Generation by Reconstructing the Past with the Present," in Proc. of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7995-8003, 2018.
- L. Wang, Z. Bai, Y. Zhang, and H. Lu, "Show, Recall, and Tell: Image Captioning with Recall Mechanism," in Proc. of the AAAI Conference on Artificial Intelligence, Vol. 34, No. 07, pp. 12176-12183, 2020.
- A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier and D. Forsyth, "Every picture tells a story: Generating sentences from images," in Proc. of European conference on computer vision, pp. 15-29, 2010.
- R. Mason, E. Charniak. "Nonparametric method for data-driven image captioning," in Proc. of the 52nd Annual Meeting of the Association for Computational Linguistics, vol. 2, pp. 592-598, 2014.
- K. He, X. Zhang, S. Ren, and J. Sun, "Deep Residual Learning for Image Recognition," in Proc. of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770-778, 2016.
- A. Karpathy and L. Fei-Fei, "Deep Visual-Semantic Alignments for Generating Image Descriptions," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 4, pp. 664-676, 1 April 2017. https://doi.org/10.1109/TPAMI.2016.2598339
- P. Anderson, X. D. He, C Buehler, D Teney, M Johnson, S Gould and L Zhang, "Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering," in Proc. of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6077-6086, 2018.
- S. Hochreiter and J. Schmidhuber, "Long Short-Term Memory," Neural Computation, vol. 9, no. 8, pp. 1735-1780, 1997. https://doi.org/10.1162/neco.1997.9.8.1735
- Z. Ren, X. Wang, N. Zhang, X. Lv and L. Li, "Deep Reinforcement Learning-Based Image Captioning with Embedding Reward," in Proc. of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1151-1159, 2017.
- S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross and V. Goel, "Self-Critical Sequence Training for Image Captioning," in Proc. of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1179-1195, 2017.
- I. Sutskever, O. Vinyals and Q. V. Le, "Sequence to sequence learning with neural networks," Advances in neural information processing systems, pp. 3104-3112, 2014.
- K. Xu, J. L. Ba, R. Kiros, K. Cho, A. Courville, R Salakhutdinov, R. S. Zemel and Y. Bengio, "Show, attend and tell: Neural image caption generation with visual attention," in Proc. of International conference on machine learning (ICML), vol. 37, pp. 2048-2057, July 2015.
- T. Yao, Y. Pan, Y. Li, Z. Qiu and T. Mei, "Boosting Image Captioning with Attributes," in Proc. of 2017 IEEE International Conference on Computer Vision (ICCV), pp. 4904-4912, 2017.
- N. Li and Z. Chen, "Image Cationing with Visual-Semantic LSTM," in Proc. of Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI), pp. 793-799, 2018.
- T. Yao, Y. Pan, Y. Li and T. Mei, "Exploring visual relationship for image captioning," in Proc. of the European conference on computer vision (ECCV), pp. 711-727, 2018.
- X. Yang, H. Zhang and J. Cai, "Auto-encoding and Distilling Scene Graphs for Image Captioning," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 5, pp. 2313-2327, 2022.
- Y. Zhong, L. Wang, J. Chen, D. Yu and Y. Li, "Comprehensive image captioning via scene graph decomposition," in Proc. of European Conference on Computer Vision, pp. 211-229, 2020.
- L. Ke, W. Pei, R. Li, X. Shen and Y. Tai, "Reflective Decoding Network for Image Captioning," in Proc. of 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 8887-8896, 2019.
- R. Li, H. Liang, Y. Shi and F. Feng, "Dual-CNN: a convolutional language decoder for paragraph image captioning," Neurocomputing, vol. 396, no. 12, pp. 92-101, 2020. https://doi.org/10.1016/j.neucom.2020.02.041
- J. Lu, C. Xiong, D. Parikh and R. Socher, "Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning," in Proc. of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3242-3250, 2017.
- W. Wang, Z. Chen, H. Hu, "Hierarchical attention network for image captioning," in Proc. of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, pp. 8957-8964, 2019.
- L. Huang, W. Wang, J. Chen and X. Wei, "Attention on Attention for Image Captioning," in Proc. of 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4633-4642, 2019.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser and I. Polosukhin, "Attention is all you need," in Proc. of the 31st International Conference on Neural Information Processing Systems, pp. 6000-6010, 2017.
- S. Herdade, A. Kappeler, K. Boakye, J. Soares, "Image captioning: Transforming objects into words," Advances in Neural Information Processing Systems, pp. 11137-11147, 2019.
- G. Li, L. Zhu, P. Liu and Y. Yang, "Entangled Transformer for Image Captioning," in Proc. of 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 8927-8936, 2019.
- T. Y. Lin, M. Maire, S. Belongie, H. James, P. Pietro, R. Deva and D. Piotr, "Microsoft coco: Common objects in context," in Proc. of European conference on computer vision, pp. 740-755, 2014.
- K. Papineni, S. Roukos, T. Ward and W. J. Zhu, "Bleu: a method for automatic evaluation of machine translation," in Proc. of the 40th annual meeting of the Association for Computational Linguistics, pp. 311-318, 2002.
- M. Denkowski and A. Lavie, "Meteor universal: Language specific translation evaluation for any target language," in Proc. of the ninth workshop on statistical machine translation, pp. 376-380, 2014.
- C. Y. Lin, "ROUGE: A package for automatic evaluation of summaries," in Proc. of Association for Computational Linguistics Workshop, pp. 74-81, 2004.
- R. Vedantam, C. L. Zitnick and D. Parikh, "CIDEr: Consensus-based image description evaluation," in Proc. of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4566-4575, 2015.
- P. Anderson, B. Fernando, M. Johnson and S. Gould, "Spice: Semantic propositional image caption evaluation," in Proc. of European conference on computer vision, Springer, Cham, vol.9909, pp. 382-398, 2016.
- S. Bengio, O. Vinyals, N. Jaitly and N. Shazeer, "Scheduled sampling for sequence prediction with recurrent neural networks," in Proc. of the 28th International Conference on Neural Information Processing Systems, Vol. 1, pp. 1171-1179, 2015.
- M. Cornia, M. Stefanini, L. Baraldi and R. Cucchiara, "Meshed-Memory Transformer for Image Captioning," in Proc. of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10575-10584, 2020.
- M. Sundararajan, A. Taly, Q. Yan, "Axiomatic attribution for deep networks," in Proc. of 2017 International Conference on Machine Learning (PMLR), vol. 70, pp. 3319-3328, 2020.
- P. Young, A. Lai, M. Hodosh and J. Hockenmaier, "From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions," in Proc. of 2014 Transactions of the Association for Computational Linguistics (TACL), vol. 2, pp. 67-78, 2014.
- L. Li, S. Tang, Y. Zhang, L. Deng, Q. Tian. "GLA: Global-local Attention for Image Description," IEEE Transactions on Multimedia, vol. 20, no. 3, pp. 726-737, 2018. https://doi.org/10.1109/tmm.2017.2751140
- X. Xiao, L. Wang, K. Ding, S. Xiang, C. Pan, "Deep Hierarchical Encoder-Decoder Network for Image Captioning," IEEE Transactions on Multimedia, vol. 21, no. 11, pp. 2942-2956, 2019. https://doi.org/10.1109/tmm.2019.2915033
- X. Xiao, L. Wang, K. Ding, S. Xiang, C. Pan, "Dense Semantic Embedding Network for Image Captioning," Pattern Recognition, pp. 285-296, vol. 90, 2019. https://doi.org/10.1016/j.patcog.2019.01.028
- J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2019.