1. Introduction
In the field of computer vision, progress has been made toward developing machines that can understand images, videos or other forms of content as well as humans do. By analyzing the relationships between pairs of objects, scene graph generation (SGG) constructs rich semantic information for tasks such as visual question answering (VQA)[1][2].
Inspired by the ladder of causation devised in THE BOOK OF WHY[3], unbiased scene graph generation (USGG) was first proposed by Tang et al.[4] using the Total Direct Effect (TDE) method. This method aims to solve the problem of long-tailed distributions in datasets.
The severe imbalance in the distribution of the dataset has a significant impact on the model. If unconstrained in training, the model is prone to overfitting the tail class, and underfitting the head class. Many methods are effective in suppressing defects in long-tailed datasets. Yu et al.[5] proposed a novel Cognition Tree (CogTree) loss, which differentiated coarse-grained relationships first and then fine-grained relationships through the tree structure. Wang et al.[6] proposed a novel model based on memory that enriches the features of low-frequency relations. Yan et al.[7] proposed Predicate-Correlation Perception Learning (PCPL), which adaptively determines appropriate loss weights according to correlations among the predicate classes. Li et al.[8] proposed a novel model-agnostic Label Semantic Knowledge Distillation (LS-KD) method that can capture correlations between subject-object instances and different predicate categories. Most of the methods deal well with the case where the ground truth is tail relations and the predicted values are head relations. This is also the original intention of the USGG method, i.e., to transform relations from coarse to fine. We found that the prediction accuracy of the tail relationship can be improved if constrained in the case where the ground truth is the tail relationship and the predicted value is the incorrect tail relationship. However, the above methods focus on improving the predicted recall between tail predicates and other predicates and do not intentionally provide additional corrections for the case of misjudged tail predicates. To effectively suppress the long-tail problem, we designed the Class Balanced Seesaw Loss (CBSLoss), which improves the correct prediction rate of the tail samples.
Visual relations can be divided into four categories: geometric, semantic, possessive, and miscellaneous. We start with semantic and geometric dimensions to enhance the capture of model features. Specifically, at the semantic level, we improve the classification accuracy and prediction accuracy of the model by deep learning processing of the model input data. In real world applications, adjacent objects may have numerous relationships. For instance, consider the situations "person uses computer" and "person lying on table." The information about the position of the person can be used to deduce the spatial locations of the computer and table as "computer on table." Obviously, information about the proposed frame positions of neighboring object pairs for each object pair in the same image can assist in the training of that object pair. Thus, we designed a geometry module that enhances the acquisition of geometric information. By considering and processing these two dimensions together, we are able to better understand and utilize the features of the model, thus improving the performance and effectiveness of the model.
The contributions of this work can be summarized as follows:
(1) We designed a new geometric module to assist in the training of each object pair in the same image by utilizing the positional information of the bounding boxes of neighboring object pairs for that object pair. This helps the model to better understand the positional relationship between objects, thus improving the recall of scene graph generation.
(2) We devised a new semantic module that exploits these rich semantic relationships.
(3) We designed a new loss function, CBSLoss, for the tail relation, to improve the tail relation recall by introducing a penalty factor when the predicted relation is a tail relation and the prediction is incorrect.
2. Related Work
2.1 Class Imbalance
In the actual world, some relationships are distributed in only a few classes, whereas other relationships are dispersed among most categories. This is known as class imbalance. Deep learning faces significant obstacles as a result of this universal natural occurrence. Numerous improvements to one-stage[9][10] and two-stage[11][12] object detection have addressed the issue of foreground-background class imbalance. However, SGG is subject to foreground-foreground class imbalance.
USGG methods for resolving long-tailed data issues can be divided into three categories[13]. (1) Data augmentation on resampling for tail data. Knyazev et al.[14] proposed a data augmentation technique based on Generative Adversarial Networks (GANs). Yao et al. [15] proposed a visual distant supervision technique without applying any human-labeled data and devised a denoising framework to reduce noise. However, these methods do not perform well in the case of strong correlation between predicate labels. This phenomenon arises partly because re-balancing strategy simply utilizes the frequency of classes while ignoring their semantic relatedness[7]. For the existing re-balancing strategies fail to increase the diversity of the relation triplet features of each predicate, Li et al.[16] proposed a novel Compositional Feature Augmentation (CFA) strategy which is the first work to mitigate bias problem by increasing the diversity of triplet features in the USGG. (2) Elaborately designed training curricula or learning losses. Wei et al.[17] proposed a higher-order structure embedded network (HOSE-Net), which consists of structure-aware embedding-to-classifier (SEC) and hierarchical semantic aggregation (HSA) modules. This method incorporates structural information in the output space and reduces the number of subspaces. To reduce error propagation, Li et al.[18] devised the bipartite graph neural network (BGNN). Chiou et al.[19] proposed the dynamic label frequency estimation (DLFE) method to address the reporting bias problem. Suhail et al.[20] designed the energy-based model (EBM), which incorporates the scene graph structure into the learning framework. (3) Disentangling biased and unbiased representations. The TDE method employs counterfactual surgery to produce unbiased predictions. These methods are effective in improving the long-tail effect. However, the causal models constructed are always causally insufficient due to dataset noise that makes confounders unobservable. Sun et al.[21] proposed Two-stage Causal Modeling (TsCM). It uses long-tailed distributions and semantic confusions as confounders for structural causal modeling, and then decouples the causal intervention into two stages.
2.2 Utilization of Semantic and Geometric Features
Several models make clever use of semantic information. For example, the random intercept factor analysis (RiFa) method[22] reveals the rich semantic features of relations. This method uses the semantic distinctions between the subject and object of the same entity to prevent biased relations. The PCPL method adaptively determines the loss weights by using the correlations between the predicate classes.
In terms of geometric information, the Motifs[23] model determines the number of relationships according to the object height. He et al.[24] incorporated relative position coding into relationship matrices. The EBM model is trained on a loss function that incorporates the structure of the output space. Obviously, semantic and geometric information are crucial for the prediction of relationships. The GSI method effectively utilizes both semantic and geometric information, and mitigates the long-tail effect.
3. Methods
The base CogTree method is inspired by the prefrontal cortex and mimics the way human intuition distinguishes between associations that are significantly different from those that are similar to them. The main idea is to first use a biased SGG model to generate subtrees based on the frequency ranking of misclassified associations and then engage the model with subtree structures to mitigate long-tail effects. Our GSI method is shown in Fig. 1. The improvements are shown in color, and the subtree constructed in CogTree continues.
Fig. 1. The GSI method has three components: first, the Faster R-CNN object detection model is used to acquire object information; then, the double Transformer model (relation transformer and object transformer) and CBSLoss are used to predict each relationship probability; and finally, subtrees are used to generate the final scene graph.
The structure of the GSI method based on the Transformer model is shown in Fig. 1. In addition, the GSI method can be added to any of the SGG base models (Motifs, visual context tree model (VCTree)[25] and Transformer) and their improved models. Since our method is model independent, the Transformer model is not changed, but follows the dual Transformer model in the CogTree method.
The Transformer model is selected as an example. The dataset is first passed through the object detection model and then through the semantic and geometric modules to enhance the acquisition of image information before entering the Transformer model. After the Transformer model output, the relationship between object pairs is predicted using the loss function CBSLoss. By using the tree structure of the subtree and the result of CBSLoss, the Tree-based Class Balanced Loss (TCBLoss) of the current node is generated, and finally, the two loss functions are combined to generate the relationship between the object pairs and the final scene graphs.
The object encoder structure of the Transformer model, contains multiple object-to-object (O2O) blocks. The components of both the O2O and relation-to-object (R2O) blocks are the attention module, residual connection, layer normalization, feedforward module, residual connection, and layer normalization. The difference between the R2O and O2O blocks is the attention module. The O2O block uses self-attention while the R2O block uses cross-attention. In both the object decoder and relational decoder, the fully connected layer and softmax layer are used.
3.1 Geometric Feature Improvement
The recall of SGG models can be improved by efficiently capturing the geometric information. Thus, this module improves on the CogTree method inability to fully exploit position information by using position embedding alone. Moreover, this module effectively addresses noise interference in the model by introducing thresholding.
The two improvement modules for geometric and semantic dimension features are shown in Fig. 2. First, a series of convolution operations are performed on the semantic dimension features; then, splicing operations are performed on the semantic features, the new geometric features obtained from the geometric module, and the initial union region features; finally, a full join operation is performed on the spliced features to obtain the new union features. If the geometry module or semantic module is used separately in the ablation experiment, only the original joint region features are fully connected with the corresponding parts. Sb ∈ i 1×m×4 and Ob ∈ ¡ 1×m×4 represent the position information in the batch image object pair <subject, object> and m is the total number of relational pairs in the batch sheet. Moreover, the number of columns (in this case, 4) corresponds to the bottom-left coordinate values x1 and y1 and the top-right coordinate values x2 and y2 . The equations for processing Sb and Ob are shown in (1):
X' = Gdrop(σ(GFC(Gdrop(σ(GFC(X)))))), (1)
Fig. 2. The improvements in the geometric and semantic features are shown schematically. Convolutions are performed on the semantic features in the first line; then, the semantic features, geometric union features, and initial geometric union features are concatenated in the second line.
where X represents the input to Sb or Ob and X' represents the S'b ∈ ¡ 1×m×4 or O'b ∈ ¡ 1×m×4 obtained after processing. σ is the rectified linear unit (ReLU) function, GFC is a linear layer, and Gdrop is a dropout layer.
The initial union region feature vector F ∈ ¡ 4096×1×1 corresponds to the object pair information in each batch of images. The first step of the improved geometric feature module is to determine the union region features U ∈ ¡ 1×4×k in object pairs in each picture, where k represents the number of object pairs in each image. In the second step, the position information of the union region between object pairs and object pairs in U is calculated to obtain the union region feature position matrix C_U ∈ ¡ 1×k×k×4 . In the third step, the IoU ∈ ¡ 1×k×k×1 of the union region location is calculated. The inputs of this step are C_Uij and Ui, where i represents the current object pair and j represents another object pair in the same image.
The maximum value of each row in IoUij is compared with the threshold s ( s∈(0,1) ), and if the value is larger than s, subscripts i and j are obtained, and the corresponding value of the union region feature vector Fj is put into Mi according to the subscript j, as shown in (2). Since the values of IoUij in the experiments are mostly between 0.75 and 0.95, s is set to 0.80 in this paper.
\(\begin{aligned}M_{i}=\left\{\begin{array}{l}F_{j}, \max \left(\operatorname{Io}_{\mathrm{i}}\right)>s \\ 0, \text { else }\end{array}\right.\end{aligned}\) (2)
Finally, the matrices F and M are fully connected to obtain the new union feature F' ∈ ¡ 4096×1×1 as (3):
F' = Gdrop(σ(GFC(Gdrop(σ(GFC(F, M)))))), (3)
where F' is the improved union feature that is used as the input to the Transformer model to ensure that the model acquires rich geometric features.
3.2 Semantic Feature Improvement
The effective acquisition of semantic features in SGG models is critical for increasing SGG model accuracy. In the base method, the word embeddings from the GloVe model[26] and the object names are used as inputs to the transformer model.
Fig. 2 shows a schematic diagram of the enhanced semantic and geometric features. Assume that the subject and object labels in the pair <subject, object> are obtained; then, the word embeddings obtained by feeding these labels into the GloVe model are Ci and Cj, and the union word embedding Xij ∈ ¡ 1×200×200 can be formulated as (4):
Xij = CTi · Cj, (4)
where · is the matrix dot product. To obtain richer information, this module performs a series of convolution operations on the obtained Xij . First, 5×5 convolutions, the ReLU function, average pooling (avg pooling), and two 3 × 3 convolutions are applied. Then, the module performs adaptive average pooling. Finally, the module performs two 1×1 convolutions and applies the ReLU function to obtain X'ij ∈ ¡ 4096×1×1. X'ij is shown in (5).
X'ij = G1×1conv(σ(Gpool2(G3×3conv(…(Gpool1(Xij))…)))), (5)
where Gpool1 represents avg pooling, Gpool2 denotes adaptive avg pooling, and Gm×nconv indicates an n×n convolution (where n is set to 1, 3, or 5).
To better use the joint word embedding Xij , this feature is integrated into the union feature F ∈ ¡ 4096×1×1 , yielding F' ∈ ¡ 4096×1×1 as follows (6):
F' = Gdrop(σ(GFC(Gdrop(σ(GFC(X'ij, F)))))). (6)
3.3 Learning with Tree Loss
3.3.1 CBSLoss
Cui et al.[27] proposed Class Balanced Loss (CBLoss) to address the effects of long-tailed data distributions on image segmentation tasks.
However, CBLoss ignores the issue that when the situation arises where the ground truth is the tail predicate and the predicted value is not the correct tail predicate, it is not possible to constrain the situation effectively. This is because the hyperparameter of the weights ni takes the value of the number of ground-truth categories, and the formula cannot effectively "penalize" such cases when the number is relatively small (tail predicate). In response to this situation, a new loss function, the CBSLoss, which is inspired by the seesaw loss function[28], is created. When the ground truth label Ti is a tail predicate and the predicted probability Tj is also a tail predicate (Ti ≠ Tj), the weight Wi is defined as (7):
\(\begin{aligned}W_{i}=\left(\frac{1-\beta}{1-\beta^{n_{i}}}\right)\left(\frac{T_{j}}{T_{i}}\right)^{q}\end{aligned}\), (7)
where the hyperparameter β is set to 0.999 ( β ∈ [0,1) ), as in the literature[5]. Moreover, the hyperparameter q is set to 3.0. The experimental results are shown in Table 1. Here, ni represents the number of true values of category i.
The novel CBSLoss loss function is defined in (8), where Ppred is the predicted probability and gi is the ground truth label corresponding to node i. This loss function includes the softmax function and corresponding class-balanced weight Wgi.
\(\begin{aligned}L_{C B S}=-W_{g_{i}} \log \left(\frac{\exp \left(p_{g_{i}}\right)}{\sum_{p_{j} \in p_{p r e d}} \exp \left(p_{j}\right)}\right)\end{aligned}\) . (8)
3.3.2 TCBLoss
In the structure of the subtree, the GSI method still uses the tree-based class balanced loss (TCBLoss), which is used in the base method [5]. TCBLoss is a loss function constructed for leaf nodes that consists of weight and a softmax function, as shown in (9). Fig. 3 shows an example of the calculation of TCBLoss.
\(\begin{aligned}L_{T C B}=\frac{1}{k} \sum_{k=1}^{K}-W_{S_{k}} \log \left(\frac{\exp \left(Z_{S_{k}}\right)}{\sum_{z_{j} \in Z_{\left(S_{k-1}\right)}} \exp \left(z_{j}\right)}\right),\end{aligned}\) (9)
Fig. 3. The TCBLoss of the parent node can be calculated from the CBSLoss of the leaf node.
where Wsk is calculated as shown in (7) and K represents the number of leaves except for the number of nodes. The inputs to the softmax function are the average of the sum of the probabilities of all brother nodes of the predicate node as the probability of its parent node Zs(k-1) , which is the probability Zsk of the node obtained by CBSLoss. The total loss function of the GSI method is the weighted sum of CBSLoss and TCBLoss, and the equation is shown in (10), where the hyperparameter λ is 0.7, which is consistent with the literature [5].
L = LCBS + λLTCB. (10)
4. Experiments
4.1 Dataset, Tasks and Metrics
We use the Visual Genome 150 (VG150)[29] dataset, which is based on the Visual Genome dataset (VGD)[30], for experiments. This dataset includes 50 relationships and the 150 most common object categories in the VGD.
In this paper, the mean recall@K (mR@K) metric is examined in three tasks: predicate classification (PredCls), scene graph classification (SGCls), and scene graph detection (SGDet). The evaluation metrics for K predictions (@20, @50, and @100) are compared.
The mR@K statistic represents the average the predicate category R@K values. This metric provides a good indication of the model's influence on the dataset's unbalanced distribution.
The GSI method uses ResNeXt101-FPN as its backbone and is trained in the Ubuntu operating system with an Intel Core i7-10700KF CPU with 32 GB RAM and an M40 GPU. Moreover, the initial learning rate is set to 0.001, the weight decay is set to 0.0001, the optimizer is the stochastic gradient descent (SGD) optimizer, the total number of training iterations is set to 50000, and the batch size is set to 4. The other hyperparameters are consistent with those used in the literature [5].
The experimental results for various values of the hyperparameter q in CBSLoss are shown in Table 1. The experimental results show that when q is set to 3.0, the method achieves the best effect. A q value of 0.0 indicates that CBSLoss degenerates to CBLoss.
Table 1. q values
The threshold s used in the geometry enhancement module is selected because the maximum value of IOU between object pairs in the same image is typically between 0.75 and 0.95. Thus, to better capture the features of the most adjacent object pairs, we set the threshold s to 0.80.
4.2 Comparison with state-of-the-art methods
The GSI method is compared with three baseline models, Motifs, VCTree, and the Transformer, as well as their corresponding improved methods: TDE, STL[31], PCPL, CogTree, NARE[32], CAME[33] and LS-KD. In addition, the mR@K values of the IMP[34], KERN[35], GPS-Net[36], BGNN and NLS[37] models are listed in Table 2.
Table 2. Comparison with state-of-the-art methods
Table 2 demonstrates the following findings: (1) Among the three baseline models, the Transformer performs better than Motifs and VCTree due to its advantages in discriminating objects and generating relationship representations. (2) The GSI method can be added to the baseline models, and the results of the combined models are better than the results of the models alone for all three tasks. Thus, the combined methods are stable. (3) In the PredCls task, the Transformer-based GSI method clearly outperforms the VCTree-based and Motifs-based GSI methods. In the SGCls task, the VCTree-based GSI method significantly outperforms the Motifs-based and Transformer-based GSI methods. Finally, in the SGDet task, the Motifs-based GSI method performs better than the VCTree-based and Transformer-based GSI methods. (4) Even when compared to state-of-the-art models such as BGNN, NLS, and the VCTree-based NARE, LS-KD method, the GSI method still shows advantages in the three tasks. (5) The Motifs-based GSI method shows a significant improvement over the Motifs-based CogTree method on all three tasks. The experimental results show that the Motifs-based GSI method is a more significant upgrade than the VCTree-based and Transformer-based GSI methods. Thus, our method effectively mitigates the impact of long-tailed distributions in the dataset.
The R@100 performance of each class is shown in Fig. 4 (a), and the head, body and tail distributions in the dataset and their respective averages are shown in Fig. 4 (d), (c), (b) and (e). We note the following findings: (1) The GSI method outperforms the CogTree method in terms of the R@100 metric. This result shows that the GSI method significantly improves both the mR@100 metric and the R@100 metric. (2) Fig. 4 (e) and (b) show that the GSI method performs better than the CogTree method, even in the body of the data. (3) Fig. 4 (a) and (b) demonstrate that the distribution between the classes is relatively even.
Fig. 4. The prediction recall distribution of each relation in the PredCls task.
Fig. 5 shows four pairs of visualization examples. In the first picture, the GSI method detects the relationship between a skier and a person, and the relationship between the skier and person is more specific than the ground-truth label (“standing on” is more specific than “on”). In the second, third and fourth images, the GSI method detects more object pairs and their relationships than the other methods.
Fig. 5. Visualization of SGG by the ground truth (blue), Transformer + CogTree (orange) and Transformer-based GSI (red) methods. A blue line between two nodes indicates that all three models predict the same relationship; if only orange and red lines are visible, the ground truth and Transformer-based GSI model predict the same relationship; and if only a red line is visible, only the Transformer-based GSI model predicts the relationship between the object pairs.
4.3 Ablation Experiments
The Transformer-based GSI method was investigated through four ablation studies, and the experimental results are shown in Table 3. In the table, the following letters are used to represent the four improved parts: “S” represents the semantic enhancement module, “G” represents the geometric enhancement module, and LCBS and LCB represent the improved CBSLoss function and the original CBLoss function, respectively.
Table 3. Ablation Experiments
Table 3 shows the results of the ablation experiments indicating the effect of each module on the model. We obtain several conclusions. (1) In the PredCls task, the geometric module improves the model most significantly, followed by the loss function module and the semantic module. (2) In the SGDet task, the geometric module improves the model most significantly, followed by the loss function module. (3) The semantic and geometric modules have the largest enhancement effect in the PredCls task due to the use of the ground-truth bounding box and labels in this task. The word embedding vector generated by the semantic module according to the labels relies heavily on the correctness of the labels. Moreover, the geometric module relies on the correctness of the bounding box to determine the degree of overlap among adjacent objects.
5. Conclusions
In this paper, we present the GSI method, a novel method for improving the mR@K metric on base model. This method can be applied to the basic SGG models and their improved models such as Motifs, VCTree and Transformer, and is model-independent. First, a geometric enhancement module is designed based on the idea that adjacent objects are related to the position relation. Second, a semantic enhancement module is designed to further enhance the semantic information. Finally, LCBS is designed to punish the incorrect tail relation, which also effectively improves the accuracy of tail relation prediction. In addition, this paper has some limitations, e.g., not introducing too much external information. In future work, we will further improve the long tail distribution of the data. The next step will be to consider incorporating more a priori, common-sense information into the model to improve model prediction performance.
References
- V. Damodaran, S. Chakravarthy, A. Kumar, et al, "Understanding the role of scene graphs in visual question answering," arXiv:2101.05479v2, pp. 1-12, Jan. 2021.
- M. Hildebrandt, H. Li, R. Koner, et al, "Scene graph reasoning for visual question answering," arXiv:2007.01072v1, pp. 1-5, Jul. 2020.
- J. Pearl, D. Mackenzie, "The book of why: the new science of cause and effect," in the ladder of causation, New York, USA: Basic books, 2018, pp. 27-58.
- K. Tang, Y. Niu, J. Huang, et al, "Unbiased Scene Graph Generation from Biased Training," in Proc. of CVPR 2020: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp. 3716-3725, June 14-19, 2020.
- J. Yu, Y. Chai, Y. Wang, et al, "Cogtree: Cognition tree loss for unbiased scene graph generation.," arXiv:2009.07526v2, pp.1-7, Sep. 2020.
- W. Wang, R. Liu, M. Wang, et al, "Memory-Based Network for Scene Graph with Unbalanced Relations," in Proc. of ACMM 2020: Proceedings of the ACM Multimedia Conference on Multimedia Conference, Seattle, USA, pp. 2400-2408, Oct 12-16. 2020.
- S. Yan, C. Shen, Z. Jin, et al, "PCPL: Predicate-Correlation Perception Learning for Unbiased Scene Graph Generation," in Proc. of ACMM 2020: Proceedings of the ACM Multimedia Conference on Multimedia Conference, Seattle, USA, pp. 265-273, Oct 12-16. 2020.
- L. Li, J. Xiao, H. Shi, et al, "Label semantic knowledge distillation for unbiased scene graph generation," IEEE Transactions on Circuits and Systems for Video Technology, pp. 1-11, 2023.
- Z. Ge, S. Liu, F. Wang, et al, "Yolox: Exceeding yolo series in 2021," arXiv:2107.08430v2, pp.1-7, Jul. 2021.
- H. Law, J. Deng, "Cornernet: Detecting objects as paired keypoints," in Proc. of ECCV 2018: Proceedings of the European Conference on Computer Vision, Munich, Germany, pp. 734-750, Sep 8-14. 2018.
- S. Ren, K. He, R, Girshick, et al, "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks," Advances in neural information processing systems, pp. 28-34, 2015.
- K. He, G. Gkioxari, P. Dollar, et al, "Mask R-CNN," in Proc. of ICCV 2017: Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, pp. 2961-2969, Oct 22-29. 2017.
- Y. Zhang, B. Kang, B. Hooi, et al, "Deep long-tailed learning: A survey," arXiv:2110.04596v1, pp. 1-20, Oct. 2021.
- B. Knyazev, H. de Vries, C. Cangea, et al, "Generative compositional augmentations for scene graph prediction," in Proc. of ICCV 2021: Proceedings of the IEEE International Conference on Computer Vision, Montreal, Canada, pp. 15827-15837, Oct 11-17. 2021.
- Y. Yao, A. Zhang, X. Han, et al, "Visual distant supervision for scene graph generation," in Proc. of ICCV 2021: Proceedings of the IEEE International Conference on Computer Vision, Montreal, Canada, pp. 15816-15826, Oct 11-17. 2021.
- L. Li, G. Chen, J. Xiao, et al, "Compositional Feature Augmentation for Unbiased Scene Graph Generation," arXiv: 2308.06712v1, pp.1-11, Aug. 2023.
- M. Wei, C. Yuan, X. Yue, et al, "HOSE-Net: Higher Order Structure Embedded Network for Scene Graph Generation," in Proc. of ACMM 2020: Proceedings of the ACM Multimedia Conference on Multimedia Conference, Seattle, USA, pp. 1846-1854, Oct 12-16. 2020.
- R. Li, S. Zhang, B. Wan, et al, "Bipartite Graph Network With Adaptive Message Passing for Unbiased Scene Graph Generation," in Proc. of CVPR 2021: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Nashville, Tennessee, USA, pp. 11109-11119, Jun 19-25. 2021.
- M. J. Chiou, H. Ding, H. Yan, et al, "Recovering the Unbiased Scene Graphs from the Biased Ones," in Proc. of ACMM 2021: Proceedings of the ACM Multimedia Conference on Multimedia Conference, Chengdu, China, pp. 1581-1590, Oct 20-24. 2021.
- M. Suhail, A. Mittal, B. Siddiquie, et al, "Energy-based learning for scene graph generation," in Proc. of CVPR 2021: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Nashville, Tennessee, USA, pp. 13936-13945, Jun 19-25. 2021.
- S. Sun, S. Zhi, Q. Liao, et al, "Unbiased Scene Graph Generation via Two-stage Causal Modeling," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 10, pp. 12562-12580, Oct. 2023. https://doi.org/10.1109/TPAMI.2023.3285009
- B. Wen, J. Luo, X. Liu, et al, "Unbiased scene graph generation via rich and fair semantic extraction," arXiv:2002.00176, pp.1-9, Feb. 2020.
- R. Zellers, M. Yatskar, S. Thomson, et al, "Neural Motifs: Scene Graph Parsing With Global Context," in Proc. of CVPR 2018: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp. 5831-5840, Jun 18-22. 2018.
- T. He, L. Gao, J. Song, et al, "Learning from the scene and borrowing from the rich: Tackling the long tail in scene graph generation," arXiv:2006.07585v1, pp.1-7, Jun. 2020.
- K. Tang, H. Zhang, B. Wu, et al, "Learning to compose dynamic tree structures for visual contexts," in Proc. of CVPR 2019: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp. 6619-6628, Jun 16-20. 2019.
- Y. Y. Lee, H. Ke, H. H. Huang, et al, "Less is more: Filtering abnormal dimensions in glove," in Proc. of WWW 2016: Proceedings of the 25th International Conference Companion on World Wide Web, Montreal, Canada, pp. 71-72, Apr 11-15. 2016.
- Y. Cui, M. Jia, T. Y. Lin, et al, "Class-balanced loss based on effective number of samples," in Proc. of CVPR 2019: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp. 9268-9277, Jun 16-20. 2019.
- J. Wang, W. Zhang, Y. Zang, et al, "Seesaw loss for long-tailed instance segmentation," in Proc. of CVPR 2021: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Nashville, Tennessee, USA, pp. 9695-9704, Jun 19-25. 2021.
- D. Xu, Y. Zhu, C. B. Choy, et al, "Scene graph generation by iterative message passing," in Proc. of CVPR 2017: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, pp. 5410-5419, Jul 21-26. 2017.
- R. Krishna, Y. Zhu, O. Groth, et al. "Visual genome: Connecting language and vision using crowdsourced dense image annotations," International journal of computer vision, vol.123, no.1, pp. 32-73, Feb. 2017. https://doi.org/10.1007/s11263-016-0981-7
- D. Chen, X. Liang, Y. Wang, et al, "Soft transfer learning via gradient diagnosis for visual relationship detection," in Proc. of WACV 2019: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Hawaii, pp. 1118-1126, Jan 7-11. 2019.
- A. Goel, B. Fernando, F. Keller, et al, "Not All Relations are Equal: Mining Informative Labels for Scene Graph Generation," in Proc. of CVPR 2022: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, Louisiana, pp. 15596-15606, Jun 19-24. 2022.
- L. Zhou, Y. Zhou, T. L. Lam, et al, "Context-aware Mixture-of-Experts for Unbiased Scene Graph Generation," arXiv:2208.07109v2, pp.1-11, Aug. 2022.
- D. Xu, Y. Zhu, C.B. Choy, et al, "Scene graph generation by iterative message passing," in Proc. of CVPR 2017: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Hawaii, USA, pp. 5410-5419, July 21-26. 2017.
- T. Chen, W. Yu, R. Chen, et al, "Knowledge-embedded routing network for scene graph generation," in Proc. of CVPR 2019: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Los Angeles, CA, pp. 6163-6171, Jun 15-21. 2019.
- X. Lin, C. Ding, J. Zeng, et al, "GPS-Net: Graph Property Sensing Network for Scene Graph Generation," in Proc. of CVPR 2020: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp. 3746-3753, June 14-19. 2020.
- Y. Zhong, J. Shi, J. Yang, et al, "Learning to generate scene graph from natural language supervision," in Proc. of ICCV 2021: Proceedings of the IEEE International Conference on Computer Vision, Montreal, Canada, pp. 1823-1834, Oct 11-17. 2021.