1. Introduction
Image segmentation is a pixel-level classification task, i.e., the division of graphic regions, which mainly consists of semantic segmentation, instance segmentation, and panoptic segmentation. Instance segmentation is a combination of object detection and semantic segmentation tasks, which requires not only the classification of pixels in an image but also the localization of different instances with the same semantic category in the foreground. Therefore, instance segmentation can provide richer and more accurate image feature information and is widely used in satellite remote sensing image processing [1,2,3], defect detection [4,5,6], intelligent vehicles [7,8], medical image processing [9,10,11], and other fields. The instance segmentation methods based on deep learning are mainly divided into two-stage, single-stage, and multi-stage.
According to the sequence of detection and segmentation, two-stage methods are classified as top-down and bottom-up. Top-down methods, such as PANet [12], and MS R-CNN [13], first identify the candidate regions of interest using an object detector, and then segment in each candidate region. However, the accuracy of object detection is not high enough, which will directly affect the subsequent segmentation results, resulting in lower detection accuracy. Another classical two-stage method Mask R-CNN [14], proposes a new mask representation to improve the model resolution, but the processing of fine details is still limited. The bottom-up methods PointRend [15] and Condlnst [16] newly introduce the clustering of neighboring pixels according to certain rules and combine the clustering results with semantic information to identify the object instances. However, because of the long processing time and high computational cost, bottom-up methods only have advantages in some specific complex scene detection tasks.
Under the unified framework of fully convolutional networks, one-stage instance segmentation methods are divided into two categories: anchor-based and anchor-free. The anchor-based one-stage methods complete instance segment tasks by breaking them into two parallel branches: generating a set of prototype masks and predicting per-instance mask coefficients. Then they produce instance masks by linearly combining the prototypes with the mask coefficients [17], such as YOLACT [18] and RTSS [19]. Anchor-based modeling can infer quickly but lacks the ability to describe object voids [20]. Unlike the anchor-based one-stage methods, the anchor-free one-stage methods add a mask branch to the detector or integrate the location information into a one-stage segmentation framework, such as CenterMask [21], BlendMask [22], Polarmask [23], SOLO [24], and LSNet [25]. The SOLOv2 [26] uses spatial correspondence between semantic and mask branches to achieve a well-balanced speed and accuracy, but the prediction results can be ambiguous when facing the presence of multiple objects in the same grid.
The multi-stage instance segmentation based on cascade structure, such as the Cascade R-CNN [27], is beneficial to train better features and further improve the instance segmentation performance. Query-based models like SOTR [28] and QueryInst [29], also the Transformer embedding method as K-Net [30] and MaskFormer [31] are the latest representatives of multi-stage methods which are the highest segmentation accuracy at present. However, the multi-stage design brings a costly computational effort. On this basis, some transformer-based methods like Mask2Former [32], Mask Transfiner [33], and Fastlnst [34] bring new improvements. Mask2Former designs masked attention, which extracts localized features by constraining cross-attention within predicted mask regions. Mask Transfiner proposes an incoherent regions detection mechanism, which decomposes and represents the image regions as a Quadtree, then corrects errors with the transformer autonomously. FastInst follows the meta-architecture of Mask2Former, which incorporates instance activation-guided queries, a dual-path update strategy, and ground truth mask-guided learning to allow the model to utilize lighter pixel decoders and fewer Transformer decoder layers. In addition, PatchDCT [35], as a newly proposed multi-stage real-time instance segmentation method, partitions the mask derived from a DCT vector into multiple patches and enhances each patch using a specially designed classifier and regressor. Although these improvements improve the detection speed and the detection accuracy of edge details, they still have limitations in solving the problem of detecting small objects and handling object occlusion.
In general, two-stage and multi-stage instance segmentation methods are widely used with high segmentation accuracy, but these models are difficult to deal with real-time problems. One-stage instance segmentation methods achieve a relative balance of segmentation accuracy and inference speed by performing detection and segmentation.
As an important part of the field of intelligent transportation, instance segmentation in the street scenes has a wide range of application prospects in real-time road condition detection, traffic planning optimization, and autonomous driving. Instance segmentation based on street scenes is the process of segmenting various objects such as various vehicles, road signs, buildings and so on at pixel level using computer vision techniques. Street scene instance segmentation requires high timeliness of system feedback, but most current instance segmentation algorithms cannot meet the requirements of segmentation accuracy and real-time at the same time. We propose a real-time instance segmentation network in the street scenes. The overall segmentation accuracy of the model is further improved by reducing the computation of the backbone network, also the small object miss segmentation and object occlusion problems are solved. Ablation experiments are conducted on MS COCO and Cityscapes datasets, comparing with other mainstream methods, our method achieves more competitive performance in both segmentation speed and segmentation accuracy.
In summary, our main contributions can be summarized as follows:
• We propose a cross-stage fusion backbone network based on location attention to fully extract image features and reduce computational effort.
• Decreasing the loss of shallow location information by integrating two-way feature pyramid networks; meanwhile, designing cross-stage mask feature fusion to resolve the problem of missing segmentation of small objects.
• The adaptive minimum loss matching method is proposed to reduce the occlusion caused by multiple object centers in an image falling into the same grid.
The remainder of the paper is organized as follows. Section 2 presents the real-time instance segmentation method based on SOLOv2. Extensive experiments are implemented and the performance of our method is evaluated compared with others in Section 3. Finally, the conclusion is presented in Section 4.
2. Instance Segmentation Methods
The cross-stage fusion backbone network is proposed based on location attention, feature information fusion structure design, and adaptive minimum loss function matching method accordingly. The general framework is shown in Fig. 1, which contains feature extraction, feature fusion, semantic branching, and mask branching.
Fig. 1. The general framework of the instance partitioning.
In Fig. 1, CSPAM represents a cross-stage fusion backbone network based on location attention, TWFPN represents a two-way feature pyramid network, and CSMF represents cross-stage feature information fusion.
2.1 Cross-Stage Fusion Backbone Network
First, the cross-stage fusion structure is used to improve the original backbone network ResNet of the SOLOv2 model, and the number of parameters of the convolutional layers is reduced, which can decrease memory consumption and improve the training efficiency of the model. Then, the Relu activation function is replaced considering several advantages of the Mish activation function such as training stability, non-monotonicity, no upper limit, and smoothing. The structure of the improved network is shown in Fig. 2, where BN represents Batch Normalization, CONV represents Convolution, and Concat represents Dimensional Splicing.
Fig. 2. Improved cross-stage feature extraction network.
In order to fully extract the position information of the bottom feature map, the Position Attention Module is applied between the residual blocks, as shown in Fig. 3. The input features C × H × W are first convolved by 1 × 1 to reduce the channel dimension C to 1/8 of the original one to get the feature map C1 × H × W, which decrease the computation. Then the H × W dimensions are combined by reshape operation to obtain three C1 × N 2D feature maps P, A, and M. Where, P is transposed and multiplied with A, after softmax processing, then multiplied with M to get the feature map PAM. Afterwards 1 × 1 convolution and reshape operation, there is the C2 × H × W feature map T2.
Fig. 3. Position attention mechanism.
Finally, the reduced-dimensional feature map T1 is multiplied by the adaptive weight factor θ, and then the Concat operation of the channel dimension is performed with the feature map T2 to reduce the output to the same dimension as the input. Where θ is set to 0 at the beginning and the optimal hyper parameter value is obtained according to the training process. The final output of the module T3 is a concatenation of high and low-level information, which makes full use of the location and semantic information and increases the local feature accuracy of the network.
2.2 Feature Fusion Structure
The FPN [36] does not make effective use of the shallow location information, resulting in the loss of location information during the delivery of feature maps, which is especially obvious in the segmentation of small objects. We introduce the bottom-up information flow and design the two-way feature pyramid network TWFPN to supplement the missing location information. Specifically, the lateral connection inputs of the FPN are the output features of layers 2, 3, 4, and 5 of the backbone network, after 50–100 layers of convolution, the shallow location information is seriously lost, so the bottom-up location information enhancement is an effective solution.
For the operation of FPN to enlarge the feature layers to the same size and then overlay, the cross-stage mask feature fusion structure is designed in two ways: 1) after 2X upsampling, the higher semantic layer P5 concat with the lower feature layer N4 in the channel dimension to reduce the computation; 2) P5 forms a global semantic guide to P2 by 8X upsampling, which improves the detection and segmentation capability for small objects. The feature information fusion structure is shown in Fig. 4, where CoordConv represents Coordinate Convolution.
Fig. 4. Feature information fusion structure.
2.3 Adaptive Minimum Loss Function Matching Method
The SOLOv2 has the occlusion problem caused by multiple target centers falling into the same grid. To solve this problem, the influence factor ε is first introduced in the function LMask, which adaptively adjusts the weight information of the predicted Mask to increase the instance segmentation accuracy. Then, the category and mask losses are normalized by the sigmoid function, and the two losses are integrated by summation operation. Finally, the Loss is compared and the minimum loss is used for network training. LMask, ε, and Loss are defined as formulas 1, 2, and 3, respectively.
\(\begin{align}L_{\mathrm{Mask}}=\frac{1}{N_{\mathrm{pos}}} \sum_{i \in P}\left(1-\frac{2 \sum m_{i} g_{i}+1}{\sum m_{i}+\sum g_{i}+1}+\varepsilon\right)\end{align}\) (1)
\(\begin{align}\varepsilon=\frac{\left(C_{P X}-C_{G X}\right)^{2}+\left(C_{P Y}-C_{G Y}\right)^{2}}{\left|\sum m_{i} g_{i}\right|+1}\end{align}\) (2)
Loss = sigmoid(LCate) + sigmoid(λLMask) (3)
where, LCate represents the category loss function calculated using focal loss, P is the set of positive samples, and Npos is the number of positive samples. The mi and gi are the prediction mask and true mask corresponding to the ith feature point. CP and CG are the predicted Mask and the true labeled centroid position respectively. X and Y are respectively the horizontal and vertical coordinates.
3. Performance Evaluation
We perform ablation experiments and evaluate the performance of our method by comparing it with other instance segmentation methods on MS COCO and Cityscapes datasets. The Cityscapes dataset is transformed into the COCO dataset format and the original categories of the Cityscapes dataset are reduced, keeping the five categories of common objects in street scenes: cars, pedestrians, trucks, buses, and riders. The accuracy of different instance segmentation methods is evaluated by AP, the segmentation effect of small objects is evaluated in terms of APS, and FPS (frames per second) is the speed of different models.
3.1 Experimental Setting
Data enhancement (random flipping, size scaling, random cropping, etc.) and data normalization are used for data preprocessing, in line with the original SOLOv2 algorithm. To reduce the model computation, the training and test images were scaled to 1200 × 680 pixel size. After several experiments, the model parameters are set as follows: batch_size is set to 16, 8 graphics card × 2 images; the optimizer is selected as SGD, the initial learning rate is 0.01, the momentum is 0.9; the regularization weight decay coefficient is 0.0001; the epochs is 40; the decay rate of learning rate is 0.1 setting at the 10th time of model training.
3.2 Ablation Experiment
Table 1 shows the comparison results of 8 sets of ablation experiments, and the 3 structural designs for the SOLOv2 network effectively improve the model recognition and segmentation accuracy. Based on the original backbone network ResNet-50-FPN of the SOLOv2 model, CSPAM indicates cross-stage fusion based on location attention, TWFPN + CSMF indicates feature information fusion, and AML indicates the addition of adaptive minimum loss matching method. It can be seen that our improved SOLOv2 network combined simultaneously with CSPAM, TWFPN + CSMF and AML achieves the best accuracy metrics with 3% increase of AP and 2.6% increase of APs respectively.
Table 1. Results of ablation experiments on the COCO dataset
Fig. 5 shows the comparison of the experimental results of the original SOLOV2 algorithm and the improved algorithm of this paper for seven groups of instance segmentation. From the up-down comparison of group 1 and group 2, it can be seen that the original algorithm has defects in the segmentation of boundary details and our improved algorithm performs better in the segmentation of edge details; From the up-and-down comparison of group 3 and the left-right comparison of group 4, the original algorithm has labeling errors in the category differentiation of birds and trucks, and our improved algorithm gets the correct category labels; From the experimental comparison of group 4, group 5, group 6, and group 7, it can be seen that the original algorithm is prone to the problem of missed segmentation in the segmentation task of smaller objects, and our improved algorithm is better at improving the problem of missed segmentation of small objects such as people and cars in images.
Fig. 5. Comparing experimental results on the COCO dataset.
3.3 Performance Comparing
Table 2 shows the experimental results of our improved SOLOv2 algorithm compared with other classical instance segmentation algorithms on the COCO dataset. To ensure the validity of the performance comparison, the backbone networks of various instance segmentation methods are unified as ResNet-101-FPN, and the forward inference times are tested at 2080TiGPU. However, due to the differences in hardware devices and algorithm parameter settings, the reproduction results of some algorithms marked with * are somewhat different from the original paper.
Table 2. Comparing with other methods on the COCO dataset
It can be seen from Table 2, that the two-stage and multi-stage instance segmentation algorithms are better in terms of segmentation accuracy, but the FPS is mostly below 10 FPS, which cannot reach the minimum requirement of real-time inference. The segmentation accuracy of the one-stage instance segmentation algorithm is generally lower than the two-stage and multi-stage methods, while the segmentation speed is faster. Among one-stage instance segmentation algorithms, the YOLACT and SOLOv2 algorithms reach the lower limit requirement of real-time instance segmentation. In this paper, the improved instance segmentation method with parallel inference is structured in such a way that the segmentation accuracy is the highest among the classical single-stage instance segmentation methods, also with the fastest inference time.
The performance comparing results of our improved SOLOv2 with other instance segmentation methods on the street scene dataset Cityscapes are shown in Table 3, where the backbone network is ResNet-50-FPN. Among the mainstream algorithms nowadays, most of the instance segmentation algorithms can’t achieve real-time performance and their backbone networks are all relatively large. At present, the common algorithms in street scene applications mainly include Mask R-CNN, YOLACT, etc. Compared with these algorithms, the instance segmentation method proposed in this paper has a great improvement in the segmentation accuracy of different object categories.
Table 3. Comparing with other methods on the Cityscapes dataset
4. Conclusion
In this work, we propose a real-time instance segmentation network applied to street scenes. The cross-stage fusion backbone network based on location attention improves model accuracy and reduces computational effort. The feature information fusion structure design enriches shallow location information and high-level semantic information to solve the problem of small objects’ missed segmentation. The adaptive minimum loss matching method reduces the loss of segmentation accuracy due to objects occlusion in images. The evaluation results show that our proposed network structure achieves competitive performance compared with other mainstream methods, and the inference speed satisfies the real-time requirement.
Acknowledgment
This work is supported by the National Natural Science Foundation of China under Grants 12071025; Scientific and Technological Innovation Foundation of Foshan Municipal People’s Government (BK20AE004).
References
- X. Xu, Z. Feng, C. Cao, M. Li, J. Wu, Z. Wu, Y. Shang, and S. Ye, "An Improved Swin Transformer-Based Model for Remote Sensing Object Detection and Instance Segmentation," Remote Sens., vol.13, no.23, 2021.
- W. Zhao, J. Na, M. Li, and H. Ding, "Rotation-Aware Building Instance Segmentation From High-Resolution Remote Sensing Images," IEEE Geosci. Remote Sens. Lett., vol.19, pp.1-5, 2022.
- B. Chai, X. Nie, H. Gao, J. Jia, and Q. Qiao, "Remote Sensing Images Background Noise Processing Method for Ship Objects in Instance Segmentation," J. Indian Soc. Remote Sens., vol.51, pp.647-659, 2023.
- M. Ding, B. Wu, J. Xu, A. N. Kasule, and H. Zuo, "Visual inspection of aircraft skin: Automated pixel-level defect detection by instance segmentation," Chin. J. Aeronaut., vol.35, no.10, pp.254-264, 2022.
- E. Antwi-Bekoe, G. Liu, J.-P. Ainam, G. Sun, and X. Xie, "A deep learning approach for insulator instance segmentation and defect detection," Neural Comput. Applic., vol.34, pp.7253-7269, 2022.
- C. Zhang and X. Zhang, "Multi-target domain-based hierarchical dynamic instance segmentation method for steel defects detection," Neural Comput. & Applic., vol.35, pp.7389-7406, 2023.
- Y. Sun, J. Li, X. Xu, and Y. Shi, "Adaptive Multi-Lane Detection Based on Robust Instance Segmentation for Intelligent Vehicles," IEEE Trans. Intell. Veh., vol.8, no.1, pp.888-899, 2023.
- S. Du, Z. Chen, L. Li, H. Zhang, D. Cao, and L. Chen, "NFIS: A NMS-free FCOS Method for Instance Segmentation," in Proc. of 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), pp.3150-3155, Indianapolis, IN, USA, Sep. 2021.
- X. Zhang, Y. Yang, Y. Shen, P. Li, Y. Zhong, J. Zhou, K. Zhang, C. Shen, Y. Li, M. Zhang, L. Pan, L. Ma, and H. Liu, "SeUneter: Channel attentive U-Net for instance segmentation of the cervical spine MRI medical image," Front. Physiol., vol.13, 2022.
- X. Pang, Z. Zhao, Y. Wang, F. Li, and F. Chang, "LGMSU-Net: Local Features, Global Features, and Multi-Scale Features Fused the U-Shaped Network for Brain Tumor Segmentation," Electronics, vol.11, no.12, 2022.
- D. N. H. Thanh, L. T. Thanh, U. Erkan, A. Khamparia, and V. B. Surya Prasath, "Dermoscopic image segmentation method based on convolutional neural networks," Int. J. Comput. Appl. Technol., vol.66, no.2, pp.89-99, 2021.
- S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, "Path Aggregation Network for Instance Segmentation," in Proc. of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.8759-8768, Salt Lake City, UT, USA, Jun. 2018.
- Z. Huang, L. Huang, Y. Gong, C. Huang, and X. Wang, "Mask Scoring R-CNN," in Proc. of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.6402-6411, Long Beach, CA, USA, Jun. 2019.
- K. He, G. Gkioxari, P. Dollar, and R. Girshick, "Mask R-CNN," in Proc. of 2017 IEEE International Conference on Computer Vision (ICCV), pp.2980-2988, Venice, Italy, Oct. 2017.
- A. Kirillov, Y. Wu, K. He, and R. Girshick, "PointRend: Image Segmentation As Rendering," in Proc. of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.9796-9805, Seattle, WA, USA, Jun. 2020.
- Z. Tian, B. Zhang, H. Chen, and C. Shen, "Instance and Panoptic Segmentation Using Conditional Convolutions," IEEE Trans. Pattern Anal. Mach. Intell., vol.45, no.1, pp.669-680, 2023.
- D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, "YOLACT++ Better Real-Time Instance Segmentation," IEEE Trans. Pattern Anal. Mach. Intell., vol.44, no.2, pp.1108-1121, 2022.
- S. Koles, S. Karakas, A. P. Ndigande, and S. Ozer, "Using Different Loss Functions with YOLACT++ for Real-Time Instance Segmentation," in Proc. of 2023 46th International Conference on Telecommunications and Signal Processing (TSP), pp.264-267, Prague, Czech Republic, Jul. 2023.
- J. Cai and Y. Li, "Realtime single-stage instance segmentation network based on anchors," Comput. Electr. Eng., vol.95, 2021.
- E. Xie, P. Sun, X. Song, W. Wang, X. Liu, D. Liang, C. Shen, and P. Luo, "PolarMask: Single Shot Instance Segmentation With Polar Representation," in Proc. of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.12190-12199, Seattle, WA, USA, Jun. 2020.
- Y. Lee and J. Park, "CenterMask: Real-Time Anchor-Free Instance Segmentation," in Proc. of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.13903-13912, Seattle, WA, USA, Jun. 2020.
- H. Chen, K. Sun, Z. Tian, C. Shen, Y. Huang, and Y. Yan, "BlendMask: Top-Down Meets Bottom-Up for Instance Segmentation," in Proc. of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.8570-8578, Seattle, WA, USA, Jun. 2020.
- E. Xie, W. Wang, M. Ding, R. Zhang, and P. Luo, "PolarMask++: Enhanced Polar Representation for Single-Shot Instance Segmentation and Beyond," IEEE Trans. Pattern Anal. Mach. Intell., vol.44, no.9, pp.5385-5400, 2022.
- X. Wang, T. Kong, C. Shen, Y. Jiang, and L. Li, "SOLO: Segmenting Objects by Locations," in Proc. of 16th European Conference on Computer Vision - ECCV 2020, Lecture Notes in Computer Science, vol.12363, pp.649-665, Glasgow, UK, Dec. 2020.
- K. Duan, L. Xie, H. Qi, S. Bai, Q. Huang, and Q. Tian, "Location-Sensitive Visual Recognition with Cross-IOU Loss," arXiv preprint arXiv:2104.04899, Apr. 2021.
- X. Wang, R. Zhang, T. Kong, L. Li, and C. Shen, "SOLOv2: Dynamic and fast instance segmentation," in Proc. of NIPS'20: Proceedings of the 34th International Conference on Neural Information Processing Systems, pp.17721-17732, Vancouver, BC, Canada, Dec. 2020.
- Z. Cai and N. Vasconcelos, "Cascade R-CNN: High Quality Object Detection and Instance Segmentation," IEEE Trans. Pattern Anal. Mach. Intell., vol.43, no.5, pp.1483-1498, 2021.
- R. Guo, D. Niu, L. Qu, and Z. Li, "SOTR: Segmenting Objects with Transformers," in Proc. of 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp.7137-7146, Montreal, QC, Canada, Oct. 2021.
- Y. Fang, S. Yang, X. Wang, Y. Li, C. Fang, Y. Shan, B. Feng, and W. Liu, "Instances as Queries," in Proc. of 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp.6890-6899, Montreal, QC, Canada, Oct. 2021.
- W. Zhang, J. Pang, K. Chen, and C. C. Loy, "K-Net: towards unified image segmentation," in Proc. of NIPS'21: Proceedings of the 35th International Conference on Neural Information Processing Systems, pp.10326-10338, Jun. 2024.
- B. Cheng, A. G. Schwing, and A. Kirillov, "Per-Pixel Classification is Not All You Need for Semantic Segmentation," arXiv preprint arXiv:2107.06278, Oct. 2021.
- B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, "Masked-attention Mask Transformer for Universal Image Segmentation," in Proc. of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.1280-1289, New Orleans, LA, USA, Jun. 2022. Aricle (CrossRef Link)
- L. Ke, M. Danelljan, X. Li, Y.-W. Tai, C.-K. Tang, and F. Yu, "Mask Transfiner for High-Quality Instance Segmentation," in Proc. of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.4402-4411, New Orleans, LA, USA, Jun. 2022.
- J. He, P. Li, Y. Geng, and X. Xie, "FastInst: A Simple Query-Based Model for Real-Time Instance Segmentation," in Proc. of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.23663-23672, Vancouver, BC, Canada, Jun. 2023.
- Q. Wen, J. Yang, X. Yang, and K. Liang, "PatchDCT: Patch Refinement for High Quality Instance Segmentation," in Proc. of Eleventh International Conference on Learning Representations (ICLR), pp.1-15, Kigali, Rwanda, Feb. 2023.
- M. Hu, Y. Li, L. Fang, and S. Wang, "A2-FPN: Attention Aggregation based Feature Pyramid Network for Instance Segmentation," in Proc. of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.15338-15347, Nashville, TN, USA, Jun. 2021.