High-Speed Transformer for Panoptic Segmentation

Baek, Jong-Hyeon;Kim, Dae-Hyun;Lee, Hee-Kyung;Choo, Hyon-Gon;Koh, Yeong Jun;

doi:10.5909/JBE.2022.27.7.1011

Journal of Broadcast Engineering (방송공학회논문지)

Volume 27 Issue 7
/
Pages.1011-1020
/
2022
/
1226-7953(pISSN)
/
2287-9137(eISSN)

The Korean Institute of Broadcast and Media Engineers (한국방송∙미디어공학회)

DOI QR Code

High-Speed Transformer for Panoptic Segmentation

Baek, Jong-Hyeon (Department of Computer Engineering, ChungNam National University) ;
Kim, Dae-Hyun (Department of Computer Engineering, ChungNam National University) ;
Lee, Hee-Kyung (Electronics and Telecommunications Research Institute) ;
Choo, Hyon-Gon (Electronics and Telecommunications Research Institute) ;
Koh, Yeong Jun (Department of Computer Engineering, ChungNam National University)

Received : 2022.10.17
Accepted : 2022.12.08
Published : 2022.12.20

https://doi.org/10.5909/JBE.2022.27.7.1011 Citation PDF KSCI KPUBS

Download PDF

⟨ Previous Next ⟩

Abstract

Recent high-performance panoptic segmentation models are based on transformer architectures. However, transformer-based panoptic segmentation methods are basically slower than convolution-based methods, since the attention mechanism in the transformer requires quadratic complexity w.r.t. image resolution. Also, sine and cosine computation for positional embedding in the transformer also yields a bottleneck for computation time. To address these problems, we adopt three modules to speed up the inference runtime of the transformer-based panoptic segmentation. First, we perform channel-level reduction using depth-wise separable convolution for inputs of the transformer decoder. Second, we replace sine and cosine-based positional encoding with convolution operations, called conv-embedding. We also apply a separable self-attention to the transformer encoder to lower quadratic complexity to linear one for numbers of image pixels. As result, the proposed model achieves 44% faster frame per second than baseline on ADE20K panoptic validation dataset, when we use all three modules.

Keywords

Acknowledgement

This work was supported partly by Institute of Information & communications Technology Planning & evaluation(IITP) grant funded by the Korea government(MSIT) (No. 2020-0-00011, Video Coding for Machine) and partly by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (No. NRF-2022 R1I1A3069113).

References

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," in Proc. NIPS, 30, 2017.
J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, and Y. Wu, "Contrastive Captioners are Image-Text Foundation Models," arXiv:2205.01917, 2022. doi: https://doi.org/10.48550/arXiv.2205.01917
Y. Wei, H. Hu, Z. Xie, Z. Zhang, Y. Cao, J. Bao, D. Chen, and B. Guo, "Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation," arXiv:2205.14141, 2022. doi: https://doi.org/10.48550/arXiv.2205.14141
W. Wang, H. Bao, L. Dong, J. Bjorck, Z. Peng, Q. Liu, K. Aggarwal, O. K. Mohammed, S. Singhal, S. Som, and F. Wei, "BEiT Pretraining for All Vision and Vision-Language Tasks," arXiv:2208.10442, 2022. doi:https://doi.org/10.48550/arXiv.2208.10442
F. Li, H. Zhang, H. Xu, S. Liu, L. Zhang, L. M. Ni, and H.-Y. Shum, "Towards A Unified Transformer-based Framework for Object Detection and Segmentation," arXiv:2206.02777, 2022. doi: https://doi.org/10.48550/arXiv.2206.02777
H. Wang, Y. Zhu, B. Green, H. Adam, A. Yuille, and L.-C. Chen, "Axial-DeepLab:Stand-Alone Axial-Attention for Panoptic Segmentation," in Proc. ECCV, pp.108-126, 2020. doi: https://doi.org/10.1007/978-3-030-58548-8_7
S. Mehta and M. Rastegari, "Separable Self-attention for Mobile Vision Transformers," arXiv:2206.02680, 2022. doi: https://doi.org/10.48550/arXiv.2206.02680
A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, M. Andreetto, and H. Adam, "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications," arxiv: 1704.04861, 2017. doi: https://doi.org/10.48550/arXiv.1704.04861
B. Cheng, A. Schwing, and A. Kirillov, "Per-Pixel Classification is Not All You Need for Semantic Segmentation," in Proc. NIPS, 34, 2021.
Y. Li, G. Yuan, Y. Wen, E. Hu, G. Evangelidis, S. Tulyakov, Y. Wang, and J. Ren "EfficientFormer: Vision Transformers at MobileNet Speed," arxiv:2206.01191, 2022. doi: https://doi.org/10.48550/arXiv.2206.01191
S. Mehta and M. Rastegari, "MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer," arxiv:2110.02178, 2022. doi: https://doi.org/10.48550/arXiv.2110.02178
M. Sandler, A. Howard, M. Zhu, A. Zhmoginov and L.-C. Chen, "MobileNetV2 Inverted Residuals and Linear Bottlenecks," in Proc. CVPR, Salt Lake City, USA, pp.4510-4520, 2018. doi: https://doi.org/10.1109/CVPR.2018.00474
W. Zhang, Z. Huang, G. Luo, T. Chen, X. Wang, W. Liu, G. Yu, and C. Shen, "TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation," in Proc. CVPR, New Orleans, USA, pp.12083-12093, 2022. doi: https://doi.org/10.1109/CVPR52688.2022.01177
A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollar, "Panoptic segmentation," in Proc. CVPR, California, USA, pp.9404-9413, 2019. doi: https://doi.org/10.1109/CVPR.2019.00963
B. Cheng, M. D. Collins, Y. Zhu, T. Liu, T. S. Huang, H. Adam, and L.-C. Chen, "Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation," in Proc. CVPR, pp. 12475-12485, 2020. doi: https://doi.org/10.1109/CVPR42600.2020.01249
N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, "End-to-End Object Detection with Transformers," in Proc. ECCV, pp.213-229, 2020. doi: https://doi.org/10.1007/978-3-030-58452-8_13
K. He, X. Zhang, S. Ren, and J. Sun, "Deep Residual Learning for Image Recognition," in Proc. CVPR, Nevada, USA, pp.770-778, 2016. doi: https://doi.org/10.1109/CVPR.2016.90
I. Loshchilov and F. Hutter, "Decoupled Weight Decay Regularization," in ICLR, 2019.
Y. Wu, G. Zhang, Y. Gao, X. Deng, K. Gong, X. Liang, and, L. Lin, "Bidirectional Graph Reasoning Network for Panoptic Segmentation," in CVPR, pp.9080-9089, 2020. doi: https://doi.org/10.1109/CVPR42600.2020.00910
Y. Wu, G. Zhang, H. Xu, X. Liang, L. Lin, "Auto-Panoptic: Cooperative Multi-Component Architecture Search for Panoptic Segmentation," in NeurlPS, 2020.
B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, R. Girdhar, "Masked-Attention Mask Transformer for Universal Image Segmentation," in CVPR, New Orleans, USA, pp.1290-1299 2022. doi: https://doi.org/10.1109/CVPR52688.2022.00135

Journal of Broadcast Engineering (방송공학회논문지)

High-Speed Transformer for Panoptic Segmentation

Abstract

Keywords

Acknowledgement

References

Detail Search