DOI QR코드

DOI QR Code

Dual Attention Based Image Pyramid Network for Object Detection

  • Dong, Xiang (Institute of Information Science, Beijing Jiaotong University) ;
  • Li, Feng (Institute of Information Science, Beijing Jiaotong University) ;
  • Bai, Huihui (Institute of Information Science, Beijing Jiaotong University) ;
  • Zhao, Yao (Institute of Information Science, Beijing Jiaotong University)
  • Received : 2021.02.19
  • Accepted : 2021.11.20
  • Published : 2021.12.31

Abstract

Compared with two-stage object detection algorithms, one-stage algorithms provide a better trade-off between real-time performance and accuracy. However, these methods treat the intermediate features equally, which lacks the flexibility to emphasize meaningful information for classification and location. Besides, they ignore the interaction of contextual information from different scales, which is important for medium and small objects detection. To tackle these problems, we propose an image pyramid network based on dual attention mechanism (DAIPNet), which builds an image pyramid to enrich the spatial information while emphasizing multi-scale informative features based on dual attention mechanisms for one-stage object detection. Our framework utilizes a pre-trained backbone as standard detection network, where the designed image pyramid network (IPN) is used as auxiliary network to provide complementary information. Here, the dual attention mechanism is composed of the adaptive feature fusion module (AFFM) and the progressive attention fusion module (PAFM). AFFM is designed to automatically pay attention to the feature maps with different importance from the backbone and auxiliary network, while PAFM is utilized to adaptively learn the channel attentive information in the context transfer process. Furthermore, in the IPN, we build an image pyramid to extract scale-wise features from downsampled images of different scales, where the features are further fused at different states to enrich scale-wise information and learn more comprehensive feature representations. Experimental results are shown on MS COCO dataset. Our proposed detector with a 300 × 300 input achieves superior performance of 32.6% mAP on the MS COCO test-dev compared with state-of-the-art methods.

Keywords

References

  1. R. Girshick, J. Donahue, T. Darrell, et al., "Rich feature hierarchies for accurate object detection and semantic segmentation," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580-587, 2014.
  2. R. Girshick, "Fast r-cnn," in Proc. of the IEEE International Conference on Computer Vision, pp. 1440-1448, 2015.
  3. S. Ren, K. He, R. Girshick, et al., "Faster r-cnn: Towards real-time object detection with region proposal networks," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137-1149, 2017. https://doi.org/10.1109/TPAMI.2016.2577031
  4. K. He, G. Gkioxari, P. Dollar, et al., "Mask r-cnn," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 2, pp. 386-397, 2020. https://doi.org/10.1109/tpami.2018.2844175
  5. M. Aamir, Y.-F. Pu, Z. Rahman, W.A. Abro, Z. Hu, F. Ullah, and A. M. Badr, "A Hybrid Proposed Framework for Object Detection and Classification," Journal of Information Processing Systems 14, no. 5, 2018.
  6. M. Aamir, Y.-F. Pu, Z. Rahman, W.A. Abro, H. Naeem, Z. Rahman, "A hybrid approach for object proposal generation," in Proc. of International Conference on Sensing and Imaging, 506, 251-259, 2017.
  7. Y. Guan, M. Aamir, Z. Rahman, A. Ali, W.A. Abro, Z. A. Dayo, M. S. Bhutta, Z. Hu, "A framework for efficient brain tumor classification using MRI images," Mathematical Biosciences and Engineering, 18(5), 5790-5815, 2021. https://doi.org/10.3934/mbe.2021292
  8. W. Liu, D. Anguelov, D. Erhan, et al., "Ssd: Single shot multibox detector," in Proc. of European Conference on Computer Vision, pp. 21-37, 2016.
  9. J. Redmon, S. Divvala, R. Girshick, et al., "You only look once: Unified, real-time object detection," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779-788, 2016.
  10. J. Redmon and A. Farhadi, "YOLO9000: better, faster, stronger," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7263-7271, 2017.
  11. J. Redmon and A. Farhadi, "Yolov3: An incremental improvement," arXiv preprint arXiv: 1804.02767, 2018.
  12. C.-Y. Fu, W. Liu, A. Ranga, et al., "Dssd: Deconvolutional single shot detector," arXiv preprint arXiv:1701.06659, 2017.
  13. T. Y. Lin, P. Dollar, R. Girshick, et al., "Feature pyramid networks for object detection," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117-2125, 2017.
  14. S. Liu, L. Qi, H. Qin, et al., "Path aggregation network for instance segmentation," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8759-8768, 2018.
  15. D. Lin, D. Shen, S. Shen, et al., "Zigzagnet: Fusing top-down and bottom-up context for object segmentation," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7490-7499, 2019.
  16. W. Li, Z. Wang, B. Yin, et al., "Rethinking on multi-stage networks for human pose estimation," arXiv preprint arXiv:1901.00148, 2019.
  17. T. Wang, R. M. Anwer, H. Cholakkal, et al., "Learning rich features at high-speed for single-shot object detection," in Proc. of the IEEE International Conference on Computer Vision, pp. 1971-1980, 2019.
  18. Y. Pang, T. Wang, R. M. Anwer, et al., "Efficient featurized image pyramid network for single shot detector," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7336-7344, 2019.
  19. K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.
  20. N. Dalal, B. Triggs, "Histograms of oriented gradients for human detection," in Proc. of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 886-893, 2005.
  21. T. Ojala, M. Pietikainen, D. Harwood, "Performance evaluation of texture measures with classification based on kullback discrimination of distributions," in Proc. of 12th International Conference on Pattern Recognition, pp. 582-585, 1994.
  22. C. Harris and M. Stephens, "A combined corner and edge detector," in Proc. of the Alvey Vision Conference, pp. 23.1-23.6, 1988.
  23. D. G. Lowe, "Distinctive image features from scale-invariant keypoints," International Journal of Computer Vision, vol. 60, no. 2, pp. 91-110, 2004. https://doi.org/10.1023/B:VISI.0000029664.99615.94
  24. B. Singh and L. S. Davis, "An analysis of scale invariance in object detection - snip," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3578-3587, 2018.
  25. J. Dai, Y. Li, K. He, et al., "R-FCN: object detection via region-based fully convolutional networks," arXiv preprint arXiv:1605.06409, 2016.
  26. Z. Cai and N. Vasconcelos, "Cascade r-cnn: Delving into high quality object detection," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6154-6162, 2018.
  27. Y. Guan, M. Aamir, Z. Hu, W.A. Abro, Z. Rahman, Z.A. Dayo, S. Akram, "A region-based efficient network for accurate object detection," Traitement du Signal, 38(2), 481-494, 2021. https://doi.org/10.18280/ts.380228
  28. T. Kong, F. Sun, C. Tan, H. Liu, and W. Huang, "Deep feature pyramid reconfiguration for object detection," in Proc. of the European Conference on Computer Vision, 2018.
  29. Z. Zhang, S. Qiao, C. Xie, W. Shen, B. Wang, and A. L. Yuille, "Single-shot object detection with enriched semantics," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  30. T.-Y. Lin, P. Goyal, R. Girshick, et al., "Focal loss for dense object detection," in Proc. of the IEEE International Conference on Computer Vision, pp. 2980-2988, 2017.
  31. Y. Li, Y. Pang, J. Shen, et al., "Netnet: Neighbor erasing and transferring network for better single shot object detection," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 13346-13355, 2020.
  32. Y. Chen, J. Li, B. Zhou, J. Feng, and S. Yan, "Weaving multi-scale context for single shot detector," arXiv preprint arXiv: 1712.03149, 2017.
  33. S. Liu, D. Huang, Y. Wang, "Receptive field block net for accurate and fast object detection," in Proc. of the European Conference on Computer Vision, pp. 404-419, 2018.
  34. S. Zhang, L. Wen, X. Bian, et al., "Single-shot refinement neural network for object detection," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4203-4212, 2018.
  35. Z. Liu, G. Gao, L. Sun, et al., "Ipg-net: Image pyramid guidance network for small object detection," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1026-1027, 2020.
  36. T.-Y. Lin, M. Maire, S. Belongie, et al., "Microsoft coco: Common objects in context," in Proc. of European Conference on Computer Vision, pp. 740-755, 2014.
  37. M. Everingham, S. Eslami, L. V. Gool, C. Williams, J. Winn, A. Zisserman, "The pascal visual object classes challenge: a retrospective," International Journal of Computer Vision, 111(1), 98-136, 2015. https://doi.org/10.1007/s11263-014-0733-5
  38. J. Deng, W. Dong, R. Socher, et al., "Imagenet: A large-scale hierarchical image database," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 248-255, 2009.
  39. K. Chen, J. Li, W. Lin, et al., "Towards accurate one-stage object detection with ap-loss," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5119-5127, 2019.