DOI QR코드

DOI QR Code

Effective Multi-Modal Feature Fusion for 3D Semantic Segmentation with Multi-View Images

멀티-뷰 영상들을 활용하는 3차원 의미적 분할을 위한 효과적인 멀티-모달 특징 융합

  • Received : 2023.08.16
  • Accepted : 2023.10.07
  • Published : 2023.12.31

Abstract

3D point cloud semantic segmentation is a computer vision task that involves dividing the point cloud into different objects and regions by predicting the class label of each point. Existing 3D semantic segmentation models have some limitations in performing sufficient fusion of multi-modal features while ensuring both characteristics of 2D visual features extracted from RGB images and 3D geometric features extracted from point cloud. Therefore, in this paper, we propose MMCA-Net, a novel 3D semantic segmentation model using 2D-3D multi-modal features. The proposed model effectively fuses two heterogeneous 2D visual features and 3D geometric features by using an intermediate fusion strategy and a multi-modal cross attention-based fusion operation. Also, the proposed model extracts context-rich 3D geometric features from input point cloud consisting of irregularly distributed points by adopting PTv2 as 3D geometric encoder. In this paper, we conducted both quantitative and qualitative experiments with the benchmark dataset, ScanNetv2 in order to analyze the performance of the proposed model. In terms of the metric mIoU, the proposed model showed a 9.2% performance improvement over the PTv2 model using only 3D geometric features, and a 12.12% performance improvement over the MVPNet model using 2D-3D multi-modal features. As a result, we proved the effectiveness and usefulness of the proposed model.

3차원 포인트 클라우드 의미적 분할은 각 포인트별로 해당 포인트가 속한 물체나 영역의 분류 레이블을 예측함으로써, 포인트 클라우드를 서로 다른 물체들이나 영역들로 나누는 컴퓨터 비전 작업이다. 기존의 3차원 의미적 분할 모델들은 RGB 영상들에서 추출하는 2차원 시각적 특징과 포인트 클라우드에서 추출하는 3차원 기하학적 특징의 특성을 충분히 고려한 특징 융합을 수행하지 못한다는 한계가 있다. 따라서, 본 논문에서는 2차원-3차원 멀티-모달 특징을 이용하는 새로운 3차원 의미적 분할 모델 MMCA-Net을 제안한다. 제안 모델은 중기 융합 전략과 멀티-모달 교차 주의집중 기반의 융합 연산을 적용함으로써, 이질적인 2차원 시각적 특징과 3차원 기하학적 특징을 효과적으로 융합한다. 또한 3차원 기하학적 인코더로 PTv2를 채용함으로써, 포인트들이 비-정규적으로 분포한 입력 포인트 클라우드로부터 맥락정보가 풍부한 3차원 기하학적 특징을 추출해낸다. 본 논문에서는 제안 모델의 성능을 분석하기 위해 벤치마크 데이터 집합인 ScanNetv2을 이용한 다양한 정량 및 정성 실험들을 진행하였다. 성능 척도 mIoU 측면에서 제안 모델은 3차원 기하학적 특징만을 이용하는 PTv2 모델에 비해 9.2%의 성능 향상을, 2차원-3차원 멀티-모달 특징을 사용하는 MVPNet 모델에 비해 12.12%의 성능 향상을 보였다. 이를 통해 본 논문에서 제안한 모델의 효과와 유용성을 입증하였다.

Keywords

Acknowledgement

본 연구는 정보통신기획평가원의 재원으로 정보통신방송 기술개발사업의 지원을 받아 수행한 연구 과제(No. 2020-0-00096 클라우드에 연결된 개별로봇 및 로봇그룹의 작업 계획 기술 개발)입니다.

References

  1. O. Ronneberger, P. Fischer, and T. Brox, "U-net: Convolutional networks for biomedical image segmentation," in International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, pp.234-241, 2015.
  2. H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, "Pyramid scene parsing network," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp.2881-2890.
  3. Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, "Swin transformer: Hierarchical vision transformer using shifted windows," In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.10012-10022, 2021.
  4. E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, "SegFormer: Simple and efficient design for semantic segmentation with transformers," Advances in Neural Information Processing Systems, Vol.34, pp.12077-12090, 2021.
  5. C. R. Qi, H. Su, K. Mo, and L. J. Guibas, "PointNet: Deep learning on point sets for 3d classification and segmentation," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.652-660, 2017.
  6. C. R. Qi, L. Yi, H. Su, and L. J. Guibas, "PointNet++: Deep hierarchical feature learning on point sets in a metric space," Advances in Neural Information Processing Systems (NeurIPS), Vol.30, pp.5099-5108, 2017.
  7. Y. Wang, Y. Sun, Z. Liu, and S. E. Sarma, M. M. Bronstein, and J. M. Solomon, "Dynamic graph CNN for learning on point clouds," Journal of ACM Transactions on Graphics, Vol.38, No.5, pp.1-12, 2019. https://doi.org/10.1145/3326362
  8. H. Lei, N. Akhtar, and A. Mian, "Spherical kernel for efficient graph convolution on 3d point clouds," Journal of the IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.43, No.10, pp.3664-3680, 2020.
  9. W. Wu, Z. Qi, and L. Fuxin, "PointConv: Deep convolutional networks on 3d point clouds," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.9621-9630, 2019.
  10. M. Xu, R. Ding, H. Zhao, and X. Qi, "PAConv: Position adaptive convolution with dynamic kernel assembling on point clouds," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.3173-3182, 2021.
  11. H. Zhao, L. Jiang, J. Jia, P. Torr, and V. Koltun, "Point transformer," in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.16259-16268., 2021.
  12. X. Wu, Y. Lao, L. Jiang, .X. Liu, and H. Zhao, "Point transformer V2: Grouped vector attention and partition-based pooling," arXiv preprint arXiv:2210.05666, 2022.
  13. M. Jaritz, J. Gu, and H. Su, "Multi-view PointNet for 3d scene understanding," in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.3995-4003, 2019.
  14. C. Du, M. A. Vega Torres, Y. Pan, and A. Borrmann, "MV-KPConv: Multi-view KPConv for enhanced 3d point cloud semantic segmentation using multi-modal fusion with 2d images," in Proceedings of the European Conference on Product and Process Modeling, 2022.
  15. A. Dai, and M. Niessner, "3DMV: Joint 3d multi-view prediction for 3d semantic scene segmentation," in Proceedings of the European Conference on Computer Vision (ECCV), pp.452-468, 2018.
  16. L. Zhao, J. Lu, and J. Zhou, "Similarity-aware fusion network for 3d semantic segmentation," in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.1585-1592, 2021.
  17. W. Hu, H. Zhao, L. Jian, J. Jia, and T. T. Wong, "Bidirectional projection network for cross dimension scene understanding," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CPVR), pp.14373-14382, 2021.
  18. A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Niessner, "ScanNet: Richly-annotated 3d reconstructions of indoor scenes," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.5828-5839, 2017.
  19. A. Boulch, B. L. Saux, and N. Audebert, "Unstructured point cloud semantic labeling using deep segmentation networks," 3dor@ eurographics, Vol.3, pp.17-24, 2017.
  20. A. Boulch, J. Guerry, B. L. Saux, and N. Audebert, "SnapNet: 3D point cloud semantic labeling with 2D deep segmentation networks," Computers & Graphics, Vol.71, pp.189-198, 2018. https://doi.org/10.1016/j.cag.2017.11.010
  21. F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size," arXiv preprint arXiv:1602.07360, 2016.
  22. A. Milioto, I. Vizzo, J. Behley, and C. Stachniss. "RangeNet++: Fast and accurate LiDAR semantic segmentation," in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.4213-4220, 2019.
  23. J. Huang and S. You, "Point cloud labeling using 3d convolutional neural network," in Proceedings of the International Conference on Pattern Recognition (ICPR), pp.2670-2675, 2016.
  24. A. Dai, D. Ritchie, M. Bokeloh, S. Reed, J. Sturm, M. Niessner, "ScanComplete: Large-scale scene completion and semantic segmentation for 3D scans," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.4578-4587, 2018.
  25. J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. F. Fei, "ImageNet: A large-scale hierarchical image database," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.248-255, 2009.
  26. K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.770-778, 2016.
  27. D. Menini, S. Kumar, M. R. Oswald, E. Sandstrom, C. Sminchisescu, and L. V. Gool, "A real-time online learning framework for joint 3d reconstruction and semantic segmentation of indoor scenes," Journal of IEEE Robotics and Automation Letters, Vol.7, No.2, pp.1332-1339, 2021.