DOI QR코드

DOI QR Code

특징 매칭을 이용한 페어와이즈 어텐션 강화 모델에 대한 연구

Research on Pairwise Attention Reinforcement Model Using Feature Matching

  • 임준식 ;
  • 주영석
  • Joon-Shik Lim (Dept, of Computer Engineering, Gachon University) ;
  • Yeong-Seok Ju (Dept, of Computer Engineering, Gachon University)
  • 투고 : 2024.09.05
  • 심사 : 2024.09.24
  • 발행 : 2024.09.30

초록

Vision Transformer(ViT)는 패치 간의 관계를 학습하지만, 색상, 질감, 경계와 같은 중요한 특징을 간과할 경우 의료 분야나 얼굴 인식 등에서 성능 한계가 발생할 수 있다. 이를 해결하기 위해 본 연구에서는 Pairwise Attention Reinforcement(PAR) 모델을 제안한다. PAR 모델은 학습 이미지와 참조 이미지를 인코더에 입력하여 두 이미지 간의 유사성을 계산한 후, 높은 유사성을 보이는 이미지 어텐션 스코어 맵을 매칭하여 학습 이미지의 매칭 영역을 강화한다. 이를 통해 이미지 간의 중요한 특징이 강조되며, 미세한 차이도 구별할 수 있다. 시계 그리기 검사 데이터를 사용한 실험에서 PAR 모델은 Precision 0.9516, Recall 0.8883, F1-Score 0.9166, Accuracy 92.93%를 기록하였다. 본 모델은 Pairwise Attention 방식을 이용한 API-Net 대비 12% 성능이 향상되었으며, ViT 모델 대비 2%의 성능 향상을 보였다.

Vision Transformer (ViT) learns relationships between patches, but it may overlook important features such as color, texture, and boundaries, which can result in performance limitations in fields like medical imaging or facial recognition. To address this issue, this study proposes the Pairwise Attention Reinforcement (PAR) model. The PAR model takes both the training image and a reference image as input into the encoder, calculates the similarity between the two images, and matches the attention score maps of images with high similarity, reinforcing the matching areas of the training image. This process emphasizes important features between images and allows even subtle differences to be distinguished. In experiments using clock-drawing test data, the PAR model achieved a Precision of 0.9516, Recall of 0.8883, F1-Score of 0.9166, and an Accuracy of 92.93%. The proposed model showed a 12% performance improvement compared to API-Net, which uses the pairwise attention approach, and demonstrated a 2% performance improvement over the ViT model.

키워드

참고문헌

  1. A. Vaswani et al., "Attention is all you need," in Adv. Neural Inf. Process. Syst., 2017, vol.30, pp.5998-6008. DOI: 10.48550/arXiv.1706.03762
  2. A. Dosovitskiy et al., "An image is worth 16×16 words: Transformers for image recognition at scale," arXiv preprint arXiv:2010.11929, 2020. DOI: 10.48550/arXiv.2010.11929
  3. Z. Liu et al., "Swin transformer: Hierarchical vision transformer using shifted windows," in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp.10012-10022. DOI: 10.1109/ICCV48922.2021.00986
  4. C. F. Chen, Q. Fan, and R. Panda, "CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification," in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp.10012-10022. DOI: 10.48550/arXiv.2103.14899
  5. R. Raksasat, S. Teerapittayanon, S. Itthipuripat, K. Praditpornsilpa, A. Petchlorlian, T. Chotibut, and I. Chatnuntawech, "Attentive pairwise interaction network for AI-assisted clock drawing test assessment of early visuospatial deficits," Sci. Rep., vol.13, no.1, p.18113, 2023. DOI: 10.1038/s41598-023-44723-1
  6. S. Chen et al., "Automatic dementia screening and scoring by applying deep learning on clock-drawing tests," Sci. Rep., vol.10, no.1, p.20854, 2020. DOI: 10.1038/s41598-020-74710-9
  7. J. Yao et al., "Extended Vision Transformer (ExViT) for Land Use and Land Cover Classification: A Multimodal Deep Learning Framework," IEEE Transactions on Geoscience and Remote Sensing, 2023. DOI: 10.1109/TGRS.2023.3284671
  8. K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014. DOI: 10.48550/arXiv.1409.1556
  9. K. He et al., "Deep residual learning for image recognition," in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp.770-778. DOI 10.1109/CVPR.2016.90
  10. H. Inoue, "Data augmentation by pairing samples for images classification," arXiv preprint arXiv:1801.02929, 2018. DOI: 10.48550/arXiv.1801.02929
  11. D. Yarats, I. Kostrikov, and R. Fergus, "Image augmentation is all you need: Regularizing deep reinforcement learning from pixels," in Int. Conf. Learn. Represent., 2021. DOI: 10.48550/arXiv.2004.13649
  12. H. Zhu, W. Ke, D. Li, J. Liu, L. Tian, and Y. Shan, "Dual cross-attention learning for fine-grained visual categorization and object re-identification," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp.4692-4702. DOI: 10.48550/arXiv.2205.02151
  13. X. Peng et al., "Optical Remote Sensing Image Change Detection Based on Attention Mechanism and Image Difference," IEEE Transactions on Geoscience and Remote Sensing, vol.59, no.9, pp.7426-7440, Sep. 2021. DOI: 10.1109/TGRS.2020.3033009
  14. S. Mehta and M. Rastegari, "MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer," arXiv preprint arXiv:2110.02178, 2021. DOI: 10.48550/arXiv.2110.02178
  15. M. Dehghani et al., "Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution," arXiv preprint arXiv:2307.06304, 2023. DOI: 10.48550/arXiv.2307.06304
  16. K. Xu, P. Deng, and H. Huang, "Vision Transformer: An Excellent Teacher for Guiding Small Networks in Remote Sensing Image Scene Classification," IEEE Transactions on Geoscience and Remote Sensing, vol.60, pp.1-15, 2022. DOI: 10.1109/TGRS.2022.3152566
  17. T. Stegmuller, B. Bozorgtabar, A. Spahr, and J. P. Thiran, "Scorenet: Learning non-uniform attention and augmentation for transformer-based histopathological image classification," in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis., 2023, pp.6170-6179.
  18. W. Wang et al., "Pyramid vision transformer: A versatile backbone for dense prediction without convolutions," in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 568-578. DOI: 10.1109/ICCV48922.2021.00061
  19. S. Amini et al., "An AI-assisted online tool for cognitive impairment detection using images from the clock drawing test," MedRxiv, 2021. DOI: 10.1101/2021.03.06.21253047
  20. Q. Chen, J. Fan, and W. Chen, "An improved image enhancement framework based on multiple attention mechanism," Displays, vol.70, pp.102091, 2021. DOI: 10.1016/j.displa.2021.102091