Acknowledgement
이 논문은 과학기술정보통신부의 재원으로 정보통신기획평가원의 지원을 받아 수행된 연구임. (No.2020-0-00004, 장기 시각 메모리 네트워크 기반의 예지형 시각지능 핵심기술 개발).
References
- Bo He et al., "MA-LMM: Memory-augmented large multimodal model for long-term video understanding," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
- Enxin Song et al., "Moviechat: From dense token to sparse memory for long video understanding," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
- Dejing Xu et al., "Video Question Answering via Gradually Refined Attention over Appearance and Motion," in Proceedings of the ACM International Conference on Multimedia, 2017.
- Muhammad Maaz et al., "VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding," arXiv preprint arXiv:2406.09418, 2024.
- Daniel Bolya et al., "Token Merging: Your ViT But Faster," in Proceedings of the International Conference on Learning and Representation, 2023.