DOI QR코드

DOI QR Code

Trends in Temporal Action Detection in Untrimmed Videos

시간적 행동 탐지 기술 동향

  • Published : 2020.06.01

Abstract

Temporal action detection (TAD) in untrimmed videos is an important but a challenging problem in the field of computer vision and has gathered increasing interest recently. Although most studies on action in videos have addressed action recognition in trimmed videos, TAD methods are required to understand real-world untrimmed videos, including mostly background and some meaningful action instances belonging to multiple action classes. TAD is mainly composed of temporal action localization that generates temporal action proposals, such as single action and action recognition, which classifies action proposals into action classes. However, the task of generating temporal action proposals with accurate temporal boundaries is challenging in TAD. In this paper, we discuss TAD technologies that are considered high performance in terms of representative TAD studies based on deep learning. Further, we investigate evaluation methodologies for TAD, such as benchmark datasets and performance measures, and subsequently compare the performance of the discussed TAD models.

Keywords

Acknowledgement

이 논문은 2020년도 정부(과학기술정보통신부)의 재원으로 정보통신기획평가원의 지원을 받아 수행된 연구임[No. B0101-15-0266, 실시간 대규모 영상 데이터 이해·예측을 위한 고성능 비주얼 디스커버리 플랫폼 개발과 No. 2020-0-00004, 장기 시각 메모리 네트워크기반의 예지형 시각지능 핵심기술 개발].

References

  1. K. Soomro et al., "UCF101: A dataset of 101 human actions classes from videos in the wild," CoRR, abs/1212.0402, 2012.
  2. X. Peng et al., "Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice," CoRR, abs/1405.4506, 2014.
  3. K. Simonyan and A. Zisserman, "Two-stream convolutional networks for action recognition in videos," NIPS, 2014, pp. 568-576.
  4. J. Carreira and A. Zisserman, "Quo vadis, action recognition? a new model and the kinetics dataset," CVPR, 2017, pp. 4724-4733.
  5. S. Asghari-Esfeden et al., "Dynamic motion representation for human action recognition," WACV, 2020, pp. 557-566.
  6. L. Wang et al., "Action recognition and detection by combining motion and appearance features," ECCV THUMOS Workshop, 2014.
  7. D. Oneasta et al., "The LEAR submission at THUMOS 2014," ECCV THUMOS Workshop, 2014.
  8. S. Karaman et al., "Fast saliency-based pooling of fisher encoded dense trajectories," ECCV THUMOS Workshop, 2014.
  9. Y.-G. Jiang et al., "Challenge: Action recognition with a large number of classes," ECCV THUMOS Workshop, http://crcv.ucf.edu/THUMOS14/, 2014.
  10. A. Montes et al., "Temporal activity detection in untrimmed videos with recurrent neural networks," the 1st NIPS Workshop on Large Scale Computer Vision Systems, 2016.
  11. S. Ma et al., "Learning activity progression in LSTMs for activity detection and early detection," CVPR, 2016, pp. 1942-1950.
  12. B. Singh et al., "A multi-Stream bi-directional recurrent neural network for fine-grained action detection," CVPR, 2016, pp. 1961-1970.
  13. R. Girshick et al., "Rich feature hierarchies for accurate object detection and semantic segmentation," CVPR, 2014, pp. 580-587.
  14. R. Girshick, "Fast R-CNN," ICCV, 2015, pp. 1440-1448.
  15. S. Ren et al., "Faster R-CNN: Towards real-time object detection with region proposal networks," NIPS 2015.
  16. Z. Shou et al., "Temporal action localization in untrimmed videos via multi-stage CNNs," CVPR 2016, pp. 1049-1058.
  17. D. Tran et al., "Learning spatiotemporal features with 3D convolutional networks," ICCV, 2015, pp. 4489-4497.
  18. Y. Zhao et al., "Temporal action detection with structured segment networks," ICCV, 2017, pp. 2914-2923.
  19. K. He et al., "Spatial pyramid pooling in deep convolutional networks for visual recognition," ECCV, 2014, pp. 346-361.
  20. J. Gao et al., "Cascaded boundary regression for temporal action detection," BMVC, 2017.
  21. J. Gao et al., "TURN TAP: Temporal unit regression network for temporal action proposals," ICCV, 2017, pp. 3628-3636.
  22. H. Xu et al., "R-C3D: Region convolutional 3D network for temporal activity detection," ICCV, 2017, pp. 5783-5792.
  23. X. Dai et al., "Temporal context network for activity localization in videos," ICCV, 2017, pp. 5793-5802.
  24. J. Gao et al., "CTAP: complementary temporal action proposal generation," ECCV, 2018.
  25. T. Lin et al., "BSN: Boundary sensitive network for temporal action proposal generation," ECCV, 2018.
  26. Y. Liu et al., "Multi-granularity generator for temporal action proposal," CVPR, 2019, pp. 3604-3613.
  27. T. Lin et al., "BMN: Boundary-matching network for temporal action proposal generation," ICCV, 2019, pp. 3889-3898.
  28. H. Eun et al., "SRG: Snippet relatedness-based temporal action proposal generator," IEEE Trans. circuits and systems for video technology(TCSVT), Early Access, 2019.
  29. C. Lin et al., "Fast learning of temporal action proposal via dense boundary generator," AAAI, 2020.
  30. W. Liu et al., "SSD: Single shot multibox detector," ECCV, 2016.
  31. T. Lin et al., "Single shot temporal action detection," MM, 2017.
  32. D. Zhang et al., "S3D: single shot multi-span detector via fully 3D convolutional network," BMVC, 2018.
  33. T. N. Kipf and M. Welling, "Semi-supervised classification with graph convolutional networks," ICRL, 2017.
  34. S. Yan et al., "Spatial temporal graph convolutional networks for skeleton-based action recognition," AAAI, 2018.
  35. X. Wang and A. Gupta, "Videos as space-time region graphs," ECCV, 2018.
  36. R. Zeng et al., "Graph Covolutional networks for temporal action localization," ICCV 2019, pp. 7094-7103.
  37. C. Zhai et al., "Action co-lociazation in an untrimmed video by graph neural networks," MMM, 2020.
  38. http://activity-net.org/challenges/2019/challenge.html
  39. F.C. Heilbron et al., "ActivityNet: A large-scale video benchmark for human activity understanding," CVPR, 2015.
  40. L. Wang et al., "Untrimmednets for weakly supervised action recognition and detection," CVPR, 2017, pp. 4325-4334.