Fig. 1. Example of human activity detection in a video.
Fig. 2. The process of activity detection in video.
Fig. 3. Feature extraction with two different convolutional neural networks.
Fig. 4. Bi-directional LSTM (BLSTM) model.
Fig. 5. Classification score for each activity (aj) per video segment (ti).
Fig. 6. Threshold-based activity localization.
Fig. 7. Evaluation of localization performance. (a)~(c) three examples of localization results with twodifferent classification models, LSTM and BLSTM, and (d) one example of localization results with twodifferent feature models, the C3D only and the C3D+I-ResNet.
Table 1. Comparison of feature models
Table 2. Comparison of classification models
Table 3. Comparison with previous models in terms of activity localization
References
- H. Wang and C. Schmid, "Action recognition with improved trajectories," in Proceedings of IEEE International Conference on Computer Vision (ICCV-13), Sydney, Australia, 2013, pp. 3551-3558.
- L. Wang, Y. Qiao, and X. Tang, "Video action detection with relational dynamic-poselets," in Proceedings of European Conference on Computer Vision (ECCV-14), Zurich, Switzerland, 2014, pp. 565-580.
- M. Schuster and K. K. Paliwal, "Bidirectional recurrent neural networks," IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673-2681, 1997. https://doi.org/10.1109/78.650093
- F. C. Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles, "ActivityNet: a large-scale video benchmark for human activity understanding," in Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition (CVPR-15), Boston, MA, 2015, pp. 961-970.
- S. Ji, W. Xu, M. Yang, and K. Yu, "3D convolutional neural networks for human action recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 221-231, 2013. https://doi.org/10.1109/TPAMI.2012.59
- K. Simonyan and A. Zisserman, "Two-stream convolutional networks for action recognition in videos," in Proceedings of the International Conference on Neural Information Processing Systems (NIPS-14), Montreal, Canada, 2014, pp. 568-576.
- J. Zheng, Z. Jiang, and R. Chellappa, "Cross-view action recognition via transferable dictionary learning," IEEE Transactions on Image Processing, vol. 5, no. 6, pp. 2542-2556, 2016.
- R. Wang and D. Tao, "UTS at ActivityNet 2016," in ActivityNet Large Scale Activity Recognition Challenge Workshop, Las Vegas, NV, 2016, pp. 1-6.
- Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S. F. Chang, "CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos," in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, 2017, pp. 5734-5743.
- Y. Xiong, Y. Zhao, L. Wang, D. Lin, and X. Tang, "A pursuit of temporal accuracy in general activity detection," 2017 [Online]. Available: https://arxiv.org/abs/1703.02716.
- L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. V. Gool, "Temporal segment networks: towards good practices for deep action recognition," in Proceedings of European Conference on Computer Vision (ECCV), Amsterdam, Netherlands, 2016, pp. 20-36.
- G. Singh and F. Cuzzolin, "Untrimmed video classification for activity detection: submission to ActivityNet challenge," 2016 [Online]. Available: https://arxiv.org/abs/1607.01979.
- S. Karaman, L. Seidenari, and A. D. Bimbo, "Fast saliency based pooling of fisher encoded dense trajectories," in Proceedings of European Conference on Computer Vision (ECCV) Workshop, Zurich, Switzerland, 2014, pp. 1-4.
- Z. Shou, D. Wang, and S. F. Chang, "Temporal action localization in untrimmed videos via multi-stage CNNs," in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 1049-1058.
- L. Wang, Y. Y. Qiao, and X. Tang, "Action recognition and detection by combining motion and appearance features," in Proceedings of European Conference on Computer Vision (ECCV), Zurich, Switzerland, 2014, pp. 1-6.
- D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, "Learning spatiotemporal features with 3D convolutional networks," in Proceedings of IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 2015, pp. 4489-4497.
- V. Escorcia, F. C. Heilbron, J. C. Niebles, and B. Ghanem, "DAPs: deep action proposals for action understanding," in Proceedings of European Conference on Computer Vision (ECCV), Amsterdam, Netherlands, 2016, pp. 768-784.
- A. Montes, A. Salvador, S. Pascual, and X. Giro-i-Nieto, "Temporal activity detection in untrimmed videos with recurrent neural networks," 2017 [Online]. Available: https://arxiv.org/abs/1608.08128.
- K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 770-778.
Cited by
- Comparative study of singing voice detection based on deep neural networks and ensemble learning vol.8, pp.1, 2018, https://doi.org/10.1186/s13673-018-0158-1