DOI QR코드

DOI QR Code

Binary Hashing CNN Features for Action Recognition

  • Li, Weisheng (Chongqing Key Laboratory of Computational Intelligence, Chongqing University of Posts and Telecommunications) ;
  • Feng, Chen (Chongqing Key Laboratory of Computational Intelligence, Chongqing University of Posts and Telecommunications) ;
  • Xiao, Bin (college of Computer Science and Technology, Chongqing University of Posts and Telecommunications) ;
  • Chen, Yanquan (college of Computer Science and Technology, Chongqing University of Posts and Telecommunications)
  • Received : 2017.12.03
  • Accepted : 2018.04.23
  • Published : 2018.09.30

Abstract

The purpose of this work is to solve the problem of representing an entire video using Convolutional Neural Network (CNN) features for human action recognition. Recently, due to insufficient GPU memory, it has been difficult to take the whole video as the input of the CNN for end-to-end learning. A typical method is to use sampled video frames as inputs and corresponding labels as supervision. One major issue of this popular approach is that the local samples may not contain the information indicated by the global labels and sufficient motion information. To address this issue, we propose a binary hashing method to enhance the local feature extractors. First, we extract the local features and aggregate them into global features using maximum/minimum pooling. Second, we use the binary hashing method to capture the motion features. Finally, we concatenate the hashing features with global features using different normalization methods to train the classifier. Experimental results on the JHMDB and MPII-Cooking datasets show that, for these new local features, binary hashing mapping on the sparsely sampled features led to significant performance improvements.

Keywords

References

  1. Aggarwal J K, Ryoo MS, "Human activity analysis: a review," ACM Comput Surv 43(3):1-43, 2011.
  2. Peng X, Wang L, Wang X, Qiao Y, "Bag of visual words and fusion methods for action recognition Comprehensive study and good practice," Computer Vision and Image Understanding, 2016.
  3. Peng X, Zou C, Qiao Y, Peng Q, "Action recognition with stacked fisher vectors," in Proc. of Proceedings of European Conference on Computer Vision (ECCV), 581-595, 2014.
  4. Herath S, Harandi M, and Porikli F. "Going deeper into action recognition: A survey," Image and Vision Computing,60:4-21, 2017. https://doi.org/10.1016/j.imavis.2017.01.010
  5. Rohrbach M, Amin S, Andriluka M, Schiele B, "A database for fine grained activity detection of cooking activities," in Proc. of Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1194-1201, 2012.
  6. Cherian A, Fernando B, Harandi M, and Gould S, "Generalized Rank Pooling for Activity Recognition," in Proc. of Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1581-1590, 2017.
  7. Wang H, Kläser A, Schmid C, Liu C L, "Dense trajectories and motion boundary descriptors for action recognition," International journal of computer vision, 103(1), 60-79, 2013. https://doi.org/10.1007/s11263-012-0594-8
  8. Wang L, Qiao Y, Tang X, "Action recognition with trajectory-pooled deep-convolutional descriptors," in Proc. of Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 4305-4314, 2015.
  9. Laptev I, "On space-time interest points," Int J Comput Vis 64(2-3):107-123, 2005. https://doi.org/10.1007/s11263-005-1838-7
  10. Klaser A, Marszalek M., Schmid C, "A spatio-temporal descriptor based on 3d-gradients," in Proc. of British Machine Vision Conference (BMVC), doi:10.5244/C.22.99, 2008.
  11. Wang H, Schmid C., "Action recognition with improved trajectories," in Proc. of Proceedings of the IEEE International Conference on Computer Vision (ICCV), 3551-3558, 2013.
  12. Simonyan K, Zisserman A, "Two-stream convolutional networks for action recognition in videos," in Proc. of Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS), 568-576, 2014.
  13. Cheron G, Laptev I, Schmid C, "P-CNN: Pose-based cnn features for action recognition," in Proc. of Proceedings of the IEEE International Conference on Computer Vision (ICCV), 3218-3226, 2015.
  14. Du W, Wang Y, Qiao Y, "Rpan: An end-to-end recurrent pose-attention network for action recognition in videos," in Proc. of Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3725-3734, 2017.
  15. Liu M, Liu H, and Chen C. "Enhanced skeleton visualization for view invariant human action recognition," Pattern Recognition, (68):346-362, 2017.
  16. Tarn D, Bourdev L, Fergus R, Torresani L,Paluri M, "Learning spatiotemporal features with 3d convolutional networks," in Proc. of Proceedings of the IEEE International Conference on Computer Vision (ICCV), 4489-4497, 2015.
  17. Koniusz P, Yan F, Gosselin PH, Mikolajczyk K, "Higher-order occurrence pooling for bags-of-words: Visual concept detection," IEEE Trans Pattern Anal Mach Intell 39(2), 313-326, 2017. https://doi.org/10.1109/TPAMI.2016.2545667
  18. Cherian A, Mairal J, Alahari K, Schmid C, "Mixing body-part sequences for human pose estimation," in Proc. of Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2353-2360, 2014.
  19. Zhou Y, Ni B, Hong R, Wang M, Tian Q, "Interaction part mining: A mid-level approach for fine-grained action recognition," in Proc. of Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3323-3331, 2015.
  20. Ni B, Paramathayalan VR, Moulin P, "Multiple granularity analysis for fine-grained action detection," in Proc. of Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 756-763, 2014.
  21. Gall J, Yao A, Razavi N, Van Gool L, Lempitsky V, "Hough Forests for Object Detection, Tracking, and Action Recognition," IEEE Trans Pattern Anal Mach Intell 33:2188-2202, 2011. https://doi.org/10.1109/TPAMI.2011.70
  22. Fernando B, Gavves E, Oramas J, Ghodrati A, Tuytelaars T, "Modeling video evolution for action recognition," in Proc. of Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 5378-5387, 2015.
  23. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L, "Large-scale video classification with convolutional neural networks," in Proc. of Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), 1725-1732, 2014.
  24. Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G, "Beyond short snippets: Deep networks for video classification," in Proc. of Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4694-4702,2015.
  25. Carreira J, Caseiro R, Batista J, Sminchisescu C, "Semantic segmentation with second-order pooling," in Proc. of Proceedings of European Conference on Computer Vision (ECCV), 430-443, 2012.
  26. Koniusz P, Cherian A, Porikli F, "Tensor representations via kernel linearization for action recognition from 3D skeletons," in Proc. of Proceedings of European Conference on Computer Vision (ECCV), 37-53, 2016.
  27. Koniusz P, Cherian A, "Sparse coding for third-order super-symmetric tensor descriptors with application to texture recognition," in Proc. of Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 5395-5403, 2016.
  28. Vasilescu M, Terzopoulos D, "Multilinear analysis of image ensembles: Tensorfaces," in Proc. of Proceedings of European Conference on Computer Vision (ECCV), 447-460, 2002.
  29. Brox T, Bruhn A, Papenberg N, Weickert J, "High accuracy optical flow estimation based on a theory for warping," in Proc. of Proceedings of European Conference on Computer Vision (ECCV), 25-36, 2004.
  30. Gkioxari G, Malik J, "Finding action tubes," in Proc. of Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 759-768, 2015.
  31. Alex Krizhevsky, I. Sutskever, G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in Proc. of Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS), 568-576, 2012.
  32. Chatfield K, Simonyan K, Vedaldi A, Zisserman A, "Return of the Devil in the Details: Delving Deep into Convolutional Nets," british machine vision conference, 2014.
  33. Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L, "Imagenet: A large-scale hierarchical image database," in Proc. of Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 248-255, 2009.
  34. Soomro K, Zamir AR, Shah M, "UCF101: A dataset of 101 human actions classes from videos in the wild," CoRR, abs/1212.0402, 2012.
  35. Yang J, Yu K, Gong Y, Huang T, "Linear spatial pyramid matching using sparse coding for image classification," in Proc. of Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1794-1801, 2009.
  36. Jhuang H, Gall J, Zuffi S, Schmid C, Black MJ., "Towards understanding action recognition," in Proc. of Proceedings of the IEEE International Conference on Computer Vision (ICCV), 3192-3199,2013.
  37. Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T, "HMDB: a large video database for human motion recognition," in Proc. of Proceedings of the IEEE Conference on Computer Vision (ICCV), 2556-2563, 2011.
  38. Oneata D, Verbeek J, Schmid C, "Action and event recognition with fisher vectors on a compact feature set," in Proc. of Proceedings of the IEEE International Conference on Computer Vision (ICCV), 1817-1824, 2013.
  39. Peng X, Schmid C, "Multi-region two-stream R-CNN for action detection," in Proc. of Proceedings of European Conference on Computer Vision (ECCV), 744-759, 2016.
  40. Zhu W, Lan C, Xing J, Zeng W, Li Y, Shen L, Xie X, "Co-occurrence feature learning for skeleton based action recognition using regularized deep lstm networks," in Proc. of Proceedings of Thirtieth AAAI Conference on Artificial Intelligence, 3697-3703, 2016.

Cited by

  1. Fine-Grained Action Recognition by Motion Saliency and Mid-Level Patches vol.10, pp.8, 2018, https://doi.org/10.3390/app10082811