Human Action Recognition Via Multi-modality Information

  • Gao, Zan ;
  • Song, Jian-Ming ;
  • Zhang, Hua ;
  • Liu, An-An ;
  • Xue, Yan-Bing ;
  • Xu, Guang-Ping
  • Received : 2013.08.23
  • Accepted : 2013.11.04
  • Published : 2014.03.01


In this paper, we propose pyramid appearance and global structure action descriptors on both RGB and depth motion history images and a model-free method for human action recognition. In proposed algorithm, we firstly construct motion history image for both RGB and depth channels, at the same time, depth information is employed to filter RGB information, after that, different action descriptors are extracted from depth and RGB MHIs to represent these actions, and then multimodality information collaborative representation and recognition model, in which multi-modality information are put into object function naturally, and information fusion and action recognition also be done together, is proposed to classify human actions. To demonstrate the superiority of the proposed method, we evaluate it on MSR Action3D and DHA datasets, the well-known dataset for human action recognition. Large scale experiment shows our descriptors are robust, stable and efficient, when comparing with the-state-of-the-art algorithms, the performances of our descriptors are better than that of them, further, the performance of combined descriptors is much better than just using sole descriptor. What is more, our proposed model outperforms the state-of-the-art methods on both MSR Action3D and DHA datasets.


Action recognition;Multi-modality;Feature fusion;RGB;Depth;MMCRR;DMHI;RDMHI;RDMHI-AHB;RDMHI-Gist


  1. A. Bobick and J. Davis. The representation and recognition of action using temporal templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(3):257-267, 2001. 1, 5, 7.
  2. L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri. Actions as space-time shapes. PAMI. 29(12):2247-2253, 2007.
  3. P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. Behavior recognition via sparse spatio-temporal features. IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp 65- 72, 2005.
  4. Schuldt, C. Laptev, and B. I. Caputo. Recognizing human actions: a local SVM approach. ICPR (17), pp 32-36, 2004.
  5. I. Laptev and T. Lindeberg. Space-time interest points. ICCV, pages 432-439, 2003.
  6. M.-y. Chen and A.Hauptmann. MoSIFT: Reocgnizing Human Actions in Surveillance Videos. CMU-CS-09-161, Carnegie Mellon University, 2009.
  7. Yan-Ching Lin, Min-Chun Hua, Wen-Huang Cheng, Yung-Huan Hsieh, Hong-Ming Chen, Human Action Recognition and Retrieval Using Sole Depth Information, ACM MM 2012.
  8. W. Li, Z. Zhang, and Z. Liu. Action recognition based on a bag of 3D points. In Human Communicative Behavior Analysis Workshop (in conjunction with CVPR), 2010. 2, 5, 6.
  9. Chen, D.Y. Efficient polygonal posture representation and action recognition, Electronic Letter, 2011, 47, (2), pp. 101-103.
  10. Kosta, G., Pedro, C., and Benoit, M. Modelization of limb coordination for human action analysis. Proc. IEEE ICIP, Atlanta, CA, USA, 2006, pp. 1765-1768.
  11. Wang, S.B., Quattoni, A., and Morency, L.P., et al.: Hidden conditional random fields for gesture recognition. Proc. IEEE CVPR, New York, NY, USA, 2006, pp. 1521-1527.
  12. Yang Wang, Greg Mori, Max-Margin Hidden Condi-tional Random Fields for Human Action Recognition, CVPR, 2009
  13. Y. Wang and G. Mori. Learning a discriminative hidden part model for human action recognition, In NIPS 21, 2008.
  14. Qinfeng Shi, Li Wang, Li Cheng, Alex Smola, Discriminative Human Action Segmentation and Recognition using Semi-Markov Model, CVPR,2008.
  15. S.-F. Wong and R. Cipolla. Extracting spatiotemporal interest points using global information. ICCV, pages 1-8, 2007.
  16. G. Willems, T. Tuytelaars, and L. V. Gool. An efficient dense and scale-invariant spatio-temporal interest point detector. ECCV, pages 650-663, 2008.
  17. I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. CVPR, pages1-8, 2008.
  18. J. C. Niebles, H. Wang, and L. Fei-Fei. Unsupervised learning of human action categories using spatialtemporal words. IJCV, 79(3):299-318, 2008.
  19. Yan-Ching Lin, Min-Chun Hua, Wen-Huang Cheng, Yung-Huan Hsieh, Hong-Ming Chen, Human Action Recognition and Retrieval Using Sole Depth Information, ACM MM 2012.
  20. Jiang Wang, Zicheng Liu, Ying Wu, Jusong Yuan, Mining actionlet ensemble for action recognition with depth cameras, in CPRR 2012, pp.1290-1297.
  21. Bingbing Ni, Gang Wang, Pierre Moulin, RGBDHuDaAct: A Color-Depth Video Database for Human Daily Activity Recognition, ICCV workshop, 2012.
  22. Vennila Megavannan, Bhuvnesh Agarwal R. Venkatesh Babu, Human Action Recognition using Depth Maps, International Conference on Signal Processing and Communications (SPCOM), 2012.
  23. Liu, A. and Han, D. Spatiotemporal Sparsity Induced Similarity Measure for Human Action Recognition. In Proceedings of JDCTA. 2010, 143-149.
  24. Kai Guo, Prakash Ishwar, and Janusz Konrad, Action Recognition Using Sparse Representation on Covariance Manifolds of Optical Flow, 2010 Seventh IEEE International Conference on Advanced Video and Signal Based Surveillance, Aug. 29 2010-Sept. 1 2010, pp: 188 - 195.
  25. Changhong Liu, Yang Yang, Yong Chen, Human action recognition using sparse representation, Intelligent Computing and Intelligent Systems, 2009, IEEE International Conference on, 20-22 Nov. 2009, PP. 184-188.
  26. J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma. Robust face recognition via sparse representation. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2009.
  27. L. Zhang, M. Yang and X. Feng, "Sparse Representation or Collaborative Representation: Which Helps Face Recognition?" in ICCV 2011.
  28. Oliva A; Torralba A Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope, International Journal of Computer Vision, 42(3):145-175, 2001.
  29. Zan Gao, Ming-yu Chen, Alexander G. Hauptmann, Anni Cai, Comparing Evaluation Protocols on the KTH Dataset, International Conference on Pattern Recognition, 2010, pages 88-100.
  30. F. Lv and R. Nevatia. Recognition and Segmentation of 3-D Human Action Using HMM and Multi-class AdaBoost. In ECCV, pages 359-372, 2006. 2,6.
  31. M. Muller and T. Roder. Motion templates for automatic classification and retrieval of motion capture data. In Proceedings of the 2006 ACM SIGGRAPH/ Eurographics symposium on Computer animation, pages 137-146. Eurographics Association, 2006. 2, 6, 8
  32. J. Martens and I. Sutskever. Learning Recurrent Neural Networks with Hessian-Free Optimization. In ICML, 2011. 2, 6.

Cited by

  1. Coupled hidden conditional random fields for RGB-D human action recognition vol.112, 2015,
  2. Human action recognition on depth dataset vol.27, pp.7, 2016,
  3. Adaptive multi-view feature selection for human motion retrieval vol.120, 2016,
  4. Multi-feature consultation model for human action recognition in depth video sequence pp.2051-3305, 2018,
  5. Multi-View Hierarchical Bidirectional Recurrent Neural Network for Depth Video Sequence Based Action Recognition vol.32, pp.10, 2018,