DOI QR코드

DOI QR Code

Video augmentation technique for human action recognition using genetic algorithm

  • Nida, Nudrat (Department of Computer Engineering, UET) ;
  • Yousaf, Muhammad Haroon (Department of Computer Engineering, UET) ;
  • Irtaza, Aun (Department of Computer Science, UET) ;
  • Velastin, Sergio A. (Applied Artificial Intelligence Research Group, Department of Computer Science and Engineering, University Carlos III de Madrid)
  • Received : 2019.11.08
  • Accepted : 2021.03.11
  • Published : 2022.04.10

Abstract

Classification models for human action recognition require robust features and large training sets for good generalization. However, data augmentation methods are employed for imbalanced training sets to achieve higher accuracy. These samples generated using data augmentation only reflect existing samples within the training set, their feature representations are less diverse and hence, contribute to less precise classification. This paper presents new data augmentation and action representation approaches to grow training sets. The proposed approach is based on two fundamental concepts: virtual video generation for augmentation and representation of the action videos through robust features. Virtual videos are generated from the motion history templates of action videos, which are convolved using a convolutional neural network, to generate deep features. Furthermore, by observing an objective function of the genetic algorithm, the spatiotemporal features of different samples are combined, to generate the representations of the virtual videos and then classified through an extreme learning machine classifier on MuHAVi-Uncut, iXMAS, and IAVID-1 datasets.

Keywords

References

  1. H. Wang et al., Dense trajectories and motion boundary descriptors for action recognition, Int. J. Comput. Vis. 103 (2013), no. 1, 60-79. https://doi.org/10.1007/s11263-012-0594-8
  2. H. Wang et al., Evaluation of local spatio-temporal features for action recognition, in Proc. British Mach. Vis. Conf. (BMVC), (London, UK), Sept. 2009, pp. 124.1-124.11.
  3. A. F. Bobick and J. W. Davis, The recognition of human movement using temporal templates, IEEE Trans. Pattern Anal. Mach. Intell. 3 (2001), 257-267.
  4. W. Zhu et al., Co-occurrence feature learning for skeleton based action recognition using regularized deep lstm networks, Proc. AAAI Conf. Artif. Intell. 30 (2016), no. 1, 3697-3703.
  5. K. Simonyan and A. Zisserman, Two-stream convolutional networks for action recognition in videos, in Proc. Annu. Conf. Neural Inf. Process. Syst. (Montreal, Canada), Dec. 2014, pp. 568-576.
  6. S. Ji et al., 3D convolutional neural networks for human action recognition, IEEE Trans. Pattern Anal. Mach. Intell. 35 (2012), no. 1, 221-231. https://doi.org/10.1109/TPAMI.2012.59
  7. Z. Li et al., Videolstm convolves, attends and flows for action recognition, Comput. Vis. Image Underst. 166 (2018), 41-50. https://doi.org/10.1016/j.cviu.2017.10.011
  8. B. Singh et al., A multi-stream bi-directional recurrent neural network for fine-grained action detection, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (Las Vegas, NV, USA), June 2016, pp. 1961-1970.
  9. J. Marin et al., Learning appearance in virtual scenarios for pedestrian detection, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (San Francisco, CA, USA), June 2010, pp. 137-144.
  10. G. Varol et al., Learning from synthetic humans, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (Honolulu, HI, USA), July 2017, pp. 109-117.
  11. D. Vazquez et al., Virtual and real world adaptation for pedestrian detection, IEEE Trans. Pattern Anal. Mach. Intell. 36 (2013), no. 4, 797-809. https://doi.org/10.1109/TPAMI.2013.163
  12. M. Hoai and A. Zisserman, Improving human action recognition using score distribution and ranking, in Asian Conference on Computer Vision, vol. 9007, Springer, Cham, Switzerland, 2014, pp. 3-20.
  13. J. Yue-Hei Ng et al., Beyond short snippets: deep networks for video classification, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (Boston, MA, USA), June 2015, pp. 4694-4702.
  14. B. Fernando et al., Modeling video evolution for action recognition, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (Boston, MA, USA), June 2015, pp. 5378-5387.
  15. L. Sun et al., Lattice long short-term memory for human action recognition, in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), (Venice, Italy), Oct. 2017, pp. 2147-2156.
  16. U. Ahsan, C. Sun, and I. Essa, Discrimnet: semi-supervised action recognition from videos using generative adversarial networks, arXiv preprint, CoRR, 2018, arXiv: 1801.07230.
  17. W. Lotter, G. Kreiman, and D. Cox, Deep predictive coding networks for video prediction and unsupervised learning, arXiv preprint, CoRR, 2016, arXiv: 1605.08104.
  18. M. Mathieu, C. Couprie, and Y. LeCun, Deep multi-scale video prediction beyond mean square error, arXiv preprint, CoRR, 2015, arXiv: 1511.05440.
  19. S. Tulyakov et al., Mocogan: Decomposing motion and content for video generation, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (Salt Lake City, UT, USA), June 2018, pp. 1526-1535.
  20. C. Vondrick, H. Pirsiavash, and A. Torralba, Generating videos with scene dynamics, in Proc. Conf. Neural Inf. Process. Syst. (Barcelona, Spain), Dec. 2016, pp. 613-621.
  21. S. Wen et al., Generating realistic videos from keyframes with concatenated gans, IEEE Trans. Circuits Syst. Video Technol. 29 (2018), no. 8, 2337-2348. https://doi.org/10.1109/tcsvt.2018.2867934
  22. M. Ranzato et al., Video (language) modeling: A baseline for generative models of natural videos, arXiv preprint, CoRR, 2014, arXiv: 1412.6604.
  23. N. Srivastava, E. Mansimov and R. Salakhudinov, Unsupervised learning of video representations using lstms, in Proc. Int. Conf. Mach. Learn. (Lille, France), July 2015, pp. 843-852.
  24. A. Mikolajczyk and M. Grochowski, Data augmentation for improving deep learning in image classification problem, in Proc. Int. Interdiscip. PhD Workshop (IIPhDW), (Swinoujscie, Poland), May 2018, pp. 117-122.
  25. M. N. Haque et al., Heterogeneous ensemble combination search using genetic algorithm for class imbalanced data classification, PloS ONE 11 (2016), no. 1, article no. e0146116.
  26. P. Yang et al., Ensemble-based wrapper methods for feature selection and class imbalance learning, in Advances in Knowledge Discovery and Data Mining, vol. 7818, Springer, Berlin, Heidelberg, Germany, 2013, pp. 544-555.
  27. N. R. Howe and A. Deschamps, Better foreground segmentation through graph cuts, arXiv preprint, CoRR, 2004, arXiv: cs/0401017.
  28. D. Weinland, R. Ronfard, and E. Boyer, Free viewpoint action recognition using motion history volumes, Comput. Vision Image Underst. 104 (2006), no. 2-3, 249-257. https://doi.org/10.1016/j.cviu.2006.07.013
  29. S. Singh, S. A. Velastin and H. Ragheb, Muhavi: A multicamera human action video dataset for the evaluation of action recognition methods, in Proc. IEEE Int. Conf. Adv. Video Signal based Surveillance (Boston, MA, USA), Sept. 2010, pp. 48-55.
  30. N. Nida et al., Instructor activity recognition through deep spatiotemporal features and feedforward extreme learning machines, Math. Probl. Eng. 2019 (2019).
  31. K. He et al., Deep residual learning for image recognition, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (Las Vegas, NV, USA), June 2016, pp. 770-778.
  32. C. Szegedy et al., Going deeper with convolutions, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (Boston, MA, USA), June 2015, pp. 1-9.
  33. F. Iandola et al., Densenet: implementing efficient convnet descriptor pyramids, arXiv preprint, CoRR, 2014, arXiv: 1404.1869.
  34. Q. Ke et al., Skeletonnet: Mining deep part features for 3-d action recognition, IEEE Sig. Process. Lett. 24 (2017), no. 6, 731-735. https://doi.org/10.1109/LSP.2017.2690339
  35. W. Zhu et al., Hierarchical extreme learning machine for unsupervised representation learning, in Proc. Int. Joint Conf. Neural Netw. (IJCNN), (Killarney, Ireland), July 2015, pp. 1-8.
  36. A. A. Chaaraoui, P. Climent-Perez, and F. Florez-Revuelta, Silhouette-based human action recognition using sequences of key poses, Pattern Recognit. Lett. 34 (2013), no. 15, 1799-1807. https://doi.org/10.1016/j.patrec.2013.01.021
  37. A. Farhadi and M. K. Tabrizi, Learning to recognize activities from the wrong view point, in Computer Vision-ECCV 2008, vol. 5302, Springer, Berlin, Heidelberg, Germany, 2008, 154-166.
  38. C-H. Huang, Y.-R. Yeh, and Y.-C. F. Wang, Recognizing actions across cameras by exploring the correlated subspace, in Computer Vision-ECCV 2012: Workshops and Demonstrations, vol. 7583, Springer, Berlin, Heidelberg, Germany, 2012, pp. 342-351.
  39. K. K. Reddy, J. Liu, and M. Shah, Incremental action recognition using feature-tree, in Proc. IEEE Int. Conf. Comput. Vis. (Kyoto, Japan), Sept. 2009, pp. 1010-1017.
  40. D. Weinland, E. Boyer and R. Ronfard, Action recognition from arbitrary views using 3D exemplars, in Proc. IEEE Int. Conf. Comput. Vis. (Rio de Janeiro, Brazil), Oct. 2007, pp. 1-7.
  41. F. Murtaza, M. H. Yousaf, and S. A. Velastin, Multi-view human action recognition using 2d motion templates based on mhis and their hog description, IET Comput. Vis. 10 (2016), no. 7, 758-767. https://doi.org/10.1049/iet-cvi.2015.0416
  42. J. Carreira and A. Zisserman, Quo vadis, action recognition? A new model and the kinetics dataset, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (Honolulu, HI, USA), July 2017, pp. 6299-6308.
  43. A. Vaswani et al., Attention is all you need, in Proc. Conf. Neural Inf. Process. Syst. (Long Beach, CA, USA), Dec. 2017, pp. 5998-6008.
  44. X.-Y. Zhang et al., Learning transferable self-attentive representations for action recognition in untrimmed videos with weak supervision, Proc. AAAI Conf. Artif. Intell. 33 (2019), no. 1, pp. 9227-9234.