DOI QR코드

DOI QR Code

BoF based Action Recognition using Spatio-Temporal 2D Descriptor

시공간 2D 특징 설명자를 사용한 BOF 방식의 동작인식

  • KIM, JinOk (Department of Media Communication Design, Daegu Haany University)
  • Received : 2015.01.19
  • Accepted : 2015.05.18
  • Published : 2015.06.30

Abstract

Since spatio-temporal local features for video representation have become an important issue of modeless bottom-up approaches in action recognition, various methods for feature extraction and description have been proposed in many papers. In particular, BoF(bag of features) has been promised coherent recognition results. The most important part for BoF is how to represent dynamic information of actions in videos. Most of existing BoF methods consider the video as a spatio-temporal volume and describe neighboring 3D interest points as complex volumetric patches. To simplify these complex 3D methods, this paper proposes a novel method that builds BoF representation as a way to learn 2D interest points directly from video data. The basic idea of proposed method is to gather feature points not only from 2D xy spatial planes of traditional frames, but from the 2D time axis called spatio-temporal frame as well. Such spatial-temporal features are able to capture dynamic information from the action videos and are well-suited to recognize human actions without need of 3D extensions for the feature descriptors. The spatio-temporal BoF approach using SIFT and SURF feature descriptors obtains good recognition rates on a well-known actions recognition dataset. Compared with more sophisticated scheme of 3D based HoG/HoF descriptors, proposed method is easier to compute and simpler to understand.

동작인식 연구에서 비디오를 표현하는 시공간 부분 특징이 모델 없는 상향식 방식의 주요 주제가 되면서 동작 특징을 검출하고 표현하는 방법이 여러 연구를 통해 다양하게 제안되고 있다. 그 중에서 BoF(bag of features)방식은 가장 일관성 있는 인식 결과를 보여주고 있다. 비디오의 동작을 BoF로 나타내기 위해서는 어떻게 동작의 역동적 정보를 표현할 것인가가 가장 중요한 부분이다. 그래서 기존 연구에서는 비디오를 시공간 볼륨으로 간주하고 3D 동작 특징점 주변의 볼륨 패치를 복잡하게 설명하는 것이 가장 일반적인 방법이다. 본 연구에서는 기존 3D 기반 방식을 간략화하여 비디오의 동작을 BoF로 표현할 때 비디오에서 2D 특징점을 직접 수집하는 방식을 제안한다. 제안 방식의 기본 아이디어는 일반적 공간프레임의 2D xy 평면뿐만 아니라 시공간 프레임으로 불리는 시간축 평면에서 동작 특징점을 추출하여 표현하는 것으로 특징점이 비디오에서 역동적 동작 정보를 포착하기 때문에 동작 표현 특징 설명자를 3D로 확장할 필요 없이 2D 설명자만으로 간단하게 동작인식이 가능하다. SIFT, SURF 특징 표현 설명자로 표현하는 시공간 BoF 방식을 주요 동작인식 데이터에 적용하여 우수한 동작 인식율을 보였다. 3D기반의 HoG/HoF 설명자와 비교한 경우에도 제안 방식이 더 계산하기 쉽고 단순하게 이해할 수 있다.

Keywords

References

  1. R. Poppe, "A survey on vision-based human action recognition", Image and Vision Computing, vol. 28. pp. 976-990, 2010. http://doi:10.1016/j.imavis. 2009.11.014
  2. Q. V. Le, W. Y. Zou, S. Y. Yeung, A. Y. Ng, "Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis," IEEE Conference Computer Vision and Pattern Recognition, 2011. pp. 3361-3368. http://10.1109/CVPR.2011.5995496
  3. Shizhi Chen, YingLi Tian, Qingshan Liu, Dimitris N. Metaxas, "Recognizing expressions from face and body gesture by temporal normalized motion and appearance features," Computer Vision and Pattern Recognition Workshops, 2011, pp.7-12. http://10.1109/CVPRW.2011.5981880
  4. George Caridakis, Stylianos Asteriadis, Kostas Karpouzis, "Non-manual cues in automatic sign language recognition," Personal and ubiquitous computing, vol. 18, no. 1. pp. 37-46, 2014. http://10.1007/s00779-012-0615-1
  5. J. K. Aggarwal, S. Park, "Human motion: Modeling and recognition of actions and interactions," 3DPVT, IEEE Computer Society, 2004, pp. 640-647. http://10.1109/TDPVT.2004.1335299
  6. T. B. Moeslund, A. Hilton, V. Kruger, "A survey of advances in vision-based human motion capture and analysis," Computer Vision and Image Understanding, vol. 104, no. 2, pp. 90-126, 2006. http://doi:10.1016/j.cviu.2006.08.002
  7. A. K. Roy-Chowdhury, R. Chellappa, A. Bovik, S. K. Zhou, "Recognition of humans and their activities using video (Synthesis Lectures on Image, Video and Multimedia Processing)", Morgan & Claypool Publishers, pp. 173, 2006. http://10.2200/S00002ED1V01Y200508IVM001
  8. A. Mokhber, C. Achard, M. Milgram, "Recognition of human behavior by space-time silhouette characterization," Pattern Recognition Letter, vol. 29, no. 1, pp. 81-89, 2008. http://doi:10.1016/j.patrec.2007.08.016
  9. A. Oikonomopoulos, I. Patras, M. Pantic, "Spatiotemporal Localization and Categorization of Human Actions in Unsegmented Image Sequences", IEEE Transactions on Image Processing. vol. 20, no. 4. pp. 1126-1140, 2011. http://10.1109/TIP.2010.2076821
  10. H. Wang, M. M. Ullah, A. Kläser, I. Laptev, C. Schmid, "Evaluation of local spatio-temporal features for action recognition", BMVC, 2009, pp. 499-502. https://10.5244/C.23.124
  11. R. A. Baeza-Yates, B. A. Ribeiro-Neto, "Modern Information Retrieval." ACM Press, Addison-Wesley, 1999. ISBN:020139829X
  12. C. Schuldt, I. Laptev, B. Caputo, "Recognizing human actions: a local SVM approach," ICPR, 2004, vol. 3, pp. 32-36. http://10.1109/ICPR.2004.1334462
  13. P. Dollar, V. Rabaud, G. Cottrell, S. Belongie, "Behavior recognition via sparse spatio-temporal features," ICCCN, 2005, pp. 65-72. htpp://10.1109/VSPETS.2005.1570899
  14. J. Niebles, F. Li, "A hierarchical model of shape and appearance for human action classification," CVPR, 2007, pp. 1-8. http://10.1109/CVPR.2007.383132
  15. J. Niebles, H. Wang, L. Fei-Fei, "Unsupervised learning of human action categories using spatial-temporal words," IJCV, vol. 79, no. 3, pp. 299-318, 2008. http://10.1007/s11263-007-0122-4
  16. J. Liu, S. Ali, M. Shah, "Recognizing human actions using multiple features," CVPR, 2008, pp. 9-18, http://10.1109/CVPR.2008.4587527
  17. P. Scovanner, S. Ali, M. Shah, "A 3-dimensional sift descriptor and its application to action recognition," MULTIMEDIA '07, 2007, pp. 357-360. http://10.1145/1291233.1291311
  18. H. Ning, Y. Hu, T. Huang, "Searching human behaviors using spatial-temporal words," IEEE International Conference on Image Processing, 2007, pp. 337-340. http://10.1109/ICIP.2007.4379590
  19. I. Laptev, M. Marszalek, C. Schmid, B. Rozenfeld, "Learning realistic human actions from movies," CVPR, pp. 1-8. 2008. http://10.1109/CVPR.2008.4587756
  20. J. Liu, M. Shah, "Learning human actions via information maximization," CVPR, 2008, pp. 21-30, http://10.1109/CVPR.2008.4587723
  21. Jinok Kim, "A Study on Visual Perception based Emotion Recognition using Body-Activity Posture," The KIPS Transactions, Part B, vol. 18, no. 5, pp. 305-314, 2011. http://10.3745/KIPSTB.2011.18B.5.305
  22. JinOk Kim, "Agent's Activities based Intention Recognition Computing", Journal of Korean Internet Society, vol. 13, no. 2, pp. 87-98, 2012. http://10.7472/jksii.2012.13.2.87
  23. T. M. Mitchell, "Machine Learning." New York: McGraw-Hill, 1997. ISBN:0070428077 9780070428072
  24. H. Bay, A. Ess, T. Tuytelaars, L. V. Gool, "SURF: Speeded Up Robust Features," European Conference on Computer Vision, 2006, pp. 346-359. http://10.1007/11744023_32
  25. D. Lowe, "Distinctive image features from scale-invariant keypoints," IJCV, vol. 60, no. 2, pp. 91-110, 2004. http://10.1023/B:VISI.0000029664.99615.94
  26. L. Gorelick, M. Blank, E. Shechtman, M. Irani, R. Basri, "Actions as space-time shapes," PAMI, vol. 29, no. 12, pp. 2247-2253, 2007. http://10.1109/TPAMI.2007.70711

Cited by

  1. Hand-Mouse Interface Using Virtual Monitor Concept for Natural Interaction vol.5, 2017, https://doi.org/10.1109/ACCESS.2017.2768405
  2. 사용자의 신체적 특징과 뇌파 집중 지수를 이용한 가상 모니터 개념의 NUI/NUX vol.16, pp.6, 2015, https://doi.org/10.7472/jksii.2015.16.6.11