DOI QR코드

DOI QR Code

Egocentric Vision for Human Activity Recognition Using Deep Learning

  • Malika Douache (Automation, Vision and Intelligent Systems Control Laboratory, University of Sciences and Technology of Oran Mohamed-Boudiaf (USTOMB)) ;
  • Badra Nawal Benmoussat (Automation, Vision and Intelligent Systems Control Laboratory, University of Sciences and Technology of Oran Mohamed-Boudiaf (USTOMB))
  • Received : 2022.05.27
  • Accepted : 2022.09.09
  • Published : 2023.12.31

Abstract

The topic of this paper is the recognition of human activities using egocentric vision, particularly captured by body-worn cameras, which could be helpful for video surveillance, automatic search and video indexing. This being the case, it could also be helpful in assistance to elderly and frail persons for revolutionizing and improving their lives. The process throws up the task of human activities recognition remaining problematic, because of the important variations, where it is realized through the use of an external device, similar to a robot, as a personal assistant. The inferred information is used both online to assist the person, and offline to support the personal assistant. With our proposed method being robust against the various factors of variability problem in action executions, the major purpose of this paper is to perform an efficient and simple recognition method from egocentric camera data only using convolutional neural network and deep learning. In terms of accuracy improvement, simulation results outperform the current state of the art by a significant margin of 61% when using egocentric camera data only, more than 44% when using egocentric camera and several stationary cameras data and more than 12% when using both inertial measurement unit (IMU) and egocentric camera data.

Keywords

Acknowledgement

The data used in this paper was obtained from kitchen.cs.cmu.edu and the data collection was funded in part by the National Science Foundation (Grant No. EEEC-0540865).

References

  1. C. Jobanputra, J. Bavishi, and N. Doshi, "Human activity recognition: a survey," Procedia Computer Science, vol. 155, pp. 698-703, 2019. https://doi.org/10.1016/j.procs.2019.08.100 
  2. T. Alhersh, H. Stuckenschmidt, A. U. Rehman, and S. B. Belhaouari, "Learning human activity from visual data using deep learning," IEEE Access, vol. 9, pp. 106245-106253, 2021. https://doi.org/10.1109/ACCESS. 2021.3099567 
  3. Carnegie Mellon University, 'CMU-Multimodal Activity (CMU-MMAC) database," 2010 [Online]. Available: http://kitchen.cs.cmu.edu/main.php. 
  4. E. H. Spriggs, F. De La Torre, and M. Hebert, "Temporal segmentation and activity classification from first-person sensing," in Proceedings of 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Miami, FL, 2009, pp. 17-24. https://doi.org/10.1109/CVPRW.2009.5204354 
  5. A. Fathi, A. Farhadi, and J. M. Rehg, "Understanding egocentric activities," in Proceedings of 2011 International Conference on Computer Vision, Barcelona, Spain, 2011, pp. 407-414. https://doi.org/10.1109/ ICCV.2011.6126269
  6. A. Fathi, Y. Li, and J. Rehg, "Learning to recognize daily actions using gaze," in Computer Vision - ECCV 2012. Heidelberg, Germany: Springer, 2012, pp. 314-327. https://doi.org/10.1007/978-3-642-33718-5_23 
  7. H. Pirsiavash and D. Ramanan, "Detecting activities of daily living in first-person camera views," in Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, 2012, pp. 2847-2854. https://doi.org/10.1109/CVPR.2012.6248010
  8. M. S. Ryoo and L. Matthies, "First-person activity recognition: What are they doing to me?," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, 2013, pp. 2730-2737. https://doi.org/10.1109/CVPR.2013.352 
  9. S. Song, V. Chandrasekhar, N. M. Cheung, S. Narayan, L. Li, and J. H. Lim, "Activity recognition in egocentric life logging videos," in Computer Vision - ACCV 2014 Workshops. Cham, Switzerland: Springer, 2014, pp. 445-458. https://doi.org/10.1007/978-3-319-16634-6_33 
  10. B. Soran, A. Farhadi, and L. Shapiro, "Action recognition in the presence of one egocentric and multiple static cameras," in Computer Vision - ACCV 2014. Cham, Switzerland: Springer, 2015, pp. 178-193. https://doi.org/10.1007/978-3-319-16814-2_12 
  11. M. S. Ryoo, B. Rothrock, and L. Matthies, "Pooled motion features for first-person videos," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, 2015, pp. 896-904. https://doi.org/10.1109/CVPR.2015.7298691 
  12. Y. Li, Z. Ye, and J. M. Rehg, "Delving into egocentric actions," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, 2015, pp. 287-295. https://doi.org/10.1109/CVPR. 2015.7298625 
  13. M. Ma, H. Fan, and K. M. Kitani, "Going deeper into first-person activity recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, 2016, pp. 1894-1903. https://doi.org/10.1109/CVPR.2016.209 
  14. S. Song, N. M. Cheung, V. Chandrasekhar, B. Mandal, and J. Liri, "Egocentric activity recognition with multimodal fisher vector," in Proceedings of 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 2016, pp. 2717-2721. https://doi.org/10.1109/ICASSP.2016.7472171 
  15. S. Song, V. Chandrasekhar, B. Mandal, L. Li, J. H. Lim, G. Sateesh Babu, P. P. San, and N. M. Cheung, "Multimodal multi-stream deep learning for egocentric activity recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Las Vegas, NV, 2016, pp. 24-31. https://doi.org/10.1109/CVPRW.2016.54 
  16. S. Singh, C. Arora, and C. V. Jawahar, "First person action recognition using deep learned descriptors," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, 2016, pp. 2620-2628. https://doi.org/10.1109/CVPR.2016.287 
  17. L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, "Temporal segment networks: towards good practices for deep action recognition," in Computer Vision - ECCV 2016. Cham, Switzerland: Springer, 2016, pp. 20-36. https://doi.org/10.1007/978-3-319-46484-8_2 
  18. E. A. Khalid, A. Hamid, A. Brahim, and O. Mohammed, "A survey of activity recognition in egocentric lifelogging datasets," in Proceedings of 2017 International Conference on Wireless Technologies, Embedded and Intelligent Systems (WITS), Fez, Morocco, 2017, pp. 1-8. https://doi.org/10.1109/WITS.2017.7934659 
  19. S. Singh, C. Arora, and C. V. Jawahar, "Trajectory aligned features for first person action recognition," Pattern Recognition, vol. 62, pp. 45-55, 2017. https://doi.org/10.1016/j.patcog.2016.07.031 
  20. Y. Liu, P. Wei, and S. C. Zhu, "Jointly recognizing object fluents and tasks in egocentric videos," in Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 2017, pp. 2924-2932. https://doi.org/10.1109/ICCV.2017.318 
  21. R. Possas, S. P. Caceres, and F. Ramos, "Egocentric activity recognition on a budget," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, 2018, pp. 5967-5976. https://doi.org/10.1109/CVPR.2018.00625 
  22. Y. Li, M. Liu, and J. M. Rehg, "In the eye of beholder: joint learning of gaze and actions in first person video," in Computer Vision - ECCV 2018. Cham, Switzerland: Springer, 2018, pp. 619-635. https://doi.org/10. 1007/978-3-030-01228-1_38  https://doi.org/10.1007/978-3-030-01228-1_38
  23. S. Sudhakaran and O. Lanz, "Attention is all we need: nailing down object-centric attention for egocentric activity recognition," 2018 [Online]. Available: https://arxiv.org/abs/1807.11794. 
  24. S. Sudhakaran, S. Escalera, and O. Lanz, "LSTA: long short-term attention for egocentric action recognition," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA,2019, pp. 9954-9963. https://doi.org/10.1109/CVPR.2019.01019 
  25. E. Kazakos, A. Nagrani, A. Zisserman, and D. Damen, "EPIC-fusion: audio-visual temporal binding for egocentric action recognition," in Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, South Korea, 2019, pp. 5492-5501. https://doi.org/10.1109/ICCV.2019.00559 
  26. Y. Lu and S. Velipasalar, "Autonomous human activity classification from ego-vision camera and accelerometer data," 2019 [Online]. Available: https://arxiv.org/abs/1905.13533. 
  27. A. Diete and H. Stuckenschmidt, "Fusing object information and inertial data for activity recognition," Sensors, vol. 19, no. 19, article no. 4119, 2019. https://doi.org/10.3390/s19194119 
  28. A. Furnari and G. M. Farinella, "Rolling-unrolling LSTMs for action anticipation from first-person video," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 11, pp. 4021-4036, 2021. https://doi.org/10.1109/TPAMI.2020.2992889 
  29. I. Rodin, A. Furnari, D. Mavroeidis, and G. M. Farinella, "Scene understanding and interaction anticipation from first person vision," in Proceedings of the 1st Workshop on Smart Personal Health Interfaces co-located with 25th International Conference on Intelligent User Interfaces, Cagliari, Italy, 2020, pp. 78-83. 
  30. K. Min and J. J. Corso, "Integrating human gaze into attention for egocentric activity recognition," in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, 2021, pp. 1069-1078. https://doi.org/10.1109/wacv48630.2021.00111
  31. F. Ragusa, A. Furnari, S. Livatino, and G. M. Farinella, "The MECCANO dataset: understanding human-object interactions from egocentric videos in an industrial-like domain," in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, 2021, pp. 1568-1577. https://doi.org/ 10.1109/wacv48630.2021.00161 
  32. A. Ghosh, A. Sufian, F. Sultana, A. Chakrabarti, and D. De, "Fundamental concepts of convolutional neural network," in Recent Trends and Advances in Artificial Intelligence and Internet of Things. Cham, Switzerland: Springer, 2020, pp. 519-567. https://doi.org/10.1007/978-3-030-32644-9_36 
  33. S. Saha, "A comprehensive guide to convolutional neural networks: the ELI5 way," 2018 [Online]. Available: https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53. 
  34. DVDVideoSoft, "Free Video to JPG converter," 2022 [Online], Available: https://www.dvdvideosoft.com/products/dvd/Free-Video-to-JPG-Converter.htm. 
  35. MathWorks, "TrainingOptionsSGDM: training options for stochastic gradient descent with momentum," c2023 [Online]. Available: https://fr.mathworks.com/help/deeplearning/ref/nnet.cnn.trainingoptionssgdm.html;jsessionid=4a2aaa96a2ed0eec48f9cfd48951. 
  36. S. Shi, "On the hyperparameters in stochastic gradient descent with momentum," 2021 [Online]. Available: https://arxiv.org/abs/2108.03947.