ADD-Net: Attention Based 3D Dense Network for Action Recognition

Man, Qiaoyue;Cho, Young Im;

doi:10.9708/jksci.2019.24.06.021

Journal of the Korea Society of Computer and Information (한국컴퓨터정보학회논문지)

Volume 24 Issue 6
/
Pages.21-28
/
2019
/
1598-849X(pISSN)
/
2383-9945(eISSN)

Korean Society of Computer Information (한국컴퓨터정보학회)

DOI QR Code

ADD-Net: Attention Based 3D Dense Network for Action Recognition

Man, Qiaoyue (Gachon University) ;
Cho, Young Im (Gachon University)

Received : 2019.04.30
Accepted : 2019.05.30
Published : 2019.06.28

https://doi.org/10.9708/jksci.2019.24.06.021 Citation PDF KSCI HTML

Download PDF

⟨ Previous Next ⟩

Abstract

Recent years with the development of artificial intelligence and the success of the deep model, they have been deployed in all fields of computer vision. Action recognition, as an important branch of human perception and computer vision system research, has attracted more and more attention. Action recognition is a challenging task due to the special complexity of human movement, the same movement may exist between multiple individuals. The human action exists as a continuous image frame in the video, so action recognition requires more computational power than processing static images. And the simple use of the CNN network cannot achieve the desired results. Recently, the attention model has achieved good results in computer vision and natural language processing. In particular, for video action classification, after adding the attention model, it is more effective to focus on motion features and improve performance. It intuitively explains which part the model attends to when making a particular decision, which is very helpful in real applications. In this paper, we proposed a 3D dense convolutional network based on attention mechanism(ADD-Net), recognition of human motion behavior in the video.

Keywords

CPTSCQ_2019_v24n6_21_f0001.png 이미지

Fig. 1. The connection structure of composite function in a dense block

CPTSCQ_2019_v24n6_21_f0002.png 이미지

Fig. 2. The overall architecture of DenseNet with four blocks

CPTSCQ_2019_v24n6_21_f0003.png 이미지

Fig. 3. The overall architecture of the based on the attention model 3D dense convolutional network (ADD-Net). We are based on a 3D dense convolutional network and modify the original network to add our attention model. In the attention model, we use an efficient attention mechanism based on a combination of channel and space attention

CPTSCQ_2019_v24n6_21_f0004.png 이미지

Fig. 4. Channel attention module

CPTSCQ_2019_v24n6_21_f0005.png 이미지

Fig. 5. Spatial attention module

CPTSCQ_2019_v24n6_21_f0006.png 이미지

Fig. 6. Examples of attention. (Best viewed in color.) A frame from a video of action in UCF101. The top is the original image, spatial attention is shown as heatmap (Blue bounding boxes represent ground truth while the red ones are predictions from our learned spatial attention) in the bottom row. a: walking with dog, b: biking, c: long jump, d: skate boarding, e: rope climbing.

Table 1. Exploration of ADD-Net(ours) and other 3D ConvNets on the UCF-101 dataset (split1)

CPTSCQ_2019_v24n6_21_t0001.png 이미지

Table 2. Accuracy (%) performance comparison of our method with other methods over all three splits of UCF101 and HMDB51

CPTSCQ_2019_v24n6_21_t0002.png 이미지

References

K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.
H. Kuehne, H. Jhuang, R. Stiefelhagen, and T. Serre.Hmdb 51: A large video database for human motion recognition. In High Performance Computing in Science and Engineering. 2013.
K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv:1212.0402, 2012
H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould. Dynamic image networks for action recognition. In Proc. CVPR, 2016.
Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Convolutional two-stream network fusion for video action recognition. In Proc. CVPR, 2016.
A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolu-tional neural networks. In NIPS, 2012.
K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.
Heng Wang and Cordelia Schmid. Action recognition with improved trajectories. In P with improved trajectories. In Proc. ICCV, 2013.
J. Ba, V. Mnih, and K. Kavukcuoglu, "Multiple object recognition with visual attention," in Proc. Int. Conf. Learn. Represent, 2015.
S. Sharma, R. Kiros, and R. Salakhutdinov, "Action recognition using visual attention," in Proc. Int. Conf. Learn. Represent. Workshop, 2016.
A. Diba, A. M. Pazandeh, and L. Van Gool. Efficient two stream motion and appearance 3d cnns for video clas-sification. In ECCV Workshops, 2016.
A. Diba, V. Sharma, and L. Van Gool. Deep temporal linear encoding networks. In CVPR, 2017.
C. Feichtenhofer, A. Pinz, and R. Wildes. Spatio-temporal residual networks for video ac-tion recognition. In Advances in Neural Information Processing Systems, pages 3468-3476, 2016.
C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutio nal two-stream network fusion for video action recognition. In CVPR, 2016.
L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016.
J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In CVPR, 2015.
H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, 2013.
D. Tran, J. Ray, Z. Shou, S.F. Chang, and M. Paluri. Convnet architecture search for spatio-temporal feature learning. arXiv:1708.05038, 2017.
L. Sun, K. Jia, D.Y. Yeung, and B. E. Shi. Human action recognition using factorized spa-tio-temporal convolutional networks. In ICCV, 2015.
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014.
G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely connected convolu-tional networks. In CVPR, 2017.
N. Dalal, B. Triggs, and C. Schmid. Human detection using oriented histograms of flow and appearance. In ECCV, 2006.
J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L.Fei-Fei. Imagenet: A large-scale hierar-chical image database. In CVPR, 2009.
Yang, H.; Yuan, C.; Li, B.; Du, Y.; Xing, J.; Hu, W.; Maybank, S.J. 3D Convolutional Neural Networks for Human Action Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2012.
Tran, D.; Bourdev, L.D.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal fea-tures with 3d convolutional networks. In Proceedings of the 2015 International Conference on Computer Vision, Las Condes, Chile, 11-18 December 2015.
Zinkevich, M.; Weimer, M.; Li, L.; Smola, A. Parallelized stochastic gradient descent. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 6-9 December 2010; pp. 2595-2603.
Sutskever, I.; Martens, J.; Dahl, G.; Hinton, G. On the importance of initialization and mo-mentum in deep learning. In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 16-21 June 2013; pp. 1139-1147.
J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR,2017.
A. Klaser, M. Marszalek, and C. Schmid. A spatio-temporal descriptor based on 3d-gradients. In BMVC, 2008.
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
K. Xu et al., "Show, attend and tell: Neural image caption generation with visual attention," in Proc. ICML, 2015, pp. 2048-2057.
C. Zhu, Y. Zhao, S. Huang, K. Tu, and Y. Ma, "Structured attentions for visual question an-swering," in Proc. ICCV, vol. 3, Oct. 2017, pp. 1300-1309.
W. Zou, D. Jiang, S. Zhao, and X. Li, "A comparable study of modeling units for end-to-end mandarin speech recognition," arXiv preprint arXiv:1805.03832, 2018.
S. Kim, T. Hori, and S. Watanabe, "Joint ctc-attention based endto-end speech recognition using multi-task learning," in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 4835-4839.
Rush, A. M. & Weston, J. A Neural Attention Model for Abstractive Sentence Summariza-tion. EMNLP, 2015.
Kadlec, R., Schmid, M., Bajgar, O. & Kleindienst, J. Text Understanding with the Attention Sum Reader Network. arXiv:1603.01547v1 [cs.CL] , 2016.
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. arXiv preprint arXiv:1709.01507, 2017
Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H., Wang, X.,Tang, X.: Residual atten-tion network for image classification. arXiv preprint arXiv:1704.06904, 2017

Journal of the Korea Society of Computer and Information (한국컴퓨터정보학회논문지)

ADD-Net: Attention Based 3D Dense Network for Action Recognition

Abstract

Keywords

References

Detail Search