Fig. 1. The connection structure of composite function in a dense block
Fig. 2. The overall architecture of DenseNet with four blocks
Fig. 3. The overall architecture of the based on the attention model 3D dense convolutional network (ADD-Net). We are based on a 3D dense convolutional network and modify the original network to add our attention model. In the attention model, we use an efficient attention mechanism based on a combination of channel and space attention
Fig. 4. Channel attention module
Fig. 5. Spatial attention module
Fig. 6. Examples of attention. (Best viewed in color.) A frame from a video of action in UCF101. The top is the original image, spatial attention is shown as heatmap (Blue bounding boxes represent ground truth while the red ones are predictions from our learned spatial attention) in the bottom row. a: walking with dog, b: biking, c: long jump, d: skate boarding, e: rope climbing.
Table 1. Exploration of ADD-Net(ours) and other 3D ConvNets on the UCF-101 dataset (split1)
Table 2. Accuracy (%) performance comparison of our method with other methods over all three splits of UCF101 and HMDB51
References
- K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.
- H. Kuehne, H. Jhuang, R. Stiefelhagen, and T. Serre.Hmdb 51: A large video database for human motion recognition. In High Performance Computing in Science and Engineering. 2013.
- K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv:1212.0402, 2012
- H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould. Dynamic image networks for action recognition. In Proc. CVPR, 2016.
- Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Convolutional two-stream network fusion for video action recognition. In Proc. CVPR, 2016.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolu-tional neural networks. In NIPS, 2012.
- K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.
- Heng Wang and Cordelia Schmid. Action recognition with improved trajectories. In P with improved trajectories. In Proc. ICCV, 2013.
- J. Ba, V. Mnih, and K. Kavukcuoglu, "Multiple object recognition with visual attention," in Proc. Int. Conf. Learn. Represent, 2015.
- S. Sharma, R. Kiros, and R. Salakhutdinov, "Action recognition using visual attention," in Proc. Int. Conf. Learn. Represent. Workshop, 2016.
- A. Diba, A. M. Pazandeh, and L. Van Gool. Efficient two stream motion and appearance 3d cnns for video clas-sification. In ECCV Workshops, 2016.
- A. Diba, V. Sharma, and L. Van Gool. Deep temporal linear encoding networks. In CVPR, 2017.
- C. Feichtenhofer, A. Pinz, and R. Wildes. Spatio-temporal residual networks for video ac-tion recognition. In Advances in Neural Information Processing Systems, pages 3468-3476, 2016.
- C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutio nal two-stream network fusion for video action recognition. In CVPR, 2016.
- L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016.
- J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In CVPR, 2015.
- H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, 2013.
- D. Tran, J. Ray, Z. Shou, S.F. Chang, and M. Paluri. Convnet architecture search for spatio-temporal feature learning. arXiv:1708.05038, 2017.
- L. Sun, K. Jia, D.Y. Yeung, and B. E. Shi. Human action recognition using factorized spa-tio-temporal convolutional networks. In ICCV, 2015.
- A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014.
- G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely connected convolu-tional networks. In CVPR, 2017.
- N. Dalal, B. Triggs, and C. Schmid. Human detection using oriented histograms of flow and appearance. In ECCV, 2006.
- J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L.Fei-Fei. Imagenet: A large-scale hierar-chical image database. In CVPR, 2009.
- Yang, H.; Yuan, C.; Li, B.; Du, Y.; Xing, J.; Hu, W.; Maybank, S.J. 3D Convolutional Neural Networks for Human Action Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2012.
- Tran, D.; Bourdev, L.D.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal fea-tures with 3d convolutional networks. In Proceedings of the 2015 International Conference on Computer Vision, Las Condes, Chile, 11-18 December 2015.
- Zinkevich, M.; Weimer, M.; Li, L.; Smola, A. Parallelized stochastic gradient descent. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 6-9 December 2010; pp. 2595-2603.
- Sutskever, I.; Martens, J.; Dahl, G.; Hinton, G. On the importance of initialization and mo-mentum in deep learning. In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 16-21 June 2013; pp. 1139-1147.
- J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR,2017.
- A. Klaser, M. Marszalek, and C. Schmid. A spatio-temporal descriptor based on 3d-gradients. In BMVC, 2008.
- K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
- K. Xu et al., "Show, attend and tell: Neural image caption generation with visual attention," in Proc. ICML, 2015, pp. 2048-2057.
- C. Zhu, Y. Zhao, S. Huang, K. Tu, and Y. Ma, "Structured attentions for visual question an-swering," in Proc. ICCV, vol. 3, Oct. 2017, pp. 1300-1309.
- W. Zou, D. Jiang, S. Zhao, and X. Li, "A comparable study of modeling units for end-to-end mandarin speech recognition," arXiv preprint arXiv:1805.03832, 2018.
- S. Kim, T. Hori, and S. Watanabe, "Joint ctc-attention based endto-end speech recognition using multi-task learning," in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 4835-4839.
- Rush, A. M. & Weston, J. A Neural Attention Model for Abstractive Sentence Summariza-tion. EMNLP, 2015.
- Kadlec, R., Schmid, M., Bajgar, O. & Kleindienst, J. Text Understanding with the Attention Sum Reader Network. arXiv:1603.01547v1 [cs.CL] , 2016.
- Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. arXiv preprint arXiv:1709.01507, 2017
- Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H., Wang, X.,Tang, X.: Residual atten-tion network for image classification. arXiv preprint arXiv:1704.06904, 2017