DOI QR코드

DOI QR Code

Human Gait Recognition Based on Spatio-Temporal Deep Convolutional Neural Network for Identification

  • Zhang, Ning (Dept. of Information and Communication Engineering, Tongmyong University) ;
  • Park, Jin-ho (Dept. of Information and Communication Engineering, Tongmyong University) ;
  • Lee, Eung-Joo (Dept. of Information and Communication Engineering, Tongmyong University)
  • Received : 2020.07.08
  • Accepted : 2020.07.21
  • Published : 2020.08.31

Abstract

Gait recognition can identify people's identity from a long distance, which is very important for improving the intelligence of the monitoring system. Among many human features, gait features have the advantages of being remotely available, robust, and secure. Traditional gait feature extraction, affected by the development of behavior recognition, can only rely on manual feature extraction, which cannot meet the needs of fine gait recognition. The emergence of deep convolutional neural networks has made researchers get rid of complex feature design engineering, and can automatically learn available features through data, which has been widely used. In this paper,conduct feature metric learning in the three-dimensional space by combining the three-dimensional convolution features of the gait sequence and the Siamese structure. This method can capture the information of spatial dimension and time dimension from the continuous periodic gait sequence, and further improve the accuracy and practicability of gait recognition.

Keywords

1. INTRODUCTION

In today's society, surveillance devices [1] based on the Internet of Things technology are becoming more and more popular, video surveillance applications are becoming more and more extensive, and surveillance video data is showing explosive growth. According to a report by the International Data Corporation (IDC) in 2014 [2], the global surveillance video data volume will reach 5.8ZB by 2020, accounting for 44% of the global data volume, which is called "the largest big data". Numerous cameras and huge monitoring networks can generate massive video data in an instant. How to efficiently extract useful information from these massive data, and then understand the information according to video content and characteristics has become an urgent need for intelligent video surveillance technology problem.

The video surveillance business must target the targets in the surveillance scene, and the pedestrian target is the most interesting object in the surveillance video analysis [3-6]. As shown in Fig. 1, in most monitoring scenarios, pedestrians are the most common and core monitoring objects. "People-centered" is one of the biggest features of modern intelligent monitoring systems. Pedestrian identification [7], as the core of surveillance video analysis, is widely used in various aspects such as building security, smart prisons, and management of public places. Traditional target recognition technology often recognizes objects based on their appearance, color, texture, and morphology. However, for pedestrian recognition problems, due to the non-rigid characteristics of the human body and the drastic changes in appearance in different environments, it is difficult to achieve high recognition accuracy by relying solely on appearance features, but the emerging biological feature of gait is very effective.

MTMDCW_2020_v23n8_927_f0001.png 이미지

Fig. 1. Schematic diagram of video surveillance scenarios.

Gait [8], as a characteristic of human activity, describes the posture of human walking, including regular movement trends and changes at the joints of upper and lower limbs during walking. Gait recognition based on video analysis technology is to record, observe, and analyze the pedestrian body movement in the image sequence, then build a gait model, and extract stable parameter features, and then through the computer to identify the process, you can finally get the pedestrian's Identity information and attribute (gender, age, race, etc.) information. Compared with other biometrics (face, fingerprint, palm print, iris, voiceprint, etc.), gait is a very potential biometric. In the pedestrian recognition scene for intelligent video surveillance, the advantages of gait mainly manifested in the following three aspects: 1) Remotely available. Gait data does not require special equipment for collection. Surveillance personnel only need to use ordinary surveillance cameras to obtain gait information of a specific target through a certain distance and use a non-contact method to conduct the hidden collection. Biometrics such as iris, fingerprints, palm prints, etc. require user-friendly cooperation to complete the collection, which is very important in intelligent video surveillance; 2) Robustness. Gait features still have a better recognition effect under low-quality monitoring screens, which enables gait recognition technology to maintain high recognition accuracy both in indoor scenes and outdoor surveillance scenes. In contrast, accurate face recognition and voiceprint recognition require a higher quality of data sources; 3) Security. Gait is difficult to disguise, imitate, and hide. In public, pedestrians must deliberately change their gait, which will often become more suspicious and will attract public attention. In summary, gait recognition technology has become an important research direction in computer vision and pattern recognition, and it has great research value and market demand.

This article selects pedestrian gait analysis as the research direction, conducts in-depth research on the hot and difficult problems that have yet to be solved, and proposes an efficient and accurate gait recognition algorithm. It aims to solve the following challenges: How to select distinguishing gait features on the gait recognition problem through deep learning techniques; How to apply the classification model in the deep learning method to the gait recognition problem; How to use the periodic pattern of gait to characterize the gait sequence. As shown in Fig. 2, this article conducts in-depth research on these key issues involved in gait recognition technology, hoping to improve the accuracy of gait recognition methods and improve the theoretical research of gait recognition technology, finally, it provides technical support for the application of high-precision gait recognition.

MTMDCW_2020_v23n8_927_f0002.png 이미지

Fig. 2. The research framework.

2. HUMAN GAIT RECOGNITION BASED ON SPATIO-TEMPORAL DEEP CNN

The gait recognition problem is essentially the process of feature extraction and pattern classification of a continuous pedestrian walking sequence. According to relevant literature research, the traditional gait recognition method is usually to synthesize a periodic gait sequence into a gait feature map, and then perform attribute classification or identity recognition based on the visual characteristics of the feature map. However, the calculation process of the feature map is often accompanied by the loss of gait timing information. The inaccuracy of the input information may lead to the inaccuracy of the output result, which cannot effectively improve the accuracy of recognition. Gait, as a movement mode, has two dimensions of space and time. Periodicity is the most significant difference between gait recognition and traditional face recognition or motion recognition. A bottleneck to improve the accuracy of gait recognition. The three-dimensional convolutional neural network [9] (3D-CNN) was originally proposed for action recognition, by extracting features from the spatial and temporal dimensions, then 3D convolution [10] (C3D) is performed to capture the motion information obtained from multiple consecutive frames. Using the C3D model in the gait sequence, combined with the parallel neural network structure, can obtain more distinguishing gait features in the time and space dimensions. Therefore, this chapter proposes to enhance the model by calculating the three-dimensional convolution features of gait motion. The model can capture time and space information from continuous periodic sequences, and further improve the accuracy and practicality of gait recognition. In addition, this paper propose a 3D-Siamese network that combines a C3D network with a Siamese structure. This spatio-temporal joint deep neural network can perform feature metric learning in three-dimensional space.

Fig. 3 shows the framework of gait recognition technology based on 3D-Siamese network. This paper first extract the gait contour sequence diagram of each pedestrian sample in the training video of the gait database, the entire network is trained by constructing a positive and negative sample library that satisfies the Siamese network. The weights are shared between the same layers in the two networks, and the loss function uses the contrast loss function. Then, in the testing stage, each sample sequence to be identified is directly input to any branch of the Siamese network to extract the spatio-temporal depth features of the sequence. Finally, the K-nearest neighbor method is used to calculate the similarity between the test sequence and the matching sequence one by one. The matching sequence with the highest similarity is the pedestrian sample with the same identity.

MTMDCW_2020_v23n8_927_f0003.png 이미지

Fig. 3. Schematic diagram of model training based on 3D-Siamese network.

2.1 Structure of 3D CNN

2.1.1 Video Processing of 3D CNN

In recent years, CNN has achieved good results in image tasks, and researchers have begun to try to use CNN in a series of video processing tasks. Andrej Karpathy [11] et al. experimented with the results of different convolutional neural networks in video classification, thereby showing the timing information of the video in CNN. A simple method of applying CNN in a video sequence is to use CNN to identify each frame of image, but this method does not take into account the information between consecutive frames. This problem is particularly prominent in gait recognition, because periodicity is the biggest feature that distinguishes gait from general motion recognition. Karen Simonya [12] and others used two independent convolutional neural networks to process the two dimensional features of time and space, and then fused the results of the two models. The network in the spatial dimension is an ordinary CNN, while the network in the time dimension superimposes the optical flow of several consecutive frames as the input of the CNN. The most important thing is that this work introduces a multi-task learning mechanism for the first time to overcome the problem of insufficient training data. The specific method is to connect the last layer of CNN to multiple Softmax layers, corresponding to different data sets, so multi-task learning can be performed on multiple data sets.

The three-dimensional convolutional neural network was originally used in human behavior analysis. Shuiwang Ji [9] and others used a three-dimensional convolutional neural network for motion recognition. The method they proposed first required human body detection. Tracking segments human objects, and then performs motion modeling on the segmented regions. This method can perform convolution operations in both time and space dimensions, thereby capturing video stream motion information. The C3D method proposed by DuTran [10] directly performs a three-dimensional convolution operation on the entire video without the need for target segmentation, therefore, it is more universal. They supervised training a powerful depth model on multiple currently available video datasets and can adapt to various video analysis tasks. Compared to the two-dimensional convolution operation or the two-dimensional convolution operation on multi-frame video, C3D performs spatio-temporal operations on both the convolution layer and the pooling layer. As shown in Fig. 5, the two-dimensional convolution operation obtains a single image, and the two-dimensional convolution operation on multiple frames of images still obtains a single feature image, therefore, only the three-dimensional convolution operation can directly act on the frame cube and obtain the characteristic expression of space-time. In order to effectively use the gait motion information and further improve the accuracy of gait recognition, in this chapter this paper use a three-dimensional convolution method, the video frame cube is directly fed into the convolutional layer and pooling layer of CNNs to operate, to capture the distinguishing features in both time and space dimensions.

MTMDCW_2020_v23n8_927_f0005.png 이미지

Fig. 5. Schematic diagram of two-dimensional convolution and three-dimensional convolution feature extraction process. (a) 2D-CNN; (b) 3D-CNN; (c) C3D.​​​​​​​

2.1.2 Feature Extraction of 3D CNN

In the traditional two-dimensional convolutional neural network, when each convolution operation is applied to a pixel in the image, it is to perform a sum operation on other pixels around the neighborhood of the feature map, and then add the result after the last bias, it is transmitted in the neural network through the activation function. Specifically, the feature value v (x, y) at a certain feature map (x, y) position in the two-dimensional neural network can be obtained by the following function:

\(v(x, y)=f\left(b+\sum_{w=0}^{W-1} \sum_{h=0}^{ H-1} C \cdot v(x+w, y+h)\right)\)       (1)

where, f(∙) is the activation function, commonly used activation functions include sigmoid function and tanh function, W and H are the width and height of the two-dimensional convolution kernel, b is the bias term of this layer of neural network, C is the Layer network parameters. In the next pooling layer, the feature map is reduced by downsampling. The entire CNN framework is to extract features by continuously superimposing such a "convolution layer + pooling layer" structure.

Table 1. Model parameter table​​​​​​​

MTMDCW_2020_v23n8_927_t0001.png 이미지

Similarly, this paper use a three-dimensional convolution kernel to operate on the features of the convolutional layer. In this structure, each feature map in the convolutional layer will be connected to multiple adjacent consecutive frames in the previous layer to capture motion information. Have to be aware of is: The three-dimensional convolution kernel can only extract one type of feature from the frame cube, because the weight C of the convolution kernel is the same in the entire cube, that is, the shared weight is the same convolution kernel, therefore, this paper can use multiple convolution kernels to extract multiple features. For the convolutional neural network, there is a general design rule that the number of feature maps in the subsequent layers (layers close to the output layer) should increase, in this way, more types of features can be generated from low-level feature map combinations. Formally, the feature value v(x, y, z) at the position of a layer of feature map (x, y, z) can be calculated by the following function, where (W, H, D) is the size of the convolution kernel. W, H, and D represent the width, height, and depth of the time series of the convolution kernel.

\(v(x, y, z)=f\left(b+\sum_{d=0}^{D} \sum_{w=0}^{W-1} \sum_{h=0}^{H-1} C \cdot v(x+w, y+h, z+d)\right)\)       (2)

In the structure shown in Fig. 6, the configuration of the three-dimensional convolutional neural network contains two parts. As shown in Fig. 6: 1) 8 convolutional layers (C1, C2, C3, C4, C5, C6, C7, C8) and 5 maximum pooling layers (P1, P2, P3, P4, P5); 2) 2 fully connected layers (FC1, FC2) and feature output layer (Feat). The specific parameters are as follows, where the size of the convolution kernel is d × k × k, where d is the depth of the time kernel, k is the size of the spatial kernel, and the scan step refers to the pixel size of the convolution kernel moving in the image each time.

MTMDCW_2020_v23n8_927_f0006.png 이미지

Fig. 6. Three-dimensional convolutional network structure diagram.​​​​​​​

In this chapter, this paper use the depth model Sport1M [10] that Du Tran and others have trained as a starting point, and fine-tuning the parameters in the gait database OU-ISIR LP, and the implementation still uses the Caffe framework [13]. In the pedestrian identification task, this paper replace the category of the output layer with the number of pedestrian identities associated with the number of pedestrians in the dataset. Before the model training, this paper first use the neural network parameters trained in Sport1M [10] to initialize the model. In the training process, similarly, this paper fixed the convolutional layer parameters of the lower layer, adjusted the parameters of the fully connected layer, only updated the parameters of the upper layer network, and then trained the entire CNN model using the stochastic gradient descent method. Then propagate the value posterior of the loss function to all other network layers, and fine-tune the entire network model until the model converges.

2.2 Structure of 3D Siamese CNN

In the gait database, this paper first extract motion contours for each pedestrian's gait video sequence motion image segmentation method. Then it is normalized to a unified scale by scaling, which is convenient for the next neural network training. Specifically, for the gait video database of 3895 people in OU-ISIR LP, each gait video sequence has been marked. The database should contain gait video sequences of the same pedestrian from different perspectives, For example, there are four angles of view: 55, 65, 75, and 85 degrees, and all the above gait video sequences have been marked with the identity of pedestrians. Use the Graph Cut image segmentation method to extract the foreground contour map for each video sequence. All the contour maps here are after normalization, the size is 128 × 171 pixels.

2.2.1 Feature Training of 3D Siamese CNN

After obtaining the three-dimensional convolution features, this paper select positive and negative samples from the gait database to construct a neural network training data set according to the method in the previous chapter, where the two identical gait sequences are positive sample pairs and set labels Is 1, the two gait sequences with different identities are negative sample pairs, and the label is set to 0 at the same time. Specifically, in the experiment, the training set is derived from the sequence map of the person's gait contour in the Gallery library in OU-ISIR LP. Since the pedestrian identity is much larger than the pedestrian gait video sequence tree of the same identity, the number of positive samples in the training set is far less than the number of negative samples. Therefore, in order to ensure the balance of samples in the training process of the neural network, this paper draw all positive samples, and randomly select an equal amount of negative samples as the final training set.

Then, the training data pairs are simultaneously input into the parallel deep neural network. After the data is cut and layered, the two segments of features enter the two branches of the network, the parameter weights of these two branches are shared, and each branch is composed of the same three fully connected layers.

Finally, the feature output layers of two three dimensional convolutional neural networks are converged to the contrast loss layer, and the contrast loss value of the objective function is calculated by the contrast loss function. This paper use the reverse backtracking algorithm to train the parameters of the overall model. Until the overall spatiotemporal joint deep neural network model converges on all training data.

2.2.2 Feature Recognition of 3D Siamese CNN

In the test recognition process, first use the model trained in the previous section to extract features from the test gait contour sequence map and all the gait contour sequence maps in the matching library, and then use the K-nearest neighbor method to calculate the test sequence and the matching sequence one by one. Feature similarity, output the matching sequence with the highest similarity, so as to identify pedestrians with the same identity.

First, the gait contour sequence diagram is extracted from the test gait video and normalized to the same size.

Then, the obtained gait contour sequence diagram is input to any branch of the parallel convolutional neural network in S13, and the features of the gait contour sequence diagram are extracted using a forward propagation algorithm. Specifically, it is input into a three-dimensional convolutional neural network, and the parameters of the network initialization adopt the model parameters trained in S1. In the specific operation process, this paper extract the output of the FC2 layer as the feature of the gait contour sequence diagram, and the feature dimension is 4096 dimensions. Therefore, each gait video sequence can be represented by a 4094-dimensional feature vector.

Finally, the K-nearest neighbor method is used to calculate the similarity between the test sequence and the matched sequence, and the result with the highest similarity is output, thereby identifying pedestrians with the same identity.

3. EXPERIMENT RESULTS

3.1 Experiment Results of Siamese Neural Net- work for HIR

This paper conducted an identification experiment in the OU-ISIR LP database. As mentioned above, in the experiment, this paper only used the Gallery library for training, so there is no Probe data for training. The training set contains the gait sequence of four observation angles (55°, 65°, 75°, 85°) for each pedestrian sample and the global observation angle All. This paper evaluated the effectiveness of the proposed algorithm from the same perspective and different perspectives. Similarly, this paper use the correct recognition rate of Rank-1 and Rank-5 as evaluation indicators.

3.1.1 Identification from the Same View

First, this paper compared the accuracy of the gait recognition method proposed in this paper from the same perspective. The following Table 2 and 3 compare the comparison results of our method with the current best method. Other comparison methods include: GEI [14] and FDF [14] model matching methods, HWLD [15], and traditional CNN-based methods. HWLD [15] only tested the accuracy on 85° and ALL data sets. As shown in the table, our proposed SiaNet.FC method achieves the highest recognition accuracy on almost all data sets. Compared with the artificial feature-based methods such as GEI, FDF and HWLD, the deep learning-based method can automatically extract distinctive and richly expressed gait features from GEI, such as semantic blur on the human head, hips, etc. In the local area, the convolution feature is still very distinguishable. Compared with the traditional CNN method, the method based on Siamese structure proposed in this paper can use the distance metric learning to solve the domain gap of classification and recognition problems well, so it can get a higher accuracy of gait recognition.

Table 2. Rank-1 recognition accuracy rate on OU-ISIR LP database​​​​​​​

MTMDCW_2020_v23n8_927_t0002.png 이미지

Table 3. Rank-5 recognition accuracy rate on OU-ISIR LP database​​​​​​​

MTMDCW_2020_v23n8_927_t0003.png 이미지

In Fig. 7, this paper also show part of the recognition results, where the first column is the gait feature map to be recognized, followed by the recognition results of Rank 5, and the red implementation box is the correct recognition result. From this, this paper can observe that the appearance contours of different pedestrians are very similar, and it is difficult to distinguish with the naked eye, especially in the head, torso, legs and other areas are more confusing. At the same time, too small differences between classes in gait recognition will lead to increased recognition difficulties. For example, in the 55-55 group, the first matching sample is more similar to the to-be-measured gait map in the trunk. However, due to the slight change of the head posture, the true matching pedestrians caused the Top 1 result to be misrecognized. In the real walking sequence, each person's pace, torso and hand swings, and step sizes are different. This is because gait has time series features in addition to spatial features. Periodicity is the biggest difference between gait recognition problems and general motion recognition problems. Therefore, this paper need to improve the matching method that only depends on the characteristics of gait energy graph.

MTMDCW_2020_v23n8_927_f0007.png 이미지

Fig. 7. Schematic diagram of model training based on Siamese network.​​​​​​​

3.1.2 Identification from the Different Views

Then, this paper also evaluated the effectiveness of the proposed method on the cross-view gait recognition problem. The comparison method selects the currently best cross-view gait recognition method: AVTM_PdVS [16], AVTM [16], woVTM [16] and RankSVM [17]. Among them, the first three methods are specifically designed for the cross-view gait recognition problem. They use an additional large-scale 3D gait database to train the view transfer model to overcome the differences in gait characteristics of different views. RankSVM is also a learning method based on distance measurement, aiming at the problem of gait recognition of variables such as multi-view, carry, pace etc. It should be noted that these four methods only randomly selected 1912 pedestrians in the entire OU-ISIR Large Population for testing, and our proposed method was tested on all 3835 pedestrians. Therefore, the experiment is more difficult. As shown in Fig. 8, in the four sets of cross-view experiments, our method has achieved a steady improvement in Rank-1 correct recognition rate compared to other methods. This shows that the method based on Siamese structure is also very robust to the problem of gait recognition across perspectives. The fundamental reason is that during the training phase, this paper simultaneously input gait graph pairs of the same perspective and different perspectives for distance measurement training. This strategy makes our method well applicable to cross-view gait recognition.

MTMDCW_2020_v23n8_927_f0008.png 이미지

Fig. 8. Accuracy of gait recognition from different perspectives. The A~D group experiments respectively represent (65°, 75°), (75°, 65°), (75°, 85°) and (85°, 75°) experiment settings.​​​​​​​

3.2 Analysis of Spatiotemporal Joint Deep Neural Network

The following Table 4 and 5 compare the comparison results of our method with the current best method. For the comparison method, this paper also choose GEI [14], FDF [14], HWLD [15], CNN. FC2 and SiaNet. FC [18]. From the table, this paper can see that our proposed method based on 3DSiamese Neural Network significantly surpasses the current best other methods by about 1%∼7%, it is worth mentioning that the method based on 3D convolutional neural network has achieved the best recognition effect on both 75° and 85° data sets. However, this paper can also find that the accuracy is not as good as the SiaNet.FC [18] method under the 55° and 65° viewing angles, mainly due to the following three aspects: 1) The pre-trained model used in this article comes from the Sport1M data set. Compared to the ImageNet data set, Sport1M has many gaps in terms of data richness and data volume. Therefore, from the perspective of feature expression, training the features extracted by the model do not have discrimination; 2) 55° and 65° are more difficult to identify than the side view. This is because from a frontal view, pedestrian torso, hand and leg movements are not obvious, and side recognition is often the easiest to distinguish; 3) From the perspective of the contour map, the front view is often difficult to capture the motion information between frames compared to the side view, so this task is very challenging for the 3D convolutional neural network. In summary, although the method proposed in this chapter does not achieve the best recognition rate in all perspectives, it still achieves a significant recognition effect on the side perspective that is easier to capture motion information, further improving gait recognition accuracy.

Table 4. Rank-1 recognition accuracy rate on OU-ISIR LP database​​​​​​​

MTMDCW_2020_v23n8_927_t0004.png 이미지

Table 5. Rank-5 recognition accuracy rate on OU-ISIR LP database​​​​​​​

MTMDCW_2020_v23n8_927_t0005.png 이미지

5. CONCLUSION

A gait recognition method based on spatiotemporal joint deep neural network is proposed. Although the two-dimensional convolutional neural network has the ability to extract rich and distinguishing features, it is limited to the image input in the two-dimensional space. As a movement pattern, gait has two dimensions of space and time, periodicity is the most significant difference between gait recognition and traditional face recognition or motion recognition, how to make full use of the most original gait sequence diagram becomes another bottleneck to improve the accuracy of gait recognition. Therefore, this paper propose to enhance the model by calculating the three-dimensional convolution features of gait motion. This model can capture time and space information from the continuous periodic sequence, and further improve the accuracy and practicality of gait recognition. In addition, this paper propose a 3DSiamese network that combines a three-dimensional neural network with a Siamese structure. This spatiotemporal joint deep neural network can perform feature metric learning in three-dimensional space. Experiments show that the algorithm in this paper can make full use of the advantages of the two to achieve the best gait recognition effect.

References

  1. M. Huadong, “Internet of Things: Objectives and Scientific Challenges,” Journal of Computer Science and Technology, Vol. 26, No. 6, pp. 919-924, 2011. https://doi.org/10.1007/s11390-011-1189-5
  2. H. Tiejun, Z. Jin, L. Bo, F. Huiyuan, M. Huadong, X. Xiangyang, et al., “Multimedia Technology Research: 2013-visual Perception and Processing for Intelligent Video Surveillance,” Journal of Chinese Image and Graphics, Vol. 19, No. 11, pp. 1539-1562, 2014.
  3. F. Huiyuan, Research on Intelligent Video Scene Understanding Technology for Crowd Supervision, Master's Thesis of Beijing University of Posts and Telecommunications, 2014.
  4. M. Huadong, Z. Chengbin, and X.L. Charles, “A Reliable People Counting System Via Multiple Cameras,” ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 2, pp. 1-22, 2012.
  5. F. Huiyuan, M. Huadong, "Real-time Crowd Detection Based on Gradient Magnitude Entropy Model," Proceeding of the ACM International Conference on Multimedia, pp. 885-888, 2014.
  6. N. Zhang and E.J. Lee, “Human Activity Recognition Based on an Improved Combined Feature Representation,” Journal of Korea Multimedia Society, Vol. 21, No. 12, pp. 1473-1480, 2019. https://doi.org/10.9717/kmms.2018.21.12.1473
  7. H. Kaiqi, C. Xiaotang, K. Yunfeng, and T. Tieniu, “Overview of Intelligent Video Surveillance Technology,” Journal of Computer , Vol. 37, No. 49, pp. 1093-1118, 2015.
  8. Gait(2020), https://en.wikipedia.org/wiki/Gait
  9. G.W. Taylor, R. Fergus, Y. Lecun, and C. Bregler, "Convolutional Learning of Spatiotemporal Features," Proceeding of European Conference on Computer Vision, pp. 140-153, 2010.
  10. D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, "Learning Spatiotemporal Features with 3D Convolutional Networks," Proceeding of IEEE International Conference on Computer Vision, pp. 4489-4497, 2015.
  11. A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L.F. Fei, "Large-scale Video Classification with Convolutional Neural Networks," Proceeding of IEEE Conference Computer Vision and Pattern Recognition, pp. 1725-1732, 2014.
  12. K. Simonyan and A. Zisserman, "Two-stream Convolutional Networks for Action Recognition in Videos," Proceeding of Advances in Neural Information Processing Systems, pp. 568-576, 2014.
  13. J. Yangqing, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, et al., "Caffe: Convolutional Architecture for Fast Feature Embedding," Proceeding of ACM International Conference on Multimedia, pp. 675-678, 2014.
  14. H. Iwama, M. Okumura, Y. Makihara, and Y. Yagi, “The OU-ISIR Gait Database Comprising the Large Population Dataset and Performance Evaluation of Gait Recognition,” IEEE Transactions on Information Forensics and Security, Vol. 7, No. 5, pp. 1511-1521, 2012. https://doi.org/10.1109/TIFS.2012.2204253
  15. S. Sivapalan, D. Chen, S. Denman, S. Sridharan, and C. Fookes, "Histogram of Weighted Local Directions for Gait Recognition," Proceeding of IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 125-130, 2013.
  16. D. Muramatsu, A. Shiraishi, Y. Makihara, M.Z. Uddin, and Y. Yagi, “Gait-based Person Recognition Using Arbitrary View Transformation Model,” IEEE Transactions on Image Processing, Vol. 24, No. 1, pp. 140-154, 2015. https://doi.org/10.1109/TIP.2014.2371335
  17. R.M. Fele and T. Xiang, "Gait Recognition by Ranking," Proceeding of European Conference on Computer Vision, pp. 328-341, 2012.
  18. Z. Cheng, L. Wu, M. Huadong, and F. Huiyuan, "Siamese Neural Network Based Gait Recognition for Human Identification," Proceeding of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2832-2836, 2016.