1. Introduction
A very important task for many visual recognition systems is to analyze the activities performed by an object within the frame. Activity recognition systems have inspired novel user interfaces and new applications for smart environments, including surveillance, emergency response, and military missions. In addition, a challenging problem for research in machine learning is to provide automatic recognition of object activity from data collected from imaging sensors [1].
The problem of learning patterns of human activity from sequences of images arises in many different areas where computer science is applied, including intelligent environments, surveillance, and assistive technology for the disabled. In particular, video surveillance has become more and more important due to the increase in need for security and related applications [2]. Video surveillance of dynamic scenes has a potentially wide range of applications, such as to assist security guards for communities, understand the behavior of crowds, provide traffic surveillance in cities and expressways, detect military targets, etc. [3]. One of the most important research topics in computer vision in particular is the video surveillance of dynamic scenes that contain crowds and objects [4][5]. A central topic of this area is the automatic analysis and recognition of crowd activity in video sequences [5][6]. The recognition of crowd activity can be described as a combination of two tasks: feature extraction and modeling classes of activities.
Cermeno et al. [7] proposed a method that extracts global features from an image, and they trained a two-class classifier using a feature vector for event recognition. Wang et al. proposed a method to detect abnormal events based on histograms of the orientation of the optical flow descriptor and a one-class SVM classifier [8]. Technically speaking, activity recognition can be divided into two tasks: (1) activity information extraction and (2) activity pattern modeling. The activity information represents attributes of movement (velocity, orientation, location) in the data while the activity patterns are representations of events that occur frequently.
In this study, we propose an intelligent activity recognition model from an image sequence of data. This paper provides the following contributions in image sequence-based activity recognition studies. First, the model proposes a representation method for crowd activity in an image sequence by using a histogram of optical flow. Second, a new machine learning method for activity recognition is also shown. The proposed learning method transforms a histogram of the optical flow to a graph, specifically a Bayesian network. In this network, each node corresponds to a histogram bin. The model extracts common structural features from the graphs generated for each activity, and these structural features are reflected in the structure of the neural network [8] as groups of input nodes. In addition, a numerical feature that represents the statistical properties of each histogram bin are used as input values for each input node.
2. Activity Recognition
In this paper, we present a three-stage method to recognize object activities. This method consists of a description stage, a structure learning stage, and a classification stage. During the representation stage, we compute the optical flow and construct a histogram for the orientation of the optical-flow (HOOF)[9] in order to describe the movement of the crowd in an image sequence. The HOOF for the i-th frame is denoted as a vector Ɵi=[o1,o2,…,o9] where ok denotes the value of the k-th histogram bin. These vectors are collected for the moving time window so that a set of vectors ={Ɵ1-T+1,…,Ɵi} can be formulated for T consecutive frames in a time window. In the structure learning stage, we generate a Bayesian network from by using a k2-algorithm. The node in the Bayesian network is a component of the HOOF vector. The Bayesian network generalizes causal relationships between the nodes. We name this network as the context network. For each activity class Aclass, a set of context networks, CNclass is constructed by collecting the Bayesian networks generated from frames with non-overlapping time windows. For each class, the common paths are extracted for the context networks. Each path Pi is defined as a structural feature of the class, and the structural features represent the topological characteristics of the context-networks. A structural feature Pi is implemented as a group of input nodes of the neural network, and a two-layer neural network is designed and is trained with a training image sequence for each class. During the training process, the structural features are extracted from the current context network while the numerical features for the nodes are included in the structural features that are to be applied to the input nodes of the neural network. The numerical features of the node represent the statistical characteristics of the nodes during a given time window.
For the input nodes of the non-existing structural features, 0’s are applied in order to avoid training for the sample. After training the two-layer perceptron, the current activity is classified by deriving the context network from the current . The paths of the context network are extracted and are compared to the stored structural features while the input nodes for the existing structural features are applied as numeric values for the nodes. The other input nodes are applied with 0 values, and the class is identified as an output node with a maximum output value. Fig. 1 illustrates the proposed activity recognition method.
Fig. 1.The block diagram of proposed method for activity recognition.
2.1 Frame Description
2.1.1 Optical Flow
The optical flow is comprised of the apparent velocities of the pixels in an image sequence. Since the direction and the amplitude of movement are a representation of the activity, the optical flow is used as the scene description. B.Horn and B.Schunck [11] compute the optical flow by using a global smoothness constraint. The basic Horn-Schunck (HS) method is used to compute the optical flow in this paper. The HS method minimizes the error function that combines two constraints. The first constraint is a brightness constancy constraint that assumes a constancy in the grey level of a point across the frames. The second constraint is a smoothness constraint that assumes a continuity in the velocities of the adjacent pixels[12]. Equation (1) shows the error function that is used.
Ix, Iy and It are the derivatives for the image intensity values along the x, y spatial axes and the time axis, u, v are the horizontal and vertical components of the optical-flow, and λ is a regularization constant. The optical-flow can be iteratively computed by using (2) and (3), where k denotes the iteration step, and and are weighted averages of u and v in the neighborhood of the pixel (x, y).
2.1.2 Histogram of Orientation Optical Flow
Fig. 2 shows a partition of an image with non-overlapping blocks, each block has the same size of n×n pixels, and each image frame is divided into m blocks, where m is the (heightimage/n) × (widthimage/n). For each block, the average of the optical flows for the block is computed to describe the local moving direction of the crowds in the block. For block k, the average of the optical-flow is represented using the polar coordinates (rk, θk ). Where, fBn denotes a histogram of block Bn, and Fk denotes a set of histograms of frame k. The orientation bins are evenly spaced into 9 parts from 0° to 360°, as shown in Fig. 2, and the block k votes into one of the n orientation bins that include θk. The HOOF for frame i is denoted as a vector Ɵi= [o1,o2,…,o9] that describes the distribution of the direction in which the crowd is moving in the entire frame. For T consecutive frames until the current i-th frame, ={Ɵi-T+1,…, Ɵi}, each HOOF for all frames is collected in order to construct a HOOF sequence Ɵi={Ɵi-T+1,…,xi}. The HOOF sequence represents a change in the direction in which the crowd is moving during a given time window. The HOOF sequence is applied to the K2 algorithm in order to generate a context network. Fk denotes frame set of k blocks.
Fig. 2.8×8 cells of Histogram of optical flow descriptor.
2.2 Structure Learning Stage
2.2.1 K2-Algorithm
A Bayesian Network (BN) is a graphical model that efficiently encodes the joint probability distribution for a set of variables [13]. The BN provides powerful knowledge representation and is a reasoning tool for conditions with uncertainty. A BN is a directed acyclic graph (DAG) that has a conditional probability distribution for each node, and the DAG structure of such networks contains nodes that represent the domain variables while the arcs between the nodes represent probabilistic dependencies [13]. We used a K2-algorithm to extract the structural relationship between histogram bin values. The K2-algorithm proposed by Cooper and Herskovits [14] is the most well-known Bayesian structure learning algorithm. The algorithm generates a Bayesian graph G with a joint probability and a Bayesian metric score. It is called the K2-metric and is the most well-known Bayesian network evaluation function. The K2-metric is expressed in (4).
Maximizing P(G, ) searches for the most probable Bayesian network structure G given a database . P(G) is the structure prior probability that is constant for each G. In (4), ri represents the number of possible values of the node oi. And qi is the list of all possible instantiations of the combination. We let πi as set of parents of node oi.
Nijk is the number of cases in in which the attribute xi is instantiated with its k-th value, and the parents of oi in πi are instantiated with the j-th instantiation in qi. Nij is the number of instances in the database in which the parents of oi in πi are instantiated with the j-th instantiation in qi [14].
The K2-algorithm starts by assuming that a node has no parents, after which it incrementally adds the parent whose addition increases the probability of the resulting structure the most for every step. The K2-algorithm stops adding parents to the nodes when the addition of a single parent cannot increase the probability of the network for the given data [14][15]. The structure learning stage uses the K2-algorithm to obtain the graphs from the training data for each of the four classes. These learned graphs are named as context-graph G and are used to extract the distinctive path patterns that are used as input features for recognition of each class [16].
2.2.2 Pattern Extraction
The context-graph and extracted path patterns are generated as shown in Fig. 3. The node oi in the context-graph is the value of the i-th histogram bin. In the graph, nodes o4 and o5 depend on the parent node o2 which is implemented as an adjacency matrix in which an element represents the existence of the connection between the nodes.
Fig. 3.Generated context-networks
The element of the adjacency matrix A[i, j]=1 if there is an edge between the i-th node and the j-th node, A[i, j]=0 otherwise. In a context-network CN= (V, E) is the directed graph where V is a set of nodes and E is a set of edges. An edge e=
The context-graphs for each class c∈C={Walk, Run, Evacuation, Merge} are learned by using the training data for each class. A set of context-graphs for class c∈C is named CNc and is shown in (6), where ∈CNc denotes the i-th generated context-network for class c and Nc is the number of generated context-networks for class c.
For each context network , which is the i-th context-network for class c, the path patterns are extracted where Kn,i is the number of the extracted path patterns from . Table 1 shows the set of path patterns for class c. Each column denotes a set of path patterns from a context network. Walk class, run class, merge class, and evacuation class, have their own path patterns, from the respective learned context-networks.
Table 1.Path Pattern from Context-Network
2.2.3 Classification stage
During the classification stage, we designed a 2-layer neural network for pattern classification by using the path patterns extracted as shown in Fig. 4. For each class c∈C, the conditional probability P(p|c) for each path p∈PS is computed during training. For each class, the path patterns with the highest conditional probabilities are selected and are used as input features for classification [17]. These selected path patterns are defined as structure features. For each distinctive selected structure feature , a set of input nodes for the neural network is assigned. For every node in pi, i.e. , j=1,2,…,m is assigned as an input node for IPi. Table 2 shows the structural features as selected path patterns for each class. The input nodes in each selected path feature are grouped separately, as shown in Fig. 4. Note that a structural feature can appear in different classes repeatedly, i.e., IP2 in the run class and merge class. Null denotes that no such path pattern exists in the context-network of that class. By using this organization for the input layer, we can reflect not only the topology of the generated context-network but also the numerical properties of the variables. A two-layer perceptron with 70 input nodes and 4 output nodes is used. Four output nodes, Walk, Run, Merge and Evacuation, correspond to a class. The activation function used for a node ni is
Fig. 4.Neural Network Architecture for Activity Recognition
Table 2.Implementation of structural features as clusters of input nodes
where Wij is the weight of the connection between ni and nj , which is a j-th node in the precedent layer. The output value for node nj is denoted as Xj. This function allows for a smooth transition between the low output and the high output of the neuron [17]. The weights of the connections are learned by using a back-propagation algorithm, and the learning algorithm minimizes the sum of squared error E between the network outputs a and the target outputs t, and E is defined as:
where ti={0,1}, and ai=[0,1] are the target values and the network output for the i-th output node, respectively. N is the number of output nodes, i.e. 4. Backpropagation is the most widely used algorithm for supervised learning with multi-layered feed-forward networks. The basic idea for backpropagation learning [18] is that there is a repeated application of the chain rule in order to compute the influence of each weight in the network with respect to the error function E:
where wij is the weight of the connection from node nj to node ni, ai is the output, and neti is the input for node ni. Once the partial derivative for each of the weights is known, the error function can be minimized by performing a simple gradient descent:
During the training phase, the input nodes are partitioned into two sets S1 and S2
where I is set of input nodes. Set S1 includes the nodes that belong to the structural features for the current target class, and set S2 is a set of the remaining input nodes. The input nodes in S1 are given as preprocessed current numeric values while the input nodes in S2 are given as 0’s. Thus, it selectively trains weights connected to the input nodes for the structural features of the current target class. During the recognition phase after training, the paths are extracted from the context-network that is generated. If the input nodes in the structural features of the neural network are provided with current numeric values, the other nodes are given 0’s. The current class is classified as
where OC is output node for a class c, and v(.) denotes the value of an output node.
3. Experiments and Result
We implemented the proposed method and measured its performance by using the PETS 2013 crowd activity dataset (S3) as experimental data.
3.1 Crowd Activity Dataset
The crowd activity dataset contains different crowd activities, and the objective is to provide a probabilistic estimation at different times for each of the following events: walking, running, evacuation, and merging. Furthermore, we are interested in systems that can identify the start and end of the events as well as the transitions between them [19]. The image sequences depict four crowd activities, and the image frames for four classes are shown in Fig. 5. 40 people act out the different crowd scenarios in the image sequences. In order to validate our approach, we tested the PETS DatasetS3, High Level, which contains four respective sequences with timestamps of 14:16 and 14:33. For each sequence, we use the videos recorded by camera1 (view001) and camera2 (view002). Moreover, we compared the performance of the proposed method against that of the MLP and Bayesian Network Classifier (BNC). Table 5 shows the distribution of the class frames in the view001 sequence and view002 sequence. The number of image frames in the walk class and the evacuation class are almost equal, but the image frames in the merge class comprise the largest portion of the dataset. We can observe that the pedestrians have relatively similar sizes in the view001 image sequence, and the pedestrians have different sizes in the view002 image sequence due to characteristics of the perspective projection. Therefore, the optical flow vectors in view001 have uniform magnitudes while the magnitudes of the optical flow vectors in view002 are irregular.
Table 3.Distribution of classes in dataset
Fig. 5.Four crowd activity classes in view001
3.2 Feature Extraction
We proposed a method that analyzes the activity of the crowd through an optical flow and also accumulated the orientation of the representative optical flow for each image block. In Fig. 6, the feature extraction stage extracted its representative optical flow for the run class in view001.
Fig. 6.Four crowd activity classes in view001
Fig. 7 presents the accumulated orientation of the optical flow vectors for each class for the 9 orientation bins, which shows a different distribution depending on the class. For the evacuation class, the number of optical flow vectors in a bin of from 80° to 120° is the maximum. For the run class, a bin from 240° to 280° has the maximum number of flow vectors. The walk class has maximum number of optical flow vectors in the 160° to 200° bin.
Fig. 7.Distibution of HOOF for four classes
3.3 Structure Learning
For the i-th image frame, a HOOF sequence = {θi−T+1,…, θi}, T=10, which is a collection of all HOOF for the previous frames, is constructed. The K2 algorithm is applied for this HOOF sequence in order to generate context networks for each class. From the context networks for each class, the path patterns with the highest conditional probabilities are extracted. The paths that most frequently appeared in the patterns for the four crowd activity classes are shown in Table 4. It shows path patterns according to the number of occurrences during the training phase. The path o1-o2-o3-o7 for the walk class appears 12 times during training. This path also appears in the merge class with an occurrence number of 38. The path patterns represent the particular correlations among the variables that we have to significantly treat during the recognition phase. Thus these patterns become structural features for the input image sequence that is to be classified.
Table 4.Most frequently appeared path patterns for activity dataset
3.4 Classification
During the classification stage, structural features are used to classify the current activity by detecting the existence of these features in the current context network. However, common structural features are present in the different classes. For example, Table 4 shows that the o1-o2-o8 path pattern belongs to the structural feature sets of all classes. Since the existence of the structural features cannot be used to distinguish the classes, we used the second features, i.e., the numeric features. These features are the values of the HOOF variables. For the path pattern o1-o2-o8, the conditional joint probability distribution of o1, o2, and o8 for each class are shown in Fig. 8.
Fig. 8.Distibution of HOOF for four classes
Fig. 8 shows the size of the optical flow about the 9 orientation bins of each class, namely the directional velocity. Each class represents a quite different distribution for the optical flow of the crowd behavior in the image sequence. In Fig. 8(a) and Fig. 8(c), the walk class and the merge class mostly have narrow distribution for their values because walk occurs in one direction or all directions in each image sequence. In contrast, in Fig. 8(b) and Fig. 8(d), the run class and evacuation class mostly have wider distributions for their values because walk occurs in one direction or all directions in each image sequence. In Fig. 8(a) and Fig. 8(b), the walk class and run class have a skew in value distribution due to the movement in one direction in the image sequence. On the other hand, in Fig. 8(c) and Fig. 8(d), the merge class and the evacuation class have a skew distribution of the value according to the movement in one direction for the image sequence. Table 5 shows the mean of each variable, E[oi], as well as the correlation coefficient between the variables ρoioj in the o1-o2-o8 path pattern. As show in Table 5, distribution of path pattern for each class is non-symetric distribution because, the context-network is directed network. This indicates a large mean value for o1, E[o1] of the run and evacuation class. Strong positive correlations exists between the variables.
Table 5.Value of each node in path patterns
In particular, the o1-o2 in the evacuation class has the strongest relationship, 0.91, since pedestrians are spread in every direction.
Table 6 presents the variable values for path pattern o1-o2-o4-o5-o6 in the context networks of the walk class. Their values are almost similar to those of every context network of the same class.
Table 6.Value of each node in path patterns
3.5 Evaluation
We compare the performance of the proposed method against that of conventional MLP and a Bayesian classifier. Generally, conventional MLP exhibits better result as the number of hidden nodes becomes larger. However, the conventional MLP shows its best performance when a hidden layer consist of 20 hidden nodes. The used input nodes for the MLP and Bayesian classifiers are variables, o1,..,o9. The used input values for the variables are average during the time window, as shown in (15).
The best performance for the proposed method can be achieved when it uses 25 hidden nodes and 99.4% of the whole image sequence as a training sequence. The performance is measured in terms of the precision in (16) and is compared with the performance of other methods in Table 7.
Table 7.Accuracy evaluation of crowd activity recognition
The proposed method exhibits the best performance for the view001 sequence while also exhibiting the worst performance for the evacuation class in the view002 sequence. Since this class takes the smallest part in the training sequence, this class is confused with other classes in view002 due to poor training. The conventional Bayesian classifier showed the worst performance when compared to the proposed method and the conventional MLP. We also present the method proposed by Wang [9] to compare its performance by using the HOOF as features and the SVM as a classifier. They performed experiments by several classes(walking toward all the directions, walking toward the same direction, walking toward the same direction, crowd formation and evacuation, local dispersion) except run class. So, we compared their classes with our corresponding classes. We compute the performance as a weighted average of the precision of the four classes using (17).
The average precision is plotted in Fig. 9 and is compared to the average precision of a conventional MLP. It shows better performance both for the view001 and view002 sequences.
Fig. 9.Accuracy evaluation of crow activity recognition
4. Conclusion
In this paper, we have proposed a crowd behavior recognition method that can be applied to a variety of complex activity environments. The proposed method consists of a feature extraction stage, a structure learning stage, and a classification stage. We propose a systematic structure learning approach that can automatically learn the appropriate Context-Network. The neural networks with structural features(pattern) designed and presented as part of this method can achieve an improved classification performance. Specifically, the structure learning stage is implemented in three steps: an input temporal vector formulation step, a subsequent context-network generation step with a K2-algorithm and a path pattern extraction step from the context-network. Our automatically learned situation recognition model outperformed the Multi-Layer Perceptron and Bayesian classifier. These results demonstrate that the proposed approach is feasible and provides sufficient recognition accuracy for multiple sensor signals. In the future, the study will expand path features in order to recognize complicated activity, such as loitering. Also, path features that are automatically generated from context-networks should be investigated. In addition, the proposed method does not consider real time execution yet. However, we are investigating efficient implementation of real time crowd activity recognition system.
References
- Hu, Weiming, et al., “A survey on visual surveillance of object motion and behaviors,” Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on 34.3, 334-352 2004. Article (CrossRef Link) https://doi.org/10.1109/TSMCC.2004.829274
- Saxena, Shobhit, et al., “Crowd behavior recognition for video surveillance,” Advanced Concepts for Intelligent Vision Systems, Springer Berlin Heidelberg, 2008. Article (CrossRef Link)
- Hu, Weiming, et al., “A survey on visual surveillance of object motion and behaviors,” Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on 34.3, 334-352, 2004. Article (CrossRef Link) https://doi.org/10.1109/TSMCC.2004.829274
- Ke, Shian-Ru, et al., “A review on video-based human activity recognition,” Computers 2.2 88-131 , 2013. Article (CrossRef Link) https://doi.org/10.3390/computers2020088
- Candamo, Joshua, et al. “Understanding transit scenes: A survey on human behavior-recognition algorithms,” Intelligent Transportation Systems, IEEE Transactions on 11.1, 206-224 ,2010. Article (CrossRef Link) https://doi.org/10.1109/TITS.2009.2030963
- Viola, Paul, Michael J. Jones, and Daniel Snow, "Detecting pedestrians using patterns of motion and appearance," in Proc. of Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on. IEEE, 2003. Article (CrossRef Link)
- Tae-Ki An, Moon-Hyun Kim, “Context-Aware Video Surveillance System,” Journal of Electrical Engineering & Technology, Vol 7, No 1, 115-123, 2012. Article (CrossRef Link) https://doi.org/10.5370/JEET.2012.7.1.115
- Cermeno, Eduardo, Silvana Mallor, and J. A. Siguenza, "Learning crowd behavior for event recognition," Performance Evaluation of Tracking and Surveillance (PETS), in Proc. of 2013 IEEE International Workshop on. IEEE, 2013. Article (CrossRef Link)
- Wang, Tian, and Hichem Snoussi, "Histograms of optical flow orientation for abnormal events detection," Performance Evaluation of Tracking and Surveillance (PETS), in Proc. of 2013 IEEE International Workshop on. IEEE, 2013. Article (CrossRef Link)
- Yang, Jhun-Ying, Jeen-Shing Wang, and Yen-Ping Chen, “Using acceleration measurements for activity recognition: An effective learning algorithm for constructing neural classifiers,” Pattern recognition letters 29.16, 2213-2220, 2008. Article (CrossRef Link) https://doi.org/10.1016/j.patrec.2008.08.002
- Horn, Berthold K., and Brian G. Schunck, "Determining optical flow." in Proc. of 1981 Technical Symposium East. International Society for Optics and Photonics, 1981. Article (CrossRef Link)
- Fujisawa, Shizuka, et al., “Pedestrian Counting in Video Sequences based on Optical Flow Clustering,” International Journal of Image Processing 7.1 , 1-16 , 2013. Article (CrossRef Link) https://doi.org/10.1049/iet-ipr.2012.0104
- Heckerman, David, Dan Geiger, and David M. Chickering, “Learning Bayesian networks: The combination of knowledge and statistical data,” Machine learning 20.3, 197-243, 1995. Article (CrossRef Link) https://doi.org/10.1023/A:1022623210503
- Cooper, Gregory F., and Edward Herskovits, “A Bayesian method for the induction of probabilistic networks from data,” Machine learning 9.4, 309-347, 1992. Article (CrossRef Link) https://doi.org/10.1023/A:1022649401552
- Lerner, Boaz, and Roy Malka*, “Investigation of the K2 algorithm in learning Bayesian network classifiers,” Applied Artificial Intelligence 25.1 , 74-96, 2011 Article (CrossRef Link) https://doi.org/10.1080/08839514.2011.529265
- Kim, Jin-Pyung, et al., “Multi-Sensor Signal based Situation Recognition with Bayesian Networks,” JOURNAL OF ELECTRICAL ENGINEERING & TECHNOLOGY 9.3, 1051-1059, 2014. Article (CrossRef Link) https://doi.org/10.5370/JEET.2014.9.3.1051
- Kim, Gyujin, Taeki An, and Moon-Hyun Kim, “Estimation of Crowd Density in Public Areas Based on Neural Network,” KSII Transactions on Internet and Information Systems (TIIS) 6.9 , 2170-2190, 2012. Article (CrossRef Link) https://doi.org/10.3837/tiis.2012.09.011
- Jain, Anil K., Robert P. W. Duin, and Jianchang Mao, “Statistical pattern recognition: A review,” Pattern Analysis and Machine Intelligence, IEEE Transactions on 22.1, 4-37, 2000. Article (CrossRef Link) https://doi.org/10.1109/34.824819
- Choudhury, Tanzeem, et al., “The mobile sensing platform: An embedded activity recognition system,” Pervasive Computing, IEEE 7.2, 32-41, 2008. Article (CrossRef Link) https://doi.org/10.1109/MPRV.2008.39
- K. Y. Eom, J. Y. Jung, and Moon-Hyun Kim, “A heuristic search-based motion correspondence algorithm using fuzzy clustering,” International Journal of Control, Automation and Systems, vol. 10, no. 3, pp. 594-602, 2012. Article (CrossRef Link) https://doi.org/10.1007/s12555-012-0317-5
Cited by
- Spatial-temporal texture features for 3D human activity recognition using laser-based RGB-D videos vol.11, pp.3, 2017, https://doi.org/10.3837/tiis.2017.03.019