1. Introduction
Recently video cameras are very widely used for many surveillance applications such as face or facial expression recognition (FER) [1]-[11]. Hence, FER has attracted a lot attention from the reearch community due to its applications in many areas of image processsing, pattern recognition and computer vision. In this regard, good face analysis can be considered a major concern due to various causes such as the failure of extracting efficient features, feature imappropriate variances among the different epression classes.
1.1 Related Work
For feature extraction from facial expression images, most of the early FER research works utilized Principal Component Analysis (PCA) [2,3]. PCA is commonly used for dimension reduction. In [2] and [3], the authors also employed PCA to solve FER with the Facial Action Coding System (FACS). In [4], the authors applied PCA to identify Facial Action Units (FAUs) and to recognize the facial expressions. Most of the recent works focused on the emotion specified feature extraction rather than FAU [5-8]. In [5], Linear Discriminant Analysis (LDA) was applied over PCA features of the facial expression images. Recently, Independent Component Analysis (ICA) has been extensively utilized for FER tasks due to its ability to extract local face features [6]. In [6], ICA was used to extract the IC features of facial expression images to recognize the Action Units (AU).
In recent years, local binary patterns (LBP) that was originally proposed for texture analysis to efficiently summarize the local structures of an image have received increasing interest for facial expression representation. The key property of LBP features is their tolerance against illumination changes and their computational simplicity. Hence, LBP has been successfully employed as a local feature extraction method in facial expression recognition [7,8]. A robust discriminant analysis called General Discriminant Analysis (GDA) has recently been used in different applications where GDA significantly shows the superiority over the traditional feature extraction approaches such as PCA and LDA [11]. Thus, GDA can be a robust tool to be used to obtain better discrimination among the face images from different expressions.
Depth information-based face representation has become very popular nowadays over RGB since the pixel intensities in the depth images are set based on the distance to the camera to provide better expression information than typical RGB images [10]. Moreover, the original identity of a person cannot be obtained easily from the depth videos that would help to resolve privacy issues which cannot be resolved easily when applying RGB videos. Though LBP-based features on RGB images can generate better results than the conventional features on RGB faces but as different people have different face colors, RGB face pixel inteisities would definitley produce problems for LBP to generate a robust person independent expression recognition system. However, LBP-based features on RGB images can produce good recognition results for a person’s face recognition rather than facial expression recognition. So, aforemnetioned discussion indicates it clearly that LBP-based depth face features can produce better results than LBP-based RGB face features as facial expression recognition systems should not be dependent on face colors. Thus, LBP-based features on depth faces should allow one to come up with a robust person independent FER system.
In addition to FER, depth camera image analysis has received a great deal of attention from many researchers in other fields of computer vision and pattern analysis [12-31]. In [12], the authors used depth map sequences to analysis robust features for distinguished human activity analysis. In [14], the authors adopted depth information-based histograms of surface orientation for action recognition. In [17], the authors applied depth motion-based maps to capture motion energies in activity videos to represent different human activities. In [21], the authors considered moving object labeling utilizing RGB as well as Depth videos. In [23], the authors adopted depth based moving object actions with the help of two-layer maximum entropy markov model. In [27], the authors used particle swarm optimization to model two interacting hands from depth images for human pose representatiom. Besides human pose analysis, visual gestural languages such as american sign language is also considered to be a very active field in image processing and computer vision [28-31]. For instance, the sign speak project analyzed transaion of American sign language based on mobile platform [31].
Hidden Markov Model (HMM), a robust tool to model time-sequential events has been very commonly tested for FER [10]. HMM is basically a stochastic model to represent sequential data where the stochastic process is determined by two mechanisms i.e., a Markov chain consisting of a finite number of states and a set of observation probability distributions. At each discrete instant, the process is usually assumed to be in a state and corresponding to that state, an observation is generated.Thus, HMMs can be adopted to model the time-sequential robust features of different facial expressions for a robust FER system.
1.2 Proposed Approach
In this work, a novel depth face-based FER approach is proposed utilizing LBP with GDA and HMMs. The local efficient features aee extracted first using LBP and then enahanced by non-linear feature classificattion approach GDA that increases the robustness of the features. The time-sequential feature sequences from each depth video then applied to train each expression HMM to be used later for recognition based on likelihood.
2. Proposed FER Methodology
A strong feature space is first generated via LBP and GDA to be applied with HMMs later for training and testing. Fig. 1 shows the basic steps of training and testing of expressions through HMMs.
Fig. 1.Architecture of the proposed depth sensor-based FER system.
2.1 Depth Face Preprocessing
The images of different expressions are captured by a depth camera [10] where the camera generates distance information (i.e., depth) simultaneously for the objects captured by the camera. The depth video represents the range of every pixel in the scene as a gray level intensity (i.e., the longer ranged pixels have darker and shorter ones brighter values or vice versa). Fig. 2(a) and (b) show six generalized depth faces from a surprise and disgust expressions respectively.
Fig. 2.Example sequential images of depth facial expression.
2.2 LBP Feature Extraction
At a given pixel position (xc,xy) with the gray value Kc, local texture T is defined as T = t(Kc,K0,K1,K2,K3,K4,K5,K6,K7), where Ki correspond to the gray values of the eight surrounding pixels. To compare the relative intensities between the center pixel and its neighbor pixels, T can be rewritten as
where function s(l) is defined as:
Then, the LBP pattern at the given pixel at (xp,yp) can be represented as an ordered set of the binary comparisons as:
However, the LBP features from the depth faces can be represented as D. Fig. 3 shows a LBP operator used in this work. The image L textual feature is basically represented by the histogram of the LBP map of which the ith bin can be defined as follows
where m is from 1 to n-1 and n the number of the LBP histogram bins where usually n=256 . Finally, the whole LBP feature H is expressed as a concatenated sequence of histograms H = (H1 , H2 , ⋯, Hr) where r is the number of the subregions of the image. Thus, from each depth face image, expression features (i.e., LBP descriptors) are extracted as aforementioned. Fig. 4 shows a surprise depth expression image is divided into 64 small regions from which LBP histograms are extracted and concatenated into LBP descriptor.
Fig. 3.An LBP operator
Fig. 4.A surprise expression image is divided into small regions from which LBP histograms are extracted and concatenated into LBP descriptor.
After analyzing the LBP descriptors of all the face depth images, it was noticed that there are some positions from all the positions corresponding to all the face images have values greater than 0. Thus, it is better to consider only those positions of LBP descriptors of a face region and determine the standard dimension of the LBP descriptors for any face. However, the LBP features from the depth faces can be represented as X.
2.3 GDA on LBP
Generalized Discriminant Analysis (GDA) is a discriminant analysis approach that produces an optimal discriminant function, which maps the input into the classification space on which the class identification of the samples is determined. The principal idea of GDA is to map the training depth faces into a high dimensional feature space. The goal of GDA is to maximize following equation as
where SB and ST are the between-class and total scatter matrices respectively in the feature space. Thus, the depth face LBP features X of different expressions can be extended by GDA as
Fig. 5 shows an exemplar plot of 3-D GDA representation of the LBP features of all the facial expression depth images that shows a good separation among the representation of the depth faces of different classes.
Fig. 5.3-D plot of LBP-GDA features of depth faces from four expressions.
2.4 HMMs for FER
To decode the depth information-based time-sequential facial expression features, discrete HMMs are employed. As discrete HMMs are usually trained and tested with discrete symbol sequences, the features are needed to be symbolized by comparing with the codebook vectors where the codebook is developed by Linde, Buzo and Gray (LBG)’s clustering algorithm on the feature vectors from the training datasets [10]. Once the codebook is obtained, the index numbers of the codebook vectors are considered as symbols to be used with discrete HMMs. As each facial expression image from the depth expression video clip is converted to a symbol, a clip of T frames will result in T symbols. Fig. 6 shows the basic steps for codebook generation and symbol selection.
Fig. 6.Steps for codebook generation and symbol selection.
HMMs have been applied extensively to solve a large number of complex problems in various applications [10]. In this work, each expression is trained using separate HMM H. Fig. 7 depicts the HMM structure used in this work as well as transition probabilities for fear expression before and after training. To test a facial expression video for recognition, the obtained discrete observation symbol sequence S from the corresponding time-sequential depth images is used to determine the proper model by means of highest likelihood Γ computation of all N trained expression HMMs as follows.
Fig. 7.A HMM transition probabilities for fear expression (a) before and (b) after training.
3. Experimental Results
To conduct the FER experiments, the RGB and depth video databases were built for six different expressions: namely Surprise, Sad, Happy, Disgust, Anger, and Fear. Each expression video clip was of variable length and each expression in each video starts and ends with neutral expression. A total of 20 sequences from each expression were used to build the feature space to project all the expression images later. To train and test each facial expression model, 20 and 40 image sequences were applied respectively. To apply on the HMMs, the features were symbolized using the codebook with the size of 32. To compare the proposed LBP-GDA features with the other conventional feature extraction methods, all methods were implemented with the HMMs to recognize aforementioned six different facial expressions. First of all, RGB camera-based experiments were tried which were followed by the depth videos with the same experimental setups.
Regarding RGB video-based experiments, all the face images were converted to the gray scale first. For the PCA feature case, the eigenvectors were computed from all the dataset and selected 100 eigenvectors to train the HMMs. RGB video-based experimental results are reported in Table 1. The average recognition rate obtained using PCA is 58%, the lowest recognition rate in the experiments. Then, LDA on PC features were employed and obtained the improved average recognition rate of 61.50%. Then, ICA highlights the local features of the faces even it is applied on the holistic images. It seems that ICA is a better tool to obtain more relevant features for the expressions. As presented in the table, the average recognition rate utilizing ICA representation of facial expression images is 80.42%, which is higher than the PCA and PCA-LDA recognition rates. Then, LBP was performed on the database and it achieved the recognition rate of 81.25%. To obtain more robust features, GDA was employed as it finds out the best linear as well as nonlinear discrimination among the datasets. Thus, LBP-GDA with HMMs achieved the total average recognition rate of 84.17%. Thus, LBP-GDA showed its superiority over the other feature extraction methods by achieving the highest recognition rate.
Table 1.FER experimental results using RGB face-based different approaches (%)
For depth camera-based experiments, the experiential setups were followed as RGB camera-based work. The depth video-based FER results are shown in Table 2. The average recognition rate using PCA on depth faces is 62%. Applying LDA on PC features, the improved average recognition rate is 65%. The mean recognition rate utilizing ICA representation on the depth facial expression images is 83.50%, which is higher than that of depth face-based FER applying PCA as well as PCA-LDA features. Then, LBP was tried on the same database that achieved the average recognition rate of 89.17%. Finally, LBP-GDA was employed and achieved the superior average recognition rate of 95.83%. Thus, LBP-GDA features with HMM for depth face-based FER showed its superiority over the other feature extraction methods including RGB camera-based all approaches in the sense of achieving the highest recognition rate.
Table 2.FER experimental results using depth face-based different approaches (%)
Fig. 8 depicts the mean recogniton performances of different approaches using RGB and depth videos to the FER systems where LBP-GDA features on depth faces shows its superiority over all others.
Fig. 8.Mean recognition rates of different approaches based on RGB and depth input videos to FER systems.
4. Conclusion
FER has attracted a lot of researchers in many important research areas such as computer vision and image processing over the last decade. Human computer interaction through facial expressions plays a very significant role in many applications such as telemedicine, smart home healthcare, and social interactions. In such kind of applications, RGB camera may not be a good choice to be used due to some privacy concerns which shows the appropriateness of the use of depth cameras in person independent FER where a person’s identity can be hidden. A typical video-based FER system consists of three modules such as preprocessing, feature extraction, and finally recognition. The facial expression features are very sensitive to noise as well as illumination and hence one expression features can become merged with the features of other expression that can lead weak FER system as separation of features of different expressions can be quite tough. Therefore, we have focused on very strong feature space generation that can contribute significantly in separation the features of different expressions that has definitely leaded us to achieve a robust FER system. Thus, in this work, a depth video-based robust FER system has been proposed using LBP-GDA features for facial expression feature extraction and HMM for recognition. The proposed method has been compared with other traditional approaches and the proposed FER approach achieved remarkably improved performance over others. However, the proposed system can still be analyzed further for deploying it to real time environments with different complex parameters.
References
- Rahman M. T. and Kehtarnavaz N., “Real-Time Face-Priority Auto Focus for Digital and Cell-Phone Cameras,” IEEE Transactions on Consumer Electronics , Vol. 54 , No. 4 , pp.1506 -1513 , 2008. Article (CrossRef Link) https://doi.org/10.1109/TCE.2008.4711194
- Calder A. J., Burton A. M., Miller P., Young A. W., and Akamatsu S., “A principal component analysis of facial expressions,” Vision Research, Vol. 41, pp. 1179-1208, 2001. Article (CrossRef Link) https://doi.org/10.1016/S0042-6989(01)00002-5
- Yu P., Xu D., and Yu P., "Comparison of PCA, LDA and GDA for Palm print Verification," in Proc. of the International Conference on Information, Networking and Automation, pp.148-152, 2010. Article (CrossRef Link)
- Donato G., Bartlett M. S., Hagar J. C., Ekman P., and Sejnowski T. J., “Classifying Facial Actions,” IEEE Transaction on Pattern Analysis and Machine Intelligence, Vol. 21, No. 10, pp. 974-989, 1999. Article (CrossRef Link) https://doi.org/10.1109/34.799905
- Dubuisson S., Davoine F., and Masson M., “A solution for facial expression representation and recognition,” Signal Processing: Image Communication, Vol. 17, pp. 657-673, 2002. Article (CrossRef Link) https://doi.org/10.1016/S0923-5965(02)00076-0
- Chao-Fa C. and Shin F. Y., “Recognizing Facial Action Units Using Independent Component Analysis and Support Vector Machine,” Pattern Recognition, Vol. 39, pp. 1795-1798, 2006. Article (CrossRef Link) https://doi.org/10.1016/j.patcog.2006.03.017
- Ojala T., Pietikäinen M., Mäenpää T., “Multiresolution gray scale and rotation invariant texture analysis with local binary patterns,” IEEE Trans. Pattern Anal. Mach. Intell, Vol. 24, pp. 971-987, 2002. Article (CrossRef Link) https://doi.org/10.1109/TPAMI.2002.1017623
- Shan C., Gong S., and McOwan P., “Facial expression recognition based on local binary patterns: A comprehensive study,” Image Vis. Comput., Vol. 27, pp. 803-816, 2009. Article (CrossRef Link) https://doi.org/10.1016/j.imavis.2008.08.005
- Shen L., Bai L., and Fairhurst M.,” Gabor wavelets and General Discriminant Analysis for face identification and verification,” Image and Vision Computing, Vol. 25, pp. 553-563, 2007. Article (CrossRef Link) https://doi.org/10.1016/j.imavis.2006.05.002
- Uddin M. Z. and Hassan M. M., "A Depth Video-Based Facial Expression Recognition System Using Radon Transform, Generalized Discriminant Analysis, and Hidden Markov Model," Multimedia Tools And Applications, Vol. 74, No. 11, pp. 3675-3690, 2015. Article (CrossRef Link) https://doi.org/10.1007/s11042-013-1793-1
- P. Yu, D. Xu, and P. Yu "Comparison of PCA, LDA and GDA for Palm print Verification," in Proc. of International Conference on Information, Networking and Automation, pp. 148-152, 2010. Article (CrossRef Link)
- W Li., Z. Zhang, and. Z. Liu, "Action recognition based on a bag of 3d points," in Proc. of Workshop on Human Activity Understanding from 3D Data, pp. 9-14, 2010. Article (CrossRef Link)
- W Li., Z. Zhang, and. Z. Liu, “Expandable data-driven graphical modeling of human actions based on salient postures,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 18, No. 11, pp. 1499-1510, 2008. Article (CrossRef Link) https://doi.org/10.1109/TCSVT.2008.2005597
- O. Oreifej and Z. Liu, "Hon4d: Histogram of oriented 4d normals for activity recognition from depth sequences," in Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp. 716-723, 2013. Article (CrossRef Link)
- A. Vieira, E. Nascimento, G. Oliveira, Z. Liu, and M. Campos, "Stop: Space-time occupancy patterns for 3d action recognition from depth map sequences," in Proc. of Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, pp. 252-259, 2012. Article (CrossRef Link)
- J.Wang, Z. Liu, J. Chorowski, Z. Chen, and Y.Wu, " Robust 3d action recognition with random occupancy patterns," in Proc. of European Conference on Computer Vision, pp. 872-885, 2012. Article (CrossRef Link)
- X. Yang, C. Zhang, and Y. Tian, "Recognizing actions using depth motion mapsbased histograms of oriented gradients," in Proc. of ACM International Conference on Multimedia, pp. 1057-1060, 2012. Article (CrossRef Link)
- J. Lei, X. Ren, and D. Fox, "Fine-grained kitchen activity recognition using rgb-d," in Proc. of ACM Conference on Ubiquitous Computing, pp.208-211, 2012.Article (CrossRef Link)
- A. Jalal ., M.Z. Uddin, J.T. Kim, and T.S. Kim, “Recognition of human home activities via depth silhouettes and R transformation for smart homes,” Indoor and Built Environment, vol. 21, no 1, pp. 184-190, 2011. Article (CrossRef Link) https://doi.org/10.1177/1420326X11423163
- Y. Wang, K. Huang, and T. Tan, "Human activity recognition based on R transform," in Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp. 1-8, 2007. Article (CrossRef Link)
- H.S. Koppula, R. Gupta, and A. Saxena, “Human activity learning using object affordances from rgb-d videos,” International Journal of Robotics Research, vol. 32, no. 8, pp. 951-970, 2013. Article (CrossRef Link) https://doi.org/10.1177/0278364913478446
- X. Yang and Y. Tian, "Eigenjoints-based action recognition using naive-bayesnearest-neighbor," in Proc. of Workshop on Human Activity Understanding from 3D Data, pp. 14-19, 2012. Article (CrossRef Link)
- J. Sung, C. Ponce, B. Selman, and A. Saxena, "Unstructured human activity detection from rgbd images," in Proc. of IEEE International Conference on Robotics and Automation, pp. 842-849, 2012. Article (CrossRef Link)
- A. McCallum, D. Freitag, and F.C.N. Pereira, "Maximum entropy markov models for information extraction and segmentation," in Proc. of International Conference on Machine Learning, pp. 591-598,2000. Article (CrossRef Link)
- H. Hamer, K. Schindler, E. Koller-Meier, and L. Van Gool, "Tracking a hand manipulating an object," in Proc. of IEEE International Conference on Computer Vision, pp. 1475-1482, 2009. Article (CrossRef Link)
- I. Oikonomidis, N. Kyriazis, and A.A. Argyros, "Tracking the articulated motion of two strongly interacting hands ," in Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp. 1862-1869, 2012. Article (CrossRef Link)
- H. Hamer, J. Gall, T. Weise, and L. Van Gool, "An object-dependent hand pose prior from sparse training data," in Proc. of IEEE Conference on Computer Vision and Pattern Recognition. pp. 671-678, 2010. Article (CrossRef Link)
- S. Ong and S. Ranganath, “Automatic sign language analysis: A survey and the future beyond lexical meaning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 6, pp. 873-891, 2005. Article (CrossRef Link) https://doi.org/10.1109/TPAMI.2005.112
- T. Pei, T. Starner, H. Hamilton, I. Essa, and J. Rehg, "Learnung the basic units in american sign language using discriminative segmental feature selection," in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4757-4760, 2009. Article (CrossRef Link)
- H.D. Yang, S. Sclaroff, and S.W. Lee, “Sign language spotting with a threshold model based on conditional random fields,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no.7, pp.1264-1277, 2009. Article (CrossRef Link) https://doi.org/10.1109/TPAMI.2008.172
- J. Bukhari, M. Rehman, S. I. Malik, A. M. Kamboh, and A. Salman, “American Sign Language Translation through Sensory Glove; SignSpeak,” International Journal of u- and eService, Science and Technology, Vol.8, No.1, pp.131-142, 2015. Article (CrossRef Link) https://doi.org/10.14257/ijunesst.2015.8.1.12