1. Introduction
Metrology, the measurement of real world metrics, has been investigated extensively in computer vision for many applications. The technique of measuring the geometric parameters of objects from video has been developed as an interesting issue in computer vision field in recent years [1-2]. With the increasing use of video surveillance systems [3], more and more crimes and incidents have been captured on video. When the incidents have been captured, we need to gain an understanding of the events or identify a particular individual.
As height is an important parameter of a person, some methods have been presented for estimating height information from video [4-5]. They can be roughly divided into two categories: absolute measurement and relative measurement. Absolute measurement requires fully calibrating camera, which is a complicated process [6]. Relative measurement only requires the minimal calibration. Guo and Chellappa [7] presented a video metrology approach using an uncalibrated single camera that is either stationary or in planar motion. This paper also leverages object motion in videos to acquire calibration information for measurement. No constant velocity motion is assumed. Furthermore, it incorporates all the measurements from individual video frames to improve the accuracy of final measurement.
Several automatic mensuration algorithms have been developed to take advantage of tracking results from video sequences. Renno et al. [8] used projected sizes of pedestrians to estimate the vanishing line of a ground plane. Bose and Crimson [9] proposed a method that uses constant velocity trajectories of objects to derive vanishing lines for recovering the reference plane and planar rectification. The basic idea of their algorithm is to use an additional constraint brought by the constant-velocity assumption, which is not always available in surveillance sequences. Shao et al. [10] proposed a minimal-supervised algorithm based upon monocular videos and uncalibrated stationary cameras. The author recovered the minimal calibration of the scene based upon tracking moving objects, then applied the single view metrology algorithm to each frame, and finally fused the multi-frame measurements using the LMedS as the cost function and the RMSA as the optimization algorithm.
However, most of the existing approaches are direct extension of image-based algorithms, which have not considered the occlusions between objects and lack robustness. Reliable tracking of multiple objects in complex situations is a challenging visual surveillance problem since the high density of objects results in occlusion. When occlusion between multiple objects is common, it is extremelly difficult to perform the tasks of height measurements of objects.
In this paper, we propose a new method for height measurements of multiple objects based on rubust tracking. Firstly, the foreground likelihood map is obtained by using the Codebook background modeling algorithm. Secondly, tracking of multiple objects are performed by a combined tracking algorithm. Then, the vanishing line of the ground plane and the vertical vanishing point are computed, and the head feature points and the feet feature points are extracted in each frame of video sequences. Finally, we obtain height measurements of multiple objects according to the projective geometric constraint, and the multiframe measurements are fused using RANSAC algorithm.
Compared with other popular methods,our proposed algorithm does not require calibrate the camera, and can track the multiple moving objects in crowded scenes.Therefore, it reduces the complexity and improves the accuracy simultaneously. The experimental results demonstrate that our method is effective and robust in the occlusion case..
The organization of this paper is as follows. In Section 2, we introduce the multi-target detecting and tracking algorithm.Section 3 addresses video-based height measurements of multiple moving objects. Section 4 presents experimental results. Section 5 concludes this paper.
2. Multi-Target Detecting and Tracking Algorithm
2.1 Multi-Target Detecting Algorithm
The capability of extracting moving objects from a video sequence captured using a static camera is a typical first step in visual surveillance. A common approach for discriminating moving objects from the background is detection by background subtraction [11-12]. The idea of background subtraction is to subtract or difference the current image from a reference background model. The subtraction identifies non-stationary or new objects. The generalized mixture of Gaussians (MOG) has been used to model complex, non-static backgrounds. MOG does have some disadvantages. Backgrounds having fast variations are not easily modeled with just a few Gaussians accurately, and it may fail to provide sensitive detection.
In this paper, codebook algorithm has been used to model backgrounds. The algorithm is an adaptive and compact background model that can capture structural background motion over a long period of time under limited memory. This allows us to encode moving backgrounds or multiple changing backgrounds. At the same time, the algorithm has the capability of coping with local and global illumination changes.
A quantization/clustering technique is adopted to construct a background model in the codebook algorithm. Samples at each pixel are clustered into the set of codewords. The background is encoded on a pixel by pixel basis.
Let X={x1,x2,...,xN} be a training sequence for a single pixel consisting of N RGB-vectors. Let C=(c1,c2,...,cL) represent the codebook for the pixel consisting of L codewords. Each pixel has a different codebook size based on its sample variation. Each codeword ci(i=1,...,L) consists of an RGB vector and a 6-tuple .The tuple auxi contains intensity (brightness) values and temporal variables described below.
In the training period, each value, xt , sampled at time t is compared to the current codebook to determine which codeword cm (if any) it matches (m is the index of matching codeword). We use the matched codeword as the sample's encoding approximation. To determine which codeword will be the best match, we employ a color distortion measure and brightness bounds.
When we have an input pixel xt = (R, G, B) and a codeword ci where .
The color distortion can be calculated by
The logical brightness function is defined as
The detailed algorithm of constructing codebook is given in [11].
We segment foreground using subtracting the current image from the background model. When we have a new input pixel xi = (R, G, B) and its codebook M. The subtraction operation BGS(xi) for the pixel is defined as:
if the codeword Cm is found, let match=1,else let match=0.
Fig. 1 depicts the results of comparison of foreground likelihood maps obtained using different methods for an indoor data set. Fig. 1(a) is an image extracted from an indoor video. Fig. 1(b) depicts the foreground likelihood map of the image using mixture of Gaussians algorithm. Fig. 1(c) depicts the foreground likelihood map of the image using Codebook-based method.
Fig. 1.Comparison of foreground likelihood maps obtained using different methods
2.2 Multi-Target Tracking Algorithm
Tracking multiple people accurately in cluttered and crowded scenes is a challenging task primarily due to occlusion between people [13-14]. Particle filter can work well when the object gets an occlusion, but it has difficulty in satisfying the requirement of real-time computing. Meanshift can solve this problem easily, while it has poor robustness during mutual occlusion. Aiming at all above problems, this section proposes a robust multi-target tracking algorithm by combining the particle filter with meanshift method.
Particle filters, provide an approximative Bayesian solution to discrete time recursive problem by updating an approximative description of the posterior filtering density [15].
At time k , when a measurement zk becomes available, z1:k = {z1,z2,...,zk}.Assume that probability distribution function p(xk-1|z1:k-1) is available at time k-1. According to the Bayes’ rule, the posterior probability function of the state vector can be calculated using the following equations.
This is the prior of the state xk at time k without the knowledge of the measurement zk, i.e. the probability given only previous measurements. Update step combines likelihood of current measurement with predicted state.
p(zk|z1:k-1) is a normalizing constant. It can be calculated by:
Because p(zk|z1:k) is a constant, (8) can be written as:
Supposing that at time step k there is a set of particles, {xik,i=1,...,N} with associated weights {ωik,i=1,...,N} randomly drawn from importance sampling, where N is the total number of particles. The weight of particle i can be defined as:
We use the transition prior p(xk|xk-1) as the importance density function q(xik|xik-1,z1:k). Then, we can simplify (11) as:
Furthermore, if we use Grenander’s factored sampling algorithm, Eq.(16) can be modified as:
The particle weights then can be normalized by using:
to give a weighted approximation of the posterior density in the following form:
where δ is the Dirac’s delta function.
Meanshift algorithm was first analyzed in [16] and developed in [17]. Meanshift is a non-parametric statistical approach that seeks the mode of a density distribution in an iterative procedure [18]. Let X denote the current location, then its new location X' after one iteration is :
where {ai,i=1,...,N} are normalized points within the rectangle area specified by the current location X , ω(ai) is the weight associated to each pixel ai, and g(x) is a kernel profile function, and h is window radius to normalize the coordinate ai.
In our tracking algorithm, we assume that the dynamic of state transition corresponds to the following second order auto- regressive process.
where A,B,C are the autoregression coefficients, nk is the Gaussian noise .
We use HSV color histogram to build the observation model. Given the current observation zk, the candidate color histogram Q(xk) is calculated on zk in the region specified by xk. The similarity between Q(xk) and the reference color histogram Q* by Bhattacharyya coefficient d(,).The likelihood distribution is evaluated as
In our method, the meanshift algorithm is applied on every sample in sample set, this will greatly reduce the computational time of particle filtering. It might not be able to capture the true location of the objects during mutual occlusion. The particle filter can improve the robustness of the algorithm. We propagate particle {xik-1,i=1,...,N} according to the dynamic of state transition to get . The samples set is the first transition to get by meanshift according to Eq.(16). With meanshifted samples ,we update their weights {ωik,i=1,...,N} according to Eq.(14). The likelihood distribution p(zk|xik) is given by Eq.(18). Then we resample and generate unweighted sample set {xik,1/N}i=1,...N . In Fig. 2 the tracking results are demonstrated for outdoor video sequences in different frames.
Fig. 2.Tracking results for test video sequences
3. Video-based Height Measurements of Multiple Moving Objects
3.1 Projective Geometry
In this section, we introduce the main projective geometric ideas and notation that are required for understanding our measurement algorithm well. We use upper case letters to indicate points in the world system and the corresponding lower case letters for their images.
Fig. 3 shows the basic geometry of the scene. A line segment in space, orthogonal to the ground plane and identified by its top point Hi and base point Fi, is denoted by HiFi, and its length is denoted by d(Hi,Fi). HiFi is projected onto the image plane as the line segment hifi.The line l is the vanishing line of the ground plane, and v the vertical vanishing point. Given one reference height d(Hl,Fl)=dl in the scene, the height of any object on ground plane(e.g. d2)) can be measured using geometry method shown in Fig. 3(b).
Fig. 3.Basic geometry of the scene
The measurement is achieved with two steps. At the first stage, we map the length of the line segment h1f1 onto the other h2f2. The intersection of the line through the two base points f1 and f2 with the vanishing line l determines the point u, and the intersection of the line through h1 and u with the line through v and f2 determines the point i . Because v and u are vanishing points, h1f1 and h1i are parallel to if2 and f1f2 respectively. h1, i , f2 , and f1 forms a parallelogram with d(h1,f1)=d(i,f2). We now have four collinear points v , h2 , i , f2 on an imaged scene line and thus there is a cross ratio available. The distance ratio d(h2,f2):d(i,f2) is the computed estimate of d2:d1 by applying a 1-D projective transformation. At the second stage, we compute the ratio of length on the imaged scene line using cross ratio [19].
The ratio between two line segments h2f2 and if2 can be written by:
with d2=rd1. The height of any object can be measured using this method.
With the assumption of perfect projection, e.g. with a pinhole camera, a set of parallel lines in the scene is projected onto a set of lines in the image which meet in a common point. This point of intersection, perhaps at infinity, is called the vanishing point. Different approaches are adopted to detect vanishing points for the reference direction, according to the environments of video data sets.
In pinhole camera model, the vanishing line of the ground plane can be determined as the line through two or more vanishing points of the plane. If we have N vertical poles of same height in the perspective view, the vertical vanishing point VY can be computed just by finding the intersection of two (or more) poles. And the vanishing line of the ground plane VL is the line consisted by the points of intersection of the lines connecting the top and bottom of the poles. Thus, we can fix the vanishing line through three (or more) non-coplanar poles. In this paper, we denote the poles by {(ti,bi)}i=1,2...N , where ti,bi represent the image positions of the top and bottom of the poles, respectively. {(∑ti,∑bi)}i=1,2,...N are the associated covariance matrices. VY can be fixed by finding the point v that minimizes the sum of distances from ti and bi to the line linking mi and v . Where mi is the midpoint of ti and bi, (wi,bi) is the line determined by mi and v .
VL can be computed by
where wVL is the unit vector of VL and bvL is a point on vanishing line.
The point xi is the intersection of line tjbj and line tkbk. The covariance matrix ∑i of xi can be computed by using Jacobian as
where .
3.2 Extracting head and feet feature points from moving objects
Humans are roughly vertical while they stand or walk. However, because human walking is an articulated motion, the shape and height of the human vary in different walking phases. As shown in Fig. 4, at the phase which the two legs cross each other, height of the human we measured from the video sequence is highest, and is also the most appropriate height to represent human’s static height.
Fig. 4.The height of human varies periodically during walking cycle
The phase at which the two feet cross each other (leg-crossing) is of particular interest in that the feet position is relatively easy to locate and the shape is relatively insensitive to viewpoint. Thus, we aim to extract the head and feet locations at leg-crossing phases. We first detect a walking human from a video sequence by change detection.Then, we extract the leg-crossing phases by temporal analysis of the object shape. Finally, we compute the principal axes of the human’s body and locate the human’s head and feet positions at those phases.
To every single frame t , the head feature point hti of the object i{i=1,2,...N} can be obtained using the following steps.
Step 1. Construct the target likelihood matrix Lti corresponding to the foreground blobs Bti(wi,hi), where wi and hi denote the width and height of foreground blob Bti , respectively.
Step 2. Compute the covariance matrix Cti of target likelihood matrix Lti. The covariance matrix Cti can computed as
Where Lti(m) and Lti(n) denote the m and n column of foreground target matrix at frame t .
Step 3. Compute the first eigenvectors of covariance matrix Cit.The centroid and of the blob give the principal axis Pit of the target’s body. The head feature point is assumed to be located on the principal axis.
Step 4. Project target blob Bit onto its corresponding principal axis Pit. Locate the head feature point hit by finding the first end point along the principal axis whose projection count is above a threshold along the principal axis from the top to the bottom.
Humans are roughly vertical at different phase of a walking cycle. This means that the head feature point, the feet feature points and vanishing point are collinear. We obtain the feet feature points of target fi by applying the collinear constraint. The fi can be computed as fi=(hi×VY)×lb(i). hi denotes head feature point of object i. VY denotes the vertical vanishing point. lb(i) denotes the bottom line of blob.
3.3 Multiple frame fusion
The measurements from multiple frames include outlier observations due to bad tracking errors, articulated motion, and occlusions. It makes using mean of the multiframe estimates not robust. The RANSAC technique has the well-known property of being less sensitive to outliers. Thus, in this paper, we use RANSAC algorithm to estimate the height from a set of data contaminated by outliners.
4. Experimental Results and Analysis
In order to show the robustness of the height measurement algorithm discussed in this paper, we conducted several experiments with the data that we collected from stationary cameras under different environments. Moving objects include vehicles and humans. Given the limited space, in this section we only list two of them to show the experimental results and the forms of data statistics. The number of particles used for our method is 300 for all experiments.
The implementation of the algorithm is based on Windows 7 operating system and using MATLAB as the software platform. The configuration of the computer is AMD Athlon (TM) X2DualCore QL-62 2.00GHz,1.74GB memory.
The results of height measurements for test video 1 are shown in Fig. 5.
Fig. 5.The results of height measurement for test video 1
Statistics of the measured value for test video 1 are shown in Table 1.
Table 1.Statistics of the measured value for test video 1
The results of height measurement for test video 2 are shown in Fig. 6. The tracking blobs are object 1, object 2, object 3 from right to left respectively. The heights of objects are shown on the top of the blobs.
Fig. 6.The results of height measurement for test video 1
Statistics of the measured value for test video 1 are shown in Table 2.
Table 2.Statistics of the measured value for test video 2
From the experimental results, we can see that our algorithm shows better accuracy and robustness than algorithm proposed in [10].
The results of height measurement for test video 3 are shown in Fig. 7.
Fig. 7.The results of height measurement for test video 1
Statistics of the measured value for test video 1 are shown in Table 3.
Table 3.Statistics of the measured value for test video 3
From the experimental results, we can see that our proposed algorithm does not require calibrating the camera, and can track the multiple moving objects when occlusion occurs. Therefore, it reduces the complexity of calculation and improves the accuracy of measurement simultaneously.
5. Conclusion
We have presented a new algorithm for estimating height of multiple moving objects. We first compute the vanishing line of the ground plane and the vertical vanishing point. Secondly, detect and track multiple moving objects. Then, the head feature points and the feet feature points are extracted in each frame of video sequences. The height measurements of multiple objects are obtained according to the projective geometric constraint. Finally, the multi-frame measurements are fused by using RANSAC algorithm. The experimental results demonstrate that our method is effective and robust to the occlusion.This is a preliminary study and further work is required to do.
References
- J. Cai, R. Walker, "Height estimation from monocular image sequences using dynamic programming with explicit occlusions," IET Comput. Vis., vol. 4, no.4, pp. 149-161, 2010. https://doi.org/10.1049/iet-cvi.2009.0063
- A. Criminisi, "Accurate visual metrology from single and multiple uncalibrated images," Distinguished dissertation. New York: Springer-Verlag, Sep. 2001.
- W. Hu, T. Tan, L. Wang, and S. Maybank, "A survey on visual surveillance of object motion and behaviors," IEEE Trans. on Systems, Man, and Cybernetics-Part C: Applcations and Reviews, vol.34, no.3, pp. 334-352,2004. https://doi.org/10.1109/TSMCC.2004.829274
- A. Criminisi, I. Reid, and A. Zisserman, "Single view metrology," Int. J. Comput. Vis., vol.40, no.2, pp. 123-148, 2000. https://doi.org/10.1023/A:1026598000963
- I. Reid, A. Zisserman J., "Goal-derected video metrology," in Proc. of 4th European Conference on Computer Vision (ECCV), pp.647-658, April 15-18, 1996.
- Z. Zhang, "A flexible new technique for camera calibration," IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI), vol.22, no.11, pp. 1330 - 1334, 2000. https://doi.org/10.1109/34.888718
- F. Guo and R. Chellappa, "Video metrology using a single camera," IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI), vol. 32, no.7, pp. 1329-1335, 2010. https://doi.org/10.1109/TPAMI.2010.26
- J. Renno, J. Orwell, and G. Jones, "Learning surveillance tracking models for the self-calibrated ground plane," in Proc. of British Machine Vision Conf., pp. 607-616, Sep. 2002.
- B. Bose and E. Grimson, "Ground plane rectification by tracking moving objects," in Proc. of Joint IEEE International Workshop on Visual Surveillance and Performence Evaluation of Tracking and Surveillance, 2003.
- J. Shao, S. K. Zhou, and R. Chellappa, "Robust height estimation of moving objects from uncalibrated videos," IEEE Trans. On Image Processing, vol. 19, no.8, pp.2221-2232, 2010. https://doi.org/10.1109/TIP.2010.2046368
- K. Kim, T.H. Chalidabhongse, D. Harwood, and L.S. Davis. "Real-Time foreground-background segmentation using codebook model," Real-Time Imaging, vol.11, no.5, pp.167-256, 2005. https://doi.org/10.1016/j.rti.2005.06.001
- B. K. Bao, G. Liu, C. Xu, S. Yan. "Inductive Robust Principal Component Analysis," IEEE Transactions on Image Processing, vol.21,no. 8, 3794-3800,2012. https://doi.org/10.1109/TIP.2012.2192742
- Z.H. Khan, I.Y.-H Gu. "Nonlinear dynamic model for visual object trackingon grassmann manifolds with partial occlusion handling," IEEE Trans. on Cybernetics, vol. 43, no.6, pp. 2005-2019, 2013. https://doi.org/10.1109/TSMCB.2013.2237900
- S.H. Khatoonabadi, I.V. Bajic. "Video object tracking in the compressed domain using spatio-temporal markov random fields," IEEE Trans. On Image Processing, vol. 22, no.1, pp.300-313, 2013. https://doi.org/10.1109/TIP.2012.2214049
- M. Du, X. M. Nan, L. Guan. "Monocular Human Motion Tracking by Using DE-MC Particle Filter," IEEE Trans. On Image Processing, vol. 22, no.10, pp.3852-3865, 2013. https://doi.org/10.1109/TIP.2013.2263146
- D. Comaniciu and P. Meer, "Mean shift: A robust approach toward feature space analysis," IEEE Trans. Pattern Anal. Mach. Intell., vol.24, no.5, pp. 603-619, 2002. https://doi.org/10.1109/34.1000236
- L. F. Wang, H. P. Yan, H. Y. Wu, C. H. Pan. "Forward-backward mean-shift for visual tracking with local-background-weighted histogram," IEEE Trans. On Intelligent Transportation Systems, vol. 14, no.3, pp. 1480-1489, 2013. https://doi.org/10.1109/TITS.2013.2263281
- D. Comaniciu, V. Ramesh, and P. Meer, "Kernel based object tracking," IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI), vol. 25, no.5, pp.564-577, 2003. https://doi.org/10.1109/TPAMI.2003.1195991
- R. Hartley, A. Zisserman. Multiple view geometry in computer vision. 2nd Edition, Cambridge University Press, Cambridge, 2003.