Real-time Human Detection under Omni-dir ectional Camera based on CNN with Unified Detection and AGMM for Visual Surveillance

  • Nguyen, Thanh Binh (Dept. of Information and Telecommunication Engineering, Soongsil University) ;
  • Nguyen, Van Tuan (Dept. of Information and Telecommunication Engineering, Soongsil University) ;
  • Chung, Sun-Tae (Dept. of Smart Systems Software, Soongsil Uniersity) ;
  • Cho, Seongwon (School of Electronic and Electrical Engineering, Hongik University)
  • Received : 2016.05.09
  • Accepted : 2016.08.09
  • Published : 2016.08.30


In this paper, we propose a new real-time human detection under omni-directional cameras for visual surveillance purpose, based on CNN with unified detection and AGMM. Compared to CNN-based state-of-the-art object detection methods. YOLO model-based object detection method boasts of very fast object detection, but with less accuracy. The proposed method adapts the unified detecting CNN of YOLO model so as to be intensified by the additional foreground contextual information obtained from pre-stage AGMM. Increased computational time incurred by additional AGMM processing is compensated by speed-up gain obtained from utilizing 2-D input data consisting of grey-level image data and foreground context information instead of 3-D color input data. Through various experiments, it is shown that the proposed method performs better with respect to accuracy and more robust to environment changes than YOLO model-based human detection method, but with the similar processing speeds to that of YOLO model-based one. Thus, it can be successfully employed for embedded surveillance application.



Since omni-directional cameras can provide a very wide field of view (up to 360° horizontally, 180° vertically) about scenes, they are popularly deployed in application areas like visual surveillance, where human detection is usefully demanded. Human detection has a wide range of applications such as pedestrian detection for car driving safety, people counting for commercial marketing, human interaction in robotics, and so on in addition to human intrusion detection in visual surveillance. Thus, during the last two decades, huge intensive research efforts have been poured into imagebased human detection.

As opposed to researches on human detection under perspective cameras, where remarkable achievements [1-9] have been accomplished during the last decade, those under omni-directional (fisheye) cameras [10-11] is far from satisfaction. Exquisite human model like deformable part model [2] and deformation dictionaries [7] as well as most of effective hand-crafted feature vectors for human detection under perspective images like HoG [1,8,9] and Integral channel [3] cannot be directly applied to omni-images since they are severely distorted from perspective images.

Now, CNN(Convolutional Neural Network) is now well-recognized as an excellent mechanism to work well for image-base application tasks since it can be structured to learn automatically models as well as low-level and high level features appropriate for application tasks. Thus, recently, to handle the limited representation of hand-crafted features and the limited capability of the previous human models to capture large variations and occlusions of human appearance, For object (including human) detection under perspective cameras, CNN-based approaches start to be proposed [12-20]. Among them, YOLO model-based one presented in [20] is noticeable for its extremely faster processing speed compared to other CNN-based methods in [12-19]. This is possible since YOLO model trains CNN in a unified way so that it can detect objects in a frame simultaneously in testing. However, it is analyzed [20] that YOLO model-based object (human) detection shows less background error (false positive) but more localization error (false negative) compared to the state-of-the-art CNN-based object detection methods like fast R-CNN [17]. Even though it is argued that YOLO model performs less background errors than fast R-CNN since YOLO model learns contextual information about objects [20], it is observed through our experiments that YOLO model is not robust enough to environment changes including background scene changes unless YOLO model neural network is trained for various environments with different background scenes.

YOLO model’s less accurate localization may come from the fact that its CNN network cannot learn more expressive representation about objects from input training images’s raw pixel information and reference information about object bounding boxes and confidence scores for the bounding boxes for the input training images.

In this paper, we concentrate on developing fast but accurate human detection under omni-directional cameras in the specific application domain of visual surveillance. One good thing about human detection for a specific application domain like visual surveillance is that one can utilize the application domain prior knowledge [9]. In visual surveillance for designated areas, the background scene is fixed as opposed to image-based driver assistance systems where background scene is changing. Humans are detected inside the foreground object areas which can be extracted as the minimum bounding boxes of foreground masks. Foreground masks under fixed backgrounds are usually extracted by background subtraction methods, among which AGMM (Adaptive Gaussian Mixture Model)-based background subtraction is well known to work successfully even under complex background scenes. The bounding boxes of foreground masks are highly plausible candidate regions for humans, and thus one can utilize the foreground bounding boxes as contextual information about foreground objects.

In this paper, we propose a new real-time human detection under omni-directional cameras for surveillance purpose based on CNN which supports a unified detection like YOLO model, but is intensified further about localization of object bounding boxes and confidence scores for the bounding boxes by foreground object context information extracted by AGMM.

The foreground contextual information provided by the pre-stage AGMM process contributes to training CNN network to learn object (human) areas more confidently and more precisely so that in testing, the trained CNN network performs more accurate with respect to human detection. Also, the false foreground object area information extracted by AGMM is used to train the CNN network to learn about the wrong foreground areas. Thus, in testing, the trained CNN network will work toward filtering out wrong human detection (false positive). Moreover, the contextual information about foreground object areas turns out to train the CNN network so as to be more robust to environment changes including background scene changes.

To learn contextual information for various backgrounds, CNN with a unified detection of YOLO model needs to be extensively trained against various backgrounds. If not, it turns out that YOLO model trained under one background does not perform well under other background since YOLO model trained under limited image data set,, but not under extensive image data set would learn contextual information biased toward backgrounds in the limited image data set. When YOLO-based human detection for visual surveillance is considered for commercial deployment, collecting various surveillance backgrounds for training will be very costly. The proposed method in this paper compensates the bias of YOLO model twisted toward backgrounds in the trained dataset by letting the CNN network learn more weight on foreground object area and less weight on background area during training. Less weight on background area helps the CNN network in the proposed method less affected by different backgrounds. Different weight assignments are guided by foreground contextual information provided by AGMM.

Increased computational time incurred by additional AGMM processing is compensated by speed-up gain obtained from utilizing 2-D input data for the unified CNN network, which consists of grey-level image data and foreground /background context information data instead of original 3-D color input data. Further speed-up can be achieved by realization that the CNN network intensified by foreground contextual information can produce the similar accuracy performance even with less layers since each convolutional layer can learn more efficiently.

The experimental results show that the proposed human detection method significantly improves accuracy and robustness to background changes without deterioration of processing speed compared to YOLO model-based human detection [20], which makes the proposed human detection method suitable for real-time human detection in embedded surveillance systems.

In passing, it is worthwhile to remark that concerning contextual or semantic information learning, the approach proposed in this paper is different from those in [13-15]. As opposed to the approaches in [13-15] which construct new CNN network architectures with input image pixel data only, the CNN network in this paper has additional contextual information input in addition to input image pixel data without restructuring the CNN network framed in a unified regression way as in [19-20]. The previous researches about CNN-based object (human) detection [12∼20] are more concerned about how to train input image data so as to make CNN learn information about object location, object class and context for objects efficiently. In this proposed method, it is shown that what data to train for CNN is also as important as how to train data for CNN.

The rest of the paper is organized as follows. Section 2 introduces technical backgrounds and related work necessary for understanding the works of this paper. Section 3 describes our proposed human detection method. Experimental results are discussed in Section 4, and finally the conclusion is presented in Section 5.



2.1 AGMM (Adaptive Gaussian Mixture Model)-based Background Subtraction

The rationale in “background subtraction” for detecting foreground objects in videos from static cameras (the background scene of camera is fixed) is to detect the foreground objects from the difference between the current frame and a reference frame, often called “background model”. AGMM (Adaptive Gaussian Mixture Model) [21] is a statistical background modeling which is well-known to be effective for extracting foreground objects in complex background scenes. For the detailed AGMM-based foreground object extraction algorithm, readers are recommended to refer to [21].

In AGMM-based foreground object detection on background subtraction methods, each pixel in the scene is modeled by a mixture of K Gaussian distributions which reflects variation of scenes and illuminations. But, background subtraction methods are still sensitive to lighting variations and scene clutters, and have difficulty in handling the grouping and fragmentation problems, and moreover foreground object candidate areas extracted from background subtraction methods are not guaranteed to contain humans. And therefore, convolutional neural networks are successfully utilized to learn about all variations and to verify whether the foreground object candidate areas extracted by AGMM contain human or not.

Background subtraction methods based on AGMM contains two significant parameters, the learning rate constant α, in the Gaussian statistics update and the background threshold TB, a measure of the minimum portion of the data that should be accounted for by the background. When the value of two parameters is changed, it will affect false negative (missing detection) and false positive (false detection). This will be mentioned in detail on Section 4, Experimental Results.

2.2 CNNs (Convolutional Neural Networks)

Convolutional Neural Networks (also called as CNNs or ConvNets) is a kind of deep learning networks which can automatically learn models as well as low-level and high level features appropriate for application tasks. Thus, it has been successfully utilized in many commercial applications from image classification, object detection to scene labelling [22].

Although the basic concepts of CNN are known since 1980 [23], they did not receive a lot of attention until the last few years [24-25]. Due to theoretical advances, the increased availability of computational resources and larger amounts of data and excellent performance, CNN-based approaches are now considered state-of-the-art in many vision tasks including human detection.

As well-known now, networks of CNN have three essential layers; convolutional layer for learning features, pooling layer for lowering computational burden by reducing the number of connections between convolutional layers, and fully connected layer for classification. Recently, pooling layer of CNN is restructured to speed up processing by utilizing spatial pyramid pooling as in [26] or to exploit the spatial relation of semantic features by spatially weighted max pooling [15].

2.3 CNN with Unified Detection [20]

YOLO model [20] is a CNN model with unified detection which can detect multiple objects in a image simultaneously through a single neural network in a regression formulation. It divides the image into a S x S grid and simultaneously predicts bounding boxes of objects, confidence in those boxes, and class probabilities. If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object. Each grid cell predicts B bounding boxes and confidence scores for those boxes. Confidence score is defined to equal the intersection over union (IOU) between the predicted box and the ground truth. These confidence scores reflect how confident the model is about that the box contains an object. Each bounding box consists of 5 predictions: x, y, w, h, and confidence score. The (x, y) coordinates represent the center of the box relative to the bounds of the grid cell. The width, w and height, h are predicted relative to the whole image. Each grid cell also predicts C conditional class probabilities. The YOLO model only predicts one set of class probabilities per grid cell, regardless of the number of boxes B. Thus, these predictions are encoded as an S × S × (B × 5 + C) tensors where S × S is the size of selected grid, B is number of bounding box for each cell, C is the number of object classes.

About how the network of YOLO model is constructed and how it is trained, readers are recommend to refer to [20].

2.4 Related Work

During the last two decades, huge intensive research efforts have been poured into on human detection from image analysis since it has a wide range of applications such as intrusion detection in visual surveillance, pedestrian detection for car driving safety, people counting for commercial marketing, human interaction in robotics, and so on.

Many noticeable human detection methods proposed and tested during the last decades [1-9] are formulated as classification problem, where classifiers are trained using carefully hand-crafted feature vectors such as HoG [1] and channel features [3]. To account for view and pose variations of humans, combined features [3] or exquisite human models like deformable part model [2] or several approaches [6∼7] have been tried. However, the representation of hand-crafted features cannot be optimized for pedestrian detection, and neither can any exquisite model so far to capture large variations of human appearance and occlusions of human bodies.

To overcome these problems, deep learningbased approaches have attracted attention. Deep learning, especially in CNN, can learn features and models from raw pixels to improve the performance of human detection. In [12], CNN is constructed so as to branch lower levels’ outputs into the top classifier to learn multi-stage features like both global shapes and structures and local details, such as a global silhouette and face components in the case of human detection. The Joint Deep learning model [13] formulates pedestrian detection in joint part-based model and designs CNN to learn pedestrian in 20 parts model and estimates pedestrian from scores of 20 parts. [12] and [13] formulate pedestrian detection as a single binary classification task (pedestrian or not pedestrian) and thus they may confuse positive (pedestrian) with hard negative samples (not pedestrians but pedestrian-like object). In order to handle this ambiguity, TA-CNN [14] constructs CNN so as to learn discriminative representation including semantic contexts for pedestrian detection by jointly optimizing it with semantic attributes including pedestrian attributes (e.g. ‘carrying backpack’) and scene attributes (e.g. ‘vehicle’, ‘tree’, and ‘horizontal’). Then, learning semantic contexts helps to discriminate between pedestrians and pedestrian-like objects during running and shows better accuracy performance than state-of-the-art methods do. However, TA-CNN [14] should learn 9 pedestrian attributes, and 8 scene attributes in addition to binary classification information so that it has computational burden, which is disadvantageous for embedded implementation.

[15] proposes a new module to learn spatially weighted max pooling module in CNN to jointly model the spatial structure and appearance of high-level semantic features, and applies the new CNN model to pedestrian detection for evaluation. Experiments through several pedestrian detection databases shows that it performs better than traditional CNN.

All of [12-15] are to localize pedestrian by sliding window approach, which requires to scan multi-scale image windows all over the whole image frame. It is well-known that the sliding window approach costs computationally high [9,16]. To improve processing speed, selective search approaches like region-CNN [16], Fast R-CNN [17] apply CNN only over region proposals extracted from the pre-stage processing, but not over the whole image frame.

All these prior works [12-17] formulate object detection as applying classifiers to each image region whether it is a sliding window or a selective region so that they still have to apply the heavy computing CNN many times before finishing. However, all these prior approaches are still heavy for real-time processing for embedded systems. [18-20] reformulate the object detection as a regression problem and train CNN to learn the efficient regression algorithm between predicted bounding boxes and ground truth bounding boxes of the objects. In addition to the coordinates and box sizes of the matching bounding boxes, how much the matching is achieved is designated by the confidence score which represents how much the matching is achieved. Deep MultiBox [18] trains a convolutional neural network in a regression to predict regions of interest. [19] is more concerned about finding a single graspable area for an object. It does not need to know anything about the object extent or its center. On the other hand, YOLO model proposed in [20] involves joint optimization of the classification and localization error, unlike [19]. YOLO model [20] shows much faster processing compared to other CNN-based state-of the-art object detection methods by framing regression formulation of object detection in a unified way in the sense that YOLO model trains CNN to learn to predict a distribution over class labels as well as a bounding box of the object in each grid cell simultaneously over 7 × 7 grid of the input image. Certainly YOLO model gains significant computational advantage over region proposals [16-17] in a detecting a few classes as in human detection. Thus, YOLO model-based method can be considered as a very promising CNN candidate architecture for object detection in embedded system without any powerful GPU or special vision DSP cores.

Due to large field of view, omni-directional lens cameras are popularly used for various applications, such as visual surveillance. For surveillance purpose, human detection under omni-directional cameras have been actively studied during past decades. Mainly two approaches have been proposed in the literature; 1) application of successful techniques developed under perspective setting and 2) different techniques directly applied for omni-images. A well-known technique in perspective environments can be applied to human detection under fisheye cameras [27-28]. But, persons sitting or lying down or standing around the center of the omni-images do not appear like persons of posture well-detected under perspective scenes even after the omni-image is converted into a perspective or a panoramic image. Also, converting the omni-image into a perspective or a panoramic image costs additional computational time. Thus, many different direct approaches have been proposed; modification of HoG feature so as to be appropriate for omni environments [10] or formulation of human detection as Bayesian MAP framework utilizing a shape-based detector [11]. These methods only consider pedestrians (walking person) appeared at off-center regions and do not show superior performance.

Likewise in the perspective setting, to handle difficulty of extracting hand-crafted features and human models appropriate under omni-image setting, CNN approaches start to appear, but a very few. [29] applies deep CNN in classifier formulation to human detection under omni-images and reports better accuracy performance compared to that of one conventional popular method, combination of HoG and Adaboost through their home-grown dataset. However, [29] does not explain how to process for multiple human detection in an image frame, and it does not provide any comparison to other methods about processing speed.



3.1 Workflow of the Propsoed Method

Fig. 1 shows the workflow of the proposed human detection method for visual surveillance based on the CNN with unified detection and foreground contextual information. How it works will be explained in later sections.

Fig. 1.The workflow of the proposed human detection method for visual surveillance.

3.2 Extraction of Foeground Iformation using AGMM-based Bckground Sbtraction

In visual surveillance applications, detection of foreground objects, especially human, is very demanding. Under a fixed surveillance camera, the information about foreground object regions can be estimated from background subtraction based on background modeling. The information about foreground object region can serve intensified information about location and confidence in detecting humans. Thus, in this proposed method, the information about foreground rectangular regions extracted from the adopted AGMM is additionally supplied to the CNN network with unified detection together with image pixel data so that the CNN network is trained to learn location and confidence of human objects more distinctively.

To extract foreground rectangular region from an image frame image, we adopt AGMM-based background subtraction method [9,21]. Fig. 2 shows the steps of our adopted AGMM-based extraction of foreground rectangular regions and associated images in each step.

Fig. 2.Steps of the adopted AGMM-based foreground mask extraction; (A) original color omni-Image, (B) gray-level resized image, (C) foreground masks, (D) corrected and bounding foreground rectangular region.

3.3 Network Design

Network of the CNN with unified detection in the proposed method is designed to learn human detection more distinctively by providing foreground rectangular region information in addition to gray-level image data into the network as inputs.

The CNN network in the proposed method is inspired by YOLO Tiny model [30] and modified for human detection task. While YOLO Tiny model is constructed for detection of 20 class objects, human detection is just a one class object detection problem so that one can simplify YOLO Tiny model more. The CNN network for human detection proposed in this paper is constructed to consist of 5 convolutional layers, 4 max pooling layers followed by 2 fully connected layers, dropout layer and output detection layer. The final output detection lay of the proposed network is the 7 × 7 × 11 = 539 tensors of predictions. 7 × 7 comes from 7 × 7 number of grid cells and 11 is calculated by (5*B+C) (B ; number of bounding boxes, C ; number of classes) where we choose B=2 to detect overlapped or closely located humans and C=1 since human detection is 1 class object detection.

Fig. 3 shows the proposed CNN network architecture for human detection.

Fig. 3.The network architecture of the CNN with unified detection of the proposed human detection method.

Compared to YOLO Tiny model (9 convolutional layers, 6 max pooling layers, 3 fully connected layers, 1 dropout layer, 1 output detection layer) [30], the proposed CNN obviously has reduced layers, but more importantly far less numbers of filters in convolutional layers since human detection needs to detect only one class (human) as opposed to YOLO model constructed to classify 20 classes.

3.4 Training

Training of the proposed CNN with unified detection is the same as that of YOLO model [20] except that training input data consists of foreground context information whether pixels belong to foreground (1) or background (0) in addition to one channel image pixel gray-level value. Original YOLO model’s training input data consists of 3 color channel (R, G, B) values.

In the same way as YOLO model, the proposed network updates the parameters of neural network during training by back propagation of the squared error between the network output and the ground truth information (x, y, w, h, confidence score).

Pre-stage AGMM can generate false positives (wrong foreground mask regions) as well as true negatives (missing foreground mask regions) in addition to true positives (true foreground mask regions). The wrong foreground context information as well as the true foreground information (true positive) are all employed to intensify the proposed network to learn to accept true object information and to learn to filter out wrong object information. These additional foreground contextual information contributes to improvement of detection accuracy. Fig. 4 illustrates an example of filtering out false positives after learning (training).

Fig. 4.The proposed network can learn to filter out the wrongly extracted as a foreground mask.



4.1 Experimental Environments

In order to evaluate our proposed method, we used 2 kinds of omni-directional image dataset: a well-known data set, Bomni-DB [31] and a home-grown dataset.

Bomni-DB consists of videos recorded in a room with two omni-directional cameras where the bounding boxes and actions of people are annotated. Each image frame of Bomni-DB has resolution of 640×480. Omni-directional cameras are located at the top and side of the room. There are two different scenarios, single-person and three-people recorded. Single-person and three-people in Bomni-DB means that clearly appearing person is one or three. For experiments of this paper, single-person Bomni-DB and three-people taken from the top omnidirectional camera are adopted, and in the experiments of the paper, each is denoted as Bomni-1-DB and Bomni-3-DB, respectively.

Fig. 5 and Fig. 6 show some sample images from Bomni-1-DB and Bomni-3-DB, respectively.

Fig. 5.Bomni-1-DB.

Fig. 6.Bomni-3-DB.

Home-grown dataset cosists of omni-directional video of 704 × 576 taken form the top at a company in Seoul, Korea. This home-grown DB will be denoted as HG-DB in this paper. Fig. 7 shows some images from the home-grown DB.

Fig. 7.HG-DB (Home-grown Dataset).

PC used for experiments has the following specification: Intel Core i5 6600 3.3GHz, 16GB RAM, and Nvidia Titan II graphic card. The graphic card Titan II is used only for training, and not used for testing (evaluation). OS for the testing PC is Windows 10 64bit.

4.2 Evaluation Methodology

In object detection, performance is commonly evaluated by accuracy and processing speed. The processing speed is simply measured by counting how many frames are processed in a sec or by calculating total consumed time in processing whole dataset. For accuracy evaluation, many methods adopt different measures even though they are similar in spirits. In this paper, accuracy is evaluated by precision and recall. Calculation of precision and recall requires the clear definition of measure for the correct detection. Our employed measure for the correct detection is PASCAL measure [4], which determines correct object detection if the area of overlap between the detect bound box (BBdt) and the ground truth bounding box (BBgt) exceed 50% described as (1) follows.

Now, if we denote the number of correctly detected human objects (true positives), the number of falsely detected human objects (false positives), and the number of missed human objects (false negatives) as TP, FP, and FN, respectively, precision, recall and overall are defined as follows.

Actually, Precision = 1 - False Alarm Rate, and Recall = 1 - Miss Rate. Overall acts as a a single measurement for comparison of different methods.

Because AGMM provides foreground contextual information to the proposed CNN network, the accuracy performance of the proposed method is affected by performance of AGMM. How much precisely AGMM produces foreground information depends on AGMM model parameters of learning rate α and background threshold TB. In stead of testing for various values of learning rate α and background threshold TB, which can make experiments a little complicate, we rather decide to use the values of two parameters for best performing AGMM over training videos from Bomni-1-DB. By this way, we adopt the learning rate α = 0.004, and the background threshold TB = 0.9 for AGMM of the experiments in this paper.

4.3 Experimental Results

Human detection in embedded systems needs a light and fast algorithm since embedded systems have limited computing capability including less powerful GPU. Even though YOLO model-based human detection method works much fast compared to other CNN-based state-of-the-art object detection method, YOLO model-based method has drawback with respect to accuracy. One of the major purpose of this paper is to develop an accurate human detection method which can operate reliably in real-time under normal embedded environments.

Through experiments in this paper, we evaluate the proposed methods against YOLO (Tiny) model-based human detection methods with respect to accuracy and processing speed.

Compared human detection methods are as follows; YOLO Tiny Model-based human detection method, YOLO Simplified Model-based human detection method, AGMM-based human detection method. YOLO Tiny Model-based human detection method uses the same network architecture and parameters as in [30], YOLO Simplified Modelbased human detection method uses the same network architecture as Fig. 3 but with only RGB pixel data input. AGMM-based human detection method decides the minimum bound boxes of foreground masks as human object region. AGMM-based human detection method just estimates foreground object regions, but it cannot precisely identify foreground objects. The evaluated proposed human detection methods are two and they are denoted as Proposed Method_1, and Proposed Method_2. Proposed Method_1 is the human detection method which uses the network architecture of Fig. 3 and applies both grey-level image pixel data and foreground information from AGMM process to the network as inputs. Proposed Method_2 is the same as Proposed Method_1 except that the last convolutional layer of Fig. 3 is eliminated. Consideration of eliminating the last convolutional layer in the Proposed Method_ 2 comes from visualization analysis about what each convolutional layers learn. After convolutional layers are visualized, it is observed that the last two convolutional layers of the Proposed Method_1 have learned almost same features. Thus, one can cut off the last convolutional layer but achieve the similar accuracy (precision and recall) with speed up. Fig. 8 shows visualizations of the weights of convolutional layers in Proposed Method_1's network.

Fig. 8.Visualization of Convolutional layer in Proposed Method_1’s network.

For accuracy evaluation, testing robustness of the methods against different environments from training environments is important. For this, we first train the proposed method and YOLO model-based human detection method using single-person Bomni-DB (Bomni-1-DB), and then evaluate precision and recall of several both methods against single-person Bomni-DB (Bomni-1-DB), three-people Bomni-DB (Bomni -3-DB) and the home-grown DB (HG-DB). We also evaluate processing speed of the proposed methods and compare them to those of YOLO model-based human detection methods.

For training of methods (proposed ones and YOLO model-based ones), we extracted over 2000 images from all video in Bomni-1-DB (5 videos) and used this images for training process.

4.3.1 Evaluation with respect to accuracy

1) Testing under the same environments as training dataset

We first evaluated accuracy over 2 videos (Bomni-1-DB_Video1, Bomni-1-DB_Video2) from single-person Bomni-DB (Bomni-1-DB), with the same background as training dataset. Bomni-1-DB_Video1 and Bomni-1-DB_Video2 consist of 1001 frames and 794 frames, and contain 906 persons and 690 persons, respectively.

Table 1 shows the experimental results about evaluation of methods over Bomni-1-DB_Video1 and Bomni-1-DB_Video2. The experimental data in Table 1 shows that both YOLO Model-based methods and the proposed methods can work well with images with the same environments as that of the trained images.

Table 1.Comparison with respect to accuracy over Bomni-1-DB which training data set is collected from

Original YOLO Tiny Model-based human detection method which has the network with more layers and more filters performs better than YOLO Simplified Model-based one of less layers and less filters. However, it costs computationally very much compared to YOLO Simplified Model-based human detection method as shown in Table 4.

Fig. 9 shows some examples of the experiment of Table 1.

Fig. 9.The human detection results in Bomni-1-DB; (1) Both YOLO Model(Simplified) method and Proposed method (Method_1) detect successfully, (2) Proposed method (Method_1) works well but the YOLO Model (Simplified) method fails to detect all humans, (3) Both fails.

From Fig. 9-(2), one can see that YOLO Model (Simplified) does not learn enough objects (humans) from the trained data so that it may not detect a new object (human) with data different from trained data, but the proposed method can learn better from additional foreground context information. The image data of Fig. 9-(3) is very difficult in the sense that all objects (humans) cannot be clearly differentiated from parts of background. Thus, AGMM cannot generate useful foreground information. Thus, the proposed method also fails to detect humans in scene.

The experimental data about AGMM Only Model-based human detection in Table 1 shows it misses to detect human or it often detects human falsely. Fig. 10 shows the case where AGMM generates false alarm and misses to detect humans but the proposed method can filter out the false alarm but detect missing humans since it can learn to detect missing and to filter out false alarm.

Fig. 10.The examples to show that the proposed method works well but the AGMM only method cannot work.

2) Testing about robustness to different environments from training environments

Object detection method (algorithm) needs to work reliably well under various environments. In order to evaluate reliable accuracy, we did two experiments; 1) one under the same background scene as that of training set but with multiple people appearance, 2) one with the different background scene from that of training data set.

The experimental data of Table 2 was obtained from evaluation of accuracy over 2 testing videos (Bomni-3-DB_Video1, Bomni-3-DB_Video2) from three-people Bomni-DB (Bomni-3-DB), which has the same background scene as the training data set but 3 people appearance. Fig. 11 shows some examples of the experiment of Table 2.

Table 2.Comparison with respect to accuracy over Bomni-3-DB which has the same background scene as training dataset but 3 people appearance

Fig. 11.The human detection results in Bomni-3-DB; (1) Both YOLO Model(Simplified) method and Proposed method (Method_1) detect successfully, (2) Proposed method (Method_1) works well but the YOLO Model (Simplified) method fails to detect all humans, (3) Both fails.

Table 3 shows the experimental data with respect to accuracy against home-grown DB (HGDB) videos with different environments from training data set. The experimental data in Table 3 shows that the proposed methods are even robust to background scene change.

Table 3.Comparison with respect to accuracy over home-grown DB with different background scene from training dataset

Fig. 12 shows some examples of the experiment of Table 3.

Fig. 12.The results in HG-DB; (1) Both YOLO Model method and the proposed method detect successfully, (2) The proposed method works well but the YOLO Model cannot work, (3) Both methods cannot detect human.

Experimental data in Table 2 and Table 3, together with Fig. 11 and Fig. 12 imply that the proposed method performs more robust to environment changes better than YOLO model does since it can learn better object regions from foreground context information.

4.3.2 Evaluation with respect to processing speed

We also measured processing speed with all testing videos. Table 4 shows the experimental results about processing speed.

Table 4.Comparison with respect to processing speed

The experimental data in Table 4 shows that the proposed method is faster than YOLO Model methods. This speed up mainly results from reduced input data dimension from 24 bits ( 3 R-G-B channels of input color image ) to 9 bits ( 8 bits data of input grey-level image and 1 bit foreground information.



In this paper, we presented a real-time human detection method under omni-directional camera for visual surveillance purpose. The proposed method is based on CNN with unified detection with additional foreground information extracted from AGMM-based background subtraction for embedded surveillance system. Since the proposed method learns more about foreground context with less input data dimension, it can achieve better robust accuracy compared to YOLO model-based method [20] and it also shows faster processing speed compared to CNN-based fastest object detection method, YOLO model-based one.

Currently, we are porting the proposed method into a smart IP network camera, which will be reported.


  1. N. Dalal and B. Triggs, "Histograms of Oriented Gradients for Human Detection," Proceeding of IEEE Conference on Computer Vision and Pattern Recognition, pp. 886-893, 2005.
  2. P.F. Felzenszwalb, R.B. Girshick, D. Mc Allester, and D. Ramanan, "Object Detection with Discriminatively Trained Part-based Models," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 32, Issue 9, pp. 1627-1645, 2009.
  3. P. Dollar, S. Belongie, and P. Perona, "The Fastest Pedestrian Detector in the West," Proceeding of The British Machine Vision Conference, pp. 68.1- 68.11, 2010.
  4. P. Dollar, C. Wojek, B. Schiele, and P. Perona, "Pedestrian Detection: An Evaluation of the State of the Art," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 34, Issue 4, pp. 743-761, 2011.
  5. R. Benenson, M. Mathias, R. Timofte, and L. Van Gool, "Pedestrian Detection at 100 Frames per Second," Proceeding of IEEE Conference on Computer Vision and Pattern Recognition, pp. 2903-2910, 2012.
  6. R. Benenson, M. Omran, J. Hosang, and B. Schiele, "Ten Years of Pedestrian Detection, What Have We Learned?," Proceeding of European Conference on Computer Vision, pp. 613-627, 2014.
  7. B. Hariharan, C.L. Zitnick, and P. Dollár, "Detecting Objects using Deformation Dictionaries," Proceeding of IEEE Conference on Computer Vision and Pattern Recognition, pp. 1995-2002, 2014.
  8. K.M Bhuvanarjun and T.C. Mahalingesh, “Pedestrian Detection in a Video Sequence using HOG and Covaraince Method,” International Journal of Electrical and Electronics Engineers, Vol. 7, Issue 1, pp. 183-190, 2015.
  9. T.B. Nguyen, V.T. Nguyen, and S.T. Chung, "A Real-time Pedestrian Detection Based on AGMM and HOG for Embedded Surveillance," Journal of Korea Multimedia Society, Vol. 18, No. 11, pp. 1289-1301, 2015.
  10. I. Cinaroglu and Y. Bastanlar, "A Direct Approach for Object Detection with Catadioptric Omnidirectional Cameras," Journal of Signal, Image and Video Processing, Vol. 10, Issue 2, pp 413-420, 2016.
  11. M. Saito, K. Kitaguchi, G. Kimura, and M. Hashimoto "Human Detection from Fish-eye Image by Bayesian Combination of Probabilistic Appearance Models," Proceeding of 2010 IEEE International Conference on Systems Man and Cybernetics, pp. 243-248, 2010.
  12. P. Sermanet, K. Kavukcuoglu, S. Chintala, and Y. LeCun, "Pedestrian Detection with Unsupervised Multi-stage Feature Learning," Proceeding of the 2013 IEEE Conference on Computer Vision and Pattern Recognition. pp. 3626-3633, 2013.
  13. W. Ouyang and X. Wang, "Joint Deep Learning for Pedestrian Detection," IEEE International Conference on Computer Vision, pp. 2056-2063, 2013.
  14. Y. Tian, P. Luo, X. Wang, and X. Tang, "Pedestrian Detection Aided by Deep Learning Semantic Tasks," IEEE Conference on Computer Vision and Pattern Recognition, pp. 5079-5087, 2015.
  15. F. Liu, Y. Huang, W. Yang, and C. Sun, "High-level Spatial Modeling in Convolutional Neural Network with Application to Pedestrian Detection," Proceeding of IEEE 28th Canadian Conference on Electrical and Computer Engineering, pp.778-783, 2015.
  16. R. Girshick, J. Donahue, T. Darrell, and J. Malik, "Region-based Convolutional Networks for Accurate Object Detection and Segmentation," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 38, Issue 1, pp.142-158, 2015.
  17. R. Girshick, "Fast R-CNN," Proceeding of IEEE International Conference on Computer Vision, pp. 1440-1448, 2015.
  18. D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov, "Scalable Object Detection Using Deep Neural Networks," Proceeding of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2155-2162, 2014.
  19. R-cnn minus R,, Aug. 31th, 2016.
  20. You Only Look Once: Unified, Real-Time Object Detection,, Aug. 31th, 2016.
  21. C. Stauffer and C, W.E.L Grimson, "Adaptive Background Mixture Models for Real-Time Tracking," Proceeding of Conference on Computer Vision and Pattern Recognition, pp. 246-252, 1999.
  22. Recent Advances in Convolutional Neural Networks,, Aug. 31th, 2016.
  23. K. Fukushima, "Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position," Biological Cybernetics, Vol. 36, Issue 4, pp 193-202, 1980.
  24. B. Boser, L. Cun, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, and L.D. Jackel, "Handwritten Digit Recognition with a Backpropagation Network," Proceeding of Advances in Neural Information Processing Systems, pp. 396-404, 1990.
  25. A. Krizhevsky, I. Sutskever, and G.E. Hinton, "Imagenet Classification with Deep Convolutional Neural Networks," Proceeding of Advances in Neural Information Processing Systems, pp. 1106-1114, 2012.
  26. Y. Gong, L. Wang, R. Guo, and S. Lazebnik, "Multi-scale Orderless Pooling of Deep Convolutional Activation Features," Proceeding of European Conference Computer Vision, pp.1-17, 2014.
  27. M. So, D.K. Han, S.K. Kang, Y.U. Kim, and S.T. Jung, "Recognition of Fainting Motion from Fish-eye Lens Camera Images," Proceeding of The 23rd International Technical Conference on Circuits/Systems, Computers and Communication, pp. 1205-1208, 2008.
  28. S.W. Jeng and W.H. Tsai, "Using Pano-mapping Tables for Unwarping of Omni-images into Panoramic and Perspective-view Images," Proceeding of IET Image Processing, pp. 149-155, 2007.
  29. H. Asanuma, K. Okamoto, and K. Kawamoto, "Feature Learning Based Human Detection for Omnidirectional Images," Journal of Japan Society for Fuzzy Theory and Intelligent Informatics, Vol. 27, pp. 813-824, 2015.
  30. YOLO Project,, (accessed Aug., 10, 2016).
  31. Bomni-DB Homepage,, (accessed Aug., 10, 2016).