DOI QR코드

DOI QR Code

Real-time Multiple Pedestrians Tracking for Embedded Smart Visual Systems

  • Received : 2019.01.11
  • Accepted : 2019.02.11
  • Published : 2019.02.28

Abstract

Even though so much progresses have been achieved in Multiple Object Tracking (MOT), most of reported MOT methods are not still satisfactory for commercial embedded products like Pan-Tilt-Zoom (PTZ) camera. In this paper, we propose a real-time multiple pedestrians tracking method for embedded environments. First, we design a new light weight convolutional neural network(CNN)-based pedestrian detector, which is constructed to detect even small size pedestrians, as well. For further saving of processing time, the designed detector is applied for every other frame, and Kalman filter is employed to predict pedestrians' positions in frames where the designed CNN-based detector is not applied. The pose orientation information is incorporated to enhance object association for tracking pedestrians without further computational cost. Through experiments on Nvidia's embedded computing board, Jetson TX2, it is verified that the designed pedestrian detector detects even small size pedestrians fast and well, compared to many state-of-the-art detectors, and that the proposed tracking method can track pedestrians in real-time and show accuracy performance comparably to performances of many state-of-the-art tracking methods, which do not target for operation in embedded systems.

Keywords

1. INTRODUCTION

Object tracking such as pedestrian tracking is an important task in the field of computer vision applications such as visual surveillance, traffic monitoring, pedestrian computer interactions, robot navigation, autonomous vehicle driving, biology and so on [1]. Many excellently performing object tracking methods are more concerned with tracking accuracy. Most of such methods are still computationally expensive for embedded systems.

In this paper, we propose a real-time multiple pedestrians tracking method with a good tracking accuracy for our developing PTZ camera with Jetson TX2 [2] as a main processor.

The proposed multiple pedestrians tracking method is based on Tracking-By-Detection (TBD) [3,4,5,6]. TBD based object tracking works in 2 main stages. First, object detection and next, association among detected objects between two consecutive frames.

For the detection part of the proposed tracking method, we design a light weight CNN-based pedestrian detector with a good detection performance of even small size pedestrians which produces pedestrian locations as bounding boxes and pose orientations of detected pedestrians. Since the lightly designed CNN-based pedestrian detector is not computationally light enough for embedded processing, we apply the detector and Kalman filter-based object predictor, alternatively. The possible prediction error (position and size of the object) in a frame will not accumulated since it is readjusted in the next frame by the CNN-based pedestrian detector, which is more reliable than the Kalman filter-based prediction. For association, we boost the accuracy of Hungarian algorithm [7] by incorporating pose orientation information together with Intersection Over Union (IOU) into the association metrics, which does not increase association processing speed further. The moving direction information, which is adopted in LKDeep [6] is determined not to be adopted in association metrics since moving direction under PTZ environments takes time to compute exactly.

Through experiments on Nvidia’s embedded computing board, Jetson TX2 and comparisons to performance results of the sate-of-the-art object detectors and trackers [3,4,5,6], it is shown that the designed pedestrian detector detects fast and performs well even for small size pedestrian detection compared to many state-of-the-art object detectors, and that the proposed tracking method can operate in real-time for embedded systems like PTZ camera equipped with Jetson TX2 as a main processor and perform comparably to performances of many state-of-the-art tracking methods.

 

2. RELATED WORKS

The recent successfully object tracking methods are based on TBD due to significant improvements in object detection. In TBD, an object detector is first applied to find the target object bounding boxes, then an association rule is applied to associate the newly detected target objects in the current frame with the detected targets in the previous frame.

Recent object detection includes 2 main development trends, 2 stage detector such as Faster RCNN [8], and single shot detector such as YOLOv2 [9] and SSD [10]. For the real-time processing purpose, single shot detector outperforms 2 stage detectors. On the other hand, single shot detectors perform worse in detecting small objects. Many proposed object detectors [11,12,13] including the third version of YOLO [14] (YOLOv3 [11]) improves the detection performance of small objects by employing several techniques or adopting new features. For example, YOLOv3 improves accuracy on small objects by designing a new backbone – darknet53 and making detection at 3 scales. All of those improved object detectors should pay more computational power for better detection performance. Thus, those improved object detectors are not suitable for real-time embedded applications. Tiny YOLOv3 is a faster version of YOLOv3, however its detection performance of small objects is sacrificed for faster processing. In this paper, we design a new CNN-based pedestrian detector which improves the detection performance of small size pedestrian without further more processing burden.

In TBD-based tracking, a conventional way to solve the association problem is to use Hungarian algorithm. Most of the previous TBD based methods using Hungarian for association merely depended on distance or the overlap ratio. Many of those methods such as SORT[3] and IOU tracker[5] are only based on bounding box position but such association methods usually confuse when objects are overlapped, which leads to very much identity switches in crowd scenes. To alleviate occlusion problems, [6] employs a moving direction information more for Hungarian association. Together with the development of Deep Learning, the authors in Deep-SORT [4] introduce deep association metric. that uses not only the location but also the appearance features. Similarly, [15] introduces a network to match a pair of object detections and use the similarity score for data association. However, calculating the appearance features and similarity score between objects are expensive and therefore may not be appropriate for the embedded environment.

In this paper, we propose a new simple association algorithm employing pose orientation in addition to IOU for Hungarian association, which boosts the Hungarian association accuracy without sacrificing processing speed.

Pose orientation means where a person looks towards; front, back, right, and left. By incorporating this useful pose orientation information, the proposed association significantly reduces identity switches problem.

 

3. PROPOSED MULTIPLE PEDESTRIANS
TRACKING METHOD

3.1 Overall Working Architecture of the proposed multiple pedestrians tracking method

Fig. 1 shows the overall work-flow of the proposed multiple pedestrians tracking method. The proposed tracking method operates alternatively between detection mode and prediction mode. In detection mode, first detect pedestrians in the current frame and associate them with pedestrians in the previous frame. In the prediction mode, predicting is tracking. The proposed tracking method starts in the detection mode. In the detection mode, the pedestrians and their pose orientations are extracted as tight bounding boxes of pedestrians and as one of 4 status (front, back, left, right) by our developed CNN-based pedestrian detector. pose orientation of a pedestrian represents the direction where pedestrian looks towards in a scene. For tracking, association between pedestrians of the previous frame and those of current frame is accomplished by Hungarian algorithm where association cost metrics is made from IOU between the predicted bounding box and detected bound box of a pedestrian pose orientation. Association algorithm of the proposed tracking method is explained in detail later. The predicted bounding box of a pedestrian is obtained from translating the previous bounding box to a new position by a motion vector, which is predicted by Kalman filter. Kalman filter contributes in improving object tracking performance as demonstrated in [16].

 

MTMDCW_2019_v22n2_167_f0001.png 이미지

Fig. 1. The flowchart of the proposed multiple pedestrians tracking system.

 

In the prediction mode, the tracking system applies Kalman filter to predict the motion vector of each bounding box towards the current frame. Motion vector indicates moving direction. If the system cannot predict the current positions, then it keeps the previous positions as the current positions. And, during prediction mode, size of bounding boxes and pose orientation are kept as the same as before. In the prediction mode, the predicted pedestrians are considered as tracked pedestrians.

The proposed tracking system keeps each pedestrian’s information during consecutive 10 frames. After 10 frames, if the tracking system cannot find the corresponding updated locations, it considers the pedestrian is lost or disappear from the scene.

A newly detected pedestrian is kept to follow in next 3 frames. If the tracking system finds a match to it in that period, it is considered a new object and tracking of it starts. Our detector made error sometimes, if a newly detected person is instantly considered a new track.

 

MTMDCW_2019_v22n2_167_f0002.png 이미지

Fig. 2. PTZ camera setup and detection result.

 

3.2 Pedestrian detector

We design a CNN-based pedestrian detector which can detect small pedestrians as well as large pedestrians fast enough for real-time embedded applications. As a backbone of the proposed pedestrian detector network, we employ the first 23 layers of Mobilenet [17]. Mobilenet is well-known as a light weight deep neural network for mobile and embedded vision applications, which are based on a streamlined architecture that uses depthwise separable convolutions. We stack 5 more convolution layers to the top of the backbone to get more semantic information. Determination of the additional 5 convolution layers is done after optimality testing through experiments. After stacking 5 more convolution layers, we construct the rest of network into 2 branches; branch A responsible for detecting small size pedestrian and branch B responsible for detection of large size pedestrian. Moreover, inspired by Feature Pyramid Networks [18] that extract features from different layers of different scales, we design the detection network to use features from different layers for accurate detection.

For smaller size pedestrian detection, we need to detect at higher resolution, therefore, we use a convolution layer together with an upsampling layer to increase the resolution of the high-level structure and then we concatenate it with low-level structures features from 11th layer of backbone. We stack 7 more convolution layers to reduce the aliasing effect of upsampling and concatenating. Experiments show that 7 more convolution layers is optimal for branch A. In this branch our image features include low-level structures from 11th layer and high-level structure after upsampling layer. Even though, low-level structures that are not effective for accurate pedestrian detection, it keeps the features of small pedestrian. By combining both low-level and high-level structures features, the feature maps in this branch can describe small pedestrian features with rich semantic information. Therefore, branch A is effective for detecting small size pedestrian.

Fig. 3 shows detail of our network architecture. The output of the Branch B for large size pedestrian detection is a matrix of size 20×20×18. 20x20 refers to the number of grid cells. Each cell contains 18 values, which divides into 2 bounding boxes. Thus, each bounding box has 9 values consisting of 4 bounding coordinates, 4 pose orientations and 1 for confidence scores. Similarly, in the Branch A, the output size is 40×40×18. Branch A is used to detect small pedestrians, and therefore, we use higher resolution feature maps and increase the number of cells. The final output size is the combination of 2 branches with 2000 candidate bounding boxes (40×40+20×20 = 2000). Each bounding box is responsible to predict 18 values. Therefore, our final output size is 2000×18. Similarly to YOLOv3, we use the anchors to make predicting people. We use K-means clustering on our training set to find suitable anchor values for our dataset. The anchors we use in our experiments are (5×24), (9×50), (17×80), (32×155), (58×251), (114×402).

 

MTMDCW_2019_v22n2_167_f0003.png 이미지

Fig. 3. Pedestrian Detector Network architecture.

 

Fig. 4 shows detection result of 2 branches on MOT16-12 sequence video, at frame 215. The result of branch A is on the right, it only detects the small pedestrian while branch B focus on detecting large pedestrian.

 

MTMDCW_2019_v22n2_167_f0004.png 이미지

Fig. 4. Detection result of 2 branches on MOT16-12 sequence video, at frame 215

 

The loss function of the designed pedestrian detector network inherits from YOLO[14], The loss function penalizes four loss criteria. The detector network produces 2000 candidate bounding boxes. Each box is responsible for predicting 18 values, and then the detector network calculates loss for each cell. We obtain the final loss by summing up losses of every cells. For each box, the loss includes bounding box loss, pedestrian confidence loss, background loss and pose orientation loss.

Bounding box loss is the same as in YOLO[14] as shown in 1 st and 2 nd term in formula (2). Pedestrian confidence loss indicates the probability that a box contains a pedestrian (\(3^{rd}\) term in formula (2)). Background loss is used to penalize when the background is detected as a pedestrian (\(4^{th} \)term in formula (2)). In formula (2), \(\left(x_{i j}, y_{i j}, w_{i j}, h_{i j}, c_{i j}\right)\) is the center location, width, height and confidence score of \(j^{th}\) box in grid cell i, \(\text { i, }(\widehat{x_{i j}}, \widehat{y_{i j}}, \widehat{w_{i j}}, \widehat{h_{i j}}, \widehat{c_{i j}})\) is the predicted center location, width, height and confidence score of the corresponding box. \(1 \begin{array}{l} {o b j} \\ {i j} \end{array}\) indicates pedestrian, which is equal to 1 if the \(j^{th}\) bounding box of cell i contains a pedestrian and 0 in other case. \(1 \begin{array}{c} {n o b j} \\ {i j} \end{array}\) indicates background, \(1 \begin{array}{c} {n o b j} \\ {i j} \end{array}=1\) only if the bounding box \(j^{th}\)of cell i doesn’t contain a pedestrian.

We add one more term to penalize human pose orientation. In this paper, we consider 4 main directions, facing to the left, right, front and back. The adopted direction loss is based on cross entropy loss which is expressed as shown in formula (1) where \(p_{i j k}\) is the probability that a pedestrian in the \(j^{th}\)bounding box of a grid cell \(i\) faced to \(k^{th}\)direction. \(p_{i j k}\) is the corresponding predicted probability.

 

\(\frac{1}{N} \sum_\limits{k=1}^{N} 1_{\mathrm{ij}}^{\mathrm{obj}}\left[\mathrm{p}_{\mathrm{ijk}} \log (\widehat{p_{i j k}})+\left(1-p_{i j k}\right) \log (1-\widehat{p_{i j k}})\right]\)       (1)

 

Some constants \(\alpha, \ \lambda, \ \gamma\) are weighting factors of each loss function. For example, if we want more precision on the location, we set a higher value for λ.

 

\(\begin{aligned} \text {Loss} &=\lambda_{\text {coord}} \sum_{i=1}^{2000} \sum_{j=1}^{B} 1_{\text {ij }}^{\text {obj }}\left[\left(x_{i}-\hat{x}_{i}\right)^{2}+\left(y_{i}-\hat{y}_{i}\right)^{2}\right] \\ &+\lambda_{\text {coord}} \sum_{i=1}^{2000} \sum_{j=1}^{B} 1_{\text {ij }}^{\text {obj }}\left[(\sqrt{w_{i}}-\sqrt{\widehat{w}_{i}})^{2}+(\sqrt{h_{i}}-\sqrt{\hat{h}_{i}})^{2}\right] \\ &+\alpha \sum_{i=1}^{2000} \sum_{j=1}^{B} 1_{\text {ij }}^{\text {obj }}\left[C_{i} \log \left(\widehat{C}_{i}\right)+\left(1-C_{i}\right) \log \left(1-\widehat{C}_{i}\right)\right] \\ &+\gamma \sum_{i=1}^{2000} \sum_{j=1}^{B} \frac{1}{N} \sum_{k=1}^{N} 1_{\text {ij }}^{\text {nobj }}\left[-\left(\operatorname{p_{iji}} \log \left(\widehat{p_{i j k}}^{}\right)+\left(1-p_{i j k}\right) \log \left(1-\widehat{p_{i j k}}\right)\right]\right. \end{aligned}\)       (2)

 

where \(1_{i}^{o b j}\) denotes if object appears in cell i and \(1_{i j}^{o b j}\) denotes that the \(j^{t h}\) bounding box predictor in cell \(i\) is responsible for that prediction. We trained our network on our collected dataset for 70 epochs. The dataset contains roundly 6000 images from a part of MOT16 [20] training set and from the internet. We need to relabel the dataset because there is no suitable dataset that provide pose orientation information.

 

3.3 Tracking Association

We propose an association algorithm that utilizes features of the association cost metrics of SORT [3] and Deep-SORT[4]. We first integrate IOU, pose orientation all together to form a cost metrics. Then we apply Hungarian association method on that cost metrics. Fig. 5 shows the integration.

 

MTMDCW_2019_v22n2_167_f0005.png 이미지

Fig. 5. Calculating cost metric.

 

where \(f d\left(t_{i}\right)\) and \(f d\left(d_{i}\right)\) is pose orientation of predicted tracking object \(t_{i}\) and pose orientation of detecting object \(d_{i}\).

The IOU between predicted tracking object \(t_{i}\) and detecting object \(d_{i}\) is calculated as bellow:

 

\(\operatorname{IOU}\left(t_{i}, d_{j}\right)=\frac{\operatorname{Area}\left(t_{i}\right) \cap \operatorname{Area}\left(d_{j}\right)}{\operatorname{Area}\left(t_{i}\right) \cup \operatorname{Area}\left(d_{j}\right)}\)       (3)

 

4. EXPERIMENTS

4.1 Experimental environments

For the evaluation of the proposed pedestrian detector, we employ the well-known pedestrian dataset - INRIA person dataset [19] which includes roundly 1000 images. Furthermore, in order to evaluate the efficiency of the proposed pedestrian detector’s detectability of small size pedestrians, we used MOT16 [20] pedestrian detection test set which includes 7 challenge videos with total 5919 image frames, 759 tracks and 182,326 target bounding boxes. Video frame rate vary from 14 frame per second to 30 frame per second, frame size is 640×480 (MOT16-06 sequence) and 1920×1080 (the remaining sequences). Pedestrian density varies from 8.1 to 45.3 pedestrian per frame.

For evaluation of the proposed multiple pedestrians tracking method, we utilize MOT16 benchmark [20]. And compare the proposed tracking method with other the state-of-the-art tracking methods, SORT[3], Deep-SORT[4], IOU-Tracker [5] which have been reported in MOT 2016 Benchmarks [20]. MOT16 evaluation dataset include 7 challenge videos (MOT16-01, MOT16-03, MOT16-06, MOT16-07, MOT16-08, MOT16-12, MOT-16-14). We compare proposed tracking method with other state-of-the-art tracking methods using proposed detector on 2D MOT 2015 benchmark[20]. The evaluation dataset includes 11 sequence videos (TUD-Stadtmitte, TUD-Campus, PETS09-S2L1, ETH-Sunnyday, ADL-Rundle-6, ADL-Rundle-8, ETH-Bahnhof, ETH-Pedcross2, KITTI-13, KITTI-17, Venice-2) with total 5500 image frames, 39905 target bounding boxes. Video frame rate vary from 7 to 30 frames per second, frame size from 640×480 to 1920×1080. The pedestrian density varies from 2.2 to 11.9 pedestrian per frames.

We use a PC with IntelTM Core™ i7-4770 CPU @ 3.40GHz, 16GB of RAM with GeForce GTX Titan X Graphic Card to train the proposed pedestrian detector. The embedded system for experiments is a Jetson TX2 board [2] with HMP Dual Denver 2/2 MB L2 + Quad ARM® A57/2 MB L2, CPU @ 2GHz, 8GB of RAM. NVIDIA Pascal™, 256 CUDA cores.

 

4.2 Experimental metrics

To evaluate pedestrian detection performance, we use Average Precision (AP), precision and recall. For the detail, reader may refer to [21].

For evaluating tracking performance, we use performance metrics suggested by MOT bench [20] that include MOTA, MOTP, MT, ML, FP, FN, ID SW, Frag. If the tracked object is an actual target object, then the tracked object is called true positive, and if not, then the tracked object is called false positive. If the actual target is missed to track, then it is called false negative. FP represents the total number of false positives, and FN represents the total number of false negatives. ID SW (Identity Switch) counts the number of mismatched objects in a frame. If a tracked object does not match its actual ground truth target, it is said that its ID is switched. MT (Mostly Tracked targets) means the ratio of ground-truth trajectories that are covered by a track hypothesis for at least 80% of their respective life span. ML (Mostly Lost targets) is the ratio of ground-truth trajectories that are covered by a track hypothesis for at most 20% of their respective life span. Frag is the total number of times a trajectory is fragmented (i.e. interrupted during tracking), and Hz is processing speed (in frames per second excluding the detector) on the bench-mark.

MOTA (Multiple Object Tracking Accuracy) measures three error sources: false positives, missed targets and identity switches and is defined as:

 

\(\text {MOTA}:=1-\frac{\sum_\limits{t}\left(F N_{t}+F P_{t}+I D S w_{t}\right)}{\sum_\limits{t}\left(G T_{t}\right)}\)       (4)

 

where t is the frame index and GT is the number of ground truth objects. MOTP (Multiple Object Tracking Precision) measure the misalignment between the annotated (ground truth) and the tracked bounding boxes and is defined as:

 

\(M O T P:=\frac{\sum_\limits{t, i} d_{t, i}}{\sum_\limits{t} C_{t}}\)       (5)

 

where  \(C_{t}\) denotes the number of matches in frame time t and \(d_{t,i}\) is the IOU (3) between tracked object i and its ground truth object at frame time t. MOTP thereby gives the average overlap between all correctly matched hypotheses and their respective objects and ranges between \(t_{d}\) = 50% and 100%.

For the detail description of each metric, readers may refer to [6, 20]. For object tracking point of view, MOTA is considered to be more important than the MOTP.

 

4.3 Experimental results

4.3.1 Performance of Pedestrian detector

Experiment results about performance of the proposed pedestrian (object) detector against INRIA and MOT16 dataset are summarized in Table 1.

 

Table 1. Performance comparison between state-of-the-art pedestrian detector and proposed pedestrian detector

MTMDCW_2019_v22n2_167_t0004.png 이미지

 

YOLOv3 JTA is the YOLOv3 trained on JTA dataset, YOLOv2 is the version 2 of YOLO, and DPM is Deformable Part Model [23]. The results for the proposed one and Tiny YOLOv3 in Table 1 have been obtained from experiments on Jetson TX2 board and AP of YOLO JTA, YOLOv2, Faster R-CNN, and DPM against MOT16 are taken from MOT website [20], but speed is measured on Jetson TX2 board.

Table 1 shows that the proposed pedestrian de-tector performs significantly better than Tiny YOLOv3 against MOT16 testing set, which contain many small size pedestrians. YOLOv3 performs more reliable in pedestrian detection over the proposed one. However, YOLOv3 is very slow on an embedded computing device, Jetson TX2 board. It takes around 1 second to process a frame. For the proposed tracking method, we apply the detector every other frame, which makes the entire processing of the proposed tracking method operate in real-time even for embedded systems.

 

4.3.2 Performance of The Proposed Multiple Pedestrian Tracker

Experimental results on performance of the proposed tracker against MOT16 benchmark are summarized in Table 2. We compare our result with some open source state-of-the-art tracking algorithms such as SORT[3], Deep-SORT[4]. and IOU tracker[5], and our previous tracker, LKDeep [6], SORT[3], Deep-SORT[4], and IOU tracker[5] are not the best among the stat-of-art tracking algorithms, but are chosen since they released open source codes.

 

Table 2. Comparing tracking results on MOT16 challenge Benchmarks

MTMDCW_2019_v22n2_167_t0005.png 이미지

 

The results of LKDeep in Table 2 is taken from [6]. The results of other open source trackers in Table 2 is taken from [20], and the arrow ↑ indicates the higher score is better, and arrow ↓ indicates the lower score is better. IOU tracker, SORT, and Deep-SORT utilizes Faster R-CNN detection data provided by MOT16 challenge, which produces more precise detection than single shot detector like the proposed detector, but is slow. LKDeep, which was our previous tracker, utilizes DPM detection data provided by MOT16. And, Proposed Tracker v2 is the same as proposed tracker except that pedestrian detector is applied for every frame. In Table2, (*) indicates that a detector in the tracker is applied for every frame, and (**) indicates that a detector is applied for every other frame.

From experimental results in Table 2, one can notice the following facts.

First, the proposed tracker achieves comparable performance overall compared to our previous tracker, but with significant association speed-up, which makes the proposed tracker more suitable for embedded application. Second, the proposed tracker shows much better association speed compared to other open source tracker except IOU tracker. Even though IOU tracker shows very high association speed due to very simple association algorithm, it cannot work in real-time since it utilizes Faster R-CNN detector which is slow as shown in Table 1.

VIn order to have a fair comparison, we evaluated the open source tracking methods, SORT tracker, Deep-SORT tracker and IOU tracker by replacing theirs detectors by our designed pedestrian detector against 2D MOT 2015 benchmark. Table 3 summarizes the experimental results from replacement of our pedestrian detector, which are produced by Multiple Object Tracking Challeng Development Kit [22].

As stated in 4.2, from tracking point of view, MOTA (Multiple Object Tracking Accuracy) is the most important performance metrics to evaluate a tracking method. Proposed tracker v2 where pedestrian detector is appled for every frame produces higher MOTA and MOTP compared to other trackers. Table 3 shows that by incorporating pose orientation into association cost metrics, the proposed tracker has less ID SW(Identity Switches) of 431 compared to 528 of SORT while keeping almost the same processing time. Our association also can keep tracking the target for longer period with the number MT is 129 compared to 111 of SORT. IOU tracker do a simple and less accuracy association, so that leads to a lot of ID SW(Identity Switches) problem with 1117 identity switches. With high quality deep appearance features more for association, Deep-SORT is the best at keep tracking target, with ID SW of 285. It also keeps tracking the target for long period with the highest MT 165. However, extracting deep appearance features takes a lot of time which makes Deep-SORT becomes the slowest tracker in Table 3.

 

Table 3. Comparing tracking results on the same detector

MTMDCW_2019_v22n2_167_t0006.png 이미지

 

The proposed tracker reduces processing time by applying pedestrian detection every other frame. From Table 3, one can see that by applying detection for every other frame we made a big improvement with respect to processing speed without sacrificing too much tracking accuracy. The proposed tracker is the fastest among trackers in Table 3. We also tested SORT v2 where our designed detector is applied for every other frame. Accuracy performance of SORT v2 deteriorates even though speed performance improves. Overall, the proposed tracker can be considered as more suitable than other trackers listed in Table for embedded applications such as multiple pedestrians tracking under PTZ camera.

 

5. CONCLUSION

In this paper, we proposed a real-time multiple pedestrians tracking method for embedded applications such as embedded surveillance. Compared to the state-of-the-art multiple objects tracking methods, which show excellent tracking accuracy performance, the proposed tracker traded off accuracy performance against speed performance, but operates in real-time and performs accurately good enough for some embedded applications, which is shown through comparison experiments on Jestson TX2 embedded board.

References

  1. H. Yanga, L. Shaoa, F. Zhenga, L. Wangd, and Z. Songa, "Recent Advances and Trends in Visual Tracking: A Review," Journal of Neu rocomputing, Vol. 74, No. 18, pp. 3823-3831, 2011.
  2. Jetson TX2, https://elinux.org/Jetson_TX2 (accessed Feb., 12, 2019).
  3. A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, "Simple Online and Realtime Tracking," Proceeding of IEEE International Conference on Image Processing, pp. 3464-3468, 2016.
  4. N. Wojke, A. Bewley, and D. Paulus, "Simple Online and Realtime Tracking with a Deep Association Metric," Proceeding of IEEE International Conference on Image Processing, pp. 3645-3649, 2017.
  5. E. Bochinski, V. Eiselein, and T. Sikora. "High-Speed Tracking-by-Detection Without Using Image Information," Proceeding of International Workshop on Traffic and Street Surveillance for Safety and Security at IEEE Advanced Video and Signal-based Sureillance, pp. 1-8, 2017.
  6. Q.D. Vu, T.B. Nguyen, and S.T. Chung, “Simple Online Multiple Pedestrian Tracking Based on LK Feature Tracker and Detection for Embedded Surveillance,” Journal of Korea Multimedia Society, Vol. 20, No. 6, pp. 893-910, 2017. https://doi.org/10.9717/KMMS.2017.20.6.893
  7. Hungarian Algorithm, https://en.wikipedia.org/wiki/Hungarian_algorithm (accessed Feb., 12, 2019).
  8. S. Ren, K. He, R. Girshick, and J. Sun, "Faster R-CNN: Towards Real-time Object Detection with Region Proposal Networks," Proceedings of the 28th International Conference on Neural Information Processing Systems - Vol. 1, pp. 91-99, 2015.
  9. J. Redmon and A. Farhadi, "YOLO9000: Better, Faster, Stronger," Proceeding of 2017 IEEE Conference on Computer Vision and Pattern Recognition, pp. 6517-6525, 2017.
  10. W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.Y. Fu, et al., "SSD: Single Shot MultiBox Detector," Proceeding of European Conference on Computer Vision, pp. 1-17, 2016.
  11. J. Redmon and A. Farhadi, "YOLOv3: An Incremental Improvement," arXiv:1804.02767, 2018.
  12. S. Bell, C.L. Zitnick, K. Bala, and R. Girshick. "Inside-outside Net: Detecting Objects in Context with Skip Pooling and Recurrent Neural Networks," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2874-2883, 2016.
  13. J. Li, X. Liang, Y. Wei, T. Xu, J. Feng, S. Yan, et al., "Perceptual Generative Adversarial Networks for Small Object Detection," Proceeding of 2017 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1222-1230, 2017.
  14. J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You Only Look Once Unified, Realtime Object Detection," Proceeding of 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp. 779-788, 2016.
  15. L. Leal-Taix, C.C. Ferrer, and K. Schindler, "Learning by Tracking: Siamese CNN for Robust Target Association," Proceeding of 2016 IEEE Computer Vision and Pattern Recognition Conference Workshops, pp. 418-425, 2016.
  16. D. Kim, J. Park, and C. Lee "Object-tracking System Using Combination of CAMshift and Kalman Filter Algorithm," Journal of Korea Multimedia Society, Vol. 16, No. 5, pp. 619-628, 2013. https://doi.org/10.9717/kmms.2013.16.5.619
  17. A.G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, et al., "Mobile Nets: Efficient Convolutional Neural Networks for Mobile Vision Applications," arXiv: 1704.04861, 2017.
  18. T.Y Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, S. Belongie, et al., "Feature Pyramid Networks for Object Detection," Proceeding of 2017 IEEE Conference on Computer Vision and Pattern Recognition, pp. 936-944, 2017.
  19. INRIA Person Dataset, http://pascal.inrialpes.fr/data/pedestrian/ (accessed Feb., 12, 2019).
  20. MOT Challenge, https://motchallenge.net/ (accessed Feb., 12, 2019).
  21. M. Everingham, L. Gool, C.K. Williams, J. Winn, and A. Zisserman, "The PASCAL Visual Object Classes (VOC) Challenge," Journal of Computer Vision. Vol. 88, Issue 2, pp. 303-338, 2010. https://doi.org/10.1007/s11263-009-0275-4
  22. Multiple Object Tracking Challenge Development Kit, https://bitbucket.org/amilan/motchallenge-devkit/ (accessed Feb., 12, 2019).
  23. P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan, “Object Detection with Discriminatively Trained Part Based Models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 32, No. 9, pp. 1627-1645, 2010. https://doi.org/10.1109/TPAMI.2009.167