DOI QR코드

DOI QR Code

Remote Distance Measurement from a Single Image by Automatic Detection and Perspective Correction

  • Layek, Md Abu (Department of Computer Science and Engineering, Kyung Hee University Global Campus) ;
  • Chung, TaeChoong (Department of Computer Science and Engineering, Kyung Hee University Global Campus) ;
  • Huh, Eui-Nam (Department of Computer Science and Engineering, Kyung Hee University Global Campus)
  • Received : 2018.10.09
  • Accepted : 2019.02.09
  • Published : 2019.08.31

Abstract

This paper proposes a novel method for locating objects in real space from a single remote image and measuring actual distances between them by automatic detection and perspective transformation. The dimensions of the real space are known in advance. First, the corner points of the interested region are detected from an image using deep learning. Then, based on the corner points, the region of interest (ROI) is extracted and made proportional to real space by applying warp-perspective transformation. Finally, the objects are detected and mapped to the real-world location. Removing distortion from the image using camera calibration improves the accuracy in most of the cases. The deep learning framework Darknet is used for detection, and necessary modifications are made to integrate perspective transformation, camera calibration, un-distortion, etc. Experiments are performed with two types of cameras, one with barrel and the other with pincushion distortions. The results show that the difference between calculated distances and measured on real space with measurement tapes are very small; approximately 1 cm on an average. Furthermore, automatic corner detection allows the system to be used with any type of camera that has a fixed pose or in motion; using more points significantly enhances the accuracy of real-world mapping even without camera calibration. Perspective transformation also increases the object detection efficiency by making unified sizes of all objects.

Keywords

1. Introduction

Since the last few years we have been witnessed the huge development in the computer visionfield and currently powered by deep learning it has been accelerating even more. The high-level target of Computer vision is to gain human-level understanding from digital images or videos. Because it involves understanding, computer vision makes use of machine-learning technologies hugely along with the primitive image and video processing tasks. Especially, these days deep learning becomes the most dominant technology for computer vision [1]. Deep Neural Networks are also a good candidate for image processing jobs [2]. Darknet [3]provides a faster deep learning framework implemented in C language. Natively, the popularreal-time object detection system YOLO [4] uses the Darknet. YOLO is very fast as compared to the previous object detection techniques because it takes the detection task as a singleregression problem and uses a single network for the whole detection [5], [6]. The active development of YOLO results in continuous improvement [5], [7] and it is extending to 3D [8]. Moreover, Several other popular Convolution Neural Network (CNN) architectures such as Dense Net [9] and ResNet [10] can also be used and combined with YOLO[6], [11], [12]. This article uses YOLOv2 [7].

In many situations, only detection of an object within an image or video is not enough rather location of the object in real space is needed, for example, it enables us to measure the distance between objects which can further be utilized in alarm, robot vision and many otherservices. However, pixel position representing an object on an image greatly varies depending on the camera position, angle, as well as quality of the camera thus we need to consider thesethings to get real locations of objects from an image. Camera calibration [13] can find out the intrinsic and extrinsic camera parameters along with the distortion vectors such as Radial and Tangential. Using this information we can remove some of the distortions from the imagestaken by that camera. Calibration also helps us locating an object in real world but in such caseeither we need to keep both camera and the interested region fixed, or placing a check-boardlike object in the Field of View (FoV) [14] which is not always easy. Additionally, if ourinterested region surrounds many unwanted things within the FoV then the chance formis detection increases. To overcome the problem, before detection we need to find our region of interest (ROI).

Optical distance measurement techniques primarily are of two kinds, contact or non-contact. Most of the early classical methods are based on laser or ultrasonic sensors even with a physical ruler attached in special arrangements [13]-[15]. On the contrary, vision-based methods rely only on images captured by cameras which is very convenient and flexible. Several works described about measuring real-world distance from images and most of the muse stereo camera or dual camera placed on a known distance [18], [19] as well as with singlemotionless camera [13], [20]. Also, several methods are based on special types of camerassuch as CCD or depth [15], [21]. Other approaches [18], [20]-[22] to measure distance between camera and an object use one or more reference objects in the FoV, Chan et al.'s [25]car to car distance measuring system is actually a variation of this category. In contrast, ourmethod deals with locating and measuring distances between objects within ROI. Raza et al. [26] proposed an image based framework for estimating distance and dimensions of pedestrians which exhibits very good results. Their method used a fixed arrangement with amotionless camera where the boundary marks are predefined. Mobile Robot Localization Systems discussed in [27], [28] use simple background subtraction methods for object detection and parallel distance measurement and morphological processing respectively formeasuring distances. However, It is possible to get better results using YOLO object detectionalong with perspective transformation.

In this work, we use a single image frame from a single camera and through automaticcorner detection, this approach can work well even if the camera position changes. Althoughin this work change of ROI orientation is not considered, it can be solved by using varying color corner points. After the corner detection phase, the orientation can be adjusted as desired. This method also does not depend on the type of camera used to capture the image. The proposed system can determine an object's location in real space from which distance and motion measurement, tracking etc. can be done. The region of interest (ROI) is determined with some predefined marks and any existing objects within the FoV can be trained as marks using deep learning. The relative positions of the marks are known prior to their detection. After corner detection, the image is cropped to the ROI and applied warp perspective transformation to make a new image with the same aspect ratio of the real space. Then, object detection is run on this normalized image and we can easily determine the real locations of the objects. However, the performance greatly depends on the number of marks for defining the boundary (also referred to as corner points in this paper ) as well as the accuracy of theirdetection. Removing the distortion before detection decreases the error and if we take more boundary marks for defining the ROI, the accuracy improves even without the cameracalibration. From the experiments, we see that perspective transformation also improves the detection performance apart from the real-world mapping.

Fig. 1. Example application scenario using the proposed approach

Fig. 1 shows a simple application scenario employing the proposed approach. Aftercapturing the scene, the camera sends an image to the server; the server performstransformation, detection and other processing then shares measurement information with allof the connected devices. If the camera unit is equipped with enough capacity, then the processing can be done immediately and send measurement information to the server.

We can summarize the contributions and importance of this work as below:

i.) The proposed approach detects objects using ‘corner detection - perspective transformation - object detection’ pipeline which normalizes the size of the objects inside ROI and improves the performance of detection. In our experiments, we found that the objects which remain undetected in their original size can be detected after the transformation.

ii.) We propose a novel integrated distance measurement method from a singlecamera image combining perspective transformation and object detection. Using YOLO architecture makes the detection faster and suitable for real-timeapplication.

iii.) We modified the Darknet framework by integrating camera calibration, perspective transformation, un-distortion; and other necessary adjustments arealso made.

iv.) We demonstrate the effect of the number of boundary points on distancemeasurement accuracy. Experiments show that using 9 corner points with the original image gives a similar result as we get using camera calibration and undistorted image. Also, for any fixed camera arrangement, the corner points need to be detected only once, if camera position changes then another detection will beneeded and so on.

The rest of the paper is organized as follows. Section 2 describes some underlying the ories and related techniques. Sections 3 and 4 discuss the experimental arrangements and scenarios whereas the results are shown in Section 5. Finally, we conclude our paper in section 6.

2. Background

In this section, we briefly review all underlying theories that the content of this paper relies on: Camera Calibration, Perspective Transformation and Object detection with Darknet YOLO.

2.1 Camera Calibration

Camera calibration is the process of estimating the parameters of a lens and image sensor of animage or video camera. The Intrinsic parameters are inherent to specific camera hardware andinclude focal length, the optical center, and the skew coefficients. It is usually expressed as a3 × 3 camera matrix (1). The parameters of equation (1) are described in Table 1. The extrinsic parameters correspond to rotation and translation vectors.

There are two additional distortion parameters: Radial distortion causes straight lines to be appeared curved. It occurs when light rays bend more near the edges than at its optical center (Fig. 2). The smaller the lens, the greater the distortion is. The distortion can be positive (barrel) or negative (pincushion), with 2 coefficients it is represented as in equation (2).

Tangential distortion occurs if the image taking lens is not parallel to the image plane and represented as in equation (3). The distortion coefficients of (2) and (3) are represented as asingle vector (4).

= (1)

Table 1. Intrinsic parameters of a camera

= (1 + 1 2 + 2 4)

= (1 + 1 2 + 2 4) (2)

= + [2 1 + 2( 2 + 2 2)] = + [ 1( 2 + 2 2) + 2 2 ](3)

where, 2 = 2 + 2; and are the undistortrd pixel locations in normalized image coordinates.

= [ 1, 2, 1, 2] (4)

Fig. 2. Barrel and Pincushion distortions

2.2 Perspective Transformation

For perspective transformation of the ROI, we use the function built in the OpenCV C library and declared as in equation (5). The function transforms the source image using a specified 3 × 3 matrix (6), the parameters of the function is described in Table 2. The 3 × 3 transformation matrix can be estimated using another built-in functiondeclared as in equation (7), where and are fourcorresponding points in the source and destination images respectively. We modified Darknetto integrate along with the function.

( , , , , ) (5)

( , ) = (

11 + 12 + 13

31 + 32 + 33

,

21 + 22 + 23

31 + 32 + 33

) (6)

( , , ) (7)

Table 2. Parameters of a warpPerspective function

2.3 Object detection with Darknet YOLO

As mentioned in the introduction, YOLO is a powerful tool for real-time object detection. In the training process of YOLO, the objects are marked as bounding boxes and stored in an XML file for each image where every objects are described with four spacial parameters {Xmin, Ymin, Xmax, Ymax}specifying object’s location inside the image. There are various marking tools and in this work we used LabelImg [29]. However, Darknet takes a .txt file for each image with a line for each ground truth object in the image as { }. We used a python script to generate those text files from the xml files.

Detections are also performed as bounding boxes. Every object's position in the image is specified by a four element vector [ , ℎ , , ] where , ℎ are X-values and , are Y-values. We modified Darknet to store these position vectors in a fileand we determine the middle position of an object by the tuple { + ℎ, +} which is used as the one point determiner for the object position, however, for corners, we calculatedifferently.

Table 3. X, Y values for boundary marks

Table 3 describes corner points, the corresponding ( , ) values for 9 boundary marks, thereal-space points and the transformed image points used in our experiments. The calculated ( , ) values are used as the source points for perspective transformation. Details of YOLOcan be found on the Darknet website [3].

In darknet, training process takes a configuration file describing the neural networkarchitecture and we used a slightly modified version of the default YOLOv2 configuration as depicted in Fig. 3. There are 23 convolutional layers, the initial image is resized to 512 × 512 and the final feature image size before classification is made to 16 × 16 with a combination of 3 × 3 , 1 × 1 filters, maxpool, reorg and route layers. Some related topics on the training and CNN artitecture are discussed below:

Grid cell: YOLO divides an image into × grid cells and each cell predicts a fixed number of bounding boxes. In our case, we use 16 × 16 grid cells whereas every grid predicts bounding boxes with different aspect ratios based on the number of anchors, in our network we use 10 anchors. We keep 20 object classes as in PASCAL-VOC, although only three labels (Cart1, Cart2, and Robot) are provided during object detection trainng and one label (Corner) during corner detection training. The bounding box prediction has 5 components: (x, y, w, h, confidence). The ( , ) represents the center of the box, relative to the grid cell location. Thesecoordinates are normalized to fall between 0 and 1. The (w, h) box dimensions are alsonormalized to [0, 1], relative to the image size. A bounding box is removed if it has no object or confidence score is less than a threshold. On the other hand, if a bounding box contains anobject with sufficient confidence then redundancy of identifying the same object is removed using Non Max Suppression and Intersection over Union (IOU).

Batch Normalization: Batch normalization (BN) normalizes the value distribution beforegoing into the next layer. From YOLOv2 batch normalization (BN) han been introduced which improves the convergence without other regularizations such as dropout.

Convolution with anchor boxes: Original YOLO predicts the bounding box coordinates with the fully connected layers, however, inspired with the Faster R-CNN [30] YOLOv2 introduced anchor boxes. It simplifies the learning process because now only offset predictions are enough instead of predicting coordinates. YOLOv2 shrink the input image from 448 images to 416 × 416 to make the output feature size 13 × 13 so that there be a singlecenter cell. They argued that larger objects tend to reside in the center and single-center featuregrid makes the detection faster. However, in our case Robots, Carts and Corners are not toolarge with respect to the ROI and we found that 512 × 512 input images with output featuresize 16 × 16 give a better prediction. Moreover, YOLOv2 has Multi-Scale Training whichenables it to work on a range of dimensions from 320 to 608 in a multiple of 32. As a result, our initial choice of 512 × 512 does not create any inconsistency.

Fine-Grained Features: Objects in our experiments are relatively smaller and the high-resolution feature size 16 × 16 helps in this regard. The feature map is further improved by adding a passthrough layer (route layer) to bring features from an earlier layer at 32 × 32 resolution.

Loss Function: The loss function is a multipart function and for a grid, cell pair (i,j), it can be defined as in equations (8), (9). The first part indicates bounding box parameter loss, thesecond is the class prediction loss whereas the third indicates the confidence score loss.

(8)

The predicted values = ( , , ,ℎ ) and the ground truth values ℎ =( , , ,ℎ ); , , , and are scalar weights.

andare 0/1 indicator such that: (9)

Dataset and training: Our training dataset have 1000 images captured from different directions while running the carts and robots, then the corners and objects are annotated using Label Img. We converted the 19 448 × 448 pre-trained weight file to use as initial convolutional weights. Although not fully converged, after 70,000 iterations there was nonotice able improvement, as a result, the weight file after 80,000 iterations was used. Training for boundary marks (Corners) and the objects (Robots and Carts) are performed separately and we get two separate weight files (Fig. 5 and 6).

Fig. 3. The deep learning pipeline used in this article

3. Experiment Setup

In our experiment, the ROI is a board and the objects inside are ten Hamster Robots and two AlphaBot2s as shown in Fig. 4, the left image (4a) is taken by one camera and the right one (4b) with another. We covered the Hamsters with blue papers; one of the Alpha Bots with white paper (Cart1) and the other with orange paper (Cart2) as shown in Fig. 5.

Fig. 4. Experimental Arrangements with two different camerasFig. 5. Objects and marks used in the experiments

Fig. 6. Flow diagram for the experimental system

Fig. 6 shows the steps of our proposed system. Images are captured with two kinds of cameras, a 2-megapixel webcam (model- ELP-USBFHD05MT-DL36) and a Galaxy Note Edge mobile camera. As discussed in section 2, as an optional step we can undistort the imagetaken by a specific camera if we know the camera matrix preferably with the distortion vectors. To get these parameters we take checkboard images captured by the same camera (Fig. 7) and apply the calibration process discussed in section 2.1. The calibration process on the checkboard images yields camera matrix along with the distortion coefficients. Fig.s 4a and 4b are taken with two different cameras and as such both have different internal parameters and distortions. The camera matrices of equations (10) and (12) corresponds to equation (1) and the four coefficient distortion vectors of equations (11) and (13) corresponds to (4) and are based on the discussion of section 2.1. From the images, it can be observed that camera1 have barrel and camera2 have small pincushion distortions. After removing the distortion we get the undistorted versions as shown in Fig. 8.

Fig. 7. Checkboard Images taken for calibration

(10)

(11) 

(12)

 (13)

 

Fig. 8. Undistorted Version of the Arrangements

Corner detection is then applied using the weight file trained with corner labeled images and the corner points are obtained using Table 3. Then, as per the techniques discussed insection 2.2, we apply the warpPerspective transformation to crop through the corner points and transform to a new image with same aspect ratio as in the real space. When 9 points are used, the whole image is divided into 4 blocks and transformation is applied on each blockseparately, details are discussed in section 4.

On this transformed image, another step of detection is applied to detect the objects using the other weight file trained with object labeled images. Finally, we get the object positions which are in the same aspect ratio with real-space and we can estimate the real distanceseasily.

Along with the exact sequence of steps we considered in Fig. 6, some other combinations are taken in the experiments to understand the effects of each process and section 4 describesthe cases of these scenarios.

4. Description of Experiment Scenarios

For each arrangement, we consider the following cases for comparison:

  • Manual object detection using manually selected corner points with original and undistorted images.
  • Object detection by the system for both original and undistorted images.
  • Using manual corner points
  • Using detected corner points

 

 

In arrangement2 (Camera2), we set nine boundary marks where we use both four and ninecorners and taking all these we perform object detection on 18 different combinations for botharrangements.

4.1 Case1: Manual detections

Fig. 9. Perspective transformed image using four manually selected corner points 3992

In this case, we actually want to see the performance solely for the perspective transformation. The corner points are selected from the image manually then perspective transformations aremade and finally the object positions in the image are also selected manually so that there will be a minimum possible error due to detections, Fig.s 9, 10 show the transformed outputs.

Fig. 10. Perspective transformed image using four manually selected corner points

In Fig. 9, we present the transformed output using 4 corner points for both arrangements. When we use the original images as input, the transformed versions also contain the distortions, Fig.s 9a, 9b also show the manual labels for the robots and carts. Fig. 4a is taken by camera1 and has barrel distortion, as a result, the transformed version of Fig. 9a clips out the surrounding distorted parts. So, the relative positions of objects inside the image vary accordingly than the real space. Similarly, Fig. 4b contains small pincushion distortion which causes little bend at the top edge of the transformed image as in Fig. 9c. However, using the images after removing the distortion results transformed versionnicely aligned (9b, 9d). All full-size images are 1920 × 1080 pixels. The real board coordinates are calculated in millimeters whose size is 2070 × 1710( ) and the transformedimage size is 1024 × 846 pixels. The correspondence of corner points is shown in Table 3.

Using all of the 9 boundary points in arrangement2 as described in Table 3, the ROI is actually divided into four blocks. As a result, warpPerspective is applied to each blockindividually and then merged into one as shown in Fig. 10. For the original image, the distortion is minimized when we use 9 corners, but the improvement for undistorted version is not clearly noticeable.

4.2 Case2: Object detection by the system

Fig. 11. Object detections on Arrangement1 (Camera1)

Here, object points are taken only from the detection; not by manual selection. At first, forarrangement 1 we perform detection in four scenarios as shown in Fig. 11. For original and undistorted input images, objects are detected using manual corner points as well as detected by the system. It can be clearly observed again that using original camera image with distortion, loses some ROI part around the boundary walls, actually clips off because camera1 has barrel distortion (Fig. 11). Also, if we use detected corner points then the detection bounding boxes are not always perfectly aligned with the corner edges leaving some detectionerror in the transformed images.

As discussed earlier, arrangement2 have nine corner points and we can use either four orall nine corners (four blocks). The following two sub-cases consider both 9 and 4 corner points for perspective transformation.

4.2.1 Case2.1

Fig. 12 shows the scenarios using manual corner points. We see that object detection bounding boxes are similar in fashion but as pointed out earlier, four corners with original image retain the distortion and we expect more performance gain from undistorted images for four corners (Fig.s 12a, 12b than with nine corners (Fig.s 12c, 12d).

Fig. 12. Object Detections With Manually selected Corner Points on Arrangement2 (Camera2)


4.2.2 Case2.2- Using detected corner points

Here, both corners and objects are detected, not selecting anything manually. Fig. 13 shows corner and object detections using input images (original and undistorted) considering both four and nine corner points. In original images, four corner keeps the small pincushiondistortion of camera2 but nine corner removes that in original images (Fig.s 13c, 13e).3995

Fig. 13. Object Detections With Detected Corner Points on Arrangement2 (Camera2)

5. Results and Analysis

As already mentioned, object detection results in the bounding boxes for every object insidethe ROI. The middle points are determined and the image points are converted into real spacepoints using equation (14) where image and real dimensions are 1024( ) × 846( ) and 2070( ) × 1710( ) respectively, as specified in Table 3. There are 12 objects; 10 robots (R1, R2, .... ,R10) and 2 Carts (Cart1, Cart2). We calculate the Euclidean distances between all 66 pairs ( 12 2 ) of objects {(R1, R2), (R1, R3) ... (R2, R3)........(R10, Cart1), (R10, Cart2), (Cart1, Cart2)}. The distances in real space were recorded earlier using measurementtapes. Finally, we find out the errors in distances between real and calculated for every pair.

Tables 4 and 5 show the errors of few distance IDs for all 18 scenarios and the Fig.s through 14-18 present the plots for all distance IDs with their corresponding errors.

(14)

Table 4. Absolute errors i.e differences between real and calculated from system for arrangement 1(Camera1). Camera1 has barrel distortion and here we only used four corner points. Orig = Original Image, Und = Undistorted Image

Table 5. Absolute errors i.e differences between real and calculated from the system for arrangement 2(Camera2). Camera2 has pincushion distortion. Orig = Original Image, Und = Undistorted Image

The bar-charts on the right-sides of the Fig.s 14 - 18 present another interpretation for the corresponding plots on the lefts. We could use the average errors for each scenario to plot the bar charts but it does not always provide the proper interpretation in this regard. If most of the detection bounding boxes are nicely aligned except a single which have big displacement thenthe average can be affected. As a result, for every figure, we calculate normalized scores and the average of those are plotted in the bar charts.

5.1 Average Normalized Error Calculation

Let, in a figure there are four comparing scenarios( 1, 2, 3, 4), for a specific distance ID1 the corresponding errors are ( 1 1, 2 1, 3 1, 4 1) then the corresponding normalizederrors will be assigned from a sequence of four equally spaced constant values ( 1, 2, 3, 4) where 1 > 2 > 3 > 4 . If ( 1 1 > 3 1 > 2 1 > 4 1) then we get the normalized errors; ( 1 1 = 1, 2 1 = 3, 3 1 = 2, 4 1 = 4) . Similarly, for anotherdistance ID 2 , if ( 3 2 > 1 2 > 4 2 > 2 2) then ( 1 2 = 2, 2 2 = 4, 3 2 = 1, 4 2 = 3), and so on. Finally, the average normalized errors

( 1 1, 1 2. . . . . . . 1 ), ( 2 1, 2 2. . . . . . . 2 ), ( 3 1, 3 2. . . . . . . 3 ) and ( 4 1, 4 2. . . . . . . 4 ) are plot as bar chart. In our experiments, we use 1 = 20, 2 =15, 3 = 10, 4 = 5 and accordingly the average normalized errors (ANE) lies between 5 and 20.

Fig. 14. Manual Object Detections using Manually Selected Corner Points with Original and Undistorted Images of Camera1

Fig. 15. Manual Object Detections using four and nine Manually Selected Corner Points with Original and Undistorted Images of Camera2

Fig. 14 shows the results for manual detection of arrangement1. Because camera1 has bigbarrel distortion, the errors for original input image significantly bigger than the undistorted version, the average normalized errors (ANE) are 19 and 11 respectively.

Fig. 15 shows the result of arrangement2 where there is a small pincushion distortion. Here the undistorted version shows smaller errors as well; 10 and 13 average normalizederrors for nine and four corners respectively. However, the original image with nine corners reveals similar (bar chart shows better) result as the undistorted image with four corners (ANE= 12 and 13).

Fig. 16. Detections using Detected and Manually Selected Corner Points with Original and Undistorted Images of Camera1

We can define the total error as in equation 13. All errors are independent of each other buthow one kind of error affects the other is uncertain.

(15)

Now let's see the results for detection by the system, Fig. 16 shows the result forarrangement1; here only four corners are used. We see that using an undistorted image with manual corner provides the best result (ANE=8). Surprisingly, original image with manual corner generates the worst result (ANE=18) than detected corners (ANE=14). This is becausecamera1 has big barrel distortion and manual corner keeps them all whereas error from detected corners actually decreases the total error in this case. In case of object detection using manual corners on arrangement2 where no corner detection error, we see the improvementusing nine corner points for the original image; average normalized errors 13 and 15 with nineand four corners respectively (Fig. 17). However, four corners again provide very good accuracy (ANE=10) for undistorted version. When both corners and objects are detected, using nine corners we find similar accuracy for both original and undistorted images (ANE=12)(Fig. 18). It happens due to the uncertain combination of different detection errors. Eithermanual corner or detected, four corners with original and undistorted images have the highestand lowest errors respectively (ANE= 15,10 and ANE=16,10). Nine corners always improvethe accuracy and seem more reliable with an original image.

Fig. 17. Detections using four and nine Manually Selected Corner Points with Original and Undistorted Images of Camera2

Fig. 18. Detections using four and nine Detected Corner Points with Original and Undistorted Images of Camera2

From all of the results discussed above, we can conclude that perspective transformationgives very good accuracy in locating objects from images. If we can undistort the image thenonly four corner points improve the accuracy significantly. Dividing the ROI into more blocks achieves some sort of auto-calibration and un-distortion becomes less important. Although manual corner selection on undistorted image performs even better, it is not practical. Bothremoving the distortion with calibration or using more boundary points in order to achieveauto-calibration gains accuracy improvement with the cost of some complexity increase. However, using more control points seems to have more advantages than camera calibration and undistortion because it does not depend on a specific camera and thus can help us todevelop a general system using multiple cameras with motion.

Table 6. Comparison of errors between our method and Mobile Robot Self-Localization System [27]

The closest method which we can compare with our work is the Mobile Robot Self-Localization System [27] where they experiment with three webcams. They measured the displacements of specific positions whereas we measured the distances between objects. To compare with them, we calculated our results in a similar way and the results are shown in Table 6. The reference method measures the displacements on the calibrated images only whereas we consider several cases along with both calibrated and original images. Above tableshows the average error of that method is 9.33 cm but our method has 1.30cm and 1.62cm for calibrated and original images respectively. Moreover, our proposed method can be trained todetect any kind of objects which gives it a wider range of applicability.

6. Conclusion

In this paper, we proposed a system to detect objects from a single remote image frame and then locate and measure the distance between them in real space. The detection is done by deep learning and performed in two steps, the boundary marks, and object detection. Perspective transformation is used to crop the ROI from image and make it proportional to thereal space. Experiments are done using original images, after removing their distortions by camera parameters, with four and nine corners. Results prove that combining perspective transformation and object detection provides a simple solution for locating objects in realspace and using more corner points improves the accuracy significantly for original images. However, if we remove the distortion from images prior to the transformation then four cornerpoints are enough to provide very good accuracy.

References

  1. Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," nature, vol. 521, no. 7553, pp. 436-444, May, 2015. https://doi.org/10.1038/nature14539
  2. C. Szegedy and V. O. Vanhoucke, "Processing images using deep neural networks," United States patent US9904875B2, February, 2018.
  3. J. Redmon, "Darknet: Open Source Neural Networks in C,".
  4. J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You only look once: Unified, real-time object detection," in Proc. of the IEEE conference on computer vision and pattern recognition, pp. 779-788, June 2016.
  5. J. Redmon and A. Farhadi, "Yolov3: An incremental improvement," arXiv preprint arXiv:1804.02767, April, 2018.
  6. Z.-Q. Zhao, P. Zheng, S. Xu, and X. Wu, "Object detection with deep learning: A review," IEEE Transactions on Neural Networks and Learning Systems, pp. 1-21, January, 2019.
  7. J. Redmon and A. Farhadi, "YOLO9000: better, faster, stronger," in Proc. of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6517-6525, July, 2017.
  8. M. Simon, S. Milz, K. Amende, and H.-M. Gross, "Complex-YOLO: Real-time 3D Object Detection on Point Clouds," arXiv preprint arXiv:1803.06199, March, 2018.
  9. G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, "Densely Connected Convolutional Networks.," in Proc. of CVPR, 2017, vol. 1, pp. 2261-2269, July, 2017.
  10. K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proc. of the 2016 IEEE conference on computer vision and pattern recognition, pp. 770-778, June, 2016.
  11. X. Hao, "A FAST OBJECT DETECTION METHOD BASED ON DEEP RESIDUAL NETWORK," INTERNATIONAL JOURNAL OF SCIENTIFIC RESEARCH, vol. 7, no. 2, February, 2018.
  12. C. N. Aishwarya, R. Mukherjee, and D. K. Mahato, "Multilayer vehicle classification integrated with single frame optimized object detection framework using CNN based deep learning architecture," in Proc. of 2018 IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT), pp. 1-6, March, 2018.
  13. Z. Zhang, "A flexible new technique for camera calibration," IEEE Transactions on pattern analysis and machine intelligence, vol. 22, no. 11, pp. 1330-1334, November, 2000. https://doi.org/10.1109/34.888718
  14. "Measuring Planar Objects with a Calibrated Camera - MATLAB & Simulink," 2018.
  15. M.-C. Lu, W.-Y. Wang, and C.-Y. Chu, "Image-based distance and area measuring systems," IEEE Sensors Journal, vol. 6, no. 2, pp. 495-503, April, 2006. https://doi.org/10.1109/JSEN.2005.858434
  16. Y. M. KlimKov, "A laser polarimetric sensor for measuring angular displacements of objects," in Proc. of Lasers and Electro-optics Europe, 1996. CLEO/Europe., Conference on, pp. 190-190, September, 1996.
  17. A. Carullo and M. Parvis, "An ultrasonic sensor for distance measurement in automotive applications," IEEE Sensors journal, vol. 1, no. 2, p. 143, August, 2001. https://doi.org/10.1109/JSEN.2001.936931
  18. Y. M. Mustafah, R. Noor, H. Hasbi, and A. W. Azma, "Stereo vision images processing for real-time object distance and size measurements," in Proc. of Computer and Communication Engineering (ICCCE), 2012 International Conference on, pp. 659-663, July, 2012.
  19. I.-H. Kim, D.-E. Kim, Y.-S. Cha, K. Lee, and T.-Y. Kuc, "An embodiment of stereo vision system for mobile robot for real-time measuring distance and object tracking," in Proc. of Control, Automation and Systems, 2007. ICCAS'07. International Conference on, pp. 1029-1033, October, 2007.
  20. K. Murawski, "Method of Measuring the Distance to an Object Based on One Shot Obtained from a Motionless Camera with a Fixed-Focus Lens.," Acta Physica Polonica, A., vol. 127, no. 6, pp. 1591-1595, June, 2015. https://doi.org/10.12693/APhysPolA.127.1591
  21. J. Han, L. Shao, D. Xu, and J. Shotton, "Enhanced computer vision with microsoft kinect sensor: A review," IEEE transactions on cybernetics, vol. 43, no. 5, pp. 1318-1334, October, 2013. https://doi.org/10.1109/TCYB.2013.2265378
  22. M. Jungel, H. Mellmann, and M. Spranger, "Improving vision-based distance measurements using reference objects," in Proc. of Robot Soccer World Cup, 2007, pp. 89-100, 2008.
  23. H. Zhang, L. Wang, R. Jia, and J. Li, "A distance measuring method using visual image processing," in Proc. of the 2009 2nd International Congress on Image and Signal Processing, 2009, vol. 1, pp. 2275-2279, October, 2009.
  24. A. Roberts, W. N. Browne, and C. Hollitt, "Accurate marker based distance measurement with single camera," in Proc. of Image and Vision Computing New Zealand (IVCNZ), 2015 International Conference on, pp. 1-6, November, 2015.
  25. K. Chan, A. Ordys, and O. Duran, "A system to measure gap distance between two vehicles using license plate character height," in Proc. of International Conference on Computer Vision and Graphics, pp. 249-256, 2010.
  26. M. Raza, Z. Chen, S. Ur Rehman, P. Wang, and J. Wang, "Framework for estimating distance and dimension attributes of pedestrians in real-time environments using monocular camera," Neurocomputing, vol. 275, pp. 533-545, January, 2018. https://doi.org/10.1016/j.neucom.2017.08.052
  27. I. Li, M.-C. Chen, W.-Y. Wang, S.-F. Su, and T.-W. Lai, "Mobile robot self-localization system using single webcam distance measurement technology in indoor environments," Sensors, vol. 14, no. 2, pp. 2089-2109, January, 2014. https://doi.org/10.3390/s140202089
  28. J. H. Shim and Y. I. Cho, "A mobile robot localization via indoor fixed remote surveillance cameras," Sensors, vol. 16, no. 2, p. 195, February, 2016. https://doi.org/10.3390/s16020195
  29. Tzutalin, "LabelImg," Git code (2015).
  30. S. Ren, K. He, R. Girshick, and J. Sun, "Faster R-CNN: towards real-time object detection with region proposal networks," IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 39, no. 6, pp. 1137-1149, June, 2017. https://doi.org/10.1109/TPAMI.2016.2577031

Cited by

  1. Deep Window Detection in Street Scenes vol.14, pp.2, 2019, https://doi.org/10.3837/tiis.2020.02.022
  2. Adaptive Fine Distortion Correction Method for Stereo Images of Skin Acquired with a Mobile Phone vol.20, pp.16, 2019, https://doi.org/10.3390/s20164492
  3. Real-Time Collection of Training Distance of Long-Distance Runners Based on Wireless Sensor Network vol.2021, 2019, https://doi.org/10.1155/2021/9992239