Dist-YOLO: Fast Object Detection with Distance Estimation

: We present a scheme of how YOLO can be improved in order to predict the absolute distance of objects using only information from a monocular camera. It is fully integrated into the original architecture by extending the prediction vectors, sharing the backbone’s weights with the bounding box regressor, and updating the original loss function by a part responsible for distance estimation. We designed two ways of handling the distance, class-agnostic and class-aware, proving class-agnostic creates smaller prediction vectors than class-aware and achieves better results. We demonstrate that the subtasks of object detection and distance measurement are in synergy, resulting in the increase of the precision of the original bounding box functionality. We show that using the KITTI dataset, the proposed scheme yields a mean relative error of 11% considering all eight classes and the distance range within [0, 150] m, which makes the solution highly competitive with existing approaches. Finally, we show that the inference speed is identical to the unmodiﬁed YOLO, 45 frames per second.


Introduction
Distance estimation is an essential part of 3D scene recognition and orientation, allowing various autonomous devices to move in the natural environment. These devices can be equipped with passive capturing devices, such as RGB or IR cameras, and active capturing devices, including lidar or sonar. Active devices are more expensive but provide distance estimation directly as a cloud of points [1], while passive devices require complex computer vision algorithms on top to estimate the distance [2].
In this study, we designed a computer vision algorithm for passive capturing devices. The algorithm detects objects in a scene together with the information about the absolute distance from a camera. Such an algorithm can be applied in automatic dimming/brightening of cars' headlamps to prevent the dazzling of other cars or to highlight various obstacles on the road. For this task, a scene in front of the car is captured by an RGB camera, the objects of interest (cars/people/animals) are localized, and a control unit of a headlamp adjusts particular light-emitting segments in order not to dazzle the object, or highlight it. The decision algorithm needs the location and the distance to an object to make a correct decision. Furthermore, the processing of such data has to be able to run in realtime on an embedded device. The usage of an RGB camera instead of an active capturing device makes the solution significantly cheaper and extends the visibility range because a camera can see further than a sonar or lidar [3].
Our idea is motivated by human perception, where we can recognize objects in a scene and estimate their distance just because we can create reasoning such as 'This is a bicycle, and it is small, so it is approximately 40 m far'. For such reasoning, a pair of eyes helps, but the ability is partially preserved even if we will use only one eye. In image processing, we can estimate a distance for every pixel in a scene or per object. The 'per object' case estimation corresponds to the mentioned human perception. This study discusses how an object detection and recognition algorithm can be naturally extended for distance estimation reflecting native human perception, that is, for a system where visual information from a monocular camera is available only. For this case, one of the fastest one-stage object detection algorithms, YOLOv3 [4], has been selected, but the proposed principle can be used also in YOLOv4 [5] and YOLOv5 [6], as they have the same inner structure. The goal of the work is to propose an extension that will not lead to an increased computational cost compared to the original algorithm, while the distance estimation quality will be comparable to other state-of-the-art systems. This study aims to create a method for estimating the absolute distance (instead of relative distance/depth information) to set light beams correctly.

Related Work
The existing methods can be split into two groups, predicting relative and absolute distance. The methods for estimating relative distance can express a particular object's (given by a bounding box) distance on a relative scale [7] or realize dense prediction and produce a depth map for the entire image. The depth map can be entirely independent between distinct images [8] or preserve the consistency [9]. We can view the depth map as a type of disparity map, where it can be produced using a single image only [10] or via multiple images [11].
The methods predicting absolute distance (in meters, inches, etc.) can be further separated into those that use data from active devices, passive devices, or their combination.
Active devices are mostly represented by lidar, which has a better visibility range than sonar. VoxelNet [12] is trained in the end-to-end scheme over the lidar raw point cloud and can detect objects of car, cyclist, and pedestrian classes together with their precise 3D bounding boxes and position in a 3D scene. Study [13] proposed coupling two VoxelNets, one for region proposal and the second one for patch refinement, which led to a precision improvement on the KITTI 3D object detection benchmark [14,15].
Regarding the combination of both types of devices, we can find models trained using lidar and visual data but perform inference using visual information only. Saxena et al. [16] separated an RGB image into small patches and by applying a probabilistic model produced a depth map close to the output of a 3D scanner. Generally, such approaches search for optimal mapping from the visual domain into a depth map, which is ideal for deep neural networks based on autoencoders [17]. Analysis [18] demonstrated that objectspecific distance estimation using a monocular camera was not flawless, so they segmented Velodyne point clouds and fused them with features from a model to precisely predict the distance.
Regarding passive devices, Zhang et al. [3] used three cameras with a small field of view and using a disparity calculation produced a dense depth map with reasonable results even for an object with distance higher than 200 m. The crucial point is to deal with pseudo-rectification and ambiguity removal.
Hu et al. [19] used a monocular camera and performed object detection and tracking, where the distance of an object was identified as valuable information for the tracker. The idea was based on Faster R-CNN enhanced by distance and angle estimation. Study [20] aimed at 3D box detection and an angle estimation network, which provided attitude angle information, and using the camera projection principle over this information, the distance was precisely determined. Natanael et al. [21] utilized YOLOv3 to detect bounding boxes together with coordinates, and from them, computed distances analytically. DisNet [2,22] combined YOLOv3 that produced bounding box (BB) coordinates and a fully connected neural network that produced distances. The fully connected network was trained separately on predictions taken from YOLO. Chen et al. [23] used YOLOv3 and coupled it with Monodepth [24], i.e., a model trained on visual information from two cameras to produce a disparity map on a single camera during inference. The distance was estimated from Monodepth's output and injected into YOLO's predicted boxes. Strbac et al. [25] used two cameras and two YOLOv3 detectors, where the distance was measured with a stereoscopic principle over the detected boxes. The disadvantages were the doubled computation time due to the two detectors and the impossibility of determining the distance when only one detector found an object. Mauri et al. [26] integrated into YOLOv3 a 3D regression module that took the detected boxes together with feature maps and utilized additional convolution layers to produce the orientation, dimension, and distance of each of the boxes. Because the 3D regression module works on the produced bounding box coordinates, it turned the original one-stage YOLO principle into a two-stage one, which is different from our proposal.
To date, no solution fully integrates distance estimation into the YOLO architecture to measure distance using a monocular camera. Such a solution is the aim of this study.

The Motivation and Our Idea in Brief
Our long-term goal is to create a system capturing a scene in front of a car and providing information about the detected objects together with their absolute distances. The system has to be inexpensive; therefore, it does not rely on lidar, and the object detector runs on a low-powered embedded device, Jetson Nano. The proposed system is a part of a project controlling LED segments in a car's headlamps to highlight detected objects. To set up the light beam precisely, absolute distance information is necessary, so it is not feasible to use methods providing relative distance, e.g., dense depth map [7,9].
We use a monocular RGB camera. The motivation to omit two cameras and compute a disparity map for distance estimation is that according to [27] "A single-pixel error in disparity implies only a 0.1m error in depth at a depth of 5 m, but a 5.8 m error at a depth of 50 m" on a KITTI dataset [14], and the current object detectors have a more significant pixel error. Moreover, we are restricted to a monocular camera because the usage of two cameras increases the price of the solution.
Our intuition is that YOLO performs regression on BB coordinates, so it can be trained to solve the regression problem of absolute distance estimation as well. The assumption is that the network can use the same inner features for BB coordinates and distance estimation, leading to synergy. It motivates us to integrate the part responsible for distance estimation into the YOLO architecture and train the model in an end-to-end manner. The assumption will be confirmed when a model trained for distance estimation leads to higher BB precision than the baseline YOLO.
Our contributions to the problem are as follows: 1.
We define a novel architecture, Dist-YOLO, where prediction vectors produced by heads are extended by information about distance and coupled with a proper distance loss function.

2.
We show that Dist-YOLO detects bounding boxes more accurately than the original YOLO while having the same backbone's capacity.

3.
We demonstrate that a monocular camera with Dist-YOLO can precisely estimate the distance of an object.

Dist-YOLO
YOLO is used to detect an object's bounding box and classify the object into one of the predefined classes. By default, it cannot estimate the distance of the object. Our primary goal is to add this ability to YOLO to create Dist-YOLO, while preserving the original properties. To achieve this capability, three basic steps must be followed: (a) Enrich the labels in the training dataset with information about distances. (b) Extend the prediction in each cell to produce the distance of an object. (c) Update the YOLOv3 loss function used for training to take into account the distance of an object.

Preliminaries-YOLOv3
YOLO-You Only Look Once-is a one-stage, multi-scale, anchor-based object detector utilizing a fully convolutional architecture, DarkNet-53. Initially introduced by Joseph Redmon [28] and evolved by them into the second and third version [4,29], where the first version utilized single scale detection and lightweight architecture. Currently, the fourth [5] and fifth versions have been introduced by independent groups. The performance boost of the fourth and fifth versions is achieved mainly by new data augmentation and minor architecture changes; the main principle remains the same [5]. The newest version, YOLOX [30] is based on YOLOv3 and modifies it into an anchor-free algorithm which allows use of decoupled heads, leading to a slightly higher accuracy of detection. Furthermore, YOLO has been extended in several ways to improve its precision and capabilities. Namely, object segmentation based on object polygon detection, Poly-YOLO, was introduced in [31], where the original YOLO prediction vector was extended to predict a set of polygon points together with their confidence for every detected object. Furthermore, an approach motivated by a Mask R-CNN × Faster R-CNN relation, YOLACT, was introduced in [32] to provide the ability for mask instance segmentation with comparable results to standard two-staged segmentation approaches, but improving their computational demands. Later, this approach was extended into YOLACT++ [33] with additional performance and precision boost.
In contrast to two-staged detectors [34], YOLO does not use a region proposal network but integrates the whole process into a single architecture. It is split into encoder and decoder parts. The encoder serves as a feature extractor. It is represented by the Darknet-53 backbone, but an arbitrary backbone can be used. The features are decoded on three succeeding scales into three output grids. Each grid consists of cells, where each cell is responsible for detecting objects whose center lies inside the cell. By detecting, we mean the regression on the bounding box coordinates together with class and confidence. In YOLOv3, each cell can detect three objects, where so-called anchors are used for suppressing the problem of detecting three identical objects. By anchor, we mean a prototypical bounding box extracted from training data during preprocessing using k-means algorithm. During training and inference, a cell is assigned three anchors and detects an object in the output position where the intersect over union (IoU) over the box and the anchor is maximized. Compared to two-stage detectors, YOLO is much faster but yields lower precision of detection.
The success of the YOLO algorithm family can be illustrated on a wide range of tasks where it was applied, such as face mask wearing detection [35], identification of plant diseases [36], pedestrian detection [37], forest fire detection [38], or safety helmet detection [39]. Currently, YOLO based approaches are used for popular social distancing estimation [40,41], also combined with mask detection [42].

Updating the Predictions Vector
A prediction vector p of cell and anchor at a certain scale is given as p = (b, c, o), where b = (x, y, w, h) are coordinates produced by a bounding box regressor, c = (c 1 , c 2 , . . . , c n ) that addresses the confidence of being a certain class up to n classes, and o expresses objectness, i.e., confidence that p captures a real object. In total, each prediction vector consists of 5 + n values, and we have 3 × 3 × r prediction vectors, where r is the number of all cells in the three output layers. Typically, r = 14,157.
Regarding distance, we extend the prediction vector into form p = (b, c, o, d), where we have two options of how to define d. Firstly, d = (d) if the distance is class-agnostic and secondly, d = (d 1 , d 2 , . . . , d n ) if the distance is class-aware. The prediction vector is produced by 1 × 1 convolutional filters, where each of the filters produces a single value representing a certain variable, e.g., x coordinate of the first object, class of the second object, etc. Therefore, when we add additional information (distance), it is necessary to increase the total number of these filters. Each filter includes one trainable parameter, because architectures implementing batch normalization [43], including YOLO, do not use bias. If we consider a typical r and, e.g., n = 7, class-agnostic distance increases the number of trainable parameters in the output layers by 8%, and class-aware distance increases them by 58%. The illustration scheme of the Dist-YOLO architecture is shown in Figure 1.

Updating the YOLOv3 Loss Function
Dist-YOLO extends the original YOLOv3 loss function of a certain scale into the form [31]: where 1 (i, j) is a loss of bounding box center prediction, 2 (i, j) is a box dimensions loss, 3 (i, j) is the confidence loss, 4 (i, j) is the class prediction loss, and 5 (i, j) is the distance loss. Finally, q i,j ∈ {0, 1} is a constant indicating whether the i-th cell and the j-th anchor contains an object or not. The loss iterates over G w G h grid cells and n a anchors. The parts 1 , . . . , 4 are taken from YOLOv3. Part 5 is new and extends YOLOv3 with the functionality of distance estimation. In the following formulas, we use · to denote predictions of the network. The parts of the loss function are, according to [31], defined as follows: where c x i,j and c y i,j are coordinates of the center of a box, H(·, ·) is the binary cross-entropy, and z i,j = 2 − w i,j h i,j serves for a relative weighting of (i, j)-th box size according to its width w i,j and height h i,j .
where a w j and a h j are the width and height of the j-th anchor.
for the class-agnostic version and for the class-aware version, where ω is a weighting constant preventing the distance loss being preferable than the other losses. In our experiment, we set ω = 1 × 10 −2 .

Updating YOLOv3 Training Data
The distances enriching the data can be taken using lidar information in datasets such as KITTI [14], Waymo Open Dataset [44], Berkeley DeepDrive Dataset [45], or nuScenes [46], etc. Adding distance into the prediction vector means that it uses the same features as the bounding box regressor, and the features can also be trained to minimize the loss of distance estimation. That is the difference from other modifications of YOLO, which build distance estimation on predictions taken from the already trained model.

Benchmark and Results
We conducted the experiment on the KITTI dataset [14], namely the KITTI 3D Object Detection Evaluation 2017, which contains 7481 training and 7518 test images. Because groundtruth labels are available only for training data, we split the original training set into the training part consisting of 5241 images and the testing part consisting of 2240 images, converted to a fixed resolution of 1216 × 366 pixels. The categories were pedestrian, car, van, truck, sitting person, cyclist, and tram. Each label contained information about its location in the image, distance, and rotation with respect to the camera. We used only the first two pieces of information. The distance was expressed in meters. The lowest distances were in the negative range for objects passing the camera, and we clipped such distances to zero. The largest distances rarely exceeded 150 m; we clipped them into 150 max. Only a few objects exceeded the distance of 90 m. The distances were normalized into [0, 1] for training. The distribution of the distances is shown in Figure 2. We measured the correlation between the object's height and the observed distance for the class car in the testing dataset. It was −0.74, which means the distance could not be precisely estimated only using the object's size. That is why approaches using size information only, such as DisNet [2], cannot reach state-of-the-art results.  In the benchmark, all models were trained using the following settings: The input of YOLO was a downscaled image with the resolution of 608 × 192 pixels (used for both training/testing scenarios), the optimizer was Adam with α = 1 × 10 −3 and 'reduce learning rate on a plateau' functionality with patience 10 and factor 0.5. The data were online modified by augmentation consisting of padding, scaling, flipping, and brightness/contrast/color adjustments to prevent overfitting. The batch size was set to 24. The training ran for 100 epochs and was evaluated on a validation dataset after each epoch; the best model was saved. The evaluation dataset was 10% of training samples. Finally, the best model was evaluated on the test set.
The text below marks class-agnostic distance YOLO as 'Dist-YOLOv3 G' and classaware distance YOLO as 'Dist-YOLOv3 W'.

Evaluation Criteria
For the evaluation of the box detection ability, the mAP (mean average precision) index following the COCO standard [47,48] was used.
Regarding the quality of distance estimation, we defined ε A and ε R metrics expressing the mean absolute and relative distance estimation error. The mean absolute distance error (MAE) is defined as and the mean relative distance error as where n is the number of found bounding boxes. Further, d,d represents the paired vectors of the groundtruth and predictions, respectively.

Preserving YOLO Performance
The first experiment was aimed to verify our motivation, namely, the assumption that the network could use the same inner features for BB coordinates and distance estimation, leading to synergy. To verify it, we trained three models: the original YOLOv3, Dist-YOLOv3 with class-agnostic distance, and Dist-YOLOv3 with class-aware distance. All three models were trained with the same setting and had the same backbone.
The numerical results are presented in Table 1 and supported our assumptions. For both types of Dist-YOLOv3, the mAP value increased. Focusing on mAP .5:.95 , the classagnostic Dist-YOLOv3 yielded a performance 108% of the original YOLOv3 and class-aware Dist-YOLOv3 yielded 118%. The justification lies in a technique called Auxiliary task. It has been proven [49,50] that the correct choice of an appropriate auxiliary task may lead to the performance improvement of the primary task.

Measure the Error of Distance Estimation
We have proven Dist-YOLO outperforms the standard YOLO regarding the accuracy of box detection. The second experiment validated the ability of Dist-YOLO to estimate the objects' distance precisely. The distance was evaluated using the defined ε A and ε R metrics. The metrics could be evaluated only for detected boxes; it was pointless to evaluate the distance of an object that was not detected. To distinguish what was detected, we used the IoU threshold of 0.5; objects detected with IoU lower than the threshold were discarded.
The results for Dist-YOLOv3 are shown in Table 2. We express the number of detected boxes for a certain class, the minimum and maximum distance error, and the mean distance error for completeness. The mean error was not computed from absolute values; therefore, it should be zero when no bias in the predictions is included. Contrary to the previous experiment, class-agnostic Dist-YOLO yielded better results than class-aware, which can even be marked as not working adequately due to the high values of ε A , ε R and high positive bias.

Comparison with the Other Methods
A direct comparison is not straightforward as there is no standard, and different authors report their results using various settings; therefore, the following numbers are mainly illustrative. Zhu et al. [18] achieved ε R = 0.25 on their KITTI validation dataset. Ali et al. [51] tested only nonoverlapping cars with the maximum distance of 70 m and achieved ε R = 0.29; they also reported that DORN [52] yielded ε R = 0.11 using the same setting. Mauri et al. [26] evaluated KITTI for three classes: car, person, and cyclist. Considering the test scenario, the achieved ε R for these three classes was 0.16, 0.42, and 1.04, which shows the reasonableness of our scheme.

Ablation Study
Firstly, we investigated the dependency of the distance error on the object's real distance. Because class-aware Dist-YOLO showed a poor result in the previous experiment, we considered only class-agnostic Dist-YOLO. The absolute and relative distance errors are visualized in Figures 3 and 4, respectively. Concerning Figure 3, we can claim that the absolute error grows according to a polynomial curve until distances higher than 80, where the trend is broken. The reason is that few objects lie in this interval and, therefore, the measured value was not computed from a statistically significant number of samples. Similar behavior is also reflected in Figure 4. Here, we can see that the relative error was almost constant for all distances except one meter. That is caused by the fact that even a 30 cm distance error on such a distance yields an error of 0.3, which makes the task extremely difficult and is probably also affected by the inaccuracy of the distance capturing process itself.
The next investigated aspect was the correlation between variables. Firstly, we measured the correlation between the quality of bounding box detection and the distance error. It was −0.27 for IoU and ε A and 0.05 for IoU and ε R . Secondly, we measured that the correlation between the height of an object and ε A was −0.55, i.e., the lower the box is, the larger the absolute error produced, which is consistent with the values in Figure 3. For height and ε R , it was 0.11, which is again consistent with the values in Figure 4; the positive correlation is caused by the large error for close boxes. Note, all of these correlations are computed for class 'car', and IoU > 0.5 between the predictions and labels. The last aspect we evaluated was the computation speed. We measured the time on an RTX2060 SUPER graphics card, where the time represents the duration of executing the predict() method with a batch size equal to one and without further optimization such as conversion into TensorRT. The prediction time was equal for all three models, 22 ms per image, i.e., it ran approx 45 frames per second. The equal processing time confirmed that the increase in parameters in the output prediction vectors was negligible compared to the complexity of the rest of the model.

Examples
Finally , examples of detections are provided in Figure 5. We present four scenes with objects representing various classes and distances. Every detected object is encapsulated in the bounding box. The color and the title of the box refer to the detected class. The two numbers in the label refer to the distance of the object-the first value represents the detected distance, the one after the slash represents the ground-truth distance. Both values are in meters.

Discussion, Open Issues, and Future Work
Training with distance estimation limits used data augmentation techniques. Spatial transformations such as mosaicing [5] or strong resize can distort the visual size of objects and confuse the network. On the other hand, reducing augmentation can increase overfitting. Therefore, the suitable way to train Dist-YOLO is to use curriculum learning [53]. Firstly, the model was trained with data augmentation and we froze (set to zero) the part of the loss responsible for distance estimation to train bounding box detection only. Then, the complete loss was used, and only the not-distort-object-size augmentations were used.
From Figure 4, it is obvious that the percentage relative error of the distance was for most cases below 10%. The outlier was the distance of one meter, where the relative error was 45%. That behavior is caused by the current form of the loss function, where the squared distance is computed. We experimented with the relative distance loss function, which would force the network to predict a distance with similar relative error for all distances, including one meter. Such a loss function for the class-agnostic version is: where σ ∈ [0, 1] is a positive smoothing factor to avoid instability for the cases when d i,j ≈ 0. However, we were unable to train the model with this loss function due to high loss, even for bounding boxes.
Because the new versions of YOLO, namely v4 [5] and v5, use the same principles as YOLOv3, the proposed scheme of monocular distance estimation can be integrated into these new variants. This realization remains a future work.

Summary
We started with YOLOv3, a fast one-stage object detector. In the literature overview, we showed that there were approaches modifying YOLOv3 to monocular absolute distance estimation, but no one fully integrated the distance estimation functionality into YOLO's architecture. To achieve that, we extended the output prediction vector, modified the loss function by the part responsible for absolute distance estimation, and described how to modify the training data. In the benchmark, we verified that distance estimation was complementary to bounding box estimation and, therefore, increasds the bounding box detection precision compared to the standard YOLOv3. We evaluated two versions of the designed Dist-YOLO, namely class-agnostic and class-aware, with the finding that only class-agnostic yielded satisfactory results. Benchmarking the KITTI dataset, the mean absolute distance error was 2.5 m, and the mean relative error was 11%, which support the claim that the full integration of distance estimation into YOLO's functionality can reach a significantly better result compared to solutions that build distance estimation on the top of the model's output. Finally, we reported that the computation speed was identical for both YOLOv3 and Dist-YOLOv3. The implementation is available at https: //gitlab.com/EnginCZ/yolo-with-distance (accessed on 8 October 2021). Data Availability Statement: Publicly available datasets were analyzed in this study. These data can be found here: http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=3d (accessed on 14 August 2021). The source codes implementing the presented approach are freely available at: https://gitlab.com/EnginCZ/yolo-with-distance (accessed 14 August 2021).

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: YOLO