Dataset: Roundabout Aerial Images for Vehicle Detection

: This publication presents a dataset of Spanish roundabouts aerial images taken from a UAV, along with annotations in PASCAL VOC XML ﬁles that indicate the position of vehicles within them. Additionally, a CSV ﬁle is attached containing information related to the location and characteristics of the captured roundabouts. This work details the process followed to obtain them: image capture, processing, and labeling. The dataset consists of 985,260 total instances: 947,400 cars, 19,596 cycles, 2262 trucks, 7008 buses, and 2208 empty roundabouts in 61,896 1920 × 1080 px JPG images. These are divided into 15,474 extracted images from 8 roundabouts with different trafﬁc ﬂows and 46,422 images created using data augmentation techniques. The purpose of this dataset is to help research into computer vision on the road, as such labeled images are not abundant. It can be used to train supervised learning models, such as convolutional neural networks, which are very popular in object detection.


Introduction
UAVs (unmanned aerial vehicles) are motorized vehicles capable of accessing hard-toreach places and sending high-resolution images in real time at an affordable cost. These are complemented by processing centers that receive and extract information from the images through object detection. This consists of recognizing the object and being able to locate it in the image. This way, the input would be an entire image, and the output would be a series of names and locations. Deep learning models, particularly object detection CNNs (convolutional neural networks), have shown great performance in this task. These are machine learning algorithms that require previously labeled examples for training (supervised learning). They are divided into two groups: one and two-stage. One-stage models treat the detection as a regression problem, learning the probabilities of a class and the location. Two-stage models, first, send a series of regions of interest to the class classifier and then, the second step, to the coordinate delimiter. One-stage ones are faster but have less accuracy than two-stage ones [1]. Some examples are: one-stage-YOLO (You Only Look Once, v1 [2], v2/9000 [3], v3 [4], v4 [5]), SSD (Single Shot Detector) [6], and RetinaNet [7]; two-stage-R-CNN [8], Fast R-CNN [9], and Faster R-CNN [10]. These models have proven to be useful in a variety of fields [11][12][13], including traffic and its infrastructures. Some examples are vehicle [14][15][16][17][18][19][20][21][22], road [23], or pedestrian detection [24,25].
According to the Spanish Traffic Department (DGT) [26], "roundabouts are a special type of intersection in which they are connected by a ring that establishes rotating traffic flow around a central island." These are the subject of study since they are already complex maneuvers [27,28] for autonomous vehicles. Furthermore, this type of traffic infrastructure offers a lot of information that can be extracted from images, such as vehicle trajectories or positions. There are several datasets available to support these studies [29][30][31][32]. However, although they are very useful, they are not abundant and are not easily accessible.
This publication presents an open-access dataset containing images of eight roundabouts along with the location of the vehicles. This has been accomplished using a methodology that simplifies the task of labeling, as it is a mainly manual and time-consuming task.

Dataset Summary
The dataset is constituted of 985,260 instances in 61,896 color images (15,474 real images and 46,422 created using data augmentation techniques) in JPG format, each of which is complemented by an XML (Extensible Markup Language) file, according to the PASCAL VOC (Visual Object Classes) format, with the annotations of the location of the vehicles within them. The images have been taken from eight different roundabouts with different traffic flow conditions. Table 1 shows a breakdown of vehicles obtained from each of them, and Figures 1-3 show some examples.

Folder Contents
The folder structure of the dataset is as follows:

Methodology
The annotation of images is a tedious task, which is why a methodology that saves part of the task has been chosen. Figure 4 summarizes the process. It consists of annotating the minimum number of images to train CNN models to auto-annotate as many cases as possible. Although these require revisions, this avoids a lot of manual annotations. In addition, to increase the number of instances without having to annotate any, data augmentation techniques are applied to create apparently new images.
In which:

Methodology
The annotation of images is a tedious task, which is why a methodology that saves part of the task has been chosen. Figure 4 summarizes the process. It consists of annotating the minimum number of images to train CNN models to auto-annotate as many cases as possible. Although these require revisions, this avoids a lot of manual annotations. In addition, to increase the number of instances without having to annotate any, data augmentation techniques are applied to create apparently new images. Record road footage. The first task is to collect some aerial videos of roundabouts and remove those with poor quality. These were taken during daylight, at different heights (indicated in the file roundabouts.csv) with sunny and cloudy conditions, using a DJI Mavic Mini 2 drone whose specifications can be found in [33] and in Table 2. The heights used to keep the roundabout in the center of the image are between 100 and 120 m so that it can be clearly seen with its entrances and exits. For that range of heights, the camera obtained a resolution (ground sampling distance-GSD) with values between 6.67 and 8 cm per image pixel, as also shown in Table 2. The footages were recorded in compliance with civilian regulations for the use of remotely piloted aircraft [34].

Component
Specification Size 1920 × 1080 px Record road footage. The first task is to collect some aerial videos of roundabouts and remove those with poor quality. These were taken during daylight, at different heights (indicated in the file roundabouts.csv) with sunny and cloudy conditions, using a DJI Mavic Mini 2 drone whose specifications can be found in [33] and in Table 2. The heights used to keep the roundabout in the center of the image are between 100 and 120 m so that it can be clearly seen with its entrances and exits. For that range of heights, the camera obtained a resolution (ground sampling distance-GSD) with values between 6.67 and 8 cm per image pixel, as also shown in Table 2. The footages were recorded in compliance with civilian regulations for the use of remotely piloted aircraft [34]. Annotation. Once the first videos are recorded, several frames are extracted using a Python script and manually annotated using software [35] (experimentally, every 10 frames, the roundabout image is different enough to be considered a new instance). This generates an XML file in PASCAL VOC format for each image.
Data augmentation. Once the images are annotated using a Python script and the OpenCV library [36], synthetic images are created by applying different flips (horizontal, vertical, and both at the same time). This is a technique widely used to create seemingly new examples with the least amount of work [37].
Basic model. The next step is to create the basic model, which is trained using [38]. The selected model is a RetinaNet [7], a one-step CNN that has already proven its effectiveness [39] for this task [40], with a Resnet 50 backbone pretrained with the COCO dataset. The mean average precision (mAP) has been established as the parameter to be optimized. This metric is very suitable, as it considers the entire precision-recall graph, unlike others such as the f1-score. The mAP is the mean AP of all classes and is defined as the area under the precision-accuracy graph (1), where precision and recall scores are calculated using (2) and (3), respectively. To obtain TP (true positive), FP (false positive), and FN (false negative), 0.5 has been set as the IoU (intersection over union), which is the minimum overlap between the true and predicted bounding box to consider a positive detection. Tables 3 and 4 show, respectively, the hardware used, the training parameters, and the result obtained. More instances. Using the model, new images are annotated by another iterative process that involves even less manual work. This would be: (1) record new videos, extract frames, and remove those with poor quality, (2) predict the location of the vehicles using the model, (3) review and confirm the images and predictions, and (4) use the data augmentation script to increase the size of the dataset.

Data Quality
For data quality assurance, the same RetinaNet model used to auto-label new instances has been trained, with the difference that the entire created dataset has been used. This has been divided into a training (70%) and validation (20%) set, plus an evaluation set (10%), to test the model once trained. Table 5 shows the results obtained from the training, and Table 6 shows the results obtained from the evaluation split of the dataset.

Conclusions
As shown in Table 6, the dataset is good enough to train a model that generalizes correctly. Among all the classes, motorcycles have the least AP; this is explained by the fact that their size is much smaller than the rest of the vehicles. Publications such as [15,41] show how increasing the image resolution causes a better generalization of all the classes. Rescaling the images could be a solution.
As future work, it would be interesting to record footage in poor visibility conditions, such as at night, heavy rain, or snowfall. However, this dataset offers the necessary tools to train models of vehicle recognition in roundabouts. In addition, these images could even be used to generate other datasets with annotations of other objects.