Pano-RSOD : A Dataset and Benchmark for Panoramic Road Scene Object Detection

Panoramic images have a wide range of applications in many fields with their ability to perceive all-round information. Object detection based on panoramic images has certain advantages in terms of environment perception due to the characteristics of panoramic images, e.g., lager perspective. In recent years, deep learning methods have achieved remarkable results in image classification and object detection. Their performance depends on the large amount of training data. Therefore, a good training dataset is a prerequisite for the methods to achieve better recognition results. Then, we construct a benchmark named Pano-RSOD for panoramic road scene object detection. Pano-RSOD contains vehicles, pedestrians, traffic signs and guiding arrows. The objects of Pano-RSOD are labelled by bounding boxes in the images. Different from traditional object detection datasets, Pano-RSOD contains more objects in a panoramic image, and the high-resolution images have 360-degree environmental perception, more annotations, more small objects and diverse road scenes. The state-of-the-art deep learning algorithms are trained on Pano-RSOD for object detection, which demonstrates that Pano-RSOD is a useful benchmark, and it provides a better panoramic image training dataset for object detection tasks, especially for small and deformed objects.


Introduction
Due to the wide availability of consumer-level panoramic video capturing and imaging devices, panoramic images are widely used in many fields [1][2][3][4][5][6].For example, they are used in 360-degree object tracking [1,4], equirectangular super-resolution [3], privacy protection in Google Street View [5] and roadway inventory management about traffic signs [6].Object detection based on panoramic images is one of the key technologies to make panoramic images widely applied.In intelligent transportation systems, the technology of object detection in panoramic images (with a wide field of view) can help autonomous driving assistance systems (ADAS) and autonomous navigation for unmanned aerial vehicles (UAV) detect the objects (e.g., vehicles, pedestrians) around the vehicle.From a map navigation perspective, panoramic maps, e.g., Baidu and Google among others, which are constructed by panoramic images, can express richer information such as location and scene.However, the information of pedestrians and vehicles in the panoramic map involves personal privacy and the speed of the private information removing or blurring based on manual methods is relatively slow.
Efficient and automatic object detection methods can be realized by deep learning for panoramic image object detection.Reference [5] proposed a probabilistic search algorithm to boost the efficiency of face detection in Google Street View so as to protect personal privacy.In smart city management and virtual reality, some object information can be retrieved and located through panoramic object detection.It can be seen that the research on panoramic object detection has important practical significance.
All of the research of panoramic object detection relies on a large-scale, high-quality training dataset.Although a variety of public datasets, e.g., PASCAL VOC [7], ImageNet [8] and COCO [9] are available for the identification and segmentation of multiple objects, they aim at generic object detection, not specific for panoramic object detection.Considering the difference between traditional images and panoramic images, the models pre-trained on the generic datasets are unsatisfactory commonly when directly applied in panoramic object detection.In addition, although there are also some street-view object detection datasets (including KITTI [10], Caltech Pedestrian Dataset [11] and UA-DETRAC Dataset [12]), these datasets mainly used for vehicle or pedestrian detection, street-view semantic and instance segmentation (including Mapillary Vistas [13], Cityscapes [14] and Apoloscape [15]).The publication of these datasets undoubtedly promotes the development of object detection.However, the public datasets specific to panoramic road scene object detection are still unavailable.
Moreover, to the best of our knowledge, there is no special object detection algorithm for panoramic images at present.The previous methods mainly use traditional hand-craft features, e.g., the histogram of oriented gradient (HOG), scale-invariant feature transform (SIFT), or the existing object detection methods based on transfer learning (e.g., pre-training in ImageNet [8] or COCO [9]).For example, Reference [16] adopted the detectors based on HOG algorithm to detect traffic signs from street-level panoramic images.Reference [17] used Faster Region-based CNN (RCNN) to detect object from indoor-level panoramic images.
Though much exciting progress on the object detection of road scenes has been extensively reported in recent years, there are two major issues that seriously limit the development of object detection in panoramic road scene images:

•
A lack of panoramic object detection datasets for deep learning.A panoramic image usually contains more objects and some distorted objects due to its special imaging mechanism, which is different from the ordinary.Therefore, an object detection task for panoramic images needs a new panoramic dataset to train and test for the purpose of adapting the differences.

•
Although the existing object detection methods of common images can be transferred to the panoramic object detection, there is a lack of model, evaluation statistics and benchmarks specifically for the panoramic object detection.
Aiming at the above problems, we construct a panoramic road scene object detection dataset (Pano-RSOD) and carry out experiments based on the state-of-the-art algorithms for object detection to construct a benchmark.The Pano-RSOD contains 9402 images and four categories objects, i.e., vehicles, pedestrians, traffic signs and guiding arrows.The constructed dataset is quantitatively and qualitatively compared with other datasets in several aspects, e.g., the number of object samples, the number of images, number of categories, resolution of images, the type of images and etc. Besides, we train five state-of-the-art detectors: faster RCNN (VGG-16) [18], faster RCNN (ResNet-101) [19], region-based fully convolutional networks (R-FCN) [20], YOLOv3 [21], and single shot multibox detector (SSD) [22,23].The transfer learning method is adopted for the five detectors designed in this paper with the pre-trained models of ImageNet [8] and COCO [9].Furthermore, the benchmark of Pano-RSOD is constructed.
In summary, the major contributions of this paper are as follows: • We present a novel and promising topic for panoramic road scene object detection, which will have potential applications in ADAS, UAV and panoramic mapping.Compared with the normal view, a panoramic view can cover a larger perspective and contain more objects in one single image.It could be possible to cover some complicated situations which are not covering in the most existing datasets.And the object detection on the online panoramic map is challenging.Thus, a panoramic road scene dataset is needed and important.In order to provide more research foundation for solving the object detection in panoramic image problems, related object detection methods and datasets are compressively overviewed (Section 2).

•
We construct a high-resolution, panoramic road scene object detection dataset (Pano-RSOD) with more annotations and small object diversity.The data set is, to the best of our knowledge, the first high-resolution panoramic road scene dataset and the images are with high intraclass diversity.
The dataset can provide a better experimental dataset for the object detection algorithm based on the panoramic image.Besides, the dataset can also evaluate the advantages and disadvantages of object detection algorithms, which aimed at small and deformed objects (Section 3).

Related Works
This section mainly discusses the object detection datasets of road scene and networks for object detection.Thus, we summarized the related works from these two aspects.

Existing Object Detection Dataset of Road Scene
Usually, vehicles, pedestrians, traffic signs, etc. are the most common objects in road scenes.The detection of these objects has very broad applications.The most common datasets including the above objects are as follows:

•
Pascal VOC Dataset [7]: This dataset is used as a standardized dataset for image detection and classification.There are two versions of voc2007 and voc2012.voc2007 has a total of 9963 images, and voc2012 has a total of 17,125 images.They include 20 categories to be detected, e.g., cars, pedestrians and etc.The image size in the dataset is different, and the horizontal image size is about 500 × 375 pixels, and the vertical image size is about 375 × 500 pixels.Each image corresponds to an xml format label file, which records the image size, ground-truth of object coordinates and other information.This dataset is widely used as an evaluation criterion in various object detection algorithms [18,[20][21][22]24,25].A total of 80,256 objects are marked, covering the car, pedestrian and cyclist.Among them, the precision-recall (PR) curve is used for object detection evaluation.The dataset has a wide range of applications in vehicle detection and pedestrian detection due to a large number of car and pedestrian samples in the dataset.

•
Pedestrian Detection Dataset: Pedestrian detection is one of the important tasks in the fields of video surveillance and automatic driving.Therefore, the pedestrian detection dataset also plays an important role in evaluating various object detection algorithms.The INRIA person dataset [26] was created by Daal, where the training set contains 614 positive samples (including 2416 pedestrians) and 1218 negative samples, and the test set contains 288 positive samples (including 1126 pedestrians) and 453 negative samples.The dataset is currently used widely in static pedestrian detection.The NICTA Pedestrian Dataset [27] is a larger static pedestrian detection dataset at present.There are 25,551 images containing single pedestrian, 5207 high-resolution images containing non-pedestrian.The training set and test set have been divided to facilitate the comparison of different classifiers in the database.The Caltech pedestrian dataset [11] is currently a large pedestrian database, which is captured by a car camera in the urban traffic environment.The dataset is a video about 10 h, and the resolution is 640 × 480, 30 frames per second.The dataset labels about 250,000 bounding boxes (including 350,000 rectangles and 2300 pedestrians).In addition, there are other pedestrian detection datasets, such as ETH [28] and CVC [29].

•
Vehicle Detection Dataset: Vehicle detection is a key step in vehicle analysis, and it is the basis for subsequent vehicle identification and vehicle feature recognition.Earlier, there is CBCL Car Database [30] created by MIT, which contains 516 images in ppm format with a resolution of 128 × 128, mainly using for vehicle detection.The UA-DETRAC dataset [12] is a larger vehicle detection and tracking dataset.It contains 10 h of video taken at different locations in Beijing and Tianjin, China.The resolution of video is 960 × 540, and the frame ratio is 25 frames per second.The dataset labels 8250 vehicles and 1.21 million object bounding boxes.BIT-Vehicle Dataset [31] covers 9850 images of bus, microbus, minivan, sedan, SUV, and truck, which can be used to evaluate the performance of multi-class vehicle detection algorithms.
However, public datasets specific for panoramic image detection remain unavailable.Therefore, the need for panoramic image detection dataset has become more urgent.

Object Detection Methods
The field of image classification by deep learning has great breakthroughs, and it also promotes the field of object detection to make great progress.Especially the convolutional neural network (CNN) plays a very important role in feature extraction [32].Girshick et al. proposed regions with CNN features (RCNN) [33] in 2014, which applied CNN to object detection.Its extensions, fast RCNN [34] and faster RCNN [18] further improved the detection speed.After that, R-FCN [20], PVAnet [24], feature pyramid networks (FPN) [35], and mask R-CNN [36] have been improved and optimized on the basis of Faster RCNN, which improves the detection speed and accuracy.In addition, in order to meet the real-time requirements of some scenes, Redmon et al. proposed a regression-based one-stage method YOLO [37] based on OverFeat [38] in 2015.Then they proposed its extensions YOLOv2 [39] and the latest YOLOv3 [21].Another branch of the one-stage methods implements the approach with multiple feature layers to predict, such as SSD [22], Raindow SSD (R-SSD) [40], and Deconvolutional Single Shot Detector (DSSD) [25].
At present, the object detection methods using deep learning can be mainly divided into two major categories: two-stage detection framework and one stage detection framework [41].The former firstly generates the proposals in the proposal stage, and then use CNN to classify these proposals.The latter has no proposal stage, and directly converts the problem of object positioning into regression problem.The comparison results of typical object detection algorithms based on deep learning are shown in Table 1.RCNN is the basis of most current two-stage detection framework.It firstly uses the selective search [42] algorithm to generate candidate bounding boxes of interest.Then each proposal is sent to the CNN network for feature extraction to generate feature vectors.Finally, support vector machine (SVM) is used for classification.Its extension fast RCNN optimizes the runtime of the algorithm.
Faster RCNN further reduces the running time of the algorithm, and it designs the region of proposal (RPN) [18], which directly generates proposals without increasing the computation cost.This way end-to-end object detection can be achieved, and computational cost is reduced.Besides, the problem of algorithm accuracy reduction caused by excessive proposals can be avoided.
In addition, many two-stage methods have been improved based on Faster RCNN.For example, PVANet [24] optimizes the feature extraction network and proposes a lightweight network.Besides, R-FCN [8] introduces position-sensitive score maps on the basis of Faster R-CNN, and the feature sharing can be realized on the whole image and the detection speed is improved.

B. one-stage detection framework
Like YOLO [37], its upgraded version of YOLOv3 [21], SSD [22], DSSD [25] and other one stage detection frameworks have no obvious proposal stage.YOLO directly performs feature extraction, candidate bounding boxes regression and classification in the same convolution network.This method performs poorly for detection of small and multiple objects appearing in the same grid cell.Then, its extension YOLOv3 proposes the Darknet53 [39] and implements a multi-scale prediction method according to FPN [35] so as to obtain better predictions under the premise of speed increase.
SSD sets discretized and multi-scale default boxes on feature maps with different resolutions (SSD512 uses 7 layers).Meanwhile, small convolution kernels are added to each feature map as the final prediction layer to complete classification and the bounding box regression.DSSD changes the feature extraction network from VGG-16 [43] to ResNet-101 [19] to enhance the network feature extraction capability.At the same time, deconvolution is used to extract contextual semantic information so as to improve the detection accuracy of small objects.Currently, there are many public datasets used for object detection, but most of them are not panoramic images.Besides, there are relatively few datasets of large traffic scene images.In recent years, with the development of panoramic imaging technology, panoramic images have had obvious advantages over traditional images in terms of overall scene perception.Then, panoramic image can be more widely used in digital cities, intelligent transportation and automatic driving.Therefore, we construct a panoramic road scene object detection dataset, namely, Pano-RSOD (Dataset link: https://pan.baidu.com/s/1H9RsXfXCCfBgpF2bY2LGeA).

Panoramic Road Scene Object Detection Dataset
The Pano-RSOD is captured from the streetscape of downtown Zhongshan City, Guangdong Province, China.It contains a total of 9402 images.The size of each image is 2048 × 1024 pixels.The labels are produced in PASCAL VOC format, including vehicles (50,255 bounding boxes), pedestrians (11,227), traffic signs (8622) and guiding arrows (17,438).Each image averagely contains about nine objects.For an easier representation of our dataset, we use a car, person, sign and line to represent vehicles, pedestrians, traffic signs and guiding arrows in the remaining of the paper.
In general, the Pano-RSOD is a multi-scale panoramic object detection dataset in road scenes, and there are more objects in a single panoramic image.Besides, the panoramic image contains a large number of small objects.It is also important to point out that objects in panoramic images are often accompanied by distortion.Therefore, the Pano-RSOD provides data sources for training, test and evaluation of object detection algorithms aimed at panoramic image, objects with distortion or small objects in road scenes.Some example images of Pano-RSOD are shown in Figure 1.
objects.For an easier representation of our dataset, we use a car, person, sign and line to represent vehicles, pedestrians, traffic signs and guiding arrows in the remaining of the paper.
In general, the Pano-RSOD is a multi-scale panoramic object detection dataset in road scenes, and there are more objects in a single panoramic image.Besides, the panoramic image contains a large number of small objects.It is also important to point out that objects in panoramic images are often accompanied by distortion.Therefore, the Pano-RSOD provides data sources for training, test and evaluation of object detection algorithms aimed at panoramic image, objects with distortion or

Panoramic Image Acquisition
In order to construct a road scene panoramic image dataset, a panoramic image acquisition system is constructed.The system is composed of a multi-camera panoramic vision system (i.e., Ladybug5) and a vehicle.The images are collected through the system by driving the vehicle in different road scenes.The panoramic image acquisition vehicle is shown in Figure 2.
The multi-camera panoramic vision system uses multiple sub-cameras distributed in different orientations to acquire image information that can be perceived by the current viewpoint.The panoramic image of the Ladybug5 satisfies the spherical camera theory, which can establish the projection relationship between each sub-image and panoramic image.As shown in Figure 3, the right panoramic image can be acquired by the left multi-camera panoramic vision system.The multicamera panoramic vision system provides high-resolution, dead-band panoramic images with the synchronization and fast speed of data acquisition.
In the process of panoramic image acquisition, we record the location of image collection, i.e., the name of the road.The images of the training set, the validation set and the test set of the Pano-RSOD all come from different roads of the city to avoid the repetition.In addition, in order to avoid images having large similarities in the same set, every 15 frames of the image sequences are firstly adopted for dataset construction, and we manually remove images with large similarities.

Panoramic Image Acquisition
In order to construct a road scene panoramic image dataset, a panoramic image acquisition system is constructed.The system is composed of a multi-camera panoramic vision system (i.e., Ladybug5) and a vehicle.The images are collected through the system by driving the vehicle in different road scenes.The panoramic image acquisition vehicle is shown in Figure 2.
The multi-camera panoramic vision system uses multiple sub-cameras distributed in different orientations to acquire image information that can be perceived by the current viewpoint.The panoramic image of the Ladybug5 satisfies the spherical camera theory, which can establish the projection relationship between each sub-image and panoramic image.As shown in Figure 3, the right panoramic image can be acquired by the left multi-camera panoramic vision system.The multi-camera panoramic vision system provides high-resolution, dead-band panoramic images with the synchronization and fast speed of data acquisition.
In the process of panoramic image acquisition, we record the location of image collection, i.e., the name of the road.The images of the training set, the validation set and the test set of the Pano-RSOD all come from different roads of the city to avoid the repetition.In addition, in order to avoid images having large similarities in the same set, every 15 frames of the image sequences are firstly adopted for dataset construction, and we manually remove images with large similarities.

Panoramic Image Acquisition
In order to construct a road scene panoramic image dataset, a panoramic image acquisition system is constructed.The system is composed of a multi-camera panoramic vision system (i.e., Ladybug5) and a vehicle.The images are collected through the system by driving the vehicle in different road scenes.The panoramic image acquisition vehicle is shown in Figure 2.
The multi-camera panoramic vision system uses multiple sub-cameras distributed in different orientations to acquire image information that can be perceived by the current viewpoint.The panoramic image of the Ladybug5 satisfies the spherical camera theory, which can establish the projection relationship between each sub-image and panoramic image.As shown in Figure 3, the right panoramic image can be acquired by the left multi-camera panoramic vision system.The multicamera panoramic vision system provides high-resolution, dead-band panoramic images with the synchronization and fast speed of data acquisition.
In the process of panoramic image acquisition, we record the location of image collection, i.e., the name of the road.The images of the training set, the validation set and the test set of the Pano-RSOD all come from different roads of the city to avoid the repetition.In addition, in order to avoid images having large similarities in the same set, every 15 frames of the image sequences are firstly adopted for dataset construction, and we manually remove images with large similarities.

Dataset Labeling
Making dataset labels is an important part of image classification, object detection and segmentation results.The quality of label making is directly related to the final accuracy of the training model.In this paper, we use an open source image annotation tool on GitHub, namely, LabelImg (https://github.com/tzutalin/labelImg).The output is the xml file, which is the same as PASCAL VOC [7].
In the field of intelligent transportation and panorama mapping, the detection of vehicles, pedestrians, traffic signs and guiding arrows often plays an important role.Thus, this paper selects those four most common objects in traffic road scenes to label.When labeling, we try to completely cover the object with the rectangular bounding box.Besides, car class only labels vehicles with four wheels, person class contains any people, e.g., walking, standing, sitting or riding people, sign class includes any traffic sign in the traffic scene, line class only labels all kinds of guiding arrows on the roads.
In order to build high quality datasets, we set strict control over the data labeling process.Ten researchers who study the object detection are asked to process the data.These ten researchers are divided into two groups on average.For each image, five researchers (the first group) are arranged to manually annotate the images, including the object category label and the coordinates of the rectangular box.After all the images have been labelled, we asked the other five researchers (the second group) to check the labelled data.Then, the voting method determines whether to pass the verification.If more than three persons pass the vote, the image is verified to pass.Otherwise it is relabeled until the checking passes.In the end, we have labeled 4 categories with a total of 87,542 object bounding boxes.

Dataset Statistics and Analysis
Our road scene panoramic image dataset contains a large number of labeled samples.Each class has sufficient samples (the minimum number of samples for a category is more than 8500).The sample information of the vehicle is the most abundant, and the minimum number of traffic signs is more than 8500.Moreover, each type of sample is acquired from different road traffic scenarios, such as a city intersection, suburban road, and urban road, which can provide rich foreground and background feature information for CNN feature extraction.In addition, the dataset contains a large number of small objects and objects with occlusion and overlap.This can increase the difficulties of the object detection task, which can also help us evaluate the advantages and disadvantages of the object detection algorithms.Figure 4a counts the number of objects for each type of dataset.Figure 4b counts the number of objects at different scales.Specifically: approximately 37% of objects are small (scale ≤ 32), 56% are medium (32 < scale ≤ 128), and 7% are large (scale > 128).
background feature information for CNN feature extraction.In addition, the dataset contains a large number of small objects and objects with occlusion and overlap.This can increase the difficulties of the object detection task, which can also help us evaluate the advantages and disadvantages of the object detection algorithms.Figure 4a counts the number of objects for each type of dataset.Figure 4b counts the number of objects at different scales.Specifically: approximately 37% of objects are small (scale ≤ 32), 56% are medium (32 < scale ≤ 128), and 7% are large (scale > 128).We define scale as the square root of the object's area.
The Pascal VOC dataset has a relatively small image size and few object types.Besides, there are relatively few traffic scenarios and traffic objects.Compared with UA-DETRAC Dataset [12] and BIT- The Pascal VOC dataset has a relatively small image size and few object types.Besides, there are relatively few traffic scenarios and traffic objects.Compared with UA-DETRAC Dataset [12] and BIT-Vehicle Dataset [31], the Pano-RSOD includes richer background information, and it covers different traffic road scenes such as urban road, crossroad, overpass, and suburban road, which can maximize the diversity of the background.In addition, most of the pedestrian and vehicle datasets are only for a single type of object, and the number of objects in the scene is relatively small, which is not suitable for object detection in complex traffic scenarios.
Compared with Pascal VOC Dataset which contains relatively few traffic scenarios and traffic objects, the panoramic images in the Pano-RSOD are high-resolution, and the average number of objects per image is up to 9, so that the object detection is not needed to use more images to train.Compared with 10,053 labeled vehicles in BIT-Vehicle Dataset, Pano-RSOD contains up to 50,255 labeled vehicles (with wider scale), which can be used for vehicle detection and other tasks.Compared with UA-DETRAC Dataset, the Pano-RSOD includes richer background information, and it covers different traffic road scenes such as urban road, crossroad, overpass, and suburban road, which can maximize the diversity of the background.In contrast with cityscapes dataset [13], mapillary vistas (Images with strong wide-angle view or 360-degree images were removed) [14] which are both used for semantic street-level understanding, the images in Pano-RSOD have a 360-degree angle of view instead of single view, so that they can contain more objects with various scales and perceive the whole road scene in single image.Of course, other datasets are not panoramic images that are essentially different from Pano-RSOD, and we just give roughly qualitative comparison.Panorama: According to the information collected from the Internet, the current public object detection datasets are basically not panoramic images, but our road scene panoramic dataset can provide a good reference for the panoramic technology applied in the object detection.Besides, objects in panoramic image often have distortion which provides a challenge for object detection task.

•
Large-scale and high resolution: According to the comparison results in Table 2, the Pano-RSOD has more labeled sample sizes, especially with the most abundant vehicle information.Besides, the Pano-RSOD has higher image resolution.

•
Multi-scales and more small objects: As can be seen from Figure 4b, the Pano-RSOD has a wide range of scales.Especially for small objects, it has as many as 31,579 samples with a scale less than 32.

Baseline Methods
Since the top rank methods for object detection in the PASCAL VOC or KITTI dataset has recently adopted a convolutional neural network, we chose the baseline methods based on CNN.In this section, we evaluate different object detection algorithms based on one stage detection framework and a two-stage detection framework reviewed in Section.2.2.A simple algorithm flow diagram about two kinds of methods used in this paper is shown in Figure 6.

Baseline Methods
Since the top rank methods for object detection in the PASCAL VOC or KITTI dataset has recently adopted a convolutional neural network, we chose the baseline methods based on CNN.In this section, we evaluate different object detection algorithms based on one stage detection framework and a two-stage detection framework reviewed in Section 2.2.A simple algorithm flow diagram about two kinds of methods used in this paper is shown in Figure 6.

Baseline Methods
Since the top rank methods for object detection in the PASCAL VOC or KITTI dataset has recently adopted a convolutional neural network, we chose the baseline methods based on CNN.In this section, we evaluate different object detection algorithms based on one stage detection framework and a two-stage detection framework reviewed in Section.2.2.A simple algorithm flow diagram about two kinds of methods used in this paper is shown in Figure 6.Anchor box settings.In the training and prediction stages, the five baseline detectors in this paper use the method of pre-setting anchor boxes, which provide the reference for final prediction (bounding boxes).If the anchors are not set properly, it will inevitably lead to more positional regression errors.Therefore, it is especially important to set the anchors with appropriate scales and aspect ratios.In this paper, the data distribution of Pano-RSOD is statistically analyzed.The scale distribution is shown in Figure 4b.The length and width of object are clustered by K-means algorithm, as illustrated in Table 3.Since the number of objects detected in this paper is mainly divided into four categories, the number of clusters is set to be 4. Considering the scale distribution of object as shown in Figure 4b, the aspect ratio after clustering as shown in Table 3, and the hardware conditions of the experiment, the scale of anchors in Faster RCNN and R-FCN is {32, 64, 128, 256, 512} and the aspect ratio is {0.5, 1, 2, 3}. Figure 7 shows the distribution of anchors in the dataset.It can be seen that the anchors with scales and aspect ratios used can cover the entire samples to a great extent.For the SSD, an additional convolutional layer is added on the basic feature extraction network InceptionV2 [23].It generates a total of six feature layers to predict.The scale and aspect ratio settings of the anchors are calculated using the method of Ref. [22], and each prediction layer sets anchors with multiple scales and aspect ratios.YOLOv3 performed k-means clustering on the object sizes of the training set (using the IoU value as the distance indicator) [21] to set up 9 different anchors.Table 4 shows the detailed parameters settings for anchors of the baseline methods in this paper.
clustering as shown in Table 3, and the hardware conditions of the experiment, the scale of anchors in Faster RCNN and R-FCN is {32, 64, 128, 256, 512} and the aspect ratio is {0.5, 1, 2, 3}. Figure 7 shows the distribution of anchors in the dataset.It can be seen that the anchors with scales and aspect ratios used can cover the entire samples to a great extent.For the SSD, an additional convolutional layer is added on the basic feature extraction network InceptionV2 [23].It generates a total of six feature layers to predict.The scale and aspect ratio settings of the anchors are calculated using the method of Ref. [22], and each prediction layer sets anchors with multiple scales and aspect ratios.YOLOv3 performed k-means clustering on the object sizes of the training set (using the IoU value as the distance indicator) [21] to set up 9 different anchors.Table 4 shows the detailed parameters settings for anchors of the baseline methods in this paper.Training strategy and model parameters.We use SGD as backpropagation algorithm for the five detectors and stepwise reduce the learning rate.Considering the depth of the network and other factors for each detector, the proper iteration steps and initial learning rates are set to ensure the convergence of the network.For the relatively deep network, we set the smaller initial learning rate to avoid gradient explosion.The iteration steps of Faster RCNN (VGG-16 based and ResNet-101 based) and R-FCN are both 100 k steps, and the initial learning rates are set to 1 × 10 −2 , 1 × 10 −3 and 1 × 10 −3 , respectively.And then the learning rates are reduced to one-tenth of the original at the 80 k steps.The iteration steps of SSD and YOLOv3 are both 150 k steps.Their initial learning rates are 4 × 10 −3 and 1 × 10 −3 , respectively.Then the learning rates drop to one-tenth of the original at 80 k steps.The selected hyper-parameters for the five detectors are shown in Table 5. Momentum 0.9 0.9 0.9 0.9 0.9 IoU threshold 0.5 0.5 0.5 0.5 0.5

Experiments and Benchmark Statistics
In order to test the dataset and build a benchmark for the Pano-RSOD, we train and test the state-of-the-art algorithms (Faster-RCNN, R-FCN, SSD and YOLOv3) on the Pano-RSOD.Among the 9402 images of the datasets, 7000 images are manually selected as training set, 1000 images selected as test set, and 1402 images are used as validation set to detect four classes of objects, i.e., car, person, sign and line.The images of training set, validation set and test set are collected from different roads of the city, and they all cover urban and suburban scenes.In the experiment, the transfer learning method is implemented, and the network is fine-tuning with the pre-training model [44].Faster RCNN(VGG-16) and YOLOv3 used the pre-training model based on ImageNet [8], and the other three detectors use the pre-training model based on COCO [9].Besides, the training and testing images are resized to the fixed size 1024 × 512 pixels for all the detectors.
All evaluations are done on Intel Core i7-3930 k (3.80 GHz) CPU (24 GB memory), a single TESLA P100 GPU (16 G memory).YOLOv3 is carried out experiment based on Darknet framework while the other detectors based on Tensorflow framework.

Evaluation Metrics
Currently, the values of average precision (AP) and mean average precision (mAP) are used to evaluate the performance of the object detection algorithms [33][34][35][36][37][38][39][40].In order to compare performance of the state-of-the-art object detection algorithms on the Pano-RSOD, we use AP and mAP to evaluate the detection results of each category and all categories for every learned model, respectively.
If intersection-over-union (IoU) of the detection result and ground truth bounding box is larger than the given threshold, the object can be detected, namely, true positive (tp).If multiple detection results matching with ground truth, the one with largest IoU is the tp, and others are false positives (fp).After matching all the detection results, all the ground-truth without detection results matched are false negatives (fn).All the detection results without ground-truth matched are false positives (fp).The equation of the AP calculation is as follow: max R(c):R(c)≥r P(R(c)) where the recall R(c) = tp(c)/(tp(c) + fn(c)), P(R(c)) = tp(c)/(tp(c) + fp(c)), both for a given confidence threshold c, i.e., IoU.mAP is calculated according to the AP of each category, and the calculation equation is as follows: where n is the number of object classes.We use two metrics in the next evaluation, i.e., AP@0.5(PASCALVOC's metric [7]) and AP@0.5:0.95 (COCO's metric [9]).While the former is computed at a single IoU of 0.5, the latter are averaged over multiple IoU values, i.e., ten IoU thresholds from 0.5 to 0.95 with equal difference 0.5.All the abbreviated forms of AP and mAP refer to AP@0.5 and the mean of each AP@0.5 in the remaining of the paper.

Qualitative Evaluation
In order to give a qualitative analysis of the performance of different detectors, we show the object detection results of the five baseline methods in four road scenes.As shown in Figure 8, the detection results for small objects are not performed well by the baselines methods.Besides, there are some missing and false detection results.For example, as shown in Figure 8a, some vehicles with larger size are not detected by faster RCNN(VGG-16) and R-FCN.The advertising board is mistakenly detected as traffic sign by faster RCNN(VGG-16), as shown in Figure 8c.This also reflects the diversity of background information in Pano-RSOD, which poses a severe challenge to object detection in large scene.Thus, how to correct the background and foreground is still the key task to improve the detection performance of our dataset.

Quantitative Evaluation
In order to quantitatively analyze the performance of various algorithms, we evaluate and compare the performance of five detectors through mAP metric, and analyze the difficulty of the object detection of different categories by AP metric.In addition, we count the time required for the detector to test each image to measure the speed of the algorithm.Table 6 shows the specific performance statistics for different detectors.For a more direct comparison of the detection performance of the detectors, we plot the precision-recall curve per category of each detector with the IoU threshold 0.5.The specific results are shown in Figure 9.As shown in Figure 9 and Table 6, from the overall mAP, YOLOv3 has achieved top performance, which is mainly due to its reference to the structure of FPN [10] feature pyramids.That

Quantitative Evaluation
In order to quantitatively analyze the performance of various algorithms, we evaluate and compare the performance of five detectors through mAP metric, and analyze the difficulty of the object detection of different categories by AP metric.In addition, we count the time required for the detector to test each image to measure the speed of the algorithm.Table 6 shows the specific performance statistics for different detectors.For a more direct comparison of the detection performance of the detectors, we plot the precision-recall curve per category of each detector with the IoU threshold 0.5.The specific results are shown in Figure 9.As shown in Figure 9 and Table 6, from the overall mAP, YOLOv3 has achieved top performance, which is mainly due to its reference to the structure of FPN [10]  combines low-resolution, semantically strong features with high-resolution, semantically weak features.It shows that the detection of a person has a large advantage.It can be seen that the AP of the person is 6.13 percentage points higher than the second best Faster RCNN (ResNet-101).From the performances of the detectors for each category, the car category gets the best performance, and the person category gets the worst performance (YOLOv3 is a counter-example).This is mainly because the car category has more training samples than the person category, and can provide more feature information.What's more, there is also a considerable relationship with almost small objects of the person category in the images (as shown in Figure 4b, its scale is almost less than 64).In addition, the mAP of Faster RCNN (ResNet-101) is 4.27 higher than Faster RCNN (VGG-16).It can be seen that a better feature extraction network is very helpful for object detection tasks.On the whole, SSD gains slightly weaker performance.We assume that it can be a lack of higher-quality proposals compared to faster RCNN or R-FCN and not added to semantic information in context compared to YOLOv3.To sum up, these elements, i.e., better feature extraction network, higher-quality proposals and richer semantic information, all contribute to the promotion of detection performance in Pano-RSOD.
Electronics 2019, 8, x FOR PEER REVIEW 15 of 21 structure combines low-resolution, semantically strong features with high-resolution, semantically weak features.It shows that the detection of a person has a large advantage.It can be seen that the AP of the person is 6.13 percentage points higher than the second best Faster RCNN (ResNet-101).
From the performances of the detectors for each category, the car category gets the best performance, and the person category gets the worst performance (YOLOv3 is a counter-example).This is mainly because the car category has more training samples than the person category, and can provide more feature information.What's more, there is also a considerable relationship with almost small objects of the person category in the images (as shown in Figure 4b, its scale is almost less than 64).In addition, the mAP of Faster RCNN (ResNet-101) is 4.27 higher than Faster RCNN (VGG-16).It can be seen that a better feature extraction network is very helpful for object detection tasks.On the whole, SSD gains slightly weaker performance.We assume that it can be a lack of higher-quality proposals compared to faster RCNN or R-FCN and not added to semantic information in context compared to YOLOv3.To sum up, these elements, i.e., better feature extraction network, higherquality proposals and richer semantic information, all contribute to the promotion of detection performance in Pano-RSOD.In terms of the speed of the algorithms, YOLOv3 also achieves top performance.On the one hand, YOLOv3 uses the Darknet deep learning framework, which is written in C language, to improve the running time of the program.On the other hand, it mainly benefits from its network structure and matching mechanism optimization between anchor and the ground-truth, such as the plenty of 1 × 1 convolution and shortcut structures in Darknet53, each ground-truth only matches one a priori box, which greatly reduces the complexity of the model.SSD, which is an end-to-end detection method, also achieves good performance.In addition, R-FCN replaces the RoI pooling layer and the fully connected layer of faster RCNN with position-sensitive score maps composed of full convolutional layers, which reduces the computational complexity of the head and increases the prediction speed by 6.17 ms.
For a more intuitive analysis of the five detectors, we have drawn their speed versus accuracy diagram, as shown in Figure 10.It can be seen that Faster RCNN and R-FCN are significantly better than SSD in terms of detection accuracy.With regard to speed, the result is the opposite.For example, the mAP of Faster RCNN (ResNet-101) as the second-best result, is 8.77 percentage points higher than SSD, and the speed of faster RCNN (ResNet-101) is slower than the SSD.Besides, YOLOv3 has a good trade-off in terms of speed and accuracy, and achieved the best performance.In terms of the speed of the algorithms, YOLOv3 also achieves top performance.On the one hand, YOLOv3 uses the Darknet deep learning framework, which is written in C language, to improve the running time of the program.On the other hand, it mainly benefits from its network structure and matching mechanism optimization between anchor and the ground-truth, such as the plenty of 1 × 1 convolution and shortcut structures in Darknet53, each ground-truth only matches one a priori box, which greatly reduces the complexity of the model.SSD, which is an end-to-end detection method, also achieves good performance.In addition, R-FCN replaces the RoI pooling layer and the fully connected layer of faster RCNN with positionsensitive score maps composed of full convolutional layers, which reduces the computational complexity of the head and increases the prediction speed by 6.17 ms.
For a more intuitive analysis of the five detectors, we have drawn their speed versus accuracy diagram, as shown in Figure 10.It can be seen that Faster RCNN and R-FCN are significantly better than SSD in terms of detection accuracy.With regard to speed, the result is the opposite.For example, the mAP of Faster RCNN (ResNet-101) as the second-best result, is 8.77 percentage points higher than SSD, and the speed of faster RCNN (ResNet-101) is slower than the SSD.Besides, YOLOv3 has a good trade-off in terms of speed and accuracy, and achieved the best performance.

Comparisons with a General Dataset
To demonstrate the differences between the models trained on conventional non-panoramic datasets and the model trained on the Pano-RSOD, we compare its performance with other object detection datasets, i.e., COCO, KITTI and UA-DETRAC.Based on the experimental results of the five baseline detectors in Table 6, we find the YOLOv3 has a good trade-off in terms of detection accuracy and speed.Therefore, we choose the YOLOv3 (with pre-trained model) as the baseline method in the following comparative experiments.We

Comparisons with a General Dataset
To demonstrate the differences between the models trained on conventional non-panoramic datasets and the model trained on the Pano-RSOD, we compare its performance with other object detection datasets, i.e., COCO, KITTI and UA-DETRAC.Based on the experimental results of the five baseline detectors in Table 6, we find the YOLOv3 has a good trade-off in terms of detection accuracy and speed.Therefore, we choose the YOLOv3 (with pre-trained model) as the baseline method in the following comparative experiments.We Table 7 shows the experimental results of different training set in terms of AP, on conditions that IoU threshold is set 0.5.It is obvious that the model trained on COCO has a poor performance in panoramic dataset, i.e., Pano-RSOD.The reasons can be summarized in two aspects: (1) the traffic scenes in COCO are relatively few.(2) the images in Pano-RSOD are different from the COCO because Pano-RSOD's panorama attribute will bring some optical distortions.Although the KITTI is a large-scale street-level object detection dataset whose scene is the same with Pano-RSOD, the AP of model trained on KITTI is still about 20% lower than the model trained on Pano-RSOD.This is good evidence that models trained on conventional non-panoramic imagery perform worse than trained on panoramic images.In contrast, when testing on KITTI and UA-DETRAC, the models trained on Pano-RSOD can achieve relatively good results due to the diversities of object scales of Pano-RSOD.To evaluate the robustness of these detectors against varying IoU threshold, we evaluate five detectors with AP@0.5:0.95(COCO'smetric).
From Table 8, we know that the accuracy of each algorithm is significantly reduced when the COCO evaluation metric is adopted, which indicates that the object detection algorithm is particularly sensitive to the selection of IoU threshold.Then we increase the threshold from 0.4 to 0.8 by 0.1 increments and calculate AP regarding to each IoU threshold for each detector and plot IoU versus AP curve.The results are shown in Figure 11.As we can be seen from Figure 11, when the matching IoU value increases, the person category for every detector has a much sharper drop in the AP value than the car category, and falls to the worst result when IoU = 0.8.Such case implies that the detected bounding boxes do not have a high overlap ratio with the ground-truth and detection of small objects is more sensitive to IoU values.Therefore, more effort should be put into developing detectors that can better handle small objects for Pano-RSOD.evidence that models trained on conventional non-panoramic imagery perform worse than trained on panoramic images.In contrast, when testing on KITTI and UA-DETRAC, the models trained on Pano-RSOD can achieve relatively good results due to the diversities of object scales of Pano-RSOD.To evaluate the robustness of these detectors against varying IoU threshold, we evaluate five detectors with AP@0.5:0.95(COCO'smetric).
From Table 8, we know that the accuracy of each algorithm is significantly reduced when the COCO evaluation metric is adopted, which indicates that the object detection algorithm is particularly sensitive to the selection of IoU threshold.Table 8.AP@0.5:0.95 of five detectors.

Method
Faster RCNN (VGG-16) Then we increase the threshold from 0.4 to 0.8 by 0.1 increments and calculate AP regarding to each IoU threshold for each detector and plot IoU versus AP curve.The results are shown in Figure 11.As we can be seen from Figure 11, when the matching IoU value increases, the person category for every detector has a much sharper drop in the AP value than the car category, and falls to the worst result when IoU = 0.8.Such case implies that the detected bounding boxes do not have a high overlap ratio with the ground-truth and detection of small objects is more sensitive to IoU values.Therefore, more effort should be put into developing detectors that can better handle small objects for Pano-RSOD.

Conclusions
Pano-RSOD is a panoramic road scene dataset for object detection.It has distinctive characteristics: high-resolution, panorama, the richness of annotations and small objects, and diversity.Experiments have been conducted with different object detection algorithms based on deep neural networks.From the experimental results, we can conclude that Pano-RSOD can be used as a benchmark for performance evaluations of object detection.In that benchmark, YOLOv3 (Darknet53) has achieved the best results of AP (Car and Person), mAP and speed.While, the best results of AP (sign and line) and AP@0.5:0.95(COCO'smetric) have been achieved by Faster RCNN(ResNet-101).
However, there are still challenges, such as the detection of small and hidden objects, and the panoramic view distortion.In future work, the method of dealing with the panoramic view distortion or directly converting from the panoramic view to normal view can be added to the new object detection algorithm.Further, object detection using new structures, like spherical CNN, which can directly process from the panoramic view, can be proposed.Besides, we also plan to extend Pano-RSOD and apply the dataset to other tasks such as semantic or instance segmentation in panoramic scene.

Conclusions
Pano-RSOD is a panoramic road scene dataset for object detection.It has distinctive characteristics: high-resolution, panorama, the richness of annotations and small objects, and diversity.Experiments have been conducted with different object detection algorithms based on deep neural networks.From the experimental results, we can conclude that Pano-RSOD can be used as a benchmark for performance evaluations of object detection.In that benchmark, YOLOv3 (Darknet53) has achieved the best results of AP (Car and Person), mAP and speed.While, the best results of AP (sign and line) and AP@0.5:0.95(COCO'smetric) have been achieved by Faster RCNN(ResNet-101).
However, there are still challenges, such as the detection of small and hidden objects, and the panoramic view distortion.In future work, the method of dealing with the panoramic view distortion or directly converting from the panoramic view to normal view can be added to the new object detection algorithm.Further, object detection using new structures, like spherical CNN, which can directly process from the panoramic view, can be proposed.Besides, we also plan to extend Pano-RSOD and apply the dataset to other tasks such as semantic or instance segmentation in panoramic scene.

Figure 3 .
Figure3.The multi-camera panoramic vision system and the acquired panoramic image.

Figure 3 .
Figure 3.The multi-camera panoramic vision system and the acquired panoramic image.Figure 3. The multi-camera panoramic vision system and the acquired panoramic image.

Figure 3 .
Figure 3.The multi-camera panoramic vision system and the acquired panoramic image.Figure 3. The multi-camera panoramic vision system and the acquired panoramic image.

Figure 4 .
Figure 4. Dataset statistics results.(a) Number of objects per category; (b) Number of objects per scale.We define scale as the square root of the object's area.

Figure 4 .
Figure 4. Dataset statistics results.(a) Number of objects per category; (b) Number of objects per scale.We define scale as the square root of the object's area.

Figure 6 .
Figure 6.General pipeline of two types of object detection baseline methods adopted in this paper.The difference between two kinds of methods is described in Section 2.2, and we only give a simple and intuitive diagram.

Figure 6 .
Figure 6.General pipeline of two types of object detection baseline methods adopted in this paper.The difference between two kinds of methods is described in Section 2.2, and we only give a simple and intuitive diagram.

Figure 7 .
Figure 7.The distribution of anchors in the dataset (separate points are the distribution values of the anchors, and the gradient of the line represents the aspect ratio).

Figure 7 .
Figure 7.The distribution of anchors in the dataset (separate points are the distribution values of the anchors, and the gradient of the line represents the aspect ratio).Better feature extraction network.Designing better feature extraction network can provide more information for object detection task.Compared with VGG-16, ResNet-101 has characteristics of low complexity, deeper network and higher accuracy for classification.Thus, we also use ResNet-101 instead of VGG-16 in the feature extraction network of Faster-RCNN and R-FCN to improve the feature extraction ability.Training strategy and model parameters.We use SGD as backpropagation algorithm for the five detectors and stepwise reduce the learning rate.Considering the depth of the network and other factors for each detector, the proper iteration steps and initial learning rates are set to ensure the convergence of the network.For the relatively deep network, we set the smaller initial learning rate to avoid gradient explosion.The iteration steps of Faster RCNN (VGG-16 based and ResNet-101 based) and R-FCN are both 100 k steps, and the initial learning rates are set to 1 × 10 −2 , 1 × 10 −3 and 1 × 10 −3 , respectively.And then the learning rates are reduced to one-tenth of the original at the 80 k steps.The iteration steps of SSD and YOLOv3 are both 150 k steps.Their initial learning rates are 4 × 10 −3 and 1 × 10 −3 , respectively.Then the learning rates drop to one-tenth of the original at 80 k steps.The selected hyper-parameters for the five detectors are shown in Table5.

Figure 8 .
Figure 8.The detection results of five algorithms in different road scenes.(a) crossroad; (b) overpasses; (c) crowded urban road; (d) suburban road.The text in the upper left corner of each image represents the algorithm adopted in the paper.The magenta box, red box, green box and cyan box separately represent car, person, sign and line.

Figure 8 .
Figure 8.The detection results of five algorithms in different road scenes.(a) crossroad; (b) overpasses; (c) crowded urban road; (d) suburban road.The text in the upper left corner of each image represents the algorithm adopted in the paper.The magenta box, red box, green box and cyan box separately represent car, person, sign and line.
totally set up five comparative experiments: COCO as training set and Pano-RSOD as test set, KITTI as training set and Pano-RSOD as test set, Pano-RSOD as training set and KITTI as test set, Pano-RSOD as training set and UA-DETRAC as test set, Pano-RSOD as training set and test set.Since common categories between Pano-RSOD and COCO, KITTI are vehicle and pedestrian, we used these two categories for experiments for fair comparisons.Table7shows the experimental results of different training set in terms of AP, on conditions that IoU threshold is set 0.5.It is obvious that the model trained on COCO has a poor performance in panoramic dataset, i.e., Pano-RSOD.The reasons can be summarized in two aspects: (1) the traffic scenes in COCO are relatively few.(2) the images in Pano-RSOD are different from the COCO because Pano-RSOD's panorama attribute will bring some optical distortions.Although the KITTI is a large-scale street-level object detection dataset whose scene is the same with Pano-RSOD, the AP of model trained on KITTI is still about 20% lower than the model trained on Pano-RSOD.This is good
totally set up five comparative experiments: COCO as training set and Pano-RSOD as test set, KITTI as training set and Pano-RSOD as test set, Pano-RSOD as training set and KITTI as test set, Pano-RSOD as training set and UA-DETRAC as test set, Pano-RSOD as training set and test set.Since common categories between Pano-RSOD and COCO, KITTI are vehicle and pedestrian, we used these two categories for experiments for fair comparisons.

• Object Detection Evaluation 2012 [10]:
This is a dataset for 2D object detection and azimuth estimation in the KITTI database.It consists of 7481 training images and 7518 test images.

Table 1 .
Comparison of typical object detection algorithms based on deep learning.The symbol * represents the multi-feature layer fusion.Methods evaluated in this work are bold-faced.

Table 2
lists the comparison results of the Pano-RSOD and the existing object detection datasets.Compared with other road scene object detection datasets, the dataset of this paper has the following characteristics:

Table 3 .
Different aspect ratios after clustering.The second and third column, i.e., H and W, separately height and width of objects in Pano-RSOD after clustering.The last column, i.e., aspect ratio, is calculated by dividing W by H.

Table 5 .
Hyper-parameters used in training process.

Table 6 .
Performance Statistics of Object Detection Using Different Algorithms.The best results are bold-faced.

Table 6 .
Performance Statistics of Object Detection Using Different Algorithms.The best results are bold-faced.
feature pyramids.That structure

Table 7 .
Detection Results of Different Training Set.

Table 7 .
Detection Results of Different Training Set.