Virtual to Real Adaptation of Pedestrian Detectors

Pedestrian detection through Computer Vision is a building block for a multitude of applications. Recently, there has been an increasing interest in convolutional neural network-based architectures to execute such a task. One of these supervised networks’ critical goals is to generalize the knowledge learned during the training phase to new scenarios with different characteristics. A suitably labeled dataset is essential to achieve this purpose. The main problem is that manually annotating a dataset usually requires a lot of human effort, and it is costly. To this end, we introduce ViPeD (Virtual Pedestrian Dataset), a new synthetically generated set of images collected with the highly photo-realistic graphical engine of the video game GTA V (Grand Theft Auto V), where annotations are automatically acquired. However, when training solely on the synthetic dataset, the model experiences a Synthetic2Real domain shift leading to a performance drop when applied to real-world images. To mitigate this gap, we propose two different domain adaptation techniques suitable for the pedestrian detection task, but possibly applicable to general object detection. Experiments show that the network trained with ViPeD can generalize over unseen real-world scenarios better than the detector trained over real-world data, exploiting the variety of our synthetic dataset. Furthermore, we demonstrate that with our domain adaptation techniques, we can reduce the Synthetic2Real domain shift, making the two domains closer and obtaining a performance improvement when testing the network over the real-world images.


Introduction
When it comes to smart cities, it is impossible not to consider video surveillance, which is becoming a key technology for a myriad of applications ranging from security to the well-being of people.In this context, pedestrian detection, in particular, is a fundamental aspect of smart city applications.These smart city applications have two conflicting requirements: to respond in real-time and to reduce false alarms as much as possible.
Deep learning, and in particular Convolutional Neural Networks (CNNs), offer excellent solutions to these conflicting requirements provided that we feed the neural network with a sufficient amount of data during the training phase.However, the data must be quite diversified to ensure that the CNNs can generalize and adapt to different scenarios having different characteristics, like different perspectives, illuminations, and object scales.Current state-of-the-art object detectors are powered by CNNs, since they can automatically learn features characterizing the objects by themselves; these solutions outperformed approaches relying instead on hand-crafted features.
This aspect becomes even more crucial in smart city applications where the smart devices that are typically used should be easily installed and deployed, without the need for an early tuning phase.Therefore, a key point for the training of cutting-edge CNNs is the availability of large sets of labeled training data that cover as much as possible the differences between the various scenarios.Although there are some large annotated generic datasets, such as ImageNet [1] and MS COCO [2], annotating the images is a very time-consuming operation, since it requires great human effort, and it is error-prone.Furthermore, sometimes these hand-labeled datasets are not sufficiently large and diverse enough to ever learn general models, and, as a consequence, they lead to poor cross dataset performance, which restricts real-world utility.Finally, it should also be noticed that sometimes it is also problematic to create a training/testing dataset with specific characteristics.
To this end, one of the contributions of this work is to provide a suitable dataset collecting images from virtual world environments that mimics as much as possible all the characteristics of our target real-world scenarios.In particular, we introduce a new dataset named ViPeD (Virtual Pedestrian Dataset), a huge collection of images taken from the highly photo-realistic video game GTA V -Grand Theft Auto V developed by Rockstar North, which extends the JTA (Joint Track Auto) dataset presented in [3].We demonstrate that by using ViPeD during the training phase we can improve performance and achieve competitive results compared to the state-of-the-art approaches in the pedestrian detection task.
In this work, we extend our previous contribution [4] presented at the International Conference of Image Analysis and Processing (ICIAP) 2019.
In particular, we experimented with another state-of-the-art object detector, Faster-RCNN [5], and we employed an extended set of real-world datasets for improving the baselines and for better validating our approach.Furthermore, other than using a simple fine-tuning methodology as in [4], we try also to perform domain-adaptation using a mixed-batch supervised method.
In the end, as in [4], we adapted the detector on specific real-world scenarios, specifically on the MOT detection benchmarks (MOT17Det and MOT19Det) [6,7].They are real-world datasets suited for pedestrian detection.This latest experiment is intended to measure the ability of our model to transfer knowledge from our virtual-world trained model to some crowded real-world scenarios.
To summarize, in this work we propose a CNN-based system able to detect pedestrians for surveillance smart cameras.We train the detector using a new dataset collected using images from a realistic video game, and we take advantage of the graphics engine for extracting the annotations without any human intervention.This paper is an extension of our previous work [4].We extend it with the following: • we experiment with another state-of-the-art object detector, Faster-RCNN [5]; • we add a new set of experiments for evaluating the generalization capabilities of our training procedure; • we test a mixed-batch supervised domain-adaptation approach, that we compare with the basic fine-tuning methodology.
The full code for replicating our experiments, together with the scripts which generate the ViPeD dataset from the JTA (Joint Track Auto) dataset [3], are accessible through our project web-page1 .

Related Work
In this section, we review the most relevant works in object and pedestrian detection.We also analyze previous studies on using synthetic datasets as training sets.
Pedestrian detection is highly related to object detection.It deals with recognizing the specific class of pedestrians, usually walking in urban environments.We can subdivide approaches for the pedestrian detection problem into two main research areas.The first class of detectors is based on handcrafted features, such as ICF (Integral Channel Features) [8,9,10,11,12].Those methods can usually rely on higher computational efficiency, at the cost of lower accuracy.On the other hand, deep neural network approaches have been explored.[13,14,15,16] proposed some modifications around the standard CNN network [17] to detect pedestrians, even accounting for different scales.
Many datasets are available for pedestrian detection.Caltech [18], INRIA [19], MOT17Det [6], MOT19Det [7], and CityPersons [20] are among the most important ones.Since they were collected in different living scenarios, they are intrinsically very heterogeneous datasets.Some of them [19,18] were specifically collected for detecting pedestrians in self-driving contexts.Our interest, however, is mostly concentrated on video-surveillance tasks and, in this scenario, the recently introduced MOT17Det and MOT19Det datasets have proved to be enough challenging due to the high variability of the video subsets.
With the need for huge amounts of labeled data, generated datasets have recently gained considerable interest.[21,22] have studied the possibility of learning features from synthetic data, validating them on real scenarios.Unlike our work, however, they did not explore deep learning approaches.[23,24] focused their attention on the possibility of performing domain adaptation to map virtual features onto real ones.Authors in [3] created a dataset taking images from the highly photo-realistic video game GTA V and demonstrated that it is possible to reach excellent results on tasks such as people tracking and pose estimation when validating on real data.
To the best of our knowledge, [25] and [26] are the works closest to our setup.In particular, [25] also used GTA V as the virtual world, but, unlike our method, they concentrated on vehicle detection.
Instead, [26] used a synthetically-generated dataset to train a simple CNN to detect objects belonging to various classes in a video.The convolutional network dealt only with the classification, while the detection of objects relied on a background subtraction algorithm based on Gaussian mixture models (GMMs).The real-world performance was evaluated on two standard pedestrian detection datasets, and one of these (MOTChallenge 2015 [27]) is an older version of the dataset we used to carry out our experimentation.

Datasets
In this section, we describe the datasets exploited in this work.First, we introduce ViPeD -V irtual Pedestrian Dataset, a new synthetically generated collection of images used for training the network.Then we outline four realworld datasets that we used for validating our approach -MOT17Det [6], MOT19Det [7], CityPersons [20] and COCOPersons.

ViPeD -Virtual Pedestrian Dataset
As mentioned above, CNNs need large annotated datasets during the training phase to learn models robust to different scenarios, and creating the annotations is a very time-consuming operation that requires a great human effort.
ViPeD is a huge collection of images taken from the highly photo-realistic video game GTA V developed by Rockstar North.This newly introduced dataset extends the JTA (Joint Track Auto) dataset presented in [3].Since we are dealing with images collected from a virtual world, we can extract pedestrian bounding boxes for free and without the manual human effort, exploiting 2D pedestrian positions extracted from the video card.The dataset includes a total of about 500K images, extracted from 512 full-HD videos (256 for training, 128 for validating and 128 for testing) of different urban scenarios.
In the following, we report some details on the construction of the bounding boxes and on the data augmentation procedure that we used to extend the JTA dataset for the pedestrian detection task.

Sanitizing the Bounding Boxes
Since JTA is specifically designed for pedestrian pose estimation and tracking, the provided annotations are not directly suitable for the pedestrian detection task.In particular, the annotations included in JTA are related to the joints of the human skeletons present in the scene (Fig. 1a).At the same time, what we need for our task are the coordinates of the bounding boxes surrounding each pedestrian instance.
Bounding box estimation can be addressed using different approaches.The GTA graphic engine is not publicly available, so it is not easy to extract the detailed masks around each pedestrian instance; [25] overcame this issue by extracting semantic masks and separating the instances by exploiting depth information.Instead, our approach uses the skeletons annotations already derived by the JTA team to reconstruct the precise bounding boxes.This seems to be a more reliable solution than the depth separation approach, especially when instances are densely distributed, as in the case of crowded pedestrian scenarios.
The very basic setup consists of drawing the smallest bounding box that encloses all the skeleton joints.The main issue with this simple approach is that each bounding box entirely contains the skeleton, but not the pedestrian mesh.Indeed, we can notice that the mesh is always larger than the skeleton (Fig. 1b).We can solve this problem by estimating a pad for the skeleton bounding box exploiting another information produced by the GTA graphic engine and already present in JTA, i.e., the distance of all the pedestrians in the scene from the camera.In particular, the height of the i th mesh, denoted as h i m , can be estimated from the height of the i th skeleton h i s by means of the formula: where z i is the distance of the i th pedestrian center of mass from the camera, and α is a parameter that depends on the camera projection matrix.Since we have not access to the camera parameters, α is unknown.
Given that z i is already available for every pedestrian, we estimate the α parameter by annotating 30 random pedestrians, actually obtaining for them the correct value for h i m .At this point we can perform linear regression on the parameter α for finding the best fit.
Then, we estimate the mesh's width w i m .Unlike the height, the width is strongly linked to the specific pedestrian pose, so it is difficult to be estimated only having access to the camera distance information.For this reason, we simply estimate w i m directly from h i m , assuming no changes in the aspect ratio for the original and adjusted bounding boxes: where r i is the aspect ratio of the i th bounding box.Examples of final estimated bounding boxes are shown in Fig. 1b.Finally, we perform a global analysis of these new annotations.As we can see in Fig. 2, in the dataset, there are annotations of pedestrians farthest than 30-40 meters from the camera.However, human annotators tend to avoid annotating objects farthest than this amount.We perform this measurement by measuring the height of the smallest bounding boxes in the humanannotated MOT17Det dataset [6] and catching out in our dataset at what distance from the camera the bounding boxes assume this human-limit size.Therefore, to obtain annotations comparable to real-world human-annotated ones, we prune all the pedestrian annotations furthest than 40 meters from the camera.
In Fig. 3, we report some examples of images of the ViPeD dataset together with the sanitized bounding boxes.

Data Augmentation
Synthetic datasets should contain scenarios as close as possible to realworld ones.Even though images grabbed from the GTA game were already very realistic, there are some missing details.In particular, images grabbed from the game are very sharp, edges are very pronounced, and common lens effects are not present.In light of this, we prepare also a more realistic version of the original images, adding some effects such as radial-blur, gaussian-blur, bloom effects, and adjusting the exposure/contrast.Parameters for these filters are randomly sampled from a uniform distribution.

Real-world datasets
Following, we report some details about the real-world datasets we employ for evaluating our approach.MOT17Det [6] and MOT19Det [7] are two realworld pedestrian-detection benchmarks for surveillance-based applications.CityPersons [20] is a real-world dataset for pedestrian detection more focused on self-driving applications.Finally, COCOPersons is a split of the MS-COCO dataset [2] comprising images collected in general contexts.

MOT Datasets
MOT17Det [6] and MOT19Det [7] datasets are recently introduced benchmarks for surveillance-based applications (the latter has been presented at the Computer Vision and Pattern Recognition Conference (CVPR) 2019).They comprise a collection of challenging images for pedestrian detection taken from multiple sequences with various crowded scenarios having different viewpoints, weather conditions, and camera motions.The annotations for all the sequences are generated by human annotators from scratch, following a specific protocol described in their papers.Training images of the MOT17Det collection are taken from sequences 2, 4, 5, 9, 10, 11 and 13 (for a total of 5,316 images), while MOT19Det training set comprises sequences 1, 2, 3 and 5 (for a total of 8,931 images).Test images for both datasets are taken from the remaining sequences (for a total of 5,919 images in MOT17Det and 4,479 images in MOT19Det).It should be noticed that the authors released only the ground-truth annotations belonging to the training subsets.
The performance metrics concerning the test subsets are instead available only submitting results to their MOT Challenge website2 .The main peculiarity of MOT19Det compared to MOT17Det is the massive crowding of the collected scenarios.

CityPersons
CityPersons dataset [20] is a recent collection of images of interest for the pedestrian detection community.It consists of a large and diverse set of stereo video sequences recorded in streets from different cities in Germany and neighboring countries.In particular, authors provide 5,000 images from 27 cities labeled with bounding boxes and divided across train/validation/test subsets.This dataset is more focused on self-driving applications, and images are collected from a moving car.

COCOPersons
COCOPersons dataset is a split of the popular COCO dataset [2] comprising images collected in general contexts belonging to 80 categories.We filter these images considering only the ones belonging to the persons category.Hence, we obtain a new dataset of about 66,000 images containing at least one pedestrian instance.

Method
Differently from [4], we use Faster-RCNN [5] as an object detector, exploiting the TorchVision v0.3 implementation provided with the PyTorch deep-learning library.Unlike YOLOv3 [28], Faster-RCNN is a two-stage detector that employs a Region Proposal Network (RPN) to suggest interesting regions in the image to attend to.Then, features extracted from each proposed region contribute to the final object classification.For the detector backbone, we prefer the ResNet-50 FPN over the ResNet-101 FPN, since it gives satisfactory performances when taking into account also the computational resources and the time required during the training phase.
As a starting point, we consider Faster-RCNN pre-trained on the COCO dataset [2], a large dataset composed of images describing complex everyday scenes of common objects in their natural context, categorized in 80 different categories.Since this network is a generic object detector, we substitute the last layers of the detector to recognize and localize object instances belonging only to a specific category -i.e., the pedestrian category in our case.
In [4], domain adaptation between virtual and real scenarios is simply carried out by fine-tuning the pre-trained Faster-RCNN architecture.In this work, we also experiment with another type of supervised domain adaptation approach, called Balanced Gradient Contribution (BGC) [29,30].It consists of mixing into the same mini-batch images from both real and virtual domains.As explained in [30], we use the real-world data as a regularization term over the synthetic data training loss.For this reason, the mini-batch is filled mostly with synthetic images, while the real-world ones are used to slightly constrain the gradients to not overfit on synthetic data.
Our entire approach is developed under the guidance of two different use-cases: • a general-purpose use case where we are interested in obtaining a good overall detector, able to generalize to different scenarios, using the available synthetic data; • a more specific use-case, where we want to maximize the performances on a particular dataset by fine-tuning the model previously trained with the synthetic data.
We explain these two scenarios in detail in the following section.

Experiments
We evaluate the detection performance using standard mean Average Precision (mAP) metrics.In particular, we rely on the COCO mAP and the MOT AP metrics.In all the experiments, we fix the IoU threshold to 0.5, which is a widely used choice.Therefore, the mAP is computed varying only the detection confidence threshold.We feed into the evaluators all the detector proposals having detection confidence greater than 0.05.
We evaluate our solution performing experiments separately for the two different use-cases introduced in Section 4 and detailed below.

Assessing Domain Generalization
The first use-case consists of a domain-agnostic training of the detector by exploiting the synthetic data so that the same model can handle diverse real-world scenarios while keeping good performances on all of them.To achieve this, we rely on the heavy amount of synthetic data available with ViPeD, captured in different urban scenarios, and under different lighting and weather conditions.We experiment with two different kinds of supervised domain-adaptation approaches.The first method, already used in [4], is a simple fine-tuning of the COCO pre-trained model on ViPeD.The second one, instead, is built upon the Balanced Gradient Contribution (BGC) framework by [29,30] and consists in mixing into the same mini-batch images from both real and virtual domains, as explained in Section 4.
First of all, we obtain a baseline for this scenario using the detector trained only on the real-world general-purpose COCO dataset.We evaluate this initial model testing it on all the remaining real-world datasets.i.e.MOT17Det, MOT19Det and CityPersons, considering only the detections belonging to the person category.
Other significant baselines are obtained by fine-tuning the COCO pretrained model with the other real-world datasets.
Then, as in [4], we fine-tune this initial COCO pre-trained detector using the ViPeD dataset.In particular, we re-initialize the box predictor module with random weights and a different number of classes.The box predictor module is placed at the end of the detector pipeline, and it is responsible for outputting bounding boxes coordinates and class scores.Instead, all the other layers are initialized with the weights of the COCO pre-trained model.
All the weights are left unfrozen so that they can be adjusted by the backpropagation algorithm.With this technique, we are forcing the architecture to adjust the learned features to match those from the destination dataset.Again, we evaluate this new model testing it on all the remaining real-world datasets.i.e.MOT17Det, MOT19Det and CityPersons.
Finally, to not overfit our detector on synthetic images, we employ the mixed-batch domain-adaptation approach for injecting into the batch images coming from some real-world scenarios.As in the previous case, we initialize all the layers except the ones from the box predictor module using the weights of the COCO pre-trained model.However, this time we fine-tune the network using batches composed by 2/3 of synthetic images and 1/3 of real-world images.In this experiment we consider COCOPersons as real-world dataset, since it depicts humans in highly heterogeneous scenarios, and it is not biased towards a specific application (e.g.autonomous driving).
As in the previous cases, we evaluate this model testing on all the realworld datasets.
Results are reported in Table 1.In the first section of the table, we report the baselines, while the latter is related to our approaches.Note that we omit results concerning a specific dataset if it has been employed during the training phase, for a fair evaluation of the overall generalization capabilities.

Discussion
Results in Table 1 show that our solution can generalize the knowledge learned from the virtual-world to different real-world datasets.In most cases, our network is also able to perform better than the ones trained using only the real-world manual-annotated datasets, taking advantage of the high variability and size of our ViPeD dataset.In particular, concerning the MOT17Det dataset, all our solutions trained with synthetic data perform better than the ones trained with real data.The best result is obtained using the mixed-batch approach.Considering the MOT19Det dataset, we achieve the best result fine-tuning the detector with our basic version of ViPeD .CityPersons is Table 3 are related only to our previous approach in [4] since, at the time of writing, it was no possible to submit our results to the challenge concerning the MOT19Det dataset.As explained before, in this case, we do not have the ground truth for the test set.The authors will open this challenge again, and we will report the updated results as soon as they will available before the final version of this article.However, given that the new detector shows higher performances than YOLOv3 on the MOT17Det dataset, we expect better results also on the MOT19Det dataset.

Discussion
Results in Table 3 demonstrate that our training procedure can reach competitive performance even when compared to specialized pedestrian detection approaches.In particular, concerning the MOT17Det dataset, we obtain a mAP of 0.89, the same value obtained by the top finishers.
Furthermore, even though the mixed batch approach cannot reach the state-of-the-art results, a performance loss of only 0.02 mAP can be justified if we consider that we are using only 1/3 of the real-world target training set.We will discuss in more detail the results concerning the MOT19Det dataset as soon as the authors open again the challenge.

Conclusions
In this work, we proposed a novel approach for training pedestrian detectors using synthetic generated data.The choice of training a network using synthetic data is motivated by the fact that a huge amount of different examples are needed for the algorithm to generalize well.This huge amount of data is typically manually collected and annotated by humans, but this procedure usually takes a lot of time, and it is error-prone.
To this end, we introduced a synthetic dataset named ViPeD .This dataset contains a massive collection of images rendered out from the highly photo-realistic video game GTA V developed by Rockstar North, and a full set of precise bounding boxes annotations around all the visible pedestrians.
We fine-tuned Faster-RCNN, a state-of-the-art two-stage object detector, with ViPeD , and we validated this approach on different real-world publicly available pedestrian detection datasets.To address the problem of the domain-adaptation from the virtual world to the real one, we also exploit a mixed-batch supervised training approach.
Using Faster-RCNN, we demonstrated that the network trained with the help of synthetic data can generalize to a great extent to multiple real-world scenarios.Furthermore, our solution can be easily transferred to specific realworld environments, outperforming the same architecture trained instead only on real-world manually-labeled datasets.
Even though in this work we considered the specific task of pedestrian detection, we think that the presented procedure could be applied at a larger scale even on other related tasks, such as image classification or object segmentation.

Figure 1 :
Figure 1: (a) Pedestrians in the JTA dataset with their skeletons.(b) Examples of annotations in the ViPeD dataset; original bounding boxes are in yellow, while the sanitized ones are in light blue.

Figure 2 :
Figure 2: Histogram of distances between pedestrians and cameras.

Figure 3 :
Figure 3: Examples of images of the ViPeD dataset together with the sanitized bounding boxes

Table 1 :
Evaluation of the generalization capabilities.The first section of the table reports results on the baselines, while the latter is related to our approaches.ViPeD and ViPeD Aug. refer to our dataset without and with augmentation respectively.ViPeD + Real and ViPeD Aug. + Real refer to the mixed batch experiments with 2/3 ViPeD and 1/3 COCOPersons.Results are evaluated using the COCO mAP.

Table 2 :
Results on MOT17Det: comparison with the state-of-the-art.Faster R-CNN is our approach in which we Fine Tune our pre-trained model with the MOT17Det dataset.Faster R-CNN MB is instead of our model trained with the Mixed-Batch approach (2/3 of images belonging to ViPeD dataset and 1/3 belonging to MOT17Det dataset).Results are evaluated using the MOT mAP.

Table 3 :
Results on MOT19Det: comparison with the state-of-the-art.Results are evaluated using the MOT mAP.