Co-Training for Deep Object Detection: Comparing Single-Modal and Multi-Modal Approaches

Top-performing computer vision models are powered by convolutional neural networks (CNNs). Training an accurate CNN highly depends on both the raw sensor data and their associated ground truth (GT). Collecting such GT is usually done through human labeling, which is time-consuming and does not scale as we wish. This data-labeling bottleneck may be intensified due to domain shifts among image sensors, which could force per-sensor data labeling. In this paper, we focus on the use of co-training, a semi-supervised learning (SSL) method, for obtaining self-labeled object bounding boxes (BBs), i.e., the GT to train deep object detectors. In particular, we assess the goodness of multi-modal co-training by relying on two different views of an image, namely, appearance (RGB) and estimated depth (D). Moreover, we compare appearance-based single-modal co-training with multi-modal. Our results suggest that in a standard SSL setting (no domain shift, a few human-labeled data) and under virtual-to-real domain shift (many virtual-world labeled data, no human-labeled data) multi-modal co-training outperforms single-modal. In the latter case, by performing GAN-based domain translation both co-training modalities are on par, at least when using an off-the-shelf depth estimation model not specifically trained on the translated images.


Introduction
Supervised deep learning enables accurate computer vision models. Key for this success is the access to raw sensor data (i.e., images) with ground truth (GT) for the visual task at hand (e.g., image classification [1], object detection [2] and recognition [3], pixel-wise instance/semantic segmentation [4,5], monocular depth estimation [6], 3D reconstruction [7], etc.). The supervised training of such computer vision models, which are based on convolutional neural networks (CNNs), is known to required very large amounts of images with GT [8]. While, until one decade ago, acquiring representative images was not easy for many computer vision applications (e.g., for onboard perception), nowadays, the bottleneck has shifted to the acquisition of the GT. The reason is that this GT is mainly obtained through human labeling, whose difficulty depends on the visual task. In increasing order of labeling time, we see that image classification requires image-level tags, object detection requires object bounding boxes (BBs), instance/semantic segmentation requires pixel-level instance/class silhouettes, and depth GT cannot be manually provided. Therefore, manually collecting such GT is time-consuming and does not scale as we wish. Moreover, this data labeling bottleneck may be intensified due to domain shifts among different image sensors, which could drive to per-sensor data labeling.
To address the curse of labeling, different meta-learning paradigms are being explored. In self-supervised learning (SfSL) the idea is to train the desired models with the help of auxiliary tasks related to the main task. For instance, solving automatically generated jigsaw puzzles helps to obtain more accurate image recognition models [9], while stereo and structure-from-motion (SfM) principles can provide self-supervision to train monocular depth estimation models [10]. In active learning (AL) [11,12], there is a human-model collaborative loop, where the model proposes data labels, known as pseudo-labels, and the human corrects them so that the model learns from the corrected labels too; thus, aiming at a progressive improvement of the model accuracy. In contrast to AL, semi-supervised learning (SSL) [13,14] does not require human intervention. Instead, it is assumed the availability of a small set of off-the-shelf labeled data and a large set of unlabeled data, and both datasets must be used to obtain a more accurate model than if only the labeled data were used. In SfSL, the model trained with the help of the auxiliary tasks is intended to be the final model of interest. In AL and SSL, it is possible to use any model with the only purpose of self-labeling the data, i.e., producing the pseudo-labels, and then use labels and pseudo-labels for training the final model of interest.
In this paper we focus on co-training [15,16], a type of SSL algorithm. Co-training self-labels data through the mutual improvement of two models. These models analyze the unlabeled data according to their different views of these data. Our work focuses on onboard vision-based perception for driver assistance and autonomous driving. In this context, vehicle and pedestrian detection are key functionalities. Accordingly, we apply co-training to significantly reduce human intervention when labeling these objects (in computer vision terminology) for training the corresponding deep object detector. Therefore, the labels are BBs locating the mentioned traffic participants in the onboard images. More specifically, we consider two settings. On the one hand, as is usual in SSL, we assume the availability of a small set of human-labeled images (i.e., with BBs for the objects of interests), and a significantly larger set of unlabeled images. On the other hand, we do not assume human labeling at all, but we have a set of virtual-world images with automatically generated BBs.
This paper is the natural continuation of the work presented by Villalonga & López [17]. In this previous work, a co-training algorithm for deep object detection is presented, addressing the two above-mentioned settings too. In [17], the two views of an image consist of the original RGB representation and its horizontal mirror; thus, it is a singlemodal co-training based on appearance. However, a priori, the higher difference among data views the more accurate pseudo-labels can be expected from co-training. Therefore, as a major novelty of this paper, we explore the use of two image modalities in the role of co-training views. In particular, one view is the appearance (i.e., the original RGB), while the other view is the corresponding depth (D) as estimated by a state-of-the-art monocular depth estimation model [18]. Thus, we term this approach as multi-modal co-training; however, it can still be considered a single-sensor because still relies only on RGB images. Figure 1 illustrates these different views for images that we use in our experiments.
In this setting, the research questions that we address are two: (Q1) Is multi-modal (RGB/D) co-training effective on the task of providing pseudo-labeled object BBs?; (Q2) How does perform multi-modal (RGB/D) co-training compared to single-modal (RGB)?. After adapting the method presented in [17] to work with both, the single and the multi-modal data views, we ran a comprehensive set of experiments for answering these two questions. Regarding (Q1), we conclude that, indeed, multi-modal co-training is rather effective. Regarding (Q2), we conclude that in a standard SSL setting (no domain shift, a few humanlabeled data) and under virtual-to-real domain shift (many virtual-world labeled data, no human-labeled data) multi-modal co-training outperforms single-modal. In the latter case, when GAN-based virtual-to-real image translation is performed [19] (i.e., as image-level domain adaptation) both co-training modalities are on par; at least, by using an off-the-shelf monocular depth estimation model not specifically trained on the translated images.
We organize the rest of the paper as follows. Section 2 reviews related works. Section 3 draws the co-training algorithm. Section 4 details our experimental setting, discussing the obtained results in terms of (Q1) and (Q2). Section 5 summarizes the presented work, suggesting lines of continuation.  [18] from the original patch. Left-middle columns are the views used for co-training in [17]. Right-middle columns are the views also used in this paper.

Related Work
As we have mentioned before, co-training falls in the realm of SSL. Thus, here we summarize previous related works applying SSL methods. The input to these methods consists of a labeled dataset, X l , and an unlabeled one, X u , with #X u " #X l and D X u " D X l , where #X is the cardinality of the set X and D X refers to the domain from which X has been drawn. Note that, when the latter requirement does not hold, we are under a domain shift setting. The goal of a SSL method is to use both X l and X u to allow the training of a predictive model, φ, so that its accuracy is higher than if only X l is used for its training. In other words, the goal is to leverage unlabeled data.
A classical SSL approach is the so-called self-training, introduced by Yarowsky [20] in the context of language processing. Self-training is an incremental process that starts by training φ on X l ; then, φ runs on X u , and its predictions are used to form a pseudo-labeled set Xˆl, further used together with X l to retrain φ. This is repeated until convergence, and the accuracy of φ, as well as the quality of Xˆl, are supposed to become higher as the cycles progress. Jeong et al. [21] used self-training for deep object detection (on PASCAL VOC and MS-COCO datasets). To collect Xˆl, a consistency loss is added while training φ, which is a CNN for object detection in this case, together with a mechanism for removing predominant backgrounds. The consistency loss is based on the idea that φpI u q " φpI uè q è , where I u is an unlabeled image, and "è" refers to performing horizontal mirroring. Lokhande et al. [22] used self-training for deep image classification. In this case, the original activation functions of φ, a CNN for image classification, must be changed to Hermite polynomials. Note that these two examples of self-training involve modifications either in the architecture of φ [22] or in its training framework [21]. However, we aim at using a given φ together with its training framework as a black box, so performing SSL only at the data level. In this way, we can always benefit from state-of-the-art models and training frameworks, i.e., avoiding changing the SSL approach if those change. In this way, we can also decouple the model used to produce pseudo-labels from the model that would be trained with them for deploying the application of interest. A major challenge when using self-training is to avoid drifting to erroneous pseudolabels. Note that, if Xˆl is biased to some erroneous pseudo-labels, when using this set to retrain φ incrementally, a point can be reached where X l cannot compensate the errors in Xˆl, and φ may end learning wrong data patterns and so producing more erroneous pseudo-labels. Thus, as alternative to the self-training of Yarowsky [20], Blum and Mitchell proposed co-training [15]. Briefly, co-training is based on two models, φ v 1 and φ v 2 , each one incrementally trained on different data features, termed as views. In each training cycle, φ v 1 and φ v 2 collaborate to form Xˆl " Xˆl v 1 Y Xˆl v 2 . Where, Xˆl v i and X l are used to retrain φ v i , i P t1, 2u. This is repeated until convergence. It is assumed that each view, v i , is discriminant enough as to train an accurate φ v i . Different implementations of cotraining, may differ in the collaboration policy. Our approach follows the disagreement idea introduced by Guz et al. [16] in the context of sentence segmentation, later refined by Tur [23] to address domain shifts in the context of natural language processing. In short, only pseudo-labels of high confidence for φ v i but of low confidence for φ v j , i, j P t1, 2u, i ‰ j, are considered as part of Xˆl v j in each training cycle. Soon, disagreement-based SSL attracted much interest [24].
In general, φ v 1 and φ v 2 can be based on different data views by either training on different data samples (Xl v 1 ‰ Xˆl v 2 ) or being different models (e.g., φ v 1 and φ v 2 can be based on two different CNN architectures). The disagreement-based co-training falls in the former case. In this line, Qiao et al. [25] used co-training for deep image classification, where the two different views are achieved by training on mutually adversarial samples. However, this implies linking the training of the φ v i 's at the level of the loss function, while, as we have mentioned before, we want to use these models as black boxes.
The most similar work to this paper is the co-training framework that we introduced in [17] since we work on top of it. In [17], two single-modal views are considered. These consist of using φ v 1 to process the original images from X u while using φ v 2 to process their horizontally mirrored counterparts, and analogously for X l . A disagreement-based collaboration is applied to form Xˆl v 1 and Xˆl v 2 . Moreover, not only the setting where X l is based on human labels is considered, but also when it is based on virtual-world data. In the latter case, a GAN-based virtual-to-real image translation [19] is used as pre-processing for the virtual-world images, i.e., before taking them for running the co-training procedure. Very recently, Díaz et al. [26] presented co-training for visual object recognition. In other words, the paper addresses a classification problem, while we address both localization and classification to perform object detection. While the different views proposed in [26] rely on self-supervision (e.g., forcing image rotations), here, these rely on data multi-modality. In fact, in our previous work [17], we used mirroring to force different data views, which can be considered as a kind of self-supervision too. Here, after adapting and improving the framework used in [17], we confront this previous setting to a new multi-modal singlesensor version (Algorithms 1 and Figure 2). We focus on the case where φ v 1 works with the original images while φ v 2 works with their estimated depth. Analyzing this setting is quite interesting because appearance and depth are different views of the same data.
To estimate depth, we need an out-of-the-shelf monocular depth estimation (MDE) model, so that we can keep the co-training as a single-sensor even being multi-modal. MDE can be based on either LiDAR supervision, or stereo/SfM self-supervision, or combinations; where, both LiDAR and stereo data, and SfM computations, are only required at training time, but not at testing time. We refer to [6] for a review on MDE state-of-the-art. In this paper, to isolate the multi-modal co-training performance assessment as much as possible from the MDE performance, we have chosen the top-performing supervised method proposed by Yin et al. [18].
Finally, we would like to mention that there are methods in the literature that may be confused with co-training, so it is worth introducing a clarification note. This is the case of the co-teaching proposed by Han et al. [27] and the co-teaching+ of Yu et al. [28]. These methods have been applied to deep image classification to handle noisy labels on X l . However, citing Han et al. [27], co-training is designed for SSL, and co-teaching is for learning with noisy (ground truth) labels (LNL); as LNL is not a special case of SSL, we cannot simply translate co-training from one problem setting to another problem setting.

Method
In this section, we explain our co-training procedure with the support of Figure 2 and Algorithms 1. Up to a large extent, we follow the same terminology as in [17].
Input: The specific sets of labeled (X l v 1 , X l v 2 ) and unlabeled (X u v 1 , X u v 2 ) input data in Algorithms 1 determine if we are running on either a single or multi-modal setting. Also, if we are supported or not by virtual-world images or their virtual-to-real translated counterparts. Table 1, clarifies the different co-training settings depending on these datasets.
In Algorithms 1, view-paired sets means that each image of one set has a counterpart in the other, i.e., following Table 1, its horizontal mirror or its estimated depth. Since the co-training is agnostic to the specific object detector in use, we explicitly consider its corresponding CNN architecture, Φ, and training hyper-parameter, H Φ , as inputs. Finally, H ct consists of the co-training hyper-parameters, which we will introduce while explaining the part of the algorithm in which each of them is required.
Output: It consists in a set of images (Xl) from X u v 1 , for which co-training is providing pseudo-labels, i.e., object BBs in this paper. In our experiments, according to Table 1, X u v 1 always corresponds to the unlabeled set of original real-world images. Since we consider as output a set of self-labeled images, which complement the input set of labeled images, they can be later used to train a model based on Φ or any other CNN architecture performing the same task (i.e., requiring the same type of BBs).
Initialize: First, the initial object detection models (φ 1 , φ 2 ) are trained using the respective views of the labeled data (X l v 1 , X l v 2 ). After their training, these models are applied to the respective views of the unlabeled data (X u v 1 , X u v 2 ). Detections (i.e., object BBs) with a confidence over a threshold are considered pseudo-labels. Since we address a multiclass problem, per-class thresholds are contained in the set T, a hyper-parameter in H ct . The temporary self-labeled sets generated by φ 1 and φ 2 are Xˆl 1,new and Xˆl 2,new , respectively. At this point no collaboration is produced between φ 1 and φ 2 . In fact, while co-training loops (repeat body), the self-labeled sets resulting from the collaboration are Xˆl 1 and Xˆl 2 , which are initialized as empty. In the training function, TrainpΦ, H Φ , S l , Sˆlq : φ, we use BB labels (in S l ) and BB pseudo-labels (in Sˆl) indistinctly. However, we only consider background samples from S l , since, as co-training progresses, Sˆl may be instantiated with a set of self-labeled images containing false negatives (i.e., undetected objects) which could be erroneously taken as hard negatives (i.e., background quite similar to objects) when training φ.
Collaboration: The two object detection models collaborate by exchanging pseudolabeled images (Figure 2-right). This exchange is inspired in disagreement-based SSL [24]. Our specific approach is controlled by the co-training hyper-parameters N, n, m, and, in case of working with image sequences instead of with sets of isolated images, also by H seq " t∆t 1 , ∆t 2 u, ∆t 1 , ∆t 2 . This approach consists of the following three steps.
(First step) Each model selects the set of its top-m most confident self-labeled images (Xl 1,Ò , Xˆl 2,Ò ); where, the confidence of an image is defined as the average over the confidences of the pseudo-labels of the image, i.e., in our case, over the object detections. Thus, Xˆl i,Ò Ď Xˆl i,new , i P t1, 2u. However, for creating Xˆl i,Ò , we do not consider all the self-labeled images in Xˆl i,new . Instead, to minimize bias and favor speed, we only consider N randomly selected images from Xˆl i,new . In the case of working with image sequences, to favor variability in the pseudo-labels, the random choice is constrained to avoid using consecutive frames. This is controlled by thresholds ∆t 1 and ∆t 2 ; where ∆t 1 controls the minimum frame distance between frames selected at the current co-training cycle (k), and ∆t 2 among frames at current cycle with respect to frames selected in previous cycles (ă k). We apply ∆t 1 first, then ∆t 2 , and then the random selection among the frames passing these constraints.
(Second step) Model φ i processes Xˆl j,Ò , i, j P t1, 2u, i ‰ j, keeping the set of the n less confident self-labeled images for it. Thus, we obtain the new sets Xˆl 1,Ó and Xˆl 2,Ó . Therefore, considering the first and second steps, we see that one model shares with the other those images that it has self-labeled with more confidence, and, of these, each model retains for retraining those that it self-labels with less confidence. Therefore, this step implements the actual collaboration between models φ 1 and φ 2 . (Third step) The self-labeled sets obtained in previous step (Xl 1,Ó , Xˆl 2,Ó ) are fused with those accumulated from previous co-training cycles (Xl 1 , Xˆl 2 ). This is done by properly calling the function FusepSl old , Sˆl new q : Sˆl for each view. The returned set of self-labeled images, Sˆl, contains Sˆl old Y Sˆl new´Sl old X Sˆl new , and, from Sˆl old X Sˆl new , only those self-labeled images in Sˆl new are added to Sˆl. Retrain and update: At this point we have new sets of self-labeled images (Xl 1 , Xˆl 2 ), which, together with the corresponding input labeled sets (X l v 1 , X l v 2 ), are used to retrain the models φ 1 and φ 2 . Afterwards, these new models are used to obtain new temporary self-labeled set (Xl 1,new , Xˆl 2,new ) through their application to the corresponding unlabeled sets (X u v 1 , X u v 2 ). Then, co-training can start a new cycle. Stop: The function Stop?pH stp , Sˆl old , Sˆl new , kq : Boolean determines if a new co-training cycle is executed. This is controlled by the co-training hyper-parameters H stp " tK min , K max , T ∆ mAP , ∆Ku. Co-training will execute a minimum of K min cycles and a maximum of K max , being k the current number. The parameters Sˆl old and Sˆl new are supposed to be instantiated with the sets of self-labeled images in previous and current co-training cycles, respectively. The similarity of these sets is monitored in each cycle, so that if its stable for more than ∆K consecutive cycles, convergence is assumed and co-trained stopped. This constrain could already be satisfied at k " K min provided K min ě ∆K. The metric used to compute the similarity between these self-labeled sets is mAP (mean average precision) [29], where Sˆl old plays the role of GT and Sˆl new the role of results under evaluation. Then, mAP is considered stable between two consecutive cycles if its magnitude variation is below the threshold T ∆ mAP . Table 1. The different configurations that we consider for Algorithms 1 in this paper, according to the input datasets. In the single-modal cases, we work only with RGB images (appearance), either from a real-world dataset (R RGB ), or a virtual-world one (V RGB ), or a virtual-to-real domain-adapted one (V G R ,RGB ), i.e., using a GAN-based V RGB Ñ R RGB image translation. One view of the data (v 1 ) corresponds to the original RGB images of each set, while the other view (v 2 ) corresponds to their horizontally mirrored counterparts, indicated with the symbol " è ". In the multi-modal cases, view v 1 is the same as for the single-modal case (RGB), while view v 2 corresponds to the depth (D) estimated from the RGB images by using an off-the-shelf monocular depth estimation model.

Modality
Domain Shift?

Datasets and Evaluation Protocol
We follow the experimental setup of [17]. Therefore, we use KITTI [29] and Waymo [30] as real-world datasets, here denoted as K and W, respectively. We use a variant of the SYNTHIA dataset [31] as virtual-world data, here denoted as V. For K we use Xiang et al. [32] split, which reduces the correlation between training and testing data. While this implies that K is formed by isolated images, W is composed of image sequences. To align its acquisition conditions with K, we consider daytime sequences without adverse weather. From them, as recommended in [30], we randomly select some sequences for training and the rest for testing. Furthermore, we adapt W's image size to match K (i.e., 1240ˆ375 pixels) by first eliminating the top rows of each image so avoiding large sky areas, and then selecting a centered area of 1240 pixel width. The 2D BBs of W and V, are obtained by projecting the available 3D BBs. On the other hand, V is generated by mimicking some acquisition conditions of K, such as image resolution, non-adverse weather, daytime, and only considering isolated shots instead of image sequences. Besides, V's images include standard visual post-effects such as anti-aliasing, ambient occlusion, depth of field, eye adaptation, blooming, and chromatic aberration. In the following, we term as K tr and K tt the training and testing sets of K, respectively. Analogously, W tr and W tt are the training and testing sets of W. For each dataset, Table 2 summarizes the number of images and object BBs (vehicles and pedestrians) used for training and testing our object detectors. Note that V is only used for training purposes. Table 2. Datasets (X ): train (X tr ) and test (X tt ) statistics, X " X tr Y X tt , X tr X X tt " H.

Dataset (X )
X tr X tt We apply the KITTI benchmark protocol for object detection [29]. Furthermore, following [17], we focus on the so-called moderate difficulty, which implies that the minimum BB height to detect objects is 25 pixels for K and 50 pixels for W. Once co-training finishes, we use the labeled data (X l ) and the data self-labeled by co-training (Xl) to train the final object detector, namely, φ F . Since this is the ultimate goal, we use the accuracy of such a detector as metric to evaluate the effectiveness of the co-training procedure. If it performs well at self-labeling objects, the accuracy of φ F should be close to the upper-bound (i.e., when the 100% of the real-world labeled data used to train φ F is provided by humans), otherwise, the accuracy of φ F is expected to be close to the lower-bound (i.e., when using either a small percentage of human-labeled data or only virtual-world data to train φ F ).

Implementation Details
When using virtual-world images we not only experiment with the originals but also with their GAN-based virtual-to-real translated counterparts, i.e., aiming at closing the domain shift between virtual and real worlds. Since the translated images are the same for both co-training modalities, we take them from [17], where a CycleGAN [19] was used to learn the translations G K : V Ñ K and G W : V Ñ W. To obtain these images, CycleGAN training was done for 40 epochs using a weight of 1.0 for the identity mapping loss, and a patch-wise strategy with patches of 300ˆ300 pixels, while keeping the rest of the parameters as recommended in [19]. We denote as V G K " G K pVq and V G W " G W pVq the sets of virtual-world images transformed by G K and G W , respectively. The 2D BBs in V are used for V G K and V G W . Furthermore, note that analogously to V, V G K and V G W are only used for training. For multi-modal co-training, depth estimation is applied indistinctly to the real-world datasets, the virtual-world one, and the GAN-based translated ones.
In the multi-modal setting, one of the co-training views is the appearance (RGB) and the other is the corresponding estimated depth (D). To keep co-training single-sensor, we use monocular depth estimation (MDE). In particular, we leverage a state-of-the-art MDE model publicly released by Yin et al. [18]. It has been trained on KITTI data, thus, being ideal to work with K. However, since our aim is not to obtain accurate depth estimation, but to generate an alternative data view useful to detect the objects of interest, we have used the same MDE model for all the considered datasets. Despite this, Figure 3 shows how the estimated depth properly captures the depth structure for the images of all datasets, i.e., not only for K, but also for W, V, V G K and V G W . However, we observe that the depth structure for V G K 's and V G W 's images is more blurred at far distances than for V, especially for V G W . Following [17], we use Faster R-CNN with a VGG16 feature extractor (backbone) as the CNN architecture for object detection, i.e., as Φ in Algorithms 1. In particular, we rely on the Detectron implementation [33]. For training, we always initialize VGG16 with ImageNet pre-trained weights, while the weights of the rest of the CNN (i.e., the candidates' generator and classifier stages) are randomly initialized. Faster R-CNN training is based on 40,000 iterations of the SGD optimizer. Note that these iterations refer to the function TrainpΦ, H Φ , S l , Sˆlq : φ in Algorithms 1, not to co-training cycles. Each iteration uses a mini-batch of two images randomly sampled from S l Y Sˆl. Thus, looking at how TrainpΦ, H Φ , S l , Sˆlq : φ is called in Algorithms 1, we can see that, for each view, the parameter S l receives the same input in all co-training cycles, while Sˆl changes from cycle-to-cycle. The SGD learning rate starts at 0.001 and we set a decay of 0.1 at iterations 30,000 and 35,000. In the case of multi-modal co-training, we use horizontal mirroring as a data augmentation technique. However, we cannot do it in the case of single-modal co-training because both data views would highly correlate. Note that, as it was done in [17] and we can see in Table 1, horizontal mirroring is the technique used to generate one of the data views in single-modal co-training. In terms of Algorithms 1, all these settings are part of H Φ and they are the same to train both φ 1 and φ 2 . The values set for the co-training hyper-parameters are shown in Table 3.
Finally, note that the final detection model used for evaluations, φ F , could be based on any CNN architecture for object detection, provided the GT it expects consists of 2D BBs. However, for the sake of simplicity, we also rely on Faster R-CNN to obtain φ F . Table 3. Co-training hyper-parameters as defined in Algorithms 1. We use the same values for K and W datasets, but H seq only applies to W. N, n, m, ∆t 1 , and ∆t 2 are set in number-of-images units, K min , K max and ∆K in number-of-cycles, T ∆ mAP runs in r0..100s. T hyper-parameter contains the confidence detection thresholds for vehicles and pedestrians, which run in r0..1s, and we have set the same value for both. The setting m " 8 means that all the images self-labeled at current co-training cycle are exchanged by the models φ 1 and φ 2 for collaboration, i.e., these will then select the n less confident for them.

Results
To include multi-modality we improved and adapted the code used in [17]. For this reason, we not only execute the multi-modal co-training experiments but also redo the single-modal and baseline ones. The conclusions in [17] remain, but by repeating these experiments, all the results presented in this paper are based on the same code.

Standard SSL Setting
We start the evaluation of co-training in a standard SSL setting, i.e., working only with either the K or W dataset to avoid domain shift. In this setting, the cardinality of the unlabeled dataset is supposed to be significantly higher than the cardinality of the labeled, we divide the corresponding training sets accordingly. In particular, for X tr P tK tr , W tr u, we use the p% of X tr as the labeled training set (X l ) and the rest as the unlabeled training set (X u ). We explore p " 5 and p " 10, where the corresponding X tr is sampled randomly once and frozen for all the experiments. Table 4 shows the obtained results for both co-training modalities. We also report upper-bound (UB) and lower-bound (LB) results. The UB corresponds to the case p " 100, i.e., all the BBs are human-labeled. The LBs correspond to the p " 5 and p " 10 cases without using co-training, thus, not leveraging the unlabeled data. Although in this paper we assume that φ F will be based on RGB data alone, since we use depth estimation for multi-modal co-training, as a reference we also report the UB and LB results obtained by using the estimated depth alone to train the corresponding φ F .
Analyzing Table 4, we confirm that the UB and LBs based only on the estimated depth (D) show a reasonable accuracy, although not at the level of appearance (RGB) alone. This is required for the co-training to have the chance to perform well. Aside from this, we see how, indeed, both co-training modalities clearly outperform LBs. In the p " 5 case, multi-modal co-training clearly outperforms single-modal in all classes (V and P) and datasets (K and W). Moreover, the accuracy improvement over the LBs is significantly larger than the remaining distance to the UBs. In the p " 10 case, both co-training modalities perform similarly. On the other hand, for K, the accuracy of multi-modal co-training with p " 5 is just "2 points below the single-modal with p " 10, and less than 1 point for W. Therefore, for 2D object detection, we recommend multi-modal co-training for a standard SSL setting with a low ratio of labeled vs. unlabeled images. Table 4. SSL (co-training) results on vehicle (V) and pedestrian (P) detection, reporting mAP. From a training set X tr P tK tr , W tr u, we preserve the labeling information for a randomly chosen p% of its images, while it is ignored for the rest. We report results for p = 100 (all labels are used), p = 5 and p = 10. If X tt " K tt , then X tr " K tr ; analogously, when X tt " W tt , then X tr " W tr , i.e., there is no domain shift in these experiments. Co-T (RGB) and Co-T (RGB/D) stand for single and multi modal co-training, respectively. UP and LB stand for upper bound and lower bound, respectively. Bold results indicate best performing within the block, where blocks are delimited by horizontal lines. Second best is underlined, but if the difference with the best is below 0.5 points, we use bold too. ∆tφ F 1 vs. φ F 2 u stands for mAP of φ F 1 minus mAP of φ F 2 .   Table 5 shows the LB results for a φ F fully trained on virtual-world images (source domain); the results of training only on the real-world images (target domain), where these images are 100% human-labeled (i.e., 100% Labeled RGB in Table 4); and the combination of both, which turns out to be the UB. In the case of testing on W tt and having V involved in the training, we need to accommodate the different labeling style (mainly the margin between BBs and objects) of W tt and V. This is only needed for a fair quantitative evaluation, thus, for performing such evaluation the detected BBs are resized by per-class constant factors. However, the qualitative results presented in the rest of the paper are shown directly as they come by applying the corresponding φ F , i.e., without applying any resizing. On the other hand, this resizing is not needed for K tt since its labeling style is similar enough to V. Table 5. SSL (co-training) results on vehicle (V) and pedestrian (P) detection, under domain shift, reported as mAP. X l refers to the human-labeled target-domain training set; thus, if X tt " K tt , then X l " K tr , and if X tt " W tt , then X l " W tr . Xˆl consists of the same images as X l , but self-labeled by co-training. Co-T (RGB), Co-T (RGB/D), UP, LB, ∆tφ F 1 vs. φ F 2 u, bold and underlined numbers are analogous to those in Table 4.

Training Set
X tt " K tt X tt " W tt According to Table 5, both co-training modalities significantly outperform the LB. Again, multi-modal co-training outperforms single-modal, especially on vehicles. Comparing multi-modal co-training with the LB, we see improvements of "15 points for vehicles in K, and "25 in W. Considering the joint improvement for vehicles and pedestrians we see "8 points for K, and "15 for W, while the distances to the UB are of "5 points for K, and "2 for W. Therefore, for 2D object detection, we recommend multi-modal co-training for an SSL scenario where the labeled data comes from a virtual world, i.e., when no human labeling is required at all, but there is a virtual-to-real domain shift. Table 6 is analogous to Table 5, just changing the original virtual-world images (V) by their GAN-based virtual-to-real translated counterparts (V G K {V G W ). In the case of testing on W tt and having V G W involved in the training, we apply the BB resizing mentioned in Section 4.3.2 for the quantitative evaluation. Focusing on the V&P results, we see that both the UB and LB of Table 6 show higher accuracy than in Table 5, which is due to the reduction of the virtual-to-real domain shift achieved thanks to the use of V G K /V G W . Still, co-training enables to improve the accuracy of the LBs, almost reaching the accuracy of the UBs. For instance, in the combined V&P detection accuracy, the single-modal co-training is 1.66 points behind the UB for K, and 3.59 for W. Multi-modal co-training is 2.63 points behind the UB for K, and 4.01 for W. Thus, in this case, single-modal co-training is performing better than multi-modal. Therefore, for 2D object detection, we can recommend even single-modal co-training for an SSL scenario where the labeled data comes from a virtual world but a properly trained GAN can perform virtual-to-real domain adaptation. On the other hand, in the case of W, co-training from V G W gives rise to worse results than by using V. We think this is due to a worse depth estimation (see Figure 3). In general, this suggests that whenever it is possible, training a specific monocular depth estimator for the unlabeled real-world data may be beneficial for multi-modal co-training (recent advances on vision-based self-supervision for monocular depth estimation [10,34] can be a good starting point). For this particular case, training the virtual-to-real domain adaptation GAN simultaneously to the monocular depth estimation CNN could be an interesting idea to explore in the future (we can leverage inspiration from [35,36]). Table 6. SSL (co-training) results on vehicle (V) and pedestrian (P) detection, after GAN-based virtualto-real image translation, reported as mAP. ASource (adapted source) refers to V G P tV G K , V G W u. X l , Xˆl, Source, Co-T (RGB), Co-T (RGB/D), UP, LB, ∆tφ F 1 vs. φ F 2 u, bold and underlined numbers are analogous to those in Table 5.  (Figure 4), as well as under domain shift (Source) and when this is reduced (ASource) by using V G K /V G W ( Figure 5). We take the self-labeled images at different co-training cycles (x-axis) as if these cycles were determined to be the stopping ones. The labeled images together with the self-labeled by co-training up to the indicated cycle are used to train the corresponding φ F . Then, we plot (y-axis) the accuracy (mAP) of each φ F in the corresponding testing set, i.e., either K tt or W tt . We can see how co-training strategies allow improving over the LBs from early iterations and, although slightly oscillating, keep improving until stabilization is reached. No drifting to erroneous self-labeling is observed. At this point, the object samples which remain as unlabeled but are required to reach the maximum accuracy, probably are too different in some aspect from the labeled and self-labeled ones (e.g., they may be under a tooheavy occlusion) and would never be self-labeled without additional information. Then, combining co-training with active learning (AL) cycles could be an interesting alternative, since occasional human loops could help co-training to progress more. We see also how when the starting point for co-training is at a lower accuracy, multi-modal co-training usually outperforms single-modal (e.g., in the 5% setting and under domain shift). . V&P detection accuracy of co-training approaches as a function of the stopping cycle. Co-T (RGB) and Co-T (RGB/D) refer to single and multi modal co-training, respectively. Target refers to the use of the 100% labeled training data, while Target p% L. indicates a lower percentage p P t5, 10u of labeled data available for training. Accordingly, p% L. + Co-T (view), view P{RGB, RGB/D}, are combinations of those. These plots complement the results shown in Table 4. Figure 5. V&P detection accuracy of co-training approaches as a function of the stopping cycle. These plots are analogous to those in Figure 4 for the cases of using virtual-world data, i.e., both with domain shift (Source) and reducing it by the use of GANs (ASource). The Targets are the same as in Figure 4. These plots complement the results shown in Tables 5 and 6.

Qualitative Results
Figures 6 and 7 present qualitative results for φ F 's trained after stopping co-training at cycles 1, 10, 20 and when it stops automatically (i.e., the stopping condition of the loop in Algorithms 1 becomes true). The shown examples correspond to the most accurate setting for each dataset; i.e., for K ( Figure 6) this is the co-training from V G K no matter the modality, while for W (Figure 7) this is the co-training from V in the multi-modal case and from V G W in the single-modal. Note that Tables 4-6, suggest to combine co-training with virtual-world data to obtain more accurate φ F 's. Figure 6. Qualitative results of how φ F would perform on K tt by stopping co-training at different cycles. We focus on co-training and object detection working from V G K (ASource). There are three blocks of results vertically arranged. At each block, the top-left image shows the results when using the 100% human-labeled training data plus V G K (Target + ASource), i.e., UB results. Detection results are shown as green BBs, and GT as red BBs. The top-right image of each block shows the results that we would obtain without leveraging the unlabeled data (ASource), i.e., LB results. The rest of the rows of the block, from top-second to bottom, correspond to stopping co-training at cycles 1, 10, 20, and automatically. In these rows, the images at the left column correspond to multi-modal co-training (i.e., Co-T (RGB/D)) and those at the right column to single-modal co-training (i.e., Co-T (RGB)).
In the left block of Figure 6, we show a case where both co-training modalities perform similarly on pedestrian detection, with final detections (green BBs) very close to the GT (red BBs), and clearly better than if we do not leverage the unlabeled data (top-right image of the block). We see also that the results are very similar to the case of using the 100% of human-labeled data (top-left image of the block). Moreover, even from the initial cycles of both co-training modalities the results are reasonably good, although, the best is expected when co-training finishes automatically (bottom row of the block), i.e., after the minimum number of cycles is exceeded (K min " 20 in Table 3). In the mid-block, we see that only multi-modal co-training helps to properly detect a very close and partially occluded vehicle.
In the right block, only multi-modal co-training helps to keep and improve the detection of a close pedestrian. Both co-training modalities help to keep an initially detected van, but multi-modal co-training induces a better BB adjustment. This is an interesting case. Since V only contains different types of cars but lacks a meaningful number of van samples, and K only has a very small percentage of those labeled, we have focused our study on the different types of cars. Therefore, vans are neither considered for training nor testing, i.e., their detection or misdetection does not affect the mAP metric either positively or negatively. However, co-training is an automatic self-labeling procedure, thus it may capture or keep these samples and then force training with them. Moreover, in this setting, the hard-negatives are mined only from the virtual-world images (translated or not by a GAN) since they are fully labeled. Thus, if no sufficient vans are part of the virtualworld images, these objects cannot act as hard negatives, so that they may be detected or misdetected depending on their resemblance to the targeted objects (here types of cars). We think this is the case here. Thus, this is an interesting consideration for designing future co-training procedures supported by virtual-world data. Alternatively, by complementing co-training with occasional AL cycles, these special false positives could be reported by the human in the AL loop (provided we really want to treat them as false positives). On the other hand, in the same block of results, we see also a misdetection (isolated red BB), which does account for the quantitative evaluation. It corresponds to a rather occluded vehicle which is not detected even when relying on human labeling (top-left image of the block). Finally, note the large range of detection distances achieved for vehicles.
In the left block of Figure 7, we see even a larger detection range for the detected vehicles than in Figure 6. Faraway vehicles (small green BBs) are considered as false positives for the qualitative evaluation because these are not part of the W tt GT (since they do not have labeled 3D BBs from which the 2D BBs are obtained). Thanks to the use of virtual-world data, these vehicles are detected (second row of the block) and both co-training modalities do not damage their detection. Note how the UBs based on virtualworld data and human-labeled real-world data are not able to detect such vehicles (first row of the block) because human labeling did not consider these faraway vehicles, while co-training does consider them as such. Besides, multi-modal co-training enables the detection of the closer vehicle since cycle 10. In the next block to the right, multi-modal co-training enables to detect a close kid since cycle 10, while single-modal does not at the end. In addition, single-modal co-training also introduces a distant false positive. Similarly to the left block, in this block both co-training modalities keep an unlabeled vehicle detected thanks to the use of the virtual-world data (second row), not detected (first row) when these data are complemented with human-labeled data (since, again, this vehicle is not even labeled). What is happening in these cases, is that there is a lack of real-world human-labeled 3D BBs for distant vehicles, which is compensated by the use of virtual-world data and maintained by co-training. In the next block to the right, we see how a pedestrian is detected thanks to both co-training methods since only using virtual-world data was not possible (second row). In the right block, both co-training modalities allow for vehicle and pedestrian detections similar to the UBs (first row). Note that the vehicle partially hidden behind the pedestrian was not detected by only using virtual-world data (second row), and neither was detected the pedestrian when using V (second row, left) or was poorly detected when using V G W (second row, right).
Finally, Figure 8 shows additional qualitative results on K tt and W tt when using multi-modal co-training, in the case of K tt based on V G K and V for W tr , i.e., we show the results of the respective best models. Overall, in the case of K tt , we see how multi-modal co-training (Co-T (RGB/D)) enables to better adjust detection BBs, and removing some false positives. In the case of W tt , multi-modal co-training enables to keep even small vehicles that are not part of the GT but are initially detected thanks to the use of virtual-world data. It also helps to detect vehicles and pedestrians not detected by only using the virtual-world data, although further improvements are needed since some pedestrians are still difficult to detect even with co-training. Qualitative results similar to those in Figure 6, but testing on W tt , co-training from V in the multi-modal case (left column of each block), and V G W in the single-modal case (right column of each block). Since, in these examples, the two co-training modalities are based on different (labeled) data, the first row of each block shows the respective UB results, i.e., those based on training with W tr and either with V (left image: Target + Source) or V G W (right image: Target + ASource). The second row of each block shows the respective results we would obtain without leveraging the unlabeled data, i.e., the LBs based on training with V (left image: Source) or V G W (right image: ASource). As in Figure 6, the rest of the rows of each block correspond to stopping co-training at cycles 1, 10, 20, and automatically.

Answering (Q1) and (Q2)
After presenting our multi-modal co-training and the extensive set of experiments carried out, we can answer the research questions driving this study. In particular, we base our answers in the quantitative results presented in Tables 4-6, the plots shown in Figures 4 and 5, as well as the qualitative examples shown in Figures 6-8, together with the associated comments we have drawn from them.
(Q1) Is multi-modal (RGB/D) co-training effective on the task of providing pseudolabeled object BBs? Indeed, multi-modal co-training is effective for self-labeling object BBs under different settings, namely, for standard SSL (no domain shift, a few human-labeled data) and when using virtual-world data (many virtual-world labeled data, but no humanlabeled data) both under domain shift and after reducing it by GAN-based virtual-to-real image translation. The achieved improvement over the lower bound configurations is significant, allowing to be almost in pair with upper bound configurations. In the standard SSL setting, by only labeling the 5% of the training dataset, multi-modal co-training allows obtaining accuracy values relatively close to the upper bounds. When using virtual-world data, i.e., without human labeling at all, the same observations hold. Moreover, multi-modal co-training and GAN-based virtual-to-real image translation have been shown to complement each other. Figure 8. Qualitative results on K tt (top block of rows) and W tt (bottom block of rows). In each block, we show (top row) GT as red BBs, (mid row) detections, as green BBs, when training with X l , (bottom row) detections with X l Y Xˆl. In this case, Xˆl comes from applying C-T (RGB/D) on either K tr or W tr , and X l is V G K for K tr , while it is V for W tr .
(Q2) How does perform multi-modal (RGB/D) co-training compared to single-modal (RGB)? We conclude that in a standard SSL setting (no domain shift, a few human-labeled data) and under virtual-to-real domain shift (many virtual-world labeled data, no humanlabeled data) multi-modal co-training outperforms single-modal. In the latter case, when GAN-based virtual-to-real image translation is performed both co-training modalities are on par; at least, by using an off-the-shelf monocular depth estimation model not specifically trained on the translated images.
To drive future research, we have performed additional experiments. These consist in correcting the pseudo-labels obtained by multi-modal co-training in three different ways, namely, removing false positives (FP), adjusting the BBs to the ones of the GT (BB) for correctly self-labeled objects (true positives), and a combination of both (FP + BB). After changing the pseudo-labels in that way, we train the corresponding φ F models and evaluate them. Table 7 presents the quantitative results. Focusing on the standard SSL setting (5%, 10%), we see that the main problem for vehicles in K is BB adjustment, while for pedestrians is the introduction of FPs. In the latter case, false negatives (FN; i.e., missing self-labeled objects) seem to be also an issue to reach upper bound accuracy. When we have the support of virtual-world data, FNs do not seem to be a problem, and addressing BB correction for vehicles and removing FPs for pedestrians would allow reaching upper bounds. In the case of W, we came to the same conclusions for vehicles, the main problem is BB adjustment, while in the case of pedestrians the main problem is not that clear. In other words, there is more balance between FP and BB. On the other hand, regarding these additional experiments, we trust more the conclusions derived from K. The reason is that, as we have seen in Figures 7 and 8, co-training was correctly self-labeling objects that are not part of the GT, so in this study, these are either considered FPs and so wrongly removed (FP, FP + BB settings), or would not have a GT BB to which adjust them (BB, FP + BB settings). Table 7. Digging in the results throw three post-processing settings applied to co-training pseudo-labels: (FP) where we remove the false positive pseudo-labels; (BB) where we change the pseudo-labels by the corresponding GT (i.e., in terms of Figures 6-8, green BBs are replaced by red ones); (FP + BB) which combines both. This table follows the terminology of Tables 4-6. ∆ X , X P tFP, BB, FP + BBu, stands for difference of setting X minus the respective original (i.e., using the co-training pseudo-labels). Moreover, for each block of results, we add the #FP/FP% row, where #FP refers to the total number of false positives that are used to train the final object detector, φ F , while FP% indicates what percentage they represent regarding the whole set (labeled and self-labeled BBs) used to train φ F .

Training Set
X tt " K tt X tt " W tt After this analysis, we think we can explore two main future lines of research. First, to improve BB adjustment, we could complement multi-modal co-training with instance segmentation, where using Mask R-CNN [37] would be a natural choice. Note that virtualworld data can also have instance segmentation as part of their GT suite. Second, to remove FPs, we could add an AL loop where humans could remove even several FP with a few clicks (note that this is much easier than delineating object BBs). On the other hand, additional CNN models could be explored to avoid FPs as a post-processing step to multimodal co-training. Besides these ideas, we think that, whenever is possible, the monocular depth estimation model should be trained on the target domain data, rather than trying to use an off-the-shelf model. Since we think that not doing so was damaging the combination of multi-modal co-training and GAN-based virtual-to-real image translation, an interesting approach would be to perform both tasks simultaneously.

Conclusions
In this paper, we have addressed the curse of data labeling for onboard deep object detection. In particular, following the SSL paradigm, we have proposed multi-modal co-training for object detection. This co-training relies on a data view based on appearance (RGB) and another based on estimated depth (D), the latter obtained by applying monocular depth estimation, so keeping co-training as a single-sensor method. We have performed an exhaustive set of experiments covering the standard SSL setting (no domain shift, a few human-labeled data) as well as the settings based on virtual-world data (many virtualworld labeled data, no human-labeled data) both with domain shift and without (using GAN-based virtual-to-real image translation). In these settings, we have compared multimodal co-training and appearance-based single-modal co-training. We have shown that multi-modal co-training is effective in all settings. In the standard SSL setting, from a 5% of human-labeled training data, co-training can already lead to a final object detection accuracy relatively close to upper bounds (i.e., with the 100% of human labeling). The same observation holds when using virtual-world data, i.e., without human labeling at all. Multi-modal co-training outperforms single-modal in standard SSL and under domain shift, while both co-training modalities are on par when GAN-based virtual-to-real image translation is performed; at least, by using an off-the-shelf depth estimation model not specifically trained on the translated images. Moreover, multi-modal co-training and GANbased virtual-to-real image translation have been proved to be complementary. For the future, we plan several lines of work, namely, improving the adjustment of object BBs by using instance segmentation upon detection and removing false-positive pseudo-labels by using a post-processing AL cycle. Moreover, we believe that the monocular depth estimation model should be trained based on target domain data whenever possible. When GAN-based image translation is required, we could jointly train the monocular depth estimation model and the GAN on the target domain. Besides, we would like to extend co-training experiments to other classes of interest for onboard perception (traffic signs, motorbikes, bikes, etc.), as well as adapting the method to tackle other tasks such as pixel-wise semantic segmentation.