Unified Object Detector for Different Modalities based on Vision Transformers

Traditional systems typically require different models for processing different modalities, such as one model for RGB images and another for depth images. Recent research has demonstrated that a single model for one modality can be adapted for another using cross-modality transfer learning. In this paper, we extend this approach by combining cross/inter-modality transfer learning with a vision transformer to develop a unified detector that achieves superior performance across diverse modalities. Our research envisions an application scenario for robotics, where the unified system seamlessly switches between RGB cameras and depth sensors in varying lighting conditions. Importantly, the system requires no model architecture or weight updates to enable this smooth transition. Specifically, the system uses the depth sensor during low-lighting conditions (night time) and both the RGB camera and depth sensor or RGB caemra only in well-lit environments. We evaluate our unified model on the SUN RGB-D dataset, and demonstrate that it achieves similar or better performance in terms of mAP50 compared to state-of-the-art methods in the SUNRGBD16 category, and comparable performance in point cloud only mode. We also introduce a novel inter-modality mixing method that enables our model to achieve significantly better results than previous methods. We provide our code, including training/inference logs and model checkpoints, to facilitate reproducibility and further research. \url{https://github.com/liketheflower/UODDM}


Introduction
Advances in computer vision and artificial intelligence have enabled the development of increasingly sophisticated robotic applications that enhance human lives. Autonomous vehicles, for instance, can transport individuals to their destination without the need for a human driver/operator, Model A exclusively processes RGB images, with the visualization generated solely from the RGB-trained model presented in this study. Model B operates on pseudo images converted from point clouds, and the visualization is derived from the simCrossTrans [1] approach, which trains on these images. Model C is capable of processing RGB images, pseudo images converted from point clouds, or a combination of both. The visualization is based on UODDM with a Swin-T [2] backbone network.
while autonomous mobile robots operating in warehouses can assist in order preparation. However, many robotic systems rely on multiple sensors, such as cameras and 3D sensors (LiDAR or depth), and not all sensors are equally effective in all scenarios. For instance, camera sensors may perform poorly in low-light conditions without supplementary lighting. Thus, the ability to operate in low-light conditions can significantly reduce electricity usage and promote environmentally friendly robot design.
The high accuracy achieved by camera-based vision systems in 2D detection owes much to the efficacy of ConvNets [3] and Transformer [4] based feature extractors. Concurrently, simCrossTrans [1] proposes that by converting 3D sensor data into pseudo images and applying cross-modality transfer learning, a 2D object detection system using identical networks as those used for RGB images can produce commendable results. This development prompts a natural question: can we further enhance performance by training a unified network with both RGB and 3D data, adopting an identical architecture and weights throughout? The proposed unified network accepts three types of sensor data, namely, 1) RGB images, 2) pseudo images converted from 3D sensors, and 3) both RGB images and pseudo images converted from 3D sensors. If a unified network can match or exceed the detection performance of separate networks, each optimized for a particular modality, it would make feasible the use of an eco-friendly system operating under natural lighting conditions during the day and without any extra lighting at night. This article examines the potential of such a system. Building on simCrossTrans [1] that demonstrates the superior performance of a Vision Transformerbased network over ConvNets-based networks, our study concentrates exclusively on the Vision Transformer network.  In summary, this article aims to address the following research questions: 1. Can a unified model achieve comparable or superior performance in processing both RGB images and pseudo images converted from point clouds?
2. If a unified model that processes both RGB and pseudo images is feasible, can the RGB and pseudo images be further fused to enhance the model's ability to process both RGB and point cloud data? We conducted experiments that resulted in insightful observations and achieved state-of-the-art performance in 2D object detection. Our proposed unified model, named the Unified Object Detector for Different Modalities (UODDM), is capable of processing various types of images, including RGB images, pseudo images converted from point clouds, and inter-modality mixing of RGB image and pseudo images converted from point clouds. Figure 1 illustrates the differences between our model and other works. Furthermore, the performance comparison of different methods can be found in Table 3. Visualizations of UODDM outputs are presented in Figure 2.
The key contributions of our work can be summarized as follows: 1. We propose two inter-modality mixing methods which can combine the data from different modalities to further feed to our unified model. 2. We propose a unified model which can process any of the following images: RGB images, pseudo images converted from point clouds or inter-modality mixing of RGB image and pseudo images converted from point clouds. This unified model achieves similar performance to RGB only model and point cloud only model. Meanwhile, by using the inter-modality mixing data as input, our model can achieve a significantly better 2D detection performance. 3. We open source our code, training/testing logs and model checkpoints.

Related Work
Projecting 3D sensor data to 2D Pseudo Images: There are different ways to project 3D data to 2D features. HHA was proposed in [5] where the depth image is encoded with three channels: Horizontal disparity, Height above ground, and the Angle of each pixel's local surface normal with gravity direction. The signed angle feature described in [6] measures the elevation of the vector formed by two consecutive points and indicates the convexity or concavity of three consecutive points. Input features converted from depth images of normalized depth(D), normalized relative height(H), angle with up-axis(A), signed angle(S), and missing mask(M) were used in [7]. DHS images are used in [8,9].
Object Detection Based on RGB images or Pseudo images from point cloud by Vision Transformers: Object detection approaches can be summarized as two-stage frameworks (proposal and detection stages) and one-stage frameworks (proposal and detection in parallel). Generally speaking, two-stage methods such as R-CNN [10], Fast RCNN [11], Faster RCNN [12], FPN [13] and mask R-CNN [14] can achieve a better detection performance while one-stage systems such as YOLO [15], YOLO9000 [16], RetinaNet [17] are faster at the cost of reduced accuracy. For deep learning based systems, as the size of the network is increased, larger datasets are required. Labeled datasets such as PASCAL VOC dataset [18] and COCO (Common Objects in Context) [19] have played important roles in the continuous improvement of 2D detection systems. Most systems introduced here are based on ConvNets. Nice reviews of 2D detection systems can be found in [20]. When replacing the backbone network from ConvNets to Vision Transformers, the systems will be adopted to Vision Transformers backbone based object detection systems. The most successful systems are Swin-transformer [2] and Swin-transformer v2 [21]. simCrossTrans [1] explored the cross modality transfer learning by using both the ConvNets and Vision Transfomers based on SUN RGB-D dataset based on the mask R-CNN [14] approach.
Inter modality mixing: [22] learns a dynamical and local linear interpolation between the different regions of cross-modality images in datadependent fashion to mix up the RGB and infrared (IR) images. We explored both the static and dynamic mixing methods and found the static has a better performance. [23] uses an interpolation between the RGB and thermal images at pixel level. As we are training a unified model supporting both the single modality image and multiple modality images as input, we do not apply the interpolation to keep the original signal of each modality. We leverage the transformer architecture itself to automatically build up the gap between different modalities.
Multimodal data fusion: Multimodal data fusion can be performed using three different approaches: early fusion, late fusion, and deep fusion. Early fusion combines various modalities of data at a lower-dimensional common space, and a feature extractor is then employed to extract relevant information. Early fusion has been applied to object detection and audio-visual processing, as demonstrated in [24] and [25], respectively. Late fusion, on the other hand, employs independent feature extractors for different data sources and merges the extracted features in the final stage. Classical works on deep fusion for action recognition, gesture segmentation and recognition, and emotion recognition are demonstrated in [26], [27], and [28], respectively. Deep fusion is characterized by fusing data at various stages of model training, transforming the input data into a higher-level representation through multiple layers, and allowing for the fusion of diverse modalities into a single shared representation layer. Various works such as [29,30,31,32,33,34] have applied deep fusion to object detection. The study in [35] explores all three fusion methods for indoor semantic segmentation. In our research, we have chosen to adopt the early fusion approach for multimodal data processing.
Transfer learning with same modality or cross modality: Transfer learning is widely used in computer vision (CV), natural language processing (NLP) and biochemistry. Most transfer learning systems are based on the same modality (e.g. RGB image in CV and text in NLP). For the CV, common transfer learning is based on supervised way such as works in R-CNN [10], Fast RCNN [11], Faster RCNN [12], FPN [13], mask R-CNN [14], YOLO [15], YOLO9000 [16], RetinaNet [17] use a pretrained backbone network model based on ImageNet classification task and the model is further trained based on the following task datasets such as COCO to achieve object detection or/and instance segmentation tasks.
In the NLP, the transfer learning such as BERT [36], GPT [37], GPT-2 [38], GPT-3 [39] are mainly based on self-supervised way and achieve great success. Inspired by the success of the self-supervised way transfer learning, the CV community is also exploring the self-supervised way to explore new possibilities, one recent work which is similar to the BERT in NLP is MAE [40]. The MolGNN [41,42] in bioinformatics use a self-supervised way based on Graph Neural network (GNN) in the pretraining stage and achieve good performance in a few shot learning framework for the following subtasks. For this work, we explore the cross modality transfer learning from a pretrained model under the supervised learning approach. Recently, Frustum-Voxnet [8] used pretrained weights from the RGB images to fine tune the point cloud converted pseudo image based on ConvNets [3]. simCrossTrans [1] further explored the cross modality transfer learning by using both the ConvNets [3] and Vision Transfomers [43,2] and showed significant improvement.

Methodology
In this section, we will describe our approach for converting structured point clouds to pseudo images, the methods we use for mixing various modalities, as well as our detection frameworks. The corresponding RGB one can be found in Figure 1 In order to use pretrained models based on RGB images, we convert point clouds to pseudo 2D images with 3 channels. The point clouds can be converted to HHA or any three channels from DHASM introduced in [44].

Convert point clouds to pseudo 2D image
For this work, we follow the same approaches in Frustum VoxNet [8] and simCrossTrans [1] by using DHS to project 3D depth data to 2D images [44]. Here is a summary of the DHS encoding method. Similar to [5,44], we adopt Depth from the sensor and Height along the sensor-up (vertical) direction as two reliable measures. Signed angle was introduced in [45]. Let us denote as X i,k = [x ik , y ik , z ik ] the vector of 3D coordinates of the k-th point in the i-th scanline. Knowledge of the vertical direction (axis z) is provided by many laser scanners, or even can be computed from the data in indoor or outdoor scenarios (based on line/plane detection or segmentation results from machine learning models) and is thus assumed known. Define D i,k = X i,k+1 − X i,k (difference of two successive measurements in a given scanline i), and A ik : the angle of the vector D i,k with the pre-determined z axis (0 to 180 degrees). The Signed angle the sign of the dot product between the vectors D i,k and D i,k−1 , multiplied by V ik . This sign is positive when the two vectors have the same orientation and negative otherwise. Those three channel pseudo images are normalized to 0 to 1 for each channel. Some samples DHS images can be seen in Figure 3 and 4. In order to expand the input options for our unified model, we introduce an inter-modality mixing approach that enables us to combine images from different modalities into a three-channel image for consumption by the model. This approach allows us to enhance the model's capabilities without modifying its architecture. By training a model using RGB, DHS images, and the mixed RGB and DHS images, we can achieve a unified detector that is capable of processing different modalities as input.

Inter modality mixing
Various techniques can be employed to fuse images from different modalities, and we propose two approaches: -Per Patch Mixing (PPM): divide the whole image into different patches with equal patch size. Randomly or alternatively select one image source for each patch.
-Stochastic Flood Fill Mixing (SFFM): Using a stochastic way to mix the images from different modalities.
We implement the Per Patch Mixing approach with relative simplicity. Specifically, for each patch in the image, we alternatively selected a modality image to assign to that patch. Moreover, we opted to utilize square patches for our implementation. As a result, the mask for selecting the modality image for each patch resembles a chessboard pattern, leading us to refer to our implementation as the Chessboard Per Patch Mixing (CPPM). Examples of the CPPM are shown in the middle of Figure 4. The Stochastic Flood Fill Mixing technique is an adaptation of the flood fill algorithm [46]. The approach involves establishing connections between neighboring pixels with a probability p, with separate probabilities for the RGB and DHS modalities. The algorithm can be implemented using four or eight neighbors to build the graph, with the latter including additional diagonal offsets. In our experiments, we used the four neighbor approach. The Python-style pseudocode for this algorithm is illustrated in Figure 5

2D detection framework
For the purpose of 2D detection and instance segmentation, we adopt the conventional object detection framework, namely Mask R-CNN [14] which is implemented in MMDetection [47]. It follows a two-stage approach [20], namely region proposal and detection/segmentation, for accomplishing detection and segmentation tasks. During the fine-tuning of the model on SUN RGB-D dataset, we disable the training of the mask branch. However, even with the default weights from the pre-trained model, the mask prediction branch can still generate acceptable mask predictions, as demonstrated in Figure 1. This observation aligns with the findings of the simCrossTrans [1] research.

2D detection backbone networks
For the backbone network, we use Swin Transformers [2], specifically we

SUN RGB-D dataset used in this work
SUN RGB-D [48] dataset is an indoor dataset which provides both the point cloud and RGB images. In this work, since we are building a 3D only object detection system, we only use the point clouds for fine tuning. The RGB images are not used during the fine tuning process. For the point clouds, they are collected based on 4 types of sensors: Intel Re-alSense, Asus Xtion, Kinect v1 and Kinect v2. The first three sensors are using an IR light pattern. The Kinect v2 is based on time-of-flight. The longest distance captured by the sensors are around 3.5 to 4.5 meters. SUN RGB-D dataset splits the data into a training set which contains 5285 images and a testing set which contains 5050 images. For the training set, it further splits into a training only, which contains 2666 images and a validation set, which contains 2619 images. Similar to [49,50,8,9] , we are fine-tuning our model based on the training only set and evaluate our system based on the validation set.

Pre-training
Both the Swin-T and Swin-S based networks 1 are firstly pre-trained on ImageNet [51] and then pre-trained on the COCO dataset [19]. Data augmentation When pre-training on COCO dataset, the image augmentations are applied during the training stage by: randomly horizontally flipping the image with probability of 0.5; randomly resizing the image with width of 1333 and height of several values from 480 to 800 (details see the configure file from the github repository); randomly cropping the original image with size of 384 (height) by 600 (width) and resizing the cropped image to width of 1333 and height of several values from 480 to 800.

Fine-tuning
Data augmentation: We follow the same augmentation with the pretrain stage. The raw input images have the width of 730 and height of 530. Those raw images are randomly resized and cropped during the training. During testing, the images are resized to width of 1120 and height of 800 which can be divided by 32.
Hardware: For the fine-tuning, we use a standard single NVIDIA Titan-X GPU, which has 12 GB memory. We fine-tune the network for 133K iterations for 100 epochs. It took about 29 hours for Swin-T based network with batch size of 2 (for 133K iterations) for the RGB only model. For the UODDM without the inter modality mixing, it took about 2 days to train the model. For with the inter modality mixing, the speed depends on the number of inter modality mixing images added to the training data. Fine-tuning subtasks: We focus on the 2D object detection performance, so we fine-tune the model based on the 2D detection related labels. Similar to simCrossTrans [1], we kept the mask branch without training to further verify whether reasonable mask detection can be created by using the weights from the pre-train stage.

Experiments
The primary focus of our experiments centers around the training of the model using diverse input data and the comparison of performance differences. Specifically, we first trained a unified model on both RGB and DHS images for the UODDM without inter-modality mixing. In contrast, for the UODDM with inter-modality mixing, we augmented the training data with inter-modality mixing images, in addition to the RGB and DHS images.

Evaluation metrics
Following the previous works mentioned in Table 3, we firstly use the AP50: Average Precision at IoU = 0.5 as evaluation metric. We also use the COCO object detection metric which is AP75: Average Precision at IoU = 0.75 and a more strict one: AP at IoU = .50:.05:.95 to evaluate the 2D detection performance.

Evaluation subgroups
We use the same subgroups as simCrossTrans [1] to evaluate the performance. The subgroups are SUNRGBD10, SUNRGBD16, SUNRGBD66 and SUNRGBD79, which have 10, 16, 66 and 79 categories. Detail list of those sub groups can be found in simCrossTrans [1].

The performance of UODDM without the inter modality mixing
We first evaluate the performance of UODDM without inter modality mixing. For this one, the model is trained based on both the RGB and DHS images. Our model architecture is the same as the simCrossTrans [1] work, which is using only the DHS image to train the model. We train a RGB images only model based on the same network to compare with the UODDM one's performance.  The performance evaluation of our proposed UODDM approach, measured in terms of mean average precision (mAP50), on the SUNRGBD79 dataset is presented in Figure 6. The results show that the UODDM model performs exceptionally well on both RGB and DHS images. Additionally, it is evident that the UODDM model significantly outperforms the DHS-only model in terms of performance on DHS images, which can be attributed to the inter-modality transfer learning from the RGB images. However, this performance improvement on DHS images comes at a slight cost of performance reduction on RGB images. Nevertheless, the UODDM model's overall performance is promising as it is a single model that can handle different modalities, making it more efficient than maintaining two separate architectures or a single architecture with two different sets of weights. This efficiency is particularly valuable for robotics and edge devices, where a seamless perception system can be built, even when transitioning from daytime to nighttime scenarios. Table 1 provides additional results for our UODDM and single-modality models, reinforcing the same conclusions.  Table 2. Result comparison based on mAP50 for different subgroups of UODDM and single modality only models.
In our study, we investigated two different methods for inter-modality mixing, namely SFFM and CPPM. For SFFM, we generated six mixing images for each RGB and DHS image pair, with connection probabilities for RGB and DHS pixels being randomly selected from the range of 0.1 to 0.9. The first pixel's RGB and DHS masks were randomly initialized with equal probability. In contrast, for CPPM, we used square patches of size 1 by 1, resulting in one CPPM image for each RGB and DHS image pair. The performance of both approaches was evaluated and presented in Table 2. Notably, the results suggest that the UODDM with CPPM outperforms the UODDM with SFFM. We attribute this to the generation of an excessive number of random images by SFFM, which can negatively impact the performance of the unified network on RGB and DHS images. Conversely, CPPM provides comparable performance to the plain UODDM model. Furthermore, the use of the CPPM image generated from both RGB and DHS images led to the best 2D detection performance. Given the ability of UODDM with CPPM to support RGB, DHS, and CPPM images from RGB and DHS, we propose it as a more powerful unified model. Table 2 presents the results obtained by using Swin-T and Swin-S as the backbone networks. It is observed that Swin-S is a more powerful network; however, the performance gain achieved is limited. Therefore, we propose the usage of the lightweight Swin-T as the backbone network to achieve a faster inference speed, as depicted in Table 4.  Table 3. 2D detection results based on SUN RGB-D validation set. Evaluation metric is average precision with 2D IoU threshold of 0.5.

Comparisons with other methods
In Table 3, we present a detailed comparison of per category results with previous works. Specifically, we evaluate the performance of our approach under three different input scenarios: RGB image, point cloud, and a combination of RGB and point cloud data using our proposed inter modality mixing method.
When considering RGB image as input, we observe that our best performing UODDM with CPPM or RGB-only model achieve slightly worse performance (54.2 mAP50 on SUNRGBD10) than the state-of-the-art Frustum PointNets [52]. On the other hand, when utilizing only point cloud as input, our plain UODDM model (without inter modality mixing) demonstrates a slightly better performance (56.6 mAP50 on SUN-RGBD10) compared to the previous state-of-the-art [1].
Remarkably, our proposed UODDM with CPPM significantly outperforms the previous best results obtained by RGB-D RCNN [5] (58.1 mAP50 on SUNRGBD10), in the scenario where both RGB and point cloud data are available. Notably, most prior works have focused on utilizing either RGB or point cloud data, with limited exploration of mixing methods for these modalities. Therefore, the proposed inter modality mixing method constitutes a significant contribution to the field.
Moreover, our UODDM with CPPM method demonstrates a substantial performance gain compared to the strongest 2D detector from RGB images, i.e., Frustum PointNets [52]. Specifically, our approach achieves 58.1 mAP50 on SUNRGBD10, which is superior to the performance of Frustum PointNets (56.8 mAP50 on SUNRGBD10).

More results based on extra evaluation metrics
More results based on mAP/mAP75 can be found in the appendix.  Table 4. Number of parameters and inference time comparison. All speed testing are based on a standard single NVIDIA Titan-X GPU. Table 4 presents the number of parameters and inference time for our proposed network architecture. The inference time reported for the Swin-T based network is the same as that reported in the simCrossTrans [1] paper, as we used the same network and hardware. However, since the Swin-S based network is larger, the inference time is slower, which is expected.

Conclusion
This paper presents a novel unified model capable of processing various types of data modalities, including RGB camera, DHS from depth sensor, and inter-modality mixing images from both RGB and DHS sources.
The proposed system is highly versatile and exhibits exceptional performance in different scenarios, where the availability of the sensors may vary. The use of RGB camera during the day and depth sensor at night results in a more eco-friendly and sustainable solution, which also benefits from the improved performance achieved with the inter-modality mixing technique. Our results demonstrate the superior performance of the proposed unified model compared to single-modality models, making it an efficient and powerful solution for various practical applications.

Acknowledgement
We would like to express our gratitude to Zhujun Li and Jaime Canizales for their valuable comments and advice during the development of this work. We would also like to thank Zhujun Li for suggesting the name "chessboard" to describe the method of alternatively selecting a modality image based on square patch.  Table 5. More results comparison based on AP@IoU = .75, AP and AP of different scales.
Besides the AP50, which was mainly used in previous works, we also use AP75 and AP to compare the results based on different methods. Meanwhile, we also report AP Across Scales of small, medium and large by following the same standard of COCO dataset. Those results can be found in Table 5. From the results, we see that in general the UODDM with the CPPM can achieve the best performance on the CPPM image. This is mainly due to the fact that both the RGB and DHS images are used for the system. When only using the RGB image and only using the DHS image, the unified model UODDM with CPPM has similar performance as the single modality based model.