MMCAN: Multi-Modal Cross-Attention Network for Free-Space Detection with Uncalibrated Hyperspectral Sensors

: Free-space detection plays a pivotal role in autonomous vehicle applications, and its state-of-the-art algorithms are typically based on semantic segmentation of road areas. Recently, hyperspectral images have proven useful supplementary information in multi-modal segmentation for providing more texture details to the RGB representations, thus performing well in road segmentation tasks. Existing multi-modal segmentation methods assume that all the inputs are well-aligned, and then the problem is converted to fuse feature maps from different modalities. However, there exist cases where sensors cannot be well-calibrated. In this paper, we propose a novel network named multi-modal cross-attention network (MMCAN) for multi-modal free-space detection with uncalibrated hyperspectral sensors. We ﬁrst introduce a cross-modality transformer using hyperspectral data to enhance RGB features, then aggregate these representations alternatively via multiple stages. This transformer promotes the spread and fusion of information between modalities that cannot be aligned at the pixel level. Furthermore, we propose a triplet gate fusion strategy, which can increase the proportion of RGB in the multiple spectral fusion processes while maintaining the speciﬁcity of each modality. The experimental results on a multi-spectral dataset demonstrate that our MMCAN model has achieved state-of-the-art performance. The method can be directly used on the pictures taken in the ﬁeld without complex preprocessing. Our future goal is to adapt the algorithm to multi-object segmentation and generalize it to other multi-modal combinations.


Introduction
As electric vehicles gradually replace traditional gasoline vehicles, the popularity of autonomous driving is also increasing year by year. People's awareness of autonomous vehicles has also shifted from science fiction to an everyday tool. Visual environment perception is the first link of autonomous driving, which helps autonomous vehicles to perceive and understand the surroundings [1]. Further, known as collision-free space detection, free-space detection is a fundamental component of visual environment perception. The approaches are generally semantic segmentation algorithms, which classify each pixel in an image into road or non-road classes. The segmentation results are then used by autonomous vehicles to navigate in complex environments and avoid obstacles.
In recent years, with the rapid development of computer technology, specifically the graphics processing unit (GPU), and the emergence of large-scale labeled data, the application of deep convolutional neural networks (DCNNs) has developed rapidly. It has become the mainstream method for free-space detection tasks. Thanks to the abundant data and accurate algorithms [2], it is convenient to train a segmentation DCNN. Even if the road is concealed by vehicles or under poor lighting conditions, these algorithms can provide a reliable result. Almost all standard road segmentation algorithms specifically support urban roads, which show either prominent boundary lines or clear texture demarcations in RGB images. However, the segmentation method for visible-light images has limitations because of complex surface features in the wild or insufficient illumination at night. Such problems may be overcome by introducing hyperspectral imaging (HSI) or near-infrared (NIR) images. A spectral image with a resolution in the range of 10 −2 λ is called a hyperspectral image [3]. Hyperspectral imaging is technology based on the continuous subdivision of narrow-band spectrums to simultaneously image the target area. It has become a mature technology that can capture detailed information for each pixel. Such a large amount of reflectance information about the underlying material can be helpful in accurate HSI segmentation. The hyperspectral images can help distinguish different substances, which is difficult in RGB images. Hence, HSI is widely used in various areas, including precision agriculture, military, surveillance, etc. [3,4]. Near-infrared is based on overtones and combinations of bond vibrations in molecules, a spectroscopic method that uses the nearinfrared region of the electromagnetic spectrum. In NIR spectroscopy, light is absorbed in varying amounts by the object at particular frequencies corresponding to the combinations and overtones of vibrational frequencies of some bonds of the molecules in the object. Therefore, the bands seen in the NIR are typically extensive, leading to spectra that are more complex to interpret compared with RGB spectra. It generally penetrates deeper into an object's surface and can reveal the underlying material characteristics [5]. Thus, changes in intensity in the NIR image are due to material and illumination changes but not to color variations within the same material [6]. In the NIR image, the impact of the shadow on the road will be effectively suppressed, and the road area remains distinguishable in the dark. In order to achieve the segmentation task based on multiple spectral data, we believe that multi-modal machine learning (MMML) is a practical approach.
A modality refers to how something happens or is experienced. In this article, we regard modality as the data provided by sensors. Multi-modal perception aims to process and understand information from multi-source modalities. Learning from heterogeneous data brings the possibility of in-depth capturing correspondences. Examples are given in Figure 1 to show the advantage of multi-modal learning.
Multi-modal Learning Single-modal Learning Input Image Figure 1. Example of real-world scenarios where current state-of-the-art single-modal approaches demonstrate misclassifications. The first row shows an issue of misclassifications caused by puddles that do not reflect the sky. The second row shows inconspicuous classes where roads and curbs are constructed of the same material.
Most existing multi-modal semantic segmentation methods are based on pixel-level aligned sensors, such as RGB and depth cameras, or multi-modal magnetic resonance imaging (MRI). This method provides a reasonable solution for unifying information from different modalities but is sensitive to the alignment of the input data. Unaligned multi-modal data will confuse the features learned by DCNNs, leading to false judgments, especially at low-dimensional layers. Today, public autonomous driving databases are dedicated to providing data for urban highway scenarios, and many free-space detection algorithms are customized based on such scenes. When these algorithms are transferred to some particular scenes, such as rural or mountain roads, it is often difficult for the same effect to be achieved [7]. In order to achieve automatic driving tasks in these particular environments, we need to build an automatic driving collection platform and a database specific for rural or mountain roads. Different from the experiments performed on readymade databases, such as KITTI [8] or Cityscapes [9], only uncalibrated data can be collected using a self-built experimental platform for multi-modal perception. In the data collection of autonomous driving, due to the different installation positions of multiple sensors, it is impossible to obtain completely aligned data from the source. The most common solution is calibrating the sensors and then performing the segmentation task using the aligned multi-modal information [10], while the mutual calibration of multiple sensors is complex work. In our experimental platform, three different spectral band sensors are included. Their field of view is adjusted to be as common as possible. The distortion of different camera lenses and different imaging principles makes pixel-level alignment of these three multi-spectral sensors impossible. Therefore, we explore a segmentation algorithm for uncalibrated multi-modal data to avoid extensive data calibration work.
To conquer this problem, we propose a cross-model transformer in a U-shape multimodal semantic segmentation architecture, which fuses heterogeneous information and supports dynamic weighted feature fusion. Instead of alignment, we draw inspiration from representation and mapping methods that use uncalibrated sensors. Cross-attention [11] can be used to combine two embedding sequences regardless of their heterogeneity. In the cross-attention module, the similarity of the resulting points will reflect the semantic proximity between their corresponding original inputs. The attention mechanism for mixing two different embedding sequences in the transformer architecture requires that the two sequences have the same dimensionality but can be of different modalities. One of the sequences defines the output length as the query (Q), and the other sequence generates the key (K) and value (V). In our model, RGB is always input as Q, while hyperspectral sequences are always input as K and V. Since RGB road segmentation achieves satisfactory results for most scenes, we hope to make the RGB modality lead the multi-modal perception. Therefore, the features after embedding are then put into a gate fusion module [12]. After calculating the attention maps of the input features, a triplet gate is applied to obtain the adaptive RGB-guided fusion weights. Finally, the fused feature is sent into a segmentation decoder for the prediction result. Comprehensive experiments on the multi-spectral dataset HSI Road [13] show that our method provides excellent results in free-space detection tasks.
In this research, we directly exploit uncalibrated multi-modal data for the segmentation task. Our contributions in this paper are four-fold:

1.
We propose a multi-modal free-space detection algorithm in an autonomous driving system with uncalibrated multi-spectral data.

2.
We propose a cross-attention module that combines uncalibrated modalities. The attention mechanism extracts the relevant information of multi-modal data without pixel-wise alignment.

3.
We design a multi-modal fusion architecture based on a triplet gate. In this structure, the participation of one primary modality is strengthened while the contributions of other modalities are maintained.

4.
Experimental results on the HSI Road dataset demonstrate the effectiveness of the proposed multi-modal segmentation network compared with other existing approaches.
The rest of the paper is organized as follows: Section 2 summarizes the existing research on free-space detection and multi-modal feature fusion. Section 3 explains the proposed approach in detail. Section 4 provides details of the dataset and explains our experimental setup. Finally, Section 5 concludes the entire paper.

Related Work
We review some related work on free-space detection and multi-modal perception in the deployment of autonomous vehicle technology.

Free-Space Detection
Free-space detection is a binary pixel-level segmentation task. Popular single-modal semantic segmentation networks, such as FCN [14], SegNet [15], U-Net [16], PSPNet [17], DANet [18], etc., have achieved good performance for RGB free-space detection tasks. Today, state-of-the-art free-space detection networks usually use multi-modal data to assist RGB image segmentation and achieve excellent results, among which depth maps [19][20][21][22][23][24][25][26] or LiDAR point clouds [27][28][29] are the most commonly used modalities as they contain 3D information. SNE-RoadSeg+ [30] is the most representative one; it fuses RGB and dense disparity images and then obtains the segmentation result through a network with densely-connected skip connections, which achieves the state-of-the-art performance on the pioneering KITTI road [8] benchmark.
Although relatively rare, there are also some studies on multi-modal segmentation algorithms only using various 2D images. Shivakumar et al. [10] established an autonomous driving database containing RGB and thermal images, which is similar to the problem we face, but they have a different solution. They first performed calibration and then the segmentation process. Therefore, they also designed a two-stream segmentation architecture for the two modalities.
Due to its particular spectral range, NIR images often substitute RGB images for segmentation tasks under low-illumination conditions [31]. Before deep learning became popular, there were studies on combining NIR and RGB images for semantic segmentation [32,33]. In recent years, there have been studies on RGB+NIR for autonomous driving, using a dual-channel CNN model to perform semantic segmentation tasks for urban [34] and forest [35] scenes. Both of them used pixel-level aligned image data.
HSI images are mainly used for remote sensing tasks [36,37], but the algorithms for autonomous driving scenarios have not been well exploited. Huang et al. [38] applied HSI to semantic segmentation in cityscape scenes for the first time. They generated coarse labels with HSI images and utilized them to assist weakly supervised training with RGB images instead of fusing the two modalities.

Multi-Modal Feature Fusion
Multi-modal machine learning has been applied to various tasks, including speech synthesis [39,40], visual-audio recognition [41], sentiment analysis [42][43][44], image/video captioning [45][46][47], etc. As a part of multi-modal perception, most of the research on multimodal segmentation [48][49][50] focuses on the feature fusion problem. Early works [19][20][21] on multi-modal learning concatenated calibrated images in different input channels to improve segmentation, which only required the training of a single model, making the training pipeline easy to construct. Other aspects [51,52] used single-modal decision values and fused them with a fusion mechanism.
Most commonly, multi-modal fusion is performed on latent features [22][23][24]. Dolz et al. [53] even proposed a densely connected network to connect and combine features from different layers of different modalities. This strategy of fusing pixels and features simultaneously allows the model to learn complex combined features between modalities freely. Chen et al. [54] introduced the method of feature gate fusion into multi-modal learning, which reduced the noise information in multi-modal data and allowed the incorporation of sufficiently complementary information to form discriminative representations for segmentation. However, these methods are all aimed at pixel-aligned feature maps.
Unfortunately, misalignment between multi-modal images is very common, but currently, no work can achieve multi-modal fusion from uncalibrated data for segmentation. In such conditions, Zhuang et al. [55] adopted a new label fusion algorithm for multimodal images, which provided different levels of the structural information of images for multi-level local atlas ranking, utilized the information-theoretic measures to compute the similarity between modalities and performed the segmentation task after aligning the modalities. Chartsias et al. [56] corrected image misalignment with a Spatial Transformer Network and reconstructed the image to enable semi-supervised learning, thus bypass-ing the problem of modal alignment. Joyce et al. [57] achieved MR image synthesis by encouraging the network to learn a modality invariant latent embedding during training to automatically correct misalignment in the input data, which has inspired us a lot. The study of modality embedding in this work inspired our approach to unaligned multi-modal data, but we believe that performing an image synthesis task is too complicated to guarantee high real-time ability in autonomous driving scenarios.
In the above research, although people are interested in using multi-spectral images and RGB images together for road detection tasks, the step of multi-modal image alignment is generally ignored since the images are preprocessed in the public dataset. However, in the actual autonomous driving scene, the installation method and imaging method of the sensors determine that multi-spectral images are difficult to align at the pixel level. We explore a model that could directly use unaligned multi-modal images so that it could be used on autonomous vehicles.

Method
To address the uncalibrated multi-modal free-space detection problem, we propose a novel network structure named multi-modal cross-attention network (MMCAN). To augment uncalibrated multi-spectral images with RGB data, we build a cross-modal encoder to enhance the modalities through multiple stages alternatively. The encoder utilizes a crossattention module to project RGB features onto hyper-spectral features, which facilitates information propagation between modalities that are not aligned at the pixel level. We also applied a three-gate fusion strategy for multi-modal fusion to maintain the specificity of each modality.
In this section, we will first present the overall topology and training methods of the multi-modal free-space detection network. Secondly, we will introduce the proposed multi-modal cross-attention module. At last, we will describe the feature fusion details of the triplet gate.

Network Architecture
In our multi-modal free-space detection task, three kinds of data from different modalities as a group are input into the network, which are 3-channel RGB, 16-channel HSI, and 25-channel NIR images. Each group of multi-modal data corresponds to the same scene, but only the RGB image has ground truth. Therefore, our research focuses on extracting information from unaligned multi-modal images for the free-space detection task.
There are five research interests in multi-modal learning [58]: representation, translation, alignment, fusion, and co-learning. Multi-modal representation learning refers to summarizing the complementarity and eliminating the redundancy between multiple sensory modalities, including two representation methods. Joint representation means that the information of multiple sensory modalities is mapped to a unified multi-modal vector space. Coordinated representation means that each modality is mapped to its respective representation space, but the mapped vectors match certain relevance constraints. Transformation, also called mapping, is to transform the information of one modality into another. Alignment is to find the correspondence between elements of different modalities from the same instance. The alignment can be reflected in time and space. In image semantic segmentation tasks, the spatial alignment is reflected in each pixel of the picture corresponding to a semantic label. Multi-modal fusion is the combination of the information of multiple sensory modalities to perform a prediction, which is one of the earliest and most widely researched directions of multi-modal machine learning. According to the fusion level, multi-modal fusion has three categories: pixel level, feature level, and decision level, corresponding to the fusion of original data, the fusion of abstract features, and the fusion of decision results. Our studies usually focus on feature-level fusion. It includes early, middle, and late fusion approaches, which represent that the fusion occurs in the different stages of feature extraction. Co-learning is the transformation of knowledge between different modalities. It can assist in the studies of multi-modal mapping, fusion, and alignment problems.
Multi-modal fusion is the key point in our research, which integrates information from different modalities into a stable multi-modal representation. The reason why multiple sensory modalities are needed to be integrated is that different modalities behave differently in the same scene, as there exist overlapping and complementarity, and even multiple different interactions between modalities. With well-processed multi-modal information, more abundant features can be obtained than single-modality, and the influence of redundant information will be reduced.
In this paper, we adopt the middle fusion strategy as the basis for our network design, which is to fuse the information at the feature level. Referring to the commonly used encoder-decoder structure, we design a separate encoder for each modality, which converts the input images into high-dimensional feature expressions, then integrates them before sending them into the segmentation decoder. As the selection of encoder, ResNet with residual block as a layer of feature extraction unit is our preferred structure. Its excellent feature extraction ability has been confirmed in numerous experiments. In order to fully preserve multi-scale features in segmentation, we design a U-shaped structure to connect the features to the decoder layer by layer. This is beneficial to the network's identification of segmentation edges.
We usually believe that in deep neural networks, the low-level features such as edges, contours, and colors contain visual information with less semantics but accurate location information, while the high-level features have rich semantic information, but their location is sketchy. Therefore, we place the feature fusion stage in the high-level layers of the encoder in order to prevent the network from learning pixel perturbations caused by misalignment in high-resolution images. Specifically, in the first two layers of the network, only RGB features are connected to the decoder through skip connections. While in the last three layers, RGB features are first used to aggregate with HSI and VIS features, respectively, then gate fused with the aggregated HSI and VIS features that are sent to the decoder in the end. The first two layers are more sensitive to details due to the smaller range of receptive fields; therefore, learning only RGB features with ground truth is sufficient for the network to predict the segmentation edges. For the last several layers with a larger range of receptive fields, the joint multi-modal features can effectively help the model learn high-dimensional semantic information and avoid misjudgments in road areas.
The overall structure of our MMCAN is depicted in Figure 2a. After the three types of spectra, images are passed through the ResNet [59] encoders. The feature maps of the HSI and NIR spectra are embedded in the RGB features, respectively, for heterogeneous information aggregation. The aggregated HSI and NIR feature maps will then be fused with the ResNet-encoded RGB feature in the last three layers through a triplet gate and sent to the U-shape decoder. The entire network is trained end-to-end, driven by cross-entropy loss defined on the segmentation benchmarks.

Multi-Modal Cross-Attention
Our multi-modal semantic segmentation needs to aggregate features from a group of uncalibrated multi-spectral images. The images in the same group correspond to different ground-truth. Learning features from a mismatching label confuses the representation learning system, resulting in convergent failure or wrong results. However, although each image in the same group is different in detail and size, the corresponding road scenes are almost the same. In the road segmentation task, our purpose is to minimize the misclassification of areas of the road rather than distinguish the edge details. Therefore, an effective cross-modality aggregation scheme should be able to extract effective segmentation information from this group of multi-modal data. We put forward a multi-modal crossattention (MMCA) fusion to solve the problem. The framework of the proposed approach is shown in Figure 3. The fusion involves the RGB feature of one branch and HSI/NIR feature of the other branch. In order to fuse multi-modal features more efficiently and effectively, we utilize the RGB feature at each branch as an agent to exchange information among the multi-spectral feature from the other branch. The proposed operation can be precisely described in the Q-K-V language, namely matching a query from one modality with a set of key-value pairs from the other modality and thereby extracting the most critical cross-modality information. The MMCA operation consists of a set of queries Q ∈ R HW 1 ×d , and a set of keys K ∈ R HW 2 ×d and values V ∈ R HW 2 ×d , where HW 1 is the pixel number of the query, HW 2 is the pixel number of key-value pairs, and d is the common dimensionality of all the input features. We calculate the dot products of the query with all keys, divide each by √ d and apply a softmax function to obtain the attention weights on the values. The MMCA operation can be mathematically expressed as: where Q ∈ R HW 1 ×d is the query, K ∈ R HW 2 ×d is the key, and V ∈ R HW 2 ×d is the value, and Z ∈ R HW 1 ×d corresponds to the attended features of the queries.  In comparison with self-attention, which only pays attention to intra-modality, our proposed cross-modal attention allows the model to attend to diverse information from different modalities. Suppose X 1 ∈ R HW 1 ×C and X 2 ∈ R HW 2 ×C are from the feature maps of a specific stage of ResNet with dimension C. Q, K, and V are given as follows: where W q , W k , W v ∈ R C×d are learnable parameters of 1 × 1 convolutions. To prevent the model from becoming too large, we set d = C/n, where n is the reduction rate of the input dimension. Implementation of multi-modal cross-attention. Figure 2b presents an example of the MMCA block with two cross-attention fusion streams. One stream is the aggregation of the HSI and RGB features, the other is for NIR and RGB. The two streams share the same structure but have independent training parameters.
Since the RGB image is the only annotated modality we have, Q comes from the RGB branch, and V, K come from the multi-spectral branches. This allows the RGB branches to participate in the overall position in the multi-spectral branch at a specific stage. As a result, it can selectively obtain more valuable information from possibly misaligned multi-spectral branches. The MMCA block can be added anywhere in CNNs because it can feed any value or key shape and ensure the same output shape as Q. This flexibility allows us to fuse richer layered features between uncalibrated modes. Thus, through the cross-attentional fusion operation, the latent features of the three modalities are aligned to HW 1 × C.

Triplet Gate Fusion
The multi-spectral features are highly complementary, not only on the good side but also on the bad side. As the most widely used modality in free-space detection, RGB images provide rich and robust features for segmentation tasks. In fact, although multi-spectral images can provide more segmentation information than RGB images in some specific scenarios, segmentation using HSI or NIR modality alone cannot achieve the performance of an RGB modality-only network on the entire dataset. General fusion strategies, such as concatenation or summation, fuse the feature maps together without considering the disambiguation among modalities. For multi-modal learning, multi-source features of the same instance are mixed with each other, which may cause cross-modality ambiguity. In order to make full use of the complementarity of multi-modal information and filter the ambiguous features, we will selectively use them for fusion according to the presentation capabilities of different modalities. To this end, we design a triplet gate structure to measure the effectiveness of each modality and to fuse these features accordingly. The triplet gate is designed based on a concatenation-based fusion with a controlled information flow, which is visualized in Figure 2. The general idea of a gate fusion is that each feature map x i ∈ R C×H×W is associated with a gate map G i ∈ [0, 1] H×W . A concatenation-based gate fusion can be defined as: where M = 3 is the number of feature maps. Specifically, we generate the triplet gate with the aggregated feature maps in the previous chapter, which are RGB ∈ R C×H×W for RGB input, HSI ∈ R C×H×W for HSI input, and NIR ∈ R C×H×W for NIR input. The first step is to concatenate these three feature maps so as to collect their features in a specific dimension. The concatenated feature is then mapped to three different gate vectors with three convolutional layers F rgb , F hsi and F nir : where V rgb , V hsi , and V nir are three gate vectors of RGB, HSI, and NIR features, respectively. The three gate vectors are then concatenated to calculate the gate maps through a softmax function: where the purpose is to normalize the gate maps G rgb , G hsi , and G nir to meet the condition G rgb + G hsi + G nir = 1, which represents the weights assigned to each position in the feature maps. The gate vectors are produced by a fully connected layer with a sigmoid function that adaptively controls the flow at the input. Therefore, the final fused feature X can be formulated as: where we join a 1 × 1 convolutional layer to map the feature vector X from R 3C×H×W to R C×H×W . Through this gate fusion module, the network has a robust feature retention mechanism to ensure that the decoders can learn complete information while eliminating the noise brought by the multi-modal data.

Experiments
Dataset. We evaluate our approach on the multi-spectral free-space detection dataset HSI Road [13]. It contains 3799 scenes with RGB, HSI, and NIR modalities, including 1811 rural scenes and 1988 urban scenes. All the modalities are respectively annotated, but we only use the RGB labels as the ground truth. The RGB modality used in the experiments is 3-channel 704 × 1280 pixel pictures, the HSI modality is 16-channel 256 × 480 pixel pictures, and the NIR modality is 25-channel 192 × 384 pixel pictures. Figure 4 shows the imaging characteristics of these spectra. Experiments are deployed on three sets, which are rural-only, urban-only, and all the datasets. Due to the small amount of data (less than 10,000), we set the ratio of the training set, test set, and validation set to 6:2:2. Therefore, for each experiment, we randomly use 60% data as the training set, 20% as the testing set, and the remaining 20% as the validation set.

Ground Truth
(a) (b) (c) Figure 4. Example of multi-spectral images in HSI Road dataset [13]. (a-c) show three different scenes, and each scene includes three images that are from RGB, HSI, and NIR, respectively (from up to down). The images on the bottom represent the ground truth, which is annotated according to the RGB spectrum.
Implementation Details. Our network is implemented by Pytorch and trained on NVIDIA Tesla V100 (Nvidia, CA, USA) platform using CUDA10.0. Our batch size is set to 6, the initial learning rate is set to 1 × 10 −4 , and the Adam solver is used to optimize the network. We train the network over 100 epochs and decay the learning rate linearly at a rate of 0.99.
Evaluation Metrics. Free-space detection is a two-class segmentation problem. Following recent methods, we employ two metrics to evaluate the performance of our networks, such as pixel accuracy and mIoU. The metrics are computed as follows: where TP, TN, FP, and FN represent the number of true positive, true negative, false positive, and false negative pixels. The results from these formulas are dimensionless. The Accuracy will show the ratio of correct predicting pixels, and the mIoU will show the ratio of intersection and union of ground truth and predicted results.

Experimental Results
In our experiments, we compare our MMCAN with SOTA semantic segmentation approaches. We use the dataset to train ten DCNNs, including five single-modal networks and four multi-modal networks. The approaches are tested under three settings: (a) training with urban scenes, (b) training with rural scenes, and (c) training with mixture scenes. The single-modal experiments are conducted with RGB images only. The multi-modal experiments are conducted with two fusion strategies: early fusion and middle fusion.
For single-modal experiments, we implemented two baseline segmentation approaches, i.e., U-Net [16] and DeepLab-v3 [60], and deployed three SOTA methods, i.e., DANet [18], HRNet [61] and Self-Regulation [62]. The backbone of HRNet is set to HRNetV2-W48, and the others are ResNet-50. In the task of multi-modal learning, early fusion methods indicate a U-Net with a concatenation of images as input, middle fusion methods include HAFB [50] and a multi-encoder U-Net baseline called MU-Net [63], which consists of three independent ResNet-50 encoders for the three modalities and the feature maps of each layer are concatenated to fuse as the skip connections of a U-Net decoder. To compare the performances between our proposed method and other SOTA DCNNs, we train our MMCAN with the same setup as for the multi-modal networks.
We evaluate the performance of our proposed MMCAN qualitatively and quantitatively. The comparisons of accuracy and mIoU scores on the validation set are shown in Table 1. It can be observed that the results show that the score in rural scenes is lower than that in urban scenes, while the score is between them under the entire dataset. The scores of SOTA multi-modal learning are similar to that of the SOTA single-modal network in the urban scene and increase by 0.5-5% in the rural scene, which indicates that the multi-modal data can indeed make up for the deficiencies of the RGB modality. Our proposed MMCAN outperforms the RGB-based single-modal methods and also multi-modal methods designed for aligned images for both urban and rural scenarios, with a score gain of 1.2-4.5%. Examples of the experimental results on the HSI Road dataset are shown in Figure 5. We can clearly observe that single-modal methods with RGB images as inputs can usually generate pretty accurate segmentation results, but it also suffers from occasional misclassification due to poor shadow and lighting conditions. Early fusion and intermediate fusion strategies using aligned data can effectively improve performance, recovering rough road shapes but with inaccurate segmentation boundaries. Our approach takes into account the above two points, not only presenting more accurate free-space estimations but also ensuring the details of the boundaries. The experimental results show that our method has three advantages. First, in the urban environment, the method is as good as the SOTA RGB single-modal method, with slightly higher accuracy; by 0.63%. Secondly, in the rural environment, the method has obvious advantages compared with the RGB single-mode method, with a 1.78% higher score. This is because the rural environment is unstructured; thus there are many features that cannot be perceived by RGB cameras, and the task can only be completed with the supplement of multi-spectral information. Thirdly, compared with other multi-modal methods, the method uses multi-modal cross-attention to solve the problem of data alignment and can directly process unaligned multi-spectral data. However, the method also has some disadvantages. It is insufficient in the accuracy of segmentation edges, and, at the same time, it has defects in recognition of small targets, which needs further research and exploration in the future.

Ablation Study
To validate the effectiveness of every component in the proposed MMCAN, we performed ablation experiments on the HSI Road dataset.
First, we investigate the impact of concatenation fusion and our proposed triplet gate fusion by replacing the gate fusion blocks with concatenation operators. As shown in Table 2, the gate fusion strategy significantly outperforms the simple concatenation fusion strategy for multi-modal free-space detection, the performance is increased by 0.61 points in urban scenes and 1.32 points in the whole dataset, which can be attributed to the fact that the gate reduces noise in the modalities, and useful information is emphasized as a result. Then, we remove the inputs from MMCAN to evaluate its performance on singlemodal vision data. We conduct three experiments: (a) training with RGB images, (b) training with HSI images, (c) training with NIR images, (d) training with RGB + HSI modalities, and (e) training with RGB + NIR modalities. From Table 3, we can observe that our choice outperforms the single-modal architecture concerning different modalities of training data, proving that the data fusion via a three-encoder architecture can benefit from free-space detection. It should be noted that although in the single-modal condition, our approach cannot provide competitive results, the network still achieves sufficiently reliable segmentation. To further validate the effectiveness of our choice, we add the MMCA module to the low-dimensional layers of the network. Table 4 verifies the superiority of deploying MMCA modules in high-dimensional layers, which helps to alleviate feature confusion to generate accurate free-space detection results.

Conclusions
In this paper, we have presented a cross-modality embedding aggregation network that can be used for free-space detection tasks on uncalibrated multi-spectral images, which is a combination of sensors deployed on autonomous vehicles. Unlike existing multimodal segmentation methods, this network does not rely on pixel-wise aligned images; therefore, a lot of preprocessing work, such as calibration and labeling, can be reduced. The network is able to correct the erroneous results of RGB single-modal segmentation in specific scenarios. Meanwhile, the joint triplet gate fusion can eliminate the ambiguous information of multi-modal data. The experimental results on HSI, NIR, and RGB tri-modal dataset show that our model not only has a significant improvement in rural and mountain scenes but also achieves SOTA in multi-scene training. The model provides a solution for multi-modal perception in autonomous driving without data preprocessing, which greatly alleviates the computational cost. There are still deficiencies in our work. The model predicts segmentation edges imprecisely and performs poorly in the detection of tiny objects. Our future work focuses on two points. The first one is to extend the algorithm to other autonomous driving tasks, such as multi-target segmentation, prediction, and 3D segmentation. The other is to explore solutions to misaligned modalities in other multi-modal vision problems.  Data Availability Statement: The datasets generated during and/or analyzed during the current study are available from the corresponding author upon reasonable request.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: DCNNs Deep Convolutional Neural Networks MMML Multi-Modal Machine Learning