A Multi-Sensor Fusion Framework Based on Coupled Residual Convolutional Neural Networks

: Multi-sensor remote sensing image classiﬁcation has been considerably improved by deep learning feature extraction and classiﬁcation networks. In this paper, we propose a novel multi-sensor fusion framework for the fusion of diverse remote sensing data sources. The novelty of this paper is grounded in three important design innovations: 1- a unique adaptation of the coupled residual networks to address multi-sensor data classiﬁcation; 2- a smart auxiliary training via adjusting the loss function to address classiﬁcations with limited samples; and 3- a unique design of the residual blocks to reduce the computational complexity while preserving the discriminative characteristics of multi-sensor features. The proposed classiﬁcation framework is evaluated using three different remote sensing datasets: the urban Houston university datasets (including Houston 2013 and the training portion of Houston 2018) and the rural Trento dataset. The proposed framework achieves high overall accuracies of 93.57%, 81.20%, and 98.81% on Houston 2013, the training portion of Houston 2018, and Trento datasets, respectively. Additionally, the experimental results demonstrate considerable improvements in classiﬁcation accuracies compared with the existing state-of-the-art methods.


Introduction
Multi-sensor image analysis of remotely sensed data has become a growing area of research in recent years. Space and airborne remote sensing data streams are providing increasingly abundant data suited for earth observation and environmental monitoring [1]. The spatial, temporal and spectral capabilities of optical remote sensing systems are also increasing over time. Besides the evolution of multispectral imaging (MSI), hyperspectral imaging (HSI) [2][3][4] and light detection and ranging (LiDAR) observation platforms have also gained relevance [5][6][7]. An increasing diversity of platforms of HSI and LiDAR acquisition systems are available for terrestrial, space and airborne-based data collection. While MSI and HSI rely on solar radiance as a passive illumination source, LiDAR devices emit their own source of active radiance for measurement. MSI and HSI systems produce pixels representing two-dimensional bands of their respective wavelengths while LiDAR systems measure structure via point clouds organized in a three-dimensional sphere for their respective wavelengths. Combining such data at image or feature level yields both opportunities and challenges. For instance, fusion of HSI and LiDAR data of the same event in space offers a rich feature space allowing distinct separation of observed objects based on their spectral signature and elevation characteristics [8,9]. Meanwhile, multi-sensor datasets can contain sophisticated heterogeneous data structures and different data formats or characteristics (e.g., asymptotic properties, spatial and spectral resolutions etc.). Given the increasing availability and complexity of multi-sensor data, fusion techniques are evolving to address meaningful data exploitation to cope with multi-source inputs. This paper is addressing the large potential volume of existing combined multi-sensor data on classification algorithms. Depending on the study site and classification scheme, multi-sensor feature spaces can possess unique hybrid properties introducing new challenges for the production and deployment of appropriate training data. Sources of accurate training data are often scarce, and the production is expensive, particularly for novel hybrid multi-sensor feature spaces. Therefore, conventional classification systems and networks often become less efficient for such diverse and complicated datasets. Hence, the effective fusion of heterogeneous multi-sensor data for classification applications is essential to our remote sensing research.
A wide variety of multi-sensor data fusion methods have been developed to leverage the use of heterogeneous data sources, most prominently for HSI and LiDAR data fusion [10][11][12][13][14][15][16][17]. In [10], morphological-level features, specifically attribute profiles (APs), were embedded with a subspace multinominal logistic regression model for the fusion of HSI and LiDAR data. The capability of APs in extracting discriminating spatial features was again confirmed in [11], where extended attribute profiles (EAPs) were used to extract features from HSI and LiDAR data, respectively. Moreover, morphological extinction profiles (EPs) have been proposed to overcome the threshold determination difficulties of APs and further boost the performance of feature extraction [12]. EPs have been successfully applied to fuse HSI and LiDAR data with a total variation subspace model in [13]. Regarding various supervised fusion algorithms, a high number of research works have been dedicated towards the development of more robust models, for instance, a generalized graph-based fusion model in [14]; a spare and low-rank component model in [15]; a multi-sensor composite kernel model in [16]; a decision-level fusion model based on a differential evolution method in [17]; semi-supervised graph-based fusion in [18]; and discriminant correlation analysis in [19]. One mutual objective of these fusion algorithms is to simultaneously determine the optimized classification decision boundary by considering heterogeneous feature spaces. Nevertheless, their success often requires a comprehensive understanding of sensor systems and individual domain expertise, and hand-crafted morphological features are naturally redundant and may still suffer from problems such as the curse of dimensionality, which is also termed as Hughes phenomenon [20].
More recently, the rapid development of deep learning techniques has led to an explosive growth in the field of remote sensing image processing, especially the classification of HSI [21]. Deep learning models, especially convolutional neural networks (CNNs), open up a new possibility for invariant feature learning of HSI data, from hand-crafted to end-to-end, from manual configurations to fully automatic, from shallow to deep [22].
At the same time, there are various research efforts developing novel multi-sensor fusion approaches based on deep learning [23][24][25][26][27][28]. Among the first studies, in [23], a deep fusion model was designed for the fusion of HSI and LiDAR data, where CNNs performed as both feature extractor and classifier. In [24], the joint use of HSI and LiDAR data was further explored by combining morphological EPs and high-level deep features via a composite kernel (CK) technique. In [25], a dual-branch CNN was proposed to learn spectral-spatial and elevation features from HSI and LiDAR, respectively, then all features were fused via a cascaded network. Besides the fusion of HSI and LiDAR data, the similar superior performance of deep learning models was also confirmed in [26], where Landsat-8 and Sentinel-2 satellite images were fed into a two branched residual convolutional neural networks (ResNet) for local climate zone classification. However, the training of such deep learning fusion models might be challenging, with problems arising from the fact that deep fusion models mostly require sophisticated network designs with more parameters to simultaneously handle multi-sensor inputs, while the network training will become more difficult when the network becomes deeper [29].
Fortunately, these issues can be mitigated using the residual learning technique, where low-level features are successively passed to deeper layers via identity mapping [30]. Based on this approach, we propose a novel multi-sensor fusion framework via designing multi-branched coupled residual convolutional neural networks, namely CResNet. Moreover, the proposed framework is designed to be a generalized deep fusion framework, where the inputs are not limited to specific sensor systems. To this end, the proposed framework is designed to automatically fuse different types of multi-sensor datasets.
The proposed CResNet mainly consists of three individual ResNet branches along with coupled fully connected layers for data fusion. Different to [24], which requires a separate training step of CK classifiers, the proposed CResNet is trained in an end-to-end manner which lowers the computational complexity during data fusion. To highlight the generalized fusion capability of CResNet, we test the proposed framework on three distinct multi-sensor datasets with inputs ranging from HSI, RGB to LiDAR feature spaces, and various land cover classes. The major contributions of this paper are summarized as threefold: 1.
The proposed CResNet adopts novel residual blocks (RBs) with identity mapping to address the gradient vanishing phenomenon and promotes the discriminant feature learning from multi-sensor datasets.

2.
The design of coupling individual ResNet with auxiliary loss enables the CResNet to simultaneously learn representative features from each dataset by considering an adjusted loss function, and fuse them in a fully automatic end-to-end manner.

3.
Considering that CResNet is highly modularized and flexible, the proposed framework leads to competitive data fusion performance on three commonly used multi-sensor datasets, where the state-of-the-art classification accuracy are achieved using limited training samples.
Section 2 describes the concept of residual feature learning and introduces the detailed architecture of the CResNet. The data descriptions and experimental setups are reported in Section 3. Then, Section 4 is devoted to the discussion of experiment results on three multi-sensor datasets. The main conclusions are summarized in Section 5.

Methodology
We present the structure of the proposed CResNet as shown in Figure 1. The fusion framework can be divided into three main components: feature learning via residual blocks, multi-sensor data fusion via coupled ResNet, and auxiliary training via an adjusted loss function. Although there is no limit in the number of datasets being fused using the proposed method, we evaluate the framework by applying it on three co-registered datasets for multi-sensor data fusion and classification.

Feature Learning via Residual Blocks
Recently, ResNet has become a popular deep learning technique [29], and has achieved significant classification performance on heterogeneous remote sensing datasets [31,32], where multi-sensor data sources (e.g., HSI, MSI, LiDAR) have been intensively investigated. Residual blocks (RBs), as the characterized architecture of ResNet, are proposed to alleviate the gradient vanishing and explosion issues of CNNs during training [29]. By solving the optimization degradation issue, such blocks are found to be helpful in terms of training accuracy, which is a prerequisite for testing and validation accuracies. In this paper, ResNet with multiple RBs are selected as the base feature learning networks, which are lately aggregated together as a generalized multi-branched data fusion network. As shown in Figure 2, a residual block can be considered to be an extension of several convolutional layers, where gradients in the deeper layers could be intuitively propagated back to the lower layers via identity mapping. To be noticed, identity mapping was proposed in [30] to further improve the training and regularization of origin design of ResNet in [29].
Within each RB, we follow the design in [30] and have three successive convolutional layers with kernel sizes of 1 × 1 × m, 5 × 5 × m, and 1 × 1 × m, respectively, where m refers to the number of feature maps. In addition, such successive layers are also named bottleneck designs consisting of a 1 × 1 × m layer for dimension reduction, a 5 × 5 × m convolution layer, and a 1 × 1 × m layer for restoring dimension, with which we can optimize the model complexity, thus lead to a more efficient model due to computational consideration [29]. X k and X k+1 refer to the input and output feature spaces of RBs, respectively, and their feature sizes are kept unchanged via a valid padding strategy. More importantly, by applying the identity mapping with full pre-activation feature spaces into deeper layers [30], the functionality of RBs is further formulated as follows: where X k refers to feature maps of (k)th layer, and the W k are the weights and biases of (k)th layer in the RBs. The function (F) is the pre-activation function, which combines the batch normalization function (BN) [33] and the nonlinear activation function (ReLUs) [34] in order to improve the speed and stability of the proposed CResNet.  Figure 2 shows how the full pre-activation shortcut connection is a direct channel for the gradient to propagate in both directions, forward and backward. Hence, the training process of such RBs is simplified and leads to improved generalization capabilities. One of the key characteristics of the full pre-activation shortcut would become more obvious, when multiple RBs are trained successively, thus we could recursively formulate the feature spaces as follows: where W k are the weights and biases of (k)th layer in the RBs. Next, based on these recursive feature spaces, Equation (1) evolves as follows: Hence, the feature space of any deeper layers (L) can be formulated as the feature space of any lower layers (k) plus a collection of convolutional functions ∑ L−1 l=k F. Moreover, this characteristic ensures the backward propagation of model gradients into lower layers as well, benefitting the overall feature learning with heterogeneous remote sensing datasets. For more detailed description of full pre-activation identity mapping, please refer to [30].
Here, the ResNet consisting of RBs with identity mapping is able to learn discriminative multi-sensor features from heterogeneous data sets due to their simplified training process, which further leads to better generalization capabilities. In this work, heterogeneous deep features are then fused with a coupled fully connected layer and a SoftMax layer (shown in Figure 3) for classification purpose. Regarding comprehensive investigations of deep learning feature extraction technique (i.e., HSI), one can further refer to [22,35].

Multi-Sensor Data Fusion via Coupled ResNets
In this paper, multi-sensor datasets are fused via coupled three-branched ResNets as shown in Figure 3. Given a set of heterogeneous input datasets Y a ∈ n×m×a , Y b ∈ n×m×b , and Y c ∈ n×m×c , for which various combination of HSI, RGB, (multispectral) LiDAR, and features generated by morphological methods (e.g., extinction profiles [12,36]) are considered in this paper in order to validate the performance of the proposed framework. More in detail, n and m refer to the spatial dimensions of image height and width, and a to c are the number of spectral bands of the input datasets. As illustrated in Figure 1, for each pixel of inputs, a set of image patches y a ∈ s×s×a , y b ∈ s×s×b , and y c ∈ s×s×c centered at the chosen pixel are extracted from Y a , Y b , and Y c , individually. Here, s refers to the neighboring window size, for which we empirically selected 24 according to [24,35]. Then the three multi-sensor image patches are fed into separate ResNets for residual feature learning, where each ResNet consists of three RBs. Regarding the classification tasks of HSI, two major challenges identified when applying supervised deep learning classification methods: the high heterogeneity and nonlinearity of spectral signatures and the few training samples against the high dimensionality of HSI [21]. In this context, the nonlinear spectral signature of corresponding ground surfaces can be better captured by coupling networks with multi-sensor inputs (e.g., LiDAR, HSI, and RGB) [1]. By connecting the lower features through the networks to the deeper layers, the design of such RBs provides an efficient way to train the deep learning classification networks even with limited training samples.
Between each of the RBs of ResNet, a 2D max-pooling layer is attached with a kernel size and a stride of 2 in order to reduce the feature variance as well as the computational complexity, with which the spatial dimension of deep feature from the previous layer is halved. In addition, since we empirically selected 24 as the neighboring window size, each individual ResNet consists of three RBs. With such a design, three RBs are trained successively to learn discriminative multi-sensor features. In addition, we increased the number of feature maps towards deeper blocks, which is doubled after each block. Here, the number of feature maps for all three RBs ranges from {32, 64, 128}. Next, a coupled fully connected layer with the SoftMax function is adopted to fuse the learned feature according to the total amount of classification categories. We use the element-wise maximization to keep the feature number unchanged even after data fusion.

Auxiliary Training via Adjusted Loss Function
Besides the coupled ResNets, an auxiliary training strategy is proposed to compensate the major loss function according to the training progress of each branch during the framework training stage. The auxiliary loss is a common technique used in other deep learning architecture (e.g., Inception network [37]). In our case, given a set of training samples {y i a , y i b , y i c } together with ground-truth labels t i and predicted labelst i , where {i = 1, 2, . . . N} and N is the number of training samples, the main model loss could be computed by the categorical cross-entropy loss function.
Besides the main categorical cross-entropy loss, individual auxiliary loss functions specified for different input branches {y i a , y i b , y i c } are computed in a similar manner, where L a , L b , and L c are designed to guide the training process of each input dataset respectively. Moreover, our auxiliary training strategy further adjusts the main loss using these auxiliary losses as follows: where {ε a , ε b , ε c } are the weights of auxiliary losses in the overall loss function. To set up the weights, there are two main considerations: first, the auxiliary losses should help in passing information through different branches and prevented from disturbing the overall training process; second, the main loss should be the most important, thus the weights of auxiliary losses should be smaller than the main loss.
The auxiliary loss function L AUX could be considered to be an intelligent regularization that helps to make features from individual branches more accurate. More importantly, L AUX only provides complementary information during the training phase of our framework, not affecting the testing phase.

Houston 2013
The Houston 2013 dataset is from an urban area of Houston, USA, which was originally distributed for the 2013 GRSS Data Fusion Contest [38]. The image size of the HSI and LiDAR-derived data are 349 × 1905 with a spatial resolution of 2.5 m. The HSI data includes 144 spectral bands, which range from 0.38 to 1.05 µm. Here, the HSI data are cloud-shadow removed. The Houston 2013 dataset has in total 15 classes in the scheme, which range from different vegetation types to highway features. Figure 4 shows the false color HSI, the LiDAR-derived DSM together with the corresponding training and testing samples. The detailed number of training and test samples are listed in Table 1.

Houston 2018
The Houston 2018 dataset (identified as GRSS_DFC_2018 dataset) captured over the area of the University of Houston, contains HSI, multispectral LiDAR, and very high resolution (VHR) RGB images. This dataset was originally distributed for the 2018 GRSS Data Fusion Contest [39]. In this paper, we used the training portion of the dataset. The HSI dataset was captured using an ITRES CASI 1500 in 48 bands with spectral range 380-1050 nm at a 1 m ground sampling distance (GSD). The multispectral LiDAR data were acquired using an Optech Titam MW (14SEN/CON340), which include point cloud data at 1550, 1064, and 532 nm, intensity raster, and DSMs at a 50 cm GSD. The RGB was acquired with a VHR RGB imager (DiMAC ULTRALIGHT) with a 70 mm focal length. The VHR color image includes Red, Green, and Blue bands at a 5 cm GSD. This co-registered dataset contains 601 × 2384 pixels. Twenty classes of interest were extracted for Houston data and corresponding training and test samples are given in Figure 5. Figure 5 also depicts the LiDAR-derived DSM and the VHR RGB image (downsampled). The number of training and testing samples used in this study are given in Table 2.

Trento
The Trento dataset was captured over a rural area in the south of the city of Trento, Italy. LiDAR and HSI data were acquired by the Optech ALTM 3100EA and the AISA Eagle sensor, respectively. This data has a spatial resolution of 1 m. The size of data is of 600 × 166 pixels in 63 bands ranging from 402.89 to 989.09 nm with the spectral resolution of 9.2 nm. Six classes of interest were extracted for this dataset, including Buildings, Wood, Apple trees, Roads, Vineyard, and Ground. A false color composite of the HSI data and the corresponding training and testing samples are shown in Figure 6. The number of training and testing samples for different classes of interest are given in Table 3.

Experimental Setup
To evaluate generalized performance of the proposed data fusion framework, the aforementioned three datasets, consisting of two or three co-registered multi-sensor inputs are explored in different ways. In detail, as for the Houston 2013 and Trento datasets, the morphological EPs features of HSI and LiDAR are generated to extract the corresponding spatial and elevation information [12], then a single branch ResNet is used to classify HSI, LiDAR, EPs-HSI, and EPs-LiDAR, respectively. As for the Houston 2018 dataset, instead of using morphological features, HSI, LiDAR, and RGB are directly classified with a single branch ResNet, respectively. Next, the combinations of EPs features and HSI are fused with the proposed CResNet for the Houston 2013 and Trento datasets, while a distinct combination of RGB, LiDAR, and HSI are considered with the Houston 2018 dataset in order to validate the proposed framework's generalized capability in handling highly heterogeneous input datasets.
The implementation of CResNet is based on the Tensorflow framework together with the Keras functional API. The Nesterov Adam optimizer is selected as the optimization algorithm for our ResNet due to its faster convergence performance compared with the stand stochastic gradient descent algorithm [26], where default parameters β 1 = 0.9, β 2 = 0.999 are used. The learning rate, training epochs and batch size are set to 0.001, 200, 64, respectively.
We evaluated the classification accuracy of our proposed framework with respect to the overall accuracy (OA), the average accuracy (AA), the Kappa coefficient, and individual class accuracy. Since the Houston 2013 dataset is intensively used in the state-of-the-art data fusion research, we thus compared the performance of our proposed framework with previous analyses on this dataset. Tables 4 and 5 give the results of the fusion of morphological EPs and HSI using CResNet for the Houston 2013 and Trento datasets, respectively. CResNet-AUX denotes to CResNet trained with adjusted auxiliary loss function. The results are compared with the results obtained from EPs-LiDAR-ResNets, EPs-HSI-ResNets, LiDAR-ResNets, and HSI-ResNets.

•
First, it is observed that HSI-ResNet considerably outperforms LiDAR-ResNet for both datasets, which also supports that the redundant spectral-spatial information of HSI has higher discriminative capability than the elevation information of LiDAR data. However, we notice that such discriminative capability of morphological feature (EPs-HSI and EPs-LiDAR) may become relatively uniform, for which EPs-HSI outperforms by 1.24% in the Houston 2013 and EPs-LiDAR outperforms by 2.88% in the Trento dataset. The reason behind this could be that morphological features consist of low-level features based on hand-crafted feature engineering, which not only extracts informative features but also bring high redundancy into feature space, thus the integration of low-level hand-crafted features and high-level deep features can further boost the classification performance [24].    It is suggested that deep learning methods need to go deeper in order to learn discriminative features [21], while the training of such methods can become even more challenging, especially with limited training samples. In this paper, we tackle this problem by construing a novel arrangement of RBs with identity mapping that successively pass the low-level features through the entire networks.

Fusion Performance of RGB, MS LiDAR, and HSI
In this scenario, we do not use EPs. However, we rely on the deep network developed to extracted the spatial, spectral, and elevation features from RGB, HSI, and multispectral LiDAR. Table 6 demonstrates the performance of CResNet for the fusion of HSI, multispectral LiDAR, and RGB. The proposed CResNet fusion framework leads to substantial improvements with respect to HSI (OA: 12.79%), LiDAR (OA: 10.36%), and RGB (OA: 11.09%). Additionally, the results show that the auxiliary training could further improve the OA by 0.58%. To be noticed here, the degradation of individual accuracy in Water class can be potentially attributed to the high imbalance of training sample numbers as listed in Table 2.    To summarize, based on the results obtained on the Houston 2018 dataset, we can validate the generalized capability of the proposed multi-sensor fusion framework. Although we use a uniform network architecture, the CResNet-AUX can automatically extract informative features via RBs and simultaneously regularize the data fusion via auxiliary loss fusion. The reason could be due to the fact that our CResNet actually consists of much deeper CNNs layers as shown in Figure 3, which can be fitted to different datasets, and trained through residual learning. In this context, we believe that the proposed CResNet presents a new possibility in developing flexible end-to-end fusion methods even with multiple inputs from different sensor systems.

Comparison to State-of-the-Art
The Houston 2013 dataset is one of the most widely used datasets, comprising a challenging mixture of urban structures. In this context, we compare the classification performance of our proposed framework with the following state-of-the-art methods listed in Table 7: The multiple subspace feature learning method (MLRsub) in [10], the total variation component-based method (OTVCA) in [13], the sparse and low-rank component-based method (SLRCA) in [15], the deep fusion method (DeepFusion) in [23], the extinction profiles fusion via CNNs and graph-based feature fusion method (EPs-CNN) in [8], and the composite kernel-based three-stream CNNs method (CK-CNN) in [24]. All these methods including the proposed method in this paper use the benchmark sets of training and testing samples published with the dataset for the classification purpose and therefore, the classification results are fully comparable. In general, these methods can be classified into two main categories: conventional shallow methods and deep learning-based methods. The highest OA, AA, and Kappa for each of those categories are 92.45%, 92.68%, and 0.9181 obtained by OTVCA and 92.57%, 92.48%, and 0.9193 obtained by CK-CNN, for which the CResNet-AUX improves both methods by around 1% in terms of OA. This performance improvement over the state-of-the-art methods further validates the effectiveness of the proposed multi-sensor framework. In addition, the superior performance compared to existing deep learning-based methods confirmed the effectiveness of the proposed CResNet in mitigating the gradient vanishing phenomenon and discriminant feature learning from heterogeneous datasets. More importantly, with the proposed multi-sensor fusion framework, the data fusion results can be achieved automatically in an end-to-end manner.

The Performance with Respect to the Number of Training Samples
To evaluate the performance of the proposed framework with respect to the number of training samples, we randomly selected 10, 25, 50, or 100 training samples per class and repeat the experiment 10 times on the Houston 2018 dataset. In Figure 10, the means and standard deviations of OA are depicted with respect to different numbers of training samples using CResNet and CResNet+AUX, respectively. In the case of 10 samples, the OAs are less than 50%, which reveals the dependency of the deep learning techniques to the adequate amount of training samples. However, the high achievements of almost 20% in terms of OA for both techniques in the case of 25 samples per class demonstrates the efficacy of the proposed deep learning-based fusion framework in the case of a limited number of samples. Additionally, the steady increase in the slope of the CResNet+AUX's graph compared with the CResNet's graph confirm that the auxiliary training loss function provides robustness in the performance of the CResNet with respect to the number of samples. Moreover, CResNet+AUX outperforms CResNet for all four cases, which supports the advantage of the CResNet+AUX.

Sensitivity Analysis of OA with Respect to the Weights of Auxiliary Losses
As mentioned in Section 2.3, the general network training can benefit from considering auxiliary losses from individual branches. Here, we analyzed the sensitivity of CResNet-AUX with respect to ε i in terms of OA. To test the effect of different {ε i | i = a, b, c}, we compared the classification OA for the Houston 2018 dataset by selecting ε i in the range of {10 −1 , 10 −2 , 10 −3 , 10 −4 , 10 −5 }. In addition, the weights of individual branches are set to be identical, since we assume that no prior knowledge of multi-sensor inputs is available. Figure 11 shows that ε i ≥ 10 −4 is a confident region for the selection of ε i . To this end, we empirically used 10 −4 in this paper.

Computational Cost
In addition to the classification accuracy, Table 8 reports the computational cost for the proposed framework, where training and testing times were given in minutes and seconds, respectively. All experiments were implemented on a workstation with 2 GeForce RTX 2080Ti graphical processing units (GPUs), each with 12 GB memory. As shown in Table 8, CResNet consumes up to three times more processing time than the individual branches since networks are simultaneously learning from multiple inputs. Compared to the sum of individual branches reveals that the training of CResNet is more efficient and faster, saving up to 35% of training time. However, this computational efficiency may slightly decrease through the application of the auxiliary training strategy because the adjusted loss function can lead to additional computation cost. As shown in Figures 10 and 12, by compromising the training time to some extent, the adjusted auxiliary loss function leads to further accuracy improvement for all three datasets. Therefore, the additional computational cost is justified for our proposed framework. More importantly, although the training time may take up to several hours for the feeding forward of testing samples (measured in seconds), the additional cost is negligible. To summarize, the auxiliary training design can improve general multi-sensor fusion accuracy by adjusting the training time within affordable ranges.

Conclusions
In this paper, we presented the development of a novel multi-sensor data fusion framework, which is capable of fusing heterogeneous data types either captured by different sensor systems (e.g., HSI, LiDAR, RGB) or generated by feature extraction algorithms (e.g., extinction profiles). The designed coupled residual neural networks with auxiliary training (i.e., CResNet-AUX) consists of highly modularized residual blocks with identity mapping and an intelligent regularization strategy with adjusted auxiliary loss functions. Extensive experiments were applied on three multi-sensor datasets (i.e., Houston 2013, Trento, and Houston 2018) and based on classification accuracies the following outcomes have been achieved: • The proposed CResNet fusion framework outperforms all the single sensor-based scenarios in the experiments for all three datasets. • Both CResNet and CResNet-AUX outperform the state-of-the-art methods for the Houston 2013 dataset.

•
The auxiliary training function boosts the performance of CResNet for all the datasets even for the case of limited training samples.

•
The proposed CResNet fusion framework shows effective performance when the number of training samples is limited, which is of great importance in the case of applying deep learning techniques for remote sensing datasets.

•
The experiments regarding the computational cost justifies the efficiency of the proposed algorithm considering the achievements in the classification accuracies.
More importantly, the proposed CResNet-AUX is designed to be a fully automatic generalized multi-sensor fusion framework, where the network architecture is largely independent from the input data types and not limited to specific sensor systems. Our framework is applicable to a wide range of multi-sensor datasets in an end-to-end, wall-to-wall manner.
Future works in developing intelligent and robust multi-sensor fusion methods may benefit from the insights we have produced in this paper. In further research we propose to test the performance of our framework on a large-scale application (continental and/or planetary) and include additional types of remote sensing data.