Next Article in Journal
Centimeter Precision Geoid Model for Jeddah Region (Saudi Arabia)
Next Article in Special Issue
A Hybrid Attention-Aware Fusion Network (HAFNet) for Building Extraction from High-Resolution Imagery and LiDAR Data
Previous Article in Journal
Mapping Winter Wheat with Combinations of Temporally Aggregated Sentinel-2 and Landsat-8 Data in Shandong Province, China
Article

A Multi-Sensor Fusion Framework Based on Coupled Residual Convolutional Neural Networks

1
GIScience Chair, Institute of Geography, Heidelberg University, 69120 Heidelberg, Germany
2
Helmholtz-Zentrum Dresden-Rossendorf, Helmholtz Institute Freiberg for Resource Technology, Exploration, Chemnitzer Str. 40, D-09599 Freiberg, Germany
3
The School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430079, China
4
Here+There Mapping Solutions, 10115 Berlin, Germany
*
Author to whom correspondence should be addressed.
Remote Sens. 2020, 12(12), 2067; https://doi.org/10.3390/rs12122067
Received: 22 May 2020 / Revised: 18 June 2020 / Accepted: 23 June 2020 / Published: 26 June 2020
(This article belongs to the Special Issue Advanced Multisensor Image Analysis Techniques for Land-Cover Mapping)

Abstract

Multi-sensor remote sensing image classification has been considerably improved by deep learning feature extraction and classification networks. In this paper, we propose a novel multi-sensor fusion framework for the fusion of diverse remote sensing data sources. The novelty of this paper is grounded in three important design innovations: 1- a unique adaptation of the coupled residual networks to address multi-sensor data classification; 2- a smart auxiliary training via adjusting the loss function to address classifications with limited samples; and 3- a unique design of the residual blocks to reduce the computational complexity while preserving the discriminative characteristics of multi-sensor features. The proposed classification framework is evaluated using three different remote sensing datasets: the urban Houston university datasets (including Houston 2013 and the training portion of Houston 2018) and the rural Trento dataset. The proposed framework achieves high overall accuracies of 93.57%, 81.20%, and 98.81% on Houston 2013, the training portion of Houston 2018, and Trento datasets, respectively. Additionally, the experimental results demonstrate considerable improvements in classification accuracies compared with the existing state-of-the-art methods.
Keywords: deep learning; data fusion; hyperspectral image classification; residual learning; multi-sensor fusion; convolutional neural networks (CNNs); auxiliary loss function deep learning; data fusion; hyperspectral image classification; residual learning; multi-sensor fusion; convolutional neural networks (CNNs); auxiliary loss function

1. Introduction

Multi-sensor image analysis of remotely sensed data has become a growing area of research in recent years. Space and airborne remote sensing data streams are providing increasingly abundant data suited for earth observation and environmental monitoring [1]. The spatial, temporal and spectral capabilities of optical remote sensing systems are also increasing over time. Besides the evolution of multispectral imaging (MSI), hyperspectral imaging (HSI) [2,3,4] and light detection and ranging (LiDAR) observation platforms have also gained relevance [5,6,7]. An increasing diversity of platforms of HSI and LiDAR acquisition systems are available for terrestrial, space and airborne-based data collection. While MSI and HSI rely on solar radiance as a passive illumination source, LiDAR devices emit their own source of active radiance for measurement. MSI and HSI systems produce pixels representing two-dimensional bands of their respective wavelengths while LiDAR systems measure structure via point clouds organized in a three-dimensional sphere for their respective wavelengths. Combining such data at image or feature level yields both opportunities and challenges. For instance, fusion of HSI and LiDAR data of the same event in space offers a rich feature space allowing distinct separation of observed objects based on their spectral signature and elevation characteristics [8,9]. Meanwhile, multi-sensor datasets can contain sophisticated heterogeneous data structures and different data formats or characteristics (e.g., asymptotic properties, spatial and spectral resolutions etc.). Given the increasing availability and complexity of multi-sensor data, fusion techniques are evolving to address meaningful data exploitation to cope with multi-source inputs. This paper is addressing the large potential volume of existing combined multi-sensor data on classification algorithms. Depending on the study site and classification scheme, multi-sensor feature spaces can possess unique hybrid properties introducing new challenges for the production and deployment of appropriate training data. Sources of accurate training data are often scarce, and the production is expensive, particularly for novel hybrid multi-sensor feature spaces. Therefore, conventional classification systems and networks often become less efficient for such diverse and complicated datasets. Hence, the effective fusion of heterogeneous multi-sensor data for classification applications is essential to our remote sensing research.
A wide variety of multi-sensor data fusion methods have been developed to leverage the use of heterogeneous data sources, most prominently for HSI and LiDAR data fusion [10,11,12,13,14,15,16,17]. In [10], morphological-level features, specifically attribute profiles (APs), were embedded with a subspace multinominal logistic regression model for the fusion of HSI and LiDAR data. The capability of APs in extracting discriminating spatial features was again confirmed in [11], where extended attribute profiles (EAPs) were used to extract features from HSI and LiDAR data, respectively. Moreover, morphological extinction profiles (EPs) have been proposed to overcome the threshold determination difficulties of APs and further boost the performance of feature extraction [12]. EPs have been successfully applied to fuse HSI and LiDAR data with a total variation subspace model in [13]. Regarding various supervised fusion algorithms, a high number of research works have been dedicated towards the development of more robust models, for instance, a generalized graph-based fusion model in [14]; a spare and low-rank component model in [15]; a multi-sensor composite kernel model in [16]; a decision-level fusion model based on a differential evolution method in [17]; semi-supervised graph-based fusion in [18]; and discriminant correlation analysis in [19]. One mutual objective of these fusion algorithms is to simultaneously determine the optimized classification decision boundary by considering heterogeneous feature spaces. Nevertheless, their success often requires a comprehensive understanding of sensor systems and individual domain expertise, and hand-crafted morphological features are naturally redundant and may still suffer from problems such as the curse of dimensionality, which is also termed as Hughes phenomenon [20].
More recently, the rapid development of deep learning techniques has led to an explosive growth in the field of remote sensing image processing, especially the classification of HSI [21]. Deep learning models, especially convolutional neural networks (CNNs), open up a new possibility for invariant feature learning of HSI data, from hand-crafted to end-to-end, from manual configurations to fully automatic, from shallow to deep [22].
At the same time, there are various research efforts developing novel multi-sensor fusion approaches based on deep learning [23,24,25,26,27,28]. Among the first studies, in [23], a deep fusion model was designed for the fusion of HSI and LiDAR data, where CNNs performed as both feature extractor and classifier. In [24], the joint use of HSI and LiDAR data was further explored by combining morphological EPs and high-level deep features via a composite kernel (CK) technique. In [25], a dual-branch CNN was proposed to learn spectral-spatial and elevation features from HSI and LiDAR, respectively, then all features were fused via a cascaded network. Besides the fusion of HSI and LiDAR data, the similar superior performance of deep learning models was also confirmed in [26], where Landsat-8 and Sentinel-2 satellite images were fed into a two branched residual convolutional neural networks (ResNet) for local climate zone classification. However, the training of such deep learning fusion models might be challenging, with problems arising from the fact that deep fusion models mostly require sophisticated network designs with more parameters to simultaneously handle multi-sensor inputs, while the network training will become more difficult when the network becomes deeper [29].
Fortunately, these issues can be mitigated using the residual learning technique, where low-level features are successively passed to deeper layers via identity mapping [30]. Based on this approach, we propose a novel multi-sensor fusion framework via designing multi-branched coupled residual convolutional neural networks, namely CResNet. Moreover, the proposed framework is designed to be a generalized deep fusion framework, where the inputs are not limited to specific sensor systems. To this end, the proposed framework is designed to automatically fuse different types of multi-sensor datasets.
The proposed CResNet mainly consists of three individual ResNet branches along with coupled fully connected layers for data fusion. Different to [24], which requires a separate training step of CK classifiers, the proposed CResNet is trained in an end-to-end manner which lowers the computational complexity during data fusion. To highlight the generalized fusion capability of CResNet, we test the proposed framework on three distinct multi-sensor datasets with inputs ranging from HSI, RGB to LiDAR feature spaces, and various land cover classes. The major contributions of this paper are summarized as threefold:
  • The proposed CResNet adopts novel residual blocks (RBs) with identity mapping to address the gradient vanishing phenomenon and promotes the discriminant feature learning from multi-sensor datasets.
  • The design of coupling individual ResNet with auxiliary loss enables the CResNet to simultaneously learn representative features from each dataset by considering an adjusted loss function, and fuse them in a fully automatic end-to-end manner.
  • Considering that CResNet is highly modularized and flexible, the proposed framework leads to competitive data fusion performance on three commonly used multi-sensor datasets, where the state-of-the-art classification accuracy are achieved using limited training samples.
Section 2 describes the concept of residual feature learning and introduces the detailed architecture of the CResNet. The data descriptions and experimental setups are reported in Section 3. Then, Section 4 is devoted to the discussion of experiment results on three multi-sensor datasets. The main conclusions are summarized in Section 5.

2. Methodology

We present the structure of the proposed CResNet as shown in Figure 1. The fusion framework can be divided into three main components: feature learning via residual blocks, multi-sensor data fusion via coupled ResNet, and auxiliary training via an adjusted loss function. Although there is no limit in the number of datasets being fused using the proposed method, we evaluate the framework by applying it on three co-registered datasets for multi-sensor data fusion and classification.

2.1. Feature Learning via Residual Blocks

Recently, ResNet has become a popular deep learning technique [29], and has achieved significant classification performance on heterogeneous remote sensing datasets [31,32], where multi-sensor data sources (e.g., HSI, MSI, LiDAR) have been intensively investigated. Residual blocks (RBs), as the characterized architecture of ResNet, are proposed to alleviate the gradient vanishing and explosion issues of CNNs during training [29]. By solving the optimization degradation issue, such blocks are found to be helpful in terms of training accuracy, which is a prerequisite for testing and validation accuracies. In this paper, ResNet with multiple RBs are selected as the base feature learning networks, which are lately aggregated together as a generalized multi-branched data fusion network. As shown in Figure 2, a residual block can be considered to be an extension of several convolutional layers, where gradients in the deeper layers could be intuitively propagated back to the lower layers via identity mapping. To be noticed, identity mapping was proposed in [30] to further improve the training and regularization of origin design of ResNet in [29].
Within each RB, we follow the design in [30] and have three successive convolutional layers with kernel sizes of 1 × 1 × m , 5 × 5 × m , and 1 × 1 × m , respectively, where m refers to the number of feature maps. In addition, such successive layers are also named bottleneck designs consisting of a 1 × 1 × m layer for dimension reduction, a 5 × 5 × m convolution layer, and a 1 × 1 × m layer for restoring dimension, with which we can optimize the model complexity, thus lead to a more efficient model due to computational consideration [29]. X k and X k + 1 refer to the input and output feature spaces of RBs, respectively, and their feature sizes are kept unchanged via a valid padding strategy. More importantly, by applying the identity mapping with full pre-activation feature spaces into deeper layers [30], the functionality of RBs is further formulated as follows:
X k + 1 = X k + F X k , W k
where X k refers to feature maps of k th layer, and the W k are the weights and biases of k th layer in the RBs. The function F is the pre-activation function, which combines the batch normalization function (BN) [33] and the nonlinear activation function (ReLUs) [34] in order to improve the speed and stability of the proposed CResNet.
Figure 2 shows how the full pre-activation shortcut connection is a direct channel for the gradient to propagate in both directions, forward and backward. Hence, the training process of such RBs is simplified and leads to improved generalization capabilities. One of the key characteristics of the full pre-activation shortcut would become more obvious, when multiple RBs are trained successively, thus we could recursively formulate the feature spaces as follows:
X k + 2 = X k + 1 + F X k + 1 , W k + 1 , = X k + F X k , W k + F X k + 1 , W k + 1 ,
where W k are the weights and biases of k th layer in the RBs. Next, based on these recursive feature spaces, Equation (1) evolves as follows:
X L = X k + l = k L 1 F X l , W l
Hence, the feature space of any deeper layers L can be formulated as the feature space of any lower layers k plus a collection of convolutional functions l = k L 1 F . Moreover, this characteristic ensures the backward propagation of model gradients into lower layers as well, benefitting the overall feature learning with heterogeneous remote sensing datasets. For more detailed description of full pre-activation identity mapping, please refer to [30].
Here, the ResNet consisting of RBs with identity mapping is able to learn discriminative multi-sensor features from heterogeneous data sets due to their simplified training process, which further leads to better generalization capabilities. In this work, heterogeneous deep features are then fused with a coupled fully connected layer and a SoftMax layer (shown in Figure 3) for classification purpose. Regarding comprehensive investigations of deep learning feature extraction technique (i.e., HSI), one can further refer to [22,35].

2.2. Multi-Sensor Data Fusion via Coupled ResNets

In this paper, multi-sensor datasets are fused via coupled three-branched ResNets as shown in Figure 3. Given a set of heterogeneous input datasets Y a n × m × a , Y b n × m × b , and Y c n × m × c , for which various combination of HSI, RGB, (multispectral) LiDAR, and features generated by morphological methods (e.g., extinction profiles [12,36]) are considered in this paper in order to validate the performance of the proposed framework. More in detail, n and m refer to the spatial dimensions of image height and width, and a to c are the number of spectral bands of the input datasets.
As illustrated in Figure 1, for each pixel of inputs, a set of image patches y a s × s × a , y b s × s × b , and y c s × s × c centered at the chosen pixel are extracted from Y a , Y b , and Y c , individually. Here, s refers to the neighboring window size, for which we empirically selected 24 according to [24,35]. Then the three multi-sensor image patches are fed into separate ResNets for residual feature learning, where each ResNet consists of three RBs. Regarding the classification tasks of HSI, two major challenges identified when applying supervised deep learning classification methods: the high heterogeneity and nonlinearity of spectral signatures and the few training samples against the high dimensionality of HSI [21]. In this context, the nonlinear spectral signature of corresponding ground surfaces can be better captured by coupling networks with multi-sensor inputs (e.g., LiDAR, HSI, and RGB) [1]. By connecting the lower features through the networks to the deeper layers, the design of such RBs provides an efficient way to train the deep learning classification networks even with limited training samples.
Between each of the RBs of ResNet, a 2D max-pooling layer is attached with a kernel size and a stride of 2 in order to reduce the feature variance as well as the computational complexity, with which the spatial dimension of deep feature from the previous layer is halved. In addition, since we empirically selected 24 as the neighboring window size, each individual ResNet consists of three RBs. With such a design, three RBs are trained successively to learn discriminative multi-sensor features. In addition, we increased the number of feature maps towards deeper blocks, which is doubled after each block. Here, the number of feature maps for all three RBs ranges from { 32 , 64 , 128 } . Next, a coupled fully connected layer with the SoftMax function is adopted to fuse the learned feature according to the total amount of classification categories. We use the element-wise maximization to keep the feature number unchanged even after data fusion.

2.3. Auxiliary Training via Adjusted Loss Function

Besides the coupled ResNets, an auxiliary training strategy is proposed to compensate the major loss function according to the training progress of each branch during the framework training stage. The auxiliary loss is a common technique used in other deep learning architecture (e.g., Inception network [37]). In our case, given a set of training samples { y a i , y b i , y c i } together with ground-truth labels t i and predicted labels t ^ i , where { i = 1 , 2 , N } and N is the number of training samples, the main model loss could be computed by the categorical cross-entropy loss function.
L = 1 × 1 N i = 1 N t i log t ^ i + 1 t i log 1 t ^ i
Besides the main categorical cross-entropy loss, individual auxiliary loss functions specified for different input branches { y a i , y b i , y c i } are computed in a similar manner, where L a , L b , and L c are designed to guide the training process of each input dataset respectively. Moreover, our auxiliary training strategy further adjusts the main loss using these auxiliary losses as follows:
L A U X = L + ε a × L a + ε b × L b + ε c × L c
where { ε a , ε b , ε c } are the weights of auxiliary losses in the overall loss function. To set up the weights, there are two main considerations: first, the auxiliary losses should help in passing information through different branches and prevented from disturbing the overall training process; second, the main loss should be the most important, thus the weights of auxiliary losses should be smaller than the main loss. In this paper, we empirically set { ε i = 10 4 i = a , b , c } .
The auxiliary loss function L A U X could be considered to be an intelligent regularization that helps to make features from individual branches more accurate. More importantly, L A U X only provides complementary information during the training phase of our framework, not affecting the testing phase.

3. Experiment

3.1. Data Descriptions

3.1.1. Houston 2013

The Houston 2013 dataset is from an urban area of Houston, USA, which was originally distributed for the 2013 GRSS Data Fusion Contest [38]. The image size of the HSI and LiDAR-derived data are 349 × 1905 with a spatial resolution of 2.5 m. The HSI data includes 144 spectral bands, which range from 0.38 to 1.05 μ m. Here, the HSI data are cloud-shadow removed. The Houston 2013 dataset has in total 15 classes in the scheme, which range from different vegetation types to highway features. Figure 4 shows the false color HSI, the LiDAR-derived DSM together with the corresponding training and testing samples. The detailed number of training and test samples are listed in Table 1.

3.1.2. Houston 2018

The Houston 2018 dataset (identified as GRSS_DFC_2018 dataset) captured over the area of the University of Houston, contains HSI, multispectral LiDAR, and very high resolution (VHR) RGB images. This dataset was originally distributed for the 2018 GRSS Data Fusion Contest [39]. In this paper, we used the training portion of the dataset. The HSI dataset was captured using an ITRES CASI 1500 in 48 bands with spectral range 380–1050 nm at a 1 m ground sampling distance (GSD). The multispectral LiDAR data were acquired using an Optech Titam MW (14SEN/CON340), which include point cloud data at 1550, 1064, and 532 nm, intensity raster, and DSMs at a 50 cm GSD. The RGB was acquired with a VHR RGB imager (DiMAC ULTRALIGHT) with a 70 mm focal length. The VHR color image includes Red, Green, and Blue bands at a 5 cm GSD. This co-registered dataset contains 601 × 2384 pixels. Twenty classes of interest were extracted for Houston data and corresponding training and test samples are given in Figure 5. Figure 5 also depicts the LiDAR-derived DSM and the VHR RGB image (downsampled). The number of training and testing samples used in this study are given in Table 2.

3.1.3. Trento

The Trento dataset was captured over a rural area in the south of the city of Trento, Italy. LiDAR and HSI data were acquired by the Optech ALTM 3100EA and the AISA Eagle sensor, respectively. This data has a spatial resolution of 1 m. The size of data is of 600 × 166 pixels in 63 bands ranging from 402.89 to 989.09 nm with the spectral resolution of 9.2 nm. Six classes of interest were extracted for this dataset, including Buildings, Wood, Apple trees, Roads, Vineyard, and Ground. A false color composite of the HSI data and the corresponding training and testing samples are shown in Figure 6. The number of training and testing samples for different classes of interest are given in Table 3.

3.2. Experimental Setup

To evaluate generalized performance of the proposed data fusion framework, the aforementioned three datasets, consisting of two or three co-registered multi-sensor inputs are explored in different ways. In detail, as for the Houston 2013 and Trento datasets, the morphological EPs features of HSI and LiDAR are generated to extract the corresponding spatial and elevation information [12], then a single branch ResNet is used to classify HSI, LiDAR, EPs-HSI, and EPs-LiDAR, respectively. As for the Houston 2018 dataset, instead of using morphological features, HSI, LiDAR, and RGB are directly classified with a single branch ResNet, respectively. Next, the combinations of EPs features and HSI are fused with the proposed CResNet for the Houston 2013 and Trento datasets, while a distinct combination of RGB, LiDAR, and HSI are considered with the Houston 2018 dataset in order to validate the proposed framework’s generalized capability in handling highly heterogeneous input datasets.
The implementation of CResNet is based on the Tensorflow framework together with the Keras functional API. The Nesterov Adam optimizer is selected as the optimization algorithm for our ResNet due to its faster convergence performance compared with the stand stochastic gradient descent algorithm [26], where default parameters β 1 = 0.9 , β 2 = 0.999 are used. The learning rate, training epochs and batch size are set to 0.001, 200, 64, respectively.
We evaluated the classification accuracy of our proposed framework with respect to the overall accuracy (OA), the average accuracy (AA), the Kappa coefficient, and individual class accuracy. Since the Houston 2013 dataset is intensively used in the state-of-the-art data fusion research, we thus compared the performance of our proposed framework with previous analyses on this dataset.

4. Discussion

4.1. Classification Results

4.1.1. Fusion Performance of Morphological EPs and HSI

Table 4 and Table 5 give the results of the fusion of morphological EPs and HSI using CResNet for the Houston 2013 and Trento datasets, respectively. CResNet-AUX denotes to CResNet trained with adjusted auxiliary loss function. The results are compared with the results obtained from EPs-LiDAR-ResNets, EPs-HSI-ResNets, LiDAR-ResNets, and HSI-ResNets.
  • First, it is observed that HSI-ResNet considerably outperforms LiDAR-ResNet for both datasets, which also supports that the redundant spectral-spatial information of HSI has higher discriminative capability than the elevation information of LiDAR data. However, we notice that such discriminative capability of morphological feature (EPs-HSI and EPs-LiDAR) may become relatively uniform, for which EPs-HSI outperforms by 1.24% in the Houston 2013 and EPs-LiDAR outperforms by 2.88% in the Trento dataset. The reason behind this could be that morphological features consist of low-level features based on hand-crafted feature engineering, which not only extracts informative features but also bring high redundancy into feature space, thus the integration of low-level hand-crafted features and high-level deep features can further boost the classification performance [24].
  • Second, the fusion of EPs and HSI with CResNet+AUX achieves the best OA (93.57% and 98.81%), AA (93.44% and 94.50%) in both datasets, again confirming the capability and effectiveness of the proposed framework in invariant feature learning from both low-level morphological features and high-level deep features.
  • Finally, we observe a common improvement of classification accuracy by training ResNet with adjusted auxiliary loss function. In the Houston 2013 dataset, CResNet-AUX outperforms the original CResNet by producing the highest OA (93.57%) and AA (93.44 %) as well as kappa value of 0.9302. Similar findings are also discovered in the Trento dataset. As explained in Section 2.3, the performance boosting can be attributed to the design of our auxiliary training strategy, where the overall loss function is regularized with the complementary losses from each individual dataset.
Figure 7 and Figure 8 show classifications corresponding to the aforementioned methods for the Houston 2013 and Trento datasets, respectively. The Houston 2013 dataset is characterized as complex urban structures and mixed residential and commercial areas. From Figure 7a–d, it is shown that single input features are insufficient in distinguishing categories like Highway and Parking lot, for which the multi-sensor fusion methods (Figure 7e,f) are able to produce accurate classification results. In this context, the similar visualization patterns in a rural region of Trento can be obtained, where homogeneous Vineyard is successfully depicted.
It is suggested that deep learning methods need to go deeper in order to learn discriminative features [21], while the training of such methods can become even more challenging, especially with limited training samples. In this paper, we tackle this problem by construing a novel arrangement of RBs with identity mapping that successively pass the low-level features through the entire networks.

4.1.2. Fusion Performance of RGB, MS LiDAR, and HSI

In this scenario, we do not use EPs. However, we rely on the deep network developed to extracted the spatial, spectral, and elevation features from RGB, HSI, and multispectral LiDAR. Table 6 demonstrates the performance of CResNet for the fusion of HSI, multispectral LiDAR, and RGB. The proposed CResNet fusion framework leads to substantial improvements with respect to HSI (OA: 12.79%), LiDAR (OA: 10.36%), and RGB (OA: 11.09%). Additionally, the results show that the auxiliary training could further improve the OA by 0.58%. To be noticed here, the degradation of individual accuracy in Water class can be potentially attributed to the high imbalance of training sample numbers as listed in Table 2.
Figure 9 shows the classifications obtained by different techniques for the Houston 2018 dataset. There are relatively well-mapped ground-truth samples extracted from the original GRSS_DFC_2018 training dataset as shown in Figure 9a. By comparing Figure 9e,f with Figure 9b–d, the improved classifications using CResNet can be observed compared to the other techniques, especially for categories like healthy grass and commercial, where large commercial blocks and grassland are well delineated.
To summarize, based on the results obtained on the Houston 2018 dataset, we can validate the generalized capability of the proposed multi-sensor fusion framework. Although we use a uniform network architecture, the CResNet-AUX can automatically extract informative features via RBs and simultaneously regularize the data fusion via auxiliary loss fusion. The reason could be due to the fact that our CResNet actually consists of much deeper CNNs layers as shown in Figure 3, which can be fitted to different datasets, and trained through residual learning. In this context, we believe that the proposed CResNet presents a new possibility in developing flexible end-to-end fusion methods even with multiple inputs from different sensor systems.

4.2. Comparison to State-of-the-Art

The Houston 2013 dataset is one of the most widely used datasets, comprising a challenging mixture of urban structures. In this context, we compare the classification performance of our proposed framework with the following state-of-the-art methods listed in Table 7: The multiple subspace feature learning method (MLRsub) in [10], the total variation component-based method (OTVCA) in [13], the sparse and low-rank component-based method (SLRCA) in [15], the deep fusion method (DeepFusion) in [23], the extinction profiles fusion via CNNs and graph-based feature fusion method (EPs-CNN) in [8], and the composite kernel-based three-stream CNNs method (CK-CNN) in [24]. All these methods including the proposed method in this paper use the benchmark sets of training and testing samples published with the dataset for the classification purpose and therefore, the classification results are fully comparable.
In general, these methods can be classified into two main categories: conventional shallow methods and deep learning-based methods. The highest OA, AA, and Kappa for each of those categories are 92.45%, 92.68%, and 0.9181 obtained by OTVCA and 92.57%, 92.48%, and 0.9193 obtained by CK-CNN, for which the CResNet-AUX improves both methods by around 1% in terms of OA. This performance improvement over the state-of-the-art methods further validates the effectiveness of the proposed multi-sensor framework. In addition, the superior performance compared to existing deep learning-based methods confirmed the effectiveness of the proposed CResNet in mitigating the gradient vanishing phenomenon and discriminant feature learning from heterogeneous datasets. More importantly, with the proposed multi-sensor fusion framework, the data fusion results can be achieved automatically in an end-to-end manner.

4.3. The Performance with Respect to the Number of Training Samples

To evaluate the performance of the proposed framework with respect to the number of training samples, we randomly selected 10, 25, 50, or 100 training samples per class and repeat the experiment 10 times on the Houston 2018 dataset. In Figure 10, the means and standard deviations of OA are depicted with respect to different numbers of training samples using CResNet and CResNet+AUX, respectively. In the case of 10 samples, the OAs are less than 50%, which reveals the dependency of the deep learning techniques to the adequate amount of training samples. However, the high achievements of almost 20% in terms of OA for both techniques in the case of 25 samples per class demonstrates the efficacy of the proposed deep learning-based fusion framework in the case of a limited number of samples. Additionally, the steady increase in the slope of the CResNet+AUX’s graph compared with the CResNet’s graph confirm that the auxiliary training loss function provides robustness in the performance of the CResNet with respect to the number of samples. Moreover, CResNet+AUX outperforms CResNet for all four cases, which supports the advantage of the CResNet+AUX.

4.4. Sensitivity Analysis of OA with Respect to the Weights of Auxiliary Losses

As mentioned in Section 2.3, the general network training can benefit from considering auxiliary losses from individual branches. Here, we analyzed the sensitivity of CResNet-AUX with respect to ε i in terms of OA. To test the effect of different { ε i i = a , b , c } , we compared the classification OA for the Houston 2018 dataset by selecting ε i in the range of { 10 1 , 10 2 , 10 3 , 10 4 , 10 5 } . In addition, the weights of individual branches are set to be identical, since we assume that no prior knowledge of multi-sensor inputs is available. Figure 11 shows that ε i 10 4 is a confident region for the selection of ε i . To this end, we empirically used 10 4 in this paper.

4.5. Computational Cost

In addition to the classification accuracy, Table 8 reports the computational cost for the proposed framework, where training and testing times were given in minutes and seconds, respectively. All experiments were implemented on a workstation with 2 GeForce RTX 2080Ti graphical processing units (GPUs), each with 12 GB memory.
As shown in Table 8, CResNet consumes up to three times more processing time than the individual branches since networks are simultaneously learning from multiple inputs. Compared to the sum of individual branches reveals that the training of CResNet is more efficient and faster, saving up to 35% of training time. However, this computational efficiency may slightly decrease through the application of the auxiliary training strategy because the adjusted loss function can lead to additional computation cost. As shown in Figure 10 and Figure 12, by compromising the training time to some extent, the adjusted auxiliary loss function leads to further accuracy improvement for all three datasets. Therefore, the additional computational cost is justified for our proposed framework. More importantly, although the training time may take up to several hours for the feeding forward of testing samples (measured in seconds), the additional cost is negligible. To summarize, the auxiliary training design can improve general multi-sensor fusion accuracy by adjusting the training time within affordable ranges.

5. Conclusions

In this paper, we presented the development of a novel multi-sensor data fusion framework, which is capable of fusing heterogeneous data types either captured by different sensor systems (e.g., HSI, LiDAR, RGB) or generated by feature extraction algorithms (e.g., extinction profiles). The designed coupled residual neural networks with auxiliary training (i.e., CResNet-AUX) consists of highly modularized residual blocks with identity mapping and an intelligent regularization strategy with adjusted auxiliary loss functions. Extensive experiments were applied on three multi-sensor datasets (i.e., Houston 2013, Trento, and Houston 2018) and based on classification accuracies the following outcomes have been achieved:
  • The proposed CResNet fusion framework outperforms all the single sensor-based scenarios in the experiments for all three datasets.
  • Both CResNet and CResNet-AUX outperform the state-of-the-art methods for the Houston 2013 dataset.
  • The auxiliary training function boosts the performance of CResNet for all the datasets even for the case of limited training samples.
  • The proposed CResNet fusion framework shows effective performance when the number of training samples is limited, which is of great importance in the case of applying deep learning techniques for remote sensing datasets.
  • The experiments regarding the computational cost justifies the efficiency of the proposed algorithm considering the achievements in the classification accuracies.
More importantly, the proposed CResNet-AUX is designed to be a fully automatic generalized multi-sensor fusion framework, where the network architecture is largely independent from the input data types and not limited to specific sensor systems. Our framework is applicable to a wide range of multi-sensor datasets in an end-to-end, wall-to-wall manner.
Future works in developing intelligent and robust multi-sensor fusion methods may benefit from the insights we have produced in this paper. In further research we propose to test the performance of our framework on a large-scale application (continental and/or planetary) and include additional types of remote sensing data.

Author Contributions

H.L. and P.G. conceived and designed the work. H.L. and Z.W. designed and performed the experiments. H.L. and B.R. analyzed and interpreted the results. H.L. and B.R. and A.S. wrote the manuscript. P.G. and M.S. and A.Z. supervised the work. All authors revised the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Acknowledgments

The authors would like to thank Lorenzo Bruzzone of the University of Trento for providing the Trento data set, and the same thanks to the National Center for Airborne Laser Mapping (NCALM) at the University of Houston for providing the Houston data set and the IEEE GRSS Image Analysis and Data Fusion Technical Committee for distributing the Houston 2013 and 2018 dataset. The work of Behnood Rasti was funded by Alexander von Humboldt foundation. This research was partly funded by the Klaus Tschira Stiftung, Heidelberg.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Ghamisi, P.; Rasti, B.; Yokoya, N.; Wang, Q.; Hofle, B.; Bruzzone, L.; Bovolo, F.; Chi, M.; Anders, K.; Gloaguen, R.; et al. Multisource and Multitemporal Data Fusion in Remote Sensing: A Comprehensive Review of the State of the Art. IEEE Geosci. Remote Sens. Mag. 2019, 7, 6–39. [Google Scholar] [CrossRef]
  2. Goetz, A.F.H.; Vane, G.; Solomon, J.E.; Rock, B. Imaging Spectrometry for Earth Remote Sensing. Science 1985, 228, 1147–1153. Available online: https://science.sciencemag.org/content/228/4704/1147.full.pdf (accessed on 16 December 2019). [CrossRef]
  3. Bioucas-Dias, J.M.; Plaza, A.; Camps-Valls, G.; Scheunders, P.; Nasrabadi, N.; Chanussot, J. Hyperspectral Remote Sensing Data Analysis and Future Challenges. IEEE Geosci. Remote Sens. Mag. 2013, 1, 6–36. [Google Scholar] [CrossRef]
  4. Ghamisi, P.; Yokoya, N.; Li, J.; Liao, W.; Liu, S.; Plaza, J.; Rasti, B.; Plaza, A. Advances in Hyperspectral Image and Signal Processing: A Comprehensive Overview of the State of the Art. IEEE Geosci. Remote Sens. Mag. 2017, 5, 37–78. [Google Scholar] [CrossRef]
  5. Eitel, J.U.H.; Höfle, B.; Vierling, L.A.; Abellán, A.; Asner, G.P.; Deems, J.S.; Glennie, C.L.; Joerg, P.C.; LeWinter, A.L.; Magney, T.S.; et al. Beyond 3-D: The new spectrum of lidar applications for earth and ecological sciences. Remote. Sens. Environ. 2016, 186, 372–392. [Google Scholar] [CrossRef]
  6. Höfle, B.; Hollaus, M.; Hagenauer, J. Urban vegetation detection using radiometrically calibrated small-footprint full-waveform airborne LiDAR data. ISPRS J. Photogramm. Remote Sens. 2012, 67, 134–147. [Google Scholar] [CrossRef]
  7. Anders, K.; Winiwarter, L.; Lindenbergh, R.; Williams, J.G.; Vos, S.E.; Höfle, B. 4D objects-by-change: Spatiotemporal segmentation of geomorphic surface change from LiDAR time series. ISPRS J. Photogramm. Remote Sens. 2020, 159, 352–363. [Google Scholar] [CrossRef]
  8. Ghamisi, P.; Höfle, B.; Zhu, X.X. Hyperspectral and LiDAR Data Fusion Using Extinction Profiles and Deep Convolutional Neural Network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 10, 3011–3024. [Google Scholar] [CrossRef]
  9. Hänsch, R.; Hellwich, O. Fusion of Multispectral LiDAR, Hyperspectral, and RGB Data for Urban Land Cover Classification. IEEE Geosci. Remote. Sens. Lett. 2020, 1–5. [Google Scholar] [CrossRef]
  10. Khodadadzadeh, M.; Li, J.; Prasad, S.; Plaza, A. Fusion of Hyperspectral and LiDAR Remote Sensing Data Using Multiple Feature Learning. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2015, 8, 2971–2983. [Google Scholar] [CrossRef]
  11. Pedergnana, M.; Marpu, P.R.; Dalla Mura, M.; Benediktsson, J.A.; Bruzzone, L. Classification of Remote Sensing Optical and LiDAR Data Using Extended Attribute Profiles. IEEE J. Sel. Top. Signal Process. 2012, 6, 856–865. [Google Scholar] [CrossRef]
  12. Ghamisi, P.; Souza, R.; Benediktsson, J.A.; Zhu, X.X.; Rittner, L.; Lotufo, R.A. Extinction Profiles for the Classification of Remote Sensing Data. IEEE Trans. Geosci. Remote Sens. 2016, 54, 5631–5645. [Google Scholar] [CrossRef]
  13. Rasti, B.; Ghamisi, P.; Gloaguen, R. Hyperspectral and LiDAR Fusion Using Extinction Profiles and Total Variation Component Analysis. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3997–4007. [Google Scholar] [CrossRef]
  14. Liao, W.; Pizurica, A.; Bellens, R.; Gautama, S.; Philips, W. Generalized graph-based fusion of hyperspectral and LiDAR data using morphological features. IEEE Geosci. Remote Sens. Lett. 2015, 12, 552–556. [Google Scholar] [CrossRef]
  15. Rasti, B.; Ghamisi, P.; Plaza, J.; Plaza, A. Fusion of Hyperspectral and LiDAR Data Using Sparse and Low-Rank Component Analysis. IEEE Trans. Geosci. Remote Sens. 2017, 55, 6354–6365. [Google Scholar] [CrossRef]
  16. Ghamisi, P.; Rasti, B.; Benediktsson, J.A. Multisensor Composite Kernels Based on Extreme Learning Machines. IEEE Geosci. Remote Sens. Lett. 2019, 16, 196–200. [Google Scholar] [CrossRef]
  17. Zhong, Y.; Cao, Q.; Zhao, J.; Ma, A.; Zhao, B.; Zhang, L. Optimal Decision Fusion for Urban Land-Use/Land-Cover Classification Based on Adaptive Differential Evolution Using Hyperspectral and LiDAR Data. Remote Sens. 2017, 9, 868. [Google Scholar] [CrossRef]
  18. Xia, J.; Liao, W.; Du, P. Hyperspectral and LiDAR Classification With Semisupervised Graph Fusion. IEEE Geosci. Remote Sens. Lett. 2019, 1–5. [Google Scholar] [CrossRef]
  19. Jahan, F.; Zhou, J.; Awrangjeb, M.; Gao, Y. Fusion of Hyperspectral and LiDAR Data Using Discriminant Correlation Analysis for Land Cover Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 3905–3917. [Google Scholar] [CrossRef]
  20. Hughes, G.F. On the mean accuracy of statistical pattern recognizers. IEEE Trans. Inf. Theory 1968, 14, 55–63. [Google Scholar] [CrossRef]
  21. Li, S.; Song, W.; Fang, L.; Chen, Y.; Ghamisi, P.; Benediktsson, J.A. Deep Learning for Hyperspectral Image Classification: An Overview. IEEE Trans. Geosci. Remote Sens. 2019, 57, 6690–6709. [Google Scholar] [CrossRef]
  22. Rasti, B.; Hong, D.; Hang, R.; Ghamisi, P.; Kang, X.; Chanussot, J.; Benediktsson, J.A. Feature Extraction for Hyperspectral Imagery: The Evolution from Shallow to Deep (Overview and Toolbox). IEEE Geosci. Remote Sens. Mag. 2020. [Google Scholar] [CrossRef]
  23. Chen, Y.; Li, C.; Ghamisi, P.; Jia, X.; Gu, Y. Deep Fusion of Remote Sensing Data for Accurate Classification. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1253–1257. [Google Scholar] [CrossRef]
  24. Li, H.; Ghamisi, P.; Soergel, U.; Zhu, X.X. Hyperspectral and LiDAR Fusion Using Deep Three-Stream Convolutional Neural Networks. Remote Sens. 2018, 10, 1649. [Google Scholar] [CrossRef]
  25. Xu, X.; Li, W.; Ran, Q.; Du, Q.; Gao, L.; Zhang, B. Multisource Remote Sensing Data Classification Based on Convolutional Neural Network. IEEE Trans. Geosci. Remote Sens. 2018, 56, 937–949. [Google Scholar] [CrossRef]
  26. Qiu, C.; Schmitt, M.; Mou, L.; Ghamisi, P.; Zhu, X.X. Feature Importance Analysis for Local Climate Zone Classification Using a Residual Convolutional Neural Network with Multi-Source Datasets. Remote Sens. 2018, 10, 1572. [Google Scholar] [CrossRef]
  27. Zhang, M.; Li, W.; Du, Q.; Gao, L.; Zhang, B. Feature Extraction for Classification of Hyperspectral and LiDAR Data Using Patch-to-Patch CNN. IEEE Trans. Cybern. 2020, 50, 100–111. [Google Scholar] [CrossRef]
  28. Xu, S.; Amira, O.; Liu, J.; Zhang, C.; Zhang, J.; Li, G. HAM-MFN: Hyperspectral and Multispectral Image Multiscale Fusion Network With RAP Loss. IEEE Trans. Geosci. Remote Sens. 2020, 1–11. [Google Scholar] [CrossRef]
  29. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar] [CrossRef]
  30. He, K.; Zhang, X.; Ren, S.; Sun, J. Identity Mappings in Deep Residual Networks. In Computer Vision – ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 630–645. [Google Scholar]
  31. Zhong, Z.; Li, J.; Luo, Z.; Chapman, M. Spectral–Spatial Residual Network for Hyperspectral Image Classification: A 3-D Deep Learning Framework. IEEE Trans. Geosci. Remote Sens. 2018, 56, 847–858. [Google Scholar] [CrossRef]
  32. Qiu, C.; Mou, L.; Schmitt, M.; Zhu, X.X. Fusing Multiseasonal Sentinel-2 Imagery for Urban Land Cover Classification With Multibranch Residual Convolutional Neural Networks. IEEE Geosci. Remote Sens. Lett. 2020, 1–5. [Google Scholar] [CrossRef]
  33. Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 6–11 July 2015; Bach, F., Blei, D., Eds.; PMLR: Lille, France, 2015; Volume 37, pp. 448–456. [Google Scholar]
  34. Nair, V.; Hinton, G.E. Rectified Linear Units Improve Restricted Boltzmann Machines. In Proceedings of the 27th International Conference on Machine Learning (ICML 2010), Haifa, Israel, 21–24 June 2010. [Google Scholar]
  35. Chen, Y.H.; Jiang, C.L.; Jia, X.; Ghamisi, P. Deep Feature Extraction and Classification of Hyperspectral Images Based on Convolutional Neural Networks. IEEE Trans. Geosci. Remote Sens. 2016, 54, 6232–6250. [Google Scholar] [CrossRef]
  36. Ghamisi, P.; Souza, R.; Benediktsson, J.A.; Rittner, L.; Lotufo, R.; Zhu, X.X. Hyperspectral Data Classification Using Extended Extinction Profiles. IEEE Geosci. Remote Sens. Lett. 2016, 13, 1641–1645. [Google Scholar] [CrossRef]
  37. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar] [CrossRef]
  38. Debes, C.; Merentitis, A.; Heremans, R.; Hahn, J.; Frangiadakis, N.; van Kasteren, T.; Liao, W.; Bellens, R.; Pizurica, A.; Gautama, S.; et al. Hyperspectral and LiDAR Data Fusion: Outcome of the 2013 GRSS Data Fusion Contest. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 7, 2405–2418. [Google Scholar] [CrossRef]
  39. Xu, Y.; Du, B.; Zhang, L.; Cerra, D.; Pato, M.; Carmona, E.; Prasad, S.; Yokoya, N.; Hänsch, R.; Le Saux, B. Advanced Multi-Sensor Optical Remote Sensing for Urban Land Use and Land Cover Classification: Outcome of the 2018 IEEE GRSS Data Fusion Contest. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 1709–1724. [Google Scholar] [CrossRef]
Figure 1. The illustration of the proposed framework in training and testing phases.
Figure 1. The illustration of the proposed framework in training and testing phases.
Remotesensing 12 02067 g001
Figure 2. The network architecture of full pre-activation RBs.
Figure 2. The network architecture of full pre-activation RBs.
Remotesensing 12 02067 g002
Figure 3. Network design of the proposed coupled residual convolutional neural networks.
Figure 3. Network design of the proposed coupled residual convolutional neural networks.
Remotesensing 12 02067 g003
Figure 4. Houston 2013: From top to bottom, the LiDAR-derived DSM image, the false color HSI image, the training samples, and the testing samples.
Figure 4. Houston 2013: From top to bottom, the LiDAR-derived DSM image, the false color HSI image, the training samples, and the testing samples.
Remotesensing 12 02067 g004
Figure 5. Houston 2018: From top to bottom, the LiDAR-derived DSM image, the VHR RGB Image (downsampled), the training samples, and the testing samples.
Figure 5. Houston 2018: From top to bottom, the LiDAR-derived DSM image, the VHR RGB Image (downsampled), the training samples, and the testing samples.
Remotesensing 12 02067 g005
Figure 6. Trento: From top to bottom, the LiDAR-derived DSM image, the false color HSI image, the training samples, and the testing samples.
Figure 6. Trento: From top to bottom, the LiDAR-derived DSM image, the false color HSI image, the training samples, and the testing samples.
Remotesensing 12 02067 g006
Figure 7. The Houston 2013 dataset: Classifications generated from different features and models. (a) HSI-ResNet, (b) LiDAR-ResNet, (c) EPs-HSI-ResNet, (d) EPs-LiDAR-ResNet, (e) CResNet, and (f) CResNet-AUX.
Figure 7. The Houston 2013 dataset: Classifications generated from different features and models. (a) HSI-ResNet, (b) LiDAR-ResNet, (c) EPs-HSI-ResNet, (d) EPs-LiDAR-ResNet, (e) CResNet, and (f) CResNet-AUX.
Remotesensing 12 02067 g007
Figure 8. The Trento dataset: Classifications generated from different features and models. (a) HSI-ResNet, (b) LiDAR-ResNet, (c) EPs-HSI-ResNet, (d) EPs-LiDAR-ResNet, (e) CResNet, and (f) CResNet-AUX.
Figure 8. The Trento dataset: Classifications generated from different features and models. (a) HSI-ResNet, (b) LiDAR-ResNet, (c) EPs-HSI-ResNet, (d) EPs-LiDAR-ResNet, (e) CResNet, and (f) CResNet-AUX.
Remotesensing 12 02067 g008
Figure 9. The Houston 2018 dataset: (a) Ground-truth label map; (bf) classification maps generated on different features and models. (b) HSI-ResNet, (c) LiDAR-ResNet, (d) RGB-ResNet, (e) CResNet, and (f) CResNet-AUX.
Figure 9. The Houston 2018 dataset: (a) Ground-truth label map; (bf) classification maps generated on different features and models. (b) HSI-ResNet, (c) LiDAR-ResNet, (d) RGB-ResNet, (e) CResNet, and (f) CResNet-AUX.
Remotesensing 12 02067 g009
Figure 10. Analysis of the classification OA w.r.t the number of training samples on the Houston 2018 dataset. We select 10, 25, 50, or 100 training samples per each class.
Figure 10. Analysis of the classification OA w.r.t the number of training samples on the Houston 2018 dataset. We select 10, 25, 50, or 100 training samples per each class.
Remotesensing 12 02067 g010
Figure 11. Analysis of classification OA w.r.t the weights of auxiliary losses on Houston 2018 dataset.
Figure 11. Analysis of classification OA w.r.t the weights of auxiliary losses on Houston 2018 dataset.
Remotesensing 12 02067 g011
Figure 12. Comparison of classification accuracy with and without auxiliary loss functions for three datasets.
Figure 12. Comparison of classification accuracy with and without auxiliary loss functions for three datasets.
Remotesensing 12 02067 g012
Table 1. Houston University 2013: The number of training samples, testing samples, and the total number of samples per class.
Table 1. Houston University 2013: The number of training samples, testing samples, and the total number of samples per class.
Class No.Class NameTrainingTestingSamples
1Healthy grass19810531251
2Stressed grass19010641254
3Synthetic grass192505697
4Tree18810561244
5Soil18610561242
6Water182143325
7Residential19610721268
8Commercial19110531244
9Road19310591252
10Highway19110361227
11Railway18110541235
12Parking Lot 119210411233
13Parking Lot 2184285469
14Tennis court181247428
15Running track187473660
Total283212,19715,029
Table 2. Houston University 2018: The number of training samples, testing samples, and the total number of samples per class.
Table 2. Houston University 2018: The number of training samples, testing samples, and the total number of samples per class.
Class No.Class NameTrainingTestingSamples
1Healthy grass145883419799
2Stressed grass431628,18632,502
3Synthetic grass331353684
4Evergreen Trees200511,58313,588
5Deciduous Trees67643725048
6Soil175727594516
7Water147119266
8Residential380935,95339,762
9Commercial2789220,895223,684
10Road318842,62245,810
11Sidewalk269931,30334,002
12Crosswalk22512911516
13Major Thoroughfares519341,16546,358
14Highway70091499849
15Railway122457136937
16Paved Parking Lot117910,29611,475
17Gravel Parking Lot12722149
18Cars84857306578
19Trains49348725365
20Seats131355116824
Total34,477470,235504,712
Table 3. Trento: The number of training samples, testing samples, and the total number of samples per class.
Table 3. Trento: The number of training samples, testing samples, and the total number of samples per class.
Class No.Class NameTrainingTestingSamples
1Apple trees12939054034
2Buildings12527782903
3Ground105374479
4Wood15489699123
5Vineyard18410,31710,501
6Roads12230523174
Total81929,39530,214
Table 4. Houston 2013: Classification accuracies for per class, OA, AA (in %), kappa coefficient (is of no unit). The bold refers to the best OA, AA, and Kappa performance.
Table 4. Houston 2013: Classification accuracies for per class, OA, AA (in %), kappa coefficient (is of no unit). The bold refers to the best OA, AA, and Kappa performance.
#ClassHSI-ResNetLiDAR-ResNetEPs-HSI-ResNetEPs-LiDAR-ResNetCResNetCResNet-AUX
Number of features(144)(1)(225)(71)(144+225+71)(144+225+71)
1Healthy grass77.6851.7674.8354.1383.0086.51
2Stressed grass98.5947.0976.6056.7799.8198.01
3Synthetic grass86.5387.3387.3394.0684.3687.87
4Tree86.4651.5251.8968.0996.6985.52
5Soil89.1143.5693.9452.3799.9187.02
6Water81.1278.3291.6179.0295.8099.81
7Residential93.7567.0774.0775.9390.11100.00
8Commercial81.8675.1280.5383.5795.7395.72
9Road88.6758.5555.7159.8790.6596.68
10Highway74.5273.8454.0572.7870.46100.00
11Railway95.6490.3268.9898.2994.6885.54
12Parking Lot 185.7868.2073.2078.1097.5095.80
13Parking Lot 282.8175.4468.0772.2879.3094.05
14Tennis court100.0090.2893.1288.66100.0095.10
15Running track68.9239.3241.2315.4389.8593.87
OA(%)86.6063.8270.6369.3991.4293.57
AA(%)86.1066.5172.3469.9691.1993.44
Kappa0.85450.60740.68090.66760.90680.9302
Table 5. Trento: Classification accuracies for per class, OA, AA (in %), kappa coefficient (is of no unit). The bold refers to the best OA, AA, and Kappa performance.
Table 5. Trento: Classification accuracies for per class, OA, AA (in %), kappa coefficient (is of no unit). The bold refers to the best OA, AA, and Kappa performance.
#ClassHSI-ResNetLiDAR-ResNetEPs-HSI-ResNetEPs-LiDAR-ResNetCResNetCResNet-AUX
Number of Features(63)(1)(225)(71)(63+225+71)(63+225+71)
1Apple trees98.210.0096.6798.3998.1099.74
2Buildings93.1215.7787.8397.5297.7799.60
3Ground77.5439.8477.0164.7177.0175.40
4Wood98.9998.2799.74100.0099.90100.00
5Vineyard99.9697.0094.9297.77100.00100.00
6Roads60.522.6275.7583.6592.4692.27
OA (%)94.4066.3093.7496.6298.4398.81
AA (%)88.0642.2588.6590.3494.2194.50
Kappa0.92500.51780.91660.95480.97900.9841
Table 6. Houston 2018: Classification accuracies for per class, OA, AA (in %), kappa coefficient (is of no unit). The bold refers to the best OA, AA, and Kappa performance.
Table 6. Houston 2018: Classification accuracies for per class, OA, AA (in %), kappa coefficient (is of no unit). The bold refers to the best OA, AA, and Kappa performance.
#ClassHSI-ResNetLiDAR-ResNetRGB-ResNetCResNetCResNet-AUX
Number of features(48)(7)(3)(48+7+3)(48+7+3)
1Healthy grass46.3524.2541.5418.7775.90
2Stressed grass79.6474.8079.3790.4367.79
3Synthetic grass82.72100.00100.00100.00100.00
4Evergreen Trees93.5990.0293.0594.7495.24
5Deciduous Trees46.2743.6244.2659.5459.47
6Soil36.1731.3986.4843.8236.82
7Water42.020.0022.6930.251.68
8Residential89.8687.5191.0890.7988.00
9Commercial71.2470.8966.3592.7192.75
10Road54.4461.3565.9764.1472.77
11Sidewalk63.1473.8075.1862.2671.27
12Crosswalk3.952.402.873.023.95
13Major Thoroughfares47.5062.6756.9765.1557.62
14Highway31.8234.9737.2242.3444.82
15Railway77.5884.7584.7463.7763.96
16Paved parking Lot85.6097.3194.8083.6489.48
17Gravel parking Lot100.00100.00100.00100.00100.00
18Cars32.2437.2450.8929.9134.57
19Trains93.4999.3698.7592.4497.74
20Seats63.4999.8498.4261.1373.42
OA (%)67.8370.2669.5380.6281.20
AA (%)62.1663.8169.5364.4766.36
Kappa0.59440.62870.62530.74160.7506
Table 7. Houston 2013: Performance comparison with the state-of-the-art models. The bold refers to the best OA, AA, and Kappa performance.
Table 7. Houston 2013: Performance comparison with the state-of-the-art models. The bold refers to the best OA, AA, and Kappa performance.
MethodsMLRsub [10]OTVCA [13]SLRCA [15]DeepFusion [23]EPs-CNN [8]CK-CNN [24]CResNetCResNet-AUX
OA (%)92.0592.4591.3091.3291.0292.5791.4293.57
AA (%)92.8592.6891.9591.9691.8292.4891.1993.44
Kappa0.91370.91810.90560.90570.90330.91930.90680.9302
Table 8. Computational time for three multi-sensor datasets. The bold refers to the best OA, AA, and Kappa performance.
Table 8. Computational time for three multi-sensor datasets. The bold refers to the best OA, AA, and Kappa performance.
Houston 2013HSI-ResNetLiDAR-ResNetEPs-HSI-ResNetEPs-LiDAR-ResNetCResNetCResNet-AUX
Train (min)8.845.8378.615.6716.1116.61
Test (s)4.383.045.533.618.1516.25
TrentoHSI-ResNetLiDAR-ResNetEPs-HSI-ResNetEPs-LiDAR-ResNetCResNetCResNet-AUX
Train (min)5.695.046.885.7911.7313.66
Test (s)6.285.629.157.1313.5714.06
Houston 2018HSI-ResNetLiDAR-ResNetRGB-ResNetCResNetCResNet-AUX
Train (min)82.5063.1158.13159.9168.33
Test (s)53.6435.8438.38102.91107.79
Back to TopTop