Multilevel Structure Extraction-Based Multi-Sensor Data Fusion

Multi-sensor data on the same area provide complementary information, which is helpful for improving the discrimination capability of classifiers. In this work, a novel multilevel structure extraction method is proposed to fuse multi-sensor data. This method is comprised of three steps: First, multilevel structure extraction is constructed by cascading morphological profiles and structure features, and is utilized to extract spatial information from multiple original images. Then, a low-rank model is adopted to integrate the extracted spatial information. Finally, a spectral classifier is employed to calculate class probabilities, and a maximum posteriori estimation model is used to decide the final labels. Experiments tested on three datasets including rural and urban scenes validate that the proposed approach can produce promising performance with regard to both subjective and objective qualities.


Introduction
With the advance of imaging techniques, the amount of remote sensing data collected by various remote sensors is growing, which allows us to combine multiple types of data for earth observation [1][2][3]. Such data record different reflectance characteristics, e.g., rich spectral information, high spatial resolution, and height information. The availability of such datasets makes it possible to merge multi-sensor data to further boost the identification accuracy of different materials. For example, a hyperspectral image (HSI) can provide abundant spectral information which is helpful for distinguishing different types of land covers [4][5][6]. Nevertheless, when the land covers are composed of the same material, i.e., roads and roofs, the spectral curves of such objects are very similar. In this situation, it is hard to distinguish these objects with the same spectral curves using HSI data. Different from the HSI, a light detection and ranging (LiDAR) image can characterize the height and structure information of various objects. Therefore, by integrating the advantages of the two types of data, the fusion of HSI and LiDAR has exhibited better identification performance over single sensor [7][8][9][10].
In the past decades, a diversity of multi-sensor fusion techniques has been investigated to boost the spatial resolution of HSI so as to obtain higher classification performance, including RGB and HSI, multispectral (MS) and HSI, HSI and panchromatic (PAN). Integration of RGB/PAN and HSI data aims to improve the spatial resolution of HSI with the help of RGB/PAN data. For example, in [11], a directional total variation model was used to fuse the HSI and RGB images. In [12], a component decomposition-based method was proposed to enhance resolution of HSIs for the first time. Promising results were obtained to identify different kinds of minerals. In [13], a structure tensor-based image enhancement method was used to improve the spatial resolution of HSI. Integration of MS and HSI involves combining the spatial details of MS and spectral resolution of HSI to obtain high spatial and spectral resolution data. For instance, in [14], a spectral embedding technique was used to fuse MS and HSI, in which the manifold and low-rank structures were exploited. In [15], a novel image decomposition scheme was used to integrate the spatial information of MS and spectral information of HSI, in which the original data was decomposed by coupled sparse tensor factorization technique.
Lately, diverse deep learning techniques have been also applied to enhance the resolution of the HSI [16][17][18]. In [16], a novel deep cross-modal network was proposed to increase the classification accuracies. This is the first time to introduce the cross-modality learning into networks with the application to the classification of remote sensing data, showing significance of milestone. In [17], a pretrained convolutional neural network (CNN) was used to regularize the fusion issue. In [18], a spatial-spectral reconstruction network was applied to merge MS and HSI, where the spatial information and spectral information is injected into the HSI, respectively. Different from resolution enhancement techniques, fusion of LiDAR and HSI aims to combine the height information of LiDAR and spectral information of HSI. In such a way, the classification performance can be greatly improved. Many different kinds of schemes have been developed to achieve the integration of LiDAR and HSI, which can be roughly divided into two categories: fusion rule [19][20][21] and feature extraction [22][23][24]. For instance, a generalized graph-based fusion rule was constructed to achieve fusion of HSI and LiDAR [19]. Rasti et al. applied a dimension reduction technique, e.g., total variation component analysis, to fuse multiple features [20]. Ghamisi et al. used a composite kernel method to merge spatial information obtained from original images [21]. Besides, many advanced feature extraction techniques have been utilized to characterize the spatial and contextual information from source images. For example, in [22], an extinction profile was exploited to acquire the spatial features from source images, and the sparse and low-rank model was used to fuse the extracted spatial features. In [23], different kinds of features, including spectral and spatial information, were considered to model the spatial information of original images. These techniques have demonstrated that the characterization of spatial features for multi-sensor data is an efficient way to enhance the classification accuracy. Nevertheless, how to improve the discrimination of different objects is still an open problem.
Furthermore, in recent years, deep learning-based schemes have also been applied for fusion of HSI and LiDAR [7,25,26]. For example, a coupled residual CNN was proposed for fusion LiDAR and HSI [25]. In [26], a dual-tunnel CNN is exploited to extract spectral-spatial features, in which a pixelwise affinity architecture was applied to build the correlation of different objects with different elevation information from LiDAR. In [7], deep CNN was employed to classify the fused features obtained by extinction profiles. However, the classification performance of these deep learning-based methods relies heavily on the amount of the labeled training samples.
In this work, a novel fusion approach based on multilevel structure extraction (MSE) is developed, which is comprised of three steps. First, the multilevel structure extraction method is constructed to capture the discriminative spatial features from multiple source images. Then, the low rank representation is adopted to fuse the high-dimensional features into low-dimensional subspace. Finally, the fused features are fed into the MLR classifier to obtain class probabilities, and the maximum posteriori estimation model is used to optimize the class probabilities to yield the final map. Experiments performed on three different scenes validate that the proposed method shows superiority compared to several state-of-the-art fusion approaches. The main contributions of this work is summarized as follows.

•
A multilevel structure feature extractor is constructed and exploited to model the spatial and contextual information from input images, which can better characterize the discrimination between different land covers.
• A general multi-sensor fusion framework is proposed based on feature extraction and probability optimization, which can effectively fuse multi-sensor remote sensing data, such as HSI, LiDAR, and synthetic aperture radar (SAR).

•
Classification quality of the proposed method is examined on three datasets, which indicates that our method obtains outstanding performance over other state-of-the-art multi-sensor fusion techniques with regard to both classification accuracies and maps. We will also make the codes freely available on author's Github repository: https://github.com/PuhongDuan.
The remaining section of this article is given as follows. The methodology is introduced in Section 2. Section 3 mainly describes the experimental results and analyses. Finally, Section 5 summarizes several concluding remarks. Figure 1 displays the flow-process diagram of our method, which is comprised of three key steps: First, the multilevel structure extraction is built and is employed to extract the spatial and semantic information from original data. Then, the extracted features are merged with a low rank model. Finally, the MLR classifier is tested on the fused features to obtain class probabilities followed by maximum a posteriori estimation model.

Multilevel Structure Extraction
Structure extraction aims at decomposing the input data into two components, i.e., a structure component that reflects the salient spatial structures and a texture component that contains noise and details. However, a single structure feature cannot effectively characterize the spatial characteristics of different kinds of land covers in a scene as different objects usually have different shapes and sizes. To provide a more complete characterization of different objects, a multilevel structure feature extraction approach is developed to characterize the spatial features from multi-sensor data. Specifically, morphological attribute filtering is first used to extract the multilevel architecture of the input because the morphological attribute filtering has been proven to be a multilevel shape-size descriptor [27,28]. Suppose the input I, the spectral dimension of I, is first decreased using principal component analysis (PCA) so as to preserve the first principal componentÎ. Then, the multilevel attribute profiles of the dimension reduced imageÎ can be computed as follows, where φ and λ stand for attribute thinning and thickening operators, respectively. Λ = {λ i |i = 1, 2, ..., L} denotes a series of threshold values for the predefined predicate T. Next, the structure extraction is performed on the multilevel attribute profiles A(Î) to obtain the multilevel structure features. Specifically, a relative total variation-based structure extraction approach [29] is adopted to remove the texture component from A(Î).
where T denotes the amount of image pixels. S indicates the desired result. α represents a weight which controls the smoothness. ε is used to avoid dividing by zero. The solution of the energy function in Equation (2) can be found in [29].
Here, D x and D y denote the total variations (TV) in the horizontal and vertical axes, respectively. This TV mainly measures the spatial differences within the local region R(i).
where ∂ x S and ∂ y S mean the partial derivatives in the horizontal and vertical axes, respectively. g i,j indicates a weight as follows, where σ represents the window size. Finally, in order to decrease the computational burden and further improve the discrimination among different land covers, the kernel PCA technique is used to reduce the dimension of the multilevel structure features so as to obtain dimension reduced features, where 80% principal components are preserved for the following fusion in this work.

Feature Fusion
Assume S i , i = 1, 2 to be the spatial features extracted from source images, a low rank representation method [30] is adopted to merge the extracted spatial information from multi-source images to reduce the redundant information between the extracted features. Accordingly, the low rank model is expressed asS whereS = [S 1 , S 2 ] denotes the stacked spatial features. F is the low-dimensional features, and P represents the subspace basis. N stands for the error term. Our goal is to estimate the low-dimensional features F and subspace basis P with onlyS. To solve this problem, a total variation prior is adopted to retain the spatial structure of the input. The corresponding energy function is as follows.
arg min To solve the nonconvex problem Equation (6) mentioned above, a cyclic descent algorithm [31] is adopted, which is as follows.
(2) P step: By fixing variable F, Equation (6) becomes an orthogonal Procrustes issue, which can be solved by low-rank Procrustes rotation.
where Q and R are calculated by singular value decomposition S T F = Q ∑ R T .

Probability Optimization
When the fused features F are obtained, a spectral classifier, i.e., multinomial logistic regression (MLR), is performed on F to obtain the class probabilities P(Y |F ). Then, maximum a posteriori estimation is exploited to optimize the class probabilities P(Y |F ), which can be calculated as where P(y i ) is assumed to be equally distributed. By using the logarithm operation, Equation (9) can be written asŶ whereŶ is the final map. P(y i |F i ) is the class probabilities obtained by MLR classifier. P(Y) is prior probability which is usually estimated with an isotropic pairwise multilevel logistic model [33].

Experiments
In the experiment section, three datasets located in rural and urban regions are used to examine the fusion effect of the proposed approach. To illustrate the advantage of the proposed MSE, several state-of-the-art multi-sensor fusion schemes are adopted for comparison, including the MLR classifier [34], orthogonal total variation component analysis (OTVCA) [22], sparse and smooth low-rank analysis (SSLRA) [35], and subspace-based fusion method (SubFus) [36]. For the competitive approaches, we have used exactly similar hyperparameters suggested in their original papers.
(1) MLR [34]: The MLR classifier is a spectral classification method based on logistic regression.
(2) OTVCA [22]: The OTVCA method is a low-rank-based dimension reduction model. First, morphology filters are utilized to extract spatial information from original images. Then, the lowrank-based dimension reduction model is used to fuse the spatial information. Finally, the fused features are fed into a random forest classifier.
(3) SSLRA [35]: The SSLRA method is a feature fusion technique. First, the spatial features of original data are extracted with morphology filters. Then, a sparse and smooth low-rank model is used to fuse the spatial features. Finally, random forest is used as a spectral classifier.
(4) SubFus [36]: The SubFus method is a recently proposed fusion method. Morphology filters are used to extract the spatial information from source images. Then, the spatial features are merged into a low-dimensional subspace. Finally, the fused features are fed into the random forest classifier to obtain the final map.
Furthermore, in order to quantitatively appraise the classification quality of all considered schemes, three widely used quality indexes [37][38][39], i.e., overall accuracy (OA), average accuracy (AA), and Kappa coefficient, are employed, which are shown as follows.
(1) OA: OA calculates the percentages of correctly identified samples.
where M is the confusion matrix. M ii is the amount of the ith class which is identified into the ith class. C and N denotes the total amount of classes and test set, respectively.
(2) AA: AA measures the average of the percentage of correctly identified samples for each land cover.
where M ij stands for the amount of the ith object which is identified into the jth object.
(3) Kappa coefficient (Kappa): Kappa indicates the percentage of correctly identified samples corrected by the number of agreements.

Datasets
(1) Trento dataset: The Trento dataset was taken from a rural area in Trento, a southern city in Italy. It includes an HSI and a LiDAR-derived DSM [22]. The HSI was gained by the AISA Eagle sensor, and the LiDAR DSM was produced using first and last point cloud pulses obtained by the Optech ALTM 3100EA sensor. Both sensors are equipped with a spatial resolution of 1 m and a spatial size of 600 × 166 pixels. The HSI has 63 spectral channels varying from 402.89 to 989.09 nm and its spectral resolution is 9.2 nm. This scene contains six different land covers: Apple Tree, Building, Ground, Wood, Vineyard, and Road. Figure 2 shows the original images, training label, testing label, and class name.
(2) Berlin dataset: The Berlin dataset was acquired over an urban scene in Berlin, which is composed of an HSI and a SAR. The EnMAP benchmark HSI can be downloaded from the website: http://doi.org/10. 5880/enmap.2016.002, which has 244 spectral channels varying from 400 to 2500 nm with spatial resolution of 30 m. Its spatial size is of 797 × 220 pixels. Based on the geographic coordinates, the corresponding region of the SAR image is downloaded from the Sentinel-1 satellite. The SAR image has four polarimetric bands with spatial resolution of 13 m, and its spatial size is of 1723 × 476 pixels. Before performing the fusion operation, the HSI is upsampled to the same size as the SAR image using the linear interpolation operation. Figure 3 presents the original images, training samples, testing samples, and class names.
(3) Houston 2013 dataset: The Houston 2013 dataset was acquired over an urban region of Houston, USA on 23 June 2012, which is comprised of an HSI image and a LiDAR-derived image. This image was released by the 2013 GRSS Data Fusion Contest [40]. Both the datasets are of 349 × 1905 pixels. The HSI was collected by the Compact Airborne Spectrographic Imager, which records 144 spectral bands varying from 380 to 1050 nm with spatial resolution of 2.5 m. The LiDAR was captured by NSF-funded Center for Airborne Laser Mapping. This scene contains 15 classes of interest. Figure 4 shows the original images, groundtruth, and class names. The amount of training and test set is listed in Table 1.     Figure 2 exhibits the visual maps obtained by all considered approaches on Trento dataset. The MLR method produces an obvious noisy phenomenon in the classification map because the MLR fails to take the spatial information into consideration (see Figure 2a,b). In addition, the MLR method on the LiDAR image can better identify the Building and Water classes compared to HSI ( Figure 5). This is due to the fact that the height and structure information in the LiDAR image make a great contribution in classifying these land covers. The OTVCA method greatly improves the visual appearance in removing the misclassified noisy labels. Nevertheless, there are still some mislabels in the Vineyard class. The SSLRA method yields an over-smoothed phenomenon for the Road class. The reason is that the SSLRA method performs wavelet transformation before projecting a low-dimensional space. The SubFus method fails to distinguish classes 3 and 5 well, in which many pixels in class 3 are misclassified into class 5. In general, the proposed approach obtains a better visual map in improving homogeneous regions. The key reason is that the constructed feature extractor can better extract the spatial information from HSI and LiDAR data compared to other techniques.  Furthermore, the objective quality of different approaches is given in Table 2. It can be observed that the proposed MSE produces the highest objective indexes concerning OA, AA, and Kappa, which confirms the effectiveness of the proposed MSE from the quantitative perspective. In addition, our method also yields the highest CA in three land covers: Ground, Wood, and Road. Figure 6 presents the visual maps of different techniques on the Berlin dataset. As shown in this figure, the MLR method on the original HSI image still yields "noisy" results (see Figure 6a). The MLR method cannot identify different land covers using the SAR image well (see Figure 6b). This is because the SAR image does not contain rich spectral information. By comparing Figure 6a,f (or Figure 6b,g), it is found that our method on HSI or SAR can effectively improve the classification quality. The reason is that the proposed feature extractor can effectively improve the discrimination between different objects. By fusing the spatial features of HSI and SAR, the proposed approach can produce better classification performance over the single sensor image, e.g., HSI or SAR.

Berlin Dataset
Besides, the objective qualities of different approaches are listed in Table 3. The OTVCA method fails to effectively classify the Water class. The SSLRA and SubFus methods obtain low accuracy on the Allotment class. It is easy to observe that the proposed MSE obtains outstanding classification performance with respect to other techniques. Consequently, the proposed MSE can also be applied to fuse HSI and SAR data.

Houston 2013 Dataset
The third experiment is conducted on a challenging dataset, i.e., Houston 2013. This image consists of various urban land covers, and it is corrupted by shadow. Figure 7 displays the classification maps obtained by different approaches. By observing these classification results, several obvious phenomena can be concluded. First, the spectral classifier, i.e., MLR on HSI or LiDAR, yields different levels of "noise" in the classification maps, while other spatial-spectral feature extraction approaches can greatly improve this problem. Second, the SubFus method cannot identify the shadow region well. In addition, the Grass class is misclassified into the Road class. Third, the proposed method provides better visual map among these considered approaches, in which the edges of different types of objects are well aligned with the labeled land covers. The main reason is that the multilevel structure extraction method can effectively increase the discrimination between different land covers. Furthermore, the classification accuracies listed in Table 4 also confirm the superiority of the proposed MSE.

The Influence of Different Parameters
In this subsection, the influence of different parameters to the performance of the proposed MSE is discussed. In our method, there are three free parameters that need to be determined, including the smoothing parameter α, the window size σ, and the number of the fused features K. An experiment is tested on the Trento dataset with standard training and test samples. When α is analyzed, σ and K are set to be 2 and 30, respectively. Figure 8a gives the influence of the proposed method with different α to the classification accuracy. We can observe that the objective quality first increases, and then tends to decrease. This is mainly because the smoothing parameter α easily removes some important structural features when α is relatively large. Thus, α = 0.003 is selected as the default parameter. When σ is analyzed, α and K are fixed, i.e., α = 0.003 and K = 30. Figure 8b presents the change trend. It is easy to see that the proposed MSE is able to yield promising accuracy when σ = 2. Similarly, when we discuss the effect of the proposed MSE with different K, σ and α are fixed. Figure 8c exhibits the influence of the proposed method with different K. When K is 30, the proposed MSE yields higher classification accuracy. Therefore, α, σ, and K are set to be 0.003, 2, and 30, respectively, which are regarded as the default parameters for all used datasets in this work.

The Influence of Different Feature Extractors
To prove the superiority of the proposed multilevel structure extraction, several popular feature extraction approaches in remote sensing community are adopted for comparison, including KPCA, structure extraction (SE), intrinsic image decomposition (IID), image fusion and recursive filtering (IFRF), and extended morphological attribute profiles (EMAP). An experiment is tested on the Trento dataset. The classification accuracy of the proposed MSE with different feature extractors is presented in Table 5. It is found that the proposed multilevel structure extractor yields the highest classification accuracies regarding OA, AA, and Kappa coefficient among all considered approaches, which confirms the effectiveness of the proposed feature extractor.

Computing Time
The running time of all considered approaches on three datasets is given in Figure 9. In this work, all experiments are carried out a laptop with 8 GB RAM and 2.6GHz with Matlab 2014a. It is found from Figure 9 that spectral classifier is computationally efficient as it does not have feature extraction and fusion steps. The SSLRA method is relatively time-consuming. This is because the feature fusion step needs to solve a nonconvex objective function. The running time of the proposed approach is moderate. Taking the Trento dataset as an example, the running time is about 316s.

Conclusions
In this study, we developed multilevel structure extraction method for fusion of multi-sensor images. The fusion scheme is composed of three key steps: First, the multilevel structure extraction approach is designed to characterize the spatial and elevation features from the original images. Then, the low rank representation model is exploited to merge the extracted information. Finally, the spectral classifier is conducted on the fused features to produce the class probabilities followed by a maximum posteriori estimation model. Experiments were performed on the rural (Trento) and urban (Berlin and Houston 2013) scenes, which have revealed several conclusions as follows.
(1) The multilevel structure extraction method exhibits outstanding performance in extracting spatial and contextual features from multiple types of data (e.g., HSI, LiDAR, and SAR) compared to other feature extractors.
(2) Experimental results prove that the proposed MSE can considerably boost the classification performance over several popular approaches with regard to both visual maps and objective indexes.
(3) The proposed MSE can yield better classification accuracy compared to single sensor image.
In the future, we will focus on designing novel fusion technique to effectively integrate large-scale multi-sensor data.
Author Contributions: P.D. conducted the experiments and wrote the manuscript. X.K. gave some suggestions and revised this work. P.G. carefully improved the presentation of this manuscript. S.L. revised the manuscript. Y.L. modified this manuscript. All authors have read and agreed to the published version of the manuscript.