Relative Total Variation Structure Analysis-Based Fusion Method for Hyperspectral and LiDAR Data Classiﬁcation

: The fusion of the hyperspectral image (HSI) and the light detecting and ranging (LiDAR) data has a wide range of applications. This paper proposes a novel feature fusion method for urban area classiﬁcation, namely the relative total variation structure analysis (RTVSA), to combine various features derived from HSI and LiDAR data. In the feature extraction stage, a variety of high-performance methods including the extended multi-attribute proﬁle, Gabor ﬁlter, and local binary pattern are used to extract the features of the input data. The relative total variation is then applied to remove useless texture information of the processed data. Finally, nonparametric weighted feature extraction is adopted to reduce the dimensions. Random forest and convolutional neural networks are utilized to evaluate the fusion images. Experiments conducted on two urban Houston University datasets (including Houston 2012 and the training portion of Houston 2017) demonstrate that the proposed method can extract the structural correlation from heterogeneous data, withstand a noise well, and improve the land cover classiﬁcation accuracy.


Introduction
Recently, the advancement of remote sensing technologies has resulted in an improvement in the availability of multi-sensor data in the same region and a deeper understanding of the research area [1,2]. Specifically, the hyperspectral image (HSI) has hundreds of spectral bands for each pixel. A detailed overview of the spectral features of the ground cover can be given by the rich spectral details preserved in HSI [3]. However, HSI may not be able to reliably identify objects with the same spectral properties [4]. Moreover, the light detecting and ranging (LiDAR) data could provide height information that is complementary to spectral details [1,5]. Objects of the same elevation but made from different materials cannot be separated using only LiDAR elevation information [6]. Therefore, the fusion of the high spectral resolution of HSI and the structural information given by LiDAR will provide more complete and enhanced surface properties for a broader range of applications, such as forest monitoring [7][8][9], biomass estimation [10], and geological analysis [11].
In order to allow the best use of the information given by different sensors, several feature extraction techniques are recommended. Typical feature extraction methods include unsupervised techniques such as principal component analysis (PCA) [12], independent component analysis [13], and the maximum noise fraction [14] and supervised methods such as linear discriminative analysis [15]. However, these methods process each pixel separately and are only optimized for extracting spectral features, without taking into account spatial context information [16][17][18].
Moreover, it has been proven that image fusion has the benefit of combining spectral features and spatial information, because the spatial regularity of remote sensing image surface materials is typically very uniform in a limited region [19,20]. Different techniques for feature extraction are commonly adopted to make full use of the spectral and spatial information given by both HSI and LiDAR data. For example, the attribute profile (AP) [21], which focuses on high-level spatial features, has attracted a great deal of interest. It can retain the geometric features of the input image while removing unimportant details [22,23]. The extended multi-attribute profile (EMAP) [24], which is then proposed to create spectral-spatial features through a set of attribute profiles, has been successfully implemented in the fusion of HSI and LiDAR data [25,26]. What is more, Gabor features can express the spatial structure of various sizes and directions in the image [27,28]. Lin et al. enhanced the discriminative low-rank Gabor filter to fully explore the capacity of extracting useful information from the spectral and spatial domains [29]. The local binary pattern (LBP) [30], which belongs to invariant and gray-scale texture operators of rotation, is proposed for texture extraction. Unlike the features extraction methods described above, the LBP concentrates on texture details (including global contrast information and texture depth) [31]. Furthermore, other advanced tools such as composite kernel methods [32], low-rank representation [20], and edge-preserving filtering [16] are utilized to make use of spectral and spatial information. However, the automatic feature fusion of multiple types of data is not straightforward [6].
To combine the heterogeneous features, several feature fusion methods have been developed. These methods can be roughly divided into five categories: feature-based stack structure, multiple kernels learning, sparse representation, graph-based method, and deep learning.
• Feature-based stack structure: The method of stacking features to create a spectralspatial cube is relatively simple [33,34]. However, the stacked feature vector assigned to each pixel has a high dimensionality, which contributes to the curse of dimensionality and a limited number of available training samples [35]. • Multiple kernels learning : Integrating multi-source data based on multiple kernels is effective [5]. For example, Camps-Valls et al. suggested a general kernel methodbased architecture that allows multi-sensor images to be combined with contextual information [36]. However, it is a challenging job to build an acceptable kernel and to pick its parameters [37]. • Sparse representation: Some fusion strategies combine heterogeneous features by dictionary creation and sparse coefficient solutions based on sparse representation [38,39]. These are non-parametric approaches that do not require any data distribution or mathematical estimation assumptions. Dian et al. formulated the fusion problem as the calculation of the spectral basis and coefficients by taking advantage of the non-local spatial self-similarities, prior knowledge of the spectral unmixing, and a sparse prior [40]. Nevertheless, how to solve the problem of optimization in sparse representation is a tough task [35]. • Graph-based method: The discriminant graph-based method merges heterogeneous features by mining the manifold structure of these features [3,41]. Liao et al. proposed a general graph-based fusion approach to combine dimension reduction and the spectral details with morphological profiles (MPs) and apply the method to the HSI and LiDAR fusion [1]. • Deep learning: Deep learning approaches such as convolutional neural networks (CNNs) can derive layer-by-layer joint spectral-spatial features [12,[42][43][44]. In extracting non-linear and hidden features, these approaches have great promise. However, deep learning networks have hyper-parameters and are thus vulnerable to the overfitting problem [35].
Total variation (TV) [45] is an efficient regularization technology for image processing and has been commonly used in remote sensing applications such as pansharpening [46], image denoising [47], and feature extraction [48]. In image fusion, the TV method based on feature extraction has high performance. For example, Kumar et al. used a semi-norm method based on TV to fuse the pixels of the input images [49]. Ma et al. emulated the fusion as an optimization problem and merged gradient transfer and TV minimization for infrared and visible image fusion [50]. However, there are relatively few studies on TV-based image feature fusion, especially for HSI and LiDAR data fusion.
In this paper, a TV-based feature fusion method is proposed for the merging of HSI and LiDAR city datasets. In the feature extraction stage, a variety of high-performance methods including EMAP, Gabor, and LBP are used to extract the features of the input data. Then, the relative TV structure analysis (RTVSA) is adopted to combine various features derived from HSI and LiDAR data. The proposed method is utilized on two publicly accessible urban Houston University datasets (including Houston 2012 and the training portion of Houston 2017). Two classification methods, random forest and convolutional neural network, are applied to the dataset for land cover classification. The main contributions of this paper are as follows: (1) First, this paper proposes a novel algorithm for the fusion of HSI and LiDAR data based on RTV and nonparametric weighted feature extraction. The proposed method RTVSA can effectively improve the classification accuracy of the HSI and LiDAR fusion data, extract the structural association from heterogeneous data, and have noise adaptability. (2) The spatial features used in the HSI and LiDAR fusion are studied in this paper. It is proven that the LBP and EMAP features can achieve high classification accuracy.
The remainder of this paper is structured as follows. Section 2 describes the related work. Section 3 introduces in depth the feature fusion approach proposed in this paper and some feature extraction methods. Section 4 presents the dataset used in this paper Section 5 shows the experiment results. The discussion is presented in Section 6. Finally, the concluding remarks are given in Section 7.

Related Works
To evaluate the quality of the fusion images, two classification methods, including random forest (RF) and the convolutional neural network (CNN), are used to classify the image before and after the fusion.

Random Forest
Random forest (RF) is a standard method of ensemble learning, with the outstanding output of classification and high processing speed, and it can prevent over-fitting effectively [51][52][53][54][55][56][57]. RF is a mixture of tree predictors. Each tree depends on the value of a random vector sampled independently and has the same distribution for all trees in the forest [58]. The generalization error of the tree classifier forest depends on the intensity and the similarity of each tree in the forest. They vote for the most popular class after creating a large number of trees.

Convolutional Neural Network
The convolutional neural network (CNN) is one of the most commonly used visual data processing methods based on deep learning. Recent studies showed that 3D-CNN achieves satisfactory results for HSI classification. In this paper, the hybrid spectral CNN (HybridSN) [59] is used for classification.
Let the input data cube be represented as I ∈ R M×N×D . The input data cube is divided into small overlapping 3D neighboring patches P ∈ R S×S×D . The true value label is determined by the label of the central pixel (α, β), covering the S × S window or spatial range and all D spectral bands. By combining the 3D kernel with 3D data, 3D convolution is accomplished. The 3D kernel is used to create feature maps of the convolution layer on several continuous bands in the input layer. Spectral details will be captured here. In 3D convolution, in the j-th feature map of the i-th layer, the activation value of the spatial location (x, y, z) is: where ϕ is the activation function, b i,j is the bias parameter for the j-th feature map of the i-th layer, d l − 1 represents the number of feature maps in the (l − 1)-th layer, and ω i,j,τ are the values of the weight parameter of the j-th feature map of the i-th layer in the τ-th dimension. The parameters of CNN are determined by using a supervised method for training with the aid of gradient descent optimization technology.

Methods
In this section, the TV-based method is proposed to fuse the HSI and LiDAR data. The approach consists of two parts, feature extraction and feature fusion. Three highperformance methods are used in the feature extraction, including the LBP, EMAP, and Gabor. The relative total variation structure analysis method is suggested in feature fusion. Using random forest and convolutional neural networks, the processed data are classified respectively. The flowchart is shown in Figure 1.   The local binary pattern (LBP) [30] is a simple, yet effective texture descriptor that can analyze each pixel and its region to summarize the local structure in the image. It has the characteristics of gray-scale invariance and rotation invariance. For each band of the input image, the texture U in the local neighborhood is defined as the joint gray-level distribution of image pixels. The gray value of the image m c consists of the gray value of the local neighborhood's middle pixel, and m p (p = 0, 1, ∴ · · · P − 1) is the gray value of P equidistant pixels on a circle forming a circularly symmetrical neighbor set with a radius of Q (Q > 0). If the coordinates of m c are (0, 0), then the coordinates of m p can be expressed as (−2Rsin(2π p/P), Rcos(2π p/P)). The LBP is calculated by thresholding the neighbors m p to create a P-bit binary code as:

Extended Multi-Attribute Profile
The extended multi-attribute profile (EMAP) relies on the application of the attribute profile (AP) to the data. The AP is a multi-level decomposition of image attribute filters [24]. Considering the different attributes, APs extract various information from the image, such as area, range, length of the diagonal of the bounding box, and entropy. The size of the structuring element (SE) also affects the degree of processing of the input image. The AP has duality, deleting from the image the darker portion and processing the input image via the closed reconstruction at the same time. The opening and closing reconstructions of the image I are respectively given by: where ξ i and δ i are the erosion and dilation of the SE size i, respectively. S δ I and S ξ I achieve reconstruction through dilation and erosion, respectively.
Different from the morphological profile (MP), the AP can steadily simplify the picture as the filter value increases, since it has a cumulative function. In order to ensure that the added standards of the attribute filter are verified, the family of criteria T i must be considered, Therefore, the AP can be regarded as the concatenation of a thickened AP, Π T φ , and a sparse AP, Π T γ : where T = T 1 , T 2 , · · · , T n is a set of ordered criteria. The process of EMAP is as follows. First, principal component analysis is used for the input image. AP is calculated on the extracted n principal components (PCs). An extended attribute profile (EAP), composed of n different APs, can be formulated as: The EMAP is a combination of different EAPs.

Gabor Filter
Gabor filters have been commonly used in hyperspectral image classification, among which 3D-Gabor filters are efficient and superior in extracting spectral-spatial characteristics [60]. In this paper, to extract the features of input images, the discriminative low-rank Gabor filtering proposed in [29] is adopted. The method used decomposes the standard 3D Gabor filter into several sub-filters, referring to various combinations of single-rank low-pass and band-pass filters. It can maintain the features of the image appropriate for discrimination purposes, relative to the conventional Gabor filter.
For the input image I with a size of {x, y, z}, calculate the Gaussian rank one envelope first.
where σ i is the standard deviation. Then, find the cosine harmonic har cos and sine harmonic har sin of rank one, respectively.
The required sub-filter consists of a Gaussian function and a corresponding harmonic function.
Finally, the 3D Gabor filter of image I can be obtained by:

Relative Total Variation Structure Analysis
The proposed feature fusion method, relative total variation structure analysis (RTVSA), includes two parts, which are RTV-based fusion and feature dimension reduction. The relative total variation (RTV) [61] model can capture meaningful structural data and delete redundant texture data in the image. Furthermore, the approach is general, i.e., it is sufficient for the decomposition of anisotropic texture.
The RTV model does not presuppose the form of texture to exclude the uncertain texture, but implements a novel mapping of window intrinsic variance. For the combined HSI and LiDAR image s after feature extraction, the RTV model can be expressed as: where j is the index of all pixels in a square region centered on point i and g i,j is a weighting function specified according to spatial affinity.
In [61], the authors transformed the original nonlinear problem into a set of subproblems that were easier to solve. As nonlinear problems can be translated into solving a sequence of linear equations, the RTV is broken down into nonlinear terms and quadratic terms. To solve Equation (14), the penalty in the x direction is extended as: It is possible to work with the y-direction term similarly. Reconstruct the quadratic term (∂ x s) 2 j and the nonlinear component u x j w x j at the same time. They can be articulated as: where G σ is the Gaussian kernel function with standard deviation σ and * is the convolution symbol.
Then, Equation (14) can be transformed into the following matrix form: where v s and v I represent the column vectors of s and I, respectively. C x and C y are the Toeplitz matrices of the forward difference gradient operators. U x ,U y , W x , and W y are all diagonal matrices, and the values on their diagonals are: Derivation of the above matrix can get the following linear equation.
where 1 is the identity matrix and L t is the weight matrix computed on the vector v t s . The data processed by the RTV still retain a large number of features and may be redundant. Many studies have shown that nonparametric weighted feature extraction (NWFE) is a powerful tool for extracting features of remote sensing images. The main idea of NWFE is to assign different weights to each sample and define new non-parameters between the class and the within-class scatter matrix. In NWFE, first calculate the distance between each pair of sampling points, and form a distance matrix dist(h l represents the l-th sample in class i. Then, calculate the weight function The weight of the scatter matrix µ (27) NWFE extracts L features by using weight vectors between and within classes. The extracted L features are the L eigenvectors having the largest eigenvalues of the following matrix: where the within-class scatter matrix is defined as: The between-class scatter matrix for L classes is defined as:

Materials
Two Houston datasets, including hyperspectral images and LiDAR data, were adopted to evaluate the effectiveness of the proposed RTVSA method. Table 1 lists the detailed information about the training and test sets.

2012 Houston Dataset
The 2012 Houston dataset (http://www.classic.grss-ieee.org/community/technicalcommittees/data-fusion/2013-ieee-grss-data-fusion-paper-contest-results/, accessed on 23 December 2020), originally distributed for the 2013 IEEE Geoscience and Remote Sensing Society (GRSS) Data Fusion Contest (DFC), contains the HSI and a LiDAR-derived digital surface model (DSM). This dataset was gathered in June 2012 and covers the Houston University and the surrounding urban areas. Among them, the HSI consists of 144 spectral bands in the 380 nm to 1050 nm region and has been calibrated to at-sensor spectral radiance units. The HSI was captured by the ITRES -CASI (Compact Airborne Spectrographic Imager) 1500 hyperspectral imager. The LiDAR data were acquired using an Optech Gemini sensor. The size of the HSI and LiDAR-derived data is 349 × 1905 with a spatial resolution of 2.5 m. The ground truth map contains 15 classes and is provided in a 2.5 m ground sampling distance (GSD) raster. Figure 2 shows the HSI, the LiDAR-derived DSM, and the ground truth map. In Table 1, the division of training data and test data on the 2012 Houston dataset is completely based on the 2013 GRSS DFC.

2017 Houston Dataset
The 2017 Houston dataset (http://www.classic.grss-ieee.org/community/technicalcommittees/data-fusion/2018-ieee-grss-data-fusion-contest/, accessed on 23 December 2020), originally distributed for the 2018 IEEE GRSS Data Fusion Contest, is composed of HSI and multispectral LiDAR data. In this paper, the training portion of the dataset was used. The HSI, captured by an ITRES CASI 1500, contains 48 bands with a spectral range of 0.38-1.05 µm at a 1 m GSD. The multispectral LiDAR data, acquired by an Optech Titan MW (14SEN/CON340), includes point cloud data at 1550 nm, 1064 nm, and 532 nm, as well as the intensity grid, returned for the first time in each channel, and digital surface model (DSM) data with a GSD of 0.5 m. The ground truth map contains 20 urban land cover/land use classes and is provided in a 0.5 m GSD raster with an image size of 1202 × 4768. Figure 3 presents the HSI, LiDAR-derived DSM, and the ground truth map. In Table 1, for each class, one percent of the samples are randomly selected as the training samples and the rest as test samples.

Experiment Settings
Three different methods including the LBP, EMAP, and Gabor were adopted in the feature extraction part. In the experiment, their parameters were designed as follows.
• EMAP : According to References [21,24,25], three different attributes are considered when constructing the EMAP: the area, standard deviation, and length of the diagonal of the bounding box. The EMAP with the area attribute describes the proportion of the structure in the scene. To create the profile with the area attribute, the sizes of the SE were set to 10, 15, and 20. The standard deviation attribute performs multi-layer decomposition on objects in the scene that are not related to the geometric shape of the area, but models the gray uniformity of the pixels in the area [21]. For the standard deviation, the size of the SE was 150. The length attribute gives the diagonal length of the smallest rectangle surrounding the connected components, and the sizes of the SEs were set to 50, 100, and 500. • Gabor: Before extracting Gabor features, the HSI was used to extract the first seven principal components using principal component analysis. For each Gabor filter, the frequency length |ω| was set to 1/2π, 1/4π, 1/8π, and 1/16π, and the angle between the frequency and the spectral size was set to 0, 1/4π, 1/2π, and 3/4π. • LBP: The input data were first normalized. The circle of radius Q was set to one, and the number of data points P on the circular symmetric neighbor set was eight.
The parameters of the RTVSA method were set as follows: λ = 0.04, σ = 2, and L = 30. In image classification, according to the results in Reference [62] and our previous work [51], the number of decision trees in the RF was determined to be 100, and the number of prediction variables was set approximately to the square root of the number of input bands. The settings of the CNN were based on [59]. In the 3D-CNN framework, the size of the 3D convolution kernel was 8 × 3 × 3 × 7 × 1, 16 × 3 × 3 × 5 × 8, and 32 × 3 × 3 × 3 × 16. The mini-batch size was 256, and the network was trained for 100 epochs without batch normalization and data expansion. No pre-trained models were used, and all models were trained from scratch. According to the classification results, the optimal learning rate of 0.001 was selected. The training dataset and the test dataset were randomly selected from the input data respectively according to Table 1.

Evaluation Indexes
To evaluate the performance of fusion image classification, four commonly used quantitative metrics, namely class accuracy (CA), overall accuracy (OA), average accuracy (AA), and the Kappa coefficient, were used to assess the classification results. Specifically, CA is used to measure the percentage of correctly classified pixels in each class. OA represents the proportion of samples that are correctly classified among all samples. AA measures the average of the percentage of pixels correctly classified for each class. The Kappa statistics is a multivariate statistical method for classification accuracy.

The 2012 Houston Dataset
For the 2012 Houston dataset, the RTVSA fusion method was applied based on the LBP, EMAP, and Gabor feature extraction, respectively. Random forest and CNN classifiers were used to classify the results of different fusion methods. The RF and CNN classification performances are shown in Tables 2 and 3, respectively. The best results are shown in bold. r represents the number of bands in each image.
It can be seen from Table 2 that, compared to directly classifying the original data, the data obtained through feature extraction can effectively improve the performance of random forest. In particular, the spatial information extracted by the EMAP can significantly improve the classification accuracy. For example, the application of the EMAP can improve AA by nearly 7.72% for the HSI. Among all the methods considered in the paper, the RTVSA fusion method captured the redundant information in the HSI and LiDAR data and eliminated unnecessary texture information. All OAs using the RTVSA method exceeded 90%. Besides, since the proposed method took into account spectral and spatial information, it showed excellent performance in the accuracy of specific classes, such as the classes of synthetic grass, soil, tennis court, and running track.
Among the CNN classification results, the EMAP extracted multi-scale spatial features, and the RTVSA fusion based on EMAP feature extraction had the best performance. For example, in terms of OA, the classification accuracy of fused images based on EMAP was 11.93% higher than that of LBP and 2.93% higher than that of Gabor. However, the HSI data after function extraction had a good output in certain classes. For example, the classification accuracy of synthetic grass, soil, tennis court, and running track in HSI data based on Gabor feature extraction reached 100%. HSI data themselves have rich spectral information, and the data after spatial feature extraction were more conducive to image classification. When it integrated LiDAR data, it supplemented complementary information and was more helpful for image classification.
The classification maps using RF and CNN classifiers obtained by different fusion approaches on the 2012 Houston data are shown in Figures 4 and 5, respectively. The sample labels were consistent with those in Figure 2. Both figures demonstrate that the proposed fusion method gives a classification map of homogeneous regions while maintaining the structure. With regards to the city data collection, the use of RTV not only boosts the smoothness, but also preserves the current structure. It is worth noting that the CNN classification map is skewed because the number of unprocessed LiDAR data bands was too limited.      Tables 4 and 5, respectively. Different from the 2012 Houston dataset, the LBP-based RTVSA fusion method achieved the best classification accuracy on the 2017 Houston dataset. This is related to the spectral and spatial information contained in the original data. Based on LBP feature extraction, the classification results after RTVSA fusion were excellent. As far as OA is concerned, the LBP based fusion achieved OA of 91.83%, which was 15.44% higher than that of HSI and 14.33% higher than that of LiDAR. The classification results of CNN also proved the above. The results indicated that the proposed fusion method was stable enough to accurately identify complex classes with few features. However, since the 2017 Houston dataset was only a part of the 2012 dataset and the 2017 dataset with a 1m GSD was too detailed, the classification performance of the 2017 Houston dataset was slightly inferior to that of the 2012 Houston dataset.
The classification maps using RF and CNN classifiers obtained by different fusion approaches on the 2017 Houston data are shown in Figures 6 and 7, respectively. The sample labels are consistent with those in Figure 3. In the RF classification map (Figure 6), the fusion data based on the LBP and RTVSA had large differences between each class, which can be distinguished well. There was a wide difference between the Gabor-based HSI and LiDAR data classification maps, primarily because the spatial information contained in the LiDAR data varied considerably from the spectral information contained in the HSI data. In Figure 7, it can be seen that the consideration of EMAP can make the classification map uniform by extracting the structure of various objects, compared to the case where only the spectral information is considered. This was further enhanced by using the proposed method of fusing spectral, spatial, and elevation information.

Parametric Analysis
Three important parameters, namely λ, σ, and L, on the performance of the proposed RTVSA method are discussed in this section. The experiment was conducted on the 2012 Houston dataset, using RF and CNN classifiers, respectively. Other parameters were fixed when addressing λ, i.e., σ = 2 and L = 30. Similarly, other parameters were also fixed when determining the effect of σ, i.e., λ = 0.04 and L = 30. When measuring the effect of L, λ and σ took 0.04 and two, respectively. The effect of classification performance is shown in Figures 8 and 9 by pairs of these parameters.
It can be seen from Figure 8 that while the value of λ increases, the RF classification accuracy also presents an upward trend. The growth rate of σ is close to that of λ, and when the value of σ exceeds three, the classification accuracy essentially no longer improves. The classification accuracy shows an upward trend with an increase of L. When L exceeds 40, the classification accuracy no longer increases. This phenomenon is not surprising. More features can introduce more useful information, which is beneficial to improve classification performance. However, if the number of features is too large, this will cause information redundancy, which will result in a decrease in classification accuracy. The changes of parameters under the CNN classifiers are similar to the RF. However, due to the characteristics of the CNN, it is more sensitive to changes in parameters.

Discussion
Due to the limitation of imaging equipment, hyperspectral data have rich spectral information, but it may not be able to reliably identify objects with the same spectral characteristics. Moreover, the LiDAR data can provide complementary information along with the HSI. To meet social and economic applications, more flexible land cover data products are needed. This paper aims to extract structural correlations from heterogeneous data to fuse HSI and LiDAR data. The proposed feature-level fusion method can provide an effective way to extract and fuse features for improving the classification accuracy of land cover.
In this study, a TV-based feature fusion method is proposed for the fusing of HSI and LiDAR urban datasets. The proposed method is used on two publicly available Houston datasets. The classification results of random forest and CNN classification demonstrate the effectiveness of the proposed method.
(1) In this study, three feature extraction methods, the LBP, EMAP, and Gabor, are used to extract the spatial features of HSI and LiDAR. The results in Tables 2-5 show that the use of these three features can improve the accuracy of land cover classification. In the 2012 Houston dataset, the EMAP feature provides the highest value of OA. Compared with other methods, the EMAP feature obtains the highest classification accuracy in the classes of healthy grass, stressed grass, synthetic grass, trees, soil, residential, commercial, and roads. However, it is slightly inferior to the performance of the LBP feature in the other classes. The EMAP extracts multi-scale structural information, while the LBP feature shows the texture feature of a local area. In the 2017 Houston dataset, different from the 2012 Houston dataset, the LBP obtains the highest classification accuracy. The reason is that the 2017 Houston dataset with a 1m GSD contains more detailed information, and the 2017 Houston dataset used in this paper only covers a part of the 2012 Houston dataset. When the sample is in a homogeneous region, the LBP feature plays an important role. When the sample contains more boundary regions, the performance of the EMAP feature is better.
(2) Although the two datasets used in this paper cover part of the same area, the classification accuracy of the 2012 Houston dataset is slightly higher than that of the 2017 Houston dataset. This is due to many reasons. The two datasets have different spatial resolutions, where the 2012 Houston dataset has a 2.5 m GSD and the 2017 Houston dataset a 2.5 m GSD. The dimensions of the HSI and LiDAR are not the same in the two datasets. The two datasets use different data types. The 2017 Houston dataset contains multispectral LiDAR, while the 2012 Houston dataset contains a LiDAR-derived DSM. In addition, the dataset used in this paper is involved in the area of Houston University, but the geo-reference is not part of the paper. (3) The feature-level fusion method proposed in this paper only contains a single spatial feature. Some researchers have found that the fusion of multiple features can improve the accuracy of land cover classification [63,64]. Therefore, the fusion of multiple features to improve the usability of remote sensing images will be considered in future work.

Conclusions
In this paper, a feature-level fusion method based on total variation is proposed to combine the heterogeneous features of HSI and LiDAR data. The proposed method consists of two parts, feature extraction and feature fusion. In feature extraction, three effective methods (LBP, EMAP, and Gabor) are adopted to extract spatial features, among which the LBP and EMAP have greater contributions to improving land cover classification. In feature fusion, the proposed RTVSA method is suitable for removing the useless texture and noise information and enhances class separability, which can effectively combine the complementary information of various data. Two urban Houston University datasets, including the 2012 Houston dataset and the training portion of the 2017 Houston dataset, are used to evaluate the proposed method. Experimental results prove that the RTVSA fusion method can capture feature redundancy while improving classification accuracy on the study datasets. Moreover, RTVSA retains the structure and provides a uniform classification map. In the future, the fusion of multiple features to improve the usability of remote sensing images will be considered.