A Two-Staged Feature Extraction Method Based on Total Variation for Hyperspectral Images

: Effective feature extraction (FE) has always been the focus of hyperspectral images (HSIs). For aerial remote-sensing HSIs processing and its land cover classiﬁcation, in this article, an efﬁcient two-staged hyperspectral FE method based on total variation (TV) is proposed. In the ﬁrst stage, the average fusion method was used to reduce the spectral dimension. Then, the anisotropic TV model with different regularization parameters was utilized to obtain featured blocks of different smoothness, each containing multi-scale structure information, and we stacked them as the next stage’s input. In the second stage, equipped with singular value transformation to reduce the dimension again, we followed an isotropic TV model based on split Bregman algorithm for further detail smoothing. Finally, the feature-extracted block was fed to the support vector machine for classiﬁcation experiments. The results, with three hyperspectral datasets, demonstrate that our proposed method can competitively outperform state-of-the-art methods in terms of its classiﬁcation accuracy and computing time. Also, our proposed method delivers robustness and stability by comprehensive parameter analysis.


Introduction
Hyperspectral imaging technology is based on multi-spectral imaging, in the spectral range from ultraviolet to near-infrared, using an imaging spectrometer to continuously scan within tens or hundreds of spectral bands of the scenes. Therefore, hyperspectral images (HSIs) not only capture spatial features but also obtains rich spectral information from each pixel, which can achieve the classification and recognition of the target objects more efficiently than traditional images. Nowadays, many HSIs passively acquired on satellite or airborne have broad ranges of land cover; they are widely used in many fields such as urban mapping [1], agriculture [2], forest [3], and environmental monitoring [4]. In addition, HSIs can also be obtained by active remote sensing technology [5], which usually utilizes wide spectral light sources [6] to replace the sun to illuminate the scenes and which play a significant role in object detection [7] and recognition [8]. HSI classification has always been a hot topic of application among these fields. It can provide high-level intuitive judgment and interpretation, especially for land use and analysis.
However, some characteristics of HSIs bring difficulties to its application. Firstly, a few pixels of HSI may represent the land cover of tens of square meters, resulting in some specific samples availability being limited, which will lead to low accuracy when directly using the spatial information of HSIs for scene detection. In addition, as the number of bands rises, the amount of data expands sharply, and adjacent bands are highly correlated, noise and redundant information are relatively increased, especially for active technology because of the stability of the light source, which will also affect the classification accuracy. More importantly, these characteristics will bring considerable time and storage overhead in the computing of related algorithms. Therefore, effective and discriminative feature extraction (FE) processing has become the key to hyperspectral technology [9], and many methods have been applied in the hyperspectral community. Regarding HSIs as three-dimensional cube data, data projection and transformation are commonly used FE methods, such as principal components analysis (PCA) [10], independent component analysis (ICA) [11], and maximum noise fraction (MNF) [12], which are dedicated to linearly transforming the data into a low-dimensional feature space and which reduce the band dimension of HSIs. In addition, supervised linear transformation methods are used, such as linear discriminant analysis (LDA) [13], and some extended versions, including low-rank discriminant analysis (LRDA) [14], locally weighted discriminant analysis (LWDA) [15], and flexible Gabor-based superpixel-level unsupervised linear discriminant analysis [16]. Considering the nonlinear characteristics of HSIs, some techniques using kernel methods have been proposed, such as kernel PCA (KPCA) [17], which can obtain a linearly separable training set by nonlinearly mapping data to a highdimensional space. What's more, manifold learning methods [18] have been continuously developed, and some advanced methods include GPU parallel implementation of isometric mapping [19], which can greatly accelerate the speed of data transformation. Spatialspectral multiple manifold discriminant analysis (SSMMDA) [20] can explore spatialspectral combined information and reveal the intrinsic multi-manifold structure in HSIs. In [21], a novel local constrained collaborative representation model was designed; it can characterize the collaborative relationship between sample pairs and explore the intrinsic geometric structure in HSI. Projection and transformation techniques usually look for the separability of the data in the transformed feature space. Generally, only the internal pattern of the data itself is considered, and the specific physical characteristics are usually lost, especially the continuous correlation of the spectral information in HSI. In addition, data mapping methods are often used as one of the steps of FE, such as completing dimension reduction.
When focusing on the continuous and high correlation of bands, some methods based on band selection and clustering [22] have been proposed and used to mitigate the "curse of dimensionality" problem. The combination of image fusion and recursive filters was applied in [23]. In [24], Wang et al. designed a method of band selection based on optimal neighborhood reconstruction, which exploited a recurrence relation for the optimal solution. Tang et al. [25] proposed a hyperspectral band selection method based on spatialspectral weighted region-wise multiple graph fusion-based spectral clustering. In [26], an unsupervised BS approach based on an improved genetic algorithm was proposed, which utilized modified genetic operations to reduce the redundancy of selected bands. Band selection methods fully excavate the spectral characteristics of HSI, but it often combines spatial information predominately for the FE process in practical applications.
The method of using energy functional optimization, which is a common technique of traditional image processing, has been widely developed in hyperspectral FE. It's aim is to minimize a constrained loss function, which usually contains fidelity and regularization terms, and play an important role in restoring and noise reduction for HSI [27]. In [28], an orthogonal total variation component analysis (OTVCA) was proposed and used a cyclic descent algorithm for solving the minimization problem. Duan et al. [29] showed a fusion and optimization framework of dual spatial information for more feature exploration. A novel multidirectional low-rank modeling and spatial-spectral total variation (MLR-SSTV) model was proposed in [30] for removing HSI mixed noise, and they developed an efficient algorithm for solving the derived optimization based on the alternating direction method of multipliers. By adding different regularization term constraints, a l 0 -l 1 hybrid total variation (l 0 -l 1 HTV) was presented in [31], which yielded sharper edge preservation and obtained superior performances in HSI restoration. But they usually have more computational overhead because numerous iterations are needed to solve the optimization problem.
In addition, with the development of deep learning, many deep FE methods have been proposed for image processing. In [32], Zhu  the definition of foggy remote sensing images. As for HSIs, rich spectral information needs to be considered, and convolutional neural networks are also widely used to extract the spectral and spatial information of HSIs [33,34]. Li et al. [35] proposed a double-branch dual-attention mechanism network (DBDA) for HSI classification, where the two branches enabled to refine and optimize the extracted feature maps. Additionally, some remarkable models, such as ResNet [36] and DenseNet [37], have also been continuously developed. The method combining the scattering transform with the attention-based ResNet was given in [38], which provided higher unmixing accuracy when using limited training data. In [39], Xie et al. proposed the MS-DenseNet framework, which introduced several dense blocks to fuse the multiscale information among different layers for the final HSIs classification. To achieve effective deep fusion features, a pixel frequency spectrum feature based on fast Fourier transformation was presented in [40] and introduced a 3D CNN, which showed that adding the presented frequency spectrum feature into CNNs can achieve better recognition results. However, deep learning techniques often need sufficient training sets to achieve their excellent performance [9], but the samples are limited in HSI. Similarly, they often need a lot of computing resources and expenses.
For aerial remote sensing HSIs processing and their land cover classification, there are several core problems in the FE method to be considered and addressed. According to the different sensors and their scanning positions, the spatial resolution of different HSIs is diverse. Developing the FE method suitable for multi-scale HSIs is the key. At the same time, FE need to fully excavate the spatial characteristics of HSI so that the features information of various classes can be fully retained, especially considering the limited samples of individual classes. On the other hand, computing overhead has increasingly become the bottleneck of tasks, and it is necessary to design lightweight methods for real-time applications. Stemming from a motivation to solve these key issues, in this article, an efficient two-staged hyperspectral FE method based on total variation (TV) [41] is proposed. In the first stage, in order to extract multi-scale structural information, we used the anisotropic total variation model (ATVM) [42] under numerical solution to process the average fused data. The model, with different regularization parameter outputs, featured blocks of different smoothness, each containing multi-scale structure information and then stacking them as the next stage's input. In the second stage, in order to prevent weakening of the feature representation of small sample classes and to fully strengthen the information of all classes, we used an isotropic total variation model (ITVM) based on the split Bregman algorithm [43] to process the data after SVD of the stacked block, where complete dimension reduction and global smoothing was performed. Finally, it was fed to the spectral classifier for the classification task. The main contributions of our study are summarized as follows: 1.
This work innovatively proposes an efficient two-staged FE method based on total variation for HSI. Based on different solutions of anisotropic and isotropic models, it successively completes the extraction of multi-scale structure information and detail smoothing enhancement of HSI, improving the discriminative ability of different land covers.

2.
Our design has no complex framework or redundant loop, which greatly reduces computational overhead. When compared with many state-of-the-art algorithms, our method can significantly outperform in classification performance and computing time, especially achieving better results in most classes.

3.
We give a sufficiently detailed parameter analysis and give the reasonable value and change explanation of each parameter. There is no need to reset parameters for diverse datasets, and the results show that our method has strong robustness and stability, strengthening the advantages in hyperspectral practical application.
The remainder of this article is structured as follows. In Section 2, the proposed method is described. Section 3 shows experimental results and comparisons using three real datasets. Section 4 gives parameter analysis and discussion in detail. The conclusions and future work are presented in Section 5.

Proposed Method
This section will be divided into three parts. First, the ATVM under the numerical solution of the first stage is described. The second part gives ITVM based on split Bregman algorithm, and the third part shows the overall design of the proposed method.

ATVM
The energy function optimization model often adds different regularization terms, such as TV terms, to maintain the smoothness and suppress the noise of the image. The ATVM under numerical solution, which is derived from relative TV [44], can effectively decompose the structure and texture in an image and strengthen the retention of structural information. The general form aims to solve the following optimization problem: R denotes a single-band image of the input raw HSI, F is the output featured image of the corresponding band, and p denotes the index of the two-dimensional image pixels. The TV term can be written in the following anisotropic form: To better extract the structure, the improved model is as follows, where λ is the regularization parameter, and ε is a small positive number to prevent the divisor from being 0, D x (p) and L x (p) are, respectively, called windowed total variations and windowed inherent variation in the x direction for pixel p, expressed as where D y (p) and L y (p) represent the y direction, the definition and calculation are the same as x direction. In Formula (4), q is the index of all pixels in a square window centered on p, and g is the weighting function.
where σ specifies the scale size of filtering elements of the window. Then the regularization term in Formula (3) can be approximately calculated as follows: where G is the Gaussian filter with standard deviation σ, * is the convolution symbol, and ε 2 is a small value to prevent division by zero. The calculation in the y direction is the same way as stated above.
Then, with these operators, Formula (3) can be written in the following matrix form: Remote Sens. 2022, 14, 302

of 22
Taking the derivative of Equation in (7), one can obtain the following linear equation: where v F and v R represent the column vectors of F and R, respectively, C x and C y are forward difference operators, and U and W are diagonal matrices, where the value comes from u and w in their respective directions. The numerical solution of ATVM is briefly given in analytical Formula (8), which mainly includes two steps: calculation of coefficient matrix M and solution of linear equation. The two key parameters, λ, can control the smoothness, and σ specifies the scale size. Here, we creatively used the scale size parameter σ as the loop condition, entering the fixed σ as the maximum, and extracting structural information at multiple scales with a small number of loops. In other words, we use few loops to extract the features of different scales in one output, without setting the number of iterations in advance, reducing redundant iterations. The calculation process of the whole model is briefly summarized in Algorithm 1. In line 5-7, we use the same coefficients to solve the linear equation for each dimension of the image.

ITVM
Using the isotropic expression of TV term, ITVM pays more attention to the global unified detail smoothing of the image rather than structure extraction, which is widely used in image deblurring and restoration [37]. The split Bregman algorithm is an extremely efficient algorithm to solve convex optimization problems, especially ITVM, and is briefly given as follows.
ITVM wishes to solve where µ is the fidelity parameter. Setting d x ≈ ∇ x F and d y ≈ ∇ y F, the equation constrained optimization problem can be transformed into an unconstrained optimization problem as follows: x,p + d 2 y,p , · 2 denotes L-2 norm, λ i denotes the regularization parameter in ITVM, and b x and b y are proper constant terms. For fast iteration, we chose the Gauss-Seidel method, where the following is obtained where i, j represent the coordinates of the pixel p, and k denotes the number of iterations. The ITVM based on the split Bregman algorithm is given in Algorithm 2, and the comprehensive mathematical process is detailed in [43]. For simplicity, Algorithm 2 shows the process of single band image processing.

Overall Design
The overall flowchart of our proposed method is given in Figure 1, and its overall calculation process is described in Algorithm 3. where , represent the coordinates of the pixel , and denotes the number of iterations. The ITVM based on the split Bregman algorithm is given in Algorithm 2, and the comprehensive mathematical process is detailed in [43]. For simplicity, Algorithm 2 shows the process of single band image processing.

Overall Design
The overall flowchart of our proposed method is given in Figure 1, and its overall calculation process is described in Algorithm 3. The original HSI has a large amount band dimension. In order to reduce the calculation time of the subsequent steps and the influence of image noise, we first use the average fusion method for preliminary dimension reduction in the first stage. The average fusion is an effective band selecting and merging method, and more importantly, its calculation is very efficient [23]. The simple rule of average fusion method is as follows. We denote the hyperspectral data of M bands as (0 < < ) and split it into N groups. Then the The original HSI has a large amount band dimension. In order to reduce the calculation time of the subsequent steps and the influence of image noise, we first use the average fusion method for preliminary dimension reduction in the first stage. The average fusion is an effective band selecting and merging method, and more importantly, its calculation is very efficient [23]. The simple rule of average fusion method is as follows. We denote the hyperspectral data of M bands as Hi(0 < i < M) 7 of 22 and split it into N groups. Then the information of the jth band after dimension reduction can be obtained by the following fusion method (0 < j < N): Among them, B = M/N, and mean represents the calculation of each group's band dimension average value, and B represents the smallest integer not greater than M/N.
Then at the first stage, we processed the fused HSIs based on ATVM. Inspired by [45], we used different λ values to perform three sets of different smoothness processing, each of which uses a fixed σ to specific the initial maximum scale size. Through this processing based on ATVM, the multi-scale structure information of the image was fully extracted. Next, we stacked these three featured blocks, used SVD to perform secondary dimension reduction at the beginning of the second stage, and then performed ITVM-based processing on each dimension using a fixed λ i , where ITVM considered global smoothing to fully strengthen the information of all classes. The output result was classified by SVM.

Datasets
Our experiments used three real hyperspectral datasets: Indian Pines, Salinas, and Houston University 2018. Their detailed parameters are given in Table 1. In addition, the three datasets also have different characteristics. The spatial resolution and size of Indian Pines are small, but the range of its scenes is wide; Salinas covers farmland with larger image size and very uniform distribution of land objects; Houston University 2018 is a typical large size image, which is released by the IEEE GRSS 2018 data fusion contest [46,47] with high spatial resolution, various classes and scattered distribution. The full band datasets were tested, these three different datasets could fully test the effectiveness of FE methods. • Local Covariance Matrix Representation (LCMR) [49]. This is a FE method using local covariance matrices to characterize the spatial-spectral information. • Random patches network (RPNet) [50]. This is an efficient deep learning-based method that directly regards the random patches taken from the image as the convolution kernels, which combines both shallow and deep convolutional features. • Multi-Scale Total Variation (MSTV) [51]. This is a noise-robust method which extracts multiscale information. • Generalized tensor regression (GTR) [52]. This is a strengthened tensorial version of the ridge regression for multivariate labels which exploits the discrimination information of different modes. • Double-Branch Dual-Attention mechanism network (DBDA) [35]. This is a deep learning network that contains two branches, applied channel attention block and spatial attention block, to capture considerable information on spectral and spatial features. • Fusion of Dual Spatial Information (FDSI) [29]. This is a framework using the fusion of dual spatial information, which includes pre-processing FE and post-processing spatial optimization. • l 0 -l 1 HTV [31]. This is a hybrid model that takes full consideration of the local spatial-spectral structure and yields sharper edge preservation. • SpectralFormer [53]. This is a backbone network based on the transformer, which learns spectrally local sequence information from neighboring bands, yielding group-wise spectral embeddings.
Among them, SVM has been implemented by LIBSVM library [44] as well as a Gaussian kernel with five-fold cross-validation, where the penalty parameter c ranged from 10 −2 to 10 4 and kernel function parameter γ ranged from 10 −3 to 10 4 . In LCMR, the number of MNF principal components L was 20, the number of local neighboring pixel K was 220, and the window size T was 25. In RPNet, the number of patches k was 20 and the size of patches w was 21. In MSTV, the number of principal components used, N, was 30, the band number of the dimension-reduced was 20. In GTR, rank-1 decomposition was used. In DBDA, the batch size was set as 16, and the optimizer was set to Adam with the 0.0005 learning rate. In FDSI, the smoothing parameter λ and the fusion weight µ were set to 1.2 and 0.5, respectively. In l 0 -l 1 HTV, the sampling ratio µ was 0.1, and the parameter λ tv was 0.08. In SpectralFormer, the mini-batch size of Adam optimizer was 64, the CAF module was selected, and the learning rate was initialized with 5 × 10 −4 and decayed by multiplying a factor of 0.9 after each one-tenth of total epochs. The various main parameters mentioned above were the default values of the related algorithms that tend to achieve the highest performance. For more detailed settings and explanations, please refer to the respective references.

Experimental Parameters
In our method, according to the description in Section 2, we mainly needed to set the following parameters. In the first stage, the average fusion method was used to reduce the raw data dimension to n; n tends to affect the calculation time of subsequent processing because the larger n is, the more dimensions are retained. The regularization parameter, i.e., λ 1 , controls the degree of smoothness for the image. By analyzing the smoothness of the single band image after filtering, the value ranges of three groups of different smoothness can be deduced. The scale parameter σ specifies the initial maximum scale size and was set as the loop condition. In the second stage, the principal component k after SVD takes a value between 20 and 30, empirically [28]. The fidelity parameter µ for smoothing can tolerate a large value, and the regularization parameter in ITVM met λ i = 2µ, which usually results in good convergence for image processing, according to [43]. The others, such as ε, ε 2 , and tol, were all set to the default initialization values as stated in Algorithms 1 and 2. Of course, the final determined value of the parameters depended on the performance; n was set to 15, λ 1 , λ 2 , λ 3 were set to 0.004, 0.01, 0.02, respectively, µ was set to 100, and k was set to 20. These values yielded the best performance considering the accuracy and computing time. The analysis and discussion of these main parameters will be given in detail in Section 4.

Metrics and Device
Three widely used quantitative metrics were used to evaluate the performances of all the experimental methods-average accuracy (AA) is the average of the accuracies for each class. Overall accuracy (OA) is the accuracy of each class weighted by the proportion of test samples for that class in the total training set. Kappa coefficient (Kappa) [54] calculates the percentage of the identified pixels corrected by the number of agreements, and is a measure of agreement normalized for chance agreement.
Experiments of all methods were performed on a personal computer equipped with Intel i7-9750 and NVIDIA GeForce GTX 1650. The main processor base frequency of 2.60 GHz and main memory

Indian Pines
To demonstrate the performance of our proposed method, a few samples were selected as the training set. For Indian Pines, we randomly selected only ten pixels from each class as the training samples and the remaining labelled pixels as the test samples. The experiments of each method were repeated ten times, and the corresponding classification results were averaged as the final value. The results are listed in Table 2, including the training set, test set, AA, OA, Kappa, and each class's accuracy. In addition, Figure 2 shows the false-color composite image, the ground truth image and the classification maps of each method in a single experiment. takes a value between 20 and 30, empirically [28]. The fidelity parameter for smoothing can tolerate a large value, and the regularization parameter in ITVM met = 2 , which usually results in good convergence for image processing, according to [43]. The others, such as , , and , were all set to the default initialization values as stated in Algorithms 1, 2. Of course, the final determined value of the parameters depended on the performance; n was set to 15, , , were set to 0.004, 0.01, 0.02, respectively, was set to 100, and k was set to 20. These values yielded the best performance considering the accuracy and computing time. The analysis and discussion of these main parameters will be given in detail in Section 4.

Metrics and Device
Three widely used quantitative metrics were used to evaluate the performances of all the experimental methods-average accuracy (AA) is the average of the accuracies for each class. Overall accuracy (OA) is the accuracy of each class weighted by the proportion of test samples for that class in the total training set. Kappa coefficient (Kappa) [54] calculates the percentage of the identified pixels corrected by the number of agreements, and is a measure of agreement normalized for chance agreement.
Experiments of all methods were performed on a personal computer equipped with Intel i7-9750 and NVIDIA GeForce GTX 1650. The main processor base frequency of 2.60 GHz and main memory of 64GB ensured that all experiments were carried out easily. The software used was based on Matlab R2018b and Jupyter Notebook with PyTorch.

Indian Pines
To demonstrate the performance of our proposed method, a few samples were selected as the training set. For Indian Pines, we randomly selected only ten pixels from each class as the training samples and the remaining labelled pixels as the test samples. The experiments of each method were repeated ten times, and the corresponding classification results were averaged as the final value. The results are listed in Table 2, including the training set, test set, AA, OA, Kappa, and each class's accuracy. In addition, Figure 2 shows the false-color composite image, the ground truth image and the classification maps of each method in a single experiment.  As can be observed, the SVM method has the worst accuracy and kappa, and its classification map is quite noisy in the whole scene, which fully shows the importance of FE for HSI. By extracting the spatial-spectral feature information, the LCMR method significantly improved the results when compared to SVM. However, LCMR uses fixed-window clustering to obtain the covariance matrix representation information, so obvious point-like noise appears in various junction areas. Similarly, in the result classification map of GTR, there is obvious regional noise among various classes, and the boundary region was misclassified seriously. RPNet directly takes the random patch obtained from the image as the convolution kernel without any training, but when the samples are few in number, the random patch pattern seems to introduce more redundant information, and the results show obvious mixed noise in the classification map. Similarly, transformers need enough large-scale training to obtain excellent performance [55]. Although SpectralFormer is an advanced version of transformers for HSI, it still obtained a poor classification effect when the amount of training was limited, and there are many misclassifications in the classification map. In the classification map of l 0 -l 1 HTV, all kinds of edges are well divided, but there are obvious "corrosion"-like points inside the individual classes, which may be due to the strong constraint of the l 0 -l 1 hybrid term, resulting in some internal pixel information being identified as noise, thus introducing misclassification.
MSTV and FDSI fully extract the spatial features of the image and focus on multi-scale and dual fusion and optimization on the structure information, respectively, which significantly reduced the misclassification between the class-map regions. However, the accuracy was still very low in some classes, especially for classes with few pixels, such as Oats and Grass-pasture-mowed, because these two methods aim to enhance structural features but weaken the feature representation of small sample classes, which often have much fewer edges. DBDA contains two branches to capture plenty of spectral and spatial features and yielded a great performance when the training sample was few in number, but it still had poor classification results in fewer-pixel-class Grass-pasture-mowed. In addition, it led to an obviously oversmoothed classification map. On the contrary, the proposed method performed better in the classification maps and had the highest results of OA, AA, and Kappa. A number of 14 classes in 16 had more than 86% accuracy, and 11 classes had over 90%, of which both are the highest among all methods. Especially for small sample classes, our method had 100% accuracy of oats and Grass-pasture-mowed. Due to the fact that the two-staged FE not only fully excavates the spatial structure information but also smoothly strengthen the detailed features, especially for the small sample classes, the two stages complement each other to complete the comprehensive and discriminative FE of HSI.
In addition, the time cost of each method is listed in Table 3. It is easy to see that the proposed method has the least time overhead.

Salinas
For Salinas, we randomly selected only five pixels from each class as the training samples, and the remaining samples were then used for the test. All experiments were repeated ten times, and each class's average accuracy, AA, OA, and Kappa are reported in Table 4. The classification maps of different methods are shown in Figure 3. As can be seen, the scenes in the Salinas are evenly distributed, with good separability, and the metrics of each method are higher than those in Indian Pines. Despite this, the proposed method also had the highest AA, OA, and Kappa. A number of 15 of the 16 categories had an accuracy higher than 85%, which was the best of all methods. It fully shows that our design enhances the features of global land cover, including not only edge information, but also other detailed sample features. At the same time, the noise of the classification result map is the smallest, and the classification is smooth in various junction areas, indicating that the texture information is better preserved while the edge is fully extracted.
What's more, the computing time of the eight considered methods for the Salinas image is reported in Table 5. As can be observed, although SVM has the smallest time overhead in this dataset, considering that SVM is not a FE method but a basic comparison, only the others were examined, the proposed method is the fastest one.  What's more, the computing time of the eight considered methods for the Salin image is reported in Table 5. As can be observed, although SVM has the smallest tim overhead in this dataset, considering that SVM is not a FE method but a basic compariso

Houston University 2018
The third experiment was performed on the Houston University 2018. Due to the large size of this dataset, 100 pixels were selected for training, and the remaining were used as evaluation samples. Similarly, each group of experiments was repeated ten times and averaged. The detailed numbers of training and test samples and all the classification results are given in Table 6, and the corresponding classification maps are shown in Figure 4. It can be seen that the proposed method is also the highest in AA and Kappa. It is worth mentioning that, in the Houston dataset, the samples of community class accounted for about 44% of the total. DBDA performed well in the community class, so its OA was 1.33% higher than our method, but its AA was 4.8% lower than ours. Both LCMR and ours had 11 classes with more than 90% accuracy, but the OA of LCMR was far weaker because LCMR performed much poorer in the community class. Overall, our algorithm performed better in more classes, which proves that our algorithm improves the discrimination ability of different ground objects. However, our method has a poor classification effect in Road and Sidewalk, and the performance of other approaches in these two classes was also lower than their average level. These two classes show the characteristics of "slender and long" land cover, possibly because this feature is easily weakened in the process of FE. How to solve this problem is a topic worth discussing. Similarly, the computing time is given in Table 7. When the amount of data increased to the dimension such as those seen in Houston University 2018 image, the running time of each method was significantly different. In LCMR, the calculation of the covariance matrix feature of each pixel consumes a lot of time, and manifold-based distance metric takes up plenty of storage space. In MSTV, the structural features require more loops, and the kernel method is also time-consuming. GTR also requires more calculations of ridge regression. Dual fusion framework in FDSI obviously requires more computing time because of the pre-post-processing. In l 0 -l 1 HTV, the optimization formula of the l 0 -l 1 regularization terms need to divide into five subproblems, which, individually, are constrained minimization problems that require many iterate computations. As for the deep learning method, RPNet has a simpler architecture and was less time-consuming when compared with other networks, but it still consumes more time to calculate the convolution of random patches in each layer, especially for large size images. In DBDA, selecting small pieces from the original cube data cost a considerable amount of time. SpectralFormer needs hundreds of epochs to reach a good performance. mentioning that, in the Houston dataset, the samples of community class accounted for about 44% of the total. DBDA performed well in the community class, so its OA was 1.33% higher than our method, but its AA was 4.8% lower than ours. Both LCMR and ours had 11 classes with more than 90% accuracy, but the OA of LCMR was far weaker because LCMR performed much poorer in the community class. Overall, our algorithm performed better in more classes, which proves that our algorithm improves the discrimination ability of different ground objects. However, our method has a poor classification effect in Road and Sidewalk, and the performance of other approaches in these two classes was also lower than their average level. These two classes show the characteristics of "slender and long" land cover, possibly because this feature is easily weakened in the process of FE. How to solve this problem is a topic worth discussing. )  1  2  3  4  5  6  7  8  9  10  11  1  13  14  15  16  17  18 19 20    The time cost of our method mainly comes from the processing of ATVM and ITVM. When inputting an HSI R r * c * b , r and c are spatial dimensions, and b is the spectral dimension. The time complexity of ATVM is O(rc), and the time complexity of ITVM is O(rc/ ln rc), both of which are less time-consuming. For comparison, the time complexity of l 0 -l 1 HTV is O(rcb + z + rc log rc), and the calculation time was far higher than ours. Intuitively, In ATVM, the scale size parameter acting as the loop condition can control the number of the iteration within a few. In ITVM, the large value of the fidelity parameter can greatly reduce the number of loops. The average fusion method only involves the average sum operation, and SVD only decomposes one matrix, therefore, both of them consume quite little time. Overall, our method had the shortest computing time, which improves the real-time application of HSI. Moreover, our classification performance is the best.

First Stage
In the first stage, the parameter n was first investigated. In light of the characteristics of average fusion, n ranged from seven to 2/M (i.e., half the spectral dimension of the image), where seven could ensure that the stacked features can complete the SVD in the second stage, which was 3 × 7 > 20. The other parameters were fixed, with default values as stated in 3.2. The same controlled variable method was used in the following experiments. All three datasets were tested, and their results are shown in Figure 5.
As can be seen, with the increase of n, the classification results show a downward trend in the Indian Pines, but a slight increase after n is greater than 90. The decline is much smaller in Salinas and basically unchanged in Houston University 2018, especially for AA. Although there are certain fluctuations in the trends, two points can be determined: First, the highest accuracies of the three datasets were all basically achieved with a small n, and secondly, the time cost increases steadily with the increase of n, especially in Houston University 2018. Therefore, n was set to 15 for our design. Qualitatively, when n is small, more band information is fused to make the final performance better. The results of Houston University 2018 dataset have little change because its spectral dimension was low, only 48, which weakens the role of average fusion. The smaller n is, the smaller the spectral dimension that is retained, naturally reducing the subsequent processing time. Moreover, the average fusion itself has high computational efficiency. In conclusion, the average fusion method in our design is the beginning of efficient processing. of -HTV is ( + + log ), and the calculation time was far higher than ours. Intuitively, In ATVM, the scale size parameter acting as the loop condition can control the number of the iteration within a few. In ITVM, the large value of the fidelity parameter can greatly reduce the number of loops. The average fusion method only involves the average sum operation, and SVD only decomposes one matrix, therefore, both of them consume quite little time. Overall, our method had the shortest computing time, which improves the real-time application of HSI. Moreover, our classification performance is the best.

First Stage
In the first stage, the parameter n was first investigated. In light of the characteristics of average fusion, n ranged from seven to 2/ (i.e., half the spectral dimension of the image), where seven could ensure that the stacked features can complete the SVD in the second stage, which was 3*7＞20. The other parameters were fixed, with default values as stated in 3.2. The same controlled variable method was used in the following experiments. All three datasets were tested, and their results are shown in Figure 5. As can be seen, with the increase of n, the classification results show a downward trend in the Indian Pines, but a slight increase after n is greater than 90. The decline is much smaller in Salinas and basically unchanged in Houston University 2018, especially for AA. Although there are certain fluctuations in the trends, two points can be In the first stage, the second type of parameters, regularization parameter λ and scale size parameter σ, were then investigated. In the three feature blocks to be stacked, we used the same σ value, and then we set σ from one to six to observe the changes in the classification results shown in Figure 6.
It can be seen that according to our design, when σ increases, there are additional loops, which expands the time overhead, and when σ is large, the classification effect decreases. When is σ = 2, our design yielded the best classification results in the three datasets. This means that when σ is two, ITVM in our design can efficiently extract multi-scale information, especially considering that the three datasets had a completely different scale and spatial resolution. This shows that this parameter in our design adapts to a wide range of data types and is robust. Moreover, when σ is small, the loop is naturally reduced, and the calculation time was saved. But σ from one to two increased the computing time, so setting σ to two was a trade-off for accuracy.
Regularization parameters can control the degree of smoothness, which is an important part of the optimization model. Our design used three different sets of regularization parameters, i.e., λ 1 = 0.004, λ 2 = 0.01, λ 3 = 0.02, which are the only different values that were set in the first stage. We conducted the experiments with Indian Pines. Figure 7 gives the featured blocks which output by ATVM under different parameters. For comparison, the smoothed block after ITVM under λ i in the second stage is also shown here. determined: First, the highest accuracies of the three datasets were all basically achieved with a small n, and secondly, the time cost increases steadily with the increase of n, especially in Houston University 2018. Therefore, n was set to 15 for our design. Qualitatively, when n is small, more band information is fused to make the final performance better. The results of Houston University 2018 dataset have little change because its spectral dimension was low, only 48, which weakens the role of average fusion. The smaller n is, the smaller the spectral dimension that is retained, naturally reducing the subsequent processing time. Moreover, the average fusion itself has high computational efficiency. In conclusion, the average fusion method in our design is the beginning of efficient processing.
In the first stage, the second type of parameters, regularization parameter and scale size parameter , were then investigated. In the three feature blocks to be stacked, we used the same value, and then we set from one to six to observe the changes in the classification results shown in Figure 6. It can be seen that according to our design, when increases, there are additional loops, which expands the time overhead, and when is large, the classification effect decreases. When is = 2, our design yielded the best classification results in the three datasets. This means that when is two, ITVM in our design can efficiently extract multiscale information, especially considering that the three datasets had a completely different scale and spatial resolution. This shows that this parameter in our design adapts to a wide range of data types and is robust. Moreover, when is small, the loop is naturally As can be observed, when λ 1 = 0.004, the information of various land cover is clear, λ 3 = 0.02, and only obvious edges are retained. With the increase of λ, the single band images are oversmoothed and more blurred. Therefore, we chose three typical λ values to represent three different degrees of smoothness. Besides, it can be seen from (d), in the single-band image after ITVM in the second stage, that the structure feature was strengthened, and the detailed information is better preserved. Figure 7 shows the intuitive judgment. In the ATVM, the multi-scale structure information with different smoothness was extracted, but the information of some samples with few pixels seems to be weakened. However, through the ITVM in the second stage, the representation of each feature information was strengthened, which intuitively shows the effectiveness of our two-staged design.
We selected five representative values at the three smoothing degrees, with an interval of 0.001, and the classification results are shown in Figure 8. The values we chose yielded the best results, and they were set as the default parameters. The range of smoothing parameters is wide, so we chose representative values of three different smoothness degrees. The experimental results show that they have the best results in the corresponding interval. Additionally this setting can yield satisfactory classification accuracies for different datasets, as shown in Section 3.
Remote Sens. 2022, 14, x FOR PEER REVIEW 17 of 23 reduced, and the calculation time was saved. But from one to two increased the computing time, so setting to two was a trade-off for accuracy. Regularization parameters can control the degree of smoothness, which is an important part of the optimization model. Our design used three different sets of regularization parameters, i.e.,  val of 0.001, and the classification results are shown in Figure 8. The values we chose yielded the best results, and they were set as the default parameters. The range of smoothing parameters is wide, so we chose representative values of three different smoothness degrees. The experimental results show that they have the best results in the corresponding interval. Additionally this setting can yield satisfactory classification accuracies for different datasets, as shown in Section 3.

Second Stage
There are two main parameters in the second stage: the number k of the principal component after SVD and the fidelity parameter . Similarly, k ranged from 10 to 40 with an interval of two, with regard to of the dimension of the stacked feature blocks being 45 in the first stage. As shown in Figure 9, with the increase of k, the classification performance improved stably on the three datasets. When k was larger than 20, the metrics decreased in Indian Pines, Salinas remained stable, and there was a slight improvement in Houston University 2018. We set 20 as the default k value. Qualitatively, When the size of the image is large, more principal components act to enhance the feature information so that the performance will improve. Still, when the image size is too small, such as the Indian Pines, more principal components will bring redundant information and reduce accuracy. In addition, the time overhead is mainly concentrated in the SVD calculation, and it steadily expands with the increase of k. In summary, SVD can effectively extract the principal component information and reduce the subsequent calculation time, playing a key role in our design. Parameter settings show stability in three datasets, especially for large-scale images.

Second Stage
There are two main parameters in the second stage: the number k of the principal component after SVD and the fidelity parameter µ. Similarly, k ranged from 10 to 40 with an interval of two, with regard to of the dimension of the stacked feature blocks being 45 in the first stage. As shown in Figure 9, with the increase of k, the classification performance improved stably on the three datasets. When k was larger than 20, the metrics decreased in Indian Pines, Salinas remained stable, and there was a slight improvement in Houston University 2018. We set 20 as the default k value. Qualitatively, When the size of the image is large, more principal components act to enhance the feature information so that the performance will improve. Still, when the image size is too small, such as the Indian Pines, more principal components will bring redundant information and reduce accuracy. In addition, the time overhead is mainly concentrated in the SVD calculation, and it steadily expands with the increase of k. In summary, SVD can effectively extract the principal component information and reduce the subsequent calculation time, playing a key role in our design. Parameter settings show stability in three datasets, especially for large-scale images.
The fidelity parameter µ is very tolerant of large values according to [35]. We firstly performed experiments to verify that this conclusion is also applicable to different HSIs. In our experiments, µ ranged from 2 to 10 with step two and grew from 10 to 150 with step 10. The classification results of the three datasets are listed in Figure 10. When µ increased to 10, the results were significantly improved. When µ was greater than 20, the performance improvement in the three datasets became slow and entered the stable zone. In addition, when the value of µ was large, the performance did not fluctuate greatly and did not show a downward trend. It can be obtained that µ can tolerate large values in different characteristics of HSI. From the formula, the larger the value, the stricter the constraint on the fidelity term, the more delicate the smoothing of each feature, and there will be no oversmoothed situation. We set as the default value of the proposed method. Parameter µ is an important parameter in the second stage. Once again, the performance in three datasets fully proves that our design is robust and stable.
Finally, the results of sending the output feature blocks of the first and second stages directly to the classifier are shown in Figure 11. As a comparison, the results of feeding the raw data and the average fused data to the classifier are also listed in this Figure. This experiment demonstrated that the two stages' designs contributed significantly to classification performance in the three datasets. The results here more fully prove that our two-stage design is effective. Each stage plays an important role and greatly improves the effect.  The fidelity parameter is very tolerant of large values according to [35]. We firstly performed experiments to verify that this conclusion is also applicable to different HSIs. In our experiments, ranged from 2 to 10 with step two and grew from 10 to 150 with step 10. The classification results of the three datasets are listed in Figure 10. When increased to 10, the results were significantly improved. When was greater than 20, the performance improvement in the three datasets became slow and entered the stable zone. In addition, when the value of was large, the performance did not fluctuate greatly and did not show a downward trend. It can be obtained that can tolerate large values in different characteristics of HSI. From the formula, the larger the value, the stricter the constraint on the fidelity term, the more delicate the smoothing of each feature, and there will be no oversmoothed situation. We set = 100 as the default value of the proposed method. Parameter is an important parameter in the second stage. Once again, the performance in three datasets fully proves that our design is robust and stable. Finally, the results of sending the output feature blocks of the first and second stages directly to the classifier are shown in Figure 11. As a comparison, the results of feeding the raw data and the average fused data to the classifier are also listed in this Figure. This experiment demonstrated that the two stages' designs contributed significantly to classification performance in the three datasets. The results here more fully prove that our twostage design is effective. Each stage plays an important role and greatly improves the effect. directly to the classifier are shown in Figure 11. As a comparison, the results of feeding the raw data and the average fused data to the classifier are also listed in this Figure. This experiment demonstrated that the two stages' designs contributed significantly to classification performance in the three datasets. The results here more fully prove that our twostage design is effective. Each stage plays an important role and greatly improves the effect.

Conclusions
To improve classification performance and reduce the time cost, this article innovatively proposes an efficient two-staged FE method based on TV for HSI. Based on different solutions of anisotropic and isotropic models, it successively completed the extraction of multi-scale structure information and detail smoothing enhancement of HSI, yielding distinguished classification performance. In addition, this design has no complex framework nor a redundant loop, which greatly reduces computational overhead. When compared with many state-of-the-art algorithms, our method can significantly outperform others in classification performance and computing time, especially achieving better results in most

Conclusions
To improve classification performance and reduce the time cost, this article innovatively proposes an efficient two-staged FE method based on TV for HSI. Based on different solutions of anisotropic and isotropic models, it successively completed the extraction of multi-scale structure information and detail smoothing enhancement of HSI, yielding distinguished classification performance. In addition, this design has no complex framework nor a redundant loop, which greatly reduces computational overhead. When compared with many state-of-the-art algorithms, our method can significantly outperform others in classification performance and computing time, especially achieving better results in most classes. More importantly, we give a sufficiently detailed parameter analysis and give the reasonable value and change explanation of each parameter from algorithm design and performance. The results show that our method has strong robustness and stability in various datasets, which further strengthens the advantages in hyperspectral practical application.
In the future, we will try to improve the classification performance of the "slender and long" land cover distribution. Although our default parameters are applicable to the three datasets, adaptive selection of parameters for different datasets is still an important improvement direction. In addition, we still have some trade-offs between time and performance, giving its parallel implementation is the primary key, especially considering the future development of HSI towards larger size and higher resolution. Finally, the practical application of HSIs is always a topic of discussion, which will face more environmental factors and noise in the future. We can use the excellent solutions and ideas for dealing with complex noise [56,57] for traditional images and transplant it to hyperspectral applications.
Author Contributions: C.L., developed the design and methodology, implemented the algorithm, executed the experiments, led the writing of the manuscript; X.T., assisted algorithm migration and verification; L.S. assisted several experiments; Y.T., revised the manuscript. Y.T. and Y.P., administrated the project and supervised the research. All authors have read and agreed to the published version of the manuscript.