Hyperspectral Image Classiﬁcation Using Feature Relations Map Learning

: Recently, deep learning has been reported to be an e ﬀ ective method for improving hyperspectral image classiﬁcation and convolutional neural networks (CNNs) are, in particular, gaining more and more attention in this ﬁeld. CNNs provide automatic approaches that can learn more abstract features of hyperspectral images from spectral, spatial, or spectral-spatial domains. However, CNN applications are focused on learning features directly from image data—while the intrinsic relations between original features, which may provide more information for classiﬁcation, are not fully considered. In order to make full use of the relations between hyperspectral features and to explore more objective features for improving classiﬁcation accuracy, we proposed feature relations map learning (FRML) in this paper. FRML can automatically enhance the separability of di ﬀ erent objects in an image, using a segmented feature relations map (SFRM) that reﬂects the relations between spectral features through a normalized di ﬀ erence index (NDI), and it can then learn new features from SFRM using a CNN-based feature extractor. Finally, based on these features, a classiﬁer was designed for the classiﬁcation. With FRML, our experimental results from four popular hyperspectral datasets indicate that the proposed method can achieve more representative and objective features to improve classiﬁcation accuracy, outperforming classiﬁcations using the comparative methods.


Introduction
As the spectral resolution of remote sensing (RS) sensors has improved, hyperspectral technology has exhibited great potential for obtaining land use information with fine quality. Hyperspectral RS images capture the spectrum of every pixel within observed scenes at hundreds of continuous and narrow bands. In comparison with multispectral images, which have wide wavelength, hyperspectral images can provide features hidden in narrow wavelengths in order to distinguish objects that are difficult to detect [1,2]. Since hyperspectral RS images have powerful capabilities, they have popularly been used in many fields, such as mining, precision agriculture, water pollution treatment, etc. [3][4][5].
Hyperspectral RS image classification is an important process for transforming hyperspectral information from the ground's surface into attribute information. It is an extension of the conventional multiple spectral RS image classification, which aims at assigning a pixel to a unique class [6,7]. Hyperspectral images differ significantly from multiple spectral images because they have high-dimensional features and the correlation between adjacent bands is often high. Along with other aspects, such as noise and mixed pixels, hyperspectral image classification also suffers from data redundancy, dimensional disaster, and uncertainty, making this type of classification more complex and challenging [8,9].
Various methods have been proposed for hyperspectral RS image classification. Classification methods for traditional multiple spectral images, such as support vector machine (SVM), k-nearest Thus, from a global perspective, utilizing all original features to establish relations between them and then learning the features from their relations would be a novel approach to discovering more intrinsic and representative features.
Beyond having the existing CNN methods learn features from original hyperspectral images, it is relatively rare to convert spectral or spatial information into regular texture pictures for feature learning with the CNNs. In fact, extensive literature reports that CNNs have more powerful capability to learn features from pictures with regular textures [26,[41][42][43][44]. Therefore, it could be assumed that converting spectral information of each pixel in hyperspectral images into regular texture scan images and using CNNs for feature learning would greatly improve classification accuracy.
For the reasons given above, we propose a new approach-called feature relations map learning (FRML)-for improving hyperspectral image classification. Here, the feature relations mean the correlations between features under certain function or mapping. First, for a pixel on the image, the relations between each two spectral features are calculated by a specific function and are recorded using a 2-D matrix. Then, the matrix is converted into a picture, which can describe the relations between different spectral features with regular textures. Finally, new features are learned from the picture based on CNN architecture and the current pixel is classified and a predicted class label is signed. After FRML is performed on the entire image, a more accurate land use map is produced. Specifically, the remainder of this paper is organized as follows. Section 2 introduces FRML and related work. Section 3 describes data sets and experimental designs. In Section 4, we analyze the experimental results and present the discussion. Finally, conclusions and suggestions are provided in Section 5. Figure 1 illustrates a hyperspectral image classification framework with the use of FRML. The framework includes an input layer, a feature relations map (FRM) establishing layer, a feature learning layer, and a classifier layer. In the first layer, each pixel in an image can be recorded as a high-dimensional vector whose entries correspond to the spectral features in each band. In the second layer, the value of each two entries can be calculated with a normalized difference index (NDI) to build a 2-D matrix-called the feature relations matrix-and then the matrix can be transformed into a picture to build FRMs that continue regular textures. In the third layer, the convolution layers of CNN can be used as a feature extractor (FE) to extract new features from the picture. In the fourth layer, the new features can be used as the input of the classifier and the final result can be predicted and signed with a class label. The classifier in the FRML can be established by any supervised classification algorithm. In this paper, we used classification and regression trees (CART) [45], random forests (RF) [46], and deep belief network (DBN) [35], respectively, to construct the classifier of FRML framework in order to compare and find the existing regulations in the FRML.

Feature Relations Map Establishment
With the relations of different features, classification rules can be established. However, it is relatively complex to decide how features should be selected for a hyperspectral image classification target [6,47]. To avoid this complexity, in our experiments, we neglected feature selection and had all spectral features build relations between one another.
Normalized indices, such as the normalized difference vegetation index (NDVI), the normalized difference water index (NDWI), and the normalized difference built-up index (NDBI), can generate new features by taking spectral features as input to enhance recognizability of objects [48]. To some extent, these new features can be used to reflect the interrelations between different spectral features. Thus, in this paper, we used the NDI to establish relations between different spectral features. The NDI is shown in (1): where i = 1, 2,…, n; j = 1, 2,…, n; n is the band count of a hyperspectral image; a and b are two mutually unequal constants. When it comes to each pixel, a 2-D matrix with an n × n dimension would be calculated with the index and a and b would then be used to adjust the matrix into an asymmetric matrix. We defined a + b = 2, and a of 0.25, 0.5, and 0.75 were tested to optimize the parameters, and finally, we found that when a = 0.75, b = 1.25 FRML had the best performance. Therefore, the default values of a and b are set as 1.25 and 0.75, respectively. With the NDI, pixels of different classes would, respectively, correspond to different feature relation matrices. To covert these matrices into pictures, FRMs of different classes would be formed. In theory, pixels belonging to the same class would have similar textures in FRMs. Thus, the process of constructing an FRM can be considered to be a process that converts a classification that uses complex spectral features into a classification that uses regular texture pictures, which offers more features for the classification target.
Obviously, the number of hyperspectral bands determines the size of an FRM-a high dimension would enlarge the data size, which would affect data processing efficiency. To reduce the

Feature Relations Map Establishment
With the relations of different features, classification rules can be established. However, it is relatively complex to decide how features should be selected for a hyperspectral image classification target [6,47]. To avoid this complexity, in our experiments, we neglected feature selection and had all spectral features build relations between one another.
Normalized indices, such as the normalized difference vegetation index (NDVI), the normalized difference water index (NDWI), and the normalized difference built-up index (NDBI), can generate new features by taking spectral features as input to enhance recognizability of objects [48]. To some extent, these new features can be used to reflect the interrelations between different spectral features. Thus, in this paper, we used the NDI to establish relations between different spectral features. The NDI is shown in (1): where i = 1, 2, . . . , n; j = 1, 2, . . . , n; n is the band count of a hyperspectral image; a and b are two mutually unequal constants. When it comes to each pixel, a 2-D matrix with an n × n dimension would be calculated with the index and a and b would then be used to adjust the matrix into an asymmetric matrix. We defined a + b = 2, and a of 0.25, 0.5, and 0.75 were tested to optimize the parameters, and finally, we found that when a = 0.75, b = 1.25 FRML had the best performance. Therefore, the default values of a and b are set as 1.25 and 0.75, respectively. With the NDI, pixels of different classes would, respectively, correspond to different feature relation matrices. To covert these matrices into pictures, FRMs of different classes would be formed. In theory, pixels belonging to the same class would have similar textures in FRMs. Thus, the process of constructing an FRM can be considered to be a process that converts a classification that uses complex spectral features into a classification that uses regular texture pictures, which offers more features for the classification target.
Obviously, the number of hyperspectral bands determines the size of an FRM-a high dimension would enlarge the data size, which would affect data processing efficiency. To reduce the potentially large size of data, a separate strategy was designed here, as shown in Figure 2. For each pixel, the spectral dimension was divided into m segments and each segment contained s features (bands). If the band count was not an integer multiple of m, then the values in the forepart continuous bands would be copied and appended on the spectrum vector to make all segments have the same band size. The feature relation matrix of each segment was calculated to obtain the m of different FRMs and then they were combined into one multi-channel picture, which is called the segmented FRM (SFRM). In comparison with the unsegmented FRM (UFRM), which uses full spectral features to build a feature relation matrix, the SFRM has a smaller picture size and its colorful textures in the RGB color space could be used to provide more discriminative information in order to distinguish a current pixel from pixels of other classes.
Remote Sens. 2020, 12, x FOR PEER REVIEW 5 of 26 potentially large size of data, a separate strategy was designed here, as shown in Figure 2. For each pixel, the spectral dimension was divided into m segments and each segment contained s features (bands). If the band count was not an integer multiple of m, then the values in the forepart continuous bands would be copied and appended on the spectrum vector to make all segments have the same band size. The feature relation matrix of each segment was calculated to obtain the m of different FRMs and then they were combined into one multi-channel picture, which is called the segmented FRM (SFRM). In comparison with the unsegmented FRM (UFRM), which uses full spectral features to build a feature relation matrix, the SFRM has a smaller picture size and its colorful textures in the RGB color space could be used to provide more discriminative information in order to distinguish a current pixel from pixels of other classes.

Feature Learning Method
CNN provides a powerful feature extractor, which consists of alternative convolution and pooling layers, to generalize the features towards deep and abstract representations [37]. As it is characterized by autonomous feature learning and it provides the necessary premise for more highprecision classification, the feature extractor was also previously used in this paper to learn features from the FRM. The descriptions of the convolution and pooling layers are as follows.
Convolution layers: In the FRML, a convolution layer is fed by kernels that have a twodimensional array of weights and a bias, which scan across an image to capture different feature representations at local and global scales. The kernels provide sharable weights for different feature maps so that the features can be learned through a reduced amount of parameters and an activation function with enhanced nonlinearity operations. Mathematically, assuming X is the input cube with a size of h × w × c, where h × w is the spatial size of X and c is the number of channels, xi means ith feature map of X. Supposing that the current convolution layer had n kernels, then the jth kernel is characterized by the weight of wj and a bias of bj. The jth feature extracted by the current layer can be expressed as (2): where * is the convolutional operator and f (•) is an activation function that is used to strengthen the nonlinear expression. ReLU is considered to be an effective activation function, which has advantages of fast convergence and robustness for gradient vanishing [7]. Thus, in this paper, we used ReLU as the activation function of the convolution layer.

Feature Learning Method
CNN provides a powerful feature extractor, which consists of alternative convolution and pooling layers, to generalize the features towards deep and abstract representations [37]. As it is characterized by autonomous feature learning and it provides the necessary premise for more high-precision classification, the feature extractor was also previously used in this paper to learn features from the FRM. The descriptions of the convolution and pooling layers are as follows.
Convolution layers: In the FRML, a convolution layer is fed by kernels that have a two-dimensional array of weights and a bias, which scan across an image to capture different feature representations at local and global scales. The kernels provide sharable weights for different feature maps so that the features can be learned through a reduced amount of parameters and an activation function with enhanced nonlinearity operations. Mathematically, assuming X is the input cube with a size of h × w × c, where h × w is the spatial size of X and c is the number of channels, x i means ith feature map of X. Supposing that the current convolution layer had n kernels, then the jth kernel is characterized by the weight of w j and a bias of b j . The jth feature extracted by the current layer can be expressed as (2): where * is the convolutional operator and f (•) is an activation function that is used to strengthen the nonlinear expression. ReLU is considered to be an effective activation function, which has advantages of fast convergence and robustness for gradient vanishing [7]. Thus, in this paper, we used ReLU as the activation function of the convolution layer. Pooling layers: The pooling layers are periodically inserted after several convolution layers for down-sampling, while retaining the invariance of features to scale, offset, and shape. With the pooling operation, the parameters and feature map size are reduced for computation and the representation of the extracted feature becomes more abstract. Generally, the common pooling functions include max-pooling, average-pooling, L2-norm pooling, and weighted pooling [7,39]. In this paper, we used the most popular max-pooling function to establish pooling layers.
With the convolutional and pooling layers, we designed a three-layer convolutional network as the FRML feature extractor (the construct of the feature extractor is further illustrated in Section 3) and expected to discover FRM laws that would improve classification accuracy. In order to ensure that our experiment would be performed with high efficiency, we used TensorFlow's Application Programming Interface (API) for programming. TensorFlow is a very famous open-source software library that was developed by the Google Brain Team for machine learning applications [48]. It provides sophisticated DL approaches, including the necessary FRML support, and is compatible with our graphics processing unit (GPU) for speeding up the operation.

Classification Method
Based on the features extracted from FRMs, CART, RF, and DBN, which belong to a single classifier, the EL and DL are separately used to train the FRML classifiers. CART is a decision tree-based classification algorithm that was built by dividing the sample set layer by layer, where the split property is the one that has the highest information gain ratio with the sample set and the optimal threshold under the split property obtained by information entropy calculation [45,49,50]. The RF combines the Bagging technique with the random subspace method and ensembles a set of CART classifiers to improve classification accuracy. Since it reduces the classification bias and eliminates overfitting in the decision tree construction, the RF has high accuracy and is reported to be an excellent EL method for RS image classification [12,14,51]. DBN is a popular deep learning architecture in the field of classification, consisting of several layers of restricted Boltzmann machines (RBMs) and one backpropagation neural network layer [35,52]. In DBN, RBM is an unsupervised network that consists of both visible and hidden layers. The hidden layer serves as a visible layer for the next and a pair of units from either of the two layers have a symmetric connection between them. With RBMs, probability distributions over their sets of inputs can be learned and used to train the backpropagation neural network [35]. In this paper, these classifiers were realized by using the scikit-learn library in Python.

Measurement of the Feature Relations Map Difference
FRMs are different for different classes, which can be considered to be an important basis for distinguishing between different objects. The structural similarity index measure (SSIM) is a full reference metric that is used for measuring the similarity of two images [53]. It comprehensively measures their differences through image brightness, contrast, and structure and it has advantages in terms of image difference discrimination. Hence, it can be used to judge the FRMs difference. If the SSIM of two FRMs is smaller, then there would be more difference between them, reflecting that corresponding objects are more easily distinguishable. Supposing that x is the target image and y is the reference image, then the SSIM of x and y can be defined as (3): where µ x (or µ y ) represents the empirical mean of x (or y), σ x (or σ y ) is the empirical standard deviation of x (or y), σ xy means the empirical correlation between x and y, and C 1 and C 2 are given constants.

Measurement of the Separability of Different Class Samples
For classification problems, the quality of features determines the separability of different classes. The higher the separability between different classes, the simpler the classification for ML would be. The Jeffries-Matusita distance (JMD) is a very widely used statistical separability criterion, which involves the covariance metrics of the separability measurement [54]. Since it can be used to pairwise measure the separability between classes, the JMD provides an effective assessment of the quality of different class samples in the available feature space. To evaluate the separability of samples with features learned from FRMs, the JMD between class c i and c j , which are members of a set of n classes, is defined as follows (i = 1, 2, . . . , n; j = 1, 2, . . . , n): where b ij is the Bhattacharyya distance between c i and c j ; M i and M j represent the mean values; N i and N j denote the covariance matrices of classes c i and c j . Generally, the JMD is a transformation of the Bhattacharyya distance from the [0, inf] range to the fixed [0,2] range-if the JMD is closer to 2, then the separability of samples belonging to two different classes would be higher.

Accuracy Verification
Cross-validation is a primary method for estimating the skill of an ML model on a limited data sample set. In this paper, to verify the classification performance of FRML, a five-fold cross-valuation method was designed. First, the sample set was divided into five folds. Then, each fold was trained and combined with the other four folds in order to be used for testing. By training the FRML modes and verifying them, five evaluations were obtained. Finally, the highest of the five evaluation values was taken as the final verification. In our experiment, the commonly used quantitative indices, including overall accuracy (OA), kappa coefficients, and accuracy at per-class level, were used for the vivification of the hyperspectral image classification.
The IP dataset was gathered using the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor over the Indian Pines test site in Northwest Indiana. After removing the bands that cover the region's water absorption, the dataset consisted of a total of 220 bands with a pixel size of 145 × 145, a spatial resolution of 20 m, and spectral coverage from 400 to 2500 nm. The SA dataset was collected by the 224-band AVIRIS sensor over the Salinas Valley in California. This dataset was characterized by a high resolution of 3.7 m. It discarded 20 water absorption bands, this including a total of 204 bands. The HL represented an image in the HyRANK dataset obtained using the Hyperion Earth Observing-1 sensor. It has a spatial resolution of 30 m and spectral coverage from 400 to 2500 nm. Following a pre-processing step, the image provided 176 surface reflectance bands with a pixel size of 249 × 945. The PU dataset was acquired through the Reflective Optics Systems Imaging Spectrometer (ROSIS) sensor during a flight campaign over Pavia in northern Italy. It constituted an image of 610 × 340 pixels, with 103 bands, spectral coverage from 430 to 860 nm, and a spatial resolution of 1.3 m.
These four datasets provided high-quality benchmark information through expert visual interpretation and field investigation. The sample information of the datasets is listed in Table 1, while their false color composition pictures and corresponding ground truth maps are shown in Figure 3.  Our experiments were conducted in three parts. The first part analyzed the FRM character and the second part analyzed the samples' separability using the features learned from SFRMs. The third part reported the classification performance of FRML and comparing methods. As the purpose of the experiments was to evaluate the feature learning results of FRML, we conducted the experiment under the same conditions. First, we set the same configuration for the feature extractor in FRML as the one shown in Table 2. Second, we used CART, RF, and DBN classifiers with the same parameters as illustrated in Table 3. All experiments were implemented using an i9-9900K 3.6 GHz processor with 32 GB RAM and the NVIDIA GeForce RTX 2080 Ti graphic card. Our experiments were conducted in three parts. The first part analyzed the FRM character and the second part analyzed the samples' separability using the features learned from SFRMs. The third part reported the classification performance of FRML and comparing methods. As the purpose of the experiments was to evaluate the feature learning results of FRML, we conducted the experiment under the same conditions. First, we set the same configuration for the feature extractor in FRML as the one shown in Table 2. Second, we used CART, RF, and DBN classifiers with the same parameters as illustrated in Table 3. All experiments were implemented using an i9-9900K 3.6 GHz processor with 32 GB RAM and the NVIDIA GeForce RTX 2080 Ti graphic card.

Feature Relations Map Results
In order to explore the nature of FRM, we first established UFRMs for all samples and calculated the average value in pixels for each class in order to obtain an averaged UFRM. Due to space limitation, only the UFRMs of IP and PU datasets are shown in Figures 4 and 5. Visually, these UFRMs reflect different numerical distributions. The UFRMs of the PU dataset obviously show differences between different classes with strong recognizability. In the IP dataset, some similar classes, such as SN, SM, WH, and WO, exhibit some similarities in their UFRMs, however, if the contours are carefully observed, it would be found that many differences in detail exist in them. These features indicate that, for a hyperspectral image, different classes have different FRMs, which can be used to distinguish them from others. In comparison with the spectral feature, which has only one-dimension, the FRMs provide more intuitive two-dimensional graphic information for distinguishing different patterns.
In order to reduce data size and processing time, the SFRM was designed as the final FRM of the FRML in our experiments. In these experiments, we took into consideration that the band counts of four images are different, thus the segment counts of the SFRMs for the IP, SA, HL, and PU datasets were set as 4, 4, 3, and 4, respectively. Displaying the first three channels using the RGB color form, the SFRM of each class in the four datasets is shown in Figure 6. The SFRMs of different classes exhibit different texture patterns (this feature is very obvious in the IP and PU datasets). Although some SFRMs have similar textures visually, they still have differences in brightness and color. This means that, using the segmented strategy, the SFRMs have the same capability as the USFRMs to distinguish between objects of different classes.

Feature Relations Map Differences
The differences between FRMs made samples of different classes separable. To evaluate these differences, the SSIM was calculated in the experiment. For a convenient description of the SSIMs between different classes, each pair of classes was numbered to obtain an index, as shown in Figure  7. Then, according to the indices, the SSIMs of SFRMs (or UFRMs) for each pair of classes were illustrated as scatter plots shown in Figure 8. Clearly, it can be observed that, for some classes, there is a great difference between their UFRMs with low SSIMs but that, for some other classes, the SSIMs are higher than 0.9, exhibiting great similarities that increase the difficulty to distinguish between objects using the UFRMs. This may be because, when the NDI was used to establish feature relations, some highly related spectral features interacted with one another, resulting in some very similar results that could not reflect differences very well. SFRMs establish relations between features in an interval of the hyperspectral spectrum. In comparison with the UFRMs, some channels of SFRMs exhibit lower SSIMs, reflecting more significant differences between different classes. This channels make great contribution to the SFRM, and with the SFRM, the objects of different classes would be more separable. When the segmentation strategy is used, the interaction of the highly relevant features is reduced, resulting in SFRM discriminations for different classes being heavily improved. This shows that SFRM could provide a favorable basis for the classification using the relations of different spectral features.

Feature Relations Map Differences
The differences between FRMs made samples of different classes separable. To evaluate these differences, the SSIM was calculated in the experiment. For a convenient description of the SSIMs between different classes, each pair of classes was numbered to obtain an index, as shown in Figure 7. Then, according to the indices, the SSIMs of SFRMs (or UFRMs) for each pair of classes were illustrated as scatter plots shown in Figure 8. Clearly, it can be observed that, for some classes, there is a great difference between their UFRMs with low SSIMs but that, for some other classes, the SSIMs are higher than 0.9, exhibiting great similarities that increase the difficulty to distinguish between objects using the UFRMs. This may be because, when the NDI was used to establish feature relations, some highly related spectral features interacted with one another, resulting in some very similar results that could not reflect differences very well. SFRMs establish relations between features in an interval of the hyperspectral spectrum. In comparison with the UFRMs, some channels of SFRMs exhibit lower SSIMs, reflecting more significant differences between different classes. This channels make great contribution to the SFRM, and with the SFRM, the objects of different classes would be more separable. When the segmentation strategy is used, the interaction of the highly relevant features is reduced,

Sample Separability with Features Learned from SFRMs
In our experiment, SFRM features were learned using the feature extractor and the separability of samples for a pair of classes was evaluated using the JMD. As shown in Figure 9, for the four datasets, the JMDs are higher under the features that learned from SFRM (JMD_SFRM) than the JMDs under the spectral features (JMD_SF), reflecting strong separability of each pair of classes. For example, for the IP dataset, the JMD_SF of each pair of classes is located in 1-1.2, while the JMD_SFRMs are basically greater than 1.38, indicating that the features learned from the SFRMs are more discriminative than the original spectral features. All these dataset cases indicate that SFRM provides two-dimensional material, such as textures and graphics, which can be used to learn more discriminative features with the deep convolutional network. Through feature learning, the SFRM differences are transformed into high-quality features, which would be more convenient for the classification of different objects because of their higher separability.

Sample Separability with Features Learned from SFRMs
In our experiment, SFRM features were learned using the feature extractor and the separability of samples for a pair of classes was evaluated using the JMD. As shown in Figure 9, for the four datasets, the JMDs are higher under the features that learned from SFRM (JMD_SFRM) than the JMDs under the spectral features (JMD_SF), reflecting strong separability of each pair of classes. For example, for the IP dataset, the JMD_SF of each pair of classes is located in 1-1.2, while the JMD_SFRMs are basically greater than 1.38, indicating that the features learned from the SFRMs are more discriminative than the original spectral features. All these dataset cases indicate that SFRM provides two-dimensional material, such as textures and graphics, which can be used to learn more discriminative features with the deep convolutional network. Through feature learning, the SFRM differences are transformed into high-quality features, which would be more convenient for the classification of different objects because of their higher separability.

Classification Results and Analysis
In this section, the features learned by FRML and those learned by other methods of comparison are evaluated. As we focus more on the capability of feature learning, only the feature extractor of the comparative methods, including the recently proposed long short-term memory (LSTM), multiscale CNN (MCNN), spectral-spatial unified networks (SSUN), random patches network (RPNet), three-dimensional scattering wavelet transform (3DSWT), and extended random walkers (ERW), were used in our experiments. For the LSTM, MCNN, and SSUN, the parameters were set to the default values given in [55]. For the RPNet, 3DSWT, and ERW, the parameters were set to the default values given in [57,58]. With the features learned by FRML and the comparative methods, the classification stage was conducted using the CART, FR, and DBN classifiers, respectively, and the final classification accuracies are shown in Table 4.

Classification Results and Analysis
In this section, the features learned by FRML and those learned by other methods of comparison are evaluated. As we focus more on the capability of feature learning, only the feature extractor of the comparative methods, including the recently proposed long short-term memory (LSTM), multiscale CNN (MCNN), spectral-spatial unified networks (SSUN), random patches network (RPNet), three-dimensional scattering wavelet transform (3DSWT), and extended random walkers (ERW), were used in our experiments. For the LSTM, MCNN, and SSUN, the parameters were set to the default values given in [55]. For the RPNet, 3DSWT, and ERW, the parameters were set to the default values given in [57,58]. With the features learned by FRML and the comparative methods, the classification stage was conducted using the CART, FR, and DBN classifiers, respectively, and the final classification accuracies are shown in Table 4.

Classification Performance at the Overall Level
As reported in many research studies, CART is a weak classifier in comparison with strong classifiers such as SVM and AdaBoost. In our experiments, CART was first used to evaluate classification performance with features extracted using different methods. It was found that, with using the CART classifier, some feature extract methods exhibit low accuracy. For instance, the CART-based LSTM had an OA of only 73.85% (kappa = 0.703) in the test on the IP dataset, while the CART-based SSUN achieved an OA of 75.33% (kappa = 0.780) in the test on HL dataset. This feature indicates that, when features are extracted by some comparative methods, CART does not seem to be an optimal classifier. The phenomenon may be caused by two reasons-that CART performance is weak or that features extracted by comparative methods are not discriminative enough for classification using CART. However, when it comes to the FRML, the CART-based FRML achieved very satisfying classification results with an OA higher than 96%. For the IP, HL, and PU datasets, the CART-based FRML achieved the highest accuracy compared to any of the other CART-based methods. The CART-based FRML for the SA dataset had an OA of 98.19% (kappa = 0.980), which was 1.3% lower than that of the CART-based ERW which had the highest accuracy. Under the same parameter conditions, the reason why the CART-based FRML had higher accuracy is because FRML had learned more discriminative features than comparative methods, resulting in pixels that are more separable in the classification.
In comparison with CART, RF had a higher accuracy in our evaluation. In the HL and PU datasets, the RF-based FRML achieved the highest OAs of 98.48% and 98.97%, respectively. In the IP dataset, the RF-based FRML had a higher accuracy than most of the comparative methods in its group, except for the RF-based MCNN, while in the SA dataset, the accuracy of the RF-based FRML was lower than the RF-based MCNN and the RF-based ERW, although the differences were very small-differing only by 0.45% and 0.93%, respectively. These results indicate that, with a strong classifier like RF, FRML could achieve more accurate classification. Expecting the comparative methods whose accuracy also obviously improved due to RF, the classification accuracy improvements of some other comparative methods are not as good as those of FRML. This phenomenon may be due to the fact that current methods are not suitable for classification with the RF classifier; however, another possible explanation is that the features extracted by these methods do not discriminate as well as FRML does.
In the group in which DBN was used as an evaluating classifier, FRML had the highest classification accuracy on all datasets, except for the IP dataset-where the DBN-based FRML had an OA of 96.29%, lower than the DBN-based ERW and the DBN-based RPNet and very close to that of the CART-based FRML. In comparison with other feature extract methods, using DBN as a classifier obviously improves classification accuracy more than using CART-however, the performance of the DBN-based FRML on the IP dataset indicates the opposite. This phenomenon may be due to the fact that fixed DBN parameters set in our experiments were not optimal for the IP dataset. Another reason may be that there was an imbalance in the training set, where some classes, such as GPM and OP, were underrepresented due to too few samples.
Taken together, in most cases, FRML outperforms the comparative methods in the four datasets. To evaluate the robustness for each method, the average accuracy of the four datasets was calculated as shown in Table 4. From the results, it can be seen that the CART-based and RF-based FRMLs had the highest average accuracy in comparison with the other methods in their groups. The DBN-based FRML exhibited almost the same average accuracy as the DBN-based ERW, which was higher than that of the other methods. This phenomenon suggests that the FRML methods are more robust than the comparative methods.
FRMLs exhibit stable performance in different datasets under the sample parameter conditions, with strong generalization and easier operation. This may be because FRMLs obtain optical features to improve the separability of the pixels in classification. Last but not least, FRMLs learn features from the relations of spectral features and, in comparison to other feature extraction methods, FRMLs make full use of the different bands of hyperspectral images, providing more abundant information with a higher quality for classification accuracy improvement.

Classification Performance at the Per-Class Level
To explore the FRML hyperspectral image classification ability, the classification accuracy at the per-class level is reported and illustrated in Figure 10. For the IP dataset, most classes achieved a higher accuracy with FRMLs than with comparative methods. Especially for the OT, the accuracy was not very good when comparative methods were used, while it obviously improved with FRML. For the SA dataset, the VU and GU do not seem to be satisfactorily classified with a high accuracy using the features extracted through comparative methods, except for ERW-when the accuracy reaches a very higher level while using FRML. FRML maintains a higher classification accuracy in the HL dataset for most individual classes, except for EMS, in comparison with most other methods. For the PU dataset, FRML also exhibits excellent performance for most individual class classifications compared to comparative methods. These features suggest a good improvement at the per-class level using FRML. In comparison with other methods, only few individual classes were not improved to the highest level, perhaps because the FRMs of these classes were too similar to those of other classes and their features-extracted by the CNN-based feature extractor designed in our experiments-were not discriminative enough yet. However, FRML did exhibit a better balance of accuracy improvement for all classes than comparative methods and this phenomenon is especially obvious in the IP and HL datasets. Clearly, FRML improves hyperspectral image classification not only at the overall level but also at the per-class level. As SFRMs provide good separability between different classes in FRML (as shown in Figures 5 and 8), they greatly reduce potential misclassifications for each class.
Remote Sens. 2020, 12, x FOR PEER REVIEW 17 of 26 make full use of the different bands of hyperspectral images, providing more abundant information with a higher quality for classification accuracy improvement.

Classification Performance at the Per-Class Level
To explore the FRML hyperspectral image classification ability, the classification accuracy at the per-class level is reported and illustrated in Figure 10. For the IP dataset, most classes achieved a higher accuracy with FRMLs than with comparative methods. Especially for the OT, the accuracy was not very good when comparative methods were used, while it obviously improved with FRML. For the SA dataset, the VU and GU do not seem to be satisfactorily classified with a high accuracy using the features extracted through comparative methods, except for ERW-when the accuracy reaches a very higher level while using FRML. FRML maintains a higher classification accuracy in the HL dataset for most individual classes, except for EMS, in comparison with most other methods. For the PU dataset, FRML also exhibits excellent performance for most individual class classifications compared to comparative methods. These features suggest a good improvement at the per-class level using FRML. In comparison with other methods, only few individual classes were not improved to the highest level, perhaps because the FRMs of these classes were too similar to those of other classes and their features-extracted by the CNN-based feature extractor designed in our experimentswere not discriminative enough yet. However, FRML did exhibit a better balance of accuracy improvement for all classes than comparative methods and this phenomenon is especially obvious in the IP and HL datasets. Clearly, FRML improves hyperspectral image classification not only at the overall level but also at the per-class level. As SFRMs provide good separability between different classes in FRML (as shown in Figures 5 and 8), they greatly reduce potential misclassifications for each class.

Impact of the Training Sample Size on FRML
The impact of sample size for RS image classification has been reported in many research studies [7,32,39]. Grasping the influence of sample size on the accuracy of a classifier can effectively guide RS image interpretation. To evaluate how the sample size impacts FRML classification accuracy, 200 samples for each class were randomly selected from the training set to produce a new sample set. Then, classifiers were trained with the samples, whose size gradually increased by 10% of the sample set size, and the classification accuracy was evaluated by testing a set that contained samples that were not selected. The OA changes in FRML and other comparative methods are shown in Figure 11. With the increase of training sample set size, the CART-and RF-based FRMLs on the IP and HL datasets exhibited excellent accuracy improvement performance, beating all other comparative methods. This feature suggests that FRML has the potential to achieve higher-level accuracy with a limited sample size. However, it is difficult to ensure that FRML can obtain higher accuracy than all comparative methods because the characteristics of the datasets used are different. For the PU and SA datasets, the ERW method had a higher accuracy than FRML, the FRML method seemed to require a larger number of training samples to achieve an accuracy that would be equal to or similar to the accuracy achieved by ERW. This phenomenon, on the two datasets, reflects a stronger dependence of FRML on the number of training samples than that of other methods for improving classification accuracy and it is also worth noting that, with the number of training samples increasing, the increases in the accuracy of FRML are larger than for most of the comparative methods. The obviously increasing FRML trends make it reasonable to conclude that if the number of the training samples is continually increasing, then there would be an even further FRML accuracy improvement.
Besides the differences between the datasets, the FRML classifier would be another important factor affecting the accuracy improvement as the number of training samples increases because, as Figure 11 illustrates, with the same number of training samples, the RF-based FRMLs have better accuracy improvement than the CART-based FRMLs. In the PU dataset, especially, the CART-based FRML seems to need more samples to improve accuracy to a higher level than the RF-and DBN-based FRMLs. Thus, if more accurate classification with a smaller sample size is desired, then the appropriate FRML classifiers should seriously be considered.

Land Use Mapping with FRML
To estimate the FRML property for land use mapping, the four datasets were classified using different methods. Due to limited space, we only showed the maps that were classified by RF (as shown in Figures 12-15). In the maps for the IP dataset (Figure 12), the comparative methods show obvious misclassifications between CO and WO in the left region of interest (ROI); however, with the RF-based FRML (Figure 12g), the two classes are correctly classified. The FR-based SSUN, RPNet, 3DSWT, and ERW (Figure 12c-f, respectively) exhibit fuzzy boundaries between HW and WO in the right ROI; however, with the RF-based FRML, the edge of these classes is more clear and similar to the ground truth. For the SA dataset (Figure 13), the comparative methods exhibit unsatisfactory performance when distinguishing between the SVD and FA in the button ROI; however, with the RF-based FRML, the two land use types were basically classified correctly. The RF-based FRML also has good performance in land use mapping in the HL dataset. When comparing the RF-based MCNN, 3DSWT and RF-based ERW (Figure 14b,e,f, respectively), the RF-based FRML seems to be better at describing details. For the PU dataset (Figure 15), SBB near to the PMS in the ROI were wrongly classified as BS by the comparative methods, while the SBB were correctly classified using the RF-based FRML.
dependence of FRML on the number of training samples than that of other methods for improving classification accuracy and it is also worth noting that, with the number of training samples increasing, the increases in the accuracy of FRML are larger than for most of the comparative methods. The obviously increasing FRML trends make it reasonable to conclude that if the number of the training samples is continually increasing, then there would be an even further FRML accuracy improvement.        Table 5 lists the time consumption of feature extraction with different methods. For FRML, the time was mainly consumed by SFRM establishing and feature learning. Obviously, SFRM establishing (SFRM-E) is a first-time consumer in FRML, which may be due to multiple loops used In the evaluation of classification accuracy, the RF-based MCNN in the IP dataset, as well as the RF-based RPNet and ERW in the SA and PU datasets, have a very high accuracy of more than 98%-even higher than the FR-based FRML. However, in the mapping experiments, the maps of these datasets exhibit some fuzzy and distorted boundaries between different objects (as shown in Figure 12b, Figure 14d,f, and Figure 15d,f), while some details were also erased. This phenomenon may be caused by the spatial fitters that these methods use to smooth the features of boundaries and details. With FRML, the features of each pixel have no spatial information obtained by filters-thus, it could maintain more land use details on the map. However, as adjacent pixels are independent and without special information, the maps produced by FRMLs would be affected by the salt and pepper noise in the images. Table 5 lists the time consumption of feature extraction with different methods. For FRML, the time was mainly consumed by SFRM establishing and feature learning. Obviously, SFRM establishing (SFRM-E) is a first-time consumer in FRML, which may be due to multiple loops used for feature relation value calculation in the procedure code. Feature learning from SFRM (FL-SFRM) seemed to consume less time than SFRM-E. For the four datasets, the FL-SFRM spent 2.41-6.21% of the time used for SFRM-E. In comparison with the other methods, FRML took more time to extract features in most cases in our experiments. This time cost of the feature extraction methods could be affected by the codding and computing platform, but it could also be concluded that FRML is a time-consuming algorithm due to SFRM-E; thus, accuracy should be the most important factor in performance evaluation.

Conclusions
By establishing relations between different spectral features using NDIs, it was found that each class has its own FRM that could be used to distinguish it from other classes. FRMs not only generate new features but also provide two-dimensional graphic information, such as textures and polygons, updating pattern recognition through using one-dimensional spectral features to regular texture pictures. Based on these findings, we proposed an FRML with SFRM (an FRM designed using a segment strategy) to classify hyperspectral images. Benefits of SFRM are that FRML could automatically enhance the separability of different objects and that it could learn more discriminative features using a feature extractor that consists of a deep convolutional network. Due to its powerful feature learning ability, FRML could archive higher accuracy than comparative methods in most cases. Unlike other feature learning methods, FRML learns features from their relationships rather than directly from the spectral or spatial features themselves-which gives FRML more chance to consider the full use of original features, without any data dimensional reduction, to obtain more objective features for improved classification accuracy and to have a stronger capability of maintaining details in mapping.
FRML exhibits excellent performance in hyperspectral image classification. However, some of its aspects still need to be improved. For example, in our experiments, only three classifiers were utilized in FRML-in order to fully explore FRML with high accuracy, more classifiers should be used to obtain the optimal FRML framework. In addition, the establishment of FRM plays an important role in FRML, however, in this paper, only DNI was used to do so. For describing the relations between different features better, more advanced methods should be explored. Last but not least, FRML suffers from a disadvantage of being time consuming; thus, the procedure code needs more optimization in order to accelerate operation speed in future studies.
To summarize, FRML successfully uses feature relations to improve hyperspectral image classification. The framework is flexible and advanced, and it is expected to be suitable for more complex RS image classifications.