For HSIs, the spectral curve of each pixel represents its essential characteristics during spatial imaging, which provides more abundant data for HSIs. However, the complex correlation between a large number of pixel data also brings difficulties to the processing of HSIs. For HSIs, different targets have different spatial characteristics, and different materials have different spectral features. There exists a certain correlation between spatial information in HSIs and spectral information. Deep learning has provided a good learning ability for complex nonlinear relations. Compared with several deep learning models, CNN has a structure of local connection and weight sharing, which can provide a better generalization ability for image-processing tasks. The door controller is added to LSTM based on the traditional RNN, which controls the intrinsic connection of the sequence signal according to the characteristic intensity of the sequence signal for controlling the degree of signal forgetting. Inspired by the above, a dual-branch spatial-spectral feature extraction and classification method is proposed to improve the classification accuracy of HSIs.
The proposed flow chart of the dual-branch spatial-spectral feature extraction and classification method is shown in
Figure 1. In the figure, the dual-branch spatial-spectral feature extraction and classification method mainly includes three parts: sample enhancement combined with local and global constraints, a multiscale spatial-spectral feature fusion, and a dual-branch Softmax classifier. In the enhancement phase of the sample, local sample enhancement takes the context information of the spatial neighborhood into account, while the global sample enhancement attaches importance to the spectral similarity between the same samples in different regions of the image. In the multiscale spatial-spectral feature fusion stage, a dual-branch structure is designed to extract and fuse the spatial features and spectral features. Among them, the first branch adopts the Bi-LSTM network to extract the spectral and spatial features, in which the inputs are the EMAP value of the hyperspectral image and the spectral value superimposed sequence vector. The second branch uses 3D-CNN to extract the spatial-spectral correlation features with 3 × 3 × 1, 5 × 5 × 3 convolution kernels for feature extraction at different scales. The features of the two network outputs are then input into the fully connected layer to complete the fusion of the spatial-spectral features. In the dual-branch Softmax classifier stage, the fusion features are sent to the Softmax classifier, in which the cross-entropy combined with classification functions are defined to complete the final classification.
2.1. Sample Enhancement with Local and Global Constraint
A schematic diagram of the sample enhancement with local and global constraints is shown in
Figure 2. In HSIs, training samples are represented as
, where m represents the number of training samples. The label of the training sample is expressed as
, and unlabeled samples are represented by
, where n is the number of unlabeled samples.
For an unlabeled sample , represents a contiguous sample of a certain size around . represents the training sample nearby in the k-size space. is the intersection of and . If exists, the class label of the sample is calculated from the statistical distribution of . The class with the most training samples is defined as . If the number of such training samples is greater than (k−1)/2, the local constraint obtained by the unlabeled sample is premarked as . Conversely, if the above conditions are not met, the unlabeled sample is premarked as 0.
For global constraints, the
image is a collection of samples of all categories that are further away from the unlabeled sample
. The spectral angular distance (SAD) of each sample in
and the unlabeled sample
is calculated by Equation (1), among which the smallest one is selected as the premarking of the unlabeled sample
. When all SAD values are greater than 1, the premark is defined as 0.
Only the unlabeled samples with the same nonzero labels by the local spatial constraints and the global spectral constraints are selected. The selected unlabeled samples are recorded as
and
, respectively, and the prelabels are determined by Equation (2):
Especially because the tracking algorithm of the prelabeled samples is the same, in the case of the same labeled samples, the training set and test set after completing the sample enhancement are fixed.
When the sample enhancement is completed, the number of samples is recalculated. The number of different class samples is balanced, which makes the training set more uniform and helps the network model to extract features.
2.2. Multiscale Spatial-Spectral Feature Fusion
HSIs are regarded as three-dimensional data cubes with both spatial and spectral information. In HSIs, the spectral characteristics of samples in the same class may vary due to the differences in imaging conditions, such as changes in illumination, environment, and atmosphere, as well as the change of time. In addition, limited by manufacturing techniques, the spatial resolution of hyperspectral remote sensing images is generally not high, so that the spatial features of ground targets are not fully extracted. Therefore, extracting complete and accurate joint spatial-spectral features is the key to improving the accuracy of HSIs classification.
The dual-branch spatial-spectral feature extraction and classification method is shown in
Figure 3. In this method, a double-branch structure is constructed. The first branch is the LSTM unit that extracts the characteristics of the sequence signal. The specific structure of the LSTM unit is shown in
Figure 4. For the full extraction of the spectral feature, the data of all spectral bands rather than dimensionality reduction data are taken as part of the input data in this branch. In order to extract the spatial information, the EMAP is applied to the HSIs data using PCA dimension reduction. The EMAP results are combined with the spectral data of the corresponding location to form the output vector. Since the combined spectral-EMAP data has no semantic order, a spatially spectral joint feature is extracted using a Bi-LSTM network, with 128 hidden nodes to ensure better feature extraction.
The LSTM is an improvement method on the standard recurrent neural network (RNN), and it is mainly improved in the addition of three door controllers: input gate, output gate, and forgetting gate. The structure of the three gate controllers is the same, which is mainly composed of a sigmoid function (σ) and a dot product operation (×). Since the value range of the sigmoid function is 0–1, the gate controller determines the proportion through which the information can pass by the value of sigmoid. Therefore, the weight control of the memory at different times is added, and a cross-layer connection is added to reduce the influence of the gradient disappearance problem. The LSTM is used to explore the dependence of the current state of the sequence information on the previous state. However, when nontime sequential vectors are involved, dependencies exist in both forward and backward directions. To solve this problem, a two-way LSTM [
49] is proposed to exploit the forward and backward relationship of the sequential data. Therefore, the method proposed in this paper is also selected from the Bi-LSTM.
The EMAP is a set of multilevel features that perform a series of coarsening (closed operation) and refinement (opening) filtering on the image and can effectively describe the spatial information of the image.
The coarsening (closed operation) and the refinement (opening operation) are two important transformations for extracting the shape profile of the image. The open operation at
x is defined as Equation (3):
where
I is a binary image,
x is a pixel on
I, and
X contains the connected domain of
x.
By adding an attribute constraint to the open operation, the attribute open operation
of the connected domain
X is obtained by Equation (4):
where
Τ(X) is an attribute of the connected domain
X and
λ is the attribute threshold. The attribute open transform
of the entire binary image is defined by Equation (5):
In order to generalize the above transformation from the binary image to the grayscale image, each grayscale value of the grayscale image is sequentially used as a threshold value. The grayscale image is calculated by the threshold to obtain a series of binary images, which are recorded as Equation (6):
Then, the opening operation is taken for each binary image
, and the maximum gray level in which the constraint is satisfied is taken as an output. The attribute of opening operation
of the grayscale image
I is obtained by Equation (7):
Similarly, the attribute of the closure operation
of the grayscale image
I is defined as Equation (8):
where
is the attribute closure operation of the binary image.
According to different attribute thresholds
for attribute opening and closing operations, the attribute opening profile
and the attribute closing profile
at the point
x is obtained by Equation (9) for each pixel
x on the image
I.
In this paper, four kinds of attributes, including: the area length, the diagonal length of the circumscribed rectangle, the first-order invariant moment, and the standard deviation of the pixel values in the area, are selected.
The second structure is a 3D-CNN structure that extracts different spatial features by a two-scale convolution kernel. Using the dimensionality reduction method of the PCA, the input of the dual-scale architecture is the dimensionality reduction patch. In the two-scale architecture, convolution kernels consist of convolutional layers of 3 × 3 × 1 and 5 × 5 × 3, corresponding normalized layers and rectified linear unit (ReLU) active layers for extracting the spatial features, spectral features, and the correlation between space and spectrum in HSIs. Then the padding layer is also superimposed. Finally, the extracted features are flattened into a one-dimensional vector as the input of the next fully connected layer. The dual-scale convolution kernel in the DBECM shows excellent performance in local feature extraction.
As shown in
Figure 5, all convolutional layer outputs of each convolution unit adopt a normalization strategy to improve the stability of the network in the convolution structure. Subsequent ReLU acts as a nonlinear activation function to activate the output of the convolution, which is expanded by a padding layer to its original size for convolution. To reduce overfitting, the dropout layer is adopted before outputting to the next convolution unit. The pooling layer is not used in the convolution unit, because the pooling operation improves the rotation invariance of the feature. However, in the hyperspectral remote sensing image classification task, the pooling operation greatly changes the spatial correlation of the target, so that the spatial information is no longer right.
After the features are extracted in both structures, the extracted features are flattened into a one-dimensional vector. The features of the LSTM structure are respectively spliced with the features input full-connection layers of the two-scale CNN structures to form multiscale spatial-spectral joint features.
2.3. Dual-Branch Softmax Classifier
These features with different scales entered into two Softmax layers, respectively. The output of the Softmax layer represents the probability distribution of the classes derived from the different scale features. Considering the two outputs of the two Softmax layers, a new loss function is defined. In the test phase, the outputs of the two Softmax layers are used to predict the class label of the sample by multiple decision methods.
In the DBECM, Softmax is used for multiclass classification. The output of the Softmax function is used to represent the probability distribution of all classes, where the probability of each class is in the range of 0–1 and all the probabilities are added up to 1. Besides, considering the scales of convolution kernels used in this model are 3 × 3 × 1 and 5 × 5 × 3, which all have central structures, and in order to incorporate the center loss into the training process, a new loss function is proposed.
The loss function is determined by the cross-entropy of the real class and the output class probability, as well as the center loss. Since the network uses Softmax as the activation function, which has the characteristics of the block in the structure, the center loss is added as a part of the loss function in the network based on the prediction probability to improve the judgment of the features. The cross-entropy of the real class and output class probability is calculated as Equation (10):
where
and
are the probabilities that the
i-th samples are assigned respectively to the
c-th class according to the different scale features.
The center loss is defined as Equation (11):
where the learned feature
corresponds to the i-th input spectra in the batch, and
denotes the
c-th class center defined by the averaging over the features in the
z-th class.
The loss function is defined as Equation (12):
In the testing phase, the classification results are determined by the class probability distribution of these different scale features. The test sample {
,
,…,
} of the label {
,
,…,
} is predicted by Equation(13):