Contrastive Learning Based on Transformer for Hyperspectral Image Classiﬁcation

: Recently, deep learning has achieved breakthroughs in hyperspectral image (HSI) classiﬁcation. Deep-learning-based classiﬁers require a large number of labeled samples for training to provide excellent performance. However, the availability of labeled data is limited due to the signiﬁcant human resources and time costs of labeling hyperspectral data. Unsupervised learning for hyperspectral image classiﬁcation has thus received increasing attention. In this paper, we propose a novel unsupervised framework based on a contrastive learning method and a transformer model for hyperspectral image classiﬁcation. The experimental results prove that our model can efﬁciently extract hyperspectral image features in unsupervised situations.


Introduction
Compared with general images, hyperspectral images can provide more abundant pixel-level spectral information, since they contain hundreds of spectral bands. This enables the pixel-level classification of hyperspectral images, which then led to hyperspectral image classification becoming a hot topic in remote sensing. Nowadays, it is widely used in many fields, such as crop estimation [1], soil salinity estimation [2], and mineral mapping [3].
Similar to computer vision, most of the state-of-the-art deep-learning-based methods are based on CNN architectures because CNN has achieved the best performance in these two areas. However, unlike computer vision, the state-of-the-art CNN for hyperspectral image classification is based on a 3D convolutional architecture instead of a 2D convolutional architecture. The 3D CNNs require much more computational resources, and researchers have spent much time designing efficient 3D CNN models for hyperspectral image classification. Additionally, CNNs fail to capture the sequence attributes of spectral signatures well, particularly middle-and long-term dependencies.
The redundant spectral information of hyperspectral images leads to high costs of hyperspectral data acquisition. Labeling a large number of hyperspectral data is unrealistic, requiring extensive human resources. Unsupervised learning methods can work without labels, so this challenge can be solved. Thus, research on unsupervised hyperspectral image classification is urgently required.
Unsupervised learning aims to extract information from data without labels. According to the mainstream view, unsupervised methods can be divided into representative learning and discriminative learning. Most of the representative learning methods are based on two models: an autoencoder [12] and generative adversarial network (GAN) [13]. These two models both aim to map the training data to a certain distribution mode. According to their distribution, we may obtain fake samples that are similar to real data or useful feature extractors. For hyperspectral image classification, the stacked sparse autoencoder (SSAE) extracts sparse spectral features and multiscale spatial features using the autoencoder [14]. Additionally, GAN [15] was used for hyperspectral image classification. Furthermore, the 3D convolutional autoencoder (3DCAE) [16], GAN-assisted CapsNet (TripleGAN) [17], and many other representative-learning-based models have been proposed for hyperspectral image classification.
Except for representative learning, discriminative learning is a method for collecting information from unlabeled data. Contrastive learning is a popular discriminative learning method. Unlike representative learning, contrastive learning aims to discriminate different data instead of obtaining the data distribution feature. It requires much less computational resources than the representative learning method. After being trained by comparing different samples, the discriminator can be applied to downstream tasks, such as image recognition [18] and object detection [19].
Contrastive learning has achieved great success in the computer vision field. However, some challenges remain when applying contrastive learning to hyperspectral image classification. Firstly, the data augmentation methods commonly used in the computer vision field are not applicable to the hyperspectral image classification field. For example, color distortion is a typical data augmentation method used for general images. However, when used for hyperspectral image classification, color distortion disrupts the spectral information in hyperspectral images. Therefore, this method is unsuitable for hyperspectral image augmentation. Secondly, the models used in contrastive learning, mostly 2D CNNs, for computer vision tasks are not applicable in hyperspectral image processing. CNNs are not able to mine the sequence attributes of spectral signatures well.
To introduce contrastive learning to hyperspectral image classification, we handled these two problems in this study. Firstly, we devised a useful data augmentation method for hyperspectral image classification. Secondly, since 3D CNNs require much more computational resources than 2D CNNs, we used a transformer model instead of convolutional neural networks as the feature extractor. The transformer architecture is well-designed for effectively processing and analyzing sequential data. We only adjusted the model parameters, which had little impact on the model computation.
In this article, we introduce bootstrap your own latent (BYOL), a state-of-the-art contrastive learning framework, to hyperspectral image classification using a transformer architecture. After training under the unsupervised condition, the transformer model can extract features from the input hyperspectral image. We use a SVM model to obtain excellent classification performance with a small percentage of features and corresponding labels for training. The contributions of our work are as follows:

1.
We introduce contrastive learning to extract hyperspectral image features in the unsupervised situation, which removes the manual labeling costs. As many contrastive methods rely on large amounts of negative samples to work well, which greatly reduces the training efficiency, we use BYOL as the contrastive learning framework. BYOL avoids the need for negative examples. Moreover, we adjust the data augmentation methods to make it suitable for hyperspectral image classification.

2.
We introduce a vision transformer into unsupervised hyperspectral image classification. The transformer model does not have any convolution or recurrent units. The 3D CNNs for hyperspectral image classification need much more computational resources than 2D CNNs for computer vision, and fail to process sequential data well.
We adjust the transformer model parameters to ensure the transformer models are suitable for hyperspectral image processing. Additionally, the transformer used for common computer vision contains at least 12 layers. We apply a two-layer transformer model in this paper, which can reduce the model size for better computational efficiency. The 12-layer transformer for computer vision has 86 million parameters, while our model has much fewer parameters.

3.
We combine a contrastive learning method and a transformer model as the framework of our unsupervised model. Our proposed model performs better than traditional representative methods in hyperspectral image classification.
We organize the remainder of the paper as follows: A brief overview of the contrastive learning and transformer model is presented in Section 2. Our proposed model is introduced in Section 3. In Section 4, we provide experimental descriptions and result analysis. Finally, the conclusions are described in Section 5.

Contrastive Learning
Contrastive learning has achieved unprecedented performance in computer vision tasks. This type of unsupervised approach differs from that of representative learning, which executes based on the augmented views of the sample, thus avoiding the computational cost of generating fake samples. Several contrastive learning methods have been proposed, such as similar contrastive learning (SimCLR) [20], momentum contrast for unsupervised visual representation learning (MoCo) [21], and bootstrap your own latent (BYOL) [22].
Among these three methods, both SimCLR and MoCo not only depend on positive pairs but also negative pairs, while obtaining the negative pairs is often a time-consuming process. BYOL, on the contrary, does not need negative pairs; thus, to date, it has achieved the best training efficiency.
The BYOL architecture is shown in Figure 1. BYOL aims to minimize the similarity loss between q(z) and sg(z ). It uses two neural networks: the online and target networks. These two networks have the same architecture: an encoder, a projector, and a predictor, but different weights. After training, only f remains and it can be used for downstream tasks. A detailed introduction of BYOL is presented in Section 3.

Transformer
The transformer model was originally proposed for natural language processing (NLP) tasks in 2017 [23], and has recently been used for image classification , named "vision transformer" [24]. The transformer model we used in the experiments is shown in Figure 2. We did not make any changes to the transformer model. The transformer model consists of several identical layers. Each layer is composed of two sub-layers, namely, a multi-headed self-attentive mechanism and a fully connected feed-forward network. Each sub-layer has a residual connection followed by layer normalization. Thus, the final output of each sub-layer can be formulated as LayerNorm(x + SubLayer(x)), where SubLayer(x) is the function of the sub-layer. The multi-head self-attention is defined as: The attention here is formulated as: where Q and K of dimension d k and V of dimension d v are three learnable weight matrices.

Proposed Method
Our proposed framework, as shown in Figure 3, consists of two important parts: a contrastive learning method and a transformer model. These two parts have recently achieved satisfactory results in general image classification. However, these methods cannot be used directly to process hyperspectral images because the spectral information in hyperspectral images cannot be disrupted. Based on the characteristics of the hyperspectral image, we modified the augmentation methods in contrastive learning to make it applicable to hyperspectral images.
Here, we chose BYOL as the contrastive learning method. BYOL uses two neural networks, online and target networks, which learn by interacting with each other. Both neural networks consist of three parts: an encoder f, a projector g, and a predictor q. The online and target networks have the same structure but use different weights. sg is the stop-gradient. First, the online network outputs a representation y, a projection z, and a prediction q(z) from the weakly augmented view of the hyperspectral image, and the target network outputs y and the target projection z from the strongly augmented view of the hyperspectral image. Second, the loss between the L2-normalized predictions q(z) and target projections z is calculated: where ·, · is the inner product. To symmetrize the loss L θ,ξ , L θ,ξ is computed by feeding the strongly augmented view of the hyperspectral image into the online network and the weakly augmented view of the hyperspectral image into the target network. The final loss is formulated as L BYOL θ,ξ = L θ,ξ + L θ,ξ . At each training step, BYOL minimizes the loss with respect to θ only, but ξ is a slowly moving exponential average of θ: ξ ← τξ + (1 − τ)θ, where τ is a target decay rate. It was empirically shown that the combination of adding the predictor to the online network and using the moving average of the online network parameters as the target network encourages encoding more and more information within the online projection and avoids collapsed solutions, such as constant representations. In this study, we reduced the model size for computational efficiency. Additionally, we applied a new image augmentation method for hyperspectral images.
The two views of the hyperspectral image used in BYOL are differently augmented images. In our model, we take the horizontal flip or vertical flip as the preliminary augmentation method, and different random erasures after that as the different augmentation method. Random erasure can be divided into two types: random rectangular area erasure and random point erasure; neither of them erase the center point, as shown in Figure 4. The procedure of selecting the rectangle area and erasing this area is presented in Algorithm 1, and the procedure of selecting the points and erasing these points is shown in Algorithm 2. Furthermore, we employed the transformer model as the encoder instead of a CNN. In a transformer model, the input image is split into fixed-size patches and then reshaped into 1D sequences. Next, a fully connected layer is used to obtain the embedding of each sequence. After adding the position embedding to these sequences and an extra learnable sequence, the sequences are fed to a two-layer transformer encoder. After getting the output sequences, the first one is used for further process. Unlike the vision transformer, we remove the MLP head and apply a small patch size. Additionally, our proposed model consists many fewer layers than the vision transformer, which consists of at least 12 layers. As shown in Figure 2, to reduce the computational resource consumption, we only use a two-layer model. The 12-layer transformer for computer vision has 86 million parameters, whereas our model has many fewer parameters, as shown in Table 1.
After training, we can use the two-layer transformer model to obtain the features of an HSI image under the unsupervised condition. Then, based on the features, with a small proportion of the label information, a simple SVM can achieve high accuracy.

Experimental Descriptions and Result Analysis
In this section, the experimental descriptions and result analysis are presented.

Datasets' Description
The experiments were conducted on three publicly available datasets, Indian Pines (IP), University of Pavia (UP), and Salinas Scene (SV), as indicated in Figure 5.
The IP dataset was collected by the AVIRIS sensor in Northwestern Indiana. The scene size is 145 × 145. After removing the bands absorbed by water, the number of bands reduces to 200. The 10,249 labeled pixels are designed into 16 classes.
The UP scene was gathered by the ROSIS sensor over Pavia, Northern Italy. UP has 610 × 610 pixels. Each pixel consists of 103 spectral bands. The 42,776 labeled pixels are divided into nine categories.
The SV image was acquired by the AVIRIS sensor over Salinas Valley, California. The image size is 512 × 217 with 204 available spectral bands. The 54,129 labeled pixels are partitioned into 16 categories.

Experimental Parameters
The parameters of our model are shown in Tables 1 and 2. All experiments were performed on a Titan-RTX GPU. The model was implemented in Python using the Pytorch framework. We adopted a traditional principal component analysis (PCA) to remove the spectral redundancy. The input size is 27 × 27 × 30 for IP and 27 × 27 × 15 for UP and SV. We set the patch size to 3 and the batch size to 256. The target decay rate was 0.99. The learning rate was set to 0.003. We trained the transformer model for 20 epochs, and chose the model with the least loss for the test.
For IP and UP, the proportion of samples for training was set to 10%. For SV, we selected 5% of the samples for training.

Result Analysis
In this study, we used the overall accuracy (OA) and average accuracy (AA) as performance evaluation metrics. OA is the proportion of correctly classified samples to all samples; AA represents the average of classification accuracy for each category. We adopted six other methods as baselines. These models contain three supervised methods: linear discriminant analysis (LDA) [25], deep convolutional neural network (1D-CNN) [26], and supervised deep feature extraction (S-CNN) [27]; and three unsupervised methods: 3D convolutional autoencoder(3DCAE) [16], adversarial autoencoder [28], and variational autoencoder [28].
The classification results with IP are shown in Table 3 and Figure 6. These results show that our model achieves the best performance with IP. As IP is more difficult to be classified among these three public datasets, we conclude that our model is superior to the others. The classification results with SV are presented in Table 4 and Figure 7. Similar to IP, our proposed method performed the best in 10 classes and approached the best performance in the other six classes. However, according to the results in UP demonstrated in Table 5 and Figure 8, the AA of our model is second only to the AA of AAE among all seven models. We presume this is due to the small number of training samples in the last category. Considering the overall performance, our findings demonstrate the feasibility of deep-learning-based hyperspectral image classification without convolution. Based on the above experimental results, the transformer is a promising model for hyperspectral image classification that warrants further investigation.   Even though our model works without label information, it still outperformed the supervised methods. Additionally, compared to the LDA results, we conclude that the deep-learning-based model is superior to the machine-learning-based method.
According to the above experimental results and analysis, we conclude that our model can effectively extract the features under unsupervised conditions. It is suitable for addressing the challenge posed by the lack of labeling. The model does not contain convolutional operations. Our findings prove that convolutional operations are not necessary for hyperspectral image classification. Additionally, 2D convolutional networks for traditional computer vision are not applicable to hyperspectral image classification. A 3D convolutional network can provide a better hyperspectral image classification result than a 2D convolutional network. The transformer model we used in the experiments is the same as the transformer model used for traditional computer vision tasks in terms of model structure. We only changed the model size, which demonstrates that computer vision models have huge potential for hyperspectral image classification. The other important part of our method, contrastive learning, relies on data augmentation for the input image. Because some data argumentation methods in traditional computer vision are unsuitable for hyperspectral images, we used image flip, and deleting spectral information of some points to augment the data. With a small number of data augmentation methods, the contrastive learning method can perform much better than the representative learning method. With more data augmentation methods, the contrastive learning method may be more accurate for hyperspectral image classification.

Conclusions
In this paper, we proposed an unsupervised framework based on a transformer and contrastive learning. Both methods can be used to reduce computational resource consumption. The experiments with three publicly available data sets demonstrated the more accurate performance of the proposed method compared to other methods. As the transformer for the visual tasks and contrastive learning method are widely used in computer vision, we think our proposed method has great potential for hyperspectral image processing.