Next Article in Journal
Application of Optimization Algorithms for Identification of Reference Points in a Monitoring Network
Previous Article in Journal
Comparison of the Performance of the Leap Motion ControllerTM with a Standard Marker-Based Motion Capture System
Previous Article in Special Issue
Retrieval of Hyperspectral Information from Multispectral Data for Perennial Ryegrass Biomass Estimation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Lightweight 1-D Convolution Augmented Transformer with Metric Learning for Hyperspectral Image Classification

The State Key Laboratory of High-Performance Computing, College of Computer, National University of Defense Technology, Changsha 410000, China
*
Author to whom correspondence should be addressed.
Sensors 2021, 21(5), 1751; https://doi.org/10.3390/s21051751
Submission received: 20 December 2020 / Revised: 21 February 2021 / Accepted: 25 February 2021 / Published: 3 March 2021
(This article belongs to the Special Issue Hyperspectral Remote Sensing of the Earth)

Abstract

:
Hyperspectral image (HSI) classification is the subject of intense research in remote sensing. The tremendous success of deep learning in computer vision has recently sparked the interest in applying deep learning in hyperspectral image classification. However, most deep learning methods for hyperspectral image classification are based on convolutional neural networks (CNN). Those methods require heavy GPU memory resources and run time. Recently, another deep learning model, the transformer, has been applied for image recognition, and the study result demonstrates the great potential of the transformer network for computer vision tasks. In this paper, we propose a model for hyperspectral image classification based on the transformer, which is widely used in natural language processing. Besides, we believe we are the first to combine the metric learning and the transformer model in hyperspectral image classification. Moreover, to improve the model classification performance when the available training samples are limited, we use the 1-D convolution and Mish activation function. The experimental results on three widely used hyperspectral image data sets demonstrate the proposed model’s advantages in accuracy, GPU memory cost, and running time.

1. Introduction

Hyperspectral image (HSI) classification is a focus point in remote sensing because of its many uses across fields, such as change area detection [1], land-use classification [2,3], and environmental protection [4]. However, because of redundant spectral band information, large data size, and a limited number of training samples, the pixel-wise classification of the hyperspectral image remains a formidable challenge.
Deep learning (DL) has become extremely popular because of its ability to extract features from raw data. It has been applied in computer vision tasks, such as image classification [5,6,7,8], object detection [9], semantic segmentation [10], and facial recognition [11]. As a classical visual classification task, HSI classification has also been influenced by DL. For example, Chen et al. [12] proposed a stacked autoencoder for feature extraction. Ma et al. [13] introduced an updated deep auto-encoder to extract spectral–spatial features. Zhang et al. [14], adopted a recursive autoencoder (RAE) as a high-level feature extractor to produce feature maps from the target pixel neighborhoods. Chen et al. [15], combined deep belief network (DBN) and restricted Boltzmann machine (RBM) for hyperspectral image classification.
However, these methods extract the features with destroying the initial spatial structure. Because convolutional neural networks (CNNs) can extract spatial features without destroying the original structure, new methods based on CNNs have been introduced. For example, Chen et al. [16] designed a novel 3-D-CNN model with regularization for HSI classification. Roy et al. [17] proposed a hybrid 3-D and 2-D model for HSI classification.
In addition to modifying the CNN structure, deep metric learning (DML) has been applied to improve CNN classification performance. The metric learning loss term was introduced to the CNN objective function to enhance the model’s discriminative power. Cheng et al. [18] designed a DML method based on existing CNN models for remote sensing image classification. Guo et al. [19] proposed a DML framework for HSI spectral–spatial feature extraction and classification.
Recently, another deep learning model, the transformer, has been applied for computer vision tasks. In Reference [20], transformers were proposed for machine translation and became the state-of-art model in many natural language processing (NLP) tasks. In Reference [21], the transformer network’s direct application, Vision Transformer, to image recognition was explored.
In this paper, inspired by the Vision Transformer, we propose a lightweight network based on the transformer for hyperspectral image classification. The main contributions of the paper are described below.
(1) First, the key part of our proposed model is the transformer encoder. The transformer encoder does not use convolution operations, requiring much less GPU memory and fewer trainable parameters than the convolutional neural network. The 1-D convolution layer in our model serves as the projection layer to get the embedding of each sequence.
(2) Second, to get a better classification performance, we replace the linear projection layers in the traditional vision transformer model for computer vision with the 1-D convolution layer and adopt a new activation function, Mish [22].
(3) Third, we introduce the metric learning mechanism, which makes the transformer model more discriminative. We believe the present study is the first to combine the metric learning and the transformer model in hyperspectral image classification.
The rest of this article is organized as follows. Section 2 introduces the proposed framework. The experimental results and analysis of different methods are provided in Section 3 and Section 4. Finally, Section 5 presents the conclusions.

2. Methods

The overall architecture of our proposed model is demonstrated in Figure 1. Firstly, we split the input image into fixed-size patches and reshaped them into 1-D sequences. Next, we use a 1-D convolution layer to get the embedding of each sequence. The embedding of the central sequence will be supervised by center loss. After adding the position embedding to the sequence, the sequences will be fed to a standard two-layer Transformer encoder. The fully connected layers will handle the result. The fully connected layers consist of a layernorm layer, several fully connected layers, and Mish activation function. The output is the classification result.

2.1. Projection Layer with Metric Learning

We use the 1-D convolution layer as the projection layer. The 1-D convolution is calculated by convolving a 1-D kernel with the 1-D-data. The computation complexity of the 1-D convolution layer is drastically lower than the 2-D and 3-D convolution layers. It leads to a significant advantage of running time for 1-D convolution layers over 2-D and 3-D convolution layers. The computational process is presented in Figure 2. In 1-D convolution, the activation value at spatial position x in the jth feature map of the ith layer, denoted as v i , j x , is generated using Equation (1) [23].
v i , j x = b i , j x + m r = 0 R i 1 W i , j , m r V i 1 , m x + r ,
where b is the bias, m denotes the feature cube connected to the current feature cube in the ( i 1 ) th layer, W is the rth value of the kernel connected to the mth feature cube in the prior layer, and R denotes the length of the convolution kernel size.
In our model, the input image is split into 25 patches. So, we apply 25 projection layers. In order to make the model perform better, we adopt two techniques, parameter sharing, and metric learning. In our proposed model, the parameter sharing strategy can accelerate the model convergence rate and promote the classification accuracy.
The metric learning can enhance the discriminative power of the model by decreasing the intraclass distances and increasing the interclass distances. The metric learning loss term that we use in our experiment is the center loss [24], formulated as:
L C = 1 2 M i = 1 M | | x i * c y i | | 2 2
where x i * denotes the learned embedding of the ith input central patch in the batch, for i = 1, ⋯, M, and c k is the kth class center based on the embeddings in kth class, for k = 1, ⋯, K.

2.2. Transformer Encoder

The transformer network was proposed in Vaswani et al. [20]. It is composed of several identical layers. Each layer was made up of two sub-layers, the multi-head self-attention mechanism and the fully connected feed-forward network, as shown in Figure 3. A residual connection followed by layer normalization is employed in each sub-layer. So, the output of each sub-layer can be defined as LayerNorm(x + SubLayer(x)), where SubLayer(x) denotes the function implemented by the sub-layer. The multi-head self-attention is defined as:
MultiHead ( Q , K , V ) = concat head 1 , , head h W O ,
where head i = Attention Q W i Q , K W i K , V W i V , W i Q R d model × d q , W i K R d model × d k , W i V R d model × d v , and W O R h × d v × d model are parameter matrices. The attention is formulated as:
Attention ( Q , K , V ) = softmax Q K T d k V ,
where Q, K of dimension d k , and V of dimension d v are three defined learnable weight matrices.

2.3. Fully Connected Layers

The fully connected layers consist of a layernorm layer, two fully connected layers, and the Mish activation function. The Mish function is defined as
M i s h ( x ) = x × t a n h ( s o f t p l u s ( x ) ) = x × t a n h ( l n ( 1 + e x ) ) ,
where x is the input of the function. The difference between Mish and Relu is shown in Figure 4. The benefit of Mish will be proved in the next section.

3. Experiment

3.1. Data Set Description and Training Details

We evaluate the proposed model on three publicly available hyperspectral image data sets, namely Indian Pines, University of Pavia, and Salinas, as illustrated in Figure 5, Figure 6 and Figure 7. The spectral radiance of these three data sets and the corresponding categories are shown in Figure 8, Figure 9, Figure 10 and Figure 11.
The Indian Pines data set was collected by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) from Northwestern Indiana. The scene contains 145 × 145 pixels with a spatial resolution of 20 m by pixel and 224 spectral channels in the wavelength range from 0.4 to 2.5 μ m. After 24 bands corrupted by water absorption effects were removed, 200 bands were available for analysis and experiments. The 10,249 labeled pixels were divided into 16 classes.
The University of Pavia data set was acquired by the Reflective Optics System Imaging Spectrometer (ROSIS) over Pavia, northern Italy. The image size was 610 × 340 with a spatial resolution of 1.3 m by pixel and 103 spectral bands in the wavelength range from 0.43 to 0.86 μ m. The 42,776 labeled pixels were designed into 9 categories.
The Salinas data set was gathered by the AVIRIS sensor over Salinas Valley, California. The image contains 512 × 217 pixels with a spatial resolution of 3.7 m by pixel and 224 spectral channels in the wavelength range from 0.4 to 2.5 μ m. After 20 water-absorbing spectral bands were discarded, 204 bands were accessible for classification. The 54,129 labeled pixels were partitioned into 16 categories.
For the Indian Pines, the proportion of samples for training and validation was set to 3%. For the University of Pavia, we set the ratios of training samples and validation samples to 0.5%. For the Salinas, we selected 0.4% of the samples for training and 0.4% for validation. Table 1, Table 2 and Table 3 list the number of training, validation, and testing samples of the three data sets.
All experiments were conducted using the same device with the RTX Titan GPU and 16 GB RAM. The learning rate was set to 0.0005, and the number of epochs was 200. The model with the least cross-entropy loss in the validation samples was selected for testing. We used mini-batches with a size of 256 for all experiments. The weight of the metric learning loss term was set to 10 6 . We apply the traditional principal component analysis (PCA) over the original HSI data to remove the spectral redundancy. To compare the performances of experimental models fairly, we use the same input size over each data set, such as 25 × 25 × 30 for the Indian Pines and 25 × 25 × 15 for the University of Pavia and the Salinas, respectively. In order to ensure the accuracy and stability of the experimental results, we conducted the experiments 10 times consecutively. The parameter summary of the proposed transformer model architecture over the three data sets is shown in Table 4, Table 5, Table 6 and Table 7.

3.2. Classification Results

To evaluate the performance of the proposed method, we used three metrics: overall accuracy (OA), average accuracy (AA), and the Kappa coefficient (Kappa). OA is the percentage of correctly classified samples over the total test samples. AA is the average accuracy from each class, and Kappa measures the consistency between the predicted result and ground truth. The results of the proposed transformer model are compared with traditional deep-learning methods, such as 2-D-CNN [25], 3-D-CNN [26], Multi-3-D CNN [27], and hybridSN [17].
Table 8, Table 9 and Table 10 show the categorized results using different methods. The numbers after the plus-minus signs are the variances of the corresponding metrics. Figure 12, Figure 13 and Figure 14 demonstrate classification maps for each of the methods. The preliminary analysis of the results revealed that our proposed model could provide a more accurate classification result than other models over each data set. Among the contrast models, the OA, AA, and Kappa of HybridSN were higher than those of other contrast models. It indicates that the 3-D-2-D-CNN models were more suitable for the hyperspectral image classification with limited training samples than the models that used 2-D convolution or 3-D convolution alone. Secondly, it can be observed from these results that the performance of 2-D-CNN was better than that of 3-D-CNN or Multi-3-D-CNN. To the best of our knowledge, it was probably because the large parameter size can easily lead to overfitting when the training samples were lacking.
Considering the spectral information, we can conclude that the spectral information can influence the classification accuracy greatly. For example, in Indian Pines, the spectral feature of Grass-pasture-mowed is significantly different from the features of the other classes over the first 75 spectral bands. Besides, the pixels of Grass-pasture-mowed have similar spectral features. Although we only used one sample of Grass-pasture-mowed for training. All the models can reach the accuracy of at least 85%.
Table 11 summarizes the parameter size of the five models. The two rows of each model are the number of parameters and the memory space occupied by the parameters. It is apparent from this table that the transformer model was of the smallest parameter size, indicating the efficiency of the transformer model.
Table 12 shows the floating-point operations (flops) of the five models. Table 13, Table 14 and Table 15 compare the training time and testing time of the five models over each data set. The computational costs of our proposed model were the least. Because of the GPU acceleration for convolutions, the 2-D-CNN was quicker than our proposed model. Considering that the accuracy of the 2-D-CNN was lower than that of our proposed model, we think our method has a better balance of accuracy and efficiency.

4. Discussion

In this part, further analysis of our proposed model is provided. Firstly, metric learning can improve the model classification performance significantly, especially when the training samples are extremely lacking, and the results prove it. Secondly, the results of controlled experiments reflect the benefits of 1-D convolution layers with parameter sharing. Thirdly, the experimental results about different activation functions confirm the superiority of Mish.

4.1. Effectiveness of the Metric Learning

To prove the effectiveness of the metric learning, we remove the metric learning mechanism and compare the performance between these two models.
Table 16 and Figure 15, Figure 16 and Figure 17 reveal the benefits of the metric learning mechanism. The numbers after the plus-minus signs are the variances of the corresponding metrics. The model with metric learning can reach a higher accuracy. We can conclude that the metric learning can improve the model classification results.

4.2. Effectiveness of the 1-D Convolution and Parameter Sharing

In Section 2.1, we declare that the parameter sharing strategy can improve the model classification result. Here, we will compare the performance of the transformer model with the 1-D convolution and parameter sharing, the transformer model with 1-D convolution, the transformer model with the linear projection layers and parameter sharing, and the transformer model with the linear projection layers.
From Figure 18, we can conclude that both 1-D convolution layers and parameter sharing boost the model performance.

4.3. Effectiveness of the Activation Function

The Mish activation function can promote model performance slightly. Figure 19 shows the classification OA of the transformer models based on different activation functions.

5. Conclusions

In this article, we introduce the transformer architecture for hyperspectral image classification. Meanwhile, by replacing the linear projection layer with the 1-D convolution layer, the image patches can be embedded into sequences with more information. It can lead to an increase in classification accuracy. Besides, the Mish activation function is adopted instead of the Relu activation function; hence, the model performance can be further boosted.
In the experiments, the influence of three innovative changes based on the classical vision transformer, including metric learning, the 1-D convolution layer, and the Mish activation function, is proved. Moreover, many state-of-the-art methods based on convolutional neural networks, including 2-D CNN, 3-D CNN, multi-scale 3-D CNN, and hybrid CNN, are taken into account. The results demonstrate the advantage of the proposed model, especially under the condition of lacking training samples.

Author Contributions

X.H. and H.W. implement the algorithms, designing the experiments, and wrote the paper; Y.L. performed the experiments; Y.P. and W.Y. guided the research. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially supported by the National Key Research and Development Program of China (No.2017YFB1301104 and 2017YFB1001900), and the National Natural Science Foundation of China (No.91648204 and 61803375), and the National Science and Technology Major Project.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets involved in this paper are all public datasets.

Acknowledgments

The authors acknowledge the State Key Laboratory of High-Performance Computing, College of Computer, National University of Defense Technology, China.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
HSIHyperspectral Image
CNNConvolutional Neural Network
GPUGraphics Processing Unit
DLDeep Learning
RAERecursive Autoencoder
DBNDeep Belief Network
RBMBoltzmann Machine
DMLDeep Metric Learning
NLPNatural Language Processing

References

  1. Marinelli, D.; Bovolo, F.; Bruzzone, L. A novel change detection method for multitemporal hyperspectral images based on a discrete representation of the change information. In Proceedings of the 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Fort Worth, TX, USA, 23–28 July 2017; pp. 161–164. [Google Scholar]
  2. Liang, H.; Li, Q. Hyperspectral imagery classification using sparse representations of convolutional neural network features. Remote Sens. 2016, 8, 99. [Google Scholar] [CrossRef] [Green Version]
  3. Sun, W.; Yang, G.; Du, B.; Zhang, L.; Zhang, L. A sparse and low-rank near-isometric linear embedding method for feature extraction in hyperspectral imagery classification. IEEE Trans. Geosci. Remote. Sens. 2017, 55, 4032–4046. [Google Scholar] [CrossRef]
  4. Awad, M. Sea water chlorophyll-a estimation using hyperspectral images and supervised artificial neural network. Ecol. Inform. 2014, 24, 60–68. [Google Scholar] [CrossRef]
  5. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
  6. Durand, T.; Mehrasa, N.; Mori, G. Learning a deep convnet for multi-label classification with partial labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 647–657. [Google Scholar]
  7. Liu, Y.; Dou, Y.; Jin, R.; Li, R.; Qiao, P. Hierarchical learning with backtracking algorithm based on the visual confusion label tree for large-scale image classification. Vis. Comput. 2021, 1–21. [Google Scholar] [CrossRef]
  8. Liu, Y.; Dou, Y.; Jin, R.; Qiao, P. Visual tree convolutional neural network in image classification. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; pp. 758–763. [Google Scholar]
  9. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  10. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
  11. Nagpal, C.; Dubey, S.R. A performance evaluation of convolutional neural networks for face anti spoofing. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019; pp. 1–8. [Google Scholar]
  12. Chen, Y.; Lin, Z.; Zhao, X.; Wang, G.; Gu, Y. Deep learning-based classification of hyperspectral data. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2014, 7, 2094–2107. [Google Scholar] [CrossRef]
  13. Ma, X.; Wang, H.; Geng, J. Spectral–spatial classification of hyperspectral image based on deep auto-encoder. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2016, 9, 4073–4085. [Google Scholar] [CrossRef]
  14. Zhang, X.; Liang, Y.; Li, C.; Huyan, N.; Jiao, L.; Zhou, H. Recursive autoencoders-based unsupervised feature learning for hyperspectral image classification. IEEE Geosci. Remote. Sens. Lett. 2017, 14, 1928–1932. [Google Scholar] [CrossRef] [Green Version]
  15. Chen, Y.; Zhao, X.; Jia, X. Spectral–spatial classification of hyperspectral data based on deep belief network. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2015, 8, 2381–2392. [Google Scholar] [CrossRef]
  16. Chen, Y.; Jiang, H.; Li, C.; Jia, X.; Ghamisi, P. Deep feature extraction and classification of hyperspectral images based on convolutional neural networks. IEEE Trans. Geosci. Remote. Sens. 2016, 54, 6232–6251. [Google Scholar] [CrossRef] [Green Version]
  17. Roy, S.K.; Krishna, G.; Dubey, S.R.; Chaudhuri, B.B. Hybridsn: Exploring 3-d–2-d cnn feature hierarchy for hyperspectral image classification. IEEE Geosci. Remote. Sens. Lett. 2019, 17, 277–281. [Google Scholar] [CrossRef] [Green Version]
  18. Cheng, G.; Yang, C.; Yao, X.; Guo, L.; Han, J. When deep learning meets metric learning: Remote sensing image scene classification via learning discriminative cnns. IEEE Trans. Geosci. Remote. Sens. 2018, 56, 2811–2821. [Google Scholar] [CrossRef]
  19. Guo, A.J.; Zhu, F. Spectral-spatial feature extraction and classification by ann supervised with center loss in hyperspectral imagery. IEEE Trans. Geosci. Remote. Sens. 2018, 57, 1755–1767. [Google Scholar] [CrossRef]
  20. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
  21. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  22. Misra, D. Mish: A self regularized non-monotonic neural activation function. arXiv 2019, arXiv:1908.08681. [Google Scholar]
  23. Kiranyaz, S.; Avci, O.; Abdeljaber, O.; Ince, T.; Gabbouj, M.; Inman, D.J. 1d convolutional neural networks and applications: A survey. arXiv 2019, arXiv:1905.03554. [Google Scholar] [CrossRef]
  24. Wen, Y.; Zhang, K.; Li, Z.; Qiao, Y. A discriminative feature learning approach for deep face recognition. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 499–515. [Google Scholar]
  25. Liu, B.; Yu, X.; Zhang, P.; Tan, X.; Yu, A.; Xue, Z. A semi-supervised convolutional neural network for hyperspectral image classification. Remote. Sens. Lett. 2017, 8, 839–848. [Google Scholar] [CrossRef]
  26. Hamida, A.B.; Benoit, A.; Lambert, P.; Amar, C.B. 3-d deep learning approach for remote sensing image classification. IEEE Trans. Geosci. Remote. Sens. 2018, 56, 4420–4434. [Google Scholar] [CrossRef] [Green Version]
  27. He, M.; Li, B.; Chen, H. Multi-scale 3d deep convolutional neural network for hyperspectral image classification. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3904–3908. [Google Scholar]
Figure 1. The overall architecture of our proposed model.
Figure 1. The overall architecture of our proposed model.
Sensors 21 01751 g001
Figure 2. The computational process of 1-D convolution.
Figure 2. The computational process of 1-D convolution.
Sensors 21 01751 g002
Figure 3. The architecture of the transformer encoder.
Figure 3. The architecture of the transformer encoder.
Sensors 21 01751 g003
Figure 4. The difference between Mish and Relu.
Figure 4. The difference between Mish and Relu.
Sensors 21 01751 g004
Figure 5. (a) False-color Indian Pines image. (b) Ground-truth map of the Indian Pines data set.
Figure 5. (a) False-color Indian Pines image. (b) Ground-truth map of the Indian Pines data set.
Sensors 21 01751 g005
Figure 6. (a) False-color Salinas image. (b) Ground-truth map of the Salinas data set.
Figure 6. (a) False-color Salinas image. (b) Ground-truth map of the Salinas data set.
Sensors 21 01751 g006
Figure 7. (a) False-color University of Pavia image. (b) Ground-truth map of the University of Pavia data set.
Figure 7. (a) False-color University of Pavia image. (b) Ground-truth map of the University of Pavia data set.
Sensors 21 01751 g007
Figure 8. The overall spectral radiance and the corresponding categories in different data sets. (a) Indian Pines. (b) Salinas. (c) University of Pavia.
Figure 8. The overall spectral radiance and the corresponding categories in different data sets. (a) Indian Pines. (b) Salinas. (c) University of Pavia.
Sensors 21 01751 g008
Figure 9. The spectral radiance of different pixels and the corresponding categories in Indian Pines. (a) Alfalfa. (b) Corn-notill. (c) Corn-mintill. (d) Corn. (e) Grass-pasture. (f) Grass-trees. (g) Grass-pasture-mowed. (h) Hay-windrowed. (i) Oats. (j) Soybean-notill. (k) Soybean-mintill. (l) Soybean-clean. (m) Wheat. (n) Woods. (o) Buildings-Grass-Trees-Drives. (p) Stone-Steel-Towers.
Figure 9. The spectral radiance of different pixels and the corresponding categories in Indian Pines. (a) Alfalfa. (b) Corn-notill. (c) Corn-mintill. (d) Corn. (e) Grass-pasture. (f) Grass-trees. (g) Grass-pasture-mowed. (h) Hay-windrowed. (i) Oats. (j) Soybean-notill. (k) Soybean-mintill. (l) Soybean-clean. (m) Wheat. (n) Woods. (o) Buildings-Grass-Trees-Drives. (p) Stone-Steel-Towers.
Sensors 21 01751 g009
Figure 10. The spectral radiance of different pixels and the corresponding categories in Salinas. (a) Brocoli_green_weeds_1. (b) Brocoli_green_weeds_2. (c) Fallow. (d) Fallow_rough_plow. (e) Fallow_smooth. (f) Stubble. (g) Celery. (h) Grapes_untrained. (i) Soil_vinyard_develop. (j) Corn_senesced_green_weeds. (k) Lettuce_romaine_4wk. (l) Lettuce_romaine_5wk. (m) Lettuce_romaine_6wk. (n) Lettuce_romaine_7wk. (o) Vinyard_untrained. (p) Vinyard_vertical_trellis.
Figure 10. The spectral radiance of different pixels and the corresponding categories in Salinas. (a) Brocoli_green_weeds_1. (b) Brocoli_green_weeds_2. (c) Fallow. (d) Fallow_rough_plow. (e) Fallow_smooth. (f) Stubble. (g) Celery. (h) Grapes_untrained. (i) Soil_vinyard_develop. (j) Corn_senesced_green_weeds. (k) Lettuce_romaine_4wk. (l) Lettuce_romaine_5wk. (m) Lettuce_romaine_6wk. (n) Lettuce_romaine_7wk. (o) Vinyard_untrained. (p) Vinyard_vertical_trellis.
Sensors 21 01751 g010
Figure 11. The spectral radiance of different pixels and the corresponding categories in the University of Pavia. (a) Asphalt. (b) Meadows. (c) Gravel. (d) Trees. (e) Painted metal sheets. (f) Bare Soil. (g) Bitumen. (h) Self-Blocking Bricks. (i) Shadows.
Figure 11. The spectral radiance of different pixels and the corresponding categories in the University of Pavia. (a) Asphalt. (b) Meadows. (c) Gravel. (d) Trees. (e) Painted metal sheets. (f) Bare Soil. (g) Bitumen. (h) Self-Blocking Bricks. (i) Shadows.
Sensors 21 01751 g011
Figure 12. The classification maps of Indian Pines. (a) Ground-truth map. (b)–(f) Classification results of 2-D-convolutional neural network (CNN), 3-D-CNN, Multi-3-D-CNN, HybridSN, and Transformer.
Figure 12. The classification maps of Indian Pines. (a) Ground-truth map. (b)–(f) Classification results of 2-D-convolutional neural network (CNN), 3-D-CNN, Multi-3-D-CNN, HybridSN, and Transformer.
Sensors 21 01751 g012
Figure 13. The classification maps of Salinas. (a) Ground-truth map. (b)–(f) Classification results of 2-D-CNN, 3-D-CNN, Multi-3-D-CNN, HybridSN, and Transformer.
Figure 13. The classification maps of Salinas. (a) Ground-truth map. (b)–(f) Classification results of 2-D-CNN, 3-D-CNN, Multi-3-D-CNN, HybridSN, and Transformer.
Sensors 21 01751 g013
Figure 14. The classification maps of the University of Pavia. (a) Ground-truth map. (b)–(f) Classification results of 2-D-CNN, 3-D-CNN, Multi-3-D-CNN, HybridSN, and Transformer.
Figure 14. The classification maps of the University of Pavia. (a) Ground-truth map. (b)–(f) Classification results of 2-D-CNN, 3-D-CNN, Multi-3-D-CNN, HybridSN, and Transformer.
Sensors 21 01751 g014
Figure 15. The classification maps of Indian Pines. (a) Ground-truth map. (b) Classification results of the Transformer without metric learning. (c) Classification results of the Transformer with metric learning.
Figure 15. The classification maps of Indian Pines. (a) Ground-truth map. (b) Classification results of the Transformer without metric learning. (c) Classification results of the Transformer with metric learning.
Sensors 21 01751 g015
Figure 16. The classification maps of Salinas. (a) Ground-truth map. (b) Classification results of the Transformer without metric learning. (c) Classification results of the Transformer with metric learning.
Figure 16. The classification maps of Salinas. (a) Ground-truth map. (b) Classification results of the Transformer without metric learning. (c) Classification results of the Transformer with metric learning.
Sensors 21 01751 g016
Figure 17. The classification maps of University of Pavia. (a) Ground-truth map. (b) Classification results of the Transformer without metric learning. (c) Classification results of the Transformer with metric learning.
Figure 17. The classification maps of University of Pavia. (a) Ground-truth map. (b) Classification results of the Transformer without metric learning. (c) Classification results of the Transformer with metric learning.
Sensors 21 01751 g017
Figure 18. Effectiveness of the 1-D convolution and parameter sharing.
Figure 18. Effectiveness of the 1-D convolution and parameter sharing.
Sensors 21 01751 g018
Figure 19. Effectiveness of the activation function.
Figure 19. Effectiveness of the activation function.
Sensors 21 01751 g019
Table 1. Training, validation, and testing sample numbers in Indian Pines.
Table 1. Training, validation, and testing sample numbers in Indian Pines.
NumberNameTrainingValidationTestingTotal
1Alfalfa114446
2Corn-notill424213441428
3Corn-mintill2424782830
4Corn77223237
5Grass-pasture1414455483
6Grass-trees2121688730
7Grass-pasture-mowed112628
8Hay-windrowed1414450478
9Oats111820
10Soybean-notill2929914972
11Soybean-mintill737223102455
12Soybean-clean1717559593
13Wheat66193205
14Woods373711911265
15Buildings-Grass-Trees-Drives1111364386
16Stone-Steel-Towers238893
Total300300964910,249
Table 2. Training, validation, and testing sample numbers in Salinas.
Table 2. Training, validation, and testing sample numbers in Salinas.
NumberNameTrainingValidationTestingTotal
1Brocoli-green-weeds-18819932009
2Brocoli-green-weeds-2141436983726
3Fallow7819611976
4Fallow-rough-plow5513841394
5Fallow-smooth101026582678
6Stubble151539293959
7Celery141435513579
8Grapes-untrained454411,18211,271
9Soil-vinyard-develop242461556203
10Corn-senesced-green-weeds131332523278
11Lettuce-romaine-4wk4410601068
12Lettuce-romaine-5wk7719131927
13Lettuce-romaine-6wk34909916
14Lettuce-romaine-7wk4410621070
15Vinyard-untrained292872117268
16Vinyard-vertical-trellis7717931807
Total20920953,71154,129
Table 3. Training, validation, and testing sample numbers in the University of Pavia.
Table 3. Training, validation, and testing sample numbers in the University of Pavia.
NumberNameTrainingValidationTestingTotal
1Asphalt333365656631
2Meadows939118,46518,649
3Gravel101020792099
4Trees151530343064
5Painted metal sheets6713321345
6Bare Soil252549795029
7Bitumen6613181330
8Self-Blocking Bricks181836463682
9Shadows45938947
Total21021042,35642,776
Table 4. Configuration of our proposed model variants.
Table 4. Configuration of our proposed model variants.
Data setLayersHidden SizeMLP sizeHeads
Indian Pines21203215
Salinas2753215
University of Pavia2753215
Table 5. Parameter summary of the proposed transformer model architecture over the Indian Pines data set.
Table 5. Parameter summary of the proposed transformer model architecture over the Indian Pines data set.
Layer (Type)Output ShapeParameter
inputLayer(30, 25, 25)0
conv1d × 25 (1, 120) × 25632 × 25
2-layer transformer encoder(1, 120)132,064
layernorm(120)240
linear(32)3872
Mish(32)0
linear(16)528
Total Trainable Parameters: 152,504
Table 6. Parameter summary of the proposed transformer model architecture over the Salinas data set.
Table 6. Parameter summary of the proposed transformer model architecture over the Salinas data set.
Layer (Type)Output ShapeParameter
inputLayer(15, 25, 25)0
conv1d × 25 (1, 75) × 25302 × 25
2-layer transformer encoder(1, 75)55,564
layernorm(75)150
linear(32)2432
Mish(32)0
linear(16)528
Total Trainable Parameters: 66,224
Table 7. Parameter summary of the proposed transformer model architecture over the University of Pavia data set.
Table 7. Parameter summary of the proposed transformer model architecture over the University of Pavia data set.
Layer (Type)Output ShapeParameter
inputLayer(15, 25, 25)0
conv1d × 25 (1, 75) × 25302 × 25
2-layer transformer encoder(1, 75)55,564
layernorm(75)150
linear(32)2432
Mish(32)0
linear(9)297
Total Trainable Parameters: 65,993
Table 8. Classification results of different models in Indian Pines.
Table 8. Classification results of different models in Indian Pines.
No.Training Samples2-D-CNN3-D-CNNmulti-3-D-CNNHybridSNTransformer
1195.2191.16100.0093.2090.92
24266.1069.6061.4383.5386.70
32486.2483.3181.5285.3385.09
4790.3893.3999.4083.7788.03
51496.0991.3496.8287.8794.55
62194.4495.7797.4693.1295.67
71100.00100.0099.4186.4391.76
81499.2998.4299.4392.4896.42
9198.7595.6699.2385.8488.33
102993.1886.5984.5985.3491.12
117383.9484.1974.6189.5388.85
121783.5277.9479.9879.4581.18
13698.5598.3599.8192.5996.23
143794.8992.6588.8994.1894.55
151187.3388.1886.1785.9986.54
16298.9195.7590.0084.1679.63
KAPPA 0.828 ± 0.013 0.822 ± 0.017 0.761 ± 0.021 0.859 ± 0.016 0.882 ± 0.010
OA(%) 85.13 ± 1.17 84.59 ± 1.52 79.43 ± 1.77 87.69 ± 1.48 89.71 ± 0.88
AA(%) 91.68 ± 1.01 90.14 ± 1.54 89.92 ± 2.04 87.68 ± 1.92 89.72 ± 3.01
Table 9. Classification results of different models in Salinas.
Table 9. Classification results of different models in Salinas.
No.Training Samples2-D-CNN3-D-CNNmulti-3-D-CNNHybridSNTransformer
1897.7399.8598.1796.9598.47
21499.5598.9998.9097.2498.54
3795.6494.4093.0198.8298.02
4595.8295.9690.9796.5795.59
51095.3696.4895.4696.2596.04
61599.6999.0698.3297.2497.98
71499.4398.0999.1299.2199.03
84588.4687.0490.3795.0594.27
92499.7199.2999.0098.7898.97
101398.9396.1395.4495.6995.90
11498.5288.7394.4697.6298.30
12793.7592.5593.4497.9694.58
13391.0086.0587.0890.4589.99
14493.0193.6090.1594.1498.20
152985.4086.2784.2887.0992.75
16799.4996.3894.3496.2799.89
KAPPA 0.934 ± 0.013 0.924 ± 0.017 0.925 ± 0.022 0.947 ± 0.013 0.957 ± 0.009
OA(%) 94.08 ± 1.25 93.20 ± 1.54 93.30 ± 2.03 95.26 ± 1.22 96.15 ± 0.86
AA(%) 95.72 ± 1.23 94.30 ± 1.65 93.91 ± 1.32 95.96 ± 0.69 96.66 ± 0.79
Table 10. Classification results of different models in the University of Pavia.
Table 10. Classification results of different models in the University of Pavia.
No.Training Samples2-D-CNN3-D-CNNmulti-3-D-CNNHybridSNTransformer
13372.5070.3571.8183.1789.98
29394.7795.8396.6296.6496.89
31085.9062.3673.7570.7988.56
41595.6277.5084.4884.6794.82
5697.5498.4996.0594.7692.43
62597.0696.4794.8894.9498.06
7697.7880.2483.9480.6188.01
81877.0864.3169.6271.5584.98
9487.2369.3871.2685.0693.89
KAPPA 0.848 ± 0.010 0.795 ± 0.029 0.825 ± 0.028 0.851 ± 0.049 0.916 ± 0.014
OA(%) 88.75 ± 0.77 84.72 ± 2.21 86.92 ± 2.10 88.90 ± 3.63 93.77 ± 1.06
AA(%) 89.50 ± 1.75 79.44 ± 3.61 82.49 ± 3.44 84.69 ± 3.09 91.96 ± 1.86
Table 11. Parameter size of the five methods on the three hyperspectral data sets.
Table 11. Parameter size of the five methods on the three hyperspectral data sets.
NetworkIndian PinesSalinasPaviaU
2-D-CNN176,736165,93698,169
0.67 MB0.63 MB0.72 MB
3-D-CNN1,018,476771,516447,374
3.89 MB2.94 MB1.71 MB
multi-3-D-CNN634,592138,97684,761
2.42 MB0.53 MB0.32 MB
HybridSN5,122,1764,845,6964,844,793
19.54 MB18.48 MB18.48 MB
Transformer152,50466,22465,993
0.58 MB0.25 MB0.25 MB
Table 12. Flops of the five methods on the three hyperspectral data sets.
Table 12. Flops of the five methods on the three hyperspectral data sets.
NetworkIndian PinesSalinasPaviaU
2-D-CNN11,708,2405,995,0405,927,280
3-D-CNN162,511,65083,938,54083,614,405
multi-3-D-CNN52,409,98420,611,71220,557,504
HybridSN248,152,51250,948,59250,947,696
Transformer5,294,9121,988,7621,988,538
Table 13. Running time of the five methods on the Indian Pines data set.
Table 13. Running time of the five methods on the Indian Pines data set.
Data setAlgorithmTraining Time (s)Testing Time (s)
Indian Pines2-D-CNN11.00.5
3-D-CNN54.14.26
multi-3-D-CNN56.235.10
HybridSN43.93.45
Transformer32.241.31
Table 14. Running time of the five methods on the Salinas data set.
Table 14. Running time of the five methods on the Salinas data set.
Data setAlgorithmTraining Time (s)Testing Time (s)
Salinas2-D-CNN6.01.9
3-D-CNN26.116.1
multi-3-D-CNN26.218.2
HybridSN13.97.5
Transformer13.84.6
Table 15. Running time of the five methods on the University of Pavia data set.
Table 15. Running time of the five methods on the University of Pavia data set.
Data setAlgorithmTraining Time (s)Testing Time (s)
University of Pavia2-D-CNN5.81.5
3-D-CNN26.212.7
multi-3-D-CNN26.214.2
HybridSN14.065.78
Transformer13.093.33
Table 16. Classification results of the transformer without metric learning and the transformer with metric learning.
Table 16. Classification results of the transformer without metric learning and the transformer with metric learning.
Indian PinesSalinas
No.withoutwithwithoutwith
Metric LearningMetric LearningMetric LearningMetric Learning
199.1690.9298.4798.47
285.4186.7098.6998.54
384.6885.0996.5898.02
485.9988.0393.1895.59
592.3994.5596.3496.04
694.9995.6798.2597.98
772.4291.7699.1099.03
895.7196.4293.8394.27
979.7288.3399.3998.97
1089.0791.1296.1395.90
1189.1788.8598.4698.30
1279.9681.1894.7194.58
1396.6596.2391.3289.99
1495.4794.5597.1698.21
1589.8886.5492.3692.75
1689.5279.6399.9399.89
KAPPA 0.876 ± 0.016 0.882 ± 0.010 0.955 ± 0.009 0.957 ± 0.009
OA(%) 89.26 ± 1.39 89.71 ± 0.88 96.02 ± 0.88 96.15 ± 0.86
AA(%) 88.76 ± 3.33 89.72 ± 3.01 96.49 ± 0.96 96.66 ± 0.79
University of Pavia
No.without Metric Learningwith Metric Learning
187.9289.98
296.8096.89
388.0988.56
492.3294.82
589.5792.43
696.6098.06
791.4788.01
885.5384.98
983.7993.89
KAPPA 0.905 ± 0.019 0.916 ± 0.014
OA(%) 92.92 ± 1.47 93.77 ± 1.06
AA(%) 90.23 ± 2.16 91.96 ± 1.86
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Hu, X.; Yang, W.; Wen, H.; Liu, Y.; Peng, Y. A Lightweight 1-D Convolution Augmented Transformer with Metric Learning for Hyperspectral Image Classification. Sensors 2021, 21, 1751. https://doi.org/10.3390/s21051751

AMA Style

Hu X, Yang W, Wen H, Liu Y, Peng Y. A Lightweight 1-D Convolution Augmented Transformer with Metric Learning for Hyperspectral Image Classification. Sensors. 2021; 21(5):1751. https://doi.org/10.3390/s21051751

Chicago/Turabian Style

Hu, Xiang, Wenjing Yang, Hao Wen, Yu Liu, and Yuanxi Peng. 2021. "A Lightweight 1-D Convolution Augmented Transformer with Metric Learning for Hyperspectral Image Classification" Sensors 21, no. 5: 1751. https://doi.org/10.3390/s21051751

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop