A Dual-Attention Deep Discriminative Domain Generalization Model for Hyperspectral Image Classiﬁcation

: Recently, hyperspectral image classiﬁcation has made great progress with the development of convolutional neural networks. However, due to the challenges of distribution shifts and data redundancies, the classiﬁcation accuracy is low. Some existing domain adaptation methods try to mitigate the distribution shifts by training source samples and some labeled target samples. However, in practice, labeled target domain samples are difﬁcult or even impossible to obtain. To solve the above challenges, we propose a novel dual-attention deep discriminative domain generalization framework (DAD3GM) for cross-scene hyperspectral image classiﬁcation without training the labeled target samples. In DAD3GM, we mainly design two blocks: dual-attention feature learning (DAFL) and deep discriminative feature learning (DDFL). DAFL is designed to extract spatial features by multi-scale self-attention and extract spectral features by multi-head external attention. DDFL is further designed to extract deep discriminative features by contrastive regularization and class discrimination regularization. The combination of DAFL and DDFL can effectively reduce the computational time and improve the generalization performance of DAD3GM. The proposed model achieves 84.25%, 83.53%, and 80.63% overall accuracy on the public Houston, Pavia, and GID benchmarks, respectively. Compared with some classical and state-of-the-art methods, the proposed model achieves optimal results, which reveals its effectiveness and feasibility.


Introduction
With the continuous development of deep learning technology, remote sensing technology has made significant progress.Hyperspectral image (HSI) classification, as a branch of remote sensing, has been a major research topic for a long time.Now, it is widely used in environmental monitoring, the food industry, precision agriculture, land cover classification, and mineral exploration [1][2][3].The purpose of HSI classification is to classify the pixels in HSIs into different categories based on their features.Conventional image classification models are not applicable for hyperspectral image classification.Despite the great success of these deep-learning-based works, some important challenges still exist.
On the one hand, the high data redundancies in hyperspectral images are caused by abundant bands and strong correlations among adjacent pixels.Therefore, effective feature extraction and representation learning are crucial to improve the classification accuracy.In the past, scholars have conducted research on remote sensing image segmentation, including traditional k-nearest neighbor [4], support vector machine [5,6], and convolutional neural network (CNN)-based algorithms [1].Of these, the mainstream approaches are based on CNNs, which have outstanding feature extraction capabilities.However, models based on CNNs are data-driven and their training processes usually demand sufficient labeled samples with precise annotations.However, acquiring such a large amount of manually annotated samples will consume a tremendous amount of time and labor.Some works introduce the attention mechanism [7][8][9] to capture global and local features.However, the calculation times of these models are not taken into account.In practical applications, the calculation times of some tasks are strictly limited, such as defect detection in industrial products.
On the other hand, distribution shifts are another challenge that needs to be dealt with.As pointed out in [1,10,11], CNN-based methods are data-driven and trained in an entirely supervised manner.They extract features primarily by capturing the statistical relationships among the samples, where model generalization easily deteriorates for different distributions.As we know, data distributions between source and target domain samples are generally inconsistent because of differences in some factors (e.g., sensor nonlinearities, weather conditions, geographic locations, imaging conditions).As a consequence, the model performance suffers a catastrophic decrease.Domain adaptation is an effective and favored technique to remedy the above limitation, which tends to complete the target domain task by transferring knowledge learned from a label-scarce target domain and label-sufficient source domains.Recently, different domain adaptation methods have been proposed.They can be divided into three types from the perspective of whether the labeled target domain samples participate in training: supervised [12][13][14], semi-supervised [15,16], and unsupervised [17][18][19].However, in practical applications, it is very difficult to obtain the labels of target domain samples.Thus, the above domain adaptation models may be unsuitable for some practical problems.Compared with domain adaptation, domain generalization (DG) is more practical but also more challenging, since it does not require the labeled target domain samples to participate in training.In 2023, the domain generalization method was first suggested for cross-scene hyperspectral image classification [10].There are limited results on domain generalization for HSI classification, possibly due to the large and complex data, as well as the difficulty of annotating them and the fact that there are fewer standard datasets to use.
Considering that HSIs usually have large intra-class variance and high inter-class similarity and include heterogeneous spatial and spectral information, in this paper, we propose a novel dual-attention deep discriminative domain generalization model (DAD3GM) for hyperspectral image classification.In contrast to [10], we first design a dual-attention mechanism to extract important and multi-granularity features, instead of generating an extended domain as in [10].Then, we enhance the model discriminability from the perspective of a feature extractor and a classifier, rather than only considering learning the discriminative features in the feature extractor.Specifically, we design a dual-attention feature learning (DAFL) module and a deep discriminative feature learning (DDFL) module.DAFL is designed to reduce the redundancies of HSIs and extract important domaininvariant spatial and spectral features, and it includes multi-scale self-attention to acquire coarse-grained and fine-grained spatial texture features and multi-head external attention to acquire robust spectral features with linear complexity.DDFL further enhances the learning of discriminative feature representations by contrastive regularization and class discrimination regularization.Our DAD3GM model has excellent generalization ability for hyperspectral image classification because it improves the model generalization by learning robust domain-invariant knowledge and enhancing the model discrimination.Our main contributions are summarized as follows.

1.
We propose a novel dual-attention deep discriminative domain generalization model for HSI classification, which considers the large intra-class variance, high interclass similarity, and heterogeneous spatial and spectral information present in HSIs.

2.
We develop a novel dual-attention module to effectively extract coarse-grained and fine-grained spatial texture features and important spectral features.

3.
We further design a deep discriminative feature learning module, incorporating feature discrimination learning and class discrimination learning.Firstly, to learn the discriminative features, contrastive regularization is implemented to separate the inter-class samples and compact intra-class samples as much as possible.Then, two independent classifiers are designed to further learn class discriminative features.

4.
Extensive experiments have been carried out on the three standard hyperspectral image classification datasets Houston, Pavia, and GID.Compared with the state-ofthe-art methods, our model achieves optimal results.

Domain Generalization
Domain generalization aims to reduce the distribution differences by transferring domain-invariant knowledge from source domains to an unseen target domain.Compared to domain adaptation (DA), it is more practical but also more challenging since the target images do not participate in training.The mainstream approaches generally can be divided into the following categories: domain-invariant representation learning [20][21][22], data augmentation [23][24][25], meta-learning [26][27][28], and ensemble learning [29][30][31].For domaininvariant representation learning, adversarial learning [32] and minimize maximum mean discrepancy (MMD) [33] are the more popular strategies for the extraction of invariant information.Data augmentation is a common practice to improve algorithm generalization.Xu et al. [25] performed data augmentation by linearly mixing the amplitude information of different source domain samples to extract domain-agnostic knowledge and transferred them to target domain samples.Meta-learning tries to study episodes sampled from related tasks to benefit future learning.Chen et al. [26] took advantage of meta-knowledge to analyze the distribution shifts between source and target domains during testing.Most of these current investigations are mainly designed for nature images, which are not suitable for HSI work, since HSIs usually have the characteristics of large intra-class variance and high inter-class similarity and include heterogeneous spatial and spectral information.

Attention Mechanism
Attention mechanisms aim to imitate the human visual system that can capture the salient regions in complex scenes [34].It can be treated as a dynamic weight adjustment process based on the features of inputs.Because of its outstanding merits, the attention mechanism has been widely applied to remote sensing tasks.Qin et al. [35] firstly adopted the discrete cosine transform from the perspective of the frequency domain to make up for the shortcomings of insufficient feature information caused by global average pooling in the most existing channel attention methods.Ding et al. [36] proposed a multi-scale receptive graph attention to extract the local-global adjacent node features and edge features.Zhao et al. [37] designed channel and spatial residual attention modules to integrate the multi-scale features by introducing a residual connection and attention mechanism.Paoletti et al. [38] employed channel attention to automatically design a network to extract spatial and spectral features.Shi et al. [39] proposed the 3D-OCONV method and abstracted the spectral features by combining 3D-OCONV and spectral attention.Pan et al. [40] applied an attention mechanism to extract global features and designed a multi-scale fusion module to extract local features, which effectively captured spatial features at different scales.

Contrastive Learning
Contrastive learning aims to extend the distance between the inter-class samples and reduce the distance between the intra-class samples, which can effectively improve the discriminative ability of networks.It has attracted widespread attention and has been employed for various tasks [41].Recently, contrastive learning has also been introduced into the remote sensing field.Wang et al. [42] proposed the Contrastive-ACE network to capture the causal invariance through the causal affect from the features to labels and introduced the Contrastive-ACE loss to ensure cross-domain stable predictions.Ou et al. [43] proposed a HSI change detection framework, which built a loss on the foundation of self-supervised contrastive learning to extract better features.Guan et al. [3] developed a cross-domain contrastive learning framework to extract invariant features via the cross-domain discrimination task.Zhang et al. [44] implemented feature extraction by a 3D CNN and applied contrastive learning to learn more discriminative features so as to address the high interclass similarity and large intra-class variance of HSIs.Jia et al. [45] designed a collaborative contrastive learning strategy to learn coordinated representations and achieve matching between LiDAR and HSI without labeled examples.This strategy adequately fused the LiDAR and hyperspectral data and effectively enhanced the classification accuracy of HSI and LiDAR data.Hang et al. [46] proposed a cross-modality contrastive learning algorithm for HSI classification, which extracted feature representations of hyperspectral and LiDAR images, respectively, and they designed a cross-modality contrastive loss to force features in the same patches to remain close.

Dual-Attention Feature Learning
Features play a critical role in enhancing model generalization.Domain-independent knowledge is robust and important in improving the true mapping between the inputs and its corresponding labels, while domain-related knowledge is unstable and may lead to false mapping between the inputs and labels.Hence, it is important to extract robust domain-independent spatial and spectral features.Data augmentation is a simple but effective technique that improves the model generalization ability by data transformation to generate more diverse samples.Among the data augmentation methods, adaptive instance normalization (AdaIn) [47] has a strong style transfer effect and is usually em-ployed in conjunction with CNNs.Furthermore, morphological manipulations maintain the primary characteristics of the structure and shape information of an image by nonlinearly transforming the features [1], in which dilation and erosion are the most basic and popular operations of morphological manipulation.Therefore, we firstly perform channel-wise AdaIn augmentation to generate diverse style spatial and spectral features, respectively, and then perform the morphological augmentation through the transformation of dilation and erosion to extract domain-independent structure feature representations.The AdaIn and morphological transformation can be expressed as in Equation ( 1) and Equation (2), respectively.
in which µ(•) and σ(•) represent the mean and standard deviation, respectively; G f spa denotes the spatial feature; α denotes the adaptive learning parameter; and G f spa denotes the randomly selected spatial features.The augmented spectral images can be obtained in the same way: in which S d and S e denote the structure elements of dilation and erosion, respectively; q and r denote the central pixel position of the current local patch.
After performing data augmentation, to effectively eliminate the data redundancies and the noise of the augmented source domain samples, a dual-attention feature learning (DAFL) module is designed to extract remarkable spatial and spectral features, respectively.DAFL incorporates multi-scale self-attention (MSSA) and spectral multi-head external attention (MHEA).Spatial components contain important texture and geometric features of a hyperspectral image and provide the most intuitive and basic information.For this reason, the MSSA module is devised to excavate the salient local and global spatial feature representations.The self-attention function can be computed on a query matrix Q, a key matrix K, and a value matrix V, where Q, K, V are acquired from the linear projection of the input X with parameter matrix To obtain MSSA, different dilation rates γ in convolutions are set to obtain multi-scale Q m and K m .Finally, MSSA can be calculated by Equation (3), and its structure is shown in Figure 2.
where D denotes the hidden dimension of the attention layer.The mutli-scale spatial attention feature maps G f spa can be expressed as in Equation ( 4): The spectral images have abundant spectral bands, which possess rich information and can detect subtle changes and small differences in the spectra of objects.However, the spectral bands usually are redundant and will consume a lot of computing resources.To extract important and robust spectral feature representations as well as reduce the computing resources, the MHEA component is introduced, which is shown in Figure 3. MHEA can not only capture the correlations among all the samples, but also has linear complexity.Firstly, a single head external attention EA att (•) is implemented by two normalization layers, and two different training data memory units M p and M q are implemented by two linear layers without bias, which can be expressed as in Equation (5).EA att (•) not only can capture the latent relationships among all source domain samples, but also can provide powerful regularization and improve the generalization ability of the attention mechanisms.Secondly, EA att (•) is extended in a similar way as the multi-head attention in Transformer to obtain MHEA, which can be expressed as in Equation (6).With multiple sub-attention components, MHEA can learn the features with different weights, extract important spectral features more efficiently, and further reduce feature redundancies.
where Head i represents the i-th head, H represents the number of heads, and W O denotes a linear transformation matrix making the dimensions of the input and output consistent.The augmented, important spatial and spectral features are fused by concoction.Then, to ensure the reliability, the augmented and original samples are fed into a two-layer convolution network for feature extraction by means of sharing weights.

Deep Discriminative Feature Learning
HSIs have complex characteristics such as high inter-class similarity and large intraclass variance, which may affect model discrimination and the classification accuracy.Nevertheless, discrimination plays a crucial role in model robustness.In view of this, we excavate the deep discriminative features from two aspects: discriminative feature representation learning and discriminative classifier representation learning.Firstly, supervised contrastive learning is implemented on the spatial and spectral features in order to obtain discriminative and robust domain-invariant knowledge: where G f i denotes each embedding feature in the batch size, and the features from the same class belong to positive samples and are denoted by P(i); |P(i)| denotes the number of positive sample sets, and the features outside the class pertain to negative samples and are denoted by A(i); G f + and G f − denote the positive and negative samples, respectively; and T is a scalar temperature parameter.Secondly, inspired by the maximum classifier discrepancy (MCD) [48], two independent classifiers, G C 1 and G C 2 , are introduced to further enhance the classifier's discrimination.Under such circumstances, a sample needs to be retrained if the predictions of the two classifiers are inconsistent.The difference loss function L cd of the two classifiers can be expressed as

Training Process
In this paper, the classification loss adopts the standard cross-entropy loss function where y c i is the label of i-th sample x c i , and c denotes the index of the class.The total loss of DAD3GM consists of the classification loss L c , supervised contrastive loss L sup , and classifier difference loss L cd :

Datasets and Evaluation Metrics
(1) Houston Dataset: The Houston benchmark [56] comes from the University of Houston and the neighboring areas.It contains Houston13 and Houston18.The Houston13 dataset includes 144 spectral bands, and the pixels are 349 × 1905.Its wavelength ranges from 380 to 1050 nm.The image spatial resolution is 2.5 m.Compared with Houston13, the spectral bands and image spatial resolution of Houston18 data are 48 and 1 µm, respectively.To align them with Houston18, 48 spectral bands from Houston13 are selected, and the overlapping area is 209 × 955.Here, Houston13 and Houston18 are adopted as the source and target domains, respectively.The Houston benchmark has 7 categories, including residential buildings, grass healthy, trees, non-residential buildings, water, road, and grass stressed.The source domain comprises the Houston13 data, and the target domain comprises the Houston18 data.The total number of data of Houston13 and Houston18 is 2530 and 53,200, respectively.The details of the Houston data are shown in Table 1.The ground-truth images of the Houston benchmark are illustrated in Figure 4.  (2) Pavia Dataset: The Pavia benchmark [10] was captured by a reflective optics system imaging spectrometer sensor over Pavia, including the University of Pavia (UP) and Pavia Center (PC).One UP image includes 610 × 340 pixels.Furthermore, it has a 1.3 m spatial resolution and 103 spectral bands.One PC image includes 1096 × 715 pixels and 102 bands.Note that, to ensure the same bands between the UP data and PC data, the 103rd band of the UP data was deleted.The Pavia benchmark contains 7 classes, including tree, bitumen, asphalt, meadow, brick, bare soil, and shadow.The UP data are employed as the source domain, and the PC data are employed as the target domain.The total numbers of UP and PC data are 39,332 and 39,355, respectively.The details of the Pavia data are shown in Table 2.The ground-truth images of the Pavia benchmark are shown in Figure 5.
(3) GID Dataset: The GID benchmark [57] was captured by the GF-2, which is the second satellite of the high-definition Earth observation system launched by the China National Space Administration.It was bulit by Wuhan University.It includes multi-spectral images taken at different times in many regions of China.The GID-nc dataset was collected on 3 January 2015 in Nanchang City, Jiangxi Province.GID-nc includes 990 × 4400 pixels.It contains four bands, including blue (0.45-0.52 µm), green (0.52-0.59 µm), red (0.63-0.69 µm), and near-infrared (0.77-0.89 µm).The spatial resolution of GID-nc is 4 µm.GID-wh was taken on 11 April 2016 in Wuhan City, Hubei Province.GID-wh contains 1600 × 1900 pixels.GID-wh has the same spectral and spatial resolution as GID-nc.Both of the two datasets have five categories: rural residential, irrigate land, garden land, river, and lake.The total numbers of GID-nc and GID-wh data are 23,339 and 30,812, respectively.We choose GID-nc as the source domain and GID-wh as the target domain.The details of the GID data are shown in Table 3.The ground-truth images of the GID benchmark are shown in Figure 6.(4) Evaluation Metrics: With regard to the evaluation metrics, the standard and public metrics kappa coefficient (KC) [58], overall accuracy (OA) [59], class-specific accuracy (CA) [10], and T-SNE [60] are employed to assess the model performance.KC denotes the proportion of error reduction between classification and completely random classification.In practice, the value range of KC is (0,1).OA is used to represent the classification accuracy.T-SNE is applied to visualize the high-dimensional data and intuitively reflect the distribution of features before and after DA and DG, and it can be used to analyze the performance of DA and DG.The mathematical representations of these metrics can be expressed as follows: OA = (TP + TN) (TP + FP + TN) CA = TP (TP + FP) (13) where G k represents the number of true labels of the k-th class, P k denotes the number of prediction labels of the k-th class, K denotes the total number of classes, and N represents the number of samples.TP represents the true positives, TN represents the true negatives, and FP represents the false positives.

Implementation Details
The experiments are implemented using the Pytorch [61] framework and are performed with an Nvidia 1080Ti GPU with 11 GB memory.The backbone is composed of two stacked Conv2d-ReLU-Maxpool2D components.The training parameter sets of the Houston, Pavia, and GID datasets are the same.The samples are divided into many patches with a size of 13 × 13.The network backbone comprises a series of Conv2d-ReLUMaxPool2d components, which is similar to the work in [10].The learning rate is 0.001, the batch size is 256, and the max epoch is 400; the embedding feature dimension d_se is set to 64, both λ 1 and λ 2 are 1.0, and the convolution dilation rate γ in Q m and K m is set to 2 and 4, respectively.The optimal strategy is adaptive moment estimation, with the momentum 0.9.

Comparison with the State-of-the-Art Methods
The results on the target domain Houston18 data are reported in Table 4. Compared with the existing methods, the proposed DAD3GM model achieves state-of-the-art results.It gains OA of 84.25% and yields 13.8% higher accuracy than DANN and 10.02% higher accuracy than LDSDG.It shows an improvement of 4.29% compared to the sub-optimal SDEnet.In terms of the KC performance, the proposed model is again optimal.It reaches 68.07%, yielding a 14.21% gain over DANN, which also is higher than the value of 60.18% for the HTCNN.The results on the target scene Pavia Center data are reported in Table 5.Our model also has an obvious advantage compared with other methods.In terms of the OA performance, our model achieves increases of 1.59% and 12.65% compared to SDEnet and LDSDG, respectively.On the KC metric, our model shows increases of 1.96% and 20.53% compared to SDEnet and DANN, respectively.These results proves that our model is feasible and steady.The results on the target domain GID-wh data are reported in Table 6.Compared with the existing methods, the proposed DAD3GM model achieves state-of-the-art results.It gains OA of 80.63% and yields 11.82% higher accuracy than DANN and 4.52% higher accuracy than LDSDG, with an improvement of 2.68% over the sub-optimal SDEnet.In terms of the KC performance, the proposed model is again optimal.It reaches 75.73%, yielding a 17.69% gain over DANN and a 33.0% gain over HTCNN.

Ablation Experiments
(1) The Influence of Different Components: In this section, we analyze the influence of different parts on the target domain Pavia Center data.First of all, we start from the baseline without any strategies.To the baseline, the DAFL module is added to form Model A. On the basis of Model A, DDFL is added to form the proposed DAD3GM.The influence results on the target domain PaviaC data are reported in Table 7.In Table 7, -indicates that DAFL or DDFL is not introduced in the models, and indicates that DAFL or DDFL is introduced in the models.Regarding the metric OA, Model A shows an increase of 2.55% over the baseline.DAD3GM improves by 0.63% compared to Model A. Regarding the metric KC, Model A shows an increase of 3.09% over the baseline.DAD3GM improves by 0.82% compared to Model A. All of these modules contribute to improving the performance of our model, which fully demonstrates its high effectiveness and validity.(2) Parameter Analysis: In this part, we analyze the influence of some hyperparameters, including the embedding dimension d_se and the dilation rate γ and λ 1 , λ 2 .The influence of d_se is reported in Table 8.At first, with the increasing d_se, the OA and KC are constantly improving, whereas the OA and KC values decrease when d_se>64.Furthermore, the model not only achieves the best performance on the OA and KC, but also has outstanding performance on certain categories, such as 3, 4, 5, 6.Therefore, the d_se is set to 64 in our work.
The influence of dilation rate γ is reported in Figure 7.When γ = (2, 4), both OA and KC achieve the optimal performance, increasing by 3.18% and 3.91%, respectively.Then, with the increase in γ, the OA and KC start to decrease.Compared with γ = (2, 4), when γ = (4, 8) and γ = (8, 12), the OA reduces by 1.41% and 2.72%, respectively.Compared with γ = (2, 4), when γ = (4, 8) and γ = (8, 12), KC reduces by 1.88% and 3.20%, respectively.We believe that the features extracted at a large scale may be poor because the size of each patch is 13 × 13.Thus, the γ is set to (2,4).The influence of λ 1 and λ 2 is reported in Figure 8. Firstly, we fix λ 2 = 1 and analyze the influence of λ 1 .When both λ 1 and λ 2 are equal to 1, the OA is 83.53% and the model achieves optimal performance.When λ 1 is set to 0.3, 10, and 20, respectively, the OA drops between 2.58% and 3.32% compared with the best result of 83.53%.When λ 1 is set to 0.1, 0.2, and 0.5, respectively, the OA drops between 0.2% and 1.94%.Then, we fix λ 1 = 1 and analyze the influence of λ 2 .In the beginning, the OA improves with the increasing λ 2 .When λ 1 = 1 and λ 2 = 1, the OA achieves the highest value.When the value of λ 2 exceeds 1, the OA keeps decreasing with the increasing λ 2 .When λ 2 = 20, the OA is the lowest and decreases by 2.98% compared with the best result of 83.53%.All of the results indicate that the discriminative features have a certain influence on the model's generalization performance.(3) Feature Visualization: To have a deeper and more intuitive understanding of the proposed method, we use T-SNE visualize the features on the Houston data, which are shown in Figure 9.In addition, we compare our method with the existing SDENet model, which contains a source domain, extended domain, and target domain.The features from Figure 9a-f are extracted by the SDENet model, and the features in Figure 9g-h are extracted by our model, DAD3GM.From Figure 9a,c,e,g, we can see that the features of all categories are intertwined.From Figure 9b,d,f,h, we can see that the features of each class form their own clusters and they are separated from the features of the other categories.However, from Figure 9b, we can see that some features of classes 1, 2, and 3 are mixed together, and the features of classes 5 and 6 are mixed together.In addition, the features of class 6 form two clusters.From Figure 9d, we can see that the features of classes 1, 2, 3, 5, and 6 form two clusters, and some features of classes 1 and 3 are mixed together.From Figure 9f, we can see that some features of classes 1, 2, and 3 are mixed together; some features of classes 5, 6, and 7 are mixed together; and the features of class 6 form two clusters.Figure 9b,d,f reveal that the discriminability of the SDENet method is relatively poor, which may lead to the misclassification of some samples.From Figure 9h, we can see that the clusters formed by the features of each class are more compact and maintain a certain distance from the feature clusters of the other classes.These results reveal that our model has better discriminability.

Discussion
Hyperspectral images have a lot of bands and the adjacent bands have strong correlations, which may lead to data redundancies, and it takes a deal deal of computing resources to process these redundant data.Furthermore, all the hyperspectral images are taken at different times and places.Affected by the weather or equipment, the data distributions are inconsistent, which leads to a sharp decline in model performance.In order to address these issues, we develop DAD3GM, which introduces a dual-attention mechanism to reduce the data redundancies and extract excellent global and local features, and we design a deep discriminative feature learning module based on contrastive learning to improve the discrimination of the proposed model.Experiments show that the combination of the dual-attention module and the deep discriminative feature learning module has advantages in hyperspectral image classification.In the previous hyperspectral image classification works, some researchers extracted features based on manual feature extraction, which is complex and consumes time; some works extracted features based on CNN networks, which improve the classification accuracy but suffer from the challenge of distribution shifts.Different from the previous works, we first select the important global and local augmented features by AdaIN through designing a dual-attention module including multiscale self-attention and spectral multi-head external attention.From Table 7, we can see that the OA improves by 2.55% and the KC improves by 3.09% on the Houston data.Furthermore, we believe that the model discrimination also plays an important role in improving model generalization.Thus, we introduce a deep discriminative feature learning module by designing two classifiers.From Table 7, we can see that the OA improves by 0.63% and the KC improves by 0.82% after adding the discriminative module on the Houston data.We also visualize the feature distributions to show the feature distributions and compare our model with SDENet.From Figure 9, we can see that the features of each class form their own clusters and are separable from other classes, which reveals that our model has stronger discriminability.From Figure 10, we can see that the execution time of our model reduces by 76.78 s compared to PDEN, but the OA improves by 8.85%, which demonstrates that our model is more simple and effective.In conclusion, our work has investigated the issues of data redundancies and distribution shifts and the new modules has effectively improved the model's generalization.The proposed framework is plug-and-play and can be extended to other tasks, such as object detection in hyperspectral images.

Conclusions
This paper proposes a simple but effective dual-attention deep discriminative domain generalization framework for hyperspectral image classification.The model adopts a dualattention mechanism and deep discriminative feature representation learning module to learn discriminative domain invariant knowledge.The dual-attention module is designed with multi-scale self-attention and multi-head external attention to extract important spatial and spectral features, respectively.The dual-attention module is simple but effective and serves to reduce the data redundancies and model complexity.Then, a deep discriminative feature learning module is introduced to improve the generalization ability of the model by designing a contrastive regularization component and two independent classifiers.Compared with some existing state-of-the-art methods, our model achieves excellent performance.For example, compared with SDEnet, the overall accuracy and kappa coefficient of our method on the Pavia dataset are 85.53% and 80.29%, i.e., 1.59% and 1.96% higher than those of the SDEnet method, respectively, while taking a similar amount of time.Although the proposed method achieves state-of-the-art performance compared with the existing methods, it has some limitations, such as catastrophic forgetting and causal feature representation learning.In the future, we will perform knowledge distillation on the model parameters to resolve the catastrophic forgetting and introduce causal inference to build the causal relationships between the inputs and the labels, so as to further improve the model generalization on cross-domain hyperspectral images.
where d represents the dimension of samples, N s denotes the number of samples, y i ∈ {1, 2, . . . ,C}, and C denotes the total categories.The target domain T = x j N t j=1 is unlabeled.The purpose of the domain generalization-based HSI classification is to learn a domain-agnostic model G on the source domain samples, which can perform well on unlabeled target domain samples.The model G(•; θ) contains a feature extractor G f and a classifier G C .In the following, we elaborate the proposed cross-scene HSI classification domain generalization method.The whole architecture is shown in Figure 1.

Figure 1 .
Figure 1.The framework of the DAD3GM model.The model mainly contains two parts: DAFL and DDFL.DAFL contains MSSA and MHEA to extract import and robust domain-independent spatial and spectral features, respectively.DDFL contains a contrastive loss and class discrimination loss.DDFL is mainly used to excavate deep discriminative features.

Figure 2 .Figure 3 .
Figure 2. The structure of multi-scale self-attention (MSSA).Q and K are set to different dilation rates to extract multi-scale attention feature maps.

Figure 4 .
Figure 4.The benchmark of Houston.(a,b) denote the image and ground truth of Houston13, respectively.(c,d) denote the image and ground truth of Houston18, respectively.

Figure 5 .Figure 6 .
Figure 5.The benchmark of Pavia.(a,b) denote the image and ground truth of the University of Pavia (UP), respectively.(c,d) denote the image and ground truth of the Pavia Center (PC), respectively.

Figure 7 .
Figure 7.The influence of different dilation rates γ.(a,b) denote the results of different γ in OA and KC, respectively.

Figure 8 .
Figure 8.The compared results of different values of λ 1 and λ 2 .

Figure 9 .
Figure 9. T-SNE visualization of Houston data.(a,b) denote the ED features of SDENet, respectively.(c,d) denote the SD features of SDENet, respectively.(e,f) denote the TD features of SDENet, respectively.(g,h) denote the features of the proposed model DAD3GM, respectively.

( 4 )Figure 10 .
Figure 10.The computational time of one epoch of training in the proposed model and compared with other methods on the Houston and Pavia data.

Table 1 .
Classes and numbers of Houston data.

Table 2 .
Classes and numbers of Pavia data.

Table 3 .
Classes and numbers of GID data.

Table 4 .
Performance comparisons for the target domain Houston18 data.

Table 5 .
Performance comparisons for the target domain Pavia Center data.

Table 6 .
Performance comparisons on the target domain GID-wh data.

Table 7 .
Performance comparisons for the target domain PC data.

Table 8 .
Performance comparisons on the target domain Pavia data.