Next Article in Journal
Determination of Slow Surface Movements Around the 1915 Çanakkale Bridge During the 2022–2024 Period with Sentinel-1 Time Series
Next Article in Special Issue
AIFloodSense: A Global Aerial Imagery Dataset for Semantic Segmentation and Understanding of Flooded Environments
Previous Article in Journal
Integrated TLS-UAV Workflow for HBIM Generation in Heritage Documentation
Previous Article in Special Issue
Seeing Like Argus: Multi-Perspective Global–Local Context Learning for Remote Sensing Semantic Segmentation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Cross-Layer Feature Fusion and Attention-Based Class Feature Alignment Network for Unsupervised Cross-Domain Remote Sensing Scene Classification

1
School of Geography, Geomatics and Planning, Jiangsu Normal University, Xuzhou 221116, China
2
School of Geographical Sciences, University of Bristol, Bristol BS8 1SS, UK
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Remote Sens. 2026, 18(6), 859; https://doi.org/10.3390/rs18060859
Submission received: 10 February 2026 / Revised: 6 March 2026 / Accepted: 9 March 2026 / Published: 11 March 2026

Highlights

What are the main findings?
  • Global distribution alignment alone is insufficient for cross-domain remote sensing scene classification; class-level feature misalignment is a critical yet overlooked factor limiting unsupervised domain adaptation performance.
  • The proposed cross-layer feature fusion and attention-based architecture significantly enhances scene representation learning and enables effective class-aware feature alignment across domains.
What are the implications of the main findings?
  • Cross-domain adaptation in remote sensing should move beyond global feature distribution alignment and explicitly model class-level structures, as neglecting class-aware alignment can fundamentally limit generalization performance.
  • Effective cross-domain scene classification requires joint optimization of multi-layer semantic representation and class-aware alignment, suggesting that future unsupervised domain adaptation architectures should integrate cross-layer feature fusion and adaptive attention mechanisms rather than relying on shallow feature matching.

Abstract

Remote sensing scene classification is one of the crucial techniques for high-resolution remote sensing image interpretation and has received widespread attention in recent years. However, acquiring high-quality labeled data is both costly and time-consuming, making unsupervised domain adaptation (UDA) an important research focus in scene classification. Existing UDA methods focus primarily on aligning the overall feature distributions across domains but neglect class feature alignment, resulting in the loss of critical class information. To address this issue, a cross-layer feature fusion and attention-based class feature alignment network (CFACA-NET) is proposed for unsupervised cross-domain remote sensing scene classification. Specifically, a multi-layer feature extraction module (MFEM) consisting of a cross-layer feature fusion module (CFFM), a multi-scale dynamic attention module (MSDAM), and a fused feature optimization module (FFOM) is designed to enhance the representation ability of scene features. A high-confidence sample selection module is further introduced, which utilizes evidence theory and information entropy to obtain reliable pseudo-labels. Finally, a class feature alignment module is proposed, incorporating a two-stage training strategy to achieve effective class feature alignment. Experimental results on three remote sensing scene classification datasets demonstrate that CFACA-NET outperforms existing state-of-the-art methods in cross-domain classification performance, effectively enhancing cross-domain adaptation capability.

1. Introduction

With the rapid development of remote sensing (RS) satellite and drone technologies, remote sensing scene classification has become increasingly important [1,2,3,4]. As an intelligent interpretation technique, its application fields have expanded from traditional environmental monitoring to various areas such as smart city construction and precision agriculture management [5,6,7,8]. In recent years, with the emergence and fast growth of artificial intelligence and deep learning technologies, the deep semantic features of remote sensing images have been extracted, significantly enhancing the intelligence level of remote sensing scene classification [9,10,11,12]. However, the data distributions across different domains vary significantly due to different spatial resolutions, different shooting angles, geographical locations, and weather conditions, which restricts the domain generalization capability of the model. Therefore, developing domain adaptation methods that meet the requirements of remote sensing image scene classification models is of great significance.
As a crucial branch of transfer learning (TL), domain adaptation aims to solve decision-making challenges for similar tasks with different distributions between the source and target domains. By transferring knowledge from the source domain to the target domain, it can enhance the performance of tasks in the target domain [13,14,15,16]. However, traditional models assume that the source and target domains typically share the same distribution. In reality, there are distribution differences between them, making such an assumption untenable and resulting in poor performance of models trained on the source domain when applied to the target domain. To address this problem, the learning paradigm of domain adaptation has emerged and evolved [17,18].
In domain adaptation, the source and target domains share the same feature space, but their data distributions are typically different. Based on whether labeled data are available in the target domain, domain adaptation can be divided into three types: fully supervised, semi-supervised, and unsupervised domain adaptation. In fully supervised domain adaptation, the target domain data comes with complete labeled information, and the model fully utilizes labels in the target domain for training [19]. However, acquiring high-quality labeled data is labor-intensive, time-consuming, and costly. If the model relies entirely on the labels from the target domain, its adaptability to new target domain distributions becomes limited, which constrains its performance in practical applications. In semi-supervised domain adaptation, the target domain has sufficient unlabeled data and some labeled data [20,21], but this approach relies on the quality of pseudo-labels, which might contain noise in the unlabeled data and interfere with the performance of the model. In unsupervised domain adaptation, the target domain data does not require labeled information. The model does not rely on labels of the target domain during training but focuses on exploring the shared features between the source and the target domains [22,23]. This feature learning approach not only enhances the model’s generalization capability, making it perform better when facing new target domains with significant distribution shifts, but also better meets the practical scenarios where target domain labels are difficult to capture. Therefore, unsupervised domain adaptation has become an indispensable strategy in solving many practical problems [24,25].
In cross-domain scene classification tasks, unsupervised domain adaptation has become a key method for addressing cross-domain distribution discrepancies without relying on target domain labels. Currently, most unsupervised domain adaptation (UDA) methods learn domain-invariant features through distribution discrepancy minimization or adversarial training to reduce the difference between the source and target domains. For example, Sun et al. [26] proposed a correlation alignment method, which aligns the second-order statistics between the source and target domains through linear transformation, achieving domain discrepancy minimization. Shen et al. [27] used a domain discriminator to estimate the empirical Wasserstein distance between the source and target domains and optimized the network through adversarial training to minimize this distance, thereby achieving effective learning of domain-invariant features. Although existing methods have achieved some success in cross-domain tasks, most approaches rely on distribution discrepancy or adversarial training to align the overall feature distributions, neglecting the alignment of the same class features between different domains. As shown in Figure 1, the above methods can align the overall feature distributions of the source and target domains but do not ensure the alignment of class feature distributions, thereby limiting the generalization ability of the source domain model on the target domain.
To address the above problems, we propose a cross-layer feature fusion and attention-based class feature alignment network (CFACA-NET). Firstly, a multi-layer feature extraction module (MFEM) is designed, which consists of a cross-layer feature fusion module (CFFM), a multi-scale dynamic attention module (MSDAM), and a fused feature optimization module (FFOM). This module is intended to enhance the representation capability of scene features, thereby obtaining more cross-domain feature information. Secondly, a high-confidence sample selection module is introduced; this module ensures the reliability of pseudo-labels for high-confidence samples in the target domain by utilizing evidence theory and information entropy for sample selection. Finally, a class feature alignment module based on a two-stage training strategy is proposed to achieve effective alignment of the same class features between the source and target domains, thereby improving cross-domain performance.
The main contributions of this article are as follows:
  • MFEM is designed to consist of a CFFM, an MSDAM, and an FFOM. Among them, CFFM is used to explore the contextual correlation information among different shallow features, MSDAM aims to enhance the key information in each layer of features, and FFOM optimizes the final aggregated features to eliminate the redundant information caused by various semantic differences.
  • A high-confidence sample selection module is introduced, which selects samples by integrating evidence theory and information entropy to ensure the reliability of pseudo-labels for high-confidence samples in the target domain.
  • A class feature alignment module based on a two-stage training strategy is proposed, which achieves effective alignment of the same class features between the source and target domains through the corresponding memory bank mechanism in each stage, thereby improving cross-domain classification performance.
  • Extensive cross-domain classification performance comparison experiments conducted on three datasets have demonstrated the effectiveness of CFACA-NET.

2. Related Works

2.1. Unsupervised Domain Adaptation

UDA is a key technology for addressing the domain shift problem. Its goal is to build a robust learning system that enables models trained on labeled data to efficiently transfer to new scenarios with unlabeled data. With the continuous development of deep learning, UDA has become increasingly important in visual tasks such as image classification [28,29] and semantic segmentation [30,31]. Currently, mainstream UDA methods achieve distribution alignment by learning domain-invariant features, which can be broadly categorized into three types: discrepancy metric-based methods [32], adversarial learning-based methods [33], and reconstruction-based methods [34]. Discrepancy metric-based methods typically use statistical measures such as maximum mean discrepancy (MMD) and Kullback–Leibler (KL) divergence to achieve distribution alignment between the source and the target domains [35,36]. Othman et al. [37] use MMD to guide the alignment of mini-batch samples from the source and the target domains and dynamically adjust the sample size to enhance the confidence of target label predictions, thereby reducing distribution discrepancies. Xu et al. [38] incorporate feature norms into MMD and adaptively adjust the norms to further decrease the distribution discrepancies between the source and the target domains. Adversarial learning-based methods introduce a domain discriminator to differentiate between the source and the target domain features in the feature space, using adversarial learning to make the generated features indistinguishable to the discriminator, thereby achieving distribution alignment implicitly [39]. Ganin et al. [40] were the first to introduce adversarial learning into domain adaptation methods, proposing domain-adversarial neural networks (DANN). Liu et al. [41] proposed an adversarial domain adaptation framework that incorporates KL divergence. By introducing KL divergence, the framework enhances the discriminator’s ability to distinguish and simultaneously strengthens the generator’s performance, enabling more effective extraction of domain-invariant features. Yang et al. [42] designed a dual-module network architecture where the domain feature discriminator module encourages the domain-invariant feature module to learn more domain-invariant features through adversarial methods. Reconstruction-based methods align the distribution differences between the source and the target domains by minimizing reconstruction error and utilizing autoencoders to extract features. Deep reconstruction-classification networks (DRCN) [43] use a multi-task learning framework, enabling the source domain classification task and the target domain reconstruction task to share the same encoding representation, thereby promoting cross-domain feature alignment. Wei et al. [44] proposed a deep transfer feature encoding framework, which incorporates MMD into the autoencoder to reduce distribution discrepancies. Although the above three types of methods have made significant progress in aligning overall feature distributions, they generally overlook the importance of aligning class feature distributions. Therefore, determining how to align class feature distributions based on overall feature alignment is crucial for further improving the performance of UDA.

2.2. Attention Mechanism in Domain Adaptation

In recent years, attention mechanisms have become an essential technology in the field of deep learning, widely applied to tasks such as computer vision [45,46]. In the vision domain, they are used to improve feature extraction and semantic understanding. The squeeze-and-excitation network (SENet) [47] employs a feature recalibration strategy, modeling at the channel level to enhance the network’s focus on critical features. The convolutional block attention module (CBAM) [48] introduces two types of attention mechanisms to improve the model’s perceptual capabilities. With continuous advancements in research, attention mechanisms have demonstrated significant potential in domain adaptation tasks, effectively enhancing knowledge transfer by extracting domain-invariant features. Chen et al. [49] integrate a global attention module into adversarial networks, enabling the discriminator to distinguish transferable features between the source and the target domains. Wen et al. [50] exploit multigranularity feature representations to capture both global semantic information and fine-grained local details, thereby improving the discriminative capability of remote sensing scene classification. To further enhance cross-domain classification performance, the proposed MSDAM constructs attention mechanisms along both the channel and spatial dimensions, thereby effectively improving the perception and discrimination of key regions within the input features.

3. Materials and Methods

In UDA, the labeled source domain data and the unlabeled target domain data are represented as D s = ( x i s , y i s ) i = 1 N s and D t = ( x i t ) i = 1 N t , while N s and N t denote the number of samples in the source and target domains. Due to the source and target domains sharing the same labeled space but exhibiting domain shift in their feature distributions, our goal is to leverage CFACA-NET to learn domain-invariant features between the source and target domains on the basis of aligning the overall feature distributions, further accomplish the alignment of feature distributions for the same classes across the two domains, thereby enhancing the model’s generalization capability on the target domain. The overall framework of our proposed CFACA-NET is shown in Figure 2.

3.1. Multi-Layer Feature Extraction Module

As shown in Figure 3, we select ResNet50 [51] as the backbone network and design an MFEM based on it. The MFEM consists of a CFFM, an MSDAM, and an FFOM. For convenience, the four layers of the ResNet50 backbone can be denoted as L1, L2, L3 and L4; the output features from these four layers are denoted as F L 1 , F L 2 , F L 3   and   F L 4 , where F L 1 , F L 2   and   F L 3 correspond to shallow local features and represent deep semantic features.
CFFM: Considering that remote sensing images generally exhibit significant intraclass variability and interclass similarity, relying solely on deep semantic features makes it difficult to fully capture multi-scale scene information, thereby affecting the model’s cross-domain adaptation performance across different domains. Meanwhile, shallow features contain rich local texture and background information, which play an important role in scene understanding and feature discrimination. To this end, we propose a cross-layer feature fusion module (CFFM). This module performs multi-scale cross-layer fusion of shallow features, preserving the integrity of deep semantic information while fully exploiting the contextual correlation information within shallow features. This enables the model to comprehensively perceive complex scenes and effectively learn domain-invariant feature representations, thereby mitigating feature distribution differences between source and target domains and enhancing the model’s adaptability in cross-domain scenarios. Specifically, for the first-layer feature F L 1 , depthwise convolution kernels with dilation rates of 3, 5, and 7 were employed to perform multi-scale receptive field expansion, thereby strengthening the local feature representation capability and generating features F 13 , F 15 , F 17 . For the second feature F L 2 , depthwise convolution with dilation rates of 3 and 5 was applied to conduct more refined local feature extraction, obtaining feature F 23 , F 25 . To facilitate the contextual information interaction between these two layers of features, fusion processing was performed on features F 23 and F 25 , and upsampled feature F u p 1 was generated through an upsampling operation. Next, an element-wise multiplication operation was executed between F u p 1 and F 13 , and the result was concatenated F 17 within the channel dimension, thereby constructing feature M 1 with shallow feature correlations.
M 1 = Cat ( F 13 F u p 1 , F 17 )
where Cat ( ) represents the concatenation operation, and denotes the element-wise multiplication.
Secondly, to explore the feature correlations between the second layer feature F L 2 and its adjacent upper and lower layer features, the features F 13 , F 15   and   F 17 from the first layer are first merged and processed to generate the downsampled feature F d p 1 . Subsequently, for the third layer feature F L 3 , a depthwise convolution kernel with an expansion rate of 3 is employed to perform localized feature extraction. Next, the extracted localized features are transformed into the upsampled feature F u p 2 . Finally, element-wise multiplication operations are performed on feature F L 2 , F d p 1 , and F u p 2 , thereby obtaining the feature M 2 that integrates multi-level information.
M 2 = F L 2 F d p 1 F u p 2
For the third layer feature F L 3 , F L 2 is first multiplied element-wise with the downsampled feature F d p 1 , and the resulting feature is then converted into the downsampled feature F d p 2 . Finally, F L 3 is element-wise multiplied by F d p 2 to thereby generate the multi-scale correlated feature M 3 .
M 3 = F L 3 F d p 2
MSDAM: To further enhance the cross-domain adaptation performance of CFACA-NET, we propose a multi-scale dynamic attention module (MSDAM). This module operates on both shallow and deep features and computes attention weights from both the channel and spatial dimensions. The resulting attention features enable the model to adaptively focus on the key regions within the scene category features, thereby effectively improving the model’s perception of discriminative information and enhancing its adaptation capability in cross-domain scenarios. As shown in Figure 4a, in the channel dimension, the input feature F is first processed through two parallel branches of max pooling and average pooling to obtain two pooled features F max c and F a v g c with distinct characteristics. Subsequently, these two pooling features are separately passed through a 1 × 1 convolution operator and softmax operations to construct inter-channel global dependency representations V max c and V a v g c , which then interact with their corresponding pooling features. The output features from the two branches are fused, and a sigmoid function is further applied to generate channel attention weight W c . Finally, the weight is element-wise multiplied by the original input feature. This process enables the model to adaptively adjust the contribution of each channel in the input features and dynamically emphasize the more critical channel information, thereby enhancing the discriminative capability of the features.
W c = φ ( Conv 1 × 1 ( Cat ( F max c × V max c , F a v g c × V a v g c ) ) )
where φ ( ) denotes the sigmoid function, Conv 1 × 1 ( ) denotes the 1 × 1 convolution, and Cat ( ) is the concatenation operation.
As shown in Figure 4b, in the spatial dimension, the input features are also processed through max pooling and average pooling to obtain two pooled features F max s and F avg s . The difference lies in the fact that we introduce a multi-scale depthwise convolution operator in the branches to capture richer spatial features. Subsequently, the two branch features are using softmax to construct spatial global dependency representations V max s and V a v g s , which then interact with the initial pooled features. The output features from the two branches are fused, and a sigmoid function is applied to generate spatial attention weight W s . Finally, the weight is multiplied by the original input features. This process enables the model to dynamically focus on important regions within the spatial features based on the input feature, thereby enhancing the model’s ability to discriminate key positions within the spatial features.
W s = φ ( Conv 1 × 1 ( Cat ( F max s V max s , F a v g s V a v g s ) ) )
FFOM: After being processed by the MSDAM, the four features can adaptively identify the key information within each layer. Although these features possess strong representational capabilities, directly fusing them may lead to information redundancy due to differences in scale, semantics, and detail representation across the layers. Therefore, a fused feature optimization module (FFOM) is designed, as shown in Figure 5. This module aims to optimize the fused features and further enhance their effectiveness. The optimization process first fuses P 1 , P 2 , P 3   and   P 4 to obtain the feature R . Then, it processes through a 1 × 1 convolution, batch normalization, and the SiLU function to obtain the output feature R 1 . Next, the convolution kernel is replaced with a 3 × 3 kernel while keeping other operations unchanged to obtain the output feature R 2 . Finally, R 2 passes through another 1 × 1 convolution and batch normalization operation to obtain the final optimized feature R 3 .
R 3 = BN ( Conv 1 × 1 ( R 2 ) )
where Conv 1 × 1 ( ) represents a 1 × 1 convolution, and BN ( ) denotes the batch normalization.

3.2. High-Confidence Sample Selection Module

To further enhance the model’s cross-domain adaptation capability, selecting high-confidence target domain samples is crucial. Therefore, a discriminative mechanism based on evidence theory and information entropy is integrated to select a high-confidence target domain sample. The output of the target domain sample through the classifier is denoted as O t , the ReLU function is denoted as θ ( ) , and the evidence vector is denoted as e i = θ ( O t ) . First, uncertainty calculation for the target domain data is performed based on evidence theory:
U n c i = K / j = 1 k ( e i + 1 )
where U n c i represents the uncertainty of each target domain sample, K denotes the number of classes, k represents the index of a specific category, and e i represents the evidence vector.
To select high-confidence samples from the perspective of evidence theory, a threshold ω U n c is introduced as the selection criterion, thereby obtaining the high-confidence sample set H U n c based on evidence theory. The formula is expressed as follows:
H U n c = { i | U n c i < ω U n c } , i [ 1 , n ]
Subsequently, the uncertainty of the target domain samples is calculated from the perspective of information entropy. The formula is expressed as follows:
E n t i = j = 1 K δ ( O t ) log δ ( O t )
where K denotes the number of classes, O t represents the output of the classifier, and δ ( ) denotes the softmax operation.
Next, high-confidence samples are selected by introducing an information entropy threshold ω E n t , thereby constructing the high-confidence sample set H E n t based on information entropy. The formula is expressed as follows:
H E n t = { i | E n t i < ω E n t } , i [ 1 , n ]
Finally, the intersection of sets H U n c and H E n t is performed to obtain the high-confidence target domain sample set that simultaneously satisfies both the evidence theory and information entropy selection criteria.
H = H U n c H E n t

3.3. Class Feature Alignment Module

Existing cross-domain scene classification methods mainly rely on distribution alignment strategies based on MMD or adversarial learning. Specifically, MMD-based methods achieve overall alignment by minimizing the statistical differences between the feature distributions of the source and target domains, whereas adversarial learning methods guide distribution matching through adversarial training between a discriminator and a generator. Although these methods have achieved certain effectiveness in cross-domain feature alignment, they typically focus on the overall feature distribution alignment while neglecting the intra-class compactness and inter-class separability. To this end, we propose a class feature alignment module, which aims to align the feature representations of the same class across the source and target domains, further achieving class feature distribution alignment on top of the overall feature distribution alignment, thereby enhancing the model’s ability to discriminate fine-grained class features. This module achieves effective class feature distribution alignment through a two-stage training strategy. Specifically, the entire training process consists of 200 epochs, with epochs 1 to 100 being the first stage and epochs 101 to 200 being the second stage. As shown in Figure 6a, in the first stage, we construct separate memory banks for the source domain M s = { B s n } n = 1 N and the target domain M t = { B t n } n = 1 N . Each memory bank is essentially a class-indexed feature dictionary, where the keys correspond to classes and the values store the features of each class, with a fixed feature capacity of Z per class. Meanwhile, the high-quality target samples selected by the high-confidence sample selection module are stored in the target domain memory bank according to their corresponding pseudo-labels. Finally, the source domain class centroid C s and target domain class centroid C t are obtained by computing the mean of the features stored for each class in the respective memory banks. To enable dynamic updates of the source and target domain class centroids, the memory banks adopt a queue mechanism. Leveraging the first-in-first-out property of the queue, when the number of features stored for a given class in the memory bank reaches the fixed limit during each mini-batch, the oldest features are automatically discarded and replaced with new ones. This ensures that the class centroids for both the source and target domains can be dynamically updated.
C s = 1 Z i = 1 Z B s n i C t = 1 Z i = 1 Z B t n i
where Z denotes the fixed number of features stored for each class in the memory bank, B s n is the features stored for each class in the source domain memory bank, and B t n is the features stored for each class in the target domain memory bank.
As shown in Figure 6b, in the second stage, the source and target domain memory banks are merged into a unified memory bank M u , where each class stores both source and target features. The unified class centroid C u are obtained by computing the mean of the features for each class in the memory bank. The unified memory bank also adopts the same queue update mechanism as in the first stage, ensuring that the unified class centroids can be dynamically updated.
C u = 1 Z i = 1 Z B u n i
where Z denotes the fixed number of features stored for each class in the memory bank, and B u n is the features stored for each class in the unified memory bank.
To achieve effective class feature alignment, the class feature alignment loss is designed based on a two-stage training strategy. In the first stage, source domain features are aligned with the target domain class centroid, while target domain features are aligned with the source domain class centroid. In the second stage, both source and target domain features are aligned with the unified class centroid. It can be formulated as follows:
L A l i g n = 1 n s i = 1 n s ( F i s C t ) 2 + 1 n t i = 1 n t ( F i t C s ) 2 , 1 t 100 1 n s i = 1 n s ( F i s C u ) 2 + 1 n t i = 1 n t ( F i t C u ) 2 , 101 t 200
where t denotes the current training epoch, n s denotes the number of source domain samples, n t represents the number of target domain samples, C s denotes the class centroid of the source domain, C t represents the class centroid of the target domain, C u denotes the unified class centroid, F i s represents the source domain feature, and F i t represents the target domain feature.

3.4. Overall Objective Function

To achieve better cross-domain adaptation of the model, a total loss function is constructed, which consists of source domain supervised classification loss, overall feature alignment loss, target domain pseudo-label mask loss, and class feature alignment loss. The source domain supervised classification loss uses cross-entropy loss to supervise the learning of labeled source samples. It can be formulated as follows:
L S R C = 1 n s x i s D s c = 1 C θ c = y i s log G ( F ( x i s ) )
where F ( ) denotes the feature extractor, G ( ) denotes the classifier, and C is the number of classes. θ c = y i s is an indicator function such that θ = 1 if the label of class c is y i s , and θ = 0 otherwise.
The overall feature alignment loss uses MMD to align the feature distributions. It can be formulated as follows:
L M M D = 1 n s i = 1 n s ϕ ( x i s ) 1 n t i = 1 n t ϕ ( x i t ) 2
where x i s denotes the source domain samples, x i t denotes the target domain samples, n s represents the number of samples in the source domain, n t represents the number of samples in the target domain, ϕ ( ) denotes the feature mapping, and 2 represents the squared Euclidean distance.
The target domain pseudo-label mask loss focuses on high-confidence target samples to ensure that the model can correctly learn the pseudo-labels of these reliable samples. It can be formulated as follows:
θ M a s k = 1 , H 0 , o t h e r
L M a s k = 1 N i = 1 N θ M a s k C E ( P i t , y ^ i t )
where θ M a s k is an indicator function, which takes the value 1 if the target domain sample belongs to the high-confidence sample set H and 0 otherwise; C E ( ) denotes the cross-entropy function; P i t represents the predicted probability of the model for the target domain sample; and y ^ i t denotes the pseudo-label assigned to the target domain sample.
Based on the above loss equation, we construct an overall loss function, and its formula expression is as follows.
L T o t a l = L S R C + L M a s k + λ 1 L M M D + λ 2 L A l i g n
where λ 1 and λ 2 are hyperparameters to balance the weights of the corresponding part.

4. Results

4.1. Cross-Domain Scene Classification Dataset

UCM Dataset [52]: The UCM dataset consists of 21 land-use classes, with 100 images per class, totaling 2100 images. All images are standardized to a size of 256 × 256 pixels and sourced from the national map urban area images provided by the United States Geological Survey (USGS), featuring a spatial resolution of up to 0.3 m. Additionally, the dataset features diverse classes, covering a wide range of scenes from natural landscapes to urban structures, posing challenges for remote sensing scene classification tasks.
AID Dataset [53]: This dataset consists of high-resolution aerial images sourced from Google Earth, encompassing 30 remote sensing scene classes. Each class contains 200 to 400 images, totaling 10,000 images, with each image having a resolution of 600 × 600 pixels. Compared to the UCM dataset, the AID dataset is larger in scale and includes more classes. It exhibits significant intra-class variability and smaller inter-class differences, making the AID dataset more challenging for scene classification tasks.
NWPU Dataset [54]: The NWPU dataset comprises 31,500 images spanning 45 scene classes, with 700 images per class; each image has a resolution of 256 × 256 pixels. Widely recognized as a benchmark for remote sensing scene classification, this dataset stands out due to its extensive scale and diversity. The scenes are captured from various geographic regions under different conditions. Compared to other remote sensing scene classification datasets, the NWPU dataset offers a significantly large number of images, many varied classes, and broad coverage. It also exhibits remarkable intra-class diversity and inter-class similarity, making it one of the most challenging benchmarks for scene classification tasks.
To validate the effectiveness of the proposed CFACA-NET, eight common classes from three remote sensing datasets were selected to build a cross-domain scene classification dataset for remote sensing. As shown in Figure 7, the eight classes are baseball diamond, beach, dense residential, forest, harbor, parking lot, river, and storage tanks, respectively.

4.2. Experimental Setup

For convenience, we denote the UCM dataset as domain U, the AID dataset as domain A, and the NWPU dataset as domain N. The six cross-domain scene classification tasks are constructed based on these three domains, which can be denoted as U→A, U→N, A→U, A→N, N→U, and N→A, where the arrows indicate the direction of cross-domain adaptation from the source domain to the target domain. We adopt overall accuracy (OA) and average accuracy (AA) as the evaluation metrics for cross-domain performance, where the AA is the mean of the OA across the six cross-domain tasks. Both the proposed CFACA-NET and all comparative methods are implemented based on the pytorch framework. In the experiments, we select stochastic gradient descent (SGD) as the main optimizer, the momentum is set to 0.9, and the initial learning rate is set to 0.005. To stabilize the training process, we adopt a learning rate annealing strategy and dynamically adjust the learning rate using the following formula: l = 0.005 / ( 1 + 10 p ) 0.75 , where p ( p = epoch / epochs ) changes linearly from 0 to 1. The proposed high-confidence sample selection module uses thresholds ω U n c and ω E n t to jointly screen out high-quality target-domain samples along with corresponding pseudo-labels. After multiple hyperparameter sensitivity experiments, the optimal values of the two thresholds are both set to ω U n c = ω E n t = 0.5 . In addition, the feature capacity Z of each class in the memory bank is set to 32 to store representative features. A total of four loss functions is set in this experiment. To balance the contributions of each loss to the overall training objective, we set two weight parameters λ 1 and λ 2 for two key losses respectively. After multiple sets of hyperparameter sensitivity experiments and tuning of different parameter combinations, the optimal values of the two weights are finally determined to be λ 1 = 0.3 and λ 2 = 0.7 . The size of the input images for the network is fixed at 224 × 224. We use the GeForce RTX 4080 Super GPU (NVIDIA Corporation, Santa Clara, CA, USA) for training. The number of training epochs is set to 200, and the batch size is 32. Following the standard domain adaptation procedure, all labeled source domain data and unlabeled target domain data are involved in the training process.

4.3. Comparison Experiments and Result Analysis

We compare the proposed CFACA-NET with fifteen state-of-the-art methods: DCC [55], JAN [56], DeepCORAL [57], BNM [58], MRAN [59], CDAN [60], AMRAN [61], DSAN [62], DeepMEDA [63], DATSNET [64], ADA-DDA [65], FCAN [66], SAMRA [67], SFMDA [68], and EUDA [69]. Table 1 presents the performance comparisons across six cross-domain tasks, where the best results in Table 1 are highlighted in bold.
To comprehensively evaluate the effectiveness of CFACA-NET, we conducted experiments on six cross-domain scene classification tasks. As shown in Table 1, the experimental results show that CFACA-NET consistently outperforms existing cross-domain scene classification methods in cross-domain classification performance. In terms of OA across the six cross-domain tasks, compared to the second-best EUDA, CFACA-NET achieves performance improvements of 1.17%, 0.78%, 0.62%, 0.52%, 0.87%, and 0.32%, respectively. Compared to SFMDA, CFACA-NET achieves performance improvements of 3.29%, 2.72%, 0.73%, 1.00%, 2.62%, and 0.69%, respectively. Furthermore, in terms of AA, CFACA-NET outperforms the state-of-the-art methods. When compared with fifteen advanced methods, including DDC, JAN, DeepCORAL, BNM, MRAN, CDAN, AMRAN, DSAN, DeepMEDA, DATSNET, ADA-DDA, FCAN, SAMRA, SFMDA, and EUDA, CFACA-NET achieves AA improvements of 18.65%, 17.92%, 16.94%, 16.75%, 16.47%, 16.3%, 16.28%, 15.43%, 14.98%, 11.20%, 9.62%, 6.27%, 4.50%, 1.84%, and 0.71%, respectively.
As shown in Figure 8, we present the confusion matrices for the A→N, N→U, and N→A cross-domain tasks. It shows that CFACA-NET, after domain adaptation, can accurately classify most classes in the target domain. Although a small number of classes exhibit misclassification, for example, a certain degree of confusion may occur between the river and forest categories. This is mainly because the two scene types exhibit some visual similarity in remote sensing images. In local regions, both categories may present relatively continuous texture patterns. Moreover, in high-resolution remote sensing images, rivers are often surrounded by abundant vegetation, which leads to a certain overlap in color distribution and texture features between water bodies and forest areas. Despite this, CFACA-NET still achieves the best cross-domain classification performance. This indicates that during the domain adaptation process, CFACA-NET can learn more domain-invariant features, effectively reducing the distribution discrepancy between the source and target domains and thereby enhancing cross-domain adaptation capability.
To validate the effectiveness of MSDAM, we incorporated SE or CBAM attention modules into MFEM for comparative classification performance evaluation. As shown in Table 2, when MSDAM was integrated into MFEM, it achieved the highest OA and AA across all six cross-domain tasks, with AA improved by 0.57% and 0.98%, respectively. This demonstrates that, compared to SE and CBAM, introducing MSDAM into MFEM can more effectively identify key regions within features and learn richer domain-invariant feature information, thereby further enhancing cross-domain classification performance.
To more intuitively demonstrate the effectiveness of CFACA-NET, we use t-stochastic neighbor embedding (t-SNE) [70] to visualize the feature distributions of three methods under the U→N and N→U cross-domain tasks. As shown in Figure 9, the first row presents the alignment results for the U→N task, while the second row shows the results for the N→U task. For the U→N task, by comparing Figure 9a–c, it can be observed that, although DATSNET and FCAN can achieve overall alignment of the feature distributions between the source and target domains, the intra-class feature aggregation is relatively weak, a certain degree of feature overlap remains between different classes, and the class boundaries are relatively blurred, thereby limiting the class separability. In contrast, CFACA-NET effectively aligns the source and target features of the same class to the unified class centroid, making each class more compact internally and the boundaries between classes clearer, while also reducing the distribution gap between the source and target domains. In the N→U task, as shown in Figure 9d–f, although the alignment performance of DATSNET and FCAN has been improved, some class features still exhibit shifts and overlaps, resulting in unclear classification boundaries. In contrast, CFACA-NET effectively aligns source domain features to the class centroid of the target domain and aligns target domain features to the class centroid of the source domain, thereby aggregating source and target features of the same class. This enhances intra-class compactness and makes class boundaries more distinct. These results demonstrate that our proposed class feature alignment method can exhibit effective adaptability across different cross-domain tasks.

5. Discussion

5.1. Ablation Study

To explore the impact of each module and different loss functions on the performance of CFACA-NET, we conducted a series of ablation experiments. The experimental results are shown in Table 3 and Table 4, where we compared OA (%) and AA (%) across six cross-domain tasks under different settings. In addition, in the module ablation experiments, we also evaluated the model complexity and computational cost under different module combinations.
The proposed CFACA-NET consists of four components: the backbone, CFFM, MSDAM, and FFOM. To better analyze the importance of each module, we constructed the following structure.
(1)
Net-0: Backbone.
(2)
Net-1: Backbone + CFFM.
(3)
Net-2: Backbone + CFFM + MSDAM.
(4)
Net-3: Backbone + CFFM + MSDAM + FFOM.
Effect of Net-0: As shown in Table 3, when CFACA-NET relies solely on the ResNet50 backbone for cross-domain adaptation, its AA is only 93.25%. This indicates that if the model makes classification decisions based solely on single deep features, its performance will be constrained in cross-domain scenarios.
Effect of Net-1: To validate the effectiveness of the CFFM, we incorporated this module into the backbone to construct Net-1. The experimental results demonstrate that Net-1 achieves a 0.59% improvement in AA compared to Net-0. This performance improvement indicates that enabling contextual interaction among different shallow-layer features can effectively enhance feature representation capability, thereby capturing more information from cross-domain scenarios.
Effect of Net-3: To analyze the effectiveness of the FFOM, we incorporated it into Net-2 to construct Net-3. TheNet-3 achieves the highest OA across all six cross-domain tasks, with its AA improving by 0.54% compared to Net-2. This indicates that after fusing the four-layer features to obtain the final feature representation, the FFOM can effectively eliminate redundant information caused by semantic discrepancies among different layers, thereby enhancing feature expression capability.
To further analyze the model complexity and computational cost introduced by different modules, we present the number of parameters (Params) and floating-point operations (FLOPs) for Net-0 to Net-3 in Table 3. Compared with Net-0, the introduction of CFFM in Net-1 increases the Params by 3.7 M and FLOPs by 2.01 G, indicating that cross-layer feature fusion brings additional computational cost while also improving classification performance. Comparing Net-1 and Net-2, the Params in Net-2 continue to increase, while FLOPs only show a slight rise. This indicates that the introduction of MSDAM effectively enhances feature discriminability with relatively low computational cost and further improves classification performance. Finally, Net-3 builds on Net-2 by incorporating FFOM, which increases the Params by 2.7 M and FLOPs by 0.13 G while achieving the best performance. Although the Params and FLOPs show a gradual increase from Net-0 to Net-3, the moderate computational cost can bring effective performance improvements, demonstrating that CFACA-NET achieves a good balance between cross-domain classification performance and computational efficiency.
Compared with the baseline network, the proposed method introduces CFFM, MSDAM, and FFOM, which inevitably increase the computational cost to some extent. However, these modules are mainly composed of lightweight convolution operations and attention structures, which can be efficiently implemented on modern GPUs. Meanwhile, the proposed method does not alter the overall architecture of the backbone network, allowing it to maintain relatively stable computational efficiency during the inference stage and making it easy to integrate into existing deep learning frameworks, and demonstrates certain potential for practical application and deployment.
To analyze the impact of different loss functions on the cross-domain performance of CFACA-NET, we conducted a loss function ablation study, with the specific combination structures shown as follows.
(1)
Loss-1: L S R C .
(2)
Loss-2: L S R C .
(3)
Loss-3: L S R C + L M M D + L A l i g n .
(4)
Loss-4: L S R C + L M M D + L M a s k .
(5)
Loss-5: L S R C + L M M D + L A l i g n + L M a s k .
Effects of Loss-1: When CFACA-NET employs only Loss-1, its AA is limited to 87.17%, indicating that, when relying solely on L S R C , the model cannot effectively mitigate the feature distribution discrepancy between the two domains, thereby limiting the cross-domain classification performance.
Effects of Loss-2: After further incorporating the L M M D to construct Loss-2, its AA improves by 3.03% compared to Loss-1. This indicates that aligning the overall feature distributions between the source and target domains can effectively mitigate domain shift, thereby enhancing cross-domain classification performance.
Effects of Loss-3: As shown in Table 4, when the L A l i g n is further incorporated into Loss-2 to construct Loss-3, the AA of CFACA-NET reaches 92.87%, representing a 2.7% improvement over the AA achieved with Loss-2. This indicates that after achieving overall feature distribution alignment, further aligning features of the same class between the source and target domains can effectively enhance the model’s cross-domain adaptation capability.
Effects of Loss-4: When the Loss-4 is further constructed by incorporating it into Loss-2, the AA of CFACA-NET using Loss-4 improves by 3.89% compared to that with Loss-2. This indicates that high-confidence pseudo-labels in the target domain play a significant role in cross-domain adaptation, and their high quality and reliability effectively enhance cross-domain performance.
Effects of Loss-5: When CFACA-NET jointly adopts all four loss functions, the OA achieves optimal performance across all six cross-domain classification tasks. Its AA shows significant improvements of 7.87%, 4.84%, 2.14%, and 0.95% compared to using Loss-1, Loss-2, Loss-3, and Loss-4, respectively. These results demonstrate that each of the four loss functions plays an irreplaceable role in the cross-domain adaptation process, and only their combined application can achieve the best cross-domain classification performance.

5.2. Hyperparameter Sensitivity Analysis

We conducted two sets of hyperparameter sensitivity experiments on the threshold hyperparameters in CFACA-NET as well as the hyperparameters in the overall loss function to explore the impact of different values on model performance and ultimately determine the optimal settings for each hyperparameter.
In CFACA-NET, in the designed high-confidence sample selection module, the threshold ω U n c and ω E n t are used together to screen out high-quality target domain samples and generate reliable pseudo-labels, which can improve the cross-domain performance. Considering that the setting of the threshold directly affects the selection quality of high-confidence pseudo-labeled samples, when the threshold is set too low, although the selected pseudo-labels can maintain high confidence, the number of target-domain samples available for training will be significantly reduced, which limits the model’s ability to fully learn target-domain features. Conversely, when the threshold is set too high, although more target-domain samples can be introduced into the training process, it may also include more unreliable pseudo-labels, thereby introducing noise and affecting the stability of model training. Therefore, in this study, the search range of the threshold is set to {0.2, 0.3, 0.4, 0.5, 0.6} in order to achieve a reasonable balance between the reliability of pseudo-labels and the utilization of target-domain samples. The experimental results are shown in Table 5, when ω U n c = ω E n t = 0.5 , CFACA-NET achieved the highest AA of 95.01% across six tasks and the highest OA in four cross-domain tasks, when ω U n c and ω E n t are set to 0.4 or 0.6, the N→A and U→N tasks reached their highest OA. These results indicate that appropriately adjusting ω U n c and ω E n t can further improve performance for specific tasks, but from an overall performance perspective, when both thresholds are set to 0.5, the CFACA-NET can more effectively select high-confidence samples and improve pseudo-label quality, thereby achieving the best AA across the six cross-domain tasks.
To explore the impact of hyperparameters λ 1 and λ 2 on the overall loss function on model performance, we conducted experiments to analyze the performance under different hyperparameter combinations. The results are shown in Table 6; when the λ 1 = 0.3 and λ 2 = 0.7 , CFACA-NET achieved the highest OA and AA across all six cross-domain tasks. This indicates that under this parameter combination, CFACA-NET can better balance the contributions of overall feature alignment and class feature alignment to achieve the best cross-domain classification performance. In contrast, the cross-domain performance of other hyperparameter combinations declines, making λ 1 = 0.3 and λ 2 = 0.7 the optimal parameter values.
The experimental results are shown in Table 7. It can be observed that different values of the feature capacity Z in the memory bank have a certain impact on the model performance. When Z increases from 8 to 32, both the OA and AA of each cross-domain task show a gradual improvement trend. When Z is set to 32, the model achieves the best classification results across all six cross-domain tasks, indicating that under this setting the model can obtain more stable and representative class feature representations. Although the number of model parameters increases as Z becomes larger, the setting of Z = 32 achieves the best classification performance, providing a good balance between computational overhead and model performance.

6. Conclusions

This article proposes a CFACA-NET for unsupervised cross-domain scene classification. To address the limitation of existing methods that primarily focus on aligning overall feature distributions while neglecting class feature information during the cross-domain process, CFACA-NET is designed from three perspectives to achieve effective class feature alignment. Specifically, MFEM integrates CFFM, MSDAM, and FFOM to enhance the representation capability of scene features and better learn domain-invariant features. A high-confidence sample selection module based on evidence theory and information entropy is introduced to obtain reliable pseudo-labels. Finally, efficient class feature alignment is achieved through a two-stage training strategy. Experimental results on three remote sensing scene classification datasets show that CFACA-NET consistently outperforms existing state-of-the-art methods in cross-domain classification, thereby validating its effectiveness. In future work, our goal is to further enhance the model’s efficiency and explore the transferability of vision transformer models in cross-domain scenarios.

Author Contributions

Conceptualization, J.W., E.L. and C.Z.; Methodology, J.W., E.L. and C.Z.; Software, J.W. and C.Z.; Validation, J.W. and C.Z.; Formal analysis, J.W., E.L. and C.Z.; Investigation, E.L. and C.Z.; Resources, E.L.; Data curation, J.W.; Writing—original draft, J.W.; Writing—review & editing, J.W., E.L. and C.Z.; Visualization, C.Z.; Supervision, E.L. and C.Z.; Project administration, C.Z.; Funding acquisition, C.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was jointly funded by the National Natural Science Foundation of China (grant No. 42371465) and the Basic Research Program of Jiangsu (grant No. BK20231353).

Data Availability Statement

The datasets analyzed in this study are publicly available benchmark datasets. The UCM Land Use Dataset is available from the United States Geological Survey (USGS); the AID Dataset is derived from publicly accessible Google Earth imagery, and the NWPU-RESISC45 Dataset is publicly available for research purposes. All dataset sources are cited in the manuscript. No new datasets were generated during the current study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Wang, W.; Chen, Y.; Ghamisi, P. Transferring CNN with adaptive learning for remote sensing scene classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–18. [Google Scholar] [CrossRef]
  2. Tang, X.; Ma, Q.; Zhang, X.; Liu, F.; Ma, J.; Jiao, L. Attention consistent network for remote sensing scene classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 2030–2045. [Google Scholar] [CrossRef]
  3. Hu, Y.; Huang, X.; Luo, X.; Han, J.; Cao, X.; Zhang, J. Variational self-distillation for remote sensing scene classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
  4. Peng, C.; Li, Y.; Jiao, L.; Shang, R. Efficient convolutional neural architecture search for remote sensing image scene classification. IEEE Trans. Geosci. Remote Sens. 2020, 59, 6092–6105. [Google Scholar] [CrossRef]
  5. Miao, W.; Geng, J.; Jiang, W. Multigranularity decoupling network with pseudolabel selection for remote sensing image scene classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–13. [Google Scholar] [CrossRef]
  6. Wang, X.; Mao, Z.; Shi, A.; Zhang, Z.; Zhou, H. Dropout-based adversarial training networks for remote sensing scene classification. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
  7. Lin, H.; Hao, M.; Luo, W.; Yu, H.; Zheng, N. BEARNet: A novel buildings edge-aware refined network for building extraction from high-resolution remote sensing images. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
  8. Chen, S.; Shi, W.; Zhou, M.; Zhang, M.; Xuan, Z. CGSANet: A contour-guided and local structure-aware encoder–decoder network for accurate building extraction from very high-resolution remote sensing imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 15, 1526–1542. [Google Scholar] [CrossRef]
  9. Li, F.; Feng, R.; Han, W.; Wang, L. An augmentation attention mechanism for high-spatial-resolution remote sensing image scene classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 3862–3878. [Google Scholar] [CrossRef]
  10. Li, Z.; Wu, Q.; Cheng, B.; Cao, L.; Yang, H. Remote sensing image scene classification based on object relationship reasoning CNN. IEEE Geosci. Remote Sens. Lett. 2020, 19, 1–5. [Google Scholar] [CrossRef]
  11. Xu, C.; Zhu, G.; Shu, J. A lightweight and robust lie group-convolutional neural networks joint representation for remote sensing scene classification. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–15. [Google Scholar] [CrossRef]
  12. Chen, J.; Huang, H.; Peng, J.; Zhu, J.; Chen, L.; Tao, C.; Li, H. Contextual information-preserved architecture learning for remote-sensing scene classification. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–14. [Google Scholar] [CrossRef]
  13. Zhang, J.; Liu, J.; Pan, B.; Shi, Z. Domain adaptation based on correlation subspace dynamic distribution alignment for remote sensing image scene classification. IEEE Trans. Geosci. Remote Sens. 2020, 58, 7920–7930. [Google Scholar] [CrossRef]
  14. Zhang, L.; Lan, M.; Zhang, J.; Tao, D. Stagewise unsupervised domain adaptation with adversarial self-training for road segmentation of remote-sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–13. [Google Scholar] [CrossRef]
  15. Song, S.; Yu, H.; Miao, Z.; Zhang, Q.; Lin, Y.; Wang, S. Domain adaptation for convolutional neural networks-based remote sensing scene classification. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1324–1328. [Google Scholar] [CrossRef]
  16. Zheng, J.; Zhao, Y.; Wu, W.; Chen, M.; Li, W.; Fu, H. Partial domain adaptation for scene classification from remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2022, 61, 1–17. [Google Scholar] [CrossRef]
  17. Liu, Z.-G.; Ning, L.-B.; Zhang, Z.-W. A new progressive multisource domain adaptation network with weighted decision fusion. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 1062–1072. [Google Scholar] [CrossRef]
  18. Xu, Q.; Shi, Y.; Yuan, X.; Zhu, X.X. Universal domain adaptation for remote sensing image scene classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
  19. Aryal, J.; Neupane, B. Multi-scale feature map aggregation and supervised domain adaptation of fully convolutional networks for urban building footprint extraction. Remote Sens. 2023, 15, 488. [Google Scholar] [CrossRef]
  20. Lasloum, T.; Alhichri, H.; Bazi, Y.; Alajlan, N. SSDAN: Multi-source semi-supervised domain adaptation network for remote sensing scene classification. Remote Sens. 2021, 13, 3861. [Google Scholar] [CrossRef]
  21. Chen, S.; Harandi, M.; Jin, X.; Yang, X. Semi-supervised domain adaptation via asymmetric joint distribution matching. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 5708–5722. [Google Scholar]
  22. Yu, C.; Liu, C.; Song, M.; Chang, C.-I. Unsupervised domain adaptation with content-wise alignment for hyperspectral imagery classification. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar]
  23. Zhang, Z.; Doi, K.; Iwasaki, A.; Xu, G. Unsupervised domain adaptation of high-resolution aerial images via correlation alignment and self-training. IEEE Geosci. Remote Sens. Lett. 2020, 18, 746–750. [Google Scholar]
  24. Tang, X.; Li, C.; Peng, Y. Unsupervised joint adversarial domain adaptation for cross-scene hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
  25. Guo, J.; Yang, J.; Yue, H.; Li, K. Unsupervised domain adaptation for cloud detection based on grouped features alignment and entropy minimization. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–13. [Google Scholar] [CrossRef]
  26. Sun, B.; Feng, J.; Saenko, K. Return of frustratingly easy domain adaptation. Proc. AAAI Conf. Artif. Intell. 2018, 32, 2058–2065. [Google Scholar] [CrossRef]
  27. Shen, J.; Qu, Y.; Zhang, W.; Yu, Y. Wasserstein distance guided representation learning for domain adaptation. Proc. AAAI Conf. Artif. Intell. 2018, 32, 4058–4065. [Google Scholar]
  28. Zhao, C.; Qin, B.; Feng, S.; Zhu, W.; Zhang, L.; Ren, J. An unsupervised domain adaptation method towards multi-level features and decision boundaries for cross-scene hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–16. [Google Scholar] [CrossRef]
  29. Hou, D.; Wang, S.; Tian, X.; Xing, H. PCLUDA: A pseudo-label consistency learning-based unsupervised domain adaptation method for cross-domain optical remote sensing image retrieval. IEEE Trans. Geosci. Remote Sens. 2022, 61, 1–14. [Google Scholar] [CrossRef]
  30. Chen, X.; Pan, S.; Chong, Y. Unsupervised domain adaptation for remote sensing image semantic segmentation using region and category adaptive domain discriminator. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar]
  31. Zhu, J.; Guo, Y.; Sun, G.; Yang, L.; Deng, M.; Chen, J. Unsupervised domain adaptation semantic segmentation of high-resolution remote sensing imagery with invariant domain-level prototype memory. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–18. [Google Scholar] [CrossRef]
  32. Li, Z.; Tang, X.; Li, W.; Wang, C.; Liu, C.; He, J. A two-stage deep domain adaptation method for hyperspectral image classification. Remote Sens. 2020, 12, 1054. [Google Scholar] [CrossRef]
  33. Tzeng, E.; Hoffman, J.; Saenko, K.; Darrell, T. Adversarial discriminative domain adaptation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 7167–7176. [Google Scholar]
  34. Deng, W.; Su, Z.; Qiu, Q.; Zhao, L.; Kuang, G.; Pietikäinen, M.; Xiao, H.; Liu, L. Deep ladder reconstruction-classification network for unsupervised domain adaptation. Pattern Recognit. Lett. 2021, 152, 398–405. [Google Scholar] [CrossRef]
  35. Baktashmotlagh, M.; Harandi, M.T.; Lovell, B.C.; Salzmann, M. Unsupervised domain adaptation by domain invariant projection. In Proceedings of the 2013 IEEE International Conference on Computer Vision (ICCV), Sydney, NSW, Australia, 1–8 December 2013; pp. 769–776. [Google Scholar]
  36. Ma, L.; Luo, C.; Peng, J.; Du, Q. Unsupervised manifold alignment for cross-domain classification of remote sensing images. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1650–1654. [Google Scholar] [CrossRef]
  37. Othman, E.; Bazi, Y.; Melgani, F.; Alhichri, H.; Alajlan, N.; Zuair, M. Domain adaptation network for cross-scene classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 4441–4456. [Google Scholar] [CrossRef]
  38. Xu, R.; Li, G.; Yang, J.; Lin, L. Larger norm more transferable: An adaptive feature norm approach for unsupervised domain adaptation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1426–1435. [Google Scholar]
  39. Wang, L.; Xiao, P.; Zhang, X.; Chen, X. A fine-grained unsupervised domain adaptation framework for semantic segmentation of remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 4109–4121. [Google Scholar] [CrossRef]
  40. Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; March, M.; Lempitsky, V. Domain-adversarial training of neural networks. J. Mach. Learn. Res. 2016, 17, 1–35. [Google Scholar]
  41. Liu, M.; Zhang, P.; Shi, Q.; Liu, M. An adversarial domain adaptation framework with KL-constraint for remote sensing land cover classification. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
  42. Yang, Y.; Zhang, T.; Li, G.; Kim, T.; Wang, G. An unsupervised domain adaptation model based on dual-module adversarial training. Neurocomputing 2022, 475, 102–111. [Google Scholar] [CrossRef]
  43. Ghifary, M.; Kleijn, W.B.; Zhang, M.; Balduzzi, D.; Li, W. Deep reconstruction-classification networks for unsupervised domain adaptation. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 597–613. [Google Scholar]
  44. Wei, P.; Ke, Y.; Goh, C.K. Feature analysis of marginalized stacked denoising autoencoder for unsupervised domain adaptation. IEEE Trans. Neural Netw. Learn. Syst. 2018, 30, 1321–1334. [Google Scholar] [CrossRef]
  45. Cai, W.; Wei, Z. Remote sensing image classification based on a cross-attention mechanism and graph convolution. IEEE Geosci. Remote Sens. Lett. 2020, 19, 1–5. [Google Scholar] [CrossRef]
  46. Cheng, Q.; Zhou, Y.; Fu, P.; Xu, Y.; Zhang, L. A deep semantic alignment network for cross-modal image-text retrieval in remote sensing. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 4284–4297. [Google Scholar] [CrossRef]
  47. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
  48. Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  49. Chen, W.; Hu, H. Generative attention adversarial classification network for unsupervised domain adaptation. Pattern Recognit. 2020, 107, 107440. [Google Scholar] [CrossRef]
  50. Weng, Q.; Huang, Z.; Lin, J.; Jian, C.; Mao, Z. Remote sensing scene classification via multigranularity alternating feature mining. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 16, 318–330. [Google Scholar] [CrossRef]
  51. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  52. Yang, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, San Jose, CA, USA, 3–5 November 2010; pp. 270–279. [Google Scholar]
  53. Xia, G.-S.; Hu, J.; Hu, F.; Shi, B.; Bai, X.; Zhong, Y.; Zhang, L.; Lu, X. AID: A benchmark data set for performance evaluation of aerial scene classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3965–3981. [Google Scholar] [CrossRef]
  54. Cheng, G.; Han, J.; Lu, X. Remote sensing image scene classification: Benchmark and state of the art. Proc. IEEE 2017, 105, 1865–1883. [Google Scholar] [CrossRef]
  55. Tzeng, E.; Hoffman, J.; Zhang, N.; Saenko, K.; Darrell, T. Deep domain confusion: Maximizing for domain invariance. arXiv 2014, arXiv:1412.3474. [Google Scholar] [CrossRef]
  56. Long, M.; Zhu, H.; Wang, J.; Jordan, M.I. Deep transfer learning with joint adaptation networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; pp. 2208–2217. [Google Scholar]
  57. Sun, B.; Saenko, K. Deep CORAL: Correlation alignment for deep domain adaptation. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 443–450. [Google Scholar]
  58. Cui, S.; Wang, S.; Zhuo, J.; Li, L.; Huang, Q.; Tian, Q. Towards discriminability and diversity: Batch nuclear-norm maximization under label insufficient situations. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 3940–3949. [Google Scholar]
  59. Zhu, Y.; Zhuang, F.; Wang, J.; Chen, J.; Shi, Z.; Wu, W.; He, Q. Multi-representation adaptation network for cross-domain image classification. Neural Netw. 2019, 119, 214–221. [Google Scholar] [CrossRef]
  60. Long, M.; Cao, Z.; Wang, J.; Jordan, M.I. Conditional adversarial domain adaptation. Adv. Neural Inf. Process. Syst. 2018, 31, 1647–1657. [Google Scholar]
  61. Zhu, S.; Du, B.; Zhang, L.; Li, X. Attention-based multiscale residual adaptation network for cross-scene classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
  62. Zhu, Y.; Zhuang, F.; Wang, J.; Ke, G.; Chen, J.; Bian, J.; Xiong, H.; He, Q. Deep subdomain adaptation network for image classification. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 1713–1722. [Google Scholar] [CrossRef]
  63. Wang, J.; Chen, Y.; Feng, W.; Yu, H.; Huang, M.; Yang, Q. Transfer learning with dynamic distribution adaptation. ACM Trans. Intell. Syst. Technol. 2020, 11, 1–25. [Google Scholar] [CrossRef]
  64. Zheng, Z.; Zhong, Y.; Su, Y.; Ma, A. Domain adaptation via a task-specific classifier framework for remote sensing cross-scene classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
  65. Yang, C.; Dong, Y.; Du, B.; Zhang, L. Attention-based dynamic alignment and dynamic distribution adaptation for remote sensing cross-domain scene classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
  66. Zhu, P.; Zhang, X.; Han, X.; Cheng, X.; Gu, J.; Chen, P.; Jiao, L. Cross-domain classification based on frequency component adaptation for remote sensing images. Remote Sens. 2024, 16, 2134. [Google Scholar] [CrossRef]
  67. Wang, X.; Xu, H.; Shi, F.; Yuan, L.; Wen, X. Multiscale attention-based subdomain dynamic adaptation for cross-domain scene classification. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
  68. Hou, D.; Yang, Y.; Wang, S.; Zhou, X.; Wang, W. Spatial–frequency multiple feature alignment for cross-domain remote sensing scene classification. IEEE Geosci. Remote Sens. Lett. 2025, 22, 1–5. [Google Scholar] [CrossRef]
  69. Abedi, A.; Wu, Q.M.J.; Zhang, N.; Pourpanah, F. EUDA: An efficient unsupervised domain adaptation via self-supervised vision transformer. arXiv 2024, arXiv:2407.21311. [Google Scholar] [CrossRef]
  70. Maaten, L.V.D.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Figure 1. Differences between the two-domain adaptation alignments. (a) Overall feature alignment, which loses class information. (b) Class feature alignment, which better aligns features of the same class between different domains. Purple represents the source domain samples and blue represents the target domain samples.
Figure 1. Differences between the two-domain adaptation alignments. (a) Overall feature alignment, which loses class information. (b) Class feature alignment, which better aligns features of the same class between different domains. Purple represents the source domain samples and blue represents the target domain samples.
Remotesensing 18 00859 g001
Figure 2. The overall framework of CFACA-NET.
Figure 2. The overall framework of CFACA-NET.
Remotesensing 18 00859 g002
Figure 3. The specific architecture of MFEM. It consists of a CFFM, an MSDAM, and an FFOM.
Figure 3. The specific architecture of MFEM. It consists of a CFFM, an MSDAM, and an FFOM.
Remotesensing 18 00859 g003
Figure 4. The overall architecture of MSDAM. (a) Channel-dimension attention branch. (b) Spatial-dimension attention branch.
Figure 4. The overall architecture of MSDAM. (a) Channel-dimension attention branch. (b) Spatial-dimension attention branch.
Remotesensing 18 00859 g004
Figure 5. The overall architecture of FFOM.
Figure 5. The overall architecture of FFOM.
Remotesensing 18 00859 g005
Figure 6. The structure of the class feature alignment module. (a) The alignment method in the first stage. (b) The alignment method in the second stage.
Figure 6. The structure of the class feature alignment module. (a) The alignment method in the first stage. (b) The alignment method in the second stage.
Remotesensing 18 00859 g006
Figure 7. Sample images from the cross-domain scene classification dataset constructed from UCM, AID, and NWPU for experimental requirements. Each column corresponds to samples of the same class from the three datasets.
Figure 7. Sample images from the cross-domain scene classification dataset constructed from UCM, AID, and NWPU for experimental requirements. Each column corresponds to samples of the same class from the three datasets.
Remotesensing 18 00859 g007
Figure 8. The confusion matrices for three cross-domain tasks: (a) A→N. (b) N→U. (c) N→A.
Figure 8. The confusion matrices for three cross-domain tasks: (a) A→N. (b) N→U. (c) N→A.
Remotesensing 18 00859 g008
Figure 9. The t-SNE visualization of feature distributions under two cross-domain tasks using different methods. The symbols “●” and “▲” denote the source domain features and target domain features, respectively. (a) DATSNET (U→N). (b) FCAN (U→N). (c) CFACA-NET (U→N). (d) DATSNET (N→U). (e) FCAN (N→U). (f) CFACA-NET (N→U).
Figure 9. The t-SNE visualization of feature distributions under two cross-domain tasks using different methods. The symbols “●” and “▲” denote the source domain features and target domain features, respectively. (a) DATSNET (U→N). (b) FCAN (U→N). (c) CFACA-NET (U→N). (d) DATSNET (N→U). (e) FCAN (N→U). (f) CFACA-NET (N→U).
Remotesensing 18 00859 g009
Table 1. Comparison of OA (%) and AA (%) across different cross-domain scene classification methods. The Bold indicates the best performance for each task.
Table 1. Comparison of OA (%) and AA (%) across different cross-domain scene classification methods. The Bold indicates the best performance for each task.
MethodU→AU→NA→UA→NN→UN→AAA
DDC72.1067.2474.3682.7876.5485.1276.36
JAN73.5366.7475.7883.4277.4985.6177.09
DeepCORAL74.6166.5076.5084.5079.3886.9178.07
BNM71.1369.1379.5088.6371.7589.4378.26
MRAN73.2668.4876.5386.7177.6488.6278.54
CDAN67.8066.5977.6390.3276.7593.1678.71
AMRAN74.0868.0975.5086.8078.5089.4378.73
DSAN74.6574.8673.8887.0578.2588.8379.58
DeepMEDA75.1875.8473.7589.7076.6389.0880.03
DATSNET76.2673.8982.5787.7688.1394.2383.81
ADA-DDA77.7874.7687.5090.7089.6391.9885.39
FCAN85.6879.8389.4592.3191.6893.4788.74
SAMRA87.3484.3691.3693.0292.8494.1290.51
SFMDA89.6588.9293.8994.7596.1395.6693.17
EUDA91.7790.8694.0095.2397.8896.0394.30
CFACA-NET 92.9491.6494.6295.7598.7596.3595.01
Table 2. Comparison of OA (%) and AA (%) for different attention modules in MFEM. The Bold indicates the best performance for each task.
Table 2. Comparison of OA (%) and AA (%) for different attention modules in MFEM. The Bold indicates the best performance for each task.
ArchitectureU→AU→NA→UA→NN→UN→AAA
MFEM with SE92.4191.6193.2595.5098.0095.8994.44
MFEM with CBAM91.7490.4193.2594.5598.2596.0394.03
MFEM with MSDAM92.9491.6494.6295.7598.7596.3595.01
Table 3. Ablation study results of OA (%) and AA (%) for each module in CFACA-NET. The Bold indicates the best performance for each task.
Table 3. Ablation study results of OA (%) and AA (%) for each module in CFACA-NET. The Bold indicates the best performance for each task.
MethodParams (M)GFLOPsU→AU→NA→UA→NN→UN→AAA
Net-025.63.9589.7590.3892.7594.7996.8894.9393.25
Net-129.35.9690.5390.6693.6295.3297.6295.3293.84
Net-241.55.9791.7791.3294.0095.7398.1295.8994.47
Net-344.26.1092.9491.6494.6295.7598.7596.3595.01
Table 4. Ablation study results of OA (%) and AA (%) for different loss functions in CFACA-NET. The Bold indicates the best performance for each task.
Table 4. Ablation study results of OA (%) and AA (%) for different loss functions in CFACA-NET. The Bold indicates the best performance for each task.
MethodU→AU→NA→UA→NN→UN→AAA
Loss-178.3775.3686.2594.3693.1295.4387.14
Loss-286.4580.4589.8895.0794.6294.5790.17
Loss-391.0688.2590.1294.8497.3895.5792.87
Loss-489.8989.5291.5095.6898.0095.8994.06
Loss-5 (ours)92.9491.6494.6295.7598.7596.3595.01
Table 5. Comparison of OA (%) and AA (%) with different threshold settings. The Bold indicates the best performance for each task.
Table 5. Comparison of OA (%) and AA (%) with different threshold settings. The Bold indicates the best performance for each task.
ω U n c ω E n t U→AU→NA→UA→NN→UN→AAA
0.289.9389.4891.5095.4596.0095.3292.94
0.391.9590.0091.8895.5497.8894.6593.65
0.492.3090.6892.3895.1497.2596.3894.02
0.592.9491.6494.6295.7598.7596.3595.01
0.692.8792.2993.3895.5998.3896.1094.77
Table 6. Comparison of OA (%) and AA (%) for different hyperparameter combinations. The Bold indicates the best performance for each task.
Table 6. Comparison of OA (%) and AA (%) for different hyperparameter combinations. The Bold indicates the best performance for each task.
λ 1 λ 2 U→AU→NA→UA→NN→UN→AAA
0.10.991.2487.4594.3895.6697.8896.1393.79
0.20.891.7489.9893.1295.5798.2596.2194.15
0.30.792.9491.6494.6295.7598.7596.3595.01
0.40.691.6389.9692.6295.0597.2595.9693.75
0.50.591.5690.4892.7594.9196.7595.0093.58
Table 7. Comparison of OA (%) and AA (%) with different values of Z. The Bold indicates the best performance for each task.
Table 7. Comparison of OA (%) and AA (%) with different values of Z. The Bold indicates the best performance for each task.
Z
Params (M)U→AU→NA→UA→NN→UN→AAA
837.692.1690.9594.0395.1297.8295.6194.28
1640.792.5391.2794.2195.3698.5695.8294.63
3244.292.9491.6494.6295.7598.7596.3595.01
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wei, J.; Li, E.; Zhang, C. Cross-Layer Feature Fusion and Attention-Based Class Feature Alignment Network for Unsupervised Cross-Domain Remote Sensing Scene Classification. Remote Sens. 2026, 18, 859. https://doi.org/10.3390/rs18060859

AMA Style

Wei J, Li E, Zhang C. Cross-Layer Feature Fusion and Attention-Based Class Feature Alignment Network for Unsupervised Cross-Domain Remote Sensing Scene Classification. Remote Sensing. 2026; 18(6):859. https://doi.org/10.3390/rs18060859

Chicago/Turabian Style

Wei, Jiahao, Erzhu Li, and Ce Zhang. 2026. "Cross-Layer Feature Fusion and Attention-Based Class Feature Alignment Network for Unsupervised Cross-Domain Remote Sensing Scene Classification" Remote Sensing 18, no. 6: 859. https://doi.org/10.3390/rs18060859

APA Style

Wei, J., Li, E., & Zhang, C. (2026). Cross-Layer Feature Fusion and Attention-Based Class Feature Alignment Network for Unsupervised Cross-Domain Remote Sensing Scene Classification. Remote Sensing, 18(6), 859. https://doi.org/10.3390/rs18060859

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop