1. Introduction
Hyperspectral image (HSI) classification exploits the rich spectral-spatial characteristics across hundreds of contiguous bands, enabling fine-grained material recognition and showing great potential in precision agriculture [
1], environmental monitoring [
2], and geological surveying [
3]. With the development of deep learning, convolutional neural networks (CNN) [
4,
5,
6] and Transformers [
7,
8,
9] have significantly improved single-scene HSI classification performance by learning nonlinear spectral-spatial representations. In addition, HSI-specific techniques such as band selection [
10] and hyperspectral unmixing [
11] help reduce spectral redundancy and noise, further improving robustness and efficiency for downstream classification. However, most existing methods rely on the assumption of independent and identically distributed (i.i.d.) data between training (source) and testing (target) sets, an assumption that often fails in real-world cross-scene applications. Variations in different sensors, atmospheric conditions, illumination, and seasonal dynamics [
12,
13] lead to substantial domain shifts between scenes, severely degrading model performance when deployed on unseen domains.
To mitigate this domain shift, Domain Generalization (DG) methods have been used to learn domain-invariant features from the source domain(s) only, thereby enabling zero-shot transfer to unseen target domains. DG has made significant progress in various areas such as computer vision and natural language processing. Existing DG approaches can be broadly categorized into three groups: data manipulation, representation learning and learning strategy [
14,
15]. Data manipulation methods, such as data augmentation [
16] and data generation [
17,
18], enhance diversity and quantity of input data to improve generalization. Representation learning, the mainstream of DG research, mainly follows two directions. One is domain-invariant representation learning, which seeks to learn representations that are invariant across domains [
19,
20]. The other is feature disentanglement, which attempts to decompose features into domain-invariant and domain-specific components [
21,
22]. Learning strategies adopt general training paradigms like meta-learning [
23,
24] and self-supervised learning [
25,
26] to improve robustness and reduce domain dependency.
However, DG in the field of HSI classification remains relatively under-explored. Currently, research on DG in HSI mainly focuses on the single-source domain generalization setting. This focus arises from the practical challenge that it is difficult and costly to obtain labeled data from multiple domains in HSI, making single-source generalization both necessary and meaningful. Due to the limitation of having only a single labeled domain, existing methods commonly adopt data manipulation techniques to generate pseudo-domains [
27,
28,
29], thereby simulating domain shifts between different environments. Based on this, two major strategies have emerged to enhance generalization: one leverages contrastive learning and adversarial training to obtain domain-invariant representations [
30,
31], while the other focuses on feature disentanglement to separate domain-shared and domain-specific features [
32,
33].
While these efforts provide valuable insights, further advancements are needed to fully address the challenges posed by complex and diverse domain shifts in HSI data. In this work, we introduce a causal perspective to understand and address the generalization problem under single-source DG settings. Specifically, we regard different sensing scenes as interventions on a shared physical system, and focus on two core challenges that hinder robust generalization across such interventional shifts.
Challenge 1: Learning stable representations from limited interventional diversity. Under the single-source DG setting, models are trained on data collected from a single sensing scene, which corresponds to a specific intervention on the underlying physical system. This lack of interventional diversity makes it difficult for the model to capture causal mechanisms that are invariant across scenes. Consequently, the learned representations may overfit to scene-specific patterns and fail to generalize to unseen domains.
Challenge 2: Eliminating spurious correlations that confound semantic prediction. In source domains, certain non-semantic factors—such as background textures, atmospheric conditions, or sensor-specific artifacts—may spuriously correlate with class labels. These confounders can mislead the model into learning shortcuts that do not hold in unseen domains, leading to degraded generalization performance. The core difficulty lies in identifying and separating truly causal semantic features from such entangled spurious ones.
To formally characterize domain shifts in HSI, we adopt a Structural Causal Model (SCM) [
34] to model the data-generating process, as illustrated in
Figure 1a. The latent physical properties
determine the semantic label
, while the observed image
is influenced by both the latent properties
and the sensing scene
. Scene variations introduce domain-specific artifacts in
X, but do not affect the intrinsic semantics
Y.
Under a single-source setting (
), the model only observes limited interventional diversity, leading to Challenge 1. Moreover, since
X encodes both semantic (
Z-related) and spurious (
S-related) components, their entanglement leads to Challenge 2. To illustrate this,
Figure 1b shows how
X can be decomposed into causal features
(derived from
Z) and non-causal features
(originating from
S), both of which may correlate with
Y in the training domain. Such spurious correlations may bias predictions when deployed under unseen scenes. In light of these issues, our goal is two-fold: simulate diverse interventions
S to encourage stable feature learning; and disentangle
X into
and
to isolate invariant semantic cues from scene-dependent noise.
To solve the aforementioned challenges and achieve our goal of robust generalization under single-source domain settings, we propose CauseHSI, a causality-inspired framework composed of two key components: the Counterfactual Generation Module (CGM) and the Causal Disentanglement Module (CDM).
To tackle Challenge 1, CGM adopts a causal perspective to simulate interventional diversity under the single-source setting. Based on the SCM in
Figure 1a, where domain shifts stem from variations in the sensing condition
S, we approximate interventions
(do-operator is a mathematical operator for intervention.) [
35] through controlled perturbations of the observed image
X in the frequency domain. This approximation does not aim to reproduce exact physical sensing processes, but instead provides an operational means to emulate plausible sensing-induced variations while preserving semantic consistency. Motivated by findings that domain-specific artifacts tend to concentrate in extreme frequency bands [
28,
36,
37,
38], we decompose the frequency representation of
X and apply structured Gaussian noise to low and high frequencies, which serve as a practical approximation of scene-sensitive variations, while preserving mid-frequency components that are relatively more robust to sensing changes and thus act as scene-robust carriers of semantic information. To further ensure semantic fidelity, CGM incorporates two complementary mechanisms: injecting the central spectral signature to retain class-discriminative cues, and applying mild spatial randomization to enrich local diversity without disrupting spatial structure. In addition, a style-controlled discrepancy loss explicitly constrains the magnitude of perturbations, preventing excessive deviations that could compromise label consistency. Through these constrained interventions in the frequency, spectral, and spatial domains, CGM generates counterfactual samples [
39], namely synthetic data instances derived by perturbing domain-specific factors while preserving class-discriminative semantics. These counterfactual samples effectively expand the range of sensing conditions beyond the single-source domain, thereby enriching interventional diversity and alleviating the generalization limitations described in Challenge 1.
To address the second challenge, we propose the CDM, guided by three principles: causal independence, cross-domain consistency, and semantic completeness. The detailed theoretical foundation is elaborated in
Section 3.3. CDM adopts a dual-branch structure to isolate causal and non-causal features based on marginal independence assumptions. To enhance semantic completeness and ensure consistency across domains, we design a Causal Reassembly Module (CRM), which reconfigures features in the frequency domain by decomposing non-causal features into high- and low-frequency components and recombining them with causal representations. This reconstruction enforces complementary constraints that encourage the disentangled causal branch to capture more authentic, invariant semantics.
In summary, CauseHSI enables robust generalization under single-source settings by simulating interventional shifts through controlled frequency perturbations and disentangling domain-invariant causal semantics via principled architectural constraints, all from a causality-inspired perspective. The major contributions of the proposed method are summarized as follows.
- 1.
We present a novel perspective for cross-scene HSI classification under single-source DG by framing the problem within a SCM, which explicitly accounts for sensing-induced interventions and their effects on feature entanglement.
- 2.
To simulate unseen domain shifts, we introduce CGM, which perturbs extreme frequency components of HSI data in a controlled manner, generating semantically consistent counterfactual samples. This module exposes the model to diverse sensing conditions, enhancing robustness.
- 3.
We propose CDM to explicitly disentangle causal and non-causal representations using a dual-branch architecture and a causal reassembly mechanism. This enables the model to isolate invariant semantic cues from domain-specific artifacts.
The remainder of this paper is organized as follows.
Section 2 introduces related works pertinent to this study.
Section 3 elaborates on the proposed methodology in detail.
Section 4 reports experimental results and comparative analyses. Finally,
Section 5 concludes the paper with a summary of the proposed approach and future directions.
3. Methods
The proposed CauseHSI framework, depicted in
Figure 2, is composed of two key modules: CGM and CDM. Given source domain hyperspectral samples, CGM generates counterfactual samples by perturbing domain-specific components while preserving semantic content. These counterfactuals simulate plausible distribution shifts, enabling the construction of a source–counterfactual domain pair that reflects potential domain variations. Both original and counterfactual samples are then fed into CDM, which adopts a dual-branch architecture to disentangle features into causal and non-causal components under marginal independence assumptions. A Causal Reassembly Module further refines these representations by enforcing semantic completeness and consistency. Reconstruction and classification losses guide the network to extract domain-invariant causal features. Ultimately, classification is performed based on the causal features, which promote robust generalization across unseen domains.
3.1. Causal Formulation of Domain Shifts
To provide a rigorous foundation for our framework, we formalize the SCM underlying HSI classification. As illustrated in
Figure 1a, four key variables are considered: latent physical properties
Z, sensing scene
S, observed image
X, and semantic label
Y. Their relationships are characterized as:
Here, captures the intrinsic distribution of physical properties, denotes the semantic generation mechanism that maps physical attributes to class labels through a stable semantic mechanism, and models the imaging process in which the sensing scene S introduces domain-specific variations. This formulation explicitly reflects that Y is causally determined by Z and is invariant to S, while S acts as an interventional variable that perturbs the distribution of X without affecting the semantic mechanism.
It is worth noting that although
Y is causally generated from
Z, the observed image
X may exhibit spurious statistical correlations with
Y in the training domain. Such correlations arise from the fixed sensing condition
S and are illustrated as a dashed arrow from
X to
Y in
Figure 1a. This dashed edge does not represent a causal influence, but rather an observational dependence induced by domain-specific biases. Under a single-source setting, all training samples are collected under a fixed sensing condition
. As a result, the model is exposed to only a limited range of sensing variations, which restricts the diversity of interventional patterns and leads to Challenge 1.
From a causal perspective, the observed hyperspectral image
X is influenced by both intrinsic semantic factors and extrinsic sensing conditions. Rather than modeling the full physical imaging process, we adopt a functional abstraction to characterize their distinct causal roles. Specifically, we conceptually decompose
X into a causal component
, which encodes semantic information determined by latent physical properties, and a non-causal component
, which captures domain-specific variations introduced by the sensing scene:
This formulation does not imply that the real imaging process is strictly linear or additive. Instead, it serves as a causal abstraction that highlights the separation between invariant semantic factors and scene-dependent perturbations. In practice, complex nonlinear interactions between Z and S may exist and are implicitly absorbed into the functional representations of and . The importance of this abstraction lies in enabling a clear causal interpretation of domain shifts, where variations in S can be viewed as interventions that alter while leaving the semantic mechanism invariant. Since S remains fixed in the source domain, the non-causal component may become spuriously correlated with the label Y. Models trained on such data tend to exploit these domain-specific cues, resulting in unstable predictions when encountering unseen sensing conditions. This phenomenon gives rise to Challenge 2.
3.2. Counterfactual Generation Module (CGM)
Prior studies [
36,
49] have shown that the extreme frequency components of images often contain domain-private patterns, which are sensitive to sensing conditions rather than reflecting intrinsic semantics. Building on this observation, we interpret sensing variations as interventions on the path
in our SCM (
Figure 1a), where the sensing scene
S introduces domain-specific artifacts into the observed image
X without altering the underlying semantics
Y. By perturbing frequency-sensitive components, CGM mimics such scene-induced variations while preserving semantic consistency inherited from the latent properties
Z. This design allows the generated counterfactual samples to reflect plausible sensing changes, thereby enhancing the robustness of the model under distributional shifts.
CGM does not aim to exactly reproduce physical sensing processes or explicitly parameterized sensing variables. Instead, it provides a controlled and operational approximation of plausible sensing variations by selectively perturbing empirically scene-sensitive components while enforcing spectral, spatial, and semantic consistency. This design allows CGM to generate label-consistent counterfactual samples that reflect realistic sensing changes, rather than arbitrary noise injections. The overall architecture of CGM is illustrated in
Figure 3, and consists of three coordinated branches.
3.2.1. Frequency-Based Intervention
Given an input hyperspectral patch
, we perform a discrete cosine transform (DCT) on each spectral channel to obtain its frequency representation
. As shown in
Figure 1a, under the structural model
, domain shifts are primarily induced by variations in the sensing condition
S, while the latent physical properties
Z remain invariant across scenes. To simulate interventions on the nuisance factor
S under a fixed source distribution
, we perform frequency-domain perturbations as an operational approximation of the structural intervention
.
Rather than adopting a hard frequency cutoff, we employ a soft frequency weighting strategy in the DCT domain. Frequency components are ordered according to their radial distance from the DC component, and a fixed, smoothly varying weighting profile is applied to modulate different frequency regions. This weighting profile assigns higher perturbation strength to extreme low- and high-frequency components, which are empirically more sensitive to sensing conditions and scene-dependent distortions, while mid-band frequencies receive minimal perturbation and are thus treated as relatively scene-robust semantic carriers.
Formally, the frequency representation is decomposed into two complementary components via soft weighting: a scene-sensitive component and a scene-robust component , satisfying . The intervention is approximated by applying stochastic multiplicative perturbations to , i.e., with , while keeping unchanged. This multiplicative formulation perturbs the magnitude of scene-sensitive frequency components without altering their spatial-frequency structure, thereby avoiding unrealistic artifacts. The intervened frequency representation is then reconstructed as and transformed back into the spatial domain via inverse DCT to produce the perturbed image .
Finally, is encoded by a frequency encoder (FreqEncoder) composed of convolutional blocks to yield a compact frequency-level representation . This frequency-level decomposition provides a controllable and effective proxy for simulating sensing-related variations under limited source diversity.
3.2.2. Spectral Consistency Preservation
To retain class-discriminative information, we enhance the perturbed sample with its central 2D spectral signature. This spectral vector is passed through a SpeEncoder composed of fully connected layers with ReLU activations and a residual connection, producing the spectral embedding
. The concatenation
is further processed by a spectral-level randomization module [
30], producing a fused feature
that preserves spectral integrity while introducing controlled variability.
3.2.3. Spatial Style Perturbation
In parallel, we extract spatial features
from the original patch using SpaEncoder, a shallow CNN. To increase spatial diversity while maintaining structural coherence, we apply Adaptive Instance Normalization (AdaIN):
where
are the per-channel statistics of
z, and
are randomly sampled from other spatial features within the batch. This transformation perturbs spatial styles without disrupting semantic layout.
The outputs from the spectral and spatial branches are concatenated and passed through a Decoder to produce the final counterfactual sample
. To ensure semantic alignment while promoting stylistic diversity, we introduce a style-controlled discrepancy loss:
where
denotes the style feature extracted from
X, and
represents the style feature extracted from the counterfactual sample
. The first term ensures a sufficient style gap via Gram matrix distance, and the second term constrains the perturbation strength within a controlled range
.
Through this controlled intervention in the frequency, spectral, and spatial spaces, CGM generates label-consistent counterfactuals that serve as effective augmentations for improving the model’s generalization across unseen domains.
3.3. Causality Feature Disentanglement: Theoretical Foundation
Inspired by causal inference theory, we propose a disentanglement approach that distinguishes causal features from non-causal features. This separation is crucial for learning representations that generalize across unseen domains by emphasizing invariant causal mechanisms. However, identifying causal and non-causal features solely from observational data is inherently challenging, especially in the absence of additional assumptions or constraints. It is important to clarify that we do not claim theoretical identifiability of causal and non-causal factors from single-source observational data alone. Indeed, in classical causal inference, such disentanglement is generally unidentifiable without interventional data, multiple environments, or strong prior assumptions. In our framework, this limitation is explicitly acknowledged and addressed by augmenting the observational setting with counterfactual samples generated by the proposed CGM. These counterfactual samples introduce controlled variations that simulate interventions on non-causal factors, thereby providing additional supervisory signals for learning a practically useful disentanglement. Following the insights from prior works [
50,
51], introducing appropriate constraints can help approximate the underlying causal structure. In the context of cross-domain generalization, we posit that the disentangled causal and non-causal features should satisfy the following three principles:
- 1.
Causal Independence: Causal and non-causal features should be statistically independent [
46].
where
denotes mutual information.
- 2.
Semantic Completeness: The combination of causal and non-causal features should preserve sufficient information for accurate reconstruction or prediction.
where
denotes Shannon entropy.
- 3.
Cross-Domain Consistency: Causal features corresponding to the same semantic content should remain invariant across domains.
where
and
denote the distributions of causal features in source and target domains, and
is a distributional distance (e.g., MMD).
These three principles jointly form the theoretical foundation for our causal-inspired disentanglement framework, which we incorporate into the model through architectural design and loss constraints to enable robust generalization under domain shifts.
3.4. Independence Constraint: Marginally Independent Representation Decomposition
To facilitate the disentanglement of causal and non-causal features, we adopt the marginal independence assumption, which posits that the two factors should be statistically independent. This assumption is widely used in causal representation learning and domain generalization [
46,
50,
52], helping to prevent spurious correlations between invariant semantics and domain-specific variations. To implement this assumption, we design a dual-branch network inspired by the early-branching strategy [
46], as shown in
Figure 4. The network begins with a Spectral-Spatial Fusion Module (SSFM), which extracts low-level spectral-spatial representations from the hyperspectral input. The SSFM integrates channel-wise spectral cues and local spatial structures through parallel convolutions, and fuses them to obtain the shallow feature map
.
Notably, the independence constraint is not imposed solely on source domain observations. Instead, it is jointly applied to source samples and their corresponding counterfactual variants generated by CGM. By exposing the model to paired samples that share semantic content but differ in non-causal factors, the disentanglement process is guided by contrastive supervisory signals that approximate interventional variation. This design alleviates the inherent identifiability ambiguity in single-source observational data and enables the model to separate invariant semantic features from domain-specific variations in a pragmatic manner.
The feature map
F is fed into two separate encoders: a semantic encoder
to extract causal features
and a domain encoder
to extract non-causal features
. Inspired by CVSSN [
53], each encoder consists of a pointwise convolution group, a depthwise convolution group, and a module tailored for specific semantics:
emphasizes mid-frequency information via
convolutions, capturing fine-grained structural and texture patterns that are generally domain-invariant.
uses
convolutions to aggregate broader context and low-frequency trends, while dilated convolutions are added to retain detail and detect high-frequency, domain-specific variations.
To ensure that
and
are statistically independent, we introduce a regularization term based on the Hilbert-Schmidt Independence Criterion (HSIC) [
54]. This criterion is a kernel-based statistical dependence measure. Given two variables
P and
Q (in our case,
and
), the empirical HSIC is computed as:
where
are Gram matrices computed using RBF kernels over samples of
P and
Q, respectively;
is the centering matrix;
n is the batch size.
We define the independence loss as:
which is minimized during training to reduce dependency between the two feature branches while promoting better disentanglement.
3.5. Completeness and Consistency Constraint: Frequency-Aware Reconstruction Strategy
While marginal independence ensures that causal and non-causal representations do not share statistical dependencies, this alone does not guarantee that they together encode complete and semantically meaningful information. To further enhance disentanglement, we propose a reconstruction process (illustrated in
Figure 5) that imposes two complementary constraints—causal completeness and causal consistency—corresponding to Principle 2 and Principle 3.
At the core of this process lies our novel Causal Reassembly Module (CRM), which performs a frequency-aware fusion of causal and non-causal features. Specifically, given the causal feature and non-causal feature , we first project them into the DCT space: , . We then partition into low-frequency component and high-frequency component . Treating as the middle-frequency component, we concatenate the three in frequency order to obtain a reassembled spectrum , which is then transformed back to the spatial domain via inverse DCT: . This frequency-aware reassembly allows the model to maintain semantic consistency across domains while preserving both coarse and fine structural details.
Finally, the reassembled feature
is passed through a lightweight decoder
to reconstruct the image:
. To ensure faithful reconstruction, we define an
loss between the original image
X and the reconstructed image
:
To impose the aforementioned constraints, we apply the reconstruction process with different feature combinations as input, enabling the model to learn both the completeness and consistency of disentangled features across domains.
3.5.1. Causal Completeness Constraint (Principle 2)
To ensure that the combined representations
and
capture the full semantic content of the input, we reconstruct both the source image and its counterfactual sample from the reassembled features. The completeness constraint is enforced via the reconstruction losses from both domains:
where
denotes the reconstruction loss of the original source image
X, and
corresponds to that of the counterfactual sample
.
3.5.2. Causal Consistency Constraint (Principle 3)
To promote domain-invariant semantics in
, we conduct cross-domain reassembly by combining source domain causal features
with counterfactual-domain non-causal features
for reconstruction. The corresponding consistency reconstruction loss is defined as:
In addition, we enforce consistency in the causal feature space directly by minimizing the distance between causal features of the source and counterfactual domains:
The total causal consistency loss is:
By systematically satisfying the three core principles, our proposed method achieves a principled disentanglement of causal and non-causal factors in HSI. These principles guide the learning process to isolate scene-invariant, label-relevant representations while suppressing scene-specific variations, thereby improving robustness under domain shifts.
Once the causal representation
is extracted by the disentanglement module, it is passed through a classification head
to predict class labels. To ensure the discriminative power of causal features extracted from both source and counterfactual images, we supervise the predictions using cross-entropy loss on both:
where
denotes the cross-entropy loss, and
is the source domain ground-truth label shared by its counterfactual. The final classification loss is the sum of both:
3.6. Training Phase
In our method, the training procedure involves the joint but alternate optimization of CGM and CDM.
We first optimize the CDM. The objective is to ensure that the extracted features satisfy the three proposed causal principles, while also being discriminative for the final classification task. The total loss
for optimizing this module is formulated as:
where
is a hyperparameter to balance the contribution of causal loss.
After updating the CDM module, we optimize the CGM with style-controlled discrepancy loss
from Equation (
4) and counterfactual samples classification loss
from Equation (
15). The total CGM loss is defined as:
where
is a balancing weight. For simplicity,
in Equation (
17) and
in Equation (
4) are set to the same value (
).
3.7. Causal Scope and Limitations
While the proposed framework is inspired by causal principles, it does not aim to identify the true underlying causal graph of hyperspectral image formation. Instead, CauseHSI adopts a causally motivated structural abstraction in which sensing scenes are treated as interventions that induce distribution shifts on the observed data. Within this formulation, the goal is not causal discovery, but to learn feature representations that are operationally consistent with causal assumptions under scene variations.
Specifically, the proposed causal disentanglement is guided by a set of practical criteria. These criteria serve as inductive biases that discourage the exploitation of scene-specific spurious correlations, rather than as formal guarantees of causal identifiability. Similarly, the counterfactual generation module provides a controlled and operational approximation of plausible sensing variations, instead of an exact simulation of physical sensing processes. As a result, the causal properties enforced by the framework should be understood as consistency-oriented constraints that improve robustness under domain shifts, rather than as theoretically proven causal correctness. This design choice is aligned with common practice in causality-inspired representation learning for domain generalization.
4. Experiments
4.1. Datasets
To evaluate the generalization performance of our proposed method, we conduct extensive experiments on three widely-used HSI datasets: Pavia, Houston, and HyRANK.
4.1.1. Pavia Dataset
The Pavia dataset consists of two urban scenes: Pavia University and Pavia Center, both captured by the Reflective Optics System Imaging Spectrometer (ROSIS) sensor. The sensor originally records 115 spectral bands within the range of 430–860 nm. After preprocessing to remove noisy bands, Pavia University retains 103 bands, and Pavia Center retains 102 bands. In this study, we use the 102 shared spectral bands between the two scenes. The spatial resolution of both datasets is 1.3 m, with image sizes of 610 × 340 (University) and 1096 × 715 (Center), respectively. We focus on the seven classes that are shared between the two scenes for consistent cross-scene evaluation. The classes and the number of samples are listed in
Table 1, and the pseudo-color image with ground truth map is shown in
Figure 6.
4.1.2. Houston Dataset
The Houston dataset comprises two subsets: Houston2013 and Houston2018, both acquired over urban areas in Houston, Texas. Houston2013 was captured by the ITRES CASI-1500 sensor (Calgary, AB, Canada) and was provided as part of the IEEE GRSS Data Fusion Contest in 2013. It contains 144 spectral bands spanning 364–1046 nm, with a spatial resolution of 2.5 m and an image size of 349 × 1905. Houston2018, released during the 2018 Data Fusion Contest, features ultrahigh-resolution imagery (0.05 m) with 48 spectral bands over a 2384 × 601 spatial grid. The two subsets are annotated with 15 and 20 land cover categories, respectively. In this study, we adopt the 48 spectral bands and select the seven categories that are common to both subsets to ensure consistency in cross-domain evaluation. The specific number of samples is shown in
Table 2. The pseudo-color image and ground truth maps are shown in
Figure 7.
4.1.3. HyRANK Dataset
The HyRANK dataset is derived from Hyperion satellite imagery and includes two scenes: Dioni and Loukia, both located in Greece. Each image has a spatial resolution of 30 m and sizes of 250 × 1376 (Dioni) and 249 × 945 (Loukia), respectively. The Hyperion sensor provides 242 spectral bands in the range of 400–2500 nm, from which 176 bands are retained after removing noisy and water absorption bands. The dataset is labeled with 14 land cover categories, among which 12 consistent classes are selected for training and evaluation in our experiments, as shown in
Table 3. The pseudo-color image and ground truth maps are shown in
Figure 8.
4.2. Implementation Details
All experiments are conducted using the PyTorch (version: 1.8) deep learning framework on a workstation running Ubuntu 20.04.5 LTS with a Linux kernel version of 4.15.0. The hardware configuration includes an Intel(R) Xeon(R) Silver 4210 CPU @ 2.20 GHz and a single NVIDIA GeForce RTX 2080 Ti GPU with 11GB of memory.
We evaluate the performance of our method using four widely adopted metrics in HSI classification: class-specific accuracy, overall accuracy (OA) and the Kappa coefficient (KC).
The training process is configured with a batch size of 256, input patch size
and 400 training epochs. We apply L2 regularization with a weight decay of
. To assess CauseHSI’s sensitivity to key hyperparameters, we analyze three parameters: base learning rate,
, and embedding dimension (
in
Figure 5), chosen from
,
, and
, respectively. As shown in
Figure 9, the learning rate exhibits consistent behavior across all datasets, with
yielding the best or near-best performance, indicating stable optimization dynamics. For
, performance varies smoothly over a wide range of values. While the optimal
differs slightly across datasets,
remains competitive on all benchmarks, and larger values can further benefit more complex scenes, suggesting that the model does not rely on precise tuning of this parameter. Similarly, the embedding dimension demonstrates a broad plateau of strong performance. Dimensions of 256 and 512 achieve comparable results across datasets, indicating that CauseHSI is not overly sensitive to the specific embedding capacity. Overall, these results confirm that CauseHSI maintains stable performance under reasonable hyperparameter variations, and the dataset-specific settings reflect minor adaptations to data characteristics rather than strict configuration dependencies. Based on the above analysis, we fix the base learning rate to
for all datasets.
is selected from 1, 10, where
yields strong performance on Pavia and Houston, while a larger value (
) is adopted for HyRANK to better accommodate its higher scene complexity. The embedding dimension is chosen as 512 for Pavia and HyRANK, and 256 for Houston.
4.3. Results and Analysis
To comprehensively evaluate the effectiveness of our proposed method in cross-scene HSI classification, we compare it against a wide range of representative and state-of-the-art approaches. Specifically, we include several recent DG methods tailored for HSI tasks, including SDEnet [
30], FDGNet [
28], S2ECNet [
29], D3Net [
40], and ISDGS [
41], which are designed to learn scene-invariant representations under unseen target domains. To provide additional reference points, we also include two representative Domain Adaptation (DA) methods, DSAN [
55] and TSTnet [
56], for supplementary comparison. Additionally, we include two competitive methods for single-scene HSI classification, SSFTT [
57] and DSNet [
58], which are trained and evaluated on the same domain without explicit generalization mechanisms. All methods are evaluated using official implementations or publicly available codebases, and we carefully follow the original training protocols and hyperparameter settings to ensure fair and reliable comparisons.
For a fair comparison, all methods are trained and evaluated under exactly the same data partitioning and augmentation strategies. Specifically, considering the imbalance in sample quantities across datasets, we adopt dataset-specific data splitting: for the Pavia dataset (Pavia University as the source domain and Pavia Center as the target), 50% of the source domain data is used for training and the remaining 50% for validation; for the Houston dataset (Houston2013 as the source and Houston2018 as the target) and the HyRANK dataset (Dioni as the source and Loukia as the target), 80% of the source domain is used for training and 20% for validation. In all cases, the entire target domain is used as the test set. Furthermore, for the Houston2013 dataset, we apply data augmentation (random flip and random radiation noise) by a factor of four, which is consistently applied to all compared methods.
To account for the randomness introduced by certain modules in the compared methods, we report the classification performance using the mean ± standard deviation over multiple runs. Specifically, we fix five random seeds and compute the final results by averaging the outcomes from five independent runs using these predefined seeds. This setup ensures a more accurate and stable evaluation of all methods under consistent experimental conditions.
For DG methods, only labeled source domain data are used during training, and models are directly tested on the target domain. The same training protocol is applied to single-scene methods to assess their generalization ability under domain shift. In contrast, DA methods are trained using labeled source domain data along with an equal amount of unlabeled target domain data. Among them, DSAN requires a batch size of 32 due to its loss function design [
55]. For other hyperparameters not explicitly mentioned, we follow the original settings reported in the respective papers.
Table 4,
Table 5 and
Table 6 summarize the class-specific accuracy, OA and KC for all compared methods across the Pavia, Houston and HyRANK datasets, respectively. The visual classification results of different methods on these three datasets are illustrated in
Figure 10,
Figure 11 and
Figure 12.
Single-scene methods (SSFT, DSNet) lack any mechanism to handle domain shift. As expected, they perform poorly when directly applied to unseen target domains. This observation further emphasizes the necessity of developing methods that explicitly address the challenges of cross-scene hyperspectral image classification.
Among DA methods, DSAN and TSTnet benefit from access to unlabeled target-domain data during training, which allows them to partially adapt to the target distribution. However, such assumptions are not applicable in the single-source domain generalization setting considered in this work.
DG methods demonstrate varying strengths across different datasets, reflecting their distinct design principles. SDEnet shows relatively stable performance across all scenes, while FDGNet performs competitively on Houston, and D3Net achieves strong results on HyRANK. These variations suggest that different design principles—such as contrastive learning or semantic alignment—impact performance under varying scene conditions. S2ECNet further explores causality-inspired design by incorporating spectral–spatial enhancement and causal contribution constraints. By introducing causal alignment through contrastive constraints on causal contribution vectors, S2ECNet exhibits strong robustness to cross-scene variations. In comparison, CauseHSI places greater emphasis on disentangling causal and non-causal features to explicitly separate invariant semantic factors from domain-specific variations. This formulation enables the model to capture stable semantic representations while flexibly accommodating domain shifts, leading to more consistent generalization across diverse scenes.
Our proposed method consistently achieves the highest OA and KC across all three datasets. Specifically, it outperforms the strongest DG baselines by clear margins on Pavia, Houston, and HyRANK, indicating superior robustness under diverse sensing conditions. The consistent improvement in Kappa further demonstrates that the performance gains are not dominated by majority classes, but reflect a more reliable agreement between predictions and ground truth.
A closer inspection of class-wise accuracies reveals that the proposed method does not uniformly improve all land-cover categories, and that performance variations across classes are clearly observable in all three datasets. In particular, several categories—such as C5 in Pavia, C1 and C4 in Houston, and C2, C6, and C10 in HyRANK—exhibit noticeable performance degradation compared with certain competing methods.
From a causal perspective, this behavior is expected rather than anomalous. The proposed framework explicitly suppresses scene-dependent and non-causal cues through causal disentanglement and counterfactual augmentation. Consequently, land-cover categories that rely heavily on background context, illumination patterns, or other scene-specific correlations—rather than intrinsic spectral–semantic properties—may experience reduced classification accuracy when such non-causal signals are attenuated.
Moreover, several degraded categories are characterized by strong spectral ambiguity or limited inter-class separability, as observed in complex datasets such as HyRANK. Under domain generalization settings, where spurious correlations cannot be exploited, learning stable causal representations for such categories remains inherently challenging for all methods. In addition, counterfactual augmentation may introduce increased variance for extremely small or noisy classes, further amplifying class-wise fluctuations.
Importantly, despite these localized degradations, the proposed method consistently achieves the highest overall accuracy and Kappa coefficient across all datasets. The improvement in Kappa indicates that the gains are not driven by a small subset of dominant classes, but instead reflect a more reliable and globally consistent alignment between predictions and ground truth under cross-scene shifts.
To further assess the reliability of the reported performance, we additionally report the 95% confidence intervals of OA and Kappa, estimated as mean
using the Student’s t-distribution over five runs. As shown in
Table 7, CauseHSI consistently achieves higher mean performance with relatively narrow confidence intervals across all three datasets, indicating stable and reliable performance gains.
Figure 10,
Figure 11 and
Figure 12 provide qualitative comparisons of classification maps. Subfigure (a) presents the ground-truth map, (b)–(i) are contrast methods, and (j) corresponds to our proposed approach. Pixels without ground-truth labels are treated as background, and all pixels are predicted for visual comparison. Notably, our method yields smoother and less noisy classification maps, as illustrated in the red-boxed regions. This visual advantage stems from the model’s focus on global semantic consistency, which reduces local misclassifications and noise.
Overall, the proposed method achieves consistent superiority in overall accuracy and domain robustness. While it may not outperform all baselines in class-level accuracy, its strong domain-invariant representation learning ensures reliable generalization across complex and diverse scenes. This further validates the effectiveness of our causal-inspired disentanglement strategy in cross-scene hyperspectral image classification.
To quantitatively assess computational efficiency,
Table 8 reports training time, inference time, FLOPs, and parameter counts on three benchmark datasets. As shown in the table, CauseHSI incurs higher FLOPs than most lightweight DG baselines, with an increase of approximately 3–5× in FLOPs. However, it has the lowest number of parameters among all compared DG methods, resulting in a compact memory footprint. In terms of runtime, the training time of CauseHSI is moderately higher than that of the lightest baselines, while remaining significantly lower than heavyweight methods. Importantly, its inference time is comparable to or only marginally higher than other DG approaches across all datasets, indicating that the additional computational cost is mainly introduced during training rather than deployment. Overall, these results suggest that CauseHSI achieves a favorable balance between training-time complexity and inference-time efficiency, making it practical for real-world hyperspectral applications where robustness to domain shifts is required.
4.4. Ablation Study
To evaluate the effectiveness of key components in the proposed CauseHSI, we conduct ablation studies by systematically removing each component and observing performance degradation across all three datasets. The quantitative results are summarized in
Table 9.
Specifically, we investigate the following six variants: (1) “no DCT”: removes the frequency-based intervention from CGM. (2) “no 2D”: removes the spectral consistency preservation branch from CGM. (3) “no Control”: disables the style-controlled discrepancy loss in CGM. (4) “no Consist&Complete”: removes both the causal consistency and completeness constraints from CDM. (5) “no Consist”: only removes the causal consistency constraint from CDM. (6) “no Complete”: only removes the causal completeness constraint from CDM.
The variants “no DCT” and “no 2D” respectively disable two complementary intervention branches in the CGM, both of which are designed to approximate counterfactual domain shifts. Removing either branch results in noticeable and consistent performance degradation across all three datasets, highlighting their synergistic roles in counterfactual generation. Specifically, the “no DCT” variant exhibits systematic drops in OA and KC, indicating that frequency-based perturbations substantially enrich the diversity of interventional samples. This empirically supports our design choice of performing counterfactual interventions in the frequency domain, as frequency components effectively capture global sensing variations such as illumination, atmospheric conditions, and sensor response, which are difficult to model explicitly through physical parameters. Similarly, the “no 2D” variant leads to clear degradation, demonstrating that preserving the central spectral structure during intervention is crucial for maintaining class-discriminative information. This confirms that counterfactual perturbations must be constrained to avoid semantic distortion, and validates the necessity of performing interventions in a controlled feature space rather than through unconstrained transformations. Moreover, the removal of the style-controlled discrepancy loss (“no Control”) causes pronounced performance drops, with HyRANK suffering the largest KC decrease (−3.30%). This observation indicates that style regulation plays a critical role in balancing semantic alignment and stylistic diversity, particularly for fine-grained land-cover categories. Without this constraint, feature-space interventions tend to over-amplify non-causal variations, reducing the utility of generated counterfactual samples. Taken together, these results provide strong empirical evidence that frequency-based intervention, spectral preservation, and style regularization play complementary roles in CGM. Frequency perturbations broaden interventional diversity, spectral preservation safeguards semantic consistency, and style control prevents over-perturbation. Their joint contribution directly supports the validity of implementing counterfactual interventions in frequency and feature spaces.
Variant “no Consist&Complete” leads to the most significant decline, as it relies solely on the causal independence constraint, failing to enforce semantic alignment or feature sufficiency. Introducing either the causal completeness (“no Consist”) or consistency (“no Complete”) constraint yields clear improvements over the basic version. Specifically, while “no Consist&Complete” shows a 3–4% drop in OA, “no Complete” reduces this drop to about 1%. These results demonstrate the importance of multi-level causal constraints in approximating true causal features.
The proposed frequency-based intervention relies on a soft frequency weighting strategy to distinguish scene-sensitive and scene-robust components in the DCT domain. Although this weighting profile is fixed across all experiments, we further investigate its sensitivity to ensure that the performance gains do not depend on a specific frequency configuration. Specifically, we vary the radial range of mid-band frequencies that receive minimal perturbation, while keeping all other components of the framework unchanged. Three configurations are considered: Narrow, Default, and Wide, corresponding to progressively smaller or larger mid-frequency preservation ranges. Notably, this variation does not introduce any explicit hard cutoff between low, mid, and high frequencies, but instead adjusts the extent of the smoothly weighted frequency regions. We evaluate these configurations on representative cross-scene classification benchmarks under the single-source domain generalization setting. As reported in
Table 10, the proposed method exhibits stable performance across different frequency weighting ranges. While minor fluctuations are observed, the overall accuracy remains consistently high, indicating that the effectiveness of the proposed frequency-based intervention is not sensitive to the specific choice of frequency weighting parameters. This robustness supports our design choice of using a fixed and coarse-grained frequency weighting profile as a practical approximation of sensing-induced variations.
To further validate the effectiveness of CauseHSI in enhancing generalization, we visualize the feature distributions on three datasets using t-SNE.
Figure 13 illustrates the distributions of target-domain samples before and after applying CauseHSI. In the original feature space, samples from different classes exhibit substantial overlap, whereas CauseHSI yields clearer inter-class separation across all datasets.
4.5. Physical Plausibility of Generated Counterfactual Samples
Since the proposed framework relies on generated Counterfactual hyperspectral samples for DG, it is important to ensure that these samples remain physically plausible rather than arbitrary perturbations. Unlike natural images, commonly used perceptual metrics such as FID or LPIPS are not directly applicable to hyperspectral data due to the high dimensionality of spectral signals and the lack of suitable pretrained feature extractors. Therefore, we assess the physical plausibility of the generated counterfactual samples using physically grounded spectral metrics.
Specifically, we compute the spectral angle mapper (SAM) between the center-pixel spectra of the original and generated samples, where the center pixel corresponds to the semantic label in each
spatial-spectral patch. In addition, we evaluate spectral smoothness along the spectral dimension to examine whether the generated spectra preserve the inherent band-wise continuity of hyperspectral signals. As reported in
Table 11, the generated counterfactual samples exhibit moderate spectral angles, typically ranging from 2.8° to 4.5°, indicating meaningful yet physically reasonable domain perturbations rather than trivial reconstructions. Moreover, the spectral smoothness of the generated samples remains close to that of real hyperspectral data across all datasets, suggesting that the proposed generation process does not introduce severe high-frequency spectral artifacts.
4.6. Scalability and Applicability to Large-Scale Scenes
Although the proposed framework is evaluated on commonly used benchmark datasets, it is not inherently restricted to small- or medium-sized hyperspectral scenes. The overall design of CauseHSI is patch-based and scene-agnostic, and does not rely on global scene-level modeling or full-image statistics. As a result, large-scale hyperspectral scenes can be processed in a tiled or sliding-window manner without any modification to the network architecture or training strategy.
Importantly, both the causal disentanglement module and the counterfactual generation mechanism operate locally on image patches or intermediate feature representations. Their computational and memory costs scale linearly with the number of patches, rather than with the spatial extent of the entire scene. This property ensures that the framework remains computationally feasible for large-area hyperspectral imagery.
Moreover, patch-wise training and inference are standard practice in hyperspectral image analysis, particularly for domain generalization settings where full-scene annotations are rarely available. The benchmark datasets used in this work are themselves extracted from large-scale airborne scenes, and therefore provide a representative proxy for real-world large-scene deployment. These considerations suggest that the proposed framework can be readily applied to very large-scale hyperspectral scenes in practical remote sensing applications.