Next Article in Journal
SFD-ADNet: Spatial–Frequency Dual-Domain Adaptive Deformation for Point Cloud Data Augmentation
Next Article in Special Issue
Predicting Nutritional and Morphological Attributes of Fresh Commercial Opuntia Cladodes Using Machine Learning and Imaging
Previous Article in Journal
Use of Patient-Specific 3D Models in Paediatric Surgery: Effect on Communication and Surgical Management
Previous Article in Special Issue
Double-Gated Mamba Multi-Scale Adaptive Feature Learning Network for Unsupervised Single RGB Image Hyperspectral Image Reconstruction
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

CauseHSI: Counterfactual-Augmented Domain Generalization for Hyperspectral Image Classification via Causal Disentanglement

College of Computer Science and Technology, China University of Petroleum (East China), Qingdao 266580, China
*
Author to whom correspondence should be addressed.
J. Imaging 2026, 12(2), 57; https://doi.org/10.3390/jimaging12020057
Submission received: 3 January 2026 / Revised: 21 January 2026 / Accepted: 23 January 2026 / Published: 26 January 2026
(This article belongs to the Special Issue Multispectral and Hyperspectral Imaging: Progress and Challenges)

Abstract

Cross-scene hyperspectral image (HSI) classification under single-source domain generalization (DG) is a crucial yet challenging task in remote sensing. The core difficulty lies in generalizing from a limited source domain to unseen target scenes. We formalize this through the causal theory, where different sensing scenes are viewed as distinct interventions on a shared physical system. This perspective reveals two fundamental obstacles: interventional distribution shifts arising from varying acquisition conditions, and confounding biases induced by spurious correlations driven by domain-specific factors. Taking the above considerations into account, we propose CauseHSI, a causality-inspired framework that offers new insights into cross-scene HSI classification. CauseHSI consists of two key components: a Counterfactual Generation Module (CGM) that perturbs domain-specific factors to generate diverse counterfactual variants, simulating cross-domain interventions while preserving semantic consistency, and a Causal Disentanglement Module (CDM) that separates invariant causal semantics from spurious correlations through structured constraints under a structural causal model, ultimately guiding the model to focus on domain-invariant and generalizable representations. By aligning model learning with causal principles, CauseHSI enhances robustness against domain shifts. Extensive experiments on the Pavia, Houston, and HyRANK datasets demonstrate that CauseHSI outperforms existing DG methods.

1. Introduction

Hyperspectral image (HSI) classification exploits the rich spectral-spatial characteristics across hundreds of contiguous bands, enabling fine-grained material recognition and showing great potential in precision agriculture [1], environmental monitoring [2], and geological surveying [3]. With the development of deep learning, convolutional neural networks (CNN) [4,5,6] and Transformers [7,8,9] have significantly improved single-scene HSI classification performance by learning nonlinear spectral-spatial representations. In addition, HSI-specific techniques such as band selection [10] and hyperspectral unmixing [11] help reduce spectral redundancy and noise, further improving robustness and efficiency for downstream classification. However, most existing methods rely on the assumption of independent and identically distributed (i.i.d.) data between training (source) and testing (target) sets, an assumption that often fails in real-world cross-scene applications. Variations in different sensors, atmospheric conditions, illumination, and seasonal dynamics [12,13] lead to substantial domain shifts between scenes, severely degrading model performance when deployed on unseen domains.
To mitigate this domain shift, Domain Generalization (DG) methods have been used to learn domain-invariant features from the source domain(s) only, thereby enabling zero-shot transfer to unseen target domains. DG has made significant progress in various areas such as computer vision and natural language processing. Existing DG approaches can be broadly categorized into three groups: data manipulation, representation learning and learning strategy [14,15]. Data manipulation methods, such as data augmentation [16] and data generation [17,18], enhance diversity and quantity of input data to improve generalization. Representation learning, the mainstream of DG research, mainly follows two directions. One is domain-invariant representation learning, which seeks to learn representations that are invariant across domains [19,20]. The other is feature disentanglement, which attempts to decompose features into domain-invariant and domain-specific components [21,22]. Learning strategies adopt general training paradigms like meta-learning [23,24] and self-supervised learning [25,26] to improve robustness and reduce domain dependency.
However, DG in the field of HSI classification remains relatively under-explored. Currently, research on DG in HSI mainly focuses on the single-source domain generalization setting. This focus arises from the practical challenge that it is difficult and costly to obtain labeled data from multiple domains in HSI, making single-source generalization both necessary and meaningful. Due to the limitation of having only a single labeled domain, existing methods commonly adopt data manipulation techniques to generate pseudo-domains [27,28,29], thereby simulating domain shifts between different environments. Based on this, two major strategies have emerged to enhance generalization: one leverages contrastive learning and adversarial training to obtain domain-invariant representations [30,31], while the other focuses on feature disentanglement to separate domain-shared and domain-specific features [32,33].
While these efforts provide valuable insights, further advancements are needed to fully address the challenges posed by complex and diverse domain shifts in HSI data. In this work, we introduce a causal perspective to understand and address the generalization problem under single-source DG settings. Specifically, we regard different sensing scenes as interventions on a shared physical system, and focus on two core challenges that hinder robust generalization across such interventional shifts.
Challenge 1: Learning stable representations from limited interventional diversity. Under the single-source DG setting, models are trained on data collected from a single sensing scene, which corresponds to a specific intervention on the underlying physical system. This lack of interventional diversity makes it difficult for the model to capture causal mechanisms that are invariant across scenes. Consequently, the learned representations may overfit to scene-specific patterns and fail to generalize to unseen domains.
Challenge 2: Eliminating spurious correlations that confound semantic prediction. In source domains, certain non-semantic factors—such as background textures, atmospheric conditions, or sensor-specific artifacts—may spuriously correlate with class labels. These confounders can mislead the model into learning shortcuts that do not hold in unseen domains, leading to degraded generalization performance. The core difficulty lies in identifying and separating truly causal semantic features from such entangled spurious ones.
To formally characterize domain shifts in HSI, we adopt a Structural Causal Model (SCM) [34] to model the data-generating process, as illustrated in Figure 1a. The latent physical properties ( Z ) determine the semantic label ( Y ) , while the observed image ( X ) is influenced by both the latent properties ( Z ) and the sensing scene ( S ) . Scene variations introduce domain-specific artifacts in X, but do not affect the intrinsic semantics Y.
Under a single-source setting ( S = s 0 ), the model only observes limited interventional diversity, leading to Challenge 1. Moreover, since X encodes both semantic (Z-related) and spurious (S-related) components, their entanglement leads to Challenge 2. To illustrate this, Figure 1b shows how X can be decomposed into causal features X c (derived from Z) and non-causal features X n (originating from S), both of which may correlate with Y in the training domain. Such spurious correlations may bias predictions when deployed under unseen scenes. In light of these issues, our goal is two-fold: simulate diverse interventions S to encourage stable feature learning; and disentangle X into X c and X n to isolate invariant semantic cues from scene-dependent noise.
To solve the aforementioned challenges and achieve our goal of robust generalization under single-source domain settings, we propose CauseHSI, a causality-inspired framework composed of two key components: the Counterfactual Generation Module (CGM) and the Causal Disentanglement Module (CDM).
To tackle Challenge 1, CGM adopts a causal perspective to simulate interventional diversity under the single-source setting. Based on the SCM in Figure 1a, where domain shifts stem from variations in the sensing condition S, we approximate interventions do ( S = s ) (do-operator is a mathematical operator for intervention.) [35] through controlled perturbations of the observed image X in the frequency domain. This approximation does not aim to reproduce exact physical sensing processes, but instead provides an operational means to emulate plausible sensing-induced variations while preserving semantic consistency. Motivated by findings that domain-specific artifacts tend to concentrate in extreme frequency bands [28,36,37,38], we decompose the frequency representation of X and apply structured Gaussian noise to low and high frequencies, which serve as a practical approximation of scene-sensitive variations, while preserving mid-frequency components that are relatively more robust to sensing changes and thus act as scene-robust carriers of semantic information. To further ensure semantic fidelity, CGM incorporates two complementary mechanisms: injecting the central spectral signature to retain class-discriminative cues, and applying mild spatial randomization to enrich local diversity without disrupting spatial structure. In addition, a style-controlled discrepancy loss explicitly constrains the magnitude of perturbations, preventing excessive deviations that could compromise label consistency. Through these constrained interventions in the frequency, spectral, and spatial domains, CGM generates counterfactual samples [39], namely synthetic data instances derived by perturbing domain-specific factors while preserving class-discriminative semantics. These counterfactual samples effectively expand the range of sensing conditions beyond the single-source domain, thereby enriching interventional diversity and alleviating the generalization limitations described in Challenge 1.
To address the second challenge, we propose the CDM, guided by three principles: causal independence, cross-domain consistency, and semantic completeness. The detailed theoretical foundation is elaborated in Section 3.3. CDM adopts a dual-branch structure to isolate causal and non-causal features based on marginal independence assumptions. To enhance semantic completeness and ensure consistency across domains, we design a Causal Reassembly Module (CRM), which reconfigures features in the frequency domain by decomposing non-causal features into high- and low-frequency components and recombining them with causal representations. This reconstruction enforces complementary constraints that encourage the disentangled causal branch to capture more authentic, invariant semantics.
In summary, CauseHSI enables robust generalization under single-source settings by simulating interventional shifts through controlled frequency perturbations and disentangling domain-invariant causal semantics via principled architectural constraints, all from a causality-inspired perspective. The major contributions of the proposed method are summarized as follows.
1.
We present a novel perspective for cross-scene HSI classification under single-source DG by framing the problem within a SCM, which explicitly accounts for sensing-induced interventions and their effects on feature entanglement.
2.
To simulate unseen domain shifts, we introduce CGM, which perturbs extreme frequency components of HSI data in a controlled manner, generating semantically consistent counterfactual samples. This module exposes the model to diverse sensing conditions, enhancing robustness.
3.
We propose CDM to explicitly disentangle causal and non-causal representations using a dual-branch architecture and a causal reassembly mechanism. This enables the model to isolate invariant semantic cues from domain-specific artifacts.
The remainder of this paper is organized as follows. Section 2 introduces related works pertinent to this study. Section 3 elaborates on the proposed methodology in detail. Section 4 reports experimental results and comparative analyses. Finally, Section 5 concludes the paper with a summary of the proposed approach and future directions.

2. Related Work

2.1. Single-Source Domain Generalization

DG aims to learn a model from one or multiple source domains that generalizes well to unseen target domains with distinct distributions. Among various DG settings, single-source DG presents a particularly challenging scenario, as models must achieve generalization without exposure to target domain data or multiple source distributions. This setting is especially relevant in hyperspectral remote sensing, where acquiring data from multiple domains is often impractical due to high labeling costs and acquisition constraints.
In single-source DG, the core challenge lies in mitigating the overfitting of models to source-specific characteristics while extracting domain-invariant features that can generalize across diverse, unseen environments. The lack of cross-domain supervision exacerbates this problem, making it difficult for models to disentangle semantic content from domain-specific noise.
To overcome this, recent works have explored domain diversification within the source domain itself, particularly through generative models that synthesize pseudo-domains to mimic unseen distributions. SDEnet [30] introduces an encoder-randomization-decoder framework that incorporates spatial and spectral perturbations to generate diverse domain variants, further enhanced by supervised contrastive learning and adversarial objectives. LLURNet [31] adopts a locally linear unbiased randomization approach using symmetric autoencoders to embed style variations while preserving semantic consistency, with contrastive regularization to stabilize training. Beyond random perturbations, D3Net [40] utilizes a domain-adversarial generator and dual-branch discriminators to extract domain-agnostic features, offering a more structured alternative to handcrafted augmentations. ISDGS [41] extends this line of work by introducing semantic-style covariance generation and spatial shuffling mechanisms, jointly optimized under a dual-sampling adversarial contrastive framework to enhance semantic preservation and domain robustness.
These methods demonstrate the effectiveness of expanding the source domain through synthetic variations to simulate domain shifts and enhance generalization. However, balancing domain perturbation with semantic and spectral integrity remains an open challenge, motivating further innovations in structured domain expansion techniques.
In parallel, a number of recent studies have focused on improving HSI classification through more powerful spectral-spatial representation learning rather than explicit domain generalization. For instance, hierarchical transformer architectures enhanced with deformable convolutions and frequency-aware attention mechanisms have been proposed to better capture local homogeneity and long-range dependencies in complex HSI [42]. Other works incorporate context-aware masking and diffusion-guided feature refinement to improve robustness against spectral-spatial perturbations and label scarcity [43]. Additionally, spectral-spatial perception networks tailored for specific applications, such as mineral hyperspectral classification, leverage frequency-domain spatial modeling and fine-grained spectral attention to enhance discrimination among visually similar classes [44]. While these approaches achieve impressive performance under in-domain or task-specific settings, they generally rely on correlation-driven representations and implicitly assume similar data distributions between training and testing, limiting their robustness under cross-scene domain shifts.

2.2. Feature Disentanglement

A core challenge in domain generalization is to separate semantic representations from domain-specific factors within visual features. Feature disentanglement addresses this issue by structuring the learned representation space into independent components that reflect distinct underlying attributes—typically, domain-invariant semantics and domain-related variations. This becomes especially important in single-source domain generalization, where models must rely solely on source data or generated variants to learn representations that are robust to unseen domain shifts. By explicitly decoupling domain-specific and domain-invariant factors, feature disentanglement enables more stable semantic learning and improves generalization across diverse target domains.
Recent studies have explored various strategies to integrate feature disentanglement into domain generalization frameworks. In hyperspectral scenarios, FDFSL [32] introduces an orthogonal low-rank decomposition to suppress source-induced bias and fuses heterogeneous spectral information using a multi-order spectral interaction block with positional encoding, enabling more robust few-shot generalization across domains. To address cross-scene domain shifts, DSDGnet [33] progressively separates domain-invariant and domain-specific features through Transformer-based style transfer and disentanglement modules, supported by a domain combination mechanism that reinforces disentanglement accuracy. Similarly, S4DL [45] employs a gradient-guided spectral-spatial decomposition to disentangle domain-specific and domain-invariant representations in hyperspectral data, further incorporating a shift-sensitive adaptive monitor to adjust disentangling intensity according to the magnitude of domain shift. From a causal perspective, an early-branching framework [46] introduces diverging causal and non-causal feature branches after a shared encoder, effectively reducing entanglement and improving semantic robustness through random domain sampling. In contrast to approaches focused solely on domain invariance, the mDSDI framework [47] jointly learns both invariant and specific features using a meta-optimization strategy, leveraging task-relevant domain-specific knowledge to enhance generalization. Additionally, the CSD method [48] decouples classifier weights into common and specific components via linear low-rank decomposition, providing theoretical identifiability of invariant features and improving efficiency by removing domain-specific components during inference.
These studies demonstrate the effectiveness of feature disentanglement in enhancing domain generalization by promoting semantic stability and reducing domain interference. However, there is often an absence of rigorous constraints to ensure that the decomposed features remain semantically complete and consistent across domains, which can lead to information loss or degraded generalization under distribution shifts.

3. Methods

The proposed CauseHSI framework, depicted in Figure 2, is composed of two key modules: CGM and CDM. Given source domain hyperspectral samples, CGM generates counterfactual samples by perturbing domain-specific components while preserving semantic content. These counterfactuals simulate plausible distribution shifts, enabling the construction of a source–counterfactual domain pair that reflects potential domain variations. Both original and counterfactual samples are then fed into CDM, which adopts a dual-branch architecture to disentangle features into causal and non-causal components under marginal independence assumptions. A Causal Reassembly Module further refines these representations by enforcing semantic completeness and consistency. Reconstruction and classification losses guide the network to extract domain-invariant causal features. Ultimately, classification is performed based on the causal features, which promote robust generalization across unseen domains.

3.1. Causal Formulation of Domain Shifts

To provide a rigorous foundation for our framework, we formalize the SCM underlying HSI classification. As illustrated in Figure 1a, four key variables are considered: latent physical properties Z, sensing scene S, observed image X, and semantic label Y. Their relationships are characterized as:
Z P ( Z ) , Y h ( Z ) , X P ( X Z , S ) .
Here, P ( Z ) captures the intrinsic distribution of physical properties, h ( · ) denotes the semantic generation mechanism that maps physical attributes to class labels through a stable semantic mechanism, and P ( X Z , S ) models the imaging process in which the sensing scene S introduces domain-specific variations. This formulation explicitly reflects that Y is causally determined by Z and is invariant to S, while S acts as an interventional variable that perturbs the distribution of X without affecting the semantic mechanism.
It is worth noting that although Y is causally generated from Z, the observed image X may exhibit spurious statistical correlations with Y in the training domain. Such correlations arise from the fixed sensing condition S and are illustrated as a dashed arrow from X to Y in Figure 1a. This dashed edge does not represent a causal influence, but rather an observational dependence induced by domain-specific biases. Under a single-source setting, all training samples are collected under a fixed sensing condition S = s 0 . As a result, the model is exposed to only a limited range of sensing variations, which restricts the diversity of interventional patterns and leads to Challenge 1.
From a causal perspective, the observed hyperspectral image X is influenced by both intrinsic semantic factors and extrinsic sensing conditions. Rather than modeling the full physical imaging process, we adopt a functional abstraction to characterize their distinct causal roles. Specifically, we conceptually decompose X into a causal component X c = f ( Z ) , which encodes semantic information determined by latent physical properties, and a non-causal component X n = g ( S ) , which captures domain-specific variations introduced by the sensing scene:
X = f ( Z ) + g ( S ) .
This formulation does not imply that the real imaging process is strictly linear or additive. Instead, it serves as a causal abstraction that highlights the separation between invariant semantic factors and scene-dependent perturbations. In practice, complex nonlinear interactions between Z and S may exist and are implicitly absorbed into the functional representations of f ( · ) and g ( · ) . The importance of this abstraction lies in enabling a clear causal interpretation of domain shifts, where variations in S can be viewed as interventions that alter X n while leaving the semantic mechanism Z Y invariant. Since S remains fixed in the source domain, the non-causal component X n may become spuriously correlated with the label Y. Models trained on such data tend to exploit these domain-specific cues, resulting in unstable predictions when encountering unseen sensing conditions. This phenomenon gives rise to Challenge 2.

3.2. Counterfactual Generation Module (CGM)

Prior studies [36,49] have shown that the extreme frequency components of images often contain domain-private patterns, which are sensitive to sensing conditions rather than reflecting intrinsic semantics. Building on this observation, we interpret sensing variations as interventions on the path S X in our SCM (Figure 1a), where the sensing scene S introduces domain-specific artifacts into the observed image X without altering the underlying semantics Y. By perturbing frequency-sensitive components, CGM mimics such scene-induced variations while preserving semantic consistency inherited from the latent properties Z. This design allows the generated counterfactual samples to reflect plausible sensing changes, thereby enhancing the robustness of the model under distributional shifts.
CGM does not aim to exactly reproduce physical sensing processes or explicitly parameterized sensing variables. Instead, it provides a controlled and operational approximation of plausible sensing variations by selectively perturbing empirically scene-sensitive components while enforcing spectral, spatial, and semantic consistency. This design allows CGM to generate label-consistent counterfactual samples that reflect realistic sensing changes, rather than arbitrary noise injections. The overall architecture of CGM is illustrated in Figure 3, and consists of three coordinated branches.

3.2.1. Frequency-Based Intervention

Given an input hyperspectral patch X R p × p × C , we perform a discrete cosine transform (DCT) on each spectral channel to obtain its frequency representation F R R p × p × C . As shown in Figure 1a, under the structural model X ( Z , S ) , domain shifts are primarily induced by variations in the sensing condition S, while the latent physical properties Z remain invariant across scenes. To simulate interventions on the nuisance factor S under a fixed source distribution S = s 0 , we perform frequency-domain perturbations as an operational approximation of the structural intervention do ( S = s ) .
Rather than adopting a hard frequency cutoff, we employ a soft frequency weighting strategy in the DCT domain. Frequency components are ordered according to their radial distance from the DC component, and a fixed, smoothly varying weighting profile is applied to modulate different frequency regions. This weighting profile assigns higher perturbation strength to extreme low- and high-frequency components, which are empirically more sensitive to sensing conditions and scene-dependent distortions, while mid-band frequencies receive minimal perturbation and are thus treated as relatively scene-robust semantic carriers.
Formally, the frequency representation is decomposed into two complementary components via soft weighting: a scene-sensitive component F R d and a scene-robust component F R c , satisfying F R = F R d + F R c . The intervention do ( S = s ) is approximated by applying stochastic multiplicative perturbations to F R d , i.e., F R ˜ d = F R d ( 1 + ϵ ) with ϵ N ( 0 , σ 2 ) , while keeping F R c unchanged. This multiplicative formulation perturbs the magnitude of scene-sensitive frequency components without altering their spatial-frequency structure, thereby avoiding unrealistic artifacts. The intervened frequency representation is then reconstructed as F R ˜ = F R ˜ d + F R c and transformed back into the spatial domain via inverse DCT to produce the perturbed image X fre .
Finally, X fre is encoded by a frequency encoder (FreqEncoder) composed of convolutional blocks to yield a compact frequency-level representation z fre R 1 × 1 × 32 . This frequency-level decomposition provides a controllable and effective proxy for simulating sensing-related variations under limited source diversity.

3.2.2. Spectral Consistency Preservation

To retain class-discriminative information, we enhance the perturbed sample with its central 2D spectral signature. This spectral vector is passed through a SpeEncoder composed of fully connected layers with ReLU activations and a residual connection, producing the spectral embedding z s p e R 1 × 1 × 32 . The concatenation [ z f r e ; z s p e ] is further processed by a spectral-level randomization module [30], producing a fused feature z g R p × p × 64 that preserves spectral integrity while introducing controlled variability.

3.2.3. Spatial Style Perturbation

In parallel, we extract spatial features z s p a R p × p × 3 from the original patch using SpaEncoder, a shallow CNN. To increase spatial diversity while maintaining structural coherence, we apply Adaptive Instance Normalization (AdaIN):
AdaIN ( z , μ , σ ) = σ z μ σ + μ ,
where ( μ , σ ) are the per-channel statistics of z, and ( μ , σ ) are randomly sampled from other spatial features within the batch. This transformation perturbs spatial styles without disrupting semantic layout.
The outputs from the spectral and spatial branches are concatenated and passed through a Decoder to produce the final counterfactual sample X cf . To ensure semantic alignment while promoting stylistic diversity, we introduce a style-controlled discrepancy loss:
L control = α · Gram ( z f ) Gram ( z ˜ f ) 2 2 + β · max ( 0 , τ min z f z ˜ f 2 ) + max ( 0 , z f z ˜ f 2 τ max ) ,
where z f denotes the style feature extracted from X, and z ˜ f represents the style feature extracted from the counterfactual sample X c f . The first term ensures a sufficient style gap via Gram matrix distance, and the second term constrains the perturbation strength within a controlled range [ τ min , τ max ] .
Through this controlled intervention in the frequency, spectral, and spatial spaces, CGM generates label-consistent counterfactuals that serve as effective augmentations for improving the model’s generalization across unseen domains.

3.3. Causality Feature Disentanglement: Theoretical Foundation

Inspired by causal inference theory, we propose a disentanglement approach that distinguishes causal features from non-causal features. This separation is crucial for learning representations that generalize across unseen domains by emphasizing invariant causal mechanisms. However, identifying causal and non-causal features solely from observational data is inherently challenging, especially in the absence of additional assumptions or constraints. It is important to clarify that we do not claim theoretical identifiability of causal and non-causal factors from single-source observational data alone. Indeed, in classical causal inference, such disentanglement is generally unidentifiable without interventional data, multiple environments, or strong prior assumptions. In our framework, this limitation is explicitly acknowledged and addressed by augmenting the observational setting with counterfactual samples generated by the proposed CGM. These counterfactual samples introduce controlled variations that simulate interventions on non-causal factors, thereby providing additional supervisory signals for learning a practically useful disentanglement. Following the insights from prior works [50,51], introducing appropriate constraints can help approximate the underlying causal structure. In the context of cross-domain generalization, we posit that the disentangled causal and non-causal features should satisfy the following three principles:
1.
Causal Independence: Causal and non-causal features should be statistically independent [46].
I ( X c ; X n ) 0 ,
where I ( · ; · ) denotes mutual information.
2.
Semantic Completeness: The combination of causal and non-causal features should preserve sufficient information for accurate reconstruction or prediction.
H ( Y X c , X n ) H ( Y X ) ,
where H ( · ) denotes Shannon entropy.
3.
Cross-Domain Consistency: Causal features corresponding to the same semantic content should remain invariant across domains.
D P ( X c ( s ) ) , P ( X c ( t ) ) 0 ,
where P ( X c ( s ) ) and P ( X c ( t ) ) denote the distributions of causal features in source and target domains, and D ( · ) is a distributional distance (e.g., MMD).
These three principles jointly form the theoretical foundation for our causal-inspired disentanglement framework, which we incorporate into the model through architectural design and loss constraints to enable robust generalization under domain shifts.

3.4. Independence Constraint: Marginally Independent Representation Decomposition

To facilitate the disentanglement of causal and non-causal features, we adopt the marginal independence assumption, which posits that the two factors should be statistically independent. This assumption is widely used in causal representation learning and domain generalization [46,50,52], helping to prevent spurious correlations between invariant semantics and domain-specific variations. To implement this assumption, we design a dual-branch network inspired by the early-branching strategy [46], as shown in Figure 4. The network begins with a Spectral-Spatial Fusion Module (SSFM), which extracts low-level spectral-spatial representations from the hyperspectral input. The SSFM integrates channel-wise spectral cues and local spatial structures through parallel convolutions, and fuses them to obtain the shallow feature map F R H × W × C .
Notably, the independence constraint is not imposed solely on source domain observations. Instead, it is jointly applied to source samples and their corresponding counterfactual variants generated by CGM. By exposing the model to paired samples that share semantic content but differ in non-causal factors, the disentanglement process is guided by contrastive supervisory signals that approximate interventional variation. This design alleviates the inherent identifiability ambiguity in single-source observational data and enables the model to separate invariant semantic features from domain-specific variations in a pragmatic manner.
The feature map F is fed into two separate encoders: a semantic encoder E c ( · ) to extract causal features F c and a domain encoder E n c ( · ) to extract non-causal features F n c . Inspired by CVSSN [53], each encoder consists of a pointwise convolution group, a depthwise convolution group, and a module tailored for specific semantics: E c emphasizes mid-frequency information via 3 × 3 convolutions, capturing fine-grained structural and texture patterns that are generally domain-invariant. E n c uses 7 × 7 convolutions to aggregate broader context and low-frequency trends, while dilated convolutions are added to retain detail and detect high-frequency, domain-specific variations.
To ensure that F c and F n c are statistically independent, we introduce a regularization term based on the Hilbert-Schmidt Independence Criterion (HSIC) [54]. This criterion is a kernel-based statistical dependence measure. Given two variables P and Q (in our case, F c and F n c ), the empirical HSIC is computed as:
HSIC ( P , Q ) = 1 ( n 1 ) 2 Tr ( K P H K Q H ) ,
where K P , K Q are Gram matrices computed using RBF kernels over samples of P and Q, respectively; H = I 1 n 1 1 is the centering matrix; n is the batch size.
We define the independence loss as:
L indep = HSIC ( F c , F n c ) ,
which is minimized during training to reduce dependency between the two feature branches while promoting better disentanglement.

3.5. Completeness and Consistency Constraint: Frequency-Aware Reconstruction Strategy

While marginal independence ensures that causal and non-causal representations do not share statistical dependencies, this alone does not guarantee that they together encode complete and semantically meaningful information. To further enhance disentanglement, we propose a reconstruction process (illustrated in Figure 5) that imposes two complementary constraints—causal completeness and causal consistency—corresponding to Principle 2 and Principle 3.
At the core of this process lies our novel Causal Reassembly Module (CRM), which performs a frequency-aware fusion of causal and non-causal features. Specifically, given the causal feature F c and non-causal feature F n c , we first project them into the DCT space: S c = DCT ( F c ) , S n c = DCT ( F n c ) . We then partition S n c into low-frequency component S n c low and high-frequency component S n c high . Treating S c as the middle-frequency component, we concatenate the three in frequency order to obtain a reassembled spectrum S r e , which is then transformed back to the spatial domain via inverse DCT: F r e = IDCT ( S r e ) . This frequency-aware reassembly allows the model to maintain semantic consistency across domains while preserving both coarse and fine structural details.
Finally, the reassembled feature F r e is passed through a lightweight decoder D r e c to reconstruct the image: X r e = D r e c ( F r e ) . To ensure faithful reconstruction, we define an 1 loss between the original image X and the reconstructed image X r e :
L rec = X r e X 1 .
To impose the aforementioned constraints, we apply the reconstruction process with different feature combinations as input, enabling the model to learn both the completeness and consistency of disentangled features across domains.

3.5.1. Causal Completeness Constraint (Principle 2)

To ensure that the combined representations F c and F n c capture the full semantic content of the input, we reconstruct both the source image and its counterfactual sample from the reassembled features. The completeness constraint is enforced via the reconstruction losses from both domains:
L comp = L rec src + L rec cf ,
where L rec src denotes the reconstruction loss of the original source image X, and L rec cf corresponds to that of the counterfactual sample X cf .

3.5.2. Causal Consistency Constraint (Principle 3)

To promote domain-invariant semantics in F c , we conduct cross-domain reassembly by combining source domain causal features F c src with counterfactual-domain non-causal features F n c cf for reconstruction. The corresponding consistency reconstruction loss is defined as:
L cons rec = D r e c ( CRM ( F c src , F n c cf ) ) X cf 1 .
In addition, we enforce consistency in the causal feature space directly by minimizing the distance between causal features of the source and counterfactual domains:
L cons feat = F c src F c cf 2 .
The total causal consistency loss is:
L cons = L cons rec + L cons feat .
By systematically satisfying the three core principles, our proposed method achieves a principled disentanglement of causal and non-causal factors in HSI. These principles guide the learning process to isolate scene-invariant, label-relevant representations while suppressing scene-specific variations, thereby improving robustness under domain shifts.
Once the causal representation F c is extracted by the disentanglement module, it is passed through a classification head C ( · ) to predict class labels. To ensure the discriminative power of causal features extracted from both source and counterfactual images, we supervise the predictions using cross-entropy loss on both:
L cls src = CE ( C ( F c src ) , y src ) , L cls cf = CE ( C ( F c cf ) , y src ) ,
where CE ( · , · ) denotes the cross-entropy loss, and y src is the source domain ground-truth label shared by its counterfactual. The final classification loss is the sum of both:
L cls = L cls src + L cls cf .

3.6. Training Phase

In our method, the training procedure involves the joint but alternate optimization of CGM and CDM.
We first optimize the CDM. The objective is to ensure that the extracted features satisfy the three proposed causal principles, while also being discriminative for the final classification task. The total loss L CDM for optimizing this module is formulated as:
L CDM = λ 1 ( L indep + L comp + L cons ) + L cls ,
where λ 1 is a hyperparameter to balance the contribution of causal loss.
After updating the CDM module, we optimize the CGM with style-controlled discrepancy loss L control from Equation (4) and counterfactual samples classification loss L cls cf from Equation (15). The total CGM loss is defined as:
L CGM = λ 2 L control + L cls cf ,
where λ 2 is a balancing weight. For simplicity, λ 1 in Equation (17) and λ 2 in Equation (4) are set to the same value ( λ = λ 1 = λ 2 ).

3.7. Causal Scope and Limitations

While the proposed framework is inspired by causal principles, it does not aim to identify the true underlying causal graph of hyperspectral image formation. Instead, CauseHSI adopts a causally motivated structural abstraction in which sensing scenes are treated as interventions that induce distribution shifts on the observed data. Within this formulation, the goal is not causal discovery, but to learn feature representations that are operationally consistent with causal assumptions under scene variations.
Specifically, the proposed causal disentanglement is guided by a set of practical criteria. These criteria serve as inductive biases that discourage the exploitation of scene-specific spurious correlations, rather than as formal guarantees of causal identifiability. Similarly, the counterfactual generation module provides a controlled and operational approximation of plausible sensing variations, instead of an exact simulation of physical sensing processes. As a result, the causal properties enforced by the framework should be understood as consistency-oriented constraints that improve robustness under domain shifts, rather than as theoretically proven causal correctness. This design choice is aligned with common practice in causality-inspired representation learning for domain generalization.

4. Experiments

4.1. Datasets

To evaluate the generalization performance of our proposed method, we conduct extensive experiments on three widely-used HSI datasets: Pavia, Houston, and HyRANK.

4.1.1. Pavia Dataset

The Pavia dataset consists of two urban scenes: Pavia University and Pavia Center, both captured by the Reflective Optics System Imaging Spectrometer (ROSIS) sensor. The sensor originally records 115 spectral bands within the range of 430–860 nm. After preprocessing to remove noisy bands, Pavia University retains 103 bands, and Pavia Center retains 102 bands. In this study, we use the 102 shared spectral bands between the two scenes. The spatial resolution of both datasets is 1.3 m, with image sizes of 610 × 340 (University) and 1096 × 715 (Center), respectively. We focus on the seven classes that are shared between the two scenes for consistent cross-scene evaluation. The classes and the number of samples are listed in Table 1, and the pseudo-color image with ground truth map is shown in Figure 6.

4.1.2. Houston Dataset

The Houston dataset comprises two subsets: Houston2013 and Houston2018, both acquired over urban areas in Houston, Texas. Houston2013 was captured by the ITRES CASI-1500 sensor (Calgary, AB, Canada) and was provided as part of the IEEE GRSS Data Fusion Contest in 2013. It contains 144 spectral bands spanning 364–1046 nm, with a spatial resolution of 2.5 m and an image size of 349 × 1905. Houston2018, released during the 2018 Data Fusion Contest, features ultrahigh-resolution imagery (0.05 m) with 48 spectral bands over a 2384 × 601 spatial grid. The two subsets are annotated with 15 and 20 land cover categories, respectively. In this study, we adopt the 48 spectral bands and select the seven categories that are common to both subsets to ensure consistency in cross-domain evaluation. The specific number of samples is shown in Table 2. The pseudo-color image and ground truth maps are shown in Figure 7.

4.1.3. HyRANK Dataset

The HyRANK dataset is derived from Hyperion satellite imagery and includes two scenes: Dioni and Loukia, both located in Greece. Each image has a spatial resolution of 30 m and sizes of 250 × 1376 (Dioni) and 249 × 945 (Loukia), respectively. The Hyperion sensor provides 242 spectral bands in the range of 400–2500 nm, from which 176 bands are retained after removing noisy and water absorption bands. The dataset is labeled with 14 land cover categories, among which 12 consistent classes are selected for training and evaluation in our experiments, as shown in Table 3. The pseudo-color image and ground truth maps are shown in Figure 8.

4.2. Implementation Details

All experiments are conducted using the PyTorch (version: 1.8) deep learning framework on a workstation running Ubuntu 20.04.5 LTS with a Linux kernel version of 4.15.0. The hardware configuration includes an Intel(R) Xeon(R) Silver 4210 CPU @ 2.20 GHz and a single NVIDIA GeForce RTX 2080 Ti GPU with 11GB of memory.
We evaluate the performance of our method using four widely adopted metrics in HSI classification: class-specific accuracy, overall accuracy (OA) and the Kappa coefficient (KC).
The training process is configured with a batch size of 256, input patch size 13 × 13 and 400 training epochs. We apply L2 regularization with a weight decay of 1 × 10 4 . To assess CauseHSI’s sensitivity to key hyperparameters, we analyze three parameters: base learning rate, λ , and embedding dimension ( d s e in Figure 5), chosen from { 10 4 , 10 3 , 10 2 , 10 1 } , { 10 2 , 10 1 , 10 0 , 10 1 } , and { 128 , 256 , 512 , 1024 } , respectively. As shown in Figure 9, the learning rate exhibits consistent behavior across all datasets, with 1 × 10 3 yielding the best or near-best performance, indicating stable optimization dynamics. For λ , performance varies smoothly over a wide range of values. While the optimal λ differs slightly across datasets, λ = 1 remains competitive on all benchmarks, and larger values can further benefit more complex scenes, suggesting that the model does not rely on precise tuning of this parameter. Similarly, the embedding dimension demonstrates a broad plateau of strong performance. Dimensions of 256 and 512 achieve comparable results across datasets, indicating that CauseHSI is not overly sensitive to the specific embedding capacity. Overall, these results confirm that CauseHSI maintains stable performance under reasonable hyperparameter variations, and the dataset-specific settings reflect minor adaptations to data characteristics rather than strict configuration dependencies. Based on the above analysis, we fix the base learning rate to 1 × 10 3 for all datasets. λ is selected from 1, 10, where λ = 1 yields strong performance on Pavia and Houston, while a larger value ( λ = 10 ) is adopted for HyRANK to better accommodate its higher scene complexity. The embedding dimension is chosen as 512 for Pavia and HyRANK, and 256 for Houston.

4.3. Results and Analysis

To comprehensively evaluate the effectiveness of our proposed method in cross-scene HSI classification, we compare it against a wide range of representative and state-of-the-art approaches. Specifically, we include several recent DG methods tailored for HSI tasks, including SDEnet [30], FDGNet [28], S2ECNet [29], D3Net [40], and ISDGS [41], which are designed to learn scene-invariant representations under unseen target domains. To provide additional reference points, we also include two representative Domain Adaptation (DA) methods, DSAN [55] and TSTnet [56], for supplementary comparison. Additionally, we include two competitive methods for single-scene HSI classification, SSFTT [57] and DSNet [58], which are trained and evaluated on the same domain without explicit generalization mechanisms. All methods are evaluated using official implementations or publicly available codebases, and we carefully follow the original training protocols and hyperparameter settings to ensure fair and reliable comparisons.
For a fair comparison, all methods are trained and evaluated under exactly the same data partitioning and augmentation strategies. Specifically, considering the imbalance in sample quantities across datasets, we adopt dataset-specific data splitting: for the Pavia dataset (Pavia University as the source domain and Pavia Center as the target), 50% of the source domain data is used for training and the remaining 50% for validation; for the Houston dataset (Houston2013 as the source and Houston2018 as the target) and the HyRANK dataset (Dioni as the source and Loukia as the target), 80% of the source domain is used for training and 20% for validation. In all cases, the entire target domain is used as the test set. Furthermore, for the Houston2013 dataset, we apply data augmentation (random flip and random radiation noise) by a factor of four, which is consistently applied to all compared methods.
To account for the randomness introduced by certain modules in the compared methods, we report the classification performance using the mean ± standard deviation over multiple runs. Specifically, we fix five random seeds and compute the final results by averaging the outcomes from five independent runs using these predefined seeds. This setup ensures a more accurate and stable evaluation of all methods under consistent experimental conditions.
For DG methods, only labeled source domain data are used during training, and models are directly tested on the target domain. The same training protocol is applied to single-scene methods to assess their generalization ability under domain shift. In contrast, DA methods are trained using labeled source domain data along with an equal amount of unlabeled target domain data. Among them, DSAN requires a batch size of 32 due to its loss function design [55]. For other hyperparameters not explicitly mentioned, we follow the original settings reported in the respective papers. Table 4, Table 5 and Table 6 summarize the class-specific accuracy, OA and KC for all compared methods across the Pavia, Houston and HyRANK datasets, respectively. The visual classification results of different methods on these three datasets are illustrated in Figure 10, Figure 11 and Figure 12.
Single-scene methods (SSFT, DSNet) lack any mechanism to handle domain shift. As expected, they perform poorly when directly applied to unseen target domains. This observation further emphasizes the necessity of developing methods that explicitly address the challenges of cross-scene hyperspectral image classification.
Among DA methods, DSAN and TSTnet benefit from access to unlabeled target-domain data during training, which allows them to partially adapt to the target distribution. However, such assumptions are not applicable in the single-source domain generalization setting considered in this work.
DG methods demonstrate varying strengths across different datasets, reflecting their distinct design principles. SDEnet shows relatively stable performance across all scenes, while FDGNet performs competitively on Houston, and D3Net achieves strong results on HyRANK. These variations suggest that different design principles—such as contrastive learning or semantic alignment—impact performance under varying scene conditions. S2ECNet further explores causality-inspired design by incorporating spectral–spatial enhancement and causal contribution constraints. By introducing causal alignment through contrastive constraints on causal contribution vectors, S2ECNet exhibits strong robustness to cross-scene variations. In comparison, CauseHSI places greater emphasis on disentangling causal and non-causal features to explicitly separate invariant semantic factors from domain-specific variations. This formulation enables the model to capture stable semantic representations while flexibly accommodating domain shifts, leading to more consistent generalization across diverse scenes.
Our proposed method consistently achieves the highest OA and KC across all three datasets. Specifically, it outperforms the strongest DG baselines by clear margins on Pavia, Houston, and HyRANK, indicating superior robustness under diverse sensing conditions. The consistent improvement in Kappa further demonstrates that the performance gains are not dominated by majority classes, but reflect a more reliable agreement between predictions and ground truth.
A closer inspection of class-wise accuracies reveals that the proposed method does not uniformly improve all land-cover categories, and that performance variations across classes are clearly observable in all three datasets. In particular, several categories—such as C5 in Pavia, C1 and C4 in Houston, and C2, C6, and C10 in HyRANK—exhibit noticeable performance degradation compared with certain competing methods.
From a causal perspective, this behavior is expected rather than anomalous. The proposed framework explicitly suppresses scene-dependent and non-causal cues through causal disentanglement and counterfactual augmentation. Consequently, land-cover categories that rely heavily on background context, illumination patterns, or other scene-specific correlations—rather than intrinsic spectral–semantic properties—may experience reduced classification accuracy when such non-causal signals are attenuated.
Moreover, several degraded categories are characterized by strong spectral ambiguity or limited inter-class separability, as observed in complex datasets such as HyRANK. Under domain generalization settings, where spurious correlations cannot be exploited, learning stable causal representations for such categories remains inherently challenging for all methods. In addition, counterfactual augmentation may introduce increased variance for extremely small or noisy classes, further amplifying class-wise fluctuations.
Importantly, despite these localized degradations, the proposed method consistently achieves the highest overall accuracy and Kappa coefficient across all datasets. The improvement in Kappa indicates that the gains are not driven by a small subset of dominant classes, but instead reflect a more reliable and globally consistent alignment between predictions and ground truth under cross-scene shifts.
To further assess the reliability of the reported performance, we additionally report the 95% confidence intervals of OA and Kappa, estimated as mean ± t 0.975 , 4 · σ / 5 using the Student’s t-distribution over five runs. As shown in Table 7, CauseHSI consistently achieves higher mean performance with relatively narrow confidence intervals across all three datasets, indicating stable and reliable performance gains.
Figure 10, Figure 11 and Figure 12 provide qualitative comparisons of classification maps. Subfigure (a) presents the ground-truth map, (b)–(i) are contrast methods, and (j) corresponds to our proposed approach. Pixels without ground-truth labels are treated as background, and all pixels are predicted for visual comparison. Notably, our method yields smoother and less noisy classification maps, as illustrated in the red-boxed regions. This visual advantage stems from the model’s focus on global semantic consistency, which reduces local misclassifications and noise.
Overall, the proposed method achieves consistent superiority in overall accuracy and domain robustness. While it may not outperform all baselines in class-level accuracy, its strong domain-invariant representation learning ensures reliable generalization across complex and diverse scenes. This further validates the effectiveness of our causal-inspired disentanglement strategy in cross-scene hyperspectral image classification.
To quantitatively assess computational efficiency, Table 8 reports training time, inference time, FLOPs, and parameter counts on three benchmark datasets. As shown in the table, CauseHSI incurs higher FLOPs than most lightweight DG baselines, with an increase of approximately 3–5× in FLOPs. However, it has the lowest number of parameters among all compared DG methods, resulting in a compact memory footprint. In terms of runtime, the training time of CauseHSI is moderately higher than that of the lightest baselines, while remaining significantly lower than heavyweight methods. Importantly, its inference time is comparable to or only marginally higher than other DG approaches across all datasets, indicating that the additional computational cost is mainly introduced during training rather than deployment. Overall, these results suggest that CauseHSI achieves a favorable balance between training-time complexity and inference-time efficiency, making it practical for real-world hyperspectral applications where robustness to domain shifts is required.

4.4. Ablation Study

To evaluate the effectiveness of key components in the proposed CauseHSI, we conduct ablation studies by systematically removing each component and observing performance degradation across all three datasets. The quantitative results are summarized in Table 9.
Specifically, we investigate the following six variants: (1) “no DCT”: removes the frequency-based intervention from CGM. (2) “no 2D”: removes the spectral consistency preservation branch from CGM. (3) “no Control”: disables the style-controlled discrepancy loss in CGM. (4) “no Consist&Complete”: removes both the causal consistency and completeness constraints from CDM. (5) “no Consist”: only removes the causal consistency constraint from CDM. (6) “no Complete”: only removes the causal completeness constraint from CDM.
The variants “no DCT” and “no 2D” respectively disable two complementary intervention branches in the CGM, both of which are designed to approximate counterfactual domain shifts. Removing either branch results in noticeable and consistent performance degradation across all three datasets, highlighting their synergistic roles in counterfactual generation. Specifically, the “no DCT” variant exhibits systematic drops in OA and KC, indicating that frequency-based perturbations substantially enrich the diversity of interventional samples. This empirically supports our design choice of performing counterfactual interventions in the frequency domain, as frequency components effectively capture global sensing variations such as illumination, atmospheric conditions, and sensor response, which are difficult to model explicitly through physical parameters. Similarly, the “no 2D” variant leads to clear degradation, demonstrating that preserving the central spectral structure during intervention is crucial for maintaining class-discriminative information. This confirms that counterfactual perturbations must be constrained to avoid semantic distortion, and validates the necessity of performing interventions in a controlled feature space rather than through unconstrained transformations. Moreover, the removal of the style-controlled discrepancy loss (“no Control”) causes pronounced performance drops, with HyRANK suffering the largest KC decrease (−3.30%). This observation indicates that style regulation plays a critical role in balancing semantic alignment and stylistic diversity, particularly for fine-grained land-cover categories. Without this constraint, feature-space interventions tend to over-amplify non-causal variations, reducing the utility of generated counterfactual samples. Taken together, these results provide strong empirical evidence that frequency-based intervention, spectral preservation, and style regularization play complementary roles in CGM. Frequency perturbations broaden interventional diversity, spectral preservation safeguards semantic consistency, and style control prevents over-perturbation. Their joint contribution directly supports the validity of implementing counterfactual interventions in frequency and feature spaces.
Variant “no Consist&Complete” leads to the most significant decline, as it relies solely on the causal independence constraint, failing to enforce semantic alignment or feature sufficiency. Introducing either the causal completeness (“no Consist”) or consistency (“no Complete”) constraint yields clear improvements over the basic version. Specifically, while “no Consist&Complete” shows a 3–4% drop in OA, “no Complete” reduces this drop to about 1%. These results demonstrate the importance of multi-level causal constraints in approximating true causal features.
The proposed frequency-based intervention relies on a soft frequency weighting strategy to distinguish scene-sensitive and scene-robust components in the DCT domain. Although this weighting profile is fixed across all experiments, we further investigate its sensitivity to ensure that the performance gains do not depend on a specific frequency configuration. Specifically, we vary the radial range of mid-band frequencies that receive minimal perturbation, while keeping all other components of the framework unchanged. Three configurations are considered: Narrow, Default, and Wide, corresponding to progressively smaller or larger mid-frequency preservation ranges. Notably, this variation does not introduce any explicit hard cutoff between low, mid, and high frequencies, but instead adjusts the extent of the smoothly weighted frequency regions. We evaluate these configurations on representative cross-scene classification benchmarks under the single-source domain generalization setting. As reported in Table 10, the proposed method exhibits stable performance across different frequency weighting ranges. While minor fluctuations are observed, the overall accuracy remains consistently high, indicating that the effectiveness of the proposed frequency-based intervention is not sensitive to the specific choice of frequency weighting parameters. This robustness supports our design choice of using a fixed and coarse-grained frequency weighting profile as a practical approximation of sensing-induced variations.
To further validate the effectiveness of CauseHSI in enhancing generalization, we visualize the feature distributions on three datasets using t-SNE. Figure 13 illustrates the distributions of target-domain samples before and after applying CauseHSI. In the original feature space, samples from different classes exhibit substantial overlap, whereas CauseHSI yields clearer inter-class separation across all datasets.

4.5. Physical Plausibility of Generated Counterfactual Samples

Since the proposed framework relies on generated Counterfactual hyperspectral samples for DG, it is important to ensure that these samples remain physically plausible rather than arbitrary perturbations. Unlike natural images, commonly used perceptual metrics such as FID or LPIPS are not directly applicable to hyperspectral data due to the high dimensionality of spectral signals and the lack of suitable pretrained feature extractors. Therefore, we assess the physical plausibility of the generated counterfactual samples using physically grounded spectral metrics.
Specifically, we compute the spectral angle mapper (SAM) between the center-pixel spectra of the original and generated samples, where the center pixel corresponds to the semantic label in each 13 × 13 spatial-spectral patch. In addition, we evaluate spectral smoothness along the spectral dimension to examine whether the generated spectra preserve the inherent band-wise continuity of hyperspectral signals. As reported in Table 11, the generated counterfactual samples exhibit moderate spectral angles, typically ranging from 2.8° to 4.5°, indicating meaningful yet physically reasonable domain perturbations rather than trivial reconstructions. Moreover, the spectral smoothness of the generated samples remains close to that of real hyperspectral data across all datasets, suggesting that the proposed generation process does not introduce severe high-frequency spectral artifacts.

4.6. Scalability and Applicability to Large-Scale Scenes

Although the proposed framework is evaluated on commonly used benchmark datasets, it is not inherently restricted to small- or medium-sized hyperspectral scenes. The overall design of CauseHSI is patch-based and scene-agnostic, and does not rely on global scene-level modeling or full-image statistics. As a result, large-scale hyperspectral scenes can be processed in a tiled or sliding-window manner without any modification to the network architecture or training strategy.
Importantly, both the causal disentanglement module and the counterfactual generation mechanism operate locally on image patches or intermediate feature representations. Their computational and memory costs scale linearly with the number of patches, rather than with the spatial extent of the entire scene. This property ensures that the framework remains computationally feasible for large-area hyperspectral imagery.
Moreover, patch-wise training and inference are standard practice in hyperspectral image analysis, particularly for domain generalization settings where full-scene annotations are rarely available. The benchmark datasets used in this work are themselves extracted from large-scale airborne scenes, and therefore provide a representative proxy for real-world large-scene deployment. These considerations suggest that the proposed framework can be readily applied to very large-scale hyperspectral scenes in practical remote sensing applications.

5. Conclusions

In this work, we address the core challenge of generalizing HSI classification models to unseen domains under the single-source setting, where interventional diversity is severely limited. By adopting a causality-inspired perspective, we formulate domain shifts as interventions within a structural causal model, and propose CauseHSI, a novel framework composed of two synergistic modules. The Counterfactual Generation Module simulates diverse sensing conditions via structured frequency perturbations, enabling counterfactual sample generation that preserves semantic integrity. And the Causal Disentanglement Module disentangles causal representations from spurious domain-specific factors through a dual-branch architecture and frequency-domain reassembly. Together, these modules tackle both the lack of interventional diversity and the confounding effects of spurious correlations. We conduct extensive experiments on multiple public datasets, and the results consistently demonstrate the effectiveness of our method.
In the future, we plan to further enhance the robustness and scalability of our framework in more diverse real-world HSI scenarios. In addition, as part of future work, we plan to conduct a broader evaluation including additional DA methods to further clarify the trade-offs between generalization without target data and adaptation with target supervision. Such a systematic comparison will help delineate the complementary strengths of DG and DA approaches, thereby guiding the selection of appropriate strategies in practical applications.

Author Contributions

Conceptualization, X.L. and Z.Y.; methodology, Z.Y.; software, Z.Y.; validation, X.L., Z.Y. and W.L.; writing—original draft preparation, Z.Y.; writing—review and editing, X.L.; visualization, W.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Shandong Province of China, grant number ZR2024MF048.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets are publicly available in Pavia at https://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes (accessed on 20 January 2026), Houston at https://github.com/anbydemara/CauseHSI (accessed on 20 January 2026) and HyRANK at https://huggingface.co/datasets/danaroth/hyrank (accessed on 20 January 2026).

Acknowledgments

The authors would like to thank the anonymous reviewers.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
HSIHyperspectral Image
DGDomain Generalization
CGMCounterfactual Generation Module
CDMCausal Disentanglement Module
CNNConvolutional Neural Networks
SCMStructural Causal Model
CRMCausal Reassembly Module
DCTDiscrete Cosine Transform
SSFMSpectral-Spatial Fusion Module
HSICHilbert-Schmidt Independence Criterion
OAOverall Accuracy
KCKappa Coefficient
DADomain Adaption

References

  1. Khan, A.; Vibhute, A.D.; Mali, S.; Patil, C.H. A systematic review on hyperspectral imaging technology with a machine and deep learning methodology for agricultural applications. Ecol. Inform. 2022, 69, 101678. [Google Scholar] [CrossRef]
  2. Zhang, B.; Wu, D.; Zhang, L.; Jiao, Q.; Li, Q. Application of hyperspectral remote sensing for environment monitoring in mining areas. Environ. Earth Sci. 2012, 65, 649–658. [Google Scholar] [CrossRef]
  3. Hajaj, S.; El Harti, A.; Pour, A.B.; Jellouli, A.; Adiri, Z.; Hashim, M. A review on hyperspectral imagery application for lithological mapping and mineral prospecting: Machine learning techniques and future prospects. Remote Sens. Appl. 2024, 35, 101218. [Google Scholar] [CrossRef]
  4. Yu, C.; Han, R.; Song, M.; Liu, C.; Chang, C.I. Feedback Attention-Based Dense CNN for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5501916. [Google Scholar] [CrossRef]
  5. Paoletti, M.E.; Haut, J.M.; Tao, X.; Plaza, J.; Plaza, A. FLOP-Reduction Through Memory Allocations Within CNN for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2021, 59, 5938–5952. [Google Scholar] [CrossRef]
  6. Xu, F.; Mei, S.; Zhang, G.; Wang, N.; Du, Q. Bridging CNN and Transformer with Cross-Attention Fusion Network for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5522214. [Google Scholar] [CrossRef]
  7. Yu, H.; Xu, Z.; Zheng, K.; Hong, D.; Yang, H.; Song, M. MSTNet: A Multilevel Spectral–Spatial Transformer Network for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5532513. [Google Scholar] [CrossRef]
  8. Zou, J.; He, W.; Zhang, H. LESSFormer: Local-Enhanced Spectral-Spatial Transformer for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5535416. [Google Scholar] [CrossRef]
  9. Peng, Y.; Zhang, Y.; Tu, B.; Li, Q.; Li, W. Spatial–Spectral Transformer with Cross-Attention for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5537415. [Google Scholar] [CrossRef]
  10. Feng, J.; Bai, G.; Li, D.; Zhang, X.; Shang, R.; Jiao, L. MR-Selection: A Meta-Reinforcement Learning Approach for Zero-Shot Hyperspectral Band Selection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5500320. [Google Scholar] [CrossRef]
  11. Hong, D.; Gao, L.; Yao, J.; Yokoya, N.; Chanussot, J.; Heiden, U.; Zhang, B. Endmember-Guided Unmixing Network (EGU-Net): A General Deep Learning Framework for Self-Supervised Hyperspectral Unmixing. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 6518–6531. [Google Scholar] [CrossRef] [PubMed]
  12. Wang, Z.; Chen, B.; Lu, R.; Zhang, H.; Liu, H.; Varshney, P.K. FusionNet: An unsupervised convolutional variational network for hyperspectral and multispectral image fusion. IEEE Trans. Image Process. 2020, 29, 7565–7577. [Google Scholar] [CrossRef]
  13. Windrim, L.; Ramakrishnan, R.; Melkumyan, A.; Murphy, R.J. A physics-based deep learning approach to shadow invariant representations of hyperspectral images. IEEE Trans. Image Process. 2017, 27, 665–677. [Google Scholar] [CrossRef] [PubMed]
  14. Wang, J.; Lan, C.; Liu, C.; Ouyang, Y.; Qin, T.; Lu, W.; Chen, Y.; Zeng, W.; Philip, S.Y. Generalizing to unseen domains: A survey on domain generalization. IEEE Trans. Knowl. Data Eng. 2022, 35, 8052–8072. [Google Scholar] [CrossRef]
  15. Zhou, K.; Liu, Z.; Qiao, Y.; Xiang, T.; Loy, C.C. Domain Generalization: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 4396–4415. [Google Scholar] [CrossRef]
  16. Zhou, K.; Loy, C.C.; Liu, Z. Semi-supervised domain generalization with stochastic stylematch. Int. J. Comput. Vis. 2023, 131, 2377–2387. [Google Scholar] [CrossRef]
  17. Zhou, K.; Yang, Y.; Hospedales, T.; Xiang, T. Learning to generate novel domains for domain generalization. In 16th European Conference on Computer Vision (ECCV); Springer: Cham, Switzerland, 2020; pp. 561–578. [Google Scholar]
  18. Li, L.; Gao, K.; Cao, J.; Huang, Z.; Weng, Y.; Mi, X.; Yu, Z.; Li, X.; Xia, B. Progressive domain expansion network for single domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2021; pp. 224–233. [Google Scholar]
  19. Gong, R.; Li, W.; Chen, Y.; Gool, L.V. Dlow: Domain flow for adaptation and generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2019; pp. 2477–2486. [Google Scholar]
  20. Motiian, S.; Piccirilli, M.; Adjeroh, D.A.; Doretto, G. Unified deep supervised domain adaptation and generalization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2017; pp. 5715–5725. [Google Scholar]
  21. Peng, X.; Huang, Z.; Sun, X.; Saenko, K. Domain agnostic learning with disentangled representations. In International Conference on Machine Learning (ICML); JMLR: Cambridge, MA, USA, 2019; pp. 5102–5112. [Google Scholar]
  22. He, Y.; Shen, Z.; Cui, P. Towards non-iid image classification: A dataset and baselines. Pattern Recognit. 2021, 110, 107383. [Google Scholar] [CrossRef]
  23. Li, D.; Yang, Y.; Song, Y.Z.; Hospedales, T. Learning to generalize: Meta-learning for domain generalization. In Proceedings of the AAAI Conference on Artificial Intelligence; Association for the Advancement of Artificial Intelligence: Washington, DC, USA, 2018; Volume 32. [Google Scholar]
  24. Chen, J.; Gao, Z.; Wu, X.; Luo, J. Meta-Causal Learning for Single Domain Generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2023; pp. 7683–7692. [Google Scholar]
  25. Jing, L.; Tian, Y. Self-supervised visual feature learning with deep neural networks: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 4037–4058. [Google Scholar] [CrossRef]
  26. Kim, D.; Yoo, Y.; Park, S.; Kim, J.; Lee, J. Selfreg: Self-supervised contrastive regularization for domain generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2021; pp. 9619–9628. [Google Scholar]
  27. Wang, X.; Liu, J.; Ni, Y.; Chi, W.; Fu, Y. Two-Stage Domain Alignment Single-Source Domain Generalization Network for Cross-Scene Hyperspectral Images Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5527314. [Google Scholar] [CrossRef]
  28. Qin, B.; Feng, S.; Zhao, C.; Xi, B.; Li, W.; Tao, R. FDGNet: Frequency Disentanglement and Data Geometry for Domain Generalization in Cross-Scene Hyperspectral Image Classification. IEEE Trans. Neural Netw. Learn. Syst. 2024, 36, 10297–10310. [Google Scholar] [CrossRef]
  29. Dong, L.; Geng, J.; Jiang, W. Spectral–Spatial Enhancement and Causal Constraint for Hyperspectral Image Cross-Scene Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5507013. [Google Scholar] [CrossRef]
  30. Zhang, Y.; Li, W.; Sun, W.; Tao, R.; Du, Q. Single-Source Domain Expansion Network for Cross-Scene Hyperspectral Image Classification. IEEE Trans. Image Process. 2023, 32, 1498–1512. [Google Scholar] [CrossRef] [PubMed]
  31. Zhao, H.; Zhang, J.; Lin, L.; Wang, J.; Gao, S.; Zhang, Z. Locally Linear Unbiased Randomization Network for Cross-Scene Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5526512. [Google Scholar] [CrossRef]
  32. Qin, B.; Feng, S.; Zhao, C.; Li, W.; Tao, R.; Xiang, W. Cross-Domain Few-Shot Learning Based on Feature Disentanglement for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5514215. [Google Scholar] [CrossRef]
  33. Peng, D.; Wu, J.; Han, T.; Li, Y.; Wen, Y.; Yang, G.; Qu, L. Disentanglement-inspired single-source domain-generalization network for cross-scene hyperspectral image classification. Knowl. Based Syst. 2024, 303, 112413. [Google Scholar] [CrossRef]
  34. Mahajan, D.; Tople, S.; Sharma, A. Domain generalization using causal matching. In International Conference On Machine Learning (ICML); JMLR: Cambridge, MA, USA, 2021; pp. 7313–7324. [Google Scholar]
  35. Wu, C.; Wang, X.; Lian, D.; Xie, X.; Chen, E. A causality inspired framework for model interpretation. In KDD ’23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; ACM, Inc.: New York, NY, USA, 2023; pp. 2731–2741. [Google Scholar]
  36. Huang, J.; Guan, D.; Xiao, A.; Lu, S. FSDR: Frequency Space Domain Randomization for Domain Generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2021; pp. 6887–6898. [Google Scholar] [CrossRef]
  37. Xu, Q.; Zhang, R.; Zhang, Y.; Wang, Y.; Tian, Q. A Fourier-based Framework for Domain Generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2021; pp. 14378–14387. [Google Scholar] [CrossRef]
  38. Lv, F.; Liang, J.; Li, S.; Zang, B.; Liu, C.H.; Wang, Z.; Liu, D. Causality Inspired Representation Learning for Domain Generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2022; pp. 8036–8046. [Google Scholar] [CrossRef]
  39. Zhang, S.; Yao, D.; Zhao, Z.; Chua, T.S.; Wu, F. Causerec: Counterfactual user sequence synthesis for sequential recommendation. In SIGIR ’21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval; ACM, Inc.: New York, NY, USA, 2021; pp. 367–377. [Google Scholar]
  40. Chu, M.; Yu, X.; Dong, H.; Zang, S. Domain-Adversarial Generative and Dual-Feature Representation Discriminative Network for Hyperspectral Image Domain Generalization. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5533213. [Google Scholar] [CrossRef]
  41. Gao, J.; Ji, X.; Ye, F.; Chen, G. Invariant semantic domain generalization shuffle network for cross-scene hyperspectral image classification. Expert Syst. Appl. 2025, 273, 126818. [Google Scholar] [CrossRef]
  42. Fang, Y.; Sun, L.; Zheng, Y.; Wu, Z. Deformable Convolution-Enhanced Hierarchical Transformer with Spectral-Spatial Cluster Attention for Hyperspectral Image Classification. IEEE Trans. Image Process. 2025, 34, 701–716. [Google Scholar] [CrossRef]
  43. Chatterjee, A.; Ghosh, S.; Ghosh, A. Context-aware masking and learnable diffusion-guided patch refinement in transformers via sparse supervision for hyperspectral image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2025; pp. 2906–2915. [Google Scholar]
  44. Zhang, B.; Chen, Y.; Yao, R.; Xiong, S.; Xiong, S.; Lu, X. SSPNet: Spatial–Spectral Perception Network for Mineral Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5531416. [Google Scholar] [CrossRef]
  45. Feng, J.; Zhang, T.; Zhang, J.; Shang, R.; Dong, W.; Shi, G.; Jiao, L. S4DL: Shift-Sensitive Spatial–Spectral Disentangling Learning for Hyperspectral Image Unsupervised Domain Adaptation. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 16894–16908. [Google Scholar] [CrossRef]
  46. Chen, L.; Zhang, Y.; Song, Y.; Zhang, Z.; Liu, L. A causal inspired early-branching structure for domain generalization. Int. J. Comput. Vis. 2024, 132, 4052–4072. [Google Scholar] [CrossRef]
  47. Bui, M.H.; Tran, T.; Tran, A.; Phung, D. Exploiting Domain-Specific Features to Enhance Domain Generalization. Adv. Neural Inf. Process. Syst. 2021, 34, 21189–21201. [Google Scholar]
  48. Piratla, V.; Netrapalli, P.; Sarawagi, S. Efficient domain generalization via common-specific low-rank decomposition. In Proceedings of the 37th International Conference on Machine Learning (ICML); JMLR: Cambridge, MA, USA, 2020; pp. 7728–7738. [Google Scholar]
  49. Xu, M.; Qin, L.; Chen, W.; Pu, S.; Zhang, L. Multi-view adversarial discriminator: Mine the non-causal factors for object detection in unseen domains. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2023; pp. 8103–8112. [Google Scholar]
  50. Atzmon, Y.; Kreuk, F.; Shalit, U.; Chechik, G. A causal view of compositional zero-shot recognition. Adv. Neural Inf. Process. Syst. 2020, 33, 1462–1473. [Google Scholar]
  51. Huang, Z.; Wang, H.; Xing, E.P.; Huang, D. Self-challenging improves cross-domain generalization. In Computer Vision—ECCV 2020; Springer: Cham, Switzerland, 2020; pp. 124–140. [Google Scholar]
  52. Chen, Y.; Wang, Y.; Pan, Y.; Yao, T.; Tian, X.; Mei, T. A style and semantic memory mechanism for domain generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2021; pp. 9164–9173. [Google Scholar]
  53. Li, M.; Liu, Y.; Xue, G.; Huang, Y.; Yang, G. Exploring the Relationship Between Center and Neighborhoods: Central Vector Oriented Self-Similarity Network for Hyperspectral Image Classification. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 1979–1993. [Google Scholar] [CrossRef]
  54. Gretton, A.; Fukumizu, K.; Teo, C.; Song, L.; Schölkopf, B.; Smola, A. A kernel statistical test of independence. Adv. Neural Inf. Process. Syst. 2007, 20. [Google Scholar]
  55. Zhu, Y.; Zhuang, F.; Wang, J.; Ke, G.; Chen, J.; Bian, J.; Xiong, H.; He, Q. Deep Subdomain Adaptation Network for Image Classification. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 1713–1722. [Google Scholar] [CrossRef]
  56. Zhang, Y.; Li, W.; Zhang, M.; Qu, Y.; Tao, R.; Qi, H. Topological Structure and Semantic Information Transfer Network for Cross-Scene Hyperspectral Image Classification. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 2817–2830. [Google Scholar] [CrossRef]
  57. Sun, L.; Zhao, G.; Zheng, Y.; Wu, Z. Spectral–Spatial Feature Tokenization Transformer for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5522214. [Google Scholar] [CrossRef]
  58. Han, Z.; Yang, J.; Gao, L.; Zeng, Z.; Zhang, B.; Chanussot, J. Dual-Branch Subpixel-Guided Network for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5521813. [Google Scholar] [CrossRef]
Figure 1. (a) SCM of the image generation process. Z, S, X and Y denote the latent physical properties, sensing scene, observed image, and semantic label, respectively. (b) Illustration of representation disentanglement. X c and X n denote the causal features and non-causal features, respectively.
Figure 1. (a) SCM of the image generation process. Z, S, X and Y denote the latent physical properties, sensing scene, observed image, and semantic label, respectively. (b) Illustration of representation disentanglement. X c and X n denote the causal features and non-causal features, respectively.
Jimaging 12 00057 g001
Figure 2. Overall pipeline of the proposed CauseHSI. “[ · , · ]” means feature concatenation.
Figure 2. Overall pipeline of the proposed CauseHSI. “[ · , · ]” means feature concatenation.
Jimaging 12 00057 g002
Figure 3. Data processing flow of the CGM. “SpeR” and “SpaR” represent spectral and spatial randomization, respectively.
Figure 3. Data processing flow of the CGM. “SpeR” and “SpaR” represent spectral and spatial randomization, respectively.
Jimaging 12 00057 g003
Figure 4. Structure of Disentanglement Module.
Figure 4. Structure of Disentanglement Module.
Jimaging 12 00057 g004
Figure 5. Data processing of reconstruction module. “ d e ” represents the embedding dimension of features.
Figure 5. Data processing of reconstruction module. “ d e ” represents the embedding dimension of features.
Jimaging 12 00057 g005
Figure 6. Pseudo-color image and ground truth map of Pavia dataset. (a) Pseudo-color image of Pavia University. (b) Ground truth map of Pavia University. (c) Pseudo-color image of Pavia Center. (d) Ground truth map of Pavia Center.
Figure 6. Pseudo-color image and ground truth map of Pavia dataset. (a) Pseudo-color image of Pavia University. (b) Ground truth map of Pavia University. (c) Pseudo-color image of Pavia Center. (d) Ground truth map of Pavia Center.
Jimaging 12 00057 g006
Figure 7. Pseudo-color image and ground truth map of Houston dataset. (a) Pseudo-color image of Houston2013. (b) Pseudo-color image of Houston2018. (c) Ground truth map of Houston2013. (d) Ground truth map of Houston2018.
Figure 7. Pseudo-color image and ground truth map of Houston dataset. (a) Pseudo-color image of Houston2013. (b) Pseudo-color image of Houston2018. (c) Ground truth map of Houston2013. (d) Ground truth map of Houston2018.
Jimaging 12 00057 g007
Figure 8. Pseudo-color image and ground truth map of HyRANK dataset. (a) Pseudo-color image of Dioni. (b) Pseudo-color image of Loukia. (c) Ground truth map of Dioni. (d) Ground truth map of Loukia.
Figure 8. Pseudo-color image and ground truth map of HyRANK dataset. (a) Pseudo-color image of Dioni. (b) Pseudo-color image of Loukia. (c) Ground truth map of Dioni. (d) Ground truth map of Loukia.
Jimaging 12 00057 g008
Figure 9. Impact of different parameters on OA across the three datasets. (a) learning rate. (b) λ . (c) embedding dimension.
Figure 9. Impact of different parameters on OA across the three datasets. (a) learning rate. (b) λ . (c) embedding dimension.
Jimaging 12 00057 g009
Figure 10. Ground truth and classification maps of different methods on the Pavia Center dataset. (a) Ground truth. (b) SSFTT. (c) DSNet. (d) DSAN. (e) TSTnet. (f) SDEnet. (g) FDGNet. (h) D3Net. (i) ISDGS. (j) Ours.
Figure 10. Ground truth and classification maps of different methods on the Pavia Center dataset. (a) Ground truth. (b) SSFTT. (c) DSNet. (d) DSAN. (e) TSTnet. (f) SDEnet. (g) FDGNet. (h) D3Net. (i) ISDGS. (j) Ours.
Jimaging 12 00057 g010
Figure 11. Ground truth and classification maps of different methods on the Houston2018 dataset. (a) Ground truth. (b) SSFTT. (c) DSNet. (d) DSAN. (e) TSTnet. (f) SDEnet. (g) FDGNet. (h) D3Net. (i) ISDGS. (j) Ours.
Figure 11. Ground truth and classification maps of different methods on the Houston2018 dataset. (a) Ground truth. (b) SSFTT. (c) DSNet. (d) DSAN. (e) TSTnet. (f) SDEnet. (g) FDGNet. (h) D3Net. (i) ISDGS. (j) Ours.
Jimaging 12 00057 g011
Figure 12. Ground truth and classification maps of different methods on the Loukia dataset. (a) Ground truth. (b) SSFTT. (c) DSNet. (d) DSAN. (e) TSTnet. (f) SDEnet. (g) FDGNet. (h) D3Net. (i) ISDGS. (j) Ours.
Figure 12. Ground truth and classification maps of different methods on the Loukia dataset. (a) Ground truth. (b) SSFTT. (c) DSNet. (d) DSAN. (e) TSTnet. (f) SDEnet. (g) FDGNet. (h) D3Net. (i) ISDGS. (j) Ours.
Jimaging 12 00057 g012
Figure 13. t-SNE visualization of feature distributions on three benchmark datasets. (ac) Distributions of target-domain samples in the original feature space for Pavia, Houston, and HyRANK, respectively. (df) Corresponding feature distributions learned by CauseHSI on the same datasets.
Figure 13. t-SNE visualization of feature distributions on three benchmark datasets. (ac) Distributions of target-domain samples in the original feature space for Pavia, Houston, and HyRANK, respectively. (df) Corresponding feature distributions learned by CauseHSI on the same datasets.
Jimaging 12 00057 g013
Table 1. Number of Source and Target Samples for the Pavia Dataset.
Table 1. Number of Source and Target Samples for the Pavia Dataset.
No.ClassPavia UniversityPavia Center
C1Trees30647598
C2Asphalt66319248
C3Brick36822685
C4Bitumen13307287
C5Shadow9472863
C6Meadow18,6493090
C7Bare soil50296584
Total39,33239,355
Table 2. Number of Source and Target Samples for the Houston Dataset.
Table 2. Number of Source and Target Samples for the Houston Dataset.
No.ClassHouston2013Houston2018
C1Grass healthy3451353
C2Grass stressed3654888
C3Trees3652766
C4Water28522
C5Residential buildings3195347
C6Non-residential buildings40832,459
C7Road4436365
Total253053,200
Table 3. Number of Source and Target Samples for the HyRANK Dataset.
Table 3. Number of Source and Target Samples for the HyRANK Dataset.
No.ClassDioniLoukia
C1Dense Urban Fabric1262206
C2Mineral Extraction Sites20454
C3Non Irrigated Arable Land614426
C4Fruit Trees15079
C5Olive Groves17681107
C6Coniferous Forest361422
C7Dense Sclerophyllous Vegetation50352996
C8Sparce Sclerophyllous Vegetation63742361
C9Sparsely Vegetated Areas1754399
C10Rocks and Sand492453
C11Water16121393
C12Coastal Water398421
Total20,02410,317
Table 4. Classification Results of Different Methods for the Target Scene Pavia Center Data. The 1st and 2nd best results are in bold and underlined, respectively.
Table 4. Classification Results of Different Methods for the Target Scene Pavia Center Data. The 1st and 2nd best results are in bold and underlined, respectively.
No.SSFTTDSNetDSANTSTnetSDEnetFDGNetS2ECNetD3NetISDGSOurs
C198.7696.5981.2486.3690.1691.7994.1890.7791.8796.33
C291.4687.2674.4574.7887.6686.1483.5091.1088.9791.96
C320.2557.3379.7761.0572.8775.9470.5875.9870.8675.86
C48.5533.0276.2865.2182.6882.5082.8781.4582.9984.52
C599.7884.2192.7388.8584.1887.9190.8584.0687.4977.56
C682.1866.3185.7282.5070.3872.8467.1558.6671.2071.62
C788.9575.3077.4766.7782.4783.7485.3985.8086.9484.71
OA (%) 72.12 ± 1.25 73.11 ± 4.32 79.18 ± 2.03 74.60 ± 0.63 83.73 ± 1.89 84.54 ± 0.91 84.13 ± 1.12 84.27 ± 1.39 85.34 ̲ ± 1.48 86.47 ± 1.28
Kappa × 100 65.93 ± 1.52 67.95 ± 4.78 75.25 ± 2.43 69.89 ± 0.76 80.39 ± 2.21 81.44 ± 1.05 80.92 ± 1.38 81.03 ± 1.66 82.32 ̲ ± 1.73 83.66 ± 1.51
Table 5. Classification Results of Different Methods for the Target Scene Houston2018 Data. The 1st and 2nd best results are in bold and underlined, respectively.
Table 5. Classification Results of Different Methods for the Target Scene Houston2018 Data. The 1st and 2nd best results are in bold and underlined, respectively.
No.SSFTTDSNetDSANTSTnetSDEnetFDGNetS2ECNetD3NetISDGSOurs
C114.9329.9879.4976.9259.8145.5970.5831.8940.8337.24
C228.7142.8970.6974.1276.4476.2874.5780.1875.0682.56
C337.0942.7075.2258.0558.0664.5544.3949.8854.8758.67
C484.5481.82100.00100.00100.00100.00100.0096.36100.0081.82
C573.6557.0663.2067.1766.0372.2273.6970.5773.6878.67
C678.6991.9777.8583.9888.3889.3487.1589.3188.2091.96
C744.4911.7244.4862.9955.4454.1461.3449.6760.7051.38
OA(%) 65.72 ± 1.69 70.21 ± 2.45 71.64 ± 1.39 77.35 ± 3.63 78.80 ± 1.53 79.81 ̲ ± 1.00 78.92 ± 1.41 78.33 ± 1.23 79.31 ± 0.95 81.78 ± 0.36
Kappa × 100 40.60 ± 5.67 42.68 ± 9.32 56.36 ± 1.56 63.93 ± 4.22 63.90 ± 2.44 65.20 ̲ ± 1.55 64.46 ± 1.52 62.56 ± 2.37 64.96 ± 1.09 69.15 ± 1.30
Table 6. Classification Results of Different Methods for the Target Scene Loukia Data. The 1st and 2nd best results are in bold and underlined, respectively.
Table 6. Classification Results of Different Methods for the Target Scene Loukia Data. The 1st and 2nd best results are in bold and underlined, respectively.
No.SSFTTDSNetDSANTSTnetSDEnetFDGNetS2ECNetD3NetISDGSOurs
C18.4237.0919.2535.3420.5128.7433.5030.3937.3838.93
C243.210.0078.400.0045.3755.1948.1515.1816.3014.82
C336.9374.3235.4437.3731.6960.0943.1941.7838.6465.21
C450.2134.6836.713.808.2321.7716.4618.9914.9418.73
C50.000.407.7147.5129.0212.3627.3721.8211.6249.99
C622.5913.6518.800.4723.5817.1133.1828.3934.6419.67
C775.0983.4674.2472.3476.5879.1681.2878.4975.6064.27
C819.2436.7566.0958.6963.4466.3960.7864.2475.4178.85
C949.9657.3440.6986.8272.8771.7845.8672.7842.1030.13
C1062.6215.583.979.4517.946.272.2117.1716.203.36
C1198.64100.00100.00100.00100.00100.00100.00100.00100.00100.00
C1299.1398.2996.20100.00100.00100.00100.0097.81100.00100.00
OA(%) 51.48 ± 1.08 57.74 ± 1.30 60.0 ± 1.09 63.19 ± 1.50 64.04 ± 0.53 64.35 ± 0.59 64.09 ± 1.01 64.61 ̲ ± 2.42 64.34 ± 0.68 65.46 ± 1.12
Kappa × 100 42.45 ± 1.78 48.50 ± 1.57 50.39 ± 0.80 55.25 ± 1.89 55.95 ± 1.09 56.00 ± 0.93 55.76 ± 1.22 56.54 ̲ ± 3.10 55.64 ± 0.98 57.79 ± 1.09
Table 7. 95% Confidence Intervals (CI) of OA and Kappa ( × 100 ) over Five Runs. For each dataset, the strongest competing method is reported for comparison.
Table 7. 95% Confidence Intervals (CI) of OA and Kappa ( × 100 ) over Five Runs. For each dataset, the strongest competing method is reported for comparison.
DatasetMethodOA (95% CI)Kappa (95% CI)
PaviaISDGS[83.50, 87.18][80.17, 84.47]
Ours[84.88, 88.06][81.78, 85.54]
HoustonFDGNet[78.57, 81.05][63.27, 67.13]
Ours[81.33, 82.23][67.54, 70.76]
HyRANKD3Net[61.61, 67.61][52.69, 60.39]
Ours[64.07, 66.85][56.44, 59.14]
Table 8. Efficiency Comparison of Different Methods on Three Datasets.
Table 8. Efficiency Comparison of Different Methods on Three Datasets.
Method DSANTSTnetSDEnetFDGNetD3NetISDGSOurs
PaviaTraining time53.8440.3115.7513.8517.409.2122.96
Testing time13.766.966.747.248.466.828.06
FLOPs 1.22 × 10 10 3.53 × 10 9 2.84 × 10 9 2.43 × 10 9 4.53 × 10 9 2.07 × 10 9 1.09 × 10 10
Parameter 2.43 × 10 7 7.70 × 10 6 4.05 × 10 5 4.90 × 10 5 5.22 × 10 5 4.65 × 10 5 3.21 × 10 5
HoustonTraining time33.2719.198.817.979.387.9114.54
Testing time16.853.4511.0610.6211.8711.0412.30
FLOPs 1.01 × 10 10 3.04 × 10 9 1.52 × 10 9 1.30 × 10 9 3.42 × 10 9 1.11 × 10 9 9.53 × 10 9
Parameter 2.42 × 10 7 7.69 × 10 6 4.08 × 10 5 4.52 × 10 5 4.88 × 10 5 4.34 × 10 5 2.55 × 10 5
HyRANKTraining time46.6431.3713.1317.4914.768.0720.92
Testing time5.013.282.823.563.203.143.34
FLOPs 1.51 × 10 10 4.19 × 10 9 4.64 × 10 9 3.98 × 10 9 6.06 × 10 9 3.40 × 10 9 1.28 × 10 10
Parameter 2.46 × 10 7 7.73 × 10 6 4.80 × 10 5 5.36 × 10 5 5.72 × 10 5 5.01 × 10 5 3.66 × 10 5
Table 9. Ablation Experiment of Our Method on Three Datasets.
Table 9. Ablation Experiment of Our Method on Three Datasets.
DatasetMetricNo DCTNo 2DNo ControlNo Consist&CompleteNo ConsistNo CompleteOurs
PaviaOA85.2684.4484.9882.6684.1885.0586.47
KC82.1781.2881.9279.1281.0281.9883.66
HoustonOA80.9980.4081.0179.1279.9480.0381.78
KC67.1266.4967.2864.1064.7965.0969.15
HyRANKOA64.8164.9663.4061.8762.9663.9165.42
KC56.3157.3754.5052.1853.4955.1557.80
Table 10. Sensitivity analysis of the frequency weighting strategy under different mid-band preservation ranges.
Table 10. Sensitivity analysis of the frequency weighting strategy under different mid-band preservation ranges.
SettingMetricPaviaHoustonHyRANK
NarrowOA85.7880.5165.03
KC82.9268.2458.11
DefaultOA86.4781.7865.42
KC83.6669.1557.80
WideOA85.7680.3664.38
KC82.8367.6256.34
Table 11. Physical plausibility evaluation of generated counterfactual samples.
Table 11. Physical plausibility evaluation of generated counterfactual samples.
DatasetCenter-Pixel SAM (°)Spectral Smoothness (Source)Spectral Smoothness (Counterfactual)
Pavia2.890.02240.0246
Houston3.240.03270.0376
HyRANK4.450.05440.0653
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, X.; Yang, Z.; Li, W. CauseHSI: Counterfactual-Augmented Domain Generalization for Hyperspectral Image Classification via Causal Disentanglement. J. Imaging 2026, 12, 57. https://doi.org/10.3390/jimaging12020057

AMA Style

Li X, Yang Z, Li W. CauseHSI: Counterfactual-Augmented Domain Generalization for Hyperspectral Image Classification via Causal Disentanglement. Journal of Imaging. 2026; 12(2):57. https://doi.org/10.3390/jimaging12020057

Chicago/Turabian Style

Li, Xin, Zongchi Yang, and Wenlong Li. 2026. "CauseHSI: Counterfactual-Augmented Domain Generalization for Hyperspectral Image Classification via Causal Disentanglement" Journal of Imaging 12, no. 2: 57. https://doi.org/10.3390/jimaging12020057

APA Style

Li, X., Yang, Z., & Li, W. (2026). CauseHSI: Counterfactual-Augmented Domain Generalization for Hyperspectral Image Classification via Causal Disentanglement. Journal of Imaging, 12(2), 57. https://doi.org/10.3390/jimaging12020057

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop