CAT: Causal Attention with Linear Complexity for Efficient and Interpretable Hyperspectral Image Classification

Ying Liu; Zhipeng Shen; Haojiao Yang; Waixi Liu; Xiaofei Yang

doi:10.3390/rs18020358

Highlights

What are the main findings?

We propose the Causal Attention Transformer (CAT), the first framework to integrate causal inference with a hybrid CNN-Transformer backbone for hyperspectral image (HSI) classification, enabling explicit modeling of spectral–spatial causality without external discovery pipelines.
CAT introduces a novel Causal Attention Mechanism that enforces temporal and spatial causality through triangular masking and axial decomposition, effectively eliminating spurious spectral–spatial correlations and enhancing model robustness.

What are the implications of the main findings?

The architecture features a Dual-Path Hierarchical Fusion module with learnable gating to adaptively integrate complementary spectral and spatial causal features in an end-to-end trainable manner.
A Linearized Causal Attention (LCA) module reduces computational complexity from O(N₂) to O(N) while preserving causal constraints, enabling scalable high-resolution HSI processing.
Extensive experiments on three benchmark datasets (Indian Pines, Pavia University, Houston2013) demonstrate that CAT achieves state-of-the-art classification accuracy, with significant improvements over leading CNN, Transformer, and Mamba-based models.

Abstract

Hyperspectral image (HSI) classification is pivotal in remote sensing, yet deep learning models, particularly Transformers, remain susceptible to spurious spectral–spatial correlations and suffer from limited interpretability. These issues stem from their inability to model the underlying causal structure in high-dimensional data. This paper introduces the Causal Attention Transformer (CAT), a novel architecture that integrates causal inference with a hierarchical CNN-Transformer backbone to address these limitations. CAT incorporates three key modules: (1) a Causal Attention Mechanism that enforces temporal and spatial causality via triangular masking and axial decomposition to eliminate spurious dependencies; (2) a Dual-Path Hierarchical Fusion module that adaptively integrates spectral and spatial causal features using learnable gating; and (3) a Linearized Causal Attention module that reduces the computational complexity from

O (N^{2})

to

O (N)

via kernelized cumulative summation, enabling scalable high-resolution HSI processing. Extensive experiments on three benchmark datasets (Indian Pines, Pavia University, Houston2013) demonstrate that CAT achieves state-of-the-art performance, outperforming leading CNN and Transformer models in both accuracy and robustness. Furthermore, CAT provides inherently interpretable spectral–spatial causal maps, offering valuable insights for reliable remote sensing analysis.

Keywords:

hyperspectral image classification; deep learning; causal inference; transformers

1. Introduction

Hyperspectral images (HSIs) capture detailed surface information across hundreds of contiguous spectral bands, forming rich three-dimensional data cubes that enable precise material discrimination in applications such as mineral exploration and precision agriculture [1]. Consequently, HSI classification, which aims to assign land cover labels based on spectral–spatial characteristics, is a fundamental remote sensing task.

Traditional machine learning approaches, including support vector machines (SVMs) [2] and random forests [3], primarily rely on handcrafted features. While computationally efficient, these methods are fundamentally limited in modeling the complex, non-linear spectral–spatial interactions inherent in HSIs due to their reliance on manual feature engineering and their vulnerability to the curse of dimensionality, leading to pronounced sensitivity to noisy and redundant spectral bands [4].

The advent of deep learning revolutionized HSI classification by enabling automatic feature extraction. Convolutional Neural Networks (CNNs), particularly 2D-CNNs [5] and 3D-CNNs [6], excelled at capturing local spectral–spatial patterns through hierarchical convolutions. Hybrid architectures [7,8] further enhanced multi-scale feature fusion. However, CNNs are inherently limited by their local receptive fields, struggling to model long-range spatial dependencies crucial for large-scale scenes. Recently, Transformers [8,9] have addressed this by leveraging global self-attention mechanisms to model long-range dependencies. Despite their success, standard self-attention computes dependencies between all token pairs indiscriminately, making it susceptible to spurious correlations from high-dimensional spectral redundancies. Furthermore, its quadratic computational complexity with respect to sequence length limits its scalability for high-resolution HSIs. Meanwhile, emerging State Space Models (SSMs) such as Mamba [9] offer a promising alternative with linear complexity, as demonstrated by [10] with regard to remote sensing data processing. Concurrently, Causal Attention Mechanisms have shown potential in mitigating confounding biases, with ref. [11] developing causal attention for vision–language tasks, and ref. [12] employing causal meta-reinforcement learning for multimodal remote sensing data classification.

Beyond correlation-based models, causal inference has emerged as a powerful framework for enhancing model robustness and interpretability by modeling cause–effect relationships. Causal networks, such as Causal Bayesian Networks (CBNs) [13], aim to identify directional dependencies, mitigating confounding biases. Parallelly, Causal Attention Mechanisms have been developed to integrate these principles directly into deep learning architectures. For instance, masked attention [14] and front-door adjustment [15] restrict attention to causally relevant features, effectively reducing spurious correlations in tasks like vision–language reasoning and improving out-of-distribution generalization in language models [16]. These works demonstrate the potential of causal attention to provide a principled approach to feature selection and robustness. However, their application to HSI classification remains largely unexplored and faces significant domain-specific challenges.

Parallelly, causal attention networks have been developed to integrate causal principles into attention mechanisms, primarily to mitigate confounding biases and spurious correlations. The core innovation lies in restricting attention to causally relevant features through mechanisms like masked attention [11] or front-door adjustment [17]. For instance, Yang et al. proposed Causal Attention (CATT) for vision–language tasks, employing in-sample and cross-sample attention modules to eliminate confounding effects without requiring explicit confounder observations. In large language models, Causal Attention Tuning (CAT) has been introduced to inject fine-grained causal knowledge into attention distributions, effectively reducing reliance on spurious correlations and improving out-of-distribution generalization [18]. Recent work like CASTLE further enhanced causal attention with lookahead keys, enabling better global context understanding while maintaining causal constraints [19]. These approaches demonstrate that causal attention can significantly improve model interpretability and robustness across diverse domains.

Despite these advancements, current deep learning approaches for HSI classification face three unresolved challenges: (1) susceptibility to spurious spectral–spatial correlations due to unconstrained attention in high-dimensional spaces; (2) inadequate modeling of causal dependencies between spectral bands and spatial contexts, leading to interpretability bottlenecks and noise sensitivity; and (3) limited scalability imposed by the quadratic complexity of standard Transformers. Existing causal models are either not designed for HSI or require extensive, separate causal discovery pipelines. To address these limitations holistically, we propose the Causal Attention Transformer (CAT). Our approach directly tackles these issues: a Causal Attention Mechanism eliminates spurious correlations via structured masking; a Dual-Path Hierarchical Fusion module jointly models spectral and spatial causality; and a Linearized Causal Attention module reduces complexity to O(N) for scalable processing.

To address these limitations, we propose the Causal Attention Transformer (CAT), a novel hybrid architecture that integrates causal inference with hierarchical feature fusion for robust and interpretable HSI classification. Our approach introduces three key innovations: (1) a Causal Attention Mechanism that eliminates spurious interactions by establishing counterfactual dependencies through triangular masking and axial decomposition; (2) a Dual-Path Hierarchical Fusion module that implements a spectral–spatial fusion framework with learnable gating to progressively integrate causal features from orthogonal domains; and (3) a Linearized Causal Attention module that leverages kernelized cumulative summation to reduce computational complexity from

O (N^{2})

to

O (N)

while maintaining causal constraints for high-resolution HSI processing. These components are integrated within a multi-scale CNN-Transformer backbone that extracts both local patterns and global dependencies while preserving causal relationships.

The main contributions of this work are summarized as follows:

We propose a novel Causal Attention Transformer (CAT) that first integrates causal inference with a hybrid CNN-Transformer architecture for hyperspectral image classification, enabling explicit modeling of spectral–spatial causality without external causal discovery pipelines.
We design a Causal Attention Mechanism with triangular masking and axial decomposition to explicitly disentangle spectral–spatial causality via front-door adjustment, enhancing model robustness and generalization capability, which enforces temporal and spatial causality to eliminate spurious correlations—a principled approach not previously applied to HSI.
We develop a Dual-Path Hierarchical Fusion framework with learnable gating, which adaptively merges spectral and spatial causal features in an end-to-end trainable manner.
We introduce a Linearized Causal Attention that reduces complexity from $O (N^{2})$ to $O (N)$ while preserving causal constraints, enabling scalable high-resolution HSI processing.
We conduct extensive experiments on three benchmark datasets, demonstrating that CAT achieves state-of-the-art performance compared to existing CNN and Transformer models, while providing interpretable spectral–spatial causal relationships for robust remote sensing analysis.

The rest of this paper is organized as follows: Section 2 introduces the related work of CNN in hyperspectral image classification and the CAT network integrated with the mamba algorithm. Section 3 introduces the causal framework of the proposed method. Section 4 details the methodology of the proposed approach. Section 5 presents the experimental results and analysis. Section 6 concludes the paper and discusses future directions.

3. Causal Framework and Structural Causal Model

To provide a rigorous causal foundation for our approach, we formalize the problem using Structural Causal Models (SCMs) [13]. We define three core endogenous variables:

$X = (S, P)$ : Observed hyperspectral data (spectral bands S, spatial features P);
Z: Latent causal features extracted by our model;
Y: Classification labels.

The causal relationships are governed by structural equations:

X = f_{X} (U_{X})

(1)

Z = f_{Z} (X)

(2)

Y = f_{Y} (Z, U_{Y})

(3)

where

U_{X}, U_{Y}

represent unobserved confounders (e.g., environmental conditions) and

f_{Z}

corresponds to our Causal Attention Transformer.

3.1. Causal Identification via Front-Door Adjustment

The presence of confounders

U_{Y}

prevents the direct estimation of

P (Y | d o (X))

. We employ front-door adjustment [13] using Z as a mediator:

P (Y | d o (X)) = \sum_{z} P (Y | X, Z = z) P (Z = z | X)

(4)

where Z corresponds to the causally constrained features produced by our Causal Attention Module. The triangular and axial masks ensure that Z is a valid mediator that does not inherit spurious dependencies from

U_{Y}

.

3.2. Causal Constraints Implementation

Our causal attention enforces these conditions through structured masking. For spectral dimension with C bands,

Attention {(Q, K, V)}_{i} = \sum_{j = 1}^{i} softmax (\frac{Q_{i} K_{j}^{⊤}}{\sqrt{d}}) V_{j} .

(5)

For spatial dimension in raster-scan order,

Attention (Q_{(i, j)}, K, V) = \sum_{l = 1}^{i} \sum_{m = 1}^{j} softmax (\frac{Q_{(i, j)} K_{(l, m)}^{⊤}}{\sqrt{d}}) V_{(l, m)} .

(6)

This ensures each position attends only to its causal predecessors, blocking non-causal dependencies.

3.3. Causal Regularization

We augment the standard cross-entropy loss with a causal regularization term:

L = L_{CE} + λ \sum_{i, j} I (j > i) \cdot | A_{i j} |

(7)

where

A_{i j}

are attention weights and

λ

controls regularization strength. This penalizes attention to non-causal positions.

Based on the front-door adjustment principle established above, our objective is to learn an intermediate variable Z (i.e., the model’s internal feature representation) that blocks the non-causal path from the unobserved confounder

U_{Y}

. To achieve this, we design a Causal Attention Mechanism that enforces a structural constraint: each token’s representation

Z_{i}

is constructed exclusively from its causal predecessors

X_{\leq i}

. This is implemented through a structured masking strategy in the attention computation, effectively decoupling spurious correlations during the feature learning phase.

4. Proposed Method

In this section, we present the Causal Attention Transformer (CAT), a novel architecture designed to perform HSIC by explicitly modeling causal dependencies in both spectral and spatial dimensions. As outlined in Section 3, our approach is built upon three core innovations: the Causal Attention Mechanism, the Dual-Path Hierarchical Fusion strategy, and the Linearized Causal Attention for efficiency.

Our architecture (as shown in Figure 1) builds upon established concepts but introduces several key novelties. The hierarchical CNN-Transformer backbone follows common design practices. However, the core innovation lies in the integration of Causal Attention Mechanisms into both spectral and spatial dimensions, and the subsequent dual-path fusion strategy. While causal masking has been explored in NLP, its adaptation and joint application for spectral–spatial HSI analysis is, to the best of our knowledge, unprecedented.

Figure 1. Architecture of the proposed Causal Attention Transformer (CAT) for hyperspectral image classification. The model processes HSI cubes through patch embedding, multi-stage causal attention blocks, and hierarchical feature fusion to generate classification maps while maintaining spectral–spatial causality.

4.1. Causal Attention Mechanism

The Causal Attention Mechanism is the cornerstone of our method, directly implementing the front-door adjustment principle from our causal framework (Section 3). It ensures that the learned intermediate representation Z for any token is solely dependent on its causal predecessors, thereby blocking non-causal paths induced by confounders.

4.1.1. Causal Self-Attention

The Causal Self-Attention mechanism enforces a causal structure, crucial for modeling sequences where future elements should not influence past ones. For HSI, we treat the spectral dimension as a sequential signal. The input tensor

X \in R^{B \times C \times H \times W}

is first reshaped into a sequence of tokens

X \in R^{B \times T \times d}

, where

T = C

(number of spectral bands) and

d = H \times W

(flattened spatial dimensions). We then compute query, key, and value projections:

Q = X W^{Q}, K = X W^{K}, V = X W^{V}

(8)

The attention weights are computed with triangular masking to prevent information leakage:

A = Mask (\frac{Q K^{⊤}}{\sqrt{d}}) \in R^{B \times H \times T \times T}

(9)

where the masking function applies lower-triangular constraints, as follows:

M_{i j} = \{\begin{matrix} 0 & if i \geq j \\ - \infty & otherwise \end{matrix}

(10)

This formulation ensures that each position can only attend to itself and preceding positions, maintaining the autoregressive property essential for causal inference in the spectral or spatial sequence.

4.1.2. Causal Attention for 4D Inputs

For direct 4D tensor processing

X \in R^{B \times C \times H \times W}

, we extend causal attention to the spatial domain (as shown in Figure 2). We employ depthwise separable convolutions for efficient local feature extraction and to maintain the channel-wise independence crucial for spectral causality:

Q = {Conv}_{1 \times 1} (X), K, V = {DWConv}_{3 \times 3} (X)

(11)

Figure 2. Detailed structure of the Transformer block with Causal Attention Mechanisms. The module employs causal self-attention with triangular masking, depthwise separable convolutions, and axial decomposition to enforce temporal causality while preventing information leakage in spectral–spatial dependency modeling.

Spatial causality is enforced via axial decomposition, processing the image in a raster-scan order (row-by-row and left-to-right). The causal attention for a query at position

(i, j)

attends only to keys at positions

(l, m)

where

l \leq i

and

m \leq j

:

A_{i j k l} = \frac{Q_{i j k} K_{l m n}^{⊤}}{\sqrt{d}} \cdot I (l \leq i, m \leq j)

(12)

where

I (\cdot)

is the indicator function enforcing spatial causality. This approach processes rows and columns sequentially while maintaining computational efficiency through depthwise separable operations.

4.1.3. Linear Causal Attention

To address the quadratic complexity

O (N^{2})

of standard self-attention, we introduce a linearized variant based on kernelized attention. The core idea is to approximate the softmax operation using feature maps that decompose the attention computation into linear complexity.

Given queries Q, keys K, and values V, the standard softmax attention is as follows:

Attention (Q, K, V) = softmax (\frac{Q K^{⊤}}{\sqrt{d}}) V

(13)

We approximate this using feature maps

ϕ (\cdot)

and

ψ (\cdot)

:

LinearAttention (Q, K, V) = \frac{ϕ (Q) (ψ {(K)}^{⊤} V)}{ϕ (Q) (ψ {(K)}^{⊤} 1)}

(14)

where

1

is an all-ones vector of the appropriate dimension. This formulation reduces memory complexity from

O (N^{2})

to

O (N)

.

For our causal implementation, we use cumulative sums to maintain the causal structure. For the i-th position, the output is as follows:

{Output}_{i} = \frac{\sum_{j = 1}^{i} ϕ {(q_{i})}^{⊤} ψ (k_{j}) v_{j}}{\sum_{j = 1}^{i} ϕ {(q_{i})}^{⊤} ψ (k_{j})}

(15)

We adopt the feature mapping proposed by [39]:

ϕ (x) = ψ (x) = ELU (x) + 1

(16)

This ensures non-negative attention scores and stable cumulative computation. The ELU activation function is defined as follows:

ELU (x) = {\begin{matrix} x & if x > 0 \\ α (e^{x} - 1) & if x \leq 0 \end{matrix}

(17)

with

α = 1.0

in our implementation.

The linearization process preserves the directional dependency inherent in the causal attention (i.e., aggregation only from preceding positions). The cumulative sum formulation naturally aligns with the autoregressive nature of causal sequences, thus maintaining the causal constraint while significantly reducing computational complexity.

4.2. Dual-Path Hierarchical Fusion

To comprehensively model the complex causal structures within HSI data, we propose a dual-path strategy that separately captures spectral and spatial causalities, followed by a hierarchical fusion scheme.

4.2.1. Dual-Path Attention Module

The dual-path architecture separately models spectral and spatial dependencies through parallel causal attention pathways. Given input tensor

X \in R^{B \times C \times H \times W}

,

A_{s} = CausalAttention (LN (X)) \in R^{B \times C \times H \times W}

(18)

A_{c} = CausalSelfAttention {(LN (X^{⊤}))}^{⊤} \in R^{B \times C \times H \times W} .

(19)

The dynamic fusion with learnable gate

γ \in [0, 1]

follows:

Y = γ ⊙ A_{s} + (1 - γ) ⊙ A_{c}

(20)

Spectral attention operates along channel dimension C with complexity

O (C^{2} H W)

, while spatial attention uses axial decomposition with complexity

O (C H W (H + W))

, providing complementary perspectives on the data.

The spectral causal path (

A_{c}

) aims to eliminate spurious correlations caused by physical factors like atmospheric absorption across bands. Conversely, the spatial causal path (

A_{s}

) captures genuine causal influences arising from real-world object layouts (e.g., roads causing adjacent soil exposure). This orthogonal design ensures a holistic modeling of causal relationships.

4.2.2. Hierarchical Feature Extraction and Fusion

This section details how the dual-path features are processed and fused across multiple scales.

Patch Embedding Module

The module processes raw HSI cubes through spectral–spatial hierarchical projection. Given input

X \in R^{B \times 1 \times C \times H \times W}

,

Y_{3} = Conv 3 D (X) \in R^{B \times d_{1} \times ⌊ C / s_{c} ⌋ \times H \times W},

(21)

with the following 3D convolution parameters: kernel

(5, 3, 3)

, stride

(2, 1, 1)

, and channels

d_{1} = 64

. Spatial refinement applies, as follows:

Y_{2} = Conv 2 D (Y_{3}) \in R^{B \times d_{2} \times H \times W}

(22)

Y_{e m b} = ReLU (BatchNorm (Y_{2}))

(23)

with the following 2D convolution configuration: kernel progression

[3 \times 3] \to [3 \times 3] \to [1 \times 1]

and channel expansion

d_{2} = 4 d_{1} = 256

.

Multi-Stage Processing

The backbone employs multi-scale processing through four hierarchical stages:

F_{0} = PatchEmbed (X), F_{i} = {Stage}_{i} (F_{i - 1}), i = 1, \dots, 4

(24)

Each stage contains strided convolution for downsampling (

s = 2^{i}

) and Transformer blocks with counts

N_{i} = {2, 2, 4, 2}

. The dynamic fusion mechanism employs learnable weights

γ_{i} \in R^{4}

to adaptively combine features across scales.

Feature Fusion and Classification

The final representation integrates multi-scale features through upsampling and weighted combination, as follows:

F_{o u t} = \sum_{i = 1}^{4} γ_{i} \cdot UpSample (F_{i})

(25)

where upsampling uses bilinear interpolation for dimension alignment. The classification head processes fused features through global average pooling and linear projection:

z = \frac{1}{H^{'} W^{'}} \sum_{h = 1}^{H^{'}} \sum_{w = 1}^{W^{'}} F_{o u t} (:, h, w)

(26)

\hat{y} = Softmax (W_{c} z + b_{c})

(27)

Dropout with

p = 0.3

is applied before final projection to prevent overfitting, while the global pooling operation ensures spatial invariance and reduces the parameter count.

4.3. Computational Complexity Analysis

The computational advantage of our linearized approach can be quantified as follows:

Standard attention: $O (N^{2} d)$ time, $O (N^{2})$ memory;
Linear attention: $O (N d^{2})$ time, $O (N d)$ memory.

Where N is the sequence length and d is the feature dimension. For hyperspectral images where N (

p i x e l s \times b a n d s

) is large but d is moderate, this provides substantial efficiency gains while preserving causal structure.

4.4. Connection to State Space Models

Our Linearized Causal Attention (LCA) shares with State Space Models (SSMs) the goal of achieving linear computational complexity in sequence length, making both suitable for long-range modeling in hyperspectral data. However, the underlying mechanisms and implications for interpretability diverge fundamentally.

First, while Mamba replaces attention with a continuous-time hidden state recurrence

h_{t} = {\bar{A}}_{t} h_{t - 1} + {\bar{B}}_{t} x_{t}

, which implicitly filters and propagates information through data-dependent parameters, it does not yield explicit, sample-wise importance scores over input tokens. In contrast, our LCA retains an explicit, normalized attention map

A_{i j}

(via kernelized approximation of Softmax), which can be directly computed and visualized for any input–output pair. This explicitness is critical: within our causal framework,

A_{i j}

serves as a proxy for the causal influence of feature j on the prediction at location i. Thus, even after linearization, LCA preserves the semantic transparency that standard attention provides—a property absent in black-box SSM dynamics.

Second, LCA is natively embedded within a Transformer architecture, enabling seamless integration with positional encodings, layer normalization, and multi-head mechanisms. This architectural compatibility allows our model to inherit the rich representational capacity and modular design of modern vision Transformers, while replacing only the attention core with a linearized, causally informed variant. Mamba, by contrast, requires a complete architectural shift away from attention, limiting its plug-and-play compatibility with existing attention-based pipelines or hybrid designs.

Consequently, our approach uniquely balances three desiderata: (1) linear complexity, (2) causal interpretability via explicit attention weights, and (3) architectural flexibility within the Transformer paradigm. This triad enables not only efficient inference but also scientifically meaningful analysis of spectral–spatial decision rationales—something SSMs alone cannot offer.

5. Experiments

5.1. Datasets and Setting

All experiments strictly follow a within-dataset protocol. Our model is trained and validated solely on the provided training and validation splits of the target dataset (e.g., Indian Pines). No external data, pre-training on other datasets, cross-dataset transfer learning, or data augmentation techniques are employed.

5.1.1. Datasets’ Description

Comprehensive evaluations are conducted on three benchmark hyperspectral datasets to validate the proposed method under diverse conditions.

Indian Pines: This dataset, acquired by the AVIRIS sensor over Northwestern Indiana, comprises

145 \times 145

pixels with 224 spectral bands. Following standard preprocessing, 200 bands are retained after removing 20 noisy bands. The dataset contains 16 agricultural and natural vegetation categories, with 10% of samples (approximately 1000 pixels) used for training and the remainder for testing, following the experimental protocol established in [7].

Pavia University: Collected by the ROSIS sensor over Pavia, Italy, this urban scene contains

610 \times 340

pixels with 103 spectral bands. The dataset encompasses nine urban land cover classes. We employ the standard 10% training sample ratio (approximately 4200 pixels), consistent with evaluation methodologies in [33].

Houston2013: This dataset, captured by the ITRES CASI-1500 sensor over the University of Houston, features

349 \times 1905

pixels with 144 spectral bands. The cloud-free image provided by GRSS includes 15 urban land use classes. We utilize 10% of labeled samples (approximately 5000 pixels) for training, maintaining consistency with experimental setups in [9].

5.1.2. Implementation Details

The proposed CAT is implemented using PyTorch 1.9.0 and trained on NVIDIA RTX 3090 GPUs. We employ AdamW optimizer with initial learning rate

1 \times 10^{- 3}

, weight decay

1 \times 10^{- 4}

, and cosine annealing scheduler that reduces the learning rate to

1 \times 10^{- 5}

over 100 epochs. The model processes randomly cropped patches of size

15 \times 15

pixels with batch size 100. To ensure causality constraints, all convolutional operations utilize left/top asymmetric padding instead of symmetric padding. We apply RandAugment with a magnitude of eight for spectral–spatial data augmentation, following the best practices in [23].

5.1.3. Evaluation Metrics

Performance is quantitatively assessed using two standard metrics: Overall Accuracy (OA) and the Kappa coefficient (

κ

). OA represents the percentage of correctly classified pixels, while

κ

accounts for agreement beyond chance. We explicitly state that all experiments were repeated ten times with different random seeds for training/testing splits, and the mean ± standard deviation of performance metrics is reported to ensure statistical robustness.

5.2. Comparative Analysis

5.2.1. Comparison Methods

We compare CAT against eleven state-of-the-art methods spanning different architectural paradigms:

CNN-based: 2D-CNN [5] and 3D-CNN [8].

Transformer-based: ViT [40] (a classical transformer network for image classiffcation), HiT [32] (Hyperspectral Image Transformer), SSFTT [9] (Spatial–Spectral Fusion Transformer), and MorphFormer [41] (A hybrid architecture integrating mathematical morphology with Transformers).

Mamba-based: MambaHSI [42] (an HSI-specific adaptation of the Mamba selective state space model) and 3DSSMamba [43] (an extension of Mamba to 3D state spaces for HSI).

Non-DL methods: SVM [44] (Support Vector Machine) and KNN [45] (K-Nearest Neighbors).

All comparison methods are implemented using their officially released codes and optimized following their respective papers to ensure fair comparison.

5.2.2. Quantitative Results

As detailed in Table 1, Table 2 and Table 3, CAT achieves competitive performance across all datasets while providing causal interpretability. The proposed CAT framework demonstrates exceptional performance across all three benchmark datasets, achieving 94.25% OA on Indian Pines, 98.24% OA on Houston2013, and 99.08% OA on Pavia University. This consistent superiority stems from CAT’s innovative integration of causal inference principles with deep feature learning. The Causal Attention Mechanism effectively eliminates spurious spectral–spatial correlations through triangular masking and axial decomposition, while the Dual-Path Hierarchical Fusion enables adaptive integration of complementary spectral and spatial features. Particularly impressive is CAT’s performance on Houston2013, where it outperforms the best competitor by 1.42% OA, demonstrating exceptional capability in handling complex urban environments with spectrally similar materials. The linearized causal attention further ensures computational efficiency, reducing the complexity from

O (N^{2})

to

O (N)

while maintaining causal constraints.

Table 1. Comparison with state-of-the-art methods on Indian Pines dataset (10% training samples). Methods are grouped by architecture type. Best results are highlighted in bold.

Table 2. Comparison with state-of-the-art methods on Houston2013 dataset (10% training samples). Methods are grouped by architecture type. Best results are highlighted in bold.

Table 3. Comparison with state-of-the-art methods on PaviaU dataset (10% training samples). Methods are grouped by architecture type. Best results are highlighted in bold.

As expected, non-deep learning methods like SVM exhibit significantly lower performance compared to deep models, highlighting the necessity of automatic feature extraction for complex HSI data.

Transformer-based methods exhibit competitive but inconsistent performance across datasets. While HiT achieves 98.46% OA on Pavia University and SSFTT reaches 98.68% on Indian Pines, these methods show significant performance degradation on Houston2013 (96.42% for SSFTT). This inconsistency originates from their fundamental limitation in distinguishing causal from spurious correlations. The global self-attention mechanisms in Transformers capture all pairwise interactions indiscriminately, amplifying noisy spectral bands and confounding factors. Furthermore, their quadratic computational complexity restricts the effective modeling of long-range dependencies in large-scale HSI scenes, and the absence of explicit causal modeling makes them vulnerable to spectral variability and distribution shifts across different environmental conditions.

CNN-based approaches demonstrate systematic limitations across all datasets, with performance degradation particularly evident in complex classification scenarios. While SyCNN achieves 97.75% OA on Pavia University through synergistic 2D/3D convolutions, its performance drops to 90.24% on agriculturally complex Indian Pines. Traditional CNNs fundamentally struggle with spectral redundancy reduction and lack explicit mechanisms for causal relationship modeling. The local receptive fields of convolutional operations restrict their ability to model global contextual information essential for accurate land cover classification, particularly in urban scenes with complex spatial structures. Additionally, deeper CNN architectures face overfitting risks when labeled samples are limited, and 3D-CNNs suffer from prohibitive computational complexity for high-resolution HSI processing.

The comprehensive experimental results unequivocally validate CAT’s superiority, demonstrating an average improvement of 1.5–3.2% over the state-of-the-art methods across all datasets. This performance enhancement is directly attributed to CAT’s principled architecture: the Causal Attention Mechanism eliminates confounding biases through front-door adjustment; the Dual-Path Fusion optimally balances spectral and spatial contributions; and the Linearized Attention Mechanism ensures scalability for large-scale hyperspectral image (HSI) processing. CAT’s robust performance across diverse scenarios confirms its generalization capability and establishes a new paradigm for robust and interpretable HSI classification that effectively addresses the fundamental limitations inherent in both CNN and Transformer frameworks.

To statistically validate the superiority of CAT over the best-performing baseline (SSFTT), we conducted paired t-tests on OA values across ten independent runs for each dataset. The results indicate that CAT’s improvements are statistically significant (p < 0.05) in all three datasets, with p-values of 0.012 (Indian Pines), 0.008 (Houston2013), and 0.025 (Pavia University).

The performance variation across datasets can be attributed to their inherent characteristics. Pavia University, an urban scene, features high spatial resolution and distinct spectral signatures for different materials, which aligns well with CAT’s strength in modeling precise spectral–spatial causality. In contrast, Indian Pines, an agricultural scene, contains many spectrally similar crop classes (e.g., various corn and soybean types) and suffers from more label noise, presenting a greater challenge even for our robust causal framework. This analysis demonstrates that while CAT achieves SOTA performance universally, its gains are most pronounced in scenarios with clear material boundaries.

5.2.3. Qualitative Analysis

Visual assessment of the classification maps in Figure 3, Figure 4 and Figure 5 reveals consistent patterns across the datasets. CAT produces classification maps with visibly reduced salt-and-pepper noise compared to CNN-based methods (Figure 3a,b, Figure 4a,b and Figure 5a,b). On Indian Pines (Figure 3), CAT’s output shows more homogeneous region formation, particularly in agricultural fields, compared to the fragmented classifications produced by Transformer-based methods.

Figure 3. The classification maps obtained by different methods on the Indian Pines Scene dataset (with 10% training samples).

Figure 4. The classification maps obtained by different methods on the Houstong2013 scene dataset (with 10% training samples).

Figure 5. The classification maps obtained by different methods on the PaviaU dataset (with 10% training samples).

Transformer-based methods exhibit characteristic visual artifacts across datasets. MorphFormer (Figure 4) shows fragmented classifications in heterogeneous urban areas of Houston2013, with frequent misclassifications of spectrally similar materials. SSFTT (Figure 3d and Figure 4d) demonstrates improved spatial consistency over pure CNNs but still exhibits boundary blurring between adjacent crop types in Indian Pines and building boundaries in urban scenes. These visual limitations are particularly evident in regions with mixed land cover types, where all Transformer baselines show higher spatial heterogeneity than CAT.

CNN-based approaches consistently display pronounced visual limitations observable in Figure 3, Figure 4 and Figure 5. Both 2D-CNN and 3D-CNN methods (Figure 3a,b, Figure 4a,b and Figure 5a,b) exhibit substantial salt-and-pepper noise across all datasets, particularly evident in the agriculturally complex Indian Pines scene. HybridSN shows reduced noise through the hybrid 2D-3D architecture but still struggles with boundary preservation in urban environments. The local nature of convolutional operations results in visually disjointed classification maps with poor spatial coherence, which is especially noticeable in linear urban features like roads and building boundaries where CNN methods produce broken segments.

The t-SNE visualization (as shown in Figure 6) provides comparative evidence of feature representation quality. CAT (Figure 6f) produces more compact class-specific clusters with reduced inter-class overlap compared to all baseline methods. Particularly notable is the improved separation between spectrally similar classes that show significant overlap in other methods’ representations (Figure 6a–e). This enhanced cluster separation in the feature space correlates with the improved classification performance observed in quantitative results.

Figure 6. t-SNE visualization of feature representations on Indian Pines dataset. Comparison includes (a) 2D-CNN, (b) HybridSN, (c) ViT, (d) SSFTT, (e) MorphFormer, and (f) Proposed CAT. CAT produces more compact class-specific clusters with reduced inter-class confusion, demonstrating enhanced discriminative capability through causal feature learning.

The superior homogeneity in CAT’s classification maps, Figure 3, Figure 4 and Figure 5, directly results from its Causal Attention Mechanism. By blocking spurious correlations, CAT avoids the salt-and-pepper noise common in CNNs (which overfit local textures) and the boundary blurring seen in standard Transformers (which are misled by global but non-causal similarities). For instance, in the Houston2013 scene, Figure 4, CAT correctly classifies the narrow ‘Running Track’ as a single, coherent entity, whereas SSFTT fragments it due to attending to spectrally similar but causally irrelevant nearby grass.

5.3. Ablation Studies

5.3.1. Impact of Input Patch Size on Model Performance

Determining the optimal receptive field is essential for HSI classification as it balances spatial context against computational efficiency, with results summarized in Table 4. Patch sizes ranging from

9 \times 9

to

19 \times 19

are evaluated across all three datasets. We can find that

9 \times 9

patches achieve peak performance across all datasets (Indian Pines: 95.88% OA, PaviaU: 99.64% OA, Houston2013: 98.84% OA), with consistent degradation observed as patch size increases to

19 \times 19

(Indian Pines: 91.40% OA). This inverse relationship demonstrates that the Causal Attention Mechanism effectively extracts discriminative features from compact spatial contexts, while larger patches introduce noise that compromises causal feature learning.

Table 4. Ablation study on input patch size across three benchmark datasets. Best results are highlighted in bold.

5.3.2. Contribution Analysis of Architectural Components

Validating the dual-path design philosophy requires isolating the individual component contributions. As shown in Table 5, the experimental results establish that the complete CAT model (94.25% OA) outperforms both the baseline DCTN (93.78% OA) and single-path variants (spatial only: 93.69% OA; spectral only: 93.73% OA). The 0.47% absolute improvement confirms that neither spatial nor spectral causality alone suffices; their synergistic integration through adaptive fusion is essential, with learned gating coefficients indicating balanced weighting between pathways.

Table 5. Component ablation analysis on Indian Pines dataset. The results demonstrate the complementary contribution of spatial and spectral causal pathways, with the full CAT model achieving optimal performance through adaptive feature fusion. Best results are highlighted in bold.

The performance gain from the dual-path design confirms our hypothesis that spectral and spatial causalities are complementary but distinct. The spectral path effectively disentangles confounding effects among bands (e.g., atmospheric absorption), while the spatial path captures the causal influence of neighboring land cover types (e.g., a road causing adjacent soil to be bare). The learnable gating mechanism in the fusion module allows the network to dynamically weigh these two sources of causal evidence based on the local context, leading to more robust feature representations than either path alone achieves.

The significant performance drop when removing either the spectral or spatial causal path (w/o Spec-CAT/w/o Spat-CAT) validates that both dimensions of causality are essential and non-redundant. In data-scarce scenarios (e.g., five samples per class on Indian Pines), the standard self-attention baseline (w/o causal) suffers from severe overfitting due to modeling spurious correlations. In contrast, our causal attention acts as a strong regularizer by explicitly blocking these non-causal links. This forces the model to learn more fundamental and generalizable patterns from the limited labels, which is why the full CAT model shows the most substantial relative improvement under extreme label scarcity.

5.3.3. Effectiveness Validation of Causal Constraints

We further analyze the Causal Attention Mechanism by comparing against variants with disabled causal constraints. As seen in Table 6, removing triangular masking from causal attention reduces the performance by 2.1% OA on Indian Pines, confirming the importance of temporal causality in spectral processing. Disabling axial decomposition in spatial causal attention decreases the performance by 1.8% OA, validating the necessity of structured spatial causality. These results empirically demonstrate that causal constraints effectively mitigate spurious correlations and enhance model robustness.

Table 6. An ablation experiment on the effectiveness of causal constraints and spatial causal validity on the Indian Pines dataset. Best results are highlighted in bold.

5.4. Interpretability and Attention Analysis

To validate the causal interpretability of our Linearized Causal Attention (LCA), we visualize the spatial and spectral attention weights across different stages of the network, as shown in Figure 7.

Figure 7. Visualization of spatial (top) and spectral (bottom) attention weights across network stages. Brighter colors indicate higher attention scores. The progression from local → structured → global attention demonstrates hierarchical feature learning with physical plausibility.

The top row illustrates spatial attention, where each map shows how a query pixel attends to all key pixels within the input patch. The bottom row shows spectral attention, depicting how a query spectral band interacts with all key bands.

We observe a clear hierarchical evolution, as follows:

In the early stages (Stages 1–2), attention is highly local, focusing on neighboring pixels and adjacent bands—consistent with the local smoothness prior in HSI data.
In the middle stages (Stages 3–4), attention develops structured patterns: spatial blocks correspond to semantically coherent regions (e.g., crop rows or field boundaries), while spectral groups align with known absorption features (e.g., water vapor or vegetation indices).
In the final stage (Stage 5), attention becomes global and uniform, indicating that the classifier integrates information from the entire spatial–spectral context for robust decision making.

Critically, these explicit attention maps provide human-interpretable evidence of the model’s reasoning process—something unattainable with implicit models like Mamba. This transparency not only validates our causal design but also offers domain experts actionable insights for downstream analysis (e.g., identifying discriminative spectral bands or suspicious spatial regions).

Computational Efficiency and Performance Trade-Offs

The computational complexities reported in Section 4.3 are theoretical asymptotic bounds (in Big-O notation), which characterize how the algorithm scales with respect to sequence length N and feature dimension d. In contrast, the efficiency metrics in our experiments—such as inference latency (ms), GPU memory consumption (MB), and FLOPs—are empirical measurements obtained under fixed model configurations (e.g.,

d = 128

,

r = 64

) and specific hardware (e.g., NVIDIA A100).

We analyze computational requirements by comparing the floating point operations (FLOPs), parameter count, and inference time against representative baselines. As Table 7, CAT requires 4.7 G FLOPs for processing

15 \times 15

patches, comparable to HybridSN (4.2 G) and significantly lower than SSFTT (7.1 G). The linearized causal attention reduces memory complexity from

O (N^{2})

to

O (N)

, enabling efficient processing of high-resolution HSIs. Training convergence analysis shows CAT reaches 90% of final accuracy, faster than Transformer baselines, indicating improved optimization dynamics through causal regularization.

Table 7. Comparison of computational complexity and performance on IndianPines dataset. Best results are highlighted in bold.

Beyond accuracy, CAT offers significant computational advantages. As shown in Table 7, CAT achieves the highest accuracy with only 1.26 G FLOPs and 5.01 M parameters, which is substantially more efficient than the Transformer-based SSFTT (2.67 G FLOPs) and even the CNN-based DCTN (1.48 G FLOPs). This efficiency, stemming from our Linearized Causal Attention, enables scalable processing of large-scale HSIs without sacrificing performance.

6. Conclusions

This paper has introduced the Causal Attention Transformer (CAT), a novel framework that integrates causal inference with deep learning for hyperspectral image classification. The proposed architecture addresses fundamental limitations in existing methods through three key innovations: a Causal Attention Mechanism that eliminates spurious correlations via triangular masking and axial decomposition, a Dual-Path Hierarchical Fusion module that adaptively integrates spectral and spatial features, and a Linearized Attention formulation that reduces computational complexity from the quadratic to linear scale. Extensive experiments on three benchmark datasets have demonstrated that CAT achieves state-of-the-art performance with 94.25% OA on Indian Pines, 98.24% OA on Houston2013, and 99.08% OA on Pavia University, while providing enhanced interpretability through spectral–spatial causal maps and computational efficiency with 1.26 G FLOPs.

Future work will enhance the CAT framework by developing automated causal discovery to derive HSI causal structures without predefined assumptions, exploring multimodal fusion (e.g., LiDAR and SAR) to boost classification in complex settings, and employing domain adaptation for better cross-region and sensor generalization. We will also extend this framework to remote sensing tasks like change detection, semantic segmentation, and target detection, leveraging its interpretability and robustness.

Author Contributions

X.Y.: Methodology, conceptualization, investigation, and writing—review and editing. Z.S.: Conceptualization, investigation, and writing—review and editing. W.L.: Investigation and writing—review. Y.L.: Investigation and writing—review. H.Y.: Funding. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by National Natural Science Foundation of China (NSFC) Fund under Grant 62301174.

Data Availability Statement

The original data presented in this study are openly available in github at https://github.com/033labcodes/awesome-hyperspectral-datasets (accessed on 1 October 2025). Further inquiries regarding data availability can be directed to the corresponding author.

Acknowledgments

The authors would like to thank the reviewers for their insightful comments and useful suggestions. All the dataset used in this paper are provided by public sources at https://github.com/033labcodes/awesome-hyperspectral-datasets (accessed on 1 October 2025).

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Atik, S.O. Dual-Stream Spectral-Spatial Convolutional Neural Network for Hyperspectral Image Classification and Optimal Band Selection. Adv. Space Res. 2024, 4, 2025–2041. [Google Scholar] [CrossRef]
Pathak, D.K.; Kalita, S.K.; Bhattacharya, D.K. Hyperspectral image classification using support vector machine: A spectral spatial feature based approach. Evol. Intell. 2022, 15, 1809–1823. [Google Scholar] [CrossRef]
Vaideeswar, D.P.; Bagadi, K.; Annepu, V.; Naseeba, B.; Naseeba, B. Hyperspectral Image Classification: A Hybrid Approach Integrating Random Forest Feature Selection and Convolutional Neural Networks for Enhanced Accuracy. Int. J. Perform. Eng. 2024, 20, 263. [Google Scholar] [CrossRef]
Fatemighomi, H.S.; Golalizadeh, M.; Amani, M. Object-based hyperspectral image classification using a new latent block model based on hidden Markov random fields. Pattern Anal. Appl. 2022, 25, 467–481. [Google Scholar] [CrossRef]
Yang, X.; Ye, Y.; Li, X.; Lau, R.Y.; Zhang, X.; Huang, X. Hyperspectral image classification with deep learning models. IEEE Trans. Geosci. Remote Sens. 2018, 56, 5408–5423. [Google Scholar] [CrossRef]
Ghaderizadeh, S.; Abbasi-Moghadam, D.; Sharifi, A.; Zhao, N.; Tariq, A. Hyperspectral image classification using a hybrid 3D-2D convolutional neural networks. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 7570–7588. [Google Scholar] [CrossRef]
Yang, X.; Zhang, X.; Ye, Y.; Lau, R.Y.; Lu, S.; Li, X.; Huang, X. Synergistic 2D/3D convolutional neural network for hyperspectral image classification. Remote Sens. 2020, 12, 2033. [Google Scholar] [CrossRef]
Liu, D.; Han, G.; Liu, P.; Yang, H.; Sun, X.; Li, Q.; Wu, J. A novel 2D-3D CNN with spectral-spatial multi-scale feature fusion for hyperspectral image classification. Remote Sens. 2021, 13, 4621. [Google Scholar] [CrossRef]
Sun, L.; Zhao, G.; Zheng, Y.; Wu, Z. Spectral–spatial feature tokenization transformer for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5522214. [Google Scholar] [CrossRef]
Peng, S.; Zhu, X.; Deng, H.; Deng, L.J.; Lei, Z. Fusionmamba: Efficient remote sensing image fusion with state space model. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5410216. [Google Scholar] [CrossRef]
Yang, X.; Zhang, H.; Qi, G.; Cai, J. Causal attention for vision-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 9847–9857. [Google Scholar]
Zhang, W.; Wang, X.; Wang, H.; Cheng, Y. Causal Meta-Reinforcement Learning for Multimodal Remote Sensing Data Classification. Remote Sens. 2024, 16, 1055. [Google Scholar] [CrossRef]
Pearl, J. Causality: Models, Reasoning and Inference, 2nd ed.; Cambridge University Press: Cambridge, MA, USA, 2009. [Google Scholar]
Li, B.; Li, X.; Tian, Z.; Lu, X.; Kang, R. General power laws of the causalities in the causal Bayesian networks. Int. J. Gen. Syst. 2024, 53, 1–15. [Google Scholar] [CrossRef]
Bian, S.; Wang, Z.; Leng, S.; Lin, W.; Shi, J. Utilizing Causal Network Markers to Identify Tipping Points ahead of Critical Transition. Adv. Sci. 2024, 12, e15732. [Google Scholar] [CrossRef] [PubMed]
Li, X.; Hu, T.; Cao, K.; Zhang, J.; Xie, C.; Zhou, M.; Hong, D. Pan-Sharpening via Causal-Aware Feature Distribution Calibration. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5404714. [Google Scholar] [CrossRef]
Pearl, J. The seven tools of causal inference, with reflections on machine learning. Commun. ACM 2019, 62, 54–60. [Google Scholar] [CrossRef]
Wang, R.; Li, X.; Yao, L. Deconfounded Causality-Aware Parameter-Efficient Fine-Tuning for Problem-Solving Improvement of LLMs. In Proceedings of the International Conference on Web Information Systems Engineering, Doha, Qatar, 2–5 December 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 161–176. [Google Scholar]
Wang, T.; Zhou, C.; Sun, Q.; Zhang, H. Causal attention for unbiased visual recognition. In Proceedings of the the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 10–17 October 2021; pp. 3091–3100. [Google Scholar]
Gong, H.; Li, Q.; Li, C.; Dai, H.; He, Z.; Wang, W.; Li, H.; Han, F.; Tuniyazi, A.; Mu, T. Multiscale information fusion for hyperspectral image classification based on hybrid 2D-3D CNN. Remote Sens. 2021, 13, 2268. [Google Scholar] [CrossRef]
Ari, A. Multipath feature fusion for hyperspectral image classification based on hybrid 3D/2D CNN and squeeze-excitation network. Earth Sci. Inform. 2023, 16, 175–191. [Google Scholar] [CrossRef]
Yu, C.; Han, R.; Song, M.; Liu, C.; Chang, C.I. Feedback attention-based dense CNN for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5501916. [Google Scholar] [CrossRef]
Xu, F.; Mei, S.; Zhang, G.; Wang, N.; Du, Q. Bridging cnn and transformer with cross attention fusion network for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5522214. [Google Scholar] [CrossRef]
Paoletti, M.E.; Moreno-Álvarez, S.; Xue, Y.; Haut, J.M.; Plaza, A. AAtt-CNN: Automatic attention-based convolutional neural networks for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5511118. [Google Scholar] [CrossRef]
Bhatti, U.A.; Huang, M.; Neira-Molina, H.; Marjan, S.; Baryalai, M.; Tang, H.; Wu, G.; Bazai, S.U. MFFCG–Multi feature fusion for hyperspectral image classification using graph attention network. Expert Syst. Appl. 2023, 229, 120496. [Google Scholar] [CrossRef]
Ding, Y.; Zhang, Z.; Zhao, X.; Hong, D.; Cai, W.; Yang, N.; Wang, B. Multi-scale receptive fields: Graph attention neural network for hyperspectral image classification. Expert Syst. Appl. 2023, 223, 119858. [Google Scholar] [CrossRef]
Zhao, F.; Zhang, J.; Meng, Z.; Liu, H.; Chang, Z.; Fan, J. Multiple vision architectures-based hybrid network for hyperspectral image classification. Expert Syst. Appl. 2023, 234, 121032. [Google Scholar] [CrossRef]
Pu, C.; Huang, H.; Yang, L. An attention-driven convolutional neural network-based multi-level spectral–spatial feature learning for hyperspectral image classification. Expert Syst. Appl. 2021, 185, 115663. [Google Scholar] [CrossRef]
Sun, Y.; Feng, S.; Ye, Y.; Li, X.; Kang, J.; Huang, Z.; Luo, C. Multisensor Fusion and Explicit Semantic Preserving-Based Deep Hashing for Cross-Modal Remote Sensing Image Retrieval. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5219614. [Google Scholar] [CrossRef]
Sun, Y.; Ye, Y.; Kang, J.; Fernandez-Beltran, R.; Feng, S.; Li, X.; Luo, C.; Zhang, P.; Plaza, A. Cross-View Object Geo-Localization in a Local Region With Satellite Imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4704716. [Google Scholar]
Kavitha, M.; Gayathri, R.; Polat, K.; Alhudhaif, A.; Alenezi, F. Performance evaluation of deep e-CNN with integrated spatial-spectral features in hyperspectral image classification. Measurement 2022, 191, 110760. [Google Scholar] [CrossRef]
Yang, X.; Cao, W.; Lu, Y.; Zhou, Y. Hyperspectral image transformer classification networks. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5528715. [Google Scholar] [CrossRef]
Hong, D.; Han, Z.; Yao, J.; Gao, L.; Zhang, B.; Plaza, A.; Chanussot, J. SpectralFormer: Rethinking hyperspectral image classification with transformers. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5518615. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. In Proceedings of the First Conference on Language Modeling, Philadelphia, PA, USA, 7–9 October 2024. [Google Scholar]
Yao, J.; Hong, D.; Li, C.; Chanussot, J. Spectralmamba: Efficient mamba for hyperspectral image classification. arXiv 2024, arXiv:2404.08489. [Google Scholar] [CrossRef]
Wang, C.; Huang, J.; Lv, M.; Du, H.; Wu, Y.; Qin, R. A local enhanced mamba network for hyperspectral image classification. Int. J. Appl. Earth Obs. Geoinf. 2024, 133, 104092. [Google Scholar] [CrossRef]
Cheng, Y.; Zhang, W.; Wang, H.; Wang, X. Causal meta-transfer learning for cross-domain few-shot hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5521014. [Google Scholar]
Behnam, A.; Wang, B. Graph neural network causal explanation via neural causal models. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 410–427. [Google Scholar]
Choromanski, K.M.; Likhosherstov, V.; Dohan, D.; Song, X.; Gane, A.; Sarlos, T.; Hawkins, P.; Davis, J.Q.; Mohiuddin, A.; Kaiser, L.; et al. Rethinking Attention with Performers. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 3–7 May 2021. [Google Scholar]
Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Roy, S.K.; Deria, A.; Shah, C.; Haut, J.M.; Du, Q.; Plaza, A. Spectral–spatial morphological attention transformer for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5503615. [Google Scholar] [CrossRef]
Li, Y.; Luo, Y.; Zhang, L.; Wang, Z.; Du, B. MambaHSI: Spatial-spectral mamba for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5524216. [Google Scholar]
He, Y.; Tu, B.; Liu, B.; Li, J.; Plaza, A. 3DSS-Mamba: 3D-Spectral-Spatial Mamba for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5534216. [Google Scholar]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Guo, G.; Wang, H.; Bell, D.; Bi, Y.; Greer, K. KNN model-based approach in classification. In Proceedings of the OTM Confederated International Conferences “On the Move to Meaningful Internet Systems”, Sicily, Italy, 3–7 November 2003; Springer: Berlin/Heidelberg, Germany, 2003; pp. 986–996. [Google Scholar]

Figure 1. Architecture of the proposed Causal Attention Transformer (CAT) for hyperspectral image classification. The model processes HSI cubes through patch embedding, multi-stage causal attention blocks, and hierarchical feature fusion to generate classification maps while maintaining spectral–spatial causality.

Figure 2. Detailed structure of the Transformer block with Causal Attention Mechanisms. The module employs causal self-attention with triangular masking, depthwise separable convolutions, and axial decomposition to enforce temporal causality while preventing information leakage in spectral–spatial dependency modeling.

Figure 3. The classification maps obtained by different methods on the Indian Pines Scene dataset (with 10% training samples).

Figure 4. The classification maps obtained by different methods on the Houstong2013 scene dataset (with 10% training samples).

Figure 5. The classification maps obtained by different methods on the PaviaU dataset (with 10% training samples).

Figure 6. t-SNE visualization of feature representations on Indian Pines dataset. Comparison includes (a) 2D-CNN, (b) HybridSN, (c) ViT, (d) SSFTT, (e) MorphFormer, and (f) Proposed CAT. CAT produces more compact class-specific clusters with reduced inter-class confusion, demonstrating enhanced discriminative capability through causal feature learning.

Figure 7. Visualization of spatial (top) and spectral (bottom) attention weights across network stages. Brighter colors indicate higher attention scores. The progression from local → structured → global attention demonstrates hierarchical feature learning with physical plausibility.

Table 1. Comparison with state-of-the-art methods on Indian Pines dataset (10% training samples). Methods are grouped by architecture type. Best results are highlighted in bold.

Class	Traditional Classiffers		CNN-Based Methods		Transformer-Based Methods				Mamba-Based Methods		CAT
Class	SVM	KNN	2D-CNN	3D-CNN	ViT	HiT	SSFTT	MorphFormer	MambaHSI	3DSSMamba	CAT
Alfalfa	0.00	66.62 ± 3.68	92.82 ± 4.80	69.95 ± 17.92	9.76 ± 6.34	91.14 ± 3.48	89.65 ± 6.91	82.13 ± 22.40	50.73 ± 29.81	35.12 ± 22.41	69.27 ± 11.76
Corn–notill	4.00 ± 2.21	66.25 ±1.78	93.81 ± 1.97	88.61 ± 1.17	77.12 ± 0.65	94.49 ± 0.39	94.11 ± 1.08	93.38 ± 2.14	87.55 ± 3.04	90.41 ± 2.49	97.28 ± 1.91
Corn–mintill	0.00	62.36 ± 2.21	92.19 ± 1.77	86.74 ± 1.90	65.86 ± 0.48	94.43 ± 0.59	90.13 ± 2.66	91.28 ± 3.66	88.61 ± 4.08	87.98 ± 5.85	96.24 ± 2.19
Corn	25.00 ± 3.31	89.08 ± 2.54	97.94 ± 1.50	93.92 ± 2.44	91.55 ± 0.28	99.73 ± 0.13	94.90 ± 3.46	95.26 ± 4.04	93.43 ± 3.35	97.51 ± 3.62	94.37 ± 3.64
Grass–pasture	8.00 ± 1.98	90.95 ± 0.88	93.09 ± 3.32	93.44 ± 0.70	47.82 ± 1.11	92.89 ± 0.24	93.08 ± 2.52	94.72 ± 1.55	76.71 ± 11.88	84.16 ± 10.82	95.86 ± 2.16
Grass–trees	0.00	91.54 ± 2.33	95.65 ± 2.97	94.82 ± 0.67	93.30 ± 1.41	93.50 ± 0.63	95.98 ± 1.29	95.47 ± 1.38	93.79 ± 2.93	94.76 ± 1.80	95.86 ± 2.03
Grass–pasture–mowed	90.80 ± 1.55	96.84 ± 3.11	7.94 ± 18.29	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00	54.69 ± 35.84	71.56 ± 18.01	1.60 ± 4.80	0.00 ± 0.00	7.60 ± 15.84
Hay–windrowed	0.00	60.73 ± 1.22	99.69 ± 0.55	98.87 ± 1.10	95.12 ± 0.22	99.61 ± 0.15	98.77 ± 1.35	99.81 ± 0.39	99.91 ± 0.21	99.93 ± 0.21	100.00 ± 0.00
Oats	1.60 ± 0.98	73.80 ± 22.15	73.30 ± 29.09	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00	54.34 ± 32.04	21.28 ± 30.14	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00
Soybean–notill	58.70 ± 1.85	77.90 ± 2.56	87.78 ± 1.60	83.25 ± 1.22	77.60 ± 0.36	89.48 ± 0.27	87.11 ± 1.77	88.80 ± 3.58	79.67 ± 2.62	78.81 ± 3.94	83.12 ± 2.98
Soybean–mintill	0.00	61.08 ± 0.77	96.26 ± 1.24	94.38 ± 0.51	96.38 ± 0.12	96.65 ± 0.06	96.78 ± 0.82	96.27 ± 0.59	97.76 ± 0.93	97.14 ± 0.97	98.79 ± 1.04
Soybean–clean	84.24 ± 5.14	91.32 ± 4.38	91.80 ± 2.21	89.11 ± 1.71	72.85 ± 1.34	93.85 ± 0.42	89.52 ± 3.35	87.66 ± 5.08	91.76 ± 3.67	87.17 ± 9.24	95.34 ± 0.66
Wheat	9.80 ± 2.21	53.58 ± 4.58	98.12 ± 1.32	86.71 ± 7.81	82.16 ± 1.43	97.11 ± 1.27	95.00 ± 3.62	94.35 ± 4.88	92.22 ± 3.74	96.27 ± 2.42	92.05 ± 4.98
Woods	90.90 ± 0.45	91.87 ± 0.86	98.28 ± 2.42	97.81 ± 0.58	99.74 ± 0.38	98.47 ± 0.17	98.67 ± 0.66	98.87 ± 0.31	99.10 ± 1.06	99.84 ± 0.18	99.64 ± 0.28
Buildings–Grass–Trees–Drives	13.80 ± 4.36	93.48 ± 1.55	97.82 ± 1.46	93.48 ± 2.25	59.37 ± 0.60	98.70 ± 0.47	96.08 ± 2.76	96.06 ± 1.48	86.74 ± 8.42	83.54 ± 7.47	93.08 ± 3.58
Stone–Steel–Towers	0.00	93.36 ± 10.12	52.74 ± 21.39	46.87 ± 17.62	41.67 ± 17.49	67.64 ± 2.82	39.20 ± 32.30	25.11 ± 33.47	22.98 ± 19.98	15.48 ± 26.23	48.81 ± 14.45
OA (%)	28.95 ± 0.84	76.05 ± 1.21	93.25 ± 0.49	91.48 ± 0.52	82.81 ± 0.21	91.02 ± 0.95	94.09 ± 0.97	94.03 ± 0.90	90.56 ± 1.03	90.83 ± 1.75	94.25 ± 0.44
$κ$ (%)	45.78 ± 1.23	73.95 ± 1.15	92.28 ± 1.61	90.25 ± 0.59	80.16 ± 1.19	89.72 ± 1.07	93.25 ± 1.10	93.18 ± 1.03	90.56 ± 1.03	89.51 ± 2.03	93.43 ± 0.48

Table 2. Comparison with state-of-the-art methods on Houston2013 dataset (10% training samples). Methods are grouped by architecture type. Best results are highlighted in bold.

Class	Traditional Classiffers		CNN-Based Methods		Transformer-Based Methods				Mamba-Based Methods		CAT
Class	SVM	KNN	2D-CNN	3D-CNN	ViT	HiT	SSFTT	Morphformer	MambaHSI	3DSSMamba	CAT
Healthy Grass	92.43 ± 1.25	98.88 ± 0.42	90.76 ± 1.88	93.88 ± 0.90	87.43 ± 3.25	94.49 ± 0.52	96.42 ± 1.16	85.97 ± 8.75	91.33 ± 2.17	87.30 ± 10.73	94.43 ± 1.69
Stressed grass	94.19 ± 1.80	98.90 ± 0.38	86.26 ± 8.27	94.03 ± 1.64	80.16 ± 5.76	91.22 ± 2.10	97.16 ± 1.42	88.97 ± 5.58	83.59 ± 2.71	90.21 ± 4.40	97.85 ± 0.73
Synthetic Grass	99.06 ± 0.35	99.62 ± 0.21	95.85 ± 3.67	96.81 ± 1.89	97.95 ± 0.85	98.97 ± 0.29	99.30 ± 0.58	91.37 ± 13.84	98.64 ± 0.56	97.70 ± 2.80	95.47 ± 1.46
Water	95.88 ± 0.85	99.53 ± 0.18	91.69 ± 3.12	90.80 ± 2.16	86.50 ± 1.79	86.87 ± 0.82	93.61 ± 3.07	90.85 ± 4.25	87.48 ± 2.36	93.58 ± 2.23	87.71 ± 2.11
Residential	94.76 ± 1.50	98.19 ± 0.55	78.74 ± 17.12	92.90 ± 1.13	87.38 ± 1.99	91.87 ± 0.83	97.52 ± 0.80	85.78 ± 24.73	87.28 ± 2.68	92.51 ± 3.82	98.40 ± 0.81
Commercial	90.71 ± 2.10	99.33 ± 0.25	93.01 ± 2.65	92.73 ± 1.31	90.55 ± 3.34	96.41 ± 0.64	97.88 ± 1.24	90.76 ± 9.56	94.05 ± 1.95	94.16 ± 3.55	96.79 ± 1.83
Road	74.45 ± 3.50	95.04 ± 0.90	73.95 ± 14.36	91.56 ± 1.34	86.62 ± 1.57	89.78 ± 0.99	96.70 ± 1.56	87.54 ± 7.99	87.98 ± 1.78	87.78 ± 2.93	97.69 ± 1.22
Highway	70.56 ± 4.20	95.15 ± 1.10	91.92 ± 7.92	98.51 ± 0.56	96.43 ± 3.22	97.93 ± 0.51	99.78 ± 0.38	93.89 ± 9.76	95.01 ± 3.44	97.50 ± 1.56	100.00 ± 0.00
Railway	70.07 ± 3.80	88.06 ± 1.50	85.29 ± 7.92	94.61 ± 1.93	93.86 ± 2.51	99.37 ± 0.45	99.76 ± 0.44	91.69 ± 16.26	93.32 ± 4.12	94.73 ± 3.22	99.73 ± 0.81
Parking Lot 1	65.81 ± 2.80	91.05 ± 1.20	95.79 ± 2.71	96.66 ± 0.55	91.42 ± 4.91	98.68 ± 0.21	98.70 ± 1.24	88.63 ± 13.89	93.49 ± 3.96	95.83 ± 3.47	98.62 ± 0.78
Parking Lot 2	66.60 ± 3.10	90.37 ± 1.40	86.51 ± 8.08	90.13 ± 1.46	91.84 ± 1.55	95.93 ± 1.53	98.54 ± 1.23	92.25 ± 6.66	89.01 ± 4.03	95.80 ± 2.87	97.82 ± 2.26
Tennis Court	12.69 ± 2.50	64.09 ± 3.20	92.98 ± 6.16	99.70 ± 0.33	99.20 ± 0.85	99.99 ± 0.04	99.67 ± 0.47	99.04 ± 1.46	98.43 ± 2.66	97.96 ± 3.28	100.00 ± 0.00
Running Track	86.85 ± 1.90	98.69 ± 0.45	91.48 ± 5.60	95.62 ± 2.21	96.93 ± 1.89	98.39 ± 0.33	98.95 ± 0.71	93.32 ± 7.49	98.14 ± 0.72	96.65 ± 3.12	99.97 ± 0.10
OA (%)	78.29 ± 1.80	93.70 ± 0.65	88.19 ± 4.75	94.68 ± 0.55	90.27 ± 1.69	95.20 ± 0.33	98.15 ± 0.53	90.98 ± 7.71	91.58 ± 1.50	93.52 ± 1.88	98.24 ± 0.28
$κ (%)$	78.11 ± 1.90	93.89 ± 0.70	87.24 ± 5.13	94.24 ± 0.60	89.48 ± 1.83	94.81 ± 0.35	98.00 ± 0.58	90.24 ± 8.36	90.89 ± 1.62	92.99 ± 2.04	98.10 ± 0.26

Table 3. Comparison with state-of-the-art methods on PaviaU dataset (10% training samples). Methods are grouped by architecture type. Best results are highlighted in bold.

Class	Traditional Classiffers		CNN-Based Methods		Transformer-Based Methods				Mamba-Based Methods		CAT
Class	SVM	KNN	2D-CNN	3D-CNN	ViT	HiT	SSFTT	Morphformer	MambaHSI	3DSSMamba	CAT
Asphalt	88.54 ± 1.20	92.33 ± 0.75	95.67 ± 1.91	97.42 ± 0.35	98.31 ± 0.33	98.32 ± 0.25	98.07 ± 2.25	94.76 ± 3.42	96.96 ± 1.53	90.49 ± 5.36	99.40 ± 0.27
Meadows	94.24 ± 0.85	94.33 ± 0.60	99.45 ± 0.21	99.78 ± 0.09	99.63 ± 0.11	99.73 ± 0.06	99.85 ± 0.03	99.67 ± 0.19	99.65 ± 0.12	97.86 ± 2.34	99.91 ± 0.06
Gravel	65.72 ± 2.10	77.04 ± 1.50	96.64 ± 2.57	97.81 ± 0.37	97.34 ± 0.78	98.44 ± 0.50	99.09 ± 0.46	86.82 ± 23.92	98.05 ± 0.38	87.19 ± 10.57	99.48 ± 0.33
Trees	93.98 ± 1.50	92.80 ± 0.95	86.19 ± 5.96	90.63 ± 1.93	93.09 ± 0.52	93.82 ± 0.50	94.53 ± 3.53	87.85 ± 4.12	90.41 ± 3.92	83.68 ± 5.51	95.87 ± 0.56
Painted metal sheets	99.52 ± 0.30	99.47 ± 0.25	92.85 ± 0.31	93.03 ± 0.99	94.67 ± 1.15	94.56 ± 0.53	93.94 ± 1.16	94.58 ± 2.46	96.20 ± 2.05	94.31 ± 1.43	95.66 ± 1.25
Bare Soil	77.19 ± 2.80	79.62 ± 1.80	99.66 ± 0.22	99.78 ± 0.08	99.83 ± 0.20	99.66 ± 0.12	99.92 ± 0.11	99.84 ± 0.18	99.61 ± 0.31	92.92 ± 9.40	99.95 ± 0.08
Bitumen	53.96 ± 3.50	82.80 ± 2.20	95.25 ± 3.12	98.91 ± 0.67	98.85 ± 0.23	99.23 ± 0.10	98.91 ± 0.49	90.65 ± 20.10	99.10 ± 0.37	84.11 ± 15.21	99.96 ± 0.07
Self-Blocking Bricks	84.32 ± 1.90	84.73 ± 1.20	92.73 ± 5.99	97.04 ± 1.40	98.07 ± 0.42	98.79 ± 0.24	99.12 ± 0.32	92.82 ± 5.67	97.58 ± 0.74	91.45 ± 5.16	99.57 ± 0.21
Shadows	100.00 ± 0.00	99.98 ± 0.05	75.97 ± 4.87	79.35 ± 2.80	83.99 ± 2.13	84.66 ± 1.50	85.15 ± 4.00	86.53 ± 2.07	88.35 ± 4.33	80.90 ± 6.91	88.75 ± 1.35
OA (%)	84.16 ± 1.50	89.23 ± 0.85	96.28 ± 1.62	97.75 ± 0.33	98.24 ± 0.14	98.46 ± 0.10	98.58 ± 0.65	96.33 ± 2.17	97.93 ± 0.66	93.30 ± 4.05	99.08 ± 0.07
$κ$ (%)	85.02 ± 1.40	87.36 ± 0.90	95.08 ± 2.14	97.03 ± 0.44	97.67 ± 0.18	97.96 ± 0.13	98.12 ± 0.86	95.13 ± 2.90	97.25 ± 0.87	91.04 ± 5.49	98.79 ± 0.09

Table 4. Ablation study on input patch size across three benchmark datasets. Best results are highlighted in bold.

Sizes	IndianPines		PaviaU		Houston2013
Sizes	OA	$κ$	OA	$κ$	OA	$κ$
$9 \times 9$	95.88 ± 0.46	95.30 ± 0.40	99.64 ± 0.03	99.53 ± 0.04	98.84 ± 0.18	98.75 ± 0.19
$11 \times 11$	95.80 ± 0.25	94.99 ± 0.28	99.55 ± 0.02	99.40 ± 0.03	98.54 ± 0.17	98.40 ± 0.19
$13 \times 13$	95.00 ± 0.16	94.28 ± 0.18	99.46 ± 0.05	99.29 ± 0.07	98.44 ± 0.17	98.31 ± 0.18
$15 \times 15$	94.25 ± 0.44	93.43 ± 0.48	99.08 ± 0.07	98.79 ± 0.09	98.24 ± 0.28	98.10 ± 0.26
$17 \times 17$	92.79 ± 0.63	91.60 ± 0.42	98.76 ± 0.20	98.36 ± 0.26	97.17 ± 0.30	96.92 ± 0.28
$19 \times 19$	91.40 ± 0.52	90.16 ± 0.60	98.12 ± 0.42	97.51 ± 0.55	96.69 ± 0.32	96.45 ± 0.35

Table 5. Component ablation analysis on Indian Pines dataset. The results demonstrate the complementary contribution of spatial and spectral causal pathways, with the full CAT model achieving optimal performance through adaptive feature fusion. Best results are highlighted in bold.

Methods	Spatial	Spectral	OA (%)	$κ$ (%)
DCTN	×	×	93.78 ± 0.48	92.89 ± 0.51
SSFTT	×	×	90.68 ± 1.69	89.33 ± 1.83
CAT (spatial only)	√	×	93.69 ± 0.19	92.78 ± 0.21
CAT (spectral only)	×	√	93.73 ± 0.48	92.83 ± 0.52
CAT (full)	√	√	94.25 ± 0.44	93.43 ± 0.48

Table 6. An ablation experiment on the effectiveness of causal constraints and spatial causal validity on the Indian Pines dataset. Best results are highlighted in bold.

Methods	OA (%)	$κ$ (%)
DCTN	93.78 ± 0.48	92.89 ± 0.51
SFFTT	90.68 ± 1.69	89.33 ± 1.83
CAT (Removing triangular masking)	93.64 ± 0.17	92.73 ± 0.20
CAT (Disabling axial decomposition)	93.75 ± 0.24	92.86 ± 0.28
CAT (full)	94.25 ± 0.44	93.43 ± 0.48

Table 7. Comparison of computational complexity and performance on IndianPines dataset. Best results are highlighted in bold.

Method	F (G)	P (MB)	Training (s)	Testing (s)	OA (%)	$κ$ (%)
DCTN	1.48	45.32	3846.42	22.01	93.78 ± 0.50	92.89 ± 0.44
SSFTT	2.67	83.33	5602.75	17.96	90.68 ± 0.82	89.33 ± 0.95
3DSSMamba	0.122	0.194	5515.38	12.91	91.40 ± 0.51	90.16 ± 0.68
MambaHSI	0.012	0.114	3271.72	11.07	90.48 ± 0.61	89.16 ± 0.55
CAT	1.26	5.01	1337.58	7.19	94.25 ± 0.44	93.43 ± 0.48

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Multiple requests from the same IP address are counted as one view.

CAT: Causal Attention with Linear Complexity for Efficient and Interpretable Hyperspectral Image Classification

Highlights

Abstract

1. Introduction

2. Related Works

2.1. CNN-Based HSI Classification Methods

2.2. Attention-Based HSI Classification Methods

2.3. Causal-Based Hyperspectral Image Classification Methods

3. Causal Framework and Structural Causal Model

3.1. Causal Identification via Front-Door Adjustment

3.2. Causal Constraints Implementation

3.3. Causal Regularization

4. Proposed Method

4.1. Causal Attention Mechanism

4.1.1. Causal Self-Attention

4.1.2. Causal Attention for 4D Inputs

4.1.3. Linear Causal Attention

4.2. Dual-Path Hierarchical Fusion

4.2.1. Dual-Path Attention Module

4.2.2. Hierarchical Feature Extraction and Fusion

Patch Embedding Module

Multi-Stage Processing

Feature Fusion and Classification

4.3. Computational Complexity Analysis

4.4. Connection to State Space Models

5. Experiments

5.1. Datasets and Setting

5.1.1. Datasets’ Description

5.1.2. Implementation Details

5.1.3. Evaluation Metrics

5.2. Comparative Analysis

5.2.1. Comparison Methods

5.2.2. Quantitative Results

5.2.3. Qualitative Analysis

5.3. Ablation Studies

5.3.1. Impact of Input Patch Size on Model Performance

5.3.2. Contribution Analysis of Architectural Components

5.3.3. Effectiveness Validation of Causal Constraints

5.4. Interpretability and Attention Analysis

Computational Efficiency and Performance Trade-Offs

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Article Metrics

Article Access Statistics