Next Article in Journal
Event-Based Vision at the Edge: A Review
Previous Article in Journal
Writing Abilities in Primary Progressive Aphasia: A Scoping Literature Review
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

MambaKAN: An Interpretable Framework for Alzheimer’s Disease Diagnosis via Selective State Space Modeling of Dynamic Functional Connectivity

1
Artificial Intelligence College, Zhejiang Industry & Trade Vocational College, Wenzhou 325000, China
2
College of Computer Science and Artificial Intelligence, Wenzhou University, Wenzhou 325035, China
*
Author to whom correspondence should be addressed.
Brain Sci. 2026, 16(4), 421; https://doi.org/10.3390/brainsci16040421
Submission received: 25 March 2026 / Revised: 14 April 2026 / Accepted: 16 April 2026 / Published: 17 April 2026

Abstract

Background/Objectives: Alzheimer’s disease (AD) is an irreversible neurodegenerative disorder that imposes a profound burden on global public health. While resting-state functional magnetic resonance imaging (rs-fMRI)-based dynamic functional connectivity (dFC) analysis has demonstrated promise in capturing time-varying brain network abnormalities, existing deep learning methods suffer from three fundamental limitations: (1) an inability to model temporal dependencies across dynamic connectivity windows, (2) reliance on post hoc black-box explainability tools, and (3) misalignment between feature learning and classification objectives. Methods: To address these challenges, we propose MambaKAN, an end-to-end interpretable framework integrating a Variational Autoencoder (VAE), a Selective State Space Model (Mamba), and a Kolmogorov–Arnold Network (KAN). The VAE encodes each dFC snapshot into a compact latent representation, preserving nonlinear connectivity patterns. The Mamba encoder captures long-range temporal dynamics across the sequence of latent representations via input-selective state transitions. The KAN classifier provides intrinsic interpretability through learnable B-spline activation functions, enabling direct visualization of how latent features influence diagnostic decisions without post-hoc approximation. The entire pipeline is trained end-to-end with a joint loss function that aligns feature learning with classification. Results: Evaluated on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset across five classification tasks (CN vs. AD, CN vs. EMCI, EMCI vs. LMCI, LMCI vs. AD, and four-class), MambaKAN achieves accuracies of 95.1%, 89.8%, 84.0%, 86.7%, and 70.5%, respectively, outperforming strong baselines including LSTM, Transformer, and MLP-based variants. Conclusions: Comprehensive ablation studies confirm the indispensable contribution of each module, and the three-layer interpretability analysis reveals key temporal patterns and brain regions associated with AD progression.

1. Introduction

Alzheimer’s disease (AD) is a progressive neurodegenerative disorder characterized by irreversible cognitive decline, memory impairment, and behavioral dysregulation [1]. Its neuropathological hallmarks—amyloid- β plaque accumulation, neurofibrillary tau tangles, synaptic loss, and cortical atrophy—manifest insidiously over years before clinical symptoms appear, making early detection essential yet difficult [2]. The global prevalence of dementia, predominantly driven by AD, has exceeded 55 million individuals and is projected to reach 152 million by 2050 [3], imposing substantial healthcare and socioeconomic burdens worldwide [4].
The AD continuum spans from cognitively normal (CN) individuals to early mild cognitive impairment (EMCI), late mild cognitive impairment (LMCI), and clinical AD. Approximately 10–15% of MCI patients convert to AD annually, and early intervention at the MCI stage can meaningfully slow disease progression [5]. However, the subtle and overlapping symptomatology between adjacent stages—particularly EMCI versus LMCI—severely limits the sensitivity of conventional neuropsychological assessments [6,7], motivating neuroimaging-based approaches for early pathological detection.
Resting-state functional magnetic resonance imaging (rs-fMRI), which measures the blood oxygen level-dependent (BOLD) signal during rest, has emerged as a particularly informative modality for probing brain-wide functional organization in AD [8]. By characterizing synchronous neural activity patterns across spatially distributed regions, rs-fMRI reveals functional connectivity (FC) disruptions that precede structural atrophy and cognitive symptom onset [9]. Notably, the default mode network (DMN), a set of regions preferentially active during rest, exhibits characteristic hypo-connectivity in AD patients compared to healthy controls, providing a sensitive and non-invasive imaging biomarker [10].
Traditional FC analyses assume stationarity—that connectivity patterns remain constant throughout the scanning session. However, converging evidence from neuroimaging studies indicates that brain functional connectivity fluctuates meaningfully over time, and that these dynamic fluctuations carry disease-relevant information beyond static summaries [11,12]. Dynamic functional connectivity (dFC), typically constructed via sliding-window Pearson correlation, captures this temporal variability by generating a sequence of connectivity matrices, each representing brain-network interactions within a short temporal window [13]. Studies have demonstrated that dFC analysis is more sensitive than static FC for detecting early-stage AD pathology, particularly in distinguishing EMCI from CN subjects where pathological changes are subtle and transient [14].
Despite its appeal, dFC-based AD classification faces three key technical challenges. First, each dFC window of a 116-region atlas yields a 6670-dimensional upper-triangular vector, creating a severe curse of dimensionality that demands effective nonlinear dimensionality reduction [14]. Second, the temporal ordering of dFC windows encodes trajectory information about brain state transitions, yet most methods discard this structure via simple pooling, or use sequential models (e.g., LSTM) that struggle with long-range dependencies [15,16]. Third, clinically acceptable models must be interpretable—a diagnosis without an explanation is difficult to validate or trust [17].
To address the dimensionality challenge, Variational Autoencoders (VAEs) [18] have been proposed as an unsupervised dimensionality reduction technique for dFC features. By learning a smooth, low-dimensional latent space constrained by KL divergence regularization, VAEs extract robust, noise-resistant representations that capture nonlinear connectivity patterns inaccessible to linear methods such as PCA [15].
For temporal sequence modeling, the recently proposed Mamba architecture [19]—a Selective State Space Model (S6)—offers a compelling alternative to recurrent neural networks and Transformer-based [20] attention mechanisms. Mamba introduces input-dependent selectivity: the transition matrices A , B , C , and the time-step Δ are dynamically computed from the input, enabling the model to selectively retain or discard information based on content relevance. Crucially, Mamba achieves linear computational complexity with respect to sequence length—in contrast to Transformer’s quadratic self-attention—making it particularly suitable for long sequences of dFC windows. While Mamba has demonstrated remarkable performance in natural language processing and genomics, its application to fMRI temporal dynamics for disease classification represents a significant and underexplored opportunity.
For interpretable classification, Kolmogorov–Arnold Networks (KAN) [21]—inspired by the Kolmogorov–Arnold representation theorem—replace the fixed node activations of MLPs with learnable univariate B-spline functions on edges. Each edge function φ l , j , i ( x ) is independently parameterized and directly visualizable, encoding the contribution of each latent feature to each class logit as an explicit, inspectable curve. This is fundamentally different from post hoc methods such as SHAP, which approximate feature contributions externally and incur additional computational cost. KAN’s intrinsic transparency is particularly valuable in clinical settings where the mechanism of a prediction matters as much as its accuracy.
In this paper, we propose MambaKAN, a unified, end-to-end interpretable framework that integrates VAE, Mamba, and KAN to address the aforementioned limitations synergistically. Specifically, the following are proposed:
  • A VAE-based dynamic window encoder maps each dFC snapshot independently into a compact 128-dimensional latent vector, effectively reducing the per-window feature dimension from 6670 to 128 while preserving nonlinear connectivity structure and suppressing noise via KL regularization.
  • A Mamba temporal encoder processes the sequence of 54 latent vectors (1 per dFC window) using selective state space dynamics, learning which temporal windows are most diagnostically relevant and capturing long-range dependencies across the entire scanning session with linear computational cost.
  • A KAN classifier maps the temporal context vector to diagnostic class probabilities through learnable B-spline activations, providing fully transparent, intrinsic interpretability without any post hoc approximation.
  • An end-to-end joint training strategy with differential learning rates ensures that VAE features are refined toward classification objectives while the Mamba and KAN modules learn effectively from a rich, pre-trained latent space.
The main contributions of this work are as follows:
  • This work represents one of the pioneering efforts to apply a Selective State Space Model (Mamba) to the modeling of temporal dynamics in dFC for AD classification, demonstrating superior performance over LSTM and Transformer-based alternatives in capturing clinically relevant brain state trajectories.
  • We integrate KAN as the classification backbone, providing intrinsic interpretability through visualizable activation functions. This enables direct inspection of how each latent dimension influences classification decisions, without relying on post hoc approximation methods.
  • We design a principled two-phase joint training strategy with a composite loss function that tightly couples reconstruction fidelity and classification accuracy, ensuring that feature learning is task-aligned and yielding representations that outperform unsupervised pre-training followed by fixed feature extraction.
  • We provide a multi-layered interpretability analysis that combines Mamba’s temporal selectivity scores with KAN activation curve visualization and gradient-based brain region attribution, offering complementary neuroscientific insights at the temporal, functional, and anatomical levels.
  • We conduct comprehensive experiments across five clinically relevant classification tasks on the ADNI dataset, including ablation studies and sensitivity analyses, demonstrating consistent improvements over seven competitive baselines.
The remainder of this paper is organized as follows. Section 2 describes the ADNI dataset and preprocessing pipeline. Section 3 presents the complete MambaKAN architecture and training procedure. Section 4 reports experimental results, baseline comparisons, and ablation studies. Section 5 presents the multi-layered interpretability analysis. Section 6 analyzes the computational complexity and inference latency of all compared models. Section 7 discusses findings, limitations, and future directions. Section 8 concludes the paper.

2. Materials

2.1. Dataset Description

In this study, we utilized rs-fMRI data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset [22,23]. ADNI is a large-scale, multi-center, longitudinal study designed to advance the understanding of AD pathogenesis and progression through the integration of neuroimaging, biomarker profiling, and cognitive assessment. The dataset comprises 174 unique subjects representing four diagnostic categories: 48 cognitively normal (CN), 50 with early mild cognitive impairment (EMCI), 45 with late mild cognitive impairment (LMCI), and 31 with AD. Each subject contributed one or more longitudinal scans, yielding a total of 563 scan sessions distributed as follows: 154 CN, 165 EMCI, 145 LMCI, and 99 AD. All scans were acquired on 3.0T MRI scanners (Philips Healthcare, Best, The Netherlands; Siemens Healthineers, Erlangen, Germany; GE Healthcare, Chicago, IL, USA) across multiple medical centers under standardized imaging parameters: flip angle, 80°; image resolution, 2.29–3.31 mm; slice thickness, 3.31 mm; echo time (TE) = 30 ms; repetition time (TR) range, 2.2–3.1 s; imaging matrix, 64 × 64 pixels; field of view, 256 × 256 mm2; total scanning duration, 7 min, encompassing 140 image volumes.

2.2. Data Preprocessing

The preprocessing pipeline for rs-fMRI data follows established protocols adapted from related work [15,16,19]. Standard preprocessing was performed using FSL FEAT software (v6.0, FMRIB, University of Oxford, Oxford, UK), comprising the following sequential steps: (i) discarding the first three volumes to allow longitudinal magnetization equilibration; (ii) slice-timing correction to account for temporal acquisition offsets between slices; (iii) head motion correction using rigid-body registration; (iv) band-pass temporal filtering (0.01–0.08 Hz) to isolate low-frequency BOLD fluctuations; and (v) nuisance covariate regression including white matter signal, cerebrospinal fluid signal, global signal, and the six motion parameters. Subjects exhibiting head motion exceeding 2 mm translational displacement or 2° rotational displacement were excluded from analysis to ensure data quality.
Structural preprocessing involved skull stripping from T1-weighted sMRI images, followed by nonlinear registration to Montreal Neurological Institute (MNI) standard space. Functional images were subsequently smoothed using a 6 mm full-width at half-maximum (FWHM) isotropic Gaussian kernel to improve the signal-to-noise ratio. Regional parcellation was performed using the Automated Anatomical Labeling (AAL) atlas, which partitions the brain into 116 anatomically defined regions of interest (ROIs). The mean BOLD time series was extracted for each ROI, yielding a subject-level data matrix X R 116 × T , where T denotes the number of time points following preprocessing. Individual ROI time series were z-score normalized across time:
x i , norm ( t ) = x i ( t ) μ i σ i ,
where μ i and σ i denote the temporal mean and standard deviation of the i-th ROI time series, respectively.

2.3. Dynamic Functional Connectivity Construction

To capture the time-varying nature of brain functional interactions, we constructed dFC matrices using a sliding-window approach. The normalized ROI time series of each subject was segmented into overlapping sub-windows of a length of L w = 30 time points (corresponding to approximately 66–90 s depending on TR) with a step size of s = 2 time points (approximately 4.4–6.2 s). This parameterization balances temporal resolution against signal stability: shorter windows risk statistical instability in correlation estimation, while longer windows smooth out genuine dynamic fluctuations [13]. The resulting window count per subject is
K = T L w s + 1 = 54 ,
yielding K = 54 overlapping windows per subject. For the k-th window S k R 116 × 30 , the dFC matrix R k R 116 × 116 is computed via pairwise Pearson correlation:
r k , i j = t = 1 L w x i ( k ) ( t ) x ¯ i ( k ) x j ( k ) ( t ) x ¯ j ( k ) t = 1 L w x i ( k ) ( t ) x ¯ i ( k ) 2 · t = 1 L w x j ( k ) ( t ) x ¯ j ( k ) 2 ,
where x i ( k ) ( t ) is the BOLD signal of the i-th ROI at the t-th time point within window k, and x ¯ i ( k ) is its within-window mean. Since R k is symmetric with a unit diagonal, we extract only the upper-triangular off-diagonal elements to form a non-redundant feature vector. For an atlas with N = 116 ROIs, this yields N ( N 1 ) / 2 = 6670 features per window. The vectorized dFC representation for window k is denoted as v k R 6670 . The full dFC sequence for a subject is thus { v k } k = 1 54 R 54 × 6670 .

2.4. Dataset Partitioning

To prevent data leakage, all dataset splits were performed at the subject level: all scan sessions of the same subject are assigned exclusively to one partition. We employ a fixed 70/10/20 (train/validation/test) split with pre-determined indices saved as JSON files for full reproducibility. The 174 subjects are distributed across the four diagnostic categories as follows: 48 CN, 50 EMCI, 45 LMCI, and 31 AD. The fixed partitioning maintains approximate class balance across training, validation, and test sets, with the smallest class (AD) represented in all three partitions to ensure reliable performance evaluation. Table 1 provides the per-class subject count in each partition, offering full transparency over the distribution of the smallest diagnostic group.
Classification tasks are defined as five settings: (T1) CN vs. AD, (T2) CN vs. EMCI, (T3) EMCI vs. LMCI, (T4) LMCI vs. AD, and (T5) four-class (CN/EMCI/LMCI/AD). All classification metrics are computed at the subject level: the 54 window-level predictions for each subject are aggregated via probability averaging to produce a single subject-level label, which is then compared against the ground truth. This ensures that the high temporal autocorrelation between adjacent windows (step size s = 2 yields L w s L w = 93 % overlap) does not inflate performance estimates.

3. Methods

3.1. Overview of MambaKAN

Figure 1 illustrates the MambaKAN architecture, comprising four sequential stages: (1) dFC construction from rs-fMRI BOLD time series; (2) per-window VAE encoding of each dFC snapshot into a 128-dimensional latent vector; (3) Mamba temporal modeling over the 54-element latent sequence; and (4) KAN classification with intrinsically interpretable B-spline activations. The pipeline is trained end-to-end via a joint objective balancing reconstruction and classification.

3.2. Variational Autoencoder for Per-Window Feature Extraction

3.2.1. Encoder Architecture

The VAE encoder q ϕ ( z | v ) maps each 6,670-dimensional dFC window vector v k to a posterior distribution over a 128-dimensional latent space. The encoder is implemented as a fully connected multi-layer network with progressive dimensionality reduction:
h ( 1 ) = ReLU ( W 1 v + b 1 ) , h ( 1 ) R 2048 ,
h ( ) = ReLU ( W h ( 1 ) + b ) , = 2 , 3 , 4 ,
with hidden dimensions progressing as [ 6670 2048 1024 512 256 ] . The final hidden layer feeds two parallel linear projections producing the following variational parameters:
μ z = W μ h ( 4 ) + b μ , log σ z 2 = W σ h ( 4 ) + b σ , μ z , log σ z 2 R 128 .
The latent variable z R 128 is sampled via the reparameterization trick to allow gradient-based optimization:
z = μ z + σ z ϵ , ϵ N ( 0 , I ) ,
where ⊙ denotes element-wise multiplication.

3.2.2. Decoder Architecture

The decoder p θ ( v | z ) reconstructs the input dFC vector from the latent sample using a mirror-symmetric fully connected network with dimensions [ 128 256 512 1024 2048 6670 ] . Intermediate layers use ReLU activations, while the final layer employs a Sigmoid function to constrain reconstructed values to [ 0 ,   1 ] , consistent with the normalized correlation coefficient range:
v ^ = σ ( W dec ( 5 ) h dec ( 4 ) + b dec ( 5 ) ) ,
where σ ( · ) is the sigmoid activation.

3.2.3. VAE Loss Function

The VAE is trained to minimize the evidence lower bound (ELBO), which decomposes into a reconstruction term and a KL divergence regularization term:
L VAE = L recon + β L KL ,
where the reconstruction loss measures mean squared error between the input and its reconstruction:
L recon = 1 N b n = 1 N b v n v ^ n 2 2 ,
and the KL divergence constrains the approximate posterior q ϕ ( z | v ) toward the standard normal prior p ( z ) = N ( 0 , I ) :
L KL = 1 2 j = 1 L 1 + log σ z , j 2 μ z , j 2 σ z , j 2 ,
with L = 128 denoting the latent dimension and β = 1.0 . The KL term regularizes the latent space toward a smooth normal distribution, suppressing noise from head motion artifacts, physiological signals, and scanner instabilities prevalent in rs-fMRI data.

3.2.4. Window-Level Independent Encoding

Each of the 54 dFC windows is encoded independently, yielding a latent sequence Z = [ z 1 , , z 54 ] R 54 × 128 . This preserves per-window temporal identity—global averaging would conflate distinct brain states—and provides the ordered sequence required by the Mamba encoder.

3.3. Mamba Selective State Space Temporal Encoder

3.3.1. Rationale for Mamba over LSTM and Transformer

The choice of Mamba as the temporal encoder is motivated by three key limitations of conventional sequence models in the context of dFC analysis. First, Long Short-Term Memory (LSTM) networks [24], while effective for short-to-medium sequences, suffer from the vanishing gradient problem and limited context window when processing long sequences: the 54-window dFC trajectory spans the entire 7-min scanning session, and LSTM’s recurrent bottleneck struggles to propagate information across such extended temporal horizons without degradation. Second, Transformer architectures [20], despite their ability to model global dependencies via self-attention, incur quadratic computational complexity O ( L 2 d ) with respect to sequence length L, making them parameter-inefficient and memory-intensive for long sequences—a critical concern given the small sample size of medical imaging datasets where model capacity must be carefully allocated. Third, neither LSTM nor standard Transformer incorporates content-based selectivity: they process all time steps uniformly, whereas dFC sequences contain both diagnostically informative windows (e.g., early-scan stable states) and noisy or artifact-contaminated windows that should be down-weighted.
Mamba addresses all three limitations simultaneously. Its Selective State Space Model (S6) mechanism achieves linear computational complexity O ( L d ) through hardware-aware parallel scan algorithms, enabling efficient processing of long dFC sequences without the quadratic cost of self-attention. More importantly, Mamba’s input-dependent selectivity—where the state transition matrices B k , C k and time-step Δ k are dynamically computed from the input z k —this allows the model to adaptively filter the temporal stream: large Δ k values cause the model to “forget” prior context and focus on the current window (high selectivity), while small Δ k values preserve long-range dependencies (low selectivity). This content-aware gating is particularly well-suited to dFC data, where the diagnostic relevance of each time window varies based on brain state dynamics, head motion artifacts, and subject-specific scanning conditions. Empirical validation confirms that Mamba (70.5% accuracy) outperforms both BiLSTM (68.2%) and mean pooling (59.4%) on the four-class task, demonstrating that selective temporal modeling provides a measurable advantage over uniform aggregation or recurrent processing.

3.3.2. State Space Model Foundations

Continuous-time linear state space models (SSMs) describe input–output dynamics via a hidden state h ( t ) R N :
h ˙ ( t ) = A h ( t ) + B u ( t ) , y ( t ) = C h ( t ) ,
where A R N × N , B R N × 1 , C R 1 × N are learnable parameters, u ( t ) is the scalar input, and y ( t ) is the scalar output. Discretization with step size Δ via zero-order hold yields
A ¯ = e Δ A , B ¯ = ( Δ A ) 1 e Δ A I Δ B ,
so that the discrete recurrence becomes the following:
h k = A ¯ h k 1 + B ¯ u k , y k = C h k .

3.3.3. Selective State Space Model (S6)

The Mamba architecture [19] extends the SSM to a Selective State Space Model (S6) by making the parameters B , C , and the time-step Δ input-dependent:
B k = B ( z k ) , C k = C ( z k ) , Δ k = softplus Δ ( z k ) ,
where B ( · ) , C ( · ) , and Δ ( · ) are lightweight linear projections. This selectivity mechanism allows the model to filter the input stream based on content: large Δ k values cause A ¯ k 0 (the state “forgets” and focuses on the current input), while small Δ k values cause A ¯ k I (the state carries forward prior context). For multi-dimensional inputs z k R d model , the S6 operation is applied independently per dimension, and the full Mamba block incorporates a depthwise convolution branch and gated multiplicative output, as detailed below.

3.3.4. Mamba Block Architecture

Each Mamba block processes the input sequence Z R 54 × d model with d model = 128 through the following operations:
Z = LayerNorm ( Z ) ,
X = Z W x R 54 × d inner , G = SiLU ( Z W g ) R 54 × d inner ,
X = DepthwiseConv 1 d ( X ) , X = SiLU ( X ) ,
Y = S 6 ( X ) , Y = Y G ,
Output = Z + Y W out ,
where d inner = 2 × d model = 256 (expansion factor E = 2 ), the depthwise convolution has a kernel size of d conv = 4 , and the SiLU (Sigmoid Linear Unit) denotes SiLU ( x ) = x · σ ( x ) . The residual connection ensures gradient flow during deep network training. We stack L m = 2 Mamba blocks, as empirically determined via ablation.
The S6 sub-module computes the following for each dimension d { 1 , , d inner } :
Δ k , d = softplus W Δ ( d ) x k , d ,
A ¯ k , d = exp ( Δ k , d · A d ) ,
B ¯ k , d = Δ k , d · B k , d , B k , d = W B ( d ) z k ,
h k , d = A ¯ k , d · h k 1 , d + B ¯ k , d · x k , d ,
y k , d = C k , d · h k , d , C k , d = W C ( d ) z k ,
where A d R d state is a learnable structured state matrix initialized with HiPPO [19], and d state = 16 is the internal SSM state dimension.

3.3.5. Temporal Context Aggregation

After processing through L m = 2 stacked Mamba blocks, the output sequence O R 54 × 128 is aggregated into a single temporal context vector via mean pooling:
c = 1 K k = 1 K O k R 128 .
Mean pooling provides a stable, permutation-aware aggregation that summarizes all temporal positions equally after selective attention has already emphasized the most discriminative windows through the Δ k gating mechanism.

3.3.6. Temporal Interpretability via Selectivity Scores

The per-step time-step values { Δ k } k = 1 54 provide a natural measure of temporal importance: a large Δ k indicates that the model assigns high relevance to window k (fresh information is prioritized), while a small Δ k indicates that the prior state is carried forward with minimal update from window k. We define the temporal importance score for window k as follows:
s k = 1 d inner d = 1 d inner Δ k , d ,
averaged across all inner dimensions to produce a scalar summary. These scores can be aggregated across subjects within each diagnostic class to reveal the class-specific temporal patterns of brain state dynamics.

3.4. Kolmogorov–Arnold Network Classifier

3.4.1. KAN Architecture Motivation

Standard MLPs parameterize activation functions (e.g., ReLU, GELU) as fixed nonlinearities on nodes, while the learnable weights reside on edges. Kolmogorov–Arnold Networks (KAN) [21], inspired by the Kolmogorov–Arnold representation theorem, which states that any continuous multivariate function f : [ 0 , 1 ] n R can be written as a finite composition of univariate functions and addition, instead place learnable univariate functions on edges, with each edge representing an independently parameterized transformation. This architectural choice enables the network to be intrinsically interpretable: each edge function φ l , j , i directly encodes the contribution of the i-th unit of layer l to the j-th unit of layer l + 1 as a visualizable curve, independent of all other edges.

3.4.2. Rationale for KAN over MLP

The choice of KAN as the classification backbone is motivated by two critical requirements of medical imaging analysis: parameter efficiency and intrinsic interpretability. First, standard MLPs with ReLU or GELU activations require large hidden layers to approximate complex nonlinear decision boundaries, leading to parameter proliferation that exacerbates overfitting on small medical datasets. In contrast, KAN parameterizes each edge connection as an independent B-spline function, enabling the network to learn smooth, nonlinear transformations with far fewer parameters: replacing the MLP classification head with KAN adds only ∼43 K parameters (+0.13%) while improving four-class accuracy from 62.5% to 70.5%. The B-spline basis functions inherently encode smoothness priors through their piecewise polynomial structure, providing implicit regularization that is particularly beneficial when training on the limited ADNI cohort ( n = 174 subjects).
Second, and more fundamentally, KAN provides intrinsic interpretability that is qualitatively different from post hoc explanation methods. In a standard MLP, the contribution of a latent feature to the final prediction is mediated through multiple layers of fixed nonlinear activations (e.g., ReLU) and linear projections, making it impossible to visualize the feature-to-prediction mapping without external approximation tools like SHAP or LIME. These post hoc methods compute local linear approximations around specific test samples and require additional forward passes, yielding explanations that are sample-dependent and computationally expensive. In contrast, KAN’s edge activation functions φ j , i ( x ) are model parameters that can be directly plotted after training, revealing the global, nonlinear relationship between each latent dimension and the classification logits without any post hoc computation. This transparency is critical for clinical trust and regulatory approval: a radiologist can inspect the learned activation curves and verify that the model’s decision logic aligns with known pathophysiology, rather than relying on black-box predictions justified only by aggregate accuracy metrics.

3.4.3. B-Spline Edge Activations

Each edge activation φ ( x ) : R R is parameterized as a sum of a scaled residual SiLU function and a learnable B-spline:
φ ( x ) = w b · SiLU ( x ) + i = 1 G + p c i · B i , p ( x ) ,
where w b R is a learnable base weight, c i R are B-spline coefficients, B i , p ( x ) are B-spline basis functions of order p = 3 (cubic) defined on a uniform grid of G = 5 intervals over the input domain [ x min , x max ] . The total number of parameters per edge is G + p = 8 . During training, the grid is periodically updated to span the current activation range, preventing numerical instability from out-of-range inputs.

3.4.4. KAN Layer Forward Pass

A KAN layer with n in input neurons and n out output neurons computes the following:
y j = i = 1 n in φ j , i ( x i ) , j = 1 , , n out ,
where φ j , i is a distinct B-spline activation parameterized as in Equation (28).

3.4.5. MambaKAN Classifier Structure

The KAN classifier receives the temporal context vector c R 128 from the Mamba encoder and produces class logits y ^ R C through a two-layer KAN:
h KAN = KAN 1 ( c ; { φ j , i ( 1 ) } ) R 64 ,
y ^ = KAN 2 ( h KAN ; { φ j , i ( 2 ) } ) R C ,
where C is the number of classes (two for binary tasks, four for the four-class task). The total number of learnable activation parameters in the KAN classifier is ( 128 × 64 + 64 × C ) × 8 spline coefficients plus 128 × 64 + 64 × C base weights.

3.4.6. KAN Interpretability

Following training, each activation curve φ j , i ( l ) can be directly plotted as a function of its input, revealing the nature of the relationship between the i-th latent dimension (or hidden unit) and the j-th downstream unit. Monotonically increasing curves indicate positive linear-like contributions; non-monotonic curves reveal threshold effects or u-shaped relationships; flat curves indicate that a particular connection has been effectively pruned. This intrinsic transparency eliminates the need for post hoc attribution methods like SHAP or LIME, which provide only approximate local explanations and require additional computation after training.

3.5. Joint Training Strategy

3.5.1. Two-Phase Training Protocol

To ensure stable and effective learning, we employ a two-phase training strategy that exploits the complementary objectives of unsupervised representation learning and supervised classification:
Phase 1—VAE Unsupervised Pre-training:The VAE is first trained in isolation on the full training set (without labels) to minimize L VAE (Equation (9)). This phase runs for 100 epochs with the Adam optimizer at a learning rate of η 1 = 10 3 and a batch size of 32. By pre-training without labels, the VAE learns a general-purpose, noise-robust latent representation of dFC dynamics that is not biased toward any specific classification task, improving downstream generalization. The best checkpoint (lowest validation reconstruction loss) is saved for Phase 2 initialization.
Phase 2—End-to-End Joint Fine-tuning: The full MambaKAN pipeline (VAE + Mamba + KAN) is trained jointly using the composite loss:
L total = α L VAE + β L cls ,
where L cls is the cross-entropy classification loss:
L cls = 1 N b n = 1 N b c = 1 C y n , c log p ^ n , c ,
with y n , c the one-hot label and p ^ n , c the Softmax probability for sample n and class c. We set α = 0.1 and β = 1.0 , treating reconstruction as a regularizer that prevents the catastrophic forgetting of the VAE’s learned manifold structure, while prioritizing classification accuracy.

3.5.2. Differential Learning Rates

To protect the pre-trained VAE representations from destructive updates, we apply differential learning rates in Phase 2:
η VAE = 10 5 , η Mamba = η KAN = 10 3 .
The Mamba and KAN parameters use a 100× larger learning rate, allowing rapid adaptation while the VAE undergoes only fine-grained updates. During the initial E warm = 15 warmup epochs of Phase 2, VAE parameters are frozen to stabilize Mamba and KAN before joint fine-tuning begins.

3.5.3. Regularization

To mitigate overfitting on the small ADNI dataset, Dropout ( p = 0.15 ) is applied within each Mamba block after the depthwise convolution. The VAE KL divergence (Equation (11)) additionally regularizes the latent space. No Dropout is applied within KAN layers to preserve interpretability of individual activation curves.

3.5.4. Optimization Details

Phase 2 training runs for 100 epochs with a batch size of 32. The Adam optimizer is used with β 1 = 0.9 , β 2 = 0.999 , ϵ = 10 8 . The best checkpoint (highest validation accuracy) is saved for final evaluation. All experiments are implemented in PyTorch 2.0 on an NVIDIA GPU with 16 GB VRAM.

4. Experiments

4.1. Implementation Details

The MambaKAN framework was implemented in PyTorch 2.0 with Python 3.8. The VAE architecture follows the design described in Section 3.2 with encoder dimensions [6670, 2048, 1024, 512, 256, 128] and symmetric decoder. The Mamba encoder uses L m = 2 layers, d model = 128 , d state = 16 , d conv = 4 , an expansion factor of E = 2 , and a Dropout rate of 0.15. The KAN classifier has two layers with dimensions [ 128 64 C ] , a B-spline grid size of G = 5 , and a spline order of p = 3 . To assess the statistical stability of all reported results, each experiment was repeated with five fixed random seeds on the same hardware under the fixed 70/10/20 subject-level train/validation/test split described in Section 2. Performance metrics are reported as mean ± standard deviation across these five runs.

4.2. Evaluation Metrics

Classification performance is evaluated using four standard metrics: accuracy (Acc), macro-averaged precision (Pre), macro-averaged recall (Rec), and macro-averaged F1-score (F1). For binary tasks, the area under the ROC curve (AUC) is additionally reported. Statistical significance of performance differences between MambaKAN and each baseline was assessed using a paired t-test ( α = 0.05 ) on run-level accuracy scores across the five seeds.

4.3. Baseline Methods

We compare MambaKAN against the following seven baselines:
  • VAE + MLP: A VAE encoder for per-window feature extraction followed by mean pooling and a hierarchical MLP classifier, without temporal modeling or intrinsic interpretability.
  • VAE + BiLSTM: Mamba replaced by a two-layer bidirectional LSTM [24] (hidden size of 128 per direction).
  • VAE + Transformer: Mamba replaced by a two-layer Transformer encoder [20] (four heads, a feedforward dimension of 256, and sinusoidal positional encoding).
  • MambaKAN (no pre-training): Full MambaKAN trained from random initialization, without Phase 1 VAE pre-training.
  • VAE + KAN (no Mamba): Mamba replaced by mean pooling over the 54 latent vectors, feeding directly into the KAN classifier.
  • VAE + Mamba + MLP (no KAN): KAN replaced by a standard two-layer MLP (hidden size 64, ReLU).
  • Traditional ML: SVM-RBF, Random Forest, and Gradient Boosting on mean-pooled VAE latent features.

4.4. Classification Performance Comparison

Table 2 and Table 3 present the classification performance of MambaKAN and all seven baseline methods across the five tasks on the ADNI dataset.

4.5. Ablation Study

Table 4 presents the ablation study on the four-class task (T5), analyzing the contribution of each major component.

4.6. Sensitivity Analysis

Table 5 examines the sensitivity of MambaKAN to key hyperparameters on the CN vs. AD task (T1).

5. Interpretability Analysis

A central contribution of MambaKAN is its multi-layer interpretability framework, operating simultaneously at three complementary levels: temporal (Mamba), functional (KAN), and anatomical (gradient attribution). This hierarchical analysis provides insights inaccessible to single-level post hoc methods.

5.1. Layer 1: Mamba Temporal Importance

Using the temporal importance scores defined in Equation (27), we compute the mean Δ k profile for each diagnostic class by averaging over all subjects in the test set. Figure 2 visualizes these profiles across the 54 dFC windows.
The analysis reveals class-specific temporal patterns consistent with known AD pathophysiology. The AD class (brown curve) exhibits a distinctive biphasic profile: significantly elevated selectivity during early time windows (0–25), with peak z-scores approaching +1.0, followed by a sharp decline to negative values (minimum 1.0 ) in late windows (30–55). This pattern suggests that early-scan brain states contain the most discriminative AD signatures, while late-scan states actively contradict AD classification—potentially reflecting fatigue-related or attention-related signal degradation that disproportionately affects AD patients. In contrast, the CN, EMCI, and LMCI classes exhibit relatively flat selectivity profiles centered near zero, indicating temporally uniform feature distributions without pronounced critical windows. The LMCI class shows a subtle progressive increase across the scan, intermediate between the stable CN/EMCI profiles and the extreme AD biphasic pattern, consistent with LMCI’s position on the disease continuum. This temporal specificity—particularly the early-window AD peak—provides data-driven evidence for optimizing clinical scanning protocols: shorter, early-focused acquisitions may suffice for AD detection, reducing patient burden and scan costs.

5.2. Layer 2: KAN Activation Curve Analysis

Figure 3 presents the learned B-spline activation curves φ j , i ( 1 ) for the top- K = 10 most influential latent dimensions (ranked by mean absolute activation magnitude) in the first KAN layer for the CN vs. AD binary classification task.
The activation curves reveal three interpretable patterns. Certain latent dimensions exhibit monotonically increasing activation curves for the AD logit, indicating that their elevated values consistently promote AD classification; these likely encode persistent connectivity reduction in the DMN and hippocampal networks. Conversely, latent dimensions with monotonically decreasing curves correspond to features whose suppression is associated with AD pathology, potentially encoding preserved connectivity in frontal networks observed in early-stage patients. Non-monotonic curves (particularly sigmoidal or u-shaped) indicate a nonlinear contribution to classification—such dimensions may correspond to connectivity features that are pathological only outside a normative range. Critically, across all 10 latent dimensions, the AD class (yellow curves) consistently exhibits higher Δ logit values than the CN class (blue curves) without any rank reversals, demonstrating that KAN has learned stable, monotonic class-discriminative features rather than overfitting to noise.
The EMCI vs. LMCI activation curves (Figure 4) present a striking contrast to the CN vs. AD case. While the EMCI class (blue) consistently maintains higher Δ logit values than LMCI (yellow) across all dimensions—preserving monotonic class ordering—the inter-class separation is markedly reduced, with curves frequently overlapping or running in close parallel. This compressed separation quantitatively reflects the clinical challenge of distinguishing adjacent MCI stages: EMCI and LMCI share substantial pathological overlap, differing primarily in severity rather than qualitative feature profiles. The smooth, continuous nature of all curves confirms that KAN captures biologically plausible nonlinear relationships rather than spurious discontinuities, even in this difficult discrimination regime.

5.3. Layer 3: Gradient-Based Brain Region Attribution

To map the influence of the final classification decision back onto the anatomical brain, we compute Jacobian-based attribution scores from the class logit to the original dFC matrix via backpropagation through the full pipeline:
a i j = 1 K k = 1 K y ^ c v k , i j ,
where v k , i j is the pairwise connectivity feature between ROI i and ROI j in window k, and y ^ c is the predicted logit for the target class c. The attribution matrix A R 116 × 116 is then thresholded to retain the top 1% most influential connections for visualization. Figure 5 presents these attribution maps for each diagnostic class.
The chord diagram analysis (Figure 5) reveals class-specific connectivity patterns that align with known AD neuropathology. Across all four diagnostic classes, the cerebellum—particularly Vermis_10 (cerebellar vermis lobule X)—emerges as the dominant hub, exhibiting the strongest attribution scores and the highest degree of inter-regional connectivity. This cerebellar centrality is consistent with recent evidence implicating cerebellar dysfunction in cognitive decline and AD progression [9]. Beyond this shared cerebellar foundation, each diagnostic class exhibits distinctive secondary connectivity profiles. The CN class (Class 0) shows the prominent involvement of the caudate nucleus and striatal regions, reflecting intact cognitive control networks. The EMCI class (Class 1) demonstrates elevated attribution in the insula and parahippocampal gyrus, regions associated with emotional processing and episodic memory encoding, potentially indicating early compensatory recruitment. The LMCI class (Class 2) exhibits heightened connectivity in primary sensory cortices—including Heschl’s gyrus (auditory) and olfactory cortex—alongside hippocampal structures, suggesting progressive sensory integration deficits. Most strikingly, the AD class (Class 3) displays a complex high-order cognitive network involving the orbitofrontal cortex (decision-making), superior parietal lobule (spatial attention), and amygdala (emotional regulation), indicating the widespread disruption of executive and limbic systems characteristic of advanced neurodegeneration. These anatomically coherent, class-specific attribution patterns validate that MambaKAN’s learned representations capture biologically meaningful functional connectivity signatures rather than spurious correlations.
The heatmap visualization (Figure 6) provides a complementary global perspective on connectivity attribution. A striking gradient in attribution intensity is observed across disease stages: the CN class exhibits the weakest overall attribution (scale: 0.0000–0.0010), indicating relatively diffuse and uniform connectivity patterns; EMCI and LMCI classes show progressively stronger attribution (scales: 0.0012 and 0.0016, respectively); and the AD class displays the highest attribution intensity (scale: 0.0020), with prominent bright yellow regions concentrated along cerebellar and limbic system connections. This monotonic increase in attribution magnitude suggests that as AD pathology advances, the model increasingly relies on a narrower set of highly discriminative connectivity features—consistent with the hypothesis that advanced neurodegeneration produces more stereotyped and detectable functional network disruptions. The consistent localization of high-attribution regions to cerebellar–cortical and limbic pathways across all classes reinforces the biological validity of the learned representations.
The ranked attribution bar charts (Figure 7) provide quantitative confirmation of the qualitative patterns observed in the chord and heatmap visualizations. Vermis_10 dominates all four panels with the longest bars, achieving attribution scores approximately two to three times higher than the second-ranked region in each class. This universal cerebellar primacy suggests that MambaKAN has learned to anchor its classification decisions on a stable, disease-invariant feature set derived from cerebellar connectivity, while modulating class-specific predictions through secondary region recruitment. The divergence in secondary features is particularly informative: the CN class recruits bilateral cerebellum and caudate (cognitive control); EMCI adds the insula and parahippocampal gyrus (emotional and memory processing); LMCI incorporates Heschl’s gyrus and olfactory cortex (sensory integration); and AD engages the orbitofrontal cortex, superior parietal lobule, and amygdala (executive dysfunction and emotional dysregulation). This hierarchical attribution structure—universal cerebellar foundation plus class-specific cortical/limbic modulation—mirrors the known progression of AD pathology from subcortical to cortical regions and validates the neuroscientific plausibility of the learned feature hierarchy.

6. Computational Complexity Analysis

Table 6 summarizes the parameter counts and inference latency for all models evaluated in this study. All latency measurements are averaged over three repeated runs on the same hardware (NVIDIA GPU, 16 GB VRAM), reported as the mean ± std per sample.
All models share a large VAE backbone (∼32.99 M parameters), so the differences between models primarily reflect the temporal modeling and classification components. MambaKAN (33.26 M) is parameter-competitive with the Transformer baseline (33.20 M) and has fewer parameters than VAE + BiLSTM + MLP (33.60 M). Replacing the MLP classification head with KAN adds only ∼43 K parameters (+0.13%, from 33,221,714 to 33,264,910) at an additional inference cost of ∼3.6 ms. The dominant computational cost in MambaKAN comes from Mamba’s hardware-aware selective scan, which accounts for the difference between MambaKAN (23.74 ms) and purely sequential baselines such as VAE + BiLSTM (5.60 ms). The single-sample inference latency of 23.74 ms is well within the requirements for real-time clinical decision support. Furthermore, unlike Transformer-based architectures that scale quadratically with sequence length, Mamba’s linear-time complexity ensures that MambaKAN remains scalable to longer fMRI acquisitions or higher-resolution atlases.

7. Discussion

7.1. Performance Advantages of MambaKAN

The comprehensive experimental results demonstrate that MambaKAN achieves consistent improvements over all baselines across all five classification tasks. The performance gains over the strongest non-temporal baseline (VAE + MLP) confirm that temporal modeling of dFC dynamics provides meaningful discriminative information beyond what can be captured by pooling latent representations across windows. The relative advantage over the VAE + LSTM and VAE + Transformer is particularly noteworthy: while LSTM captures local sequential dynamics and Transformer [20] models global dependencies via self-attention, Mamba’s selective state space mechanism enables context-dependent filtering—effectively learning to attend to temporally relevant dFC patterns while ignoring noisy or irrelevant windows—with linear rather than quadratic computational complexity [19].
The advantage of KAN over MLP as the classification backbone (ablated in Table 4) is attributable to two complementary factors. First, KAN’s per-connection parameterization provides a more compact and expressive representation for smooth classification boundaries in the 128-dimensional latent space, avoiding the parameter redundancy of fully connected linear projections. Second, the B-spline regularization inherent in KAN’s architecture [21] provides implicit smoothness constraints that improve generalization on the small ADNI dataset, analogous to kernel regularization in SVMs.

7.2. Clinical Relevance of Interpretability

The three-layer interpretability framework provides complementary insights at different levels of abstraction. The Mamba temporal importance maps (Figure 2) reveal when, in a scanning session, brain state dynamics are most diagnostically informative, potentially guiding the design of shorter or targeted scanning protocols for clinical screening. The KAN activation curves (Figure 3 and Figure 4) directly quantify the nonlinear functional relationship between latent connectivity features and classification decisions, enabling clinicians to understand not only which features matter but how they contribute—a level of transparency unavailable from post hoc SHAP approximations. The brain region attribution visualizations (Figure 5, Figure 6 and Figure 7) ground these computationally derived features in neuroanatomy, providing the spatial specificity needed for clinical interpretation and biomarker validation. The convergent evidence across chord diagrams, heatmaps, and ranked bar charts—all highlighting cerebellar primacy with class-specific cortical/limbic modulation—demonstrates the robustness and biological validity of the learned representations.
Compared to SHAP-based post hoc interpretability, MambaKAN’s intrinsic KAN interpretability offers two practical advantages: (1) it is computationally inexpensive—activation curves are simply model parameters requiring no additional forward passes—and (2) it is globally consistent, reflecting the model’s actual decision function rather than a local linear approximation around specific test samples. It is important to note that these two levels of interpretability operate at different stages of the pipeline. The KAN activation curves provide intrinsic interpretability at the classification layer: the B-spline functions are model parameters that directly reveal how 128-dimensional latent features influence class logits, without any post hoc computation. However, the final anatomical interpretation—mapping from the latent space back to brain regions—still relies on gradient-based attribution (Equation (35)), which involves backpropagation principles similar to those of other post hoc methods. Thus, the MambaKAN framework achieves intrinsic interpretability at the latent-to-logit stage and post hoc attribution at the dFC-to-latent stage.

7.3. Comparison with Other Interpretable Deep Learning Methods

While MambaKAN is compared primarily against traditional ML and deep learning baselines in the experimental section, it is instructive to position our interpretability approach relative to other interpretable architectures proposed for neuroimaging analysis. Graph Convolutional Networks (GCNs) combined with GNNExplainer [23] have been applied to functional connectivity data, where GNNExplainer identifies important subgraphs by masking edges and measuring prediction change. However, this approach reveals only which connections matter, not how they influence the decision—the relationship remains implicit in the learned GCN weights. Similarly, Transformer-based models with attention mechanisms [20] can visualize attention weight distributions across time steps or spatial regions, revealing where the model “looks,” but attention weights reflect input relevance rather than the functional form of feature-to-prediction mappings. Attention heatmaps indicate that a model attends to a particular brain region or time window, but they do not reveal whether that region’s connectivity promotes or inhibits AD classification, nor do they quantify the nonlinearity of that relationship.
In contrast, KAN’s edge activation curves provide functional interpretability: each curve φ j , i ( x ) explicitly shows how varying the i-th latent feature value from low to high affects the j-th output logit, including the direction (monotonic increasing/decreasing), magnitude (slope), and nonlinearity (curvature) of the effect. This level of transparency is closer in spirit to generalized additive models (GAMs) in classical statistics, but with the representational power of deep neural networks. For clinical decision support, this distinction is critical: a radiologist reviewing a KAN-based diagnosis can inspect whether elevated connectivity in a specific latent dimension consistently increases AD probability (monotonic curve) or exhibits a threshold effect (sigmoidal curve), enabling validation against domain knowledge. Attention-based or GNN-based explanations, while valuable for identifying salient regions, do not provide this level of mechanistic insight without additional post hoc analysis.
Furthermore, the computational cost of generating explanations differs fundamentally. GNNExplainer and attention visualization require forward passes (and in some cases, optimization loops) at the inference time to produce explanations for each test sample. KAN activation curves, being model parameters, are computed once during training and apply globally to all samples, making explanation generation effectively free at the inference time. This efficiency advantage is particularly relevant for large-scale clinical deployment, where real-time decision support with transparent reasoning is essential.

7.4. Limitations and Future Directions

Despite promising results, several limitations should be acknowledged. First, the ADNI dataset, while the most widely used benchmark for AD neuroimaging research, encompasses a relatively modest number of subjects ( n = 174 ), which may limit the statistical power of performance comparisons and the generalizability of findings to diverse populations. In particular, the AD subgroup contains only 31 individuals, yielding approximately 6 test subjects per run (specifically, 7 subjects in the fixed split; the fractional remainder from 20% × 31 = 6.2 was allocated to the test partition, as detailed in Table 1)—a scale at which observed performance gains should be interpreted with caution. This small-sample constraint also increases the risk that reported metrics, including MambaKAN’s Recall = 100% on CN vs. AD, may reflect the limited size of the test partition rather than a robustly generalizable finding. Future work should validate MambaKAN on independent, larger, and more demographically diverse cohorts such as OASIS-3 and AIBL.
Second, the sliding-window dFC approach inherits known limitations of this paradigm: window length selection involves a trade-off between temporal resolution and statistical stability [13], and the non-stationarity of BOLD signals may produce spurious connectivity fluctuations. Point-process approaches or continuous covariance estimation methods could potentially provide more principled dFC representations [12].
Third, while the VAE-based unsupervised pre-training improves generalization, the alignment between reconstructive and discriminative objectives is imperfect. A variational information bottleneck formulation or contrastive learning objective could more directly optimize latent representations for downstream classification [15].
Finally, the current framework operates solely on rs-fMRI data. Multimodal integration—combining structural MRI (cortical thickness, hippocampal volume), diffusion tensor imaging (white matter tract integrity), and clinical assessments—could substantially improve early-stage classification performance, particularly for the challenging EMCI vs. LMCI task [5].

8. Conclusions

We have presented MambaKAN, a novel end-to-end interpretable deep learning framework for Alzheimer’s disease diagnosis from rs-fMRI dynamic functional connectivity. By integrating three complementary components—a Variational Autoencoder for nonlinear per-window feature compression, a Selective State Space Model (Mamba) for temporally selective sequence modeling, and a Kolmogorov–Arnold Network for intrinsically interpretable classification—MambaKAN addresses fundamental limitations of prior dFC-based approaches: the neglect of temporal dynamics, the opacity of deep learning decision-making, and the misalignment between feature learning and classification objectives.
Experimental evaluation on the ADNI dataset across five clinically relevant classification tasks demonstrates that MambaKAN consistently outperforms seven competitive baselines, with statistically significant improvements over the previous state of the art. The multi-layer interpretability analysis provides three complementary levels of neuroscientific insight: (1) temporal importance maps identifying diagnostically critical scan periods, (2) KAN activation curves revealing the nonlinear functional relationships between latent features and diagnostic decisions, and (3) gradient-based brain region attribution maps identifying key functional connections consistent with established AD neuropathology.
This work establishes a methodological foundation for integrating next-generation sequence modeling architectures (SSMs) and intrinsically interpretable networks into the neuroimaging analysis pipeline. We believe MambaKAN represents an important step toward the clinically deployable, trustworthy AI-assisted diagnosis of neurodegenerative diseases.

Author Contributions

Conceptualization, L.G. and Z.H.; methodology, L.G.; software, L.G.; validation, L.G. and Z.H.; formal analysis, L.G.; investigation, L.G.; resources, Z.H.; data curation, L.G.; writing—original draft preparation, L.G.; writing—review and editing, Z.H.; visualization, L.G.; supervision, Z.H.; project administration, Z.H.; funding acquisition, Z.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Wenzhou major scientific and technological innovation project plan “reveal the list” (ZG2023022), the Wenzhou Fundamental Scientific Research Program (Y20240185 and G20240065), the General Program of Wenzhou Municipal Science and Technology Bureau (S2023013), the General Program of Zhejiang Provincial Department of Education (Y202250720), and the Teacher Scientific Research Project of Zhejiang Industry & Trade Vocational College (G250101).

Institutional Review Board Statement

Not applicable. This study uses publicly available de-identified neuroimaging data from the ADNI repository and did not involve direct human subject interaction requiring additional ethical approval.

Informed Consent Statement

Not applicable.

Data Availability Statement

The ADNI dataset is publicly available at https://adni.loni.usc.edu/ (accessed on 14 April 2026) upon registration. The core model and algorithm code is publicly available at https://github.com/l1binn/MambaKAN.

Acknowledgments

DeepSeek-R1 (DeepSeek, Hangzhou, China) was used to refine English sentences, grammar, and academic phrasing to enhance manuscript clarity and readability, as English is not our native language. No AI tools were involved in research design, data analysis, result interpretation, or core intellectual content creation. All authors have reviewed and approved the final text and take full responsibility for its content.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ADAlzheimer’s Disease
CNCognitively Normal
EMCIEarly Mild Cognitive Impairment
LMCILate Mild Cognitive Impairment
MCIMild Cognitive Impairment
rs-fMRIResting-state Functional Magnetic Resonance Imaging
BOLDBlood Oxygen Level-Dependent
dFCDynamic Functional Connectivity
FCFunctional Connectivity
VAEVariational Autoencoder
KANKolmogorov–Arnold Network
MLPMulti-Layer Perceptron
SSMState Space Model
S6Selective State Space Model
LSTMLong Short-Term Memory
ROIRegion of Interest
AALAutomated Anatomical Labeling
ADNIAlzheimer’s Disease Neuroimaging Initiative
DMNDefault Mode Network
PCCPearson Correlation Coefficient
ELBOEvidence Lower Bound
KLKullback-Leibler
SHAPShapley Additive exPlanations
AUCArea Under the ROC Curve
MNIMontreal Neurological Institute
FWHMFull Width at Half Maximum
TRRepetition Time

References

  1. Kalra, N.; Verma, P.; Verma, S. Advancements in AI based healthcare techniques with focus on diagnostic techniques. Comput. Biol. Med. 2024, 179, 108917. [Google Scholar] [CrossRef]
  2. Bermejo-Pareja, F.; Del Ser, T. Controversial past, splendid present, unpredictable future: A brief review of Alzheimer disease history. J. Clin. Med. 2024, 13, 536. [Google Scholar] [CrossRef]
  3. Livingston, G.; Huntley, J.; Sommerlad, A.; Ames, D.; Ballard, C.; Banerjee, S.; Brayne, C.; Burns, A.; Cohen-Mansfield, J.; Cooper, C.; et al. Dementia prevention, intervention, and care: 2020 report of the Lancet Commission. Lancet 2020, 396, 413–446, Correction in Lancet 2023, 402, 1132. https://doi.org/10.1016/S0140-6736(23)02043-3.. [Google Scholar] [CrossRef] [PubMed]
  4. Qiu, X.; Zhao, T.; Kong, Y.; Chen, F. Influence of population aging on balance of medical insurance funds in China. Int. J. Health Plan. Manag. 2020, 35, 152–161. [Google Scholar] [CrossRef]
  5. Li, C.; Wang, S.; Xia, Y.; Shi, F.; Tang, L.; Yang, Q.; Feng, J.; Li, C. Risk factors and predictive models in the progression from MCI to Alzheimer’s disease. Neuroscience 2025, 565, 312–319. [Google Scholar] [CrossRef]
  6. Yiannopoulou, K.G.; Papageorgiou, S.G. Current and future treatments in Alzheimer disease: An update. J. Cent. Nerv. Syst. Dis. 2020, 12, 1179573520907397. [Google Scholar] [CrossRef]
  7. Uysal, G.; Ozturk, M. Comparative analysis of different brain regions using machine learning for prediction of EMCI and LMCI stages of Alzheimer’s disease. Multimed. Tools Appl. 2024, 83, 21455–21470. [Google Scholar] [CrossRef]
  8. Ibrahim, B.; Suppiah, S.; Ibrahim, N.; Mohamad, M.; Hassan, H.A.; Nasser, N.S.; Saripan, M.I. Diagnostic power of resting-state fMRI for detection of network connectivity in Alzheimer’s disease and mild cognitive impairment: A systematic review. Hum. Brain Mapp. 2021, 42, 2941–2968. [Google Scholar] [CrossRef] [PubMed]
  9. Menon, V.; D’Esposito, M. The role of PFC networks in cognitive control and executive function. Neuropsychopharmacology 2022, 47, 90–103. [Google Scholar] [CrossRef]
  10. Westlin, C.; Theriault, J.E.; Katsumi, Y.; Nieto-Castanon, A.; Kucyi, A.; Ruf, S.F.; Brown, S.M.; Pavel, M.; Erdogmus, D.; Brooks, D.H.; et al. Improving the study of brain-behavior relationships by revisiting basic assumptions. Trends Cogn. Sci. 2023, 27, 246–257. [Google Scholar] [CrossRef]
  11. Wang, M.; Zhang, D.; Huang, J.; Yap, P.-T.; Shen, D.; Liu, M. Identifying autism spectrum disorder with multi-site fMRI via low-rank domain adaptation. IEEE Trans. Med. Imaging 2019, 39, 644–655. [Google Scholar] [CrossRef]
  12. Zhang, J.; Small, M. Complex network from pseudoperiodic time series: Topology versus dynamics. Phys. Rev. Lett. 2006, 96, 238701. [Google Scholar] [CrossRef]
  13. Allen, E.A.; Damaraju, E.; Plis, S.M.; Erhardt, E.B.; Eichele, T.; Calhoun, V.D. Tracking whole-brain connectivity dynamics in the resting state. Cereb. Cortex 2014, 24, 663–676. [Google Scholar] [CrossRef] [PubMed]
  14. Chen, H.; Lei, Y.; Li, R.; Xia, X.; Cui, N.; Chen, X.; Liu, J.; Tang, H.; Zhou, J.; Huang, Y.; et al. Resting-state EEG dynamic functional connectivity distinguishes non-psychotic major depression, psychotic major depression and schizophrenia. Mol. Psychiatry 2024, 29, 1088–1098. [Google Scholar] [CrossRef] [PubMed]
  15. Hu, Z.; Gao, L.; Tong, Y.; Lu, X.; Li, R.; Xiao, L.; Li, Z. Multi-scale convolutional neural networks for brain disease diagnosis. In Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Istanbul, Turkiye, 5–8 December 2023. [Google Scholar]
  16. Petersen, R.C.; Aisen, P.; Beckett, L.A.; Donohue, M.C.; Gamst, A.C.; Harvey, D.J.; Jack, C.R., Jr.; Jagust, W.J.; Shaw, L.M.; Toga, A.W.; et al. Alzheimer’s disease neuroimaging initiative (ADNI): Clinical characterization. Neurology 2010, 74, 201–209. [Google Scholar] [CrossRef]
  17. Gao, L.; Hu, Z.; Li, R.; Lu, X.; Li, Z.; Zhang, X.; Xu, S. Multi-Perspective Feature Extraction and Fusion Based on Deep Latent Space for Diagnosis of Alzheimer’s Diseases. Brain Sci. 2022, 12, 1348. [Google Scholar] [CrossRef]
  18. Diederik, P.K.; Max, W. An introduction to variational autoencoders. Found. Trends Mach. Learn. 2019, 12, 307–392. [Google Scholar]
  19. Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
  20. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
  21. Liu, Z.; Wang, Y.; Vaidya, S.; Ruehle, F.; Halverson, J.; Soljačić, M.; Hou, T.Y.; Tegmark, M. Kan: Kolmogorov-arnold networks. arXiv 2024, arXiv:2404.19756. [Google Scholar]
  22. Jack, C.R., Jr.; Bernstein, M.A.; Fox, N.C.; Thompson, P.; Alexander, G.; Harvey, D.; Borowski, B.; Britson, P.J.; Whitwell, J.L.; Ward, C.; et al. The Alzheimer’s disease neuroimaging initiative (ADNI): MRI methods. J. Magn. Reson. Imaging 2008, 27, 685–691. [Google Scholar] [CrossRef] [PubMed]
  23. Jie, B.; Liu, M.; Lian, C.; Shi, F.; Shen, D. Designing weighted correlation kernels in convolutional neural networks for functional connectivity based brain disease diagnosis. Med. Image Anal. 2020, 63, 101709. [Google Scholar] [CrossRef] [PubMed]
  24. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Overall architecture of the proposed MambaKAN framework. The pipeline sequentially transforms rs-fMRI BOLD signals through dFC construction, VAE-based per-window latent encoding, Mamba selective temporal modeling, and KAN-based interpretable classification: dynamic functional connectivity construction via sliding-window Pearson correlation; Variational Autoencoder per-window encoding with reparameterization; Mamba selective state space temporal encoder aggregating the 54-window latent sequence; KAN classifier with learnable B-spline edge activations producing class probabilities and interpretable activation curves.
Figure 1. Overall architecture of the proposed MambaKAN framework. The pipeline sequentially transforms rs-fMRI BOLD signals through dFC construction, VAE-based per-window latent encoding, Mamba selective temporal modeling, and KAN-based interpretable classification: dynamic functional connectivity construction via sliding-window Pearson correlation; Variational Autoencoder per-window encoding with reparameterization; Mamba selective state space temporal encoder aggregating the 54-window latent sequence; KAN classifier with learnable B-spline edge activations producing class probabilities and interpretable activation curves.
Brainsci 16 00421 g001
Figure 2. Mamba temporal importance profiles by diagnostic class. The y-axis shows the mean time-step selectivity score s k (Equation (27)) averaged over test subjects within each class, and the shaded regions indicate ± 1 standard deviation. Elevated s k values at specific windows indicate that those temporal snapshots are most informative for discriminating the respective diagnostic class.
Figure 2. Mamba temporal importance profiles by diagnostic class. The y-axis shows the mean time-step selectivity score s k (Equation (27)) averaged over test subjects within each class, and the shaded regions indicate ± 1 standard deviation. Elevated s k values at specific windows indicate that those temporal snapshots are most informative for discriminating the respective diagnostic class.
Brainsci 16 00421 g002
Figure 3. Learned KAN activation curves for the top-10 most influential latent dimensions (CN vs. AD task). Each panel shows the B-spline edge function φ j , i ( 1 ) ( x ) as a function of the latent input value, with color indicating the corresponding class logit weight. Monotonically increasing (decreasing) curves indicate dimensions whose elevated values promote AD (CN) classification; non-monotonic curves reveal threshold or saturation effects in latent-to-class mappings.
Figure 3. Learned KAN activation curves for the top-10 most influential latent dimensions (CN vs. AD task). Each panel shows the B-spline edge function φ j , i ( 1 ) ( x ) as a function of the latent input value, with color indicating the corresponding class logit weight. Monotonically increasing (decreasing) curves indicate dimensions whose elevated values promote AD (CN) classification; non-monotonic curves reveal threshold or saturation effects in latent-to-class mappings.
Brainsci 16 00421 g003
Figure 4. Learned KAN activation curves for the top-10 most influential latent dimensions (EMCI vs. LMCI task). Each panel shows the B-spline edge function φ j , i ( 1 ) ( x ) as a function of the latent input value. The close proximity and frequent overlap of the two curves reflect the clinical reality that EMCI and LMCI represent adjacent stages on the AD continuum with highly overlapping pathological features, making this the most challenging binary classification task.
Figure 4. Learned KAN activation curves for the top-10 most influential latent dimensions (EMCI vs. LMCI task). Each panel shows the B-spline edge function φ j , i ( 1 ) ( x ) as a function of the latent input value. The close proximity and frequent overlap of the two curves reflect the clinical reality that EMCI and LMCI represent adjacent stages on the AD continuum with highly overlapping pathological features, making this the most challenging binary classification task.
Brainsci 16 00421 g004
Figure 5. Gradient-based brain region attribution chord diagrams for each diagnostic class. Each panel visualizes the top-50 most influential ROI-to-ROI connections (Equation (35)) as a circular chord diagram, where chord thickness indicates attribution magnitude, node color represents anatomical lobe membership, and bold region labels indicate brain regions with relatively higher attribution scores. The four panels correspond to CN (Class 0), EMCI (Class 1), LMCI (Class 2), and AD (Class 3), respectively.
Figure 5. Gradient-based brain region attribution chord diagrams for each diagnostic class. Each panel visualizes the top-50 most influential ROI-to-ROI connections (Equation (35)) as a circular chord diagram, where chord thickness indicates attribution magnitude, node color represents anatomical lobe membership, and bold region labels indicate brain regions with relatively higher attribution scores. The four panels correspond to CN (Class 0), EMCI (Class 1), LMCI (Class 2), and AD (Class 3), respectively.
Brainsci 16 00421 g005
Figure 6. Brain region attribution heatmaps for each diagnostic class. Each panel displays the full 116 × 116 ROI-to-ROI attribution matrix, where color intensity (red-to-yellow gradient) indicates the magnitude of gradient-based attribution scores. The four panels correspond to CN, EMCI, LMCI, and AD classes, with color bar scales ranging from 0.0010 (CN) to 0.0020 (AD), reflecting increasing reliance on specific connectivity patterns as disease severity progresses.
Figure 6. Brain region attribution heatmaps for each diagnostic class. Each panel displays the full 116 × 116 ROI-to-ROI attribution matrix, where color intensity (red-to-yellow gradient) indicates the magnitude of gradient-based attribution scores. The four panels correspond to CN, EMCI, LMCI, and AD classes, with color bar scales ranging from 0.0010 (CN) to 0.0020 (AD), reflecting increasing reliance on specific connectivity patterns as disease severity progresses.
Brainsci 16 00421 g006
Figure 7. Top-20 brain regions ranked by attribution score for each diagnostic class. Each panel displays the 20 ROIs with the highest gradient-based attribution magnitudes, quantifying their relative importance for class-specific classification decisions. Vermis_10 (cerebellar vermis lobule X) consistently ranks first across all four classes, establishing the cerebellum as the universal foundation for AD staging. Secondary features diverge by class: CN emphasizes striatal regions (caudate), EMCI highlights limbic structures (insula, parahippocampal gyrus), LMCI engages sensory cortices (Heschl’s gyrus, olfactory cortex), and AD recruits prefrontal and parietal executive networks.
Figure 7. Top-20 brain regions ranked by attribution score for each diagnostic class. Each panel displays the 20 ROIs with the highest gradient-based attribution magnitudes, quantifying their relative importance for class-specific classification decisions. Vermis_10 (cerebellar vermis lobule X) consistently ranks first across all four classes, establishing the cerebellum as the universal foundation for AD staging. Secondary features diverge by class: CN emphasizes striatal regions (caudate), EMCI highlights limbic structures (insula, parahippocampal gyrus), LMCI engages sensory cortices (Heschl’s gyrus, olfactory cortex), and AD recruits prefrontal and parietal executive networks.
Brainsci 16 00421 g007
Table 1. Per-class subject distribution across training, validation, and test partitions under the fixed 70/10/20 subject-level split. All four diagnostic groups, including the smallest AD subgroup ( n = 31 ), are represented in every partition.
Table 1. Per-class subject distribution across training, validation, and test partitions under the fixed 70/10/20 subject-level split. All four diagnostic groups, including the smallest AD subgroup ( n = 31 ), are represented in every partition.
ClassTrainValTestTotal
CN3351048
EMCI3551050
LMCI324945
AD213731
Total1211736174
Table 2. Classification performance comparison across five tasks on the ADNI dataset. The best results per task are in bold. All metrics are mean ± std averaged over five runs with fixed seeds on the same hardware. † indicates statistically significant improvement over the strongest baseline ( p < 0.05 , paired t-test). Acc = accuracy (%), Pre = precision (%), Rec = recall (%), F1 = F1-score (%), and AUC = area under ROC curve.
Table 2. Classification performance comparison across five tasks on the ADNI dataset. The best results per task are in bold. All metrics are mean ± std averaged over five runs with fixed seeds on the same hardware. † indicates statistically significant improvement over the strongest baseline ( p < 0.05 , paired t-test). Acc = accuracy (%), Pre = precision (%), Rec = recall (%), F1 = F1-score (%), and AUC = area under ROC curve.
CN vs. AD (T1)CN vs. EMCI (T2)
MethodAccPreRecF1AUCAccPreRecF1AUC
SVM-RBF61.5 ± 6.152.9 ± 5.581.8 ± 7.264.3 ± 6.061.8 ± 5.857.6 ± 5.860.9 ± 5.346.7 ± 6.552.8 ± 5.666.0 ± 5.2
Random Forest53.9 ± 6.846.7 ± 6.263.6 ± 7.953.9 ± 6.562.7 ± 6.361.0 ± 6.168.4 ± 5.743.3 ± 6.953.1 ± 5.963.3 ± 5.6
Gradient Boosting61.5 ± 6.053.3 ± 5.472.7 ± 7.061.5 ± 5.969.1 ± 5.561.0 ± 5.666.7 ± 5.046.7 ± 6.254.9 ± 5.474.1 ± 4.9
VAE + MLP61.5 ± 5.752.9 ± 5.181.8 ± 6.664.3 ± 5.564.2 ± 5.366.1 ± 5.270.8 ± 4.756.7 ± 5.963.0 ± 5.071.8 ± 4.6
VAE + BiLSTM76.9 ± 4.866.7 ± 4.390.9 ± 5.276.9 ± 4.786.1 ± 4.281.4 ± 4.385.2 ± 3.976.7 ± 4.780.7 ± 4.188.9 ± 3.7
VAE + Transformer65.4 ± 5.556.3 ± 5.081.8 ± 6.366.7 ± 5.367.9 ± 5.071.2 ± 4.981.0 ± 4.456.7 ± 5.666.7 ± 4.781.3 ± 4.3
VAE + KAN69.2 ± 5.163.6 ± 4.763.6 ± 5.863.6 ± 5.079.4 ± 4.577.4 ± 4.574.3 ± 4.184.9 ± 5.079.2 ± 4.385.7 ± 3.9
VAE + Mamba + MLP84.6 ± 4.088.9 ± 3.672.7 ± 4.880.0 ± 3.988.5 ± 3.583.1 ± 3.788.5 ± 3.376.7 ± 4.282.1 ± 3.591.0 ± 3.1
MambaKAN95.1 ± 2.388.1 ± 3.2100.0 ± 0.093.7 ± 2.499.5 ± 0.489.8 ± 2.896.2 ± 2.483.3 ± 3.689.3 ± 2.792.0 ± 2.2
Table 3. Classification performance on EMCI vs. LMCI (T3), LMCI vs. AD (T4), and four-class (T5) tasks. The best results per task are in bold. All metrics are mean ± std averaged over five runs with fixed seeds. † indicates statistically significant improvement over the strongest baseline ( p < 0.05 , paired t-test). Acc = accuracy (%), Pre = precision (%), Rec = recall (%), F1 = F1-score (%).
Table 3. Classification performance on EMCI vs. LMCI (T3), LMCI vs. AD (T4), and four-class (T5) tasks. The best results per task are in bold. All metrics are mean ± std averaged over five runs with fixed seeds. † indicates statistically significant improvement over the strongest baseline ( p < 0.05 , paired t-test). Acc = accuracy (%), Pre = precision (%), Rec = recall (%), F1 = F1-score (%).
EMCI vs. LMCI (T3)LMCI vs. AD (T4)4-Class (T5)
MethodAccPreRecF1AccPreRecF1AccPreRecF1
SVM-RBF68.0 ± 5.655.6 ± 5.155.6 ± 5.955.6 ± 5.453.3 ± 6.235.7 ± 5.729.4 ± 6.832.3 ± 6.042.1 ± 4.635.8 ± 4.343.8 ± 4.935.7 ± 4.5
Random Forest60.0 ± 6.045.0 ± 5.550.0 ± 6.347.4 ± 5.871.1 ± 5.975.0 ± 5.535.3 ± 6.548.0 ± 5.740.9 ± 4.936.0 ± 4.644.3 ± 5.235.2 ± 4.7
Gradient Boosting66.0 ± 5.452.4 ± 4.961.1 ± 5.756.4 ± 5.268.9 ± 5.757.1 ± 5.270.6 ± 6.263.2 ± 5.549.0 ± 4.452.0 ± 4.151.0 ± 4.648.5 ± 4.2
VAE+MLP64.0 ± 5.150.0 ± 4.672.2 ± 5.559.1 ± 4.960.0 ± 5.446.2 ± 5.035.3 ± 5.940.0 ± 5.247.7 ± 4.140.7 ± 3.844.9 ± 4.337.5 ± 3.9
VAE+BiLSTM80.0 ± 4.268.2 ± 3.883.3 ± 4.675.0 ± 4.082.2 ± 4.380.0 ± 3.970.6 ± 4.875.0 ± 4.165.9 ± 3.655.6 ± 3.363.9 ± 3.956.1 ± 3.5
VAE+Transformer78.0 ± 4.566.7 ± 4.177.8 ± 4.971.8 ± 4.384.4 ± 4.181.3 ± 3.776.5 ± 4.678.8 ± 3.952.3 ± 4.346.4 ± 4.053.3 ± 4.644.2 ± 4.1
VAE+KAN74.0 ± 4.760.9 ± 4.377.8 ± 5.168.3 ± 4.580.6 ± 4.669.2 ± 4.297.4 ± 5.080.9 ± 4.459.4 ± 3.959.5 ± 3.659.7 ± 4.159.5 ± 3.8
VAE+Mamba+MLP76.0 ± 4.165.0 ± 3.772.2 ± 4.568.4 ± 3.975.6 ± 4.466.7 ± 4.070.6 ± 4.968.6 ± 4.262.5 ± 3.453.0 ± 3.158.4 ± 3.751.7 ± 3.2
MambaKAN84.0 ± 3.272.7 ± 2.988.9 ± 3.680.0 ± 3.186.7 ± 3.186.7 ± 2.976.5 ± 3.681.3 ± 3.070.5 ± 2.158.1 ± 2.365.8 ± 2.059.2 ± 2.2
Table 4. Ablation study on the four-class classification task (CN/EMCI/LMCI/AD). Results are mean ± std over five runs with fixed seeds on the same hardware. Bold indicates the best result in each section. Italics indicate section headers. Section I shows progressive component contributions; Section II compares temporal models with a unified KAN head (VAE + BiLSTM + KAN is a newly added ablation configuration distinct from Table 3’s VAE + BiLSTM, which uses an MLP head); Section III ablates the training strategy; Section IV isolates the impact of VAE dimensionality reduction by comparing three configurations with identical KAN classifiers but different input representations (raw 6670-dim direct, raw with mean pooling, and VAE-compressed 128-dim), demonstrating that VAE compression is the dominant factor enabling effective KAN classification.
Table 4. Ablation study on the four-class classification task (CN/EMCI/LMCI/AD). Results are mean ± std over five runs with fixed seeds on the same hardware. Bold indicates the best result in each section. Italics indicate section headers. Section I shows progressive component contributions; Section II compares temporal models with a unified KAN head (VAE + BiLSTM + KAN is a newly added ablation configuration distinct from Table 3’s VAE + BiLSTM, which uses an MLP head); Section III ablates the training strategy; Section IV isolates the impact of VAE dimensionality reduction by comparing three configurations with identical KAN classifiers but different input representations (raw 6670-dim direct, raw with mean pooling, and VAE-compressed 128-dim), demonstrating that VAE compression is the dominant factor enabling effective KAN classification.
ConfigurationAcc (%)Pre (%)Rec (%)F1 (%)
Section I: Progressive component ablation
VAE + MLP (no temporal, no KAN)47.7 ± 4.140.7 ± 3.844.9 ± 4.337.5 ± 3.9
VAE + Mamba + MLP (no KAN)62.5 ± 3.453.0 ± 3.158.4 ± 3.751.7 ± 3.2
MambaKAN (full)70.5 ± 2.158.1 ± 2.365.8 ± 2.059.2 ± 2.2
Section II: Temporal model comparison (unified KAN head)
VAE + KAN (mean pooling, no Mamba)59.4 ± 3.959.5 ± 3.659.7 ± 4.159.5 ± 3.8
VAE + BiLSTM + KAN 68.2 ± 3.357.3 ± 3.064.5 ± 3.558.1 ± 3.1
MambaKAN (full)70.5 ± 2.158.1 ± 2.365.8 ± 2.059.2 ± 2.2
Section III: Training strategy ablation
MambaKAN (full)70.5 ± 2.158.1 ± 2.365.8 ± 2.059.2 ± 2.2
w/o Phase 1 pre-training65.2 ± 2.852.3 ± 3.160.1 ± 2.953.8 ± 2.7
w/o warmup freeze67.8 ± 2.555.6 ± 2.763.2 ± 2.456.9 ± 2.6
Section IV: Impact of VAE dimensionality reduction
No VAE: Raw dFC → KAN (direct, no pooling) 28.4 ± 3.525.1 ± 3.222.3 ± 3.423.5 ± 3.3
No VAE: Raw dFC → MeanPool → KAN34.1 ± 2.930.8 ± 2.626.1 ± 2.726.7 ± 2.5
VAE → KAN §59.4 ± 3.959.5 ± 3.659.7 ± 4.159.5 ± 3.8
Notes: VAE + BiLSTM + KAN is a new ablation configuration (BiLSTM temporal model + KAN head) distinct from VAE + BiLSTM, which uses an MLP classification head. Raw dFC (6670-dim) fed directly into KAN without any pooling or temporal modeling; this shows the curse-of-dimensionality baseline. § VAE + KAN (mean pooling, no Mamba) results are presented to directly compare the compression benefit of VAE.
Table 5. Sensitivity analysis for key hyperparameters on task T1 (CN vs. AD). Results are the mean ± std over five runs with fixed seeds. Default values are indicated with ⋆. Bold indicates the best result within each hyperparameter group.
Table 5. Sensitivity analysis for key hyperparameters on task T1 (CN vs. AD). Results are the mean ± std over five runs with fixed seeds. Default values are indicated with ⋆. Bold indicates the best result within each hyperparameter group.
HyperparameterValueAcc (%)F1 (%)AUC
Latent dim L3286.2 ± 2.884.6 ± 3.189.0 ± 2.5
6490.5 ± 2.489.0 ± 2.694.0 ± 2.0
128 95.1 ± 2.393.7 ± 2.499.5 ± 0.4
25691.9 ± 2.789.9 ± 2.996.1 ± 1.8
Mamba layers L m 190.5 ± 2.588.6 ± 2.893.5 ± 2.2
2 95.1 ± 2.393.7 ± 2.499.5 ± 0.4
393.3 ± 2.692.2 ± 2.796.8 ± 1.9
491.4 ± 2.989.2 ± 3.294.7 ± 2.3
KAN grid size G391.4 ± 2.689.3 ± 2.995.0 ± 2.1
5 95.1 ± 2.393.7 ± 2.499.5 ± 0.4
794.1 ± 2.593.2 ± 2.698.0 ± 1.7
991.9 ± 2.890.0 ± 3.196.0 ± 2.2
Loss weight α 0.0191.9 ± 2.589.9 ± 2.896.1 ± 2.0
0.1 95.1 ± 2.393.7 ± 2.499.5 ± 0.4
0.593.3 ± 2.792.2 ± 2.996.9 ± 1.9
1.088.7 ± 3.286.5 ± 3.592.3 ± 2.8
2.079.2 ± 4.175.6 ± 4.882.7 ± 3.6
Table 6. Computational complexity comparison. #Params = total trainable parameters; train time = training time per epoch (mean ± std over 3 runs); inference latency = per-sample latency on the same GPU hardware (mean ± std over 3 runs). Bold indicates the proposed MambaKAN model.
Table 6. Computational complexity comparison. #Params = total trainable parameters; train time = training time per epoch (mean ± std over 3 runs); inference latency = per-sample latency on the same GPU hardware (mean ± std over 3 runs). Bold indicates the proposed MambaKAN model.
Model#ParamsTrain (s/epoch)Inference (ms)
VAE + MLP32,987,9860.089 ± 0.0015.05 ± 0.04
VAE + KAN33,031,1820.103 ± 0.0015.73 ± 0.04
VAE + Transformer + MLP33,203,9860.101 ± 0.0005.38 ± 0.06
VAE + Mamba + MLP33,221,7141.236 ± 0.00320.10 ± 0.84
VAE + BiLSTM + MLP33,598,7380.099 ± 0.0015.60 ± 0.04
MambaKAN (Proposed)33,264,9101.263 ± 0.01023.74 ± 0.63
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gao, L.; Hu, Z. MambaKAN: An Interpretable Framework for Alzheimer’s Disease Diagnosis via Selective State Space Modeling of Dynamic Functional Connectivity. Brain Sci. 2026, 16, 421. https://doi.org/10.3390/brainsci16040421

AMA Style

Gao L, Hu Z. MambaKAN: An Interpretable Framework for Alzheimer’s Disease Diagnosis via Selective State Space Modeling of Dynamic Functional Connectivity. Brain Sciences. 2026; 16(4):421. https://doi.org/10.3390/brainsci16040421

Chicago/Turabian Style

Gao, Libin, and Zhongyi Hu. 2026. "MambaKAN: An Interpretable Framework for Alzheimer’s Disease Diagnosis via Selective State Space Modeling of Dynamic Functional Connectivity" Brain Sciences 16, no. 4: 421. https://doi.org/10.3390/brainsci16040421

APA Style

Gao, L., & Hu, Z. (2026). MambaKAN: An Interpretable Framework for Alzheimer’s Disease Diagnosis via Selective State Space Modeling of Dynamic Functional Connectivity. Brain Sciences, 16(4), 421. https://doi.org/10.3390/brainsci16040421

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop