1. Introduction
Alzheimer’s disease (AD) is a neurodegenerative disorder characterized by gradual cognitive impairments, notably affecting speech and language abilities [
1,
2]. It is the most common cause of dementia, accounting for roughly 60–80% of cases [
3]. Given its prevalence and the lack of a curative therapy, there is a pressing need for earlier, scalable diagnosis that can improve quality of life for people living with dementia and their caregivers [
4].
Clinical diagnosis of AD still relies primarily on comprehensive assessment, such as neuropsychological testing, clinician interview, and imaging or fluid biomarkers; these are procedures that can be costly, time-intensive, and not universally accessible [
5,
6]. In contrast, a growing body of work shows that spontaneous speech carries a clinically relevant signal for AD: lexical–semantic changes, discourse organization, and acoustic alterations emerge early and can be captured non-invasively and at low cost [
7,
8,
9]. Historically, speech pipelines rely on hand-crafted acoustic and linguistic features, which require domain expertise and can struggle to transfer across disease stages [
10,
11]. Our recent work in LLM-enabled speech and language analysis has improved robustness and reduced the need for manual feature engineering, enabling practical screening workflows [
7,
12].
Existing studies mostly use data from a single modality, either linguistic features from transcripts or acoustic–prosodic cues from audio to screen for AD [
8,
10,
11,
13]. Recent efforts combine two modalities (e.g., voice + transcripts), generally outperforming unimodal baselines by aligning linguistic features with acoustic features [
9,
14,
15,
16]. In picture-description tasks that use a fixed scene prompt (e.g., Cookie Theft), a visual image has not been used for prediction. To our knowledge, no AD study has evaluated a trimodal design that jointly integrates the fixed image prompt together with audio and text.
Multimodal learning integrates complementary signals from language, vision, and audio to produce outputs that are more accurate and robust than unimodal systems [
17]. With pretrained large language model (LLM) encoders now widely available, such as the Bidirectional Encoder Representations from Transformers (BERT) for text [
18], the Contrastive Language-Image Pre-Training (CLIP) model [
19] for language–vision alignment and Whisper [
20] or wav2vec 2.0 [
21] for speech, attention has shifted from unimodal representation to the problem of multimodal fusion: combining embeddings from different modalities in a way that is simple, stable, and efficient. Picture-description tasks, however, introduce a unique challenge of modality asymmetry: the image functions as a global prior, whereas speech and text vary by subject. Fusion through simple concatenation or averaging can ignore differences in reliability and granularity, while token-level coupling can be brittle when inputs are already high-level representations [
22,
23,
24]. These considerations motivate learned fusion mechanisms that adaptively weight information across different modalities.
Transformer-based cross-attention [
25] has become a key mechanism for aligning information across modalities by letting one representation selectively “query” another, rather than relying on fixed concatenation or averaging. In language–vision models, co-attentional and cross-modal transformers align text with image regions and learn joint representations that improve downstream accuracy [
26,
27]. For audio–text integration, directional cross-modal attention has been shown to couple linguistic and acoustic sequences effectively [
24]. Within Alzheimer’s research, however, most multimodal systems commonly use late fusion instead of learned cross-attention [
9,
10,
11]. Prior multimodal approaches for dementia detection have largely focused on using attention mechanisms to combine two modalities of speech and text (e.g., [
16,
28]). Hence, these methods are limited to two modalities and do not explicitly model how speech and text relate to the image, i.e., the visual stimulus that elicits picture-description discourse. In contrast, our approach performs trimodal cross-attention over projected representations of text, audio, and the image, learning to weight and integrate information across all three sources when forming the fused representation. The use of the Cookie Theft picture enables us to exploit the relationship between images and texts, particularly the text’s relevance to the picture and the focused area of the picture, thereby providing a shared semantic anchor that can stabilize fusion.
In this work, we specifically assess whether a multimodal cross-attention framework that jointly attends to text, audio, and image can improve AD detection from a picture-description task. By leveraging pretrained LLMs for individual modality, we project individual embeddings to a shared space and then fuse via Transformer-based cross-attention mechanisms, yielding a fused vector ready for downstream classification (
Figure 1). We benchmark uni-, bi-, and trimodal variants and evaluate with downstream classifiers such as logistic regression (LR), support vector classifier (SVC), and random forests (RF). Using the ADReSSo 2021 dataset, we show that our fusion consistently matches or exceeds uni- and bi-modal baselines, showing that our multimodal cross-attention fusion method offers a general approach for accurate and reliable AD prediction without task-specific feature engineering.
In this context, the present study proposes a multimodal cross-attention fusion approach to AD screening, offering the following key contributions:
We introduce an embedding-level cross-attention mechanism that attends jointly over text, audio, and image, producing a classifier-ready fused vector; to the best of our knowledge, this is the first application of embedding-level trimodal cross-attention fusion architecture for AD screening.
We show that on the ADReSSo picture-description tasks, the multimodal fusion consistently improves over unimodal baselines and is better than bimodal fusion.
We perform an ablation analysis of modality contributions to quantify the impact of each modality and provide uncertainty with 95% bootstrap confidence intervals to enhance the interpretability.
2. Materials and Methods
2.1. Data Overview
We conduct our study on the ADReSSo 2021 Challenge corpus [
9], a curated collection of 237 audio recordings in which clinically diagnosed AD participants and cognitively healthy controls describe the Cookie Theft scene from the Boston Diagnostic Aphasia Examination (BDAE) [
29,
30]. The Cookie Theft picture is a standardized elicitation stimulus that reliably evokes connected speech about everyday events; performance on this task engages naming, scene description, and lexical–semantic access, which are frequently affected in AD.
Following the official ADReSSo split, recordings are partitioned 70/30 into 166 training and 71 test samples, with demographics balanced across partitions. The training set comprises 87 AD and 79 control recordings. The ADReSSo dataset was constructed to minimize demographic bias via propensity score matching for age and sex, implemented using the MatchIt package. Propensity scores were estimated using a probit regression of diagnostic group on age and sex, and balance was assessed using standardized mean differences (SMDs). Reported SMDs for age and sex covariates were <0.001 (with higher-order and interaction terms well below 0.1), indicating adequate demographic matching. Further details and exact implementation can be found in Luz et al. [
9].
2.2. Preprocessing and Feature Construction
All recordings were first automatically transcribed with Whisper to ensure consistent text across variable recording conditions. Specifically, we used the
whisper-large-v3 to perform the transcription, as at the time of writing, it had the lowest transcription word error rate. We utilized a prompt (
“Umm, Uhh, let me think like, hmm… Okay, here’s what I’m, like, thinking.”) for the Whisper model so as to preserve clinically informative phenomena such as fillers, repetitions, and false starts. Each transcript was then converted to a fixed 768-dimensional text embedding using a pretrained
ModernBERT, producing a vector per recording. Specifically, we use mean-token pooling to aggregate ModernBERT outputs into a single vector per transcript. ModernBERT is a modernized version of BERT, pretrained on large general-domain corpora (e.g., BooksCorpus and Wikipedia) via masked-language modeling to learn rich lexical, syntactic, and discourse representations, and then pooled into sentence/paragraph vectors suitable for classification [
18,
31].
For the audio stream, raw waveforms were read at a consistent sampling rate of 16 kHz, compatible with the pretrained model. We avoided aggressive trimming so as not to remove hesitation or silence patterns that may carry relevant data in AD speech. Each file was passed through a pretrained speech recognition model (
wav2vec2-base-960h), and the hidden states were pooled over time (mean pooling) to yield a 768-dimensional audio embedding per recording. Wav2vec 2.0 is trained self-supervised on large unlabeled speech collections (e.g., LibriSpeech/Libri-Light), learning context representations directly from raw waveforms before light supervised tuning for downstream tasks [
21]. By pooling encoder states over time into a vector, we capture segmental and suprasegmental characteristics relevant to AD while avoiding heavy task-specific feature engineering.
For the given Cookie Theft picture as the visual prompt, this image was encoded with a vision–language model (
CLIP ViT-L/14), yielding a 768-dimensional image descriptor that captures the shared visual context. When the image was combined with corresponding text or audio, this descriptor was broadcast to match the batch dimension. The CLIP model is trained contrastively on hundreds of millions of web-scale image–text pairs to align visual and linguistic semantics, enabling strong zero-shot transfer. We do not use participant-specific images. Instead, we encode the single Cookie Theft stimulus picture, shared across all recordings, using CLIP ViT-L/14 to obtain a 768-D image embedding. This embedding is constant for all subjects and is used as contextual grounding for the picture-description task during multimodal fusion rather than as a subject-specific visual feature [
19].
After the above-mentioned preprocessing steps, each split comprises three aligned matrices: a text embedding of size N × 768, an audio embedding of size N × 768, and the image embedding of size 1 × 768, where N represents the sample size of 237. These representations are fed directly into the embedding-level cross-attention fusion described below. After fusion, each participant is represented by a single fixed-length vector (768-D), which is used as input to a downstream classifier. We evaluate SVC, LR, and RF as separate models trained independently on the same fused embeddings for either AD or Non-AD binary classification outcome. For comparability, we use the same downstream classifiers for unimodal and bimodal settings by replacing the fused embedding with the corresponding modality-specific (or bimodal) representation.
2.3. Multimodal Cross-Attention Fusion
Cross-attention is a mechanism in Transformers where one sequence learns to attend to another sequence, as detailed in the landmark paper [
25]. For this work, it is the natural way to fuse different modalities, e.g., acoustic and linguistic features, enabling the model to capture subtle relationships between
how someone speaks and
what they say. Essentially, the cross-modal attention is represented as
, where Q, K, V are respectively the queries, keys, and values, defined using modality-specific projection matrices (e.g., Q = WX for the initial embedding X, and W are learnable parameters). For self-attention, Q, K, V all come from the same sequence, whereas for cross-attention, Q comes from one sequence (e.g., text), while K, V come from another (e.g., audio). Therefore, this cross-attention score captures interdependencies between text and audio.
In our implementation, we fuse the three embeddings—text, audio, and the image descriptor—using a Transformer-based cross-attention mechanism, as implemented in PyTorch 2.10.0. Each embedding is first linearly projected to a common dimension of 768, then combined into a fused vector that summarizes the relative contribution of the three modalities. The attention output is passed through a residual LayerNorm + feed-forward network and followed by another LayerNorm, yielding a stable classifier-ready representation.
2.4. Training and Validation Protocol
All encoders were kept frozen, and only the projection layers, the cross-attention block, and the small feed-forward head were trained. We optimized with
AdamW and used a fixed random seed to ensure reproducibility. Training was performed on the ADReSSo training partition, reserving a small validation fold of 20% split from the training set from the set for early stopping and learning-rate scheduling. Because the picture prompt is constant, the CLIP image descriptor is computed only once and broadcast across the batch during training; text and audio embeddings are derived on a per-sample basis. At test time, we restore the best validation checkpoint, discard the linear classification head, and extract fused vectors for downstream prediction with standard classifiers. We applied vector normalization prior to the projection layers to stabilize optimization when combining different modalities [
32]. The 20% validation split is used only to select the fusion-module checkpoint (early stopping/scheduler) and is not used to tune or ensemble the downstream classifiers. After selecting the best fusion checkpoint, we compute fused embeddings for the full training partition, fit SVC/LR/RF independently on those training embeddings, and report performance on the held-out ADReSSo test set.
2.5. Benchmark with Baselines and Fusion Variants
To make fair comparisons, we held the model architecture constant and simply varied which modalities were included. Unimodal baselines used either text embeddings (ModernBERT) or audio embeddings (wav2vec 2.0). Bimodal settings fuse any two of the three embeddings (text + audio, text + image, audio + image) to the same embedding-level cross-attention block, which naturally reduces to attending over two modalities. The primary model combined all three modalities (text, audio and image), producing a vector for classification. This setup allows us to attribute performance differences to the information in each modality rather than to differences in model architecture. All conditions use the same projection layers and identical fusion block; the only variable is the set of input modalities provided to the model [
24,
25].
To contextualize performance, we report results from published ADReSSo 2021 challenge systems (e.g., Luz et al., Balagopalan et al., Pan et al.) using the metrics as reported in their papers and compare them with our results on the ADReSSo test split. Because prior systems may differ in preprocessing, feature extraction, and tuning protocols, these comparisons are intended to provide approximate benchmark context rather than a strictly controlled re-implementation of each method.
2.6. Downstream Classifiers
To assess the utility of the obtained embeddings for AD prediction, we evaluated representations with three widely used classifiers: Support Vector Classifier (SVC), Logistic Regression (LR), and Random Forests (RF) [
33,
34,
35,
36]. For all the classifiers, the features are standardized within a
scikit-learn pipeline. We keep hyperparameters modest and conventional: for SVC, we consider the kernel type and the regularization constant; for LR, we use L2-penalized logistic regression and vary the solver and inverse regularization strength; and for RF, we primarily sweep the number of trees with a light check on maximum depth. All classifier implementations are from
scikit-learn [
37].
2.7. Evaluation Metrics
All downstream classification model selection is performed on the training split only via a grid search, after which the chosen configuration is refit on the full training set and evaluated once on the held-out test set. No metadata or demographic covariates are used for training or selection. We report Accuracy, Precision, Recall, and F1 on the test set using standard definitions with the AD class as the positive label. In addition to the held-out test evaluation, we perform 5-fold stratified cross-validation on the training partition and report Accuracy, Precision, Recall, and F1 as mean with standard deviation across folds (AD as the positive label) to quantify performance stability. Our results are presented for unimodal, bimodal, and trimodal settings under the same protocol to ensure comparability.
2.8. Interpretability Through Ablation Analysis
To quantify how each modality contributes to the final decision, we conducted a sensitivity analysis of modality contributions on the held-out test set. For every sample, we recorded the model’s probability for the true class with all three inputs present and then recomputed that probability after removing one modality at a time, which is implemented by zeroing that modality’s projected input while keeping the others unchanged. The contribution of a modality is defined as the change in true-class probability when that modality is removed (baseline minus ablated). We report 95% confidence intervals for the contributions of each modality using a nonparametric bootstrap [
38]. For each modality, we collect the per-sample contribution scores on the test set and repeatedly resample the test examples with replacement 5000 times. For each resample, we compute the mean contribution, and the 2.5th and 97.5th percentiles of the resulting distribution define the lower and upper bounds of the interval.
We additionally perform ablation analysis to investigate if using any of the modalities as the query yields a benefit compared to our standard pipeline, which uses a random initialization of the query for training. Specifically, we set the query to either text, image or audio and train the trimodal pipeline, followed by inference on the unseen test set.
4. Discussion
This study shows that cross-attention is a powerful mechanism for fusing text, audio, and image into a representation that consistently improves AD prediction over both the unimodal and bimodal fusions on the ADReSSo 2021 dataset. From our analysis, the text modality remains the dominant source of signal, consistent with extensive evidence that lexical–semantic and discourse changes emerge early in AD speech [
8,
10]. Including the Cookie Theft image as a CLIP descriptor reliably boosts precision and, consequently, overall F1 when paired with text. When the image modality is fused with audio, the same image prior improves the accuracy of the audio-only system, reinforcing an otherwise weaker modality. The small image effect is expected because the Cookie Theft picture is a fixed stimulus shared across participants and therefore contains no subject-specific diagnostic information. In our framework, the CLIP embedding acts mainly as contextual grounding that can stabilize cross-modal alignment (often improving precision) rather than as an independent biomarker. The trimodal configuration delivers the strongest overall performance, indicating that the method of combining pretrained embeddings is important.
The classifier behaviors match expectations: with text included, SVC generally yields the highest F1. LR, by contrast, often achieves the best recall, while RF remains relatively strong when audio is included—likely due to its robustness to noisy features [
41]. Together with the precision gains from adding the image, these patterns suggest that different operational goals (e.g., screening vs. ruling-in) can be met by choosing appropriate downstream classifiers rather than changing the fusion module.
To assess how each modality shapes decisions, we performed an ablation study, measuring the change in true-class probability when each modality is removed and reporting 95% bootstrap confidence intervals [
38]. Three results emerge. First, removing text causes the largest average drop, indicating that language features carry the strongest discriminative signal for this task, consistent with prior evidence of robust lexical–semantic markers in AD speech [
8,
11]. Second, audio contributes a smaller but meaningful additional gain, aligning with findings that prosodic and temporal disruptions are characteristic of AD [
11,
42]. Third, the image prior yields only a modest average benefit, which is reasonable given the fixed visual stimulus across subjects; it serves more as stabilizing context than a subject-specific cue. Clinically, this pattern is consistent with AD-related language changes such as reduced information content and lexical–semantic impairment, which are especially salient in the Cookie Theft picture-description task, where content is constrained by a shared scene. The audio modality likely adds complementary cues related to timing and rhythm reported in AD; however, pooled wav2vec2 embeddings are not a comprehensive prosody model and may under-represent clinically important suprasegmental features, which may partly explain the smaller audio contribution [
43].
The query-modality ablations show that a randomly initialized, learned query yields the most consistent performance. This likely reflects its modality-agnostic nature, which avoids the representational biases of any single modality and allows attention to adapt to the most informative cross-modal cues. In contrast, constraining the query to a specific modality biases attention allocation and can shift the precision–recall trade-off, as evidenced by reduced recall when using audio as the query. Overall, the query functions as a flexible integrator of textual, auditory, and visual information aligned with task demands.
When compared with representative ADReSSo 2021 challenge systems, our unimodal and multimodal variants achieve competitive performance, with the trimodal fusion producing the strongest overall F1 and accuracy among the listed approaches. These gains are notable given that our pipeline relies on frozen pretrained encoders and a lightweight fusion module rather than task-specific end-to-end training. At the same time, direct ranking against challenge leaderboards should be interpreted cautiously because published systems may differ in preprocessing, feature construction, and tuning protocols. Nevertheless, the consistent improvements from unimodal to bimodal to trimodal settings suggest that cross-attentional fusion of pretrained embeddings is an effective and practical strategy for this benchmark.
Several limitations merit discussion. First, the ADReSSo 2021 dataset is small and only represents data from English speakers, so findings may not generalize to other tasks, languages, or recording conditions [
9]. Second, the image modality is a singleton: appropriate for the study design but unable to capture fine-grained, per-sample visual variation. Third, ASR errors and transcription inconsistency introduce noise—even with Whisper’s robustness—potentially weakening text signals [
20]. Since fillers, repetitions, and hesitations are diagnostically salient in Alzheimer’s disease, ASR transcription uncertainty represents a critical confound, warranting future investigation into robustness against ASR errors.
These limitations suggest several avenues for future work: external validation on additional cohorts and multilingual extensions to test generalizability; varied elicitation prompts and multiple images to probe whether richer visual context outperforms a single picture prior; longitudinal studies to assess sensitivity to change and clinical progression; and modeling work on robustness to missing modalities and domain shift. To improve robustness to transcription errors, future work can apply noisy-text augmentation that mimics ASR-like corruptions such as deletions and insertions while preserving clinically meaningful disfluencies such as fillers, repetitions, and hesitations. In addition, error-aware preprocessing can retain uncertainty markers and avoid aggressive normalization that removes diagnostically relevant phenomena. When available, ASR confidence scores could be used to down-weight unreliable segments or to combine text and acoustic evidence in a confidence-aware manner.
As summarized in
Table S2, the proposed fusion module is lightweight relative to the underlying pretrained encoders because all encoders are kept frozen and used only for offline embedding extraction; training is restricted to the projection and cross-attention fusion layers. Building on this design, additional optimizations could further improve practical applicability. For example, lightweight adaptation techniques such as modality-specific adapters or LoRA can be applied to the fusion projections and attention layers to reduce trainable parameters and speed up training without full fine-tuning. In addition, knowledge distillation could be used to train a smaller student model to approximate either the fused representation or the final classifier outputs, enabling lower-latency inference. Finally, compression strategies such as reducing projection dimensionality, quantizing the fusion module, or replacing heavier downstream classifiers with a compact linear head may further reduce runtime and memory while preserving performance.
In summary, our results suggest a promising route for multimodal AD screening that uses pretrained encoders and cross-attentional fusion of text, audio, and image. This strategy outperforms unimodal baselines and exceeds bimodal fusion.
5. Conclusions
This work shows that cross-attention fusion of text, audio, and an image improves Alzheimer’s detection on ADReSSo, outperforming unimodal and bimodal baselines. By using pretrained encoders (ModernBERT, wav2vec 2.0, CLIP) with cross-attention fusion, we obtain a fused representation that integrates seamlessly with standard classifiers (SVC/LR/RF). An ablation analysis of modality contributions provides practical interpretability, pinpointing where gains originate. More broadly, the results highlight that fusion design materially influences AD detection, even with fixed upstream encoders. For picture-description screening, multimodal cross-attention offers strong performance and produces embeddings suitable for downstream validation and prospective studies.
Overall, we present a fusion approach that unifies pretrained text, audio, and image representations into a single vector, enabling screening that exploits complementary lexical–semantic, acoustic–prosodic, and scene-context cues.