Next Article in Journal
Rural–Urban Differences in Cognitive Outcomes Among Older Adults: The Roles of Falls and Depressive Symptoms
Previous Article in Journal
Clinical, Behavioral, and Socio-Cultural Manifestations of Dementia: Evidence from Caregiver Reports
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Dementia Detection from Spontaneous Speech Using Cross-Attention Fusion

1
School of Biomedical Engineering, Science and Health Systems, Drexel University, Philadelphia, PA 19104, USA
2
Division of Artificial Intelligence and the Humanities, The Hong Kong Polytechnic University, Hong Kong SAR, China
3
Department of Language Science and Technology, The Hong Kong Polytechnic University, Hong Kong SAR, China
4
Departments of Data Science and Artificial Intelligence, The Hong Kong Polytechnic University, Hong Kong SAR, China
*
Author to whom correspondence should be addressed.
J. Dement. Alzheimer's Dis. 2026, 3(1), 12; https://doi.org/10.3390/jdad3010012
Submission received: 23 October 2025 / Revised: 14 January 2026 / Accepted: 5 February 2026 / Published: 2 March 2026

Abstract

Background/Objectives: Alzheimer’s disease (AD) is a progressive neurodegenerative disorder that affects the daily lives of older adults, impacting their cognitive abilities as well as speech and language communication. Early detection is crucial, as it enables timely intervention and helps improve the quality of life for those affected. While large language models (LLMs) have shown promise from spontaneous speech, most studies are unimodal and miss complementary signals across modalities. Methods: We present an LLM-powered multimodal cross-attention framework that integrates lexical (text), acoustic (speech), and visual (image) information for dementia detection using the ADReSSo 2021 picture-description dataset. Within this framework, text data are encoded using the ModernBERT, audio features are extracted using the wav2vec 2.0-base-960, and the Cookie Theft image is represented through the CLIP ViT-L/14. These embeddings are linearly projected to a shared space and then combined via Transformer-based cross-attention, yielding a fused vector for AD detection. Results: Our results show that the trimodal model achieved the best overall performance when paired with an SVC classifier, reaching an accuracy of 0.8732 and an F1 score of 0.8571, surpassing both the top-performing unimodal and bimodal configurations. For interpretability, a sensitivity analysis of modality contributions reveals that text plays the primary role, audio provides complementary improvements, and image offers modest yet stabilizing contextual support. Conclusions: These results highlight that the method of multimodal embedding fusion significantly influences performance: a cross-attention block achieves an effective balance between accuracy and simplicity, producing integrated representations that align well with interpretable downstream classifiers.

1. Introduction

Alzheimer’s disease (AD) is a neurodegenerative disorder characterized by gradual cognitive impairments, notably affecting speech and language abilities [1,2]. It is the most common cause of dementia, accounting for roughly 60–80% of cases [3]. Given its prevalence and the lack of a curative therapy, there is a pressing need for earlier, scalable diagnosis that can improve quality of life for people living with dementia and their caregivers [4].
Clinical diagnosis of AD still relies primarily on comprehensive assessment, such as neuropsychological testing, clinician interview, and imaging or fluid biomarkers; these are procedures that can be costly, time-intensive, and not universally accessible [5,6]. In contrast, a growing body of work shows that spontaneous speech carries a clinically relevant signal for AD: lexical–semantic changes, discourse organization, and acoustic alterations emerge early and can be captured non-invasively and at low cost [7,8,9]. Historically, speech pipelines rely on hand-crafted acoustic and linguistic features, which require domain expertise and can struggle to transfer across disease stages [10,11]. Our recent work in LLM-enabled speech and language analysis has improved robustness and reduced the need for manual feature engineering, enabling practical screening workflows [7,12].
Existing studies mostly use data from a single modality, either linguistic features from transcripts or acoustic–prosodic cues from audio to screen for AD [8,10,11,13]. Recent efforts combine two modalities (e.g., voice + transcripts), generally outperforming unimodal baselines by aligning linguistic features with acoustic features [9,14,15,16]. In picture-description tasks that use a fixed scene prompt (e.g., Cookie Theft), a visual image has not been used for prediction. To our knowledge, no AD study has evaluated a trimodal design that jointly integrates the fixed image prompt together with audio and text.
Multimodal learning integrates complementary signals from language, vision, and audio to produce outputs that are more accurate and robust than unimodal systems [17]. With pretrained large language model (LLM) encoders now widely available, such as the Bidirectional Encoder Representations from Transformers (BERT) for text [18], the Contrastive Language-Image Pre-Training (CLIP) model [19] for language–vision alignment and Whisper [20] or wav2vec 2.0 [21] for speech, attention has shifted from unimodal representation to the problem of multimodal fusion: combining embeddings from different modalities in a way that is simple, stable, and efficient. Picture-description tasks, however, introduce a unique challenge of modality asymmetry: the image functions as a global prior, whereas speech and text vary by subject. Fusion through simple concatenation or averaging can ignore differences in reliability and granularity, while token-level coupling can be brittle when inputs are already high-level representations [22,23,24]. These considerations motivate learned fusion mechanisms that adaptively weight information across different modalities.
Transformer-based cross-attention [25] has become a key mechanism for aligning information across modalities by letting one representation selectively “query” another, rather than relying on fixed concatenation or averaging. In language–vision models, co-attentional and cross-modal transformers align text with image regions and learn joint representations that improve downstream accuracy [26,27]. For audio–text integration, directional cross-modal attention has been shown to couple linguistic and acoustic sequences effectively [24]. Within Alzheimer’s research, however, most multimodal systems commonly use late fusion instead of learned cross-attention [9,10,11]. Prior multimodal approaches for dementia detection have largely focused on using attention mechanisms to combine two modalities of speech and text (e.g., [16,28]). Hence, these methods are limited to two modalities and do not explicitly model how speech and text relate to the image, i.e., the visual stimulus that elicits picture-description discourse. In contrast, our approach performs trimodal cross-attention over projected representations of text, audio, and the image, learning to weight and integrate information across all three sources when forming the fused representation. The use of the Cookie Theft picture enables us to exploit the relationship between images and texts, particularly the text’s relevance to the picture and the focused area of the picture, thereby providing a shared semantic anchor that can stabilize fusion.
In this work, we specifically assess whether a multimodal cross-attention framework that jointly attends to text, audio, and image can improve AD detection from a picture-description task. By leveraging pretrained LLMs for individual modality, we project individual embeddings to a shared space and then fuse via Transformer-based cross-attention mechanisms, yielding a fused vector ready for downstream classification (Figure 1). We benchmark uni-, bi-, and trimodal variants and evaluate with downstream classifiers such as logistic regression (LR), support vector classifier (SVC), and random forests (RF). Using the ADReSSo 2021 dataset, we show that our fusion consistently matches or exceeds uni- and bi-modal baselines, showing that our multimodal cross-attention fusion method offers a general approach for accurate and reliable AD prediction without task-specific feature engineering.
In this context, the present study proposes a multimodal cross-attention fusion approach to AD screening, offering the following key contributions:
  • We introduce an embedding-level cross-attention mechanism that attends jointly over text, audio, and image, producing a classifier-ready fused vector; to the best of our knowledge, this is the first application of embedding-level trimodal cross-attention fusion architecture for AD screening.
  • We show that on the ADReSSo picture-description tasks, the multimodal fusion consistently improves over unimodal baselines and is better than bimodal fusion.
  • We perform an ablation analysis of modality contributions to quantify the impact of each modality and provide uncertainty with 95% bootstrap confidence intervals to enhance the interpretability.

2. Materials and Methods

2.1. Data Overview

We conduct our study on the ADReSSo 2021 Challenge corpus [9], a curated collection of 237 audio recordings in which clinically diagnosed AD participants and cognitively healthy controls describe the Cookie Theft scene from the Boston Diagnostic Aphasia Examination (BDAE) [29,30]. The Cookie Theft picture is a standardized elicitation stimulus that reliably evokes connected speech about everyday events; performance on this task engages naming, scene description, and lexical–semantic access, which are frequently affected in AD.
Following the official ADReSSo split, recordings are partitioned 70/30 into 166 training and 71 test samples, with demographics balanced across partitions. The training set comprises 87 AD and 79 control recordings. The ADReSSo dataset was constructed to minimize demographic bias via propensity score matching for age and sex, implemented using the MatchIt package. Propensity scores were estimated using a probit regression of diagnostic group on age and sex, and balance was assessed using standardized mean differences (SMDs). Reported SMDs for age and sex covariates were <0.001 (with higher-order and interaction terms well below 0.1), indicating adequate demographic matching. Further details and exact implementation can be found in Luz et al. [9].

2.2. Preprocessing and Feature Construction

All recordings were first automatically transcribed with Whisper to ensure consistent text across variable recording conditions. Specifically, we used the whisper-large-v3 to perform the transcription, as at the time of writing, it had the lowest transcription word error rate. We utilized a prompt (“Umm, Uhh, let me think like, hmm… Okay, here’s what I’m, like, thinking.”) for the Whisper model so as to preserve clinically informative phenomena such as fillers, repetitions, and false starts. Each transcript was then converted to a fixed 768-dimensional text embedding using a pretrained ModernBERT, producing a vector per recording. Specifically, we use mean-token pooling to aggregate ModernBERT outputs into a single vector per transcript. ModernBERT is a modernized version of BERT, pretrained on large general-domain corpora (e.g., BooksCorpus and Wikipedia) via masked-language modeling to learn rich lexical, syntactic, and discourse representations, and then pooled into sentence/paragraph vectors suitable for classification [18,31].
For the audio stream, raw waveforms were read at a consistent sampling rate of 16 kHz, compatible with the pretrained model. We avoided aggressive trimming so as not to remove hesitation or silence patterns that may carry relevant data in AD speech. Each file was passed through a pretrained speech recognition model (wav2vec2-base-960h), and the hidden states were pooled over time (mean pooling) to yield a 768-dimensional audio embedding per recording. Wav2vec 2.0 is trained self-supervised on large unlabeled speech collections (e.g., LibriSpeech/Libri-Light), learning context representations directly from raw waveforms before light supervised tuning for downstream tasks [21]. By pooling encoder states over time into a vector, we capture segmental and suprasegmental characteristics relevant to AD while avoiding heavy task-specific feature engineering.
For the given Cookie Theft picture as the visual prompt, this image was encoded with a vision–language model (CLIP ViT-L/14), yielding a 768-dimensional image descriptor that captures the shared visual context. When the image was combined with corresponding text or audio, this descriptor was broadcast to match the batch dimension. The CLIP model is trained contrastively on hundreds of millions of web-scale image–text pairs to align visual and linguistic semantics, enabling strong zero-shot transfer. We do not use participant-specific images. Instead, we encode the single Cookie Theft stimulus picture, shared across all recordings, using CLIP ViT-L/14 to obtain a 768-D image embedding. This embedding is constant for all subjects and is used as contextual grounding for the picture-description task during multimodal fusion rather than as a subject-specific visual feature [19].
After the above-mentioned preprocessing steps, each split comprises three aligned matrices: a text embedding of size N × 768, an audio embedding of size N × 768, and the image embedding of size 1 × 768, where N represents the sample size of 237. These representations are fed directly into the embedding-level cross-attention fusion described below. After fusion, each participant is represented by a single fixed-length vector (768-D), which is used as input to a downstream classifier. We evaluate SVC, LR, and RF as separate models trained independently on the same fused embeddings for either AD or Non-AD binary classification outcome. For comparability, we use the same downstream classifiers for unimodal and bimodal settings by replacing the fused embedding with the corresponding modality-specific (or bimodal) representation.

2.3. Multimodal Cross-Attention Fusion

Cross-attention is a mechanism in Transformers where one sequence learns to attend to another sequence, as detailed in the landmark paper [25]. For this work, it is the natural way to fuse different modalities, e.g., acoustic and linguistic features, enabling the model to capture subtle relationships between how someone speaks and what they say. Essentially, the cross-modal attention is represented as Attention ( Q , K , V ) = softmax ( Q K T d k ) V , where Q, K, V are respectively the queries, keys, and values, defined using modality-specific projection matrices (e.g., Q = WX for the initial embedding X, and W are learnable parameters). For self-attention, Q, K, V all come from the same sequence, whereas for cross-attention, Q comes from one sequence (e.g., text), while K, V come from another (e.g., audio). Therefore, this cross-attention score captures interdependencies between text and audio.
In our implementation, we fuse the three embeddings—text, audio, and the image descriptor—using a Transformer-based cross-attention mechanism, as implemented in PyTorch 2.10.0. Each embedding is first linearly projected to a common dimension of 768, then combined into a fused vector that summarizes the relative contribution of the three modalities. The attention output is passed through a residual LayerNorm + feed-forward network and followed by another LayerNorm, yielding a stable classifier-ready representation.

2.4. Training and Validation Protocol

All encoders were kept frozen, and only the projection layers, the cross-attention block, and the small feed-forward head were trained. We optimized with AdamW and used a fixed random seed to ensure reproducibility. Training was performed on the ADReSSo training partition, reserving a small validation fold of 20% split from the training set from the set for early stopping and learning-rate scheduling. Because the picture prompt is constant, the CLIP image descriptor is computed only once and broadcast across the batch during training; text and audio embeddings are derived on a per-sample basis. At test time, we restore the best validation checkpoint, discard the linear classification head, and extract fused vectors for downstream prediction with standard classifiers. We applied vector normalization prior to the projection layers to stabilize optimization when combining different modalities [32]. The 20% validation split is used only to select the fusion-module checkpoint (early stopping/scheduler) and is not used to tune or ensemble the downstream classifiers. After selecting the best fusion checkpoint, we compute fused embeddings for the full training partition, fit SVC/LR/RF independently on those training embeddings, and report performance on the held-out ADReSSo test set.

2.5. Benchmark with Baselines and Fusion Variants

To make fair comparisons, we held the model architecture constant and simply varied which modalities were included. Unimodal baselines used either text embeddings (ModernBERT) or audio embeddings (wav2vec 2.0). Bimodal settings fuse any two of the three embeddings (text + audio, text + image, audio + image) to the same embedding-level cross-attention block, which naturally reduces to attending over two modalities. The primary model combined all three modalities (text, audio and image), producing a vector for classification. This setup allows us to attribute performance differences to the information in each modality rather than to differences in model architecture. All conditions use the same projection layers and identical fusion block; the only variable is the set of input modalities provided to the model [24,25].
To contextualize performance, we report results from published ADReSSo 2021 challenge systems (e.g., Luz et al., Balagopalan et al., Pan et al.) using the metrics as reported in their papers and compare them with our results on the ADReSSo test split. Because prior systems may differ in preprocessing, feature extraction, and tuning protocols, these comparisons are intended to provide approximate benchmark context rather than a strictly controlled re-implementation of each method.

2.6. Downstream Classifiers

To assess the utility of the obtained embeddings for AD prediction, we evaluated representations with three widely used classifiers: Support Vector Classifier (SVC), Logistic Regression (LR), and Random Forests (RF) [33,34,35,36]. For all the classifiers, the features are standardized within a scikit-learn pipeline. We keep hyperparameters modest and conventional: for SVC, we consider the kernel type and the regularization constant; for LR, we use L2-penalized logistic regression and vary the solver and inverse regularization strength; and for RF, we primarily sweep the number of trees with a light check on maximum depth. All classifier implementations are from scikit-learn [37].

2.7. Evaluation Metrics

All downstream classification model selection is performed on the training split only via a grid search, after which the chosen configuration is refit on the full training set and evaluated once on the held-out test set. No metadata or demographic covariates are used for training or selection. We report Accuracy, Precision, Recall, and F1 on the test set using standard definitions with the AD class as the positive label. In addition to the held-out test evaluation, we perform 5-fold stratified cross-validation on the training partition and report Accuracy, Precision, Recall, and F1 as mean with standard deviation across folds (AD as the positive label) to quantify performance stability. Our results are presented for unimodal, bimodal, and trimodal settings under the same protocol to ensure comparability.

2.8. Interpretability Through Ablation Analysis

To quantify how each modality contributes to the final decision, we conducted a sensitivity analysis of modality contributions on the held-out test set. For every sample, we recorded the model’s probability for the true class with all three inputs present and then recomputed that probability after removing one modality at a time, which is implemented by zeroing that modality’s projected input while keeping the others unchanged. The contribution of a modality is defined as the change in true-class probability when that modality is removed (baseline minus ablated). We report 95% confidence intervals for the contributions of each modality using a nonparametric bootstrap [38]. For each modality, we collect the per-sample contribution scores on the test set and repeatedly resample the test examples with replacement 5000 times. For each resample, we compute the mean contribution, and the 2.5th and 97.5th percentiles of the resulting distribution define the lower and upper bounds of the interval.
We additionally perform ablation analysis to investigate if using any of the modalities as the query yields a benefit compared to our standard pipeline, which uses a random initialization of the query for training. Specifically, we set the query to either text, image or audio and train the trimodal pipeline, followed by inference on the unseen test set.

3. Results

3.1. Unimodal Baselines

In Table 1, we compare the performance of text and audio embeddings using the different classifiers. From the table, we can see that text provides a stronger prediction for AD than audio. With ModernBERT embeddings on the unseen test set, SVC attains an F1 of 0.8358, outperforming LR, which has an F1 of 0.7945, and RF, which has an F1 of 0.7741. In general, audio (wav2vec2) is weaker across all the classifiers, as the best audio-only result is achieved by RF with an F1 of 0.6769 and an accuracy of 0.7042. Similar observations can be made for the 5-fold cross-validation results as well. This performance gap between text and audio underscores that picture-description transcripts capture more discriminative information than audio embeddings.

3.2. Bimodal Fusion (Two-Way)

Augmenting text with a visual image further improves performance, as shown in Table 2. Relative to text alone, on the unseen test set, SVC gains primarily through a precision increase while holding recall roughly constant, yielding an accuracy of 0.8591 and an F1 of 0.8484. This pattern is consistent with 5-fold cross-validation results and the notion that the image acts as a stabilizer that reduces false positives without suppressing true positives. Both LR and RF show smaller but consistent improvements, with LR having an F1 of 0.7878 and RF with an F1 of 0.7812, suggesting that even the image descriptor can provide complementary context to already informative transcripts.
When combining text with audio, the performance (Table 3) is comparable to text alone for SVC, with a modest precision–recall trade-off, having an overall F1 of 0.8307 on the unseen test set. LR and RF match their text + image scores, both of which are around F1~0.79/0.78, indicating that wav2vec2 embedding features add limited incremental value when rich lexical–semantic information from transcripts is already available. A similar observation can be made for the 5-fold CV results as well. Overall, these results suggest that, in picture-description tasks, text remains the dominant driver, while audio provides modest refinement.
When text is unavailable, adding the image to audio noticeably strengthens performance across models (Table 4). RF improves from an audio-only F1 of 0.6769 to 0.7462, and SVC/LR rise to an F1 of around 0.74 on the unseen test set. The gains reflect a reduction in variance and more consistent decisions across classifiers, suggesting that the image as a visual prompt provides anchoring context for prosodic evidence. When it comes to the 5-fold CV performance, we observe a consistent mirror to the unseen test set, suggesting training stability. While absolute performance remains below text-involving settings, the combination narrows the gap and highlights the utility of the image when linguistic data is absent.

3.3. Multimodal Fusion (Three-Way)

In Table 5, we report the performance of the trimodal fusion model. For both 5-fold CV and the unseen test set, the full text + audio + image model delivers the best overall performance compared to the unimodal and bimodal counterparts. On the unseen test set, SVC has an accuracy of 0.8732 and an F1 of 0.8571, improving on both the strongest unimodal (text-only SVC F1 0.8358) and the best pair (text + image SVC having an F1 of 0.8484). The maximum precision of 0.9642 suggests the fused representation is especially conservative on positives while retaining competitive recall. RF also benefits by having an F1 of 0.8125, and LR achieves its top F1 (0.8055), with a recall of 0.8285—tied for the highest recall observed across all settings (equal to text-only LR).

3.4. Ablation Analysis for Interpretability

Figure 2 summarizes overall contributions of each modality to the final three-way fusion model, reported as the change in true-class probability when that modality is removed, with 95% bootstrap confidence intervals. We can see that text is the dominant driver, while audio provides a smaller but positive gain, and the image contributes only marginally on average. These patterns indicate that language features account for most of the discriminative signal on this task, while prosodic/voice cues add complementary information, and the image has limited incremental value overall. Quantitative results from the figure, including percentage contributions, are reported in Table S1.
Our default trimodal model uses a learned fusion query (randomly initialized and optimized during training) to attend over the projected text, audio, and image representations. To test whether fusion behavior depends on the source of the query, we replaced the learned query with modality-derived queries and evaluated on the held-out test set (Table 6). Relative to the learned-query baseline (our best-performing configuration), modality-derived queries produce comparable but slightly different operating points. Image-query yields the strongest point performance among the modality-query variants for SVC, having an accuracy of 0.8451 and an F1 of 0.8358. Audio-query yields the lowest F1 (0.7541), suggesting that initiating attention from the audio representation may under-emphasize complementary linguistic evidence in this dataset. Overall, these results support our design choice of using a learned fusion query, which provides the most consistent performance and serves as the default configuration.

3.5. Benchmarking Against ADReSSo Challenge Baselines

To contextualize performance against the ADReSSo 2021 benchmark, we compared our results with several published challenge systems on the unseen test set (Table 7). Our trimodal fusion achieves the strongest overall F1 (0.8571) and accuracy (0.8732), exceeding the ADReSSo baseline reported by Luz et al. (F1 = 0.7888) and other representative challenge submissions that also utilized both fusion-based and unimodal approaches. Notably, we observe a consistent progression from unimodal to bimodal to trimodal settings, with fusion improving F1 from 0.8358 (unimodal) to 0.8484 (bimodal) and 0.8571 (trimodal). These results suggest that cross-attentional fusion of pretrained embeddings provides measurable gains over both single-modality models and prior ADReSSo challenge baselines under this evaluation setting.

4. Discussion

This study shows that cross-attention is a powerful mechanism for fusing text, audio, and image into a representation that consistently improves AD prediction over both the unimodal and bimodal fusions on the ADReSSo 2021 dataset. From our analysis, the text modality remains the dominant source of signal, consistent with extensive evidence that lexical–semantic and discourse changes emerge early in AD speech [8,10]. Including the Cookie Theft image as a CLIP descriptor reliably boosts precision and, consequently, overall F1 when paired with text. When the image modality is fused with audio, the same image prior improves the accuracy of the audio-only system, reinforcing an otherwise weaker modality. The small image effect is expected because the Cookie Theft picture is a fixed stimulus shared across participants and therefore contains no subject-specific diagnostic information. In our framework, the CLIP embedding acts mainly as contextual grounding that can stabilize cross-modal alignment (often improving precision) rather than as an independent biomarker. The trimodal configuration delivers the strongest overall performance, indicating that the method of combining pretrained embeddings is important.
The classifier behaviors match expectations: with text included, SVC generally yields the highest F1. LR, by contrast, often achieves the best recall, while RF remains relatively strong when audio is included—likely due to its robustness to noisy features [41]. Together with the precision gains from adding the image, these patterns suggest that different operational goals (e.g., screening vs. ruling-in) can be met by choosing appropriate downstream classifiers rather than changing the fusion module.
To assess how each modality shapes decisions, we performed an ablation study, measuring the change in true-class probability when each modality is removed and reporting 95% bootstrap confidence intervals [38]. Three results emerge. First, removing text causes the largest average drop, indicating that language features carry the strongest discriminative signal for this task, consistent with prior evidence of robust lexical–semantic markers in AD speech [8,11]. Second, audio contributes a smaller but meaningful additional gain, aligning with findings that prosodic and temporal disruptions are characteristic of AD [11,42]. Third, the image prior yields only a modest average benefit, which is reasonable given the fixed visual stimulus across subjects; it serves more as stabilizing context than a subject-specific cue. Clinically, this pattern is consistent with AD-related language changes such as reduced information content and lexical–semantic impairment, which are especially salient in the Cookie Theft picture-description task, where content is constrained by a shared scene. The audio modality likely adds complementary cues related to timing and rhythm reported in AD; however, pooled wav2vec2 embeddings are not a comprehensive prosody model and may under-represent clinically important suprasegmental features, which may partly explain the smaller audio contribution [43].
The query-modality ablations show that a randomly initialized, learned query yields the most consistent performance. This likely reflects its modality-agnostic nature, which avoids the representational biases of any single modality and allows attention to adapt to the most informative cross-modal cues. In contrast, constraining the query to a specific modality biases attention allocation and can shift the precision–recall trade-off, as evidenced by reduced recall when using audio as the query. Overall, the query functions as a flexible integrator of textual, auditory, and visual information aligned with task demands.
When compared with representative ADReSSo 2021 challenge systems, our unimodal and multimodal variants achieve competitive performance, with the trimodal fusion producing the strongest overall F1 and accuracy among the listed approaches. These gains are notable given that our pipeline relies on frozen pretrained encoders and a lightweight fusion module rather than task-specific end-to-end training. At the same time, direct ranking against challenge leaderboards should be interpreted cautiously because published systems may differ in preprocessing, feature construction, and tuning protocols. Nevertheless, the consistent improvements from unimodal to bimodal to trimodal settings suggest that cross-attentional fusion of pretrained embeddings is an effective and practical strategy for this benchmark.
Several limitations merit discussion. First, the ADReSSo 2021 dataset is small and only represents data from English speakers, so findings may not generalize to other tasks, languages, or recording conditions [9]. Second, the image modality is a singleton: appropriate for the study design but unable to capture fine-grained, per-sample visual variation. Third, ASR errors and transcription inconsistency introduce noise—even with Whisper’s robustness—potentially weakening text signals [20]. Since fillers, repetitions, and hesitations are diagnostically salient in Alzheimer’s disease, ASR transcription uncertainty represents a critical confound, warranting future investigation into robustness against ASR errors.
These limitations suggest several avenues for future work: external validation on additional cohorts and multilingual extensions to test generalizability; varied elicitation prompts and multiple images to probe whether richer visual context outperforms a single picture prior; longitudinal studies to assess sensitivity to change and clinical progression; and modeling work on robustness to missing modalities and domain shift. To improve robustness to transcription errors, future work can apply noisy-text augmentation that mimics ASR-like corruptions such as deletions and insertions while preserving clinically meaningful disfluencies such as fillers, repetitions, and hesitations. In addition, error-aware preprocessing can retain uncertainty markers and avoid aggressive normalization that removes diagnostically relevant phenomena. When available, ASR confidence scores could be used to down-weight unreliable segments or to combine text and acoustic evidence in a confidence-aware manner.
As summarized in Table S2, the proposed fusion module is lightweight relative to the underlying pretrained encoders because all encoders are kept frozen and used only for offline embedding extraction; training is restricted to the projection and cross-attention fusion layers. Building on this design, additional optimizations could further improve practical applicability. For example, lightweight adaptation techniques such as modality-specific adapters or LoRA can be applied to the fusion projections and attention layers to reduce trainable parameters and speed up training without full fine-tuning. In addition, knowledge distillation could be used to train a smaller student model to approximate either the fused representation or the final classifier outputs, enabling lower-latency inference. Finally, compression strategies such as reducing projection dimensionality, quantizing the fusion module, or replacing heavier downstream classifiers with a compact linear head may further reduce runtime and memory while preserving performance.
In summary, our results suggest a promising route for multimodal AD screening that uses pretrained encoders and cross-attentional fusion of text, audio, and image. This strategy outperforms unimodal baselines and exceeds bimodal fusion.

5. Conclusions

This work shows that cross-attention fusion of text, audio, and an image improves Alzheimer’s detection on ADReSSo, outperforming unimodal and bimodal baselines. By using pretrained encoders (ModernBERT, wav2vec 2.0, CLIP) with cross-attention fusion, we obtain a fused representation that integrates seamlessly with standard classifiers (SVC/LR/RF). An ablation analysis of modality contributions provides practical interpretability, pinpointing where gains originate. More broadly, the results highlight that fusion design materially influences AD detection, even with fixed upstream encoders. For picture-description screening, multimodal cross-attention offers strong performance and produces embeddings suitable for downstream validation and prospective studies.
Overall, we present a fusion approach that unifies pretrained text, audio, and image representations into a single vector, enabling screening that exploits complementary lexical–semantic, acoustic–prosodic, and scene-context cues.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/jdad3010012/s1, Table S1. Drop-one modality ablation quantified; Table S2. Computational resources for the proposed fusion model.

Author Contributions

Conceptualization: F.A. and H.L.; Methodology: F.A. and H.L.; Formal analysis and investigation: F.A.; Writing—Original Draft: F.A. and H.L.; Writing—Review and Editing: F.A. and H.L.; Supervision: H.L. All authors have read and agreed to the published version of the manuscript.

Funding

Research reported in this publication was supported by the National Institute on Aging of the National Institutes of Health under Award Number P30AG073105. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All the data are available at https://dementia.talkbank.org (accessed on 20 January 2023).

Acknowledgments

We thank the ADReSSo Challenge data that was available via DementiaBank.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. McKhann, G.M.; Knopman, D.S.; Chertkow, H.; Hyman, B.T.; Jack, C.R.; Kawas, C.H.; Klunk, W.E.; Koroshetz, W.J.; Manly, J.J.; Mayeux, R.; et al. The Diagnosis of Dementia Due to Alzheimer’s Disease: Recommendations from the National Institute on Aging-Alzheimer’s Association Workgroups on Diagnostic Guidelines for Alzheimer’s Disease. Alzheimer’s Dement. 2011, 7, 263–269. [Google Scholar] [CrossRef]
  2. Taler, V.; Phillips, N.A. Language Performance in Alzheimer’s Disease and Mild Cognitive Impairment: A Comparative Review. J. Clin. Exp. Neuropsychol. 2008, 30, 501–556. [Google Scholar] [CrossRef] [PubMed]
  3. Better, M.A. Alzheimer’s Disease Facts and Figures. Alzheimer’s Dement. 2023, 19, 1598–1695. [Google Scholar]
  4. Livingston, G.; Huntley, J.; Sommerlad, A.; Ames, D.; Ballard, C.; Banerjee, S.; Brayne, C.; Burns, A.; Cohen-Mansfield, J.; Cooper, C.; et al. Dementia Prevention, Intervention, and Care: 2020 Report of the Lancet Commission. Lancet 2020, 396, 413–446. [Google Scholar] [CrossRef] [PubMed]
  5. Folstein, M.F.; Folstein, S.E.; McHugh, P.R. “Mini-Mental State”: A Practical Method for Grading the Cognitive State of Patients for the Clinician. J. Psychiatr. Res. 1975, 12, 189–198. [Google Scholar] [CrossRef]
  6. Jack, C.R., Jr.; Andrews, J.S.; Beach, T.G.; Buracchio, T.; Dunn, B.; Graf, A.; Hansson, O.; Ho, C.; Jagust, W.; McDade, E.; et al. Revised Criteria for Diagnosis and Staging of Alzheimer’s Disease: Alzheimer’s Association Workgroup. Alzheimer’s Dement. 2024, 20, 5143–5169. [Google Scholar] [CrossRef]
  7. Agbavor, F.; Liang, H. Predicting Dementia from Spontaneous Speech Using Large Language Models. PLoS Digit. Health 2022, 1, e0000168. [Google Scholar] [CrossRef]
  8. Eyigoz, E.; Mathur, S.; Santamaria, M.; Cecchi, G.; Naylor, M. Linguistic Markers Predict Onset of Alzheimer’s Disease. EClinicalMedicine 2020, 28, 100583. [Google Scholar]
  9. Luz, S.; Haider, F.; de la Fuente, S.; Fromm, D.; MacWhinney, B. Detecting Cognitive Decline Using Speech Only: The ADReSSo Challenge. arXiv 2021, arXiv:2104.09356. [Google Scholar] [CrossRef]
  10. de la Fuente Garcia, S.; Ritchie, C.W.; Luz, S. Artificial Intelligence, Speech, and Language Processing Approaches to Monitoring Alzheimer’s Disease: A Systematic Review. J. Alzheimer’s Dis. 2020, 78, 1547–1574. [Google Scholar] [CrossRef]
  11. Fraser, K.C.; Meltzer, J.A.; Rudzicz, F. Linguistic Features Identify Alzheimer’s Disease in Narrative Speech. J. Alzheimer’s Dis. 2016, 49, 407–422. [Google Scholar] [CrossRef] [PubMed]
  12. Agbavor, F.; Liang, H. Multilingual Prediction of Cognitive Impairment with Large Language Models and Speech Analysis. Brain Sci. 2024, 14, 1292. [Google Scholar] [CrossRef] [PubMed]
  13. Balagopalan, A.; Eyre, B.; Rudzicz, F.; Novikova, J. To BERT or Not To BERT: Comparing Speech and Language-Based Approaches for Alzheimer’s Disease Detection. arXiv 2020, arXiv:2008.01551. [Google Scholar]
  14. Ilias, L.; Askounis, D. Multimodal Deep Learning Models for Detecting Dementia From Speech and Transcripts. Front. Aging Neurosci. 2022, 14, 830943. [Google Scholar] [CrossRef]
  15. Ksibi, A.; Walha, A.; Zakariah, M.; Ayadi, M.; Alshalali, T.; Almujally, N.A. Multimodal Siamese Networks for Dementia Detection from Speech in Women. Sci. Rep. 2025, 15, 30938. [Google Scholar] [CrossRef]
  16. Wang, N.; Cao, Y.; Hao, S.; Shao, Z.; Subbalakshmi, K.P. Modular Multi-Modal Attention Network for Alzheimer’s Disease Detection Using Patient Audio and Language Data. In Proceedings of the Interspeech 2021, Brno, Czechia, 30 August–3 September 2021; ISCA: Brno, Czechia, 2021; pp. 3835–3839. [Google Scholar]
  17. Baltrušaitis, T.; Ahuja, C.; Morency, L.-P. Multimodal Machine Learning: A Survey and Taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 423–443. [Google Scholar] [CrossRef]
  18. Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
  19. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning; ML Research Press: Cambridge, MA, USA, 2021. [Google Scholar]
  20. Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision. In Proceedings of the 40th International Conference on Machine Learning; ML Research Press: Cambridge, MA, USA, 2022. [Google Scholar]
  21. Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 12449–12460. [Google Scholar]
  22. Arevalo, J.; Solorio, T.; Montes-y-Gómez, M.; González, F.A. Gated Multimodal Units for Information Fusion. arXiv 2017, arXiv:1702.01992. [Google Scholar] [CrossRef]
  23. Atrey, P.K.; Hossain, M.A.; El Saddik, A.; Kankanhalli, M.S. Multimodal Fusion for Multimedia Analysis: A Survey. Multimed. Syst. 2010, 16, 345–379. [Google Scholar] [CrossRef]
  24. Tsai, Y.-H.H.; Bai, S.; Yamada, M.; Morency, L.-P.; Salakhutdinov, R. Transformer Dissection: A Unified Understanding of Transformer’s Attention via the Lens of Kernel. arXiv 2019, arXiv:1908.11775. [Google Scholar]
  25. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  26. Lu, J.; Batra, D.; Parikh, D.; Lee, S. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
  27. Tan, H.; Bansal, M. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP); Inui, K., Jiang, J., Ng, V., Wan, X., Eds.; Association for Computational Linguistics: Hong Kong, China, 2019; pp. 5100–5111. [Google Scholar]
  28. Pappagari, R.; Cho, J.; Moro-Velázquez, L.; Dehak, N. Using State of the Art Speaker Recognition and Natural Language Processing Technologies to Detect Alzheimer’s Disease and Assess Its Severity. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020; ISCA: Valencia, Spain, 2020; pp. 2177–2181. [Google Scholar]
  29. Becker, J.T.; Boiler, F.; Lopez, O.L.; Saxton, J.; McGonigle, K.L. The Natural History of Alzheimer’s Disease: Description of Study Cohort and Accuracy of Diagnosis. Arch. Neurol. 1994, 51, 585–594. [Google Scholar] [CrossRef] [PubMed]
  30. Goodglass, H.; Kaplan, E.; Weintraub, S. BDAE: The Boston Diagnostic Aphasia Examination, 3rd ed.; Lippincott Williams & Wilkins: Philadelphia, PA, USA, 2001. [Google Scholar]
  31. Warner, B.; Chaffin, A.; Clavié, B.; Weller, O.; Hallström, O.; Taghadouini, S.; Gallagher, A.; Biswas, R.; Ladhak, F.; Aarsen, T.; et al. Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, Vienna, Austria, 27 July 2024. [Google Scholar]
  32. Chowdhury, N.; Wang, F.; Shenoy, S.; Kiela, D.; Schwettmann, S.; Thrush, T. Nearest Neighbor Normalization Improves Multimodal Retrieval. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing; Al-Onaizan, Y., Bansal, M., Chen, Y.-N., Eds.; Association for Computational Linguistics: Miami, FL, USA, 2024; pp. 22571–22582. [Google Scholar]
  33. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  34. Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
  35. Cox, D.R. The Regression Analysis of Binary Sequences. J. R. Stat. Society. Ser. B (Methodol.) 1958, 20, 215–242. [Google Scholar] [CrossRef]
  36. Schölkopf, B.; Smola, A.J. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond; The MIT Press: Cambridge, MA, USA, 2001. [Google Scholar]
  37. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V. Scikit-Learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
  38. Efron, B.; Tibshirani, R.J. An Introduction to the Bootstrap; CRC Press: Boca Raton, FL, USA, 1994. [Google Scholar]
  39. Balagopalan, A.; Novikova, J. Comparing Acoustic-Based Approaches for Alzheimer’s Disease Detection. In Proceedings of the Interspeech 2021; ISCA: Valencia, Spain, 2021; pp. 3800–3804. [Google Scholar]
  40. Pan, Y.; Mirheidari, B.; Harris, J.; Thompson, J.; Jones, M.; Snowden, J.; Blackburn, D.; Christensen, H. Using the Outputs of Different Automatic Speech Recognition Paradigms for Acoustic- and BERT-Based Alzheimer’s Dementia Detection Through Spontaneous Speech. In Proceedings of the INTERSPEECH 2021, Brno, Czechia, 30 August–3 September 2021; ISCA: Valencia, Spain, 2021; p. 3814. [Google Scholar]
  41. Biau, G. Analysis of a Random Forests Model. J. Mach. Learn. Res. 2012, 13, 1063–1095. [Google Scholar]
  42. Agbavor, F.; Liang, H. Artificial Intelligence-Enabled End-To-End Detection and Assessment of Alzheimer’s Disease Using Voice. Brain Sci. 2022, 13, 28. [Google Scholar] [CrossRef]
  43. Nylén, F. An Acoustic Model of Speech Dysprosody in Patients with Parkinson’s Disease. Front. Hum. Neurosci. 2025, 19, 1566274. [Google Scholar] [CrossRef]
Figure 1. Multimodal cross-attention aligns 768-D embeddings from text (ModernBERT), audio (Wav2Vec2 2.0), and the Cookie Theft image (CLIP ViT-L/14). After linear projection to a shared space, a query (Q) attends to the modality set, Key (K) = Value (V) = [Text, Audio, Image] to yield a fused embedding, which is evaluated with downstream classifiers.
Figure 1. Multimodal cross-attention aligns 768-D embeddings from text (ModernBERT), audio (Wav2Vec2 2.0), and the Cookie Theft image (CLIP ViT-L/14). After linear projection to a shared space, a query (Q) attends to the modality set, Key (K) = Value (V) = [Text, Audio, Image] to yield a fused embedding, which is evaluated with downstream classifiers.
Jdad 03 00012 g001
Figure 2. Overall sensitivity analysis of modality contributions with 95% CIs. Mean change in true-class probability when each modality is removed at inference (baseline minus drop-one) for Text, Image, and Audio in the trimodal fusion model. Error bars show 95% bootstrap confidence intervals (5000 resamples).
Figure 2. Overall sensitivity analysis of modality contributions with 95% CIs. Mean change in true-class probability when each modality is removed at inference (baseline minus drop-one) for Text, Image, and Audio in the trimodal fusion model. Error bars show 95% bootstrap confidence intervals (5000 resamples).
Jdad 03 00012 g002
Table 1. Unimodal baselines on the ADReSSo 2021 unseen test set and 5-fold cross-validation results (shaded). Performance of text-only (ModernBERT embeddings) and audio-only (wav2vec 2.0 embeddings) representations with three standard classifiers (SVC, LR, RF).
Table 1. Unimodal baselines on the ADReSSo 2021 unseen test set and 5-fold cross-validation results (shaded). Performance of text-only (ModernBERT embeddings) and audio-only (wav2vec 2.0 embeddings) representations with three standard classifiers (SVC, LR, RF).
AccuracyPrecisionRecallF1
TestSVCtext0.84500.87500.80000.8358
SVCaudio0.66190.67740.60000.6363
LRtext0.78870.76310.82850.7945
LRaudio0.64780.69230.51420.5901
RFtext0.80280.88880.68570.7741
RFaudio0.70420.73330.62850.6769
5-fold CVSVCtext0.8000 (0.0991)0.8296 (0.1033)0.7916 (0.1556)0.8012 (0.1019)
SVCaudio0.7290 (0.0628)0.7336 (0.0768)0.7705 (0.0324)0.7504 (0.0520)
LRtext0.7761 (0.1016)0.7940 (0.0797)0.7805 (0.1522)0.7805 (0.1010)
LRaudio0.6383 (0.0223)0.6721 (0.0318)0.6091 (0.0215)0.6384 (0.0163)
RFtext0.7698 (0.0948)0.8064 (0.0857)0.7472 (0.1686)0.7656 (0.1050)
RFaudio0.7103 (0.0579)0.7596 (0.1070)0.6895 (0.0596)0.7154 (0.0397)
Table 2. Text + image fusion on the ADReSSo 2021 unseen test set and 5-fold cross-validation results (shaded). Accuracy, Precision, Recall, and F1 are reported for SVC, LR, and RF.
Table 2. Text + image fusion on the ADReSSo 2021 unseen test set and 5-fold cross-validation results (shaded). Accuracy, Precision, Recall, and F1 are reported for SVC, LR, and RF.
AccuracyPrecisionRecallF1
TestSVC0.85910.90320.80000.8484
LR0.80280.83870.74280.7878
RF0.80280.86200.71420.7812
5-fold CVSVC0.8314 (0.0266)0.8625 (0.0632)0.8183 (0.0971)0.8342 (0.0318)
LR0.8253 (0.0330)0.8546 (0.0758)0.8183 (0.0971)0.8296 (0.0335)
RF0.8012 (0.0456)0.8230 (0.0521)0.7941 (0.0611)0.8070 (0.0445)
Table 3. Text + audio fusion on the ADReSSo 2021 unseen test set and 5-fold cross-validation results (shaded). Accuracy, Precision, Recall, and F1 are reported for SVC, LR, and RF.
Table 3. Text + audio fusion on the ADReSSo 2021 unseen test set and 5-fold cross-validation results (shaded). Accuracy, Precision, Recall, and F1 are reported for SVC, LR, and RF.
AccuracyPrecisionRecallF1
TestSVC0.84500.90000.77140.8307
LR0.80280.83870.74280.7878
RF0.80280.86200.71420.7812
5-fold CVSVC0.8494 (0.0372)0.8484 (0.0637)0.8745 (0.0725)0.8582 (0.0368)
LR0.8554 (0.0252)0.8576 (0.0535)0.8745 (0.0725)0.8629 (0.0279)
RF0.8135 (0.0606)0.8414 (0.0552)0.7935 (0.0761)0.8161 (0.0633)
Table 4. Audio + image fusion on the ADReSSo 2021 unseen test set and 5-fold cross-validation results (shaded). Accuracy, Precision, Recall, and F1 are reported for SVC, LR, and RF.
Table 4. Audio + image fusion on the ADReSSo 2021 unseen test set and 5-fold cross-validation results (shaded). Accuracy, Precision, Recall, and F1 are reported for SVC, LR, and RF.
AccuracyPrecisionRecallF1
TestSVC0.74640.74280.74280.7428
LR0.74640.75750.71420.7352
RF0.76050.78120.71420.7462
5-fold CVSVC0.7533 (0.0365)0.7523 (0.0778)0.8157 (0.088)0.7763 (0.0190)
LR0.7654 (0.0358)0.7867 (0.0823)0.7817 (0.0961)0.7770 (0.0259)
RF0.7529 (0.0659)0.7581 (0.0671)0.7810 (0.1140)0.7659 (0.0752)
Table 5. Trimodal fusion (text + audio + image) on the ADReSSo 2021 unseen test set and 5-fold cross-validation results (shaded). Accuracy, Precision, Recall, and F1 for SVC, LR, and RF. The SVC trimodal model attains the best overall performance, exceeding unimodal and bimodal baselines.
Table 5. Trimodal fusion (text + audio + image) on the ADReSSo 2021 unseen test set and 5-fold cross-validation results (shaded). Accuracy, Precision, Recall, and F1 for SVC, LR, and RF. The SVC trimodal model attains the best overall performance, exceeding unimodal and bimodal baselines.
AccuracyPrecisionRecallF1
TestSVC0.87320.96420.77140.8571
LR0.80280.78370.82850.8055
RF0.83090.89650.74280.8125
5-fold CVSVC0.8788 (0.0270)0.9375 (0.1017)0.8333 (0.0691)0.8816 (0.0159)
LR0.8529 (0.0179)0.8824 (0.0386)0.8333 (0.0268)0.8536 (0.0130)
RF0.8134 (0.0577)0.8495 (0.0896)0.7948 (0.0704)0.8174 (0.0529)
Table 6. We compare trimodal cross-attention variants where the query is derived from a specific modality (Text/Image/Audio) relative to our default setting that uses a learned fusion query (randomly initialized and optimized during training). Reported metrics are for downstream classifiers (SVC, LR, RF) trained on the resulting fused embeddings on the unseen test set.
Table 6. We compare trimodal cross-attention variants where the query is derived from a specific modality (Text/Image/Audio) relative to our default setting that uses a learned fusion query (randomly initialized and optimized during training). Reported metrics are for downstream classifiers (SVC, LR, RF) trained on the resulting fused embeddings on the unseen test set.
ModelAccuracyPrecisionRecallF1
Text as querySVC0.81690.84380.77140.8060
LR0.83100.84850.80000.8235
RF0.81690.86670.74290.8000
Image as querySVC0.84510.87500.80000.8358
LR0.83100.82860.82860.8286
RF0.80280.81820.77140.7941
Audio as querySVC0.78870.88460.65710.7541
LR0.80280.83870.74290.7879
RF0.81690.89290.71430.7937
Table 7. Comparison to other ADReSSo 2021 challenge systems on the unseen test set. Reported metrics for selected published challenge baselines are shown alongside our unimodal and multimodal variants to contextualize performance relative to prior state-of-the-art submissions on the ADReSSo benchmark.
Table 7. Comparison to other ADReSSo 2021 challenge systems on the unseen test set. Reported metrics for selected published challenge baselines are shown alongside our unimodal and multimodal variants to contextualize performance relative to prior state-of-the-art submissions on the ADReSSo benchmark.
AccuracyPrecisionRecallF1
Luz et al. 2021 (Baseline) [9]0.78900.77800.80000.7888
Balagopalan et al. 2021 [39]0.67610.63640.80000.7089
Pan et al. 2021 [40]0.80280.86210.71430.7813
Unimodal (Ours)0.84500.87500.80000.8358
Bimodal Fusion (Ours)0.85910.90320.80000.8484
Trimodal Fusion (Ours)0.87320.96420.77140.8571
Bold indicates best performance.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Agbavor, F.; Liang, H. Dementia Detection from Spontaneous Speech Using Cross-Attention Fusion. J. Dement. Alzheimer's Dis. 2026, 3, 12. https://doi.org/10.3390/jdad3010012

AMA Style

Agbavor F, Liang H. Dementia Detection from Spontaneous Speech Using Cross-Attention Fusion. Journal of Dementia and Alzheimer's Disease. 2026; 3(1):12. https://doi.org/10.3390/jdad3010012

Chicago/Turabian Style

Agbavor, Felix, and Hualou Liang. 2026. "Dementia Detection from Spontaneous Speech Using Cross-Attention Fusion" Journal of Dementia and Alzheimer's Disease 3, no. 1: 12. https://doi.org/10.3390/jdad3010012

APA Style

Agbavor, F., & Liang, H. (2026). Dementia Detection from Spontaneous Speech Using Cross-Attention Fusion. Journal of Dementia and Alzheimer's Disease, 3(1), 12. https://doi.org/10.3390/jdad3010012

Article Metrics

Back to TopTop