A Dual-Stage Multimodal Alignment Approach for Robust Breast Cancer Diagnosis via Visual–Textual Computing

Dogan, Ramazan Ozgur

doi:10.3390/app16125934

Open AccessArticle

A Dual-Stage Multimodal Alignment Approach for Robust Breast Cancer Diagnosis via Visual–Textual Computing

by

Ramazan Ozgur Dogan

Department of Artificial Intelligence Engineering, Faculty of Computer and Information Sciences, Trabzon University, Trabzon 61080, Türkiye

Appl. Sci. 2026, 16(12), 5934; https://doi.org/10.3390/app16125934

Submission received: 19 May 2026 / Revised: 9 June 2026 / Accepted: 10 June 2026 / Published: 11 June 2026

Download

Browse Figures

Versions Notes

Abstract

Manual classification of breast cancer is resource-intensive, slow, and subject to inter-observer variability, motivating automated deep learning solutions. Most current methods rely on unimodal imaging data and struggle with domain generalization (DG) across varied clinical environments. We propose a Dual-Stage Multimodal Alignment approach that integrates breast ultrasound (US) imagery with clinical text reports to improve diagnostic stability. The method proceeds in two stages: (1) Local Correlation Alignment (LCA), which aligns fine-grained visual features with textual embeddings to capture localized lesion attributes, and (2) Global Attention Alignment (GAA), which applies multi-head self-attention to the joint visual–textual sequence to encourage domain-invariant representations. We evaluate the approach on a harmonized, leakage-free repository of 6880 images aggregated from six public US datasets (BUS-CoT, BrEaST, BUS-BRA, BUS-UCLM, BLUI, BUSI) under three protocols: independent benchmarking on BUS-CoT, pooled cross-dataset evaluation, and zero-shot domain generalization on unseen unimodal target domains. On the BUS-CoT benchmark, the 198M-parameter model reaches 0.8177 accuracy and 0.8852 AUC, on par with the 7-billion-parameter Qwen2.5-VL-7B with chain-of-thought reasoning (0.8064 accuracy, 0.8354 AUC) while using roughly 1/35 the parameter count. In the pooled setting, it is competitive with single-domain state-of-the-art methods on individual subsets (e.g., 0.9576 AUC on BUSI, 0.8741 accuracy on BUS-BRA). Under zero-shot transfer without clinical text, per-domain AUC ranges from 0.7360 to 0.8060 across four unseen targets, providing a lower bound under cross-scanner shift. These results indicate that task-specific multimodal alignment can rival large vision-language models in breast US diagnosis at a fraction of the parameter count.

Keywords:

breast cancer; deep learning; domain generalization; dual-stage alignment; multimodal learning; ultrasound imaging

1. Introduction

Breast cancer stands as the primary cause of mortality related to malignant neoplasms in women globally, according to the Global Cancer Observatory (GLOBOCAN) [1,2]. Timely and precise identification of breast malignancies is paramount for optimizing treatment outcomes, enhancing survival probabilities, and diminishing mortality rates. In routine clinical workflows, practitioners utilize a diverse array of imaging modalities, including digital mammography (DM), ultrasound (US), and magnetic resonance imaging (MRI), in conjunction with biopsy for histopathological verification to identify and categorize breast lesions.

DM remains the standard modality for preliminary screening, proficient in detecting microcalcifications and diminutive non-palpable masses. Nevertheless, the sensitivity of mammography is often compromised in dense breast tissue, where tumors may be obscured [3]. US, generating imagery via high-frequency sound wave echoes, functions as a critical adjunctive tool for characterizing lesions within dense breasts and clarifying inconclusive mammographic findings [4]. Apart from its cost-effectiveness, accessibility, and lack of ionizing radiation, US facilitates the differentiation between benign and malignant masses through sonographic features and is notably effective in identifying cystic lesions. Conversely, MRI, utilizing radiofrequency pulses and potent magnetic fields, offers exceptional soft-tissue contrast and high-resolution visualization of intricate breast structures [5]. While it is advocated for high-risk patients and problem-solving scenarios (e.g., resolving ambiguities from other modalities), its routine application is constrained by higher costs and limited availability. Unlike imaging techniques, biopsy remains the gold standard for malignancy confirmation, typically involving percutaneous core needle sampling of suspicious regions for microscopic analysis. Recently, US has solidified its role as a potent complement to DM in both diagnostic and screening protocols, particularly for women with dense breast tissue, due to its safety profile and affordability [6].

The manual classification of breast cancer is a labor-intensive process, susceptible to significant inter- and intra-observer variations. This challenge is exacerbated in dense breasts, where early-stage indicators are subtle and easily missed. To alleviate these challenges, Computer-Aided Diagnosis (CAD) systems have been developed to bolster and partially automate the diagnostic pipeline [7,8], providing more consistent and rapid decision support. Recently, deep learning (DL) methodologies have seen a surge in application for breast cancer classification, with Convolutional Neural Networks (CNNs) and attention mechanisms showing strong capabilities in feature extraction [9,10,11,12,13,14,15,16]. These architectures discern minute morphological nuances in US images, such as differentiating benign from malignant lesions, and have improved diagnostic precision. However, the deployment of DL in breast cancer classification continues to face several impediments. Domain Generalization (DG) remains a formidable obstacle: heterogeneous US imaging hardware, operator-dependent acquisition protocols, and varied patient demographics produce inconsistent image characteristics, and DL models trained on isolated datasets or specific devices frequently fail to generalize to unseen data, limiting their real-world clinical utility. Relative to DM and MRI, DL investigations focusing on US imagery are also less mature [17], partly because US presents lower spatial resolution, reduced signal-to-noise ratio (SNR), and substantial operator variability, and partly because large-scale public benchmarks have been comparatively scarce. Finally, the majority of DL-based classification pipelines rely exclusively on visual data and ignore clinical context such as patient history and laboratory findings [18]; although multimodal approaches integrating imaging with clinical reports and demographic data have recently gained traction [19,20], their adoption in US-based diagnosis remains limited despite the potential to combine visual and clinical evidence within a single pipeline.

The decision to focus on the fusion of US imagery with clinical text, rather than on alternative multimodal combinations such as US with DM or US with MRI, is motivated by practical clinical and data-related considerations rather than by an assumption of inherent superiority. US is radiation-free, real-time, and considerably less costly than DM or MRI, which establishes it as a first-line modality for characterizing breast lesions [4,6], particularly in dense breasts [3] and in settings where MRI is rarely available [5]. In addition, the structured morphological descriptors recorded by radiologists during a routine US examination, following the BI-RADS US lexicon (e.g., lesion shape, margin, orientation, and echogenicity) [21,22], already form part of standard reporting; coupling these descriptors with the image introduces a complementary modality at essentially no additional acquisition cost, in contrast to acquiring a second imaging modality. Finally, the public US data ecosystem has expanded considerably in recent years, as reflected by the six public datasets harmonized in this work, whereas datasets that pair US with DM or MRI for the same cohort remain scarce and are seldom released publicly, which constrains reproducible benchmarking of such alternatives. The present work therefore treats the fusion of US and clinical text not as a categorically superior paradigm but as the combination that is currently the most clinically deployable and the most amenable to reproducible evaluation on public data.

This study proposes a Dual-Stage Multimodal Alignment approach for breast cancer classification under cross-domain conditions, addressing limitations of unimodal pipelines and concatenation-based multimodal baselines. The method couples US images with clinical text in two stages: Stage I, Local Correlation Alignment (LCA), refines the cross-modal correlation matrix with a small convolutional module to align fine-grained visual features (e.g., lesion margins) with textual embeddings; Stage II, Global Attention Alignment (GAA), applies multi-head self-attention to the joint visual–textual sequence to model long-range dependencies across modalities. Although attention and correlation mechanisms are individually well established, their direct application to breast US is often complicated by speckle noise and operator-dependent artifacts; cascading local correlation refinement with global semantic modeling is intended to mitigate this gap. To evaluate the approach under varied clinical conditions, we harmonize six public US datasets: BUS-CoT [23], BrEaST [24], BUS-BRA [25], BUS-UCLM [26], BLUI [27], and BUSI [28]. Experiments on this unified benchmark show that the approach is competitive with established unimodal and multimodal baselines and remains stable under zero-shot cross-domain transfer.

The remainder of this manuscript is structured as follows. Section 2 provides a review of related literature on breast cancer classification, multimodal learning, and DG. Section 3.1 outlines the public US datasets utilized. Section 3.2 elaborates on the proposed Dual-Stage Multimodal Alignment approach, detailing the local and global alignment mechanisms, fusion techniques, and the loss function. Section 4 presents the experimental configuration, evaluation metrics, and a comparative performance analysis against baseline models. Finally, Section 6 summarizes key insights, addresses limitations, and suggests directions for future research.

2. Related Work

Breast cancer classification has been exhaustively researched using DM [29] and MRI [30] modalities. In contrast, machine learning (ML) exploration of US imaging remains comparatively sparse, despite the modality’s widespread clinical adoption due to its cost-efficiency and availability. This lag is largely driven by a paucity of high-quality, annotated datasets necessary for training resilient models. While recent years have witnessed the release of public US datasets such as BUS-CoT [23], BrEaST [24], BUS-BRA [25], BUS-UCLM [26], BLUI [27], and BUSI [28], these resources frequently exhibit limitations regarding size, variability, and scope, often lacking detailed clinical or demographic annotations. The literature is also heavily saturated with studies focusing on the single BUSI dataset [28]. While this allows for controlled benchmarking, it severely constrains the assessment of model generalization and real-world utility.

Addressing classification performance, various unimodal strategies have been explored. Dep et al. [31] proposed a fuzzy-rank ensemble utilizing multiple CNN architectures for consensus-based decision-making, while Islam [32] introduced an ensemble augmented with a vision transformer. Dealing with uncertainty, Chegini et al. [33] utilized Monte Carlo dropout to create an uncertainty-aware DL model. Multitask learning has also shown promise; Chowdary [34] proved that simultaneous tumor segmentation and classification leverages shared representations to enhance accuracy, an approach furthered by He et al. [35] and Aumente et al. [36]. Investigating feature extraction, Nastase et al. [37] analyzed tissue characteristics using pre-trained networks, and Foleis et al. [38] demonstrated that late-fusion ensembles combining CNN-based and handcrafted features yield robust binary classification. Beyond traditional transfer learning and feature engineering, recent efforts have moved towards automated architecture optimization and specific clinical targeting. For instance, AlZoubi et al. [39] utilized Bayesian optimization to design BONet, tailoring the model for sonographic characteristics and achieving competitive performance on both internal and multi-center external datasets. Beyond these, DL has been applied to specific molecular subtypes, such as predicting triple-negative breast cancer (TNBC) from US images with VGG-based architectures [40].

However, while these unimodal approaches excel at extracting visual features and optimizing architectural efficiency for specific tasks, they inherently lack broader clinical context. The integration of clinical semantics remains an underexplored avenue. Therefore, there is a critical need for frameworks that can harmoniously integrate multi-institutional visual data with clinical text to achieve true, zero-shot diagnostic generalization.

DG is pivotal for guaranteeing consistent model efficacy across heterogeneous, unseen datasets that differ in patient demographics, acquisition protocols, and imaging hardware [41]. Such technical and demographic variability significantly impacts classification reliability, rendering generalization a prerequisite for clinical deployment. However, existing research on DG in breast cancer is limited and predominantly skewed towards DM. Samala et al. [42] assessed the generalization error of transfer learning CNNs in differentiating mammographic masses. Garrucho et al. [43] performed a comprehensive evaluation of DL generalization for mass detection in DM across multi-center datasets. Li et al. [44] proposed a contrastive learning strategy to improve cross-vendor generalization using limited labeled data.

Lately, Large Language Models (LLMs), especially BERT [45] and its biomedical adaptations like ClinicalBERT [46] and BioBERT [47], have become integral to medical data analysis. While BERT is adept at general clinical text, variants trained on biomedical corpora enhance the comprehension of complex terminology and context. Fusing these models with visual networks creates multimodal architectures that synthesize clinical and imaging insights for precise diagnosis [48]. While these studies demonstrate the power of deep architectures for breast cancer classification, our approach diverges by implementing a dynamic Dual-Stage Alignment (LCA and GAA) that moves beyond static feature combination to achieve finer semantic synchronization between localized visual cues and clinical terms. Beyond US, multimodal synthesis has been explored across diverse oncological domains. Yan et al. [49] designed a network merging pathology images with electronic health records to classify breast cancer. Arya et al. [50] introduced a stacked ensemble DL framework integrating clinical data, gene expression, and copy number alterations for prognosis. Similarly, Jadoon et al. [51] utilized a heterogeneous ensemble of genomic and clinical data for effective prediction.

Contemporary multimodal alignment techniques generally rely on simple concatenation [52] or attention-based mechanisms [53,54]. Self-attention, a cornerstone of Transformers [53], is favored for modeling long-range token dependencies. In fusion contexts, attention-based alignment usually calculates cross-modal affinity via dot-products to generate modality-aware embeddings. However, while efficient for global alignment, these methods frequently lack the capacity to refine fine-grained local feature correspondences, which are important for identifying subtle morphological cues in US images. Reliable diagnosis therefore benefits from hybrid alignment strategies that capture both local visual–textual correlations and global semantic context.

3. Materials and Methods

3.1. Datasets

To evaluate the approach under cross-domain conditions, we built a unified repository by merging six public breast US datasets: BUS-CoT [23], BrEaST [24], BUS-BRA [25], BUS-UCLM [26], BLUI [27], and BUSI [28]. This heterogeneity allows assessment across different scanner manufacturers, imaging protocols, and patient populations. Per-dataset statistics are summarized in Table 1.

BUS-CoT [23]: A recently curated multimodal enhancement of the BUSI benchmark. The original authors provided a high-quality subset of 5163 US images (2694 benign and 2469 malignant), which we use for direct baseline comparisons against state-of-the-art Vision-Language Models (VLMs). Because BUS-CoT is an aggregated meta-dataset, we also created a deduplicated Leakage-Free (LF) subset comprising 3610 images (1645 benign and 1965 malignant) by removing the 1553 BUS-CoT entries whose accompanying BUS-Expert annotations identify their source as BUS-BRA or BUSI; those two datasets are evaluated independently in our pooled experiments, so retaining the BUS-CoT copies would constitute cross-protocol leakage. We verified this metadata-based filter with an image-level perceptual hash (pHash) scan between BUS-CoT and each of the other five datasets (Hamming distance threshold $\leq 2$ ), which flagged only 8 residual near-duplicate pairs out of $5163 \times 3826$ comparisons. The LF subset is used for our harmonized evaluations to avoid cross-domain data leakage. BUS-CoT also provides expert-derived textual clinical descriptors. To prevent label leakage, we kept only six pre-diagnostic morphological features in the text input: Lesion Edge, Lesion Boundary, Calcification Features, Echo Characteristics, Blood Flow Features, and Elastography Features.
BrEaST [24]: Comprising 256 US scans gathered across various medical facilities in Poland from 2019 to 2022, this dataset encompasses 154 benign lesions, 98 malignancies, and 4 normal instances (the 252 lesion images entering our binary protocol are reported in Table 1 after exclusion of the 4 normal cases). Along with BUS-CoT, BrEaST serves as a primary source for our multimodal strategy by offering clinical metadata paired with imagery. To maintain a strictly non-conclusive textual input and prevent label leakage, we utilized eight specific clinical descriptors: Age, Tissue Composition, Symptoms, Lesion Shape, Echogenicity, Calcification Status, Skin Thickening, and Physical Signs. Image acquisition was conducted across a heterogeneous set of diagnostic platforms, including the Hitachi ARIETTA 70 (Hitachi Ltd., Tokyo, Japan), Esaote 6150 (Esaote S.p.A., Genoa, Italy), Samsung RS85 (Samsung Medison Co., Ltd., Hongcheon-Gun, Korea), and Philips Affiniti 70G (Royal Philips, Amsterdam, The Netherlands), thereby introducing significant hardware variability to the dataset.
BUS-BRA [25]: Serving as the most extensive component of our repository, BUS-BRA features 1875 images derived from 1064 patients at the National Institute of Cancer in Brazil. At the patient level, the cohort comprises 722 individuals with benign masses and 342 with malignant masses, all biopsy-verified, which translates to 1268 benign and 607 malignant images at the image level. While this dataset originally includes structured BI-RADS evaluations, we strictly excluded them from the textual modality to prevent any potential label leakage, thereby treating BUS-BRA as an image-only (unimodal) dataset in our evaluations. This setup allows us to rigorously test the model’s “Robustness to Missing Modalities,” evaluating whether multimodal alignment learned from other descriptive sources can enhance visual feature extraction when clinical text is absent during inference. Data acquisition involved four distinct US devices, guaranteeing substantial intra-dataset variation.
BUS-UCLM [26]: Gathered from 2022 to 2023, this collection consists of 683 images obtained from 38 subjects (174 benign, 90 malignant, 419 normal) using a Siemens ACUSON S2000 machine (Siemens Healthineers, Forchheim, Germany). After excluding the 419 normal cases, 264 lesion images are used for our binary protocol (Table 1). The single-vendor acquisition makes BUS-UCLM useful for testing model behaviour on a controlled, high-end scanner.
BLUI [27]: Offering 232 lesions validated through histopathology (split into 123 malignant and 109 benign), this dataset was imaged via an AirPlorer Ultimate scanner equipped with a 5–18 MHz linear transducer. It is especially valuable for gauging the model’s flexibility regarding high-frequency imaging probes.
BUSI [28]: Recognized as a standard benchmark in the field, BUSI comprises 780 images (categorized into 437 benign, 210 malignant, and 133 normal) acquired at Baheya Hospital in 2018. After exclusion of the 133 normal cases, the 647 lesion images reported in Table 1 are used for our binary protocol. It acts as a foundational dataset frequently cited in research, facilitating direct performance comparisons against leading state-of-the-art techniques.

Not all public US datasets provide clinical text, and our design treats this heterogeneity as an explicit part of the methodology rather than as an obstacle. The two text-bearing datasets (BUS-CoT and BrEaST) supply the paired image and clinical-text data used to train the Dual-Stage Alignment, whereas the four datasets without accompanying text (BUS-BRA, BUS-UCLM, BLUI, and BUSI) are processed through the same model with the textual modality replaced by a fixed placeholder and serve as unseen zero-shot targets under Protocol III. The absence of clinical text in part of the public ecosystem is thus accommodated by the missing-modality design itself, without requiring additional data collection.

Descriptor provenance and example inputs. The clinical text used in this study consists of the non-conclusive sonographic descriptors recorded by the radiologist at the time of the US examination, before any biopsy or pathological confirmation. Final BI-RADS assessment categories and any conclusive or pathology-derived statements were excluded, so the retained descriptors represent pre-diagnostic, observation-time information rather than annotations made after image interpretation or pathology. For BUS-CoT, the descriptors are dominated by standard B-mode morphological features that are populated for essentially all cases (lesion edge, lesion boundary, calcification features, and echo characteristics, each present in approximately 100% of records), whereas the supplementary BloodFlowFeatures and ElastographyFeatures fields are populated for only about 6% and below 1% of cases, respectively, because they are reported only when a dedicated Doppler or elastography acquisition was performed. These two supplementary fields are therefore not a systematic model input and cannot drive predictions across the dataset. Representative anonymized inputs are:

BUS-CoT (benign): “LesionEdge: Regular, LesionBoundary: BoundaryClear, LesionCalcificationFeatures: NoCalcification, EchoCharacteristics: SlightlyLowEcho.”
BUS-CoT (malignant): “LesionEdge: PartiallyRegular, LesionBoundary: BoundaryFairlyClear, LesionCalcificationFeatures: CoarseCalcification, EchoCharacteristics: LowEcho.”
BrEaST: “Age: 50s, Composition: heterogeneous (predominantly fat), Symptoms: family history of breast/ovarian cancer, Shape: irregular, Echogenicity: heterogeneous, Calcifications: no, Skin thickening: yes, Physical signs: breast scar.”

We acknowledge that such descriptors inherently encode radiologist interpretation, and that some morphological terms (for example, an irregular margin or coarse calcification) are correlated with malignancy. This interpretation, however, is routinely available in real clinical workflows before biopsy, so the model is trained on realistic observation-time information rather than on the post-pathology label. Consistent with this view, the text-only ClinicalBERT baseline already reaches 0.8304 AUC, confirming that the descriptors carry useful diagnostic prior information; the full Dual-Stage model improves this to 0.8852 AUC, indicating genuine visual–textual synergy beyond the text prior rather than mere reliance on label-correlated text.

To evaluate the proposed approach under a rigorous DG protocol and ensure mathematical stability across diverse evaluation metrics (e.g., AUC), we standardized the annotations from all harmonized datasets into a unified binary taxonomy: Benign versus Malignant. Ground truth labels provided by the respective datasets (e.g., histopathologically confirmed ‘Pathology’ classifications) were utilized directly.

A minority of datasets (BrEaST, BUS-UCLM, BUSI) contain instances of healthy breast tissue (‘Normal’ class, totaling 556 cases), whereas BUS-CoT, BUS-BRA, and BLUI consist only of confirmed lesions. Including a ‘Normal’ class in cross-domain evaluations where the target domain lacks such samples makes multi-class ROC-AUC and Precision-Recall undefined. In clinical CAD workflows, the diagnostic task is also typically lesion characterization rather than screening of healthy tissue. We therefore excluded ‘Normal’ cases from training and testing. The resulting binary task provides direct comparability with existing unimodal baselines.

To prevent information leakage during dataset harmonization, a stratified, image-level five-fold partition was used across all subsequent experiments, with the class distribution preserved across folds. Explicit patient- or subject-level identifiers are released for BUS-BRA (1064 patients) and BUS-UCLM (38 subjects), and BrEaST provides case-level filenames with a single image per case; BUS-CoT, BLUI, and BUSI do not release an explicit patient or case identifier. Because the partition operates at the image level, for the two datasets in which a single patient contributes several images, namely BUS-BRA and BUS-UCLM, correlated images from the same patient may be split across training and test folds. The pooled (Protocol II) results should therefore be interpreted as image-level performance that may modestly overestimate strict patient-level clinical generalization; BrEaST contributes a single image per case and is unaffected. The dedicated leakage-free filter described earlier in this section and the disjoint institutional origins of the six datasets remove the more consequential cross-dataset patient overlap, and we verified that no patient is shared across datasets. A full patient-level re-partition with retraining, for the datasets whose identifiers permit it, is left for future work.

This study exclusively utilizes six public, de-identified datasets (BUS-CoT [23], BrEaST [24], BUS-BRA [25], BUS-UCLM [26], BLUI [27], and BUSI [28]). All datasets were acquired and shared by their respective original institutions under open-access licenses (e.g., CC BY 4.0) for research purposes. As the data is completely anonymized and no direct patient interaction occurred, this study qualifies for exemption from additional Institutional Review Board (IRB) approval, adhering to the ethical principles of the Declaration of Helsinki.

3.2. Proposed Approach

As illustrated in Figure 1, the proposed model effectively integrates US images and clinical text to enhance diagnostic robustness. The input consists of a US image and its corresponding medical record, denoted as I and C. The image is divided into

N_{v}

non-overlapping patches to extract localized visual features, while the medical record is tokenized into

N_{t}

discrete units for textual representation. Both modalities are processed through dedicated encoders: a Vision Transformer (ViT) [54] extracts visual embeddings

V \in R^{B \times N_{v} \times d}

, and ClinicalBERT [46] generates contextualized text embeddings

T \in R^{B \times N_{t} \times d}

, where B is the batch size and d is the feature dimension.

To address the limitations of simple fusion, we adopt a Dual-Stage Alignment strategy combining LCA for local feature refinement and GAA for global context modeling. The design follows two observations specific to US imaging. Global attention applied in isolation tends to under-weight subtle morphological cues such as irregular lesion boundaries, whereas local alignment alone struggles to build domain-invariant representations across scanners. Cascading the two stages is therefore intended to capture both fine-grained sonographic features and higher-level clinical semantics within a single pipeline.

3.2.1. Stage I: Local Alignment via LCA

The LCA is designed to capture fine-grained interdependencies (e.g., lesion boundaries vs. shape descriptors) by refining the raw correlation between modalities. First, a correlation matrix

A \in R^{N_{v} \times N_{t}}

is computed via the dot product of the embeddings:

A = V \cdot T^{⊤}

(1)

A direct correlation matrix often contains noise and lacks spatial coherence. To mitigate this, we apply a convolutional layer over A, which acts as a smoothing filter to enforce regional consistency within the interaction space:

A_{conv} = Conv 2 D (A)

(2)

Specifically, the Conv2D operation utilizes a

3 \times 3

convolutional kernel with a single filter, a stride of 1, and a padding of 1. This configuration is strategically chosen to smooth the raw correlation scores while maintaining the original spatial dimensions of the interaction space, ensuring that every local visual–textual correspondence is refined by its immediate neighbors.

The refined matrix

A_{conv}

is then utilized to generate modality-specific attention maps,

α \in R^{N_{v}}

and

β \in R^{N_{t}}

, via row-wise and column-wise aggregation followed by sigmoid normalization:

α = σ (mean (A_{conv}, \dim = 1))

(3)

β = σ (mean (A_{conv}, \dim = 0))

(4)

where

σ (\cdot)

denotes the sigmoid function. These maps modulate the original embeddings to highlight locally correlated features:

V_{lca} = α ⊙ V, T_{lca} = β ⊙ T

(5)

where ⊙ represents element-wise multiplication with broadcasting.

3.2.2. Stage II: Global Alignment via GAA

While LCA handles local alignment, long-range semantic dependencies are captured via GAA. First, the original embeddings are concatenated to form a unified sequence

C = [V; T] \in R^{B \times (N_{v} + N_{t}) \times d}

. This sequence is processed using scaled dot-product attention to obtain the global context matrix H:

H = softmax (\frac{C_{Q} C_{K}^{⊤}}{\sqrt{d}}) C_{V}

(6)

where

C_{Q}

,

C_{K}

, and

C_{V}

are linear projections of C (the query, key, and value matrices), and d is the feature dimension. In practice the operation is implemented as multi-head self-attention; we use d in the scaling factor for notational simplicity, with the standard per-head scaling

\sqrt{d / num_heads}

used in the implementation.

Since the output

H \in R^{B \times (N_{v} + N_{t}) \times d}

contains mixed tokens, we split it back into modality-specific components to retrieve the globally refined embeddings:

V_{gaa} = H_{:, 1 : N_{v}, :}, T_{gaa} = H_{:, (N_{v} + 1) : (N_{v} + N_{t}), :}

(7)

Here, indices are 1-based:

V_{gaa}

corresponds to the first

N_{v}

tokens (image features), and

T_{gaa}

corresponds to the remaining

N_{t}

tokens (text features). This step ensures that we extract the context-aware representations for each modality before the final alignment.

3.2.3. Feature Concatenation, Fusion and Loss Function

Feature Concatenation: The locally aligned (LCA) and globally refined (GAA) features are then concatenated to form the final aligned representations:

V_{aligned} = [V_{lca}; V_{gaa}], T_{aligned} = [T_{lca}; T_{gaa}]

(8)

Variance-Aware Semantic Embedding: To extract a compact global representation for each modality, we first compute global pooled embeddings by averaging the aligned feature sequences (

V_{aligned} \in R^{B \times M_{v} \times d}

and

T_{aligned} \in R^{B \times M_{t} \times d}

) along their token dimension, where

M_{v}

and

M_{t}

denote the resulting sequence lengths (in the default dual-stage configuration

M_{v} = 2 N_{v}

and

M_{t} = 2 N_{t}

, while a single-stage variant yields

M_{v} = N_{v}

and

M_{t} = N_{t}

):

V_{pool} = \frac{1}{M_{v}} \sum_{i = 1}^{M_{v}} V_{aligned} [:, i, :]

(9)

T_{pool} = \frac{1}{M_{t}} \sum_{j = 1}^{M_{t}} T_{aligned} [:, j, :]

(10)

As an additional design element within the semantic-embedding pipeline (rather than a standalone contribution), we incorporate a variance-aware modulation that amplifies dimensions with high variability in the pooled embeddings, since these often correspond to discriminative clinical features:

V_{sem} = V_{pool} + V_{pool} ⊙ N (Var (V_{pool}))

(11)

T_{sem} = T_{pool} + T_{pool} ⊙ N (Var (T_{pool}))

(12)

where

N (\cdot)

denotes the L2-normalization operation over the variance vector to ensure scale-invariance, and

Var (\cdot)

computes a per-feature variance across the samples in the batch, yielding a d-dimensional vector that is then broadcast in the element-wise product.

Two properties of this modulation merit clarification. First, because the variance vector is L2-normalized, its contribution is bounded: every component lies in

[- 1, 1]

and, for the

d = 768

feature space used here, the typical component magnitude is on the order of

1 / \sqrt{d} \approx 0.036

. The modulation therefore acts as a small, bounded refinement of the pooled embedding rather than a dominant transformation. Second, in the experiments reported in this work,

Var (\cdot)

is evaluated over the mini-batch using a fixed batch size, so all reported metrics are well defined and reproducible. Because a batch statistic is unavailable when cases are evaluated individually, for single-case clinical deployment the variance term can be replaced by the statistic estimated once over the training set, which is a fixed d-dimensional vector. This makes each prediction depend only on its own input, removes any dependence on batch composition or batch size, and avoids the degenerate single-sample case; given the bounded magnitude above, the fixed-statistic variant remains the same minor refinement and keeps each prediction independent of the inference batch. An on/off ablation isolating this modulation was not conducted, which we acknowledge among the study limitations; consistent with its role as an auxiliary refinement within the LCA–GAA pipeline rather than a standalone mechanism, its effect is expected to be minor by construction.

Fusion and Classification: The semantic embeddings (

V_{sem}, T_{sem}

) and pooled embeddings (

V_{pool}, T_{pool}

) are fused via element-wise multiplication to produce the semantic fusion

F_{sem}

and pooled fusion

F_{pool}

:

F_{sem} = V_{sem} ⊙ T_{sem}, F_{pool} = V_{pool} ⊙ T_{pool}

(13)

These are then concatenated to form the final descriptor

F_{final} = [F_{sem}; F_{pool}]

, which is fed into the classification head. The final fusion step (element-wise product followed by concatenation) is intentionally simple: because the cross-modal interactions are handled by the preceding Dual-Stage Alignment (LCA and GAA), a heavier late-fusion mechanism is not required and, as shown in our ablation, does not yield additional gains.

Loss Function: To train the model, we employ a multi-objective loss function that balances classification accuracy, modality-specific learning, and cross-modal alignment:

L_{total} = L_{cls} + λ_{1} L_{image} + λ_{2} L_{text} + λ_{3} L_{align}

(14)

Here,

λ_{1}, λ_{2}, λ_{3}

are hyperparameters controlling the weight of each term. We utilize Focal Loss for the classification terms (

L_{cls}, L_{image}, L_{text}

) to mitigate class imbalance:

L_{focal} = - \sum_{i} α {(1 - p_{t})}^{γ} log (p_{t})

(15)

where

γ

is the focusing parameter and

α

is a class-balancing factor.

The cosine alignment loss term,

L_{align}

, is formulated to pull visual and textual representations of the same case closer in the shared latent space. Note that this objective acts on positive pairs only and is therefore an alignment (similarity) loss rather than a contrastive loss with explicit negative samples. It is defined as:

L_{align} = 1 - \frac{V_{pool} \cdot T_{pool}}{∥ V_{pool} ∥ ∥ T_{pool} ∥}

(16)

where

V_{pool}

and

T_{pool}

are the pooled visual and textual embeddings obtained from the aligned feature sequences for a single sample. In practice this scalar is averaged across the samples in the batch during training. By minimizing this term, the model enforces semantic consistency, ensuring that the visual features of a lesion are directly aligned with their corresponding clinical descriptors.

4. Results

4.1. Software and Hardware Configuration

All experiments were implemented using the PyTorch deep learning framework (Version 2.9.0+cu130) within a Python 3.12 environment on Ubuntu 24.04 LTS. Model training and inference were accelerated using an NVIDIA RTX A6000 GPU (48 GB VRAM) hosted on a high-performance workstation equipped with an Intel Core i9-14900KF processor and 128 GB of RAM. The CUDA 13.0 toolkit was utilized to ensure GPU-optimized tensor operations.

4.2. Implementation Details and Protocols

To assess both the learning capacity and the robustness of the proposed approach, we utilized both an independent 5163-image subset and a harmonized, leakage-free repository of 6880 cases derived from six public datasets (BUS-CoT [23], BrEaST [24], BUS-BRA [25], BUS-UCLM [26], BLUI [27], and BUSI [28]). As detailed in Section 3.1, this repository includes both multimodal (image + text) and unimodal (image-only) samples.

The BUS-CoT and BrEaST datasets provide paired clinical text. The remaining four (BUS-BRA, BUS-UCLM, BLUI, BUSI) consist only of US images. To handle these unimodal samples without changing the architecture, we use a token imputation strategy: when no clinical report is available, the text input is set to the fixed placeholder sequence “no data”. The textual encoder maps this sequence to a deterministic, low-information embedding that the model can treat as a missing-modality indicator, so a single pipeline can process both multimodal and unimodal data and shift its diagnostic reliance to the visual modality when the text is uninformative. The same network weights and forward path are used at both training and inference; for a text-absent sample, the only change is that its clinical report is replaced by the placeholder prior to encoding, after which the visual encoder, both alignment stages, and the classifier are applied without modification.

We designed three distinct experimental protocols to evaluate the approach under different clinical and comparative scenarios:

4.2.1. Protocol I: Independent BUS-CoT Benchmarking

To benchmark our approach against the state-of-the-art VLMs (e.g., Qwen2.5-VL) presented in the original BUS-CoT study [23], we followed their official dataset partitioning. The 5163-image dataset was divided into their predefined trainval and test sets. We conducted a stratified image-level 5-fold cross-validation within the trainval cohort to optimize model parameters. The best-performing model weights from each of the five folds were then evaluated on the fixed test set, and final performance metrics are reported as mean and standard deviation across these five independent test evaluations. This protocol provides a like-for-like comparison with the cited VLM baselines and a transparent statistical basis for the reported variance.

4.2.2. Protocol II: Pooled Intra-Domain Evaluation (Leakage-Free)

For Protocol II we use the harmonized 6880-image, leakage-free repository spanning all six datasets. Because BUS-CoT is itself an aggregated meta-dataset that re-uses images from other public sources, we performed the deduplication described in Section 3.1: BUS-CoT entries whose BUS-Expert annotations mark their source as BUS-BRA or BUSI were removed (1553 images), and the result was cross-checked with a pHash near-duplicate scan against the other five datasets (Hamming distance

\leq 2

). After this two-stage filter, the BUS-CoT subset contained 3610 images. It was merged with the five other datasets (BrEaST, BUS-BRA, BUS-UCLM, BLUI, BUSI; combined 3270 images) to form the 6880-image pool.

On this harmonized repository, we employed a stratified, image-level five-fold cross-validation strategy. The data was randomly partitioned into five folds, ensuring a uniform class distribution (Benign, Malignant) across folds. In each iteration, four folds were used for training and one for testing, with the final performance reported as the average across all iterations. This protocol assesses the model’s capacity to learn from a heterogeneous distribution while providing strong assurance against cross-dataset overlap (the dedicated leakage-free filter described in Section 3.1); strict patient-level grouping is not separately enforced in this image-level partition, and the resulting interpretation of the pooled results, together with the datasets for which patient or case identifiers are in fact available, is discussed in Section 3.1.

4.2.3. Protocol III: Zero-Shot Cross-Domain Evaluation (Missing Modality)

The rigorous assessment of DG under realistic, challenging clinical conditions stands as a defining contribution of this study. Under Protocol III, we established a strict zero-shot cross-domain transfer scenario, in which the model evaluates entirely unseen target datasets without prior exposure, to assess the robustness of the learned representations against severe domain shifts. The proposed approach was trained exclusively on the harmonized multimodal source pool (comprising the leakage-free BUS-CoT and BrEaST datasets). A dedicated validation subset from this source pool was strictly utilized for model selection and early stopping.

The trained model is then evaluated directly on four unseen unimodal target domains (BUSI, BLUI, BUS-UCLM, BUS-BRA) without any domain-specific fine-tuning. The protocol stresses two distinct challenges at once: the target domains use different scanner hardware (e.g., Siemens, LOGIQ, AirPlorer) and originate from different patient populations (Egypt, Spain, Brazil); and the model receives no clinical text at inference time, relying entirely on the “no data” placeholder. The setup therefore measures how much of the multimodal training signal is retained in the visual backbone when the textual modality is unavailable at test time. Protocol III is thus designed as a missing-modality, cross-scanner transfer evaluation rather than as a controlled comparison against dedicated domain-generalization (DG) algorithms. We make no claim that the proposed alignment is superior to explicit DG methods such as IRM, GroupDRO, MixStyle, or CORAL; the per-domain results are reported as a lower bound on purely visual inference under cross-scanner and cross-population shift, and a controlled comparison against such DG methods under the identical source–target protocol is a complementary direction left for future work.

Table 2 summarizes the role of each of the six datasets across the three protocols defined above; per-dataset image counts, class balance, clinical-text availability, and scanners are listed in Table 1.

4.2.4. Training Configuration and Hyperparameters

To ensure consistency and fair evaluation, a unified training configuration was strictly maintained across all experiments. All input US images were resized to

224 \times 224

pixels to match the input dimensionality requirements of the DeiT backbone, accompanied by standard data augmentation techniques (e.g., random horizontal flipping and minor rotations) to enhance model robustness against spatial variations. All models were trained with a batch size of 16 for a maximum duration of 30 epochs. To prevent overfitting, an early stopping mechanism was implemented with a patience of 5 epochs, monitoring the validation loss. For the pooled Protocol II experiments we additionally applied a domain-balanced batch sampler (inverse-frequency weighting over the source dataset of each image) so that the substantially larger sources (e.g., BUS-CoT, BUS-BRA) do not dominate gradient updates relative to the smaller ones (e.g., BLUI, BrEaST); this affects sampling order only, not the underlying dataset partition. A fixed random seed was used to initialize all stochastic components (Python, NumPy, and PyTorch RNGs), and the exact seed and remaining configuration details are provided in the released code repository to support full reproducibility.

Optimization was performed using the Adam optimizer, which effectively balances fast convergence with gradient stability. We employed a differential learning rate strategy: the pre-trained unimodal encoders (DeiT and ClinicalBERT) were fine-tuned with a learning rate of

1 \times 10^{- 5}

to preserve their foundational feature representations, while the newly initialized multimodal alignment modules (LCA and GAA) were trained with a slightly higher learning rate of

3 \times 10^{- 5}

. These values were empirically selected based on established best practices for fine-tuning transformer architectures [45,54]. Furthermore, for the multi-objective loss function defined in Equation (14), the weighting coefficients were empirically configured to

λ_{1} = 0.7

,

λ_{2} = 0.3

, and

λ_{3} = 0.1

. This specific distribution strategically prioritizes the visual modality (

λ_{1} = 0.7

) as the primary diagnostic source, while utilizing the textual modality (

λ_{2} = 0.3

) and cross-modal alignment (

λ_{3} = 0.1

) as supportive regularizers. This ensures that the model benefits from cross-modal semantic refinement without allowing the auxiliary text features to overwhelm the primary image-based classification objective. Additionally, to effectively mitigate class imbalance and focus the model on challenging diagnostic cases, the Focal Loss parameters (Equation (15)) were set to a focusing parameter of

γ = 2.0

and a balancing factor of

α = 1.0

, dynamically down-weighting the easily classified instances.

4.3. Quality Metrics

We use five metrics: Accuracy (ACC), Precision (PRE), Recall (REC), F1-Score, and Area Under the ROC Curve (AUC). In a cancer diagnosis context, recall and AUC are particularly informative because they reflect the model’s ability to limit false negatives while maintaining specificity.

We defined these metrics using the standard confusion matrix components (True Positives [TP], True Negatives [TN], False Positives [FP], and False Negatives [FN]) as follows:

\begin{matrix} ACC & = \frac{TP + TN}{TP + TN + FP + FN} \end{matrix}

(17)

\begin{matrix} PRE & = \frac{TP}{TP + FP} \end{matrix}

(18)

\begin{matrix} REC & = \frac{TP}{TP + FN} \end{matrix}

(19)

\begin{matrix} F 1 & = \frac{2 \cdot PRE \cdot REC}{PRE + REC} = \frac{2 \cdot TP}{2 \cdot TP + FP + FN} \end{matrix}

(20)

For threshold-free evaluation, we utilize the AUC. This metric relies on the True Positive Rate (TPR), also known as sensitivity, and the False Positive Rate (FPR), defined as:

\begin{matrix} TPR & = \frac{TP}{TP + FN} \end{matrix}

(21)

\begin{matrix} FPR & = \frac{FP}{FP + TN} \end{matrix}

(22)

The AUC is then computed by integrating TPR against FPR across varying decision thresholds:

AUC = \int_{0}^{1} TPR (FPR) d (FPR)

(23)

This provides a threshold-independent measure of classification separability.

4.4. Quantitative Analysis

This section evaluates the proposed approach along four directions. First, we benchmark it against baseline models on the BUS-CoT dataset (Protocol I). Second, we analyze its performance on the harmonized 6880-image pool and compare per-domain results against state-of-the-art techniques (Protocol II). Third, we measure its zero-shot domain generalization on unseen unimodal targets (Protocol III). Finally, we report ablation studies on the alignment, fusion, and loss components.

4.4.1. Comparative Analysis with Baseline Models (Protocol I)

Using the standardized five-fold cross-validation on the 5163-image BUS-CoT benchmark (Protocol I), we benchmarked the proposed approach against unimodal (DeiT, ViT, BERT variants), standard multimodal architectures, and a state-of-the-art VLM. Table 3 summarizes the performance metrics.

Analysis of Transfer Learning and Backbones: In the visual domain, the Data-efficient Image Transformer (DeiT) slightly outperforms the standard ViT, confirming its suitability for medical datasets where data scale is limited compared to natural images. Regarding textual representation, the observed ordering is ClinicalBERT (0.767) > BERT (0.766) > BioBERT (0.763); the differences are small and within the cross-fold standard deviation, but ClinicalBERT consistently leads. We interpret this as evidence that pretraining on clinical text (MIMIC-III, for ClinicalBERT) is more aligned with the structured sonographic descriptors used here than pretraining on the broader biomedical literature (for BioBERT), which appears to provide no measurable advantage over the generic BERT baseline on this task. Consequently, we adopted DeiT and ClinicalBERT as the backbone encoders for our multimodal architecture.

Multimodal Efficacy of the Proposed Alignment: The Dual-Stage approach with DeiT and ClinicalBERT backbones reaches an accuracy of 0.8177 and an AUC of 0.8852 (Table 3), improving over the strongest simple-fusion baselines (DeiT + BERT, 0.8116 ACC/0.8844 AUC) by ~0.6 ACC points at a comparable AUC. Within the same table, the single-stage variants (LCA-only and GAA-only) trail the dual configuration by 1.0–1.2 ACC points, suggesting that the two modules contribute complementary signal rather than redundant capacity.

Comparison against Large VLMs: To contextualize our approach within recent advances in generative AI, we benchmarked it against the VLM reported in the original BUS-CoT study: the 7-billion parameter Qwen2.5-VL-7B augmented with chain-of-thought (CoT) reasoning. As presented in Table 3, the Qwen model attains a slightly higher Recall (0.8414) but a markedly lower Precision (0.7546) and a lower AUC (0.8354 vs. 0.8852), indicating a tendency to over-predict malignancy and produce more false positives. Our task-specific Dual-Stage Alignment achieves higher Accuracy (0.8177 vs. 0.8064) and higher Precision (0.8169 vs. 0.7546), with a better overall F1-Score (0.8169 vs. 0.7956). We interpret this as evidence that, for narrow but clinically critical tasks such as benign/malignant lesion characterization, lightweight discriminative alignment via LCA and GAA can offer a more balanced accuracy–efficiency trade-off than prompting a large general-purpose VLM. The proposed model contains 198.25 million trainable parameters in total, roughly

35 \times

fewer than Qwen’s 7 billion, of which the pre-trained backbones dominate (DeiT: 86.39 M, 43.6%; ClinicalBERT: 108.31 M, 54.6%) while the Dual-Stage Fusion module (LCA and GAA) adds only 3.55 M parameters (1.8%). These figures should be read as a favorable accuracy/parameter trade-off rather than a head-to-head comparison of architectures, since Qwen2.5-VL was used in a zero-shot prompted setting while our model is fine-tuned for the task.

4.4.2. Comparison with State-of-the-Art Methods (Protocol II)

Table 4 details the performance comparison across individual datasets, extracted from the aggregated cross-validation predictions. We note that while the cited baseline SOTA methods (e.g., He et al. [35], Nastase et al. [37]) were trained and evaluated exclusively on a single isolated dataset, our proposed approach generated these predictions using a single unified model trained across the heterogeneous 6880-image pooled repository. Under this harmonized setting, the model remains competitive across most datasets. On BUS-BRA, our approach matches or exceeds Gomez et al. [25] on Accuracy (0.8741 vs. 0.8650), although Gomez’s Recall (0.8590) remains slightly higher than ours (0.8550). On BUSI, the model reaches AUC 0.9576 and F1 0.9057; here, specialized single-domain models such as He et al. [35] retain a higher Accuracy (0.9440 vs. 0.9196). Overall, the approach’s main strength is consistent behavior across both multimodal (BUS-CoT, BrEaST) and purely unimodal (BUS-UCLM, BLUI) domains without dataset-specific fine-tuning, rather than topping every individual leaderboard.

The consistent performance across both multimodal and unimodal target datasets suggests that the Dual-Stage Alignment strategy lets the visual encoder learn more discriminative morphological representations than purely visual optimization methods. By bridging sonographic features with clinical semantics during training, the approach retains useful diagnostic capacity under domain shifts even when clinical text is unavailable at inference.

To provide a more complete clinical picture, Table 5 reports per-class sensitivity and specificity together with PPV and NPV for the Protocol II out-of-fold evaluation, treating malignant as the positive class. Specificity remains high across datasets (0.78–0.97), and PPV and NPV both stay above 0.76, indicating that the reported performance is not an artefact of class imbalance and that the model maintains a clinically reasonable balance between detecting malignancies and limiting false alarms.

4.4.3. Evaluation of Zero-Shot Cross-Modal Domain Generalization (Protocol III)

Following Protocol III, we trained the proposed approach on the multimodal source pool (BUS-CoT and BrEaST) and tested it on four entirely unseen unimodal datasets (BUSI, BLUI, BUS-UCLM, and BUS-BRA). In this zero-shot transfer scenario, the model received no clinical text during inference (utilizing the “no data” placeholder) and was blind to the target domains’ scanner characteristics and patient demographics.

As detailed in Table 6, the approach retains useful diagnostic capability under cross-country, cross-scanner, and text-free conditions, but the drop relative to Protocol II is substantial and should be acknowledged. The AUC ranges from 0.7360 on BUS-UCLM to 0.8060 on BUSI (BLUI: 0.8020, BUS-BRA: 0.7930), with a mean of approximately 0.784 across the four target domains, compared with in-pool AUCs typically above 0.87 in Protocol II. We read this as evidence that the Dual-Stage Alignment encodes partially domain-invariant morphological features (enough to give a reasonable lower bound for purely visual inference) but not as a claim that the cross-domain gap is closed. Training with paired clinical text appears to make the visual backbone more resilient to speckle noise and vendor-specific artifacts that often degrade unimodal networks in cross-scanner deployment, while the residual gap highlights that prospective evaluation with domain-specific calibration is still likely to be required.

4.4.4. Ablation Studies

To justify the architectural design choices, we conducted detailed ablation studies focusing on fusion strategies and loss function configurations.

Impact of Fusion Strategies: We compared element-wise summation (+), product (⊙), and concatenation (||). In Table 7, rows index the 1st Stage strategy (Intermediate:

F_{sem}, F_{pool}

) and columns index the 2nd Stage strategy (Final:

F_{final}

). The hybrid choice, Product (⊙) for Stage I and Concatenation (||) for Stage II, reaches the highest accuracy (0.8177): the product step acts as a soft gate that refines the semantic features, while the concatenation step preserves the full feature space for the classification head. We also tested an Adaptive Gated Fusion variant in place of the final concatenation; it did not improve over the simple concatenation, suggesting that the Dual-Stage Alignment already produces sufficiently balanced LCA and GAA representations so that an additional learnable gate is unnecessary.

Impact of Loss Configurations: Table 8 reports the contribution of each loss component to the Dual-Stage Alignment. With only the primary classification loss (

L_{cls}

), performance is ACC 0.7865/AUC 0.8552; in that setting the network tends to let one modality dominate while under-using the other. Adding modality-specific auxiliary losses (

L_{cls} + L_{image} + L_{text}

) encourages both the DeiT and ClinicalBERT encoders to learn discriminative features before fusion, improving accuracy to 0.7964.

Adding the cosine alignment loss (

L_{align}

) raises performance to ACC 0.8055 and AUC 0.8735, indicating that enforcing semantic alignment between image and text embeddings in the latent space contributes a measurable signal beyond the classification heads. Without

L_{align}

, the GAA lacks the pairwise constraint that pulls each lesion’s visual representation towards its matched clinical descriptor. The full configuration (

L_{cls} + L_{image} + L_{text} + L_{align}

) yields the highest results (ACC 0.8177, AUC 0.8852), combining modality-specific learning with the cross-modal consistency term.

4.5. Qualitative Analysis

Beyond quantitative metrics, we examine the regions that drive the model’s predictions to assess whether decisions rely on medically relevant features rather than confounding artifacts. We use Gradient-weighted Class Activation Mapping (Grad-CAM) to generate heatmaps over the input image. Figure 2 shows six malignant cases from the test set, chosen to span a range of tumor sizes, morphologies, and image qualities.

As illustrated in Figure 2, the Dual-Stage Alignment approach consistently directs its attention to the pathological Region of Interest (ROI), with the high-activation zones (indicated in red) tightly corresponding to the tumor area across a range of lesion scales, morphologies, and acquisition challenges. On the small, subtle lesions typical of early-stage carcinoma (Figure 2a), the attention map localizes the lesion despite its limited spatial extent, while for large, distinctive masses (Figure 2b) it covers the entire tumor volume rather than fixating on the center. Across morphological variants, the heatmaps trace the complex, spiculated boundaries of irregular tumors (Figure 2c), suggesting that the LCA module ties fine-grained visual cues to the corresponding clinical descriptors, and standard oval or round masses (Figure 2d) are localized with their geometric integrity preserved, a property relevant to BI-RADS shape assessment. Two harder cases probe resilience to acoustic confounders: a lesion partially obscured by posterior acoustic shadowing (Figure 2e), where the model maintains focus on the tumor despite blurred borders, and a tumor embedded in complex tissue (Figure 2f), where the GAA suppresses heterogeneous background and isolates the malignancy.

Taken together, these visualizations indicate that combining LCA and GAA helps the model focus on clinically relevant regions while suppressing background tissue, which supports the use of the approach as a transparent decision-support tool in clinical workflows.

The maps in Figure 2 visualize where the model attends within the image. The local visual–textual alignment itself is performed inside the LCA module, which forms a correlation matrix between the visual patch embeddings and the clinical-descriptor token embeddings and derives per-token attention weights that couple individual descriptors to image regions. The Grad-CAM evidence is consistent with this mechanism: for irregular, spiculated lesions (Figure 2c), the activation traces the lesion contour that the morphological descriptors (e.g., margin and shape) describe, while demographic tokens contribute little to the spatial response. A dedicated visualization that overlays the most strongly aligned descriptor tokens onto their corresponding image regions for individual cases would make this descriptor-to-region linkage even more explicit; producing it requires instrumenting the inference pass to export the per-case correlation and attention tensors, which we identify as a focused direction for future work.

5. Discussion

The experimental results presented in Section 4 collectively support the central premise of this study: a carefully designed, application-specific dual-stage cross-modal alignment can perform competitively with both parameter-heavy VLMs and standard fusion baselines in breast US diagnosis, at a small fraction of the parameter count. In Protocol I, the proposed approach matched or exceeded the 7-billion parameter Qwen2.5-VL-7B with CoT reasoning under the BUS-CoT protocol of the original VLM study, while utilizing roughly 35-fold fewer trainable parameters; because that VLM was evaluated in a zero-shot prompted setting whereas our model is fine-tuned for the task, this comparison is indicative of the accuracy/parameter trade-off rather than a controlled head-to-head result. This indicates that, for narrow but clinically critical tasks such as benign/malignant lesion characterization, dedicated lightweight discriminative architectures may offer a more favorable accuracy–efficiency trade-off than prompting massive generative foundation models. In Protocol II, the unified model achieved competitive metrics on individual subsets despite being trained on a heterogeneous, harmonized pool of 6880 images, suggesting that the Dual-Stage Alignment helps the encoders internalize discriminative morphological representations that remain stable across scanner manufacturers and patient populations. Finally, in Protocol III, the model preserved promising AUC scores (greater than 0.80 on BUSI and BLUI) under a fully zero-shot, text-free transfer scenario, providing empirical evidence that the cross-modal training signal continues to shape the visual backbone even when clinical text is unavailable at test time.

From a methodological perspective, the ablation studies suggest that the local (LCA) and global (GAA) modules are complementary rather than redundant. The LCA imposes spatial regularity on the raw cross-modal affinity map via its convolutional refinement, which appears to be particularly important for US images, where speckle noise and operator-dependent artifacts may otherwise generate spurious local correspondences. The GAA, in turn, integrates long-range dependencies across visual patches and textual tokens, allowing the model to relate global lesion context to high-level clinical descriptors. The fusion ablation further indicates that an element-wise product followed by concatenation provides a balanced combination of feature gating and information preservation, while heavier late-fusion mechanisms (e.g., adaptive gating) do not yield additional gains. Taken together with the loss-function ablation, where the combination of focal classification losses with a cosine alignment term produced the strongest results, these findings suggest that semantic alignment in a shared latent space is a key driver of the observed performance. We therefore present the cascade not as a categorically superior fusion paradigm but as the configuration empirically best suited to the descriptor-rich, low-text-density nature of breast US data; consistent with this, the per-module ablations in Table 3, where neither the LCA-only nor the GAA-only variant matches the dual configuration, indicate that the two stages contribute complementary rather than redundant signal.

From a clinical interpretation standpoint, the results indicate that the Dual-Stage approach is most useful precisely where unimodal pipelines tend to struggle: cross-vendor acquisition with heterogeneous protocols and small per-domain cohorts. The Protocol III scores, although lower than the in-distribution Protocol II results, suggest that the visual backbone retains a meaningful portion of the discriminative signal learned from paired text even when the textual modality is absent at inference. We interpret this as evidence that paired training acts as an implicit regularizer, encouraging the visual encoder to lock onto clinically relevant morphological cues rather than dataset-specific artefacts. The qualitative analysis (Section 4.5) is consistent with this reading: the attention maps consistently localize the lesion region rather than diffuse background tissue, even on scans where speckle noise or shadowing partially occludes the boundary.

The accuracy/parameter trade-off observed against Qwen2.5-VL-7B is also worth noting from a deployment perspective. In settings where on-site GPU resources are limited, a 198M-parameter task-specific model that can be fine-tuned on a single workstation is a more tractable option than orchestrating a 7B-parameter generative model. While the latter offers broader generality, the present results suggest that, for narrow but clinically demanding binary characterization tasks, the strategic alignment of complementary modalities can match or exceed prompted large-VLM performance at a fraction of the inference cost.

6. Conclusions

In this study we presented a Dual-Stage Multimodal Alignment approach for breast cancer diagnosis. The method couples US images with clinical text via two complementary stages: an LCA over a refined visual–textual correlation matrix, and a GAA on the joint visual–textual token sequence. Together they let the model use fine-grained morphological cues alongside higher-level clinical semantics within a single pipeline. While the LCA and GAA stages build on well-established correlation and attention primitives, the contribution of this work lies in their application-specific cascaded composition for breast ultrasound, the variance-aware semantic modulation, and the empirical validation across six heterogeneous public sources under three evaluation protocols, rather than in a fundamentally new multimodal learning mechanism.

Across the three protocols evaluated on a harmonized repository of 6880 cases from six heterogeneous datasets (BUS-CoT [23], BrEaST [24], BUS-BRA [25], BUS-UCLM [26], BLUI [27], and BUSI [28]), several findings emerge. On the BUS-CoT benchmark (Protocol I), the proposed model reached 0.8177 accuracy and 0.8169 F1, performing comparably to the 7-billion parameter Qwen2.5-VL-7B with chain-of-thought reasoning (0.8064 accuracy, 0.7956 F1) at roughly 1/35 of the parameter count and with a markedly lower false-positive rate (precision 0.8169 vs. 0.7546). Under the pooled cross-dataset setting (Protocol II), a single unified model trained on the 6880-image repository remained competitive with specialized single-domain methods on the individual subsets, including 0.9576 AUC on BUSI and 0.8741 accuracy on BUS-BRA, without dataset-specific fine-tuning. Under zero-shot transfer with no clinical text at inference (Protocol III), per-domain AUC ranged from 0.7360 on BUS-UCLM to 0.8060 on BUSI, with BLUI at 0.8020 and BUS-BRA at 0.7930, yielding a mean of approximately 0.784 across the four unseen targets; although lower than the in-distribution Protocol II results, these values establish a meaningful lower bound for purely visual inference under cross-scanner and cross-population shift.

From a deployment perspective, the “no data” placeholder strategy allows a single pipeline to handle both multimodal and image-only datasets, which is relevant for clinical sites with variable data completeness. The LCA and GAA stages account for only 3.55 M of the model’s 198 M trainable parameters, and the full configuration trains and runs on a single RTX A6000 (48 GB) workstation rather than requiring cluster-scale resources.

Despite the high diagnostic accuracy, certain limitations remain. First, because the “Normal” (healthy) class was explicitly excluded to ensure metric stability across datasets lacking such samples, our approach is strictly limited to lesion characterization (benign vs. malignant) and cannot perform initial lesion detection or screening. Second, while the “no data” token strategy allows handling missing modalities, it does not fully replace the diagnostic value of real clinical descriptors. Failure modes observed in our qualitative analysis often involved images with extreme acoustic shadowing where visual features were almost entirely obliterated, suggesting that even multimodal alignment has a lower bound of effectiveness in poor-quality scans. Third, Protocol III is reported as a zero-shot cross-domain evaluation against the in-distribution Protocol II baseline and does not include dedicated DG methods such as IRM [56], GroupDRO [57], MixStyle [58] or CORAL [59]; the reported per-domain AUCs should therefore be read as a lower bound on what is achievable under explicit DG objectives rather than as a head-to-head DG benchmark. Fourth, the pooled experiments use an image-level five-fold partition; for the sources that provide patient or case identifiers and contain multiple images per patient (BUS-BRA, BUS-UCLM, and the multi-view BUS-CoT cases), this does not strictly enforce patient-level separation and may overestimate patient-level generalization, so the pooled results are presented as image-level performance and a patient-level re-evaluation with retraining is left for future work. Finally, the source cohorts are individually small (between 232 and 1875 lesion images per dataset) and originate from a limited number of institutions, so generalization to broader populations, additional vendors, and prospective clinical workflows still requires external validation. We also note that all metrics are reported as the mean and standard deviation across 5-fold cross-validation runs; formal significance testing (e.g., paired t-tests or bootstrap confidence intervals) between the proposed approach and individual baselines was not performed, and reported numerical differences should be interpreted with this caveat in mind. Likewise, the variance-aware modulation step (Equations (11) and (12)) is reported as a fixed component of the semantic-embedding pipeline; an independent on/off ablation isolating this term was not conducted, and its contribution should therefore be interpreted as part of the overall LCA + GAA design rather than as an individually validated mechanism.

Future work will focus on integrating additional imaging modalities, such as elastography, and exploring lightweight Large Language Models (LLMs) for real-time interpretability. Additionally, we aim to investigate generative AI techniques (e.g., GANs, Diffusion Models) to synthesize missing modalities, moving beyond static token imputation to further enhance model robustness in data-scarce scenarios.

Funding

This research received no external funding.

Institutional Review Board Statement

Ethical review and approval were waived for this study because it exclusively utilizes six publicly available, fully de-identified breast ultrasound datasets (BUS-CoT, BrEaST, BUS-BRA, BUS-UCLM, BLUI, and BUSI) released by their respective original institutions under open-access licenses. No new human or animal data were collected and no direct patient interaction occurred. The study adheres to the ethical principles of the Declaration of Helsinki.

Informed Consent Statement

Patient consent was waived because this retrospective study uses only publicly available, fully anonymized breast ultrasound datasets for which informed consent had already been obtained by the original data providers.

Data Availability Statement

This study uses six publicly available breast ultrasound datasets, which can be accessed from their original sources as cited in the manuscript: BUS-CoT [23], BrEaST [24], BUS-BRA [25], BUS-UCLM [26], BLUI [27], and BUSI [28]. The companion repository for this manuscript is hosted at https://github.com/doganr/DSMA-Breast (accessed on 9 June 2026); the complete source code, training scripts, and harmonized metadata supporting the reported results will be made publicly available at this address upon acceptance of the manuscript.

Conflicts of Interest

The author declares no conflicts of interest.

References

Siegel, R.L.; Giaquinto, A.N.; Jemal, A. Cancer statistics, 2024. CA Cancer J. Clin. 2024, 74, 12–49. [Google Scholar] [CrossRef]
Zhang, Y.; Ji, Y.; Liu, S.; Li, J.; Wu, J.; Jin, Q.; Liu, X.; Duan, H.; Feng, Z.; Liu, Y.; et al. Global burden of female breast cancer: New estimates in 2022, temporal trend and future projections up to 2050 based on the latest release from GLOBOCAN. J. Natl. Cancer Cent. 2025, 5, 287. [Google Scholar] [CrossRef]
Lobig, F.; Caleyachetty, A.; Forrester, L.; Morris, E.; Newstead, G.; Harris, J.; Blankenburg, M. Performance of supplemental imaging modalities for breast cancer in women with dense breasts: Findings from an umbrella review and primary studies analysis. Clin. Breast Cancer 2023, 23, 478–490. [Google Scholar] [CrossRef]
Hussain, H.K.; Rajeev, R.; DeStigter, K.K.; Gulani, V. Global Cancer Imaging Access: Addressing Barriers and Harnessing Innovations. Radiol. Imaging Cancer 2025, 7, e250019. [Google Scholar] [CrossRef]
Lee, C.H.; Dershaw, D.D.; Kopans, D.; Evans, P.; Monsees, B.; Monticciolo, D.; Brenner, R.J.; Bassett, L.; Berg, W.; Feig, S.; et al. Breast cancer screening with imaging: Recommendations from the Society of Breast Imaging and the ACR on the use of mammography, breast MRI, breast ultrasound, and other technologies for the detection of clinically occult breast cancer. J. Am. Coll. Radiol. 2010, 7, 18–27. [Google Scholar] [CrossRef]
Fiorica, J.V. Breast cancer screening, mammography, and other modalities. Clin. Obstet. Gynecol. 2016, 59, 688–709. [Google Scholar] [CrossRef]
Masoud, R.M.; Bakir, R.M.A.; Saraya, M.S.; Ayyad, S.M. BREAST-CAD: A Computer-Aided Diagnosis System for Breast Cancer Detection Using Machine Learning. Technologies 2025, 13, 268. [Google Scholar] [CrossRef]
Szumiejko, A.; Ptak, M.; Dolkega-Kozierowski, B. Computational Methods in Breast Cancer Diagnostics and Surgery Planning: A Review. Arch. Comput. Methods Eng. 2025, 33, 3391–3423. [Google Scholar] [CrossRef]
Raza, A.; Ullah, N.; Khan, J.A.; Assam, M.; Guzzo, A.; Aljuaid, H. DeepBreastCancerNet: A novel deep learning model for breast cancer detection using ultrasound images. Appl. Sci. 2023, 13, 2082. [Google Scholar] [CrossRef]
Gupta, S.; Agrawal, S.; Singh, S.K.; Kumar, S. A novel transfer learning-based model for ultrasound breast cancer image classification. In Proceedings of the Computational Vision and Bio-Inspired Computing (ICCVBIC 2022); Springer: Berlin/Heidelberg, Germany, 2023; pp. 511–523. [Google Scholar] [CrossRef]
Balasubramaniam, S.; Velmurugan, Y.; Jaganathan, D.; Dhanasekaran, S. A modified LeNet CNN for breast cancer diagnosis in ultrasound images. Diagnostics 2023, 13, 2746. [Google Scholar] [CrossRef] [PubMed]
Jabeen, K.; Khan, M.A.; Hamza, A.; Albarakati, H.M.; Alsenan, S.; Tariq, U.; Ofori, I. An EfficientNet integrated ResNet deep network and explainable AI for breast lesion classification from ultrasound images. CAAI Trans. Intell. Technol. 2024, 10, 842–857. [Google Scholar] [CrossRef]
Nehary, E.A.; Rajan, S. Ultrasound Breast Image Classification through Domain Knowledge Integration into Deep Neural Networks. IEEE Access 2024, 12, 112966–112983. [Google Scholar] [CrossRef]
Matondo-Mvula, N.; Elleithy, K. Breast cancer detection with quanvolutional neural networks. Entropy 2024, 26, 630. [Google Scholar] [CrossRef]
Wang, H.; Zhang, G.; Zhao, Y.; Lai, F.; Cui, W.; Xue, J.; Wang, Q.; Zhang, H.; Lin, Y. Rpf-eld: Regional prior fusion using early and late distillation for breast cancer recognition in ultrasound images. In Proceedings of the 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM); IEEE: New York, NY, USA, 2024; pp. 2605–2612. [Google Scholar] [CrossRef]
Saber, A.; Emara, T.; Elbedwehy, S.; Hassan, E. A novel approach for breast cancer detection using a Nesterov accelerated adam optimizer with an attention mechanism. Sci. Rep. 2025, 15, 27065. [Google Scholar] [CrossRef] [PubMed]
Afrin, H.; Larson, N.B.; Fatemi, M.; Alizad, A. Deep learning in different ultrasound methods for breast cancer, from diagnosis to prognosis: Current trends, challenges, and an analysis. Cancers 2023, 15, 3139. [Google Scholar] [CrossRef] [PubMed]
Han, J.; Hua, H.; Fei, J.; Liu, J.; Guo, Y.; Ma, W.; Chen, J. Prediction of Disease-Free Survival in Breast Cancer using Deep Learning with Ultrasound and Mammography: A Multicenter Study. Clin. Breast Cancer 2024, 24, 215–226. [Google Scholar] [CrossRef]
Sangeetha, S.; Mathivanan, S.K.; Karthikeyan, P.; Rajadurai, H.; Shivahare, B.D.; Mallik, S.; Qin, H. An enhanced multimodal fusion deep learning neural network for lung cancer classification. Syst. Soft Comput. 2024, 6, 200068. [Google Scholar] [CrossRef]
Duan, J.; Xiong, J.; Li, Y.; Ding, W. Deep learning based multimodal biomedical data fusion: An overview and comparative review. Inf. Fusion 2024, 112, 102536. [Google Scholar] [CrossRef]
Mendelson, E.B.; Berg, W.A.; Merritt, C.R.B. Toward a standardized breast ultrasound lexicon, BI-RADS: Ultrasound. Semin. Roentgenol. 2001, 36, 217–225. [Google Scholar] [CrossRef][Green Version]
Spak, D.A.; Plaxco, J.S.; Santiago, L.; Dryden, M.J.; Dogan, B.E. BI-RADS fifth edition: A summary of changes. Diagn. Interv. Imaging 2017, 98, 179–190. [Google Scholar] [CrossRef]
Yu, H.; Li, Y.; Niu, Z.; Zhang, N.; Gong, X.; Li, H.; Zou, Z.; Qi, H.; Cao, Z.; Lan, Z.; et al. A chain-of-thought reasoning breast ultrasound dataset covering all histopathology categories. Sci. Data 2026, 13, 370. [Google Scholar] [CrossRef]
Pawłowska, A.; Ćwierz-Pieńkowska, A.; Domalik, A.; Jaguś, D.; Kasprzak, P.; Matkowski, R.; Fura; Nowicki, A.; Żołek, N. Curated benchmark dataset for ultrasound based breast lesion analysis. Sci. Data 2024, 11, 148. [Google Scholar] [CrossRef] [PubMed]
Gómez-Flores, W.; Gregorio-Calas, M.J.; Coelho de Albuquerque Pereira, W. BUS-BRA: A breast ultrasound dataset for assessing computer-aided diagnosis systems. Med. Phys. 2024, 51, 3110–3123. [Google Scholar] [CrossRef]
Vallez, N.; Bueno, G.; Déniz, Ó.; Rienda, M.Á.; Pastor, C. BUS-UCLM: Breast Ultrasound Lesion Segmentation Dataset. Sci. Data 2025, 12, 242. [Google Scholar] [CrossRef] [PubMed]
Ardakani, A.A.; Mohammadi, A.; Mirza-Aghazadeh-Attari, M.; Acharya, U.R. An open-access breast lesion ultrasound image database: Applicable in artificial intelligence studies. Comput. Biol. Med. 2023, 152, 106438. [Google Scholar] [CrossRef] [PubMed]
Al-Dhabyani, W.; Gomaa, M.; Khaled, H.; Fahmy, A. Dataset of breast ultrasound images. Data Brief. 2020, 28, 104863. [Google Scholar] [CrossRef]
Hazarika, M.; Sarmah, S.; Das, P.; Mahanta, L.B. Advancements in Computer-Aided Diagnosis Systems for Mammographic Mass Detection: A Comprehensive Review. In Revolutionizing Healthcare: Impact of Artificial Intelligence on Diagnosis, Treatment, and Patient Care; Springer: Cham, Switzerland, 2025; Volume 1182, pp. 119–144. [Google Scholar] [CrossRef]
Murphy, P.; McEntee, M.; Maher, M.; Ryan, M.; Harman, C.; England, A.; Moore, N. Assessment of breast composition in MRI using artificial intelligence—A systematic review. Radiography 2025, 31, 102900. [Google Scholar] [CrossRef]
Deb, S.D.; Jha, R.K. Breast UltraSound Image classification using fuzzy-rank-based ensemble network. Biomed. Signal Process. Control 2023, 85, 104871. [Google Scholar] [CrossRef]
Islam, M.R.; Rahman, M.M.; Ali, M.S.; Nafi, A.A.N.; Alam, M.S.; Godder, T.K.; Miah, M.S.; Islam, M.K. Enhancing breast cancer segmentation and classification: An Ensemble Deep Convolutional Neural Network and U-net approach on ultrasound images. Mach. Learn. Appl. 2024, 16, 100555. [Google Scholar] [CrossRef]
Chegini, M.; Mahlooji Far, A. Uncertainty-aware deep learning-based CAD system for breast cancer classification using ultrasound and mammography images. Comput. Methods Biomech. Biomed. Eng. Imaging Vis. 2024, 12, 2297983. [Google Scholar] [CrossRef]
Chowdary, J.; Yogarajah, P.; Chaurasia, P.; Guruviah, V. A multi-task learning framework for automated segmentation and classification of breast tumors from ultrasound images. Ultrason. Imaging 2022, 44, 3–12. [Google Scholar] [CrossRef]
He, Q.; Yang, Q.; Su, H.; Wang, Y. Multi-task learning for segmentation and classification of breast tumors from ultrasound images. Comput. Biol. Med. 2024, 173, 108319. [Google Scholar] [CrossRef] [PubMed]
Aumente-Maestro, C.; Díez, J.; Remeseiro, B. A multi-task framework for breast cancer segmentation and classification in ultrasound imaging. Comput. Methods Programs Biomed. 2025, 260, 108540. [Google Scholar] [CrossRef] [PubMed]
Nastase, I.N.A.; Moldovanu, S.; Biswas, K.C.; Moraru, L. Role of inter-and extra-lesion tissue, transfer learning, and fine-tuning in the robust classification of breast lesions. Sci. Rep. 2024, 14, 22754. [Google Scholar] [CrossRef]
Foleis, V.K.; Andrade, B.A.; Shigihara, H.B.; Lemes, D.A.M.; Picolo, J.G.; Junqueira, B.F.; Sales, G.R.; Corso, V.; Bezerra, C.S. Transfer Learning and Handcrafted Features Ensembles for Ultrasound Breast Cancer Image Classification. Revista de Informática Teórica e Aplicada 2025, 32, 11–17. [Google Scholar] [CrossRef]
AlZoubi, A.; Lu, F.; Zhu, Y.; Ying, T.; Ahmed, M.; Du, H. Classification of breast lesions in ultrasound images using deep convolutional neural networks: Transfer learning versus automatic architecture design. Med. Biol. Eng. Comput. 2024, 62, 135–149. [Google Scholar] [CrossRef]
Boulenger, A.; Luo, Y.; Zhang, C.; Zhao, C.; Gao, Y.; Xiao, M.; Zhu, Q.; Tang, J. Deep learning-based system for automatic prediction of triple-negative breast cancer from ultrasound images. Med. Biol. Eng. Comput. 2023, 61, 567–578. [Google Scholar] [CrossRef]
Yoon, J.S.; Oh, K.; Shin, Y.; Mazurowski, M.A.; Suk, H.I. Domain Generalization for Medical Image Analysis: A Review. Proc. IEEE 2024, 112, 1583–1609. [Google Scholar] [CrossRef]
Samala, R.K.; Chan, H.P.; Hadjiiski, L.M.; Helvie, M.A.; Richter, C.D. Generalization error analysis for deep convolutional neural network with transfer learning in breast cancer diagnosis. Phys. Med. Biol. 2020, 65, 105002. [Google Scholar] [CrossRef]
Garrucho, L.; Kushibar, K.; Jouide, S.; Diaz, O.; Igual, L.; Lekadir, K. Domain generalization in deep learning based mass detection in mammography: A large-scale multi-center study. Artif. Intell. Med. 2022, 132, 102386. [Google Scholar] [CrossRef] [PubMed]
Li, Z.; Cui, Z.; Wang, S.; Qi, Y.; Ouyang, X.; Chen, Q.; Yang, Y.; Xue, Z.; Shen, D.; Cheng, J.Z. Domain generalization for mammography detection via multi-style and multi-view contrastive learning. In Proceedings of the Medical Image Computing and Computer Assisted Intervention (MICCAI); Springer: Berlin/Heidelberg, Germany, 2021; pp. 98–108. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT); Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
Alsentzer, E.; Murphy, J.R.; Boag, W.; Weng, W.H.; Jin, D.; Naumann, T.; McDermott, M. Publicly available clinical BERT embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 72–78. [Google Scholar] [CrossRef]
Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef] [PubMed]
Yao, Z.; Lin, F.; Chai, S.; He, W.; Dai, L.; Fei, X. Integrating medical imaging and clinical reports using multimodal deep learning for advanced disease analysis. In Proceedings of the 2024 IEEE 2nd International Conference on Sensors, Electronics and Computer Engineering (ICSECE); IEEE: New York, NY, USA, 2024; pp. 1217–1223. [Google Scholar] [CrossRef]
Yan, R.; Ren, F.; Rao, X.; Shi, B.; Xiang, T.; Zhang, L.; Liu, Y.; Liang, J.; Zheng, C.; Zhang, F. Integration of multimodal data for breast cancer classification using a hybrid deep learning method. In Proceedings of the 15th International Conference on Intelligent Computing (ICIC); Springer: Berlin/Heidelberg, Germany, 2019; pp. 460–469. [Google Scholar] [CrossRef]
Arya, N.; Saha, S. Multi-modal classification for human breast cancer prognosis prediction: Proposal of deep-learning based stacked ensemble model. IEEE/ACM Trans. Comput. Biol. Bioinform. 2020, 19, 1032–1041. [Google Scholar] [CrossRef] [PubMed]
Jadoon, E.K.; Khan, F.G.; Shah, S.; Khan, A.; Elaffendi, M. Deep learning-based multi-modal ensemble classification approach for human breast cancer prognosis. IEEE Access 2023, 11, 85760–85769. [Google Scholar] [CrossRef]
Li, L.H.; Yatskar, M.; Yin, D.; Hsieh, C.J.; Chang, K.W. Visualbert: A simple and performant baseline for vision and language. arXiv 2019, arXiv:1908.03557. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Advances in Neural Information Processing Systems 30 (NIPS 2017); Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 5998–6008. Available online: https://papers.nips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html (accessed on 18 May 2026).
Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar] [CrossRef]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning (ICML), Virtual, 18–24 July 2021; pp. 10347–10357. Available online: https://proceedings.mlr.press/v139/touvron21a.html (accessed on 18 May 2026).
Arjovsky, M.; Bottou, L.; Gulrajani, I.; Lopez-Paz, D. Invariant Risk Minimization. arXiv 2019, arXiv:1907.02893. [Google Scholar] [CrossRef]
Sagawa, S.; Koh, P.W.; Hashimoto, T.B.; Liang, P. Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization. In Proceedings of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 30 April 2020; Available online: https://openreview.net/forum?id=ryxGuJrFvS (accessed on 18 May 2026).
Zhou, K.; Yang, Y.; Qiao, Y.; Xiang, T. Domain Generalization with MixStyle. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 3–7 May 2021; Available online: https://openreview.net/forum?id=6xHJ37MVxxp (accessed on 18 May 2026).
Sun, B.; Saenko, K. Deep CORAL: Correlation Alignment for Deep Domain Adaptation. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Amsterdam, The Netherlands, 8–10, 15–16 October 2016; pp. 443–450. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of the proposed Dual-Stage Multimodal Alignment approach.

Figure 2. Qualitative visualization of the model’s decision-making process across six diverse malignant cases using Grad-CAM. The panels span different lesion characteristics: (a) a small, subtle lesion; (b) a large, clear mass; (c) an irregular tumor boundary; (d) a standard oval mass; (e) a lesion obscured by acoustic shadowing; and (f) a tumor within complex tissue texture. In all cases, the model accurately focuses on the region of interest (red areas).

Table 1. Summary of the harmonized breast US datasets used in this study. To focus on the binary task of lesion characterization (Benign vs. Malignant) and to keep AUC and PRC computable across folds, ‘Normal’ cases were excluded. The BUS-CoT dataset was further deduplicated to form the LF subset that removes overlap with the other five sources.

Dataset	Total Images	Benign	Malignant	Text	Scanner
BUS-CoT (LF) [23]	3610	1645	1965	✓	Multi-Vendor
BrEaST [24]	252	154	98	✓	Multi-Vendor
BUS-BRA [25]	1875	1268	607	×	4 Scanners
BUS-UCLM [26]	264	174	90	×	Siemens
BLUI [27]	232	109	123	×	AirPlorer
BUSI [28]	647	437	210	×	LOGIQ E9
Total (Harmonized)	6880	3787	3093	-	-

✓ indicates the dataset provides paired clinical text (multimodal input); × indicates an image-only dataset (no clinical text). Bold values in the last row denote the harmonized totals across all datasets.

Table 2. Role of each dataset across the three evaluation protocols.

Dataset	Protocol I	Protocol II	Protocol III
BUS-CoT	Benchmark	Source (pooled)	Training source
BrEaST	–	Source (pooled)	Training source
BUS-BRA	–	Source (pooled)	Unseen target
BUS-UCLM	–	Source (pooled)	Unseen target
BLUI	–	Source (pooled)	Unseen target
BUSI	–	Source (pooled)	Unseen target

Table 3. Performance Comparison of Baseline Techniques and the proposed approach (DeiT + ClinBERT) on the Independent BUS-CoT Benchmark (Protocol I). All performance metrics are reported as the mean ± standard deviation across 5-fold cross-validation.

Modality	Technique	ACC	PRE	REC	F1-Score	AUC
Image	ViT [54]	0.7981 ± 0.0071	0.7969 ± 0.0070	0.7972 ± 0.0076	0.7967 ± 0.0073	0.8721 ± 0.0089
Image	DeiT [55]	0.8019 ± 0.0104	0.8013 ± 0.0103	0.8031 ± 0.0101	0.8013 ± 0.0103	0.8763 ± 0.0044
Text	BERT [45]	0.7661 ± 0.0060	0.7672 ± 0.0057	0.7645 ± 0.0020	0.7638 ± 0.0043	0.8319 ± 0.0006
	BioBERT [47]	0.7634 ± 0.0093	0.7646 ± 0.0101	0.7602 ± 0.0053	0.7604 ± 0.0073	0.8288 ± 0.0050
	ClinicalBERT [46]	0.7673 ± 0.0080	0.7680 ± 0.0088	0.7643 ± 0.0060	0.7645 ± 0.0070	0.8304 ± 0.0033
Image + Text	ViT + BERT (Baseline)	0.8017 ± 0.0072	0.8011 ± 0.0075	0.8024 ± 0.0083	0.8009 ± 0.0076	0.8746 ± 0.0058
	ViT + BioBERT (Baseline)	0.8044 ± 0.0092	0.8033 ± 0.0092	0.8039 ± 0.0096	0.8032 ± 0.0093	0.8724 ± 0.0069
	ViT + ClinicalBERT (Baseline)	0.8034 ± 0.0074	0.8031 ± 0.0074	0.8033 ± 0.0078	0.8022 ± 0.0075	0.8724 ± 0.0086
	DeiT + BERT (Baseline)	0.8116 ± 0.0097	0.8134 ± 0.0086	0.8147 ± 0.0085	0.8113 ± 0.0095	0.8844 ± 0.0037
	DeiT + BioBERT (Baseline)	0.8090 ± 0.0102	0.8083 ± 0.0103	0.8093 ± 0.0101	0.8080 ± 0.0102	0.8831 ± 0.0047
	Proposed (Stage I: LCA Only)	0.8068 ± 0.0101	0.8061 ± 0.0102	0.8054 ± 0.0117	0.8052 ± 0.0107	0.8838 ± 0.0048
	Proposed (Stage II: GAA Only)	0.8053 ± 0.0145	0.8045 ± 0.0147	0.8063 ± 0.0151	0.8046 ± 0.0147	0.8783 ± 0.0097
	Proposed (Dual-Stage)	0.8177 ± 0.0127	0.8169 ± 0.0127	0.8184 ± 0.0130	0.8169 ± 0.0128	0.8852 ± 0.0078
VLM Baseline	Qwen2.5-VL-7B + CoT [23]	0.8064 ± 0.0128	0.7546 ± 0.0179	0.8414 ± 0.0204	0.7956 ± 0.0153	0.8354 ± 0.0128

Bold values indicate the best result in each column; the shaded (gray) row denotes the proposed method.

Table 4. Performance Comparison with State-of-the-Art Techniques across Individual Datasets (Protocol II). Performance metrics for individual datasets are computed over the aggregated out-of-fold predictions across the 5-fold cross-validation, ensuring every image is evaluated exactly once in the test set.

Dataset	Modality	Technique	ACC	PRE	REC	F1	AUC
BUS-CoT	Image + Text	Proposed Approach	0.8021	0.8002	0.8006	0.8004	0.8786
BrEaST	Image + Text	Proposed Approach	0.8052	0.7932	0.7902	0.7916	0.8753
BUS-BRA	Image	Nastase et al. [37]	0.8540	-	-	-	0.9540
	Image	Gomez et al. [25]	0.8650	-	0.8590	-	0.9310
	Image	Proposed Approach	0.8741	0.8568	0.8550	0.8559	0.9246
BUS-UCLM	Image	Proposed Approach	0.8977	0.8980	0.8715	0.8825	0.9305
BLUI	Image	Proposed Approach	0.8124	0.8158	0.8155	0.8124	0.8934
BUSI	Image	Dep et al. [31]	0.8530	-	-	-	-
		Islam et al. [32]	0.8780	-	-	-	0.9100
		Chegini et al. [33]	0.9130	0.9010	-	-	-
		He et al. [35]	0.9440	0.9460	-	-	-
	Image	Proposed Approach	0.9196	0.9198	0.8947	0.9057	0.9576

Bold values indicate the best result in each column (per dataset); rows labelled “Proposed Approach” correspond to our method.

Table 5. Extended clinical evaluation on the Protocol II out-of-fold predictions. Per-class sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) are reported per dataset, with malignant treated as the positive class. These values provide the per-class decomposition of the macro-averaged Recall and Precision reported in Table 3 and Table 4.

Dataset	Sensitivity	Specificity	PPV	NPV
BUS-CoT	0.8175	0.7837	0.8187	0.7824
BrEaST	0.7227	0.8577	0.7637	0.8294
BUS-BRA	0.8008	0.9092	0.8085	0.9051
BUS-UCLM	0.7892	0.9538	0.8984	0.8974
BLUI	0.7641	0.8669	0.8663	0.7651
BUSI	0.8237	0.9657	0.9202	0.9194

Table 6. Domain Generalization Performance (Protocol III). The proposed approach was trained exclusively on the multimodal source pool (BUS-CoT + BrEaST) and evaluated zero-shot on completely unseen, unimodal target domains. All performance metrics are reported as the mean ± standard deviation across independent runs.

Source Domain	Target Domain	ACC	PRE	REC	F1	AUC
Multimodal Pool (BUS-CoT + BrEaST)	BUSI	0.7530 ± 0.0180	0.7270 ± 0.0200	0.7250 ± 0.0220	0.7240 ± 0.0200	0.8060 ± 0.0070
	BLUI	0.7300 ± 0.0160	0.7450 ± 0.0190	0.7340 ± 0.0150	0.7280 ± 0.0170	0.8020 ± 0.0160
	BUS-UCLM	0.7170 ± 0.0150	0.6930 ± 0.0260	0.6460 ± 0.0300	0.6490 ± 0.0300	0.7360 ± 0.0360
	BUS-BRA	0.6940 ± 0.0560	0.7010 ± 0.0230	0.7200 ± 0.0310	0.6830 ± 0.0500	0.7930 ± 0.0120

Table 7. Ablation on Fusion Strategies. Rows indicate the Stage I (Intermediate) strategy, Columns indicate the Stage II (Final) strategy. All performance metrics are reported as the mean ± standard deviation across 5-fold cross-validation.

Stage I/II	ACC			PRE
( $F_{sem}, F_{pool}$ )	\|\|	⊙	+	\|\|	⊙	+
\|\|	0.8112 ± 0.0131	0.8095 ± 0.0133	0.8071 ± 0.0135	0.8084 ± 0.0132	0.8062 ± 0.0134	0.8041 ± 0.0135
⊙	0.8177 ± 0.0127	0.8154 ± 0.0128	0.8132 ± 0.0129	0.8169 ± 0.0127	0.8141 ± 0.0129	0.8125 ± 0.0131
+	0.8041 ± 0.0136	0.8022 ± 0.0138	0.7985 ± 0.0141	0.8015 ± 0.0137	0.7993 ± 0.0139	0.7951 ± 0.0142
Stage I/II	REC			F1-Score
( $F_{sem}, F_{pool}$ )	\|\|	⊙	+	\|\|	⊙	+
\|\|	0.8075 ± 0.0132	0.8053 ± 0.0135	0.8032 ± 0.0137	0.8080 ± 0.0131	0.8058 ± 0.0134	0.8035 ± 0.0136
⊙	0.8184 ± 0.0130	0.8152 ± 0.0132	0.8131 ± 0.0134	0.8169 ± 0.0128	0.8146 ± 0.0130	0.8128 ± 0.0132
+	0.7995 ± 0.0138	0.7971 ± 0.0140	0.7925 ± 0.0144	0.8005 ± 0.0137	0.7982 ± 0.0139	0.7941 ± 0.0143
Stage I/II	AUC
( $F_{sem}, F_{pool}$ )	\|\|	⊙	+
\|\|	0.8795 ± 0.0091	0.8773 ± 0.0094	0.8752 ± 0.0096
⊙	0.8852 ± 0.0078	0.8845 ± 0.0081	0.8821 ± 0.0085
+	0.8731 ± 0.0099	0.8710 ± 0.0102	0.8665 ± 0.0108

Bold values indicate the best result, corresponding to the dual-stage configuration.

Table 8. Ablation on Loss Function Configurations. All performance metrics are reported as the mean ± standard deviation across 5-fold cross-validation.

Loss Configuration	ACC	PRE	REC	F1	AUC
$L_{cls}$ Only	0.7865 ± 0.0143	0.7844 ± 0.0145	0.7885 ± 0.0142	0.7869 ± 0.0144	0.8552 ± 0.0104
$L_{cls} + L_{image} + L_{text}$	0.7964 ± 0.0138	0.7942 ± 0.0139	0.7975 ± 0.0137	0.7958 ± 0.0139	0.8651 ± 0.0098
$L_{cls} + L_{align}$	0.8055 ± 0.0135	0.8031 ± 0.0136	0.8062 ± 0.0134	0.8048 ± 0.0135	0.8735 ± 0.0092
$L_{cls} + L_{image} + L_{text} + L_{align}$	0.8177 ± 0.0127	0.8169 ± 0.0127	0.8184 ± 0.0130	0.8169 ± 0.0128	0.8852 ± 0.0078

Bold values indicate the best result in each column; the shaded (gray) row denotes the full loss configuration adopted in the final model.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Dogan, R.O. A Dual-Stage Multimodal Alignment Approach for Robust Breast Cancer Diagnosis via Visual–Textual Computing. Appl. Sci. 2026, 16, 5934. https://doi.org/10.3390/app16125934

AMA Style

Dogan RO. A Dual-Stage Multimodal Alignment Approach for Robust Breast Cancer Diagnosis via Visual–Textual Computing. Applied Sciences. 2026; 16(12):5934. https://doi.org/10.3390/app16125934

Chicago/Turabian Style

Dogan, Ramazan Ozgur. 2026. "A Dual-Stage Multimodal Alignment Approach for Robust Breast Cancer Diagnosis via Visual–Textual Computing" Applied Sciences 16, no. 12: 5934. https://doi.org/10.3390/app16125934

APA Style

Dogan, R. O. (2026). A Dual-Stage Multimodal Alignment Approach for Robust Breast Cancer Diagnosis via Visual–Textual Computing. Applied Sciences, 16(12), 5934. https://doi.org/10.3390/app16125934

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Dual-Stage Multimodal Alignment Approach for Robust Breast Cancer Diagnosis via Visual–Textual Computing

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Datasets

3.2. Proposed Approach

3.2.1. Stage I: Local Alignment via LCA

3.2.2. Stage II: Global Alignment via GAA

3.2.3. Feature Concatenation, Fusion and Loss Function

4. Results

4.1. Software and Hardware Configuration

4.2. Implementation Details and Protocols

4.2.1. Protocol I: Independent BUS-CoT Benchmarking

4.2.2. Protocol II: Pooled Intra-Domain Evaluation (Leakage-Free)

4.2.3. Protocol III: Zero-Shot Cross-Domain Evaluation (Missing Modality)

4.2.4. Training Configuration and Hyperparameters

4.3. Quality Metrics

4.4. Quantitative Analysis

4.4.1. Comparative Analysis with Baseline Models (Protocol I)

4.4.2. Comparison with State-of-the-Art Methods (Protocol II)

4.4.3. Evaluation of Zero-Shot Cross-Modal Domain Generalization (Protocol III)

4.4.4. Ablation Studies

4.5. Qualitative Analysis

5. Discussion

6. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI