An Efficient Cross-Modal Interaction and Dynamic Fusion Network for Multimodal Breast Ultrasound Diagnosis

Wu, Xiangqiong; Lan, Yin; Han, Lina; Wang, Peng

doi:10.3390/tomography12070093

Open AccessArticle

An Efficient Cross-Modal Interaction and Dynamic Fusion Network for Multimodal Breast Ultrasound Diagnosis

¹

School of Computer Science, Hunan First Normal University, Changsha 410205, China

²

School of Electronic Information, Hunan First Normal University, Changsha 410205, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Tomography 2026, 12(7), 93; https://doi.org/10.3390/tomography12070093 (registering DOI)

Submission received: 23 April 2026 / Revised: 17 June 2026 / Accepted: 18 June 2026 / Published: 25 June 2026

(This article belongs to the Special Issue Imaging in Cancer Diagnosis)

Download

Browse Figures

Versions Notes

Simple Summary

Multimodal breast ultrasound provides complementary information for lesion characterization, while effective integration remains challenging due to heterogeneous feature distributions, limited cross-modal interaction, and sensitivity to noise and missing data. This study presents an efficient framework that enhances modality-specific features, enables early-stage cross-modal interaction, and adaptively fuses multimodal information. The proposed method reduces computational cost while achieving high area under the curve, and shows reduced performance variation under noisy test conditions. These results provide preliminary insights that may support further research in multimodal medical imaging.

Abstract

Background: Multimodal breast ultrasound, including B-mode imaging, color Doppler flow imaging, and elastography, provides complementary information for lesion characterization. However, effectively integrating heterogeneous modalities remains challenging due to inconsistent feature distributions, limited cross-modal interaction, computational cost in existing methods, and sensitivity to noise and missing data. Methods: We presented an efficient Cross-Modal Interaction and Dynamic Fusion Network (CIDFNet) for multimodal breast ultrasound analysis. The framework integrates a multi-scale feature enhancement module to improve modality-specific representations, a cross-modal interaction module to enable early-stage feature exchange across modalities, and a dynamic fusion strategy to adaptively combine modality information based on feature reliability estimation. In addition, an invertible neural network is incorporated to reconstruct missing modality features during training. Results: Experiments on an internal dataset of 248 patients with 1532 images show that CIDFNet obtains an AUC of 85.69%, accuracy of 75.51%, recall of 50.00%, F1-score of 62.50%, and precision of 83.33%, while requiring 49.51 M parameters and 79.79 G FLOPs, respectively. Under a simplified Gaussian noise perturbation setting, performance degradation is observed. Conclusions: CIDFNet presents a framework for multimodal breast ultrasound analysis that reflects a trade-off between performance and computational efficiency.

Keywords:

breast ultrasound; cross-modal attention; feature enhancement; dynamic fusion

1. Introduction

Breast cancer remains one of the most common malignancies among women globally. According to 2022 global cancer statistics [1], the disease accounted for approximately 2.3 million new cases and 680,000 deaths. Across all cancer types, its incidence now ranks second worldwide, surpassed only by lung cancer. Early diagnosis and timely intervention are essential for improving patient prognosis and reducing breast cancer mortality [2]. Among current diagnostic modalities, ultrasound imaging is widely used in practice due to its real-time acquisition, non-invasiveness, lack of radiation exposure, and low cost [3]. Compared with imaging methods such as mammography that involve ionizing radiation, ultrasound examination has higher repeatability and is more acceptable to patients. It is particularly suitable for the screening of dense breast tissue. Moreover, ultrasound equipment is portable and can be operated bedside, which has unique advantages in primary medical care and areas with limited resources. With the development of portable ultrasound and artificial intelligence-assisted diagnostic technologies, the application value of ultrasound in the early screening of breast cancer has further increased. In particular, multimodal ultrasound imaging, including B-mode ultrasound, color Doppler flow imaging (CDFI) and ultrasound elastography (UE), provides complementary information on lesion morphology, vascularity, and tissue stiffness [4,5], as shown in Figure 1. Despite these advantages, effectively integrating heterogeneous information across modalities remains a challenging task. The differences in imaging mechanisms from different modalities often lead to inconsistent feature distributions [6], while existing methods tend to rely on simple weighted summation or concatenation to achieve cross-modal feature fusion [7]. Furthermore, medical imaging data is often affected by noise and incomplete modal acquisition, which further increases the difficulty [8]. These limitations highlight the necessity of designing a multimodal learning framework to fully utilize the complementary information among various modalities and adapt to imaging data with significant sample differences.

With the continuous development of deep learning, breast ultrasound diagnosis methods have attracted wide attention [9]. However, these existing methods mostly rely on a single strategy for missing modality handling, which is difficult to adapt to various missing scenarios with limited robustness [10]. To improve the accuracy of diagnosis, multi-modal medical image analysis has shown progress [11,12,13]. For instance, Wang et al. [14] proposed FESCA, which realizes effective fusion of pathological images and genomic data through feature enhancement and semantic alignment strategies. Misra et al. [15] constructed a weighted multi-modal U-Net and achieved synergistic optimization in the joint segmentation and classification of BUS and elastography ultrasound. Yan et al. [16] proposed TDFNet, which introduces contrastive clustering loss and invertible neural networks, showing promising performance in dynamic feature fusion and missing modality recovery. Although the existing methods have made significant progress, there are still obvious shortcomings. Most existing methods adopted a separate architecture with single-modal feature extraction followed by late fusion, where information interaction between modalities only occurs at the decision level, making it difficult to achieve mutual guidance and complementarity in the early stage of feature learning [17]. Meanwhile, existing methods lack targeted enhancement of key diagnostic regions in ultrasound images, such as lesion edges, blood-rich areas, and abnormal stiff areas, leading to insufficient distinguishability and stability of the model in feature representation [18]. In the field of multimodal breast ultrasound analysis, some current studies employ the Visual Transformer (ViT) as the backbone network for feature extraction and fusion [19,20,21]. However, ViT suffers from a large number of parameters and high inference latency, making it challenging to deploy in resource-constrained environments [22].

To address the above limitations, we proposed an efficient Cross-Modal Interaction and Dynamic Fusion Network (CIDFNet). The model utilizes an InceptionV3 backbone, which is lighter than ViT, as the shared backbone to extract basic multi-modal features, and then enhances key single-modal regions via the multi-scale feature enhancement module. The cross-attention interaction module is then designed to achieve early and deep information interaction among modalities. Finally, the dynamic fusion module and invertible neural network collaboratively complete uncertainty-weighted fusion and missing modality recovery. These modules work together to strike a balance between model accuracy and computational cost. Extensive experiments have presented that the proposed method obtains a reasonable balance between performance and computational cost, while exhibiting only limited robustness under simulated challenging conditions. The main contributions of this paper are summarized as follows:

1.: We proposed CIDFNet, a unified framework that integrates multi-scale feature enhancement and cross-modal interaction for multimodal breast ultrasound diagnosis. The proposed model obtains a competitive performance while substantially reducing computational cost, reflecting a trade-off between accuracy and efficiency.
2.: We introduced a cross-attention interaction module that combines a trusted dynamic fusion module and an invertible neural network for feature-adaptive fusion and reconstruction, enabling cross-modal integration.
3.: Extensive experiments suggest that CIDFNet shows a trade-off between performance and computational cost, though its robustness is limited under simulated varying Gaussian noise conditions.

2. Related Work

In recent years, multimodal breast ultrasound diagnosis based on deep learning has attracted widespread attention due to its ability to effectively integrate complementary information from different imaging modalities [23]. This section systematically reviews the existing research from three aspects: multimodal fusion strategies, cross-modal interaction mechanisms, and network design with reduced computational cost. Firstly, it reviews the technological evolution of decision-level, feature-level, and adaptive fusion paradigms. Secondly, it analyzes the application status of the attention mechanism in cross-modal feature interaction [24]. Finally, it summarizes computationally efficient model architectures suitable for medical imaging scenarios [25]. Through this review, it can be observed that when adapting to the joint diagnosis of BUS, CDFI, and UE modalities, the existing methods adopt a paradigm of extracting independently and then fusing. As a result, the cross-modal complementary information fails to interact fully. Additionally, the imaging characteristics of the three modalities are significantly different, which further increases the difficulty of feature alignment and fusion, and the overall model’s computational efficiency requires further improvement.

2.1. Multi-Modal Learning

In the field of medical image analysis, deep learning has made significant progress and demonstrated superior performance in various tasks, including nodule recognition [26], tumor segmentation [27], lesion grading [28], and mammography image classification [29]. However, most existing studies are based on single-modal data, which makes it difficult to describe the multi-dimensional features of complex lesions, thus limiting further improvement in analysis performance. In contrast, multi-modal imaging provides complementary information from different imaging mechanisms, which has advantages in improving lesion representation. Therefore, intelligent cancer analysis methods based on multi-modal data have attracted increasing attention due to their significant potential in assisting decision-making [30,31,32]. Building upon the advantages of multi-modal imaging, a critical challenge lies in how to effectively integrate heterogeneous information from different modalities. To this end, multi-modal fusion strategies have evolved into three representative technical paradigms.

The first type is decision-level fusion [33], which independently infers for each modality and combines outputs through statistical rules or voting mechanisms. Although this method is easy to implement, it cannot establish deep cross-modal feature correlations, resulting in limited fusion capabilities. The second paradigm is feature-level fusion [6], where multi-modal features are connected or weighted in the middle layers of the network. Compared to decision-level fusion, it provides stronger representation capabilities. However, it often struggles to handle modal heterogeneity, resulting in inconsistent features, redundancy, and potential conflicts. The third paradigm is adaptive fusion [16], which dynamically learns modality-specific weights and relationships between modalities to achieve more flexible and robust feature integration. By adjusting the contribution of each modality according to different lesion characteristics and imaging patterns, this method has become the main strategy for multimodal breast ultrasound diagnosis, and has shown good adaptability under various image conditions. For example, Huang et al. [34] proposed the AW3M framework, which integrates three-modal features through a self-weighting strategy. Fang et al. [35] introduced Swin Transformer into multi-modal ultrasound diagnosis, using shifted window attention to capture cross-modal interactions. Mo et al. [36] developed HoVer-Trans, which combines anatomical prior knowledge for diagnosis without ROI annotation. TDFNet [16] further proposed a dynamic feature fusion module, in which modal weights are adjusted based on uncertainty estimated via a Dirichlet distribution to improve fusion robustness. Despite these advances, most existing methods still follow the paradigm of independent feature extraction and late-stage fusion. Therefore, cross-modal interactions are often delayed, and supplementary information in the early stages is not fully utilized. This limitation will weaken the model’s ability to handle image noise and differences in lesion size. It is prone to exhibiting fluctuating classification results on test samples with significant sample variations.

2.2. Cross-Modal Interaction and Fusion

Although the aforementioned multi-modal fusion strategies have shown promising performances, a fundamental challenge remains in effectively modeling cross-modal interactions. In particular, the discriminability of single-modal representations largely determines the upper bound of fusion performance. Therefore, the attention mechanism is an important approach for enhancing significant features and facilitating cross-modal information interaction.

In the area of single-modal feature enhancement, Hu et al. [37] introduced SENet, which adopts a multi-crop fusion strategy and recalibrates the channel-level features through compression and excitation mechanisms. However, this work used ResNet as the backbone and only focused on single-modal optimization, without involving cross-modal interaction. Roy et al. [38] further proposed the scSE module, which incorporates concurrent spatial and channel recalibration into the segmentation network via a block fusion strategy. However, despite employing SD-Net as the backbone, this approach lacks a dedicated mechanism to handle missing modalities. Cai et al. [39] developed CGDMNet, which utilizes the transformer fusion strategy and achieves cross-modal interaction in the Intermediate stage, using MAL as the backbone network. Liu et al. [40] proposed LHNet, which employs information fusion and multi-scale sliding window attention to balance long-range dependencies and computational efficiency. However, this method relies on a ViT backbone and performs fusion at the Late stage. All these methods did not consider the scenario of missing modalities, and the ViT-like architecture has high computational overhead.

To address the issue of cross-modal interaction, several studies [24,41,42] have explored fusion methods based on the attention mechanism. Ghantasala et al. [24] proposed HXM-Net, which adopts a Transformer fusion strategy and uses ViT as the backbone to complete cross-modal interaction at the Intermediate stage. It integrates multimodal breast ultrasound images, utilizes transfer learning to enhance the generalization ability across datasets, and combines interpretable artificial intelligence to improve the interpretability of clinical decisions. Dong et al. [41] developed DCFAN, which adopts a channel fusion strategy and uses CNNs as the backbone to complete cross-modal interaction at the Intermediate stage, integrating spatial features of B-mode ultrasound and shear wave elastography hardness features. Xu et al. [42] introduced MSFT-Net, which adopts SCAM fusion strategy and uses TimeSformer as the backbone to achieve feature aggregation at the Intermediate stage, improving the fusion efficiency by selectively retaining meaningful query-key interactions. Mondol et al. [43] designed KGCML network, which adopts MM-CAF fusion strategy, using ResNet50 as the backbone to achieve Hybrid dual-stage fusion, combining H-scan, Nakagami parameter images, and synthetic breast ultrasound images. Qian et al. [44] developed BMU-Net, which adopts Transformer fusion strategy and uses ResNet18 as the backbone to achieve Hybrid fusion as well, using Random Masking Training to handle the problem of incomplete modalities. Yan et al. [16] proposed TDF-Net, which adopts TDFM dynamic weighted fusion strategy, using ResNet50 as the backbone to complete fusion at the Intermediate stage, and for the first time introduced an invertible neural network (INN) to achieve feature-level reconstruction of missing modalities. When directly transferred to the BUS-CDFI-UE trimodal task and constraining the model parameter scale, these methods generally suffer from insufficient adaptability and poor computational efficiency.

Table 1 summarizes the differences between existing methods and the proposed CIDFNet from five aspects: fusion strategy, interaction timing, backbone network, computational cost, and missing modality handling. The early methods SENet and scSE only focus on single-modal optimization without cross-modal interaction. CGDMNet and LHNet adopt intermediate and late fusion, respectively, but their interaction stages are too late to establish effective feature correlations in shallow networks. HXM-Net, DCFAN, and MSFT-Net achieve cross-modal interaction at the intermediate level and outperform single-modal approaches. However, HXM-Net and LHNet employ ViT-based backbones, resulting in higher computational costs. KGCML adopts a hybrid fusion strategy that combines early and intermediate fusion, which enhances interaction capability but does not handle missing modalities. BMU-Net uses random masking training to handle missing modalities, but it does not perform feature-level reconstruction. TDF-Net is the first to introduce invertible neural networks for missing modality reconstruction. However, its backbone is ResNet50 with FLOPs as high as 364.882 G, leading to high deployment costs. In contrast, CIDFNet adopts a hybrid fusion strategy, achieving early-intermediate cross-stage feature interaction. It adopts InceptionV3 as the backbone and also employs invertible neural networks to restore missing modalities. With only 49.51 M parameters and 79.789 G FLOPs, CIDFNet strikes a trade-off among AUC, accuracy, and computational cost.

2.3. Lightweight Network

Optimizing the overall performance of multimodal breast ultrasound models requires not only cross-modal interaction but also computational efficiency. Consequently, various lightweight network architectures have been developed. The MobileNet series [45,46] reduces computational cost via depthwise separable convolutions while maintaining feature extraction capability. Xie et al. [47] proposed a multi-scale feature fusion shuffle network, utilizing dilated convolution to capture diverse features, with its core components built upon ShuffleNetV2 units. Choi et al. [48] enhanced the U-Net encoder with an Inception architecture to address feature extraction deficiencies in breast ultrasound segmentation.

For breast ultrasound applications, several task-specific lightweight designs have also been explored. Chen et al. proposed GDUNet [49], which adapts to limited computational resources by simplifying network structures and freezing shallow layers. Wu et al. introduced UltraLight VM-UNet [50], which balances diagnostic accuracy and efficiency through parallel lightweight branches. HAAU-Net [27] combined multi-scale adaptive attention with spatial-channel attention to enable efficient, real-time segmentation with low computational cost, which can be deployed on standard clinical equipment. These studies provide valuable insights for lightweight backbone design in medical imaging. However, most existing methods are mainly designed for bimodal inputs, making them difficult to directly adapt to the joint feature modeling across BUS, CDFI, and UE modalities. Moreover, the general lightweight backbone lacks optimization for the low contrast and multiple lesion features of breast ultrasound images, often resulting in insufficient feature extraction and compromised multi-modal fusion.

3. Materials and Methods

This section systematically introduces the dataset composition, experimental setup, evaluation metrics, and network architecture of CIDFNet.

3.1. Datasets

The multimodal breast ultrasound dataset used in this study was collected from the Second Affiliated Hospital of Harbin Medical University. It was originally established by Yan et al. [16] and has been widely applied to breast cancer diagnosis research. The dataset contains 1532 breast ultrasound images from 248 patients, covering three modalities: BUS, CDFI, and UE. The age of the patients ranges from 14 to 85 years, with an average age of 46 years. Among them, 145 cases are benign, and 103 are malignant, with all diagnoses confirmed by pathological examination. Fibroadenoma constitutes the majority of benign lesions, whereas invasive ductal carcinoma is the most prevalent malignant type. All images were acquired by experienced radiologists using high-end ultrasound devices such as Philips and Siemens. In this work, the dataset is randomly divided at the patient level into training, validation, and test sets with a ratio of 60%, 20%, and 20%, respectively, to avoid data leakage. All images are resized to a unified resolution of

224 \times 224

pixels.

3.2. Implementation Details

All experiments were implemented using the PyTorch 2.4.1 framework on a single NVIDIA GeForce RTX 4090 GPU (NVIDIA Corporation, Santa Clara, CA, USA). The backbone is initialized with InceptionV3 pre-trained on the ImageNet dataset. To facilitate convergence and adapt to the characteristics of breast ultrasound images, the parameters of the first eight layers are frozen during training. The pre-trained weights are loaded to further stabilize training. All hyperparameters are tuned and determined based on this fixed validation set. The model is trained for 150 epochs with a batch size of 4, an initial learning rate of

2 \times 10^{- 5}

, and a weight decay coefficient of

1 \times 10^{- 5}

. The Adam optimizer is employed, and the learning rate is kept constant throughout training to ensure stable optimization. For reproducibility, the random seed is fixed at 42.

To mitigate overfitting, several data augmentation strategies are applied during training, including random horizontal and vertical flips, random brightness and contrast perturbations, and random Gaussian blur. During validation and testing, only resizing and center cropping are performed to ensure consistent evaluation. In addition, all modality images are standardized to have zero mean and unit variance, reducing distribution discrepancies across modalities. During the testing phase, inference is performed with a batch size of 1, and a uniform decision threshold of 0.5 is applied. All ablation studies, comparative experiments, and noise perturbations utilize the fixed test set and are evaluated by loading the best model weights saved during the training process of each model. The entire experimental configuration remains consistent to ensure fair and comparable results.

3.3. Evaluation Metrics

To comprehensively evaluate the proposed method, five widely used metrics are adopted, including the Area Under Curve (AUC), accuracy, recall, precision, and F1-score. Their definitions are given as follows:

\begin{matrix} accuracy & = \frac{TP + TN}{TP + TN + FP + FN}, \end{matrix}

(1)

\begin{matrix} recall & = \frac{TP}{TP + FN}, \end{matrix}

(2)

\begin{matrix} precision & = \frac{TP}{TP + FP}, \end{matrix}

(3)

\begin{matrix} F 1 - score & = \frac{2 \times precision \times recall}{precision + recall}, \end{matrix}

(4)

where

TP

,

FP

,

FN

, and

TN

denote the numbers of true positives, false positives, false negatives, and true negatives, respectively. The AUC is computed based on the receiver operating characteristic (ROC) curve, which reflects the trade-off between true positive rate and false positive rate across different decision thresholds.

3.4. Overall Framework

To utilize the complementary information across modalities, we proposed CIDFNet, an efficient multimodal learning framework for breast ultrasound diagnosis. As shown in Figure 2, CIDFNet adopts a progressive pipeline that integrates feature enhancement, cross-modal interaction, and adaptive fusion. It consists of six modules: a shared backbone network, a multi-scale feature enhancement module (MSFE), a cross-attention interaction module (CAIM), a global-local dual-branch feature extractor (GLDE), a trusted dynamic fusion module (TDFM), and an invertible neural network (INN) for missing modality restoration. Specifically, the GLDE, TDFM, and INN adopt the established architecture from TDFNet [16], which have proven effective in dynamic feature fusion and missing modality recovery. Building upon this, the main innovations of this proposed model are threefold: Firstly, the proposed MSFE and CAIM enable fine-grained single-modal feature enhancement and early cross-modal interaction, respectively. Secondly, the backbone network is optimized by replacing the original ResNet50 with InceptionV3 to reduce computational overhead. Thirdly, taking into account the specific characteristics of multimodal breast ultrasound, these modules have been jointly optimized to maintain the original advantages while achieving a good balance between performance and computational cost.

Given multimodal ultrasound inputs, a shared backbone is first employed to extract modality-specific feature representations with consistent dimensions. These features are then enhanced by the MSFE module, which is designed to improve discriminative regions. The enhanced features are subsequently fed into the CAIM module, where early-stage cross-modal interaction is performed to capture complementary information across modalities. To further improve representation capability, the GLDE module jointly models global structural patterns and local fine-grained details. Finally, the TDFM module adaptively integrates modality-specific features by estimating their relative reliability, producing a robust fused representation for classification. During training, the INN-based reconstruction mechanism proposed in TDFNet [16] is introduced. Random modality masking is applied as a regularization strategy, forcing the INN to learn accurate cross-modal mappings and enhancing feature alignment between modalities.

3.5. Shared Backbone

To balance feature representation capability and computational efficiency, we adopted InceptionV3 pre-trained on ImageNet as the shared backbone. Benefiting from its multi-branch design and factorized convolutions, InceptionV3 can capture multi-scale features while maintaining a relatively low computational cost.

To adapt the backbone for medical imaging tasks, its auxiliary classifier branch is removed, and the parameters of the first eight layers are frozen to preserve general low-level features and stabilize training. All modalities share the same backbone weights to ensure consistent feature extraction. The backbone outputs feature maps with 768 channels, which are then reduced to 512 channels via a

1 \times 1

convolution. This operation not only reduces computational overhead but also enables cross-channel feature interaction. Subsequently, an adaptive average pooling layer is applied to yield feature maps of

28 \times 28

, ensuring consistent spatial resolution across inputs. The resulting features are denoted as

f_{m}^{(0)} \in R^{B \times 512 \times 28 \times 28}

.

3.6. Multi-Scale Feature Enhancement Module

The quality of single-modality features often affects subsequent fusion. However, breast ultrasound images have significant heterogeneity and multi-scale lesion features. To address this, we proposed the MSFE module to enhance feature representation.

As shown in Figure 3, given the input features

X \in R^{B \times N \times C}

, where B is the batch size, N is the feature sequence length, and C is the number of channels, we first performed adaptive feature modulation:

X_{\mod} = LN (X) \cdot S_{1} + X \cdot S_{2},

(5)

where

LN

denotes layer normalization, which is applied to the feature sequence dimension to adjust the feature distribution of each channel to a standard distribution, alleviating feature distribution shift.

S_{1}

and

S_{2}

are trainable scaling parameters, which perform channel-wise modulation on normalized features and original features, respectively. This design balances normalized and original features, improving feature stability and cross-modal distribution alignment.

The modulated features are then linearly projected and reshaped into spatial feature maps, and this process can be formulated as follows:

X_{sp} = Reshape (Linear (X_{\mod}), (B, C^{'}, H, W)),

(6)

where

C^{'}

is the number of output channels of the linear layer and

H = W = \sqrt{N}

, requiring the original feature sequence length N to be a perfect square, which can be satisfied by adaptive pooling after the backbone output.

To capture multi-scale spatial features, three parallel depthwise convolution branches with kernel sizes

3 \times 3

,

5 \times 5

, and

7 \times 7

are employed to model fine-grained textures, structural patterns, and global morphology, respectively. Their outputs are averaged and fused, combined with residual connections to reduce information loss during multi-scale fusion. By assigning identical weights to features across all scales, the average fusion strategy mitigates the risk of overfitting caused by the introduction of additional learnable parameters. The residual connection preserves original spatial details and alleviates information loss during multi-scale feature aggregation. The process is summarized as follows:

F_{agg} = \frac{{DWConv}_{3 \times 3} (X_{sp}) + {DWConv}_{5 \times 5} (X_{sp}) + {DWConv}_{7 \times 7} (X_{sp})}{3} + X_{sp} .

(7)

Finally, the enhanced features are projected back using a linear transformation and residual projection:

X_{out} = X + Linear (F_{agg}) .

(8)

This design enhances discriminative features while preserving original semantic information, providing inputs for subsequent cross-modal interaction.

3.7. Cross-Attention Interaction Module

In multimodal breast ultrasound diagnosis, single-modal features have limited information coverage and can hardly describe the multidimensional characteristics of lesions comprehensively. To enable information interaction across modalities, we introduced a cross-attention interaction module (CAIM). Unlike late fusion, this module performs early-stage interaction through cross-attention, allowing each modality to incorporate complementary information from others, achieving adaptive information interaction and feature enhancement across modalities through the multi-head attention mechanism.

As illustrated in Figure 4, for a given modality, its feature is treated as the query Q, while the features from other modalities are concatenated to form the key K and value V. The module first performs layer normalization on the query features Q to stabilize the training process. Subsequently, the features Q, K, and V are linearly projected into the attention calculation space and input into the multi-head attention mechanism, enabling the model to learn cross-modal correlation and diversity from different modality features, further enhancing the learning ability and expressiveness of the model. The correlation between Q and K is calculated via scaled dot-product, and a scaling factor is introduced to mitigate numerical instability caused by high-dimensional feature representations, ensuring stable attention estimation. Then, the correlation matrix is normalized via softmax to produce attention weights, which are used to aggregate the value features V, allowing the model to selectively incorporate informative cues from other modalities while suppressing noise, and the core formula is shown as follows:

Attn (Q, K, V) = softmax (\frac{Q K^{⊤}}{\sqrt{d_{k}}}) V,

(9)

where

d_{k}

refers to the feature dimension of a single attention head.

The aggregated features are first projected back to the original feature dimension via a linear transformation. A residual connection is then introduced by adding the projected features to the query feature Q, yielding the final enhanced representation. The core formulation is given as follows:

F_{out} = Q + Linear (Attn (Q, K, V)) .

(10)

Through this design, each modality retains its intrinsic characteristics while integrating complementary features from other modalities, resulting in more informative and feature representations.

3.8. Multi-Task Joint Loss

To collaboratively optimize classification performance, cross-modal consistency, and feature alignment, we introduced a multi-task joint loss function consisting of contrastive clustering loss, modality transformation loss, and uncertainty-aware fusion loss, and the total loss is defined as follows:

L_{total} = ω_{clu} L_{clu} + ω_{\mod} L_{\mod} + ω_{uaf} L_{uaf},

(11)

where

ω_{clu}

,

ω_{\mod}

, and

ω_{uaf}

are weight coefficients to balance the impact of different task objectives on model training. These coefficients are empirically determined via grid search on the validation set from the fixed data split, and are set to

0.5

,

2.5

, and

1.5

, respectively.

The contrastive clustering loss

L_{clu}

is designed to enforce the semantic consistency across modalities while preserving class-level discriminability. Given that different modalities of the same patient describe a shared pathological entity, their representations are expected to be aligned in the latent space despite appearance heterogeneity. This loss consists of a supervised clustering term and an unsupervised alignment term:

L_{clu} = \frac{α}{2} L_{s} + α L_{u},

(12)

where

α

is a balancing coefficient set to 0.5. The supervised clustering loss

L_{s}

uses learnable cluster centers C to pull the features of each modality closer to the cluster centers of their corresponding categories, which can be formulated as:

L_{s} = - \frac{1}{M N} \sum_{i = 1}^{M} \sum_{j = 1}^{N} [y_{j} log (p_{i j}) + (1 - y_{j}) log (1 - p_{i j})],

(13)

where M and N denote the number of modalities and samples, respectively.

y_{j}

is the true label of the j-th sample, and

p_{i j} = Softmax (C f_{i, j})

represents the predicted probability of the j-th sample in the i-th modality. To further enforce cross-modal consistency, an unsupervised clustering loss based on Sinkhorn–Knopp normalization [51] was introduced to exchange clustering prediction logic between different modalities, and the formula is defined as follows:

L_{u} = - \frac{1}{M N} \sum_{j = 1}^{N} [q_{1 j} log (p_{2 j}) + q_{2 j} log (p_{3 j}) + q_{3 j} log (p_{1 j})],

(14)

where

q_{i, j}

denotes the soft cluster assignment generated by Sinkhorn normalization.

To facilitate cross-modal feature alignment, we introduced a modality transformation loss

L_{\mod}

, which employs the Kullback–Leibler (KL) divergence to measure the difference between the feature distribution generated by the INN and the ground-truth feature distribution. During training, random modality masking is applied as a regularization strategy, forcing the INN to learn robust cross-modal mappings and enhancing feature alignment between modalities. The loss is defined as follows:

L_{\mod} = \sum_{m = 1}^{M} \sum_{A \subset M ∖ {m}} KL (f_{m}^{gt} ‖ {\hat{f}}_{m}^{A}),

(15)

where

M = {BUS, CDFI, UE}

denotes the set of modalities, A is the subset of available modalities,

f_{m}^{gt}

represents the ground-truth feature of the missing modality m,

{\hat{f}}_{m}^{A}

is the feature recovered by the INN from the available modalities, and the calculation form of the KL divergence is defined as follows:

KL (P ‖ Q) = \sum_{k = 1}^{K} P (k) log \frac{P (k)}{Q (k)} .

(16)

The uncertainty-aware fusion loss

L_{uaf}

is designed to jointly optimize classification performance and predictive reliability. Specifically, this loss integrates the classification cross-entropy and the evidence regularization term, with its formula as follows:

L_{uaf} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{k = 1}^{K} y_{i k} log p_{i k} + λ \sum_{i = 1}^{N} \frac{K}{S_{i}},

(17)

where

K = 2

denotes the number of classes,

y_{i k}

is the one-hot label of the i-th sample, and

p_{i k}

is the predicted class probability output by the TDFM. In the evidence regularization term,

S_{i} = \sum_{k = 1}^{K} α_{i k}

represents the Dirichlet strength, and

α_{i k}

is the evidence value of the i-th sample for the k-th class. Additionally,

\frac{K}{S_{i}}

reflects the uncertainty of the prediction. The regularization term penalizes low-evidence predictions, thereby discouraging overconfident outputs and improving uncertainty calibration. The regularization coefficient

λ

is experimentally determined to be set to 0.1, which is used to balance the weight between classification accuracy and uncertainty estimation.

4. Results

This section reports the experimental results of CIDFNet, including classification performance, computational efficiency, and behavior under Gaussian noise perturbations. The evaluation is conducted through ablation studies, comparative experiments, and robustness analysis.

4.1. Ablation Study

To investigate the contribution of each core component to both performance and computational efficiency, we conducted a comprehensive ablation study based on the TDFNet baseline. The evaluation was performed from three aspects, including performance, parameter count, and computational complexity, and the results are summarized in Table 2.

As shown in the first row, the baseline model reported an AUC of 86.72% and a precision of 80.00%, but it required 55.70 M parameters and 364.882 G FLOPs, which implies its relatively high computational overhead may restrict deployment on low-power hardware. Introducing the MSFE module independently, as shown in the second row, accuracy improved from 77.55% to 83.67%, and both recall and F1-score increased to 70.00% and 77.78%, respectively, while the AUC slightly decreased to 80.00%. This result indicated that MSFE can capture discriminative lesion features at the early stage. To reduce the computational complexity of the model, we replaced the original backbone with InceptionV3 while keeping other components unchanged. Although most classification metrics decreased compared with the baseline, the model still obtained a competitive AUC of 81.20%. More importantly, the number of parameters reduced from 55.70 M to 48.23 M, and FLOPs decreased dramatically from 364.882 G to 75.904 G, demonstrating a substantial improvement in computational efficiency with an acceptable performance trade-off. When the CAIM module was introduced independently, the recall improved to 70.00%, highlighting the benefit of cross-modal interaction in capturing complementary information. However, this improvement is accompanied by increased parameter count and computational cost. When combining the backbone InceptionV3 with either MSFE or CAIM, both parameters and computational cost were substantially reduced. However, the decline in classification performance indicates that the reduced capacity of the lightweight backbone is insufficient to sustain the feature enhancement and interaction modules. Similarly, integrating MSFE and CAIM without the lightweight backbone fails to yield improvements and incurs additional computational overhead.

In contrast, the complete integration of MSFE, InceptionV3, and CAIM reported an AUC of 85.69% with parameters of 49.51 M and FLOPs of 79.789 G. Compared to the baseline, CIDFNet reduces computational cost by 78.13% while improving precision by 3.33%. Although the AUC is marginally lower than the baseline, the overall results suggest that CIDFNet obtains a trade-off. These findings suggest the potential usefulness of the proposed components and the necessity of their collaborative design.

Furthermore, to evaluate the trade-off reported by the proposed InceptionV3-based framework, we compared it with five mainstream lightweight alternatives: Mobilenetv3-small [52], EfficientNet-B0 [53], ShuffleNetV2 [54], GhostNet [55], and ConvNeXt-Tiny [56].

As shown in Table 3, the proposed method reports a higher AUC compared with the evaluated lightweight backbones. Specifically, while ConvNeXt-Tiny exhibits competitive performance, its parameter quantity and computational cost are higher than those of InceptionV3. Conversely, although Mobilenetv3-small, EfficientNet-B0, ShuffleNetV2, and GhostNet reduce parameters, they suffer from degradation in both AUC and precision. These comparisons suggest that InceptionV3 provides a trade-off between accuracy and computational efficiency for the multimodal breast ultrasound task.

4.2. Comparison Results

To evaluate the performance of CIDFNet, we compared it with four representative multimodal learning methods, including TDFNet [16], BINDS [57], FusionM4Net [58], and MSMFN [59]. To ensure fairness, all methods strictly followed the network architectures reported in their original papers.

As shown in Table 4, CIDFNet shows competitive performance on the multimodal breast ultrasound dataset, with an AUC of 85.69%, an accuracy of 75.51%, a recall rate of 50.00%, an F1 score of 62.50%, and a precision rate of 83.33%. Compared with the baseline TDFNet, CIDFNet improves precision by 3.33%, while reducing parameters by 11.11% and FLOPs by 78.13%. BINDS achieves a comparable AUC of 85.00%, but its lower precision and larger model size may affect its suitability for resource-constrained settings. In contrast, while FusionM4Net and MSMFN reduce computational overhead, they compromise performance. Notably, FusionM4Net yields an AUC of merely 48.79% and a precision of 50%, and MSMFN achieves an AUC of 76.55% with the same precision. This suggests that aggressive model simplification may lead to performance degradation in multimodal breast ultrasound tasks.

Overall, the comparison experiments indicate that CIDFNet balances performance between precision and computational efficiency in multimodal breast ultrasound. It yields relatively competitive AUC and precision compared to representative multimodal methods, while requiring fewer parameters and lower computational overhead than TDFNet and BINDS. Although FusionM4Net and MSMFN are lighter, their accuracy degradation offsets their computational advantages.

4.3. Robustness Analysis

Multimodal breast ultrasound images are inevitably affected by random noise due to device conditions, operator variability, and tissue scattering. Such noise obscures discriminative features of lesion regions, posing a challenge to analysis stability. To evaluate the robustness of the proposed CIDFNet, we injected Gaussian noise into the raw images under two settings: single-modality perturbation and full-modality perturbation. It should be noted that the Gaussian noise serves only as a simplified perturbation to observe the model’s performance variations, and it cannot substitute for the authentic noise and complex artifacts inherent in ultrasound imaging.

As summarized in Table 5, in the single-modality perturbation setting corresponding to the first three rows, applying 10% Gaussian noise independently to a single modality (BUS, CDFI, or UE) yields minimal performance degradation. For instance, with the noise applied solely to the BUS modality, the model retains an AUC of 85.69% and an accuracy of 75.51%, which is comparable to the noise-free setting. When noise is only applied to the CDFI modality, the accuracy reaches 75.51%, indicating a degree of tolerance to noise perturbations. In the case of noise in the UE modality, the F1-score attains 62.50%. These results indicate that CIDFNet shows limited performance variation when a single modality is degraded. This behavior may be related to the dynamic fusion mechanism. By incorporating uncertainty modeling, the module adaptively down-weights the corrupted modality and emphasizes information from clean modalities, thereby helping to maintain overall performance.

In the full-modality perturbation setting corresponding to rows four through seven, Gaussian noise was simultaneously injected into all modalities at intensities ranging from 10% to 20%. Unsurprisingly, degrading all modalities simultaneously imposes a substantially greater challenge. Specifically, while performance exhibits moderate stability at 10% noise, increasing the intensity to 20% causes the accuracy to drop to 67.35% and AUC to 80.17%. Notably, recall exhibits a sharper decline from 50% to 30% than precision. This discrepancy can be attributed to the distinct characteristics of malignant lesions. Specifically, their inherently low-contrast and blurred boundaries become further obscured by background speckle under high-intensity noise, leading to increased missed detections. Conversely, the clearer morphological features of benign lesions are less affected, resulting in a smaller drop in precision.

Figure 5 further indicates the AUC decay curve of CIDFNet under full-modal Gaussian noise perturbation. The proposed CIDFNet exhibits a gradual degradation as noise increases from 0% to 20%, with the AUC declining from 85.69% to 80.17%. Although high-intensity noise leads to performance degradation, CIDFNet obtains an AUC above 80% even at 20% noise level, suggesting that the model maintains a certain level of discriminative performance under controlled noise conditions. These results suggest limited performance degradation under Gaussian noise, while performance varies across different noise levels.

5. Discussion

The proposed CIDFNet shows a trade-off between prediction performance and computational efficiency for multimodal breast ultrasound analysis. Compared with existing approaches, the model obtains a competitive classification performance while requiring fewer computational resources. This suggests that appropriately designed feature enhancement and cross-modal interaction mechanisms may support representation learning under constrained model capacity.

From the perspective of component analysis, the results of the ablation study indicate that each module contributes differently to the overall performance. The MSFE module enhances modality-specific feature representation, and its effect is mainly reflected in sensitivity-related metrics. The CAIM is associated with enhanced global feature alignment by enabling early-stage information exchange across modalities. In contrast, the use of an InceptionV3 backbone reduces computational cost, though it leads to some variation in performance compared with heavier architectures. When combined, these components exhibit complementary behavior, resulting in a trade-off between performance and efficiency.

In terms of comparison with other multimodal methods, CIDFNet presents competitive results in terms of AUC and precision while maintaining a relatively lower computational cost. However, it is also observed that methods with lower computational complexity may exhibit reduced classification performance, indicating that excessive simplification of model structure can affect feature representation capacity in multimodal settings.

The robustness analysis evaluates the model under synthetic Gaussian noise perturbations. Under single-modality perturbations, only minor performance variations are observed, suggesting that the dynamic fusion strategy may help reduce the impact of degraded inputs by leveraging complementary information from other modalities. However, when noise is applied to all modalities simultaneously, a more noticeable performance decline is observed, particularly in sensitivity-related metrics. It should be noted that the Gaussian noise used in this study represents a simplified simulation and does not fully reflect the complexity of real ultrasound imaging artifacts. Therefore, these results should be interpreted as an analysis of model behavior under controlled perturbations rather than as evidence of robustness in clinical or real-world scenarios.

In addition, the relatively moderate recall observed in the experiments reflects the influence of the current decision threshold and the inherent trade-off between sensitivity and precision under the adopted optimization setting. Since a fixed threshold is used for all evaluations without additional calibration, the model tends to produce more conservative outputs under the current threshold setting, which reduces false positives while also affecting the detection of some positive cases. This behavior should be interpreted in relation to the experimental configuration and task settings rather than as an isolated performance limitation.

Despite these findings, several limitations should be acknowledged. The study relies on a single-institution dataset without external validation, limiting generalizability. The Gaussian noise analysis serves only as a preliminary simulation and does not capture the full complexity of real-world ultrasound imaging. Additionally, the framework focuses primarily on the image data and does not incorporate clinical metadata. Future work will explore multi-center validation and integration of additional data sources to better assess real-world utility.

6. Conclusions

This study proposes CIDFNet, an efficient multimodal breast ultrasound diagnosis framework. It integrates an InceptionV3-based shared backbone, whose computational cost is lower than that of ResNet50, together with a multi-scale feature enhancement module and a cross-modal interaction module under a unified architecture. It is designed to address the challenges of heterogeneous information fusion, limited utilization of lesion features, and computational cost. Experimental results show that CIDFNet shows a trade-off between precision and computational cost in this setting. Future work will focus on integrating multi-center data and additional modalities to further promote the practical application.

Author Contributions

Conceptualization, X.W. and P.W.; methodology, X.W., Y.L., and L.H.; software, Y.L. and L.H.; validation, Y.L. and L.H.; formal analysis, X.W., Y.L., and L.H.; investigation, Y.L. and L.H.; resources, X.W. and P.W.; data curation, X.W.; writing—original draft preparation, X.W., Y.L., and L.H.; writing—review and editing, X.W., Y.L., L.H., and P.W.; visualization, Y.L. and L.H.; supervision, P.W.; project administration, X.W.; funding acquisition, X.W. and P.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (grant number 62402175 and 62573187) and the Natural Science Foundation of Hunan Province (grant number 2024JJ6176).

Institutional Review Board Statement

Not applicable. Ethical review or prior approval by a Research Ethics Committee was not required for this study, as it relied solely on aggregated, publicly available data and official documentation. No individually identifiable patient records or other forms of sensitive health information were collected, processed, or analyzed at any stage of the study. This article does not involve studies with human participants or animals conducted by any of the authors. All analyses were performed on a publicly available dataset that had already received ethical approval from the relevant committees in its original studies. Therefore, no additional ethical approval was sought for this work.

Informed Consent Statement

Not applicable.

Data Availability Statement

A publicly available dataset was used in the article. https://www.kaggle.com/datasets/timesxy/multimodal-breast-ultrasound-dataset-us3m, accessed on 1 November 2025.

Acknowledgments

During the preparation of this manuscript, the authors used Doubao 2.0 and DeepSeek V4 for the purposes of language polishing and grammar checking. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bray, F.; Laversanne, M.; Sung, H.; Ferlay, J.; Siegel, R.L.; Soerjomataram, I.; Jemal, A. Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 2024, 74, 229–263. [Google Scholar] [CrossRef] [PubMed]
Sun, Y.; Zhao, Z.; Yang, Z.; Xu, F.; Lu, H.; Zhu, Z.; Shi, W.; Jiang, J.; Yao, P.; Zhu, H. Risk factors and preventions of breast cancer. Int. J. Biol. Sci. 2017, 13, 1387–1397. [Google Scholar] [CrossRef] [PubMed]
Daniaux, M.; Gruber, L.; De Zordo, T.; Geiger-Gritsch, S.; Amort, B.; Santner, W.; Egle, D.; Baltzer, P.A. Preoperative staging by multimodal imaging in newly diagnosed breast cancer: Diagnostic performance of contrast-enhanced spectral mammography compared to conventional mammography, ultrasound, and MRI. Eur. J. Radiol. 2023, 163, 110838. [Google Scholar] [CrossRef] [PubMed]
Bo, X.; Kaiwen, W.; Ying, W.; Jie, H.; Chaoyi, C. Dynamic adversarial domain adaptation based on multikernel maximum mean discrepancy for breast ultrasound image classification. Expert Syst. Appl. 2022, 207, 117978. [Google Scholar] [CrossRef]
Wojcinski, S.; Brandhorst, K.; Sadigh, G.; Hillemanns, P.; Degenhardt, F. Acoustic Radiation Force Impulse Imaging with Virtual Touch Tissue Quantification: Measurements of Normal Breast Tissue and Dependence on the Degree of Pre-compression. Ultrasound Med. Biol. 2013, 39, 2226–2232. [Google Scholar] [CrossRef] [PubMed]
Cho, Y.; Misra, S.; Managuli, R.; Barr, R.G.; Lee, J.; Kim, C. Attention-based Fusion Network for Breast Cancer Segmentation and Classification Using Multi-modal Ultrasound Images. Ultrasound Med. Biol. 2025, 51, 568–577. [Google Scholar] [CrossRef] [PubMed]
Abdullakutty, F.; Akbari, Y.; Al-Maadeed, S.; Bouridane, A.; Hamoudi, R.R. Enhancing the Prediction of Breast Cancer Progression Through Multi-modal Data Transformation. Cogn. Comput. 2025, 17, 114. [Google Scholar] [CrossRef]
Abdullakutty, F.; Akbari, Y.; Al-Maadeed, S.; Bouridane, A.; Talaat, I.M.; Hamoudi, R. Towards improved breast cancer detection via multi-modal fusion and dimensionality adjustment. Comput. Struct. Biotechnol. Rep. 2024, 1, 100019. [Google Scholar] [CrossRef]
Ogut, Z.; Karaduman, M.; Yildirim, M. Clinically Focused Computer-Aided Diagnosis for Breast Cancer Using SE and CBAM with Multi-Head Attention. Tomography 2025, 11, 138. [Google Scholar] [CrossRef] [PubMed]
Yao, Z.; Luo, T.; Dong, Y.; Jia, X.; Deng, Y.; Wu, G.; Zhu, Y.; Zhang, J.; Liu, J.; Yang, L.; et al. Virtual elastography ultrasound via generative adversarial network for breast cancer diagnosis. Nat. Commun. 2023, 14, 788. [Google Scholar] [CrossRef] [PubMed]
Iniyan, S.; Raja, M.S.; Poonguzhali, R. Enhanced breast cancer diagnosis through integration of computer vision with fusion based joint transfer learning using multi modality medical images. Sci. Rep. 2024, 14, 28376. [Google Scholar] [CrossRef] [PubMed]
Al-Tam, R.M.; Al-Hejri, A.M.; Alshamrani, S.S. Multimodal breast cancer hybrid explainable computer-aided diagnosis using medical mammograms and ultrasound images. Biocybern. Biomed. Eng. 2024, 44, 731–758. [Google Scholar] [CrossRef]
Bobowicz, M.; Rygusik, M.; Buler, J.; Buler, R.; Ferlin, M.; Kwasigroch, A.; Szurowska, E.; Grochowski, M. Attention-Based Deep Learning System for Classification of Breast Lesions—Multimodal, Weakly Supervised Approach. Cancers 2023, 15, 2704. [Google Scholar] [CrossRef] [PubMed]
Wang, H.; Wei, L.; Li, J.; Liu, B.; Fang, J.; Mooney, C. A multimodal data-based model for breast cancer diagnosis. Comput. Methods Program. Biomed. 2026, 279, 109288. [Google Scholar] [CrossRef] [PubMed]
Misra, S.; Yoon, C.; Kim, K.J.; Managuli, R.; Barr, R.G.; Baek, J.; Kim, C. Deep learning-based multimodal fusion network for segmentation and classification of breast cancers using B-mode and elastography ultrasound images. Bioeng. Transl. Med. 2023, 8, e10480. [Google Scholar] [CrossRef] [PubMed]
Yan, P.; Gong, W.; Li, M.; Zhang, J.; Li, X.; Jiang, Y.; Luo, H.; Zhou, H. TDF-Net: Trusted Dynamic Feature Fusion Network for breast cancer diagnosis using incomplete multimodal ultrasound. Inf. Fusion 2024, 112, 102592. [Google Scholar] [CrossRef]
Yang, X.; Xi, X.; Yang, L.; Xu, C.; Song, Z.; Nie, X.; Qiao, L.; Li, C.; Shi, Q.; Yin, Y. Multi-modality relation attention network for breast tumor classification. Comput. Biol. Med. 2022, 150, 106210. [Google Scholar] [CrossRef] [PubMed]
Muduli, D.; Dash, R.; Majhi, B. Automated diagnosis of breast cancer using multi-modal datasets: A deep convolution neural network based approach. Biomed. Signal Process. Control 2022, 71, 102825. [Google Scholar] [CrossRef]
Li, M.; Fang, Y.; Shao, J.; Jiang, Y.; Xu, G.; Cui, X.W.; Wu, X. Vision transformer-based multimodal fusion network for classification of tumor malignancy on breast ultrasound: A retrospective multicenter study. Int. J. Med. Inform. 2025, 196, 105793. [Google Scholar] [CrossRef] [PubMed]
Lu, W.; Tang, Y.; Luo, J.; Zhang, C.; Zhu, L. Multimodal transformer-based fusion for breast ultrasound: Segmentation and BI-RADS classification. J. Radiat. Res. Appl. Sci. 2026, 19, 102260. [Google Scholar] [CrossRef]
Mallina, R.; Shareef, B. XBusNet: Text-Guided Breast Ultrasound Segmentation via Multimodal Vision–Language Learning. Diagnostics 2025, 15, 2849. [Google Scholar] [CrossRef] [PubMed]
Yan, R.; Zhang, F.; Rao, X.; Lv, Z.; Li, J.; Zhang, L.; Liang, S.; Li, Y.; Ren, F.; Zheng, C.; et al. Richer fusion network for breast cancer classification based on multimodal data. BMC Med. Inf. Decis. Mak. 2021, 21, 134. [Google Scholar] [CrossRef] [PubMed]
Zhou, S.K.; Greenspan, H.; Davatzikos, C.; Duncan, J.S.; Van Ginneken, B.; Madabhushi, A.; Prince, J.L.; Rueckert, D.; Summers, R.M. A review of deep learning in medical imaging: Imaging traits, technology trends, case studies with progress highlights, and future promises. Proc. IEEE 2021, 109, 820–838. [Google Scholar] [CrossRef]
Ghantasala, G.S.P.; Akhil, M.; Vidyullatha, P.; Guruguntla, V.; Rao, T.S.S.B.; Yuvaraju, B.A.G. Multimodal fusion of ultrasound images using HXM net for breast cancer diagnosis. Sci. Rep. 2025, 15, 40689. [Google Scholar] [CrossRef] [PubMed]
Chen, J.; Pan, T.; Zhu, Z.; Liu, L.; Zhao, N.; Feng, X.; Zhang, W.; Wu, Y.; Cai, C.; Luo, X.; et al. A deep learning-based multimodal medical imaging model for breast cancer screening. Sci. Rep. 2025, 15, 14696. [Google Scholar] [CrossRef] [PubMed]
Nasrullah, N.; Sang, J.; Alam, M.S.; Mateen, M.; Cai, B.; Hu, H. Automated Lung Nodule Detection and Classification Using Deep Learning Combined with Multiple Strategies. Sensors 2019, 19, 3722. [Google Scholar] [CrossRef] [PubMed]
Asghar, M.A.; Shoaib, S.; Zahid, M. HAAU-Net: Hybrid Adaptive Attention U-Net Integrated with Context-Aware Morphologically Stable Features for Real-Time MRI Brain Tumor Detection and Segmentation. Tomography 2026, 12, 44. [Google Scholar] [CrossRef] [PubMed]
Wu, X.; Tan, G.; Luo, H.; Chen, Z.; Pu, B.; Li, S.; Li, K. A knowledge-interpretable multi-task learning framework for automated thyroid nodule diagnosis in ultrasound videos. Med. Image Anal. 2023, 91, 103039. [Google Scholar] [CrossRef] [PubMed]
McKinney, S.M.; Sieniek, M.; Godbole, V.; Godwin, J.; Antropova, N.; Ashrafian, H.; Back, T.; Chesus, M.; Corrado, G.S.; Darzi, A.; et al. International evaluation of an AI system for breast cancer screening. Nature 2020, 2020, 89–94. [Google Scholar] [CrossRef] [PubMed]
Qian, X.; Pei, J.; Zheng, H.; Xie, X.; Yan, L.; Zhang, H.; Han, C.; Gao, X.; Zhang, H.; Zheng, W.; et al. Prospective assessment of breast cancer risk from multimodal multiview ultrasound images via clinically applicable deep learning. Nat. Biomed. Eng. 2021, 5, 522–532. [Google Scholar] [CrossRef] [PubMed]
Khoshdel, V.; Ashraf, A.; LoVetri, J. Enhancement of Multimodal Microwave-Ultrasound Breast Imaging Using a Deep-Learning Technique. Sensors 2019, 19, 4050. [Google Scholar] [CrossRef] [PubMed]
Gu, J.; Zhong, X.; Fang, C.; Lou, W.; Fu, P.; Woodruff, H.C.; Wang, B.; Jiang, T.; Lambin, P. Deep learning of multimodal ultrasound: Stratifying the response to neoadjuvant chemotherapy in breast cancer before treatment. Oncologist 2024, 29, e187–e197. [Google Scholar] [CrossRef] [PubMed]
Chen, H.; Li, Y.; Zhang, J.; Yang, L.; Sun, Y.; Chen, Y.; Zhou, S.; Li, Z.; Qian, X.; Xu, Q.; et al. An Alignment and Imputation Network (AINet) for Breast Cancer Diagnosis With Multimodal Multi-View Ultrasound Images. IEEE Trans. Med. Imaging 2026, 45, 1383–1394. [Google Scholar] [CrossRef] [PubMed]
Huang, R.; Lin, Z.; Dou, H.; Wang, J.; Miao, J.; Zhou, G.; Jia, X.; Xu, W.; Mei, Z.; Dong, Y.; et al. AW3M: An auto-weighting and recovery framework for breast cancer diagnosis using multi-modal ultrasound. Med. Image Anal. 2021, 72, 102137. [Google Scholar] [CrossRef] [PubMed]
Fang, M.; Xu, B. Transformer-based multi-modal learning for breast cancer screening: Merging imaging and genetic data. J. Radiat. Res. Appl. Sci. 2025, 18, 101586. [Google Scholar] [CrossRef]
Mo, Y.; Han, C.; Liu, Y.; Liu, M.; Shi, Z.; Lin, J.; Zhao, B.; Huang, C.; Qiu, B.; Cui, Y.; et al. HoVer-Trans: Anatomy-Aware HoVer-Transformer for ROI-Free Breast Cancer Diagnosis in Ultrasound Images. IEEE Trans. Med. Imaging 2023, 42, 1696–1706. [Google Scholar] [CrossRef] [PubMed]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2011–2023. [Google Scholar] [CrossRef] [PubMed]
Roy, A.G.; Navab, N.; Wachinger, C. Recalibrating Fully Convolutional Networks With Spatial and Channel “Squeeze and Excitation” Blocks. IEEE Trans. Med. Imaging 2019, 38, 540–549. [Google Scholar] [CrossRef] [PubMed]
Cai, J.; Li, H.; Tan, M.; He, B.; Lv, W.; Li, H. Cross-modal generalizable medical image segmentation with dual-domain deformable transformer and multitask adaptation. Expert Syst. Appl. 2025, 277, 127249. [Google Scholar] [CrossRef]
Liu, Z.; Liu, B.; Tao, Z.; Zhou, Y.; Li, C. LHNet: Lightweight hybrid network with multi-scale sliding window attention for real-time semantic segmentation. Neurocomputing 2026, 662, 131857. [Google Scholar] [CrossRef]
Dong, L.; Cai, X.; Ge, H.; Sun, L.; Pan, X.; Sun, F.; Meng, Q. Breast Cancer Diagnosis Using a Dual-Modality Complementary Deep Learning Network With Integrated Attention Mechanism Fusion of B-Mode Ultrasound and Shear Wave Elastography. Ultrasound Med. Biol. 2025, 51, 2135–2143. [Google Scholar] [CrossRef] [PubMed]
Xu, J.; Zhuang, S.; He, Y.; Wang, H.; Zhuang, Z.; Zeng, H. Multimodal sparse fusion transformer network with spatio-temporal decoupling for breast tumor classification. Med. Image Anal. 2026, 110, 103966. [Google Scholar] [CrossRef] [PubMed]
Mondol, S.S.; Hasan, M.K. Enhancing B-mode-based breast cancer diagnosis via cross-attention fusion of H-scan and Nakagami imaging with multi-CAM-QUS-Driven XAI. Phys. Med. Biol. 2025, 70, 175011. [Google Scholar] [CrossRef]
Qian, X.; Pei, J.; Han, C.; Liang, Z.; Zhang, G.; Chen, N.; Zheng, W.; Meng, F.; Yu, D.; Chen, Y.; et al. A multimodal machine learning model for the stratification of breast cancer risk. Nat. Biomed. Eng. 2025, 9, 356–370. [Google Scholar] [CrossRef] [PubMed]
Karmakar, R.; Nooshabadi, S. Mobile-PolypNet: Lightweight Colon Polyp Segmentation Network for Low-Resource Settings. J. Imaging 2022, 8, 169. [Google Scholar] [CrossRef] [PubMed]
Dwika Hefni Al-Fahsi, R.; Naghim Fauzaini Prawirosoenoto, A.; Adi Nugroho, H.; Ardiyanto, I. GIVTED-Net: GhostNet-Mobile Involution ViT Encoder-Decoder Network for Lightweight Medical Image Segmentation. IEEE Access 2024, 12, 81281–81292. [Google Scholar] [CrossRef]
Xie, M.; Wu, J.; Sun, J.; Xiao, L.; Liu, Z.; Yuan, R.; Duan, S.; Wang, L. MFFSNet: A Lightweight Multi-Scale Shuffle CNN Network for Wheat Disease Identification in Complex Contexts. Agronomy 2025, 15, 910. [Google Scholar] [CrossRef]
Choi, Y.; Kim, M.N.; Na, S. Inception U-Net for Enhanced Breast Ultrasound Image Segmentation Using Transfer Learning. Bioengineering 2026, 13, 181. [Google Scholar] [CrossRef] [PubMed]
Chen, J.; Shen, X.; Zhao, Y.; Qian, W.; Ma, H.; Sang, L. Attention gate and dilation U-shaped network (GDUNet): An efficient breast ultrasound image segmentation network with multiscale information extraction. Quant. Imaging Med. Surg. 2024, 14, 2034–2048. [Google Scholar] [CrossRef] [PubMed]
Wu, R.; Liu, Y.; Ning, G.; Liang, P.; Chang, Q. UltraLight VM-UNet: Parallel Vision Mamba significantly reduces parameters for skin lesion segmentation. Patterns 2025, 6, 101298. [Google Scholar] [CrossRef] [PubMed]
Caron, M.; Misra, I.; Mairal, J.; Goyal, P.; Bojanowski, P.; Joulin, A. Unsupervised learning of visual features by contrasting cluster assignments. Adv. Neural Inf. Process. Syst. 2020, 33, 9912–9924. [Google Scholar]
Wang, S.; Ren, J.; Guo, X. A high-accuracy lightweight network model for X-ray image diagnosis: A case study of COVID detection. PLoS ONE 2024, 19, e0303049. [Google Scholar] [PubMed]
Dudeja, T.; Dubey, S.K.; Bhatt, A.K. Ensembled EfficientNetB3 architecture for multi-class classification of tumours in MRI images. Intell. Decis. Technol. 2023, 17, 395–414. [Google Scholar] [CrossRef]
Han, J.; Yang, Y. L-Net: Lightweight and fast object detector-based ShuffleNetV2. J.-Real-Time Image Process. 2021, 18, 2527–2538. [Google Scholar]
Han, K.; Wang, Y.; Xu, C.; Guo, J.; Xu, C.; Wu, E.; Tian, Q. GhostNets on heterogeneous devices via cheap operations. Int. J. Comput. Vis. 2022, 130, 1050–1069. [Google Scholar] [CrossRef]
Tagnamas, J.; Ramadan, H.; Yahyaouy, A.; Tairi, H. CS-Net: Combined ConvNeXt-Swin-Unet for accurate medical image segmentation. J. Supercomput. 2026, 82, 276. [Google Scholar] [CrossRef]
Li, Y.; Zhang, J.; Chen, H.; Yang, L.; Xie, Y.; Xu, Q.; Li, P.; Li, J.; Li, Z.; Dai, L.; et al. A deep learning system for non-invasive breast cancer diagnosis with multimodal data. Nat. Biomed. Eng. 2026, 1–18. [Google Scholar] [CrossRef] [PubMed]
Tang, P.; Yan, X.; Nan, Y.; Xiang, S.; Krammer, S.; Lasser, T. FusionM4Net: A multi-stage multi-modal learning algorithm for multi-label skin lesion classification. Med. Image Anal. 2022, 76, 102307. [Google Scholar] [CrossRef] [PubMed]
Meng, Z.; Zhu, Y.; Pang, W.; Tian, J.; Nie, F.; Wang, K. MSMFN: An Ultrasound Based Multi-Step Modality Fusion Network for Identifying the Histologic Subtypes of Metastatic Cervical Lymphadenopathy. IEEE Trans. Med. Imaging 2023, 42, 996–1008. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Multimodal ultrasound examples of benign (top row) and malignant (bottom row) breast lesions. (a,d) BUS; (b,e) CDFI; (c,f) UE.

Figure 2. Overall framework of the proposed CIDFNet.

Figure 3. Architecture of the Multi-Scale Feature Enhancement (MSFE) module.

Figure 4. Architecture of the Cross-Attention Interaction Module (CAIM).

Figure 5. AUC variation curve of CIDFNet under gradually increased full-modal Gaussian noise.

Table 1. Comparison of representative modal fusion methods.

Methods	Fusion Strategy	Interaction	Backbone	Parameters (M)	FLOPs (G)	Missing Modality Handling
SENet [37]	Multi-crop	None	ResNet	None	None	None
scSE [38]	scSE block	None	SD-Net	None	None	None
CGDMNet [39]	Transformer	Intermediate	MAL	25.06	6.37	None
LHNet [40]	Information	Late	ViT	2.95	33.15	None
HXM-Net [24]	Transformer	Intermediate	ViT	None	None	None
DCFAN [41]	Channel	Intermediate	CNNs	None	None	None
MSFT-Net [42]	SCAM	Intermediate	TimeSformer	None	None	None
KGCML [43]	MM-CAF	Hybrid	ResNet50	39.41	23.99	None
BMU-Net [44]	Transformer	Hybrid	ResNet18	None	None	Random Masking
TDF-Net [16]	TDFM	Intermediate	ResNet50	55.7	364.882	INN
Ours	CAIM+TDFM	Hybrid	InceptionV3	49.51	79.789	INN

Table 2. Ablation study of the main components of CIDFNet. Best results are in bold.

MSFE	InceptionV3	CAIM	AUC (%)	Accuracy (%)	Recall (%)	F1-Score (%)	Precision (%)	Parameters (M)	FLOPs (G)
×	×	×	86.72	77.55	60.00	68.57	80.00	55.70	364.882
✔	×	×	80.00	83.67	70.00	77.78	87.50	55.93	365.063
×	✔	×	81.20	71.43	55.00	61.11	68.75	48.23	75.904
×	×	✔	85.17	77.55	70.00	71.79	73.68	57.80	371.057
✔	✔	×	83.97	75.51	55.00	64.71	78.57	48.46	76.085
×	✔	✔	82.93	73.47	45.00	58.06	81.82	50.33	82.079
✔	×	✔	80.00	75.51	65.00	68.42	72.22	56.98	368.767
✔	✔	✔	85.69	75.51	50.00	62.50	83.33	49.51	79.789

In this table, ✔ indicates that the corresponding modality is enabled, and × indicates that it is disabled.

Table 3. Comparative results of different lightweight backbones on the multimodal breast ultrasound dataset. Best results are in bold.

Methods	AUC (%)	Accuracy (%)	Recall (%)	F1-Score (%)	Precision (%)	Parameters (M)	FLOPs (G)
Mobilenetv3-small	72.59	71.43	60.00	63.16	66.67	37.61	42.459
EfficientNet-B0	71.38	69.39	70.00	65.12	60.87	44.14	48.865
ShuffleNetV2	67.24	65.31	60.00	58.54	57.14	38.50	44.128
GhostNet	65.52	55.10	15.00	21.43	37.50	38.06	41.768
ConvNeXt-Tiny	80.86	73.47	60.00	64.86	70.59	91.48	121.718
InceptionV3 (Ours)	85.69	75.51	50.00	62.50	83.33	49.51	79.789

Table 4. Comparative results of different methods on the multimodal breast ultrasound dataset. Best results are in bold.

Methods	AUC (%)	Accuracy (%)	Recall (%)	F1-Score (%)	Precision (%)	Parameters (M)	FLOPs (G)
TDFNet	86.72	77.55	60.00	68.57	80.00	55.70	364.882
BINDS	85.00	73.47	50.00	60.61	76.92	64.42	27.451
FusionM4Net	48.79	59.18	30.00	37.50	50.00	95.15	49.581
MSMFN	76.55	69.39	50.00	57.14	66.67	88.64	41.192
Ours	85.69	75.51	50.00	62.50	83.33	49.51	79.789

Table 5. Robustness experimental results of the proposed method on the multimodal breast ultrasound dataset with different noise levels.

Noise	BUS	CDFI	UE	AUC (%)	Accuracy (%)	Recall (%)	F1-Score (%)	Precision (%)
10%	✔	×	×	85.69	75.51	50.00	62.50	83.33
10%	×	✔	×	85.86	75.51	45.00	60.00	90.00
10%	×	×	✔	83.10	75.51	50.00	62.50	83.33
0%	✔	✔	✔	85.69	75.51	50.00	62.50	83.33
10%	✔	✔	✔	82.59	75.51	50.00	62.50	83.33
15%	✔	✔	✔	83.79	69.39	35.00	48.28	77.78
20%	✔	✔	✔	80.17	67.35	30.00	42.86	75.00

In this table, ✔ indicates that the corresponding modality is used, and × indicates that it is not used.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wu, X.; Lan, Y.; Han, L.; Wang, P. An Efficient Cross-Modal Interaction and Dynamic Fusion Network for Multimodal Breast Ultrasound Diagnosis. Tomography 2026, 12, 93. https://doi.org/10.3390/tomography12070093

AMA Style

Wu X, Lan Y, Han L, Wang P. An Efficient Cross-Modal Interaction and Dynamic Fusion Network for Multimodal Breast Ultrasound Diagnosis. Tomography. 2026; 12(7):93. https://doi.org/10.3390/tomography12070093

Chicago/Turabian Style

Wu, Xiangqiong, Yin Lan, Lina Han, and Peng Wang. 2026. "An Efficient Cross-Modal Interaction and Dynamic Fusion Network for Multimodal Breast Ultrasound Diagnosis" Tomography 12, no. 7: 93. https://doi.org/10.3390/tomography12070093

APA Style

Wu, X., Lan, Y., Han, L., & Wang, P. (2026). An Efficient Cross-Modal Interaction and Dynamic Fusion Network for Multimodal Breast Ultrasound Diagnosis. Tomography, 12(7), 93. https://doi.org/10.3390/tomography12070093

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

An Efficient Cross-Modal Interaction and Dynamic Fusion Network for Multimodal Breast Ultrasound Diagnosis

Simple Summary

Abstract

1. Introduction

2. Related Work

2.1. Multi-Modal Learning

2.2. Cross-Modal Interaction and Fusion

2.3. Lightweight Network

3. Materials and Methods

3.1. Datasets

3.2. Implementation Details

3.3. Evaluation Metrics

3.4. Overall Framework

3.5. Shared Backbone

3.6. Multi-Scale Feature Enhancement Module

3.7. Cross-Attention Interaction Module

3.8. Multi-Task Joint Loss

4. Results

4.1. Ablation Study

4.2. Comparison Results

4.3. Robustness Analysis

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI