Figure 1.
The patch embedding layer incorporates learnable 2D positional encoding. The strided convolutions decrease the spatial dimensions, but learnable positional features are incorporated to include spatial priors in the encoding process.
Figure 1.
The patch embedding layer incorporates learnable 2D positional encoding. The strided convolutions decrease the spatial dimensions, but learnable positional features are incorporated to include spatial priors in the encoding process.
Figure 2.
The Stage 1 architecture implements positional-aware convolutional encoding with learnable 2D embeddings. This stage improves the initial spatial representations before the hybrid modeling process begins.
Figure 2.
The Stage 1 architecture implements positional-aware convolutional encoding with learnable 2D embeddings. This stage improves the initial spatial representations before the hybrid modeling process begins.
Figure 3.
The Stage 2 architecture: hybrid local–global encoding via convolutional residual blocks and a MobileViT-based token mixer with spatial gating.
Figure 3.
The Stage 2 architecture: hybrid local–global encoding via convolutional residual blocks and a MobileViT-based token mixer with spatial gating.
Figure 4.
UltraScanUnit: This block unites state-space operations with convolutional processing and low-rank residual enhancements for adaptive inputs.
Figure 4.
UltraScanUnit: This block unites state-space operations with convolutional processing and low-rank residual enhancements for adaptive inputs.
Figure 5.
Overview of the proposed UltraScanNet architecture. The model processes an input ultrasound image through a patch embedding layer, followed by four main stages: positional convolutional encoding (Stage 1), hybrid local–global representation (Stage 2), and two progressive context modeling stages (Stage 3) combining UltraScanUnits, ConvAttnMixers, and attention blocks. The final prediction is produced via global pooling and a linear classification head.
Figure 5.
Overview of the proposed UltraScanNet architecture. The model processes an input ultrasound image through a patch embedding layer, followed by four main stages: positional convolutional encoding (Stage 1), hybrid local–global representation (Stage 2), and two progressive context modeling stages (Stage 3) combining UltraScanUnits, ConvAttnMixers, and attention blocks. The final prediction is produced via global pooling and a linear classification head.
Figure 6.
Grouped performance metrics on BUSI. Bars represent top-1 accuracy, precision, recall, and F1-score for each model. Horizontal dashed lines mark the maximum value of each metric across models.
Figure 6.
Grouped performance metrics on BUSI. Bars represent top-1 accuracy, precision, recall, and F1-score for each model. Horizontal dashed lines mark the maximum value of each metric across models.
Figure 7.
Comparison of mean accuracy (with standard deviation) between UltraScanNet and MambaVision on BUSI. UltraScanNet achieves slightly higher average accuracy, while both models remain within overlapping variability ranges.
Figure 7.
Comparison of mean accuracy (with standard deviation) between UltraScanNet and MambaVision on BUSI. UltraScanNet achieves slightly higher average accuracy, while both models remain within overlapping variability ranges.
Figure 8.
Per-class F1-scores on BUSI. UltraScanNet attains the highest F1-score for normal (C2), showing a balanced trade-off between precision and recall. Competing models show more class-specific strengths but less consistent balance. The stars indicate the maximum values of per-class metrics.
Figure 8.
Per-class F1-scores on BUSI. UltraScanNet attains the highest F1-score for normal (C2), showing a balanced trade-off between precision and recall. Competing models show more class-specific strengths but less consistent balance. The stars indicate the maximum values of per-class metrics.
Figure 9.
Radar comparison of per-class metrics (C0: benign, C1: malignant, C2: normal) for multiple models. Each panel highlights differences in precision, recall, and F1-score.
Figure 9.
Radar comparison of per-class metrics (C0: benign, C1: malignant, C2: normal) for multiple models. Each panel highlights differences in precision, recall, and F1-score.
Figure 10.
Class-wise ROC curves on BUSI. Colors denote C0 (benign), C1 (malignant), and C2 (normal); the dashed diagonal marks random-chance performance. Panels (a,b) show UltraScanNet and MambaVision, while (c–e) compare DeiT-T/16, ViT-S/16, and MaxViT-Tiny.
Figure 10.
Class-wise ROC curves on BUSI. Colors denote C0 (benign), C1 (malignant), and C2 (normal); the dashed diagonal marks random-chance performance. Panels (a,b) show UltraScanNet and MambaVision, while (c–e) compare DeiT-T/16, ViT-S/16, and MaxViT-Tiny.
Figure 11.
Precision–recall (PR) curves on BUSI. Each color denotes a class: C0 (benign), C1 (malignant), and C2 (normal). The top row shows UltraScanNet and its baseline, while the bottom row compares DeiT-T/16, ViT-S/16, and MaxViT-Tiny.
Figure 11.
Precision–recall (PR) curves on BUSI. Each color denotes a class: C0 (benign), C1 (malignant), and C2 (normal). The top row shows UltraScanNet and its baseline, while the bottom row compares DeiT-T/16, ViT-S/16, and MaxViT-Tiny.
Figure 12.
Precision–recall trade-off per class for UltraScanNet. The curves are smoother across all classes, and C1 (malignant) maintains recall at higher thresholds, indicating the more robust detection of malignant cases.
Figure 12.
Precision–recall trade-off per class for UltraScanNet. The curves are smoother across all classes, and C1 (malignant) maintains recall at higher thresholds, indicating the more robust detection of malignant cases.
Figure 13.
Precision–recall trade-off per class for the MambaVision baseline model. Precision and recall are relatively stable for C0 (benign) and C2 (normal), but C1 (malignant) shows a sharper recall decline as the threshold increases.
Figure 13.
Precision–recall trade-off per class for the MambaVision baseline model. Precision and recall are relatively stable for C0 (benign) and C2 (normal), but C1 (malignant) shows a sharper recall decline as the threshold increases.
Figure 14.
F1-macro mean with 95% confidence intervals across models on BUSI. UltraScanNet ranks highest, followed closely by ViT-S/16 and MaxViT-Tiny, while lighter convolutional models such as MobileNetV2 and EfficientNet-B0 achieve lower scores.
Figure 14.
F1-macro mean with 95% confidence intervals across models on BUSI. UltraScanNet ranks highest, followed closely by ViT-S/16 and MaxViT-Tiny, while lighter convolutional models such as MobileNetV2 and EfficientNet-B0 achieve lower scores.
Figure 15.
Per-class recall with 95% confidence intervals on BUSI. Each color corresponds to a class: C0 (benign), C1 (malignant), C2 (normal). UltraScanNet achieves strong recall across all classes and a more balanced profile, while other competing models show sharper variations. The stars indicate the maximum values of per-class metrics.
Figure 15.
Per-class recall with 95% confidence intervals on BUSI. Each color corresponds to a class: C0 (benign), C1 (malignant), C2 (normal). UltraScanNet achieves strong recall across all classes and a more balanced profile, while other competing models show sharper variations. The stars indicate the maximum values of per-class metrics.
Figure 16.
Per-class ROC–AUC with 95% confidence intervals on BUSI. Each color corresponds to a class: C0 (benign), C1 (malignant), C2 (normal). Most models achieve high AUC values above 90%, with UltraScanNet and the Transformer-based methods showing strong separability across all classes. The stars indicate the maximum values of per-class metrics.
Figure 16.
Per-class ROC–AUC with 95% confidence intervals on BUSI. Each color corresponds to a class: C0 (benign), C1 (malignant), C2 (normal). Most models achieve high AUC values above 90%, with UltraScanNet and the Transformer-based methods showing strong separability across all classes. The stars indicate the maximum values of per-class metrics.
Figure 17.
Confusion matrices on the BUSI validation split. Rows represent true labels and columns represent predicted labels for the three classes: benign, malignant, and normal.
Figure 17.
Confusion matrices on the BUSI validation split. Rows represent true labels and columns represent predicted labels for the three classes: benign, malignant, and normal.
Figure 18.
Model performance comparison on the BUS-UCLM dataset. Bars show top-1 accuracy, precision, recall, and F1-score for each model. UltraScanNet achieves the best recall, F1-score, and overall balance, while other models show stronger performance in individual metrics.
Figure 18.
Model performance comparison on the BUS-UCLM dataset. Bars show top-1 accuracy, precision, recall, and F1-score for each model. UltraScanNet achieves the best recall, F1-score, and overall balance, while other models show stronger performance in individual metrics.
Figure 19.
Grad-CAM visualizations for representative BUSI samples. Red/yellow denote the regions that are the most influential for the model’s decisions.
Figure 19.
Grad-CAM visualizations for representative BUSI samples. Red/yellow denote the regions that are the most influential for the model’s decisions.
Table 1.
Model performance on the BUSI dataset. Best value in each column is in bold.
Table 1.
Model performance on the BUSI dataset. Best value in each column is in bold.
Model | Loss | Top-1 Accuracy (%) | Precision | Recall | F1-Score |
---|
MaxViT-Tiny [50] | 0.3465 | 91.67
| 0.9187 | 0.8915 | 0.9040 |
DeiT-Tiny [31] | 0.4057 | 89.10 | 0.9263 | 0.8563 | 0.8764 |
Swin-Tiny [32] | 0.3528 | 90.38 | 0.9074 | 0.8715 | 0.8864 |
ConvNeXt-Tiny [28] | 0.3406 | 89.74 | 0.9000 | 0.8885 | 0.8941 |
EfficientNet-B0 [51] | 0.4703 | 85.26 | 0.8439 | 0.8242 | 0.8314 |
ViT-Small [23] | 0.3203 | 91.67 | 0.9120 | 0.9044 | 0.9073 |
DenseNet-121 [27] | 0.3882 | 89.74 | 0.8906 | 0.8803 | 0.8850 |
MobileNetV2 [52] | 0.4674 | 85.90 | 0.8460 | 0.8576 | 0.8514 |
ResNet-50 [21] | 0.5047 | 85.90 | 0.8680 | 0.8286 | 0.8400 |
MambaVision [25] | 0.3505 | 91.02 | 0.9241 | 0.8832 | 0.9003 |
UltraScanNet (ours) | 0.3367 | 91.67 | 0.9072 | 0.9174 | 0.9096 |
Table 2.
Mean accuracy and standard deviation comparison between UltraScanNet and MambaVision.
Table 2.
Mean accuracy and standard deviation comparison between UltraScanNet and MambaVision.
Model | Mean Accuracy (%) | Std. Dev. (%) |
---|
UltraScanNet (ours) | 91.03 | ±0.80 |
MambaVision | 90.64 | ±0.62 |
Table 3.
Per-class performance on BUSI. Classes: C0 = benign, C1 = malignant, C2 = normal. Best value in each column is in bold.
Table 3.
Per-class performance on BUSI. Classes: C0 = benign, C1 = malignant, C2 = normal. Best value in each column is in bold.
Model | Precision (%) | Recall (%) | F1 (%) |
---|
C0 | C1 | C2 | C0 | C1 | C2 | C0 | C1 | C2 |
---|
UltraScanNet (ours) | 93.18 | 91.89 | 87.10 | 94.25 | 80.95 | 100.00 | 93.71 | 86.08 | 93.10 |
MambaVision T2 (baseline) | 91.21 | 86.05 | 100.00 | 95.40 | 88.10 | 81.48 | 93.26 | 87.06 | 89.80 |
ResNet-50 | 84.69 | 90.00 | 85.71 | 95.40 | 64.29 | 88.89 | 89.73 | 75.00 | 87.27 |
MobileNetV2-1.0 | 89.53 | 78.05 | 86.21 | 88.51 | 76.19 | 92.59 | 89.02 | 77.11 | 89.29 |
DenseNet-121 | 91.11 | 87.18 | 88.89 | 94.25 | 80.95 | 88.89 | 92.66 | 83.95 | 88.89 |
ViT-S/16 | 92.22 | 92.11 | 89.29 | 95.40 | 83.33 | 92.59 | 93.79 | 87.50 | 90.91 |
EfficientNet-B0 | 86.02 | 88.57 | 78.57 | 91.95 | 73.81 | 81.48 | 88.89 | 80.52 | 80.00 |
ConvNeXt-Tiny | 89.89 | 87.80 | 92.31 | 91.95 | 85.71 | 88.89 | 90.91 | 86.75 | 90.57 |
Swin-T (patch4, window7) | 89.47 | 94.29 | 88.46 | 97.70 | 78.57 | 85.19 | 93.41 | 85.71 | 86.79 |
DeiT-T/16 | 85.29 | 100.00 | 92.59 | 100.00 | 64.29 | 92.59 | 92.06 | 78.26 | 92.59 |
MaxViT-Tiny | 91.30 | 92.31 | 92.00 | 96.55 | 85.71 | 85.19 | 93.85 | 88.89 | 88.46 |
Table 4.
Per-class ROC–AUC and PR–AUC on BUSI (%). Classes: C0 = benign, C1 = malignant, C2 = normal. Best value in each column is in bold.
Table 4.
Per-class ROC–AUC and PR–AUC on BUSI (%). Classes: C0 = benign, C1 = malignant, C2 = normal. Best value in each column is in bold.
Model | ROC–AUC | PR–AUC |
---|
C0 | C1 | C2 | C0 | C1 | C2 |
UltraScanNet (ours) | 96.19 | 95.07 | 99.31 | 96.08 | 91.55 | 96.60 |
MambaVision | 96.33 | 95.23 | 99.08 | 93.93 | 93.07 | 96.24 |
ResNet-50 | 94.02 | 94.70 | 96.63 | 95.68 | 88.39 | 83.39 |
MobileNetV2-1.0 | 93.49 | 92.44 | 98.19 | 94.47 | 86.67 | 93.78 |
DenseNet-121 | 96.92 | 96.35 | 98.32 | 97.54 | 91.75 | 95.10 |
ViT-S/16 | 96.44 | 95.97 | 99.31 | 96.10 | 92.63 | 96.13 |
EfficientNet-B0 | 91.21 | 91.83 | 97.16 | 92.29 | 83.08 | 87.18 |
ConvNeXt-Tiny | 94.44 | 93.98 | 98.88 | 91.68 | 92.89 | 92.70 |
Swin-T (patch4, window7) | 93.29 | 89.81 | 98.22 | 90.97 | 89.33 | 94.58 |
DeiT-T/16 | 94.79 | 93.90 | 98.91 | 93.92 | 89.97 | 94.22 |
MaxViT-Tiny | 93.70 | 93.40 | 99.11 | 90.06 | 84.80 | 96.60 |
Table 5.
Sensitivity at 90% specificity for each class on BUSI (%). Classes: C0 = benign, C1 = malignant, C2 = normal. Best value in each column is in bold.
Table 5.
Sensitivity at 90% specificity for each class on BUSI (%). Classes: C0 = benign, C1 = malignant, C2 = normal. Best value in each column is in bold.
Model | C0 | C1 | C2 |
---|
UltraScanNet (ours) | 81.61 | 80.95 | 92.59 |
MambaVision T2 (baseline) | 89.66 | 88.10 | 96.30 |
ResNet-50 | 80.46 | 80.95 | 88.89 |
MobileNetV2-1.0 | 72.41 | 73.81 | 92.59 |
DenseNet-121 | 86.21 | 80.95 | 92.59 |
ViT-S/16 | 93.10 | 88.10 | 81.48 |
EfficientNet-B0 | 73.56 | 73.81 | 85.19 |
ConvNeXt-Tiny | 89.66 | 90.48 | 96.30 |
Swin-T (patch4, window7) | 82.76 | 80.95 | 92.59 |
DeiT-T/16 | 72.41 | 78.57 | 96.30 |
MaxViT-Tiny | 93.10 | 85.71 | 85.19 |
Table 6.
F1-macro mean and 95% confidence intervals on BUSI. Classes: C0 = benign, C1 = malignant, C2 = normal. Best value in each column is in bold.
Table 6.
F1-macro mean and 95% confidence intervals on BUSI. Classes: C0 = benign, C1 = malignant, C2 = normal. Best value in each column is in bold.
Model | F1-Macro Mean | 95% CI Low | 95% CI High |
---|
UltraScanNet (ours) | 0.909 | 0.860 | 0.953 |
MambaVision T2 (baseline) | 0.899 | 0.841 | 0.949 |
ResNet-50 | 0.837 | 0.773 | 0.899 |
MobileNetV2-1.0 | 0.851 | 0.787 | 0.909 |
DenseNet-121 | 0.882 | 0.823 | 0.939 |
ViT-S/16 | 0.905 | 0.850 | 0.955 |
EfficientNet-B0 | 0.830 | 0.762 | 0.890 |
ConvNeXt-Tiny | 0.893 | 0.838 | 0.941 |
Swin-T (patch4, window7) | 0.885 | 0.822 | 0.939 |
DeiT-T/16 | 0.874 | 0.812 | 0.928 |
MaxViT-Tiny | 0.901 | 0.846 | 0.951 |
Table 7.
Per-class recall mean and 95% confidence intervals on BUSI. Classes: C0 = benign, C1 = malignant, C2 = normal. Best value in each column is in bold.
Table 7.
Per-class recall mean and 95% confidence intervals on BUSI. Classes: C0 = benign, C1 = malignant, C2 = normal. Best value in each column is in bold.
Model | Recall Mean | 95% CI Low | 95% CI High |
---|
C0 | C1 | C2 | C0 | C1 | C2 | C0 | C1 | C2 |
UltraScanNet (ours) | 94.25 | 81.16 | 100.00 | 88.76 | 68.42 | 100.00 | 98.78 | 92.68 | 100.00 |
MambaVision T2 (baseline) | 95.41 | 88.30 | 81.22 | 90.53 | 77.78 | 65.51 | 98.91 | 97.37 | 95.24 |
ResNet-50 | 95.39 | 64.20 | 88.79 | 90.47 | 48.89 | 75.00 | 98.94 | 78.38 | 100.00 |
MobileNetV2-1.0 | 88.55 | 76.46 | 92.57 | 81.52 | 63.15 | 80.77 | 94.63 | 89.19 | 100.00 |
DenseNet-121 | 94.25 | 81.07 | 88.48 | 88.46 | 68.08 | 75.00 | 98.80 | 92.68 | 100.00 |
ViT-S/16 | 95.33 | 83.54 | 92.23 | 90.36 | 71.42 | 80.77 | 98.97 | 94.44 | 100.00 |
EfficientNet-B0 | 91.97 | 73.90 | 81.43 | 85.71 | 60.45 | 65.52 | 96.77 | 86.85 | 95.84 |
ConvNeXt-Tiny | 92.01 | 85.88 | 88.78 | 85.18 | 73.91 | 75.00 | 97.50 | 95.45 | 100.00 |
Swin-T (patch4, window7) | 97.70 | 78.72 | 84.99 | 94.18 | 64.86 | 70.37 | 100.00 | 90.48 | 96.30 |
DeiT-T/16 | 100.00 | 64.24 | 92.45 | 100.00 | 48.83 | 81.24 | 100.00 | 79.07 | 100.00 |
MaxViT-Tiny | 96.55 | 85.85 | 84.68 | 92.13 | 74.42 | 70.00 | 100.00 | 95.35 | 96.43 |
Table 8.
Per-class ROC–AUC mean and 95% confidence intervals on BUSI. Classes: C0 = benign, C1 = malignant, C2 = normal. Best value in each column is in bold.
Table 8.
Per-class ROC–AUC mean and 95% confidence intervals on BUSI. Classes: C0 = benign, C1 = malignant, C2 = normal. Best value in each column is in bold.
Model | AUC Mean | 95% CI Low | 95% CI High |
---|
C0 | C1 | C2 | C0 | C1 | C2 | C0 | C1 | C2 |
---|
UltraScanNet (ours) | 96.24 | 95.10 | 99.29 | 92.81 | 90.69 | 98.21 | 98.70 | 98.32 | 99.94 |
MambaVision T2 (baseline) | 96.34 | 95.25 | 99.05 | 92.62 | 89.83 | 97.68 | 98.94 | 98.95 | 99.83 |
ResNet-50 | 94.02 | 94.63 | 96.60 | 90.14 | 90.76 | 93.76 | 97.27 | 97.76 | 98.89 |
MobileNetV2-1.0 | 93.47 | 92.43 | 98.16 | 89.26 | 87.18 | 95.90 | 96.79 | 96.48 | 99.59 |
DenseNet-121 | 96.91 | 96.32 | 98.22 | 94.51 | 93.28 | 95.39 | 98.86 | 98.55 | 99.87 |
ViT-S/16 | 96.43 | 95.99 | 99.30 | 93.03 | 91.84 | 98.13 | 99.06 | 98.89 | 100.00 |
EfficientNet-B0 | 91.24 | 91.85 | 97.15 | 86.00 | 86.08 | 94.45 | 95.35 | 96.30 | 98.99 |
ConvNeXt-Tiny | 94.53 | 94.12 | 98.88 | 90.06 | 87.20 | 97.23 | 98.17 | 99.07 | 100.00 |
Swin-T (patch4, window7) | 93.35 | 89.91 | 98.23 | 88.21 | 81.62 | 95.71 | 97.56 | 96.71 | 99.82 |
DeiT-T/16 | 94.84 | 93.91 | 98.91 | 90.82 | 89.28 | 97.21 | 98.03 | 97.76 | 100.00 |
MaxViT-Tiny | 93.71 | 93.40 | 99.10 | 88.26 | 86.66 | 97.63 | 98.27 | 98.62 | 100.00 |
Table 9.
BUS-UCLM (balanced) validation results. Best values in bold.
Table 9.
BUS-UCLM (balanced) validation results. Best values in bold.
Model | Top-1 (%) | Precision | Recall | F1 |
---|
UltraScanNet (ours) | 66.77 | 0.618 | 0.638 | 0.620 |
MambaVision | 66.45 | 0.597 | 0.569 | 0.575 |
ResNet-50 | 59.35 | 0.532 | 0.415 | 0.403 |
MobileNetV2-1.0 | 54.19 | 0.438 | 0.412 | 0.413 |
DenseNet-121 | 54.84 | 0.600 | 0.562 | 0.499 |
ViT-S/16 (224) | 65.81 | 0.605 | 0.541 | 0.557 |
EfficientNet-B0 | 56.13 | 0.465 | 0.442 | 0.446 |
ConvNeXt-Tiny | 68.39 | 0.636 | 0.608 | 0.619 |
Swin-T (224) | 64.52 | 0.605 | 0.523 | 0.534 |
DeiT-Tiny/16 (224) | 63.87 | 0.624 | 0.504 | 0.515 |
MaxViT-Tiny (224) | 65.81 | 0.615 | 0.566 | 0.575 |
Table 10.
The 95% confidence intervals (CI) for the macro F1-score and per-class recall on BUS-UCLM (out-of-the-box). Values are mean [low, high]. Best per column in bold.
Table 10.
The 95% confidence intervals (CI) for the macro F1-score and per-class recall on BUS-UCLM (out-of-the-box). Values are mean [low, high]. Best per column in bold.
Model | F1-Macro | Recall C0 | Recall C1 | Recall C2 |
---|
UltraScanNet (ours) | 0.620 [0.558, 0.680] | 0.759 [0.697, 0.824] | 0.459 [0.347, 0.567] | 0.703 [0.574, 0.827] |
MambaVision | 0.575 [0.508, 0.638] | 0.800 [0.739, 0.859] | 0.626 [0.519, 0.724] | 0.286 [0.171, 0.406] |
ResNet-50 | 0.402 [0.344, 0.465] | 0.931 [0.891, 0.965] | 0.181 [0.092, 0.268] | 0.134 [0.046, 0.245] |
MobileNetV2-1.0 | 0.413 [0.355, 0.471] | 0.776 [0.713, 0.837] | 0.289 [0.190, 0.386] | 0.172 [0.075, 0.283] |
DenseNet-121 | 0.498 [0.435, 0.557] | 0.605 [0.531, 0.677] | 0.253 [0.161, 0.344] | 0.831 [0.727, 0.925] |
ViT-S/16 (224) | 0.558 [0.496, 0.618] | 0.863 [0.809, 0.914] | 0.458 [0.354, 0.566] | 0.307 [0.186, 0.432] |
EfficientNet-B0 | 0.446 [0.388, 0.505] | 0.766 [0.703, 0.828] | 0.376 [0.268, 0.477] | 0.192 [0.094, 0.309] |
ConvNeXt-Tiny | 0.619 [0.554, 0.677] | 0.812 [0.751, 0.866] | 0.578 [0.464, 0.690] | 0.435 [0.300, 0.575] |
Swin-T (224) | 0.533 [0.471, 0.596] | 0.891 [0.847, 0.937] | 0.301 [0.203, 0.402] | 0.381 [0.255, 0.513] |
DeiT-Tiny/16 (224) | 0.516 [0.455, 0.577] | 0.909 [0.867, 0.949] | 0.267 [0.179, 0.365] | 0.345 [0.226, 0.471] |
MaxViT-Tiny (224) | 0.575 [0.511, 0.638] | 0.789 [0.728, 0.845] | 0.625 [0.517, 0.733] | 0.286 [0.167, 0.408] |
Table 11.
Comparison of model complexity and inference time across CNN-, Transformer-, and Mamba-based architectures. FLOPs are computed for a input.
Table 11.
Comparison of model complexity and inference time across CNN-, Transformer-, and Mamba-based architectures. FLOPs are computed for a input.
Model | Params (M) | FLOPs (G) | Inference Time (ms) |
---|
UltraScanNet (ours) | 36.48 | 5.59 | 5.84 |
MambaVision | 34.44 | 5.12 | 5.42 |
ResNet-50 | 23.51 | 4.13 | 2.16 |
MobileNetV2-1.0 | 2.19 | 0.30 | 2.26 |
DenseNet-121 | 6.87 | 2.83 | 5.80 |
ViT-Small-Patch16-224 | 21.59 | 4.25 | 2.22 |
EfficientNet-B0 | 3.97 | 0.38 | 2.82 |
ConvNeXt-Tiny | 27.80 | 4.45 | 2.83 |
Swin-Tiny | 27.50 | 4.37 | 4.45 |
DeiT-Tiny-Patch16-224 | 5.49 | 1.08 | 2.21 |
MaxViT-Tiny-RW-224 | 28.45 | 4.89 | 7.60 |
Table 12.
Ablation study: positional encoding.
Table 12.
Ablation study: positional encoding.
Variant | Top-1 Accuracy (%) |
---|
Patch Embedding + Learned Pos | 91.67 |
Mamba + Attn | 91.03 |
Simple Patch Embedding | 90.38 |
Hybrid ConvNeXt | 83.33 |
Hybrid | 82.69 |
Posemb Patch1Stage | 82.69 |
ConvNeXt + Attn | 82.05 |
Inverted | 78.85 |
Shallow Attn | 78.21 |
Hybrid Dropout | 71.79 |
Learned Pos + Attn | 66.03 |
Table 13.
Ablation study: Stage 1 configurations. Best value in bold.
Table 13.
Ablation study: Stage 1 configurations. Best value in bold.
Variant | Top-1 Accuracy (%) |
---|
ConvBlock |
Patch Embed + Learned Pos + ConvBlock + PosEnc | 91.67 |
Patch Embed + Mamba + Attn + ConvBlock + PosEnc | 89.10 |
Patch Embed + ConvBlock + PosEnc | 89.10 |
SE Conv |
Patch Embed + SE Conv | 88.46 |
Patch Embed + Learned Pos + SE Conv | 88.46 |
Patch Embed + Mamba + Attn + SE Conv | 88.46 |
ConvNeXt |
Patch Embed + Mamba + Attn + ConvBlock + ConvNeXt | 85.26 |
Patch Embed + ConvBlock + ConvNeXt | 82.69 |
Patch Embed + Learned Pos + ConvBlock + ConvNeXt | 82.05 |
ConvBlock + LayerNorm |
Patch Embed + Learned Pos + ConvBlock + LN + PosEnc | 84.62 |
Patch Embed + ConvBlock + LN + PosEnc | 83.33 |
Patch Embed + Mamba + Attn + ConvBlock + LN + PosEnc | 82.05 |
Mamba Simple |
Patch Embed + Mamba Simple | 83.97 |
Patch Embed + Learned Pos + Mamba Simple | 81.41 |
Patch Embed + Mamba + Attn + Mamba Simple | 81.41 |
ConvMixer |
Patch Embed + Mamba + Attn + ConvMixer | 82.05 |
Patch Embed + Learned Pos + ConvMixer | 82.05 |
Patch Embed + ConvMixer | 81.41 |
CoordConv |
Patch Embed + CoordConv | 80.77 |
Mamba Hybrid |
Patch Embed + Learned Pos + Mamba Hybrid | 80.77 |
Table 14.
Ablation study: Stage 2 block type comparison. Best value in bold.
Table 14.
Ablation study: Stage 2 block type comparison. Best value in bold.
Stage 2 Block Type | Top-1 Accuracy (%) |
---|
Hybrid | 91.67 |
ResMamba | 83.97 |
ConvNeXt | 78.21 |
Table 15.
Mixer scheduling ablation on BUSI (sorted by top-1). Best value in bold.
Table 15.
Mixer scheduling ablation on BUSI (sorted by top-1). Best value in bold.
Arrangement | Top-1 (%) |
---|
Depth-aware hybrid (USM→CAM→MHSA) | 91.67 |
USM → CAM×3 (center) → MHSA | 91.03 |
USM/CAM alternating + MHSA last | 90.38 |
All-USM (scaled late) | 90.38 |
USM → CAM×2 (center) → MHSA | 90.38 |
USM + every 4th block MHSA | 89.74 |
USM → CAM×2 (center) → USM → MHSA×2 (tail) | 89.74 |
MHSA first → USM → CAM (tail) | 89.74 |
All-MHSA (head scaling late) | 89.74 |
All-USM (constant) | 89.74 |
All-MHSA | 89.10 |
MHSA at edges, CAM in the middle | 89.10 |
USM body → MHSA×2 (tail) | 88.46 |
Reversed (MHSA→CAM→USM) | 87.82 |
All-CAM | 86.54 |