This section presents a comprehensive evaluation of the proposed framework from two perspectives: (i) an ablation study assessing the contribution of each architectural component and the impact of varying proportions of weak and strong annotations, and (ii) a comparative analysis against state-of-the-art segmentation methods. Experiments are conducted on both crack detection (RCFD dataset) and skin lesion segmentation (ISIC 2018) datasets, enabling performance assessment under different annotation budgets and across structurally distinct domains.
5.1. Ablation Study
We perform an ablation study to quantify the contribution of the proposed bounding box encoder and to evaluate the effect of varying proportions of weak (bounding box) and strong (pixel-level) annotations. The baseline model consists of a ConvNeXt-Base encoder and a U-Net-style decoder trained exclusively with strong annotations. Our full model augments this baseline with the bounding box encoder, which modulates encoder features using spatial attention maps derived from weak masks.
Table 2 and
Table 3 report the results on the RCFD and ISIC 2018 datasets, respectively. For both datasets, incorporating the bounding box encoder improves performance across all metrics, even when the proportion of strong labels is small. We further examine annotation coverage by varying the strong-label ratio
and the weak-label (bounding-box) ratio
, and report the resulting performance to quantify the trade-off between annotation cost and accuracy.
On the RCFD dataset, using only 10% strong annotations and weak masks for all images achieves an IoU of 61.97%, which is 1.48% lower than the fully supervised baseline (63.45%), and an F1-score of 76.52%, which is just 1.12% lower than the baseline (77.64%). The mIoU difference is minimal at −0.82% (80.15% vs. 80.97%). With 20% strong annotations, performance improves significantly to an IoU of 69.29% (+5.84%), F1-score of 81.86% (+4.22%), and mIoU of 83.99% (+3.02%) compared to the baseline. Further increases in strong annotations lead to consistent improvements, reaching 87.81% mIoU with full supervision.
A similar trend is observed on ISIC 2018. With only 10% strong annotations and full weak masks, our method achieves an IoU of 87.41% (+8.06%), F1-score of 93.28% (+4.79%), and mIoU of 90.92% (+5.61%) compared to the fully supervised baseline. Using 20% strong annotations, the IoU rises to 87.95% (+8.60%), F1-score to 93.59% (+5.10%), and mIoU to 91.29% (+5.98%). The best performance of 92.38% mIoU is obtained with 100% strong and weak annotations.
Overall, these results confirm that: (i) the bounding box encoder consistently boosts segmentation accuracy across IoU, Precision, Recall, F1-score, and mIoU; (ii) the proposed framework can achieve near full-supervision or better performance with as little as 10–20% strong annotations, reducing annotation cost without sacrificing quality.
5.2. Comparative Experiments to the State of the Art
We compare the proposed framework with fully supervised segmentation methods and recent semi-/weakly supervised approaches that utilize both weak (bounding box) and strong (pixel-level) annotations.
Table 4 and
Table 5 summarize the results for the RCFD and ISIC 2018 datasets.
For methods whose original papers did not report results on RCFD or ISIC 2018 (e.g., EfficientCrackNet [
36], Self-Correcting [
29], DCR [
37], Strong-Weak [
38], Xiong et al. [
39]), we re-trained the authors’ implementations—when available—or faithful re-implementations on our data splits using the unified configuration described in
Section 4.3 and the evaluation protocol in
Section 4.4.
RCFD dataset: Among fully supervised methods, CrackRefineNet [
32] achieves the highest score with 65.41% IoU, 79.09% F1-score, and 81.97% mIoU. Our framework, with only 10% strong labels and full weak annotations, achieves 61.97% IoU, 76.52% F1-score, and 80.15% mIoU—just 3.44% IoU and 2.57% F1 below CrackRefineNet despite using only one-tenth of the strong annotations, while outperforming the best mixed-supervision competitor (Self-Correcting [
29]) by +13.83% IoU, +11.53% F1, and +7.23% mIoU.
Table 4.
Comparison with state-of-the-art methods on the RCFD dataset. Weak = bounding box annotations, Strong = pixel-level annotations. Bold highlighting values denote the highest results.
Table 4.
Comparison with state-of-the-art methods on the RCFD dataset. Weak = bounding box annotations, Strong = pixel-level annotations. Bold highlighting values denote the highest results.
Model | Supervision Type | Weak | Strong | IoU (%) | Precision (%) | Recall (%) | F1-Score (%) | mIoU (%) |
---|
U-Net [1] | Full | – | 100% | 48.82 | 58.42 | 74.82 | 65.61 | 73.03 |
MobileNetv3 [40] | – | 100% | 51.69 | 71.57 | 65.04 | 68.15 | 74.78 |
SwinT [41] | – | 100% | 53.94 | 73.84 | 66.68 | 70.08 | 75.97 |
EfficientCrackNet [36] | – | 100% | 35.47 | 39.42 | 77.93 | 52.36 | 65.24 |
CrackMaster [42] | – | 100% | 63.53 | 79.37 | 76.09 | 77.7 | 81.0 |
CrackRefineNet [32] | – | 100% | 65.41 | 79.2 | 78.98 | 79.09 | 81.97 |
Self-Correcting [29] | Mixed (Box + Full) | 100% | 10% | 48.14 | 69.65 | 60.92 | 64.99 | 72.92 |
Macro-Micro [43] | 100% | 10% | 40.35 | 63.16 | 52.78 | 57.5 | 68.82 |
DCR [37] | 100% | 10% | 40.62 | 64.88 | 52.07 | 57.77 | 68.98 |
Strong-Weak [38] | 100% | 10% | 24.39 | 26.25 | 77.47 | 39.22 | 57.97 |
Xiong et al. [39] | 100% | 10% | 16.99 | 18.35 | 69.66 | 29.05 | 52.53 |
Ours | Mixed (Box + Full) | 100% | 10% | 61.97 | 75.75 | 77.30 | 76.52 | 80.15 |
Ours | 100% | 20% | 69.29 | 80.24 | 83.54 | 81.86 | 83.99 |
When increasing to 20% strong labels, the performance reaches 69.29% IoU, 81.86% F1-score, and 83.99% mIoU, surpassing CrackRefineNet by +3.88% IoU, +2.77% F1, and +2.02% mIoU, and significantly outperforming all other baselines.
Table 5.
Comparison with state-of-the-art methods on the ISIC 2018 dataset. Weak = bounding box annotations, Strong = pixel-level annotations. Bold highlighting values denote the highest results.
Table 5.
Comparison with state-of-the-art methods on the ISIC 2018 dataset. Weak = bounding box annotations, Strong = pixel-level annotations. Bold highlighting values denote the highest results.
Model | Supervision Type | Weak | Strong | IoU (%) | Precision (%) | Recall (%) | F1-Score (%) | mIoU (%) |
---|
U-Net [1] | Full | – | 100% | 58.39 | 77.41 | 70.38 | 73.73 | 69.81 |
MobileNetv3 [40] | – | 100% | 78.41 | 89.68 | 86.18 | 87.9 | 84.49 |
SwinT [41] | – | 100% | 78.31 | 90.9 | 84.97 | 87.84 | 84.50 |
EfficientCrackNet [36] | – | 100% | 71.4 | 80.42 | 86.43 | 83.31 | 78.81 |
CrackMaster [42] | – | 100% | 81.81 | 92.15 | 87.94 | 89.99 | 86.99 |
CrackRefineNet [32] | – | 100% | 83.11 | 89.97 | 91.6 | 90.78 | 87.77 |
Self-Correcting [29] | Mixed (Box+Full) | 100% | 10% | 74.81 | 86.8 | 84.41 | 85.59 | 81.79 |
Macro-Micro [43] | 100% | 10% | 71.98 | 85.15 | 82.31 | 83.7 | 79.71 |
DCR [37] | 100% | 10% | 73.56 | 86.81 | 82.82 | 84.77 | 80.94 |
Strong-Weak [38] | 100% | 10% | 66.28 | 79.47 | 79.97 | 79.72 | 75.25 |
UCMT (U-Net) [44] | 100% | 10% | 69.81 | 84.75 | 81.60 | 83.33 | 80.67 |
DSBD [45] | 100% | 10% | 71.42 | 84.71 | 87.54 | 86.31 | 78.05 |
Xiong et al. [39] | 100% | 10% | 65.0 | 82.56 | 75.34 | 78.79 | 74.77 |
EGE-UDSMT [46] | 100% | 10% | 74.62 | 89.65 | 86.36 | 88.65 | 81.63 |
Ours | Mixed (Box + Full) | 100% | 10% | 87.41 | 92.35 | 94.23 | 93.28 | 90.92 |
Ours | 100% | 20% | 87.95 | 91.99 | 95.24 | 93.59 | 91.29 |
ISIC 2018 dataset: In the fully supervised setting, CrackRefineNet [
32] achieves the highest score with 83.11% IoU, 90.78% F1-score, and 87.77% mIoU.
Among mixed-supervision methods using 10% strong labels, Self-Correcting [
29] achieves 74.81% IoU, 85.59% F1-score, and 81.79% mIoU, while EGE-UDSMT [
46] achieves 74.62% IoU, 88.65% F1-score, and 81.63% mIoU.
Our framework, with 10% strong labels and full weak annotations, achieves 87.41% IoU, 93.28% F1-score, and 90.92% mIoU—outperforming Self-Correcting by +12.60% IoU, +7.69% F1, and +9.13% mIoU, and surpassing EGE-UDSMT by +12.79% IoU, +4.63% F1, and +9.29% mIoU. Compared to the best fully supervised model (CrackRefineNet), our method improves by +4.30% IoU, +2.50% F1, and +3.15% mIoU, despite using only one-tenth of the strong annotations.
When increasing to 20% strong labels, our method further boosts performance to 87.95% IoU, 93.59% F1-score, and 91.29% mIoU, achieving the best results across all metrics. The IoU is +4.84% higher than CrackRefineNet, and the mIoU is +3.52% higher, demonstrating that the proposed framework scales effectively with additional strong labels while maintaining a substantial lead over both mixed- and fully supervised baselines.
These results show that the proposed framework not only bridges the performance gap between weakly and fully supervised segmentation but also outperforms state-of-the-art fully supervised models in multiple settings while using as little as 10–20% of strong annotations. This confirms the effectiveness of the bounding box encoder in leveraging weak supervision to achieve competitive and even superior performance across different domains and metrics.
5.3. Qualitative Analysis of Results
To further illustrate the effectiveness of the proposed framework,
Figure 5 presents qualitative comparisons on both the RCFD and ISIC 2018 datasets. Each column shows the input image, ground truth mask, and segmentation outputs from representative state-of-the-art methods alongside our approach with 10% and 20% strong annotations. Blue regions correspond to true positives, red regions indicate false positives, and green regions mark false negatives, with the best IoU scores highlighted in yellow.
On the RCFD dataset, our method produces more precise crack delineations with fewer spurious predictions compared to baselines such as Strong-Weak [
38] and Xiong et al. [
39], which tend to generate noisy or fragmented masks. Even under limited supervision (10% strong), our framework captures fine crack structures and suppresses background noise more effectively than fully supervised models such as U-Net [
1] and MobileNetv3 [
40]. Increasing the proportion of strong annotations to 20% further refines the predictions, yielding continuous and complete crack regions that closely match the ground truth.
On ISIC 2018, lesion boundaries obtained by our method align more faithfully with ground truth masks compared to competing approaches. For example, macro–micro [
43] and DCR [
37] often under-segment lesion areas, while Self-Correcting [
29] and SwinT [
41] occasionally introduce false positives along the lesion border. In contrast, our framework with only 10% strong labels already achieves compact and accurate delineations, and the 20% setting produces the sharpest lesion boundaries with minimal false positives and negatives.
These visual results reinforce the quantitative findings: the bounding box encoder effectively leverages weak supervision to guide feature learning, enabling the model to outperform both fully and weakly supervised baselines while reducing annotation requirements.
5.4. Efficiency Analysis
To complement the quantitative and qualitative results, we report inference efficiency and model size of different segmentation models in terms of trainable parameters, per-image inference time, and floating-point operations (FLOPs) under the implementation setup described in
Section 4.3, as reported in
Table 6.
Lightweight baselines such as MobileNetv3 [
40] and EfficientCrackNet [
36] demonstrate small model size (3.28 M and 0.35 M parameters, respectively) with very low compute (FLOPs). However, this efficiency comes at the cost of accuracy, as confirmed in previous evaluations. In contrast, fully supervised architectures such as SwinT [
41] and CrackRefineNet [
32] achieve higher accuracy but at a higher compute and model-size cost, requiring up to 99.66 M parameters and 0.206 T FLOPs, with longer inference times.
When considering semi-supervised frameworks, Self-Correcting network [
29] achieves a relatively low compute footprint (48.77 M parameters and 0.061 T FLOPs) due to its streamlined design. However, this reduced compute budget comes with limitations in segmentation accuracy. Our proposed model, which integrates a ConvNeXt-Base backbone with a lightweight U-Net style CNN decoder and a dedicated bounding box encoder, introduces a higher parameter count (92.67 M) but keeps the computation tractable with only 0.092 T FLOPs. Importantly, the inference time (0.0201 s per image) remains close to that of U-Net (0.0192 s) and substantially faster than CrackRefineNet (0.0387 s), despite the stronger backbone.
This efficiency analysis highlights that the proposed framework strikes an effective balance between accuracy and efficiency in the semi-supervised setting. By leveraging bounding box guidance through the encoder, our model achieves superior segmentation performance while remaining efficient at inference, demonstrating its suitability for both research and real-world deployment.
5.5. Analysis of Failure Cases
Although the proposed framework achieves strong overall performance, certain limitations remain in challenging scenarios.
Figure 6 highlights representative failure cases from the RCFD and ISIC 2018 datasets. The main sources of error can be grouped into three categories.
First, boundary ambiguity arises in regions where cracks or lesion edges are poorly defined. In such cases, the model may either under-segment the structure or include irrelevant background, leading to false positives or false negatives. Second, low-contrast or fine-scale details are sometimes missed, particularly when only a small fraction of strong annotations is available. This results in incomplete predictions, with thin or subtle structures not being fully captured. Finally, textured or noisy regions occasionally cause the model to over-segment, producing spurious detections that do not correspond to true target regions.
Despite these challenges, the overall alignment between predictions and ground truth remains high, even with as little as 10–20% strong supervision. These observations suggest that while the bounding box encoder effectively leverages weak annotations, future work could focus on improving boundary refinement and robustness to noise, for example through boundary-aware losses or post-processing modules.