FD-HCL: A Fractal-Dimension-Guided Hierarchical Contrastive Learning Dual-Student Framework for Semi-Supervised Medical Segmentation

Dong, Xinhua; Xu, Wenjun; Xu, Zhigang; Han, Hongmu; Zhang, Hui; Mao, Juan; Dong, Guangwei

doi:10.3390/fractalfract9120828

Open AccessArticle

FD-HCL: A Fractal-Dimension-Guided Hierarchical Contrastive Learning Dual-Student Framework for Semi-Supervised Medical Segmentation

by

Xinhua Dong

,

Wenjun Xu

,

Zhigang Xu

^*

,

Hongmu Han

,

Hui Zhang

,

Juan Mao

and

Guangwei Dong

School of Computer Science, Hubei University of Technology, Wuhan 430068, China

^*

Author to whom correspondence should be addressed.

Fractal Fract. 2025, 9(12), 828; https://doi.org/10.3390/fractalfract9120828

Submission received: 5 November 2025 / Revised: 5 December 2025 / Accepted: 16 December 2025 / Published: 18 December 2025

(This article belongs to the Special Issue Fractional and Fractal Methods in Biomedical Imaging and Time Series Learning)

Download

Browse Figures

Versions Notes

Abstract

Semi-supervised learning (SSL) is critical for medical image segmentation but often struggles with network dependency and pseudo-label error accumulation. To address these issues, we propose a fractal-dimension-guided hierarchical contrastive learning dual-student framework(FD-HCL). We extend the Mean Teacher architecture with a dual-student design and introduce an independence-aware exponential moving average (I-EMA) update mechanism to mitigate model coupling. For enhanced feature learning, we devise a hierarchical contrastive learning (HCL) mechanism guided by voxel uncertainty, spanning global, high-confidence, and low-confidence regions. We further improve structural integrity by incorporating a fractal-dimension (FD)-weighted consistency loss and integrating a novel uncertainty-aware bidirectional copy–paste (UB-CP) augmentation. Extensive experiments on the LA and BraTS 2019 datasets demonstrate the state-of-the-art performance of our framework across 10% and 20% labeled data settings. On the LA dataset with 10% labeled data, our method achieved a Dice score that outperformed the best existing approach by 0.68%. Similarly, under the 10% labeling setting on the BraTS 2019 dataset, we surpassed the state-of-the-art Dice score by 0.55%.

Keywords:

medical image segmentation; semi-supervised learning; contrastive learning; dual-student framework; fractal dimension; data augmentation

1. Introduction

Medical image segmentation is a critical task in computer-aided diagnosis (CAD) systems, aiming to precisely delineate anatomical structures or pathological regions from imaging modalities such as CT and MRI. This process assists clinicians in disease diagnosis and provides reliable support for clinical decision-making [1]. In recent years, deep learning-based segmentation methods have achieved revolutionary progress. Architectures such as convolutional neural networks (CNNs) and U-Net [2] have demonstrated outstanding performance across a variety of medical image segmentation tasks [3]. Despite their strong capability, fully supervised methods rely heavily on large amounts of high-quality pixel-/voxel-level annotations [4]. Medical image annotation requires specialized domain expertise and must be performed manually by experienced radiologists, making it time-consuming, labor-intensive, and costly [5].

Consequently, the significant gap between the scarcity of labeled data and the abundance of unlabeled medical images severely limits the broader application of deep learning techniques in clinical practice. Recent surveys highlight that semi-supervised and weakly supervised paradigms have become key pathways to reducing dependence on costly pixel-wise annotations [6]. Among them, semi-supervised learning (SSL), which simultaneously leverages limited labeled data and large-scale unlabeled data, has shown great potential and therefore attracted substantial research attention.

Among various SSL methods, the Mean Teacher (MT) framework is one of the most widely adopted baselines [7]. MT consists of a student network and a teacher network: the student network is trained on the data, while the teacher network generates pseudo-labels for unlabeled samples and guides the student through consistency loss. The teacher’s parameters are updated using an exponential moving average (EMA) of the student’s historical parameters. However, MT inherently suffers from data dependency, as the student tends to converge toward the teacher, leading to increased dependency, reduced network diversity, and weakened generalization ability [8].

Moreover, due to highly similar intensity patterns in medical images, object boundaries are often ambiguous and difficult to segment, especially in complex boundary regions where models are prone to misclassification. Existing semi-supervised segmentation methods predominantly focus on label-level consistency regularization while overlooking feature-level learning, resulting in a lack of explicit supervision to enhance class separability in the feature space [9]. To address this, contrastive learning has recently been introduced into semi-supervised segmentation. Current approaches include voxel-level contrastive learning [10], prototype-based contrastive learning [11], and uncertainty-guided contrastive learning [12]. However, these methods are primarily designed for global feature alignment and often ignore fine-grained or boundary-level representations, which restricts segmentation accuracy in complex medical images.

Many researchers have also attempted to improve sample diversity through data augmentation. Similar strategies have been validated in other data-scarce engineering domains, such as virtual sample generation (VSG), where augmenting the dataset effectively enhances model performance [13]. In medical imaging, copy–paste (BCP) is a widely used augmentation strategy that creates mixed samples by pasting regions between labeled and unlabeled data. However, BCP suffers from randomness: pasting uncertain unlabeled regions into labeled samples can introduce noise and reduce pseudo-label reliability, causing error accumulation during training [14]. Therefore, confirmation bias remains a critical challenge.

Motivated by these observations, we propose a novel semi-supervised segmentation framework. We extend the Mean Teacher architecture by introducing an additional student model. To mitigate network coupling and error accumulation, we apply strong and weak perturbations to the two students, respectively, and introduce a quantifiable network independence metric. Based on this metric, we design an independence-aware EMA (I-EMA) strategy for updating the teacher model. To further enhance feature-space learning, we propose a hierarchical contrastive learning mechanism that applies multi-level feature consistency constraints to improve class separability. Additionally, we incorporate an uncertainty-based copy–paste strategy to enhance data diversity. Our main contributions are summarized as follows:

We design a dual-student framework in which two student models are trained under strong and weak perturbations, respectively. We further introduce an independence-aware EMA (I-EMA) mechanism that encourages the teacher model to learn more from the student with higher independence. Moreover, we incorporate fractal-dimension (FD) weighting to guide consistency regularization between the teacher and student networks.
We propose a novel hierarchical contrastive learning (HCL) mechanism that performs contrastive learning at three levels—global, high-confidence, and low-confidence. A voxel reliability measure is introduced at the high-confidence level to improve the robustness of contrastive learning.
We design a bidirectional copy–paste augmentation strategy based on MC-dropout uncertainty (UB-CP), enabling selective region extraction to reduce overly strong perturbations introduced by random cropping.
Extensive experiments on the BraTS 2019 and LA datasets demonstrate that our method achieves state-of-the-art performance under both 10% and 20% labeled data settings.

2. Related Works

2.1. Semi-Supervised Medical Image Segmentation

Semi-supervised medical image segmentation methods can be broadly categorized into consistency regularization and pseudo-label self-training. Consistency-based methods rely on the smoothness assumption, which requires stable predictions under input perturbations. The

Π

-Model and Temporal Ensembling [15] enforce consistency between predictions of perturbed inputs, while EMA is used to aggregate predictions across epochs for more stable pseudo-labels. Mean Teacher (MT) [16] introduces a teacher network whose EMA-updated parameters provide consistent supervisory signals. UA-MT [17] introduces uncertainty awareness by selecting reliable pixels for consistency regularization, though reliance on fixed thresholds may induce bias. URPC [18] minimizes discrepancies among multi-scale predictions to improve reliability. DTC [19] combines pixel-wise classification with level-set representations to exploit complementary information. Pham et al. [20] dynamically refine pseudo-labels via the teacher network to reduce noise in pseudo-labels. CPS [21] employs cross-network supervision between two independent models, becoming widely adopted in recent SSL tasks. SDCL [22] leverages prediction discrepancies from multiple students to estimate uncertainty for pseudo-label refinement. DSST [23] introduces an additional student model and alternates updates among strongly and weakly perturbed student networks and the teacher, improving MT’s limited generalization ability.

Although adding additional networks may increase prediction diversity, ensembles composed of multiple homogeneous networks also raise inter-network similarity, which can amplify error accumulation and limit overall performance [24]. Therefore, data dependency and error accumulation remain key challenges for MT-based methods. CauSSL [25] argues from a causal perspective that algorithmic independence is essential for achieving model complementarity and proposes a quantitative independence metric. Inspired by this, we introduce an additional student model and apply strong and weak perturbations to increase the diversity of network learning. Moreover, through our proposed I-EMA strategy, we integrate the independence metric into the EMA update rule, enabling the teacher model to rely more on the student with higher independence.

2.2. Data Augmentation

To mitigate the challenge of data scarcity and increase the effective training sample size, data augmentation is commonly utilized in existing studies to improve model generalization. In semi-supervised medical image segmentation, two strategies are commonly employed: weak data augmentation and strong data augmentation. Weak augmentation typically involves common geometric transformations. Meanwhile, strong augmentation strategies, including intensity transformations and mixing-based approaches, are also widely employed to boost model generalization. Table 1 lists various commonly used weak and strong data augmentation methods. MixUp [26] constructs training samples by linearly interpolating two images and their corresponding labels. CutMix [27] demonstrates superior retention of local features by cropping regions from one image and pasting them onto another while blending their labels. Inspired by these approaches, the copy–paste augmentation strategy [28] has been shown to be effective in natural image instance segmentation. Its core mechanism involves copying foreground objects from one image and randomly pasting them onto another. Variants like BCP [29] enhance the copy–paste concept in a bidirectional symmetric manner, effectively reducing the distribution discrepancy between labeled and unlabeled data. This approach mitigates distribution differences by performing bidirectional copy–paste augmentation on both datasets. However, it neglects considerations for strategically selecting the pasting regions. Building upon existing approaches, we propose an uncertainty-aware bidirectional copy–paste data augmentation method. This technique performs interactive copy–paste between labeled and unlabeled data, optimizing the selection of pasted content based on regional uncertainty while dynamically modifying the pasting region throughout training. This approach helps further enhance model generalization and robustness. This dynamic mechanism is tightly integrated with the training schedule, ensuring that data augmentation continuously provides constructive guidance for model learning.

2.3. Contrastive Learning in Semi-Supervised Image Segmentation

Contrastive learning applies a consistency constraint to the feature representation space. Its objective is to learn an effective feature representation space where features of positive pairs are pulled closer together, and features of negative pairs are pushed further apart [30]. It has recently been widely applied in semi-supervised image segmentation tasks to mine the intrinsic structural information within unlabeled data. CPC [31] and MoCo [32] laid the foundation for the application of CL in visual representation learning. In the semi-supervised setting, ECS-Net [33] combined CL with the Mean Teacher framework, utilizing the teacher model to generate feature contrast targets for the student model. Pixel-wise contrastive learning [34] enforces that pixel features of the same class are closer in the embedding space, while those of different classes are farther apart, achieving pixel-level feature discrimination. This RCPS [9] introduces a rectified contrastive pseudo-supervision framework that leverages voxel-level contrastive learning between labeled and unlabeled data to enhance feature consistency and discriminability. SAMT-PCL [35] uses pseudo-labels to guide the pixel-wise contrastive learning process. These methods significantly enhance the discriminative power of features through the contrastive mechanism, enabling the model to better leverage unlabeled data.

Inspired by the application of contrastive learning in the semi-supervised domain, we designed a multi-level contrastive learning framework. The basic idea is as follows: For each input instance, we generate both a weak-augmented view and a strong-augmented view and impose a consistency constraint under both views. At the same time, rather than stopping at the instance-level contrast, we further apply refined contrastive learning at two sub-instance levels: the high-confidence region and the low-confidence region. This design is aimed at promoting feature-space optimization from multiple scales and multiple confidence levels, thereby improving the model’s generalization ability under weak supervision.

2.4. Fractal Dimension

Fractal dimension (FD) is a classical metric for quantifying geometric complexity, self-similarity, and spatial filling capacity, originally proposed by Mandelbrot [35]. In image analysis, FD has been used for defect detection in construction materials [36] and texture analysis in engineering images [37]. In medical imaging, FD effectively characterizes the structural complexity of tissues, for example, in retinal pathology classification [38] and plant leaf segmentation [39]. In [38], FD is used to construct region-weighted tensors that guide the optimizer toward structurally complex and diagnostically meaningful high-FD regions; ref. [39] incorporates FD into its plant segmentation network to enhance diagnostic analysis capability. The MTFD-Net framework [40] addresses this by transforming the CT image into a map of fractal dimensions. This map provides auxiliary supervision for the main left atrium (LA) segmentation task, where each voxel contains a localized FD value that quantifies the texture complexity of its surrounding region. Given the long-standing effectiveness of FD in image analysis, we incorporate it into consistency regularization for the first time. This direction aligns with recent advances in high-accuracy domains that integrate physically interpretable metrics or XAI techniques to improve model reliability and transparency [41].

In medical images, regions with complex textures often contain abundant diagnostically valuable information. Leveraging the unique property of fractal dimension (FD) in characterizing image texture complexity, we introduce FD as a texture-aware feature to guide the network in focusing on semantically rich information within structurally intricate tissue regions. Specifically, we incorporate the FD map as a dynamic weighting term in the consistency regularization, enabling the model to impose stronger consistency constraints in highly complex regions while avoiding excessive penalization in homogeneous areas. To validate the effectiveness of FD in capturing structural complexity, Figure 1 illustrates an example of the original image, its corresponding ground-truth annotation, and the computed FD map [40]. As shown, the FD map exhibits significantly elevated responses along intricate anatomical boundaries, while maintaining low and stable values in homogeneous tissue regions.

3. Methods

As discussed in Section 2, despite significant progress, semi-supervised learning methods for medical image segmentation still face core challenges such as error accumulation, network dependency, and training noise introduced by strong data augmentation. To systematically address these issues and effectively leverage limited labeled data alongside abundant unlabeled data, we design and propose a novel semi-supervised medical image segmentation framework: the Fractal-Dimension-Guided Hierarchical Contrastive Learning Dual-Student framework (FD-HCL). The overall workflow of the model is illustrated in Figure 2.

Our framework extends the mainstream Mean Teacher (MT) architecture by integrating hierarchical contrastive learning for the dual-student framework and an independence-inspired model update strategy. These components are complemented by uncertainty-guided dynamic data augmentation and fractal-dimension-weighted consistency regularization. Collectively, they aim to comprehensively enhance the robustness, generalization capability, and feature discriminative power of semi-supervised learning across three dimensions: model updates, data perturbations, and feature learning.

This chapter details the structure of our proposed FD-HCL framework and its innovative modules. Specifically, we first introduce the overall framework architecture, followed by a detailed description of how our dual-student hierarchical contrastive learning framework achieves mutual supervision at both the feature and data levels (Section 3.2). Next, we delve into the algorithm-agnostic I-EMA parameter update mechanism (Section 3.3) and the application of uncertainty-aware interactive copy–paste data augmentation for generating high-quality training samples (Section 3.1). Finally, we elaborate on the fractal-dimension-weighted consistency loss (Section 3.4) and present the framework’s ultimate optimization objective.

3.1. Independence-Aware Exponential Moving Average

The traditional Mean Teacher (MT) framework relies solely on the prediction consistency between the student network and its teacher network updated via exponential moving average (EMA). This approach is susceptible to model confirmation bias when handling unlabeled data, with the dependency between networks increasing as training iterations progress, leading to cumulative prediction errors. To address this challenge, we introduce an auxiliary student model to provide multi-perspective segmentation predictions. Furthermore, we propose an independence-aware exponential moving average (I-EMA) update strategy based on a linear-independence metric between networks for aggregating and updating teacher model parameters within the dual-student framework.

Intuitively, if one of the two student models exhibits a higher degree of independence from the teacher model in the feature space, it indicates that this student captures complementary representations that are absent in the teacher. Consequently, the teacher should preferentially absorb knowledge from this student to enhance the overall representational diversity and reduce the risk of model collapse. The proposed I-EMA mechanism incorporates such independence information into the exponential moving average update, where a linear-independence-based proxy guides the teacher’s parameter aggregation in a more informed and directional manner.

In the semi-supervised learning (SSL) paradigm for medical image segmentation, the training dataset is inherently imbalanced. For the current task, we construct the dataset to include N labeled samples and a substantially larger set of M unlabeled samples, satisfying

N ≪ M

. The training data fed into the network are formally defined as the labeled data

x_{l}

and the unlabeled data

x_{u}

. Each input medical volume x is represented as a three-dimensional volume with spatial dimensions

x \in R^{H \times W \times L}

, and the corresponding ground-truth annotation y is a binary segmentation mask, denoted as

y \in {0, 1}^{H \times W \times L}

. Our method employs two distinct student segmentation networks, parameterized by

θ^{S 1}

and

θ^{S 2}

, which are denoted as

f (x, θ^{S 1})

and

f (x, θ^{S 2})

, respectively. To enhance robustness and enable contrastive representation learning, we apply different data augmentation strategies to the input samples. Specifically, for the student network

f (x, θ^{S 1})

, we generate a strongly augmented view using the transformation

A_{s} (\cdot)

; for the student network

f (x, θ^{S 2})

, we construct a weakly augmented view using

A_{w} (\cdot)

. The segmentation outputs of the two student networks are denoted as

p^{S 1}

and

p^{S 2}

, where the subscripts l and u indicate the predictions corresponding to labeled and unlabeled inputs, respectively. In particular, the predictions on the unlabeled data are defined as follows, where the image fed to the strong branch is first subjected to weak augmentations such as flipping and cropping, followed by additional strong augmentation perturbations:

p_{u}^{S 1} = f (θ^{S 1}, A_{s} (A_{w} (x_{u}))),

(1)

p_{u}^{S 2} = f (θ^{S 2}, A_{w} (x_{u})),

(2)

where the functions

f (X, θ^{S 1})

and

f (X, θ^{S 2})

denote the two student models subjected to strong and weak perturbations, respectively. Together, they collaboratively capture the essential latent features of the volumetric input data.

In teacher–student frameworks, the exponential moving average (EMA) is widely used to improve the stability and generalization ability of the teacher model. Its core idea is to update the teacher parameters by exponentially smoothing the parameters of the student model, which can be expressed as

θ_{T}^{t} = α θ_{T}^{t - 1} + (1 - α) θ_{S}^{t},

(3)

where

α

is the smoothing coefficient, and

θ_{T}^{t}

and

θ_{S}^{t}

denote the parameters of the teacher and student models at iteration t, respectively. This strategy effectively smooths the high-frequency noise during student training, allowing the teacher model to provide more stable pseudo-labels in semi-supervised or self-supervised learning.

However, the conventional EMA only performs temporal smoothing without considering the structural independence between networks in the representation space. Previous studies have shown that the main limitation of self-training and Mean Teacher frameworks lies in the strong dependency between the auxiliary and primary networks [42]. When the representations of two networks become overly similar, the teacher updates tend to be redundant, reducing the diversity of feature representations and constraining the model’s generalization ability.

To alleviate this problem, we introduce the network independence metric proposed in CauSSL [25] into our dual-student framework, which serves as a computable proxy for the algorithmic independence metric based on Kolmogorov complexity [43]. This metric follows the principle of independent causal mechanisms (ICM) [44] and is implemented according to the minimum description length principle [45]. The detailed implementation will be presented in the following section.

We first map the learned weights of the neural network onto quantifiable mathematical objects. While the 3D convolution operation involves spatial sliding and filtering, its essence, from an algebraic perspective, is a high-dimensional inner product between the input feature patches and the convolutional kernel weights. We construct a pattern basis vector for this mapping; for a 3D convolutional layer, a kernel’s weight dimension is

K \times K \times K \times C_{in}

. We flatten this multi-dimensional weight block into a

d = K^{3} C_{in}

dimensional vector v. In linear algebra, v serves as a basis vector, representing the specific feature template learned by the network for that output channel. The entire convolutional layer comprises

C_{out}

independent kernels. The detailed procedure is shown in Figure 3. By stacking these

C_{out}

basis vectors along the row dimension, we obtain a convolutional layer weight matrix

Θ \in R^{C_{out} \times d}

. Each row

v_{j}

of this matrix is a d-dimensional basis vector for feature detection.

By converting the network weights into the matrix

Θ

, we transform the problem of network representational redundancy and independence into an analysis of matrix properties. According to the minimum description length (MDL) principle, if two networks A and B are algorithmically independent, the total length required to describe them equals the sum of their separate description lengths. In linear algebra, this is expressed by the matrix rank additivity property:

rank ([A, B]) = rank (A) + rank (B) .

(4)

Specifically, we define the single-layer dependence

I

using the following formula:

I (Θ^{A}, Θ^{B}) = \frac{1}{C_{out}} \sum_{j = 1}^{C_{out}} {(\frac{v_{j}^{A} \cdot q_{j}^{B}}{∥ v_{j}^{A} ∥ \cdot ∥ q_{j}^{B} ∥})}^{2},

(5)

where

v_{j}^{A}

is the j-th basis vector in the student network

f (x, θ^{S})

weight matrix

Θ^{A}

, and

q_{j}^{B}

is the optimal linear combination approximation vector of

v_{j}^{A}

in the teacher network

Θ^{B}

’s feature space, calculated as

q_{j}^{B} = {(G^{T} \times Θ^{B})}_{j}

.

To find the most suitable set of linear coefficient matrices

G^{B}

, we first fix the parameters of the student network

θ^{A}

and the teacher network

θ^{B}

in the current iteration t. Our objective is to find the

G^{B}

that maximizes the dependence

I (θ^{A}, θ^{B})

between the two networks, which is equivalent to finding the

G^{B}

that minimizes the linear approximation error. Once this optimal

G^{B}

is obtained, we fix it and then proceed to adjust the network parameters by maximizing the principle of independence.

Based on the single-layer dependence, we define the independence proxy

I (Θ_{i}^{S}, Θ_{i}^{T})

of the student network

Θ^{S}

relative to the teacher network

Θ^{T}

as

I (Θ^{S}, Θ^{T}) = 1 - \frac{1}{# layers} \sum_{i = 1}^{# layers} I (Θ_{i}^{S}, Θ_{i}^{T}),

(6)

where

Θ_{i}^{S}

and

Θ_{i}^{T}

are the weight parameters in matrix form within the networks, and

# layers

is the number of convolutional layers; only convolutional layers are considered here.

The I-EMA strategy guides the parameter update of the teacher network

θ^{T}

in every iteration t by first calculating the independence weights

α_{1}

and

α_{2}

. We first compute the independence proxy values for the two students

θ^{S 1}

and

θ^{S 2}

relative to the current teacher

θ^{T}

:

I_{1} = I (Θ^{S 1}, Θ^{T}),

(7)

I_{2} = I (Θ^{S 2}, Θ^{T}) .

(8)

Subsequently,

I_{1}

and

I_{2}

are normalized using the Softmax function to generate the dynamic weights

α_{1}

and

α_{2}

that guide the aggregation. The Softmax operation ensures that the network with higher independence receives a larger weight:

α_{1} = \frac{e^{I_{1} / τ}}{\sum_{k = 1}^{2} e^{I_{k} / τ}}, α_{2} = \frac{e^{I_{2} / τ}}{\sum_{k = 1}^{2} e^{I_{k} / τ}},

(9)

where

τ

is a temperature parameter used to adjust the strength of the independence difference’s impact on the final weights. We set it 1.1. We directly integrate the independence-weighted aggregation result into the teacher model’s traditional EMA update equation. This parameter update strategy dynamically weights and aggregates knowledge from the students, ensuring that the teacher model prioritizes absorbing knowledge with complementary representations. The final unified I-EMA parameter update equation is

θ^{T (t)} \leftarrow τ_{EMA} θ^{T (t - 1)} + (1 - τ_{EMA}) (α_{1}^{(t)} θ^{S 1} + α_{2}^{(t)} θ^{S 2}),

(10)

where

θ^{T (t)}

is the teacher’s parameters at the next time step,

θ^{T (t - 1)}

is the current teacher’s parameters, and

τ_{EMA}

is the traditional EMA smoothing factor. This equation merges the current independence guidance

α_{1}^{(t)}

and

α_{2}^{(t)}

with the historical knowledge

θ^{T (t)}

.

This parameter update strategy dynamically aggregates knowledge from the students, ensuring that the teacher model prioritizes absorbing knowledge with complementary representations.

3.2. Hierarchical Contrastive Learning for Dual-Student Framework

In our dual-student framework, the two student networks are fed with differently augmented inputs, representing weak and strong perturbations, respectively. Based on the smoothness assumption, the predictions from networks under different levels of perturbation should remain consistent both at the data level and in the feature space. Therefore, we further introduce contrastive learning at the feature level to enhance the representational consistency and discriminability of the two student networks across different feature subspaces. Specifically, we constrain the feature representations from three complementary perspectives: global consistency, high-confidence alignment, and low-confidence separation. For each student network

S 1

and

S 2

, we add a projection head

h^{S 1}

and

h^{S 2}

after their respective decoder last layers. Let

{\hat{y}}^{S 1}

and

{\hat{y}}^{S 2} \in R^{V \times C}

be the final output labels of the two networks. We first enforce global prediction consistency by minimizing the distance between the overall output predictions of the two student networks using the cosine distance constraint:

L_{clt} ({\hat{y}}_{u}^{S 1}, {\hat{y}}_{u}^{S 2}) = 1 - \frac{{\hat{y}}_{u}^{S 1} \cdot {\hat{y}}_{u}^{S 2}}{{∥ {\hat{y}}_{u}^{S 1} ∥}_{2} \cdot {∥ {\hat{y}}_{u}^{S 2} ∥}_{2}} .

(11)

This term constrains the two networks to maintain similar prediction directions in the global feature space, effectively preventing model divergence.

Subsequently, we implemented a refined, targeted contrastive learning strategy based on voxel confidence. Specifically, we classified all voxels into high-confidence and low-confidence sets according to the uncertainty generated by model predictions. Figure 1 illustrates the specific implementation method within the hierarchical contrastive learning section. The gray dashed box indicates the high-confidence region, while the star icons represent inter-class prototypes. For the high-confidence set, we introduced a reliability metric based on inter-class separability. This metric guides data partitioning into reliable positive–negative sample pairs, which then undergo typical contrastive push–pull training. This mechanism ensures the network learns tighter intra-class cohesion and sharper inter-class boundaries within regions of high certainty. Conversely, voxels in the low-confidence set often originate from challenging regions like boundaries and ambiguous areas. Despite high uncertainty, the structural and semantic information they contain is crucial for precise segmentation and cannot be ignored. To address this, we enhance the separation of class prototypes through cross-class prototype distance metrics, focusing on prototype-level learning within this dataset. This approach ultimately improves the model’s feature discrimination capability in ambiguous and highly uncertain regions.

To achieve this, we first construct the high-confidence data pool and calculate the necessary uncertainty metrics. We utilize uncertainty metrics from the two student networks to filter reliable voxels and build the high-confidence data pool. We compute the uncertainty

U

of the prediction result

p_{u}^{S 1}

by performing T Monte Carlo (MC) dropout forward passes on the student

S 1

network. Similarly, we compute the uncertainty

U

for the student network

S 2

. We then define the high-confidence set

M_{high}

:

M_{high} = \{v ∣ U (v) < ψ (U)\},

(12)

where

ψ (\cdot)

computes the top 30% quantile value. Addressing class imbalance, we introduce a voxel reliability measure based on the proximity to the class prototype, reflecting the desired convergence towards class centers. For each voxel v within the high-confidence region, the class prototype

{c p}_{c}^{S 1}

is computed by an averaging method.

{c p}_{c}^{S 1} = \frac{z^{S 1} ⊙ m_{c}^{S} 1}{\sum m_{c}^{S 1}},

(13)

where

m_{c}^{S 1}

is the prediction mask of the student network S1 for voxels predicted as class c, and z represents the feature vectors obtained through the non-linear projection head. We compute the cosine similarity between the voxel feature

z (v)

and the prototype

{c p}_{c}^{S 1}

of its assigned class

y (v)

:

{sim}_{c} (v) = \frac{{c p}_{c}^{S 1} \cdot z^{S 1} (v)}{{∥ {c p}_{c}^{S 1} ∥}_{2} \cdot {∥ z^{S 1} (v) ∥}_{2}},

(14)

if the feature of voxel v highly matches its class prototype,

λ

approaches

{sim}_{c} (v)

, indicating higher reliability. The reliability

λ

of voxel v is now directly expressed as the similarity between the voxel feature and its assigned class prototype, multiplied by the pseudo-label

{\hat{y}}_{u}^{S 1} (v)

to ensure focus on the predicted class:

λ^{S 1} (v) = 1 - {({sim}_{c} (v) - {\hat{y}}_{u}^{S 1} (v))}^{2} .

(15)

In the high-confidence voxel region (

v \in M_{high}

), our primary goal is to enforce consistency between the feature predictions from the two student networks. We utilize the InfoNCE loss function to calculate the contrastive loss

L_{ce}

, weighted by the voxel-wise reliability

λ^{S 1} (v)

. The high-confidence contrastive loss is defined as

L_{clh}^{s 1} = \frac{1}{| M_{high} |} \sum_{v \in M_{high}} λ (v) \cdot L_{ce} (z^{S 1} (v), z^{+} (v)),

(16)

where

z^{S 1} (v)

is the anchor feature,

λ (v)

is the voxel reliability weight,

τ

is the temperature parameter, and

z^{+} (v)

is the dynamically selected positive feature. The InfoNCE loss

L_{ce}

for the anchor

z^{S 1} (v)

is given by

L_{ce} (z^{S 1}, z^{+}) = - log \frac{e^{cos (z^{S 1}, z^{+}) / τ}}{e^{cos (z^{S 1}, z^{+}) / τ} + \sum_{P^{-} \in N^{-}} e^{cos (z^{S 1}, z^{-}) / τ}},

(17)

where

N^{-}

is the negative sample set, and

τ

is a parameter. We dynamically select the positive sample

z^{+} (v)

, whose choice mechanism is entirely dependent on the prediction consistency between the two student networks. When predictions are consistent

{\hat{y}}^{S 1} (v) = {\hat{y}}^{S 2} (v)

, the positive sample for the anchor feature

z^{S 1} (v)

is chosen as the average or a random sample from the same-class feature set

z^{S 2}

of the

f (X, θ^{S 2})

network, aiming to achieve mutual knowledge alignment and feature attraction between the student networks. Conversely, when predictions are inconsistent, as the prediction of

f (X, θ^{S 2})

is considered less reliable in this scenario, we switch to choosing the average or a random sample from the same-class feature set

z^{S 1}

of the

f (X, θ^{S 1})

, considered less reliable in this scenario network itself, as the positive sample. This strategy enforces the alignment of features with the reliable prototypes generated by

f (X, θ^{S 1})

, thereby effectively preventing the introduction of unreliable noise. Finally, the contrastive loss in the high-confidence layer, denoted as

L_{cl}

, is defined as the sum of the high-confidence loss terms for the two student networks:

L_{clh} = L_{clh}^{s 1} + L_{clh}^{s 2} .

(18)

Low-confidence voxels often reflect boundary information or ambiguous regions, yet they hold valuable learning potential. For these regions, we employ a strategy of pushing away the distances between different class prototypes to enhance the model’s discriminative power at class boundaries. We define the prototype separation loss

L_{cll}

, which directly uses the cosine similarity between distinct class prototypes

{c p}_{c_{i}}^{S 1}

and

{c p}_{c_{j}}^{S}

. For any two different classes,

c_{i}

and

c_{j}

(

i \neq j

), the prototype repulsion term

L_{dp}

is defined as

L_{cll} (c_{i}, c_{j}) = cos ({c p}_{c_{i}}^{S}, {c p}_{c_{j}}^{S}),

(19)

where

{c p}_{c_{i}}^{S 1}

and

{c p}_{c_{j}}^{S 1}

are the class prototypes for class

c_{i}

and class

c_{j}

. By minimizing this loss, we successfully enforce maximum separation between different class prototypes in the feature space, thereby strengthening the model’s discriminative capability. Finally, our total contrastive learning loss,

L_{cl}

, is defined as a weighted sum of three loss function: the total consistency loss

L_{clt}

, the high-confidence contrastive loss

L_{clh}

, and the low-confidence prototype separation loss

L_{cll}

.

L_{cl} = λ_{t} L_{clt} + λ_{h} L_{clh} + λ_{l} L_{cll},

(20)

where

λ_{t}

,

λ_{h}

, and

λ_{l}

are hyperparameters that control the weight of each loss term. The pseudocode describing the overall computational procedure is provided in Algorithm 1.

Algorithm 1 Hierarchical Contrastive Learning on Dual-Student Framework.

Input: Unlabeled data

D_{u}

; weak and strong augmentations

A_{w} (\cdot)

,

A_{s} (\cdot)

; student networks

f (x, θ^{S 1})

,

f (x, θ^{S 2})

; projection heads

h^{S 1}

,

h^{S 2}

; total iterations E.

1:: for $e = 1$ to E do
2:: Generate a minibatch of unlabeled data $D_{u}$ .
3:: for each $x_{u} \in D_{u}$ do
4:: Obtain augmented inputs: $x_{s} = A_{s} (x_{u})$ , $x_{w} = A_{w} (x_{u})$ .
5:: Compute predictions ${\hat{y}}_{u}^{S 1} = f (x_{s}, θ^{S 1})$ and ${\hat{y}}_{u}^{S 2} = f (x_{w}, θ^{S 2})$ .
6:: Extract projected features $z^{S 1} = h^{S 1} (f (x_{s}, θ^{S 1}))$ and $z^{S 2} = h^{S 2} (f (x_{s}, θ^{S 2}))$ .
7:: Compute global prediction consistency loss $L_{clt}$ via Equation (11).
8:: Estimate voxel-wise uncertainty $U$ via Monte Carlo Dropout.
9:: Select high-confidence voxels $M_{high}$ using Equation (12).
10:: Compute class prototypes ${c p}_{c}^{S 1}$ , ${c p}_{c}^{S 2}$ via Equation (13).
11:: Compute voxel reliability $λ (v)$ via Equation (15).
12:: for each voxel $v \in M_{high}$ do
13:: Determine positive feature $z^{+} (v)$ based on prediction consistency.
14:: Compute InfoNCE contrastive loss $L_{ce}$ via Equation (17).
15:: Obtain high-confidence contrastive losses $L_{clh}^{S 1}$ and $L_{clh}^{S 2}$ via Equation (16).
16:: Compute prototype separation loss $L_{cll}$ via Equation (19).
17:: end for
18:: Combine feature-level contrastive losses using Equation (20)
19:: end for
20:: end for

To minimize the prediction disparity between the two student networks, we introduce cross-pseudo-supervision at the data level. This strategy facilitates the accurate localization of class prototypes in the feature space, which, in turn, indirectly reduces the cosine distance between the predictions derived from the two perspectives [46].

L_{cps} = L_{ce} (p_{u}^{S 1}, {\hat{y}}_{u}^{S 2}) + L_{ce} ({\hat{y}}_{u}^{S 1}, p_{u}^{S 2}),

(21)

where

L_{ce}

is cross-entropy loss, and

{\hat{y}}_{u}^{S 1}

and

{\hat{y}}_{u}^{S 2}

are the predicted pseudo-labels from

f (x, θ_{S 1})

and

f (x, θ_{S 2})

.

3.3. Uncertainty-Aware Bidirectional Copy–Paste Data Augmentation

To effectively bridge the potential structural and semantic distribution gap between labeled data

x_{u}

and unlabeled data

x_{u}

, and to mitigate noise interference caused by pseudo-label uncertainty in semi-supervised learning, we propose an uncertainty-guided bidirectional copy–paste (UB−CP) augmentation strategy. The core of this method lies in utilizing the uncertainty estimated via

MC Dropout

during the warm-up phase to dynamically and selectively guide the copy–paste operations.

During the warm-up phase of

f (X, θ^{S 1})

, we employ Monte Carlo Dropout on the teacher network to estimate voxel-level uncertainty by computing the mean prediction entropy across multiple stochastic forward passes. Based on this uncertainty map, we precisely paste the regions with the highest uncertainty from the unlabeled data into the labeled data. This uncertainty-guided mechanism ensures that the regularization loss focuses on the “hard” regions where the model requires further learning. For the labeled data, we still follow the principle of BCP and utilize a zero-centered mask

M

for the copy–paste operation. During training, we sample one instance from each set to perform the bidirectional copy–paste operation. We copy–paste the labeled data

X_{l}

onto the unlabeled data

X_{u}

. To introduce accurate boundary information from the labeled data and enforce the model to learn boundary features on the unlabeled domain, we adopt the BCP mechanism using a zero-centered mask. To conduct the copy–paste operation between the image pair, we first generate a zero-centered mask

M_{1}

:

M_{1} \in {0, 1}^{W \times H \times L}, M_{1} (v) \in {0, 1},

(22)

where

M_{1}

is the mask where the voxel value indicates whether the voxel originates from the foreground image 0 or the background image 1. The mask

M_{1}

is defined based on a zero-value region with size

β H \times β W \times β L

:

M_{1} (v) = \{\begin{matrix} 0 & if v \in Center Block of size β W \times β H \times β L \\ 1 & otherwise \end{matrix}

(23)

where

β \in (0, 1)

. The parameter

β

controls the size ratio of the central block, and the resulting mixed volume

x_{c o p y}

is generated by the formula:

x_{c o p y} = x_{u} ⊙ M_{1} + x_{l} ⊙ (1 - M_{1})

(24)

This operation pastes the central block of

x_{l}

onto the central region of

x_{u}

. Subsequently, we copy–paste the unlabeled data

x_{u}

onto the labeled data

x_{l}

. This direction aims to transfer the most uncertain unlabeled regions onto the labeled data for proper regularization. We first employ the teacher network

f (X, θ^{T})

combined with Monte Carlo Dropout to estimate the predictive uncertainty. The average prediction probability

{\bar{P}}^{t}

is obtained through

N = 8

forward passes:

{\bar{P}}^{t} (v) = \frac{1}{N} \sum_{n = 1}^{N} P_{n}^{t} (v), \forall v \in x_{u},

(25)

where N is the number of forward passes, and

P_{n}^{t} (v)

is the prediction probability at voxel v during the n-th pass of the teacher network.

The voxel-level uncertainty is quantified using the average prediction entropy

U (v)

:

U (v) = - \sum_{c = 1}^{C} {\bar{P}}^{t} {(v)}_{c} log ({\bar{P}}^{t} {(v)}_{c}), U \in R^{W \times H \times L},

(26)

where C is the number of segmentation classes, and

{\bar{P}}^{t} {(v)}_{c}

is the average probability for class c at voxel v.

We define the target volume

B

for the copied patch as half of the zero-centered region, and the volume of this

B

is

\frac{1}{2} (β W \times β H \times β L)

. By sliding a window over the uncertainty map U and selecting the two non-overlapping blocks with the highest average uncertainty,

B_{1}

and

B_{2}

, we define the indicator function

B (v)

for the zero-value regions as

B (v) = \{\begin{matrix} 1 & if v \in B_{1} or v \in B_{2} \\ 0 & otherwise \end{matrix} .

(27)

By combining

B_{1}

and

B_{2}

, we obtain the mask

M_{2}

, where the values are set to 1 within the

B

regions and 0 elsewhere, thus pasting the most uncertain voxel blocks from the unlabeled data onto

x_{l}

, forming the mixed volume

x_{p a s t e}

:

x_{p a s t e} = x_{u} ⊙ M_{2} + x_{l} ⊙ (1 - M_{2})

(28)

For the bidirectionally mixed data

x_{copy}

and

x_{paste}

, we feed them into the student network

f (\cdot)

to obtain the corresponding predictions

p_{copy}^{S 1}

and

p_{paste}^{S 1}

. To ensure data alignment, the supervision labels for consistency learning on the other student network are also constructed by mixing labeled and unlabeled data. Accordingly, the pseudo-labels for the two branches are defined as

\{\begin{matrix} {\hat{y}}_{copy}^{S 2} & = {\hat{y}}_{u}^{S 2} \cdot M_{1} + y_{l} \cdot (1 - M_{1}), \\ {\hat{y}}_{paste}^{S 2} & = {\hat{y}}_{u}^{S 2} \cdot M_{2} + y_{l} \cdot (1 - M_{2}), \end{matrix}

(29)

where

M_{1}

and

M_{2}

are the binary masks used for region mixing,

{\hat{y}}_{u}^{S 2}

denotes the pseudo-labels generated from unlabeled data, and

y_{l}

represents the ground-truth labels from labeled data.

3.4. Fractal-Dimension-Weighted Consistency Regularization

Fractal dimension (FD) serves as a single metric to quantify structural complexity and irregularity, particularly effective for analyzing complex biological structures common in medical images. Regions with higher FD values indicate more intricate structures and irregular textures, often containing richer diagnostic information (e.g., organ boundaries). Standard consistency regularization often overfits simple regions and underperforms in structurally complex areas. We introduce the

L_{fd}

to reflect local structural complexity via FD, ensuring stronger supervision signals in challenging regions. Figure 4 illustrates the process of applying fractal dimensions to consistency regularization.

For an input 3D image volume

X \in R^{H \times W \times L}

, we compute the local fractal dimension

FD (v)

for each voxel position

v = (x, y, z)

using the 3D box-counting method. The

FD (v)

is approximated by

FD (v) = \frac{Δ log N_{r} (v)}{Δ log (1 / r)},

(30)

where r denotes the side length of the box, and

N_{r} (v)

represents the minimum number of boxes with scale r required to cover the non-zero voxels within the local block centered at voxel v. The computation of

N_{r} (v)

follows the differential box-counting strategy described in [40]. Specifically, we set the box scale range to

2 \leq r \leq 7

. The result is a fractal-dimension map that precisely matches the spatial dimensions of the input volume.

The biological structures, such as organs, blood vessels, or tumors, although complex, are not ideal mathematical fractals. They exhibit statistical self-similarity within a finite range of scales, resulting in small differences between the calculated FD values. To increase the model’s discriminative power between different regions, we apply Min-Max normalization to the FD values. The processed weights, ranging from 0 to

1.0

, provide a clear gradient signal for consistency learning between the models:

w (v) = \frac{FD (v) - min (FD)}{max (FD) - min (FD) + ϵ},

(31)

where

max (FD)

is the global maximum FD,

min (FD)

is the global minimum FD, and

ϵ

is a minimal constant.

Assigning higher weights to complex regions forces the student model to focus on aligning its predictions with the teacher model’s distribution in these challenging areas, thereby enhancing consistency along boundaries and complex structural regions. Areas with high FD values are typically critical decision regions in semantic segmentation, yet also the most error-prone. Building upon this weighted learning mechanism, we further incorporate uncertainty estimation by performing alignment only on voxels whose uncertainty falls below a predefined threshold. This approach assigns higher weights to complex regions while preserving numerical smoothness and avoiding excessive bias. The weighted consistency regularization term

L_{fd}

is defined as

L_{fd} = \frac{1}{N} \sum_{v} w (v) \cdot D_{KL} (p^{T} (v), p^{S} (v)) \cdot I (U (v) < ψ (U)),

(32)

where

I (\cdot)

is the indicator function, and the constraint

I (U (v) < ψ (U))

ensures that the loss is only calculated for voxels v where the uncertainty

U (v)

is below the threshold

ψ (U)

. N is the total number of voxels, and

p^{T} (v)

and

p^{S} (v)

are the prediction probability distributions of the teacher and student networks, respectively.

To enable the model to learn more complex knowledge without being overly affected by noise, we empirically set an uncertainty threshold, where

ψ (\cdot)

computes the top

30 %

quantile value. The Kullback–Leibler (KL) divergence

D_{KL}

is given by

D_{KL} (p^{T} (v), p^{S} (v)) = p^{T} (v) log \frac{p^{T} (v) + ϵ}{p^{S} (v) + ϵ},

(33)

where

ϵ

is a minimal constant set to

10^{- 8}

, introduced to maintain numerical stability.

4. Loss Function

Our model’s objective is achieved by optimizing a comprehensive loss function that combines both supervised

L_{\sup}

and unsupervised

L_{unsup}

components. We first define the supervised loss and then present the total objective function. The supervised loss guides the student networks’ training on the limited set of labeled data by comparing their predictions against the ground truth. Following standard practice in medical segmentation, the supervised loss for each student network is a combination of the cross-entropy loss and the Dice loss. For student network

f (X, θ^{S 1})

, the supervised loss on the labeled data

D_{l}

is defined as

L_{\sup 1} (p_{l}^{S 1}, y_{l}) = L_{ce} (p_{l}^{S 1}, y_{l}) + L_{Dice} (p_{l}^{S 1}, y_{l}),

(34)

where

p_{l}^{S 1}

is the prediction result of

S 1

for the labeled input, and

y_{l}

is the corresponding ground-truth label. The CE loss ensures pixel-wise classification accuracy, while the Dice loss addresses potential class imbalance and focuses on optimizing the region overlap.

The supervised loss

L_{\sup 2}

for the second student network

f (X, θ^{S 2})

is computed identically. The total supervised loss

L_{\sup}

is the sum of the individual losses from the two student networks, ensuring both models are anchored to the available ground truth:

L_{\sup} = L_{\sup 1} + L_{\sup 2} .

(35)

The overall objective function is a weighted combination of the total supervised loss

L_{\sup}

and the overall unsupervised loss

L_{unsup}

, enabling effective knowledge transfer from the large amount of unlabeled data. The overall loss (

L_{total}

) is calculated as

L_{total} = L_{\sup} + σ \cdot L_{unsup},

(36)

where

σ

is a scalar coefficient that dynamically weights the contribution of the unsupervised terms during training. The unsupervised loss

L_{unsup}

is composed of three distinct mechanisms designed to enhance feature consistency, leverage pseudo-supervision, and enforce structural integrity:

L_{unsup} = L_{cl} + L_{cps} + L_{fd},

(37)

The

L_{cl}

is the hierarchical contrastive loss,

L_{cps}

is the cross-pseudo-supervision loss, and

L_{fd}

is the fractal-dimension-weighted consistency loss. The specific calculation methods for these unsupervised loss components have been detailed in the preceding sections.

5. Experiments

5.1. Dataset

The LA dataset [47], derived from the 2019 Atrial Segmentation Challenge, contains 100 3D gadolinium-enhanced MRI scans along with corresponding expert annotations. Among these, 80 isotropic scans are used for training and 20 for testing. To evaluate our method in a semi-supervised learning setting, we randomly select 20% of the training samples as labeled data, while the remaining 80% are treated as unlabeled data. For a fair comparison, we follow the same preprocessing strategy as previous works [19], which includes cropping each image to a 3D subvolume of size 112 × 112 × 80, focusing on the heart region and applying intensity normalization. We compare our approach against other methods under 20% and 10% labeled data settings.

The BraTS2019 dataset [48], collected as part of the Brain Tumor Segmentation Challenge, consists of 335 samples with a resolution of 240 × 240 × 155 mm³;. Each sample includes four types of brain MRI scans: T1-weighted, T1-weighted with contrast enhancement, T2-weighted, and T2-FLAIR. Following previous studies [17], we use T2-FLAIR scans for our experiments. Each sample underwent center cropping and was resampled to an isotropic resolution of

1.0 \times 1.0 \times 1.0 {mm}^{3}

. In our experiments, 250 scans were used for training, 25 for validation, and 60 for testing. Similar to the LA dataset, we evaluated the performance of our method against others under

20 %

and

10 %

labeled data conditions.

5.2. Implementation Details

Our experiments were implemented using the PyTorch framework. The experimental hardware configuration included an NVIDIA RTX 3090 GPU and an Intel Xeon Gold 6330 processor equipped with 14 virtual CPUs (vCPUs). We adopted an improved Mean Teacher model as the baseline and introduced an auxiliary student model on top of it. Following prior work, we ran 6000 iterations on the LA dataset and 20,000 iterations on the BraTS 2019 dataset. All methods were optimized using the SGD optimizer with momentum set to 0.9 and weight decay set to 0.0001. The total batch size was set to 4, with each batch containing 2 labeled images and 2 unlabeled images. The initial learning rate was 0.01, reduced by 0.1 every 2500 iterations. Parameter

λ_{t}

was set to 1, while

λ_{h}

and

λ_{l}

employed time-dependent Gaussian warm-up functions to stabilize the training process. The code was implemented using PyTorch (https://pytorch.org/), and all experiments utilized a fixed random seed of 1337 to ensure reproducibility.

5.3. Evaluation Metrics

To comprehensively evaluate model performance in medical image segmentation tasks, this study employs four widely adopted metrics: Dice Coefficient, Jaccard Index, 95% Hausdorff Distance, Average Surface Distance.

The Dice Coefficient quantifies the spatial overlap between predicted and ground-truth segmentations. It is formally defined as twice the intersection area divided by the sum of prediction and ground-truth areas:

D i c e = \frac{2 | P \cap G |}{| P | + | G |}

(38)

where P represents the predicted segmentation region, G represents the ground-truth segmentation region, and

| \cdot |

denotes the number of voxels in a set. The Jaccard Index (also known as Intersection over Union, IoU) measures the ratio between the intersection and the union of the predicted and ground-truth regions and is defined as

J a c c a r d = \frac{| P \cap G |}{| P \cup G |}

(39)

The Hausdorff Distance is used to measure the maximum of the minimum distances between two sets of boundary points. The 95% Hausdorff Distance (95HD) takes the 95th percentile of all the point-to-set distances to reduce the influence of outliers. Let d(x,G) denote the minimum distance from point x to set G, then the 95HD is defined as

95 H D (P, G) = max \{\underset{x \in P}{p_{95}} (d (x, G)), \underset{y \in G}{p_{95}} (d (y, P))\}

(40)

where

p_{95}

denotes the 95th percentile operation. The Average Surface Distance (ASD) measures the mean minimum distance between predicted and ground truth boundaries, defined as

A S D (P, G) = \frac{1}{| P | + | G |} (\sum_{x \in P} d (x, G) + \sum_{y \in G} d (y, P))

(41)

The Average Surface Distance (ASD) provides an averaged measure of the overall discrepancy between two boundary point sets, serving to quantify the global accuracy of model segmentation results.

5.4. Experiments on the LA Dataset

In Table 2, we report the comparative results on the LA dataset using both 10% and 20% labeled data for our method and several state-of-the-art semi-supervised segmentation approaches.

On the LA dataset, we first evaluate the fully supervised V-Net using 10% and 20% labeled data. As expected, increasing the number of labeled scans from 8 to 16 leads to a substantial performance gain, with Dice improving from 77.87% to 85.68% and Jaccard increasing from 66.56% to 75.58%. However, similar to our observations on BraTS 2019, supervised learning remains limited when annotations are scarce, particularly in boundary-focused metrics such as 95HD and ASD. Under the 10% labeled setting, incorporating the large amount of unlabeled scans substantially boosts segmentation accuracy across most semi-supervised approaches. Methods such as UA-MT, BCP, DTC, UPRC, and Contrimix achieve statistically significant improvements in Dice and Jaccard, demonstrating the benefits of leveraging unlabeled information. MC-Net and DSMT-Net exhibit particularly strong performance, especially in reducing boundary errors. Our method achieves a Dice of 88.92% and a Jaccard of 79.85%, together with a low ASD of 1.84, indicating more accurate region segmentation and smoother anatomical boundaries than most competing approaches.

When the labeled ratio increases to 20%, all methods continue to improve, but the performance gap between state-of-the-art approaches becomes more evident. Approaches such as UA-MT, BCP, DTC, MC-Net, and Contrimix report highly significant gains (

p < 0.05

or

p \leq 0.01

), showing their effectiveness at higher annotation budgets. Nevertheless, our method consistently achieves the best overall results, reaching 90.78% Dice and 82.72% Jaccard, while also obtaining the lowest 95HD (6.07) and ASD (1.57). These improvements indicate that our framework not only enhances volumetric accuracy but also yields more stable and anatomically coherent boundaries. Overall, across both low-label and medium-label settings, our method demonstrates strong robustness and outperforms the fully supervised baseline as well as the majority of existing semi-supervised segmentation models. The consistent improvements across all evaluation metrics validate the effectiveness of our approach on the LA dataset.

As shown in Figure 5, under the 20% labeled data condition, we present segmentation results on the LA dataset comparing our method to several alternatives. Clearly, in some hard-to-discriminate regions (e.g., blurred boundaries or complex structures), our approach shows stronger robustness.

To further highlight our method’s performance advantage, we also plotted line charts of Dice and Jaccard metrics across different label ratios, as shown in Figure 6.

The results demonstrate that our method gains substantially over the purely supervised model V-Net, especially when labeled samples are extremely scarce, and outperforms other models on both segmentation metrics Dice and Jaccard.

5.5. Experiments on the BraTS 2019 Dataset

We conducted extensive comparative experiments on the BraTS 2019 dataset to validate the effectiveness and robustness of our proposed method in the task of brain tumor segmentation, as shown in Table 3. We compare our performance against a series of state-of-the-art semi-supervised segmentation approaches under both the

10 %

and

20 %

labeled data proportions. Similarly, we present the comparison of segmentation results for different methods under the

20 %

labeled data proportion in Figure 7.

In the BraTS 2019 dataset, we first evaluate the fully supervised baseline V-Net under limited annotations. When increasing the labeled scans from 10% (25 scans) to 20% (50 scans), the Dice score improves from 66.23% to 71.61% and the Jaccard index increases from 58.81% to 63.46%, demonstrating the strong dependence of supervised learning on annotation quantity. However, compared with the semi-supervised approaches, the performance of V-Net remains clearly inferior.

Under the 10% labeled/90% unlabeled setting, most semi-supervised methods (e.g., CPS, DTC, SLC-Net, MC-Net, and DSMT-Net) outperform the supervised baseline in both Dice and Jaccard, indicating that leveraging abundant unlabeled data greatly benefits tumor segmentation. Methods such as SLC-Net and MC-Net achieve statistically significant improvements (

p < 0.05

), although their boundary-related metrics (95HD and ASD) still show relatively large variance. In contrast, our method achieves 80.68% Dice and 69.48% Jaccard, while also producing lower 95HD (8.64) and ASD (2.47), suggesting that it provides more accurate region segmentation and more stable boundary localization.

When increasing the annotation ratio to 20% labeled/80% unlabeled, the performance of all methods improves further, but the differences between approaches become more pronounced. Classical semi-supervised methods such as UPRC, SLC-Net, BCP, and MC-Net show significant or highly significant gains. Nevertheless, our approach still achieves the best overall performance, reaching 81.45% Dice and 71.18% Jaccard (Figure 8), while also offering competitive or superior results in 95HD and ASD. This demonstrates the robustness and generalization ability of our framework across different annotation budgets.

Overall, across both annotation settings, our method consistently surpasses the fully supervised baseline and most existing semi-supervised segmentation methods. The improvements in both region-based and boundary-based metrics validate the effectiveness of our approach in exploiting unlabeled data and enhancing tumor segmentation quality on BraTS 2019.

5.6. Effect of the Uncertainty Threshold $ψ (\cdot)$

In the hierarchical contrastive learning mechanism, we partition the high and low confidence sets based on an uncertainty threshold. To evaluate the influence of this confidence threshold, we conduct experiments using different settings for the uncertainty threshold

ψ (\cdot)

to explore its impact on model performance; we conducted an ablation study on the BraTS 2019 dataset with a 10% labeled data ratio, and the results are shown in Table 4. This threshold is used to filter high-confidence pseudo-labels, guiding the network to perform consistency learning.

Experimental results indicate that as the uncertainty threshold

ψ (\cdot)

increases, the segmentation performance of the model exhibits a trend of first rising and then declining. Specifically, as the threshold gradually increases from

5 %

to

30 %

, the Dice score steadily improves from

79.55 %

to a peak of

80.68 %

; simultaneously, metrics such as Jaccard,

95 HD

, and ASD also achieve optimal performance. This confirms that applying consistency constraints to more high-confidence voxels under moderately relaxed conditions enables the network to learn richer and more accurate feature information, thereby effectively enhancing segmentation performance. When the threshold is further increased to

40 %

, the Dice score drops to

80.24 %

and all performance metrics deteriorate. This indicates that an overly lenient threshold causes the network to incorporate more voxels containing uncertainty or erroneous information into its consistency learning objectives. These inaccurate pseudo-labels disrupt model training, negating the positive effects of semi-supervised learning and ultimately limiting further improvements in segmentation performance. Therefore, we set the optimal value of

ψ (\cdot)

to

30 %

.

5.7. Performance Comparison with Other Strong Data Augmentation Methods

To verify the superiority of our proposed uncertainty-aware bidirectional copy–paste (UB-CP) strategy, we conducted comparative experiments against two commonly used strong data augmentation methods, namely Gaussian Noise only and CutMix, on the BRATS 2019 and LA datasets, respectively. The experimental results are shown in Figure 9. Our analysis focuses on three different labeled data proportions:

10 %

,

20 %

, and

30 %

. We observe that the effectiveness of all augmentation methods increases as the proportion of labeled data grows.

The performance gain of UB-CP compared to other methods is most evident at the most data-scarce proportion of

10 %

. This indicates that UB-CP can effectively utilize the structural information within the unlabeled data to generate high-quality pseudo-labels and augmented samples, thereby providing stronger regularization and generalization capability to the model when the data volume is limited.

On the LA dataset (left figure), the UB-CP method achieves a Dice score close to

90 %

, which is significantly higher than the performance of CutMix and Gaussian Noise. On the BRATS 2019 dataset (right figure), when the labeled data reaches

30 %

, the UB-CP model’s Dice score exceeds

82.5 %

, while the other two methods remain below

82 %

. This further demonstrates the strong capability of UB-CP in enriching the diversity of training samples.

5.8. Effect of the Uncertainty-Aware Bidirectional Copy–Paste Parameters

We further conducted an ablation study to evaluate the influence of two key hyperparameters in our proposed uncertainty-aware bidirectional copy–paste strategy: the number of mask blocks n and the volume ratio

β

on the segmentation performance on the LA dataset with

10 %

and

20 %

labeled data. The parameter n controls the number of independent mask blocks utilized for copy–paste, while

β

determines the ratio of the overall cut volume to the original image volume. The experimental results are presented in Table 5.

We observe that the model’s performance is generally superior when

n = 2

compared to the

n = 3

setting. This suggests that dividing the cut volume into

n = 2

mask blocks, each with a volume of

(β W \times β H \times β L) / 2

, allows the generated augmented samples to maintain sufficient semantic relevance while effectively introducing cross-voxel perturbations, thereby enhancing the model’s generalization ability. Conversely, setting

n = 3

might lead to mask blocks that are too small or semantically fragmented, slightly compromising the performance.

Building on the

n = 2

setting, the model shows sensitivity to the volume ratio

β

. When

β

gradually increases from

1 / 2

to

3 / 4

, the Dice score steadily improves from

89.82 %

to its peak value of

90.78 %

. Simultaneously, Jaccard,

95 HD

, and ASD metrics also reach their optimal values. This indicates that a moderate increase in the cutting volume

β = 3 / 4

provides richer contextual information and stronger structural consistency constraints. However, when

β

is further increased to

4 / 5

, the performance slightly decreases. This deterioration may be attributed to the excessively large cutting volume introducing too much noise or disrupting some critical semantic structures. In summary, we select the parameter combination that yields the best performance across all settings, namely

n = 2

for the number of mask blocks and

β = 3 / 4

for the volume ratio, for all main experiments in this study.

5.9. Ablation Study

To evaluate the independent contribution of each module to the overall performance, we conducted a comprehensive ablation study on the LA dataset. Following the standard semi-supervised setting, the experiment uses 16 labeled scans (20%) and 64 unlabeled scans. All quantitative results are summarized in Table 6. For fair comparison and to ensure the basic operability of the framework, we adopt the standard EMA when the I-EMA module is removed and employ the KL divergence for consistency computation when the

L_{f d}

term is excluded. We first take the configuration containing I-EMA and the consistency loss

L_{c p s}

as the baseline model, which achieves 87.35% Dice and 77.82% Jaccard. After incorporating the hierarchical contrastive loss

L_{h c l}

, the performance improves significantly to 90.54% Dice and 82.46% Jaccard, indicating that the hierarchical contrastive constraints effectively enhance feature discrimination and cross-view consistency. This suggests that fractal-dimension weighting helps the network better capture anatomical boundary details. We also observe that removing the I-EMA module leads to a noticeable decrease in segmentation accuracy. Finally, when all modules—namely I-EMA,

L_{c p s}

,

L_{h c l}

, and

L_{f d}

—are jointly optimized, the model achieves the best overall performance, reaching 90.78% in Dice and 82.72% in Jaccard. This result clearly demonstrates that each module contributes critically to the overall performance improvement, and their synergistic combination maximizes the model’s stability and segmentation accuracy.

6. Discussion

Despite the superior performance of the proposed framework compared with existing semi-supervised segmentation methods, several practical considerations remain. Introducing the dual-student architecture, MC-dropout-based uncertainty estimation, and FD graph computation increases the training time computational cost. However, these components are enabled only during training and therefore do not affect inference efficiency. The teacher network incurs the same forward-pass cost as a standard Mean Teacher (MT) framework, and neither uncertainty estimation nor FD computation is required at deployment. Thus, the inference complexity remains comparable to conventional 3D CNN segmentation models. Each additional module is designed to address specific limitations of the MT framework. The dual-student design enhances representation diversity and mitigates confirmation bias, while the I-EMA strategy prevents the teacher from collapsing toward redundant representations. MC dropout improves pseudo-label reliability, and FD weighting assigns greater importance to structurally complex regions. Ablation studies show that removing any module causes notable performance degradation, indicating that the increased training cost is justified by improved accuracy and robustness. Nonetheless, the generalization ability of the framework can still be improved. Because FD weighting and uncertainty estimation rely on image-dependent statistics, large distribution shifts (e.g., scanning protocol differences, magnetic field variation, or strong noise) may reduce their reliability and degrade pseudo-label quality. Likewise, although dual students increase diversity, both may drift similarly when facing substantial domain shifts, reducing the effectiveness of I-EMA. Future work may explore improving cross-domain and cross-device generalization: (1) integrating domain adaptation or domain generalization to stabilize FD estimation and uncertainty prediction under distribution shifts; (2) incorporating pretrained knowledge from foundation models to enhance robustness to rare anatomical patterns or low-quality images; and (3) leveraging synthetic data or generative models to introduce broader structural and imaging variations during training. From the clinical integration perspective, the proposed framework introduces no additional inference-time operations and can be seamlessly integrated into PACS or routine radiology workflows. FD graphs can be precomputed in preprocessing, and UB-CP is also training-only. Therefore, real-world runtime overhead is nearly identical to existing deep-learning segmentation systems. Improved boundary accuracy and robustness may also reduce manual correction effort, suggesting potential clinical value.

7. Conclusions

In this work, we address key challenges in weak-label or semi-supervised segmentation settings—including model coupling, error accumulation, and limited discriminative capacity of representations. Specifically, we propose a novel semi-supervised framework. We design a dual-student architecture in which two student models are respectively subjected to weak and strong perturbations, and incorporate an independence-aware exponential moving average (I-EMA) update mechanism to effectively alleviate the network-dependency issue inherent in the traditional Mean-Teacher paradigm. We further introduce an uncertainty-aware bidirectional copy–paste (UB-CP) data augmentation method to enhance sample diversity while avoiding excessive perturbation. Meanwhile, we embed a hierarchical contrastive learning (HCL) mechanism that enforces feature consistency at global, high-confidence, and low-confidence levels, and guide the consistency loss between teacher and student networks with fractal-dimension (FD) weighting, thereby boosting the discriminative power of the feature space. Mechanistically, the I-EMA module dynamically adjusts the update speed of the teacher network—slowing down when network dependency is high, and accelerating when it is low—thereby improving training stability; the UB-CP augmentation preserves structural-semantic consistency while introducing rich sample perturbations, enhancing the model’s robustness; the HCL module, by constructing contrastive objectives across multiple confidence levels and scales, significantly strengthens feature separability and improves boundary delineation and pseudo-label quality. To validate our method, we conduct large-scale experiments on two public datasets, BraTS 2019 and LA, under

10 %

and

20 %

labeled data ratios. The results clearly demonstrate the superior performance of our approach across all settings: On the LA dataset, with

10 %

labeled data, our model achieves a Dice score of

88.92 %

and an ASD of

1.84

voxels. With

20 %

labeled data, we achieve the best performance, recording a Dice score of

90.78 %

and a

95 HD

of

6.07

voxels. On the BraTS 2019 dataset, with

10 %

labeled data, we achieve a Dice score of

80.68 %

and a

95 HD

of

8.64

voxels. With

20 %

labeled data, the Dice score reaches

81.45 %

and the ASD is

1.88

voxels. These comprehensive results not only significantly outperform current state-of-the-art semi-supervised methods but also demonstrate our method’s potential to successfully exploit unlabeled data and enhance model generalization under extremely sparse annotation conditions.

Author Contributions

Conceptualization, W.X., X.D. and Z.X.; Methodology, W.X. and X.D.; Writing—Original Draft Preparation, W.X. and Z.X.; Software, W.X.; Project Administration, X.D.; Resources, W.X.; Data Curation, W.X.; Writing—Review and Editing, H.H. and H.Z.; Supervision, H.H., G.D. and J.M.; Formal Analysis, H.H.; Funding Acquisition, Z.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key-Area Research and Development Program of Hubei Province under Grant 2022BAA040, the Key-Area Research and Development Program of Guangdong Province under Grant 2020B1111420002, Science and Technology Project of Department of Transport of Hubei Province under Grant 2022-11-4-3, Innovation Fund of Hubei University of Technology under Grant BSQD2019027, Grant BSQD2019020, and Grant BSQD2016019.

Data Availability Statement

The public datasets utilized in this paper originate from [47,48].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhou, S.K.; Greenspan, H.; Davatzikos, C.; Duncan, J.S.; Van Ginneken, B.; Madabhushi, A.; Prince, J.L.; Rueckert, D.; Summers, R.M. A Review of Deep Learning in Medical Imaging: Imaging Traits, Technology Trends, Case Studies with Progress Highlights, and Future Promises. Proc. IEEE 2021, 109, 820–838. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar] [CrossRef]
Wang, G.; Li, W.; Aertsen, M.; Deprest, J.; Ourselin, S.; Vercauteren, T. Aleatoric uncertainty estimation with test-time augmentation for medical image segmentation with convolutional neural networks. Neurocomputing 2019, 338, 34–45. [Google Scholar] [CrossRef]
Ma, J.; Zhang, Y.; Gu, S.; Ge, C.; Mae, S.; Young, A.; Zhu, C.; Yang, X.; Meng, K.; Huang, Z.; et al. Unleashing the strengths of unlabelled data in deep learning-assisted pan-cancer abdominal organ quantification: The FLARE22 challenge. Lancet Digit. Health 2024, 6, e815–e826. [Google Scholar] [CrossRef]
Tajbakhsh, N.; Jeyaseelan, L.; Li, Q.; Chiang, J.N.; Wu, Z.; Ding, X. Embracing imperfect datasets: A review of deep learning solutions for medical image segmentation. Med. Image Anal. 2020, 63, 101693. [Google Scholar] [CrossRef]
Zhang, X.; Wang, J.; Wei, J.; Yuan, X.; Wu, M. A Review of Non-Fully Supervised Deep Learning for Medical Image Segmentation. Information 2025, 16, 433. [Google Scholar] [CrossRef]
Tarvainen, A.; Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. arXiv 2018, arXiv:1703.01780. [Google Scholar]
Huo, X.; Xie, L.; He, J.; Yang, Z.; Zhou, W.; Li, H.; Tian, Q. ATSO: Asynchronous Teacher-Student Optimization for Semi-Supervised Image Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 1235–1244. [Google Scholar] [CrossRef]
Zhao, X.; Qi, Z.; Wang, S.; Wang, Q.; Wu, X.; Mao, Y.; Zhang, L. RCPS: Rectified Contrastive Pseudo Supervision for Semi-Supervised Medical Image Segmentation. IEEE J. Biomed. Health Inform. 2024, 28, 251–261. [Google Scholar] [CrossRef]
Zhong, Y.; Yuan, B.; Wu, H.; Yuan, Z.; Peng, J.; Wang, Y.X. Pixel Contrastive-Consistent Semi-Supervised Semantic Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 7253–7262. [Google Scholar] [CrossRef]
Chen, Z.; Lian, Z. Semi-supervised Semantic Segmentation via Prototypical Contrastive Learning. In Proceedings of the 30th ACM International Conference on Multimedia (MM), Lisboa, Portugal, 10–14 October 2022; pp. 6696–6705. [Google Scholar] [CrossRef]
Wang, T.; Lu, J.; Lai, Z.; Wen, J.; Kong, H. Uncertainty-Guided Pixel Contrastive Learning for Semi-Supervised Medical Image Segmentation. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22), Vienna, Austria, 23–29 July 2022; pp. 1444–1450. [Google Scholar] [CrossRef]
Ge, J.; Yao, Z.; Wu, M.; Almeida, J.H.S., Jr.; Jin, Y.; Sun, D. Tackling data scarcity in machine learning-based CFRP drilling performance prediction through a broad learning system with virtual sample generation (BLS-VSG). Compos. Part B Eng. 2025, 305, 112701. [Google Scholar] [CrossRef]
Huang, S.; Ge, Y.; Liu, D.; Hong, M.; Zhao, J.; Loui, A.C. Rethinking copy-paste for consistency learning in medical image segmentation. IEEE Trans. Image Process. 2025, 34, 1060–1074. [Google Scholar] [CrossRef] [PubMed]
Nouboukpo, A.; Allaoui, M.L.; Allili, M.S. Multi-scale spatial consistency for deep semi-supervised skin lesion segmentation. Eng. Appl. Artif. Intell. 2024, 135, 108681. [Google Scholar] [CrossRef]
Laine, S.; Aila, T. Temporal ensembling for semi-supervised learning. arXiv 2016, arXiv:1610.02242. [Google Scholar]
Yu, L.; Wang, S.; Li, X.; Fu, C.W.; Heng, P.A. Uncertainty-aware Self-ensembling Model for Semi-supervised 3D Left Atrium Segmentation. arXiv 2019, arXiv:1907.07034. [Google Scholar]
Luo, X.; Liao, W.; Chen, J.; Song, T.; Chen, Y.; Zhang, S.; Chen, N.; Wang, G.; Zhang, S. Efficient Semi-supervised Gross Target Volume of Nasopharyngeal Carcinoma Segmentation via Uncertainty Rectified Pyramid Consistency. In Proceedings of the International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), Strasbourg, France, 27 September–1 October 2021; pp. 318–329. [Google Scholar] [CrossRef]
Luo, X.; Chen, J.; Song, T.; Wang, G. Semi-supervised Medical Image Segmentation through Dual-task Consistency. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Online, 2–9 February 2021; pp. 8801–8809. [Google Scholar] [CrossRef]
Pham, H.; Dai, Z.; Xie, Q.; Le, Q.V. Meta pseudo labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 11557–11568. [Google Scholar]
Chen, X.; Yuan, Y.; Zeng, G.; Wang, J. Semi-supervised semantic segmentation with cross pseudo supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 2613–2622. [Google Scholar]
Song, B.; Wang, Q. SDCL: Students Discrepancy-Informed Correction Learning for Semi-supervised Medical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2024, Marrakesh, Morocco, 6–10 October 2024; Lecture Notes in Computer Science; Volume 15008, pp. 567–577. [Google Scholar] [CrossRef]
Li, B.; Wang, Y.; Xu, Y.; Wu, C. DSST: A dual student model guided student–teacher framework for semi-supervised medical image segmentation. Biomed. Signal Process. Control 2024, 90, 105890. [Google Scholar] [CrossRef]
Wu, Y.; Xu, M.; Ge, Z.; Cai, J.; Zhang, L. Semi-supervised Left Atrium Segmentation with Mutual Consistency Training. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2021, Strasbourg, France, 27 September–1 October 2021; pp. 297–306. [Google Scholar] [CrossRef]
Miao, J.; Chen, C.; Liu, F.; Wei, H.; Heng, P.A. CauSSL: Causality-inspired Semi-supervised Learning for Medical Image Segmentation. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 21369–21380. Available online: https://scholar.google.com.hk/scholar?hl=zh-CN&as_sdt=0%2C5&q=CauSSL%3A+Causality-inspired+Semi-supervised+Learning+for+Medical+Image+Segmentation&btnG= (accessed on 1 August 2025).
Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. MixUp: Beyond empirical risk minimization. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Yun, S.; Han, D.; Chun, S.; Oh, S.J.; Choe, J.; Yoo, Y. CutMix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6022–6031. [Google Scholar] [CrossRef]
Ghiasi, G.; Lin, T.Y.; Le, Q.V. Simple copy-paste is a strong data augmentation method for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 2918–2928. [Google Scholar]
Bai, Y.; Chen, D.; Li, Q.; Shen, W.; Wang, Y. Bidirectional Copy-Paste for Semi-Supervised Medical Image Segmentation. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 11514–11524. [Google Scholar] [CrossRef]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations. arXiv 2020, arXiv:2002.05709. [Google Scholar] [CrossRef]
van den Oord, A.; Li, Y.; Vinyals, O. Representation Learning with Contrastive Predictive Coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9726–9735. [Google Scholar] [CrossRef]
Cui, W.; Liu, Y.; Li, Y.; Guo, M.; Li, Y.; Li, X.; Zheng, Y. Semi-supervised brain lesion segmentation with an adapted mean teacher model. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Shenzhen, China, 13–17 October 2019; pp. 554–562. [Google Scholar]
Wang, X.; Zhang, R.; Shen, C.; Kong, T.; Li, L. Dense Contrastive Learning for Self-Supervised Visual Pre-Training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 3023–3032. [Google Scholar] [CrossRef]
Mandelbrot, B.B. The Fractal Geometry of Nature; WH Freeman: New York, NY, USA, 1982. [Google Scholar]
Erdem, S.; Blankson, M.A. Fractal–fracture analysis and characterization of impact-fractured surfaces in different types of concrete using digital image analysis and 3D nanomap laser profilometery. Constr. Build. Mater. 2013, 40, 70–76. [Google Scholar] [CrossRef]
Yao, X.; Wu, Q.; Zhang, P.; Bao, F. Weighted Adaptive Image Super-Resolution Scheme Based on Local Fractal Feature and Image Roughness. IEEE Trans. Multimed. 2021, 23, 1426–1441. [Google Scholar] [CrossRef]
Shao, Y.; Yang, J.; Zhou, W.; Sun, H.; Gao, Q. Fractal-Inspired Region-Weighted Optimization and Enhanced MobileNet for Medical Image Classification. Fractal Fract. 2025, 9, 511. [Google Scholar] [CrossRef]
Batchuluun, G.; Kim, S.G.; Kim, J.S.; Mahmood, T.; Park, K.R. Artificial Intelligence-Based Segmentation and Classification of Plant Images with Missing Parts and Fractal Dimension Estimation. Fractal Fract. 2024, 8, 633. [Google Scholar] [CrossRef]
Saber Jabdaragh, A.; Firouznia, M.; Faez, K.; Alikhani, F.; Alikhani Koupaei, J.; Gunduz-Demir, C. MTFD-Net: Left atrium segmentation in CT images through fractal dimension estimation. Pattern Recognit. Lett. 2023, 173, 108–114. [Google Scholar] [CrossRef]
Wu, M.; Yao, Z.; Verbeke, M.; Karsmakers, P.; Gorissen, B.; Reynaerts, D. Data-driven models with physical interpretability for real-time cavity profile prediction in electrochemical machining processes. Eng. Appl. Artif. Intell. 2025, 160, 111807. [Google Scholar] [CrossRef]
Çiçek, Ö.; Abdulkadir, A.; Lienkamp, S.S.; Brox, T.; Ronneberger, O. 3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2016, Athens, Greece, 17–21 October 2016; pp. 424–432. [Google Scholar] [CrossRef]
Jin, Z.; von Kügelgen, J.; Ni, J.; Vaidhya, T.; Kaushal, A.; Sachan, M.; Schoelkopf, B. Causal Direction of Data Collection Matters: Implications of Causal and Anticausal Learning for NLP. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, 7–11 November 2021; pp. 9499–9513. [Google Scholar] [CrossRef]
Daniusis, P.; Janzing, D.; Mooij, J.M.; Zscheischler, J.; Steudel, B.; Zhang, K.; Schölkopf, B. Inferring deterministic causal relations. arXiv 2012, arXiv:1203.3475. [Google Scholar] [CrossRef]
Grünwald, P.D. The Minimum Description Length Principle, 1st ed.; The MIT Press: Cambridge, MA, USA, 2007. [Google Scholar]
Wu, H.; Geng, Y.; Gai, D.; Tu, J.; Xiong, X.; Wang, Q.; Huang, Z. Dual-Student Adversarial Framework with Discriminator and Consistency-Driven Learning for Semi-Supervised Medical Image Segmentation. IEEE J. Biomed. Health Inform. 2025; in press. [Google Scholar] [CrossRef] [PubMed]
Xiong, Z.; Xia, Q.; Hu, Z.; Huang, N.; Bian, C.; Zheng, Y.; Vesal, S.; Ravikumar, N.; Maier, A.; Yang, X.; et al. A global benchmark of algorithms for segmenting the left atrium from late gadolinium-enhanced cardiac magnetic resonance imaging. Med. Image Anal. 2021, 67, 101832. [Google Scholar] [CrossRef] [PubMed]
Menze, B.H.; Jakab, A.; Bauer, S.; Kalpathy-Cramer, J.; Farahani, K.; Kirby, J.; Burren, Y.; Porz, N.; Slotboom, J.; Wiest, R.; et al. The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS). IEEE Trans. Med. Imaging 2015, 34, 1993–2024. [Google Scholar] [CrossRef]
Liu, J.; Desrosiers, C.; Yu, D.; Zhou, Y. Semi-Supervised Medical Image Segmentation Using Cross-Style Consistency with Shape-Aware and Local Context Constraints. IEEE Trans. Med. Imaging 2024, 43, 1449–1461. [Google Scholar] [CrossRef] [PubMed]
Zhang, M.; Wang, C.; Zou, W.; Qi, X.; Sun, M.; Zhou, W. Contrmix: Progressive Mixed Contrastive Learning for Semi-Supervised Medical Image Segmentation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 2260–2264. [Google Scholar] [CrossRef]
Su, J.; Sun, W.; Adamyk, B. DSMT-Net: Dual-student mean teacher network with pixel-level pseudo-label optimization for semi-supervised medical image segmentation. Comput. Biol. Chem. 2025, 119, 108579. [Google Scholar] [CrossRef]

$Fractalfract 09 00828 g001$

Figure 1. Visualization of an example MRI slice, the corresponding ground truth, and its fractal-dimension map.

$Fractalfract 09 00828 g001$

$Fractalfract 09 00828 g002$

Figure 2. The figure depicts the overall architecture of our proposed FD-HCL framework. The Data Mix section illustrates the data copying–pasting process in our UB-CP strategy. The dual-student framework section includes our designed dual-student framework and the I-EMA module. The hierarchical contrastive learning section presents the multi-level contrastive learning mechanism within the framework.

$Fractalfract 09 00828 g002$

$Fractalfract 09 00828 g003$

Figure 3. The process of generating a matrix after flattening the convolutional kernels.

$Fractalfract 09 00828 g003$

$Fractalfract 09 00828 g004$

Figure 4. The process of fractal-dimension normalization and weighted consistency loss guidance.

$Fractalfract 09 00828 g004$

$Fractalfract 09 00828 g005$

Figure 5. The comparison of segmentation results on the LA dataset with 10% labeled data, where the red regions highlight the segmentation output of the proposed method.

$Fractalfract 09 00828 g005$

$Fractalfract 09 00828 g006$

Figure 6. Comparison of Dice and Jaccard scores across different labeled data ratios on the LA dataset.

$Fractalfract 09 00828 g006$

$Fractalfract 09 00828 g007$

Figure 7. The comparison of segmentation results on the BraTS 2019 dataset with 20% labeled data, where the red regions highlight the segmentation output of the proposed method.

$Fractalfract 09 00828 g007$

$Fractalfract 09 00828 g008$

Figure 8. Comparison of Dice and Jaccard scores across different labeled data ratios on the BraTS 2019 dataset.

$Fractalfract 09 00828 g008$

$Fractalfract 09 00828 g009$

Figure 9. Comparison of the segmentation performance of different strong data augmentation methods in our framework.

$Fractalfract 09 00828 g009$

Table 1. Common weak and strong data augmentation methods.

Category	Method	Description
Weak	Flip and Rotation	Random flipping and small-angle rotation in the sagittal or coronal planes.
	Elastic Deformation	Applies slight non-rigid elastic deformation to slightly change the image shape.
	Zoom Transform	Randomly scales the image within a specified range.
Strong	Gaussian Noise	Adds Gaussian noise to the image.
	Bias Field	Randomly applies a bias field to simulate intensity non-uniformity in images.
	CutMix	Cuts a patch from one image and pastes it onto another.
	MixUp	Linearly mixes two images by a ratio.

Table 2. Comparison with state-of-the-art methods on the LA dataset. The * denotes

p < 0.05

, and ** denotes

p \leq 0.01

in paired t-tests.

Table 2. Comparison with state-of-the-art methods on the LA dataset. The * denotes

p < 0.05

, and ** denotes

p \leq 0.01

in paired t-tests.

Method	Labeled/Unlabeled	Dice (%) ↑	Jaccard (%) ↑	95HD (voxel) ↓	ASD (voxel) ↓
V-Net	8 (10%)/0	77.87 ± 8.12	66.56 ± 10.81	18.77 ± 13.42	21.25 ± 3.18
V-Net	16 (20%)/0	85.68 ± 8.26	75.58 ± 6.31	14.65 ± 5.74	14.48 ± 2.71
CPS [21]	8 (10%)/72 (90%)	84.89 ± 5.63	74.63 ± 8.92	18.56 ± 11.43	3.82 ± 2.35
UA-MT [17]		85.55 ± 6.41 *	76.45 ± 9.84 *	15.68 ± 12.27	2.78 ± 3.32
BCP [29]		87.91 ± 4.88 *	78.72 ± 7.41 *	10.90 ± 15.12 *	2.88 ± 4.84
DTC [19]		87.26 ± 7.17 *	77.12 ± 9.66 *	8.15 ± 12.58	2.61 ± 2.96
UPRC [18]		87.46 ± 4.95 **	77.65 ± 7.88 **	9.45 ± 13.92	2.85 ± 3.74 *
SLC-Net [49]		86.77 ± 7.52	78.15 ± 10.11	9.83 ± 11.65	2.56 ± 2.47
MC-Net [24]		88.24 ± 4.36	79.19 ± 6.83	7.85 ± 10.32 *	1.95 ± 2.08
Contrimix [50]		87.45 ± 6.92 **	77.61 ± 8.41 **	10.24 ± 12.63 *	2.39 ± 3.24
DSMT-Net [51]		87.85 ± 6.24	78.54 ± 7.83	8.25 ± 10.93	2.21 ± 2.31
Ours		88.92 ± 5.58	79.85 ± 7.82	8.04 ± 12.64	1.84 ± 2.71
CPS [21]	16 (20%)/64 (80%)	86.87 ± 3.66	78.67 ± 6.82	12.87 ± 3.44	2.35 ± 1.52
UA-MT [17]		87.23 ± 2.85 **	79.77 ± 4.22 **	7.45 ± 3.21	1.69 ± 1.41
BCP [29]		90.12 ± 5.88 **	81.47 ± 10.94 **	9.27 ± 4.68 **	2.26 ± 1.86
DTC [19]		89.45 ± 4.79 *	80.56 ± 7.21 *	7.27 ± 5.86	1.92 ± 2.04
UPRC [18]		88.24 ± 1.88 *	79.27 ± 3.76 *	8.49 ± 2.12 *	2.07 ± 0.96
SLC-Net [49]		88.79 ± 3.27	80.15 ± 5.66	8.46 ± 3.24	1.82 ± 1.65
MC-Net [24]		90.42 ± 2.91 *	82.06 ± 4.38 *	6.65 ± 3.45	1.99 ± 1.74 *
Contrimix [50]		90.23 ± 2.74 **	81.73 ± 5.43 **	7.15 ± 3.56 *	2.39 ± 1.62
DSMT-Net [51]		89.49 ± 2.49	80.73 ± 5.44	6.87 ± 4.06	1.63 ± 1.83
Ours		90.78 ± 2.64	82.72 ± 5.38	6.07 ± 2.74	1.57 ± 1.29

Table 3. Comparison experiments on BraTS 2019. The * denotes

p < 0.05

, and ** denotes

p \leq 0.01

in paired t-tests.

Table 3. Comparison experiments on BraTS 2019. The * denotes

p < 0.05

, and ** denotes

p \leq 0.01

in paired t-tests.

Method	Scans Used	Metrics
Method	Labeled Unlabeled	Dice (%) ↑	Jaccard (%) ↑	95HD (voxel) ↓	ASD (voxel) ↓
V-Net	25 (10%)/0	66.23 ± 12.84	58.81 ± 15.27	14.69 ± 16.83	4.46 ± 4.12
V-Net	50 (20%)/0	71.61 ± 10.15	63.46 ± 13.68	11.05 ± 14.27	3.97 ± 3.65
CPS [21]	25 (10%)/225 (90%)	76.87 * ± 10.92	66.46 * ± 14.31	11.45 ± 19.74	3.11 ± 4.88
UA-MT [17]		78.86 ± 10.47	68.26 ± 13.89	9.12 ± 14.33	2.46 ± 3.67
DTC [19]		78.35 * ± 10.68	67.89 * ± 13.95	12.37 ± 16.12	3.64 ± 4.31
UPRC [18]		77.78 ± 11.14	67.24 ± 14.58	11.12 * ± 15.46	2.35 ± 3.72
SLC-Net [49]		80.13 * ± 9.87	69.24 * ± 13.42	9.43 ± 13.88	2.56 ± 3.45
BCP [29]		79.84 ± 14.33	68.73 ± 17.01	9.53 ± 14.27	2.25 ± 3.91
MC-Net [24]		80.12 * ± 9.94	69.11 * ± 13.67	10.68 ± 14.37	2.98 ± 3.18
Contrimix [50]		79.93 ± 11.47	68.46 ± 14.86	11.68 ± 14.86	2.98 ± 3.72
DSMT-Net [51]		78.46 * ± 12.29	67.57 * ± 13.67	9.23 ± 10.59	2.98 ± 3.88
Ours		80.68 ± 8.12	69.48 ± 11.23	8.64 ± 12.47	2.47 ± 2.68
CPS [21]	50 (20%)/200 (80%)	79.06 * ± 9.33	68.86 * ± 12.78	9.14 ± 20.66	2.07 ± 5.14
UA-MT [17]		80.36 * ± 8.87	69.75 * ± 12.33	8.79 ± 10.93	1.95 * ± 2.83
DTC [19]		78.96 ± 10.76	67.96 ± 14.11	9.41 * ± 15.08	2.81 ± 3.67
UPRC [18]		79.35 ** ± 8.58	70.29 * ± 11.02	7.74 ± 9.67	2.14 ± 2.84
SLC-Net [49]		80.93 * ± 8.71	69.49 * ± 12.19	8.46 * ± 11.84	2.46 ± 3.11
BCP [29]		80.67 * ± 7.02	70.11 * ± 9.86	9.04 ± 17.83	2.37 ± 4.28
MC-Net [24]		81.28 ± 6.56	70.24 ± 9.28	9.47 * ± 12.41	2.69 ± 3.03
Contrimix [50]		81.23 * ± 8.17	70.05 * ± 12.34	9.54 ± 6.82	2.98 ± 2.48
DSMT-Net [51]		78.86 ± 7.87	67.84 ± 10.82	8.68 * ± 14.35	1.91 ± 3.36
Ours		81.45 ± 7.38	71.18 ± 10.67	7.98 ± 9.94	1.88 ± 2.71

Table 4. Ablation study of the uncertainty threshold

ψ (\cdot)

.

Table 4. Ablation study of the uncertainty threshold

ψ (\cdot)

.

$ψ (\cdot)$	Dice (%) ↑	Jaccard (%) ↑	95HD (voxel) ↓	ASD (voxel) ↓
$5 %$	79.55	68.32	11.34	2.98
$10 %$	80.26	69.04	10.52	2.43
$20 %$	80.49	69.21	10.24	2.68
$30 %$	80.68	69.48	9.84	2.27
$40 %$	80.24	69.16	10.48	2.75

Table 5. Metrics for different n and

β

values.

Table 5. Metrics for different n and

β

values.

Labeled	Unlabeled	n	$β$	Dice ↑	Jaccard ↑	95HD ↓	ASD ↓
16 (20%)	84 (80%)	2	1/2	89.82	81.17	9.14	2.43
			2/3	90.46	82.45	6.54	1.79
			3/4	90.78	82.72	6.07	1.57
			4/5	90.35	82.34	7.12	1.93
		3	1/2	88.32	80.97	9.68	2.70
			2/3	88.64	82.34	6.89	1.73
			3/4	88.57	82.15	6.34	1.84
			4/5	89.26	81.91	6.38	2.08
8 (10%)	72 (90%)	2	1/2	87.50	75.68	9.46	2.52
			2/3	88.76	79.57	7.71	1.72
			3/4	88.92	79.85	8.04	1.84
			4/5	88.49	79.96	7.63	2.12
		3	1/2	86.48	77.97	10.50	2.87
			2/3	87.67	78.43	7.60	1.90
			3/4	88.46	79.13	8.34	2.34
			4/5	88.54	79.35	7.44	2.43

Table 6. Ablation study of individual components on the LA dataset.

Scans Used		Modules Used				Metrics
Labeled	Unlabeled	I-EMA	$L_{cps}$	$L_{hcl}$	$L_{fd}$	Dice (%) ↑	Jaccard (%) ↑
16	64	✓	✓			87.35	77.82
16	64	✓	✓	✓		90.54	82.46
16	64	✓	✓		✓	87.22	77.51
16	64	✓		✓	✓	89.65	81.45
16	64		✓	✓	✓	90.62	82.50
16	64	✓	✓	✓	✓	90.78	82.72

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dong, X.; Xu, W.; Xu, Z.; Han, H.; Zhang, H.; Mao, J.; Dong, G. FD-HCL: A Fractal-Dimension-Guided Hierarchical Contrastive Learning Dual-Student Framework for Semi-Supervised Medical Segmentation. Fractal Fract. 2025, 9, 828. https://doi.org/10.3390/fractalfract9120828

AMA Style

Dong X, Xu W, Xu Z, Han H, Zhang H, Mao J, Dong G. FD-HCL: A Fractal-Dimension-Guided Hierarchical Contrastive Learning Dual-Student Framework for Semi-Supervised Medical Segmentation. Fractal and Fractional. 2025; 9(12):828. https://doi.org/10.3390/fractalfract9120828

Chicago/Turabian Style

Dong, Xinhua, Wenjun Xu, Zhigang Xu, Hongmu Han, Hui Zhang, Juan Mao, and Guangwei Dong. 2025. "FD-HCL: A Fractal-Dimension-Guided Hierarchical Contrastive Learning Dual-Student Framework for Semi-Supervised Medical Segmentation" Fractal and Fractional 9, no. 12: 828. https://doi.org/10.3390/fractalfract9120828

APA Style

Dong, X., Xu, W., Xu, Z., Han, H., Zhang, H., Mao, J., & Dong, G. (2025). FD-HCL: A Fractal-Dimension-Guided Hierarchical Contrastive Learning Dual-Student Framework for Semi-Supervised Medical Segmentation. Fractal and Fractional, 9(12), 828. https://doi.org/10.3390/fractalfract9120828

Article Menu

FD-HCL: A Fractal-Dimension-Guided Hierarchical Contrastive Learning Dual-Student Framework for Semi-Supervised Medical Segmentation

Abstract

1. Introduction

2. Related Works

2.1. Semi-Supervised Medical Image Segmentation

2.2. Data Augmentation

2.3. Contrastive Learning in Semi-Supervised Image Segmentation

2.4. Fractal Dimension

3. Methods

3.1. Independence-Aware Exponential Moving Average

3.2. Hierarchical Contrastive Learning for Dual-Student Framework

3.3. Uncertainty-Aware Bidirectional Copy–Paste Data Augmentation

3.4. Fractal-Dimension-Weighted Consistency Regularization

4. Loss Function

5. Experiments

5.1. Dataset

5.2. Implementation Details

5.3. Evaluation Metrics

5.4. Experiments on the LA Dataset

5.5. Experiments on the BraTS 2019 Dataset

5.6. Effect of the Uncertainty Threshold $ψ (\cdot)$

5.7. Performance Comparison with Other Strong Data Augmentation Methods

5.8. Effect of the Uncertainty-Aware Bidirectional Copy–Paste Parameters

5.9. Ablation Study

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

FD-HCL: A Fractal-Dimension-Guided Hierarchical Contrastive Learning Dual-Student Framework for Semi-Supervised Medical Segmentation

Abstract

1. Introduction

2. Related Works

2.1. Semi-Supervised Medical Image Segmentation

2.2. Data Augmentation

2.3. Contrastive Learning in Semi-Supervised Image Segmentation

2.4. Fractal Dimension

3. Methods

3.1. Independence-Aware Exponential Moving Average

3.2. Hierarchical Contrastive Learning for Dual-Student Framework

3.3. Uncertainty-Aware Bidirectional Copy–Paste Data Augmentation

3.4. Fractal-Dimension-Weighted Consistency Regularization

4. Loss Function

5. Experiments

5.1. Dataset

5.2. Implementation Details

5.3. Evaluation Metrics

5.4. Experiments on the LA Dataset

5.5. Experiments on the BraTS 2019 Dataset

5.6. Effect of the Uncertainty Threshold ψ ( · )

5.7. Performance Comparison with Other Strong Data Augmentation Methods

5.8. Effect of the Uncertainty-Aware Bidirectional Copy–Paste Parameters

5.9. Ablation Study

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

5.6. Effect of the Uncertainty Threshold $ψ (\cdot)$