Semi-Supervised Nuclei Instance Segmentation with Category-Adaptive Sampling and Region-Adaptive Attention

Li, Xunci; Luo, Die; Wei, Zimei; Long, Junan; Ye, Zhiwei

doi:10.3390/app15095107

Open AccessArticle

Semi-Supervised Nuclei Instance Segmentation with Category-Adaptive Sampling and Region-Adaptive Attention

by

Xunci Li

^1,2,

Die Luo

^1,2,*

,

Zimei Wei

^1,2,

Junan Long

³ and

Zhiwei Ye

^1,2

¹

School of Computer Science, Hubei University of Technology, Wuhan 430068, China

²

Hubei Provincial Key Laboratory of Green Intelligent Computing Power Network, Hubei University of Technology, Wuhan 430068, China

³

Faculty of Engineering Sciences, University College London, London WC1E 6BT, UK

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(9), 5107; https://doi.org/10.3390/app15095107

Submission received: 2 April 2025 / Revised: 23 April 2025 / Accepted: 28 April 2025 / Published: 4 May 2025

Download

Browse Figures

Versions Notes

Abstract

Cell nuclei instance segmentation plays a critical role in pathological image analysis. In recent years, fully supervised methods for cell nuclei instance segmentation have achieved significant results. However, in practical medical image processing, annotating dense cell nuclei at the instance level is often costly and time-consuming, making it challenging to acquire large-scale labeled datasets. This challenge has motivated researchers to explore ways to further enhance segmentation performance under limited labeling conditions. To address this issue, this paper proposes a network based on category-adaptive sampling and attention mechanisms for semi-supervised nuclei instance segmentation. Specifically, we design a category-adaptive sampling method that forces the model to focus on rare categories and dynamically adapt to different data distributions. By dynamically adjusting the sampling strategy, the balance of samples across different cell types is improved. Additionally, we propose a strong–weak contrast consistency method that significantly expands the perturbation space. Strong perturbations enhance the model’s ability to discriminate key nuclei features, while weak perturbations improve its robustness against noise and interference. Furthermore, we introduce a region-adaptive attention mechanism that dynamically assigns higher weights to key regions, guiding the model to prioritize learning discriminative features in challenging areas such as blurred or ambiguous cell boundaries. This improves the morphological accuracy of the segmentation masks. Our method effectively leverages the potential information in unlabeled data, thereby reducing reliance on large-scale, high-quality labeled datasets. Experimental results on public datasets demonstrate the effectiveness of our approach.

Keywords:

semi-supervised; instance segmentation; category-adaptive sampling; region-adaptive attention

1. Introduce

Pathologists can analyze morphological features such as cell size, texture, and shape to assist in diagnosing malignant tumors, determining treatment methods, or conducting in-depth research [1]. Specifically, through instance segmentation of cells, researchers can accurately quantify cellular characteristics, including the cell count and size within tumor tissues, thereby supporting the quantitative analysis, grading, and staging of malignancies [2]. The prerequisite for these procedures involves large-scale cellular-level analysis of digital tissue samples, which renders manual cell analysis practically infeasible due to its high labor intensity and significant inter-observer variability. This necessitates the urgent development of automated cell analysis methods [3].

Most existing fully supervised automated cell analysis algorithms are based on convolutional neural networks (CNNs), which have achieved remarkable progress in the medical field and digital pathology [4,5,6]. Among them, UNet [7], a widely used CNN architecture for image segmentation, has been applied to cell instance segmentation [8]. Researchers have also proposed numerous UNet-based variants, such as UNet++ [9] and UNet3+ [10]. In these methods, the semantic gap between feature maps of encoder and decoder subnetworks is reduced by redesigning skip connections, thereby enhancing segmentation performance. However, despite merging coarse-grained deep features and fine-grained shallow features at the same scale via skip connections, UNet-based encoder–decoder networks exhibit inefficient non-local contextual modeling across arbitrary positions, limiting their performance in segmenting complex histopathological images [11]. To address these challenges, researchers have developed novel approaches. For example, Hover-Net [12] uses a fully convolutional neural network to extract image features and identifies candidate cell regions through post-processing. These candidate regions are further processed and filtered to generate final segmentation masks, achieving instance segmentation of individual cells. CIA-Net [13] incorporates two decoders, with each responsible for segmenting nuclei or contours. By bidirectionally aggregating specific features, it leverages spatial and textural dependencies between nuclei and contours to enhance performance on both tasks. Nevertheless, such methods rely on complex pixel grouping post-processing to extract object instances, and their performance is highly dependent on segmentation results and grouping strategies [14].

Fully supervised deep learning methods have significantly enhanced the automation level of cell segmentation through end-to-end feature learning. However, these methods rely heavily on large quantities of high-quality annotated data for training, which enables models to achieve high segmentation accuracy even in complex backgrounds. This reliance on extensive annotations poses challenges, including high costs, time-consuming processes, and inconsistent annotation quality due to the need for pathologist involvement in medical image labeling. To address these limitations, semi-supervised learning methods have gradually gained attention. These methods integrate a small amount of labeled data with a large volume of unlabeled data, reducing data acquisition costs while enhancing model generalization and robustness. For example, Mean Teacher [15] leverages consistency regularization to utilize unlabeled data for performance improvement, FixMatch [16] combines pseudo-labeling with strong–weak data augmentation strategies to significantly boost segmentation accuracy, and Cross Pseudo Supervision [17] mitigates pseudo-label noise via mutual supervision between dual networks. Additionally, generative adversarial network (GAN)-based methods such as SegAN [18] exploit distributional information in unlabeled data through adversarial training mechanisms, thereby further improving segmentation model robustness. Despite these advances, efficiently leveraging unlabeled data remains a challenge. The insufficient exploration of unlabeled data or failure to find optimal correlations between unlabeled and labeled data can lead to suboptimal model performance. Designing self-supervised constraints, consistency regularization, or GAN-based mechanisms can uncover implicit distributional patterns and contextual information in unlabeled data. Nevertheless, current semi-supervised methods for medical images still face challenges such as pseudo-label noise propagation and feature disentanglement, necessitating the urgent development of novel learning paradigms adapted to cellular characteristics [19]. Furthermore, designing a robust cell segmentation model framework remains crucial.

In this paper, we propose a semi-supervised method for nuclear instance segmentation. A class-adaptive sampling strategy is employed to guide the model to focus on rare classes and dynamically adapt to data distributions, thereby effectively addressing class imbalance issues. Additionally, a region-adaptive attention mechanism is introduced to improve the morphological accuracy of segmentation masks by dynamically assigning higher weights to critical regions. We present comparative results on two independent multi-tissue histology image datasets and demonstrate state-of-the-art performance compared to other recently proposed methods. The contributions of this work can be summarized as follows:

A category-adaptive sampling method is designed to guide the model to prioritize rare classes and dynamically adapt to data distributions. This prevents fixed sampling strategies from causing model reliance on artificially set balancing ratios and effectively addresses long-tailed data distribution issues.

A strong–weak contrast consistency method is developed, generating two strong perturbations and one weak perturbation to substantially expand the perturbation space, learn perturbation-invariant feature representations, and enhance the model’s discriminative ability for critical cellular features.

A region-adaptive attention mechanism is designed to strengthen key regions through dynamic weighting, guiding the model to focus on learning discriminative features of challenging regions such as blurred boundaries or overlapping cells, thereby improving the morphological accuracy of segmentation masks.

2. Relate Work

Traditional semi-supervised medical image segmentation methods typically rely on handcrafted shallow features with limited representational capacity, which renders them incapable of delivering satisfactory segmentation results for medical images characterized by low contrast and severe noise interference. In contrast, deep learning-based semi-supervised approaches achieve superior segmentation performance owing to their strong feature representation and modeling capabilities. Current state-of-the-art semi-supervised medical image segmentation methods predominantly utilize conventional encoder–decoder networks as their backbone architecture. To more effectively leverage unlabeled data, recent research has focused on advancing learning strategies. Prior studies can be broadly classified into three primary directions:

2.1. Pseudo-Label-Based Methods

The core idea of pseudo-label-based methods is to train a base model using a small amount of labeled data, predict labels for unlabeled data with the trained model, select high-confidence predictions as pseudo-labels, and finally retrain the model by combining pseudo-labeled data with labeled data [20]. The pseudo-label estimates are updated after several training epochs, with the expectation that their quality improves through iterative training. However, directly incorporating initial pseudo-labels into the segmentation loss function may degrade performance if the estimates are incorrect. To address this issue, Feng et al. [21] selected high-confidence pseudo-labels for training, reducing noise interference by dynamically updating these pseudo-labels during training to iteratively correct errors. Yu et al. [22] proposed a Mean Teacher framework that quantifies prediction uncertainty using Monte Carlo dropout, selecting low-uncertainty pseudo-labels for subsequent training to minimize misinformation. Another approach, Noisy Student [23], trains a teacher model on labeled data to estimate pseudo-labels and then trains a student model on both real and pseudo-labels while injecting noise via data augmentation and dropout to improve robustness. In this paradigm, the student model typically outperforms the teacher and can be iteratively updated as a new teacher to estimate refined pseudo-labels. Su et al. [24] introduced a mutual learning framework leveraging the synergy between multiple segmentation networks to improve semi-supervised performance by exchanging “reliable” pseudo-labels. FixMatch [16] and its variant Unimatch [25] enhance performance using high-confidence pseudo-labels and consistency regularization.

2.2. Consistency Regularization-Based Methods

Consistency regularization is one of the core paradigms in semi-supervised learning. Its central idea is to force the model to produce consistent predictions for different perturbations (e.g., noise, data augmentation, etc.) of the same input image data. By leveraging the intrinsic structural information of unlabeled data, this approach significantly enhances model generalization, particularly in tasks with scarce labeled data such as medical image segmentation.

Π

-Model [26] is a classic consistency regularization method, which first proposed leveraging unlabeled data by perturbing the input data and enforcing consistency in model predictions. The subsequent Temporal Ensembling method introduced in this paper employs exponential moving averages (EMAs) of historical model predictions as consistency targets to reduce random noise. Tarvainen et al. [15] developed the Mean Teacher method, replacing the EMAs of predictions in Temporal Ensembling with the EMAs of model weights and enabling the teacher model to provide more stable consistency targets. Shu Jianhua et al. [27] proposed a self-consistency regularization mechanism that enforces prediction consistency across different data augmentations or network layers. To prevent overfitting, Li et al. [28] constrained the consistency between teacher and student models on differently augmented versions of the same input, optimizing a weighted combination of supervised loss for labeled data and regularization loss for both labeled and unlabeled data, thereby improving segmentation accuracy and robustness in medical imaging. To minimize uncertain predictions, Wu et al. [29] introduced a novel mutual consistency network (MC-Net) composed of an encoder and two slightly different decoders. A cyclic pseudo-labeling scheme was designed to convert prediction discrepancies between the two decoders into unsupervised losses, promoting mutual consistency. This framework was later extended to MC-Net+ [30], which incorporates pseudo-label filtering to reduce noise impact and improve semi-supervised segmentation stability. Liu et al. [31] proposed a new extension of the Mean Teacher model to address prediction accuracy in consistency learning. This included a new auxiliary teacher and replaced the Mean Teacher’s mean squared error (MSE) with a stricter confidence-weighted cross-entropy (Conf-CE) loss, enhancing unlabeled data utilization through perturbations and strict filtering strategies.

2.3. GAN-Based Methods

In medical image segmentation tasks, generative adversarial networks (GANs) typically train two networks: a generator that synthesizes realistic segmentation masks from random noise to mimic ground-truth annotations and a discriminator that distinguishes generated masks from real ones, thereby guiding the generator’s learning process.

In recent studies, GASNet [32] adopts the idea of generative adversarial learning and combines uncertainty discriminators and feature mapping loss, enabling the model to fully utilize unlabeled data for efficient medical image segmentation in the case of limited labeled data. Duo-SegNet [33] uses two distinct perspectives (two segmentation branches) to learn image features separately and designs a Critic to discriminate between predictions from the two perspectives. Through min-max optimization, it encourages the model to produce more consistent segmentations across multi-perspectives. Zhao et al. [34] developed a boundary-guided segmentation network (BGSegNet) with a lightweight pixel-level discriminator to distinguish predictions from labels. This discriminator also provides pseudo-labels for unlabeled data to train the segmentation network. Chen et al. [35] further enhanced pseudo-label smoothness and accuracy by integrating virtual adversarial training into cross-domain patch-wise contrastive learning.

3. Methods

In this study, we adopted DeepLabv3 [36] as the backbone architecture for the segmentation network. Based on this foundation, we proposed a novel category-adaptive sampling (CAS) strategy to enhance dataset sampling efficiency. To facilitate semi-supervised learning, we incorporated a strong–weak consistency (SWC) mechanism that enforces contrastive consistency between strongly and weakly augmented views. Furthermore, we introduced a region-adaptive attention (RAA) module to emphasize discriminative regional features from the dilated convolutional outputs. By integrating these components, we developed a semi-supervised segmentation framework—termed CSRA-Net. The overall network architecture is illustrated in Figure 1. In this study, we adopted DeepLabv3 [36] as the backbone architecture for the segmentation network. Based on this foundation, we propose a novel Category-Adaptive Sampling (CAS) strategy to enhance dataset sampling efficiency. To facilitate semi-supervised learning, we incorporate a Strong-Weak Consistency (SWC) mechanism that enforces contrastive consistency between strongly and weakly augmented views. Furthermore, we introduce a Region-Adaptive Attention (RAA) module to emphasize discriminative regional features from the dilated convolutional outputs. By integrating these components, we develop a semi-supervised segmentation framework—termed CSRA-Net. The overall network architecture is illustrated in Figure 1.

3.1. Category-Adaptive Sampling Method

In medical image analysis, datasets often suffer from significant class imbalance, where certain tissues or cell types are underrepresented. This imbalance can lead to suboptimal model performance, particularly in recognizing minority classes during training. Conventional techniques such as random oversampling and SMOTE [37] adopt fixed sampling strategies, which may not fully exploit class-specific information embedded in the training data.

To address this challenge, we proposed a category-adaptive sampling (CAS) strategy that dynamically adjusts the sampling weights based on the imbalance severity of each class. The CAS framework emphasizes the representation of underrepresented classes by incorporating class-specific oversampling factors and adaptive adjustment mechanisms. This dynamic strategy enhances the ability of the model to learn from rare classes and mitigates the negative impact of class imbalance.

Moreover, to improve the stability and robustness of the model while minimizing the risk of overfitting associated with oversampling, we introduced several refinements to the CAS method. These include precise formula derivations, systematic methodological enhancements, and comprehensive implementation details, all of which contribute to improved classification performance on imbalanced datasets.

3.1.1. Calculate the Category Imbalance Ratio

First, the imbalance ratio for each category in the training set was computed to quantify its relative scarcity compared to the majority class. Let

c_{T}

represent the tissue category,

c_{cell}

represent the cellular category,

N_{train}

represent the total number of training samples, and

N_{majorty class}

represent the number of samples in the majority class. The category imbalance ratio is defined as Equation (1):

imbalance (c_{T}) = \frac{N_{class c_{T}}}{N_{majorty class}}, imbalance (c_{cell}) = \frac{N_{class c_{c e l l}}}{N_{majority class}} .

(1)

These imbalance ratios are subsequently used to compute the oversampling factor

γ_{s}^{c}

for each category. To ensure that rarer classes receive proportionally more sampling attention, an inverse proportionality strategy is adopted. The calculation of r is defined in Formula (2). Here,

α

is a constant hyperparameter that controls the overall oversampling intensity, thereby modulating the amplification level across different categories.

γ_{s}^{c} = \frac{α}{imbalance (c)}

(2)

3.1.2. Calculate the Sampling Weights of the Sample

For each training sample i, the overall sampling weight is composed of two components: one corresponding to the tissue category and the other to the cell category. The sampling weight

w_{Tissue} (i, γ_{s})

for the tissue category is determined based on the specific tissue label

c_{T}, i

associated with the sample. The calculation is defined in Equation (3):

w_{Tissue} (i, γ_{s}) = N_{Train} \cdot γ_{s}^{c_{T}} \cdot \sum_{j = 1}^{N_{Train}} F_{c_{T}, j = c_{T}, j} + (1 - γ_{s}^{c_{T}}) \cdot N_{Train}

(3)

Here,

F_{c_{T}, j = c_{T}, j}

is an indicator function, denoting whether sample j belongs to the category

c_{T}, i

of the tissue.

The sampling weight

w_{Cell} (i, γ_{s})

for the cell categories is computed based on the cell class composition of each training sample i. Since a single sample may contain multiple cell categories, we define a binary vector

c_{i}

, where

N_{cell}

denotes the total number of cells of all categories. Each element

c_{i}^{j}

in this vector indicates whether sample i contains cell category j (1 if present, 0 if otherwise).

w_{Cell} (i, γ_{s}) = (1 - γ_{s}^{c_{cell}}) + γ_{s}^{c_{cell}} \cdot \frac{\sum_{j = 1}^{C} c_{i}^{j}}{N_{C e l l}},

(4)

3.1.3. Calculate the Total Sampling Weight

Ultimately, the total sampling weight

p_{i} (γ_{s})

for each sample i is computed as a sum of the tissue-level and cell-level sampling weights. This formulation allows for joint consideration of both tissue and cellular category imbalances. The calculation is expressed in Equation (5):

p_{i} (γ_{s}) = \frac{w_{Tissue} (i, γ_{s})}{{max}_{j} w_{Tissue} (j, γ_{s})} + \frac{w_{Cell} (i, γ_{s})}{{max}_{j} w_{Cell} (j, γ_{s})} .

(5)

By incorporating this composite weight into the sampling procedure, the sampling probability of each training sample is dynamically adjusted, enabling the model to more effectively address class imbalance across both tissue and cell categories.

3.1.4. Dynamically Adjust the Oversampling Factor

To effectively address category imbalance while mitigating the risk of overfitting, a periodic oversampling strategy is proposed. Rather than maintaining a fixed oversampling rate throughout training, this strategy periodically updates the oversampling factor to reduce computational overhead per epoch and enhance model generalization. Specifically, the oversampling factor

γ_{s}

is updated every predefined interval (e.g., every 10 epochs), and its magnitude is gradually decreased as training progresses. This dynamic adjustment allows the model to focus more on underrepresented categories in early training stages while reducing the influence of oversampling in later stages. The update rule is defined in Equation (6):

γ_{s} = γ_{\max} \cdot (1 - \frac{e p o c h m o d i n t e r v a l}{N_{epochs}})

(6)

Among them,

γ_{\max}

is the initial oversampling factor, and

e p o c h m o d i n t e r v a l

represents the remainder of the current

e p o c h

and the periodic update interval

i n t e r v a l

, which determines the adjustment ratio within the current period;

N_{epochs}

is the total number of training rounds, and

i n t e r v a l

is the interval of periodic updates (for example, updated once every 10 epochs).

This gradually decreasing oversampling intensity helps prevent overfitting to minority classes in later stages of training and improves the generalization capability of the model.

3.2. Strong–Weak Contrastive Consistency

To enhance the robustness and feature discrimination capability of models in cell segmentation tasks, we proposed a strong–weak contrast consistency (SWC) strategy, which integrates dual-stream perturbation contrast learning, strong–weak perturbation consistency supervision, labeled perturbation supervision mechanisms, and staining normalization enhancement techniques. Below, we provide a detailed explanation of each module along with its mathematical formulation. Additionally, Figure 2 presents a schematic illustration comparing the SWC strategy with FixMatch [16] and UniMatch [25].

As illustrated in Figure 2, the proposed SWC strategy demonstrates substantial differences compared to existing semi-supervised learning methods, such as FixMatch and UniMatch. X,

X^{W}

,

X^{S}

,

X_{u}^{W}

, and

X_{u}^{S}

respectively represent the labeled data, the weakly perturbed features of labeled data, the strongly perturbed features of labeled data, the weakly perturbed features of unlabeled data, and the strongly perturbed features of unlabeled data.

y_{x}

,

p^{x}

,

p_{x_{S}}

,

p_{u}^{W}

, and

p_{u}^{S}

respectively represent the labeled data, the prediction results for labeled data, the prediction results based on strongly perturbed features of labeled data, the prediction results for weakly perturbed unlabeled data, and the prediction results based on strongly perturbed features of unlabeled data. Unlike FixMatch, which depends exclusively on pseudo-label confidence and hard thresholding, SWC incorporates a two-stream perturbation comparison mechanism that aligns representations under both strong and weak perturbations. This mitigates confirmation bias and enhances feature-level consistency. In contrast to UniMatch, our method fully learns perturbation-invariant features by contrasting two strongly perturbed features. Moreover, our approach diverges from both FixMatch and UniMatch in that it applies strong perturbation feature generation not only to unlabeled data but also to labeled data. By constructing perturbation space consistency, our method effectively guides the learning of unlabeled data, thereby further improving the robustness and learning efficiency of the model.

3.2.1. Dual-Stream Perturbation and Contrastive Learning

UniMatch [25] employs the DusPerb method to expand the perturbation space from two distinct perspectives, thereby fully capturing the diversity of original perturbations. Inspired by this approach, we adopted a similar perturbation strategy in this work. Specifically, for each unlabeled sample

x_{u}

, two strongly augmented feature representations, denoted as

q_{u, s 1}^{strong}

and

q_{u, s 2}^{strong}

, are generated. These perturbations aim to enhance the discriminative representation of cell regions, thereby improving the ability of the model to distinguish cellular structures from complex backgrounds. In addition, we introduced the CutMix [38] data augmentation technique, which utilizes randomly sampled templates to synthesize images with diverse staining styles. This strategy not only increases data diversity but also overcomes the limitations of using fixed staining templates.

To maximize consistency between the strongly perturbed features, a contrastive learning loss is defined, as shown in Equation (7):

L_{contast}^{steng} = - \frac{1}{N} \sum_{j \in {s 1, s 2}} log \frac{exp (q_{u, s 1}^{strong} \cdot q_{u, s 2}^{strong})}{\sum_{i = 0}^{C} exp (q_{u, j}^{strong} k_{i})}

(7)

Among them, N represents the sample quantity,

q_{u, s 1}^{strong}

and

q_{u, s 2}^{strong}

represent strong disturbance features,

k_{i}

is the classifier weight, and C represents the number of classifier weights. By maximizing the similarity between features, the model can learn robust feature representations in complex backgrounds.

3.2.2. Consistent Supervision of Strong and Weak Perturbation

To further enhance the quality of pseudo-labels, a weak perturbation feature

q_{u}^{weak}

is introduced for the unlabeled data. Pseudo-label

{\hat{y}}_{u}

is then generated to ensure that semantic consistency of the original data is preserved under weak perturbations, thereby making the reliability of the results of the model.

{\hat{y}}_{u} = arg \max P (y ∣ q_{u}^{weak})

(8)

The method proposed in UDA [39] employs RandAugment and back-translation to generate two distinct strongly augmented variants of unlabeled data and computes the consistency loss through pseudo-labeling. This approach ensures that the model remains robust to augmented data. Inspired by this, we incorporated pseudo-labels combined with dual-stream strong perturbation for consistency supervision, and we define the loss function as presented in Equation (9):

L_{constist}^{weak - strong} = - \frac{1}{2 N} \sum_{i = 1}^{N} (\log P ({\hat{y}}_{u} | q_{u, s 1}^{strong}) + \log P ({\hat{y}}_{u} | q_{u, s 2}^{strong}))

(9)

where

q_{u}^{weak}

denotes a weakly perturbed feature, and

{\hat{y}}_{u}

represents a pseudo-label generated from the prediction corresponding to the weakly perturbed input. The weak perturbations cause minimal damage distortion to the original image structure, thereby ensuring that the generated pseudo-labels are closer to the true labels.

q_{u, s 1}^{strong}

and

q_{u, s 2}^{strong}

correspond to strong perturbation features. Consistency loss is computed between the two strong perturbation features and the pseudo-label to ensure model consistency across various perturbation conditions, This approach mitigates bias potentially introduced by a single perturbation and reduces the impact of potential errors in the pseudo-labels. This process further enhances the robustness of feature representations for unlabeled data and improves the stability of pseudo-label generation. Moreover, each unlabeled sample is reused in two strongly perturbed versions, effectively simulating the effects of data augmentation, which is particularly advantageous in scenarios where medical image annotations are limited.

3.2.3. Supervised Mechanism with Label Perturbation

Although consistent supervision using both strong and weak perturbations can effectively enhance the learning performance of unlabeled data, the absence of direct guidance from ground-truth labels may result in the accumulation of pseudo-label errors. Miyato [40] computed the prediction results for both original samples and perturbed samples, applied consistency loss to ensure stable predictions under perturbations, and suggested that this approach could be extended to labeled data. Therefore, this section introduces consistency supervision for labeled data to provide indirect guidance for the learning process of unlabeled data, thereby assisting the model in correcting cognitive biases and improving its segmentation capability on real-world data. For labeled data

x_{s}

, strong perturbation features

q_{s}^{strong}

are generated, and the supervised loss is calculated in conjunction with the ground-truth labels

y_{s}

.

L_{consist}^{label} = - \frac{1}{N} \sum_{i = 1}^{N} (log P (y_{s} | q_{s}) + log P (y_{s} | q_{s}^{strong}))

(10)

This mechanism offers indirect supervision for pseudo-label generation, thereby addressing the model’s cognitive bias toward real data and further improving segmentation performance. Finally, by integrating the contrastive learning loss of unlabeled data, the consistency regularization loss from strong and weak perturbations, and the supervised loss of labeled data, the overall loss function is formulated as shown in Equation (11):

L_{total} = L_{constrast}^{strong} + λ_{1} L_{consist}^{weak - strong} + λ_{2} L_{consist}^{label}

(11)

Among them,

λ_{1}

and

λ_{2}

serve as hyperparameters for adjusting the weights of the loss terms. This comprehensive loss function enables collaborative optimization in the learning process of both labeled and unlabeled data, thereby enhancing the model’s robustness and accuracy in scenarios with imbalanced data distribution and complex backgrounds.

3.3. Region-Adaptive Attention Mechanism

In cellular segmentation tasks, the inherent challenges posed by highly heterogeneous morphological and textural characteristics of cells, particularly in regions exhibiting fine structures, low contrast, or ambiguous boundaries, significantly compromise the effectiveness of conventional convolutional neural networks (CNNs) and standard attention mechanisms. These limitations primarily manifest in two aspects: (1) Traditional attention frameworks employ static weighting strategies that fail to adaptively recalibrate feature importance according to regional complexity levels, resulting in insufficient attention allocation to diagnostically critical areas. (2) The inherent confirmation bias in pseudo-labeling processes tends to reinforce erroneous high-confidence predictions during iterative training, thereby exacerbating error propagation and ultimately degrading segmentation accuracy.

To address these critical issues, we proposed a region-adaptive attention (RAA) mechanism based on gradient changes, as illustrated in Figure 3. By dynamically adjusting focus on different regions, this mechanism mitigates the model’s reinforcement of incorrectly predicted areas, thereby improving segmentation performance in complex regions.

Specifically, each scale

T_{i}

(the four different scale outputs from the encoder) is processed to compute the difficulty score

g (x_{i})

for each region. This score is derived based on the gradient variation within the region, reflecting the complexity of feature changes in that region. For each region

x_{i}

, its gradient magnitude

∣ \nabla I (x_{i}) ∣

is computed, which represents the degree of pixel intensity variation in the region. The variance

g (x_{i})

in the gradient magnitude for the region is then calculated using Equation (12) to assess the difficulty level of the region.

g (x_{i}) = Var (∣ \nabla I (x_{i}) ∣)

(12)

Regions with large gradient variance typically exhibit high feature variation, such as blurred boundaries or complex textures, indicating that these regions are more challenging. Consequently, they will receive greater attention in subsequent steps. Based on the gradient variance

g (x_{i})

, dynamic weights

α_{i}

are assigned to each region to reflect the level of attention required. During computation, the Sigmoid function is applied to map the difficulty score, thereby obtaining the dynamic weights as shown in Equation (13):

α_{i} = \frac{1}{1 + exp (- g (x_{i}))}

(13)

This weight ensures that regions with larger gradient variations (i.e., higher difficulty regions) receive increased attention, enabling the model to concentrate its resources on processing these complex areas more effectively. Through this dynamic adjustment mechanism, the model adaptively prioritizes challenging regions during training while mitigating overfitting to easily distinguishable regions.

Then, a multi-head self-attention mechanism [41] was adopted to combine features of different scales, and a dynamic weight

α_{i}

was introduced in the attention calculation to further enhance the focus on regions with higher difficulty. Specifically, for each scale

T_{i} (i = 1, 2, 3, 4)

, the corresponding Query, Key, and Value were calculated through Equations (4)–(14):

Q = {\tilde{T}}_{i} W_{Q}, K = \sum \tilde{T} W_{K}, V = \sum \tilde{T} W_{V}

(14)

Among them,

W_{Q}

,

W_{K}

, and

W_{V}

are the corresponding weight matrices,

{\tilde{T}}_{i}

represents the input region tokens, and

\sum \tilde{T}

is the result of concatenating all scale outputs. The similarity matrix

W_{i}

is calculated through these inputs, and the weighted output

{\tilde{O}}_{i}

is obtained through Equation (15) to acquire the feature representation of each region.

{\tilde{O}}_{i} = M_{i} V^{T} = δ [softmax (\frac{Q^{T} K}{\sqrt{\sum D}})] V^{T}

(15)

Among them,

δ

denotes the instance normalization operation, which serves to normalize the feature matrix of each region. To further enhance the feature representation capability, the DMLP module within the DAT-Net [42] network structure is applied to the output of each scale. Through deep semantic enhancement, the feature representation is enriched, thereby improving both the expressiveness of features and the model’s ability to understand complex regions. The final feature representation

O_{i}

is obtained via Formula (16).

O_{i} = {\tilde{T}}_{i} + α_{i} \cdot {\tilde{O}}_{i} + DMLP ({\tilde{T}}_{i} + {\tilde{O}}_{i})

(16)

Finally, through the convolutional layer, the enhanced feature representation is mapped back to the original input space, yielding the final output

O_{1}, O_{2}, O_{3}, O_{4}

. These outputs not only capture high-level semantic information but also effectively emphasize complex regions requiring correction.

4. Experiment

4.1. Datasets and Evaluation Metrics

This paper primarily employed two widely-used datasets in medical image nuclei segmentation tasks: the PanNuke [43] dataset and the Conic [44] dataset. The PanNuke dataset served as the primary training and evaluation dataset, comprising 7901 images of size 256 × 256 pixels. This dataset encompasses 19 distinct tissue types and five cell categories, namely, tumor cells, non-tumor epithelial cells, inflammatory cells, connective tissue cells, and dead cells. Notably, the dataset exhibits significant class imbalance, particularly a severe underrepresentation of the dead cell category, which is prominently reflected in the statistics of both cell and tissue categories. The Conic dataset consists of H&E-stained colon tissue section images, containing 4981 images of size 256 × 256 pixels. In this dataset, cells are finely categorized into six classes, including epithelial cells, lymphocytes, plasma cells, eosinophils, neutrophils, and connective tissue cells. To evaluate the network’s segmentation accuracy under semi-supervised conditions, the data from the PanNuke and Conic datasets were partitioned, with the distribution presented in Table 1. A labeled training dataset was created through random sampling from the original training dataset, while the remaining portion formed the unlabeled dataset.

We adopted Precision, Recall, F1-score, and Mean Intersection over Union (MIOU) as the evaluation metrics for the segmentation performance of the semi-supervised model.

R e c a l l = \frac{T P}{T P + F N}

(17)

P r e c i s i o n = \frac{T P}{T P + F P}

(18)

F 1 - score = \frac{2 P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(19)

where True Positive (TP) represents the count of correctly segmented target pixels, False Positive (FP) denotes the count of pixels from other classes incorrectly segmented as target pixels, and False Negative (FN) indicates the count of target pixels mistakenly segmented as non-target pixels. MIOU is a key metric for assessing the accuracy of semantic segmentation models. It is the ratio of the intersection to the union of the predicted and ground-truth values. It is expressed in Equation (20):

M I O U = \frac{1}{k} \sum_{i = 1}^{k} \frac{T P_{i}}{T P_{i} + F P_{i} + F N_{i}}

(20)

where k represents the number of pixel categories, excluding the background class.

4.2. Implementation and Training Details

We conducted all experiments on PyTorch 1.13.1 with Nvidia GeForce GTX 4090. We used mini-batch stochastic gradient descent (SGD) with a momentum 0.9 and weight decay of

5 \times 10^{- 5}

. The batch size was set to 16, and the training epoch was set to 400. The initial learning rate was 0.005, and it was decayed by cosine annealing with a period of 400.

Upon data loading, the original images and their corresponding labels underwent standardization and augmentation to enhance the model’s robustness and generalization capabilities. First, input images and labels were randomly scaled with a scaling ratio ranging from 0.5 to 2.0 to improve the model’s scale invariance. Additionally, fixed-size cropping was applied to enhance the model’s adaptability to spatial transformations while ensuring that the input data met computational requirements. Meanwhile, distinct invalid region markers were assigned for labeled and unlabeled data to prevent interference from unlabeled data during training. Furthermore, images were randomly horizontally flipped with a 50% probability to improve the model’s adaptability to data from various orientations. For unlabeled data, this study introduced a multi-view augmentation strategy to strengthen the model’s generalization ability and self-supervised learning performance. Specifically, color jittering was applied with an 80% probability to adjust brightness, contrast, saturation, and hue, enabling the model to handle diverse lighting conditions. Random grayscale conversion was applied with a 20% probability to enhance recognition of black-and-white or low-saturation images. Gaussian blur was applied with a 50% probability to simulate real-world blurring effects, thereby improving the model’s ability to learn boundary information.

4.3. Performance Comparisonn

We compared our method with the state-of-the-art methods on the PanNuke and Conic datasets. Both datasets encompass a variety of cell types and exhibit a significant imbalance across these types. As such, they are considered among the most challenging instance segmentation datasets in terms of segmentation complexity. In this experiment, we evaluated the performance of CSRA-Net and compared it with that of the fully supervised network UNet as well as the state-of-the-art semi-supervised methods Fixmatch and Unimatch. All methods were tested under identical conditions, including the same dataset and training settings, to ensure a fair comparison. Furthermore, detailed comparisons were conducted for scenarios where the proportion of labeled data was set to 1/4, 1/8, 1/16, and 1/32, respectively.

Table 2 summarizes the experimental results of the network on the PanNuke dataset. As shown in the table, CSRA-Net achieved superior performance across all annotation ratios (1/4, 1/8, 1/16, and 1/32). The performance outcomes of both FixMatch and UniMatch networks were outperformed by CSRA-Net. Despite being among the state-of-the-art semi-supervised segmentation networks, there existed a noticeable time gap between the introduction of these two models. Nonetheless, their segmentation accuracies were comparable, with the improvement margin between them being less than 0.5%. This suggests that multi-type instance segmentation under complex backgrounds poses an exceptionally high level of difficulty, where even highly sophisticated network designs can only yield marginal accuracy gains. Compared to UNet, CSRA-Net demonstrated significant improvements at all annotation data ratios, particularly in scenarios with extremely limited annotations (1/32), where the F1-score and MIOU increased by 12.75% and 10.70%, respectively. This highlights its effectiveness in learning from small sample annotation data. When compared to FixMatch, CSRA-Net exhibited a relatively modest improvement (approximately 1%) at higher annotation ratios (1/4 and 1/8). However, at lower annotation ratios (1/16 and 1/32), the F1-score increased by 3.29% and 1.77%, while the MIOU improved by 2.67% and 1.60%, respectively. This indicates that CSRA-Net outperforms FixMatch in scenarios with scarce data. In comparison to UniMatch, CSRA-Net maintained a consistent lead across all data ratios, albeit with slightly smaller gains than those observed against FixMatch. Specifically, at the 1/16 and 1/32 data divisions, the improvements in F1-score and MIOU ranged between 1.0% and 1.3%, suggesting that the two methods performed similarly, yet CSRA-Net retained a stable advantage. In contrast to the marginal improvement observed between FixMatch and UniMatch, CSRA-Net achieved an enhancement of approximately 2% across all scenarios, thereby providing further evidence of its effectiveness.

To provide a more intuitive illustration of the MIOU scores for various methods under different proportions of labeled data in the PanNuke dataset, Figure 4 offers a visual representation. The results shown in the figure demonstrate that CSRA-Net achieved superior segmentation performance across all four scenarios.

Table 3 illustrates the experimental results of CSRA-Net compared to fully supervised methods and the other state-of-the-art semi-supervised approaches on the Conic dataset. Comparable to the findings on the PanNuke dataset, both FixMatch and UniMatch demonstrated weaker performance relative to CSRA-Net. The segmentation accuracy results of these two networks under various conditions were similar, with an average improvement margin between them being less than 1.5%. Specifically, CSRA-Net exhibited an approximate 5% higher average accuracy across different conditions compared to FixMatch and approximately a 2% improvement compared to UniMatch. Compared to UNet, CSRA-Net demonstrated substantial improvements across all data partitions, particularly in the 1/8 and 1/16 scenarios, where the F1-score increased by 8.26% and 9.21%, and the MIOU improved by 5.82% and 6.05%, respectively. When compared to FixMatch, CSRA-Net exhibited consistent enhancements across all data proportions, with notable performance gains in the 1/8 and 1/32 scenarios, where the F1-score rose by 2% and %, and the MIOU increased by 0.7% and 2.7%, respectively. This highlights the superior effectiveness of the semi-supervised strategy SWC proposed in this work over FixMatch. In comparison to UniMatch, CSRA-Net achieved a moderate yet stable improvement across all data partitions, with the F1-score increasing by 1.6–1.9% and the MIOU improving by 0.7–1.3%. These results indicate that CSRA-Net offers distinct advantages in semi-supervised learning scenarios.

To provide a more intuitive illustration of the MIOU scores for various methods under different proportions of labeled data in the Conic dataset, Figure 5 offers a visual representation. The results shown in the figure demonstrate that CSRA-Net achieved superior segmentation performance across all four scenarios.

In summary, the experiments conducted on the PanNuke and Conic datasets show that CSRA-Net is capable of effectively leveraging limited labeled data to improve segmentation performance. Notably, in scenarios with extremely low labeling ratios (1/8, 1/16, and 1/32), it consistently outperformed other semi-supervised methods (FixMatch and UniMatch) by maintaining stable performance gains. These results validate the effectiveness and strong generalization capability of CSRA-Net in the task of semi-supervised cell instance segmentation.

4.4. Ablation Experiments

To rigorously assess the effectiveness of each component within this framework, a series of ablation experiments were performed on the PanNuke dataset to systematically evaluate the contribution of each module to the overall performance. The PanNuke dataset comprises 19 distinct tissue types and five diverse cell categories, with a significant imbalance across these classes, thereby posing substantial challenges to the segmentation task.

4.4.1. The Function of Individual Modules

Category-adaptive sampling (CAS) enhances the balance of samples across different cell types by dynamically adjusting the sampling strategy for labeled data. This indirectly optimizes the training process of unlabeled data and mitigates the prediction errors caused by class imbalance. The balanced labeled data provide a more uniform supervision signal, enabling the model to comprehensively learn the features of various cell types and generate higher-quality pseudo-labels. This reduces error propagation due to insufficient minority-class samples. During this process, the features learned from balanced batches avoid over-reliance on discriminative patterns of dominant cell types, thereby reducing the likelihood of misclassifying ambiguous cells in unlabeled data as majority classes and improving segmentation and classification accuracy. The benchmark network selected for this study was Deeplabv3. As a prominent model in the segmentation domain, Deeplabv3 demonstrated exceptionally high segmentation accuracy. Even when applied to the highly challenging PanNuke dataset, Deeplabv3 achieved segmentation accuracy comparable to that of state-of-the-art semi-supervised segmentation networks. In recent years, the average improvement achieved by mainstream semi-supervised segmentation networks on the PanNuke dataset has been approximately 0.5%, with no substantial performance gains observed. As shown in Table 4, when the proportion of labeled data was 1/4, 1/8, 1/16, and 1/32, the F1-score increased by 1.07%, 0.9%, 1.22%, and 1.25%, respectively, and the MIOU score increased by 0.33%, 0.67%, 0.16%, and 0.49%, respectively, after applying the category adaptive sampling strategy in the benchmark network. Compared to the accuracy gains achieved by existing semi-supervised instance segmentation networks, the improvement brought by the category-adaptive sampling strategy is notably more significant, thereby demonstrating its effectiveness.

To verify the strong–weak contrast consistency (SWC) strategy, two strongly perturbed features were generated for an unlabeled cell image sample. Their feature consistency was constrained via contrastive learning to ensure robustness in data augmentation and feature extraction. This strategy compels the model to disregard noise and irrelevant background information in the image, focusing instead on learning the structural and morphological features of the cells, thereby extracting more robust and essential feature representations. In this study, strong perturbations were applied to labeled data to encourage the model to learn perturbation-invariant feature representations, enhancing its adaptability to variations in cell morphology. By applying strong perturbations to both labeled and unlabeled data, the proposed method promotes consistency in the model’s feature space, alleviating optimization bias caused by differences in perturbations between supervised and unsupervised data and achieving a robust integration of supervised and unsupervised signals. As shown in Table 4, after introducing the perturbation strategy, the model’s performance improved by 1.31%, 1.24%, 1.77%, and 1.65% in the F1-score and by 0.54%, 0.83%, 0.23%, and 1.05% in the MIOU score compared to the baseline network when the proportion of labeled data was 1/4, 1/8, 1/16, and 1/32, respectively. These results demonstrate the effectiveness of this strategy in enhancing the model’s robustness. Compared to the category-adaptive sampling strategy, the strong–weak contrast consistency strategy demonstrated a more pronounced improvement in accuracy, thereby validating its effectiveness in enhancing model robustness.

The region-adaptive attention (RAA) mechanism dynamically adjusts attention according to the complexity of regions, enabling the model to prioritize complex areas. This enhances the accuracy and robustness of cell segmentation while mitigating the adverse effects of erroneous learning during model training. By incorporating gradient changes, adaptive weight computation, and multi-head self-attention, RAA offers an effective enhancement strategy for medical image segmentation tasks, addressing challenges such as variable cell morphology and ambiguous boundaries. Experimental results show that after integrating the RAA method into the network, when the proportion of labeled data was 1/4, 1/8, 1/16, and 1/32, the F1-score improved by 1.65%, 1.59%, 2.42%, and 2.06%, respectively, and the MIOU scores increased by 0.69%, 0.94%, 3.08%, and 1.24%, respectively. As shown in Table 4, the regional-adaptive attention mechanism demonstrated the highest effectiveness among the three strategies, with its improvement nearly equivalent to the combined effects of the other two. This suggests that in instance segmentation tasks involving complex backgrounds, the adoption of the regional adaptive attention mechanism enables the model to dynamically adjust attention based on regional complexity, directing greater focus toward intricate areas and consequently improving the accuracy of cell segmentation.

4.4.2. The Function of the Combination Module

In the preceding ablation experiments, the designed methods were individually integrated into the baseline network to validate their standalone contributions. The results demonstrate the effectiveness of the proposed methods in addressing the complex task of cell instance segmentation. In the subsequent ablation experiments, a detailed analysis was conducted on the performance improvements achieved by different module combinations under varying proportions of labeled data. As shown in Table 5, the Base+SWC+RAA scheme consistently outperformed the Base+CAS+SWC scheme across all data partitions, achieving higher F1-score and MIOU improvements. This indicates that the RAA component plays a more critical role in enhancing cell feature extraction and fine-grained information learning. Specifically, under the 1/4, 1/8, 1/16, and 1/32 data partitions, the F1-score of Base+SWC+RAA improved by 1.30%, 0.90%, 1.99%, and 1.92%, respectively, compared to the baseline network. Similarly, the MIOU index improved by 1.18%, 1.20%, 4.25%, and 2.55%, respectively.

Furthermore, CSRA-Net (Base + CAS + SWC + RAA) demonstrated the best performance across all experiments. Compared to the baseline model, this network achieved an average accuracy improvement of 1.5%, outperforming the enhancement observed among mainstream networks and thereby validating the effectiveness of the combined module. Specifically, in the 1/16 data split, its F1-score and MIOU improved by 5.64% and 5.75%, respectively, compared to the baseline. This highlights the significant enhancement in the network’s learning capability achieved through the synergistic interaction of multiple modules in scenarios with limited labeled data. However, in the 1/32 data split, despite achieving the highest improvement among all methods (the F1-score increased by 2.95%, and the MIOU increased by 3.43%), the performance gain was relatively smaller compared to the 1/16 split. This indicates that under conditions of extremely scarce labeled data, the model’s overall performance is constrained, leading to a saturation effect in the contributions of additional modules.

5. Discussion and Conclusions

This paper presents CSRA-Net, a segmentation network based on category-adaptive and attention mechanisms designed for semi-supervised instance segmentation of cell nuclei. To address the prediction bias arising from class imbalance in semi-supervised instance segmentation, we proposed a category-adaptive sampling method. By dynamically adjusting the sampling strategy for labeled data, this method enhances the balance of samples across different cell types, allowing the model to effectively learn the characteristics of various cell types and consequently produce more accurate segmentation results. Secondly, to tackle the segmentation robustness challenge encountered by semi-supervised models, we proposed a strong–weak contrast consistency strategy. This strategy compels the model to disregard noise and irrelevant background information in images while emphasizing the learning of structural and morphological features of cells. Consequently, it enables the extraction of more robust and intrinsic feature representations, thereby improving the model’s robustness. Finally, we proposed a region-adaptive attention mechanism that adaptively adjusts the attention weights based on regional complexity. This enables the model to focus more effectively on complex regions, thereby enhancing the accuracy of segmentation. The proposed network, acting as a universal semi-supervised instance segmentation method, not only provides valuable insights for both semi-supervised and fully supervised segmentation approaches but also effectively tackles the challenges of sample category imbalance and limited segmentation robustness in segmentation tasks.

Despite the satisfactory performance of CSRA-Net, several limitations should be noted. First, the method heavily relies on the quality of pseudo-labels generated during the semi-supervised learning process. Although our strong–weak contrast consistency strategy helps mitigate the impact of noise in pseudo-labels, the potential accumulation of errors in the pseudo-labels remains a challenge. These errors could lead to suboptimal model performance, particularly in complex or ambiguous regions of the images. Additionally, while the dynamic category-adaptive sampling method effectively addresses the issue of class imbalance, it requires careful hyperparameter tuning to ensure optimal performance across different datasets. For instance, the inverse proportional strategy used in our method to calculate the oversampling factor for each class may not achieve ideal balance in cases of extreme class imbalance between cell types.

In future work, we will explore strategies to further improve pseudo-label quality, such as incorporating self-training techniques or refining the adaptive sampling method to better handle extreme class imbalances. Furthermore, reducing the computational overhead of the region-adaptive attention mechanism could enhance the applicability of CSRA-Net in more resource-constrained environments.

Author Contributions

Methodology, X.L. and D.L.; Software, Z.W. and J.L.; Validation, Z.Y.; Formal analysis, Z.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China under Grant 62376089.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in this article. All codes and trained models are available at https://github.com/ldrunning/Semi-supervised-segmentation (accessed on 27 April 2025). Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhou, Y.; Dou, Q.; Chen, H.; Qin, J.; Li, Y.; Wang, D.; Zang, Y.; Wang, X.; Heng, P.A. SFCN-OPI: Detection and fine-grained classification of nuclei using sibling FCN with objectness prior interaction. Proc. AAAI Conf. Artif. Intell. 2018, 32, 2652–2659. [Google Scholar] [CrossRef]
Sirinukunwattana, K.; Raza, S.E.A.; Tsang, Y.W.; Snead, D.R.J.; Cree, I.A.; Rajpoot, N.M. Locality sensitive deep learning for detection and classification of nuclei in routine colon cancer histology images. IEEE Trans. Med. Imaging 2016, 35, 1196–1206. [Google Scholar] [CrossRef] [PubMed]
Chen, X.; Zhou, X.; Wong, S.T.C. Automated segmentation, classification, and tracking of cancer cell nuclei in time-lapse microscopy. IEEE Trans. Biomed. Eng. 2006, 53, 762–766. [Google Scholar] [CrossRef] [PubMed]
BenTaieb, A.; Hamarneh, G. Deep learning models for digital pathology. arXiv 2019, arXiv:1910.12329. [Google Scholar]
Chen, X.X. Research on Medical Image Segmentation Method Based on CNN and Transformer. Master’s Thesis, Anhui University, Hefei, China, 2024. [Google Scholar]
Liu, L.; Wu, F.X.; Wang, Y.P.; Wang, J. Multi-receptive-field CNN for semantic segmentation of medical images. IEEE J. Biomed. Health Inform. 2020, 24, 3215–3225. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Zhou, D.; Xu, L.; Wang, T.; Wei, S.; Liu, Y.; Zhang, H.; Li, M.; Chen, J.; Zhao, Q.; Wang, X. M-DDC: MRI based demyelinative diseases classification with U-Net segmentation and convolutional network. Neural Netw. 2024, 169, 108–119. [Google Scholar] [CrossRef]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. UNet++: A nested U-Net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis; Springer: Cham, Switzerland, 2018; pp. 3–11. [Google Scholar]
Huang, H.; Lin, L.; Tong, R.; Hu, H.; Zhang, Q.; Iwamoto, Y.; Han, X.; Wang, Y.; Chen, Y.; Wu, J. UNet 3+: A full-scale connected UNet for medical image segmentation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 1055–1059. [Google Scholar]
Ji, Y.; Zhang, R.; Wang, H.; Li, Z.; Wu, L.; Zhang, S.; Luo, P. Multi-compound transformer for accurate biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention (MICCAI), Strasbourg, France, 27 September–1 October 2021; pp. 326–336. [Google Scholar]
Graham, S.; Vu, Q.D.; Raza, S.E.A.; Azam, A.; Tsang, Y.W.; Kwak, J.T.; Rajpoot, N. Hover-Net: Simultaneous segmentation and classification of nuclei in multi-tissue histology images. Med. Image Anal. 2019, 58, 101563. [Google Scholar] [CrossRef]
Zhou, Y.; Onder, O.F.; Dou, Q.; Tsougenis, E.; Chen, H.; Heng, P.-A. CIA-Net: Robust nuclei instance segmentation with contour-aware information aggregation. In Proceedings of the Information Processing in Medical Imaging (IPMI), Hong Kong, China, 2–7 June 2019; pp. 682–693. [Google Scholar]
Yao, K.; Huang, K.; Sun, J.; Hussain, A.; Jude, C. PointNU-Net: Simultaneous multi-tissue histology nuclei segmentation and classification in the clinical wild. arXiv 2021, arXiv:2111.01557. [Google Scholar]
Tarvainen, A.; Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Adv. Neural Inf. Process. Syst. 2017, 30, 1196. [Google Scholar]
Sohn, K.; Berthelot, D.; Carlini, N.; Zhang, Z.; Zhang, H.; Raffel, C.A.; Cubuk, E.D.; Kurakin, A.; Li, C.-L. FixMatch: Simplifying semi-supervised learning with consistency and confidence. Adv. Neural Inf. Process. Syst. 2020, 33, 596–608. [Google Scholar]
Chen, X.; Yuan, Y.; Zeng, G.; Wang, J. Semi-supervised semantic segmentation with cross pseudo supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021; pp. 2613–2622. [Google Scholar]
Xue, Y.; Xu, T.; Zhang, H.; Long, L.R.; Huang, X. SegAN: Adversarial network with multi-scale L1 loss for medical image segmentation. Neuroinformatics 2018, 16, 383–392. [Google Scholar] [CrossRef] [PubMed]
Kumar, N.; Verma, R.; Sharma, S.; Bhargava, S.; Vahadane, A.; Sethi, A. A dataset and a technique for generalized nuclear segmentation for computational pathology. IEEE Trans. Med. Imaging 2017, 36, 1550–1560. [Google Scholar] [CrossRef]
Lee, D.H. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Proceedings of the ICML Workshop on Representation Learning, Atlanta, GA, USA, 16–21 June 2013. [Google Scholar]
Feng, Z.; Zhou, Q.; Cheng, G.; Xin, T.; Shi, J.; Ma, L. Semi-supervised semantic segmentation via dynamic self-training and class-balanced curriculum. arXiv 2020, arXiv:2003.06074. [Google Scholar]
Yu, L.; Wang, S.; Li, X.; Fu, C.-W.; Heng, P.-A. Uncertainty-aware self-ensembling model for semi-supervised 3D left atrium segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention (MICCAI), Shenzhen, China, 13–17 October 2019; pp. 605–613. [Google Scholar]
Xie, Q.; Luong, M.-T.; Hovy, E.; Le, Q.V. Self-training with noisy student improves ImageNet classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 14–19 June 2020; pp. 10687–10698. [Google Scholar]
Su, J.; Luo, Z.; Lian, S.; Lin, D.; Li, S. Mutual learning with reliable pseudo label for semi-supervised medical image segmentation. Med. Image Anal. 2024, 94, 103111. [Google Scholar] [CrossRef]
Yang, L.; Qi, L.; Feng, L.; Zhang, W.; Shi, Y. Revisiting weak-to-strong consistency in semi-supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 7236–7246. [Google Scholar]
Laine, S.; Aila, T. Temporal ensembling for semi-supervised learning. arXiv 2016, arXiv:1610.02242. [Google Scholar]
Shu, J.H.; Nian, F.D.; Lv, G. Semi-supervised cell segmentation algorithm based on self-consistent regularization constraint. Pattern Recognit. Artif. Intell. 2020, 33, 643–652. [Google Scholar]
Li, X.; Yu, L.; Chen, H.; Fu, C.-W.; Xing, L.; Heng, P.-A. Transformation-consistent self-ensembling model for semi-supervised medical image segmentation. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 523–534. [Google Scholar] [CrossRef]
Wu, Y.; Xu, M.; Ge, Z.; Cai, J.; Zhang, L. Semi-supervised left atrium segmentation with mutual consistency training. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention (MICCAI), Strasbourg, France, 27 September–1 October 2021; pp. 297–306. [Google Scholar]
Wu, Y.; Ge, Z.; Zhang, D.; Xu, M.; Zhang, L.; Xia, Y.; Cai, J. Mutual consistency learning for semi-supervised medical image segmentation. Med. Image Anal. 2022, 81, 102530. [Google Scholar] [CrossRef]
Liu, Y.; Tian, Y.; Chen, Y.; Li, W.; Wang, H.; Zhang, S.; Yang, Z.; Huang, J.; Zhao, R.; Lin, X. Perturbed and strict mean teachers for semi-supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 4258–4267. [Google Scholar]
Li, C.; Liu, H. Generative adversarial semi-supervised network for medical image segmentation. In Proceedings of the IEEE International Symposium on Biomedical Imaging (ISBI), Nice, France, 13–16 April 2021; pp. 303–306. [Google Scholar]
Peiris, H.; Chen, Z.; Egan, G.; Harandi, M. Duo-SegNet: Adversarial dual-views for semi-supervised medical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention (MICCAI), Strasbourg, France, 27 September–1 October 2021; pp. 428–438. [Google Scholar]
Zhao, F.; Chen, Y.; Huang, K.; He, X.; Chen, X.; Hou, Y. Semi-bgSegNet: A semi-supervised boundary-guided breast tumor segmentation network. In Proceedings of the IEEE International Symposium on Biomedical Imaging (ISBI), Cartagena, Colombia, 18–21 April 2023; pp. 1–5. [Google Scholar]
Chen, Z.M.; Xuan, S.B. Cross-domain block contrastive semi-supervised nuclear segmentation with virtual adversarial training. Comput. Technol. Dev. 2024, 34, 37–44. [Google Scholar]
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Almarshdi, R.; Nassef, L.; Fadel, E.; Alowidi, N. Hybrid deep learning based attack detection for imbalanced data classification. Intell. Autom. Soft Comput. 2023, 35, 297–320. [Google Scholar] [CrossRef]
Yun, S.; Han, D.; Oh, S.J.; Chun, S.; Choe, J.; Yoo, Y. CutMix: Regularization strategy to train strong classifiers with localizable features. arXiv 2019, arXiv:1905.04899. [Google Scholar]
Xie, Q.; Dai, Z.; Hovy, E.; Luong, M.T.; Le, Q.V. Unsupervised data augmentation for consistency training. Adv. Neural Inf. Process. Syst. 2020, 33, 6256–6268. [Google Scholar]
Miyato, T.; Maeda, S.; Koyama, M.; Nakae, K.; Ishii, S. Virtual adversarial training: A regularization method for supervised and semi-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 1979–1993. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
Mei, M.; Wei, Z.; Hu, B.; Zhang, Y.; Li, X.; Wang, J.; Chen, L.; Liu, Q.; Yang, H.; Zhou, Y. DAT-Net: Deep aggregation transformer network for automatic nuclear segmentation. Biomed. Signal Process. Control 2024, 98, 106764. [Google Scholar] [CrossRef]
Gamper, J.; Koohbanani, N.A.; Benes, K.; Graham, S.; Jahanifar, M.; Khurram, S.A.; Azam, A.; Hewitt, K.; Rajpoot, N. Pannuke dataset extension, insights and baselines. arXiv 2020, arXiv:2003.10778. [Google Scholar]
Graham, S.; Jahanifar, M.; Vu, Q.D.; Hadjigeorghiou, G.; Leech, T.; Snead, D.; Raza, S.E.A.; Minhas, F.; Rajpoot, N. Conic: Colon nuclei identification and counting challenge 2022. arXiv 2021, arXiv:2111.14485. [Google Scholar]

Figure 1. Overall architecture diagram of the CSRA-Net network.

Figure 2. A comparison chart illustrating the SWC strategy versus other methods.

Figure 3. Structure diagram of the region-adaptive attention.

Figure 4. MIOU scores of different methods on the PanNuke dataset.

Figure 5. MIOU scores of different methods on the Conic dataset.

Table 1. Division of the PanNuke and Conic datasets.

Datasets	The Proportion of Labeled	Labeled Training Set	Unlabeled Training Set	Validation Set
PanNuke	1/4	1578	4734	1589
	1/8	789	5523
	1/16	394	5918
	1/32	197	6115
Conic	1/4	996	2988	997
	1/8	498	3486
	1/16	249	3735
	1/32	124	3860

Table 2. Comparison experiment results on the PanNuke dataset.

Split	Method	F1-Score	Precision (P)	Recall (R)	Recall (MIOU)
1/4	UNet	0.6464	0.6818	0.6202	0.5096
	FixMatch	0.6717	0.6915	0.6586	0.5375
	UniMatch	0.6708	0.6754	0.6696	0.5346
	CSRA-Net	0.6854	0.7002	0.6758	0.5465
1/8	UNet	0.6180	0.6475	0.5973	0.4803
	FixMatch	0.6551	0.6871	0.6304	0.5198
	UniMatch	0.6518	0.6910	0.6224	0.5181
	CSRA-Net	0.6639	0.7028	0.6376	0.5283
1/16	UNet	0.5446	0.6376	0.5176	0.4253
	FixMatch	0.5829	0.6596	0.5541	0.4645
	UniMatch	0.6039	0.6544	0.5850	0.4779
	CSRA-Net	0.6158	0.6653	0.5971	0.4912
1/32	UNet	0.4908	0.5121	0.4749	0.3758
	FixMatch	0.6006	0.6379	0.5909	0.4668
	UniMatch	0.6082	0.6468	0.5859	0.4733
	CSRA-Net	0.6183	0.6574	0.5956	0.4828

Table 3. Comparison experiment results on the Conic dataset.

Split	Method	F1-Score	Precision (P)	Recall (R)	Recall (MIOU)
1/4	UNet	0.5430	0.5893	0.5120	0.4175
	FixMatch	0.5377	0.6714	0.4820	0.4150
	UniMatch	0.5553	0.6298	0.5128	0.4275
	CSRA-Net	0.5712	0.6445	0.5287	0.4392
1/8	UNet	0.4563	0.5827	0.4317	0.3584
	FixMatch	0.4982	0.6541	0.4543	0.3895
	UniMatch	0.5207	0.6254	0.4799	0.4039
	CSRA-Net	0.5389	0.6397	0.4978	0.4166
1/16	UNet	0.4377	0.5776	0.4220	0.3434
	FixMatch	0.5091	0.6646	0.4689	0.3966
	UniMatch	0.5110	0.6578	0.4676	0.3972
	CSRA-Net	0.5298	0.6495	0.4859	0.4039
1/32	UNet	0.3731	0.4404	0.3634	0.2953
	FixMatch	0.4210	0.6812	0.3877	0.3298
	UniMatch	0.4400	0.6060	0.4036	0.3410
	CSRA-Net	0.4562	0.6204	0.4195	0.3544

Table 4. Ablation experiment results of individual modules on the PanNuke dataset.

Split	Method	F1-Score	Precision (P)	Recall (R)	Recall (MIOU)
1/4	Base	0.6717	0.6915	0.6586	0.5375
	Base+CAS	0.6789	0.6998	0.6649	0.5393
	Base+SWA	0.6805	0.7014	0.6670	0.5404
	Base+RAA	0.6828	0.7038	0.6693	0.5412
1/8	Base	0.6551	0.6871	0.6304	0.5198
	Base+CAS	0.6610	0.6935	0.6369	0.5233
	Base+SWA	0.6632	0.6960	0.6395	0.5241
	Base+RAA	0.6655	0.6984	0.6419	0.5247
1/16	Base	0.5829	0.6596	0.5541	0.4645
	Base+CAS	0.5900	0.6675	0.5610	0.4719
	Base+SWA	0.5932	0.6705	0.5642	0.4752
	Base+RAA	0.5970	0.6740	0.5680	0.4788
1/32	Base	0.6006	0.6379	0.5909	0.4668
	Base+CAS	0.6081	0.6458	0.5982	0.4691
	Base+SWA	0.6105	0.6484	0.6007	0.4713
	Base+RAA	0.6130	0.6511	0.6033	0.4726

Table 5. Results of the ablation experiments of the combined modules on the PanNuke dataset.

Split	Method	F1-Score	Precision (P)	Recall (R)	Recall (MIOU)
1/4	Base	0.6717	0.6915	0.6586	0.5375
	Base+CAS+SWA	0.6768	0.6957	0.6642	0.5412
	Base+SWA+RAA	0.6805	0.6998	0.6678	0.5438
	BBase+CAS+SWA+RAA (CSRA-Net)	0.6854	0.7002	0.6758	0.5465
1/8	Base	0.6551	0.6871	0.6304	0.5198
	Base+CAS+SWA	0.6593	0.6908	0.6350	0.5254
	Base+SWA+RAA	0.6610	0.6925	0.6371	0.5261
	BBase+CAS+SWA+RAA (CSRA-Net)	0.6639	0.7028	0.6376	0.5283
1/16	Base	0.5829	0.6596	0.5541	0.4645
	Base+CAS+SWA	0.5902	0.6679	0.5619	0.4797
	Base+SWA+RAA	0.5945	0.6720	0.5661	0.4842
	BBase+CAS+SWA+RAA (CSRA-Net)	0.6158	0.6653	0.5971	0.4912
1/32	Base	0.6006	0.6379	0.5909	0.4668
	Base+CAS+SWA	0.6084	0.6460	0.5982	0.4754
	Base+SWA+RAA	0.6121	0.6502	0.6018	0.4787
	BBase+CAS+SWA+RAA (CSRA-Net)	0.6183	0.6574	0.5956	0.4828

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, X.; Luo, D.; Wei, Z.; Long, J.; Ye, Z. Semi-Supervised Nuclei Instance Segmentation with Category-Adaptive Sampling and Region-Adaptive Attention. Appl. Sci. 2025, 15, 5107. https://doi.org/10.3390/app15095107

AMA Style

Li X, Luo D, Wei Z, Long J, Ye Z. Semi-Supervised Nuclei Instance Segmentation with Category-Adaptive Sampling and Region-Adaptive Attention. Applied Sciences. 2025; 15(9):5107. https://doi.org/10.3390/app15095107

Chicago/Turabian Style

Li, Xunci, Die Luo, Zimei Wei, Junan Long, and Zhiwei Ye. 2025. "Semi-Supervised Nuclei Instance Segmentation with Category-Adaptive Sampling and Region-Adaptive Attention" Applied Sciences 15, no. 9: 5107. https://doi.org/10.3390/app15095107

APA Style

Li, X., Luo, D., Wei, Z., Long, J., & Ye, Z. (2025). Semi-Supervised Nuclei Instance Segmentation with Category-Adaptive Sampling and Region-Adaptive Attention. Applied Sciences, 15(9), 5107. https://doi.org/10.3390/app15095107

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Semi-Supervised Nuclei Instance Segmentation with Category-Adaptive Sampling and Region-Adaptive Attention

Abstract

1. Introduce

2. Relate Work

2.1. Pseudo-Label-Based Methods

2.2. Consistency Regularization-Based Methods

2.3. GAN-Based Methods

3. Methods

3.1. Category-Adaptive Sampling Method

3.1.1. Calculate the Category Imbalance Ratio

3.1.2. Calculate the Sampling Weights of the Sample

3.1.3. Calculate the Total Sampling Weight

3.1.4. Dynamically Adjust the Oversampling Factor

3.2. Strong–Weak Contrastive Consistency

3.2.1. Dual-Stream Perturbation and Contrastive Learning

3.2.2. Consistent Supervision of Strong and Weak Perturbation

3.2.3. Supervised Mechanism with Label Perturbation

3.3. Region-Adaptive Attention Mechanism

4. Experiment

4.1. Datasets and Evaluation Metrics

4.2. Implementation and Training Details

4.3. Performance Comparisonn

4.4. Ablation Experiments

4.4.1. The Function of Individual Modules

4.4.2. The Function of the Combination Module

5. Discussion and Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI