A Review of Non-Fully Supervised Deep Learning for Medical Image Segmentation

Zhang, Xinyue; Wang, Jianfeng; Wei, Jinqiao; Yuan, Xinyu; Wu, Ming

doi:10.3390/info16060433

Open AccessReview

A Review of Non-Fully Supervised Deep Learning for Medical Image Segmentation

by

Xinyue Zhang

¹

,

Jianfeng Wang

^2,*,

Jinqiao Wei

²

,

Xinyu Yuan

²

and

Ming Wu

^3,4,*

¹

College of Artificial Intelligence, Taiyuan University of Technology, Jinzhong 036000, China

²

School of Software, Taiyuan University of Technology, Jinzhong 036000, China

³

Department of Computer Science, KU Leuven, 3001 Leuven, Belgium

⁴

Department of Mechanical Engineering, KU Leuven, 3001 Leuven, Belgium

^*

Authors to whom correspondence should be addressed.

Information 2025, 16(6), 433; https://doi.org/10.3390/info16060433

Submission received: 6 April 2025 / Revised: 14 May 2025 / Accepted: 22 May 2025 / Published: 24 May 2025

(This article belongs to the Section Biomedical Information and Health)

Download

Browse Figures

Versions Notes

Abstract

Medical image segmentation, a critical task in medical image analysis, aims to precisely delineate regions of interest (ROIs) such as organs, lesions, and cells, and is crucial for applications including computer-aided diagnosis, surgical planning, radiation therapy, and pathological analysis. While fully supervised deep learning methods have demonstrated remarkable performance in this domain, their reliance on large-scale, pixel-level annotated datasets—a significant label scarcity challenge—severely hinders their widespread deployment in clinical settings. Addressing this limitation, this review focuses on non-fully supervised learning paradigms, systematically investigating the application of semi-supervised, weakly supervised, and unsupervised learning techniques for medical image segmentation. We delve into the theoretical foundations, core advantages, typical application scenarios, and representative algorithmic implementations associated with each paradigm. Furthermore, this paper compiles and critically reviews commonly utilized benchmark datasets within the field. Finally, we discuss future research directions and challenges, offering insights for advancing the field and reducing dependence on extensive annotation.

Keywords:

medical image segmentation; semi-supervised learning; weakly supervised learning; unsupervised learning; survey

1. Introduction

Medical image segmentation, regarded as a critical component within the domain of medical image analysis, is characterized by the application of specific algorithms and techniques to facilitate the accurate partitioning of target organs, tissues, or lesions from the background in medical images such as computed tomography (CT) and magnetic resonance imaging (MRI); semantic labels are assigned to each segmented region. Initial investigations were primarily based on traditional methods that utilized low-level image features and prior knowledge, encompassing pixel intensity thresholding, region growing, edge detection, and active contour models. Although effectiveness was observed in simple images, challenges were encountered regarding real-world clinical complexities such as ambiguous boundaries, noise, and anatomical variations.

In recent years, significant advancements in medical image segmentation have been attributed to deep learning. Convolutional neural networks (CNNs) [1], with capabilities for local feature extraction through convolutional operations, have been shown to automatically learn highly discriminative feature representations from large-scale medical imaging datasets, thereby enhancing segmentation performance. The introduction of fully convolutional networks (FCNs) [2] established a foundational paradigm for semantic segmentation by substituting fully connected layers in traditional CNNs with convolutional layers, which enabled end-to-end pixel-level prediction. The U-Net architecture [3] further propelled the domain by employing an encoder–decoder structure with skip connections that effectively fused multi-scale features. The integration of Transformers [4], utilizing potent global context modeling capabilities to address the limitations of CNNs in capturing long-range dependencies, has emerged as a prominent research trajectory. Recently, the Segment Anything Model (SAM) [5] has garnered considerable attention within the broader image segmentation community, with its robust zero-shot transfer capabilities presenting novel opportunities for medical image segmentation under the fully supervised paradigm.

However, despite the demonstrated efficacy of fully supervised deep learning methods, limitations arise due to their significant dependence on extensive, pixel-wise, accurately annotated datasets, which constrains clinical applicability. The annotation process of medical images is characterized by high costs, time-intensive requirements, and a propensity for errors, which are further complicated by inter-observer variability, data scarcity, and patient privacy issues. To address these challenges, the exploration of non-fully supervised medical image segmentation approaches has been undertaken by researchers. Non-fully supervised learning, in contrast to fully supervised learning, which mandates that every sample in the training set be accompanied by precise and complete labels as supervisory signals, is a machine learning methodology that trains models by employing strategies such as utilizing partial labels, weak-form labels (e.g., inexact, incomplete, or indirect supervisory signals), or by operating entirely independently of direct task-relevant labels. This approach aims to significantly alleviate or eliminate the reliance on large-scale, high-quality, and exhaustively annotated data. Specifically within the domain of medical image segmentation, these non-fully supervised methods are designed to leverage limited, incomplete, or coarse-grained annotation information to train models, thereby reducing the demand for pixel-level ground truth, enhancing annotation efficiency, and facilitating the deployment of medical image segmentation in practical clinical settings. Currently, weakly supervised learning, semi-supervised learning, and unsupervised learning are identified as the principal research directions within this domain, with each approach aiming to develop effective segmentation models from various types of incomplete annotation data.

Early approaches to non-fully supervised medical image segmentation were predominantly based on traditional image processing techniques, which were augmented with limited annotation [6,7]. With the advent of deep learning, various non-fully supervised learning methodologies have progressively achieved prominence. Among these, weakly supervised learning methods were initially recognized for leveraging more readily obtainable annotation forms, such as image-level labels [8], bounding boxes [9], scribbles [10], and point annotations [11]. These methods typically incorporated strategies like Class Activation Maps (CAMs) [8], iterative mining, and adversarial learning [12] to optimize segmentation accuracy while minimizing the annotation burden. To further exploit the abundance of unlabeled data, semi-supervised learning methods were introduced into the medical image segmentation domain, resulting in significant enhancements in model performance by utilizing a combination of a limited number of labeled samples and a substantial volume of unlabeled samples. In scenarios marked by a scarcity of annotation, unsupervised learning methods have also been developed, particularly in the realms of unsupervised anomaly segmentation and unsupervised domain adaptation. Additionally, techniques such as transfer learning [13], self-supervised learning [14], multi-modal fusion, and the application of prior knowledge [15] are often integrated with these non-fully supervised approaches to further improve segmentation performance and robustness. To facilitate a clearer and more in-depth analysis of the distinctions and characteristics among the aforementioned methodologies, Table 1 systematically compares and summarizes medical image segmentation approaches based on semi-supervised, weakly supervised, and unsupervised learning, leveraging five key dimensions: supervision source, annotation cost, core mechanism, performance potential, and application scenarios.

This review aims to provide a comprehensive overview and discussion of the recent advancements in deep learning-based medical image segmentation under the non-fully supervised paradigm (as illustrated in Figure 1). Specifically, Section 1 introduces the background of the medical image segmentation field. Section 2 compiles commonly utilized datasets for medical image segmentation. Section 3, Section 4 and Section 5 delve into non-fully supervised medical image segmentation methods, systematically presenting semi-supervised (Section 3), weakly supervised (Section 4), and unsupervised (Section 5) approaches, categorized based on the type and quantity of annotation information employed during model training. Section 6 provides an in-depth comparative evaluation of the aforementioned non-fully supervised methods, scrutinizing not only their potential and value in clinical applications but also dissecting their respective inherent limitations, practical challenges encountered, and the requisite trade-off considerations for method selection. Section 7 discusses the non-fully supervised medical image segmentation methodologies reviewed herein, addressing their clinical applications and outlining future research directions. Finally, Section 8 concludes the paper.

2. Datasets

This section presents a systematic overview of benchmark datasets that are widely utilized in the field of medical image segmentation. These datasets provide essential data support for the training and performance evaluation of segmentation algorithms while serving as critical resources that drive technological advancements and methodological innovation within this domain. Data dimensionality has been used for categorization into two major types: 2D (pixel-based) images and 3D (voxel-based) volumes, with specific characteristics and representative examples detailed in Section 2.1 and Section 2.2, respectively. To enhance readability and provide a structured overview, Table 2 and Table 3 summarize the core datasets discussed in this study from 2D and 3D perspectives, respectively. These tables not only delineate the task-specific characteristics of each dataset—including imaging modalities, primary anatomical regions covered (e.g., colon, breast, skin), and typical application scenarios—but also elucidate their alignment with distinct supervision categories, thereby revealing the interplay between dataset properties and training methodologies in medical AI research.

2.1. Two-Dimensional Image Datasets

ACDC dataset [16]: The Automated Cardiac Diagnosis Challenge (ACDC) dataset, released as part of the MICCAI 2017 challenge and provided by Pierre-Marc Jodoin, Alain Lalande, and Olivier Bernard, comprises multi-slice 2D cardiac cine-magnetic resonance imaging (cine-MRI) samples from 100 patients. It can be utilized for the segmentation of the left ventricle (LV), right ventricle (RV), and myocardium (MYO) at the end-diastole (ED) and end-systole (ES) phases. For semi-supervised learning applications, it is partitioned into a training set (70 scans), a validation set (10 scans), and a test set (20 scans). The ACDC dataset serves as a foundation for the development of accurate cardiac image segmentation methods, which are necessary for the assessment of cardiac function and provide critical information for the diagnosis and treatment of cardiac diseases.

Colorectal Adenocarcinoma Gland (CRAG) dataset [17]: The CRAG dataset is centered on the task of gland segmentation in colorectal adenocarcinoma histopathology images, with the objective of facilitating the development of pertinent medical image segmentation algorithms. Comprising 213 H&E-stained colorectal adenocarcinoma tissue section images from various centers and equipment, 173 images are allocated for training while 40 images are reserved for testing. Complete instance-level gland annotations are provided by the dataset, including precise segmentation masks. CRAG represents a significant benchmark resource for the medical image processing field, particularly for investigations pertaining to colorectal cancer gland segmentation.

IU Chest X-ray dataset [18]: A collaborative effort between Indiana University and the Open-i laboratory has resulted in the release of this dataset, which is widely utilized as a benchmark in medical image analysis, particularly for tasks associated with lung disease detection and classification. The dataset consists of 3700 high-resolution chest X-ray images accompanied by annotation information that encompasses 14 frequently encountered pulmonary conditions as well as labels representing normal status. Binary labels for disease presence have been provided by radiologists through annotations, with certain samples including information regarding lesion locations. A standard split has been established, comprising a training set of 2590 images, a validation set of 370 images, and a test set of 740 images.

MIMIC-CXR dataset [19]: MIMIC-CXR, a large-scale open-source medical dataset released jointly by MIT and Beth Israel Deaconess Medical Center, is focused on the intersection of chest X-ray image analysis and natural language processing research. A total of 473,057 chest X-ray images are encompassed, with each image associated with structured labels representing 14 common diseases and corresponding free-text radiology reports. The standard partition is comprised of a training set containing 378,447 images, a validation set containing 47,305 images, and a test set containing 47,305 images.

COV-CTR dataset [20]: COV-CTR is characterized as a high-quality open-source dataset designed for COVID-19 lung CT image analysis. This initiative is administered by the Institute of Automation, Chinese Academy of Sciences, alongside Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, with collaboration from over 20 institutions, including Tsinghua University and Fudan University. The dataset comprises CT image data sourced from 1110 confirmed COVID-19 patients, encompassing both non-contrast and contrast-enhanced CT types. Annotations indicating five typical lesion types have been cross-validated by a minimum of three radiology experts.

MS-CXR-T dataset [21]: MS-CXR-T is a multi-center chest X-ray dataset that focuses on tuberculosis (TB) and is designed to address the challenges related to heterogeneity in TB imaging and the generalizability of diagnostic models. Over 5000 chest X-ray images are included, featuring a case distribution of approximately 3500 TB-positive and 1500 TB-negative cases. The dataset is employable for training TB screening models that are adaptable to multi-center data, while also facilitating the study of differences in the radiological presentation of TB lesions across various regions.

NIH-AAPM-Mayo Clinical LDCT dataset [22]: A collaboration between the National Institutes of Health (NIH), the American Association of Physicists in Medicine (AAPM), and the Mayo Clinic has facilitated the provision of this dataset, which is intended primarily for the development of lung cancer screening algorithms and research in low-dose CT (LDCT) image reconstruction. LDCT scans from over 1000 high-risk individuals have been included. Lung nodules ≥3 mm were annotated by three radiologists. Potential applications for the development of deep learning-based nodule detection systems and investigations of high-resolution reconstruction algorithms under low-dose conditions have been identified.

LoDoPaB dataset [23]: The Low-Dose Parallel Beam (LoDoPaB) dataset is recognized internationally as a benchmark for low-dose computed tomography (CT) projection data, specifically created for research pertaining to low-dose CT image reconstruction algorithms. While being primarily utilized for reconstruction-focused studies, reference value may also be provided for segmentation research. The dataset comprises low-dose CT projection data from over 35,000 clinical patients, encompassing multiple anatomical regions, including the chest, abdomen, and head. Applications for training deep learning models for direct image reconstruction from projection data and for investigating personalized radiation dose allocation strategies are enabled.

LDCT datasets [24]: Low-dose CT medical imaging datasets have been focused on the optimization of image quality while minimizing radiation exposure, with their core value lying in the equilibrium of diagnostic efficacy and patient safety. Particular application value has been noted in the field of medical image segmentation: By providing standardized imaging data under low-dose conditions, support for traditional image reconstruction research has been achieved (including the determination of minimum effective dose, development of reconstruction algorithms for low signal-to-noise ratio images, and quantum noise suppression), along with the provision of training and validation benchmarks for precise segmentation of key anatomical structures (such as pulmonary nodules and liver lesions) in low-radiation environments. Typical datasets generally include scan data from essential anatomical regions such as the chest, abdomen, and heart, enabling the dual functionality of supporting image quality optimization research and contributing to the development and evaluation of downstream intelligent analysis tasks.

2.2. Three-Dimensional Image Datasets

LA dataset [25]: The Left Atrium (LA) benchmark dataset has been established from late gadolinium-enhanced MRI (LGE-MRI) scans derived from patients diagnosed with atrial fibrillation, aimed at facilitating precise segmentation of the left atrium and its associated scar tissue. Data have been obtained from institutions such as the University of Utah, Beth Israel Deaconess Medical Center, and King’s College London, where various scanning equipment and resolutions have been employed to enhance model generalizability across diverse clinical contexts. The dataset consists of 100 3D LGE-MRI scans, frequently utilized for training purposes, with image resolutions standardized to 0.625 × 0.625 × 0.625 mm³. The LA dataset is recognized as a prominent benchmark within the field of semi-supervised medical image segmentation.

Pancreas-CT dataset [26]: Provided by the National Institutes of Health, this dataset facilitates research on pancreas segmentation in CT scans. It contains 80 (originally 82, with 2 removed due to duplication) abdominal contrast-enhanced 3D CT scans from 53 male and 27 female subjects (age range 18–76 years, mean age 46.8 ± 16.7 years). The CT scans have a resolution of 512 × 512 pixels, with variations in pixel size and slice thickness; the slice thickness ranges between 1.5 and 2.5 mm. The data were acquired using Philips and Siemens MDCT scanners (120 kVp tube voltage). Slice-by-slice segmentation of the pancreas was performed manually by a medical student and subsequently verified or revised by senior radiologists. This dataset is extensively utilized for research and algorithm development in pancreas segmentation tasks.

BraTS dataset [27]: The Multimodal Brain Tumor Segmentation (BraTS) dataset is recognized as a public resource specifically designed for brain tumor segmentation, utilizing multi-modal medical images. The data have been sourced from multiple hospitals and encompass four MRI modalities: T1-weighted (T1), gadolinium-enhanced T1-weighted (T1Gd), T2-weighted (T2), and T2 fluid-attenuated inversion recovery (T2-FLAIR). All scans have undergone rigorous review by neuroradiologists and are accompanied by expert annotations. For semi-supervised learning scenarios, it is commonly partitioned into a training set (250 scans), a validation set (25 scans), and a test set (60 scans). The dataset aims to foster the development and evaluation of automated brain tumor segmentation algorithms.

ATLAS dataset [28]: The Anatomical Tracings of Lesions After Stroke (ATLAS) dataset has been established to aggregate MRI brain scans from multiple centers across the globe, allowing for the evaluation of automated stroke lesion segmentation methods. This dataset primarily serves the domain of stroke rehabilitation research. A total of 1271 images has been compiled, with 955 being publicly accessible. This number includes 655 training images and 300 test images, which contain hidden annotations, while an additional 316 images constitute an independent generalization test set. The majority of images were acquired using 3T MRI scanners from Siemens Healthineers (Erlangen, Germany), Philips Healthcare (Eindhoven, Netherlands), and GE Healthcare (Chicago, IL, USA), achieving a resolution of 1 mm. A limited number of cases were scanned at 1.5 T, with resolutions ranging from 1 to 2 mm. The ATLAS dataset is recognized as a significant resource for the advancement of automated segmentation techniques pertaining to subacute and chronic stroke lesions.

ISLES dataset [29,30,31]: The Ischemic Stroke Lesion Segmentation (ISLES) dataset is focused on the segmentation of ischemic stroke lesions, with the objective of enabling automatic delineation of acute to subacute ischemic stroke lesions through multi-modal imaging. The ISLES dataset can be utilized for the training and validation of relevant segmentation algorithms. Taking ISLES22 as a representative example, the dataset comprises 400 MRI cases acquired from multiple centers and devices, categorized into a public training set of 250 cases and a test set of 150 cases, with non-public annotations for online evaluation. This dataset is instrumental in the development of more accurate and reliable algorithms, which are expected to enhance diagnosis and treatment planning for patients who have experienced a stroke.

AISD dataset [32]: The Acute Ischemic Stroke Dataset (AISD) has been established as a comprehensive resource that integrates clinical information, imaging data, and follow-up data from patients diagnosed with acute ischemic stroke, with the intent to furnish high-quality data suitable for scientific research. A total of 397 non-contrast CT scans, obtained within 24 hours post-stroke onset, have been included in the dataset, of which 345 have been designated for model training and validation, while 52 scans have been reserved for testing purposes. The primary objective of the AISD is the promotion of research, development, and clinical translation of CT-based acute stroke segmentation techniques.

Cardiac (M&Ms) dataset [33]: This dataset—the Multi-Center, Multi-Vendor & Multi-Disease Cardiac Segmentation dataset—has been designed to segment the left atrium, along with other designated cardiac structures, from single-modality MR images, with the application context clearly delineated in the original reference. All imaging data are normalized to the [0, 1] range, encompassing 30 clinical cases that have been officially categorized into a training set of 20 cases and a test set of 10 cases. Segmentation results for the test set may be submitted for evaluation through the official platform.

KiTS19 dataset [34]: The Kidney Tumor Segmentation Challenge 2019 (KiTS19) dataset is centered on the segmentation of kidneys and kidney tumors, with the objective of advancing medical image segmentation algorithms. CT images and associated semantic segmentation labels are included for 300 cases originating from multiple centers and devices, wherein 210 cases are designated for training purposes and 90 for testing. Fine-grained segmentation labels for both the kidney and tumor are provided, alongside clinical attributes for select cases, reflecting significant diversity and complexity, and establishing an essential research resource in the domain of kidney tumor segmentation.

UKB dataset [35]: The UK Biobank (UKB) has been established as a large-scale, multidimensional biomedical database and research platform, focusing on the investigation of complex relationships between genetic factors, lifestyle, and health status, thereby facilitating the advancement of understanding regarding various chronic diseases, including cancer, heart disease, diabetes, and mental disorders. This database, encompasses genetic information, biological samples, imaging data such as brain and cardiac MRI, and detailed health records, which provide a crucial foundation for large-scale, interdisciplinary health research. Additionally, the collection of semi-structured diagnostic reports, when paired with corresponding imaging data, has been recognized as a unique resource that supports the development of weakly supervised segmentation methods. These capabilities are manifested in the construction of universal medical image feature extractors, enabling the quantitative analysis of anatomical structure–disease risk correlations and supporting knowledge distillation training based on clinical reports.

LiTS dataset [36]: The Liver Tumor Segmentation Challenge (LiTS) dataset concentrates on the segmentation of the liver and liver tumors in CT images, aiming to advance relevant segmentation algorithms and promote research into automated diagnostic systems. It collects CT scan data from seven different medical centers, containing 131 training datasets and 70 test datasets. LiTS finds wide application value in medical image segmentation, computer-aided diagnosis, development of medical image analysis tools, and academic research and education, supporting researchers in developing and validating new segmentation algorithms.

CHAOS dataset [37]: The Combined (CT-MR) Healthy Abdominal Organ Segmentation (CHAOS) dataset focuses on abdominal organ segmentation, providing paired multi-modal (CT and MR) data with annotations to foster the development of abdominal organ segmentation algorithms. CHAOS includes 40 paired CT and MR images, divided into 20 training sets and 20 test sets. The training set provides annotations for four abdominal organs (liver, left kidney, right kidney, spleen), where CT images are annotated only for the liver, while MR images have annotations for all four organs. Test set annotations are not publicly available. This dataset supports the development and validation of multi-modal learning and segmentation algorithms.

3. Semi-Supervised Medical Image Segmentation Methods

Given the abundance of unlabeled data in clinical practice, a prominent and highly promising research direction has been identified in semi-supervised learning methods. Labeled data are effectively integrated with large amounts of unlabeled data for model training in SSL, thereby reducing the dependency on large-scale annotated datasets.

A substantial body of research has emerged within the domain of semi-supervised segmentation, with the majority of methods concentrating on two predominant paradigms: pseudo-labeling [38,39,40] and consistency regularization [41]. In pseudo-labeling approaches, the model’s predictions on unlabeled data are leveraged, and pseudo-labels are generated through the application of thresholding or other selection mechanisms. Subsequently, these pseudo-labeled samples are combined with manually annotated data for further model training and refinement. Conversely, the fundamental principle of consistency regularization posits that for an unlabeled sample the model’s predictions should remain consistent across its different perturbed versions. The objective is to minimize the discrepancy between the model’s predictions on various perturbed versions of unlabeled data. High-quality consistency targets established during training are deemed crucial for the achievement of optimal performance.

In recent years, semi-supervised medical image segmentation models. Among these strategies, the Mean Teacher (MT) model [42] is recognized as a classic method for implementing consistency regularization. Unlabeled data are effectively utilized through the application of weak and strong augmentations, with consistency enforced between the model’s predictions on these different augmented versions. As illustrated in Figure 2, a dual-model architecture is employed by the MT model, comprising a student model (represented by orange weights) and a teacher model (represented by blue weights), which typically share an identical network structure (e.g., U-Net or V-Net). During the processing of a labeled training sample, inference is performed on the input by both models, incorporating stochastic perturbations (denoted as

η

and

η^{'}

, respectively). Optimization of the student model’s weights (

θ

) is achieved via backpropagation using a composite loss function. This function includes a classification loss (classification cost), which quantifies the discrepancy between the student’s prediction (orange probability distribution) and the ground truth label, and a consistency loss (consistency cost), which measures the divergence between the predictions of the student and teacher models (blue probability distribution). Following the gradient descent update of the student weights, the teacher model’s weights (

θ^{'}

) are updated not through gradient computation but rather as an exponential moving average (EMA) of the student weights (

θ \to θ^{'}

), as depicted. Stability is imparted to the teacher model through this EMA update mechanism, allowing for the progressive aggregation of knowledge acquired by the student throughout training. For unlabeled data, optimization relies solely on the consistency loss, thereby encouraging the alignment of the student model’s outputs with the more stable and reliable predictions generated by the teacher model. As a result, “pseudo-label” supervision is implicitly furnished by the teacher model, guiding the student model to learn latent structural information and characteristics of data distributions from abundant unlabeled data, ultimately enhancing generalization capability and segmentation accuracy.

Recently, consistency regularization methods supervised by pseudo-labels have achieved significant success in semi-supervised segmentation [43,44]. Concurrently, approaches combining contrastive learning strategies with consistency regularization methods are continually emerging [45,46]. Based on these evolving trends, this paper categorizes and reviews semi-supervised medical image segmentation methods into the following three classes: consistency regularization methods, consistency regularization methods incorporating pseudo-labeling, and methods combining contrastive learning with consistency regularization.

3.1. Consistency Regularization-Based Segmentation Methods

This subsection is dedicated to semi-supervised segmentation methods that apply the principle of consistency regularization without reliance on explicit pseudo-labels, with the aim of enhancing model performance through the effective utilization of unlabeled data.

The Ambiguity Consensus Mean Teacher (AC-MT) model [47] represents an enhancement of the fundamental Mean Teacher (MT) model. The architecture of student–teacher and the exponential moving average (EMA) weight update mechanism are inherited from the MT framework; however, AC-MT incorporates an ambiguity identification module. This module assesses the prediction “ambiguity” (i.e., uncertainty) for each pixel in the unlabeled data using strategies such as calculating entropy, model uncertainty, or employing prototype/class conditioning for noisy label identification. Unlike MT, which computes consistency loss across all pixels, AC-MT calculates and imposes consistency loss only on pixels identified as having high ambiguity. This forces the student model to achieve consensus with the teacher model, specifically in these challenging yet informative regions. This selective consistency learning strategy enables AC-MT to extract critical information from unlabeled data more precisely, thereby further improving segmentation performance.

During training, the AC-MT model utilizes labeled data for supervised learning of the student model (calculating standard segmentation loss). Unlabeled data first pass through the ambiguity identification module to filter high-ambiguity pixels. These data are then fed into both the student and teacher models, with consistency loss calculated and used to update the student model only on these ambiguous pixels. Finally, the teacher model is updated via the EMA of the student model’s weights. This process iterates, achieving more efficient semi-supervised learning by selectively focusing on ambiguous regions within the unlabeled data. Compared to the baseline Mean Teacher model and other state-of-the-art semi-supervised learning methods, AC-MT demonstrates more effective utilization of unlabeled data, particularly with limited labeled data (e.g., 10% or 20%). It achieves significant improvements in key segmentation accuracy metrics such as the Dice similarity coefficient (DSC) and Jaccard index, and maintains robust performance even in scenarios with extremely scarce labeled data (e.g., 2.5%).

AAU-Net [48] similarly represents an enhancement to the standard MT model, yet its distinguishing characteristic lies in the utilization of anatomical prior knowledge to address the challenge of unreliable predictions on unlabeled data within the MT framework. Whereas AC-MT concentrates on the “ambiguity” associated with predictions, AAU-Net places greater emphasis on quantifying the deviation of predictions from expected anatomical structures.

AAU-Net introduces a pre-trained Denoising Autoencoder (DAE) to capture anatomical prior knowledge. This DAE can map any predicted segmentation mask

P_{t}

to a more anatomically plausible segmentation

P_{t}^{'}

, denoted as

P_{t}^{'} = DAE (P_{t})

(1)

By utilizing the DAE module, AAU-Net estimates uncertainty based on the discrepancy between the teacher model’s prediction

P_{t}

and its “anatomically corrected” version

P_{t}^{'}

, rather than directly using the difference between the raw predictions of the student and teacher models. By utilizing the DAE module, uncertainty is estimated based on the discrepancy between the teacher model’s prediction

P_{t}

and its “anatomically corrected” version

P_{t}^{'}

, rather than directly employing the difference between the raw predictions of the student and teacher models. An anatomy-aware uncertainty estimation mechanism is proposed by AAU-Net as a core component, extending the previously established representation. The anatomical prior knowledge provided by the DAE is fully exploited to evaluate the reliability of predictions made by the teacher model

P_{t}

. An uncertainty map U is constructed through the calculation of the pixel-wise difference between

P_{t}

and

P_{t}^{'}

:

U = ∥ P_{t} - P_{t}^{'} ∥^{2}

(2)

The map U has been shown to reflect the inconsistency between the predictions generated by the model and the anatomical priors. A larger difference has been associated with a greater deviation of the prediction at that pixel from the anatomical structure, thereby implying an increase in uncertainty. This uncertainty estimation method has been effectively integrated into the semi-supervised learning process, allowing for the more accurate identification of potentially erroneous regions within the predictions. Consequently, unlabeled data can be utilized more effectively for training purposes.

In the task of abdominal CT multi-organ segmentation, improvements in segmentation performance were observed when AAU-Net was applied, in comparison to existing state-of-the-art baseline methods such as Uncertainty-Aware Mean Teacher (UAMT) [49] and Uncertainty Rectified Pyramid Consistency (URPC) [50]. With a 20% labeling ratio, the average DSC improved by 1.95% and the HD metric by 1 mm. These results demonstrate the method’s capability to achieve accurate segmentation with limited labeled data, making it suitable for medical image analysis scenarios involving complex anatomical structures or challenging annotation tasks.

Whereas the former two approaches address the unreliable predictions for unlabeled data by focusing on ambiguity and incorporating anatomical priors, respectively, CMMT-Net [51] instead enhances intrinsic model diversity. This is achieved through the construction of a cross-head mutual mean-teaching architecture, aimed at improving the MT model for more robust utilization of unlabeled data. As illustrated in Figure 3, CMMT-Net is fundamentally characterized by a shared encoder and a dual-decoder cross-head design, integrated with a teacher model updated via exponential moving average. Specifically, both the student network (upper part) and the teacher network (lower part) comprise a shared encoder and two distinct decoders. This dual-decoder configuration introduces feature-level diversity. The weights of the teacher network are an EMA of the corresponding student network weights, ensuring the stability of teacher predictions. During the training procedure, labeled data undergo strong augmentation (CutMix, Weak Aug. + Adv noise) before being fed into the student network. The outputs

p^{s 1}, p^{s 2}

from the two decoders are used to compute the supervised loss

l_{s u p}

against the ground truth label

y^{l}

. Unlabeled data are concurrently utilized for cross-head self-training (loss

l_{s s}

) and mutual mean-teaching (loss

l_{s t f}

). Specifically, unlabeled data, subjected to weak augmentation (weak aug.), are input to the teacher network to generate reliable predictions. Simultaneously, the same unlabeled data, subjected to strong augmentation, are input to the student network. Subsequently, the outputs from each teacher decoder are employed to supervise the outputs of both student decoders (e.g., via losses

l_{s t 1}^{12}, l_{s t 2}^{12}, l_{s t 1}^{21}, l_{s t 2}^{21}

), enforcing prediction consistency under different perturbations and across different decoding paths, thereby effectively leveraging unlabeled data to enhance segmentation performance.

Furthermore, CMMT-Net incorporates a multi-level perturbation strategy. At the data level, Mutual Virtual Adversarial Training (MVAT) is employed to introduce pixel-level adversarial noise, alongside the Cross-set CutMix technique, which generates novel training samples by blending regions between disparate images. These robust augmentation approaches contrast with the weak augmentation applied to the inputs of the teacher network, thereby enhancing the diversity and difficulty presented within the training data. At the network level, the teacher–student structure is inherently constituted as a form of perturbation, with the EMA-updated teacher network furnishing stable supervisory signals. Enabled by these meticulously designed mechanisms, CMMT-Net effectively leverages both limited labeled data and substantial volumes of unlabeled data, achieving superior performance in medical image segmentation tasks. Experimental results indicate that substantial performance gains in semi-supervised segmentation were yielded by the proposed CMMT-Net method on the public LA, Pancreas-CT, and ACDC datasets. Specifically, compared to previous state-of-the-art (SOTA) methods, MC-Net+ [52] and BCP [53], improvements in the Dice score of 1.79%, 12.73%, and 1.83% were observed on these respective datasets.

3.2. Consistency Regularization Segmentation Methods Supervised by Pseudo-Labels

Complementary to the consistency regularization approaches previously reviewed for semi-supervised medical image segmentation, pseudo-labeling has been classified as another pivotal strategy, attracting significant research attention. Hybrid methodologies integrating pseudo-labeling with consistency regularization have emerged as a particularly active research direction. This convergence is aimed at mitigating the principal limitations inherent in pseudo-labeling, specifically the potential unreliability of the initially generated labels alongside the critical challenge of selecting and utilizing high-confidence pseudo-labels effectively during the training process.

To address the issue of pseudo-label unreliability, a novel Mutual Learning with Reliable Pseudo-Label (MLRPL) framework was proposed by Su et al. [54]. The innovation of this method lies in its construction and co-training of two sub-networks with slightly different architectures. Through meticulously designed reliability assessment strategies, it filters and utilizes high-quality pseudo-labels for model optimization.

Specifically, the framework initially employs two sub-networks (sharing an encoder but with independent decoders) to make independent predictions on the same input image, generating respective preliminary pseudo-labels. Subsequently, a dual reliability assessment mechanism is introduced: first, a “mutual comparison” strategy is adopted, comparing the prediction confidences of the two sub-networks pixel by pixel and selecting the one with higher confidence as a more reliable pseudo-label candidate; second, an “intra-class consistency” metric is proposed, further evaluating pseudo-label reliability by calculating the similarity between a pixel’s feature and its predicted class prototype, quantifying this reliability as a weighting coefficient. The loss function design ingeniously integrates both assessment results: the mutual comparison outcome determines whether knowledge transfer occurs between the sub-networks (i.e., using one sub-network’s pseudo-label to guide the other only when its prediction is significantly superior), while the intra-class consistency metric serves as a weight to finely adjust the cross-entropy loss, assigning greater influence to highly reliable pseudo-labels. This dual-guarantee mechanism effectively suppresses noise introduced by unreliable pseudo-labels, significantly enhancing the performance and robustness of semi-supervised medical image segmentation and offering a highly promising solution to address the annotation bottleneck in medical image analysis.

On the Pancreas-CT dataset, using 10% labeled data, MLRPL achieved a Dice coefficient improvement of up to 21.99% compared to the baseline model (V-Net) [55]. Furthermore, MLRPL demonstrated significant advantages over existing semi-supervised methods; for instance, when trained with 10% labeled data on the Pancreas-CT dataset, the method achieved a 2.40% improvement in Dice coefficient compared to URPC [50]. Under certain experimental settings, MLRPL’s performance was even comparable to fully supervised models trained with the complete labeled dataset.

Building upon prior work demonstrating the potential of reliability assessment mechanisms for enhancing pseudo-label quality, such as that by Su et al. [54], the Cooperative Rectification Learning Network (CRLN) proposed by Wang et al. (2025) [56] further investigates the generation of more accurate pseudo-labels through prototype learning and explicit pseudo-label rectification, specifically targeting semi-supervised 3D medical image segmentation tasks.

The CRLN method operates on inputs comprising a small set of 3D medical images with voxel-level annotation (labeled data) and a large volume of unlabeled 3D medical images. To enhance model generalization and leverage consistency regularization, unlabeled data undergo two distinct augmentation processes: weak augmentation (e.g., random cropping, flipping) and strong augmentation (e.g., applying Gaussian noise, CutMix in addition to weak augmentations). Within the MT framework, CRLN feeds strongly augmented unlabeled data to the student model for prediction, while weakly augmented unlabeled data are input to the teacher model to generate pseudo-labels. Labeled data are utilized for supervised learning and subsequent prototype learning. Both student and teacher models share an identical backbone network architecture, typically an encoder–decoder structure like V-Net or 3D-UNet.

The core innovation of CRLN lies in its proposed prototype learning and pseudo-label rectification mechanism, designed to leverage prior knowledge learned from labeled data to improve pseudo-label quality. This process consists of two main stages: a learning stage and a rectification stage. Specifically, during the learning stage, the model learns multiple prototypes for each class to capture intra-class variations. Through a Dynamic Interaction Module (DIM), these prototypes interact with feature maps extracted from intermediate layers of the student model’s decoder, specifically using labeled data features. The DIM employs a pair-wise cross-attention mechanism to compute similarities between prototypes and feature maps, subsequently updating the prototype representations. Following this interaction, an aggregation operation incorporating spatial awareness and cross-class reasoning (implemented via shared-parameter convolutional layers) generates a holistic relationship map,

M (x)

, which encodes the association degree between each voxel and all class prototypes. The learning of prototypes and the student’s DIM component is accomplished implicitly by minimizing the segmentation loss on the labeled data, potentially including a term that optimizes predictions on labeled data using

M (x)

.

In the pseudo-label rectification stage, the model leverages the class prototypes learned from labeled data during the learning stage and the EMA-updated teacher DIM to handle unlabeled data. First, the teacher model generates original pseudo-labels

\bar{y}

from the weakly augmented unlabeled data. Simultaneously, these unlabeled data (or their weakly augmented version) are input to the teacher DIM to generate the corresponding relationship map

M (x)

. Then, voxel-wise refinement is performed on the original pseudo-labels using the rectification formula

{\hat{y}}_{r} = \bar{y} + (1 - μ) \times M (x)

(3)

where

{\hat{y}}_{r}

denotes the rectified pseudo-labels, and

μ

is a learnable parameter that adaptively controls the intensity of the rectification guided by

M (x)

. The rectified pseudo-labels

{\hat{y}}_{r}

are regarded as more reliable supervisory signals to supervise the student model’s training on strongly augmented unlabeled data.

Experimental results demonstrate that the CRLN method yielded substantial performance improvements on the LA, Pancreas-CT, and BraTS19 datasets. On the Pancreas-CT dataset specifically, compared to the MC-Net+ [52] baseline, CRLN improved the Dice score by 11.8% and 4.57% when utilizing 10% and 20% labeled data, respectively. These results highlight the enhanced accuracy and robustness of CRLN for semi-supervised medical image segmentation.

3.3. Segmentation Methods Combining Contrastive Learning and Consistency Regularization

To further enhance the utilization efficiency of unlabeled data, a pivotal research direction involves the synergistic integration of contrastive learning (CL) and consistency regularization (CR). Contrastive learning improves feature discrimination by comparing similar/dissimilar regions, while consistency regularization ensures stable predictions under perturbations. Their integration enables the imposition of constraints concomitantly within both the feature embedding space and the model prediction space. This dual-constraint paradigm aims to improve model generalization and performance.

The CRCFP proposed by Bashir et al. (2024) [45] is built upon the DeepLab-v3 architecture and incorporates multiple techniques to enhance performance.

DeepLab-v3 is a classic semantic segmentation network, employing an encoder–decoder structure with atrous convolutions. The encoder, typically a pre-trained ResNet network, extracts image features; the Atrous Spatial Pyramid Pooling (ASPP) module captures multi-scale contextual information using atrous convolutions with different dilation rates; the decoder progressively recovers spatial resolution and fuses multi-level features to generate pixel-level class predictions.

As illustrated in Figure 4, the core of the CRCFP model consists of a shared encoder (h) and decoder (g). For labeled data, the model utilizes a supervised branch (blue path), where the input is processed through the shared backbone network and subsequently fed into the main pixel-wise classifier (

C_{f}

) to obtain predictions. The cross-entropy loss

L_{s u p}

is then computed. For unlabeled data

x_{u}

, the model incorporates two unsupervised pathways:

Context-aware consistency path (green path): Two overlapping patches, $x_{u 1}$ and $x_{u 2}$ , cropped from the unlabeled image are passed through the shared backbone network. Their resulting features are mapped through a projection head (Φ) to obtain embeddings $ϕ_{u 1}$ and $ϕ_{u 2}$ . A contrastive loss, $L_{c o n t}$ , is employed to enforce feature consistency under differing contextual views.
Cross-consistency training path (brown path): Features extracted from the complete unlabeled image $x_{u}$ are fed into the main classifier $C_{f}$ to yield prediction ${\hat{y}}_{u}$ . Concurrently, these features, subjected to perturbation (P), are input to multiple auxiliary classifiers, producing predictions ${\hat{y}}_{u}^{k}$ . A cross-consistency loss, $L_{c r o s s}$ , enforces consistency between the outputs of the main and auxiliary classifiers.

Furthermore, an entropy minimization loss,

L_{e n t}

, is applied to the main classifier’s predictions

{\hat{y}}_{u}

for unlabeled data to enhance prediction confidence. Finally, all the constituent loss terms are weighted and combined for end-to-end training.

Experimental results demonstrate that the CRCFP framework exhibits superior performance in semi-supervised semantic segmentation tasks on two public histology datasets: BCSS and MoNuSeg. The advantages of this framework are particularly pronounced in low-data regimes, and its performance using only a fraction of the labeled data approaches that achieved by fully supervised models.

CRCFP, through context-aware consistency and cross-consistency training, effectively leverages contextual information and feature perturbations from unlabeled data, enhancing model robustness. However, CRCFP primarily focuses on consistency at the global feature level, potentially resulting in lower accuracy for crucial edge details significant in medical imaging. Addressing this, Yang et al. (2025) [46] further explored the application of contrastive learning to specifically enhance segmentation accuracy in edge regions, integrating it with feature perturbation consistency within a novel network architecture for semi-supervised medical image segmentation.

The methodology proposed by Yang et al. (2025) [46] similarly employs an architecture based on a shared encoder and multiple decoders, diverging from the strategy of Bashir et al., which utilized lightweight auxiliary classifiers; Yang et al. designed multiple complete decoder branches. Its core innovation resides in the introduction of a structured weak-to-strong feature perturbation mechanism. Operating not at the image level, but rather at the feature level, it leverages the statistical information (mean, standard deviation) of feature maps to perform controllable linear transformations, applying perturbations of incrementally increasing intensity across the different decoder branches. This strategy is designed to explore the feature space more systematically and comprehensively, facilitating the learning of representations robust to perturbations.

To effectively leverage this structured perturbation for learning from unlabeled data, the method incorporates a feature perturbation consistency loss, compelling the model to yield consistent predictions for the same input under varying perturbation strengths. Crucially, for generating reliable supervisory targets to compute this consistency loss, the model does not simply average the predictions from the various branches but employs an uncertainty-weighted aggregation strategy. This strategy fuses the prediction results based on the confidence level (derived from uncertainty estimation) associated with each perturbed branch, thereby producing more dependable aggregated pseudo-labels.

Furthermore, specifically addressing the critical and often challenging edge regions in medical image segmentation, Yang et al. [46] designed an Edge-Aware Contrastive Learning (ECL) branch. The novelty of this branch lies in its intelligent sample selection mechanism. It utilizes the prediction results generated by the main segmentation branch along with corresponding uncertainty maps to identify and prioritize the selection of pixels located in edge regions. By constructing positive and negative pairs from these carefully chosen edge pixel features and applying a contrastive loss, the model is explicitly guided to learn more discriminative feature representations pertinent to edge areas, consequently enhancing edge segmentation accuracy.

On the public BraTS2020, LA, and ACDC datasets, the method demonstrably outperformed contemporary baseline models including SFPC [43], PLCT [57], CAML [58], and MC-Net+ [52] in semi-supervised segmentation tasks, with its advantages being particularly pronounced under low-labeled-data regimes (e.g., 5%). Moreover, the method demonstrated its capability to significantly enhance segmentation accuracy within challenging edge regions, addressing a prevalent limitation of existing techniques and exhibiting superior precision at object boundaries.

This section reviews semi-supervised medical image segmentation approaches, with a focus on representative methods employing consistency regularization strategies and their integration with pseudo-labeling and contrastive learning. To enable a clear and comprehensive performance comparison of deep learning segmentation models within the semi-supervised paradigm, Table 4 and Table 5, respectively, summarize the DSC, Jaccard, 95HD, and ASD scores achieved by representative methods on the 2D ACDC2017 dataset and the 3D BraTS2020 dataset, utilizing labeled data proportions of only 5% and 10%.

4. Weakly Supervised Medical Image Segmentation Methods

To mitigate the reliance of fully supervised learning on large-scale, high-quality pixel-level annotations, significant attention has been given to research into weakly supervised learning (WSL). Efforts in WSL are directed towards the replacement of precise pixel-level segmentation masks with easily obtainable coarse-grained annotations, including image-level labels, bounding boxes, or scribbles. The efficiency of medical image annotation is significantly enhanced through this approach, labeling costs are reduced, and more feasible solutions for clinical applications are offered.

The field of weakly supervised medical image segmentation has been marked by rapid development in recent years, characterized by the emergence of diverse methods that utilize various types of weak labels. Early research was primarily directed towards the utilization of image-level labels, where Class Activation Maps (CAMs) [8] were generated to localize target regions. However, CAMs typically highlight only the most discriminative parts of the target, resulting in incomplete segmentation outcomes. To address this limitation, subsequent studies have explored various strategies, which include the incorporation of saliency information [64], iterative region mining [12], and the employment of adversarial learning [65]. Stronger forms of weak labels, including bounding boxes [66], scribbles [67], and point annotations, have also been utilized in certain methods, thereby providing more precise localization information.

Recently, approaches combining multiple forms of weak labels, leveraging prior knowledge (e.g., target size, shape), and introducing self-supervised learning [68] have attracted growing attention. The performance of weakly supervised medical image segmentation has been continuously enhanced by these methods, leading to advancements in precision and robustness.

To clearly delineate the developmental trajectory of weakly supervised medical image segmentation, an in-depth exploration of the advantages and disadvantages of different approaches is provided. Existing methods are classified and summarized based on the type of weak annotation information utilized and the learning paradigm employed. Initially, weakly supervised approaches that depend solely on image-level labels are concentrated on (Section 4.1); these minimize annotation costs, although less accurate segmentation is often observed. Subsequently, weakly semi-supervised methods are explored (Section 4.2), which employ sparse annotations (e.g., scribbles, point annotations) in conjunction with extensive unlabeled data, creating a more effective compromise between annotation cost and segmentation accuracy.

4.1. Image-Level Label-Based Weakly Supervised Medical Image Segmentation

In the domain of weakly supervised medical image segmentation utilizing solely image-level labels, a variety of effective methods have been developed. Among these, Class Activation Mapping [8] and Multiple-Instance Learning (MIL) [69] are regarded as two of the most representative and widely employed techniques. Weak supervision signals for segmentation are provided by CAMs through the visualization of internal activations generated by convolutional neural networks, facilitating the identification of regions within the image relevant to specific classes. In contrast, MIL is designed to conceptualize an image as a “bag”, with pixels functioning as “instances”, thereby deriving pixel-level segmentation outcomes through the analysis of bag-level labels. A comprehensive analysis of recent research corresponding to these two methodological frameworks is presented.

4.1.1. CAM: A Powerful Tool for Weakly Supervised Medical Image Segmentation

The network architecture for Class Activation Mapping is illustrated in Figure 5. An input image is first processed through a series of convolutional layers for feature extraction. The key modification is located in the final segment of the network: following the last convolutional layer, which yields a set of feature maps, the conventional fully connected layers are omitted, and a global average pooling (GAP) layer is applied directly. The GAP layer computes the spatial average over each feature map from the last convolutional layer, thereby compressing it into a feature vector whose number of dimensions corresponds to the number of channels (n) in that layer. This feature vector, output by GAP, is subsequently fed into the final fully connected output layer (e.g., a softmax layer for classification). For any given class (e.g., ‘Australian terrier’), the final score is derived as a weighted sum of the elements in the GAP output vector, where the weights (

W_{1}, W_{2}, \dots, W_{n}

) signify the connection strength between the average activation of each feature map and the output node designated for that class. This architecture enables the output layer weights to be projected back onto the feature maps of the last convolutional layer, thereby generating a CAM, which localizes the image regions that substantially contribute to the classification of a specific class.

Leveraging the principle of CAMs, which utilize image-level supervision to generate localization maps, Chikontwe et al. (2022) further developed a weakly supervised learning framework specifically addressing the challenges of segmenting whole-slide histopathology images (WSIs) in digital pathology [70]. To manage the substantial memory requirements of WSIs while retaining global contextual information, the method initially employs a neural compression technique. An encoder, trained on image patches via unsupervised contrastive learning, is subsequently utilized to compress the entire WSI into a fixed-size feature map of significantly reduced dimensionality that preserves critical spatial information, thus circumventing the limitations inherent in conventional patch-based processing.

The core of the framework is structured around a single-stage weakly supervised segmentation process, which is enhanced by an innovative self-supervised Class Activation Map (CAM) refinement mechanism. A segmentation network is utilized to generate a preliminary CAM based on the compressed features. The initial CAM is not directly employed; instead, refinement occurs through two key modules. Firstly, the “

τ

-MaskOut” technique is implemented to identify and mask input features corresponding to low-confidence regions in the preliminary CAM, which acts as a form of spatial regularization. Secondly, a Pixel Correlation Module (PCM) is utilized to compute the correlation between the masked features and the preliminary CAM using self-attention, promoting activation expansion to yield a more comprehensive refined CAM.

To facilitate end-to-end training and to effectively leverage weak labels alongside self-supervised signals, a specifically designed composite loss function is employed by the framework. This loss function incorporates a standard classification loss, which associates the global average-pooled outputs of both the initial and refined CAMs with image-level labels; an equivariant regularization loss, which enforces consistency between the initial and refined CAMs; and a conditional entropy minimization loss, which aims at mitigating prediction uncertainty. Minimization of this composite objective, which integrates classification, consistency, and uncertainty constraints, allows for the learning of high-quality segmentation masks that are conditioned solely on image-level labels.

Experimental results have demonstrated that the proposed method, utilizing solely image-level labels, achieves segmentation accuracy remarkably close to that of a fully supervised U-Net model trained on the same compressed data. Specifically, the reported Dice similarity coefficient gaps were approximately 1.6% on the Set-I dataset and 8.5% on the Set-II dataset. Furthermore, the majority of these performance improvements were verified as statistically significant (p < 0.05), robustly validating the efficacy of the proposed self-supervised CAM refinement framework.

However, CAM-based approaches typically encounter a limitation: the generated activation maps often highlight only the most discriminative regions of the target object, potentially overlooking other less salient but equally relevant portions. To address this issue, G. Patel et al. (2022) introduced a novel multi-modal learning strategy [71] that leverages both intra-modal and cross-modal equivariant constraints to enhance CAMs. This approach is based on the observation that while different modalities emphasize distinct tissue characteristics, they should yield consistent segmentations over the same underlying anatomical structures. Building upon this insight, G. Patel et al. [71] devised a composite loss function incorporating terms for intra-modal equivariance, cross-modal equivariance, and KL divergence, integrated with the standard image-level classification objective. This formulation aims to produce CAMs that are both more complete and more accurate.

Central to the training process (detailed in Algorithm 1) is the self-supervised refinement of Class Activation Maps achieved through the leveraging of multi-modal data and spatial transformations. Specifically, training is conducted using K neural networks, one for each modality, which are parameterized by

Θ_{m_{k}}

. In each training iteration, a minibatch of data is processed by the algorithm. Initially, the same random spatial transformation π is applied uniformly to the images of all modalities within the minibatch. Subsequently, forward propagation is executed on both the original and transformed images to obtain their respective CAMs (denoted as M and M_π), along with softmax probability outputs (P and P_π). The crucial step involves the computation of a composite loss L(

Θ_{m_{k}}

) for each modality k, characterized as a weighted sum formed by the standard classification loss L_c, utilizing the image-level label y; the within-modality equivariance loss L_ER, enforcing consistency between M and M_π under the transformation π; the cross-modal knowledge distillation loss L_KD, employing KL divergence to encourage alignment between P and P_π across different modalities; and the central cross-modal equivariance loss L_CMER, enforcing consistency between M and M_π of different modalities when subjected to the same transformation π. Finally, gradients are computed based on this composite loss, permitting the parameters

Θ_{m_{k}}

for each network to be updated accordingly. Experiments conducted on the BraTS brain tumor segmentation and prostate segmentation datasets demonstrate that the proposed method significantly outperforms standard CAM and GradCAM++ [72], as well as state-of-the-art weakly supervised segmentation methods such as SEAM [68].

Algorithm 1 Training algorithm

Require: Training dataset $D$

1:: $K = Number of image modalities$
2:: $Π = Set of transformations$
3:: $T = Total number of Epochs$
4:: for k in $[1, K]$ do
5:: Initialize $Θ_{m_{k}}$
6:: end for
7:: for t in $[1, T]$ do
8:: for every minibatch B in $D$ do
9:: $π \leftarrow π \sim Π$
10:: $M \leftarrow {{M_{i}^{m_{k}}}_{k = 1}^{K}}_{i \in B}$
11:: $P \leftarrow {{p_{i}^{m_{k}}}_{k = 1}^{K}}_{i \in B}$
12:: $D_{π} \leftarrow {{π (X_{i}^{m_{k}})}_{k = 1}^{K}, y_{i}}_{i \in B}$
13:: $M_{π} \leftarrow {{M_{i}^{m_{k}}}_{k = 1}^{K}}_{i \in B}$
14:: $P_{π} \leftarrow {{p_{i}^{m_{k}}}_{k = 1}^{K}}_{i \in B}$
15:: for k in $[1, K]$ do
16:: Compute $L_{C} (Θ_{m_{k}}), L_{E R} (Θ_{m_{k}}), L_{K D} (Θ_{m_{k}}, m_{l \neq k}), L_{C M E R} (Θ_{m_{k}}, m_{l \neq k})$ .
17:: Compute $L (Θ_{m_{k}})$
18:: loss $\leftarrow \frac{1}{| B |} \sum_{i \in B} L (Θ_{m_{k}})$
19:: Compute gradients of loss w.r.t $Θ_{m_{k}}$
20:: Update $Θ_{m_{k}}$ using the optimizer
21:: end for
22:: end for
23:: end for

Prior methods have often been restricted to single-class segmentation scenarios; however, multiple lesions exhibiting diverse morphologies are commonly present in medical images. To address this challenge, an Anomaly-Guided Mechanism (AGM) [73] for multi-class lesion segmentation in optical coherence tomography (OCT) images was introduced by Yang et al. [46].

The AGM initially employs a GANomaly network, trained on normal images, to generate a pseudo-healthy counterpart for each input OCT image. An anomaly-discriminative representation, highlighting abnormal regions, is then produced by computing the difference between the original image and its pseudo-healthy counterpart. AGM utilizes a dual-branch architecture: a backbone branch processes concatenated information from the original and pseudo-healthy images, while an Anomaly Self-Attention Module (ASAM) branch processes the anomaly-discriminative representation. Global contextual information and pixel dependencies within abnormal patterns, particularly regarding small lesions, are captured by the ASAM branch through the application of self-attention. Feature maps from the two branches are fused (e.g., via element-wise multiplication) and subsequently processed through global max pooling (GMP) and fully connected (FC) layers for multi-label classification and the generation of the initial Class Activation Map (specifically GradCAM).

An iterative refinement learning stage is identified as a key component. CAMs and classification predictions from the preceding iteration are utilized to create a weighted region-of-interest (ROI) mask. This mask serves to enhance the input to the backbone branch in the following training iteration, facilitating a more precise focus of the model on potential lesion areas. Ultimately, the refined CAMs are subjected to post-processing, which includes retina mask extraction, thresholding, and class selection, to generate high-quality pseudo pixel-level labels. These labels are subsequently employed for training a standard segmentation network.

By integrating anomaly detection and self-attention within the weakly supervised semantic segmentation (WSSS) framework, along with the incorporation of iterative refinement, localization accuracy is enhanced, particularly in relation to small, low-contrast, and co-existing lesions frequently encountered in medical imaging. State-of-the-art performance has been achieved across multiple datasets. For instance, within the public RESC and Duke SD-OCT datasets, along with a private retina OCT dataset utilized in the study, AGM demonstrated significantly superior performance regarding pseudo-label quality, as assessed by the mean intersection over union (mIoU) and final segmentation results, when compared to baseline methods such as SEAM [74] and ReCAM [75].

4.1.2. MIL: An Effective Strategy for Weakly Supervised Medical Image Segmentation

Multiple-Instance Learning (MIL) has been identified as an effective weakly supervised strategy for medical image segmentation, utilizing solely image-level labels. Within the MIL framework, each image is designated as a “bag”, with individual pixels or regions categorized as “instances”. The model’s objective is the prediction of instance-level labels, specifically determining whether each pixel or region is associated with a lesion, based on the overarching bag-level label indicating if the image contains a lesion.

To overcome a common limitation of traditional MIL methods—overlooking long-range dependencies among pixels in histopathology image segmentation—Li et al. (2023) proposed SA-MIL [76], a weakly supervised segmentation method based on self-attention. Instead of directly feeding pixel features into a classifier for prediction, as is common in conventional MIL approaches, SA-MIL integrates self-attention modules at multiple stages of feature extraction. As illustrated in Figure 6, SA-MIL employs the first three convolutional stages of VGGNet as its backbone network (represented by the sequence of purple 3 × 3 Conv modules). A self-attention module (SAM, orange module) is inserted between the convolutional and pooling layers within each stage. The core component of the SAM is the Criss-Cross Attention (CCA) module, which aggregates contextual information for each pixel along its horizontal and vertical directions through two recurrent operations, thereby establishing long-range dependencies among pixels. This mechanism enhances the capacity of feature representation, allowing for improved differentiation between foreground (cancerous regions) and background.

A deep supervision strategy is adopted by SA-MIL to fully leverage the limited image-level annotation information. A decoder branch (light blue module) is connected following each SAM module. A pixel-level prediction map is generated by this decoder, corresponding to that stage. These pixel-level prediction maps are subsequently aggregated via softmax activation and generalized mean (GeM) pooling operations to produce an image-level prediction for that stage. The corresponding Multiple-Instance Learning loss (MIL loss) is then computed using the image-level ground truth label Y. Finally, the output features from the last SAM module, along with the pixel-level prediction maps generated by all intermediate decoders, are jointly fed into a fusion module (red module). A final fusion loss,

L_{f u s e}

, is utilized to supervise this ultimate segmentation result.

SA-MIL was extensively validated through experiments on two histopathology image datasets (colon cancer tissue and cervical cancer cells) and compared against various weakly supervised and fully supervised methods. The experimental results demonstrate that SA-MIL, utilizing only image-level labels, significantly outperforms weakly supervised methods, such as PSPS [77] and Implicit PointRend [78], across multiple metrics, including F1 score, Hausdorff distance, mIoU, and mAP. Furthermore, its performance approaches, and in some cases matches, that of fully supervised methods like U-Net.

Seeböck et al. (2024) introduced a novel strategy termed Anomaly-Guided Segmentation (ANGUS) [79], which utilizes the output of a pre-trained anomaly detection model as supplementary semantic context to enhance lesion segmentation in retinal optical coherence tomography images. This approach leverages weak spatial information derived from anomaly detection, differing from methods relying solely on specific target lesion annotations.

Specifically, the method is implemented in three stages: First, an anomaly detection model (e.g., WeakAnD), pre-trained on a dataset of healthy OCT images, is applied to the segmentation training set to generate pixel-wise weak anomaly maps. These maps encode regions deviating from normal patterns. Second, the manually annotated ground truth masks for the target lesions are merged with these weak anomaly maps to construct an expanded annotation scheme. This scheme adds an ’anomaly’ class to the original lesion classes, representing areas identified as abnormal by the detector but not explicitly labeled as a target lesion, potentially encompassing other pathologies or variations. Finally, a segmentation network (e.g., U-Net) is trained using this expanded annotation system, typically employing a weighted cross-entropy loss function. This training strategy compels the network not only to learn features of the annotated lesions but also to discriminate between normal tissue, known lesions, and unannotated anomalies, thereby improving robustness to complex lesions, inter-class variability, and real-world data variations without requiring additional manual annotation effort.

Experimental results demonstrated that across two in-house and two public datasets, targeting various lesion types (including IRC, SRF, PED, HRF, SHRM, etc.), the ANGUS method consistently, and often statistically significantly, outperformed standard U-Net baselines trained solely on target lesion annotations. This was evidenced by improvements in metrics such as the Dice coefficient, precision, and recall, enabling accurate segmentation of lesions in retinal OCT images.

4.2. Weakly Semi-Supervised Medical Image Segmentation Methods

Weakly semi-supervised learning aims to enhance medical image segmentation performance by leveraging a small amount of weak annotations (e.g., scribbles, points) and a large volume of unlabeled data, effectively mitigating the annotation bottleneck. This section focuses on representative methods, SOUSA and Point SEGTR, whose core idea involves integrating supervision from sparse annotations with consistency constraints derived from unlabeled data, albeit with differing implementation emphases.

Proposed by Gao et al. (2022), Segmentation Only Uses Sparse Annotations (SOUSA) [80] is a framework for medical image segmentation that integrates weakly supervised and semi-supervised consistency learning (WSCL). SOUSA leverages sparse semantic information from scribble annotations alongside consistency priors inherent in unlabeled data. The framework is based on the MT architecture, comprising a student network and a teacher network. The student network features two output heads: a primary segmentation head, predicting pixel-level segmentation masks; and an auxiliary regression head, tasked with predicting geodesic distance maps, which are pre-computed offline based on the input image and its corresponding scribble annotations.

For images with scribble annotations, the student network is supervised using two loss components: the segmentation head employs a partial cross-entropy (PCE) loss, calculated only on scribble pixels; while the regression head utilizes a regression loss, comparing its predicted distance map with the pre-computed geodesic distance map. The incorporation of the geodesic distance map aims to more fully exploit the sparse scribble information by providing spatial context that guides the model’s focus toward the target regions.

For unlabeled images, a consistency regularization strategy is applied. The same unlabeled image, subjected to different perturbations (e.g., random noise, linear transformations), is fed into the student and teacher networks, respectively. Consistency between their segmentation predictions is enforced using two loss functions: the standard mean squared error (MSE) loss and a novel Multi-angle Projection Reconstruction (MPR) loss. The MPR loss first randomly rotates the segmentation output maps from the student and teacher networks by the same angle, then projects the rotated maps onto the horizontal and vertical axes, and finally computes the consistency between these projection vectors. Compared to MSE, this projection mechanism is more sensitive to prediction discrepancies at boundaries and in small, discrete regions, addressing a limitation of MSE by penalizing such errors effectively.

Throughout the training process, the total loss function combines the supervised loss (for labeled data) and the consistency loss (for unlabeled data, comprising MSE and MPR components). The weight of the unlabeled data loss is modulated by a function that changes over the course of training (e.g., a Gaussian ramp-up). The student network’s parameters are optimized via backpropagation, while the teacher network’s parameters are updated using an exponential moving average of the student network’s parameters (momentum update).

Experimental results on the ACDC cardiac dataset and an in-house colon tumor dataset validated the efficacy of SOUSA. Compared to weakly supervised methods using only scribble annotations (e.g., Scribble2Label, MAAG), SOUSA significantly improved segmentation accuracy (e.g., achieving an approximately 5% higher Dice score than PCE + CRF on the ACDC dataset with 10% labeled data). Furthermore, when compared to standard semi-supervised methods adapted for the WSCL setting (e.g., ICT, Uncertainty-aware MT), SOUSA not only yielded higher Dice scores (e.g., 3.55% higher than ICT) but also demonstrated significantly lower Hausdorff distance (HD) and average symmetric surface distance (ASSD) metrics (e.g., reductions of 26.20 mm and 6.88 mm compared to ICT, respectively), indicating more accurate boundary delineation and fewer false-positive regions in its segmentation outputs.

While SOUSA demonstrates the potential of weakly semi-supervised learning in medical image segmentation, Point SEGTR, proposed by Shi et al. (2023) [81], adopts an alternative strategy by integrating a small amount of pixel-level annotations with a large volume of more readily available point-level annotations. Point SEGTR employs a teacher–student learning framework, wherein the teacher model is based on the Point DETR architecture, incorporating a point encoder for encoding point annotations, an image encoder (CNN + Transformer) for extracting image features, a Transformer decoder for fusing information via attention mechanisms, and a segmentation head for outputting the segmentation results. During training, Point SEGTR first utilizes pixel-level annotated data to initialize the teacher model, enhancing its robustness to variations in point annotation location by introducing a Multi-Point Consistency (MPC) loss, which enforces consistent segmentation predictions for different points sampled within the same target object. Subsequently, the teacher model is fine-tuned using a large amount of point-annotated data, concurrently applying a Symmetric Consistency (SC) loss. This SC loss encourages consistent predictions for the same input subjected to transformations or perturbations, thereby improving generalization and better leveraging the weak annotations. Finally, the optimized teacher model is utilized to generate high-quality pseudo-segmentation labels for all point-annotated data. These pseudo-labels, combined with the original pixel-level annotations, are used to jointly train a student network (e.g., Mask R-CNN), which serves as the final model for inference. Through this weakly semi-supervised process incorporating MPC and SC regularization, Point SEGTR achieves competitive segmentation performance while significantly reducing the dependency on pixel-level annotations.

Experiments conducted on three endoscopic datasets (CVC, ETIS, and NASOP) demonstrated that the Point SEGTR framework, when augmented with MPC and SC regularization, enabled the teacher model to achieve performance comparable to or even exceeding that of baseline models trained with 100% pixel-level annotations, even when using only a limited fraction (e.g., 50%) of such annotations. Furthermore, student models trained using pseudo-labels generated by this enhanced teacher model exhibited significantly improved segmentation accuracy compared to baselines. These results validate the effectiveness of the proposed regularization strategies in reducing annotation requirements and enhancing weakly semi-supervised segmentation performance.

SOUSA and Point SEGTR represent two distinct weakly semi-supervised strategies. SOUSA emphasizes consistency between scribble annotations and unlabeled data, whereas Point SEGTR focuses on fusing pixel-level and point-level annotations. Both approaches underscore the efficacy of combining sparse annotations with consistency constraints. Future research could explore the integration of additional types of weak annotations with consistency learning, as well as investigate more effective model architectures and training strategies to further advance the performance of weakly semi-supervised medical image segmentation.

This section systematically reviews weakly supervised medical image segmentation methods, focusing on key advancements that leverage image-level labels (e.g., CAM- and MIL-based approaches) as well as weak semi-supervised learning strategies, combining sparse annotations (such as scribbles or points) with consistency regularization (e.g., SOUSA, Point SEGTR). To facilitate quantitative performance evaluation of existing weakly supervised semantic segmentation methods, Table 6 systematically presents the DSC and mIoU scores achieved by representative approaches on the RESC and Duke datasets across different lesion regions.

5. Unsupervised Medical Image Segmentation Methods

Unsupervised segmentation methods have garnered significant attention due to their independence from pixel-level annotations, particularly in scenarios where labeled data are scarce or prohibitively expensive to acquire. The objective of these methods is to automatically identify and delineate regions of interest, such as lesions or specific organs, within images algorithmically, without reliance on manually annotated training data. Initially, traditional image processing techniques, including clustering, region growing, and thresholding, were primarily explored by researchers to achieve unsupervised segmentation. These approaches typically operate based on low-level image features (e.g., pixel intensity, texture, edges) and do not require training data. However, constraints imposed by hand-crafted features and predefined rules are often encountered, limiting the effectiveness of these methods in handling the complex anatomical structures and pathological variations inherent in medical imaging. Recent advancements have been spurred by the advent of deep learning in unsupervised segmentation methods based on Autoencoders (AEs) and their variants, which exhibit powerful feature representation and learning capabilities [3,6,7]. These techniques often learn the latent distribution of normal images, thereby enabling the identification and segmentation of anomalous regions that deviate from this learned normality model, providing valuable support for clinical diagnosis. Building upon this developmental trajectory, contemporary unsupervised medical image segmentation approaches are predominantly categorized into two main classes: Unsupervised anomaly segmentation (UAS) and unsupervised domain adaptation (UDA) for segmentation. Distinct challenges are addressed by these two classes, which have demonstrated considerable success in practical applications.

5.1. Unsupervised Anomaly Segmentation Methods

Unsupervised anomaly detection and segmentation in medical image analysis are directed towards the identification and delineation of pathological manifestations that deviate from normal anatomical structures. The heterogeneity of pathologies contributes to challenges in capturing all possible variations with labeled examples, while the acquisition of large-scale annotated datasets is characterized by its complexity and associated costs. These constraints have limited the feasibility of fully supervised approaches, thus motivating the utilization of unsupervised methods, which eliminate the necessity for explicit anomaly labels and have received substantial attention. However, the effective modeling of the complex distribution of normal anatomy, in order to accurately differentiate between diverse and potentially subtle deviations resulting from the aforementioned heterogeneity, continues to present a significant challenge for these unsupervised techniques. Recent advancements in deep learning have facilitated progress in this domain, with methods based on Autoencoders and their variants being recognized as a prominent research focus due to their robust feature representation capabilities.

Silva-Rodríguez et al. (2022) introduced a novel framework [87] based on constrained optimization applied to attention mechanisms derived from a Variational Autoencoder (VAE). Instead of relying solely on reconstruction error, attention maps extracted from the VAE encoder’s intermediate layers were leveraged to identify anomalies. While the investigation of Gradient-Weighted Class Activation Mapping (Grad-CAM) was initially undertaken, non-gradient-weighted Activation Maps (AMs) were found to be preferable. A key innovation was the formulation of a constraint loss designed to ensure comprehensive attention coverage over the entire context in normal images. Crucially, implementation occurs not as a pixel-wise equality constraint enforcing maximum activation everywhere, but as a global inequality constraint on the overall activation level of the attention map. This formulation grants greater flexibility to the model in learning the distribution of normal patterns.

In addressing the optimization challenge presented by the inequality constraint, the extended log-barrier method is explored. This methodology integrates the constraint into the objective function through a smooth, differentiable barrier term. An alternative regularization strategy is proposed, which entails maximizing the Shannon entropy of the (softmax-normalized) attention map. This approach is associated with a diffuse attention distribution across normal images and simultaneously reduces the number of hyperparameters. During inference, the trained Variational Autoencoder (VAE) generates the (softmax-normalized) activation map, which is utilized directly as the anomaly saliency map. Thresholding of this saliency map results in the final anomaly segmentation mask. Experimental evaluation has highlighted the superior performance of the AMCons strategy, employing Shannon entropy maximization, which significantly enhances the separation between the activation distributions corresponding to normal and anomalous patterns, resulting in a reduction of their histogram overlap to 10.6%.

In contrast to the approach practiced by Silva-Rodríguez et al., which focused on attention mechanisms and global constraints, an alternative strategy was investigated by Pinaya et al. This investigation leveraged the synergy between Vector Quantized Variational Autoencoders (VQ-VAEs) and Transformers for unsupervised 3D neuroimaging anomaly detection and segmentation [88], as illustrated schematically in their Figure 7. In the initial stage, a VQ-VAE architecture (Figure 7 left side) is used to acquire a discrete latent representation

Z_{q}

of the input brain image X. An encoder network E is utilized to map the input to a continuous latent space Z, followed by vector quantization against a learnable codebook

e_{k}

, thereby yielding the discrete index grid

Z_{q}

. A decoder network D is subsequently engaged to reconstruct the image

\hat{X}

from the quantized representation. During this phase, which is trained exclusively on healthy-brain images, the objective is to derive a high-fidelity, compact discrete encoding.

Subsequently, the learned discrete latent representation

Z_{q}

is serialized into a one-dimensional sequence S (Figure 7 right side). This sequence serves as the input for the second stage, which is conducted using an autoregressive Transformer model. Sequential dependencies inherent in latent sequences derived from normal brain data are learned by this Transformer through the modeling of the conditional probability distribution

p (s_{i} | s_{< i})

. Training is executed by maximizing the log-likelihood across sequences from the healthy training set, which facilitates the internalization of the statistical patterns characteristic of normal brain structure in the discrete latent space.

At the inference stage, indices

s_{i}

in the latent sequence S are identified by the trained Transformer, exhibiting probabilities

p (s_{i} | s_{< i})

below a predefined threshold, thereby forming a resampling mask m. This mask is employed to guide the generation of a “healed” latent sequence

\hat{s}

through the resampling of anomalous indices and its corresponding decoded image

{\hat{x}}_{h e a l e d}

. The upsampled and smoothed resampling mask m is utilized to filter the pixel-/voxel-wise residual map

| x - {\hat{x}}_{h e a l e d} |

, with the Transformer’s probabilistic anomaly judgment leveraged to suppress reconstruction artifacts. Thresholding of this filtered residual map results in the final anomaly segmentation.

On synthetic data as well as four real-world 2D/3D brain lesion datasets (UKB, MSLUB, BRATS, WMH), the proposed framework integrating VQ-VAE, Transformer, the residual mask filtering technique, and a multi-view ensemble strategy demonstrated superior performance in unsupervised anomaly segmentation. Established state-of-the-art unsupervised methods, including Autoencoder (AE) variants [89], Variational Autoencoders [90], and f-AnoGAN [91], were consistently and significantly outperformed, with the highest reported Dice and AUPRC scores achieved across the evaluated benchmarks.

5.2. Unsupervised Domain Adaptation Segmentation Methods

Unsupervised domain adaptation is recognized as a significant research paradigm for the mitigation of domain-shift challenges in medical image segmentation. Performance degradation of deep learning models, when exposed to data distributions that differ from the training source—exemplified by variations arising from multi-center, multi-modal, or multi-protocol contexts—represents a primary barrier to successful clinical translation. Given the prohibitive costs and practical constraints, including privacy concerns related to the acquisition of fine-grained annotations for the target domain, the primary objective of UDA is to adapt models trained on a labeled source domain to an unlabeled target domain, thus facilitating robust generalization performance. The effectiveness of UDA methods that can manage complex distribution discrepancies while learning feature representations that are resilient to domain shifts and discriminative for segmentation tasks is vital, particularly in the context of minimizing the risk of negative transfer. Consequently, successful UDA is deemed essential for the enhancement of generalization, robustness, and deployment feasibility of segmentation models in clinical environments. In response to the primary challenges posed by UDA, numerous methodological advancements have been introduced in recent years. This subsection offers a systematic review of these developments, which have been categorized along three main lines of methodological progress.

5.2.1. Advancements in Source-Data-Free Unsupervised Domain Adaptation

Conventional unsupervised domain adaptation (UDA) methods are characterized by the necessity for concurrent access to both source- and target-domain data. However, in real-world applications, this assumption has often been demonstrated to be impractical due to constraints imposed by data privacy regulations and restrictive sharing protocols. To mitigate the challenge of limited access to source-domain data during the adaptation process, distinct UDA strategies have been independently proposed by Stan and Rostami (2024) [92] as well as Liu et al. (2023) [93].

A strategy centered on latent feature distribution alignment was introduced by Stan and Rostami (2024) [92]. During the adaptation phase, information regarding the source domain’s latent feature distribution is leveraged, in conjunction with optimal transport (OT) theory, to achieve alignment between the target-domain features and the source feature distribution.

The methodology is initiated by training a semantic segmentation network on labeled source-domain data. Following this, latent features are extracted from the source data using the trained network, and a Gaussian Mixture Model (GMM) is learned to approximate the distribution of these source latent features. Prior knowledge from the source domain is encapsulated by this GMM and serves as a surrogate during the subsequent adaptation stage.

In the adaptation phase, a distribution alignment strategy grounded in optimal transport (OT) theory is proposed, wherein the model lacks access to source data and relies solely on unlabeled target-domain data. The sliced Wasserstein distance (SWD) is used as the metric to quantify the discrepancy between the target-domain feature distribution and the learned Gaussian Mixture Model (GMM) distribution. SWD approximates the Wasserstein distance by projecting high-dimensional distributions onto multiple one-dimensional directions and averaging the corresponding one-dimensional Wasserstein distances. This approach allows for end-to-end optimization through backpropagation, thereby circumventing the computational challenges associated with the direct calculation of high-dimensional Wasserstein distances. By minimizing the SWD loss, the method effectively pulls the target domain’s latent feature distribution towards the pre-learned GMM distribution, thereby achieving domain alignment. To further enhance performance on the target domain, a regularization term is introduced for fine-tuning the classifier.

The efficacy of this method was validated on two publicly available cardiac image datasets and one abdominal image dataset. Experimental results demonstrate that, even in the absence of source-domain data, the proposed approach achieves performance comparable to, or exceeding, existing unsupervised domain adaptation methods for medical image segmentation that utilize source data. The potential for effective domain adaptation while ensuring data privacy is underscored.

Alternatively, the OSUDA framework proposed by Liu et al. (2023) presents a “plug-and-play” UDA strategy [93]. This approach similarly obviates the need for source-domain data access, relying solely on a segmentation model pre-trained on the source domain. The central concept of OSUDA is to leverage the statistical information contained within batch normalization (BN) layers, which are widely utilized in pre-trained models. BN layers encapsulate two categories of statistics: low-order statistics (mean and variance) and high-order statistics (scaling and shifting factors). Research indicates that low-order statistics exhibit domain specificity, whereas high-order statistics tend to be domain-invariant. Consequently, the OSUDA framework adapts to the target domain by progressively adjusting the low-order statistics while maintaining the high-order statistics.

A progressive adaptation strategy for low-order statistics, based on exponential momentum decay (EMD) is used. During the adaptation process, the batch normalization (BN) statistics of the target domain are gradually converged towards the statistics of the current batch through the application of exponentially decaying momentum, thereby facilitating a smooth transition. High-order statistics are subjected to explicit constraints during adaptation through the introduction of a High-order BN Statistics Consistency Loss (LHBS), which prevents alterations.

In order to enhance performance within the target domain, an adaptive channel weighting mechanism is incorporated by OSUDA. The transferability of each channel is evaluated by calculating the difference in low-order statistics between domains, while scaling factors derived from high-order statistics are taken into account. Higher weights are assigned to channels considered to be more transferable within the LHBS loss calculation, consequently directing the model’s focus towards these informative channels.

OSUDA is characterized by the integration of unsupervised self-entropy (SE) minimization and a novel Memory-Consistent Self-Training (MCSF) strategy utilizing a queue. High-confidence predictions on the target domain are promoted through SE minimization. A dynamic queue is utilized by MCSF to store historical prediction information, facilitating the selection of reliable pseudo-labels for self-training. Consistency between current and past predictions is enforced, which contributes to the enhancement of performance and stability on the target domain.

OSUDA was validated across multiple medical image segmentation tasks, including cross-modality and cross-subtype brain tumor segmentation, as well as cardiac MR-to-CT segmentation. Experimental results indicate that OSUDA outperforms existing source-relaxed UDA methods and achieves performance comparable to UDA methods that utilize source data. This highlights OSUDA’s capability to effectively transfer knowledge while preserving source-domain data privacy, demonstrating significant practical applicability.

5.2.2. Advancements in UDA via Adversarial Training

Distinct from the source-data-free methodologies detailed in the preceding section, another significant category of UDA enhancements is centered on the learning of domain-invariant features through adversarial training. The primary objective of these approaches is to mitigate the discrepancies between domains, thereby enabling the model to attain robust performance on the target domain.

The ODADA framework proposed by Sun et al. (2022) represents a characteristic adversarial learning-based UDA methodology [94]. However, diverging from conventional adversarial training approaches, ODADA separates features into domain-invariant and domain-specific components, using adversarial training to enhance generalizability. ODADA decomposes features into a domain-invariant representation (DIR) and a domain-specific representation (DSR), employing a novel orthogonality loss function to encourage independence between the DIR and DSR components. As illustrated in Figure 8, the ODADA architecture utilizes a shared feature extractor to process input images from both source and target domains, generating an initial mixed feature representation, denoted as

F_{b o t h}

. The core of this framework lies in the explicit decomposition of

F_{b o t h}

; a dedicated, learnable module termed the DIR Extractor is responsible for extracting the domain-invariant features,

F_{d i}

, from

F_{b o t h}

. Concurrently, the domain-specific features,

F_{d s}

, are computed via a non-parametric difference layer (diff layer), where

F_{d s} = F_{b o t h} - F_{d i}

(4)

This design has been established to ensure that the decomposition process yielding

F_{d s}

is lossless and does not necessitate the learning of independent parameters. The extracted

F_{d i}

is employed in two subsequent tasks: initially, it is introduced into a Segmentor to predict segmentation results, supervised exclusively on the source domain via a segmentation loss; subsequently, it is forwarded through a Gradient Reversal Layer (GRL) before being input to a first domain classifier, where an adversarial loss compels the alignment of

F_{d i}

, thereby reinforcing its domain invariance. Conversely, the computed

F_{d s}

is directly input into a second domain classifier, which is also trained utilizing an adversarial loss. Critically, however, no GRL is utilized in this context; the objective is to maximize the capability of this classifier to differentiate domains based on

F_{d s}

. This mechanism inversely incentivizes the DIR Extractor to maximally isolate the domain-specific information within

F_{d s}

, thereby clarifying

F_{d i}

. Ultimately, an orthogonal loss is enforced between

F_{d i}

and

F_{d s}

to explicitly encourage the independence of these two representation components. Through the integration of orthogonal decomposition, targeted adversarial training, and independence constraints, the entire architecture has been designed to learn purified and more effective domain-invariant features suitable for cross-domain segmentation.

The ODADA model was shown to demonstrate superior experimental outcomes across several challenging medical image segmentation tasks. Significant outperforming of conventional adversarial learning-based UDA methods (e.g., DANN, ADDA) as well as image translation-based approaches (e.g., CycleGAN) was observed on three public datasets—cross-center prostate MRI segmentation, cross-center COVID-19 CT lesion segmentation, and cross-modality cardiac (MRI/CT) segmentation. For instance, on the prostate dataset, a 7.73% improvement in Dice score was achieved by ODADA compared to DANN. ODADA’s plug-and-play capability was validated through experiments: integration as an adversarial module into other state-of-the-art UDA frameworks resulted in further performance enhancements, underscoring the method’s effectiveness and versatility.

The SMEDL method [95] proposed by Cai et al. (2025) builds upon adversarial training by further incorporating concepts of style mixup and dual-domain invariance learning. The central tenet of SMEDL is to enhance model generalization by implicitly generating mixed domains with diverse styles and to improve robustness through dual-domain invariance learning. SMEDL employs a disentangled style mixup (DSM) strategy, utilizing a shared feature extractor alongside independent content and style extractors to decompose images into content and style features. Subsequently, convex combinations of style features from different domains are performed to generate mixed-style features, thereby implicitly creating multiple style-mixed domains.

Moreover, SMEDL introduces a dual-domain invariance learning mechanism, comprising: Intra-domain contrastive learning (ICL), which performs contrastive learning within the source and target domains separately, thereby encouraging the model to learn invariance between features possessing the same semantic content but perturbed by different styles; and inter-domain adversarial learning (IAL), which conducts adversarial learning between two style-mixed domains. A discriminator is trained to distinguish between the mixed domains while simultaneous training of the feature extractor occurs to generate domain-invariant features, capturing invariance between features exhibiting the same mixed style but varying semantic content. This approach enables SMEDL to leverage both intra-domain and inter-domain variations for learning robust domain-invariant representations. SMEDL achieves domain adaptation through style mixup and dual-domain invariance learning without requiring image translation or diversity regularization, offering a more concise and efficient solution.

SMEDL was validated comprehensively on two public cardiac datasets and one brain dataset. Significant performance improvements were achieved by SMEDL when compared to state-of-the-art unsupervised domain adaptation (UDA) methods for medical image segmentation. The effectiveness of SMEDL in addressing cross-modality medical image segmentation tasks was thereby demonstrated.

5.2.3. UDA Improvements Based on Semantic Preservation

Divergence from source-free and adversarial training-based approaches is observed in the Dual Domain Distribution Disruption with Semantics Preservation (DDSP) method proposed by Zheng et al. [96]. A distribution disruption module is introduced by DDSP to actively and broadly alter the image distributions of both source and target domains, while strong constraints based on semantic information are employed. This design compels the model to move beyond reliance on specific domain distribution details and focus on intrinsic, domain-invariant anatomical structural information, thereby achieving domain-agnostic capabilities.

In its implementation, DDSP has been characterized by the utilization of a dual-domain distribution disruption strategy, which simultaneously perturbs source- and target-domain images through non-learning transformation functions such as Shuffle Remap. A key aspect of this approach is the asymmetric application of perturbation magnitude: greater perturbation is assigned to the source domain, which contains label information, thereby leveraging strong semantic supervision (i.e., segmentation loss) to facilitate the learning of distribution robustness by the model. Conversely, a lesser degree of perturbation is applied to the unlabeled target domain, accompanied by a semantic consistency loss (Lsec) that constrains prediction consistency prior to and following perturbation. This methodology guides the model to adapt to the characteristics of the target domain while mitigating the risk of excessive noise introduction. Additionally, a Cross-Channel Similarity Feature Alignment (IFA) mechanism has been innovatively introduced within DDSP. This mechanism relies on the established concept that cross-domain anatomical structures maintain consistency in terms of semantic information and relative volume ratios. By aligning the channel-wise similarity matrices of feature maps from both the source and target domains, IFA encourages the channel features of the target domain to reflect a structural emphasis that is consistent with the source domain, thus leading to a significant enhancement in the accuracy of the shared classifier during the processing of target-domain features.

In bidirectional cross-modality segmentation experiments conducted on the MMWHS17 cardiac dataset, exceptional performance was demonstrated by DDSP. When compared to the preceding state-of-the-art method, ODADA [94], an average Dice score improvement of 7.9% to 9.6% was achieved, significantly narrowing the gap to the fully supervised baseline. On the PRO12 prostate cross-center segmentation task, superior results were also obtained by DDSP, yielding Dice scores of 76.6% and 83.0%. These quantitative results, spanning multiple tasks and datasets, consistently validate the effectiveness of the DDSP framework. It is indicated that SOTA performance is achieved by DDSP in overcoming the limitations of GANs and enhancing cross-domain medical image segmentation, while the potential to approach fully supervised performance levels is also highlighted.

Unsupervised domain adaptation methods in the domain of medical image segmentation are reviewed within this section, with a focus on three primary strategies: source-free approaches, adversarial training-based methods, and techniques based on distribution disruption and semantic preservation. A quantitative comparison of performance across various representative UDA techniques was facilitated through the presentation of the Dice coefficient and average symmetric surface distance metrics achieved by these methods on the MM-WHS challenge dataset, as shown in Table 7.

6. Comparison

In the preceding sections, we systematically reviewed commonly used datasets in the field of medical image segmentation and, based on the differences in supervisory information, provided a detailed classification and exposition of various non-fully supervised learning methods. Although these limited supervision methods can, to varying degrees, effectively alleviate the challenges posed by the over-reliance of fully supervised learning on extensive, high-quality pixel-level annotated data, representative techniques within each category still exhibit significant differences in terms of performance improvement potential, optimal application scenarios, and inherent limitations. To foster a profound understanding of these distinctions, this section aims to conduct a more in-depth comparison and evaluation of these non-fully supervised approaches.

Specifically, we first conduct a detailed comparative analysis of representative methods in semi-supervised learning, weakly supervised learning, and unsupervised learning from three key dimensions: core mechanisms, performance advantages, and typical application scenarios. The core findings of this analysis are summarized in Table 8 and Table 9. Furthermore, we systematically summarize the performance improvements achievable by each class of methods and emphasize the practical application value of these improvements in real-world clinical settings. To enhance the critical perspective of this review, we also delve into the inherent limitations of each methodological paradigm, the primary technical challenges currently faced, and the trade-off considerations necessary when selecting methods in practice.

6.1. Semi-Supervised Medical Image Segmentation Methods

In the domain of semi-supervised learning for medical image segmentation, research outcomes and performance improvements have been substantial, primarily categorized into the following three complementary technical directions: Firstly, consistency regularization strategies have been continually refined. Researchers have enabled models to more effectively learn stable and reliable feature representations from unlabeled data by introducing approaches such as leveraging fuzziness assessment (e.g., AC-MT [47]) to more intelligently manage regions of predictive uncertainty, integrating anatomical prior knowledge (e.g., AAU-Net [48]) to ensure the clinical plausibility of segmentation results, and employing multi-head mutual learning with multi-level perturbation mechanisms (e.g., CMMT-Net [51]) to enhance model robustness across diverse clinical data sources. These innovations significantly lower the barrier for large-scale, fine-grained annotation in clinical practice. This not only accelerates the deployment and application of artificial intelligence technologies in medical institutions with limited resources or high data heterogeneity but also enhances the capability to identify and delineate complex lesion morphologies, thereby providing robust support for precise diagnosis.

Secondly, the optimization of pseudo-label generation and utilization mechanisms has also yielded significant achievements. For instance, through dual-network mutual learning supplemented by reliability assessment (e.g., MLRPL [54]) to screen for high-quality pseudo-labels approaching the fidelity of true annotations, or by adopting prototype learning combined with dynamic correction for three-dimensional data (e.g., CRLN [56]) to refine 3D pseudo-labels, models can achieve high-precision segmentation even with a scarcity of labeled data. These technological advancements directly support quantitative analysis tasks in clinical settings that require precise volumetric measurements, such as tumor burden assessment and therapeutic efficacy tracking, and provide critical information for refined surgical and radiotherapy planning. Furthermore, this has facilitated the direct and efficient processing of mainstream 3D clinical images (e.g., CT, MRI) by artificial intelligence, expanding its potential in advanced applications like 3D reconstruction and multi-modal fusion.

Thirdly, the integration of contrastive learning has further augmented the model’s discriminative power for subtle feature differences. Research efforts have enabled models to more accurately capture subtle information crucial for diagnosis by utilizing context-aware consistency and cross-consistency training (e.g., CRCFP [45]) to improve the differentiation of histological features, or by employing structured feature perturbation combined with edge-aware contrastive learning strategies (e.g., the method proposed by Yang et al. (2025) [46]) to sharpen the perception of lesion boundaries. In clinical practice, this translates into improved accuracy of automated pathological image analysis and significantly enhanced capability for precise delineation of critical details such as tumor infiltration margins and minute early-stage lesions, thereby establishing a solid foundation for early diagnosis, personalized treatment planning, and reliable prognostic assessment.

In terms of technological maturity, semi-supervised learning methods predicated on consistency regularization and pseudo-label generation strategies, particularly in scenarios characterized by the availability of a limited yet high-quality labeled dataset and minimal domain shift, have demonstrated a considerable degree of maturity and clinical application potential. Indeed, some refined models are approaching or have reached a level suitable for deployment in specific research or auxiliary diagnostic settings. Conversely, SSL approaches that integrate sophisticated mechanisms (e.g., advanced contrastive learning, multi-head mutual learning) or impose stringent assumptions on the distribution of unlabeled data, despite reporting excellent performance, often on benchmark datasets, find that their robustness, generalization capability, and sensitivity to hyperparameters across diverse clinical datasets necessitate further extensive large-scale clinical validation. Consequently, these advanced approaches are predominantly in an active phase of experimental exploration and optimization.

However, despite the encouraging progress in this field, we must soberly acknowledge the inherent limitations of semi-supervised learning. These methods intrinsically rely on the quality and representativeness of the initial small set of labeled data and typically assume that unlabeled data share a related distribution with labeled data, which restricts their performance when significant domain shifts or systematic biases in labeled data are present. Concurrently, mechanisms reliant on pseudo-labels carry the risk of error accumulation, while consistency regularization may reinforce the model’s “confirmation bias”, hindering its ability to proactively discover entirely new patterns that deviate significantly from initial knowledge. Furthermore, many advanced SSL methods exhibit high model complexity and sensitivity to hyperparameter tuning, which also increases the difficulty of understanding, implementation, and optimization.

Transitioning these methods into practical clinical applications also presents numerous challenges. Unavoidable annotation errors within the limited labeled data can be amplified. The inherent high heterogeneity of medical images means that the generalization capabilities of models across different centers or devices often remain insufficient, with domain adaptation issues remaining prominent. Simultaneously, some SSL methods impose high demands on computational resources, universally effective medical image perturbation strategies are difficult to establish, and the evaluation and debugging of models are rendered more complex due to their characteristic of learning from unlabeled data. Ultimately, establishing clinician trust in these “self-learning” models and enhancing their interpretability are key obstacles to their clinical adoption.

Therefore, when designing and applying semi-supervised learning strategies, meticulous trade-offs must be made. The primary consideration is to strike a balance between acceptable annotation costs and the desired clinical performance. Specifically, in methodological design, choices are required between the confidence threshold for pseudo-labels (influencing quality) and the utilization rate of unlabeled data (influencing efficiency). Additionally, the strength of consistency loss must be judiciously set to balance learning from unlabeled data against maintaining training stability. Further, trade-offs between model complexity (potentially yielding higher performance) and its usability and generalization capability are essential. It is also crucial to acknowledge the model’s dependence on both the “quality” and “quantity” of unlabeled data—it is not merely a case of “the more, the better”—requiring comprehensive consideration within specific clinical contexts.

6.2. Weakly Supervised Medical Image Segmentation Methods

Turning to the domain of weakly supervised learning, research has concentrated on training segmentation models using weak labels that are more readily obtainable than pixel-level annotations, such as image-level labels, bounding boxes, points, or scribble annotations. Significant progress has also been achieved in this endeavor, primarily classifiable into the following key technical directions.

A principal research focus has been the continuous optimization of strategies for generating pixel-level localization information from image-level labels. Researchers have extensively utilized Class Activation Maps and their variants as a means to initially localize target regions. For instance, by introducing multi-modal learning strategies, complementary information between different modalities and cross-modal consistency constraints is leveraged to enhance the completeness and accuracy of CAMs (e.g., the work by G. Patel et al. (2022) [71]). Concurrently, integrating mechanisms such as self-supervised CAM optimization (e.g., the work by Chikontwe et al. (2022) [70]), through iterative refinement and spatial regularization, progressively transforms coarse activation maps into more refined segmentation masks. These innovations enable the generation of segmentation results with a certain degree of localization accuracy relying solely on image-level labels, significantly reducing annotation costs. This is particularly applicable to clinical scenarios such as large-scale screening and preliminary lesion detection, facilitating the rapid identification of potentially abnormal regions and assisting physicians in initial diagnosis.

In parallel, leveraging the Multiple-Instance Learning paradigm to adroitly manage the inherent uncertainty between weak labels and pixel-level predictions has also yielded substantial results. For example, SA-MIL [76] effectively captures long-range dependencies between pixels by integrating self-attention modules at different stages of the network, combined with deep supervision strategies, thereby enhancing the ability to learn fine-grained segmentation from image-level labels. Furthermore, fusing anomaly detection concepts (such as GANomaly in ANGUS [79]) with MIL, by learning normal patterns from healthy images and then guiding segmentation by identifying deviations from these normal patterns, enables models to more robustly handle complex lesion morphologies and inter-class variations. These technological advancements allow models to learn relatively accurate pixel-level segmentations from very coarse image-level information, clinically providing effective low-cost solutions for tasks such as pathological image analysis (e.g., cancer cell region identification) and simultaneous segmentation of multiple lesion types in OCT images.

Moreover, the emergence of weakly semi-supervised learning strategies has further enhanced the performance and generalization capabilities of weakly supervised segmentation. Such methods ingeniously combine a small amount of sparse yet more informative weak labels (e.g., scribbles or point annotations) with a large volume of unlabeled data or data with only image-level labels. For instance, the SOUSA framework [80], by integrating scribble annotations with consistency learning, utilizes geodesic distance maps and multi-angle projection reconstruction loss to fully exploit the spatial information of sparse annotations and the consistency priors of unlabeled data. PointSEGTR [81] actively explores a teacher–student learning framework that fuses a small number of pixel-level annotations with a large number of point-level annotations, enhancing the model’s robust utilization of point annotations through multi-point consistency and symmetric consistency losses. In clinical practice, these methods signify that even when complete pixel-level annotations are unavailable, segmentation tools with performance approaching or even surpassing supervised models reliant solely on strong weak labels can be trained by combining a small number of easily provided sparse interactive annotations with a large volume of readily available weak-labeled or unlabeled data. This dramatically improves annotation efficiency and the applicability of models to real-world clinical data.

In the domain of weakly supervised learning, methods leveraging image-level labels to generate Class Activation Maps for initial localization are relatively mature and have found initial application in scenarios such as large-scale screening. However, their maturity for direct application in precise segmentation is limited, often serving as a pre-processing or auxiliary step. WSL approaches based on stronger sparse annotations, such as points, lines, or scribbles, along with certain well-designed Multiple-Instance Learning frameworks, have demonstrated promising application prospects and a higher degree of maturity for specific tasks (e.g., tumor region identification, organ contour delineation). Some of these approaches have been validated in specific clinical studies and are progressively transitioning towards clinical auxiliary tools. Conversely, WSL techniques involving complex label transformations or relying on elaborate post-processing steps still require improvements in terms of stability and usability and are largely in the research and development phase. Weakly semi-supervised learning, an emerging direction that combines the advantages of both paradigms, exhibits substantial potential; however, most methods remain in an active research and validation phase and are not yet ready for widespread clinical deployment.

However, despite the substantial potential demonstrated by weakly supervised learning, we must remain cognizant of its inherent limitations. Deducing pixel-level segmentation from the weakest image-level labels is, in itself, a challenging ill-posed problem; models tend to focus on the most discriminative yet potentially incomplete regions and struggle with multi-instance or low-contrast scenarios. Even with a more informative bounding box, point, or scribble annotations, the unavoidable introduction of background noise, annotation sparseness, and potential inaccuracies can lead to blurred segmentation boundaries, internal holes, or an insufficient ability to capture complex topological structures and subtle texture differences. Furthermore, weakly supervised models are typically more sensitive to network architecture, loss function design, and post-processing steps, requiring meticulous tuning to achieve satisfactory results. A true performance evaluation of them is also more challenging due to the lack of pixel-level gold standards and a greater susceptibility to dataset bias.

In the process of translating these promising methods into practical clinical applications, numerous real-world challenges will inevitably be encountered. The information content and noise levels vary dramatically across different types of weak labels, and effectively fusing multiple weak label types is itself a complex problem. The size, shape, and number of target objects in medical images exhibit considerable variability, and weak labels often fail to comprehensively reflect these variations, potentially limiting model generalization. Particularly for complex scenes with multiple disconnected instances or low contrast between the target and background, weakly supervised methods find it especially difficult to achieve precise segmentation results. Concurrently, evaluating the true performance of weakly supervised segmentation is more challenging, as pixel-level “gold standards” for direct comparison are often unavailable.

Consequently, when designing and applying weakly supervised learning strategies, prudent trade-off considerations are imperative. The core consideration lies in identifying the optimal balance between acceptable annotation cost/complexity and the desired segmentation accuracy/completeness. For instance, image-level labels offer the lowest cost but may also lead to the lowest achievable precision, whereas scribble annotations, while slightly more costly, can provide stronger localization guidance. At the model-design level, a trade-off must be struck between the strength of guidance from weak labels and the avoidance of overfitting to their inherent biases. Simultaneously, judicious choices must be made between model complexity (e.g., incorporating sophisticated attention mechanisms or iterative optimization procedures) and training efficiency, as well as final interpretability, to ensure that models are not only performant but also efficiently deployable in clinical environments and readily understood and trusted by clinicians.

6.3. Unsupervised Medical Image Segmentation Methods

Finally, we direct our attention to the field of unsupervised medical image segmentation, which aims to achieve automatic identification and delineation of regions of interest in images entirely without pixel-level annotations. In recent years, with the advancement of deep learning techniques, unsupervised segmentation methods have also made notable progress; they are primarily classifiable into the following key technical directions.

A prominent research direction involves leveraging Autoencoders and their variants for anomaly detection and segmentation. Such methods typically train models on a large corpus of healthy or “normal” images to learn the intrinsic data distribution and reconstruction capabilities. For instance, the framework proposed by Silva-Rodríguez et al. (2022) [87] learns attention patterns from unlabeled normal images by constraining the attention maps of an intermediate layer in a Variational Autoencoder and combining this with an augmented Lagrangian method or Shannon entropy maximization strategy. Consequently, during inference, anomalous manifestations in the attention maps are used to localize and segment pathological regions. Pinaya et al. (2022) [88] ingeniously combined a Vector Quantized Variational Autoencoder (VQ-VAE) with a Transformer. Initially, the VQ-VAE compresses healthy brain images into discrete latent representations. Subsequently, a Transformer is trained to learn the statistical patterns of these normal latent sequences. Ultimately, fine-grained anomaly segmentation is achieved by identifying “unexpected” patterns (low-probability sequences) in the latent sequences of test images, coupled with residual map filtering. These innovations enable models to distinguish regions deviating from normal patterns from the background without any pathological annotations. Clinically, this holds significant value for the identification of rare diseases, preliminary exploratory discovery of novel lesions, and unbiased segmentation in scenarios lacking any prior knowledge, being particularly instrumental in uncovering unexpected abnormalities.

Concurrently, unsupervised domain adaptation techniques have achieved remarkable success in addressing cross-domain segmentation problems, developing a diverse array of strategies. One critical class of strategies is source-data-free UDA, which focuses on enabling model transfer without access to source data (often due to privacy considerations). This is achieved, for example, by learning a proxy for the source-domain feature distribution (e.g., the work by Stan and Rostami (2024) [92]) or by cleverly utilizing the internal statistics of pre-trained models (e.g., the OSUDA framework [93]). These methods are of substantial clinical importance for protecting data privacy and rapidly adapting to new environments, significantly reducing re-annotation costs. Another category of UDA strategies places greater emphasis on enhancing cross-domain consistency at the feature or semantic level. Examples include disentangling and aligning domain-invariant features through adversarial learning (e.g., the ODADA framework [94]), introducing style mixing and dual-domain invariance learning (e.g., the SMEDL method [95]) to enhance generalization, or enforcing the learning of intrinsic anatomical structural information through distribution perturbation and semantic feature alignment (e.g., the DDSP method [96]). Clinically, all these methods contribute to overcoming performance bottlenecks caused by multi-source data discrepancies, improving model robustness and consistency. Semantic preservation strategies, in particular, exhibit unique advantages when handling cross-domain tasks with substantial modal or contrast differences.

The overall technological maturity of unsupervised medical image segmentation methods remains lower than that of semi-supervised and some more established weakly supervised approaches. Autoencoder-based anomaly detection methods are conceptually attractive, particularly for identifying unknown or rare pathologies. However, they still face challenges in defining a universally applicable ‘normal’ pattern within clinical contexts and ensuring the clinical significance of the segmentation outcomes. Consequently, these methods are predominantly in the experimental exploration and proof-of-concept stages, better suited for exploratory research than for precise diagnostic applications. Unsupervised domain adaptation techniques, especially those aimed at aligning discrepancies between specific modalities, have made significant strides at the research level. Certain adversarial learning and feature alignment strategies demonstrate potential in mitigating data heterogeneity. Nevertheless, considerable effort is still required to develop robust, ‘plug-and-play’ UDA solutions suitable for routine clinical deployment. Methods that depend on intricate models or impose strong assumptions on data distributions, in particular, exhibit relatively lower maturity and are primarily confined to research prototype validation.

However, despite the unique advantages and potential demonstrated by unsupervised learning, we must soberly acknowledge its inherent limitations. Unsupervised methods typically struggle to achieve the optimal performance of supervised methods on specific tasks due to the lack of explicit pixel-level guidance. Anomaly detection methods can be highly sensitive to the definition of “normal” patterns and may find it challenging to distinguish true pathological abnormalities from benign individual variations or artifacts. While domain adaptation methods can transfer knowledge, their effectiveness can significantly degrade, or even result in negative transfer, when the discrepancy between source and target domains is excessively large. Concurrently, many unsupervised methods rely on complex network architectures and sophisticated loss function designs, and their interpretability and stability are sometimes difficult to guarantee.

In the process of translating these promising methods into practical clinical applications, numerous real-world challenges will inevitably be encountered. Defining a universally applicable and robust “normal” or “healthy” pattern is crucial for anomaly detection, yet this is inherently a complex issue in clinical practice. For tasks requiring the segmentation of multiple specific classes or anatomical structures, simple anomaly detection or domain adaptation may not suffice, necessitating more refined semantic understanding. The evaluation of unsupervised methods is also inherently more challenging due to the absence of direct pixel-level ground truth for quantitative comparison, often relying on indirect metrics or manual assessment. Furthermore, ensuring that unsupervised methods do not learn biases present in the data or introduce new, unpredictable behaviors is critical for their safe application.

Consequently, when designing and applying unsupervised learning strategies, prudent trade-off considerations are imperative. The core consideration lies in striking a balance between the convenience of eliminating the need for annotation entirely and achieving acceptable segmentation accuracy and specificity. For example, anomaly detection might rapidly identify unknown abnormalities, but precise boundary and class definitions may be lacking. In domain adaptation, a trade-off must be made between the effective transfer of source-domain knowledge and the avoidance of interference from target-domain noise or irrelevant information. Simultaneously, choices must be made between model innovation and complexity (which might lead to a deeper understanding of the data) and the stability of training and reproducibility of results, to ensure that models are not only valuable at the research level but can also operate robustly in clinical environments and provide meaningful auxiliary information.

7. Discussion

Significant advancements in deep learning-based medical image segmentation approaches operating under limited supervision have been observed in recent years. The reliance on large-scale, pixel-precise annotations, which are typically required by fully supervised techniques, has been effectively reduced by these methods, leading to their widespread application in diverse medical image analysis tasks. Nonetheless, critical challenges remain, including constrained segmentation accuracy, a common limitation pertaining to single-class object segmentation, and the inadequate modeling of long-range dependencies among pixels. In light of these issues, an overview of representative application scenarios is provided in this section (Section 7.1), followed by an in-depth exploration of future research directions and development trends within this domain (Section 7.2).

7.1. Applications

The emergence of deep learning as a significant technology in medical image segmentation has been noted, with the development of paradigms that utilize limited supervision recognized as a crucial foundation for the extensive deployment and application of these techniques. This section highlights the typical application scenarios of these non-fully supervised methods in medical image segmentation.

Auxiliary diagnosis: In auxiliary diagnosis, approaches under limited supervision are found to significantly mitigate the dependency on large volumes of pixel-level ground truth annotations. WSL enables the training of segmentation models using more readily available weak labels, such as image-level tags, bounding boxes, or point annotations. For instance, the employment of Class Activation Maps or Multiple-Instance Learning frameworks facilitates the localization and preliminary segmentation of lesion areas based solely on image-level labels indicating the presence or absence of pathology, thereby providing valuable indicative information for radiologists, particularly suited for large-scale screening or the detection of atypical lesions [102,103]. SSL, conversely, markedly improves segmentation performance by integrating a small amount of precisely annotated data with substantial amounts of unlabeled data. Strategies including consistency regularization or pseudo-labeling have been shown to demonstrate efficacy in tasks such as lung nodule, skin lesion, and retinal vessel segmentation, achieving competitive accuracy at a considerably lower annotation cost compared to fully supervised methods, thereby facilitating earlier and more accurate identification of disease indicators [49].

Surgical planning: Surgical planning necessitates precise segmentation of patient-specific anatomical structures (e.g., tumors, organs, vessels). Methods utilizing limited supervision can significantly accelerate this process and enhance adaptability to diverse data sources. SSL allows for the leveraging of existing, albeit limited, high-quality segmentation data (potentially from disparate sources or prior cases) in conjunction with the current patient’s unlabeled pre-operative images to rapidly generate personalized 3D anatomical models, which are required for accommodating individual anatomical variations [104]. WSL, particularly when combined with interactive methodologies (e.g., clinician-provided sparse clicks or bounding boxes), can yield segmentation results adequate for planning requirements within minutes, considerably faster than exhaustive manual delineation, while ensuring the accuracy of critical structures [105]. Unsupervised domain adaptation technology is crucial for the successful deployment of models trained on standard datasets or specific imaging devices to distinct surgical cases (potentially involving different scanning parameters or equipment); model adjustment is facilitated in the absence of target case labels to curtail performance degradation attributable to domain shift, thereby ensuring planning reliability [106].

Treatment response assessment and longitudinal monitoring: Tracking of disease progression or evaluation of therapeutic efficacy necessitates consistent and reproducible segmentation across serial imaging time points. Manual processing of extensive longitudinal datasets is rendered practically infeasible. SSL has been demonstrated to be particularly effective in such contexts: by utilizing annotation information from select time points (e.g., baseline) and incorporating temporal consistency constraints (e.g., assuming minimal or smooth structural changes over short intervals), accurate segmentation of images from other unlabeled time points is achieved, facilitating reliable quantitative tracking of metrics such as lesion volume and morphology [107]. Coarse information regarding changes (e.g., clinician assessments of “increase/stable/decrease”) or global measurements may be leveraged by WSL to guide the segmentation model. Unsupervised change detection methods permit direct comparison of images from different time points, highlighting regions exhibiting significant structural or intensity alterations without necessitating prior definition or segmentation of specific structures, thus aiding in the rapid identification of abnormal changes or the assessment of treatment-induced tissue alterations [108].

Image data standardization and quality control: The inherent heterogeneity of medical image data—stemming from variations in imaging devices, centers, and protocols—constitutes a primary obstacle to the generalization capability and widespread clinical adoption of DL models. UDA represents a key technology to address this challenge. By aligning feature distributions or image styles between the source domain (labeled) and the target domain (unlabeled), model performance on unseen target-domain data can be significantly boosted [109,110]. This is essential for the development of tools amenable to reliable utilization in multi-center studies or stable operation across diverse clinical environments, ensuring the consistency and comparability of analysis results. For example, UDA methods predicated on adversarial learning or feature moment matching have been extensively applied in cross-device segmentation tasks involving brain, cardiac, and abdominal organs, effectively enhancing model robustness and the degree of standardization [111].

7.2. Future Works

This subsection is aimed at the systematic analysis and envisioning of several key future research directions and challenges in the field of medical image segmentation, with the integration of current technological advancements and clinical requirements.

7.2.1. Data-Efficient Segmentation Methods

The acquisition of large-scale, high-quality pixel-level annotations for medical images is identified as a critical bottleneck that constrains the advancement of deep learning models. Consequently, investigating methods to achieve precise segmentation under conditions of limited or even absent annotations—that is, developing data-efficient learning paradigms—constitutes a significant future research direction. This encompasses, but is not limited to: exploring more effective semi-supervised learning strategies to fully leverage abundant unlabeled data; advancing weakly supervised learning research to utilize readily obtainable weak information such as image-level labels, bounding boxes, point annotations, or scribbles for pixel-level prediction; developing self-supervised learning approaches to mine supervisory signals from the data themselves for pre-training or direct application in segmentation tasks; and investigating few-shot and even zero-shot segmentation techniques to enable models to rapidly adapt to novel segmentation tasks where annotations are scarce. Such research endeavors are expected to substantially reduce data annotation costs and accelerate the application of models across diverse clinical scenarios.

7.2.2. Generalization, Robustness, and Federated Learning

Medical imaging data inherently suffer from the domain shift problem, where models trained on one dataset or center may experience a sharp decline in performance when applied to another dataset or center due to discrepancies in imaging devices, protocols, or patient populations. Enhancing model generalization capability on unseen data and robustness against various interferences (e.g., noise, artifacts) represents a core challenge for achieving widespread clinical deployment. Future research must prioritize unsupervised domain adaptation, domain generalization (DG) techniques, and strategies capable of handling multi-center, heterogeneous data. Concurrently, with increasingly stringent data privacy and security regulations, federated learning (FL), as a distributed, privacy-preserving training framework, exhibits immense potential within medical image segmentation. Key research questions will involve how to effectively conduct model training and aggregation under the federated setting, and how to address data heterogeneity (non-IID data).

7.2.3. Interpretability, Uncertainty Quantification, and Clinical Trustworthiness

The “black-box” nature of deep learning models is recognized to restrict application in high-reliability medical decision-making contexts. An enhancement in the interpretability or explainability (XAI) of segmentation models is deemed crucial for enabling clinicians to comprehend the rationale behind specific segmentation decisions made by these models, thereby fostering trust. The development of techniques capable of generating visual explanations (e.g., saliency maps, Class Activation Maps) or providing rule-based or concept-based interpretations is necessitated by this requirement. Concurrently, quantifying the uncertainty (UQ) associated with model predictions is equally critical; this informs users about regions where the model lacks confidence in its segmentation results, thereby prompting manual review or mitigating potential errors. Research conducted into reliable uncertainty estimation methods and their effective integration into clinical workflows is represented as a significant direction for enhancing model safety and practical utility.

7.2.4. Multi-Modal and Longitudinal Data Fusion for Segmentation

In clinical practice, physicians often integrate information from multiple imaging modalities (e.g., CT, MRI, PET) and refer to patient historical images (longitudinal data) for diagnosis and assessment. Current segmentation models predominantly focus on single-modality, single-timepoint analysis. Future research should concentrate on developing effective multi-modal fusion strategies to fully exploit the complementary information provided by different modalities, thereby achieving more precise and comprehensive segmentation. Simultaneously, for longitudinal data, it is imperative to devise segmentation models capable of capturing spatio-temporal dynamics, such as modeling lesion evolution or organ changes using recurrent neural networks (RNNs), Transformers, or graph neural networks (GNNs). This holds significant importance for accurate treatment response assessment and disease progression monitoring. Effectively fusing multi-source, heterogeneous spatio-temporal information remains a core challenge in this direction.

8. Conclusions

This review is focused on the investigation of deep learning-based techniques for medical image segmentation, with emphasis placed on non-fully supervised learning paradigms designed to address the dependence of fully supervised methods on extensive pixel-level annotations. By systematically examining the core principles, representative algorithms, and application contexts of semi-supervised learning (utilizing strategies like consistency regularization and pseudo-labeling with limited labeled and abundant unlabeled data), weakly supervised learning (employing coarse-grained annotations such as image-level labels, bounding boxes, or scribbles), and unsupervised learning (including anomaly segmentation and domain adaptation to handle data heterogeneity and privacy concerns), substantial value is underscored for these methods in significantly reducing annotation costs and promoting the clinical translation of advanced segmentation technologies. Although significant progress has been achieved, challenges remain for these non-fully supervised approaches, particularly concerning the performance gap with fully supervised methods, model robustness and generalization, interpretability enhancement, and effective fusion of multi-modal/longitudinal data. Continuous innovation in non-fully supervised learning is regarded as crucial for accelerating the application of AI in medical image analysis and improving diagnosis and treatment.

Author Contributions

Conceptualization, X.Z. and J.W. (Jianfeng Wang); methodology, X.Z. and M.W.; validation, X.Z., J.W. (Jinqiao Wei) and X.Y.; investigation, X.Y.; writing—original draft preparation, X.Z.; writing—review and editing, X.Z., J.W. (Jinqiao Wei) and M.W.; visualization, X.Y.; supervision, J.W. (Jianfeng Wang) and M.W.; project administration, M.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Acknowledgments

We are grateful to the reviewers and colleagues for their insightful comments and suggestions, which greatly improved the manuscript. This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2012, 60, 84–90. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. arXiv 2015, arXiv:1411.4038. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. Lect. Notes Comput. Sci. 2015, 9351, 234–241. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2017; pp. 5999–6009. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment Anything. arXiv 2023, arXiv:2304.02643. [Google Scholar]
Baur, C.; Wiestler, B.; Albarqouni, S.; Navab, N. Deep autoencoding models for unsupervised anomaly segmentation in brain MR images. Lect. Notes Comput. Sci. 2019, 11383, 161–169. [Google Scholar]
Yarkony, J.; Wang, S. Accelerating Message Passing for MAP with Benders Decomposition. arXiv 2018, arXiv:1805.04958. [Google Scholar]
Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning Deep Features for Discriminative Localization. arXiv 2016, arXiv:1512.04150. [Google Scholar]
Dai, J.; He, K.; Sun, J. BoxSup: Exploiting Bounding Boxes to Supervise Convolutional Networks for Semantic Segmentation. arXiv 2015, arXiv:1503.01640. [Google Scholar]
Ballan, L.; Castaldo, F.; Alahi, A.; Palmieri, F.; Savarese, S. Knowledge transfer for scene-specific motion prediction. Lect. Notes Comput. Sci. 2016, 9905, 697–713. [Google Scholar]
Tanaka, K. Minimal networks for sensor counting problem using discrete Euler calculus. Jpn. J. Ind. Appl. Math. 2017, 34, 229–242. [Google Scholar] [CrossRef]
Wei, Y.; Feng, J.; Liang, X.; Cheng, M.M.; Zhao, Y.; Yan, S. Object Region Mining With Adversarial Erasing: A Simple Classification to Semantic Segmentation Approach. arXiv 2017, arXiv:1703.08448. [Google Scholar]
Hu, F.; Wang, Y.; Ma, B.; Wang, Y. Emergency supplies research on crossing points of transport network based on genetic algorithm. In Proceedings of the 2015 International Conference on Intelligent Transportation, Big Data and Smart City, ICITBS 2015, Halong Bay, Vietnam, 19–20 December 2015; pp. 370–375. [Google Scholar] [CrossRef]
Gannon, S.; Kulosman, H. The condition for a cyclic code over Z4 of odd length to have a complementary dual. arXiv 2019, arXiv:1905.12309. [Google Scholar]
Abraham, N.; Khan, N.M. A novel focal tversky loss function with improved attention u-net for lesion segmentation. In Proceedings of the International Symposium on Biomedical Imaging, Venice, Italy, 8–11 April 2019; pp. 683–687. [Google Scholar] [CrossRef]
Bernard, O.; Lalande, A.; Zotti, C.; Cervenansky, F.; Yang, X.; Heng, P.A.; Cetin, I.; Lekadir, K.; Camara, O.; Ballester, M.A.G.; et al. Deep Learning Techniques for Automatic MRI Cardiac Multi-Structures Segmentation and Diagnosis: Is the Problem Solved? IEEE Trans. Med. Imaging 2018, 37, 2514–2525. [Google Scholar] [CrossRef]
Graham, S.; Chen, H.; Gamper, J.; Dou, Q.; Heng, P.A.; Snead, D.; Tsang, Y.W.; Rajpoot, N. MILD-Net: Minimal information loss dilated network for gland instance segmentation in colon histology images. Med. Image Anal. 2019, 52, 199–211. [Google Scholar] [CrossRef] [PubMed]
Demner-Fushman, D.; Kohli, M.D.; Rosenman, M.B.; Shooshan, S.E.; Rodriguez, L.; Antani, S.; Thoma, G.R.; McDonald, C.J. Preparing a collection of radiology examinations for distribution and retrieval. J. Am. Med. Inform. Assoc. JAMIA 2016, 23, 304–310. [Google Scholar] [CrossRef]
Johnson, A.E.; Pollard, T.J.; Berkowitz, S.J.; Greenbaum, N.R.; Lungren, M.P.; Deng, C.Y.; Mark, R.G.; Horng, S. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 2019, 6, 317. [Google Scholar] [CrossRef] [PubMed]
Li, Z.; Li, D.; Xu, C.; Wang, W.; Hong, Q.; Li, Q.; Tian, J. TFCNs: A CNN-Transformer Hybrid Network for Medical Image Segmentation. Lect. Notes Comput. Sci. 2022, 13532, 781–792. [Google Scholar] [CrossRef]
Bannur, S.; Hyland, S.; Liu, Q.; Pérez-García, F.; Ilse, M.; Castro, D.C.; Boecking, B.; Sharma, H.; Bouzid, K.; Thieme, A.; et al. Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Juan, PR, USA, 17–19 June 1997; pp. 15016–15027. [Google Scholar] [CrossRef]
McCollough, C.H.; Bartley, A.C.; Carter, R.E.; Chen, B.; Drees, T.A.; Edwards, P.; Holmes, D.R.; Huang, A.E.; Khan, F.; Leng, S.; et al. Low-dose CT for the detection and classification of metastatic liver lesions: Results of the 2016 Low Dose CT Grand Challenge. Med. Phys. 2017, 44, e339–e352. [Google Scholar] [CrossRef]
Leuschner, J.; Schmidt, M.; Baguer, D.O.; Maass, P. LoDoPaB-CT, a benchmark dataset for low-dose computed tomography reconstruction. Sci. Data 2021, 8, 109. [Google Scholar] [CrossRef]
Moen, T.R.; Chen, B.; Holmes, D.R.; Duan, X.; Yu, Z.; Yu, L.; Leng, S.; Fletcher, J.G.; McCollough, C.H. Low-dose CT image and projection dataset. Med. Phys. 2021, 48, 902–911. [Google Scholar] [CrossRef]
Xiong, Z.; Xia, Q.; Hu, Z.; Huang, N.; Bian, C.; Zheng, Y.; Vesal, S.; Ravikumar, N.; Maier, A.; Yang, X.; et al. A global benchmark of algorithms for segmenting the left atrium from late gadolinium-enhanced cardiac magnetic resonance imaging. Med. Image Anal. 2021, 67, 101832. [Google Scholar] [CrossRef] [PubMed]
Clark, K.; Vendt, B.; Smith, K.; Freymann, J.; Kirby, J.; Koppel, P.; Moore, S.; Phillips, S.; Maffitt, D.; Pringle, M.; et al. The cancer imaging archive (TCIA): Maintaining and operating a public information repository. J. Digit. Imaging 2013, 26, 1045–1057. [Google Scholar] [CrossRef]
Menze, B.H.; Jakab, A.; Bauer, S.; Kalpathy-Cramer, J.; Farahani, K.; Kirby, J.; Burren, Y.; Porz, N.; Slotboom, J.; Wiest, R.; et al. The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS). IEEE Trans. Med. Imaging 2015, 34, 1993–2024. [Google Scholar] [CrossRef]
Liew, S.Q.; Ngoh, G.C.; Yusoff, R.; Teoh, W.H. Acid and Deep Eutectic Solvent (DES) extraction of pectin from pomelo (Citrus grandis (L.) Osbeck) peels. Biocatal. Agric. Biotechnol. 2018, 13, 1–11. [Google Scholar] [CrossRef]
Petzsche, M.R.H.; de la Rosa, E.; Hanning, U.; Wiest, R.; Valenzuela, W.; Reyes, M.; Meyer, M.; Liew, S.L.; Kofler, F.; Ezhov, I.; et al. ISLES 2022: A multi-center magnetic resonance imaging stroke lesion segmentation dataset. Sci. Data 2022, 9, 762. [Google Scholar] [CrossRef] [PubMed]
Maier, O.; Menze, B.H.; von der Gablentz, J.; Häni, L.; Heinrich, M.P.; Liebrand, M.; Winzeck, S.; Basit, A.; Bentley, P.; Chen, L.; et al. ISLES 2015—A public evaluation benchmark for ischemic stroke lesion segmentation from multispectral MRI. Med. Image Anal. 2017, 35, 250–269. [Google Scholar] [CrossRef]
Hakim, A.; Christensen, S.; Winzeck, S.; Lansberg, M.G.; Parsons, M.W.; Lucas, C.; Robben, D.; Wiest, R.; Reyes, M.; Zaharchuk, G. Predicting Infarct Core From Computed Tomography Perfusion in Acute Ischemia With Machine Learning: Lessons From the ISLES Challenge. Stroke 2021, 52, 2328–2337. [Google Scholar] [CrossRef] [PubMed]
Liang, K.; Han, K.; Li, X.; Cheng, X.; Li, Y.; Wang, Y.; Yu, Y. Symmetry-Enhanced Attention Network for Acute Ischemic Infarct Segmentation with Non-contrast CT Images. Lect. Notes Comput. Sci. 2021, 12907, 432–441. [Google Scholar] [CrossRef]
Campello, V.M.; Gkontra, P.; Izquierdo, C.; Martin-Isla, C.; Sojoudi, A.; Full, P.M.; Maier-Hein, K.; Zhang, Y.; He, Z.; Ma, J.; et al. Multi-Centre, Multi-Vendor and Multi-Disease Cardiac Segmentation: The MMs Challenge. IEEE Trans. Med. Imaging 2021, 40, 3543–3554. [Google Scholar] [CrossRef]
Heller, N.; Isensee, F.; Maier-Hein, K.H.; Hou, X.; Xie, C.; Li, F.; Nan, Y.; Mu, G.; Lin, Z.; Han, M.; et al. The state of the art in kidney and kidney tumor segmentation in contrast-enhanced CT imaging: Results of the KiTS19 challenge. Med. Image Anal. 2021, 67, 101821. [Google Scholar] [CrossRef]
Littlejohns, T.J.; Holliday, J.; Gibson, L.M.; Garratt, S.; Oesingmann, N.; Alfaro-Almagro, F.; Bell, J.D.; Boultwood, C.; Collins, R.; Conroy, M.C.; et al. The UK Biobank imaging enhancement of 100,000 participants: Rationale, data collection, management and future directions. Nat. Commun. 2020, 11, 2624. [Google Scholar] [CrossRef] [PubMed]
Bilic, P.; Christ, P.; Li, H.B.; Vorontsov, E.; Ben-Cohen, A.; Kaissis, G.; Szeskin, A.; Jacobs, C.; Mamani, G.E.H.; Chartrand, G.; et al. The Liver Tumor Segmentation Benchmark (LiTS). Med. Image Anal. 2023, 84, 102680. [Google Scholar] [CrossRef] [PubMed]
Kavur, A.E.; Gezer, N.S.; Barış, M.; Aslan, S.; Conze, P.H.; Groza, V.; Pham, D.D.; Chatterjee, S.; Ernst, P.; Özkan, S.; et al. CHAOS Challenge-combined (CT-MR) healthy abdominal organ segmentation. Med. Image Anal. 2021, 69, 101950. [Google Scholar] [CrossRef]
Zhang, P.; Zhang, B.; Chen, D.; Yuan, L.; Wen, F. Cross-domain correspondence learning for exemplar-based image translation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Juan, PR, USA, 17–19 June 1997; pp. 5142–5152. [Google Scholar] [CrossRef]
Sohn, K.; Berthelot, D.; Li, C.L.; Zhang, Z.; Carlini, N.; Cubuk, E.D.; Kurakin, A.; Zhang, H.; Raffel, C. FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2020. [Google Scholar]
Chen, H.; Tao, R.; Fan, Y.; Wang, Y.; Wang, J.; Schiele, B.; Xie, X.; Raj, B.; Savvides, M. SoftMatch: Addressing the Quantity-Quality Trade-off in Semi-supervised Learning. In Proceedings of the 11th International Conference on Learning Representations, ICLR, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Wang, X.; Tang, F.; Chen, H.; Cheung, C.Y.; Heng, P.A. Deep semi-supervised multiple instance learning with self-correction for DME classification from OCT images. Med. Image Anal. 2023, 83, 102673. [Google Scholar] [CrossRef]
Tarvainen, A.; Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Adv. Neural Inf. Process. Syst. 2017, 2017, 1196–1205. [Google Scholar]
Yang, L.; Qi, L.; Feng, L.; Zhang, W.; Shi, Y. Revisiting Weak-to-Strong Consistency in Semi-Supervised Semantic Segmentation. arXiv 2023, arXiv:2208.09910. [Google Scholar]
Lyu, F.; Ye, M.; Carlsen, J.F.; Erleben, K.; Darkner, S.; Yuen, P.C. Pseudo-Label Guided Image Synthesis for Semi-Supervised COVID-19 Pneumonia Infection Segmentation. IEEE Trans. Med. Imaging 2023, 42, 797–809. [Google Scholar] [CrossRef] [PubMed]
Bashir, R.M.S.; Qaiser, T.; Raza, S.E.; Rajpoot, N.M. Consistency regularisation in varying contexts and feature perturbations for semi-supervised semantic segmentation of histology images. Med. Image Anal. 2024, 91, 102997. [Google Scholar] [CrossRef]
Yang, Y.; Sun, G.; Zhang, T.; Wang, R.; Su, J. Semi-supervised medical image segmentation via weak-to-strong perturbation consistency and edge-aware contrastive representation. Med. Image Anal. 2025, 101, 103450. [Google Scholar] [CrossRef]
Xu, X.; Chen, Y.; Wu, J.; Lu, J.; Ye, Y.; Huang, Y.; Dou, X.; Li, K.; Wang, G.; Zhang, S.; et al. A novel one-to-multiple unsupervised domain adaptation framework for abdominal organ segmentation. Med. Image Anal. 2023, 88, 102873. [Google Scholar] [CrossRef]
V., S.A.; Dolz, J.; Lombaert, H. Anatomically-aware uncertainty for semi-supervised image segmentation. Med. Image Anal. 2024, 91, 103011. [Google Scholar] [CrossRef]
Yu, L.; Wang, S.; Li, X.; Fu, C.W.; Heng, P.A. Uncertainty-Aware Self-ensembling Model for Semi-supervised 3D Left Atrium Segmentation. Lect. Notes Comput. Sci. 2019, 11765, 605–613. [Google Scholar]
Luo, X.; Wang, G.; Liao, W.; Chen, J.; Song, T.; Chen, Y.; Zhang, S.; Metaxas, D.N.; Zhang, S. Semi-supervised medical image segmentation via uncertainty rectified pyramid consistency. Med. Image Anal. 2022, 80, 102517. [Google Scholar] [CrossRef]
Li, W.; Bian, R.; Zhao, W.; Xu, W.; Yang, H. Diversity matters: Cross-head mutual mean-teaching for semi-supervised medical image segmentation. Med. Image Anal. 2024, 97, 103302. [Google Scholar] [CrossRef]
Wu, Y.; Ge, Z.; Zhang, D.; Xu, M.; Zhang, L.; Xia, Y.; Cai, J. Mutual consistency learning for semi-supervised medical image segmentation. Med. Image Anal. 2022, 81, 102530. [Google Scholar] [CrossRef]
Bai, Y.; Chen, D.; Li, Q.; Shen, W.; Wang, Y. Bidirectional Copy-Paste for Semi-Supervised Medical Image Segmentation. arXiv 2023, arXiv:2305.00673. [Google Scholar]
Su, J.; Luo, Z.; Lian, S.; Lin, D.; Li, S. Mutual learning with reliable pseudo label for semi-supervised medical image segmentation. Med. Image Anal. 2024, 94, 103111. [Google Scholar] [CrossRef]
Milletari, F.; Navab, N.; Ahmadi, S.A. V-Net: Fully convolutional neural networks for volumetric medical image segmentation. In Proceedings of the 2016 4th International Conference on 3D Vision, 3DV, Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar] [CrossRef]
Wang, Y.; Song, K.; Liu, Y.; Ma, S.; Yan, Y.; Carneiro, G. Leveraging labelled data knowledge: A cooperative rectification learning network for semi-supervised 3D medical image segmentation. Med. Image Anal. 2025, 101, 103461. [Google Scholar] [CrossRef]
Chaitanya, K.; Erdil, E.; Karani, N.; Konukoglu, E. Local contrastive loss with pseudo-label based self-training for semi-supervised medical image segmentation. Med. Image Anal. 2023, 87, 102792. [Google Scholar] [CrossRef]
Gao, S.; Zhang, Z.; Ma, J.; Li, Z.; Zhang, S. Correlation-Aware Mutual Learning for Semi-supervised Medical Image Segmentation. Lect. Notes Comput. Sci. 2023, 14220, 98–108. [Google Scholar] [CrossRef]
Li, S.; Zhang, C.; He, X. Shape-Aware Semi-supervised 3D Semantic Segmentation for Medical Images. Lect. Notes Comput. Sci. 2020, 12261, 552–561. [Google Scholar] [CrossRef]
Wang, R.; Chen, S.; Ji, C.; Fan, J.; Li, Y. Boundary-aware context neural network for medical image segmentation. Med. Image Anal. 2022, 78, 102395. [Google Scholar] [CrossRef] [PubMed]
Luo, X.; Chen, J.; Song, T.; Wang, G. Semi-supervised Medical Image Segmentation through Dual-task Consistency. Proc. AAAI Conf. Artif. Intell. 2021, 35, 8801–8809. [Google Scholar] [CrossRef]
Shi, J.; Gao, W. Transverse ultimate capacity of U-type stiffened panels for hatch covers used in ship cargo holds. Ships Offshore Struct. 2021, 16, 608–619. [Google Scholar] [CrossRef]
Peng, J.; Wang, P.; Desrosiers, C.; Pedersoli, M. Self-Paced Contrastive Learning for Semi-supervised Medical Image Segmentation with Meta-labels. Adv. Neural Inf. Process. Syst. 2021, 20, 16686–16699. [Google Scholar]
Oh, S.J.; Benenson, R.; Khoreva, A.; Akata, Z.; Fritz, M.; Schiele, B. Exploiting saliency for object segmentation from image level labels. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Honolulu, HI, USA, 21–26 July 2016; pp. 5038–5047. [Google Scholar] [CrossRef]
Durieux, G.; Irles, A.; Miralles, V.; Peñuelas, A.; Perelló, M.; Pöschl, R.; Vos, M. The electro-weak couplings of the top and bottom quarks—Global fit and future prospects. J. High Energy Phys. 2019, 2019, 12. [Google Scholar] [CrossRef]
Kervadec, H.; Dolz, J.D.; Montral, D.; Wang, S.; Granger, E.G.; Montral, G.; Ben, I.; Ayed, A.; Montral, A. Bounding boxes for weakly supervised segmentation: Global constraints get close to full supervision. Proc. Mach. Learn. Res. 2020, 121, 365–380. [Google Scholar]
Lin, D.; Dai, J.; Jia, J.; He, K.; Sun, J. ScribbleSup: Scribble-Supervised Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Juan, PR, USA, 17–19 June 1997; pp. 3159–3167. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, J.; Kan, M.; Shan, S.; Chen, X. Self-Supervised Equivariant Attention Mechanism for Weakly Supervised Semantic Segmentation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Juan, PR, USA, 17–19 June 1997; pp. 12272–12281. [Google Scholar] [CrossRef]
Dietterich, T.G.; Lathrop, R.H.; Lozano-Pérez, T. Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell. 1997, 89, 31–71. [Google Scholar] [CrossRef]
Chikontwe, P.; Sung, H.J.; Jeong, J.; Kim, M.; Go, H.; Nam, S.J.; Park, S.H. Weakly supervised segmentation on neural compressed histopathology with self-equivariant regularization. Med. Image Anal. 2022, 80, 102482. [Google Scholar] [CrossRef]
Patel, G.; Dolz, J. Weakly supervised segmentation with cross-modality equivariant constraints. Med. Image Anal. 2022, 77, 102374. [Google Scholar] [CrossRef]
Chattopadhay, A.; Sarkar, A.; Howlader, P.; Balasubramanian, V.N. Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision, WACV, Lake Tahoe, NV, USA, 12–15 March 2018; pp. 839–847. [Google Scholar] [CrossRef]
Yang, J.; Mehta, N.; Demirci, G.; Hu, X.; Ramakrishnan, M.S.; Naguib, M.; Chen, C.; Tsai, C.L. Anomaly-guided weakly supervised lesion segmentation on retinal OCT images. Med. Image Anal. 2024, 94, 103139. [Google Scholar] [CrossRef] [PubMed]
Chen, Z.; Wang, T.; Wu, X.; Hua, X.S.; Zhang, H.; Sun, Q. Class Re-Activation Maps for Weakly-Supervised Semantic Segmentation. arXiv 2022, arXiv:2203.00962. [Google Scholar]
Zhang, W.; Zhu, L.; Hallinan, J.; Zhang, S.; Makmur, A.; Cai, Q.; Ooi, B.C. BoostMIS: Boosting Medical Image Semi-Supervised Learning With Adaptive Pseudo Labeling and Informative Active Annotation. arXiv 2022, arXiv:2203.02533. [Google Scholar]
Li, K.; Qian, Z.; Han, Y.; Chang, E.I.; Wei, B.; Lai, M.; Liao, J.; Fan, Y.; Xu, Y. Weakly supervised histopathology image segmentation with self-attention. Med. Image Anal. 2023, 86, 102791. [Google Scholar] [CrossRef] [PubMed]
Yao, T.; Pan, Y.; Li, Y.; Ngo, C.W.; Mei, T. Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning. Lect. Notes Comput. Sci. 2022, 13685, 328–345. [Google Scholar] [CrossRef]
Cheng, B.; Parkhi, O.; Kirillov, A. Pointly-Supervised Instance Segmentation. arXiv 2022, arXiv:2104.06404. [Google Scholar]
Seeböck, P.; Orlando, J.I.; Michl, M.; Mai, J.; Schmidt-Erfurth, U.; Bogunović, H. Anomaly guided segmentation: Introducing semantic context for lesion segmentation in retinal OCT using weak context supervision from anomaly detection. Med. Image Anal. 2024, 93, 103104. [Google Scholar] [CrossRef]
Gao, F.; Hu, M.; Zhong, M.E.; Feng, S.; Tian, X.; Meng, X.; Ni-jia ti, M.; Huang, Z.; Lv, M.; Song, T.; et al. Segmentation only uses sparse annotations: Unified weakly and semi-supervised learning in medical images. Med. Image Anal. 2022, 80, 102515. [Google Scholar] [CrossRef] [PubMed]
Shi, Y.; Wang, H.; Ji, H.; Liu, H.; Li, Y.; He, N.; Wei, D.; Huang, Y.; Dai, Q.; Wu, J.; et al. A deep weakly semi-supervised framework for endoscopic lesion segmentation. Med. Image Anal. 2023, 90, 102973. [Google Scholar] [CrossRef]
Ahn, J.; Shin, S.Y.; Shim, J.; Kim, Y.H.; Han, S.J.; Choi, E.K.; Oh, S.; Shin, J.Y.; Choe, J.C.; Park, J.S.; et al. Association between epicardial adipose tissue and embolic stroke after catheter ablation of atrial fibrillation. J. Cardiovasc. Electrophysiol. 2019, 30, 2209–2216. [Google Scholar] [CrossRef]
Viniavskyi, O.; Dobko, M.; Dobosevych, O. Weakly-Supervised Segmentation for Disease Localization in Chest X-Ray Images. Lect. Notes Comput. Sci. 2020, 12299, 249–259. [Google Scholar] [CrossRef]
Ma, X.; Ji, Z.; Niu, S.; Leng, T.; Rubin, D.L.; Chen, Q. MS-CAM: Multi-Scale Class Activation Maps for Weakly-Supervised Segmentation of Geographic Atrophy Lesions in SD-OCT Images. IEEE J. Biomed. Health Inform. 2020, 24, 3443–3455. [Google Scholar] [CrossRef] [PubMed]
Zhang, S.; Zhang, J.; Xia, Y. TransWS: Transformer-Based Weakly Supervised Histology Image Segmentation. Lect. Notes Comput. Sci. 2022, 13583, 367–376. [Google Scholar] [CrossRef]
Wang, T.; Niu, S.; Dong, J.; Chen, Y. Weakly Supervised Retinal Detachment Segmentation Using Deep Feature Propagation Learning in SD-OCT Images. Lect. Notes Comput. Sci. 2020, 12069, 146–154. [Google Scholar] [CrossRef]
Silva-Rodríguez, J.; Naranjo, V.; Dolz, J. Constrained unsupervised anomaly segmentation. Med. Image Anal. 2022, 80, 102526. [Google Scholar] [CrossRef]
Pinaya, W.H.; Tudosiu, P.D.; Gray, R.; Rees, G.; Nachev, P.; Ourselin, S.; Cardoso, M.J. Unsupervised brain imaging 3D anomaly detection and segmentation with transformers. Med. Image Anal. 2022, 79, 102475. [Google Scholar] [CrossRef]
Hinton, G.E.; Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science 2006, 313, 504–507. [Google Scholar] [CrossRef]
Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. In Proceedings of the 2nd International Conference on Learning Representations, ICLR 2014—Conference Track Proceedings, Banff, AB, Canada, 14–16 April 2014. [Google Scholar] [CrossRef]
Schlegl, T.; Seeböck, P.; Waldstein, S.M.; Langs, G.; Schmidt-Erfurth, U. f-AnoGAN: Fast unsupervised anomaly detection with generative adversarial networks. Med. Image Anal. 2019, 54, 30–44. [Google Scholar] [CrossRef]
Stan, S.; Rostami, M. Unsupervised model adaptation for source-free segmentation of medical images. Med. Image Anal. 2024, 95, 103179. [Google Scholar] [CrossRef]
Liu, X.; Xing, F.; Fakhri, G.E.; Woo, J. Memory consistent unsupervised off-the-shelf model adaptation for source-relaxed medical image segmentation. Med. Image Anal. 2023, 83, 102641. [Google Scholar] [CrossRef]
Sun, Y.; Dai, D.; Xu, S. Rethinking adversarial domain adaptation: Orthogonal decomposition for unsupervised domain adaptation in medical image segmentation. Med. Image Anal. 2022, 82, 102623. [Google Scholar] [CrossRef] [PubMed]
Cai, Z.; Xin, J.; You, C.; Shi, P.; Dong, S.; Dvornek, N.C.; Zheng, N.; Duncan, J.S. Style mixup enhanced disentanglement learning for unsupervised domain adaptation in medical image segmentation. Med. Image Anal. 2025, 101, 103440. [Google Scholar] [CrossRef] [PubMed]
Zheng, B.; Zhang, R.; Diao, S.; Zhu, J.; Yuan, Y.; Cai, J.; Shao, L.; Li, S.; Qin, W. Dual domain distribution disruption with semantics preservation: Unsupervised domain adaptation for medical image segmentation. Med. Image Anal. 2024, 97, 103275. [Google Scholar] [CrossRef] [PubMed]
Dou, Q.; Ouyang, C.; Chen, C.; Chen, H.; Glocker, B.; Zhuang, X.; Heng, P.A. PnP-AdaNet: Plug-and-Play Adversarial Domain Adaptation Network with a Benchmark at Cross-modality Cardiac Segmentation. arXiv 2018, arXiv:1812.07907. [Google Scholar] [CrossRef]
Vu, T.H.; Jain, H.; Bucher, M.; Cord, M.; Perez, P. ADVENT: Adversarial Entropy Minimization for Domain Adaptation in Semantic Segmentation. arXiv 2019, arXiv:1811.12833. [Google Scholar]
Chen, C.; Dou, Q.; Chen, H.; Qin, J.; Heng, P.A. Unsupervised Bidirectional Cross-Modality Adaptation via Deeply Synergistic Image and Feature Alignment for Medical Image Segmentation. IEEE Trans. Med. Imaging 2020, 39, 2494–2505. [Google Scholar] [CrossRef] [PubMed]
Wu, F.; Zhuang, X. CF Distance: A New Domain Discrepancy Metric and Application to Explicit Domain Adaptation for Cross-Modality Cardiac Image Segmentation. IEEE Trans. Med. Imaging 2020, 39, 4274–4285. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Zhu, Z.; Zheng, S.; Liu, Y.; Zhou, J.; Zhao, Y. Margin Preserving Self-Paced Contrastive Learning Towards Domain Adaptation for Medical Image Segmentation. IEEE J. Biomed. Health Inform. 2022, 26, 638–647. [Google Scholar] [CrossRef]
Chen, J.; Huang, W.; Zhang, J.; Debattista, K.; Han, J. Addressing inconsistent labeling with cross image matching for scribble-based medical image segmentation. IEEE Trans. Image Process. 2025, 34, 842–853. [Google Scholar] [CrossRef]
Gao, W.; Wan, F.; Pan, X.; Peng, Z.; Tian, Q.; Han, Z.; Zhou, B.; Ye, Q. TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization. arXiv 2021, arXiv:2103.14862. [Google Scholar]
Mahapatra, D. Generative Adversarial Networks and Domain Adaptation for Training Data Independent Image Registration. arXiv 2019, arXiv:1910.08593. [Google Scholar] [CrossRef]
Wang, G.; Li, W.; Zuluaga, M.A.; Pratt, R.; Patel, P.A.; Aertsen, M.; Doel, T.; David, A.L.; Deprest, J.; Ourselin, S.; et al. Interactive Medical Image Segmentation Using Deep Learning with Image-Specific Fine Tuning. IEEE Trans. Med. Imaging 2018, 37, 1562–1573. [Google Scholar] [CrossRef] [PubMed]
Chen, C.; Dou, Q.; Chen, H.; Qin, J.; Heng, P.A. Synergistic Image and Feature Adaptation: Towards Cross-Modality Domain Adaptation for Medical Image Segmentation. Proc. AAAI Conf. Artif. Intell. 2019, 33, 865–872. [Google Scholar] [CrossRef]
Lei, T.; Zhang, D.; Du, X.; Wang, X.; Wan, Y.; Nandi, A.K. Semi-Supervised Medical Image Segmentation Using Adversarial Consistency Learning and Dynamic Convolution Network. IEEE Trans. Med. Imaging 2023, 42, 1265–1277. [Google Scholar] [CrossRef]
Kalinicheva, E.; Ienco, D.; Sublime, J.; Trocan, M. Unsupervised Change Detection Analysis in Satellite Image Time Series Using Deep Learning Combined with Graph-Based Approaches. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 1450–1466. [Google Scholar] [CrossRef]
Kamnitsas, K.; Baumgartner, C.; Ledig, C.; Newcombe, V.; Simpson, J.; Kane, A.; Menon, D.; Nori, A.; Criminisi, A.; Rueckert, D.; et al. Unsupervised Domain Adaptation in Brain Lesion Segmentation with Adversarial Networks. Lect. Notes Comput. Sci. 2017, 10265, 597–609. [Google Scholar] [CrossRef]
Hoffman, J.; Tzeng, E.; Park, T.; Zhu, J.Y.; Isola, P.; Saenko, K.; Efros, A.; Darrell, T. CyCADA: Cycle-Consistent Adversarial Domain Adaptation. arXiv 2018, arXiv:1711.03213. [Google Scholar]
Zhang, Y.; Miao, S.; Mansi, T.; Liao, R. Task driven generative modeling for unsupervised domain adaptation: Application to X-ray image segmentation. Lect. Notes Comput. Sci. 2018, 11071, 599–607. [Google Scholar]

Figure 1. Structure and coverage of this review.

Figure 2. Framework diagram of Mean Teacher [42].

Figure 3. An overview of the CMMT-Net architecture [51].

Figure 4. Framework diagram of CRCFP [45].

Figure 5. Architecture diagram of CAM network [8].

Figure 6. The framework of SA-MIL [76].

Figure 7. Framework for brain data processing based on VQ-VAE and Transformer [88].

Figure 8. An overview of the ODADA architecture [94].

Table 1. Comparison of non-fully supervised learning paradigms for medical image segmentation.

Dimension	Semi-Supervised Learning (SSL)	Weakly Supervised Learning (WSL)	Unsupervised Learning (UL)
Supervision Source	Small amount of precise labels + large amount of unlabeled data	Coarse-grained/indirect labels (e.g., image-level tags, bounding boxes, points/scribbles)	No direct segmentation labels; relies on inherent data structure or priors
Annotation Cost	Moderate	Low	Very low/none (for the target task)
Core Mechanism	Leverages unlabeled data to boost performance (e.g., consistency regularization, pseudo-labeling)	Infers strong segmentation from weak signals (e.g., CAMs, MIL)	Discovers inherent data patterns/anomalies (e.g., clustering, reconstruction, UDA)
Performance Potential	High; can approach fully supervised	Moderate to high; depends on weak label quality	Generally lower than supervised; valuable for specific tasks (e.g., anomaly detection)
Application Scenarios	Few precise labels but abundant unlabeled data; enhancing robustness	Precise labeling difficult but coarse information available; large-scale screening	No labels available; exploratory analysis; domain adaptation

Table 2. Summary of 2D medical imaging datasets.

Dataset	Modality	Anatomical Area	Application Scenarios	Supervision Type
ACDC [16]	MRI	Heart (left and right ventricles)	Cardiac function analysis, ventricular segmentation	Fully supervised
Colorectal adenocarcinoma glands [17]	Pathology Sections (H&E staining)	Colorectal tissue	Segmentation of the glandular structure	Fully supervised
IU Chest X-ray [18]	X-ray (chest X-ray)	Chest (cardiopulmonary area)	Classification of lung diseases	Weakly supervised
MIMIC-CXR [19]	X-ray (chest X-ray) + clinical report	Chest	Automatic diagnosis of multiple diseases	Weakly supervised
COV-CTR [20]	CT (chest)	Lung	COVID-19 severity rating	Weakly supervised
MS-CXR-T [21]	X-ray (chest X-ray)	Chest	Multilingual report generation	Weakly supervised
NIH-AAPM-Mayo Clinical LDCT [22]	Low-dose CT (chest)	Lung	Lung nodule detection	Fully supervised
LoDoPaB [23]	Low-dose CT (simulation)	Body	CT reconstruction algorithm development	Fully supervised
LDCT [24]	Low-dose CT	Chest/abdomen	Radiation dose reduction studies	Fully supervised

Table 3. Summary of 3D medical imaging datasets.

Dataset	Modality	Anatomical Area	Application Scenarios	Supervision Type
LA [25]	MRI	Heart (left atrium)	Surgical planning for atrial fibrillation	Fully supervised
Pancreas-CT [26]	CT (abdomen)	Pancreas	Pancreatic tumor segmentation	Fully supervised
BraTS [27]	Multiparametric MRI	Brain (glioma)	Brain tumor segmentation	Fully supervised
ATLAS [28]	MRI (T1)	Brain (stroke lesions)	Stroke analysis	Fully supervised
ISLES [29,30,31]	MRI (multiple sequences)	Brain	Ischemic stroke segmentation	Fully supervised
AISD [32]	Ultrasonic	Abdominal organs	Organ boundary segmentation	Fully supervised
Cardiac [33]	MRI	Heart	Ventricular division	Fully supervised
KiTS19 [34]	CT (abdomen)	Kidney	Segmentation of kidney tumors	Fully supervised
UKB [35]	MRI/CT/X-ray	Body	Multi-organ phenotypic analysis	Weakly supervised
LiTS [36]	CT (abdomen)	Liver	Segmentation of liver tumors	Fully supervised
CHAOS [37]	CT/MRI (abdomen)	Multi-organ	Cross-modal organ segmentation	Fully supervised

Table 4. Comparison of classic semi-supervised methods on the 2D dataset ACDC2017 [46].

Method	% Labeled	2017 ACDC (2D)
Method	Scans	DSC (%)	Jaccard (%)	95HD (mm)	ASD (mm)
Using 5% labeled scans
UAMT [49]	5	51.23 (1.96)	41.82 (1.62)	17.13 (2.82)	7.76 (2.01)
SASSNet [59]	5	58.47 (1.74)	47.04 (2.02)	18.04 (3.63)	7.31 (1.53)
Tri-U-MT [60]	5	59.15 (2.01)	47.37 (1.82)	17.37 (2.77)	7.34 (1.31)
DTC [61]	5	57.09 (1.57)	45.61 (1.23)	20.63 (2.61)	7.05 (1.94)
CoraNet [62]	5	59.91 (2.08)	48.37 (1.75)	15.53 (2.23)	5.96 (1.42)
SPCL [63]	5	81.82 (1.24)	70.62 (1.04)	5.96 (1.62)	2.21 (0.29)
MC-Net+ [52]	5	63.47 (1.75)	53.13 (1.41)	7.38 (1.68)	2.37 (0.32)
URPC [50]	5	62.57 (1.18)	52.75 (1.36)	7.79 (1.85)	2.64 (0.36)
PLCT [57]	5	78.42 (1.45)	67.43 (1.25)	6.54 (1.62)	2.48 (0.24)
DGCL [41]	5	80.57 (1.12)	68.74 (0.96)	6.04 (1.73)	2.17 (0.30)
CAML [58]	5	79.04 (0.83)	68.45 (0.97)	6.28 (1.79)	2.24 (0.26)
DCNet [40]	5	71.57 (1.58)	61.12 (1.19)	8.37 (1.92)	4.08 (0.84)
SFPC [43]	5	80.52 (1.03)	68.73 (0.88)	6.08 (1.47)	2.14 (0.22)
Using 10% labeled scans
UAMT [49]	10	81.86 (1.25)	71.07 (1.43)	12.92 (1.68)	3.49 (0.64)
SASSNet [59]	10	84.61 (1.97)	74.53 (1.78)	6.02 (1.54)	1.71 (0.35)
Tri-U-MT [60]	10	84.06 (1.69)	74.32 (1.77)	7.41 (1.63)	2.59 (0.51)
DTC [61]	10	82.91 (1.65)	71.61 (1.81)	8.69 (1.84)	3.04 (0.59)
CoraNet [62]	10	84.56 (1.53)	74.41 (1.49)	6.11 (1.15)	2.35 (0.44)
SPCL [63]	10	87.57 (1.15)	78.63 (0.89)	4.87 (0.79)	1.31 (0.27)
MC-Net+ [52]	10	86.78 (1.41)	77.31 (1.27)	6.92 (0.95)	2.04 (0.37)
URPC [50]	10	85.18 (0.98)	74.65 (0.83)	5.01 (0.79)	1.52 (0.26)
PLCT [57]	10	86.83 (1.17)	77.04 (0.83)	6.62 (0.86)	2.27 (0.42)
DGCL [41]	10	87.74 (1.06)	78.82 (1.22)	4.74 (0.73)	1.56 (0.24)
CAML [58]	10	87.67 (0.83)	78.70 (0.91)	4.97 (0.62)	1.35 (0.17)
DCNet [40]	10	87.81 (0.88)	78.96 (0.94)	4.84 (0.81)	1.23 (0.21)
SFPC [43]	10	87.76 (0.92)	78.94 (0.83)	4.90 (0.74)	1.28 (0.23)

Table 5. Comparison of classic semi-supervised methods on the 3D dataset BraTS2020 [46].

Method	% Labeled	BraTS2020 (3D)
Method	Scans	DSC (%)	Jaccard (%)	95HD (mm)	ASD (mm)
Using 5% labeled scans
UAMT [49]	5	49.46 (2.51)	38.46 (1.86)	19.57 (3.28)	6.54 (0.86)
SASSNet [59]	5	51.82 (1.74)	43.93 (1.42)	23.47 (2.83)	7.47 (1.09)
Tri-U-MT [60]	5	53.95 (1.97)	44.33 (2.18)	19.68 (3.06)	7.29 (0.84)
DTC [61]	5	56.72 (2.04)	45.78 (1.67)	17.38 (4.31)	6.28 (1.22)
CoraNet [62]	5	57.97 (1.83)	46.40 (1.64)	19.52 (2.80)	5.83 (0.85)
SPCL [63]	5	78.73 (1.54)	67.90 (1.29)	16.26 (1.68)	4.47 (1.08)
MC-Net+ [52]	5	58.91 (1.47)	47.24 (1.36)	20.82 (3.35)	7.14 (1.12)
URPC [50]	5	60.48 (2.01)	50.69 (1.99)	18.21 (3.27)	7.12 (0.95)
PLCT [57]	5	65.74 (2.17)	55.40 (1.85)	16.61 (3.04)	6.85 (1.39)
DGCL [41]	5	80.21 (0.75)	68.86 (0.63)	14.91 (1.53)	4.63 (1.16)
CAML [58]	5	77.86 (0.96)	66.42 (1.37)	15.21 (1.74)	5.10 (1.12)
DCNet [40]	5	78.52 (1.21)	67.81 (1.07)	17.37 (1.48)	4.32 (0.96)
SFPC [43]	5	80.76 (0.74)	69.18 (0.83)	14.87 (1.92)	4.02 (0.75)
Using 10% labeled scans
UAMT [49]	10	81.04 (1.46)	68.88 (1.57)	17.27 (3.35)	6.25 (1.63)
SASSNet [59]	10	82.36 (2.08)	71.03 (2.35)	14.80 (3.72)	4.11 (1.54)
Tri-U-MT [60]	10	82.83 (1.35)	71.52 (1.21)	15.19 (2.86)	3.57 (1.30)
DTC [61]	10	81.98 (2.41)	70.41 (2.73)	16.27 (3.62)	3.62 (1.71)
CoraNet [62]	10	81.38 (1.68)	70.01 (1.83)	13.94 (2.72)	3.95 (1.26)
SPCL [63]	10	84.65 (1.16)	73.91 (1.19)	12.24 (1.47)	3.28 (0.42)
MC-Net+ [52]	10	83.93 (1.73)	72.34 (1.69)	13.52 (2.74)	3.37 (1.13)
URPC [50]	10	84.23 (1.41)	72.37(1.26)	11.52 (1.79)	3.26 (1.14)
PLCT [57]	10	83.66 (1.82)	71.99 (1.67)	13.68 (1.29)	3.59 (1.02)
DGCL [41]	10	84.02 (1.24)	72.16 (1.07)	12.98 (1.28)	3.02 (0.96)
CAML [58]	10	84.34 (1.03)	73.84 (0.92)	12.02 (1.84)	3.31 (0.58)
DCNet [40]	10	83.39 (0.97)	71.94 (0.88)	11.93 (1.24)	3.50 (0.33)
SFPC [43]	10	85.01 (0.89)	74.67 (1.14)	10.73 (1.36)	3.03 (0.31)

Table 6. Performance comparison of weakly supervised medical image segmentation methods [73].

Dataset	RESC						Duke
Lesion	BG		SRF		PED		BG		Fluid
Metric	DSC	mIoU	DSC	mIoU	DSC	mIoU	DSC	mIoU	DSC	mIoU
IRNet [82]	98.88%	97.78%	49.18%	33.75%	22.98%	14.66%	99.02%	98.10%	17.79%	20.45%
SEAM [68]	98.69%	97.43%	46.44%	34.13%	28.09%	10.71%	98.48%	97.03%	25.48%	17.87%
ReCAM [74]	98.81%	97.66%	31.19%	14.23%	31.99%	19.11%	98.16%	96.41%	18.91%	11.67%
WSMIS [83]	96.90%	95.64%	45.91%	24.64%	10.34%	2.96%	98.16%	96.41%	0.42%	0.42%
MSCAM [84]	98.59%	97.25%	18.52%	10.14%	17.03%	11.97%	98.98%	98.00%	29.93%	17.98%
TransWS [85]	99.07%	98.18%	52.44%	34.88%	30.28%	17.22%	99.06%	98.15%	37.58%	27.01%
DFP [86]	98.83%	97.72%	20.39%	6.40%	31.39%	15.64%	99.10%	98.24%	27.53%	15.14%
AGM [73]	99.15%	98.34%	57.84%	43.94%	34.03%	22.33%	99.13%	98.29%	40.17%	30.06%

Table 7. The quantitative comparison results of unsupervised domain adaptation methods for medical image segmentation on the MM-WHS challenge dataset [95].

Methods	Cardiac MRI → Cardiac CT		Cardiac CT → Cardiac MRI
	AA		AA
	Dice (%)	ASSD (mm)	Dice (%)	ASSD (mm)
Supervised training
(upper bound)	92.0 ± 7.2	1.5 ± 0.8	80.12 ± 4.0	4.2 ± 1.9
Without adaptation
(lower bound)	0.1 ± 0.1	51.0 ± 9.1	18.1 ± 13.7	32.9 ± 4.7
One-shot Finetune	46.2 ± 9.2	10.7 ± 2.1	39.9 ± 11.2	8.2 ± 1.5
Five-shot Finetune	73.1 ± 3.4	8.6 ± 1.7	39.5 ± 10.3	8.5 ± 1.2
PnP-AdaNet [97]	74.0 ± 21.1	24.9 ± 6.7	43.7 ± 6.2	3.1 ± 2.2
AdvEnt [98]	84.2 ± 3.0	9.1 ± 4.1	53.0 ± 5.9	6.9 ± 1.7
SIFA [99]	81.3 ± 5.7	7.9 ± 2.7	65.3 ± 10.9	7.3 ± 5.0
VarDA [100]	81.9 ± 9.1	8.1 ± 5.0	54.6 ± 9.3	15.5 ± 4.5
BMCAN [101]	83.0 ± 6.8	5.8 ± 4.1	72.2 ± 4.3	3.7 ± 2.6
DAAM [74]	87.0 ± 2.1	5.4 ± 3.0	76.0 ± 7.3	6.8 ± 3.2
ADR [94]	87.9 ± 3.6	5.9 ± 4.4	69.7 ± 4.2	5.1 ± 2.1
MPSCL [101]	86.8 ± 2.6	7.7 ± 3.9	64.6 ± 4.7	4.5 ± 2.3
SMEDL [95]	88.3 ± 3.5	4.3 ± 2.3	80.12 ± 4.0	4.2 ± 1.9

Table 8. Summary of medical image segmentation methods.

Method	Authors (Year)	Key Feature	Application Domain(s)	Strengths
AC-MT [47]	Xu et al. (2023)	Ambiguity recognition module selectively calculates consistency loss	Medical image segmentation	High-ambiguity-pixel screening with entropy and selective consistency learning improves segmentation index
AAU-Net [48]	Adiga V. et al. (2024)	Uncertainty estimation of anatomical prior (DAE)	Abdominal CT multi-organ segmentation	Denoising Autoencoder optimizes prediction anatomy rationality and improves DSC/HD
CMMT-Net [51]	Li et al. (2024)	Cross-head mutual-aid mean teaching and multi-level perturbations	Medical image segmentation on LA, Pancreas-CT, ACDC	Multi-head decoder enhances prediction diversity and improves Dice
MLRPL [54]	Su et al. (2024)	Collaborative learning framework with dual reliability evaluation	Medical image segmentation (e.g., Pancreas-CT)	Dual decoders with mutual comparison strategy, achieves near-fully supervised performance
CRLN [56]	Wang et al. (2025)	Prototype learning and dynamic interaction correction for pseudo-labeling	3D medical image segmentation (LA, Pancreas-CT, BraTS19)	Multi-prototype learning captures intra-class diversity to enhance generalization
CRCFP [45]	Bashir et al. (2024)	Context-aware contrast and cross-consistency training	Histopathology image segmentation (BCSS, MoNuSeg)	Dual-path unsupervised learning with lightweight classifier, achieves near-fully supervised performance
AGM [73]	Yang et al. (2024)	Iterative refinement learning stage	Handling small size, low contrast, and multiple co-existing lesions in medical images	Enhances lesion localization accuracy
SA-MIL [76]	Li et al. (2023)	Criss-Cross Attention	Better differentiation between foreground (e.g., cancerous regions) and background	Enhances feature representation capability

Table 9. Summary of medical image segmentation methods (continued).

Method	Authors (Year)	Key Feature	Application Domain(s)	Strengths
SOUSA [80]	Gao et al. (2022)	Multi-angle projection reconstruction loss	More accurate segmentation boundaries, fewer false positive regions	Significantly improves segmentation accuracy
Point SEGTR [81]	Shi et al. (2023)	Fuses limited pixel-level annotations with abundant point-level annotations	Endoscopic image analysis	Significantly reduces dependency on pixel-level annotations
VAE [87]	Silva-Rodríguez et al. (2022)	Attention mechanism (Grad-CAM) + extended log-barrier method	Unsupervised anomaly detection and segmentation; lesion detection and localization	Effectively separates activation distributions of normal and abnormal patterns
OSUDA [93]	Liu et al. (2023)	Exponential momentum decay; High-order BN Statistics Consistency Loss	Source-free unsupervised domain adaptation (SFUDA); privacy-preserving knowledge transfer	Improves performance and stability in the target domain
ODADA [94]	Sun et al. (2022)	Domain-invariant representation and domain-specific representation decomposition	Scenarios with significant domain shift; unsupervised domain adaptation	Learns purer and more effective domain-invariant features
SMEDL [95]	Cai et al. (2025)	Disentangled Style Mixup (DSM) strategy	Cross-modal medical image segmentation tasks	Leverages both intra-domain and inter-domain variations to learn robust representations
DDSP [96]	Zheng et al. (2024)	Dual domain distribution disruption strategy; Inter-channel Feature Alignment (IFA) mechanism	Scenarios with complex domain shift; unsupervised domain adaptation tasks	Significantly improves shared classifier accuracy for target domains

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, X.; Wang, J.; Wei, J.; Yuan, X.; Wu, M. A Review of Non-Fully Supervised Deep Learning for Medical Image Segmentation. Information 2025, 16, 433. https://doi.org/10.3390/info16060433

AMA Style

Zhang X, Wang J, Wei J, Yuan X, Wu M. A Review of Non-Fully Supervised Deep Learning for Medical Image Segmentation. Information. 2025; 16(6):433. https://doi.org/10.3390/info16060433

Chicago/Turabian Style

Zhang, Xinyue, Jianfeng Wang, Jinqiao Wei, Xinyu Yuan, and Ming Wu. 2025. "A Review of Non-Fully Supervised Deep Learning for Medical Image Segmentation" Information 16, no. 6: 433. https://doi.org/10.3390/info16060433

APA Style

Zhang, X., Wang, J., Wei, J., Yuan, X., & Wu, M. (2025). A Review of Non-Fully Supervised Deep Learning for Medical Image Segmentation. Information, 16(6), 433. https://doi.org/10.3390/info16060433

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Review of Non-Fully Supervised Deep Learning for Medical Image Segmentation

Abstract

1. Introduction

2. Datasets

2.1. Two-Dimensional Image Datasets

2.2. Three-Dimensional Image Datasets

3. Semi-Supervised Medical Image Segmentation Methods

3.1. Consistency Regularization-Based Segmentation Methods

3.2. Consistency Regularization Segmentation Methods Supervised by Pseudo-Labels

3.3. Segmentation Methods Combining Contrastive Learning and Consistency Regularization

4. Weakly Supervised Medical Image Segmentation Methods

4.1. Image-Level Label-Based Weakly Supervised Medical Image Segmentation

4.1.1. CAM: A Powerful Tool for Weakly Supervised Medical Image Segmentation

4.1.2. MIL: An Effective Strategy for Weakly Supervised Medical Image Segmentation

4.2. Weakly Semi-Supervised Medical Image Segmentation Methods

5. Unsupervised Medical Image Segmentation Methods

5.1. Unsupervised Anomaly Segmentation Methods

5.2. Unsupervised Domain Adaptation Segmentation Methods

5.2.1. Advancements in Source-Data-Free Unsupervised Domain Adaptation

5.2.2. Advancements in UDA via Adversarial Training

5.2.3. UDA Improvements Based on Semantic Preservation

6. Comparison

6.1. Semi-Supervised Medical Image Segmentation Methods

6.2. Weakly Supervised Medical Image Segmentation Methods

6.3. Unsupervised Medical Image Segmentation Methods

7. Discussion

7.1. Applications

7.2. Future Works

7.2.1. Data-Efficient Segmentation Methods

7.2.2. Generalization, Robustness, and Federated Learning

7.2.3. Interpretability, Uncertainty Quantification, and Clinical Trustworthiness

7.2.4. Multi-Modal and Longitudinal Data Fusion for Segmentation

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI