1. Introduction
Medical image segmentation represents one of the most fundamental and challenging tasks in medical image analysis, serving as a critical component for accurate diagnosis, treatment planning, and disease monitoring [
1,
2,
3]. The precise delineation of anatomical structures and pathological regions from medical images such as CT scans, MRI, ultrasound, and histopathology slides directly impacts clinical decision-making and patient outcomes [
4,
5]. A critical limitation of conventional segmentation methods lies in their inability to adequately model uncertainty and inter-observer variability, which are intrinsic to medical image interpretation. Differences in clinical expertise, annotation protocols, and subjective judgment often result in multiple plausible interpretations of the same image [
6,
7]. Traditional deterministic models typically produce a single segmentation output, failing to capture this uncertainty and potentially leading to overconfident predictions in clinically ambiguous cases [
8]. This limitation motivates the exploration of probabilistic and generative modeling paradigms that can better reflect the realities of clinical practice.
In this context, DDPMs have recently emerged as a powerful class of generative models and have attracted growing interest in medical image analysis. Diffusion models learn data distributions through a gradual denoising process, in which input data are progressively perturbed by Gaussian noise and subsequently reconstructed through a learned reverse process [
9,
10,
11]. This formulation enables stable training and flexible modeling of complex, high-dimensional data distributions.
Forward Diffusion Process (Noising): The forward process gradually adds Gaussian noise to a clean segmentation mask (
) over
T steps according to a variance schedule
:
Using the property of Gaussian distributions, we can sample
directly from
:
where
and
.
Reverse Diffusion Process (Denoising): The goal is to learn a model
that reverses this process, starting from Gaussian noise
to reconstruct the segmentation mask
. In segmentation tasks, this is conditional, meaning the model also takes the original medical image (
I) as input:
The mean
is typically parameterized to predict the noise
:
Compared with earlier generative approaches such as Generative Adversarial Networks (GANs) [
12] and Variational Autoencoders (VAEs), diffusion models exhibit superior mode coverage and sample fidelity, reducing common issues such as mode collapse and training instability [
13,
14]. These properties make diffusion-based frameworks particularly well suited for medical imaging applications, where robustness, reliability, and faithful representation of anatomical variability are essential. As a result, diffusion models have begun to play an increasingly important role in addressing long-standing challenges in medical image segmentation, including uncertainty modeling, data scarcity, and generalization across heterogeneous imaging modalities.
1.1. Motivation and Clinical Significance
The application of diffusion models to medical image segmentation addresses several critical limitations of existing approaches. Collective insights from expert groups have consistently outperformed individual diagnostic capabilities in clinical tasks [
15,
16]. However, the majority of AI-based segmentation methods are designed to replicate a single “optimal” expert annotation, implicitly assuming the existence of a unique ground truth. This assumption is often violated in medical imaging, where inter-observer variability and ambiguous anatomical boundaries are common [
17].
Medical image segmentation is inherently uncertain, as different expert annotators may provide distinct yet equally valid segmentation masks for the same anatomical structure. Such ambiguity arises from factors including pathological variability, unclear tissue boundaries, and subjective anatomical definitions, particularly in complex or low-contrast imaging scenarios [
18].
Diffusion-based segmentation frameworks offer a principled alternative by modelling the full distribution of plausible segmentation outcomes rather than producing a single deterministic prediction. This enables the generation of multiple segmentation hypotheses from the same input image, directly capturing inter-observer variability. The resulting uncertainty estimates inform diagnostic confidence and support treatment planning decisions in ambiguous cases, aligning these models more naturally with the realities of clinical practice than conventional deterministic approaches.
1.2. Advantages of Diffusion Models in Medical Segmentation
Diffusion models offer several compelling advantages for medical image segmentation tasks. The inherent stochastic sampling process of diffusion models can be leveraged to generate multiple plausible segmentation masks from a single input image, naturally modeling the uncertainty and variability present in medical annotations [
18]. This capability eliminates the need for separate prior distributions during inference, which is critical for conditional VAE-based segmentation models [
13].
Another important advantage of diffusion models is their hierarchical denoising structure, which allows the progressive refinement of segmentation masks across multiple time steps. This multi-stage generation process enables fine-grained control over the diversity and realism of the predicted segmentations, resulting in anatomically coherent yet heterogeneous outputs. Such control is particularly valuable in medical imaging, where capturing subtle anatomical variations and pathological heterogeneity is essential for robust clinical assessment.
In addition, diffusion models exhibit superior mode coverage compared to adversarial generative frameworks such as GANs. By avoiding adversarial training dynamics, diffusion-based approaches reduce the risk of mode collapse and ensure that less frequent but clinically relevant anatomical patterns are adequately represented. This property is critical in medical segmentation tasks, where rare anatomical variants or pathological structures may carry significant diagnostic importance and should not be systematically overlooked.
1.3. Computational Challenges and Recent Advances
Despite their conceptual advantages, the deployment of diffusion models for medical image segmentation is constrained by substantial computational demands. Classical denoising diffusion probabilistic models rely on iterative reverse sampling processes that may require hundreds to thousands of sampling steps to achieve high-quality outputs. Such computational overhead poses a significant challenge for clinical applications, where timely inference and resource efficiency are critical for real-world usability [
9,
11].
Recent methodological advances have substantially mitigated these limitations. Latent diffusion models reduce computational complexity by performing the diffusion process in a compressed latent space rather than directly operating on high-resolution medical images. This strategy enables efficient segmentation while preserving anatomical fidelity and has proven particularly effective for large-scale and high-dimensional medical imaging data [
10,
13].
In parallel, transformer-enhanced diffusion architectures have emerged as a promising direction for improving both efficiency and representational capacity. By integrating vision transformer mechanisms into diffusion-based segmentation frameworks, these models can better capture long-range spatial dependencies and global contextual information, which are essential for accurate anatomical delineation in complex medical images [
14]. Collectively, these advances have brought diffusion-based segmentation closer to practical clinical deployment, narrowing the gap between theoretical promise and real-world feasibility.
1.4. Scope and Objectives
This review provides a comprehensive and systematic analysis of diffusion models for medical image segmentation, with the goal of synthesizing recent methodological advances and assessing their clinical relevance. We cover the theoretical foundations of diffusion-based generative modeling as applied to segmentation tasks, alongside key architectural developments, including latent diffusion frameworks, transformer-enhanced diffusion models, and approaches designed to handle ambiguous and uncertain annotations.
In addition to methodological analysis, this review examines strategies for improving computational efficiency and scalability, which are critical for clinical deployment. Applications across multiple medical imaging modalities and anatomical regions are discussed to illustrate the practical impact and generalizability of diffusion-based segmentation approaches. Finally, we identify open challenges and emerging research directions, highlighting opportunities for future development and clinical translation.
By consolidating current knowledge and critically evaluating both strengths and limitations, this review aims to serve as a valuable reference for researchers developing diffusion-based segmentation methods and for clinicians seeking to understand their potential role in real-world medical imaging workflows. To provide a high level overview of the diffusion based medical image segmentation landscape,
Figure 1 summarizes the major models paradigms discussed in this review.
2. Methodology
This review employs a systematic literature review methodology to synthesize the rapidly evolving landscape of diffusion models for medical image segmentation. The primary objective is to provide a holistic analysis of theoretical advancements, architectural innovations, and clinical applicability. The review process was guided by the following five specific research questions (RQs):
- 1.
RQ1: What are the theoretical foundations and evolutionary trajectory of diffusion models in the context of medical image segmentation?
- 2.
RQ2: How can existing diffusion-based segmentation methodologies be taxonomically classified based on their architectural designs and conditioning strategies?
- 3.
RQ3: How effectively do diffusion models address inherent clinical challenges such as inter-observer variability and ambiguous boundary delineation?
- 4.
RQ4: What computational bottlenecks hinder the clinical deployment of diffusion models, and what acceleration strategies have been proposed?
- 5.
RQ5: Across which medical imaging modalities and specific anatomical tasks have diffusion models demonstrated superior performance compared to traditional segmentation paradigms?
2.1. Search Strategy and Data Sources
To address these research questions, a comprehensive and reproducible search was conducted across five digital libraries: IEEE Xplore, PubMed, ScienceDirect, SpringerLink, and arXiv. All databases were searched from January 2019 to January 2025, capturing the inception of DDPMs and their subsequent adaptation to medical imaging. Searches were executed on 25 December 2025. Search fields: For IEEE Xplore, PubMed, and ScienceDirect, searches were restricted to Title, Abstract, and Keywords fields. For arXiv, full-text search was applied due to platform constraints. PubMed searches additionally incorporated MeSH terms (e.g., “Image Segmentation”[MeSH] AND “Diffusion”[tw]).
The search query utilized a combination of Boolean operators and keywords:
(“Diffusion Models” OR “Denoising Diffusion Probabilistic Models” OR “DDPM” OR “Latent Diffusion”)
AND
(“Medical Image Segmentation” OR “Medical Image Analysis” OR “Organ Segmentation” OR “Tumor Segmentation”)
2.2. Inclusion and Exclusion Criteria
Studies were selected based on a two-stage screening process (title/abstract screening followed by full-text review). The inclusion criteria were as follows:
Studies focusing on the application of diffusion models to medical image segmentation tasks.
Research utilizing standard medical imaging modalities (MRI, CT, Ultrasound, X-ray).
Papers proposing novel methodological frameworks, architectural variations, or significant efficiency improvements.
Studies providing quantitative performance metrics (e.g., Dice Similarity Coefficient, IoU) or qualitative analyses of uncertainty modeling.
Exclusion criteria included the following:
Studies applying diffusion models solely for image generation or reconstruction without a segmentation component.
Non-medical applications of diffusion models.
Non-peer-reviewed articles were excluded, except for arXiv preprints that met at least one of the following criteria: (a) later accepted in a peer-reviewed venue by January 2025; (b) received ≧50 citations on Google Scholar; or (c) served as a widely referenced foundational baseline. Among the included studies, eight were arXiv preprints that satisfied these criteria.
2.3. Data Extraction and Synthesis
A structured extraction form was applied to each eligible study, recording the following: (1) diffusion paradigm type; (2) target anatomy and imaging modality; (3) quantitative metrics; (4) inference time and hardware; and (5) uncertainty quantification strategy. The extracted data was synthesized qualitatively to identify trends, technological gaps, and future directions, forming the basis of the discussion presented in the subsequent sections.
This review was conducted in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 guidelines. The study selection process is illustrated using a PRISMA flow diagram in the
Figure 2.
3. Literature Review
The application of diffusion models (DMs) to medical image segmentation has rapidly evolved into a prominent research domain, marked by significant contributions across theoretical foundations, methodological innovations, and clinical applications [
11]. This section presents a structured review of the extant literature concerning diffusion-based medical image segmentation. We synthesize prior works by categorizing them based on modeling strategies, architectural designs, and application domains, while simultaneously evaluating their comparative strengths and limitations. Particular emphasis is devoted to recent developments addressing ambiguity-aware segmentation, computational efficiency, and scalability factors that are critical for translating diffusion models from research prototypes into viable clinical practice.
3.1. Foundational Theoretical Frameworks
Diffusion-based medical image segmentation methods rely on probabilistic generative modeling, learning complex data distributions through iterative refinement. These frameworks model uncertainty, generate diverse segmentation hypotheses, and ensure stable training. Denoising Diffusion Probabilistic Models (DDPMs) serve as the core paradigm for most segmentation-oriented diffusion frameworks, forming the basis for architectural innovations and application-specific adaptations.
3.1.1. Denoising Diffusion Probabilistic Models (DDPMs)
The theoretical foundation of diffusion models in medical image segmentation is rooted in the seminal work on Denoising Diffusion Probabilistic Models. These models define a forward diffusion process where medical image data is gradually perturbed over multiple steps by adding Gaussian noise, followed by learning to reverse this diffusion process to retrieve desired noise-free segmentation masks from noisy segmentation masks. The diffusion framework is formulated using two Markov chains: a forward diffusion process and a learned reverse denoising process. The forward process, denoted as
, gradually corrupts the initial segmentation mask
by incrementally adding Gaussian noise over
T time steps according to a predefined variance schedule [
19]. Conversely, the reverse process, parameterized by
and denoted as
, learns to invert this noising procedure by iteratively removing noise, thereby reconstructing the clean segmentation mask from a noisy observation as shown in
Figure 3.
Recent comprehensive surveys have established that diffusion models are widely appreciated for their strong mode coverage and quality of generated samples despite their known computational burdens [
20,
21]. The field of medical imaging has observed growing interest in diffusion models, capitalizing on advances in computer vision while addressing domain-specific challenges such as data scarcity, annotation variability, and the need for uncertainty quantification. Score-based generative models extend DDPMs using stochastic differential equations (SDEs) [
22], where the forward process is a continuous-time diffusion given by
and the reverse process samples by estimating the score
. In medical segmentation, SDE formulations enable flexible noise schedules and signed distance functions for boundary refinement [
18,
23], improving stability over discrete-step DDPMs.
3.1.2. Latent Diffusion Models (LDMs)
The high computational cost associated with applying diffusion models directly to high-resolution medical images has motivated the development of Latent Diffusion Models (LDMs), which perform the diffusion process in a compressed latent space rather than at the pixel level [
24]. By operating on lower-dimensional representations, LDMs substantially reduce memory consumption and inference time while preserving the expressive power of diffusion-based generative modeling, thereby improving the feasibility of clinical deployment.
The diffusion process is then applied within this latent space, allowing the efficient modeling of complex distributions without the prohibitive computational burden associated with high-resolution pixel-space diffusion. During inference, the denoised latent representations are decoded back into the image or segmentation space, yielding high-quality outputs that retain fine anatomical details [
25,
26]. Recent studies have demonstrated that latent diffusion models are particularly well suited for medical image segmentation tasks involving large volumetric data, such as MRI and CT imaging, where pixel-space diffusion becomes computationally impractical [
27]. Beyond efficiency gains, latent diffusion also facilitates the integration of advanced architectural components, including transformer-based encoders and task-specific conditioning mechanisms, further enhancing segmentation performance in complex clinical scenarios [
28]. As a result, LDMs have become a foundational component in many state-of-the-art diffusion-based medical segmentation frameworks and serve as a critical bridge between theoretical diffusion modeling and real-world clinical applicability.
3.2. Ambiguous Medical Image Segmentation
Medical image segmentation is inherently ambiguous, as multiple expert annotators may produce distinct yet clinically valid segmentation masks for the same image. This ambiguity arises from factors such as unclear anatomical boundaries, pathological variability, imaging artifacts, and subjective interpretation criteria. Recent research has increasingly recognized ambiguity-aware segmentation as a critical requirement for reliable medical image analysis [
4]. This paradigm shift has led to the development of collective and uncertainty-aware segmentation frameworks that better align with clinical practice.
3.2.1. Collective Intelligence Approach
Rahman et al. proposed a framework for ambiguity-aware medical image segmentation through Collectively Intelligent Medical Diffusion (CIMD), which explicitly models the collective intelligence of expert annotators rather than imitating a single best expert annotation [
4]. The core insight of CIMD is that aggregated interpretations from multiple experts consistently outperform individual diagnostic judgments in clinical tasks, particularly in complex or ambiguous imaging scenarios. The framework leverages the inherent stochastic sampling process of diffusion models to produce distribution of segmentation masks, eliminating the need for separate prior distributions during inference a critical limitation of conditional VAE-based segmentation models [
13]. The hierarchical structure enables control over diversity at each time step, producing realistic and heterogeneous segmentation masks that reflect the naturally occurring variation in medical image interpretation.
3.2.2. Ambiguity Modeling Network (AMN)
The technical innovation of CIMD lies in its Ambiguity Modeling Network, which explicitly models the distribution of ground-truth segmentation masks conditioned on an input image. Given an image
, the network represents segmentation ambiguity by embedding expert annotation variability into a latent space. Specifically, the network parameterizes a Gaussian distribution over latent variables
with learned mean and variance functions:
where
and
denote the mean and standard deviation predicted by the network,
represents the learnable parameters, and the latent space
captures the natural variation present in expert-provided segmentation annotations [
29].
The latent variable z is integrated into the diffusion framework as a conditional prior that modulates the denoising process. Unlike standard conditional diffusion models that rely solely on the image I for conditioning, the CIMD denoiser is conditioned on both the image I and the latent code z. This formulation allows the model to map a single input image to a distribution of plausible segmentation masks, effectively disentangling the image content from the interpretation style.
The training objective minimizes a variational bound, combining the standard denoising score matching loss with a KL-divergence regularization term for the latent space. The loss function
is defined as
where
is the noisy segmentation mask at timestep
t,
is the ground-truth noise, and
is a weighting factor. The KL term ensures the latent distribution remains close to a standard Gaussian prior, enabling random sampling during inference.
The operational workflow of CIMD can be summarized in two phases:
- 1.
Training Phase: Given an image
I and a ground-truth mask
, the AMN encodes the annotation style into a latent code
z. The diffusion model is then trained to predict the noise
added to
, conditioned on both
I and
z, using the objective in Equation (
6).
- 2.
Inference Phase: To generate diverse segmentation hypotheses, a latent code z is sampled from the prior distribution . The diffusion model performs the reverse denoising process starting from pure noise , conditioned on I and the sampled z, to produce a plausible segmentation mask . Re-sampling z yields different valid interpretations.
This formulation addresses a fundamental limitation in medical image segmentation, namely the inherent ambiguity arising from inter-observer variability, where different diagnosticians may provide differing interpretations regarding the type, shape, and extent of anatomical abnormalities [
7]. Experimental results demonstrate that the proposed framework achieves superior segmentation performance across multiple medical imaging modalities, including computed tomography (CT), ultrasound, and magnetic resonance imaging (MRI), while effectively preserving the naturally occurring variations observed in expert annotations.
3.3. Transformer-Enhanced Diffusion Models
While early diffusion-based segmentation frameworks primarily relied on convolutional architectures such as U-Net, their limited capacity to model long-range dependencies has motivated the integration of transformer mechanisms. Vision Transformers (ViTs) excel in medical image segmentation due to their global self-attention, handling complex structures and large contexts [
30]. Transformer-enhanced diffusion models merge transformers’ contextual reasoning with diffusion processes’ generative power for robust segmentation.
MedSegDiff-V2 Framework
Wu et al. developed MedSegDiff-V2, a Transformer-based diffusion framework that addresses the limited performance of naively combining U-Net-based diffusion models with standard transformer blocks. The framework integrates vision transformer mechanisms with a diffusion backbone to improve segmentation across multiple modalities [
31]. Its key contribution is a jointly designed architecture the SS-Former that enables long-range dependency modeling while preserving the generative properties of the diffusion process.
To clarify the architectural roles, the transformer in MedSegDiff-V2 serves as both the denoiser backbone and a context encoder:
Denoiser Backbone Role: It forms a hybrid U-Net in which Transformer blocks replace convolutional bottlenecks, creating the primary feature extraction pathway.
Context Encoder Role: Self-attention within these blocks captures long-range spatial dependencies from both the input image and the noisy segmentation mask.
Concise Architectural Breakdown:
Input Conditioning: The input medical image is concatenated with the noisy segmentation mask at timestep t, forming the conditioned input to the denoiser.
Hybrid Backbone (SS-Former): A U-Net-like encoder–decoder with skip connections, but convolutional bottlenecks are replaced by Transformer blocks (multi-head self-attention + feed-forward layers) to process features at multiple scales.
Denoising Output: The network predicts the noise , conditioned on global context from the transformers, iteratively refining the mask over diffusion steps.
Training/Inference: Trained with standard DDPM loss; inference uses classifier-free guidance for improved sample quality.
Compared to standard U-Net denoisers (which rely on local convolutional inductive biases and struggle with global variations in medical structures like irregular tumors), the Transformer-enhanced design in MedSegDiff-V2 enables better modeling of long-range dependencies, reduces sensitivity to local noise patterns, and improves generalization across modalities (e.g., +2–5% DSC gains in multi-organ CT/US tasks). This results in more robust and semantically aware segmentation, particularly for ambiguous or variable anatomical regions. Extensive testing across 20 medical image segmentation tasks spanning various imaging modalities highlights its clear superiority over previous state-of-the-art approaches.
3.4. Computational Efficiency Innovations
Despite their strong generative and uncertainty modeling capabilities, diffusion models have traditionally been limited by high computational cost, primarily due to the large number of iterative sampling steps required during inference. Conventional Denoising Diffusion Probabilistic Models (DDPMs) require numerous sampling steps, hindering their practicality for time-sensitive medical image segmentation [
11,
32].
3.4.1. Accelerated Sampling Strategies
Recent research has increasingly prioritized mitigating the substantial computational burden inherent to traditional diffusion models, which typically necessitate up to 1000 sampling steps to synthesize high-fidelity outputs. Consequently, a variety of strategies have been devised to drastically reduce inference latency without compromising segmentation accuracy [
33]. Notably, accelerated sampling protocols have substantially reduced inference cost: LDSeg achieves DSC scores comparable to the full 1000-step DDPM baseline using only 2 DDIM sampling steps, representing a 100–500× reduction in network function evaluations. These improvements are pivotal for clinical translation, where real-time or near-real-time processing capabilities are indispensable for effective patient care.
3.4.2. Memory-Efficient Architectures
The intensive computational demands of processing large 3D medical volumes have catalyzed the development of innovative memory-efficient architectures. Patch-based strategies facilitate the handling of volumetric data by partitioning extensive volumes into manageable sub-volumes, thereby substantially reducing memory overhead while preserving segmentation coherence across the full dataset. Concurrently, wavelet-based (or frequency-domain) approaches have demonstrated the capacity to train on resource-constrained hardware—such as single 40 GB GPUs while achieving state-of-the-art performance in high-resolution medical image generation and segmentation [
34]. These advancements represent a critical leap forward in democratizing diffusion models, making them feasible for clinical applications operating on standard hardware configurations.
3.5. Conditioned Diffusion for Medical Image Segmentation
Conditional diffusion frameworks enable the denoising process to be guided by auxiliary information, such as input images, anatomical priors, or textual descriptions. Among these variants, text-conditioned diffusion models incorporate semantic guidance derived from natural language to improve segmentation controllability and generalization in limited-data scenarios. Language-guided segmentation frameworks leverage vision–language alignment to incorporate high-level semantic priors, enabling more flexible and target-aware segmentation behavior [
35].
DiffBoost Framework
Zhang et al. proposed DiffBoost, a text-guided diffusion framework designed to alleviate data scarcity in medical image segmentation through anatomically constrained synthetic image generation [
5]. The framework follows a three-stage pipeline comprising large-scale pretraining on RadImageNet, task-specific fine-tuning, and joint optimization with downstream segmentation objectives.
A key innovation of DiffBoost lies in its use of edge-aware anatomical constraints to guide the diffusion synthesis process, ensuring that generated samples preserve medically relevant structural information. By incorporating textual descriptions as semantic guidance, DiffBoost enables the generation of diverse yet clinically plausible synthetic images, thereby enhancing segmentation performance in data-limited scenarios.
3.6. Comparative Analysis with Traditional Uncertainty Modeling
Diffusion models compare favorably to uncertainty methods. Bayesian segmentation approximates posteriors via variational inference or MC Dropout [
36], capturing epistemic uncertainty but requiring 5–10 forward passes at ∼0.01–0.1 s per inference [
37]. Diffusion generates diverse samples natively in 10–50 sampling steps, 0.3–2 s, outperforming MC Dropout in AUSE and AURG metrics for reconstruction task [
38]. The Probabilistic U-Net [
39] models aleatoric uncertainty via latent variables, but is susceptible mode collapse; diffusion model avoid this through mode coverage [
40]. Ensembles methods (e.g., nnU-Net 5-fold) improve DSC by 1–5% at the cost of five times the training time, whereas diffusion-based sampling achieves comparable gains within a single trained model. For ambiguous cases, diffusion models the full posterior distribution
, sampling plausible segmentations from high-density regions and naturally capturing inter-observer variability [
41].
4. Types of Diffusion Models for Segmentation
The application of diffusion models to medical image segmentation has diversified into several distinct methodological paradigms, each tailored to address specific challenges inherent to medical imaging [
42]. Rather than constituting a single unified approach, diffusion-based segmentation methods differ substantially in terms of architectural design, conditioning strategy, uncertainty modeling, and computational efficiency. To provide a systematic overview of these approaches,
Table 1 summarizes the main categories of diffusion models for medical image segmentation, highlighting their core design principles, strengths, and limitations. This structured taxonomy of diffusion model types developed for medical image segmentation, highlighting their core characteristics and intended use cases.
4.1. Denoising Diffusion Probabilistic Models (DDPMs)
Standard Denoising Diffusion Probabilistic Models represent the earliest and most straightforward application of diffusion processes to medical image segmentation [
9]. These approaches directly adapt vanilla diffusion formulations to segmentation tasks, serving as a foundational baseline for subsequent architectural and methodological innovations.
Vanilla DDPM for Segmentation
The foundational approach applies standard Denoising Diffusion Probabilistic Models directly to medical image segmentation tasks. These models define a forward diffusion process where segmentation masks are gradually perturbed over multiple steps by adding Gaussian noise, followed by learning to reverse this diffusion process to retrieve desired noise-free segmentation masks from noisy samples [
47]. Standard DDPMs demonstrate strong mode coverage and quality of generated samples, making them particularly suitable for medical applications where precision and reliability are paramount. However, they suffer from significant computational burdens due to the iterative nature of the denoising process, typically requiring up to 1000 sampling steps for high-quality generation.
4.2. Conditional Diffusion Models
Conditional diffusion models extend vanilla diffusion formulations by incorporating auxiliary information to guide the denoising process. In medical image segmentation, conditioning is essential for ensuring that generated segmentation masks remain anatomically consistent with the corresponding input images.
Image-Conditioned Diffusion
Image-conditioned diffusion models represent a crucial advancement for medical image segmentation, where the denoising process is guided by input medical images, as shown in
Figure 4. These models learn to generate segmentation masks conditioned on the corresponding medical images, enabling controlled generation of anatomically consistent segmentation outputs [
43]. The conditioning mechanism typically involves concatenating image features with noisy segmentation masks at each denoising step, allowing the model to leverage anatomical information for accurate mask generation. This approach has demonstrated superior performance across multiple medical imaging modalities including CT, ultrasound, and MRI [
31].
4.3. Latent Diffusion Models (LDMs)
Latent Diffusion Models are designed to improve the computational efficiency of diffusion-based medical image segmentation by shifting the diffusion process from pixel space to a compressed latent representation [
44]. By operating on lower-dimensional feature spaces, LDMs substantially reduce memory consumption and inference time while preserving segmentation fidelity, making them well suited for high-resolution medical imaging applications.
Compressed Latent Space Processing
In latent diffusion-based segmentation frameworks, pre-trained encoders are used to project medical images and corresponding segmentation masks into compact latent representations. The diffusion process is then performed entirely within this latent space, where a denoising model learns to generate clean latent segmentation representations from noisy latent representations [
48].
Following diffusion, a decoder reconstructs the final segmentation masks in pixel space. This three-stage design comprising an encoder, a latent diffusion model, and a decoder enables efficient processing of high resolution scans such as MRI and CT images while maintaining competitive segmentation accuracy. As a result, latent diffusion models provide a practical balance between computational efficiency and representational expressiveness in medical image segmentation tasks [
49].
Throughout this review, the term ‘noisy segmentation mask’ refers to pixel-space diffusion, whereas ‘noisy latent representation’ denotes the equivalent in compressed latent space (LDMs).
4.4. Specialized Noise Models
While most diffusion-based segmentation frameworks rely on Gaussian noise formulations, medical image segmentation particularly binary segmentation often benefits from noise models that better reflect the discrete nature of segmentation masks [
50]. Specialized noise models adapt the diffusion process to non continuous data distributions, enabling more effective modeling of binary and categorical segmentation outputs.
Binary Bernoulli Diffusion Model (BBDM)
The Binary Bernoulli Diffusion Model represents a significant innovation specifically designed for binary segmentation tasks. Unlike traditional diffusion models that apply Gaussian noise to continuous data, BBDM uses Bernoulli noise as the diffusion kernel to enhance the capacity of the diffusion model for binary segmentation tasks.
BerDiff, a prominent implementation of BBDM, introduces randomness through sampling of initial noise and latent variables, producing diverse segmentation masks that effectively highlight regions of interest [
45]. This approach is particularly valuable for medical image segmentation where binary masks represent the presence or absence of anatomical structures or pathological regions.
4.5. Transformer-Enhanced Diffusion Models
Transformer-enhanced diffusion models combine the probabilistic generative strength of diffusion processes with the global contextual modeling capability of transformers. This class of methods is designed to overcome the limited receptive field of convolutional architectures by enabling long-range dependency modeling, which is particularly important for medical image segmentation involving complex anatomical structures and large spatial contexts.
MedSegDiff-V2 Framework
MedSegDiff-V2 represents a novel Transformer-based Diffusion framework that addresses the limitation of simply combining UNet-based diffusion models with transformers. The framework integrates vision transformer mechanisms with diffusion models to enhance medical image segmentation across multiple modalities [
31]. The key innovation lies in the SS-Former architecture that learns the interaction between noise and semantic features. This approach leverages the long-range dependency modeling capabilities of transformers while maintaining the generative advantages of diffusion models.
4.6. Hybrid Diffusion Models
Hybrid diffusion models combine discriminative segmentation networks with generative diffusion processes to leverage the complementary strengths of both paradigms. While discriminative models provide efficient and accurate initial predictions, diffusion-based refiners enable probabilistic correction and structural refinement by modeling the underlying data distribution. This hybrid design aims to improve segmentation robustness without incurring the full computational cost of purely generative diffusion pipelines.
HiDiff: Hybrid Discriminative–Generative Framework
HiDiff represents a novel hybrid approach that synergizes discriminative and generative modeling paradigms for medical image segmentation [
46]. The framework comprises two fundamental components: a discriminative segmentor and a diffusion refiner. The discriminative segmentor utilizes conventional trained segmentation models to provide segmentation mask priors, while the diffusion refiner employs a Binary Bernoulli Diffusion Model to effectively refine segmentation masks.
The key innovation lies in the alternate-collaborative training strategy, where the segmentor and BBDM [
51] are trained to mutually enhance each other’s performance. This approach addresses the limitation of purely discriminative methods that neglect underlying data distribution and intrinsic class characteristics.
8. Challenges and Limitations
8.1. Computational Complexity and Resource Requirements
Despite their strong representational capacity, diffusion-based segmentation models face notable challenges related to computational cost and hardware constraints, which continue to hinder large-scale and real-time clinical deployment.
8.1.1. Iterative Denoising Overhead
A core limitation of diffusion models arises from their reliance on an iterative denoising process, in which segmentation masks are progressively refined over multiple inference steps. Conventional Denoising Diffusion Probabilistic Models (DDPMs) typically require hundreds to thousands of sampling iterations to achieve high-quality outputs, resulting in substantial computational overhead [
11,
19].
This inference latency is particularly restrictive in clinical environments where real-time or near-real-time performance is essential, such as ultrasound-guided interventions, emergency diagnostics, and intraoperative imaging. Although accelerated sampling methods have alleviated this burden to some extent, the trade-off between sampling speed and segmentation fidelity remains a critical challenge [
21].
8.1.2. Memory Limitations for 3D Processing
Diffusion models also encounter significant memory constraints when applied to high-resolution 3D medical imaging data, including volumetric CT and MRI scans. The need to store intermediate feature maps across multiple diffusion steps often leads to excessive GPU memory consumption, limiting batch size and spatial resolution during both training and inference [
74].
These memory limitations pose a major obstacle for comprehensive volumetric analysis, which is essential for accurate anatomical delineation and clinical decision-making. While latent diffusion and patch-based processing strategies offer partial mitigation, achieving efficient full-volume 3D diffusion-based segmentation remains an open research problem [
70].
8.2. Training and Data Challenges
While diffusion models provide powerful mechanisms for modeling uncertainty and ambiguity, their effectiveness in medical image segmentation is strongly constrained by data availability, annotation quality, and evaluation limitations.
8.2.1. Annotation Requirements and Cost
Training diffusion-based models for ambiguous medical image segmentation often necessitates annotations from multiple expert radiologists to accurately capture inter-observer variability. Acquiring such multi-rater annotations is both time-consuming and financially expensive, significantly increasing the data collection burden compared to conventional single-label segmentation pipelines [
75,
76].
This challenge is particularly pronounced in specialized clinical domains—such as oncology, cardiology, and neuroimaging where expert availability is limited and annotation protocols are complex. As a result, the scalability of diffusion based ambiguous segmentation methods remains constrained, motivating research into weakly supervised learning, annotation-efficient training strategies, and synthetic label generation [
77].
8.2.2. Ground Truth Distribution Limitations
In real-world clinical settings, ground truth for ambiguous segmentation is typically represented by a small and finite set of expert annotations, rather than a well-defined continuous distribution. This sparse sampling of the annotation space poses challenges for both robust model training and reliable performance evaluation [
78].
Evaluation metrics such as the Generalized Energy Distance (GED) are commonly used to assess the alignment between predicted and ground-truth segmentation distributions. However, GED and related distribution-based metrics can be unstable or biased when ground truth distributions are approximated using limited samples, potentially leading to unreliable assessments of model performance [
79].
These limitations highlight the need for improved evaluation protocols and uncertainty-aware metrics that remain robust under sparse annotation regimes, as well as training objectives that can better exploit limited distributional supervision.
8.3. Clinical Translation Barriers
Despite strong methodological advances, the translation of diffusion-based medical image segmentation models from research prototypes to routine clinical use remains limited. Key barriers include insufficient clinical validation and challenges in integrating these models into real-world healthcare workflows.
8.3.1. Limited Clinical Validation
Although diffusion models demonstrate strong segmentation performance in controlled experimental settings, systematic clinical validation remains insufficient. Most existing studies evaluate performance using technical metrics on retrospective, single-institution datasets a standard that is insufficient to demonstrate clinical readiness. Specific gaps and concrete recommendations include the following: (a) External validation: Validation on at least two independent cohorts from different institutions, scanner vendors, and imaging protocols is necessary to demonstrate robustness beyond the training distribution [
80]. (b) Prospective reader studies: Clinical impact should be quantified through controlled reader studies comparing AI-assisted versus unassisted radiologist performance, measuring inter-reader agreement improvement, time-to-diagnosis reduction, and diagnostic accuracy changes. (c) Clinically meaningful endpoints: Beyond Dice/IoU, evaluations should report treatment decision concordance (e.g., radiotherapy target volume agreement within clinical margins), false positive/negative rates at actionable thresholds, and time savings per workflow step. (d) Population diversity: Studies must validate across heterogeneous patient populations (varied age, comorbidities, disease stage, ethnicity) and include pathological edge cases, not only typical presentations present in public benchmarks. Addressing these gaps is a prerequisite for regulatory approval and broad clinical adoption of diffusion-based segmentation systems [
81].
8.3.2. Integration with Clinical Workflows
Integrating diffusion-based segmentation models into routine clinical practice involves practical and technical challenges across three areas. First, AI systems must interoperate with established medical imaging infrastructure, including Picture Archiving and Communication Systems (PACS) and Electronic Health Records (EHR) [
82], which manage image storage and patient data. Second, inference latency is a critical operational constraint: for time-sensitive applications such as emergency diagnostics, intraoperative imaging, and point-of-care ultrasound, segmentation must complete within seconds. Third, user interaction design must allow clinicians to efficiently review, adjust, and approve automatically generated segmentations. Because diffusion models produce probabilistic outputs, communicating uncertainty through interpretable visualizations or confidence maps is essential for clinical trust. Evaluation should therefore extend beyond algorithmic accuracy to include workflow-level metrics: reductions in reporting time, improvements in inter-observer consistency, and compatibility with downstream tasks such as radiotherapy planning and surgical guidance [
83,
84].
To clarify how current limitations motivate future research efforts,
Table 6 summarizes the key challenges identified in diffusion-based medical image segmentation and the corresponding future directions discussed in this review.
8.4. Explainability Challenges
A key limitation of diffusion-based segmentation models is their lack of explainability, which poses a significant barrier to regulatory approval and clinical adoption of generative AI in medicine [FDA, 2024; EU AI Act, 2024]. Unlike deterministic methods such as U-Net, which allow direct inspection of feature maps, diffusion models operate as black boxes: the final segmentation mask emerges from many iterative denoising steps driven by learned score functions, making it difficult for clinicians to understand why a particular boundary was chosen or why uncertainty is elevated in a given region. This gap is especially evident when compared to established reconstruction methods. In CT imaging, filtered back-projection (FBP) remains a clinical standard because its mathematical formulation is transparent: streak artifacts from metal implants or beam hardening are well understood and teachable. In contrast, diffusion-based methods such as MedSegDiff-V2 and HiDiff offer only post hoc approximations (e.g., attention maps, uncertainty heatmaps) that are not guaranteed to reflect the true decision process. Current attempts to mitigate this include the following:
Attention visualization in transformer-enhanced models (e.g., SS-Former in MedSegDiff-V2) [
31], Uncertainty maps from stochastic sampling [
4], Counterfactual generation (what-if mask perturbations).
These post hoc methods remain insufficient for regulatory purposes: to date, no diffusion segmentation framework has been validated under explainability requirements for FDA Class II/III devices or CE marking under the EU AI Act. Alongside computational cost and data scarcity, this explainability gap is a primary reason diffusion models remain largely confined to research settings.
10. Discussion
This comprehensive review has examined the transformative impact of diffusion models on medical image segmentation, revealing a field that has rapidly evolved from theoretical foundations to practical clinical applications. The analysis demonstrates that diffusion models represent a paradigm shift in medical imaging, offering unique capabilities that address fundamental challenges while introducing novel possibilities for improved patient care and diagnostic accuracy.
10.1. Key Achievements and Contributions
The field has witnessed remarkable progress across multiple dimensions. Computational efficiency improvements have substantially reduced inference latency: LDSeg reduced per-frame processing time from 91.23 s (pixel-space DDPM baseline) to 0.34 s using 2-step DDIM sampling on the Echo dataset, while maintaining a DSC of 0.92 competitive with the baseline. The development of latent diffusion models, accelerated sampling protocols, and memory-efficient architectures has made real-time clinical deployment feasible. Methodological innovations have established diffusion models as uniquely suited for medical imaging challenges. The introduction of ambiguous segmentation modeling through frameworks like CIMD addresses the fundamental reality that medical image interpretation inherently involves uncertainty and inter-observer variability. This capability to generate multiple plausible segmentation outputs provides clinicians with valuable uncertainty quantification that traditional deterministic models cannot offer. The versatility across imaging modalities has been conclusively demonstrated, with successful applications spanning MRI, CT, ultrasound, and X-ray imaging. Notable achievements include DSC scores of 0.96 for knee cartilage segmentation, +13.87% improvement in breast ultrasound segmentation, and superior performance across 20 different medical segmentation tasks.
10.2. Addressing Critical Medical Imaging Challenges
Diffusion models have successfully addressed several critical limitations of traditional approaches. Data scarcity, a persistent challenge in medical imaging due to privacy concerns and annotation costs, has been mitigated through sophisticated synthetic data generation capabilities. Studies demonstrate that CNNs trained on diffusion-generated synthetic data achieve comparable performance to those trained on original datasets. The uncertainty quantification capabilities inherent in diffusion models address a fundamental gap in traditional segmentation approaches. The ability to provide pixel-wise uncertainty maps and multiple plausible outputs enables more informed clinical decision-making, particularly in ambiguous cases where segmentation accuracy directly impacts treatment planning.
10.3. Current Limitations and Ongoing Challenges
Despite significant progress, several challenges remain. Computational complexity continues to pose barriers for widespread deployment, particularly in resource-constrained clinical environments. While substantial improvements have been achieved, further optimization is needed for universal clinical adoption. Limited clinical validation represents a significant gap between research achievements and clinical translation. More comprehensive multi-institutional studies focusing on clinical outcomes rather than technical metrics are essential for demonstrating real-world utility.
10.4. Future Outlook and Transformative Potential
The future of diffusion models in medical image segmentation appears exceptionally promising. Emerging research directions including edge computing deployment, multi-modal fusion frameworks, and personalized medicine applications suggest continued rapid advancement. The development of specialized medical architectures designed specifically for healthcare applications will likely yield substantial improvements. Regulatory framework development and standardized evaluation protocols will facilitate broader clinical adoption and ensure patient safety. The establishment of open-source ecosystems and international collaboration networks will accelerate progress and enable global impact.
10.5. Broader Implications for Medical Imaging
This review reveals that diffusion models represent more than incremental improvements in segmentation accuracy; they fundamentally change how we approach medical image analysis. The ability to model uncertainty, generate synthetic training data, and provide multiple plausible interpretations aligns closely with the inherent nature of medical diagnosis, where ambiguity and uncertainty are natural components of clinical decision-making. The democratization of advanced medical imaging through synthetic data generation and reduced annotation requirements has the potential to improve healthcare access globally, particularly in resource-limited settings where expert annotations are scarce.
10.6. Final Perspectives
The comprehensive analysis presented in this review demonstrates that diffusion models have successfully established themselves as transformative technologies in medical image segmentation. The combination of superior technical performance, unique uncertainty quantification capabilities, computational efficiency improvements, and clinical validation results positions these models as essential tools for the future of medical imaging. The field has progressed from initial theoretical explorations to practical clinical implementations, with a clear trajectory toward widespread adoption and integration into routine medical practice. The consistent year-on-year increase in relevant publications from fewer than 10 diffusion-segmentation papers in 2020 to more than 60 in 2025 based on our systematic search reflects sustained and growing community investment in this technology. As we look toward the future, diffusion models are poised to play an increasingly central role in medical imaging, contributing to improved diagnostic accuracy, enhanced clinical decision-making, and ultimately better patient outcomes. The continued collaboration between computer scientists, medical professionals, and healthcare institutions will be essential for realizing the full potential of these promising technologies.
11. Conclusions
In conclusion, this review establishes Denoising Diffusion Probabilistic Models as a transformative advancement in medical image segmentation. By effectively tackling critical challenges such as data scarcity, inter-observer variability, and uncertainty quantification, diffusion models have demonstrated superior performance across diverse modalities like MRI, CT, and ultrasound. The field has seen remarkable progress in terms of both accuracy and efficiency, evidenced by Dice scores reaching up to 0.96 and drastic reductions in processing time from 91.23 s to 0.34 s. While challenges remain regarding computational costs, the continued development of latent diffusion and transformer-based architectures offers a promising path for clinical translation. This work serves as a comprehensive resource and roadmap for future research, paving the way for more precise and reliable diagnostic tools.