Diffusion-Based Approaches for Medical Image Segmentation: An In-Depth Review

Yaseen, Muhammad; Ali, Maisam; Ali, Sikandar; Kim, Hee-Cheol

doi:10.3390/electronics15071400

Open AccessSystematic Review

Diffusion-Based Approaches for Medical Image Segmentation: An In-Depth Review

¹

Institute of Digital Anti-Aging Healthcare, Inje University, Gimhae 50834, Republic of Korea

²

Department of Mechanical Engineering, Kyung Hee University, Seoul 02447, Republic of Korea

³

Interdisciplinary Research Center for Intelligent Manufacturing and Robotics, King Fahd University of Petroleum and Minerals, Dhahran 31261, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(7), 1400; https://doi.org/10.3390/electronics15071400

Submission received: 29 January 2026 / Revised: 24 March 2026 / Accepted: 25 March 2026 / Published: 27 March 2026

(This article belongs to the Special Issue Advanced Techniques in Real-Time Image Processing)

Download

Browse Figures

Versions Notes

Abstract

Medical image segmentation represents a fundamental task in medical image analysis, serving as a critical component for accurate diagnosis, treatment planning, and disease monitoring. The emergence of Denoising Diffusion Probabilistic Models (DDPMs) has revolutionized the landscape of generative modeling and recently gained significant attention in medical image analysis. This comprehensive review examines the current state of the art in diffusion models for medical image segmentation, covering theoretical foundations, methodological innovations, computational efficiency strategies, and clinical applications. We analyze recent advances in latent diffusion frameworks, transformer-based architectures, and ambiguous segmentation modeling while addressing the practical challenges of implementing these models in clinical environments. The review encompasses applications across multiple medical imaging modalities including Magnetic Resonance Imaging (MRI), Computed Tomography (CT), ultrasound, and X-ray imaging, providing insights into performance achievements and identifying future research directions. Through systematic analysis of publications mostly from 2019 to 2025, we demonstrate that diffusion models have achieved remarkable progress in addressing fundamental challenges including data scarcity, inter-observer variability, and uncertainty quantification. Notable achievements include inference time being reduced from 91.23 s to 0.34 s for echocardiogram segmentation (LDSeg, Echo dataset), DSC scores up to 0.96 for knee cartilage MRI segmentation, and a +13.87% DSC improvement over baseline methods for breast ultrasound segmentation. This review serves as a comprehensive resource for researchers and clinicians interested in leveraging diffusion models for medical image segmentation, providing a roadmap for future research and clinical translation.

Keywords:

denoising diffusion probabilistic models; medical image segmentation; generative adversarial networks; latent diffusion model; conditional diffusion models; gaussian noise; ambiguous medical image segmentation; medical imaging

1. Introduction

Medical image segmentation represents one of the most fundamental and challenging tasks in medical image analysis, serving as a critical component for accurate diagnosis, treatment planning, and disease monitoring [1,2,3]. The precise delineation of anatomical structures and pathological regions from medical images such as CT scans, MRI, ultrasound, and histopathology slides directly impacts clinical decision-making and patient outcomes [4,5]. A critical limitation of conventional segmentation methods lies in their inability to adequately model uncertainty and inter-observer variability, which are intrinsic to medical image interpretation. Differences in clinical expertise, annotation protocols, and subjective judgment often result in multiple plausible interpretations of the same image [6,7]. Traditional deterministic models typically produce a single segmentation output, failing to capture this uncertainty and potentially leading to overconfident predictions in clinically ambiguous cases [8]. This limitation motivates the exploration of probabilistic and generative modeling paradigms that can better reflect the realities of clinical practice.

In this context, DDPMs have recently emerged as a powerful class of generative models and have attracted growing interest in medical image analysis. Diffusion models learn data distributions through a gradual denoising process, in which input data are progressively perturbed by Gaussian noise and subsequently reconstructed through a learned reverse process [9,10,11]. This formulation enables stable training and flexible modeling of complex, high-dimensional data distributions.

Forward Diffusion Process (Noising): The forward process gradually adds Gaussian noise to a clean segmentation mask (

x_{0}

) over T steps according to a variance schedule

β_{1}, \dots, β_{T}

:

q (x_{t} | x_{t - 1}) = N (x_{t}; \sqrt{1 - β_{t}} x_{t - 1}, β_{t} I),

(1)

Using the property of Gaussian distributions, we can sample

x_{t}

directly from

x_{0}

:

x_{t} = \sqrt{{\bar{α}}_{t}} x_{0} + \sqrt{1 - {\bar{α}}_{t}} ϵ, ϵ \sim N (0, I),

(2)

where

α_{t} = 1 - β_{t}

and

{\bar{α}}_{t} = \prod_{i = 1}^{t} α_{i}

.

Reverse Diffusion Process (Denoising): The goal is to learn a model

p_{θ}

that reverses this process, starting from Gaussian noise

x_{T} \sim N (0, I)

to reconstruct the segmentation mask

x_{0}

. In segmentation tasks, this is conditional, meaning the model also takes the original medical image (I) as input:

p_{θ} (x_{t - 1} | x_{t}, I) = N (x_{t - 1}; μ_{θ} (x_{t}, I, t), Σ_{θ} (x_{t}, I, t)),

(3)

The mean

μ_{θ}

is typically parameterized to predict the noise

ϵ_{θ}

:

μ_{θ} (x_{t}, I, t) = \frac{1}{\sqrt{α_{t}}} (x_{t} - \frac{β_{t}}{\sqrt{1 - {\bar{α}}_{t}}} ϵ_{θ} (x_{t}, I, t))

(4)

Compared with earlier generative approaches such as Generative Adversarial Networks (GANs) [12] and Variational Autoencoders (VAEs), diffusion models exhibit superior mode coverage and sample fidelity, reducing common issues such as mode collapse and training instability [13,14]. These properties make diffusion-based frameworks particularly well suited for medical imaging applications, where robustness, reliability, and faithful representation of anatomical variability are essential. As a result, diffusion models have begun to play an increasingly important role in addressing long-standing challenges in medical image segmentation, including uncertainty modeling, data scarcity, and generalization across heterogeneous imaging modalities.

1.1. Motivation and Clinical Significance

The application of diffusion models to medical image segmentation addresses several critical limitations of existing approaches. Collective insights from expert groups have consistently outperformed individual diagnostic capabilities in clinical tasks [15,16]. However, the majority of AI-based segmentation methods are designed to replicate a single “optimal” expert annotation, implicitly assuming the existence of a unique ground truth. This assumption is often violated in medical imaging, where inter-observer variability and ambiguous anatomical boundaries are common [17].

Medical image segmentation is inherently uncertain, as different expert annotators may provide distinct yet equally valid segmentation masks for the same anatomical structure. Such ambiguity arises from factors including pathological variability, unclear tissue boundaries, and subjective anatomical definitions, particularly in complex or low-contrast imaging scenarios [18].

Diffusion-based segmentation frameworks offer a principled alternative by modelling the full distribution of plausible segmentation outcomes rather than producing a single deterministic prediction. This enables the generation of multiple segmentation hypotheses from the same input image, directly capturing inter-observer variability. The resulting uncertainty estimates inform diagnostic confidence and support treatment planning decisions in ambiguous cases, aligning these models more naturally with the realities of clinical practice than conventional deterministic approaches.

1.2. Advantages of Diffusion Models in Medical Segmentation

Diffusion models offer several compelling advantages for medical image segmentation tasks. The inherent stochastic sampling process of diffusion models can be leveraged to generate multiple plausible segmentation masks from a single input image, naturally modeling the uncertainty and variability present in medical annotations [18]. This capability eliminates the need for separate prior distributions during inference, which is critical for conditional VAE-based segmentation models [13].

Another important advantage of diffusion models is their hierarchical denoising structure, which allows the progressive refinement of segmentation masks across multiple time steps. This multi-stage generation process enables fine-grained control over the diversity and realism of the predicted segmentations, resulting in anatomically coherent yet heterogeneous outputs. Such control is particularly valuable in medical imaging, where capturing subtle anatomical variations and pathological heterogeneity is essential for robust clinical assessment.

In addition, diffusion models exhibit superior mode coverage compared to adversarial generative frameworks such as GANs. By avoiding adversarial training dynamics, diffusion-based approaches reduce the risk of mode collapse and ensure that less frequent but clinically relevant anatomical patterns are adequately represented. This property is critical in medical segmentation tasks, where rare anatomical variants or pathological structures may carry significant diagnostic importance and should not be systematically overlooked.

1.3. Computational Challenges and Recent Advances

Despite their conceptual advantages, the deployment of diffusion models for medical image segmentation is constrained by substantial computational demands. Classical denoising diffusion probabilistic models rely on iterative reverse sampling processes that may require hundreds to thousands of sampling steps to achieve high-quality outputs. Such computational overhead poses a significant challenge for clinical applications, where timely inference and resource efficiency are critical for real-world usability [9,11].

Recent methodological advances have substantially mitigated these limitations. Latent diffusion models reduce computational complexity by performing the diffusion process in a compressed latent space rather than directly operating on high-resolution medical images. This strategy enables efficient segmentation while preserving anatomical fidelity and has proven particularly effective for large-scale and high-dimensional medical imaging data [10,13].

In parallel, transformer-enhanced diffusion architectures have emerged as a promising direction for improving both efficiency and representational capacity. By integrating vision transformer mechanisms into diffusion-based segmentation frameworks, these models can better capture long-range spatial dependencies and global contextual information, which are essential for accurate anatomical delineation in complex medical images [14]. Collectively, these advances have brought diffusion-based segmentation closer to practical clinical deployment, narrowing the gap between theoretical promise and real-world feasibility.

1.4. Scope and Objectives

This review provides a comprehensive and systematic analysis of diffusion models for medical image segmentation, with the goal of synthesizing recent methodological advances and assessing their clinical relevance. We cover the theoretical foundations of diffusion-based generative modeling as applied to segmentation tasks, alongside key architectural developments, including latent diffusion frameworks, transformer-enhanced diffusion models, and approaches designed to handle ambiguous and uncertain annotations.

In addition to methodological analysis, this review examines strategies for improving computational efficiency and scalability, which are critical for clinical deployment. Applications across multiple medical imaging modalities and anatomical regions are discussed to illustrate the practical impact and generalizability of diffusion-based segmentation approaches. Finally, we identify open challenges and emerging research directions, highlighting opportunities for future development and clinical translation.

By consolidating current knowledge and critically evaluating both strengths and limitations, this review aims to serve as a valuable reference for researchers developing diffusion-based segmentation methods and for clinicians seeking to understand their potential role in real-world medical imaging workflows. To provide a high level overview of the diffusion based medical image segmentation landscape, Figure 1 summarizes the major models paradigms discussed in this review.

2. Methodology

This review employs a systematic literature review methodology to synthesize the rapidly evolving landscape of diffusion models for medical image segmentation. The primary objective is to provide a holistic analysis of theoretical advancements, architectural innovations, and clinical applicability. The review process was guided by the following five specific research questions (RQs):

1.: RQ1: What are the theoretical foundations and evolutionary trajectory of diffusion models in the context of medical image segmentation?
2.: RQ2: How can existing diffusion-based segmentation methodologies be taxonomically classified based on their architectural designs and conditioning strategies?
3.: RQ3: How effectively do diffusion models address inherent clinical challenges such as inter-observer variability and ambiguous boundary delineation?
4.: RQ4: What computational bottlenecks hinder the clinical deployment of diffusion models, and what acceleration strategies have been proposed?
5.: RQ5: Across which medical imaging modalities and specific anatomical tasks have diffusion models demonstrated superior performance compared to traditional segmentation paradigms?

2.1. Search Strategy and Data Sources

To address these research questions, a comprehensive and reproducible search was conducted across five digital libraries: IEEE Xplore, PubMed, ScienceDirect, SpringerLink, and arXiv. All databases were searched from January 2019 to January 2025, capturing the inception of DDPMs and their subsequent adaptation to medical imaging. Searches were executed on 25 December 2025. Search fields: For IEEE Xplore, PubMed, and ScienceDirect, searches were restricted to Title, Abstract, and Keywords fields. For arXiv, full-text search was applied due to platform constraints. PubMed searches additionally incorporated MeSH terms (e.g., “Image Segmentation”[MeSH] AND “Diffusion”[tw]).

The search query utilized a combination of Boolean operators and keywords:

(“Diffusion Models” OR “Denoising Diffusion Probabilistic Models” OR “DDPM” OR “Latent Diffusion”)

AND

(“Medical Image Segmentation” OR “Medical Image Analysis” OR “Organ Segmentation” OR “Tumor Segmentation”)

2.2. Inclusion and Exclusion Criteria

Studies were selected based on a two-stage screening process (title/abstract screening followed by full-text review). The inclusion criteria were as follows:

Studies focusing on the application of diffusion models to medical image segmentation tasks.
Research utilizing standard medical imaging modalities (MRI, CT, Ultrasound, X-ray).
Papers proposing novel methodological frameworks, architectural variations, or significant efficiency improvements.
Studies providing quantitative performance metrics (e.g., Dice Similarity Coefficient, IoU) or qualitative analyses of uncertainty modeling.

Exclusion criteria included the following:

Studies applying diffusion models solely for image generation or reconstruction without a segmentation component.
Non-medical applications of diffusion models.
Non-peer-reviewed articles were excluded, except for arXiv preprints that met at least one of the following criteria: (a) later accepted in a peer-reviewed venue by January 2025; (b) received ≧50 citations on Google Scholar; or (c) served as a widely referenced foundational baseline. Among the included studies, eight were arXiv preprints that satisfied these criteria.

2.3. Data Extraction and Synthesis

A structured extraction form was applied to each eligible study, recording the following: (1) diffusion paradigm type; (2) target anatomy and imaging modality; (3) quantitative metrics; (4) inference time and hardware; and (5) uncertainty quantification strategy. The extracted data was synthesized qualitatively to identify trends, technological gaps, and future directions, forming the basis of the discussion presented in the subsequent sections.

This review was conducted in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 guidelines. The study selection process is illustrated using a PRISMA flow diagram in the Figure 2.

3. Literature Review

The application of diffusion models (DMs) to medical image segmentation has rapidly evolved into a prominent research domain, marked by significant contributions across theoretical foundations, methodological innovations, and clinical applications [11]. This section presents a structured review of the extant literature concerning diffusion-based medical image segmentation. We synthesize prior works by categorizing them based on modeling strategies, architectural designs, and application domains, while simultaneously evaluating their comparative strengths and limitations. Particular emphasis is devoted to recent developments addressing ambiguity-aware segmentation, computational efficiency, and scalability factors that are critical for translating diffusion models from research prototypes into viable clinical practice.

3.1. Foundational Theoretical Frameworks

Diffusion-based medical image segmentation methods rely on probabilistic generative modeling, learning complex data distributions through iterative refinement. These frameworks model uncertainty, generate diverse segmentation hypotheses, and ensure stable training. Denoising Diffusion Probabilistic Models (DDPMs) serve as the core paradigm for most segmentation-oriented diffusion frameworks, forming the basis for architectural innovations and application-specific adaptations.

3.1.1. Denoising Diffusion Probabilistic Models (DDPMs)

The theoretical foundation of diffusion models in medical image segmentation is rooted in the seminal work on Denoising Diffusion Probabilistic Models. These models define a forward diffusion process where medical image data is gradually perturbed over multiple steps by adding Gaussian noise, followed by learning to reverse this diffusion process to retrieve desired noise-free segmentation masks from noisy segmentation masks. The diffusion framework is formulated using two Markov chains: a forward diffusion process and a learned reverse denoising process. The forward process, denoted as

q (x_{1 : T} ∣ x_{0})

, gradually corrupts the initial segmentation mask

x_{0}

by incrementally adding Gaussian noise over T time steps according to a predefined variance schedule [19]. Conversely, the reverse process, parameterized by

θ

and denoted as

p_{θ} (x_{0 : T})

, learns to invert this noising procedure by iteratively removing noise, thereby reconstructing the clean segmentation mask from a noisy observation as shown in Figure 3.

Recent comprehensive surveys have established that diffusion models are widely appreciated for their strong mode coverage and quality of generated samples despite their known computational burdens [20,21]. The field of medical imaging has observed growing interest in diffusion models, capitalizing on advances in computer vision while addressing domain-specific challenges such as data scarcity, annotation variability, and the need for uncertainty quantification. Score-based generative models extend DDPMs using stochastic differential equations (SDEs) [22], where the forward process is a continuous-time diffusion given by

d X_{t} = f (X_{t}, t) d t + g (t) d W_{t},

and the reverse process samples by estimating the score

\nabla log p_{t} (X_{t})

. In medical segmentation, SDE formulations enable flexible noise schedules and signed distance functions for boundary refinement [18,23], improving stability over discrete-step DDPMs.

3.1.2. Latent Diffusion Models (LDMs)

The high computational cost associated with applying diffusion models directly to high-resolution medical images has motivated the development of Latent Diffusion Models (LDMs), which perform the diffusion process in a compressed latent space rather than at the pixel level [24]. By operating on lower-dimensional representations, LDMs substantially reduce memory consumption and inference time while preserving the expressive power of diffusion-based generative modeling, thereby improving the feasibility of clinical deployment.

The diffusion process is then applied within this latent space, allowing the efficient modeling of complex distributions without the prohibitive computational burden associated with high-resolution pixel-space diffusion. During inference, the denoised latent representations are decoded back into the image or segmentation space, yielding high-quality outputs that retain fine anatomical details [25,26]. Recent studies have demonstrated that latent diffusion models are particularly well suited for medical image segmentation tasks involving large volumetric data, such as MRI and CT imaging, where pixel-space diffusion becomes computationally impractical [27]. Beyond efficiency gains, latent diffusion also facilitates the integration of advanced architectural components, including transformer-based encoders and task-specific conditioning mechanisms, further enhancing segmentation performance in complex clinical scenarios [28]. As a result, LDMs have become a foundational component in many state-of-the-art diffusion-based medical segmentation frameworks and serve as a critical bridge between theoretical diffusion modeling and real-world clinical applicability.

3.2. Ambiguous Medical Image Segmentation

Medical image segmentation is inherently ambiguous, as multiple expert annotators may produce distinct yet clinically valid segmentation masks for the same image. This ambiguity arises from factors such as unclear anatomical boundaries, pathological variability, imaging artifacts, and subjective interpretation criteria. Recent research has increasingly recognized ambiguity-aware segmentation as a critical requirement for reliable medical image analysis [4]. This paradigm shift has led to the development of collective and uncertainty-aware segmentation frameworks that better align with clinical practice.

3.2.1. Collective Intelligence Approach

Rahman et al. proposed a framework for ambiguity-aware medical image segmentation through Collectively Intelligent Medical Diffusion (CIMD), which explicitly models the collective intelligence of expert annotators rather than imitating a single best expert annotation [4]. The core insight of CIMD is that aggregated interpretations from multiple experts consistently outperform individual diagnostic judgments in clinical tasks, particularly in complex or ambiguous imaging scenarios. The framework leverages the inherent stochastic sampling process of diffusion models to produce distribution of segmentation masks, eliminating the need for separate prior distributions during inference a critical limitation of conditional VAE-based segmentation models [13]. The hierarchical structure enables control over diversity at each time step, producing realistic and heterogeneous segmentation masks that reflect the naturally occurring variation in medical image interpretation.

3.2.2. Ambiguity Modeling Network (AMN)

The technical innovation of CIMD lies in its Ambiguity Modeling Network, which explicitly models the distribution of ground-truth segmentation masks conditioned on an input image. Given an image

x_{b}

, the network represents segmentation ambiguity by embedding expert annotation variability into a latent space. Specifically, the network parameterizes a Gaussian distribution over latent variables

z

with learned mean and variance functions:

z \sim N (μ (b, x_{b}; ν), σ^{2} (b, x_{b}; ν)),

(5)

where

μ (\cdot)

and

σ (\cdot)

denote the mean and standard deviation predicted by the network,

ν

represents the learnable parameters, and the latent space

z

captures the natural variation present in expert-provided segmentation annotations [29].

The latent variable z is integrated into the diffusion framework as a conditional prior that modulates the denoising process. Unlike standard conditional diffusion models that rely solely on the image I for conditioning, the CIMD denoiser

ϵ_{θ}

is conditioned on both the image I and the latent code z. This formulation allows the model to map a single input image to a distribution of plausible segmentation masks, effectively disentangling the image content from the interpretation style.

The training objective minimizes a variational bound, combining the standard denoising score matching loss with a KL-divergence regularization term for the latent space. The loss function

L

is defined as

L = E_{t, ϵ, z} [{∥ ϵ - ϵ_{θ} (x_{t}, I, z, t) ∥}^{2}] + λ D_{K L} (N (μ, σ^{2}) ∥ N (0, I))

(6)

where

x_{t}

is the noisy segmentation mask at timestep t,

ϵ

is the ground-truth noise, and

λ

is a weighting factor. The KL term ensures the latent distribution remains close to a standard Gaussian prior, enabling random sampling during inference.

The operational workflow of CIMD can be summarized in two phases:

1.: Training Phase: Given an image I and a ground-truth mask $x_{0}$ , the AMN encodes the annotation style into a latent code z. The diffusion model is then trained to predict the noise $ϵ$ added to $x_{0}$ , conditioned on both I and z, using the objective in Equation (6).
2.: Inference Phase: To generate diverse segmentation hypotheses, a latent code z is sampled from the prior distribution $N (0, I)$ . The diffusion model performs the reverse denoising process starting from pure noise $x_{T}$ , conditioned on I and the sampled z, to produce a plausible segmentation mask $x_{0}$ . Re-sampling z yields different valid interpretations.

This formulation addresses a fundamental limitation in medical image segmentation, namely the inherent ambiguity arising from inter-observer variability, where different diagnosticians may provide differing interpretations regarding the type, shape, and extent of anatomical abnormalities [7]. Experimental results demonstrate that the proposed framework achieves superior segmentation performance across multiple medical imaging modalities, including computed tomography (CT), ultrasound, and magnetic resonance imaging (MRI), while effectively preserving the naturally occurring variations observed in expert annotations.

3.3. Transformer-Enhanced Diffusion Models

While early diffusion-based segmentation frameworks primarily relied on convolutional architectures such as U-Net, their limited capacity to model long-range dependencies has motivated the integration of transformer mechanisms. Vision Transformers (ViTs) excel in medical image segmentation due to their global self-attention, handling complex structures and large contexts [30]. Transformer-enhanced diffusion models merge transformers’ contextual reasoning with diffusion processes’ generative power for robust segmentation.

MedSegDiff-V2 Framework

Wu et al. developed MedSegDiff-V2, a Transformer-based diffusion framework that addresses the limited performance of naively combining U-Net-based diffusion models with standard transformer blocks. The framework integrates vision transformer mechanisms with a diffusion backbone to improve segmentation across multiple modalities [31]. Its key contribution is a jointly designed architecture the SS-Former that enables long-range dependency modeling while preserving the generative properties of the diffusion process.

To clarify the architectural roles, the transformer in MedSegDiff-V2 serves as both the denoiser backbone and a context encoder:

Denoiser Backbone Role: It forms a hybrid U-Net in which Transformer blocks replace convolutional bottlenecks, creating the primary feature extraction pathway.

Context Encoder Role: Self-attention within these blocks captures long-range spatial dependencies from both the input image and the noisy segmentation mask.

Concise Architectural Breakdown:

Input Conditioning: The input medical image is concatenated with the noisy segmentation mask at timestep t, forming the conditioned input to the denoiser.
Hybrid Backbone (SS-Former): A U-Net-like encoder–decoder with skip connections, but convolutional bottlenecks are replaced by Transformer blocks (multi-head self-attention + feed-forward layers) to process features at multiple scales.
Denoising Output: The network predicts the noise $ϵ_{θ} (x_{t}, t)$ , conditioned on global context from the transformers, iteratively refining the mask over diffusion steps.
Training/Inference: Trained with standard DDPM loss; inference uses classifier-free guidance for improved sample quality.

Compared to standard U-Net denoisers (which rely on local convolutional inductive biases and struggle with global variations in medical structures like irregular tumors), the Transformer-enhanced design in MedSegDiff-V2 enables better modeling of long-range dependencies, reduces sensitivity to local noise patterns, and improves generalization across modalities (e.g., +2–5% DSC gains in multi-organ CT/US tasks). This results in more robust and semantically aware segmentation, particularly for ambiguous or variable anatomical regions. Extensive testing across 20 medical image segmentation tasks spanning various imaging modalities highlights its clear superiority over previous state-of-the-art approaches.

3.4. Computational Efficiency Innovations

Despite their strong generative and uncertainty modeling capabilities, diffusion models have traditionally been limited by high computational cost, primarily due to the large number of iterative sampling steps required during inference. Conventional Denoising Diffusion Probabilistic Models (DDPMs) require numerous sampling steps, hindering their practicality for time-sensitive medical image segmentation [11,32].

3.4.1. Accelerated Sampling Strategies

Recent research has increasingly prioritized mitigating the substantial computational burden inherent to traditional diffusion models, which typically necessitate up to 1000 sampling steps to synthesize high-fidelity outputs. Consequently, a variety of strategies have been devised to drastically reduce inference latency without compromising segmentation accuracy [33]. Notably, accelerated sampling protocols have substantially reduced inference cost: LDSeg achieves DSC scores comparable to the full 1000-step DDPM baseline using only 2 DDIM sampling steps, representing a 100–500× reduction in network function evaluations. These improvements are pivotal for clinical translation, where real-time or near-real-time processing capabilities are indispensable for effective patient care.

3.4.2. Memory-Efficient Architectures

The intensive computational demands of processing large 3D medical volumes have catalyzed the development of innovative memory-efficient architectures. Patch-based strategies facilitate the handling of volumetric data by partitioning extensive volumes into manageable sub-volumes, thereby substantially reducing memory overhead while preserving segmentation coherence across the full dataset. Concurrently, wavelet-based (or frequency-domain) approaches have demonstrated the capacity to train on resource-constrained hardware—such as single 40 GB GPUs while achieving state-of-the-art performance in high-resolution medical image generation and segmentation [34]. These advancements represent a critical leap forward in democratizing diffusion models, making them feasible for clinical applications operating on standard hardware configurations.

3.5. Conditioned Diffusion for Medical Image Segmentation

Conditional diffusion frameworks enable the denoising process to be guided by auxiliary information, such as input images, anatomical priors, or textual descriptions. Among these variants, text-conditioned diffusion models incorporate semantic guidance derived from natural language to improve segmentation controllability and generalization in limited-data scenarios. Language-guided segmentation frameworks leverage vision–language alignment to incorporate high-level semantic priors, enabling more flexible and target-aware segmentation behavior [35].

DiffBoost Framework

Zhang et al. proposed DiffBoost, a text-guided diffusion framework designed to alleviate data scarcity in medical image segmentation through anatomically constrained synthetic image generation [5]. The framework follows a three-stage pipeline comprising large-scale pretraining on RadImageNet, task-specific fine-tuning, and joint optimization with downstream segmentation objectives.

A key innovation of DiffBoost lies in its use of edge-aware anatomical constraints to guide the diffusion synthesis process, ensuring that generated samples preserve medically relevant structural information. By incorporating textual descriptions as semantic guidance, DiffBoost enables the generation of diverse yet clinically plausible synthetic images, thereby enhancing segmentation performance in data-limited scenarios.

3.6. Comparative Analysis with Traditional Uncertainty Modeling

Diffusion models compare favorably to uncertainty methods. Bayesian segmentation approximates posteriors via variational inference or MC Dropout [36], capturing epistemic uncertainty but requiring 5–10 forward passes at ∼0.01–0.1 s per inference [37]. Diffusion generates diverse samples natively in 10–50 sampling steps, 0.3–2 s, outperforming MC Dropout in AUSE and AURG metrics for reconstruction task [38]. The Probabilistic U-Net [39] models aleatoric uncertainty via latent variables, but is susceptible mode collapse; diffusion model avoid this through mode coverage [40]. Ensembles methods (e.g., nnU-Net 5-fold) improve DSC by 1–5% at the cost of five times the training time, whereas diffusion-based sampling achieves comparable gains within a single trained model. For ambiguous cases, diffusion models the full posterior distribution

p (y | x)

, sampling plausible segmentations from high-density regions and naturally capturing inter-observer variability [41].

4. Types of Diffusion Models for Segmentation

The application of diffusion models to medical image segmentation has diversified into several distinct methodological paradigms, each tailored to address specific challenges inherent to medical imaging [42]. Rather than constituting a single unified approach, diffusion-based segmentation methods differ substantially in terms of architectural design, conditioning strategy, uncertainty modeling, and computational efficiency. To provide a systematic overview of these approaches, Table 1 summarizes the main categories of diffusion models for medical image segmentation, highlighting their core design principles, strengths, and limitations. This structured taxonomy of diffusion model types developed for medical image segmentation, highlighting their core characteristics and intended use cases.

4.1. Denoising Diffusion Probabilistic Models (DDPMs)

Standard Denoising Diffusion Probabilistic Models represent the earliest and most straightforward application of diffusion processes to medical image segmentation [9]. These approaches directly adapt vanilla diffusion formulations to segmentation tasks, serving as a foundational baseline for subsequent architectural and methodological innovations.

Vanilla DDPM for Segmentation

The foundational approach applies standard Denoising Diffusion Probabilistic Models directly to medical image segmentation tasks. These models define a forward diffusion process where segmentation masks are gradually perturbed over multiple steps by adding Gaussian noise, followed by learning to reverse this diffusion process to retrieve desired noise-free segmentation masks from noisy samples [47]. Standard DDPMs demonstrate strong mode coverage and quality of generated samples, making them particularly suitable for medical applications where precision and reliability are paramount. However, they suffer from significant computational burdens due to the iterative nature of the denoising process, typically requiring up to 1000 sampling steps for high-quality generation.

4.2. Conditional Diffusion Models

Conditional diffusion models extend vanilla diffusion formulations by incorporating auxiliary information to guide the denoising process. In medical image segmentation, conditioning is essential for ensuring that generated segmentation masks remain anatomically consistent with the corresponding input images.

Image-Conditioned Diffusion

Image-conditioned diffusion models represent a crucial advancement for medical image segmentation, where the denoising process is guided by input medical images, as shown in Figure 4. These models learn to generate segmentation masks conditioned on the corresponding medical images, enabling controlled generation of anatomically consistent segmentation outputs [43]. The conditioning mechanism typically involves concatenating image features with noisy segmentation masks at each denoising step, allowing the model to leverage anatomical information for accurate mask generation. This approach has demonstrated superior performance across multiple medical imaging modalities including CT, ultrasound, and MRI [31].

4.3. Latent Diffusion Models (LDMs)

Latent Diffusion Models are designed to improve the computational efficiency of diffusion-based medical image segmentation by shifting the diffusion process from pixel space to a compressed latent representation [44]. By operating on lower-dimensional feature spaces, LDMs substantially reduce memory consumption and inference time while preserving segmentation fidelity, making them well suited for high-resolution medical imaging applications.

Compressed Latent Space Processing

In latent diffusion-based segmentation frameworks, pre-trained encoders are used to project medical images and corresponding segmentation masks into compact latent representations. The diffusion process is then performed entirely within this latent space, where a denoising model learns to generate clean latent segmentation representations from noisy latent representations [48].

Following diffusion, a decoder reconstructs the final segmentation masks in pixel space. This three-stage design comprising an encoder, a latent diffusion model, and a decoder enables efficient processing of high resolution scans such as MRI and CT images while maintaining competitive segmentation accuracy. As a result, latent diffusion models provide a practical balance between computational efficiency and representational expressiveness in medical image segmentation tasks [49].

Throughout this review, the term ‘noisy segmentation mask’ refers to pixel-space diffusion, whereas ‘noisy latent representation’ denotes the equivalent in compressed latent space (LDMs).

4.4. Specialized Noise Models

While most diffusion-based segmentation frameworks rely on Gaussian noise formulations, medical image segmentation particularly binary segmentation often benefits from noise models that better reflect the discrete nature of segmentation masks [50]. Specialized noise models adapt the diffusion process to non continuous data distributions, enabling more effective modeling of binary and categorical segmentation outputs.

Binary Bernoulli Diffusion Model (BBDM)

The Binary Bernoulli Diffusion Model represents a significant innovation specifically designed for binary segmentation tasks. Unlike traditional diffusion models that apply Gaussian noise to continuous data, BBDM uses Bernoulli noise as the diffusion kernel to enhance the capacity of the diffusion model for binary segmentation tasks.

BerDiff, a prominent implementation of BBDM, introduces randomness through sampling of initial noise and latent variables, producing diverse segmentation masks that effectively highlight regions of interest [45]. This approach is particularly valuable for medical image segmentation where binary masks represent the presence or absence of anatomical structures or pathological regions.

4.5. Transformer-Enhanced Diffusion Models

Transformer-enhanced diffusion models combine the probabilistic generative strength of diffusion processes with the global contextual modeling capability of transformers. This class of methods is designed to overcome the limited receptive field of convolutional architectures by enabling long-range dependency modeling, which is particularly important for medical image segmentation involving complex anatomical structures and large spatial contexts.

MedSegDiff-V2 Framework

MedSegDiff-V2 represents a novel Transformer-based Diffusion framework that addresses the limitation of simply combining UNet-based diffusion models with transformers. The framework integrates vision transformer mechanisms with diffusion models to enhance medical image segmentation across multiple modalities [31]. The key innovation lies in the SS-Former architecture that learns the interaction between noise and semantic features. This approach leverages the long-range dependency modeling capabilities of transformers while maintaining the generative advantages of diffusion models.

4.6. Hybrid Diffusion Models

Hybrid diffusion models combine discriminative segmentation networks with generative diffusion processes to leverage the complementary strengths of both paradigms. While discriminative models provide efficient and accurate initial predictions, diffusion-based refiners enable probabilistic correction and structural refinement by modeling the underlying data distribution. This hybrid design aims to improve segmentation robustness without incurring the full computational cost of purely generative diffusion pipelines.

HiDiff: Hybrid Discriminative–Generative Framework

HiDiff represents a novel hybrid approach that synergizes discriminative and generative modeling paradigms for medical image segmentation [46]. The framework comprises two fundamental components: a discriminative segmentor and a diffusion refiner. The discriminative segmentor utilizes conventional trained segmentation models to provide segmentation mask priors, while the diffusion refiner employs a Binary Bernoulli Diffusion Model to effectively refine segmentation masks.

The key innovation lies in the alternate-collaborative training strategy, where the segmentor and BBDM [51] are trained to mutually enhance each other’s performance. This approach addresses the limitation of purely discriminative methods that neglect underlying data distribution and intrinsic class characteristics.

5. Key Applications Across Modalities

Diffusion models have proven to be highly versatile in the field of medical image segmentation [52], showcasing their effectiveness across various imaging modalities, each of which comes with its own distinct challenges and potential advantages. Table 2 provides a modality-wise summary of representative diffusion-based medical image segmentation applications, associated tasks, and reported benefits. Figure 5 illustrates the key applications of diffusion models.

5.1. MRI Applications

Magnetic Resonance Imaging (MRI) has emerged as one of the most prominent application domains for diffusion-based medical image segmentation, owing to its high spatial resolution, rich tissue contrast, and frequent use in neurological and oncological imaging tasks [55].

5.1.1. Brain Tumor Segmentation

Brain tumor segmentation represents one of the most extensively studied applications of diffusion models in medical imaging. BTSegDiff introduces a novel framework [56] for automated brain tumor segmentation that leverages multimodal MRI information through a Diffusion Probability Model (DPM). The framework addresses the significant challenges posed by uneven grayscale distribution, irregular tumor shapes, and substantial size variations in brain tumor images.

5.1.2. Knee Cartilage Segmentation

LDSeg demonstrates exceptional performance in 3D MRI knee cartilage segmentation, achieving DSC scores of 0.96 and IoU scores of 0.93 for accurately delineating femur and tibia cartilage [44]. Its capability to efficiently process 3D volumetric data makes it especially beneficial for orthopedic applications, where precise cartilage evaluation is essential for effective treatment planning.

5.2. CT Applications

Computed Tomography (CT) imaging presents unique segmentation challenges due to variable organ contrast, heterogeneous tissue appearance, and the need to simultaneously delineate multiple anatomical structures [57]. Diffusion-based segmentation frameworks have demonstrated strong adaptability to these challenges by modeling structural variability and uncertainty across diverse CT imaging scenarios.

5.2.1. Multi-Organ Abdominal Segmentation

HiDiff exhibits outstanding performance in the segmentation of abdominal organs, delivering superior outcomes across various anatomical structures such as the liver, kidneys, spleen, and pancreas [46]. Its hybrid discriminative–generative framework effectively tackles the challenge of segmenting organs that differ in size, shape, and contrast characteristics within the same imaging volume.

5.2.2. Lung Nodule Detection

The application of diffusion models in pulmonary nodule segmentation tackles the vital challenge of early detection of lung cancer. These models excel at producing multiple potential segmentation outcomes, offering radiologists valuable insights into uncertainty, which is especially crucial in situations where nodule boundaries are unclear or partially hidden [58].

5.3. Ultrasound Applications

Ultrasound imaging presents distinct challenges for medical image segmentation, such as low signal-to-noise ratios, acoustic artifacts, and indistinct tissue boundaries [59]. Diffusion-based segmentation frameworks have demonstrated significant promise in this domain by effectively capturing uncertainty and structural variability, thereby enabling more reliable segmentation even in noisy and ambiguous imaging scenarios.

5.3.1. Breast Cancer Segmentation

DiffBoost achieves remarkable results on ultrasound breast imaging, demonstrating +13.87% improvement over baseline methods.The framework’s edge-guided synthesis approach is particularly valuable for ultrasound imaging where boundary delineation can be challenging due to acoustic artifacts and unclear tissue boundaries [5].

5.3.2. Echocardiogram Analysis

LDSeg showcases outstanding performance in analyzing echocardiograms, efficiently processing 2D+time echocardiogram videos to segment the left ventricle and left atria with Dice Similarity Coefficient (DSC) scores of 0.92 and Intersection over Union (IoU) scores of 0.84. Its capability to manage temporal dynamics in echocardiogram sequences while ensuring computational efficiency makes it an excellent choice for real-time cardiac evaluations [48].

5.4. X-Ray Applications

X-ray imaging remains one of the most widely used diagnostic modalities due to its low cost and rapid acquisition [60]. However, limited contrast, overlapping anatomical structures, and data scarcity pose significant challenges for automated analysis. Diffusion-based models have demonstrated strong potential in this domain by enhancing robustness through probabilistic modeling and synthetic data generation.

Chest X-Ray Analysis

Diffusion models have shown promising results in chest and lung X-ray image analysis, particularly for COVID-19 detection and pulmonary pathology identification. The application extends beyond segmentation to include synthetic data generation for training CNNs in SARS-CoV-2 CT scan analysis, addressing data scarcity challenges in pandemic-related medical imaging [61].

6. Notable Models and Performance

This section emphasizes notable diffusion-based frameworks that have set new benchmarks in medical image segmentation. Instead of providing an exhaustive list of methods, the focus is placed on models that showcase unique methodological innovations and consistently deliver superior performance across various imaging modalities.

Table 3 summarizes representative state-of-the-art diffusion-based medical image segmentation frameworks, highlighting their target modalities, core methodological contributions, and reported performance characteristics.

6.1. State-of-the-Art Diffusion Frameworks

Several diffusion-based segmentation frameworks [62] have emerged as benchmarks due to their strong empirical performance, architectural innovation, and clinical relevance. Among these, text-guided and ambiguity-aware diffusion models have shown particularly notable impact [63].

6.1.1. DiffBoost: Text-Guided Medical Image Synthesis

DiffBoost introduces an innovative method in text-guided diffusion models for medical image segmentation, delivering significant performance enhancements across various modalities [5]. This framework utilizes controllable diffusion models for medical image synthesis, integrating edge information of objects to effectively guide the synthesis process.

6.1.2. Performance Achievements

Ultrasound Breast Segmentation: +13.87% improvement over baseline methods.
CT Spleen Segmentation: +0.38% improvement over baseline methods
MRI Prostate Segmentation: +7.78% improvement over baseline methods

6.1.3. Collectively Intelligent Medical Diffusion

CIMD presents a groundbreaking approach to ambiguous medical image segmentation by producing multiple plausible outputs that reflect the inherent variability found in expert annotations. This innovative framework overcomes a key limitation of current methods, which often aim to replicate the performance of a single expert instead of harnessing the collective knowledge and diverse perspectives of multiple experts [64].

6.1.4. Key Performance Features

Multi-Modal Effectiveness: Demonstrates superior performance across CT, ultrasound, and MRI modalities.
Ambiguity Modeling: Successfully captures frequency distributions of expert annotations.
Uncertainty Quantification: Provides pixel-wise uncertainty maps through stochastic sampling.

6.2. LDSeg: Latent Diffusion Segmentation

LDSeg introduces an end-to-end latent diffusion segmentation framework that significantly improves computational efficiency by performing diffusion in a compressed latent space rather than pixel space [65]. By jointly optimizing the encoder, decoder, and latent diffusion process, LDSeg eliminates the need for multi-stage training pipelines while preserving high-fidelity segmentation performance.

Extensive experimental evaluation demonstrates that LDSeg achieves near-real-time inference across multiple medical imaging modalities, making it well suited for time-sensitive clinical workflows.

Performance Benchmarks

Echo Dataset: DSC: 0.92, IoU: 0.84, Processing time: 0.34 s (reduced from 91.23 s), 2D+time echocardiogram (echo) video dataset, Image size (512 × 768), NVIDIA A100-SXM4 (80 GB) GPU.
Knee MRI: DSC: 0.96, IoU: 0.93 for cartilage segmentation, 3D MRI dataset, Image size (128 × 128 × 256), NVIDIA A100-SXM4 (80 GB) GPU.
Sampling Efficiency: Achieves comparable performance with only 2 sampling steps.

Table 4 highlights a clear accuracy–efficiency trade-off. LDSeg achieves the best performance (DSC 0.96/IoU 0.93) with only 2 sampling steps and 0.34 s inference on NVIDIA A100. BerDiff/BBDM and HiDiff follow closely (DSC 0.89–0.84 at 10 steps, 0.4 s on V100). Earlier models (CIMD, PD-DDPM) require 300–1000 steps with unreported times, while MedSegDiff-V2 and LSegDiff lack efficiency metrics. Latent and accelerated variants consistently outperform vanilla DDPMs, underscoring the need for standardized reporting of sampling steps, inference time, and hardware.

7. Advances in Training and Inference

7.1. Training Strategy Innovations

Recent research has increasingly focused on improving the training stability, generalization ability, and inference reliability of diffusion-based medical image segmentation models. Conventional diffusion training paradigms often rely on partially noised ground-truth masks during early sampling steps, which can inadvertently introduce data leakage and lead to performance degradation during inference. To overcome these limitations, novel training strategies have been proposed that fundamentally rethink how diffusion models are optimized for segmentation tasks [66]. Table 5 summarizes representative diffusion-based segmentation frameworks with respect to inference efficiency, sampling strategies, and computational characteristics as reported in the literature.

7.1.1. Recycling Training Framework

A notable advancement in diffusion training strategy is the Recycling Training Framework, which explicitly addresses the data leakage problem inherent in standard diffusion-based segmentation pipelines [67]. Unlike traditional approaches that condition early sampling steps on partially corrupted ground-truth segmentations, the recycling strategy enforces complete corruption of the input mask at the initial diffusion step, ensuring that the model learns to reconstruct segmentation outputs purely through iterative refinement.

This modification aligns the training process more closely with the inference setting, where ground-truth information is unavailable. As a result, the diffusion model learns a more robust and self-consistent denoising trajectory, leading to improved stability and generalization [68].

The recycling training strategy demonstrates several key advantages:

Performance Maintenance: Proposed diffusion models can refine or maintain segmentation accuracy throughout the inference process.
Stability Enhancement: Unlike other diffusion models that show declining or unstable performance trends, recycling-trained models maintain consistent performance.
Generalization Improvement: The approach effectively reduces overfitting and promotes better generalization.

7.1.2. End-to-End Learning Strategy

LDSeg introduces a comprehensive end-to-end conditional diffusion modeling framework that significantly improves training efficiency for medical image segmentation. Unlike conventional two-step training strategies, LDSeg adopts a joint optimization scheme in which the encoder, decoder, and score-based diffusion model are trained simultaneously using a unified objective function [44]. This integrated design mitigates error propagation between separately trained modules and ensures that latent representations remain task-aware throughout the diffusion process. By directly coupling segmentation objectives with diffusion-based denoising constraints, the framework facilitates more coherent feature learning and accelerates convergence during training.

The overall training loss is defined as

L = L_{1} + λ L_{2},

(7)

where

L_{1}

denotes the segmentation loss, consisting of a combination of cross-entropy loss and Dice similarity coefficient to enforce accurate structural predictions, and

L_{2}

represents the denoising loss applied in the latent space to guide the diffusion model in learning the underlying data distribution. The weighting parameter

λ

balances the contributions of the segmentation and diffusion objectives during joint training. This end-to-end formulation aligns training dynamics with inference behavior and has been shown to substantially reduce training overhead while preserving high segmentation fidelity across multiple imaging modalities [69].

7.2. Inference Optimization Advances

7.2.1. Latent Space Processing

The development of Latent Diffusion Models (LDMs) represents a paradigm shift in computational efficiency for medical image segmentation. By performing diffusion operations entirely in low-dimensional latent space, these models achieve dramatic reductions in memory consumption and processing time while maintaining accuracy in the reconstructed segmentation masks [70].

This paradigm shift is particularly impactful for large-scale 3D medical imaging modalities such as MRI and CT, where pixel-space diffusion would otherwise be prohibitively expensive. Latent-space processing thus enables scalable deployment of diffusion-based segmentation models in clinical environments without sacrificing accuracy [71].

7.2.2. Fast Sampling Protocols

Beyond latent-space compression, accelerated sampling strategies have emerged as a critical advancement for real-time inference. Traditional diffusion models require hundreds to thousands of sampling steps, limiting their clinical usability. Recent developments such as Denoising Diffusion Implicit Models (DDIM) enable deterministic and non-Markovian sampling, allowing high-quality segmentation outputs to be generated with as few as 5–20 inference steps [72].

These fast sampling protocols preserve segmentation fidelity while dramatically reducing latency, making diffusion-based models viable for time-sensitive clinical workflows such as ultrasound guidance and intraoperative imaging [73].

8. Challenges and Limitations

8.1. Computational Complexity and Resource Requirements

Despite their strong representational capacity, diffusion-based segmentation models face notable challenges related to computational cost and hardware constraints, which continue to hinder large-scale and real-time clinical deployment.

8.1.1. Iterative Denoising Overhead

A core limitation of diffusion models arises from their reliance on an iterative denoising process, in which segmentation masks are progressively refined over multiple inference steps. Conventional Denoising Diffusion Probabilistic Models (DDPMs) typically require hundreds to thousands of sampling iterations to achieve high-quality outputs, resulting in substantial computational overhead [11,19].

This inference latency is particularly restrictive in clinical environments where real-time or near-real-time performance is essential, such as ultrasound-guided interventions, emergency diagnostics, and intraoperative imaging. Although accelerated sampling methods have alleviated this burden to some extent, the trade-off between sampling speed and segmentation fidelity remains a critical challenge [21].

8.1.2. Memory Limitations for 3D Processing

Diffusion models also encounter significant memory constraints when applied to high-resolution 3D medical imaging data, including volumetric CT and MRI scans. The need to store intermediate feature maps across multiple diffusion steps often leads to excessive GPU memory consumption, limiting batch size and spatial resolution during both training and inference [74].

These memory limitations pose a major obstacle for comprehensive volumetric analysis, which is essential for accurate anatomical delineation and clinical decision-making. While latent diffusion and patch-based processing strategies offer partial mitigation, achieving efficient full-volume 3D diffusion-based segmentation remains an open research problem [70].

8.2. Training and Data Challenges

While diffusion models provide powerful mechanisms for modeling uncertainty and ambiguity, their effectiveness in medical image segmentation is strongly constrained by data availability, annotation quality, and evaluation limitations.

8.2.1. Annotation Requirements and Cost

Training diffusion-based models for ambiguous medical image segmentation often necessitates annotations from multiple expert radiologists to accurately capture inter-observer variability. Acquiring such multi-rater annotations is both time-consuming and financially expensive, significantly increasing the data collection burden compared to conventional single-label segmentation pipelines [75,76].

This challenge is particularly pronounced in specialized clinical domains—such as oncology, cardiology, and neuroimaging where expert availability is limited and annotation protocols are complex. As a result, the scalability of diffusion based ambiguous segmentation methods remains constrained, motivating research into weakly supervised learning, annotation-efficient training strategies, and synthetic label generation [77].

8.2.2. Ground Truth Distribution Limitations

In real-world clinical settings, ground truth for ambiguous segmentation is typically represented by a small and finite set of expert annotations, rather than a well-defined continuous distribution. This sparse sampling of the annotation space poses challenges for both robust model training and reliable performance evaluation [78].

Evaluation metrics such as the Generalized Energy Distance (GED) are commonly used to assess the alignment between predicted and ground-truth segmentation distributions. However, GED and related distribution-based metrics can be unstable or biased when ground truth distributions are approximated using limited samples, potentially leading to unreliable assessments of model performance [79].

These limitations highlight the need for improved evaluation protocols and uncertainty-aware metrics that remain robust under sparse annotation regimes, as well as training objectives that can better exploit limited distributional supervision.

8.3. Clinical Translation Barriers

Despite strong methodological advances, the translation of diffusion-based medical image segmentation models from research prototypes to routine clinical use remains limited. Key barriers include insufficient clinical validation and challenges in integrating these models into real-world healthcare workflows.

8.3.1. Limited Clinical Validation

Although diffusion models demonstrate strong segmentation performance in controlled experimental settings, systematic clinical validation remains insufficient. Most existing studies evaluate performance using technical metrics on retrospective, single-institution datasets a standard that is insufficient to demonstrate clinical readiness. Specific gaps and concrete recommendations include the following: (a) External validation: Validation on at least two independent cohorts from different institutions, scanner vendors, and imaging protocols is necessary to demonstrate robustness beyond the training distribution [80]. (b) Prospective reader studies: Clinical impact should be quantified through controlled reader studies comparing AI-assisted versus unassisted radiologist performance, measuring inter-reader agreement improvement, time-to-diagnosis reduction, and diagnostic accuracy changes. (c) Clinically meaningful endpoints: Beyond Dice/IoU, evaluations should report treatment decision concordance (e.g., radiotherapy target volume agreement within clinical margins), false positive/negative rates at actionable thresholds, and time savings per workflow step. (d) Population diversity: Studies must validate across heterogeneous patient populations (varied age, comorbidities, disease stage, ethnicity) and include pathological edge cases, not only typical presentations present in public benchmarks. Addressing these gaps is a prerequisite for regulatory approval and broad clinical adoption of diffusion-based segmentation systems [81].

8.3.2. Integration with Clinical Workflows

Integrating diffusion-based segmentation models into routine clinical practice involves practical and technical challenges across three areas. First, AI systems must interoperate with established medical imaging infrastructure, including Picture Archiving and Communication Systems (PACS) and Electronic Health Records (EHR) [82], which manage image storage and patient data. Second, inference latency is a critical operational constraint: for time-sensitive applications such as emergency diagnostics, intraoperative imaging, and point-of-care ultrasound, segmentation must complete within seconds. Third, user interaction design must allow clinicians to efficiently review, adjust, and approve automatically generated segmentations. Because diffusion models produce probabilistic outputs, communicating uncertainty through interpretable visualizations or confidence maps is essential for clinical trust. Evaluation should therefore extend beyond algorithmic accuracy to include workflow-level metrics: reductions in reporting time, improvements in inter-observer consistency, and compatibility with downstream tasks such as radiotherapy planning and surgical guidance [83,84].

To clarify how current limitations motivate future research efforts, Table 6 summarizes the key challenges identified in diffusion-based medical image segmentation and the corresponding future directions discussed in this review.

8.4. Explainability Challenges

A key limitation of diffusion-based segmentation models is their lack of explainability, which poses a significant barrier to regulatory approval and clinical adoption of generative AI in medicine [FDA, 2024; EU AI Act, 2024]. Unlike deterministic methods such as U-Net, which allow direct inspection of feature maps, diffusion models operate as black boxes: the final segmentation mask emerges from many iterative denoising steps driven by learned score functions, making it difficult for clinicians to understand why a particular boundary was chosen or why uncertainty is elevated in a given region. This gap is especially evident when compared to established reconstruction methods. In CT imaging, filtered back-projection (FBP) remains a clinical standard because its mathematical formulation is transparent: streak artifacts from metal implants or beam hardening are well understood and teachable. In contrast, diffusion-based methods such as MedSegDiff-V2 and HiDiff offer only post hoc approximations (e.g., attention maps, uncertainty heatmaps) that are not guaranteed to reflect the true decision process. Current attempts to mitigate this include the following:

Attention visualization in transformer-enhanced models (e.g., SS-Former in MedSegDiff-V2) [31], Uncertainty maps from stochastic sampling [4], Counterfactual generation (what-if mask perturbations).

These post hoc methods remain insufficient for regulatory purposes: to date, no diffusion segmentation framework has been validated under explainability requirements for FDA Class II/III devices or CE marking under the EU AI Act. Alongside computational cost and data scarcity, this explainability gap is a primary reason diffusion models remain largely confined to research settings.

9. Future Directions

9.1. Computational Efficiency and Real-Time Processing

As diffusion-based segmentation models move closer to clinical adoption, computational efficiency and real-time capability will become decisive factors [32]. Although recent advances have significantly reduced inference time, further acceleration is essential to meet the stringent latency and reliability requirements of clinical workflows such as intraoperative imaging, point-of-care ultrasound, and emergency diagnostics.

9.1.1. Next-Generation Acceleration Techniques

Future research must prioritize the development of ultra-fast diffusion models capable of real-time clinical deployment. While current advances have achieved dramatic speed improvements, clinical applications demand even greater efficiency. The development of adaptive sampling strategies that dynamically adjust the number of sampling steps based on image complexity and clinical requirements represents a promising direction [85].

Additionally, integrating diffusion models with hardware-aware optimization, including edge AI deployment, mixed-precision inference, and compiler-level acceleration, may further reduce latency and energy consumption in clinical environments [86]. Learning-based step schedulers and early-exit mechanisms also represent compelling avenues for balancing efficiency and segmentation fidelity in time-critical applications [87].

9.1.2. Edge Computing and Mobile Deployment

The future of diffusion-based medical image segmentation increasingly points toward edge computing and mobile deployment, enabling on-device inference for resource-constrained clinical settings such as point-of-care diagnostics, rural healthcare, and emergency response units [88]. Deploying diffusion models at the edge reduces dependency on cloud infrastructure, enhances data privacy, and minimizes latency—critical factors in real-world clinical environments.

Achieving this vision will require the development of lightweight diffusion architectures optimized for limited memory, power consumption, and compute capability. Promising research directions include model compression, knowledge distillation, mixed-precision inference, and the co-design of diffusion models with specialized medical imaging hardware [89,90]. These advances will be essential for extending the benefits of diffusion-based segmentation to portable ultrasound devices, handheld scanners, and wearable imaging systems.

9.2. Advanced Architectural Innovations

As diffusion models mature, architectural innovation will play a central role in extending their applicability and robustness for complex medical image segmentation tasks [73]. Future designs are expected to move beyond single-modality and single-scale processing toward more holistic and biologically informed frameworks.

9.2.1. Multi-Modal Fusion Frameworks

Future diffusion-based segmentation models are expected to integrate information from multiple imaging modalities—such as MRI, CT, ultrasound, and histopathology—within a unified generative framework [91]. By leveraging the complementary strengths of different modalities, such models can produce more comprehensive and reliable segmentations, particularly in complex diagnostic scenarios.

The incorporation of cross-modal attention mechanisms will enable diffusion models to selectively emphasize modality-specific features based on anatomical relevance and clinical context. This selective fusion is especially valuable in settings where certain modalities provide superior soft-tissue contrast while others offer structural or functional insights [92]. Diffusion models are uniquely positioned to benefit from such fusion due to their iterative refinement process, which allows modality interactions to evolve progressively across sampling steps.

9.2.2. Hierarchical and Multi-Scale Processing

Another promising research direction lies in the development of hierarchical and multi-scale diffusion architectures capable of simultaneously modeling medical images at different anatomical resolutions [93]. Such architectures can address segmentation tasks spanning from coarse organ-level delineation to fine-grained cellular or lesion-level analysis within a single unified model.

By embedding multi-scale feature hierarchies into the diffusion process, these models can preserve global anatomical coherence while capturing fine structural details. This capability is particularly relevant for applications such as oncology and digital pathology, where clinically meaningful information exists across multiple spatial scales [94]. Hierarchical diffusion frameworks thus offer a principled approach to bridging macro- and micro-level medical image understanding.

9.3. Clinical Translation and Validation

For diffusion-based medical image segmentation to transition from promising research to routine clinical use, rigorous clinical validation and regulatory alignment are essential. Future efforts must extend beyond algorithmic benchmarks to demonstrate tangible clinical value and safety in real-world healthcare settings.

9.3.1. Comprehensive Clinical Validation Studies

Future research should prioritize large-scale, multi-institutional clinical validation studies that evaluate the real-world impact of diffusion-based segmentation systems on patient outcomes, diagnostic accuracy, and clinical workflow efficiency [80,95]. Unlike conventional technical evaluations, these studies should emphasize clinically meaningful endpoints, including decision-making consistency, time-to-diagnosis, and treatment planning improvements.

Additionally, validation across diverse populations, imaging protocols, and healthcare infrastructures is critical for assessing robustness and generalizability. Such evidence will be indispensable for gaining clinician trust and supporting regulatory approval of diffusion-based medical imaging tools [96].

9.3.2. Regulatory Framework Development

Close collaboration with regulatory agencies and healthcare policymakers will be necessary to establish clear approval pathways for diffusion-based medical imaging technologies [83]. The stochastic and probabilistic nature of diffusion models introduces unique challenges for certification, particularly regarding reproducibility, uncertainty communication, and risk management.

Future regulatory frameworks must explicitly account for uncertainty quantification, model update mechanisms, and post-deployment monitoring to ensure patient safety and clinical reliability. Establishing standardized reporting guidelines and validation protocols tailored to generative and probabilistic models will be essential for enabling responsible and scalable clinical deployment [97].

9.4. Explainability and Regulatory Readiness

Future research must prioritize explainable diffusion frameworks. Promising directions include integrating inherently interpretable components (e.g., concept bottleneck layers or prototype-based denoising), developing diffusion-specific XAI methods (e.g., score-function attribution or step-wise tracing of noise removal), and conducting prospective studies that evaluate both segmentation accuracy and clinician trust via think-aloud protocols. Bridging the explainability gap by providing traceable, human-understandable rationales comparable to the transparency of filtered back-projection in CT will be essential for moving diffusion models from promising research to safe, regulatory-approved clinical tools.

10. Discussion

This comprehensive review has examined the transformative impact of diffusion models on medical image segmentation, revealing a field that has rapidly evolved from theoretical foundations to practical clinical applications. The analysis demonstrates that diffusion models represent a paradigm shift in medical imaging, offering unique capabilities that address fundamental challenges while introducing novel possibilities for improved patient care and diagnostic accuracy.

10.1. Key Achievements and Contributions

The field has witnessed remarkable progress across multiple dimensions. Computational efficiency improvements have substantially reduced inference latency: LDSeg reduced per-frame processing time from 91.23 s (pixel-space DDPM baseline) to 0.34 s using 2-step DDIM sampling on the Echo dataset, while maintaining a DSC of 0.92 competitive with the baseline. The development of latent diffusion models, accelerated sampling protocols, and memory-efficient architectures has made real-time clinical deployment feasible. Methodological innovations have established diffusion models as uniquely suited for medical imaging challenges. The introduction of ambiguous segmentation modeling through frameworks like CIMD addresses the fundamental reality that medical image interpretation inherently involves uncertainty and inter-observer variability. This capability to generate multiple plausible segmentation outputs provides clinicians with valuable uncertainty quantification that traditional deterministic models cannot offer. The versatility across imaging modalities has been conclusively demonstrated, with successful applications spanning MRI, CT, ultrasound, and X-ray imaging. Notable achievements include DSC scores of 0.96 for knee cartilage segmentation, +13.87% improvement in breast ultrasound segmentation, and superior performance across 20 different medical segmentation tasks.

10.2. Addressing Critical Medical Imaging Challenges

Diffusion models have successfully addressed several critical limitations of traditional approaches. Data scarcity, a persistent challenge in medical imaging due to privacy concerns and annotation costs, has been mitigated through sophisticated synthetic data generation capabilities. Studies demonstrate that CNNs trained on diffusion-generated synthetic data achieve comparable performance to those trained on original datasets. The uncertainty quantification capabilities inherent in diffusion models address a fundamental gap in traditional segmentation approaches. The ability to provide pixel-wise uncertainty maps and multiple plausible outputs enables more informed clinical decision-making, particularly in ambiguous cases where segmentation accuracy directly impacts treatment planning.

10.3. Current Limitations and Ongoing Challenges

Despite significant progress, several challenges remain. Computational complexity continues to pose barriers for widespread deployment, particularly in resource-constrained clinical environments. While substantial improvements have been achieved, further optimization is needed for universal clinical adoption. Limited clinical validation represents a significant gap between research achievements and clinical translation. More comprehensive multi-institutional studies focusing on clinical outcomes rather than technical metrics are essential for demonstrating real-world utility.

10.4. Future Outlook and Transformative Potential

The future of diffusion models in medical image segmentation appears exceptionally promising. Emerging research directions including edge computing deployment, multi-modal fusion frameworks, and personalized medicine applications suggest continued rapid advancement. The development of specialized medical architectures designed specifically for healthcare applications will likely yield substantial improvements. Regulatory framework development and standardized evaluation protocols will facilitate broader clinical adoption and ensure patient safety. The establishment of open-source ecosystems and international collaboration networks will accelerate progress and enable global impact.

10.5. Broader Implications for Medical Imaging

This review reveals that diffusion models represent more than incremental improvements in segmentation accuracy; they fundamentally change how we approach medical image analysis. The ability to model uncertainty, generate synthetic training data, and provide multiple plausible interpretations aligns closely with the inherent nature of medical diagnosis, where ambiguity and uncertainty are natural components of clinical decision-making. The democratization of advanced medical imaging through synthetic data generation and reduced annotation requirements has the potential to improve healthcare access globally, particularly in resource-limited settings where expert annotations are scarce.

10.6. Final Perspectives

The comprehensive analysis presented in this review demonstrates that diffusion models have successfully established themselves as transformative technologies in medical image segmentation. The combination of superior technical performance, unique uncertainty quantification capabilities, computational efficiency improvements, and clinical validation results positions these models as essential tools for the future of medical imaging. The field has progressed from initial theoretical explorations to practical clinical implementations, with a clear trajectory toward widespread adoption and integration into routine medical practice. The consistent year-on-year increase in relevant publications from fewer than 10 diffusion-segmentation papers in 2020 to more than 60 in 2025 based on our systematic search reflects sustained and growing community investment in this technology. As we look toward the future, diffusion models are poised to play an increasingly central role in medical imaging, contributing to improved diagnostic accuracy, enhanced clinical decision-making, and ultimately better patient outcomes. The continued collaboration between computer scientists, medical professionals, and healthcare institutions will be essential for realizing the full potential of these promising technologies.

11. Conclusions

In conclusion, this review establishes Denoising Diffusion Probabilistic Models as a transformative advancement in medical image segmentation. By effectively tackling critical challenges such as data scarcity, inter-observer variability, and uncertainty quantification, diffusion models have demonstrated superior performance across diverse modalities like MRI, CT, and ultrasound. The field has seen remarkable progress in terms of both accuracy and efficiency, evidenced by Dice scores reaching up to 0.96 and drastic reductions in processing time from 91.23 s to 0.34 s. While challenges remain regarding computational costs, the continued development of latent diffusion and transformer-based architectures offers a promising path for clinical translation. This work serves as a comprehensive resource and roadmap for future research, paving the way for more precise and reliable diagnostic tools.

Author Contributions

Conceptualization, M.Y. and M.A.; methodology, M.Y.; validation, S.A. and M.A.; formal analysis, M.Y.; investigation, S.A.; resources, H.-C.K.; data curation, M.A.; writing—original draft preparation, M.Y.; writing—review and editing, M.Y. and M.A.; visualization, S.A.; supervision, H.-C.K.; project administration, H.-C.K.; funding acquisition, H.-C.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors would like to acknowledge the administrative and technical support provided by their respective institutions during the course of this research.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

DDPM	Denoising diffusion probabilistic model
MRI	Magnetic resonance imaging
CT	Computed tomography
GAN	Generative adversarial network
VAE	Variational Autoencoders
DM	Diffusion models
LDM	Latent diffusion models
CIMD	Collectively intelligent medical diffusion
AMN	Ambiguity modeling network
ViT	Vision transformer
BBDM	Binary bernoulli diffusion model
DPM	Diffusion probability model
FDA	Food and Drug Administration
EU AI Act	European Union Artificial Intelligence Act
EHR	Electronic health record
MC Dropout	Monte Carlo Dropout
AUSE	Area Under the Sparsification Error curve
GED	Generalized energy distance

References

He, X.; Hu, Y.; Zhou, Z.; Jarraya, M.; Liu, F. Few-shot adaptation of training-free foundation model for 3d medical image segmentation. arXiv 2025, arXiv:2501.09138. [Google Scholar]
Bougourzi, F.; Hadid, A. Recent advances in medical imaging segmentation: A survey. arXiv 2025, arXiv:2505.09274. [Google Scholar] [CrossRef]
Neha, F.; Bhati, D.; Shukla, D.K.; Dalvi, S.M.; Mantzou, N.; Shubbar, S. U-net in medical image segmentation: A review of its applications across modalities. arXiv 2024, arXiv:2412.02242. [Google Scholar] [CrossRef]
Rahman, A.; Valanarasu, J.M.J.; Hacihaliloglu, I.; Patel, V.M. Ambiguous medical image segmentation using diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 11536–11546. [Google Scholar]
Zhang, Z.; Yao, L.; Wang, B.; Jha, D.; Durak, G.; Keles, E.; Medetalibeyoglu, A.; Bagci, U. Diffboost: Enhancing medical image segmentation via text-guided diffusion model. IEEE Trans. Med. Imaging 2024, 44, 3670–3682. [Google Scholar] [CrossRef] [PubMed]
Obuchowicz, R.; Oszust, M.; Piorkowski, A. Interobserver variability in quality assessment of magnetic resonance images. BMC Med. Imaging 2020, 20, 109. [Google Scholar] [CrossRef]
Schmidt, A.; Morales-Alvarez, P.; Molina, R. Probabilistic modeling of inter-and intra-observer variability in medical image segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 21097–21106. [Google Scholar]
Xu, L.; Halike, A.; Sen, G.; Sha, M. Medical image segmentation model based on local enhancement driven global optimization. Sci. Rep. 2025, 15, 18281. [Google Scholar] [CrossRef]
Khader, F.; Müller-Franzes, G.; Tayebi Arasteh, S.; Han, T.; Haarburger, C.; Schulze-Hagen, M.; Schad, P.; Engelhardt, S.; Baeßler, B.; Foersch, S.; et al. Denoising diffusion probabilistic models for 3D medical image generation. Sci. Rep. 2023, 13, 7303. [Google Scholar] [CrossRef]
Pfaff, L.; Wagner, F.; Vysotskaya, N.; Thies, M.; Maul, N.; Mei, S.; Wuerfl, T.; Maier, A. No-new-denoiser: A critical analysis of diffusion models for medical image denoising. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2024; pp. 568–578. [Google Scholar]
Kazerouni, A.; Aghdam, E.K.; Heidari, M.; Azad, R.; Fayyaz, M.; Hacihaliloglu, I.; Merhof, D. Diffusion models in medical imaging: A comprehensive survey. Med. Image Anal. 2023, 88, 102846. [Google Scholar] [CrossRef]
Kebaili, A.; Lapuyade-Lahorgue, J.; Ruan, S. Deep learning approaches for data augmentation in medical imaging: A review. J. Imaging 2023, 9, 81. [Google Scholar] [CrossRef]
Amit, T.; Shaharbany, T.; Nachmani, E.; Wolf, L. Segdiff: Image segmentation with diffusion probabilistic models. arXiv 2021, arXiv:2112.00390. [Google Scholar]
Usman Akbar, M.; Larsson, M.; Blystad, I.; Eklund, A. Brain tumor segmentation using synthetic MR images-A comparison of GANs and diffusion models. Sci. Data 2024, 11, 259. [Google Scholar] [CrossRef]
Amit, T.; Shichrur, S.; Shaharabany, T.; Wolf, L. Annotator consensus prediction for medical image segmentation with diffusion models. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2023; pp. 544–554. [Google Scholar]
Ali, M.; Ali, S.; Abbas, Q.; Abbas, Z.; Lee, S.W. Artificial intelligence for mental health: A narrative review of applications, challenges, and future directions in digital health. Digit. Health 2025, 11, 20552076251395548. [Google Scholar] [CrossRef]
Christensen, J.L.; Hannemose, M.R.; Dahl, A.B.; Dahl, V.A. Diffusion Based Ambiguous Image Segmentation. In Scandinavian Conference on Image Analysis; Springer: Berlin/Heidelberg, Germany, 2025; pp. 187–200. [Google Scholar]
Wolleb, J.; Sandkühler, R.; Bieder, F.; Valmaggia, P.; Cattin, P.C. Diffusion models for implicit image segmentation ensembles. In International Conference on Medical Imaging with Deep Learning; PMLR: Zurich, Switzerland, 2022; pp. 1336–1348. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Sallami, M.N.; Fredj, O.B. Survey on Diffusion Models. In Proceedings of the 2024 International Symposium of Systems, Advanced Technologies and Knowledge (ISSATK), Kairouan, Tunisia, 2–3 November 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–7. [Google Scholar]
Song, J.; Meng, C.; Ermon, S. Denoising diffusion implicit models. arXiv 2020, arXiv:2010.02502. [Google Scholar]
Song, Y.; Ermon, S. Generative modeling by estimating gradients of the data distribution. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar] [CrossRef]
Bogensperger, L.; Narnhofer, D.; Ilic, F.; Pock, T. Score-based generative models for medical image segmentation using signed distance functions. In DAGM German Conference on Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 2023; pp. 3–17. [Google Scholar]
Lovelace, J.; Kishore, V.; Wan, C.; Shekhtman, E.; Weinberger, K.Q. Latent diffusion for language generation. Adv. Neural Inf. Process. Syst. 2023, 36, 56998–57025. [Google Scholar]
Ding, K. Application of Latent Diffusion Models (LDMs) in Data-Scarce Scenarios. In ITM Web of Conferences; EDP Sciences: Les Ulis, France, 2025; Volume 80, p. 01008. [Google Scholar]
Chen, L.; Liao, H.; Kong, W.; Zhang, D.; Chen, F. Anatomy preserving GAN for realistic simulation of intraoperative liver ultrasound images. Comput. Methods Programs Biomed. 2023, 240, 107642. [Google Scholar] [CrossRef]
Tapp, A.; Parida, A.; Zhao, C.; Lam, V.; Lepore, N.; Anwar, S.M.; Linguraru, M.G. MR to CT synthesis using 3D latent diffusion. In Proceedings of the 2024 IEEE International Symposium on Biomedical Imaging (ISBI), Athens, Greece, 27–30 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–5. [Google Scholar]
Huang, Y.; Xu, J.; Lai, J.; Jiang, Z.; Chen, T.; Li, Z.; Yao, Y.; Ma, X.; Yang, L.; Chen, H.; et al. Advancing transformer architecture in long-context large language models: A comprehensive survey. arXiv 2023, arXiv:2311.12351. [Google Scholar]
Wang, J.; Wang, J.; Ma, J.; Chen, B.; Chen, Z.; Zheng, Y. CaliDiff: Multi-rater annotation calibrating diffusion probabilistic model towards medical image segmentation. Med. Image Anal. 2025, 107, 103812. [Google Scholar] [CrossRef]
Rane, N. Transformers for medical image analysis: Applications, challenges, and future scope. SSRN Electron. J. 2023. [Google Scholar] [CrossRef]
Wu, J.; Ji, W.; Fu, H.; Xu, M.; Jin, Y.; Xu, Y. Medsegdiff-v2: Diffusion-based medical image segmentation with transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 6030–6038. [Google Scholar]
Abdullah; Huang, T.; Lee, I.; Ahn, E. Computationally Efficient Diffusion Models in Medical Imaging: A Comprehensive Review. arXiv 2025, arXiv:2505.07866. [Google Scholar] [CrossRef]
Salimans, T.; Ho, J. Progressive Distillation for Fast Sampling of Diffusion Models. arXiv 2022, arXiv:2202.00512. [Google Scholar] [CrossRef]
Wang, H.; Liu, Z.; Sun, K.; Wang, X.; Shen, D.; Cui, Z. 3D MedDiffusion: A 3D Medical Latent Diffusion Model for Controllable and High-quality Medical Image Generation. IEEE Trans. Med. Imaging 2025, 44, 4960–4972. [Google Scholar] [CrossRef]
Li, M.; Meng, M.; Ye, S.; Fulham, M.; Bi, L.; Kim, J. Language-guided Medical Image Segmentation with Target-informed Multi-level Contrastive Alignments. arXiv 2024, arXiv:2412.13533. [Google Scholar]
Gal, Y.; Ghahramani, Z. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In Proceedings of the 33rd International Conference on Machine Learning (PMLR), New York, NY, USA; 20-22 June 2016; Volume 48, pp. 1050–1059. [Google Scholar]
Zou, K.; Chen, Z.; Yuan, X.; Shen, X.; Wang, M.; Fu, H. A review of uncertainty estimation and its application in medical imaging. Meta-Radiology 2023, 1, 100003. [Google Scholar] [CrossRef]
De Vita, M.; Belagiannis, V. Diffusion model guided sampling with pixel-wise aleatoric uncertainty estimation. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 26 February–6 March 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 3844–3854. [Google Scholar]
Kohl, S.; Romera-Paredes, B.; Meyer, C.; De Fauw, J.; Ledsam, J.R.; Maier-Hein, K.; Eslami, S.; Jimenez Rezende, D.; Ronneberger, O. A probabilistic u-net for segmentation of ambiguous images. Adv. Neural Inf. Process. Syst. 2018, 31. Available online: https://proceedings.neurips.cc/paper/2018/hash/473447ac58e1cd7e96172575f48dca3b-Abstract.html (accessed on 24 March 2026).
Han, Q.; Huang, B.; Li, Y. SAR-Conditioned Consistency Model for Effective Cloud Removal in Remote Sensing Images. Remote Sens. 2025, 17, 3721. [Google Scholar] [CrossRef]
Favero, A.; Sclocchi, A.; Cagnetta, F.; Frossard, P.; Wyart, M. How compositional generalization and creativity improve as diffusion models are trained. arXiv 2025, arXiv:2502.12089. [Google Scholar] [CrossRef]
Luo, J.; Yang, L.; Liu, Y.; Hu, C.; Wang, G.; Yang, Y.; Yang, T.L.; Zhou, X. Review of diffusion models and its applications in biomedical informatics. BMC Med. Inform. Decis. Mak. 2025, 25, 390. [Google Scholar] [CrossRef]
Wang, T.; Zhang, Z.; Zhou, Y.; Zhang, X.; Chen, Y.; Tan, T.; Yang, G.; Tong, T. From Noisy Labels to Intrinsic Structure: A Geometric-Structural Dual-Guided Framework for Noise-Robust Medical Image Segmentation. arXiv 2025, arXiv:2509.02419. [Google Scholar]
Zaman, F.A.; Jacob, M.; Chang, A.; Liu, K.; Sonka, M.; Wu, X. Latent Diffusion for Medical Image Segmentation: End to end learning for fast sampling and accuracy. Biomed. Signal Process. Control 2025, 114, 109380. [Google Scholar] [CrossRef]
Chen, T.; Wang, C.; Shan, H. Berdiff: Conditional bernoulli diffusion model for medical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2023; pp. 491–501. [Google Scholar]
Chen, T.; Wang, C.; Chen, Z.; Lei, Y.; Shan, H. HiDiff: Hybrid diffusion framework for medical image segmentation. IEEE Trans. Med. Imaging 2024, 43, 3570–3583. [Google Scholar] [CrossRef]
Guo, X.; Xiang, Y.; Yang, Y.; Ye, C.; Yu, Y.; Ma, T. Accelerating denoising diffusion probabilistic model via truncated inverse processes for medical image segmentation. Comput. Biol. Med. 2024, 180, 108933. [Google Scholar] [CrossRef]
Zaman, F.A.; Jacob, M.; Chang, A.; Liu, K.; Sonka, M.; Wu, X. Denoising Diffusions in Latent Space for Medical Image Segmentation. arXiv 2024, arXiv:2407.12952. [Google Scholar] [CrossRef]
Mármol-Rivera, J.A.; Fernández-Rodríguez, J.D.; Asenjo, B.; López-Rubio, E. Latent diffusion for arbitrary zoom MRI super-resolution. Expert Syst. Appl. 2025, 284, 127970. [Google Scholar] [CrossRef]
Guo, X.; Yang, Y.; Ye, C.; Lu, S.; Xiang, Y.; Ma, T. Accelerating Diffusion Models via Pre-segmentation Diffusion Sampling for Medical Image Segmentation. In Proceedings of the 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI), Cartagena, Colombia, 18–21 April 2022. [Google Scholar]
Li, B.; Xue, K.; Liu, B.; Lai, Y.K. Bbdm: Image-to-image translation with brownian bridge diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 1952–1961. [Google Scholar]
Azad, M.; Fahad, N.M.; Raiaan, M.A.K.; Anik, T.R.; Khan, M.F.K.; Toyé, H.M.K.; Muhammad, G. A Systematic Review of Diffusion Models for Medical Image-Based Diagnosis: Methods, Taxonomies, Clinical Integration, Explainability, and Future Directions. Diagnostics 2026, 16, 211. [Google Scholar] [CrossRef]
Xing, Z.; Wan, L.; Fu, H.; Yang, G.; Zhu, L. Diff-unet: A diffusion embedded network for volumetric segmentation. arXiv 2023, arXiv:2303.10326. [Google Scholar] [CrossRef]
Ji, Y.; Lin, D.; Wang, X.; Zhang, L.; Zhou, W.; Ge, C.; Chu, R.; Yang, X.; Zhao, J.; Chen, J.; et al. A Generative Foundation Model for Chest Radiography. arXiv 2025, arXiv:2509.03903. [Google Scholar] [CrossRef]
Charles-Edwards, E.M.; Nandita, M.d. Diffusion-weighted magnetic resonance imaging and its application to cancer. Cancer Imaging 2006, 6, 135. [Google Scholar] [CrossRef]
Qin, J.; Xu, D.; Zhang, H.; Xiong, Z.; Yuan, Y.; He, K. BTSegDiff: Brain tumor segmentation based on multimodal MRI Dynamically guided diffusion probability model. Comput. Biol. Med. 2025, 186, 109694. [Google Scholar] [CrossRef]
Sultana, S.; Robinson, A.; Song, D.Y.; Lee, J. Automatic multi-organ segmentation in computed tomography images using hierarchical convolutional neural network. J. Med. Imaging 2020, 7, 55001. [Google Scholar] [CrossRef] [PubMed]
Su, H.; Lei, H.; Guoliang, C.; Lei, B. Cross-graph interaction and diffusion probability models for lung nodule segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2024; pp. 482–492. [Google Scholar]
Zhang, X.; Chen, E.Z.; Zhao, L.; Chen, X.; Liu, Y.; Maihe, B.; Duncan, J.S.; Chen, T.; Sun, S. Adapting vision foundation models for real-time ultrasound image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2025; pp. 24–34. [Google Scholar]
Zifei, D.; Wenjie, W.; Jinkui, H.; Tianqi, C.; Ziqiao, W.; Bo, Z. AnyCXR: Human Anatomy Segmentation of Chest X-ray at Any Acquisition Position using Multi-stage Domain Randomized Synthetic Data with Imperfect Annotations and Conditional Joint Annotation Regularization Learning. arXiv 2025, arXiv:2512.17263. [Google Scholar]
Nafi, A.a.N.; Hossain, M.A.; Rifat, R.H.; Zaman, M.M.U.; Ahsan, M.M.; Raman, S. Diffusion-Based Approaches in Medical Image Generation and Analysis. arXiv 2024, arXiv:2412.16860. [Google Scholar] [CrossRef]
Dong, Z.; Sun, Y.; Liu, T.; Gu, Y. DiffRIS: Enhancing Referring Remote Sensing Image Segmentation with Pre-trained Text-to-Image Diffusion Models. arXiv 2025, arXiv:2506.18946. [Google Scholar] [CrossRef]
Feng, C.M. Enhancing label-efficient medical image segmentation with text-guided diffusion models. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2024; pp. 253–262. [Google Scholar]
Fan, K.; Liang, L.; Li, H.; Situ, W.; Zhao, W.; Li, G. Research on Medical Image Segmentation Based on SAM and Its Future Prospects. Bioengineering 2025, 12, 608. [Google Scholar] [CrossRef]
Vu Quoc, H.; Tran Le Phuong, T.; Trinh Xuan, M.; Dinh Viet, S. Lsegdiff: A latent diffusion model for medical image segmentation. In Proceedings of the 12th International Symposium on Information and Communication Technology, Ho Chi Minh City, Vietnam, 7–8 December 2023; pp. 456–462. [Google Scholar]
Geng, Z.; Pokle, A.; Luo, W.; Lin, J.; Kolter, J.Z. Consistency models made easy. arXiv 2024, arXiv:2406.14548. [Google Scholar] [CrossRef]
Fu, Y.; Li, Y.; Saeed, S.U.; Clarkson, M.J.; Hu, Y. A recycling training strategy for medical image segmentation with diffusion denoising models. arXiv 2023, arXiv:2308.16355. [Google Scholar] [CrossRef]
Tian, Y.; Yang, L.; Zhang, X.; Tong, Y.; Wang, M.; Cui, B. Diffusion-sharpening: Fine-tuning diffusion models with denoising trajectory sharpening. arXiv 2025, arXiv:2502.12146. [Google Scholar]
Ngoc, H.T.; Hai, T.N.; Son, B.L.; Quoc, L.T. Diffusion Model in Latent Space for Medical Image Segmentation Task. arXiv 2025, arXiv:2512.01292. [Google Scholar] [CrossRef]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 10684–10695. [Google Scholar]
Shi, Y.; Abulizi, A.; Wang, H.; Feng, K.; Abudukelimu, N.; Su, Y.; Abudukelimu, H. Diffusion models for medical image computing: A survey. Tsinghua Sci. Technol. 2024, 30, 357–383. [Google Scholar] [CrossRef]
Nichol, A.Q.; Dhariwal, P. Improved denoising diffusion probabilistic models. In Proceedings of the International Conference on Machine Learning (PMLR), Virtual Event, 18–24 July 2021; pp. 8162–8171. [Google Scholar]
Wang, W.; Xia, J.; Luo, G.; Dong, S.; Li, X.; Wen, J.; Li, S. Diffusion model for medical image denoising, reconstruction and translation. Comput. Med. Imaging Graph. 2025, 124, 102593. [Google Scholar] [CrossRef] [PubMed]
Kim, J.; Park, H. Adaptive latent diffusion model for 3D medical image to image translation: Multi-modal magnetic resonance imaging study. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–7 January 2024; pp. 7604–7613. [Google Scholar]
Schilling, M.P.; Scherr, T.; Muenke, F.R.; Neumann, O.; Schutera, M.; Mikut, R.; Reischl, M. Automated annotator variability inspection for biomedical image segmentation. IEEE Access 2022, 10, 2753–2765. [Google Scholar] [CrossRef]
Ramonell, K.M.; Ohori, N.P.; Liu, J.B.; McCoy, K.L.; Furlan, A.; Tublin, M.; Carty, S.E.; Yip, L. Changes in thyroid nodule cytology rates after institutional implementation of the thyroid imaging reporting and data system. Surgery 2023, 173, 232–238. [Google Scholar] [CrossRef]
Wang, L.; Guo, D.; Wang, G.; Zhang, S. Annotation-efficient learning for medical image segmentation based on noisy pseudo labels and adversarial learning. IEEE Trans. Med. Imaging 2020, 40, 2795–2807. [Google Scholar] [CrossRef]
Zhang, W.; Zhang, X.; Huang, S.; Lu, Y.; Wang, K. A probabilistic model for controlling diversity and accuracy of ambiguous medical image segmentation. In Proceedings of the 30th ACM International Conference on Multimedia, Lisbon, Portugal, 10–14 October 2022; pp. 4751–4759. [Google Scholar]
Geng, S.; Jiang, S.; Hou, T.; Yao, H.; Huang, J.; Ding, W. FEU-Diff: A Diffusion Model with Fuzzy Evidence-Driven Dynamic Uncertainty Fusion for Medical Image Segmentation. IEEE Trans. Neural Netw. Learn. Syst. 2025, 37, 547–561. [Google Scholar] [CrossRef]
Kelly, C.J.; Karthikesalingam, A.; Suleyman, M.; Corrado, G.; King, D. Key challenges for delivering clinical impact with artificial intelligence. BMC Med. 2019, 17, 195. [Google Scholar] [CrossRef]
Topol, E.J. High-performance medicine: The convergence of human and artificial intelligence. Nat. Med. 2019, 25, 44–56. [Google Scholar] [CrossRef]
Langlotz, C.P. Will artificial intelligence replace radiologists? Radiol. Artif. Intell. 2019, 1, e190058. [Google Scholar] [CrossRef] [PubMed]
U.S. Food and Drug Administration. Artificial Intelligence and Machine Learning in Software as a Medical Device; US Food & Drug Administration: Silver Spring, MD, USA, 2021.
EUAI Act. The EU Artificial Intelligence Act; European Union: Brussels, Belgium, 2024. [Google Scholar]
Jeong, W.; Lee, K.; Seo, H.; Chun, S.Y. Upsample what matters: Region-adaptive latent sampling for accelerated diffusion transformers. arXiv 2025, arXiv:2507.08422. [Google Scholar] [CrossRef]
Zheng, D. Diffusion Models on the Edge: Challenges, Optimizations, and Applications. arXiv 2025, arXiv:2504.15298. [Google Scholar]
Moon, T.; Choi, M.; Yun, E.; Yoon, J.; Lee, G.; Lee, J. Early exiting for accelerated inference in diffusion models. In Proceedings of the ICML 2023 Workshop on Structured Probabilistic Inference & Generative Modeling, Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
Lv, B.; Liu, F.; Li, Y.; Nie, J.; Gou, F.; Wu, J. Artificial intelligence-aided diagnosis solution by enhancing the edge features of medical images. Diagnostics 2023, 13, 1063. [Google Scholar] [CrossRef] [PubMed]
Zhao, Y.; Xu, Y.; Xiao, Z.; Jia, H.; Hou, T. Mobilediffusion: Instant text-to-image generation on mobile devices. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2024; pp. 225–242. [Google Scholar]
Yahyati, C.; El Makkaoui, K.; Ouahbi, I.; Maleh, Y. TinyML for Smarter Healthcare: Compact AI Solutions for Medical Challenges. In Tiny Machine Learning Techniques for Constrained Devices; Chapman and Hall/CRC: New York, NY, USA; pp. 68–79.
Xu, Y. Deep learning in multimodal medical image analysis. In International Conference on Health Information Science; Springer: Berlin/Heidelberg, Germany, 2019; pp. 193–200. [Google Scholar]
Song, X.; Chao, H.; Xu, X.; Guo, H.; Xu, S.; Turkbey, B.; Wood, B.J.; Sanford, T.; Wang, G.; Yan, P. Cross-modal attention for multi-modal image registration. Med. Image Anal. 2022, 82, 102612. [Google Scholar] [CrossRef] [PubMed]
Zhang, Q.; Geng, G.; Yan, L.; Zhou, P.; Li, Z.; Li, K.; Liu, Q. P-msdiff: Parallel multi-scale diffusion for remote sensing image segmentation. arXiv 2024, arXiv:2405.20443. [Google Scholar]
Xie, X.; Pan, X.; Zhang, W.; An, J. A context hierarchical integrated network for medical image segmentation. Comput. Electr. Eng. 2022, 101, 108029. [Google Scholar] [CrossRef]
Kuo, M.D.; Chiu, K.W.; Wang, D.S.; Larici, A.R.; Poplavskiy, D.; Valentini, A.; Napoli, A.; Borghesi, A.; Ligabue, G.; Fang, X.H.B.; et al. Multi-center validation of an artificial intelligence system for detection of COVID-19 on chest radiographs in symptomatic patients. Eur. Radiol. 2023, 33, 23–33. [Google Scholar] [CrossRef]
Reinke, A.; Tizabi, M.D.; Eisenmann, M.; Maier-Hein, L. Common pitfalls and recommendations for grand challenges in medical artificial intelligence. Eur. Urol. Focus 2021, 7, 710–712. [Google Scholar] [CrossRef]
National Institute of Standards and Technology (NIST) Artificial Intelligence Risk Management Framework (AI RMF 1.0). 2023; pp. 100–101. Available online: https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf (accessed on 15 January 2026).

Figure 1. Overview of diffusion model paradigms for medical image segmentation, including standard DDPMs, conditional diffusion models, latent diffusion models, specialized noise models, transformer-enhanced diffusion, and hybrid discriminative–generative frameworks.

Figure 2. PRISMA 2020 flow diagram illustrating the identification, screening, eligibility, and inclusion process of studies for this systematic review.

Figure 3. DDPM forward and reverse processes for segmentation.

Figure 4. Architectural framework of a conditional diffusion model for medical image segmentation.

Figure 5. Applications of the diffusion model for medical image segmentation across modalities.

Table 1. Taxonomy of diffusion models for medical image segmentation.

Model Category	Core Idea	Key Advantages	Limitations	Ref.
Vanilla DDPMs	Apply standard denoising diffusion directly to medical segmentation masks	Strong mode coverage; principled probabilistic modeling	High computational cost; slow iterative inference	[19]
Conditional Diffusion Models	Condition the denoising process on auxiliary information such as input images, textual descriptions, etc.	Anatomically consistent segmentation; improved accuracy	Increased architectural complexity; dependence on conditioning quality	[5,43]
Latent Diffusion Models (LDMs)	Perform diffusion in compressed latent space	Lower memory usage; faster inference; scalable to high-resolution data	Potential loss of fine details if latent representation is suboptimal	[44]
Binary Bernoulli Diffusion Model (BBDM)	Use task-specific noise distributions for segmentation masks	Improved binary mask realism; better uncertainty modeling	Limited applicability to multi-class segmentation	[45]
Transformer-Enhanced Diffusion Models	Integrate self-attention within diffusion framework	Captures long-range dependencies; enhanced global coherence	Higher training and memory complexity	[31]
Hybrid Diffusion Models	Discriminative model generates initial mask; diffusion refines it	Improved robustness; refinement of coarse predictions	Additional training stages; system complexity	[46]

Table 2. Summary of diffusion-based medical image segmentation applications across imaging modalities, highlighting representative tasks, challenges addressed, and key reported outcomes.

Imaging Modality	Representative Task	Key Challenges Addressed	Reported Outcomes	Ref.
MRI	Brain tumor segmentation	Irregular tumor shapes, heterogeneous intensity distributions, multimodal data fusion	Improved robustness to tumor size variability and complex anatomical boundaries	[4]
MRI	Knee cartilage segmentation	Fine-grained structural delineation in high-resolution MRI volumes	Achieved DSC up to 0.96 with accurate cartilage boundary preservation	[44]
CT	Multi-organ abdominal segmentation	Large inter-organ variation in size, shape, and contrast within a single volume	Superior segmentation performance across liver, kidney, spleen, and pancreas	[53]
CT	Lung nodule segmentation	Ambiguous nodule boundaries and partial volume effects	Enabled uncertainty-aware segmentation for improved clinical reliability	[45]
Ultrasound	Breast cancer segmentation	Acoustic artifacts, low contrast, and unclear lesion boundaries	+13.87% improvement over baseline segmentation methods	[5]
X-ray	Chest X-ray analysis	Data scarcity and subtle pathological patterns in lung imaging	Enhanced segmentation and synthetic data generation for robust model training	[54]

Table 3. Representative state-of-the-art diffusion-based medical image segmentation frameworks, their datasets, modalities, methodological contributions, and reported performance characteristics.

Year	Model	Dataset	Imaging Modality	Core Methodological Contribution	Reported Performance Highlights	Ref.
2024	DiffBoost	Breast, Spleen	Ultrasound, CT, MRI	Text-guided diffusion framework with edge-aware synthesis for controllable medical image generation and downstream segmentation enhancement	+13.87% DSC improvement (breast ultrasound); consistent gains across prostate MRI and spleen CT segmentation	[5]
2023	CIMD	LIDC-IDRI	CT, MRI, Ultrasound	Collectively intelligent diffusion model that generates multiple plausible segmentation outputs to explicitly model inter-observer ambiguity	Improved segmentation robustness; pixel-wise uncertainty maps; frequency-aware modeling of expert annotations	[4]
2025	LDSeg	Echo, Knee MRI	MRI, Ultrasound	End-to-end latent diffusion framework enabling joint optimization of encoder, decoder, and diffusion model for efficient segmentation	DSC 0.92 (Echo); DSC 0.96 (knee MRI); inference time reduced to 0.34 s with 2–10 sampling steps	[44]
2024	HiDiff	Abdominal, BraTS-2021	CT, MRI	Hybrid discriminative–generative framework combining a conventional segmentor with a Binary Bernoulli Diffusion refiner	Superior multi-organ abdominal segmentation; improved robustness across heterogeneous organ shapes and contrasts	[46]
2024	MedSegDiff-V2	ISIC, BraTS	MRI, CT, Ultrasound	Transformer-enhanced diffusion framework with SS-Former architecture to model noise–semantic interactions and long-range dependencies	State-of-the-art performance across 20 segmentation tasks spanning multiple modalities	[31]

Table 4. Performance comparison of diffusion-based medical image segmentation models across datasets, sampling steps, inference time, and hardware configurations.

Model	Dataset	DSC/IoU	Sampling Steps	Inference Time (s)	Hardware
CIMD	LIDC-IDRI	DSC 0.91	1000	NR	NVIDIA V100
DiffBoost	Breast, Spleen	DSC 0.94	NR	NR	NVIDIA RTX A6000 GPU
MedSegDiff-V2	ISIC, BraTs	DSC 0.93/IoU 0.85	100	NR	NVIDIA A100 GPUs
BerDiff/BBDM	BraTs	DSC 0.89	10	0.4 s	NVIDIA V100 GPUs
HiDiff	Abdominal, BraTS-2021	DSC 0.84	10	NR	NVIDIA V100 GPU
LDSeg	Knee cartilage, Echo	DSC 0.96/IoU 0.93	2	0.34 s	NVIDIA A100-SXM4
BTSegDiff	BraTS	DSC 0.89	NR	NR	NVIDIA A100
Implicit Ensembles (2022)	BRATS2020	DSC 0.86/IoU 0.79	NR	NR	RTX 6000 GPU
LSegDiff	ISIC	DSC 0.90/IoU 0.84	NR	0.39 s	RTX3090 GPU
Latent-Space Diffusion (2025)	ISIC-2018, CVC-Clinic	DSC 0.88/IoU 0.80	NR	NR	Tesla V100 GPUs
PD-DDPM	WMH	DSC 0.81	300	NR	NVIDIA Tesla V100 GPUs
TextDiff	QaTa-COVID19, MoNuSeg	DSC 0.78/IoU 0.64	NR	NR	NVIDIA Tesla V100 GPUs

Table 5. Comparison of diffusion-based medical image segmentation frameworks in terms of computational efficiency, sampling strategies, and inference characteristics. Only explicitly reported efficiency-related claims are included.

Model	Efficiency Strategy	Sampling Steps	Inference Characteristics	Ref.
Vanilla DDPM (Baseline)	Iterative denoising in pixel space	∼1000	High computational cost; impractical for real-time clinical deployment	[19]
LDSeg	End-to-end latent diffusion with joint optimization of encoder, decoder, and score model	2	Inference time reduced from 91.23 s to 0.34 s while maintaining high segmentation accuracy	[44]
MedSegDiff-V2	Transformer-enhanced diffusion with semantic–noise interaction modeling	Reduced	Improved segmentation performance across 20 tasks with optimized inference efficiency	[31]
HiDiff	Hybrid discriminative–generative refinement using Binary Bernoulli Diffusion	Moderate	Efficient refinement of segmentation masks guided by discriminative priors	[46]

Table 6. Summary of key challenges in diffusion-based medical image segmentation and corresponding future research directions as identified in this review.

Key Challenge	Description	Corresponding Future Direction
High computational complexity	Iterative denoising with hundreds to thousands of sampling steps limits real-time clinical deployment	Development of ultra-fast diffusion models, adaptive sampling strategies, and efficient latent-space inference
Memory constraints for 3D medical imaging	Processing high-resolution 3D CT and MRI volumes exceeds GPU memory limits in standard diffusion implementations	Hierarchical and multi-scale diffusion architectures with memory-efficient latent representations
Costly and limited multi-annotator data	Ambiguous segmentation requires multiple expert annotations, increasing data collection cost and time	Weakly supervised learning, synthetic data generation, and annotation-efficient diffusion training strategies
Limited ground-truth distribution representation	Ground truth ambiguity is represented by sparse annotation samples, limiting robust uncertainty modeling and evaluation	Advanced uncertainty-aware evaluation metrics and probabilistic modeling of annotation distributions
Lack of large-scale clinical validation	Most studies focus on technical metrics without multi-institutional clinical outcome validation	Comprehensive multi-center clinical trials focusing on downstream diagnostic and therapeutic impact
Integration into clinical workflows	Deployment challenges due to infrastructure constraints, real-time requirements, and regulatory barriers	Edge computing deployment, lightweight diffusion architectures, and workflow-aware system integration
Regulatory and trust barriers	Stochastic outputs and uncertainty modeling complicate regulatory approval and clinical trust	Development of uncertainty-aware regulatory frameworks and explainable diffusion-based decision support systems

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yaseen, M.; Ali, M.; Ali, S.; Kim, H.-C. Diffusion-Based Approaches for Medical Image Segmentation: An In-Depth Review. Electronics 2026, 15, 1400. https://doi.org/10.3390/electronics15071400

AMA Style

Yaseen M, Ali M, Ali S, Kim H-C. Diffusion-Based Approaches for Medical Image Segmentation: An In-Depth Review. Electronics. 2026; 15(7):1400. https://doi.org/10.3390/electronics15071400

Chicago/Turabian Style

Yaseen, Muhammad, Maisam Ali, Sikandar Ali, and Hee-Cheol Kim. 2026. "Diffusion-Based Approaches for Medical Image Segmentation: An In-Depth Review" Electronics 15, no. 7: 1400. https://doi.org/10.3390/electronics15071400

APA Style

Yaseen, M., Ali, M., Ali, S., & Kim, H.-C. (2026). Diffusion-Based Approaches for Medical Image Segmentation: An In-Depth Review. Electronics, 15(7), 1400. https://doi.org/10.3390/electronics15071400

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Diffusion-Based Approaches for Medical Image Segmentation: An In-Depth Review

Abstract

1. Introduction

1.1. Motivation and Clinical Significance

1.2. Advantages of Diffusion Models in Medical Segmentation

1.3. Computational Challenges and Recent Advances

1.4. Scope and Objectives

2. Methodology

2.1. Search Strategy and Data Sources

2.2. Inclusion and Exclusion Criteria

2.3. Data Extraction and Synthesis

3. Literature Review

3.1. Foundational Theoretical Frameworks

3.1.1. Denoising Diffusion Probabilistic Models (DDPMs)

3.1.2. Latent Diffusion Models (LDMs)

3.2. Ambiguous Medical Image Segmentation

3.2.1. Collective Intelligence Approach

3.2.2. Ambiguity Modeling Network (AMN)

3.3. Transformer-Enhanced Diffusion Models

MedSegDiff-V2 Framework

3.4. Computational Efficiency Innovations

3.4.1. Accelerated Sampling Strategies

3.4.2. Memory-Efficient Architectures

3.5. Conditioned Diffusion for Medical Image Segmentation

DiffBoost Framework

3.6. Comparative Analysis with Traditional Uncertainty Modeling

4. Types of Diffusion Models for Segmentation

4.1. Denoising Diffusion Probabilistic Models (DDPMs)

Vanilla DDPM for Segmentation

4.2. Conditional Diffusion Models

Image-Conditioned Diffusion

4.3. Latent Diffusion Models (LDMs)

Compressed Latent Space Processing

4.4. Specialized Noise Models

Binary Bernoulli Diffusion Model (BBDM)

4.5. Transformer-Enhanced Diffusion Models

MedSegDiff-V2 Framework

4.6. Hybrid Diffusion Models

HiDiff: Hybrid Discriminative–Generative Framework

5. Key Applications Across Modalities

5.1. MRI Applications

5.1.1. Brain Tumor Segmentation

5.1.2. Knee Cartilage Segmentation

5.2. CT Applications

5.2.1. Multi-Organ Abdominal Segmentation

5.2.2. Lung Nodule Detection

5.3. Ultrasound Applications

5.3.1. Breast Cancer Segmentation

5.3.2. Echocardiogram Analysis

5.4. X-Ray Applications

Chest X-Ray Analysis

6. Notable Models and Performance

6.1. State-of-the-Art Diffusion Frameworks

6.1.1. DiffBoost: Text-Guided Medical Image Synthesis

6.1.2. Performance Achievements

6.1.3. Collectively Intelligent Medical Diffusion

6.1.4. Key Performance Features

6.2. LDSeg: Latent Diffusion Segmentation

Performance Benchmarks

7. Advances in Training and Inference

7.1. Training Strategy Innovations

7.1.1. Recycling Training Framework

7.1.2. End-to-End Learning Strategy

7.2. Inference Optimization Advances

7.2.1. Latent Space Processing

7.2.2. Fast Sampling Protocols

8. Challenges and Limitations

8.1. Computational Complexity and Resource Requirements

8.1.1. Iterative Denoising Overhead

8.1.2. Memory Limitations for 3D Processing

8.2. Training and Data Challenges

8.2.1. Annotation Requirements and Cost

8.2.2. Ground Truth Distribution Limitations

8.3. Clinical Translation Barriers

8.3.1. Limited Clinical Validation

8.3.2. Integration with Clinical Workflows

8.4. Explainability Challenges

9. Future Directions

9.1. Computational Efficiency and Real-Time Processing