1. Introduction
Cardiovascular diseases are one of the major health issues worldwide, significantly impacting human’s quality of life [
1]. Cine cardiac MRI captures dynamic images of the heart at different stages, providing detailed information about the cardiac structures and situations [
2]. It is an important non-invasive diagnostic tool for cardiovascular diseases. Accurate segmentation of cardiovascular magnetic resonance (CMR) images is an important prerequisite in clinical practice to reliably diagnose and assess a number of major cardiovascular diseases. The segmentation of different regions of the heart, such as the ventricles and myocardium, helps physicians accurately assess the cardiac function (e.g., pumping capacity) and formulate personalized treatment plans.
In recent years, segmentation algorithms for cardiac MR images based on deep neural networks (DNNs) have demonstrated remarkable accuracy, establishing state-of-the-art performance [
3]. The U-Net [
4] segmentation network, as a validated effective DNN, has been extensively utilized in the algorithm models for cardiac MR image segmentation. The authors of [
5] developed a fully automatic three-dimensional model based on the 3D U-Net network. The authors of [
6] employed the U-Net network to segment three cardiac structures in the short-axis cine sequence and then used an ensemble classifier for the classification and diagnosis of cardiac diseases. The authors of [
7] combined the interpretability of the level set method with the U-Net network’s ability to handle more complex images, proposing an automatic medical image segmentation method that integrates constraint terms and the level set method. These methods all leverage the segmentation capabilities of the U-Net network itself and propose new loss functions or combinable feature modules to enhance the model’s segmentation accuracy. However, when the dataset is small and limited, these DNN-based methods may lead to a severe overfitting phenomenon. The difficulty of collecting a large number of annotated medical samples has always been an objective issue in artificial intelligence medical imaging research. Therefore, data augmentation methods are a necessary solution to alleviate the limitation of having few annotated samples in medical datasets.
Data augmentation is a commonly used strategy in machine learning and deep learning to increase the variability and quantity of samples in an existing dataset [
8]. This is achieved by applying a series of transformations to the original data or by creating new samples through generative models, without the need to collect additional samples. Data augmentation aims to enhance the diversity and variability of input data without increasing the volume of collected data, while promoting the deep neural network’s learning of invariance to transformed input data, enabling the network to generalize better to unseen data and improve its performance in new scenarios [
9]. Therefore, in medical image research, data augmentation can effectively alleviate the issues of poor heterogeneity, inconspicuous features, and difficulties in annotation caused by the small scale of medical datasets. It also mitigates the overfitting and poor generalization problems in deep neural networks caused by the aforementioned issues [
10,
11].
The mainstream methods of data augmentation can generally be divided into two major categories: Transformation of original data and generation of artificial data [
8]. The former typically increases the diversity of the dataset through various geometric transformations or pixel-level transformations. The latter, overcoming the limitations of image enhancement techniques in the former, can produce more diverse and challenging samples. Among them, Generative Adversarial Networks [
12,
13], Variational Autoencoders [
14], and Diffusion Models [
15] have all been applied in data augmentation methods for generating artificial samples. Specifically, due to the theoretical foundation of the diffusion model and its impressive ideal characteristics in terms of feature distribution coverage, ease of training, and scalability, they have quickly been applied to various visual tasks. Diffusion models do not require adversarial training and excel in the diversity of generated images compared to GANs, making them naturally more suitable for data augmentation. Diffusion models have achieved remarkable effects in natural image tasks. At the same time, diffusion models can also be constrained by prior conditions, enabling them to generate samples of specified types. The input conditions during data generation by diffusion models support multimodality, which can be categorized by text features, encoding features, segmentation maps, and more implicit feature vectors. In segmentation tasks, the most relevant to model performance is the segmentation map of the sample data, and this experiment will primarily utilize segmentation maps as input conditions.
Therefore, this paper proposes a conditional diffusion model image enhancement framework based on multi-label spatial mask constraints for cardiac applications. This framework leverages a conditional generative model to simulate the cardiac morphologies of different patients by generating pseudo-labels based on the spatial structural probability distribution of segmentation maps, thereby enhancing the diversity of training samples. The framework is mainly divided into three steps: First, a multi-pseudo label mask diffusion model that conforms to the spatial probability distribution is trained on a labeled cardiac MR dataset. Second, a cardiac MR image implicit diffusion model is trained using samples and mask pairs from the labeled dataset, with the mask serving as a conditional factor for generation. To ensure that the mask conditions are pixel-level aligned with the cardiac MR images, this paper integrates the SPADE module into the decoding process of the diffusion model, thus constraining the image generation. During image enhancement, the trained pseudo-label diffusion model is first used to generate multi-class cardiac masks, which serve as constraint conditions to control the generation of 2D cardiac pseudo-samples by the cardiac sample diffusion model. In the label generation process, due to the propensity of the diffusion model to generate scattered label points, a label screening strategy is designed to filter the generated labels, preventing the creation of noisy images during the cardiac image generation phase.
Our contributions are as follows:
A diffusion-based data augmentation framework capable of generating cardiac MR images and their segmentation labels from scratch
A multi-category cardiac mask model and a conditional cardiac MR image synthesis model
Experiments demonstrate that our method achieves performance improvements on multiple datasets compared to traditional data augmentation methods.
3. Methods
The goal of cardiac image segmentation model S is to learn a mapping from the space of cardiac images
X to the space of pixel-level labels
Y. Assuming the training set
consists of
N labeled data pairs, denoted as
, where
represents cardiac images, and
represents segmentation maps,
K being the number of classes (in this case,
, where 0 stands for background, 1 for the left ventricle, 2 for the left ventricular myocardium, and 3 for the right ventricle). In supervised learning, the objective function
is used to quantify the probabilistic discrepancy between the predicted labels
and the ground truth labels
. The objective is to minimize this discrepancy with respect to a set of learnable parameters
of the segmentation network, as shown in Equation (
1):
After the application of data augmentation in supervised learning, the augmented dataset
is composed of the original labeled dataset
and the augmented dataset
:
, where
,
represents the augmented cardiac images obtained through enhancement, and
are the corresponding segmentation maps. Now, the objective function for training the segmentation network
S can be computed through optimization as shown in Equation (
2).
Obviously, obtaining a rich and diverse set of reasonable pseudo samples, , , is crucial for enhancing the performance of cardiac region segmentation. Therefore, our goal is to generate virtual pseudo-labels using a diffusion model, given a limited number of training sample pairs , . Subsequently, based on these pseudo-labels, we aim to produce a certain number of reliable pseudo-images, , to improve the model’s generalization capabilities and robustness.
To achieve this goal, this paper proposes a data augmentation method based on conditional generation with cardiac multi-label spatial mask constraints, as shown in
Figure 1. Initially, a pseudo-label generative diffusion model is trained based on the cardiac spatial structural mask, enabling the generation of pseudo-labels that conform to the spatial structural probability distribution within the label images. Subsequently, we propose a conditional generative latent diffusion network based on cardiac segmentation labels. The process starts by mapping cardiac images from the pixel space to an implicit feature space using an encoder, and then decoding to restore the training, with the aim of obtaining an effective and stable encoder-decoder and implicit feature space. Secondly, features within the implicit space are trained using a diffusion network. To ensure pixel-perfect alignment of input segmentation labels with the final generated images, a SPADE module is integrated into the decoding part of the diffusion model for image synthesis constraints. Furthermore, during the label generation process, this paper introduces a label screening strategy that helps the model filter out effective and reasonable label inputs into the subsequent conditional cardiac image generation model.
Next, we will introduce the pseudo-label generative diffusion model, pseudo-image generative latent diffusion model, and filtering strategy for effective labels.
3.1. Pseudo-Label Generative Diffusion Model
Cardiac mask labels can represent the structure of the heart; therefore, different mask labels can enhance the diversity of cardiac structures. The goal of this phase is to train a pseudo-label diffusion generative model , which takes random sample noise as inputs and generates pseudo-labels , resulting in a pseudo-label dataset , where M is the number of pseudo-labels. This process synthesizes a greater number of label images to be used as conditional labels in the second phase. The label images contain three spatial structural labels of the heart: the left ventricle, the left ventricle myocardium, and the right ventricle, allowing for direct learning of the basic cardiac spatial structure distribution information. This is referred to as the multi-structure label diffusion generative module, which can provide more direct ventricular structure information guidance for subsequent image generation.
For the sake of conciseness, the true cardiac mask structure is denoted as
y,
. To maximize data likelihood, the diffusion model defines both forward (also known as generative) and reverse processes, as shown in
Figure 2. During the forward process, a small amount of Gaussian noise is sequentially added to the mask
y over T steps, following the Equations (
3) and (
4):
where noise
, with I being the identity matrix. The set
represents a schedule of variances, and
. The result sequence
forms a Markov chain. Given
, the conditional probability of
follows the Gaussian distribution in Equation (
5):
In the reverse process, since
is not easily estimated, a neural network model
is utilized to approximate
, which also follows the Gaussian distribution as described in Equation (
6):
The network needs to optimize the negative log-likelihood through the variational lower bound, as shown in Equations (
7) and (
8).
The objective function is the variational lower bound loss:
where each term except
represents the Kullback–Leibler (KL) divergence between two Gaussian distributions. In practice, a simplified version of
is commonly used [
13], as shown in Equation (
9):
Once the network training is completed, denoising can be performed stepwise at random time points across the
T time steps to generate new samples, as shown in Equation (
10).
where
.
To synthesize cardiac mask structures, an unconditional Denoising Diffusion Probabilistic Model (DDPM) is trained on the original cardiac masks. Following [
22], this unconditional DDPM adopts a U-Net architecture. Ultimately, this process generates a pseudo-label dataset
.
3.2. Pseudo-Image Conditional Latent Diffusion Model
This stage involves synthesizing cardiac MR images conditioned on cardiac segmentation maps. In the absence of constraints, unconditional diffusion models will generate a diverse range of samples. There are generally two approaches to conditional synthesis of constrained images: classifier-guided diffusion [
24] and classifier-free guidance [
23]. Since classifier-guided diffusion requires training a separate classifier, which is not suitable for this task and incurs additional training costs, we opt for classifier-free guidance to control the sampling process. Additionally, because medical images are often high-resolution, and the MR image space is more complex compared to the label space, and diffusion models have substantial computational requirements for high-resolution images, to reduce the significant demand for computing time and resources, an implicit diffusion model [
25] is trained. An autoencoder learns a space that is perceptually equivalent to the image space, significantly reducing computational complexity, after which the diffusion model learns the internal implicit space. This approach helps to enhance computational efficiency, and the U-Net architecture continues to effectively learn spatially structured data.
As shown in
Figure 1, from the labeled dataset
, any image
is selected and encoded by the encoder
E into the latent features
. The decoder
D then reconstructs the image from the latent space,
. Here,
, and the encoder downsamples the image by a factor
, where
f is a hyperparameter. Once a stable and effective autoencoder is trained, each image
x can be encoded into its corresponding latent feature space. Subsequently, a diffusion network can be used to learn within the latent space, akin to Equation (
11), to obtain the loss function in the latent space at this time:
Given that the forward process is fixed, during training, the latent features can be efficiently obtained from E, and the features z can be decoded into the image space with a single pass through D.
Cardiac labels and cardiac MR images possess distinct feature spaces. Simply connecting them within a denoising U-Net network or passing them through a cross-attention module can reduce image fidelity and result in unclear correspondence structures between synthesized cardiac MR images and their segmentation labels. Therefore, we consider employing a Spatially-Adaptive Normalization (SPADE) [
27] module to correspond label information with cardiac images. During the decoding process of the conditional synthesis U-Net generative network, SPADE is constructed, and we include SPADE modules at different resolution layers of the network to leverage the multi-scale information of cardiac label structures. The encoder consists of stacks of residual blocks (Resblocks) and attention blocks (AttnBlocks). The decoder is a stack of SPADE Blocks and attention blocks. Each SPADE Block is composed of SPADE, SiLU, and Convolution, which takes feature maps and cardiac tags as inputs.
The SPADE module has been proven effective in semantic image synthesis by adjusting the normalized feature maps using spatially adaptive transformations learned from input semantic layouts, allowing for better preservation of semantic information compared to conventional normalization layers. This approach is particularly suitable for tasks such as semantic image synthesis, where the generation of realistic images from semantic masks is desired. By incorporating the SPADE module, the network can effectively propagate semantic information throughout the generative process, leading to the synthesis of images that are not only realistic but also aligned with the input semantic structures.
3.3. Filtering Strategy for Effective Labels
The pseudo-label generative diffusion model enables denoising the random Gaussian noise to generate pseudo-labels. However, the structure of these pseudo-labels influences the subsequent phase of cardiac MR image generation. We have identified four abnormal conditions that can occur after pseudo-label generation, which are crucial for the next phase:
(1) Disjointed parts within the cardiac structure; (2) Spatial configurations that do not align with clinical logic; (3) Excessively small pixel occupation by cardiac labels relative to the whole content; and (4) A large number of discrete labels appearing in the background.
In subsequent ablation studies, we evaluated the impact of including or excluding these abnormal labels. We discovered that the input of these abnormal labels affects the generation of pseudo-cardiac images and the performance of downstream segmentation tasks. It is essential to address these issues to enhance the quality and accuracy of the generated images and the subsequent diagnostic or therapeutic applications in the medical field.
4. Experiments
4.1. Cardiac MR Datasets
The heart dataset is derived from the M&Ms Challenge [
28]. This public dataset primarily includes multi-disease cardiac MR data from various centers and different devices. We divided it into multiple sub-datasets according to the equipment and conducted experimental verification on the Canon and Siemens datasets. The Canon dataset includes 50 cases of cardiac MR data from patients, which we divided into a training set of 35 cases, a validation set of 5 cases, and a test set of 10 cases in a 7:1:2 ratio. The Siemens dataset includes data from 94 patients, which we also divided into a training set of 66 cases, a validation set of 9 cases, and a test set of 19 cases in a 7:1:2 ratio. Each case includes the left ventricle, left ventricular myocardium, and right ventricle in the end-diastolic and end-systolic frames, manually labeled by experts. We preprocessed the images, and to save computational costs, all images were centrally cropped and resized to
.
4.2. Experiments Details
To validate the effectiveness of the proposed augmentation method, its performance was rigorously evaluated through a downstream image segmentation network. The experimental setup involved specific configurations for the image generation module and the segmentation task.
Within the image generation module, the pseudo-label generative diffusion model’s encoder and decoder each consist of six layers, with channel dimensions progressively increasing through 64, 128, 256, 512, up to 1024. Each layer within both the encoder and decoder comprises two ResNetBlocks, with the final three layers additionally incorporating Attention Blocks. This network was trained using the Adam optimizer with a learning rate of and a batch size of 16. For the pseudo-image conditional latent diffusion model, the autoencoder component’s encoder and decoder each feature five layers, with channel dimensions progressing through 64, 128, 256, up to 1024. Each layer in both the encoder and decoder includes two ResBlocks, and Group Normalization is applied with 64 groups. This autoencoder network was trained using the Adam optimizer, with varying learning rates applied to different subsets of the training data. For the downstream image segmentation task, a U-Net architecture was employed, with both its encoder and decoder comprising four layers. This segmentation network was trained using the Adam optimizer with a learning rate of 0.0001 and a weight decay of 0.00001. The training epoch of the pseudo-label generative diffusion model is 1000 and the training epoch of the pseudo-image conditional latent diffusion model’s autoencoder network and ldm network is 500. The training epoch of the segmentation model is 400. All experiments in this study were conducted on a GTX A40 GPU.
4.3. Performance Evaluation of Methods
The experiment includes data augmentation and image segmentation. In the data augmentation phase, six different techniques are applied to increase the diversity and quantity of the training dataset, which is crucial for improving the robustness and generalization capability of the segmentation models. Each method has its unique way of altering the image data:
No Data Augmentation: This serves as the baseline, where the original dataset is used without any augmentation.
Affine Transformations: These include operations like rotation, translation, scaling, and shearing, which preserve the collinearity of points but not necessarily the distances.
Elastic Deformations: Also known as non-linear transformations, these allow for more complex distortions that can simulate various deformations in the image.
Pixel Intensity Transformations: Adjustments to the brightness, contrast, and color properties of the pixels to enhance the visibility and quality of the images.
CutMix Method: A data augmentation technique that combines two images by cutting a portion from one image and pasting it into another, which encourages the model to learn from the context of mixed images.
Proposed Method: The novel data augmentation approach introduced in this study, which aims to generate more realistic and diverse samples to improve segmentation performance.
The effectiveness of these methods is evaluated in the image segmentation phase, where the U-Net network, a popular choice for medical image segmentation due to its encoder-decoder structure with skip connections, is employed. The performance of the segmentation models is quantitatively assessed using the DSC coefficient, which measures the overlap between the predicted and actual segmentation masks, and the IoU value, which calculates the ratio of the overlapping area to the total area covered by the predicted and actual masks.
By comparing the DSC and IoU values across different augmentation methods, we can determine which approach contributes the most to the improvement of segmentation accuracy and model generalizability. This comprehensive experimental setup ensures a thorough evaluation of the proposed data augmentation method against existing techniques in the context of cardiac MR image segmentation.
5. Results and Discussion
Table 1 and
Table 2 display the quantitative outcomes of employing various augmentation techniques on two separate datasets. The numerical values reported are the mean Dice coefficients for each technique. It is evident that the proposed method yields significant enhancements compared to other augmentation strategies, with the optimal and suboptimal methods highlighted in red and blue, respectively. Notably, the absence of any data augmentation results in the poorest performance of the segmentation model. Data augmentation through affine transformations significantly boosts the model’s performance. The incorporation of random elastic transformations further augments accuracy. However, augmentation that solely alters the pixel intensity of images can lead to a decline in segmentation model performance. The CutMix method also ameliorates the segmentation model’s effectiveness. Despite the counterintuitive appearance of images generated by CutMix, preserving the authentic anatomical forms may not be essential for neural networks to attain superior segmentation outcomes. On the Canon dataset, our method achieved an average DSC improvement of 0.19 over the baseline model without augmentation, with increases of 0.16 for the Left Ventricle (LV), 0.23 for the Myocardium (MYO), and 0.16 for the Right Ventricle (RV). Regarding the IoU values, an average enhancement of 0.18 was observed, with improvements of 0.16 for the LV, 0.24 for the MYO, and 0.15 for the RV.
In contrast to other data augmentation techniques, our method uniquely encapsulates the viability of preceding approaches on two crucial fronts. Firstly, by emulating the state of the label space, the framework is capable of generating pseudo-label maps that adhere stochastically to the distribution of the original label space. This inherent characteristic ensures significant enhancements to the structural aspects of the image space, mirroring and even extending the benefits typically achieved through conventional geometric transformations like elastic or affine transformations. Secondly, the subsequent generative training of cardiac MR images ensures robust variability in the intensity at each pixel point, thereby substantially augmenting the overall diversity of the images. Consequently, the two distinct phases we propose correspond precisely to the spatial variations in the image and the alterations in pixel intensity that are characteristic of conventional data augmentation methods. This dual-phase approach allows our framework to synthesize realistic and diverse data while maintaining anatomical plausibility, which is often a challenge for simpler augmentation techniques.
Figure 3 illustrates the generation process of our proposed framework, while
Figure 4 presents the visualization results of other comparative data augmentation methods, offering a clear visual comparison of the distinct outputs.
6. Discussion
Our proposed framework demonstrates significant potential for broader application beyond the scope of this study, opening avenues for future research and deployment in diverse medical imaging contexts.
Firstly, the adaptability of our method is highlighted by its extensibility to segment other anatomical regions. By simply adjusting the training data to include images of different organs, such as the liver, brain, or kidneys, along with their corresponding mask conditions, the framework can be retrained to accurately delineate these new structures. This modularity suggests that the underlying principles of our approach are not confined to a single anatomical area but can be generalized across various body parts, making it a versatile tool for comprehensive medical image analysis.
Secondly, the framework’s applicability extends to different imaging modalities. While this study primarily focused on Cardiac MRI, our diffusion model’s core design allows for its retraining on the unique characteristics of other image types, such as CT or ultrasound, and their associated labels. This flexibility means the model can learn to interpret and generate segmentations from distinct data representations, greatly expanding its utility in clinical practice where multiple imaging techniques are often used in conjunction.
However, transitioning to these new domains or modalities does come with potential challenges and considerations. Variations in image quality, resolution, and pathological appearance across different anatomical regions or imaging modalities can significantly impact model performance. For instance, the inherent noise levels in ultrasound, the anisotropic resolution in some CT scans, or the diverse manifestations of diseases in different organs (e.g., tumors in liver vs. brain) would necessitate careful data curation and potentially modality-specific architectural adaptations or training strategies. Addressing these nuances will be crucial for ensuring robust and accurate segmentation performance in expanded applications.