MAM-E: Mammographic Synthetic Image Generation with Diffusion Models

Generative models are used as an alternative data augmentation technique to alleviate the data scarcity problem faced in the medical imaging field. Diffusion models have gathered special attention due to their innovative generation approach, the high quality of the generated images, and their relatively less complex training process compared with Generative Adversarial Networks. Still, the implementation of such models in the medical domain remains at an early stage. In this work, we propose exploring the use of diffusion models for the generation of high-quality, full-field digital mammograms using state-of-the-art conditional diffusion pipelines. Additionally, we propose using stable diffusion models for the inpainting of synthetic mass-like lesions on healthy mammograms. We introduce MAM-E, a pipeline of generative models for high-quality mammography synthesis controlled by a text prompt and capable of generating synthetic mass-like lesions on specific regions of the breast. Finally, we provide quantitative and qualitative assessment of the generated images and easy-to-use graphical user interfaces for mammography synthesis.


Introduction
Data scarcity is an important problem faced in the medical imaging domain, caused by several factors such as expensive image acquisition, processing and labeling procedure, data privacy concerns, and the rare incidence of some pathologies [7].This leads a reduction of the volume of medical data available for the training of deep learning models, which limits the models performance and holds back the development of computer-aided systems, compared with non-medical imaging applications.
Generative models have been used to complement traditional data augmentation techniques and expand medical datasets, with Generative Adversarial Networks (GANs) being the state-of-the-art (SOTA) due to their high image quality and photorealism.Nevertheless, unstable training, low diversity generation and low sample quality make the use of GAN-like architectures challenging for medical data [7], as medical diagnosis can depend on subtle changes in the organs appearance reflected in the images, affecting the performance of a computer-assisted diagnosis and intervention systems [11].
Diffusion models (DM) captured special attention from the generative models community when they were proposed for image generation and seemingly outperformed GANs in 2021 [2].Since then applications and research papers for medical images have been published to explore this new image generation principle.For instance, [3] proposed using the original pipeline of diffusion models on computer vision called denoising diffusion probabilistic models (DDPM) [6] for the generation of high-quality MRI of brain tumors.This first implementation of diffusion models for 3D medical images reached SOTA results and outperformed the baseline models based on 3D GANs.Latent diffusion was used by [14] to generate high-resolution 3D brain images, increasing the image resolution from 64x64x64 to 160x224x160 without requiring more GPU memory usage or overall training time.The Fréchet Inception Distance (FID) for image fidelity, and the multi-scale structural similarity index measure (MS-SSIM) for generation diversity were computed and in both cases DM surpassed the GANs baseline metrics.
A Stable Diffusion (SD) implementation for medical images was introduced by [1] who proposed a model for chest X-ray generation.Their model, named RoentGen, was able to create visually convincing, diverse chest X-rays, controlling the generation output using text prompts with radiology-specific language.
A key characteristic of this work is the use of SD weights pretrained with natural images as baseline.Instead of training from scratch specific parts of the network were fine-tuned to adapt the weights from its original to a new medical domain.This DM fine-tuning approach is called Dreambooth and was first introduced by [19] for natural images.
Besides full-field image generation, DM can be used for other tasks such as image inpainting.Some works have explored lesion inpainting using DM for brain MRI such as [18] from the Mayo Clinic.They developed a DDPM to execute several inpainting tasks, like generating synthetic lesions or healthy tissue, on slices of the 3D volumes in various sequences.Their model was capable of generate realistic tumoral lesions and tumor-free brain tissue, although the performance of the model was only assessed visually.
Despite all this, the use of diffusion models in the medical imaging field continues at early stages, specially for mammography.Prior to this publication we have released the source code, weights and user interface for the first implementation of SD for mammographic image synthesis in [10].Following works have explored the generation of synthetic mammograms using DM, such as the release of one synthetic mammography dataset from [13], composed of 100k 512x512 synthetic images with masking level labeling, and the proposal of [8] to explore the use of SD for brain imaging and contrast-enhanced spectral mammography.
We introduce MAM-E, a pipeline of generative models for high quality mammographic image synthesis, capable of generating images based on a text prompt description, and also capable of generating lesions on a specific section of the breast using a mask.Our pipeline was developed using stable diffusion, a SOTA Fig. 1.Graphical user interface of MAM-E for generation of synthetic healthy mammograms diffusion model technique that uses both conditioning, to control the image generation, and a latent space to allow high-resolution without requiring large computational resources.The generated images are for presentation, meaning that their appearance and pixel intensities are meant for radiologist inspection, with the limitations on resolution and pixel depth inherent to the current state of diffusion pipelines.To the knowledge of the authors, this is the first work to use stable diffusion fine-tuning for lesion inpainting for mammography.Moreover, this work source code publication represented the first implementation of SD for mammographic image generation.
Our main pipeline can be separated into two tasks: healthy mammogram generation and lesion inpainting.For the first task, the generation process is controlled by a text conditioning with the description of the image including view, breast density, breast area and vendor.For the second task we use an stable diffusion inpainting model designed to generate synthetic lesions in desired regions of the a mammogram.The name of our model was inspired by OpenAI's DALL-E [16].The source code (https://github.com/Likalto4/diffusion-modelsmaster) and the pretrained weights (https://huggingface.co/Likalto4) are publicly available.Additionally, graphical user interfaces for both synthesis tasks were designed for easy-to-use image generation and their source code can be found in the same repository with the characteristics shown in figure 1.

Datasets
We decided to use two datasets for the training of the diffusion models so that different patient populations and mammography unit vendors were considered.

OMI-H
We used a subset of the OPTIMAM Mammography Image Database, consisting of around 40k Hologic vendor full-field digital mammograms (FFDM) from several UK breast screening centers and with different image views [5].The dataset was composed of images with and without lesions (benign, malignant and interval-cancers), and expert annotations are included in the respective cases, including the coordinates of a bounding box surrounding the lesion.
VinDr-Mammo A second dataset composed of around 20k FFDM with breastlevel assessment and extensive lesion annotation was also used.It consists of 5,000 mammography exams, each with 4 standard views (CC and MLO for both lateralities), coming from two primary hospitals from Vietnam, giving a total of 20,000 images in DICOM files [12].Metadata of each image consisting of both technical and clinical information waas also available in a CSV file.We filter the images so that only mammograms coming from a Siemens vendor unit were used.
Table 1 shows the distribution of the cases among both datasets and their combination.

Data preprocessing and preparation
Both datasets were subject to the same preprocessing and preparation steps.First, mammograms were saved as PNG files to ensure faster access and less disk memory space.Secondly, to be able to use pretrained weights, the images were saved in RGB format, repeating the original gray-channel into each RGB channel.The original image intensities with uint16 data type were scaled to a [0, 255] range with a reduced uint8 data type.Healthy image generation For each healthy mammogram a text prompt description was created and saved along with the image ID in a JSON file.In the case of the OMI-H dataset we created a prompt with the image view and breast area size information.We defined a criterion to categorize the breast area sizes in three main groups: small, medium and large.For the VinDr dataset the breast density information was included instead of the breast area for the prompt description.Breast density was available in BI-RADS scale so we needed to transform this information into a semantically meaningful text.The criteria used for both cases is defined in table 2.

Lesion inpainting
The inpainting task requires mammograms with confirmed lesions only.Using the bounding boxes coordinates available in the metadata, binary masks were generated.Naturally, due to the resizing and cropping preprocessing performed previously, the original coordinates required a proper redefinition using simple geometrical properties.The mask has pixel values of 255 inside the bounding box and zero elsewhere.Because the SD architecture used for the inpainting task requires an input text prompt for the generation, a toy prompt with "a mammogram with a lesion" text was used for all training images.

Diffusion models
The original diffusion model idea was presented by [20] and consisted on using a Markov chain, a sequence of stochastic events whose time steps depend on the previous one, to gradually convert one known distribution (e.g.Gaussian distribution) into another (target distribution).Inspired by non-equilibrium statistical physics, the main idea is to systematically and iteratively destroy structure in a data distribution through a process called forward diffusion.Then, the reverse diffusion process is learned and used to restore structure in data.The first practical implementation of the DM premise on images was developed by [6] introducing Denoising diffusion probabilistic models (DDPM).In this framework, the data is destroyed by adding Gaussian noise to the image in an iterative fashion described by the Markov chain.The total number of diffusion timesteps T is defined by the user but an usual number is around T = 1000.To learn the reverse process a UNet is used to carry on the denoising process.
To solve the image size limitation, latent diffusion was introduced, which uses encoders to compress images from their original sizes in the image space into a smaller representation in the latent space.The motivation behind this is that images usually contain redundant information and an encoder can produce a smaller representation that can later be reconstructed back using a decoder.Therefore, in latent diffusion the diffusion processed is performed on the latent representations rather than the original images [17].
Stable diffusion is an improvement to [17] latent diffusion work, in which text conditioning is added to the model for additional control on the generation process.The text conditioning is a prompt with the description of the image.To create a numeric representation of the prompt a pretrained transformer called CLIP is used [15].CLIP, which stands for Contrastive Language-Image Pre-training, maps both text and images into the same representational space, allowing comparison and similarity quantification between them [4].
Our experiments were conducted using stable diffusion models for both generation tasks, adapting the DreamBooth fine-tuning technique with pretrained stable-diffusion-v1-5 weights as baseline, publicly available in the Hugging Face model hub repository [17].
Healthy image generation For each dataset we trained a separate model using only healthy images, as each dataset contains independent semantic information in the prompt and because the intensity ranges and image details differ between populations.Additionally, a third model with the combination of mammograms from both vendors was trained, adding to the prompt the vendor's name.
We decided not to fine-tune the VAE encoder and decoder after testing its encoding-decoding performance on our mammograms using pretrained natrual images weights.Moreover [1] found that a pretrained VAE on natural images can perform well on Chest X-ray images.Using this VAE encoder, an original image of 512x512 pixels is compressed to 4 latent representations of 64x64, reducing 16 times its original shape [9].Consequently, the diffusion process is performed on the latent representations rather than the original images, allowing lower memory usage, fewer layers in the UNet, and faster training and generation.
Therefore, only the CLIP text encoder and the UNet weights were trained.The UNet architecture is the original SD UNet proposed by [17].The network has four 2D down-and upsampling blocks.Except for the last downsampling block (and its corresponding upsampling block) all blocks are composed of two ResNet blocks and two transformer blocks, one after the other.The timestep embedding is added to the ResNet blocks whereas the text embedding is added through cross attention into the Transformer blocks.For the last downblock (and first upblock) only the timestep information is fed.
-Training steps: Experiments ranged from 1k up to 16k.
To select the best hyperparameters and to track the performance of the models, a validation process was conducted by generating 4 sample images from the same random Gaussian noise every 100 or 200 training steps.The training loss (mean squared error) and the GPU memory usage were also logged.Lesion inpainting The SD pipeline described for task 1 can be modified in some key aspects to be able to perform the inpainting task.We propose using the modified DreamBooth fine-tuning pipeline to inpaint lesion in a designated region of the breast.
For each mammogram with lesion two new elements are added per example: the mask and a masked version of the original image.The masked version means that the pixel values inside of the bounding box are set to zero.At training time, first both the image and the masked image are encoded using the latent space.Also the mask must be reshaped to the latent representation size.The rest of the diffusion process remains the same except for one crucial difference: instead of feeding only the latent representation to the UNet, the latent representation, the mask, and the masked latent representation are stacked into one tensor.This small change in the training process allows the network to pay attention only to the pixels inside the mask, as the pixel outside of it are always provided.This process is described in figure 2.

Independent datasets
Training examples of the conditional model using prompt text can be shown in figure 3 for the OMI-H dataset.We observe that the finetuning technique allows to generate meaningful images since epoch one.For this example we can observe that, as the training process increases, the mammogram reduces its shape in accordance to the area described in the prompt text.Thanks to the combined fine-tuning of the CLIP text encoder and the UNet weights, our conditional models can learn the anatomical structure and form of a mammogram, and can also push the generated image in the direction of the text prompt semantics as the training process increases.Concept extrapolation Beside allowing us to select the vendor type of the generated mammogram, the combination of both datasets permitted to extrapolate the characteristics of one dataset to the other.This means that, e.g. the breast density of the Hologic mammograms could be controlled, even though this information was not available in the Hologic dataset.

Lesion generation
Initial results of the lesion generation pipeline show the possibility to inpaint lesions in any part of the mammogram as shown in figure 5.As the experimentation with this pipeline was preliminary, only a CAD assessment was conducted to investigate the sensibility of a lesion classification model when a synthetic lesion is presented.Results of this assessment are shown in figure 6 and discussed in the following section.

Assessment
Radiological assessment A visual assessment experiment was performed with the radiological evaluation of 53 synthetic images by a radiologist.The experiment consisted on asking a radiologist with 30 years of experience to rate the mammograms in a scale from 0 (definitely synthetic image) to 4 (definitely real image).The distribution of the mammograms had a 50/50 real-synthetic ratio.The results of the test are summarized as a ROC curve in figure 6.The shape of the ROC curve bears resemblance to the random guess curve, suggesting that the radiologists cannot easily identify the difference between real and synthetic images.Moreover, the AUROC value obtained by the radiologist for this synthetic classification task was 0.49.For the lesion inpainting, the heatmaps of three Explainability AI (XAI) methods were computed for a healthy mammogram with an inpainted synthetic lesion.The CAD system used was a full-field mammogram classification model for benign and malignant breast lesions.The XAI interpretation methods applied were gradcam, saliency and occlusion and their respective heatmaps can be seen in figure 6 The hypothesis is that when a synthetic mammogram is used as input the algorithm should highlight the synthetic lesion area, indicating that synthetic lesions have similar pixel distribution to those present in real images.

Conclusion
Stable diffusion text conditioning is a suitable generative model implementation to synthesize mammograms with specific characteristics and properties.Moreover, fine-tuning a SD model pretrained on natural images with mammographic images is feasible and the training objective is to shift the learned data distribution from a non-medical into our mammography datasets.
We also found that SD can be modified for inpainting of synthetic lesions over healthy mammograms.The developed pipeline essentially only requires the modification of the input latent representation to include a mask to focus the generation process only in that region.All these models inference pipelines were made accessible and ready-to-use through graphical user interfaces, and the weights and code were made available through personal repositories.
Thirdly, we found initial evidence that the synthetic images coming from our implementation of SD could potentially be used for CAD systems in need of specific image characteristics or with the presence of lesions.A radiological assessment showed that the initial image quality can be compared with real mammograms and the use of explainability AI models helped to explore the behavior of a classification model.
The first clear limitation of this work is the resolution and pixel depth of the synthetic mammograms.This limited resolution reduces the use of our synthetic images on CAD system that require higher resolution, such as micro-calcification detection.The pixel depth was also reduced from its original 16 bits to 8 bits to match the pretrained model requirements.This reduction losses some information in the images and reduces the overall contrast.With the release of the pretrained weights of SD model for 768x768 resolution images, we expect to perform minimal changes in our current pipeline to allow higher resolution mammography generation.We also plan to train complete CAD pipelines with and without synthetic images to analyze performance changes.

Fig. 2 .
Fig. 2. Inpainting training pipeline.The mask is reshaped to match the image size of the latent representations (64x64).The same UNet as in the SD pipeline is used.

Fig. 3 .
Fig. 3. Training evolution of SDM with Hologic images at epoch 1, 3, 6 and 10.The prompt is: "a mammogram in MLO view with small area".

Fig. 4 .
Fig. 4. Training evolution of the diffusion process on a conditional pretrained model trained with both Siemens and Hologic images at epoch 1, 3, 7 and 40.The prompt is: "a siemens mammogram in MLO view with high density and small area".

Fig. 6 .
Fig. 6.Explainability AI methods heatmaps of synthetic lesion over real healthy mammogram (left) and ROC curve of radiological assessment experiment (right).

Table 1 .
Distribution of cases for both datasets.

Table 2 .
Criteria for breast area size and breast density.