You are currently viewing a new version of our website. To view the old version click .
Bioengineering
  • Article
  • Open Access

15 July 2025

Implementation of a Conditional Latent Diffusion-Based Generative Model to Synthetically Create Unlabeled Histopathological Images

,
,
,
,
,
and
1
Computational Systems Biology Laboratory, Division of Information Science, Nara Institute of Science and Technology, Nara 630-0192, Japan
2
Data Science Center, Nara Institute of Science and Technology, Nara 630-0192, Japan
3
Department of Surgery and Oncology, Kyushu University, Fukuoka 819-0395, Japan
*
Authors to whom correspondence should be addressed.
This article belongs to the Section Biosignal Processing

Abstract

Generative image models have revolutionized artificial intelligence by enabling the synthesis of high-quality, realistic images. These models utilize deep learning techniques to learn complex data distributions and generate novel images that closely resemble the training dataset. Recent advancements, particularly in diffusion models, have led to remarkable improvements in image fidelity, diversity, and controllability. In this work, we investigate the application of a conditional latent diffusion model in the healthcare domain. Specifically, we trained a latent diffusion model using unlabeled histopathology images. Initially, these images were embedded into a lower-dimensional latent space using a Vector Quantized Generative Adversarial Network (VQ-GAN). Subsequently, a diffusion process was applied within this latent space, and clustering was performed on the resulting latent features. The clustering results were then used as a conditioning mechanism for the diffusion model, enabling conditional image generation. Finally, we determined the optimal number of clusters using cluster validation metrics and assessed the quality of the synthetic images through quantitative methods. To enhance the interpretability of the synthetic image generation process, expert input was incorporated into the cluster assignments.

1. Introduction

Generative artificial intelligence (GAI) has emerged as one of the most transformative technological advancements in recent years. From text generation and image synthesis to code automation and molecular design, generative models are reshaping industries, redefining creativity, and challenging conventional thinking. In particular, generative AI has revolutionized image synthesis, opening new frontiers in creativity, automation, and practical applications. Leading this progress are diffusion models, a class of generative models that have demonstrated remarkable abilities in producing high-fidelity images, inpainting, super-resolution, and conditional synthesis. By employing iterative denoising processes inspired by thermodynamics, diffusion models have surpassed traditional generative adversarial networks (GANs) and variational autoencoders (VAEs) in terms of sample quality, stability, and controllability.
The remarkable capability of diffusion models to generate high-quality images has been highlighted in numerous recent research articles. Ho et al. first proposed denoising diffusion probabilistic models (DDPMs) as a powerful alternative to GANs for high-quality image synthesis []. Unlike GANs, which often suffer from mode collapse and training instability [], diffusion models offer more predictable training, making them more robust and reliable for large-scale image generation tasks. Rombach et al. later proposed latent diffusion models (LDMs) []. Unlike standard diffusion models that operate directly in pixel space, LDMs compress images into a lower-dimensional latent space before applying the diffusion process, significantly reducing computational complexity while maintaining high-quality generative outputs. The authors of [] presented techniques to improve the performance of DDPMs in terms of efficiency, sampling speed, and likelihood estimation, all while preserving high-quality generative performance. Furthermore, the findings in [] suggest that DDPMs can automatically and optimally adapt to unknown low-dimensional structures within data, thereby enhancing their practical efficiency.
In addition to the impressive capabilities offered by GAI models, ensuring the integrity and reliability of generated or derived features is a critical concern, especially in sensitive domains such as medical imaging. Recent studies have shown that even minor corruption, such as noisy, false, or mislabeled features, can significantly degrade the performance of both classical and deep learning classifiers, regardless of the model architecture []. This vulnerability underscores the importance of robust feature engineering and the careful validation of synthetic data pipelines. In the context of histopathological image analysis, where ground truth annotations are often scarce, the use of unsupervised or synthetic representations requires particular attention to data quality, as corrupted latent features can silently undermine downstream classification and interpretation. These concerns further motivate the development and evaluation of conditional generative models that not only synthesize realistic images but preserve structural fidelity and robust feature representations.
Another study has highlighted several critical vulnerabilities associated with the use of synthetic medical data generated by artificial intelligence []. AI models trained on such data can be highly sensitive to its quality and representativeness, which may hinder their ability to generalize to real-world clinical scenarios. Furthermore, the process of generating synthetic data may inadvertently omit rare but important clinical features, perpetuate existing biases, or introduce artificial patterns that undermine model reliability. In addition, models trained on synthetic datasets may produce explanations that differ from those based on real data, thereby challenging interpretability and clinical trust. These vulnerabilities underscore the need for the rigorous evaluation and validation of generative AI systems intended for medical applications.
While it is important to recognize the potential vulnerabilities associated with GAI, the adoption of LDMs offers several distinct advantages that help address these concerns. LDMs operate in a compressed latent space, which not only enhances computational efficiency but enables the generation of high-fidelity, structurally realistic synthetic images. This latent-space approach helps preserve both global and local features, making the synthetic data more representative of true biological variability. Moreover, conditional LDMs allow for flexible and controllable synthesis without the need for explicit annotations, enabling the creation of new data that reflects real-world complexity. By employing robust perceptual and structural similarity metrics during training, both the realism and feasibility of generated images can be ensured. Consequently, LDMs represent a promising direction for generating high-quality, privacy-preserving, and application-specific synthetic medical data, which can improve model training, validation, and interpretability in challenging domains such as histopathology.
In our research, we implemented a conditional latent diffusion model (cLDM) to generate synthetic histopathology images. First, we trained a VQ-GAN to compress high-dimensional image data into a lower-dimensional latent space. Next, we performed a diffusion process on the pre-trained latent space, in which noise was added to the latent features and a reverse process was learned to reconstruct clean latent representations over multiple steps. We then applied clustering to these latent features and used the clustering results to guide the diffusion model in generating specific outputs. This conditioning approach was necessary due to the absence of labels in our dataset. Subsequently, we determined the optimal number of clusters using cluster validation metrics. Finally, we generated a large number of synthetic images in the context of each cluster set and evaluated the quality of the generated images using several metrics to assess their similarity to the original references.

3. Materials and Methods

3.1. Description of the Dataset

In this study, we used histopathology images of KPC mice as our research dataset. The Department of Surgery and Oncology at the Graduate School of Medical Sciences, Kyushu University, provided these images. The dataset is unlabeled and consists of images stained using the Hematoxylin and Eosin (HE) technique, which is the most commonly used staining method in histopathology for detailed visualization of cell structures and tissues. Hematoxylin stains the cell nuclei purple, while eosin stains the fibers pink []. The dataset contains a total of 11,000 images, of which 10,000 were used for training the model and 1000 were reserved for testing. All images have a resolution of 128 × 128 pixels.

3.2. Vector Quantized Generative Adversarial Network (VQ-GAN)

We selected the VQ-GAN as the autoencoder framework for our latent diffusion model. Using an autoencoder in the LDM makes the diffusion process more efficient, as it is applied in the image’s latent space rather than in the pixel space, thereby reducing computational requirements. The VQ-GAN is a generative model that combines vector quantization (VQ), autoencoders, and GANs to efficiently generate high-quality images while operating in a discrete latent space. It is an improved version of the VQ-VAE (Vector Quantized Variational Autoencoder), incorporating adversarial training to produce sharper and more realistic images. The inclusion of a discriminator and a discretized latent space in the VQ-GAN results in less blurry images compared to conventional VAEs [].
Our VQ-GAN model consists of three main components: the VQ-VAE architecture, the LPIPS (learned perceptual image patch similarity) model, and the PatchGAN discriminator. First, we briefly describe the VQ-VAE network. In [], the VQ-VAE model was introduced, pioneering the use of a discrete codebook to represent latent variables. Unlike traditional VAEs, which use continuous latent variables, the VQ-VAE quantizes latent representations using a learned codebook of discrete embeddings. This approach makes the autoencoder well-suited for applications such as image generation, speech synthesis, and reinforcement learning. The VQ-VAE loss function consists of three components as follows:
  L V Q V A E = L r e c + L c b + β L c o m m i t
Here, L r e c is the reconstruction loss, measured as the mean squared error (MSE) between the original image and the reconstructed image; L c b , or codebook loss, updates the codebook embeddings by bringing them closer to the encoded latent representations; L c o m m i t is the commitment loss, which encourages the encoder’s latent outputs to remain close to the discrete codebook vectors and prevents codebook collapse. The parameter β is the commitment coefficient, with values typically chosen between 0 and 1. For our experiment, we selected 0.2 as the value of β .
Secondly, an LPIPS model is used to measure the perceptual loss in the VQ-GAN framework. Perceptual loss ensures that the generated images not only match the pixel values of the original images but preserve high-level features, such as textures and structure. By comparing deep feature representations rather than raw pixel values, perceptual loss helps produce sharper and more realistic images. To calculate this loss, high-level feature maps of both the original and the generated images are extracted from a pre-trained neural network (such as VGG or ResNet), and the perceptual distance between them is computed—a concept introduced in []. In our work, we used a pre-trained VGG-16 model via the PyTorch (version 2.4.1) deep learning library []. Feature representations were extracted at various layers of the VGG-16, and the L2 distance was computed between the corresponding features. We then applied normalization and linear weighting layers before aggregating the differences. Finally, the perceptual loss was averaged over the spatial dimensions.
Finally, we describe the discriminator used in our VQ-GAN model. The discriminator plays a crucial role in enhancing the realism and sharpness of generated images by enforcing adversarial training []. In adversarial training, two neural networks, a generator and a discriminator, are trained in a competitive setting. The objective is to improve the quality of the generated outputs by encouraging the generator to produce increasingly realistic samples. In our experiment, we used a PatchGAN discriminator, while the decoder of the VQ-VAE serves as the generator. Unlike traditional discriminators that classify an entire image as real or fake, the PatchGAN discriminator divides the image into smaller patches and classifies each patch independently []. The output is a feature map in which each value corresponds to a specific patch in the input image. The final adversarial loss is computed over all patches, which encourages the generator to focus on producing realistic local details.

3.3. VQ-GAN Hyperparameters

We have listed the hyperparameters selected for our VQ-GAN model in Table 1.
Table 1. Hyperparameters for the VQ-GAN.

3.4. The Conditional Latent Diffusion Model (cLDM)

After training the VQ-GAN model, we utilize the pre-trained latent space and apply a diffusion process to it, hence the term ‘latent diffusion’. Operating in the latent space significantly reduces computational cost. The denoising diffusion probabilistic model (DDPM) in the latent space involves two main steps: the forward (diffusion) process and the reverse (denoising) process.
In the forward process, the latent representation is gradually corrupted by the noise over a finite number of timesteps. Equation (2) illustrates this forward process.
q z t z t 1 = N   z t ; α t z t 1 , 1 α t I
Here, α t defines the noise schedule. For our experiment, we used a linear noise scheduler. The entire process is carried out over T timesteps. After the forward process, we obtain a noisy latent z t from the original clean latent z 0 .
In the reverse process, a neural network, typically a U-Net, is trained to predict and remove the noise step-by-step, reconstructing z 0 . The reverse process is represented by Equation (3).
p θ z t 1 z t , c = N   z t 1 ; μ θ z t , t , c , Σ θ z t , t , c
Once the latent representation z 0 is recovered, the pre-trained VQ-VAE decoder reconstructs the final image. The objective function for the U-Net is given by Equation (4).
L t = ϵ ϵ θ z t , t , c 2
This objective is essentially the MSE between the actual and the predicted noise. Here, c represents the conditioning input to the U-Net. In our approach, we used the clustering results obtained from clustering the latent space as the conditioning input for the U-Net.

3.5. Information Maximization-Based Clustering

Information maximization-based clustering is an approach that learns discrete cluster assignments by maximizing the mutual information (MI) between the input data and the assigned clusters. This method ensures that clusters are both meaningful and well-separated by promoting high-confidence predictions while maintaining balanced cluster sizes. Specifically, mutual information is maximized by increasing the difference between the marginal entropy (which encourages well-balanced clusters) and the conditional entropy (which ensures each sample is confidently assigned to a single cluster) []. Self-augmented training can further enhance cluster stability by ensuring that cluster assignments remain consistent under data transformations such as rotation, translation, and scaling. This approach is particularly powerful in unsupervised learning scenarios, especially for representation learning and clustering in deep neural networks.

3.6. cLDM Hyperparameters

We have listed the hyperparameters selected for the cLDM model in Table 2.
Table 2. Hyperparameters for cLDM.
Here, finetuning-1 and finetuning-2 refer to two additional training phases of our conditional latent diffusion model following the initial training. The reasons for these additional training steps will be explained in a later section.

4. Results

In this section, we present the various results obtained from our research.

4.1. VQ-GAN Reconstructions

In Figure 1, we present some original images from our test set, which were not used during the training of the VQ-GAN model. Their corresponding reconstructions closely resemble the originals, indicating that the VQ-GAN achieved high-quality reconstructions after training.
Figure 1. Reconstructed samples generated by the VQ-GAN model: (a) randomly selected samples from the test set; and (b) corresponding reconstructions by VQ-GAN.

4.2. Cluster Validation

We trained our model using unlabeled histopathology images. To differentiate between the images, we experimented with cluster sets containing 10, 11, 12, 13, 14, 15, and 16 clusters. We then applied several internal cluster validation metrics to identify the optimal number of clusters. Internal cluster validation is necessary when ground truth information is unavailable, as it evaluates the similarity among objects within each cluster [].
We used five different internal cluster validation metrics: the Calinski–Harabasz index, C index, Dunn index, Hartigan index, and McClain–Rao index. The Calinski–Harabasz (CH) index assesses cluster quality by measuring the ratio of between-cluster dispersion to within-cluster dispersion []. The CH index is defined as follows:
C H = t r a c e ( B k ) t r a c e ( W k ) × n k k 1
Here, B k is the between-cluster scatter matrix, which measures how well-separated the clusters are, while W k is the within-cluster scatter matrix, which measures the compactness of each cluster, n is the total number of samples, and k is the number of clusters. A higher CH index indicates better clustering—meaning clusters are well-separated and compact—whereas a lower CH index suggests poor clustering, with overlapping or dispersed clusters. The C index is another metric for evaluating cluster quality, focusing on both compactness and separation. It compares the sum of intra-cluster distances to the best and worst possible clustering scenarios []. The C index is defined as follows:
C = S S m i n S m a x S m i n
Here, S is the sum of the intra-cluster distances, that is, the sum of the distances between all points within the same cluster, S m i n is the smallest possible sum of intra-cluster distances (the ideal case), and S m a x is the largest possible sum (the worst case). Lower C index values indicate better clustering, meaning intra-cluster distances are small and clusters are compact. Conversely, higher C index values suggest poor clustering, with data points within a cluster being far apart. The Dunn index evaluates the quality of clustering by calculating the ratio of the smallest inter-cluster distance to the largest intra-cluster distance []. It is defined as follows:
D = m i n i j   d ( C i , C j ) m a x k   d ( C k )
Here, d ( C i , C j ) is the inter-cluster distance between clusters C i and C j (typically the minimum distance between points in different clusters), and d ( C k ) is the intra-cluster distance for cluster C k (usually the maximum distance between points within the same cluster). A higher Dunn index indicates better clustering, meaning clusters are well-separated and compact, while a lower Dunn index suggests poor clustering, with overlapping or poorly-formed clusters. The Hartigan index is calculated using the logarithmic ratio of the between-cluster sum of squares (SSB) to the within-cluster sum of squares (SSW) []. It is given by Equation (8):
H = l o g S S B S S W
A higher Hartigan index suggests well-separated clusters, while a lower index indicates less distinct clustering. Finally, the McClain–Rao (MR) index is an internal cluster validation metric used to assess the quality of a clustering configuration. It evaluates the ratio of average intra-cluster distances to average inter-cluster distances, providing insight into both the compactness and separation of clusters. The MR index is defined as follows:
M R = A v e r a g e   i n t r a   c l u s t e r   d i s t a n c e A v e r a g e   i n t e r   c l u s t e r   d i s t a n c e
Lower MR index values indicate better clustering quality, as they suggest that data points within clusters are closer together (greater compactness) and clusters are well-separated from one another. Higher MR index values suggest poorer clustering quality, indicating that clusters may be overlapping or not well-defined.
From Table 3, we observe that the 14-cluster set achieves the highest CH index, Dunn index, and Hartigan index, as well as the lowest C index and MR index. Therefore, 14 is the optimal number of clusters for our dataset.
Table 3. Internal validation metrics for different cluster sets.

4.3. Clustering Results

The clustering results for the 14-cluster set are presented in Figure 2.
Figure 2. Clustering results for the 14-cluster set.
The results of the optimal cluster set are shown, and the clustering output from this set serves as the conditioning mechanism for generating samples and for comparison with the source images during evaluation. We consulted a clinical expert, who contributed his expertise by assigning labels to each cluster; further details are provided in the Section 5.
In this study, we used information maximization-based clustering due to its effectiveness in generating balanced and well-separated clusters in an unsupervised manner. We chose this approach because it had previously performed well with this dataset. Although clustering inherently involves dimensionality reduction and abstraction, we minimized the risk of losing critical information by optimizing the number of clusters using multiple internal validation indices, ensuring that the clusters are both meaningful and representative of the data’s diversity. Additionally, we incorporated expert feedback to interpret the clusters, providing an extra safeguard against the loss of important histopathological features.

4.4. Generated Image Evaluation

We generated 10,000 images after training our cLDM for each cluster set. Using the latent features of the images in the training set, we applied the clustering and then used the clustering results as conditioning inputs for generating new samples with each model. To evaluate the quality of the generated images for each cluster set, we employed three different metrics: SSIM (structural similarity index measure), MS-SSIM (multi-scale structural similarity index measure), and LPIPS (learned perceptual image patch similarity) metric. LPIPS can be used both as a model for image enhancement and as a metric for image quality assessment; previously, we used LPIPS as a model component in our VQ-GAN architecture.
SSIM evaluates image quality based on structural information rather than simple pixel-wise differences, such as those measured by MSE []. It takes into account luminance, contrast, and structural similarities between images, making it useful for assessing how well a generated image preserves the structure of the reference image. SSIM is particularly useful for detecting distortions. MS-SSIM is an extension of SSIM that evaluates images at multiple scales, enhancing its robustness to variations in viewing conditions []. By capturing details at different resolutions, MS-SSIM provides a more comprehensive evaluation of high-resolution images and complex textures. LPIPS is a deep learning-based metric that measures perceptual similarity by comparing feature representations extracted from deep neural networks. Unlike SSIM and MS-SSIM, which rely on traditional structural comparisons, LPIPS uses deep features to assess visual similarity in a manner that aligns more closely with human perception. This makes it especially valuable for evaluating images generated by modern deep learning models, such as GANs and diffusion models, where pixel-wise comparisons alone may not be sufficient.
We have listed the SSIM, MS-SSIM, and LPIPS values for each cluster set in Table 4.
Table 4. Generated image evaluation metrics for different cluster sets.
Higher SSIM and MS-SSIM values (closer to 1) indicate that the generated image is structurally more similar to the reference image, reflecting better image quality. Conversely, a lower LPIPS score means the generated image is perceptually more similar to the real image. For the 14-cluster set, we achieved both the highest SSIM and MS-SSIM values, as well as the lowest LPIPS score. These results confirm that the cluster configuration with the highest quality partitioning also yields the best conditional image generation for our unlabeled dataset. Moreover, by using SSIM, MS-SSIM, and LPIPS, we ensure that both structural fidelity and perceptual realism are accounted for in our evaluations.
The authors of [] evaluated various deep learning-based compression methods and reported that their model achieved higher MS-SSIM values and lower LPIPS scores compared to baseline methods, indicating better preservation of structural details and perceptual similarity across all compression rates. In [], the researchers proposed StainGAN—an automated, unpaired, end-to-end stain style transfer solution inspired by CycleGAN—that preserves tissue structure without requiring a reference template. StainGAN achieved SSIM scores 10% higher than those of traditional stain normalization techniques, demonstrating a notable improvement in structural similarity to target images. Liang et al. introduced SSIM-GAN, which incorporated SSIM into its reconstruction loss to better maintain structural content []. Their results showed that SSIM-GAN ensures greater structural fidelity compared to baseline models. Another study addressed the need for reliable evaluation metrics in medical image synthesis [], showing that SSIM and MS-SSIM effectively capture structural fidelity but are sensitive to preprocessing steps, whereas LPIPS better aligns with perceptual quality, particularly under realistic distortions. While none of these studies utilized clustering, our research employed clustering as a conditioning mechanism for image synthesis. This comparison highlights that our conditional latent diffusion model achieves competitive results in terms of image quality and structural fidelity, particularly given the unlabeled nature of our dataset.

4.5. Conditional Image Sampling Using the 14-Cluster Set

We generated 100 samples for each cluster using cluster IDs 0 to 13, as the 14-cluster set was identified as optimal in our experiment. After generating these samples for each cluster, we also assessed how many were correctly predicted by our model. Representative samples that closely resemble each cluster are shown in Figure 3.
Figure 3. Conditionally generated samples for each cluster in the 14-cluster set.
Figure 3 displays a selection of generated samples produced using the cluster IDs. We observe that these samples exhibit similar tissue patterns to the corresponding clusters shown in Figure 2. For most clusters, the model predicted the sample IDs with high accuracy—over 70%. In several clusters, the accuracy exceeds 60%, while only one cluster shows a lower accuracy of 54%.

5. Discussion

As previously mentioned, we finetuned our cLDM model twice following the initial training. In this section, we explain the rationale behind these two additional finetuning steps.
Figure 4 illustrates samples generated by conditioning on the latent features of selected test images using the 14-cluster set, following the initial training, finetuning-1, and finetuning-2. The original test samples are shown in Figure 4a. As seen in Figure 4b, the quality of the generated images after the first training is quite poor. During this initial phase, our model was trained for 600 epochs; however, the MSE loss (the difference between the actual and predicted noise) remained too high, resulting in low-quality outputs. To address this issue, we froze the parameters of the neural network responsible for clustering and proceeded to finetune our model, focusing specifically on reducing the MSE loss.
Figure 4. Comparison of conditionally generated samples: (a) random samples from the test set; (b) samples generated after initial training; (c) samples generated after finetuning-1; and (d) samples generated after finetuning-2.
We finetuned our model (finetuning-1) for 130 epochs to improve the quality of the generated samples by further reducing the MSE loss. The results after finetuning-1 are shown in Figure 4c, where the images are noticeably better than those produced after the initial training. However, the color distribution in some samples is still suboptimal and visibly inconsistent. This issue arose due to a high transformation loss from the information maximization-based clustering. To address this, we conducted an additional finetuning (finetuning-2) for 70 epochs, this time focusing on reducing the transformation loss and enhancing the color quality of the generated images. During finetuning-2, we did not freeze any model parameters. The results after this second finetuning are presented in Figure 4d.
Although our dataset lacks annotations, we consulted with a specialist to help interpret the images within each cluster shown in Figure 2. The specialist provided descriptions for each cluster, which are presented in Table 5.
Table 5. Descriptions for clusters in the 14-cluster set.
These descriptive labels, based on the expert pathologist’s review, add further clarity and interpretability to the synthetic image generation process.
While our study focused on histopathology images, the proposed conditional latent diffusion model is fundamentally adaptable to other medical imaging modalities, such as radiology and dermatology. The core components, including latent space compression utilizing VQ-GAN, unsupervised clustering, and diffusion-based generation, are not specific to histopathology and can be retrained on different image types to capture modality-specific features. However, we acknowledge that each modality presents unique challenges, including variations in resolution, texture, anatomical diversity, and potential imaging artifacts, which may require adjustments to the autoencoder architecture, clustering strategies, or preprocessing steps. We anticipate that, with careful retraining, domain-specific cluster validation, and expert input for interpretation, our method could be effectively extended to other medical imaging modalities.
Finally, we highlight the main difference between our research and several recent techniques that utilize LDMs. Unconditional LDMs have demonstrated impressive image synthesis capabilities; however, they lack any conditioning mechanism, which limits their usefulness for targeted image generation tasks. Class-conditional LDMs use labels for conditioning, enabling more specific output generation, but this approach depends on annotated datasets that are often unavailable in medical imaging. Other conditioning techniques in LDMs, such as those for super-resolution, inpainting, or mask-based synthesis, have been successfully applied to enhance image details or enable region-specific synthesis using explicit inputs or guidance (e.g., text, segmentation masks, or partial images). By contrast, our method introduces clustering-based conditioning within the latent space, enabling effective and interpretable image synthesis even in the absence of annotations. This approach distinguishes our work by allowing for the generation of diverse and clinically meaningful synthetic images from unlabeled data, while still leveraging the strengths of LDMs in terms of image quality and flexibility.

6. Conclusions

In this work, we conditionally generated unlabeled histopathology images using our cLDM model. We began by selecting a VQ-GAN architecture as the autoencoder for our latent diffusion model, which enabled us to reconstruct very high-quality samples. Next, we performed a forward diffusion process on the latent space extracted from the pre-trained VQ-GAN, using a linear scheduler. We then applied a deep learning-based clustering technique to the latent space, and the resulting clusters were used to guide the U-Net in generating targeted outputs. Since our dataset lacked annotations, this conditioning procedure allowed us to effectively direct the generation process. During the reverse process, the U-Net learned to reconstruct a clean latent space from a noisy one, and the resulting clean latent image was passed through the pre-trained VQ-VAE decoder to generate a new image.
We experimented with different numbers of clusters and determined the optimal number using five internal cluster validation methods. All metrics consistently indicated that 14 was the optimal cluster set. Subsequently, we generated a large number of synthetic images, equal to the number of training images, within the context of each cluster set, using the latent features of the training images as conditioning inputs. The quality of these artificially generated images was evaluated using three different metrics, all of which confirmed that the images generated with the 14-cluster set were most similar to the training dataset. By incorporating expert pathologist insights into the cluster assignments, we were able to link data-driven groupings with meaningful histopathological features, thereby enhancing both the interpretability and the relevance of our synthetic image generation process.
Possible future research directions include extending our approach by incorporating semi-supervised and weakly supervised strategies, which would allow for the integration of limited annotated data alongside unlabeled samples to achieve even more refined and clinically relevant image synthesis. Additionally, our methodology can be adopted to analyze other types of medical images. Other deep learning models, beyond LDMs, could also be explored for this purpose. Finally, future work should focus on further clinical validation through collaboration with experts, in order to assess the interpretability and utility of synthetic image generation in real-world diagnostic settings.
In our study, all synthetic images were generated solely for research purposes using unlabeled data and were not intended for direct clinical decision-making. We acknowledge that deploying synthetic images in clinical settings requires strict validation to ensure their reliability and safety. Therefore, we recommend that synthetic images undergo thorough expert review and quantitative validation before any clinical application. Additionally, transparency in data provenance and clear labeling of synthetic content are essential to prevent misuse. We emphasize that our methodology should remain a research tool until comprehensive validation protocols for clinical use are established.
We have provided the architecture of our cLDM framework, the loss function used to train it, and some additional information regarding the linear scheduler in the Supplementary Materials.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/bioengineering12070764/s1. Figure S1: The architecture of our proposed conditional latent diffusion model (cLDM); Equation (S1): cLDM loss function; and Figure S2: β t and α ¯ t plots for linear variance scheduler.

Author Contributions

Conceptualization, M.I.R. and N.O.; methodology, M.I.R. and N.O.; programming, M.I.R.; validation, M.I.R.; formal analysis, M.I.R.; investigation, K.O.; resources, K.O.; data curation, N.O.; writing—original draft preparation, M.I.R.; writing—review and editing, M.I.R.; visualization, M.I.R.; supervision, A.K.N., M.A., M.A.-U.-A. and S.K.; project administration, A.K.N., M.A. and M.A.-U.-A.; funding acquisition, N.O. and K.O. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Japan Society for the Promotion of Science (JSPS) KAKENHI Grant-in-Aid for Scientific Research (C): grant number 21K12111.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

If this document is accepted, we will provide our dataset that can be accessed from public domain resources. In addition, the code will be made public and can be obtained from GitHub, https://github.com/randomaccess2023/KPC_LDM_128, accessed on 13 July 2025.

Acknowledgments

The authors are grateful to the JSPS KAKENHI for their support in funding this work under the grant number 21K12111. The authors would like to thank the reviewers for their feedback that helped improve the quality of this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of this study; in the collection, analyses, or interpretation of the data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
VQ-GANVector Quantized Generative Adversarial Network
GAIGenerative Artificial Intelligence
GANGenerative Adversarial Network
VAEVariational Autoencoder
LDMLatent Diffusion Model
DDPMDenoising Diffusion Probabilistic Model
cLDMConditional Latent Diffusion Model
HEHematoxylin and Eosin
VQVector Quantization
VQ-VAEVector Quantized Variational Autoencoder
LPIPSLearned Perceptual Image Patch Similarity
MSEMean Squared Error
MIMutual Information
CH indexCalinski–Harabasz index
MR indexMcClain–Rao index
SSIMStructural Similarity Index Measure
MS-SSIMMulti-Scale Structural Similarity Index Measure

References

  1. Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
  2. Saxena, D.; Cao, J. Generative adversarial networks (GANs): Challenges, solutions, and future directions. ACM Comput. Surv. (CSUR) 2021, 54, 1–42. [Google Scholar] [CrossRef]
  3. Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 21–24 June 2022. [Google Scholar]
  4. Nichol, A.Q.; Dhariwal, P. Improved denoising diffusion probabilistic models. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021. [Google Scholar]
  5. Huang, Z.; Wei, Y.; Chen, Y. Denoising diffusion probabilistic models are optimally adaptive to unknown low dimensionality. arXiv 2024, arXiv:2410.18784. [Google Scholar] [CrossRef]
  6. Moldovanu, S.; Munteanu, D.; Sîrbu, C. Impact on Classification Process Generated by Corrupted Features. Big Data Cogn. Comput. 2025, 9, 45. [Google Scholar] [CrossRef]
  7. Munteanu, D.; Moldovanu, S.; Miron, M. The Explanation and Sensitivity of AI Algorithms Supplied with Synthetic Medical Data. Electronics 2025, 14, 1270. [Google Scholar] [CrossRef]
  8. Letafati, M.; Ali, S.; Latva-Aho, M. Denoising diffusion probabilistic models for hardware-impaired communications. In Proceedings of the IEEE Wireless Communications and Networking Conference (WCNC), Dubai, United Arab Emirates, 21–24 April 2024. [Google Scholar]
  9. Podell, D.; English, Z.; Lacey, K.; Blattmann, A.; Dockhorn, T.; Müller, J.; Penna, J.; Rombach, R. Improving latent diffusion models for high-resolution image synthesis. arXiv 2023, arXiv:2307.01952. [Google Scholar] [CrossRef]
  10. Singh, J.; Gould, S.; Zheng, L. High-fidelity guided image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
  11. Liu, M.F.; Lyu, S.; Vinaroz, M.; Park, M. Differentially private latent diffusion models. arXiv 2023, arXiv:2305.15759. [Google Scholar]
  12. Zhang, J.; Xu, Z.; Cui, S.; Meng, C.; Wu, W.; Lyu, M.R. On the robustness of latent diffusion models. arXiv 2023, arXiv:2306.08257. [Google Scholar] [CrossRef]
  13. Guo, J.; Xu, X.; Pu, Y.; Ni, Z.; Wang, C.; Vasu, M.; Song, S.; Huang, G.; Shi, H. Smooth diffusion: Crafting smooth latent spaces in diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
  14. Di Federico, G.; Durlofsky, L.J. Latent diffusion models for parameterization and data assimilation of facies-based geomodels. arXiv 2024, arXiv:2406.14815. [Google Scholar] [CrossRef]
  15. Jiang, L.; Mao, Y.; Wang, X.; Chen, X.; Li, C. Cola-diff: Conditional latent diffusion model for multi-model mri synthesis. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Vancouver, BC, Canada, 8–12 October 2023. [Google Scholar]
  16. Yuan, Y.; Xiao, J.; Li, X. Are conditional latent diffusion models effective for image restoration? arXiv 2024, arXiv:2412.09324. [Google Scholar] [CrossRef]
  17. Du, P.; Parikh, M.H.; Fan, X.; Liu, X.Y.; Wang, J.X. Conditional neural field latent diffusion model for generating spatiotemporal turbulence. Nat. Commun. 2024, 15, 10416. [Google Scholar] [CrossRef]
  18. Gu, J.; Shen, Y.; Zhai, S.; Zhang, Y.; Jaitly, N.; Susskind, J.M. Kaleido diffusion: Improving conditional diffusion models with autoregressive latent modeling. arXiv 2024, arXiv:2405.21048. [Google Scholar] [CrossRef]
  19. Kaufman, B.; Williams, E.C.; Pederson, R.; Underkoffler, C.; Panjwani, Z.; Wang-Henderson, M.; Mardirossian, N.; Katcher, M.H.; Strater, Z.; Grandjean, J.M.; et al. Latent diffusion for conditional generation of molecules. bioRxiv 2024. [Google Scholar] [CrossRef]
  20. Zhuang, P.; Abnar, S.; Gu, J.; Schwing, A.; Susskind, J.M.; Bautista, M.A. Diffusion probabilistic fields. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  21. Khader, F.; Müller-Fanzes, G.; Tayebi Arasteh, S.; Han, T.; Haarburger, C.; Schulze-Hagen, M.; Schad, P.; Engelhardt, S.; Baeßler, B.; Foersch, S.; et al. Denoising diffusion probabilistic models for 3d medical image generation. Sci. Rep. 2023, 13, 7303. [Google Scholar] [CrossRef] [PubMed]
  22. Dar, S.U.H.; Ghanaat, A.; Kahmaan, J.; Ayx, I.; Papavassiliu, T.; Schoenberg, S.O.; Engelhardt, S. Investigating data memorization in 3d latent diffusion models for medical image synthesis. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Vancouver, BC, Canada, 8–12 October 2023. [Google Scholar]
  23. Dar, S.U.H.; Seyfarth, M.; Ayx, I.; Papavassiliu, T.; Schoenberg, S.O.; Siepmann, R.M.; Laqua, F.C.; Kahmann, J.; Frey, N.; Baeßler, B.; et al. Unconditional latent diffusion models memorize patient imaging data: Implications for openly sharing synthetic data. arXiv 2024, arXiv:2402.01054. [Google Scholar]
  24. Nafi, A.A.N.; Hossain, M.A.; Rifat, R.H.; Zaman, M.M.U.; Ahsan, M.M.; Raman, S. Diffusion-based approaches in medical image generation and analysis. arXiv 2024, arXiv:2412.16860. [Google Scholar] [CrossRef]
  25. Hung, A.L.U.; Zhao, K.; Zheng, H.; Yan, R.; Raman, S.S.; Terzopoulos, D.; Sung, K. Med-cDiff: Conditional medical image generation with diffusion models. Bioengineering 2023, 10, 1258. [Google Scholar] [CrossRef]
  26. Siddiqui, A.A.; Tirunagari, S.; Zia, T.; Windridge, D. A latent diffusion approach to visual attribution in medical imaging. Sci. Rep. 2025, 15, 962. [Google Scholar] [CrossRef]
  27. Song, W.; Jiang, Y.; Fang, Y.; Cao, X.; Wu, P.; Xing, H.; Wu, X. Medical image generation based on latent diffusion models. In Proceedings of the International Conference on Artificial Intelligence Innovation (ICAII), Wuhan, China, 22–24 September 2023. [Google Scholar]
  28. Pinaya, W.H.L.; Tudosiu, P.D.; Dafflon, J.; Da Costa, P.F.; Fernandez, V.; Nachev, P.; Ourselin, S.; Cardoso, M.J. Brain imaging generation with latent diffusion models. In Proceedings of the MICCAI Workshop on Deep Generative Models, Singapore, 22 September 2022. [Google Scholar]
  29. Müller-Fanzes, G.; Niehues, J.M.; Khader, F.; Arasteh, S.T.; Haarburger, C.; Kuhl, C.; Wang, T.; Han, T.; Nolte, T.; Nebelung, S.; et al. A multimodal comparison of latent denoising diffusion probabilistic models and generative adversarial networks for medical image synthesis. Sci. Rep. 2023, 13, 12098. [Google Scholar] [CrossRef]
  30. Han, Y.; He, J.; Gupta, M.; Delp, E.J.; Zhu, F. Diffusion model with clustering-based conditioning for food image generation. In Proceedings of the 8th International Workshop on Multimedia Assisted Dietary Management, Ottawa, ON, Canada, 29 October 2023. [Google Scholar]
  31. Moghadam, P.A.; Van Dalen, S.; Martin, K.C.; Lennerz, J.; Yip, S.; Farahani, H.; Bashashati, A. A morphology focused diffusion probabilistic model for synthesis of histopathology images. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023. [Google Scholar]
  32. Harb, R.; Pock, T.; Müller, H. Diffusion-based generation of histopathological whole slide images at a gigapixel scale. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024. [Google Scholar]
  33. Yellapragada, S.; Graikos, A.; Prasanna, P.; Kurc, T.; Saltz, J.; Samaras, D. PathLDM: Text conditioned latent diffusion model for histopathology. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024. [Google Scholar]
  34. Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. In Proceedings of the 2nd International Conference on Learning Representations, Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
  35. Karras, T.; Laine, S.; Aila, T. A Style-Based Generator Architecture for Generative Adversarial Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
  36. Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-Image Translation with Conditional Adversarial Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  37. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention (MICCAI): 18th International Conference, Munich, Germany, 5–9 October 2015. [Google Scholar]
  38. Chen, L.C.; Papandreu, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef]
  39. Chen, T.; Konblith, S.; Norouzi, M.; Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations. In Proceedings of the International Conference on Machine Learning (ICML), Online, 13–18 July 2020. [Google Scholar]
  40. He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
  41. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  42. Tajbakhsh, N.; Jeyaseelan, L.; Li, Q.; Chiang, J.N.; Wu, Z.; Ding, X. Embracing Imperfect Datasets: A Review of Deep Learning Solutions for Medical Image Segmentation. Med. Image Anal. 2020, 63, 101693. [Google Scholar] [CrossRef]
  43. Redmond, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
  44. Rumman, M.I.; Ono, N.; Ohuchida, K.; Arasteh Altaf-Ul-Amin, M.D.; Huang, M.; Kanaya, S. Information maximization-based clustering of histopathology images using deep learning. PLoS Digit. Health 2023, 2, e0000391. [Google Scholar] [CrossRef] [PubMed]
  45. Esser, P.; Rombach, R.; Ommer, B. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
  46. Van Den Oord, A.; Vinyals, O.; Kavukcuoglu, K. Neural discrete representation learning. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
  47. Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
  48. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32, 8026–8037. [Google Scholar]
  49. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27. Available online: https://dl.acm.org/doi/10.5555/2969033.2969125 (accessed on 13 July 2025).
  50. Li, C.; Wand, M. Precomputed real-time texture synthesis with Markovian generative adversarial networks. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
  51. Hu, W.; Miyato, T.; Tokui, S.; Matsumoto, E.; Sugiyama, M. Learning discrete representations via information maximizing self-augmented training. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017. [Google Scholar]
  52. Liu, Y.; Li, Z.; Xiong, H.; Gao, X.; Wu, J. Understanding of internal clustering validation measures. In Proceedings of the IEEE International Conference on Data Mining, Sydney, Australia, 13–17 December 2010. [Google Scholar]
  53. Saitta, S.; Raphael, B.; Smith, I.F.C. A comprehensive validity index for clustering. Intell. Data Anal. 2008, 12, 529–548. [Google Scholar] [CrossRef]
  54. Ansari, Z.; Azeem, M.F.; Ahmed, W.; Babu, A.V. Quantitative evaluation of performance and validity indices for clustering the web navigational sessions. arXiv 2015, arXiv:1507.03340. [Google Scholar] [CrossRef]
  55. Palacio, N.; Julio, O.; Berzal, F. Evaluation metrics for unsupervised learning algorithms. arXiv 2019, arXiv:1905.05667. [Google Scholar] [CrossRef]
  56. Hartigan, J.A. Clustering Algorithms, 1st ed.; John Wiley & Sons, Inc.: New York, NY, USA, 1975; pp. 97–98. [Google Scholar]
  57. Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
  58. Wang, Z.; Simoncelli, E.P.; Bovik, A.C. Multiscale structural similarity for image quality assessment. In Proceedings of the Thirty-Seventh Asilomar Conference on Signals, Systems, & Computers, Pacific Grove, CA, USA, 9–12 December 2003. [Google Scholar]
  59. Fischer, M.; Neher, P.; Schüffler, P.; Xiao, S.; Almeida, S.D.; Ulrich, C.; Muckenhuber, A.; Braren, R.; Götz, M.; Kleesiek, J.; et al. Enhanced Diagnostic Fidelity in Pathology Whole Slide Image Compression via Deep Learning. In Proceedings of the International Workshop on Machine Learning in Medical Imaging, Vancouver, BC, Canada, 8 October 2023. [Google Scholar]
  60. Shaban, M.T.; Baur, C.; Navab, N.; Albarqouni, S. StainGAN: Stain Style Transfer for Digital Histological Images. In Proceedings of the IEEE International Symposium on Biomedical Imaging (ISBI), Venice, Italy, 8–11 April 2019. [Google Scholar]
  61. Liang, H.; Plataniotis, K.N.; Li, X. Stain Style Transfer of Histopathology Images Via Structure-Preserved Generative Learning. In Proceedings of the Machine Learning for Medical Image Reconstruction, Lima, Peru, 8 October 2020. [Google Scholar]
  62. Dohmen, M.; Klemens, M.A.; Baltruschat, I.M.; Truong, T.; Lenga, M. Similarity and Quality Metrics for MR Image-to-Image Translation. arXiv 2024, arXiv:2405.08431. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.