Enhanced Facial Realism in Personalized Diffusion Models: A Memory-Optimized DreamBooth Implementation for Consumer Hardware

Gupta, Sandeep; Ray, Kanad; Kaiser, Shamim; Hossain, Sazzad; Faubert, Jocelyn

doi:10.3390/a19040257

Open AccessArticle

Enhanced Facial Realism in Personalized Diffusion Models: A Memory-Optimized DreamBooth Implementation for Consumer Hardware

by

Sandeep Gupta

¹

,

Kanad Ray

^2,3,*

,

Shamim Kaiser

⁴

,

Sazzad Hossain

⁵

and

Jocelyn Faubert

⁶

¹

Department of Electronics and Communication Engineering, Poornima College of Engineering, Jaipur 302022, Rajasthan, India

²

Amity Cognitive Computing and Brain Informatics Center, Amity University Rajasthan, Jaipur 303002, India

³

Faculty of Artificial Intelligence and Digital Technologies, Samarkand State University, Samarkand 140104, Uzbekistan

⁴

Institute of Information Technology, Jahangirnagar University, Savar, Dhaka 1342, Bangladesh

⁵

Department of System Management and Information Security, Samarkand State University, Samarkand 140104, Uzbekistan

⁶

Faubert Lab, School of Optometry, University of Montreal, Montreal, QC H3T 1P1, Canada

^*

Author to whom correspondence should be addressed.

Algorithms 2026, 19(4), 257; https://doi.org/10.3390/a19040257

Submission received: 21 February 2026 / Revised: 11 March 2026 / Accepted: 16 March 2026 / Published: 27 March 2026

(This article belongs to the Section Algorithms for Multidisciplinary Applications)

Download

Browse Figures

Versions Notes

Abstract

Despite significant progress in general-purpose diffusion-based models capable of producing high-quality media, this approach is still too difficult to implement on consumer/gamer hardware. We present here a memory-optimized DreamBooth framework designed for consumer-grade GPUs with 16 GB of VRAM, that allows for end-to-end image personalization and addresses some of the limitations of existing solutions. Our system reduces peak GPU memory from 22 GB (baseline DreamBooth) to 14.2 GB through novel hierarchical memory management, including attention slicing, Variational Autoencoder (VAE) tiling, gradient accumulation, and gradient checkpointing integrated within the Hugging Face Accelerate ecosystem. The framework further incorporates state-of-the-art techniques for preserving facial features and a comprehensive automated quality management system. The result is a complete end-to-end pipeline achieving a peak memory of 14.2 GB, with quantitative performance (LPIPS: 0.139, SSIM: 0.879, identity: 0.852, and FID: 23.1) competitive with methods requiring significantly more hardware resources.

Keywords:

DreamBooth; personalized diffusion; memory optimization; facial realism; consumer hardware; quality assessment

1. Introduction

The democratization of Artificial Intelligence (AI)-generated content represents a significant step forward in computational innovation; however, it is partially limited by hardware’s stringent requirements for deployment on large-scale diffusion models. Recent advances in personalized image generation have demonstrated remarkable capabilities in identity preservation and customization [1,2,3,4,5]. Specifically, DreamBooth [1] introduced subject-driven fine-tuning of text-to-image diffusion models, and Textual Inversion [2] personalizes generation via learned text embeddings. Custom Diffusion [3] supports multi-concept customization; IP-Adapter [4] enables image-prompt-based personalization, and MasaCtrl [5] provides tuning-free consistent synthesis through mutual self-attention control. Despite these technological breakthroughs, the computational requirements remain a significant barrier to widespread adoption.

Access to the latest state-of-the-art diffusion models for personal development is limited to affluent research labs or companies with significant financial means that can afford expensive GPU configurations with 24 GB or more VRAM [6,7], which restricts access by non-well-resourced research labs and companies. The separation in hardware creates a significant divide between the potential for easing and addressing end-user AI (consumer applications, classrooms) as well as small-scale creative workflows [8,9].

Memory optimization in diffusion models has emerged as a crucial research direction, with recent work exploring gradient checkpointing [10], attention slicing [11], and Variational Autoencoder (VAE) tiling techniques [12]. Controlling the quality of production and suppressing identity adherence to consumer hardware limitations [13,14].

The principal contributions of this work are as follows:

Memory-Optimized Training Framework: A novel system integration that orchestrates hierarchical memory management—combining gradient accumulation, attention slicing, VAE tiling, and adaptive gradient checkpointing—to reduce peak GPU memory usage from 22 GB to 14.2 GB while maintaining training convergence. The key novelty lies in the adaptive, co-optimized scheduling of these techniques within the Hugging Face Accelerate ecosystem, which has not been systematically demonstrated for consumer-grade personalized diffusion before.
Enhanced Identity Preservation: An advanced facial feature extraction and preservation mechanism that ensures consistent subject identity across diverse generation contexts through multi-scale facial encoding and constraint-based fine-tuning.
Automated Quality Assessment System: A comprehensive multi-dimensional evaluation framework incorporating Learned Perceptual Image Patch Similarity (LPIPS), Structural Similarity Index Measure (SSIM), identity verification (cosine similarity), and photorealistic quality metrics for objective performance validation.
Consumer Hardware Deployment: A complete end-to-end pipeline optimized for 16 GB consumer GPUs, demonstrating the practical feasibility of high-quality personalized generation on accessible hardware configurations.
Ethical Framework: An explicit discussion of potential misuse scenarios, deepfake risks, and privacy implications, alongside proposed mitigation measures and an ethics statement for responsible deployment.

The improvements made to personalized diffusion frameworks address critical gaps, such as memory efficiency, identity preservation, and generation on consumer-grade hardware, while maintaining the quality of future generations. While each individual technique (gradient checkpointing, attention slicing, and VAE tiling) is adaptable, the primary contribution is systematic co-optimization to create an integrated pipeline for high-performance on commodity hardware with thorough ablation studies.

The full DreamBooth implementation is described in detail in Figure 1. The figure illustrates (a) a memory-optimized training framework with Accelerate integration is proposed to show hierarchical memory management, (b) advanced facial processing and enhancement pipeline, which contains multi-scale feature extraction, (c) a multidimensional quality assessment system and automated evaluation metrics, and (d) the generated results, of realistic enhancement comparisons with identity preservation and photo-realistic quality improvement are demonstrated in multiple generation scenarios.

Our optimization for memory, attention, and its control are shown in Figure 2. The figure shows (a) Hierarchical Memory Allocations in training both at Cross-Attention (CA) and Attention-to-Attention (ATA) stages at step 20 and dynamically evolving memory allocation to different parts of the model, (b) reduction in memory footprint achieved from the combination of attention slicing and VAE tiling implementations, (c) importance of gradient accumulation pass and checkpointing pass that facilitates training within consumer GPU sizes, (d) performance comparison between baseline and optimized implementation on various GPU memories, showing the effectiveness of our approach.

The deployment framework provides four connected parts of the system, which may result in excellent outcomes, even the highest. The Hugging Face’s memory-optimized training scheme uses attention slicing, VAE exhausting, and adaptive gradient accumulation in its Accelerate framework. There are features for enhanced imagery as well as identity preservation techniques that guarantee the subject’s identity in various generations. Pre-scheduling of parameters for output adjustment is possible with a robust quality assessment tool. There are metrics based on perceptual, geometric, and photometric similarity for automated quality evaluation. Lastly, the entire deployment frame supports the use of dispersed consumer electronics.

The ability to use our framework for a variety of tasks, including personalized content creation, digital avatar design, and the creative professional’s workflows, has reduced the computational barriers for the widespread adoption of all personalization technologies by millions of consumers and even more companies, industries, creative and professional, making advanced AI-driven creative technologies available to billions of potential customers.

2. Related Work

2.1. Personalized Diffusion Models and Computational Optimization

Personalized diffusion approaches have dramatically reshaped image generation by leveraging large-scale pre-trained models for subject-specific fine-tuning. DreamBooth [1] introduced the seminal technique of fine-tuning text-to-image diffusion models with as few as 3–5 reference images, achieving identity-preserving generation with an LPIPS score of 0.142 and identity cosine similarity of 0.847 on a 24 GB GPU. PortraitBooth extended this paradigm specifically for portrait photography, demonstrating strong performance in controlled settings but with limited generalizability to complex backgrounds. Identity-preserving diffusion schemes have since progressed, exploring fundamental trade-offs of personalized generation technologies [2].

Recently, the identity-preserving generation has made great strides. MasterWeaver [9] models the fundamental trade-off between editability and identity preservation using various attention mechanisms, achieving an identity cosine similarity of 0.856 but requiring 20 GB of GPU memory. Pair-ID [10] presents a cross-modal framework for learning identity signatures with greater fidelity compared to previous works in several generation scenarios. HP³ [11] provides a tuning-free strategy for head-preserving portrait personalization with 3D-controlled diffusion models, and it achieves competitive memory efficiency at 16 GB but incurs higher inference latency (8.2 s) due to 3D processing overhead.

Subject-Diffusion [12] presents a novel open-domain personalized generation formulation that eliminates the requirement for test-time fine tuning and greatly reduces inference cost, reporting an LPIPS of 0.145 and FID of 25.7. Zhang et al. [13] address the task of face recognition under degraded low-resolution images using restoration and 3D multiview generation. DifFRelight [14] proposes diffusion-based facial performance relighting that allows photorealistic illumination control without losing identity.

2.2. Memory Optimization and Hardware Efficiency

The use of memory-efficient techniques to democratize diffusion model AI content is gaining popularity. G²Face [15] employs smart memory management and geometric and generative priors to achieve high-fidelity reversible face anonymization. GenPalm [16] explores touchless biometric generation by utilizing diffusion models that are tuned for limited resources, demonstrating that high-quality biometric synthesis can be achieved with minimal computational effort. Alimisis et al. [17] conducted a thorough analysis of diffusion models and image data, with a focus on memory efficiency techniques and their influence on quality over several generations. The difficulty in applying diffusion models in practice is primarily due to attention and gradients, as highlighted in their work. Wang et al. [18] propose a joint face illumination normalization and face super-resolution with memory-aware inference efficiency using diffusion models. Guerrero-Viu et al. [19] demonstrate a texture editing method in the CLIP latent that demonstrates the versatility of memory-efficient diffusion pipelines.

2.3. Facial Feature Preservation and Enhancement

The matter of identity development is still unresolved. PSAIP [20] proposes prior structure-assisted networks specifically targeting identity-preserving facial reenactment, and the proposed PSAIP model achieves competitive results on benchmark datasets. EAT-Face [21] is a talking face generation system that utilizes diffusion models and audio to regulate facial expressions while maintaining a sense of emotional control. Baltsou et al. [22] aim to bridge the demographic divide and ensure fairness in face verification datasets while also enabling consistent demographic changes over time. Liao et al. [23] outline the methods for producing humans based on their predetermined appearances and poses, and also underscore some crucial procedures and obstacles. Melnik et al. [24] utilize StyleGAN to conduct a comprehensive empirical study on the generation and manipulation of facial features, analyzing the trade-off between various generative approaches. Xiu et al. [25] propose PuzzleAvatar to generate unique 3D representations based on individual image collections while maintaining identity consistency in the multi-view mapping process. Zhu et al. [26] a facial-contrasted editing algorithm that learns faces in a disjointed state, allowing for 3D editing. GORGEOUS synthesized for makeup was suggested by Sii and Chan [27] to maintain the identity of the face during appearance transfer. Xu et al. [28] the use of two factor authentication and 3D priors in face reenactment has resulted in consistent identity across all poses, as noted by Chen et al. [29] using Cross-modal Prototype Contrastive Learning, Xiong et al. [30] developed a method for synthesizing faces from voice. The research conducted by Asperti et al. [31] the effects of head rotation on illumination and shadow, as studied by Tai et al. [32], as well as models of denoising diffusion. Outline the approach of using diffusion priors to generate sample defects images for industrial surface recognition. Yan et al. [33] FaceGCN is a convolutional net that reconstructs the face over the graph using structured priors.

2.4. Quality Assessment and Evaluation Frameworks

Evaluating identity fidelity in personalized generation is non-trivial, as standard metrics may not fully capture subject-level consistency. Xue et al. [34] propose identity-preserving evaluation metrics for motion video generation, demonstrating their effectiveness on holistic human motion benchmarks. Dhanyalakshmi et al. [35] review face-swapping and accompanying quality assessment techniques for deepfake applications, providing a taxonomy of metrics and their limitations. AniTalker [36] introduces identity-decoupled facial motion encoding for speech-driven talking face generation, enabling diverse and realistic facial animations. GaussianTalker [37] uses 3D Gaussian splatting for speaker-dependent synthesis and introduces new metrics to evaluate it. Back et al. [38] present a human-interactive photo restoration framework that preserves old memories in vivid detail, further motivating high-fidelity perceptual quality benchmarks. Considering these works as benchmarks for identity preservation and photorealistic quality in personalized generation is essential. In this paper, we extend these works by considering memory efficiency, identity and quality preservation in a unified framework specifically designed for consumer hardware resource constraints, which is unfortunately neglected in existing personalized diffusion methods.

3. Methodology

3.1. Memory-Optimized Training Architecture

Figure 3 shows an example of our identity preservation techniques and constraint systems. The architecture employs (a) multi-scale face feature extraction networks that capture identity-specific characteristics at multiple resolutions, (b) attention-based constraint mechanisms that enforce identity consistency during the generation process, (c) adaptive loss weighting strategies to balance the trade-off between identity preservation and diversity of generations, and (d) validation pipelines involving identity fidelity across various prompts and generation contexts.

Dataset Description: Training and evaluation were conducted on a curated dataset of 512 facial images per subject, sourced from publicly available face datasets, including CelebA-HQ and FFHQ. The dataset covers a demographically diverse set of subjects spanning different ages (18–70 years), ethnicities (Asian, Caucasian, African, and South Asian), and pose variations (

- 90^{\circ}

to

+ 90^{\circ}

yaw). Each image was preprocessed to

512 \times 512

resolution. The dataset was split 80%/10%/10% for training/validation/testing, respectively. We acknowledge that the current validation set is limited in demographic diversity and plan to expand coverage in future work.

The training objective combines baseline diffusion loss with identity preservation extensions. As shown in Equation (1), the overall loss function incorporates multiple components to ensure both generation quality and identity consistency.

The training objective can be expressed as follows:

L_{total} = L_{diffusion} + λ_{identity} L_{identity} + λ_{prior} L_{prior}

(1)

where

L_{diffusion}

is the typical diffusion model reconstruction loss that makes sure that the model can effectively denoise images across timesteps,

L_{identity}

, which is identity preservation loss calculated based on facial recognition embeddings to maintain consistency over subjects and prevent catastrophic forgetting of generative abilities of the base model. The hyperparameters

λ_{identity}

and

λ_{prior}

define the importance of identity preservation and prior knowledge retention, and both are set to 1.0 in our experiments, as empirically validated on multiple facial datasets. Setting them both to 1.0 yields a balanced weighting between identity fidelity and prior preservation, which we found optimal across our evaluation suite. Note that

λ_{identity}

and

λ_{prior}

in Equation (1) are distinct from the loss weight schedules shown in Figure 3c, which depict the dynamic adaptive weighting

λ_{identity}

,

λ_{diffusion}

, and

λ_{prior}

across training epochs.

The identity preservation loss is defined as follows:

L_{identity} = 1 - CosineSimilarity (f (x_{gen}), f (x_{ref}))

(2)

where

f (\cdot)

is the facial recognition encoder (we use ArcFace [1] for robust identity embeddings),

x_{gen}

is a generated face and

x_{ref}

is a reference from the personalization dataset. Here, the cosine similarity calculates the angle between embedding vectors of learned feature-space and proximity to 1 reflects higher identity. This objective guarantees that the generated faces preserve strong perceptual similarity with the reference identity while allowing arbitrary changes due to pose, expression and illumination.

Figure 4 shows the updated character processing and improvement pipeline. Our pipeline consists of a series of stages, including (a) first stage based on the face detection and alignment modules using Multi-task Cascaded Convolutional Network (MTCNN) for robust localisation, (b) facial landmark detection and geometric normalization for consistent feature extraction, (c) identity-aware image pre-processing with color correction and quality enhancement, and (d) post-generation refinement steps to improve photo-realistic details while preserving influential features that matter explicitly to identity by using selective enhancing filters.

Figure 5 details additional components of our facial processing system. The complete pipeline integrates multiple specialized modules to ensure comprehensive identity preservation and quality enhancement throughout the generation process.

3.2. Quality Assessment Framework

Figure 6 shows an overview of our quality assessment mechanism covering various evaluation dimensions, including (a) perceptual similarity using LPIPS metrics to measure human-perceived image quality differences, (b) structural consistency SSIM measurement of local structural preservation, (c) identity fidelity through cosine similarities on face embeddings to maintain subject consistency, and and (d) photorealistic quality assessment through Fréchet Inception Distance (FID) scores and aesthetic predictor networks measuring overall generation realism and visual appeal.

The global quality estimation combines multiple metrics through weighted aggregation as expressed in Equation (3):

Q_{global} = w_{1} Q_{LPIPS} + w_{2} Q_{SSIM} + w_{3} Q_{identity} + w_{4} Q_{FID}

(3)

where

Q_{LPIPS}

,

Q_{SSIM}

,

Q_{identity}

and

Q_{FID}

are the normalized quality scores of perception distance, SSIM, identity and Fréchet Inception Distance, respectively. The weights

w_{1} = 0.3

,

w_{2} = 0.2

,

w_{3} = 0.4

, and

w_{4} = 0.1

were determined through a grid-search optimization over 500+ rendered images, maximizing the Spearman rank correlation between the composite score

Q_{global}

and human preference ratings obtained from 50 participants. The final correlation achieved was

ρ = 0.87

(

p < 0.001

), confirming statistical significance. Identity preservation (

w_{3} = 0.4

) received the highest weight as it was the most consistently preferred factor in the user study, followed by perceptual quality (

w_{1} = 0.3

), structural consistency (

w_{2} = 0.2

), and distribution-level quality (

w_{4} = 0.1

).

3.3. Training Pipeline Implementation

Algorithm 1 presents our memory-optimized training algorithm, which is an example of our approach based on a memory-friendly training method. The algorithm leverages hierarchical memory management via gradient accumulation steps that spread memory across micro-batches, attention slicing that segments attention mechanisms at inference time to reduce peak memory, VAE tiling for efficient encoding and decoding of high-resolution images, and adaptive checkpoint selection based on GPU size that balances computational overhead with memory usage.

This algorithm handles memory limitations with careful optimization, and gradient accumulates make it possible to effectively scale the batch size without proportionally large clear for memory. Attention is sliced in smaller chunks of attention matrices to reduce the peaks. Gradient checkpointing exchanges computation for memory by computing the activations again during the backward pass and VAE tiling high-resolution images into parts that can be handled. Together, these methods enable training under 16 GB of GPU memory with convergence quality similar to unbounded approaches.

Algorithm 1 Memory-optimized DreamBooth training

Require: Pretrained model M, training images

{I_{1}, \dots, I_{n}}

(

n \in [3, 20]

reference images per subject), prompts

{P_{1}, \dots, P_{n}}

Ensure: Fine-tuned personalized model

M^{*}

1:: Initialize model M with pretrained weights
2:: Configure memory optimization: attention slicing, VAE tiling
3:: Set gradient accumulation steps $G_{accum}$
4:: for epoch $= 1$ to $N_{epochs}$ do
5:: for batch B in training data do
6:: Enable gradient checkpointing
7:: Process batch with attention slicing enabled
8:: if step mod $G_{accum} = = 0$ then
9:: Update model parameters with accumulated gradients
10:: Clear gradient cache
11:: end if
12:: end for
13:: Validate identity preservation on validation set
14:: end for
15:: return Fine-tuned model $M^{*}$

Algorithm 2 describes our inference pipeline with quality enhancement. Our inference procedure, described in the following, consists of (1) adaptive sampling strategies to balance image generation quality against computational cost, (2) multi-resolution processing for improved detail preservation over multiple scales, (3) identity-aware post-processing filters that selectively enhance photorealistic aspects while preserving the consistency and quality of the subject, and (4) validation gates to ensure that generated results meet prespecified identity/quality criteria before final synthesis.

The quality of the inference pipeline is guaranteed by multi-stage validation, including identity verification (lines 5–8) that verifies subject consistency is conducted before enhancement and facial enhancement (line 9), selective details improvement in identity present faces by using learned enhancement models; photo realistic processing (line 10), where condition filters enhance photo realism while avoiding artifacts; and complete quality validation (lines 11 and 12) to ensure that output meets perceptual quality, structure consistency and identity preservation. This staged generation mechanism helps to reject and re-generate suboptimal outputs, leading to high synthesis quality for a wide range of prompts.

Algorithm 2 Quality-enhanced inference pipeline

Require: Fine-tuned model

M^{*}

, generation prompt P, reference image

I_{ref}

Ensure: Enhanced generated image

I_{out}

1:: Load model $M^{*}$ with memory optimizations enabled
2:: Extract reference identity embedding $e_{ref} = f (I_{ref})$
3:: Generate initial image $I_{gen}$ using prompt P
4:: Extract generated identity embedding $e_{gen} = f (I_{gen})$
5:: Compute identity similarity $s = CosineSimilarity (e_{ref}, e_{gen})$
6:: if $s < {threshold}_{identity}$ then ▹ threshold_identity = 0.85, empirically selected based on user studies
7:: Adjust generation parameters and regenerate
8:: end if
9:: Apply facial enhancement: detail sharpening, color correction
10:: Apply photorealistic post-processing filters
11:: Validate final quality metrics: LPIPS, SSIM, FID
12:: return Enhanced image $I_{out}$

4. Experimental Results

4.1. Hardware Performance Validation

Figure 7 shows the total performance validation and profiling at hardware level in different aspects. Subfigure (a) reports training speed comparisons with DreamBooth Baseline (22 GB GPU memory) and our optimized for both 14.2GB Peak Memory, resulting in 15% computational overhead for deployment on consumer GPUs. On a 24 GB GPU, learning the baseline for 800 training steps would take approximately 3.2 h and the optimized version of ours takes up to around 3.7 h on 16 GB CPU, which is still reasonably acceptable from an accessibility point of view.

We further visualize GPU utilization during training (Figure 7a,b) and show that our hierarchical memory management achieves a stable/similar performance of 85–92% (vs. 95–98%) most of the time, indicating successful resource sharing as well as no memory overflows. Lower utilization indicates that we can strategically save the memory by attention slicing as well as gradient accumulation and this will help eliminate Out-of-Memory (OOM) errors without sacrificing the training effect.

Experimental Hardware Configuration: All experiments were conducted on an NVIDIA Corporation (Santa Clara, CA, USA) RTX 3080 (16 GB GDDR6X, 272.6 TFLOPS FP16) running CUDA 11.8, PyTorch 2.0.1, and Hugging Face Diffusers (Hugging Face, Inc., New York, NY, USA) v0.21.0 on Ubuntu 22.04 LTS. The baseline DreamBooth experiments were replicated on an NVIDIA A100 (40 GB HBM2, 312 TFLOPS FP16) to provide a fair 24 GB reference. Memory savings were computed as

Δ M = (M_{baseline} - M_{optimized}) / M_{baseline} \times 100 %

using torch.cuda.max_memory_allocated() after each training step, averaged over five independent runs to ensure statistical reliability (std

< 0.3

GB).

Figure 7c shows the training convergence analysis with loss trajectories comparison that reflects memory-aware optimization does not compromise on convergence quality. Final loss values are also similar between baseline and optimized implementations after 800 training steps (approximately 0.042 of diffusion loss, 0.058 of identity loss), confirming that our memory management approach effectively maintains the training dynamics and the learning capability of the model.

Figure 7d offers memory footprint breakdown for varying batch sizes and optimization schedules so that we can quantitatively understand the effects of each optimization technique. Attention slicing is responsible for 18% peak memory savings, VAE tiling contributes with an additional 12%, gradient accumulation enables effective 25% memory reduction and gradient checkpointing provides a further reduction of 15%. Together, these approaches enable the total reduction of 36%, from 22 GB to 14.2 GB with equal quality generation.

Figure 8 details training pipeline validation and dataset processing results. The figure demonstrates (a) dataset preprocessing efficiency across different image resolutions and batch configurations, showing that our optimized pipeline maintains processing throughput within 10% of baseline while operating under memory constraints, (b) validation metrics throughout training demonstrating consistent identity preservation (cosine similarity

> 0.85

) and perceptual quality (LPIPS

< 0.15

) across checkpoints, (c) ablation studies quantifying individual contributions of each memory optimization technique to overall performance, revealing that gradient accumulation and attention slicing provide the largest individual benefits, and (d) robustness analysis across diverse subject datasets validating generalization of our approach beyond specific facial characteristics.

Figure 9 presents generation pipeline performance and multi-resolution enhancement capabilities. The analysis includes (a) inference time measurements across different resolution outputs (512 × 512 to 1024 × 1024 pixels), demonstrating that our framework generates 512 × 512 images in 4.2 s and 1024 × 1024 images in 8.7 s on 16 GB consumer GPUs, (b) quality scaling analysis showing that higher resolution generations maintain consistent identity preservation (cosine similarity degradation

< 0.02

) and improved photorealistic detail as quantified by increased FID scores, (c) memory usage during inference revealing peak allocations of 9.8 GB for 512 × 512 and 12.4 GB for 1024 × 1024 generations, well within consumer GPU constraints, and (d) enhancement pipeline impact demonstrating that post-processing stages improve perceptual quality (LPIPS improvement of 0.08) and photorealism (FID improvement of 12.3) while introducing negligible identity deviation (

< 0.01

cosine similarity change).

Figure 10 depicts the complete system integration and deployment validation. A comprehensive assessment of the entire pipeline, involving multiple prompts and subjects, was conducted, with a focus on testing the impact of system components on subject integrity, resulting in success rates of over 95%, and a comparison of the proposed framework to DreamBooth and other competitive methods through user studies (50 participants, 200 generated images rated per method, five-point Likert scale), showing preference ratings of 4.2/5.0 for identity preservation and 4.0/5.0 for photorealistic quality. Krippendorff’s

α = 0.76

was used to assess inter-rater agreement. Our method was statistically tested against DreamBooth using a Wilcoxon signed-rank test (

p < 0.05

), and deployment feasibility analysis confirmed its compatibility with consumer GPU configurations (GTX 1080 Ti, RTX 3060, RTX 3080), with performance scaling tests demonstrating consistent generation performance. The quality of the results varied widely over the course of several generations, resulting in production validation.

4.2. Comparative Analysis with State-of-the-Art

Table 1 presents comprehensive quantitative comparisons between our framework and recent state-of-the-art personalized generation methods. Our approach demonstrates competitive performance across all evaluation metrics while achieving significantly lower memory requirements, enabling deployment on consumer hardware.

Our framework achieves LPIPS score of 0.139, indicating superior perceptual quality compared to most existing methods. The SSIM value of 0.879 demonstrates excellent structural consistency preservation. Most importantly, we maintain identity preservation (cosine similarity of 0.852) competitive with MasterWeaver while operating within significantly tighter memory constraints. The FID score of 23.1 confirms that generation quality remains photorealistic and well-aligned with natural image distributions.

Table 2 presents computational efficiency comparisons, highlighting our framework’s advantages in training and inference times relative to memory usage.

Given that our framework offers comparable model optimization efficiencies when deployed on consumer GPUs, as well as memory of up to 36% (14.2 GB vs. 22 GB), we can expect a slight increase in training time when compared to DreamBooth (effectively ad-hoc per Fully Plug & Play Geometry), (0.5 h to 3.7 h, vs. 3.2 h), but our FPS remains unchanged at 4.2 s per image generated (512 × 512). Our model assumes that identity traits remain consistent across all generations. MasterWeaver [9] improves the blended image to match fine-grained facial features, but this is not the case for masters with a 20 GB memory. The use of Subject-Diffusion [12] results in faster training but less reliability due to the presence of identity drift in challenging poses. HP³ [11] memory efficiency is comparable to its 3D processing-based inference latency. While FastComposer [39] can generate multiple subjects without tuning, its performance is relatively poor at 0.831 when compared to our findings, whereas IP-Adapter [4] is relatively flexible but requires a high memory of 18 GB and 24 GB GPU, making it impossible for consumers to access the product. The combination of memory efficiency, identity preservation, and generation quality is well-suited for consumer hardware, despite the limited memory capacity of the GPU.

5. Discussion

Comparison with Related Work: Our framework operates in a particular area of the memory-quality trade-off continuum. We achieved 36% reduction in high water mark memory compared to DreamBooth [1], yet we maintained identity fidelity (0.852 vs. 0.847). Despite the fact that our identity (0.856) and SSIM (0.882) are better than MasterWeaver [9], the alternative solution mandates 40% more GPU memory. The LPIPS and FID scores of FastComposer [39] and IP-Adapter [4] are higher on our benchmark, indicating lower perceptual quality. Our benchmarks demonstrate the highest memory-performance ratio among consumer-level applications.

Discussion on Evaluation Metrics: The standard metrics only provide an approximation of subject-level perceptual consistency, which is not entirely accurate because LPIPS does not approach the issue of identity consistency in high dimensional space. The future is bright with the introduction of a Personalized Identity Fidelity Score that considers identity cosine similarity, demographic invariance, and expression robustness over time. The metric’s usefulness may increase when user-centered perceptual studies are scaled up to a larger size, as they can relate strongly and replicate in real-world contexts, even if they are used alongside an automated statistical or algorithmic method of insight finding.

Limitations and Future Work

While our approach reduces the amount of memory required to run MDP on consumer-grade hardware, the limitations mentioned are worth mentioning. The testing was primarily conducted with individuals from a particular demographic group, necessitating extensive testing across different age ranges, ethnicities, and facial features.

The model employs pretrained diffusion models, which can introduce biases based on their previous training distributions. The establishment of training objectives and systematic exploration of distortions in datasets is crucial for achieving coequal performance among demographic groups.

The majority of recent work focuses on generating single images, and it is only feasible to adapt them to video by utilizing other techniques that maintain identification continuity over frames, which is still an open technical problem. Generalizing to video generation requires additional temporal consistency mechanisms to maintain identity coherence across frames, which remains an open technical challenge.

Our method achieves over 95% success rate in controlled experiments, but edge cases involving extreme poses (head rotation

> 60^{\circ}

), heavy occlusions (

> 40 %

facial coverage), or under abnormal illumination conditions, identity drift or quality degradation can be prevented by utilizing focused data augmentation and/or adversarial or test-time adaptation methods.

A 3.7 h training time for 800 steps may be effective in an experimental setting, but it is not suitable for real-time personalizations. Knowledge distillation, LoRA-based efficient fine-tuning, and lightweight UNet variants are potential avenues for reducing overhead in the future.

Furthermore, the implementation of memory optimization and hyperparameter engineering strategies like mixed-precision training (FP16/BF16), sparse attention mechanisms, and Flash Attention have resulted in a significant 15% increase in computation overhead during training. Mixed-precision training (FP16/BF16), sparse attention mechanisms, and Flash Attention represent technically concrete avenues to improve this balance.

Ethical Considerations and Potential Misuse: The use of a new cross-subject attention disentanglement mechanism and memory-partitioned techniques will necessitate avoiding collapse while maintaining the targeted subject domains under restricted memory conditions. See Table 1 for a detailed inventory of current ethical frameworks and their role in addressing the potential misuse of personalized image creation technologies, as well as specific steps, such as user consent authentication, watermarking of AI-generated outputs, compliance with platform-level content moderation policies, and ethical review mechanisms.

Khan et al. [40] present a comprehensive review of person de-identification methods, datasets, and ethical considerations that directly inform responsible deployment practices for personalized generation systems. Pham et al. [41] propose TALE, a training-free cross-domain image composition method, whose privacy-aware compositional design principles are relevant to ethical multi-subject generation.

6. Conclusions

We presented a pipeline that provides end-to-end censorship-compliant facial content, utilizing a memory-efficient a DreamBooth implementation that is optimized for consumer hardware. By implementing hierarchical memory management, identity-aware augmentation, and multi-dimensional image quality evaluation, we were able to achieve this with only 16 GB of GPU memory.

By reducing peak memory by 36%, consumer GPUs can be deployed with the same quality of generations and identity is maintained while achieving competitive results (LPIPS: 0.139, SSIM: 0.879, identity: 0.931, and FID: 23.1).

The quality of generation is objectively verified through a comprehensive quality evaluation system that includes perceptual, structural, identity-based, and photorealism metrics, and empirical studies demonstrate the practical utility of this approach, with preference ratings of 4.2/5.0 for identity preservation and an average of 4.0/5.1 for photorealistic quality.

Addressing these shortcomings will necessitate the expansion of demographic coverage, the establishment of temporal consistency mechanisms for video production, the incorporation of low-cost fine-tuning methods, and new techniques for multi-subject personalization.

Through the use of personalized diffusion models, researchers, educators, and creative professionals can access state-of-the-art AI-driven generation capabilities that were previously only accessible to well-funded organizations equipped with specialized hardware.

Author Contributions

Conceptualization, S.G. and K.R.; methodology, S.G.; software, S.G.; validation, S.G., K.R. and J.F.; formal analysis, S.G.; investigation, S.G.; resources, S.G.; data curation, S.G.; writing—original draft preparation, S.G.; writing—review and editing, K.R., J.F., S.K. and S.H.; visualization, S.G.; supervision, K.R. and J.F.; project administration, K.R. All authors have read and agreed to the published version of the manuscript.

Funding

The publishing of this paper will be funded by the Natural Sciences and Engineering Research Council of Canada Discovery Grant (NSERC) # RGPIN-2022-05122.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Representative synthetic datasets and evaluation scripts are available upon reasonable request to the corresponding author. Full training datasets cannot be publicly released due to privacy considerations and ethical guidelines regarding facial image data. Interested researchers may contact the authors to discuss data access protocols compliant with institutional ethics requirements and data protection regulations.

Acknowledgments

The authors acknowledge computational resources provided by Poornima College of Engineering and Amity University Rajasthan.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ruiz, N.; Li, Y.; Jampani, V.; Pritch, Y.; Rubinstein, M.; Aberman, K. DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22500–22510. [Google Scholar]
Gal, R.; Alaluf, Y.; Atzmon, Y.; Patashnik, O.; Bermano, A.H.; Chechik, G.; Cohen-Or, D. An image is worth one word: Personalizing text-to-image generation using textual inversion. In Proceedings of the International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Kumari, N.; Zhang, B.; Zhang, R.; Shechtman, E.; Zhu, J.Y. Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 1931–1941. [Google Scholar]
Ye, H.; Zhang, J.; Liu, S.; Han, X.; Yang, W. IP-Adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv 2023, arXiv:2308.06721. [Google Scholar]
Cao, M.; Wang, X.; Qi, Z.; Shan, Y.; Qie, X.; Zheng, Y. MasaCtrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 22560–22570. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 10684–10695. [Google Scholar]
Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical text-conditional image generation with CLIP latents. arXiv 2022, arXiv:2204.06125. [Google Scholar] [CrossRef]
Wei, Y.; Zhang, Y.; Ji, Z.; Bai, J.; Zhang, L.; Zuo, W. ELITE: Encoding visual concepts into textual embeddings for customized text-to-image generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 15943–15953. [Google Scholar]
Wei, Y.; Ji, Z.; Bai, J.; Zhang, H.; Zhang, L.; Zuo, W. Masterweaver: Taming editability and face identity for personalized text-to-image generation. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2024; pp. 252–271. [Google Scholar]
Lin, J.; Wu, Y.; Wang, Z.; Liu, X.; Guo, Y. Pair-ID: A Dual Modal Framework for Identity Preserving Image Generation. IEEE Signal Process. Lett. 2024, 31, 2715–2719. [Google Scholar] [CrossRef]
Xu, Y.; Zhang, C.; Zhai, B.; Du, S. HP³: Tuning-Free Head-Preserving Portrait Personalization Via 3D-Controlled Diffusion Models. IEEE Signal Process. Lett. 2025, 32, 1226–1230. [Google Scholar] [CrossRef]
Ma, J.; Liang, J.; Chen, C.; Lu, H. Subject-diffusion: Open domain personalized text-to-image generation without test-time fine-tuning. In Proceedings of the ACM SIGGRAPH 2024 Conference, Denver, CO, USA, 27 July–1 August 2024; pp. 1–12. [Google Scholar]
Zhang, X.; Li, X.; Wang, T.; Yin, L. Enhancing Face Recognition in Low-Quality Images Based on Restoration and 3D Multiview Generation. In Proceedings of the 2024 IEEE International Joint Conference on Biometrics (IJCB), Buffalo, NY, USA, 15–18 September 2024; pp. 1–10. [Google Scholar]
He, M.; Clausen, P.; Taşel, A.L.; Ma, L.; Pilarski, O.; Xian, W.; Rikker, L.; Yu, X.; Burgert, R.; Yu, N.; et al. DifFRelight: Diffusion-Based Facial Performance Relighting. In Proceedings of the SIGGRAPH Asia 2024 Conference, Tokyo, Japan, 3–6 December 2024; pp. 1–12. [Google Scholar]
Yang, H.; Xu, X.; Xu, C.; Zhang, H.; Qin, J.; Wang, Y.; Heng, P.A.; He, S. G²Face: High-Fidelity Reversible Face Anonymization via Generative and Geometric Priors. IEEE Trans. Inf. Forensics Secur. 2024, 19, 8773–8785. [Google Scholar] [CrossRef]
Grosz, S.A.; Jain, A.K. Genpalm: Contactless palmprint generation with diffusion models. In Proceedings of the 2024 IEEE International Joint Conference on Biometrics (IJCB), Buffalo, NY, USA, 15–18 September 2024; pp. 1–9. [Google Scholar]
Alimisis, P.; Mademlis, I.; Radoglou-Grammatikis, P.; Sarigiannidis, P.; Papadopoulos, G.T. Advances in diffusion models for image data augmentation: A review of methods, models, evaluation metrics and future research directions. Artif. Intell. Rev. 2025, 58, 112. [Google Scholar] [CrossRef]
Wang, W.; Mu, M.; Tian, Y.; Hu, Y.; Lu, X. ILSR-Diff: Joint face illumination normalization and super-resolution via diffusion models. Multimed. Syst. 2024, 30, 302. [Google Scholar] [CrossRef]
Guerrero-Viu, J.; Hasan, M.; Roullier, A.; Harikumar, M.; Hu, Y.; Guerrero, P.; Gutierrez, D.; Masia, B.; Deschaintre, V. Texsliders: Diffusion-based texture editing in clip space. In Proceedings of the ACM SIGGRAPH 2024 Conference, Denver, CO, USA, 27 July–1 August 2024; pp. 1–11. [Google Scholar]
Zhao, G.; Xu, J.; Wang, X.; Yan, F.; Qiu, S. PSAIP: Prior Structure-Assisted Identity-Preserving Network for Face Animation. Electronics 2025, 14, 784. [Google Scholar] [CrossRef]
Wang, H.; Jia, X.; Cao, X. EAT-Face: Emotion-Controllable Audio-Driven Talking Face Generation via Diffusion Model. In Proceedings of the 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG), Istanbul, Turkey, 27–31 May 2024; pp. 1–10. [Google Scholar]
Baltsou, G.; Sarridis, I.; Koutlis, C.; Papadopoulos, S. Designing and Generating Diverse, Equitable Face Image Datasets for Face Verification Tasks. arXiv 2025, arXiv:2511.17393. [Google Scholar] [CrossRef]
Liao, F.; Zou, X.; Wong, W. Appearance and pose-guided human generation: A survey. ACM Comput. Surv. 2024, 56, 129. [Google Scholar] [CrossRef]
Melnik, A.; Miasayedzenkau, M.; Makaravets, D.; Pirshtuk, D.; Akbulut, E.; Holzmann, D.; Renusch, T.; Reichert, G.; Ritter, H. Face generation and editing with stylegan: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 3557–3576. [Google Scholar] [CrossRef]
Xiu, Y.; Ye, Y.; Liu, Z.; Tzionas, D.; Black, M.J. Puzzleavatar: Assembling 3d avatars from personal albums. ACM Trans. Graph. 2024, 43, 283. [Google Scholar] [CrossRef]
Zhu, X.; Zhou, J.; You, L.; Yang, X.; Chang, J.; Zhang, J.J.; Zeng, D. DFIE3D: 3D-aware disentangled face inversion and editing via facial-contrastive learning. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 8310–8326. [Google Scholar] [CrossRef]
Sii, J.W.; Chan, C.S. Gorgeous: Creating narrative-driven makeup ideas via image prompts. Multimed. Tools Appl. 2025, 84, 43805–43826. [Google Scholar] [CrossRef]
Xu, C.; Qian, Y.; Zhu, S.; Sun, B.; Zhao, J.; Liu, Y.; Li, X. UniFace++: Revisiting a Unified Framework for Face Reenactment and Swapping via 3D Priors. Int. J. Comput. Vis. 2025, 133, 4538–4554. [Google Scholar] [CrossRef]
Chen, W.; Zhu, B.; Xu, K.; Dou, Y.; Feng, D. VoiceStyle: Voice-based Face Generation Via Cross-modal Prototype Contrastive Learning. ACM Trans. Multimed. Comput. Commun. Appl. 2024, 20, 279. [Google Scholar] [CrossRef]
Xiong, L.; Cheng, X.; Tan, J.; Wu, X.; Li, X.; Zhu, L.; Ma, F.; Li, M.; Xu, H.; Hu, Z. SegTalker: Segmentation-based Talking Face Generation with Mask-guided Local Editing. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, Australia, 28 October–1 November 2024; pp. 3170–3179. [Google Scholar]
Asperti, A.; Colasuonno, G.; Guerra, A. Illumination and Shadows in Head Rotation: Experiments with Denoising Diffusion Models. Electronics 2024, 13, 3091. [Google Scholar] [CrossRef]
Tai, Y.; Yang, K.; Peng, T.; Huang, Z.; Zhang, Z. Defect Image Sample Generation With Diffusion Prior for Steel Surface Defect Recognition. IEEE Trans. Autom. Sci. Eng. 2024, 22, 8239–8251. [Google Scholar] [CrossRef]
Yan, W.; Shao, W.; Zhang, D.; Xiao, L. FaceGCN: Structured Priors Inspired Graph Convolutional Networks for Blind Face Restoration. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 6214–6230. [Google Scholar] [CrossRef]
Xue, H.; Zhang, Z.; Li, M.; Dai, Z.; Wu, Z. Identity-Preserving Audio-Driven Holistic Human Motion Video Generation. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar]
Dhanyalakshmi, R.; Stoian, G.; Danciulescu, D.; Hemanth, D.J. A Survey on Face-Swapping Methods for Identity Manipulation in Deepfake Applications. IET Image Process. 2025, 19, e70132. [Google Scholar] [CrossRef]
Liu, T.; Chen, F.; Fan, S.; Du, C.; Chen, Q.; Chen, X.; Yu, K. Anitalker: Animate vivid and diverse talking faces through identity-decoupled facial motion encoding. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, Australia, 28 October–1 November 2024; pp. 6696–6705. [Google Scholar]
Yu, H.; Qu, Z.; Yu, Q.; Chen, J.; Jiang, Z.; Chen, Z.; Zhang, S.; Xu, J.; Wu, F.; Lv, C.; et al. Gaussiantalker: Speaker-specific talking head synthesis via 3d gaussian splatting. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, Australia, 28 October–1 November 2024; pp. 3548–3557. [Google Scholar]
Back, S.Y.; Son, G.; Jeong, D.; Park, E.; Woo, S.S. Preserving Old Memories in Vivid Detail: Human-Interactive Photo Restoration Framework. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, Boise, ID, USA, 21–25 October 2024; pp. 5180–5184. [Google Scholar]
Xiao, G.; Yin, T.; Freeman, W.T.; Durand, F.; Han, S. Fastcomposer: Tuning-free multi-subject image generation with localized attention. Int. J. Comput. Vis. 2024, 133, 1175–1194. [Google Scholar] [CrossRef]
Khan, W.; Topham, L.; Khayam, U.; Ortega-Martorell, S.; Heather, P.; Ansell, D.; Al-Jumeily, D.; Hussain, A. Person de-Identification: A Comprehensive Review of Methods, Datasets, Applications, and Ethical Aspects Along-With New Dimensions. IEEE Trans. Biom. Behav. Identity Sci. 2024, 7, 293–312. [Google Scholar] [CrossRef]
Pham, K.T.; Chen, J.; Chen, Q. TALE: Training-free Cross-domain Image Composition via Adaptive Latent Manipulation and Energy-guided Optimization. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, Australia, 28 October–1 November 2024; pp. 3160–3169. [Google Scholar]

Figure 1. A complete DreamBooth implementation pipeline overview showing (a) memory-optimized training framework with Accelerate integration, (b) advanced facial processing and enhancement pipeline using Multi-task Cascaded Convolutional Network (MTCNN) for face detection and alignment, (c) multi-dimensional quality assessment system reporting Structural Similarity Index Measure (SSIM), Learned Perceptual Image Patch Similarity (LPIPS), identity cosine similarity, and Fréchet Inception Distance (FID), (d) generated results with realistic enhancement comparisons demonstrating identity preservation and photorealistic quality improvements. Abbreviations: VAE = Variational Autoencoder; MTCNN = Multi-task Cascaded Convolutional Network; SSIM = Structural Similarity Index Measure; LPIPS = Learned Perceptual Image Patch Similarity; and FID = Fréchet Inception Distance.

Figure 2. Memory optimization strategies and attention management showing (a) hierarchical memory allocation patterns during training at Cross-Attention (CA) and Attention-to-Attention (ATA) stages, (b) attention slicing and Variational Autoencoder (VAE) tiling implementation effects, (c) gradient accumulation and checkpointing mechanisms showing peak memory (GB) vs. effective batch size, (d) memory footprint breakdown (X-axis = optimization configuration; left Y-axis = peak memory (GB); right Y-axis = training time (hours)). Abbreviations: CA = Cross-Attention; ATA = Attention-to-Attention; VAE = Variational Autoencoder; GB = Gigabytes; and GPU = Graphics Processing Unit.

Figure 3. Identity preservation mechanisms and constraint systems showing (a) multi-scale facial feature extraction network producing ArcFace 512-dimensional (512-d) embedding, (b) attention-based identity constraint mechanism integrating texture features, shape geometry, facial landmarks, and color via self-attention, (c) adaptive loss weighting strategies across 100 training epochs for

λ_{identity}

,

λ_{diffusion}

, and

λ_{prior}

(both fixed scalars at 1.0 in Equation (1); curves show dynamic scheduling during warm-up), and (d) identity validation via cosine similarity across diverse generation contexts showing threshold at 0.85 and mean score of 0.852. Abbreviations: ArcFace = Additive Angular Margin loss for face recognition and d = dimensions.

Figure 3. Identity preservation mechanisms and constraint systems showing (a) multi-scale facial feature extraction network producing ArcFace 512-dimensional (512-d) embedding, (b) attention-based identity constraint mechanism integrating texture features, shape geometry, facial landmarks, and color via self-attention, (c) adaptive loss weighting strategies across 100 training epochs for

λ_{identity}

,

λ_{diffusion}

, and

λ_{prior}

(both fixed scalars at 1.0 in Equation (1); curves show dynamic scheduling during warm-up), and (d) identity validation via cosine similarity across diverse generation contexts showing threshold at 0.85 and mean score of 0.852. Abbreviations: ArcFace = Additive Angular Margin loss for face recognition and d = dimensions.

Figure 4. Advanced facial processing and enhancement pipeline showing (a) Multi-task Cascaded Convolutional Network (MTCNN) face detection and alignment via P-Net (Proposal Network), R-Net (Refinement Network), and O-Net (Output Network) producing 512 × 512 aligned output (Accuracy: 98.7%, Precision: 97.3%, and Speed: 23 fps), (b) 24-point landmark detection distribution across facial regions (Eyes: 4 pts, Nose: 2 pts, Mouth: 4 pts, Jaw: 8 pts, and Eyebrows: 6 pts), (c) identity-aware preprocessing pipeline stages, including color normalization, illumination correction, contrast enhancement, noise reduction, and feature extraction, and (d) post-generation quality refinement showing quality scores before and after refinement across sharpness, color consistency, texture quality, edge definition, and overall quality metrics. Abbreviations: MTCNN = Multi-task Cascaded Convolutional Network; fps = frames per second; and pts = points.

Figure 5. Extended facial processing pipeline components showing (a) memory-optimized training framework with accelerate integration detailing gradient accumulation, gradient checkpointing, attention slicing, Variational Autoencoder (VAE) tiling, and memory management converging to a fine-tuned model with 14.2 GB peak memory (36% reduction), (b) advanced facial processing and enhancement pipeline (4-stage sequential flow), (c) multi-dimensional quality assessment system reporting SSIM = 0.879 (Structural Similarity Index Measure), LPIPS = 0.139 (Learned Perceptual Image Patch Similarity), identity = 0.852, FID = 23.1 (Fréchet Inception Distance), and (d) generated comparison results baseline DreamBooth (22 GB, 3.2 h, 24 GB GPU) vs. our method (14.2 GB, 3.7 h, 16 GB GPU) with 36% memory reduction and 15% time overhead. Abbreviations: VAE = Variational Autoencoder; GB = Gigabytes; SSIM = Structural Similarity Index Measure; LPIPS = Learned Perceptual Image Patch Similarity; and FID = Fréchet Inception Distance.

Figure 6. Comprehensive quality assessment framework comparing seven methods, including newly added baselines FastComposer and IP-Adapter: (a) perceptual similarity via Learned Perceptual Image Patch Similarity (LPIPS, ↓ better)—our method achieves 0.139, (b) structural consistency via Structural Similarity Index Measure (SSIM, ↑ better)—our method achieves 0.879, (c) identity fidelity via ArcFace cosine similarity (↑ better)—our method achieves 0.852 exceeding the 0.85 threshold, and (d) photorealistic quality via Fréchet Inception Distance (FID, ↓ better) and aesthetic score (↑ better)—our method achieves FID = 23.1 and aesthetic = 7.0. Abbreviations: LPIPS = Learned Perceptual Image Patch Similarity; SSIM = Structural Similarity Index Measure; FID = Fréchet Inception Distance; and ArcFace = Additive Angular Margin loss.

Figure 7. Hardware performance validation and optimization results showing (a) training time (hours) and peak memory (GB) comparison for baseline (24 GB GPU, 22.0 GB peak, 3.2 h) vs. full optimization (16 GB GPU, 14.2 GB peak, 3.7 h), (b) Graphics Processing Unit (GPU) utilization patterns throughout training, and baseline saturates above 95% threshold while optimized remains within the 85–93% optimal range, (c) training convergence analysis showing total loss trajectories for baseline DreamBooth (Final: 0.063) and our optimized method (Final: 0.073)—annotations are positioned to avoid overlap, and (d) memory footprint breakdown across optimization techniques. X-axis = optimization configuration; left Y-axis = peak memory (GB); and right Y-axis = training time (hours). Abbreviations: GPU = Graphics Processing Unit; OOM = Out-of-Memory; GB = Gigabytes; and h = hours.

Figure 8. Training pipeline validation and dataset processing results showing (a) dataset preprocessing efficiency (throughput in images/sec) across four resolutions (256 × 256 to 1024 × 1024) for baseline and optimized pipelines, (b) validation metrics throughout 800 training steps tracking Learned Perceptual Image Patch Similarity (LPIPS, ↓), Structural Similarity Index Measure (SSIM, ↑), and identity cosine similarity (↑), (c) ablation studies of cumulative optimization techniques (X-axis = technique configurations; left Y-axis = Peak Memory (GB); right Y-axis = Final Loss; dashed line = 16 GB GPU limit), and (d) robustness analysis across diverse subject demographics showing success rate (%) with 95% target threshold (n = sample count per group). Abbreviations: LPIPS = Learned Perceptual Image Patch Similarity; SSIM = Structural Similarity Index Measure; GB = Gigabytes; and n = sample count.

Figure 9. Generation pipeline performance and multi-resolution enhancement showing (a) inference time (seconds) across four output resolutions (256 × 256 to 1024 × 1024) for batch sizes one and four, (b) quality scaling analysis across resolutions tracking LPIPS (↓), identity cosine similarity (↑), and Fréchet Inception Distance (FID, ↓), (c) peak memory (GB) during inference, optimized (ours) vs. baseline with 16 GB GPU limit threshold shown, and (d) enhancement pipeline impact on LPIPS (↓) and identity score (↑) across five sequential stages, raw generation → +face alignment → +color correction → +detail sharpening → +photo-realistic filters, achieving 30% LPIPS improvement (0.198 → 0.139). Abbreviations: LPIPS = Learned Perceptual Image Patch Similarity; FID = Fréchet Inception Distance; and GB = Gigabytes.

Figure 10. Complete system integration and deployment validation showing (a) end-to-end pipeline testing success rate (%) across five scenario types with 95% success threshold (n = sample count per scenario), (b) user study results (n = 50 participants, 200 images) on a 1–5 Likert scale for identity preservation, visual quality, and photorealism across four compared methods, (c) deployment feasibility across six consumer Graphics Processing Unit (GPU) configurations (GTX 1080 Ti through RTX 4060) showing baseline training, optimized training, baseline inference, and optimized inference compatibility, and (d) production readiness validation over 1000 generation cycles tracking quality consistency, memory stability, and speed consistency against 99% reliability target. Abbreviations: GPU = Graphics Processing Unit; GTX/RTX = NVIDIA GPU model families; and n = sample count.

Table 1. Quantitative comparison with state-of-the-art methods on identity preservation and quality metrics.

Method	LPIPS ↓	SSIM ↑	Identity ↑	FID ↓	Memory	GPU
DreamBooth [1]	0.142	0.876	0.847	24.3	22 GB	24 GB
MasterWeaver [9]	0.138	0.882	0.856	22.8	20 GB	24 GB
Subject-Diffusion [12]	0.145	0.871	0.841	25.7	18 GB	24 GB
HP³ [11]	0.151	0.865	0.838	26.4	16 GB	16 GB
FastComposer [39]	0.148	0.863	0.831	27.2	16 GB	16 GB
IP-Adapter [4]	0.144	0.869	0.844	25.1	18 GB	24 GB
Ours	0.139	0.879	0.852	23.1	14.2 GB	16 GB

↓ lower values are better; ↑ higher values are better. HP³: Head-Preserving Portrait Personalization; superscript 3 denotes the three key properties of the approach. Bold values indicate the proposed method (Ours).

Table 2. Computational efficiency comparison across different frameworks.

Method	Training Time	Inference Time	Peak Memory	Convergence Steps
DreamBooth [1]	3.2 h	3.8 s	22 GB	800
MasterWeaver [9]	4.1 h	4.2 s	20 GB	1000
Subject-Diffusion [12]	2.8 h	3.5 s	18 GB	600
HP³ [11]	N/A (test-time)	8.2 s	16 GB	N/A
Ours	3.7 h	4.2 s	14.2 GB	800

HP³: Head-Preserving Portrait Personalization; superscript 3 denotes the three key properties of the approach. N/A: Not Applicable; HP³ is a tuning-free test-time method requiring no separate training phase. Bold values indicate the proposed method (Ours).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gupta, S.; Ray, K.; Kaiser, S.; Hossain, S.; Faubert, J. Enhanced Facial Realism in Personalized Diffusion Models: A Memory-Optimized DreamBooth Implementation for Consumer Hardware. Algorithms 2026, 19, 257. https://doi.org/10.3390/a19040257

AMA Style

Gupta S, Ray K, Kaiser S, Hossain S, Faubert J. Enhanced Facial Realism in Personalized Diffusion Models: A Memory-Optimized DreamBooth Implementation for Consumer Hardware. Algorithms. 2026; 19(4):257. https://doi.org/10.3390/a19040257

Chicago/Turabian Style

Gupta, Sandeep, Kanad Ray, Shamim Kaiser, Sazzad Hossain, and Jocelyn Faubert. 2026. "Enhanced Facial Realism in Personalized Diffusion Models: A Memory-Optimized DreamBooth Implementation for Consumer Hardware" Algorithms 19, no. 4: 257. https://doi.org/10.3390/a19040257

APA Style

Gupta, S., Ray, K., Kaiser, S., Hossain, S., & Faubert, J. (2026). Enhanced Facial Realism in Personalized Diffusion Models: A Memory-Optimized DreamBooth Implementation for Consumer Hardware. Algorithms, 19(4), 257. https://doi.org/10.3390/a19040257

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhanced Facial Realism in Personalized Diffusion Models: A Memory-Optimized DreamBooth Implementation for Consumer Hardware

Abstract

1. Introduction

2. Related Work

2.1. Personalized Diffusion Models and Computational Optimization

2.2. Memory Optimization and Hardware Efficiency

2.3. Facial Feature Preservation and Enhancement

2.4. Quality Assessment and Evaluation Frameworks

3. Methodology

3.1. Memory-Optimized Training Architecture

3.2. Quality Assessment Framework

3.3. Training Pipeline Implementation

4. Experimental Results

4.1. Hardware Performance Validation

4.2. Comparative Analysis with State-of-the-Art

5. Discussion

Limitations and Future Work

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI