Efficient and Controllable Image Generation on the Edge: A Survey on Algorithmic and Architectural Optimization

Ham, Se-Jun; Park, Chun-Su

doi:10.3390/electronics15040828

Open AccessReview

Efficient and Controllable Image Generation on the Edge: A Survey on Algorithmic and Architectural Optimization

by

Se-Jun Ham

and

Chun-Su Park

^*

Department of Computer Education, Sungkyunkwan University, Seoul 03063, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(4), 828; https://doi.org/10.3390/electronics15040828

Submission received: 9 January 2026 / Revised: 7 February 2026 / Accepted: 12 February 2026 / Published: 14 February 2026

(This article belongs to the Special Issue Advances in Computer Vision Research and Applications)

Download

Browse Figures

Versions Notes

Abstract

Since the introduction of denoising diffusion probabilistic models (DDPM) in 2020, diffusion-based image generation has achieved remarkable quality but remains computationally demanding for resource-constrained environments. This survey systematically analyzes over 100 publications from 2020 to 2025, presenting a four-layer optimization stack that encompasses model architecture, controllable mechanisms, sampling algorithms, and model compression. We address the fundamental “quality–efficiency–control” trilemma through three research questions: (1) the architectural complexity gap between U-shaped network (UNet) and diffusion transformer (DiT) models, (2) the parameter overhead spectrum of control mechanisms from ControlNet (42%) to NanoControl (0.024%), and (3) the theoretical impact of quantization and bit-width reduction on information loss. Our analysis reveals that instant image generation is achievable through algorithmic innovations such as step distillation and architectural pruning, reducing the sampling steps from 50 to 4–8 (or even 1) and computational cost by over 90%. We utilize the floating point operations (FLOPs) efficiency ratio (FER) to highlight the discrepancy between theoretical FLOPs reduction and actual efficiency, pointing towards the need for system-level optimization. Key findings demonstrate that DiT architectures exhibit high computational density (FER > 1.6) and low-bit quantization such as 8-bit weight, and activation (W8A8) maintains an optimal balance between compression and quality (Fréchet inception distance degradation ΔFID < 1.0), and lightweight control mechanisms enable sophisticated image control with a negligible parameter overhead. This survey provides a comprehensive algorithmic optimization roadmap for practitioners targeting efficient on-device image generation.

Keywords:

diffusion models; algorithmic optimization; model compression; controllable image generation; quantization; neural architecture search; knowledge distillation

1. Introduction

1.1. Motivation: The Gap Between GenAI Demands and Edge Constraints

Deep learning-based generative models have evolved rapidly over the past decade. Generative adversarial networks (GANs) [1] enabled real-time image synthesis but suffered from training instability and mode collapse. Variational autoencoders (VAEs) [2] provided more stable training with explicit latent representations, and subsequent works explored adaptive variants for improved efficiency [3]. Notably, GAN-based approaches continue to advance in specialized domains: recent work on adaptive fused domain-cycling variational GANs [4] has demonstrated improved data synthesis quality through multi-domain feature coordination, while multimodal techniques combining denoising diffusion probabilistic models with audiovisual semantic decoding have achieved robust arc detection in railway systems under data-scarce conditions [5]. These developments illustrate that both adversarial and diffusion paradigms remain actively evolving, with increasing integration of multimodal learning strategies. While these approaches established foundational principles for efficient generation, diffusion models have emerged as the dominant paradigm for image synthesis, due to their superior quality–diversity trade-off.

Since the introduction of denoising diffusion probabilistic models (DDPM) [6] in 2020, diffusion models have driven a paradigm shift in image generation. The original DDPM achieved an unprecedented FID of 3.17 on CIFAR-10 [6], demonstrating high-quality synthesis through iterative denoising. Subsequently, denoising diffusion implicit models (DDIM) [7] enabled 10×–50× faster sampling through non-Markovian processes. In 2022, the latent diffusion model (LDM) [8] by Rombach et al. dramatically reduced computational complexity by performing diffusion in a compressed latent space, publicly released as Stable Diffusion [8], and becoming the foundation for widespread adoption.

However, the inherent characteristics of these large-scale generative models—billions of parameters and tens of iterative sampling steps—have intensified dependency on high-performance GPU servers. An unoptimized Stable Diffusion v1.5 requires prohibitively high latency (exceeding 10 s per image) on mobile GPUs [9], which is unacceptable for real-time applications. In contrast, SnapFusion [9] achieved sub-2 s generation on mobile devices through architecture optimization and progressive distillation. Meanwhile, AI demand in edge environments such as IoT devices, mobile devices, and autonomous vehicles is experiencing explosive growth [10]. While cloud-based inference introduces network latency, privacy concerns, and ongoing operational costs, on-device AI holds the potential to address these challenges [10,11]. Figure 1 illustrates the timeline of this evolution from early diffusion models to today’s edge-native era.

1.2. Problem Definition: The “Efficiency–Quality–Control” Trilemma

Deploying image generation AI in edge environments inherently involves fundamental trade-offs among three conflicting objectives: quality, efficiency, and controllability [19].

The first trade-off exists between quality and efficiency. Diffusion models generate high-quality images through iterative denoising steps, which inevitably increases the inference time [6,7]. Recent methods such as latent consistency models (LCM) [16] and adversarial diffusion distillation (ADD) [17], which extend the concept of consistency models [20], have enabled four-to-eight-step or even single-step generation [16,17]; however, this may be accompanied by texture loss or a “waxy appearance” artifact [21].

The second trade-off exists between controllability and efficiency. ControlNet [12] enables precise spatial control but adds approximately 361 M parameters [12], effectively doubling the memory footprint. This is a primary cause of out-of-memory (OOM) errors in edge environments [22]. T2I-Adapter [23] and IP-Adapter [24] have proposed lightweight adapter approaches, yet substantial overhead remains [23,24].

The third challenge arises from architectural evolution. The transition from UNet-based models to the DiT architecture [13] (SD3 [14], FLUX [15]) has brought advances in quality; however, the computational cost scaling quadratically with the number of tokens has exacerbated memory bottleneck issues [13,25]. SDXL [26] represents the largest UNet-based model with 2.6 B parameters [26], while SD3 has scaled up to 8 B parameters with its Multimodal DiT (MMDiT) architecture [14,27].

1.3. Scope and Contribution: The Four-Layer Software Optimization Stack

This survey provides two main contributions focused on software-centric optimization for edge deployment. First, we propose a four-layer software optimization stack that reflects the practical deployment pipeline from model design to on-device inference. The four layers are defined as follows:

Layer I (model architecture optimization): Design-time structural decisions made before or during initial training, including backbone selection (UNet vs. DiT), block configuration, and macro-architecture design (e.g., channel width, attention placement).
Layer II (efficient controllability): Integration of control mechanisms into the base architecture, ranging from high-overhead encoder replication (ControlNet) to minimalist attention injection (NanoControl).
Layer III (sampling and algorithmic acceleration): Algorithmic techniques that reduce the number of sampling steps or computational cost per step, independent of the underlying architecture (e.g., LCM, rectified flow).
Layer IV (model compression): Post-training optimizations applied to pre-trained models prior to deployment, including quantization (FP16 to INT8), pruning, and memory optimization techniques.

This layered taxonomy reflects the sequential nature of the deployment pipeline: practitioners first select an architecture (Layer I), attach control modules if needed (Layer II), choose sampling strategies (Layer III), and finally apply compression for target hardware (Layer IV). Notably, Layers I and IV are distinguished by their temporal position in this pipeline—Layer I involves structural choices that define the model’s computational graph, while Layer IV preserves the architecture and reduces numerical precision or parameter redundancy.

Second, we present a comprehensive analysis of algorithmic efficiency by introducing quantitative frameworks, including the control-efficiency score (CES) and the FLOPs efficiency ratio (FER), and normalizing performance metrics across more than 50 publications to enable a systematic comparison of theoretical gains across optimization techniques.

1.4. Comparison with Existing Surveys

Several recent surveys have addressed efficient diffusion models from various perspectives. Table 1 summarizes the key differences between our work and representative prior surveys.

Our survey differs from prior works in three key aspects. First, we provide the first systematic analysis of the control overhead spectrum, introducing the control-efficiency score (CES) to quantify the trade-off between parameter cost and control fidelity—a dimension that is not formally addressed in existing surveys. Second, we introduce the FLOPs efficiency ratio (FER) to reveal the critical gap between theoretical computational reduction and actual inference performance, highlighting that software optimization alone cannot fully overcome hardware limitations. Third, we specifically target the edge deployment scenario with a practical four-layer optimization pipeline, providing actionable guidance for practitioners deploying diffusion models on resource-constrained devices.

2. Background: Metrics and Baselines

This section establishes the foundational evaluation framework for this survey by defining the key metrics used to assess image generation model performance across three dimensions: distributional quality (FID [30], IS [31]), semantic alignment (CLIP score [32], esthetic score [33]), and control accuracy (mIoU, SSIM [34], RMSE). Additionally, we present the quantitative specifications of baseline diffusion models—including both UNet-based architectures (SD1.5 [8], SDXL [26]) and DiT-based architectures (SD3 [14], FLUX [15])—which serve as reference points for comparative analysis throughout subsequent sections.

2.1. Quality Metrics

2.1.1. Distributional Metrics (FID, IS)

Fréchet inception distance (FID) [30], as defined in Equation (1), is the most widely used quality metrics for measuring distributional similarity between generated and real images. The FID is computed as

FID = | μ_{r} - μ_{g} |^{2} + Tr (Σ_{r} + Σ_{g} - 2 {(Σ_{r} Σ_{g})}^{1 / 2})

(1)

where

μ_{r}

and

μ_{g}

denote the mean feature vectors of real and generated image distributions,

Σ_{r}

and

Σ_{g}

represent their respective covariance matrices, and

T r (\cdot)

denotes the trace operator [30].

In general, FID scores below 10 are considered excellent, scores in the range of 10–30 are regarded as good, and scores between 30 and 50 are considered moderate [35]. Along with Inception Score (IS) [31], it has become a standard benchmark for evaluating generative models [35].

2.1.2. Semantic and Esthetic Metrics (CLIP, Esthetic Score)

Contrastive Language-Image Pre-training (CLIP) Score [32] measures the semantic alignment between generated images and input text prompts. It is computed as the cosine similarity in the image-text embedding space of the CLIP model [32], with typical values ranging from 20 to 35. Higher scores indicate better text-image alignment [36].

Esthetic score evaluates the visual appeal of images using a model trained to predict human esthetic preferences [33]. The LAION-Esthetics Predictor V2 [33] is a linear model trained on top of CLIP ViT-L/14 embeddings, producing scores on a 1–10 scale. Other human preference-based evaluation models, such as ImageReward [37] and human preference score (HPS) [38], are also being actively researched.

2.1.3. Control Accuracy Metrics (mIoU, Edge-F1)

Evaluating controllable generation requires specialized metrics for each condition type [12,23]. The following metrics are used according to each condition type, as summarized in Table 2:

Depth: Correlation coefficient measurement after depth estimation via MiDaS [39].
Pose: Keypoint detection via OpenPose [40], followed by PCKh/OKS calculation.
Edge: F1 score measurement after edge detection via Canny [41] or HED [42].
Semantic: Segmentation mIoU measurement based on ADE20K [43].
Compositionality: Attribute binding, object relationships, and complex composition evaluation via T2I-CompBench [44].

Table 2. Condition-specific control accuracy metrics. This table summarizes the evaluation protocols for different control conditions, detailing the primary metrics, measurement methods, and standard datasets used for quantitative analysis.

Condition Type	Primary Metric	Measurement Method	Standard Dataset
Segmentation	mIoU	Mask2Former [45] extraction	ADE20K [43]
Depth	RMSE, δ thresholds	MiDaS [39] extraction	NYU Depth V2 [46]
Pose (OpenPose)	PCK, mAP (OKS)	OpenPose [40] detection	COCO-Pose [47]
Edge (Canny)	F1-Score	OpenCV Canny [41]	BSDS500 [48]
Edge (HED)	ODS-F, OIS-F	HED [42] detector	BSDS500 [48]
Identity	DINO similarity, ArcFace cosine	Feature extraction	Custom

mIoU, mean intersection over union; RMSE, root mean square error; PCK, percentage of correct keypoints; mAP, mean average precision; OKS, object keypoint similarity; ODS-F, optimal dataset scale F-measure; OIS-F, optimal image scale F-measure; DINO [49], self-distillation with no labels; HED, holistically nested edge detection [42]; and ArcFace [50].

2.2. Theoretical Efficiency Metrics

In the context of software optimization, we focus on theoretical metrics that are independent of specific hardware implementations [51].

2.2.1. Computational Complexity (FLOPs, MACs)

Floating point operations (FLOPs) and multiply accumulate operations (MACs) represent the theoretical computational complexity of a model, which is independent of specific hardware implementations [51]. Since one MAC consists of one multiplication and one addition, 1 MAC is approximately equivalent to 2 FLOPs. To ensure consistency across the literature, this survey standardizes all computational cost values in Giga MACs (GMACs), converting reported FLOPs values where necessary.

2.2.2. Model Size and Parameters

The number of parameters is directly related to the model’s storage requirements and memory bandwidth consumption during inference [52]. At FP32 precision, 1 M parameters occupy approximately 4 MB of memory, which reduces to 2 MB at FP16, 1 MB with INT8 quantization, and 0.5 MB at INT4 [53]. For instance, SD1.5 with approximately 860 M parameters requires roughly 1.7 GB at FP16 precision [8], while FLUX.1-dev with 12 B parameters demands approximately 24 GB [15], highlighting the critical importance of quantization for edge deployment [53,54].

2.2.3. Data Precision (Bit-Width)

The primary precision levels applied to diffusion models span a wide range, depending on deployment targets [53,54]: FP32 serves as the training baseline with maximum numerical stability; FP16/BF16 has become the GPU inference standard, offering 2× memory reduction with negligible quality loss [55]; INT8 represents the key target for edge and mobile neural processing unit (NPU) deployment, due to broad hardware support and favorable quality–efficiency trade-offs [54]; and INT4 enables extreme compression for memory-constrained scenarios, albeit with potential quality degradation [56]. The choice of bit-width directly impacts both model size and inference throughput, making it a critical optimization lever for on-device deployment [53].

2.3. Baseline Architectures

2.3.1. UNet-Based Models (SD1.5, SDXL)

Stable Diffusion 1.5 (SD1.5) [8] has been the most widely studied baseline model since its release in 2022. It adopts a hybrid architecture combining ResNet blocks [57] and transformer blocks [58] on a UNet [59] backbone—a symmetric encoder–decoder architecture with skip connections that were originally proposed for biomedical image segmentation. SDXL [26] employs a U-Net approximately three times larger than SD1.5 (~2.6 B parameters) [26] and improves prompt understanding through dual text encoders [26].

2.3.2. DiT-Based Models (SD3, FLUX)

In 2024, a paradigm shift occurred toward pure transformer-based DiT architectures [13]. SD3 [14] adopts the MMDiT architecture and introduces the rectified flow [60] training approach. FLUX.1 [15] is a rectified flow transformer model with 12 B parameters [15]. Table 3 summarizes the quantitative specifications of these baseline models, and Figure 2 illustrates the architectural differences between UNet-based and DiT-based approaches.

3. Layer I: Model Architecture Optimization

This layer focuses on structural modifications to the diffusion backbone to minimize computational redundancy while maintaining generative capability [63].

3.1. Structural Complexity Analysis

3.1.1. Convolution (UNet) vs. Attention (DiT) Complexity

UNet [59] features a symmetric encoder–decoder structure dominated by convolution operations, which is advantageous for leveraging convolution accelerators in mobile NPUs [64]. In contrast, DiT [13] is centered on attention operations, making a surge in computational cost inevitable as the number of tokens increases, with hardware optimization requirements that differ substantially [13,25].

3.1.2. The Quadratic Cost of Global Attention

DiT architectures demonstrate significant computational intensity (as shown in Table 2); for instance, DiT-XL/2 achieves an FID of 2.27 on ImageNet 256 × 256 [13] but requires 119 GFLOPs of computation [13]. The standard UNet in Stable Diffusion [8] also suffers from inefficiencies, characterized by excessive channel counts (up to 1280) [8], uniform block allocation, and high-resolution attention complexity (

O (N^{2})

) [25].

3.1.3. Memory Access Patterns and Hardware Friendliness

Beyond computational complexity, UNet and DiT architectures exhibit fundamentally different memory access characteristics. UNet is dominated by convolution operations with localized, predictable memory access patterns, enabling efficient cache utilization on mobile NPUs optimized for CNN workloads [52,64]. In contrast, DiT relies on self-attention operations requiring global memory access with

O (N^{2})

intermediate storage, resulting in higher bandwidth demands and reduced cache efficiency [25,65]. This memory-bound characteristic explains the elevated FER values (>1.6) for DiT models, as quantified later in the FER analysis (Section 8).

These differences have direct deployment implications: UNet achieves better hardware utilization on current mobile NPUs [64], while DiT benefits from memory-efficient implementations like FlashAttention [25]. Hybrid approaches such as MobileDiffusion [18] leverage both paradigms by concentrating attention in low-resolution bottleneck regions.

3.2. Neural Architecture Search (NAS) Approaches

To address these structural inefficiencies, NAS-based optimization has been applied [66]. SnapFusion [9] utilized evolutionary algorithms and discovered that while the full structure is necessary in the early stages of generation, the contribution of the middle blocks decreases sharply in later stages [9]. It also confirmed that blocks in the up-sampling path are more important than those in the down-sampling path [9].

3.3. Efficient Backbone Design

3.3.1. Block Removal and Channel Reduction

Building on NAS findings, SnapFusion [9] reduced the number of channels in the U-Net middle block and replaced computationally intensive blocks with efficient operators [9]. Similarly, MobileDiffusion [18] focused on lightweighting not only the U-Net but also the text encoder and variational autoencoder (VAE) decoder [2], adopting a hybrid strategy that concentrates attention operations in lower-resolution bottleneck regions [18].

3.3.2. Depth-Wise Separable Convolutions (Mobile-Optimized UNet)

Decomposing standard 3 × 3 convolutions into depth-wise and point-wise operations theoretically enables 8–9× FLOPs reduction [67]. MobileDiffusion [18] applied this principle, first introduced in MobileNet [67], to the U-Net architecture, achieving substantial performance improvements [18]. Figure 3 compares the structural differences between the standard UNet and mobile-optimized variants, and Table 4 quantifies the FLOPs reduction achieved by each optimization.

4. Layer II: Efficient Controllability

This layer analyzes the spectrum of control mechanisms, ranging from heavy model replication to minimalist architectural integration [28].

4.1. High-Overhead Mechanisms

4.1.1. ControlNet: The Cost of Model Replication

Since its proposal by Zhang et al. in 2023 [12], ControlNet [12] has become the de facto standard for spatial control. The core idea is to create a trainable copy of the pre-trained Stable Diffusion encoder and connect it to the original model through zero convolutions [12]. Zero convolution refers to a 1 × 1 convolution layer with the weights and biases initialized to zero [12]. This design does not affect the behavior of the original model during early training, while progressively injecting control signals [12]. For SD1.5, ControlNet adds approximately 361 M parameters [12] per condition type, corresponding to approximately 42% overhead, relative to the base model [12]. Figure 4 illustrates the evolution of control overhead from heavy CNN-based adapters to efficient DiT-centric solutions.

4.1.2. ControlNet++ and Cycle Consistency

ControlNet++ [69] (ECCV 2024) introduced cycle consistency optimization to address pixel-level alignment issues. By leveraging a pre-trained discriminative reward model to optimize consistency loss between generated images and input conditions, it achieved improvements of +11.1% mIoU for segmentation, +13.4% SSIM for edges, and +7.6% RMSE for depth [69], compared to ControlNet [12].

4.2. Low-Overhead Adapters

4.2.1. T2I-Adapter and IP-Adapter

T2I-Adapter [23] (AAAI 2024) is a lightweight adapter using only approximately 77 M parameters [23]. Unlike ControlNet [12], which replicates the encoder, it injects condition features through simple addition at four scales of the U-Net encoder [23]. IP-Adapter [24] introduced decoupled cross-attention layers for image prompt conditioning [24]. It achieves competitive performance with only approximately 22 M trainable parameters [24], with full compatibility with text prompts, ControlNet, and T2I-Adapter being its key advantage [24].

4.2.2. LoRA-Based Control

Low-Rank Adaptation (LoRA) [70] (ICLR 2022) learns low-rank decomposition matrices A and B instead of directly modifying weight matrices [70]. With typical rank settings of 4–128, it requires only 1–6 MB of storage, which is approximately 500–1000× more efficient than DreamBooth [71]. Weight-Decomposed Low-Rank Adaptation (DoRA) [72] (ICML 2024) further improved LoRA’s performance by decomposing weights into magnitude and direction components, enhancing learning capacity and stability without additional inference overhead compared to LoRA [72].

4.3. Architecture-Integrated Control (Minimalist)

4.3.1. ControlNet-XS: Architectural Bottlenecking

ControlNet-XS [73] (ECCV 2024) introduces high-bandwidth bidirectional communication between the control branch and generation branch, using only 14 M parameters for SD1.5 and 48 M parameters for SDXL [73]. This corresponds to 1–2% of the base model, achieving lower FID than the original ControlNet while reducing memory usage to approximately 1.5× the baseline [73].

4.3.2. NanoControl and OminiControl: Attention/KV Injection

OminiControl [74] (ICCV 2025) is a universal control framework for DiT models that enables arbitrary condition control with only 0.1% additional parameters [74]. It leverages parameter reuse, utilizing DiT’s own VAE encoder and transformer blocks, along with unified sequence processing, as its core mechanisms [74]. NanoControl [68] (arXiv 2025) achieved an extremely low overhead of a 0.024% parameter increase and a 0.029% computational increase [68] compared to the original ControlNet [12]. The key innovation is the “KV-Context Augmentation” technique, which uses LoRA-style low-rank matrices to directly inject control signals into the key and value of the attention layers [68]. As visualized in Figure 5, this minimalist approach achieves competitive control fidelity with dramatically reduced parameter overhead, revealing diminishing returns beyond approximately 100 M parameters.

4.4. Control Accuracy vs. Generalization Trade-Off

An inherent tension exists between the control fidelity and generalization capability across different control mechanisms, as summarized in Table 5.

Specialist mechanisms like ControlNet achieve superior per-condition accuracy but require separate model instances for each control type, posing significant memory constraints for edge deployment. In contrast, generalist approaches such as UniControl and NanoControl trade marginal accuracy for universal applicability, enabling scenarios where loading multiple control models is infeasible. LoRA-based approaches offer a flexible middle ground through composability, allowing condition-specific modules to be merged or swapped at inference time with minimal overhead [70,76]. The choice between these approaches ultimately depends on the deployment requirements: precision-critical applications favor specialists, while resource-constrained interactive tools benefit from lightweight generalist solutions.

5. Layer III: Sampling and Algorithmic Acceleration

This layer addresses the iterative nature of diffusion models, focusing on algorithms that reduce the number of sampling steps or the computational cost per step without compromising generative quality [19,28].

5.1. Step Reduction Algorithms

5.1.1. Latent Consistency Models (LCM)

LCM [16] leverages the consistency principle [20] to train models to instantly predict the original data from any point along the ordinary differential equation (ODE) trajectory [77]. The core idea is to impose the constraint that all intermediate states in the diffusion process must converge to the same final output [16,20]. LCM achieves approximately 5–12× speedup at 4–8 steps compared to conventional 20–50 step solvers, with FID degradation typically limited to ΔFID < 3 [16]. When deployed as a lightweight adapter in the form of LCM-LoRA [76], existing models can be converted to high-speed four-step models with less than 100 MB of additional parameters [76].

5.1.2. Rectified Flow and Flow Matching

Flow matching [77] is an approach that models the optimal path connecting noise and data distributions as a deterministic ODE [77]. As formalized in Equation (2), rectified flow [60] performs ODE learning to “straighten the trajectory,” whereby the closer the path is to a straight line, the more accurately noise can be mapped to data with a single Euler step [60].

\frac{d z_{t}}{d t} = v_{θ} (z_{t}, t), z_{0} ~ N (0, I), z_{1} = x_{dt}

(2)

where

z_{t}

represents the latent state at time

t \in [0,1]

,

v_{θ}

is the learned velocity field parameterized by

θ

,

z_{0}

is the initial noise sampled from a standard Gaussian distribution, and

z_{1}

corresponds to the target data [60].

InstaFlow [78] was the first to apply rectified flow to text-to-image generation, achieving an FID of 13.1 (MS COCO 2014-30k) [78] in 0.09 s [78] and demonstrating the feasibility of single-step generation. PeRFlow [79] proposed dividing the entire time window into multiple segments and straightening each segment individually, enabling it to serve as a plug-and-play acceleration module while maintaining a high quality, even in four-step generation [79].

5.2. One-Step Generation Techniques

Adversarial diffusion distillation (ADD) [17] is a hybrid approach combining knowledge distillation with adversarial training [17]. By leveraging pre-trained discriminators such as DINOv2 [80] to train generated images to be indistinguishable from real images, it maintains quality even in single-step generation [17]. Latent adversarial diffusion distillation (LADD) [81] extended the discriminative process to operate directly in latent space, enabling generation of high-resolution multi-aspect-ratio images in only four steps, as demonstrated by SD3-Turbo [81].

5.3. Algorithmic Caching Strategies

DeepCache [82] proposed a training-free acceleration method that exploits temporal redundancy across sequential denoising steps by reusing high-level features and only updating low-level features [82]. It achieved 2.3× speedup on Stable Diffusion v1.5 with only a 0.05 decrease in CLIP score [82], and 4.1× speedup on LDM-4-G (ImageNet) with only a 0.22 decrease in FID [82].

6. Layer IV: Model Compression Techniques

This layer focuses on reducing the bit-width and redundancy of the model parameters to minimize memory footprint and bandwidth usage [53].

6.1. Quantization Algorithms

6.1.1. Post-Training Quantization (PTQ) for Diffusion

PTQ4DM [56] introduced a time-aware calibration strategy that maintains, or even improves, performance while quantizing diffusion models to 8-bit [56]. Q-Diffusion [54] combined split quantization with timestep-wise calibration [54]. PTQD [83] separated quantization noise into correlated and uncorrelated components and introduced a mixed-precision approach, minimizing the FID increase to 0.06 [83] compared to FP LDM-4 on ImageNet 256 × 256 while achieving 19.9× reduction in bit operations [83]. MixDQ [84] applies a metric-decoupled strategy that quantizes quality-sensitive layers at high precision and latency-sensitive layers at low precision [84]. Among the latest research, SVDQuant [85] utilizes low-rank branches to absorb outliers in both weights and activations, thereby maintaining visual quality even at 4-bit quantization [85].

6.1.2. Impact of Bit-Width Reduction (Information Loss Analysis)

The primary precision levels applied to diffusion models range from FP32 to INT4 [53]. Table 6 summarizes the effects of various quantization precisions on model size and image quality [53,54,56]. The W8A8 and W4A16 combinations have emerged as the optimal region where substantial acceleration and minimal quality loss are balanced [54,84], as visualized in the precision trade-off heatmap in Figure 6.

6.2. Structural Pruning and Distillation

Pruning physically removes unnecessary parameters from a model, simultaneously reducing FLOPs and model size [51,86]. Diff-Pruning [87] utilized Taylor expansion over timesteps to identify important weights, achieving approximately 50% FLOPs reduction [87] while maintaining consistent generation quality. LD-Pruner [88] proposed a task-agnostic metric for estimating output sensitivity in latent space, achieving approximately 34.9% inference speedup [88]. EcoDiff [89] is a model-agnostic structural pruning framework that is capable of removing up to 20% of parameters without retraining [89]. Synthesizing data from SnapFusion [9] and MobileDiffusion [18], the threshold of FLOPs that can be removed without visual collapse in diffusion models is estimated at approximately 50–60% [9,18,87]. Compression beyond this point leads to quality degradation, such as texture loss and structural distortion [21,87].

6.3. Algorithmic Memory Optimization

Memory consumption during diffusion model inference consists of model weights, intermediate activations, and key-value (KV) cache [52,65]. Tiling is a technique that controls memory usage by dividing high-resolution images into smaller tiles and processing them sequentially [90]; when splitting a 1024 × 1024 image into 512 × 512 tiles, peak video random access memory (VRAM) usage can be reduced by approximately 62%, from approximately 4 GB to 1.5 GB [90]. FlashAttention [25] reduces the memory complexity of attention operations from

O (N^{2}) t o O (N)

[25]. FlashAttention-2 [91] achieved approximately 2× additional speedup over FlashAttention, through improved parallelization and work partitioning [91]. Gradient checkpointing [92] reduces memory consumption from

O (n) t o O (\sqrt{n})

by storing only a subset of intermediate activations during training and recomputing them as needed [92]. Table 7 summarizes the effectiveness of these memory optimization techniques, comparing their impact on peak VRAM usage and processing speed.

7. Quantitative Analysis (Software-Centric)

This chapter presents an integrated quantitative analysis of the software-centric optimization techniques examined in previous sections, focusing on theoretical efficiency and algorithmic performance [28,63].

7.1. Control-Efficiency Analysis

7.1.1. Parameter Efficiency Leaderboard (CES)

To enable fair comparison across different control mechanisms, we define the control-efficiency score (CES), as shown in Equation (3). Table 8 presents the comprehensive Control-Cost Leaderboard, ranked by CES.

CES = \frac{Control Quality Score}{\log_{10} (Added Params (M) + 1)}

(3)

where control quality score represents condition-specific accuracy metrics (mIoU, SSIM, F1, etc.) normalized to a 0–100 scale and added params (M) denotes the number of additional trainable parameters in millions, introduced by the control mechanism. Additionally, we calculate the normalized overhead ratio (NOR), using SD1.5 [8] (~860 M) as the baseline model.

7.1.2. Parameter Count vs. Control Quality

The key pattern observed in high-overhead mechanisms is a log-linear relationship between parameter increase and control fidelity (C-FID) [96]. Initial parameter additions yield rapid improvements in control quality, but diminishing returns appear beyond a certain threshold (approximately 100 M) [12,69,95]. Table 9 provides a detailed condition-specific comparison of control quality across different methods.

7.2. Algorithmic Efficiency Analysis

We analyze the relationship between algorithmic speedup and quality degradation [97]. Table 10 presents a comprehensive comparison of state-of-the-art models and optimization techniques, from which the Pareto frontier reveals a “Sweet Spot” at 6–12× speedup, where models achieve significant acceleration with minimal quality loss (ΔFID < 3) [9,16,18].

7.3. Theoretical Computational Reduction

Table 11 decomposes the optimization process from SD1.5 [8] to MobileDiffusion [18] stage by stage, analyzing the individual contribution of each technique to the theoretical speedup factor.

8. Discussion: The Limits of Software Optimization

8.1. Summary of Algorithmic Optimization Gains

Through the four-layer software optimization stack (architecture, controllability, sampling, compression), we have observed dramatic reductions in theoretical computational costs [28,63]. Step reduction algorithms like LCM [16] and ADD [17] have decreased sampling steps by over 90% (from 50 to 1–4 steps), while architectural optimizations [9,18] and quantization [53,54] have reduced model size and FLOPs by more than 50%. Collectively, these software-centric interventions theoretically offer up to 100× efficiency gains compared to baseline models [63].

8.2. The Gap: Theoretical FLOPs vs. Real-World Performance

However, theoretical FLOPs reduction does not directly translate to actual latency reduction [52,103]. To quantify this discrepancy, we define the FLOPs efficiency ratio (FER), as shown in Equation (4):

FER = \frac{FLOPs Ratio}{Latency Ratio} = \frac{F_{baseline} / F_{optimized}}{L_{baseline} / L_{optimized}}

(4)

where

F_{b a s e l i n e}

and

F_{o p t i m i z e d}

denote the FLOPs of the baseline and optimized models, respectively, and

L_{b a s e l i n e}

and

L_{o p t i m i z e d}

represent their corresponding inference latencies. An FER of 1.00 indicates perfect alignment between theoretical and actual speedup, FER < 1.00 suggests optimizations exceeding theoretical expectations (e.g., improved memory access patterns), and FER > 1.00 indicates a slower-than-expected performance due to memory bottlenecks or hardware inefficiencies [103,104]. As quantified in Table 12, this discrepancy is particularly pronounced in DiT architectures [13].

8.3. Future Work: Towards System-Level Optimization

The correlation analysis reveals that the actual throughput has a higher correlation with memory bandwidth (r = 0.89) than with NPU TOPS (r = 0.42), confirming that diffusion models are memory-bound workloads [103,104,105]. Software optimizations alone cannot fully overcome physical bandwidth limitations [52,104]. Our quantitative findings from Section 7 and Section 8 suggest several concrete directions for future research.

8.3.1. Hardware–Software Co-Design for Memory-Bound Architectures

The FER analysis in Table 12 reveals that DiT architectures exhibit FER values exceeding 1.6, indicating that their actual performance falls significantly behind theoretical expectations due to memory bottlenecks. This finding motivates research on memory-aware scheduling algorithms that dynamically allocate computation based on bandwidth availability [106], as well as dedicated inference engines [101,107] optimized for the global memory access patterns that are characteristic of attention-dominant architectures. Hardware techniques such as memory tiling [90] and FlashAttention variants [25,91] should be co-designed with NPU hardware to bridge the gap between theoretical FLOPs reduction and real-world latency improvement [108].

8.3.2. Lightweight Control for Next-Generation Architectures

The CES analysis in Table 8 demonstrates that NanoControl achieves the highest parameter efficiency (CES = 156.4) with only 0.024% overhead, suggesting a paradigm shift toward KV-context injection as the preferred control mechanism for DiT models. Future work should explore whether this minimalist injection principle can be generalized to other conditional generation tasks beyond spatial control, such as temporal conditioning for video generation or 3D structural control. The diminishing returns observed beyond approximately 100 M parameters (Figure 5) further indicate that research efforts would be better directed toward more efficient injection architectures, rather than scaling the existing control modules.

8.3.3. Cross-Layer Optimization Interactions

While this survey analyzed each optimization layer independently, significant opportunities exist in understanding cross-layer interactions. For instance, the impact of INT8 quantization (Layer IV) on control signal fidelity (Layer II) remains underexplored—preliminary evidence from Table 6 suggests that W8A8 maintains ΔFID < 1.0, but the interaction with lightweight control mechanisms such as NanoControl warrants dedicated investigation. Similarly, the interplay between step reduction algorithms (Layer III) and architectural pruning (Layer I) could yield synergistic efficiency gains beyond their individual contributions, as suggested by the cumulative analysis in Table 11.

8.3.4. Extension to Multimodal and Domain-Specific Generation

The optimization techniques surveyed in this paper are not limited to text-to-image generation. Recent advances have demonstrated the effectiveness of diffusion-based feature extraction in multimodal industrial applications, such as arc detection in railway systems [5], and adaptive GAN-based approaches for data-scarce environments [4]. These developments suggest that the four-layer optimization framework proposed in this survey could be adapted for video generation, 3D content creation, and domain-specific applications where real-time, on-device inference is critical [10,11]. The extension of step reduction (Layer III) and quantization (Layer IV) techniques to video diffusion models, where temporal consistency adds an additional dimension to the quality–efficiency–control trilemma, represents a particularly promising research direction.

9. Conclusions

This survey systematically analyzed over 100 publications from 2020 to 2025, presenting a software-centric taxonomy that encompasses control overhead, algorithmic efficiency, and model compression [19,28,29]. We addressed the “efficiency–quality–control” trilemma through three core research questions.

RQ1 (Architecture): The DiT architecture [13] exhibits substantially increased parameters and computational requirements compared to UNet [59], with FER analysis indicating significant memory bottlenecks [103,104,109,110].
RQ2 (Control): Control mechanism overhead spans a wide spectrum, with OminiControl [74] being essential for DiT architectures and ControlNet-XS [73] providing the optimal balance for UNet-based models.
RQ3 (System): W8A8 quantization [54] represents an efficient precision level for deployment, balancing model size reduction with minimal quality degradation [53,111].

This survey makes three primary contributions to the field of efficient diffusion models. First, we proposed the four-layer software optimization stack, providing a clear hierarchical framework for understanding and applying optimization techniques across architecture, controllability, sampling, and compression layers. Second, we established comprehensive benchmarking frameworks, including the Control-Cost Leaderboard and the efficiency–quality Pareto frontier [97], enabling systematic comparison of different approaches. Third, we identified and quantified the discrepancy between theoretical efficiency gains and real-world performance through the FLOPs efficiency ratio (FER), revealing critical bottlenecks that software optimization alone cannot address [103].

Our analysis reveals several key insights that shape the future of on-device image generation. Sub-second generation has transitioned from aspiration to reality through the synergistic combination of architectural innovations like MobileDiffusion [18] and advanced optimization techniques [16,54,112,113], enabling truly real-time interactive ap-plications on mobile devices [10,11]. The Pareto frontier analysis (Table 10) identifies a practical “sweet spot” at 6–12× speedup with ΔFID < 3, providing practitioners with a concrete guideline for balancing generation speed and visual quality in deployment scenarios. The cumulative optimization pathway from SD1.5 to MobileDiffusion (Table 11) further demonstrates that step reduction contributes the largest individual speedup factor (6.25×), suggesting that sampling algorithm selection should be prioritized in the optimization pipeline.

While DiT architectures [13] guarantee superior image quality, they introduce significant memory constraints that demand specialized optimizations such as FlashAttention [25] to achieve practical deployment [25,91]. The emergence of lightweight control technologies exemplified by OminiControl [74] demonstrates a paradigm shift: sophisticated image control with minimal parameter overhead is not only possible but achievable at scale, challenging the conventional wisdom that control quality requires substantial computational resources [68,69,74].

This survey establishes a comprehensive algorithmic foundation for efficient image generation on resource-constrained devices [10,11]. However, the gap between theoretical efficiency and actual performance underscores the limitations of purely software-centric approaches [103]. The FER metric introduced in this work provides a quantitative tool for identifying such bottlenecks, enabling researchers and hardware designers to target specific architectural inefficiencies, rather than relying on aggregate speedup figures. Furthermore, the growing adoption of diffusion models in domain-specific applications—including multimodal fault detection in industrial systems [4,5]—highlights the broad applicability of the optimization techniques surveyed here. Future research must pivot toward bridging the theory–practice divide through hardware–software co-design [106,114,115], exploring cross-layer optimization interactions that exploit synergies between architecture, control, sampling, and compression, and extending the four-layer framework to emerging modalities such as video and 3D generation. Developing acceleration strategies tailored to the memory-bound nature of diffusion workloads [104,105] and investigating how lightweight control mechanisms perform under aggressive quantization remain critical open problems for realizing the full potential of on-device generative AI.

Author Contributions

Conceptualization, S.-J.H. and C.-S.P.; methodology, S.-J.H.; software, S.-J.H. and C.-S.P.; validation, S.-J.H. and C.-S.P.; formal analysis, S.-J.H.; investigation, S.-J.H.; resources, C.-S.P.; data curation, S.-J.H.; writing—original draft preparation, S.-J.H.; writing—review and editing, C.-S.P.; visualization, S.-J.H.; supervision, C.-S.P.; project administration, C.-S.P.; funding acquisition, C.-S.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the MSIT (Ministry of Science and ICT), Republic of Korea, under Grant RS-2025-23323861. This research was supported by the Technology Development Program (RS-2025-25439892) funded by the Ministry of SMEs and Startups (MSS, Republic of Korea). This research was supported by the MSIT (Ministry of Science and ICT), Republic of Korea, under the Graduate School of Virtual Convergence support program (IITP-2026-RS-2023-00254129) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation). This research was supported by the “Regional Innovation System & Education (RISE)” through the Seoul RISE Center, funded by the Ministry of Education (MOE) and the Seoul Metropolitan Government (2025-RISE-01-018-05).

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Acknowledgments

This research received technical consultation from AIPro Co., Ltd.

Conflicts of Interest

The authors declare no conflict of interest.

References

Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 8–13 December 2014; pp. 2672–2680. [Google Scholar]
Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. In Proceedings of the International Conference on Learning Representations (ICLR), Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
Larsen, A.B.L.; Sønderby, S.K.; Larochelle, H.; Winther, O. Autoencoding beyond pixels using a learned similarity metric. In Proceedings of the International Conference on Machine Learning (ICML), New York, NY, USA, 19–24 June 2016; pp. 1558–1566. [Google Scholar]
Wang, X.; Jiang, H.; Zeng, T.; Dong, Y. An Adaptive Fused Domain-Cycling Variational Generative Adversarial Network for Machine Fault Diagnosis under Data Scarcity. Inf. Fusion 2026, 126, 103616. [Google Scholar] [CrossRef]
Yan, J.; Cheng, Y.; Zhang, F.; Li, M.; Zhou, N.; Jin, B.; Wang, H.; Yang, H.; Zhang, W. Research on Multimodal Techniques for Arc Detection in Railway Systems with Limited Data. Struct. Health Monit. 2025. [Google Scholar] [CrossRef]
Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Virtual, 6–12 December 2020; pp. 6840–6851. [Google Scholar]
Song, J.; Meng, C.; Ermon, S. Denoising Diffusion Implicit Models. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 3–7 May 2021. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
Li, Y.; Wang, H.; Jin, Q.; Hu, J.; Chemerys, P.; Fu, Y.; Wang, Y.; Tulyakov, S.; Ren, J. SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Chen, J.; Ran, X. Deep Learning with Edge Computing: A Review. Proc. IEEE 2019, 107, 1655–1674. [Google Scholar] [CrossRef]
Xu, D.; Li, T.; Li, Y.; Su, X.; Tarkoma, S.; Jiang, T.; Crowcroft, J.; Hui, P. Edge Intelligence: Empowering Intelligence to the Edge of Network. Proc. IEEE 2021, 109, 1778–1837. [Google Scholar] [CrossRef]
Zhang, L.; Rao, A.; Agrawala, M. Adding Conditional Control to Text-to-Image Diffusion Models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 3836–3847. [Google Scholar]
Peebles, W.; Xie, S. Scalable Diffusion Models with Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 4195–4205. [Google Scholar]
Esser, P.; Kulal, S.; Blattmann, A.; Entezari, R.; Müller, J.; Saini, H.; Levi, Y.; Lorber, D.; Sber, D.; Salimans, T.; et al. Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. In Proceedings of the International Conference on Machine Learning (ICML), Vienna, Austria, 21–27 July 2024. [Google Scholar]
Black Forest Labs. FLUX.1 Technical Report. 2024. Available online: https://blackforestlabs.ai/ (accessed on 14 February 2025).
Luo, S.; Tan, Y.; Huang, L.; Li, J.; Zhao, H. Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 7–11 May 2024. [Google Scholar]
Sauer, A.; Lorenz, D.; Blattmann, A.; Rombach, R. Adversarial Diffusion Distillation. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024. [Google Scholar]
Zhao, Y.; Li, Y.; Ge, Z.; Lin, G. MobileDiffusion: Subsecond Text-to-Image Generation on Mobile Devices. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024. [Google Scholar]
Cao, H.; Tan, C.; Gao, Z.; Xu, Y.; Chen, G.; Heng, P.A.; Li, S.Z. A Survey on Generative Diffusion Models. IEEE Trans. Knowl. Data Eng. 2024, 36, 2814–2830. [Google Scholar] [CrossRef]
Song, Y.; Dhariwal, P.; Chen, M.; Sutskever, I. Consistency Models. In Proceedings of the International Conference on Machine Learning (ICML), Honolulu, HI, USA, 23–29 July 2023; pp. 32211–32252. [Google Scholar]
Kang, M.; Zhu, J.Y.; Zhang, R.; Park, J.; Shechtman, E.; Paris, S.; Park, T. Scaling up GANs for Text-to-Image Synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 10124–10134. [Google Scholar]
Ravi, S.; Anand, V.; Kumar, A.; Athikomrattanakul, S. Efficient Memory Management for On-Device AI Inference. In Proceedings of the ACM International Conference on Mobile Computing and Networking (MobiCom), Madrid, Spain, 2–6 October 2023. [Google Scholar]
Mou, C.; Wang, X.; Xie, L.; Wu, Y.; Zhang, J.; Qi, Z.; Shan, Y.; Qie, X. T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; pp. 4296–4304. [Google Scholar]
Ye, H.; Zhang, J.; Liu, S.; Han, X.; Yang, W. IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models. arXiv 2023, arXiv:2308.06721. [Google Scholar]
Dao, T.; Fu, D.Y.; Ermon, S.; Rudra, A.; Ré, C. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 28 November–9 December 2022; pp. 16344–16359. [Google Scholar]
Podell, D.; English, Z.; Lacey, K.; Blattmann, A.; Dockhorn, T.; Müller, J.; Penna, J.; Rombach, R. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 7–11 May 2024. [Google Scholar]
Chen, J.; Yu, J.; Ge, C.; Yao, L.; Xie, E.; Wu, Y.; Wang, Z.; Kwok, J.; Luo, P.; Lu, H.; et al. PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 7–11 May 2024. [Google Scholar]
Yang, L.; Zhang, Z.; Song, Y.; Hong, S.; Xu, R.; Zhao, Y.; Zhang, W.; Cui, B.; Yang, M.H. Diffusion Models: A Comprehensive Survey of Methods and Applications. ACM Comput. Surv. 2023, 56, 105. [Google Scholar] [CrossRef]
Ma, Z.; Zhang, Y.; Liu, B.; Sun, T.; Ge, Z.; Feng, Y.; Yang, S.; Zhang, K. Efficient Diffusion Models: A Comprehensive Survey from Principles to Practices. arXiv 2024, arXiv:2410.11795. [Google Scholar] [CrossRef] [PubMed]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 6626–6637. [Google Scholar]
Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; Chen, X. Improved Techniques for Training GANs. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Barcelona, Spain, 5–10 December 2016; pp. 2234–2242. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models from Natural Language Supervision. In Proceedings of the International Conference on Machine Learning (ICML), Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Schuhmann, C.; Beaumont, R.; Vencu, R.; Gordon, C.; Wightman, R.; Cherti, M.; Coombes, T.; Katta, A.; Mullis, C.; Wortsman, M.; et al. LAION-5B: An Open Large-Scale Dataset for Training Next Generation Image-Text Models. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 28 November–9 December 2022; pp. 25278–25294. [Google Scholar]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
Borji, A. Pros and Cons of GAN Evaluation Measures: New Developments. Comput. Vis. Image Underst. 2022, 215, 103329. [Google Scholar] [CrossRef]
Hessel, J.; Holtzman, A.; Forbes, M.; Le Bras, R.; Choi, Y. CLIPScore: A Reference-free Evaluation Metric for Image Captioning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Punta Cana, Dominican Republic, 7–11 November 2021; pp. 7514–7528. [Google Scholar]
Xu, J.; Liu, X.; Wu, Y.; Tong, Y.; Li, Q.; Ding, M.; Tang, J.; Dong, Y. ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Wu, X.; Hao, Y.; Sun, K.; Chen, Y.; Zhu, F.; Zhao, R.; Li, H. Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis. arXiv 2023, arXiv:2306.09341. [Google Scholar]
Ranftl, R.; Bochkovskiy, A.; Koltun, V. Vision Transformers for Dense Prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 10–17 October 2021; pp. 12179–12188. [Google Scholar]
Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.E.; Sheikh, Y. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 172–186. [Google Scholar] [CrossRef]
Canny, J. A Computational Approach to Edge Detection. IEEE Trans. Pattern Anal. Mach. Intell. 1986, PAMI-8, 679–698. [Google Scholar] [CrossRef]
Xie, S.; Tu, Z. Holistically-Nested Edge Detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1395–1403. [Google Scholar]
Zhou, B.; Zhao, H.; Puig, X.; Fidler, S.; Barriuso, A.; Torralba, A. Scene Parsing through ADE20K Dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 633–641. [Google Scholar]
Huang, K.; Sun, K.; Xie, E.; Li, Z.; Liu, X. T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-Image Generation. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention Mask Transformer for Universal Image Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 1290–1299. [Google Scholar]
Silberman, N.; Hoiem, D.; Kohli, P.; Fergus, R. Indoor Segmentation and Support Inference from RGBD Images. In Proceedings of the European Conference on Computer Vision (ECCV), Florence, Italy, 7–13 October 2012; pp. 746–760. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Arbelaez, P.; Maire, M.; Fowlkes, C.; Malik, J. Contour Detection and Hierarchical Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 898–916. [Google Scholar] [CrossRef] [PubMed]
Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging Properties in Self-Supervised Vision Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 10–17 October 2021; pp. 9650–9660. [Google Scholar]
Deng, J.; Guo, J.; Xue, N.; Zafeiriou, S. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 4690–4699. [Google Scholar]
Molchanov, P.; Tyree, S.; Karras, T.; Aila, T.; Kautz, J. Pruning Convolutional Neural Networks for Resource Efficient Inference. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
Sze, V.; Chen, Y.H.; Yang, T.J.; Emer, J.S. Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proc. IEEE 2017, 105, 2295–2329. [Google Scholar] [CrossRef]
Gholami, A.; Kim, S.; Dong, Z.; Yao, Z.; Mahoney, M.W.; Keutzer, K. A Survey of Quantization Methods for Efficient Neural Network Inference. In Low-Power Computer Vision; Chapman and Hall/CRC: Boca Raton, FL, USA, 2022; pp. 291–326. [Google Scholar]
Li, Y.; Shen, Y.; Gao, S.; Liu, Z.; Xu, Y.; Zhang, W.; Lu, H.; Huang, G. Q-Diffusion: Quantizing Diffusion Models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 17535–17545. [Google Scholar]
Micikevicius, P.; Narang, S.; Alben, J.; Diamos, G.; Elsen, E.; Garcia, D.; Ginsburg, B.; Houston, M.; Kuchaiev, O.; Venkatesh, G.; et al. Mixed Precision Training. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Shang, Y.; Yuan, Z.; Xie, B.; Wu, B.; Yan, Y. Post-Training Quantization on Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 1972–1981. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention (MICCAI), Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Liu, X.; Gong, C.; Liu, Q. Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Cherti, M.; Beaumont, R.; Wightman, R.; Wortsman, M.; Ilharco, G.; Gordon, C.; Schuhmann, C.; Schmidt, L.; Jitsev, J. Reproducible Scaling Laws for Contrastive Language-Image Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 2818–2829. [Google Scholar]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
Shen, H.; Zhang, J.; Xiong, B.; Hu, R.; Chen, S.; Wan, Z.; Wang, X.; Zhang, Y.; Gong, Z.; Bao, G.; et al. Efficient diffusion models: A survey 2024. arXiv 2025, arXiv:2502.06805. [Google Scholar]
Ignatov, A.; Timofte, R.; Kulik, A.; Yang, S.; Wang, K.; Baum, F.; Wu, M.; Xu, L.; Van Gool, L. AI Benchmark: All About Deep Learning on Smartphones in 2019. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3617–3635. [Google Scholar]
Pope, R.; Douglas, S.; Chowdhery, A.; Devlin, J.; Bradbury, J.; Heek, J.; Xiao, K.; Agrawal, S.; Dean, J. Efficiently Scaling Transformer Inference. In Proceedings of the Machine Learning and Systems (MLSys), Miami, FL, USA, 4–8 June 2023. [Google Scholar]
Elsken, T.; Metzen, J.H.; Hutter, F. Neural Architecture Search: A Survey. J. Mach. Learn. Res. 2019, 20, 1–21. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Liu, S.; Zhu, J.; Lu, J.; Gong, Y.; Li, L.; Cheng, B.; Ma, Y.; Wu, L.; Wu, X.; Leng, D.; et al. NanoControl: A Lightweight Framework for Precise and Efficient Control in Diffusion Transformer. arXiv 2025, arXiv:2508.10424. [Google Scholar] [CrossRef]
Li, M.; Cong, Y.; Zhang, G.; Wu, Y.; Xu, P.; Gu, Q.; Xu, K. ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024. [Google Scholar]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 25–29 April 2022. [Google Scholar]
Ruiz, N.; Li, Y.; Jampani, V.; Pritch, Y.; Rubinstein, M.; Aberman, K. DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 22500–22510. [Google Scholar]
Liu, S.Y.; Wang, C.Y.; Yin, H.; Molchanov, P.; Wang, Y.C.F.; Cheng, K.T.; Chen, M.H. DoRA: Weight-Decomposed Low-Rank Adaptation. In Proceedings of the International Conference on Machine Learning (ICML), Vienna, Austria, 21–27 July 2024. [Google Scholar]
Zavadski, D.; Ryll, J.P.; Kneip, L. ControlNet-XS: Designing an Efficient and Effective Architecture for Controlling Text-to-Image Diffusion Models. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024. [Google Scholar]
Tan, Z.; Cong, Y.; Li, M.; Wang, Y.; Wu, S.; Zhang, G. OminiControl: Minimal and Universal Control for Diffusion Transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Honolulu, HI, USA, 19–25 October 2025. [Google Scholar]
Qin, C.; Zhang, S.; Zhang, N.; Bai, J.; Zhang, Y.; Shen, H.; Yang, H. UniControl: A Unified Diffusion Model for Controllable Visual Generation in the Wild. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Luo, S.; Tan, Y.; Patil, S.; Gu, D.; von Platen, P.; Passos, A.; Huang, L.; Li, J.; Zhao, H. LCM-LoRA: A Universal Stable-Diffusion Acceleration Module. arXiv 2023, arXiv:2311.05556. [Google Scholar]
Lipman, Y.; Chen, R.T.Q.; Ben-Hamu, H.; Nickel, M.; Le, M. Flow Matching for Generative Modeling. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Liu, X.; Zhang, X.; Ma, J.; Peng, J.; Liu, Q. InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 7–11 May 2024. [Google Scholar]
Yan, H.; Yang, L.; Zhang, Z.; Luo, J. PeRFlow: Piecewise Rectified Flow as Universal Plug-and-Play Accelerator. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 9–15 December 2024. [Google Scholar]
Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. DINOv2: Learning Robust Visual Features without Supervision. arXiv 2024, arXiv:2304.07193. [Google Scholar] [CrossRef]
Sauer, A.; Boesel, F.; Dockhorn, T.; Blattmann, A.; Esser, P.; Rombach, R. Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation. In Proceedings of the ACM SIGGRAPH Asia Conference, Tokyo, Japan, 3–6 December 2024. [Google Scholar]
Ma, X.; Fang, G.; Wang, X. DeepCache: Accelerating Diffusion Models for Free. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 15762–15772. [Google Scholar]
He, Y.; Liu, L.; Liu, J.; Wu, W.; Zhou, H.; Zhuang, B. PTQD: Accurate Post-Training Quantization for Diffusion Models. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Zhao, J.; Zhang, B.; Chen, Z.; Wang, Z.; Zhao, Y.; Tian, Y.; Yuan, C. MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024. [Google Scholar]
Li, Y.; Lin, J.; Tang, H.; Sun, K.; Song, Y.; Han, S. SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models. In Proceedings of the International Conference on Learning Representations (ICLR), Singapore, 24–28 April 2025. [Google Scholar]
LeCun, Y.; Denker, J.S.; Solla, S.A. Optimal Brain Damage. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Denver, CO, USA, 27–30 November 1989; pp. 598–605. [Google Scholar]
Fang, G.; Ma, X.; Wang, X. Structural Pruning for Diffusion Models. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Castells, T.; Yamamoto, H.; Moro, A.; Kobayashi, T.; Otani, M. LD-Pruner: Efficient Pruning of Latent Diffusion Models using Task-Agnostic Insights. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
Zhang, K.; Li, D.; Li, Y.; Liu, Z. EcoDiff: Economizing Diffusion Models for Better Efficiency. arXiv 2024, arXiv:2403.11111. [Google Scholar]
Bar-Tal, O.; Yariv, L.; Lipman, Y.; Dekel, T. MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation. In Proceedings of the International Conference on Machine Learning (ICML), Honolulu, HI, USA, 23–29 July 2023; pp. 1737–1752. [Google Scholar]
Dao, T. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 7–11 May 2024. [Google Scholar]
Chen, T.; Xu, B.; Zhang, C.; Guestrin, C. Training Deep Nets with Sublinear Memory Cost. arXiv 2016, arXiv:1604.06174. [Google Scholar] [CrossRef]
von Platen, P.; Patil, S.; Lozhkov, A.; Cuenca, P.; Lambert, N.; Rasul, K.; Davaadorj, M.; Wolf, T. Diffusers: State-of-the-Art Diffusion Models. GitHub Repository. 2022. Available online: https://github.com/huggingface/diffusers (accessed on 14 February 2025).
Rhu, M.; Giber, N.; Keckler, S.W. vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO), Taipei, Taiwan, 15–19 October 2016; pp. 1–13. [Google Scholar]
Zhao, S.; Chen, D.; Chen, Y.C.; Bao, J.; Hao, S.; Yuan, L.; Wong, K.Y.K. Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T.B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. Scaling Laws for Neural Language Models. arXiv 2020, arXiv:2001.08361. [Google Scholar] [CrossRef]
Dehghani, M.; Djolonga, J.; Mustafa, B.; Padlewski, P.; Heek, J.; Gilmer, J.; Steiner, A.P.; Caron, M.; Geirhos, R.; Alabdulmohsin, I.; et al. Scaling Vision Transformers to 22 Billion Parameters. In Proceedings of the International Conference on Machine Learning (ICML), Honolulu, HI, USA, 23–29 July 2023; pp. 7480–7512. [Google Scholar]
Lin, S.; Wang, A.; Yang, X. SDXL-Lightning: Progressive Adversarial Diffusion Distillation. arXiv 2024, arXiv:2402.13929. [Google Scholar] [CrossRef]
Yang, Z.; Zhang, L.; Chen, Y.; Zhou, Y. EdgeFusion: On-Device Text-to-Image Generation. arXiv 2024, arXiv:2404.11925. [Google Scholar]
Apple Inc. Core ML Framework Documentation. 2023. Available online: https://developer.apple.com/documentation/coreml (accessed on 14 February 2025).
Qualcomm Technologies Inc. Qualcomm AI Engine Direct SDK (QNN). 2023. Available online: https://developer.qualcomm.com/software/qualcomm-ai-engine-direct-sdk (accessed on 14 February 2025).
Podell, D.; English, Z.; Lacey, K.; Dockhorn, T.; Blattmann, A.; Rombach, R. Efficient VAE Decoding for High-Resolution Image Synthesis. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 7–11 May 2024. [Google Scholar]
Williams, S.; Waterman, A.; Patterson, D. Roofline: An Insightful Visual Performance Model for Multicore Architectures. Commun. ACM 2009, 52, 65–76. [Google Scholar] [CrossRef]
Ivanov, A.; Dryden, N.; Ben-Nun, T.; Li, S.; Hoefler, T. Data Movement Is All You Need: A Case Study on Optimizing Transformers. In Proceedings of the Machine Learning and Systems (MLSys), Virtual, 5–9 April 2021. [Google Scholar]
Kim, S.; Hooper, C.; Gholami, A.; Dong, Z.; Li, X.; Shen, S.; Mahoney, M.W.; Keutzer, K. SqueezeLLM: Dense-and-Sparse Quantization. In Proceedings of the International Conference on Machine Learning (ICML), Vienna, Austria, 21–27 July 2024. [Google Scholar]
Kwon, W.; Li, Z.; Zhuang, S.; Sheng, Y.; Zheng, L.; Yu, C.H.; Gonzalez, J.; Zhang, H.; Stoica, I. Efficient Memory Management for Large Language Model Serving with PagedAttention. In Proceedings of the ACM SIGOPS Symposium on Operating Systems Principles (SOSP), Koblenz, Germany, 23–26 October 2023; pp. 611–626. [Google Scholar]
Jouppi, N.P.; Young, C.; Patil, N.; Patterson, D.; Agrawal, G.; Bajwa, R.; Bates, S.; Bhatia, S.; Boden, N.; Borber, A.; et al. In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), Toronto, ON, Canada, 24–28 June 2017; pp. 1–12. [Google Scholar]
Chen, Y.H.; Krishna, T.; Emer, J.S.; Sze, V. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. IEEE J. Solid-State Circuits 2017, 52, 127–138. [Google Scholar] [CrossRef]
Fang, G.; Li, K.; Ma, X.; Wang, X. TinyFusion: Diffusion Transformers Learned Shallow. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–16 June 2025; pp. 18144–18154. [Google Scholar]
You, H.; Barnes, C.; Zhou, Y.; Kang, Y.; Du, Z.; Zhou, W.; Zhang, L.; Nitzan, Y.; Liu, X.; Lin, Z.; et al. Layer- and Timestep-Adaptive Differentiable Token Compression Ratios for Efficient Diffusion Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–16 June 2025; pp. 18072–18082. [Google Scholar]
Wu, J.; Wang, H.; Shang, Y.; Shah, M.; Yan, Y. PTQ4DiT: Post-training Quantization for Diffusion Transformers. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 9–15 December 2024. [Google Scholar]
Lee, Y.; Park, K.; Cho, Y.; Lee, Y.J.; Hwang, S.J. KOALA: Empirical Lessons Toward Memory-Efficient and Fast Diffusion Models for Text-to-Image Synthesis. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 9–15 December 2024. [Google Scholar]
Yin, T.; Gharbi, M.; Park, T.; Zhang, R.; Shechtman, E.; Durand, F.; Freeman, W.T. Improved Distribution Matching Distillation for Fast Image Synthesis. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 9–15 December 2024. [Google Scholar]
Marchisio, A.; Massa, A.; Mrazek, V.; Bussolino, B.; Martina, M.; Shafique, M. A Survey on Deep Learning Hardware Accelerators for Heterogeneous HPC Platforms. ACM Comput. Surv. 2024, 56, 1–38. [Google Scholar]
Sze, V.; Chen, Y.H.; Emer, J.; Suleiman, A.; Zhang, Z. Hardware for Machine Learning: Challenges and Opportunities. In Proceedings of the IEEE Custom Integrated Circuits Conference (CICC), Austin, TX, USA, 30 April–3 May 2017; pp. 1–8. [Google Scholar]

Figure 1. Timeline of the evolution of image generative AI models (2020–2025). This diagram shows the progression from early diffusion models (DDPMs [6], DDIMs [7]) to latent diffusion models (LDMs) [8], followed by 2023–2024 breakthroughs in control methods (ControlNet [12]), transformer architectures (DiT [13], SD3 [14], FLUX [15]), and fast generation (LCM [16], ADD [17]), culminating in today’s edge-native era (2025) where ultra-lightweight models like MobileDiffusion [18] and similar approaches generate high-quality images in under 0.5 s on mobile devices with minimal computational requirements. Blue boxes represent UNet-era models (2020–2022), green boxes indicate the paradigm shift period (2023), and orange/red boxes denote the DiT and edge-native era (2024–2025).

Figure 2. Architectural comparison between UNet-based and DiT-based diffusion models [13,59]. Comparison of UNet (encoder–decoder with skip connections) and DiT (stacked Transformer blocks) architectures. UNet relies on convolution operations for local spatial features, while DiT uses attention mechanisms for global context, resulting in distinct computational and hardware optimization characteristics. The ellipsis (“…”) between Transformer blocks denotes the repetition of the blocks for N iterations, as labeled “N Blocks.” The ⊕ symbol represents element-wise addition (i.e., residual connection), and the arrows indicate the direction of data flow. Different colors distinguish functionally distinct components within each architecture.

Figure 3. Structural comparison between the standard UNet [59] and the mobile-optimized UNet architecture [9,18]. The standard UNet (left) employs a uniform block placement with full attention across all levels, while the mobile-optimized variant (right) uses depth-wise separable convolutions [67], removes high-resolution attention, and reduces the middle block by 62% [9] for efficient on-device inference. In both architectures, blue blocks represent convolution (ResNet) blocks, and yellow blocks denote attention blocks. This color distinction highlights the key structural difference: the mobile-optimized variant selectively removes attention blocks at higher resolutions to reduce computational cost.

Figure 4. Evolution of control overhead and architectural paradigms in controllable diffusion models. The timeline shows the progression from heavy CNN-based adapters (ControlNet [12], 361 M parameters) to efficient DiT-centric solutions (NanoControl [68], ~0.1 MB), demonstrating the dramatic reduction in parameter overhead from 2023 to 2025. The ellipsis (“…”) between LoRA modules indicates that the modules are repeated across multiple layers within the network. Vertical dashed lines separate distinct stages of architectural evolution (2023, 2024, and 2025). Colors denote specific architectural components: orange represents trainable encoders, blue indicates lightweight control branches, green represents DiT blocks, red denotes LoRA modules, and purple indicates input conditions.

Figure 5. Correlation between parameter overhead and control fidelity (C-FID) in controllable diffusion models. This scatter plot shows the relationship between model size (trainable parameters on X-axis) and image quality (Control-FID on Y-axis, lower is better) for various AI control models, including ControlNet [12], ControlNet++ [69], UniControl [75], ControlNet-XS [73], T2I-Adapter [23], and NanoControl [68]. The models are grouped into three efficiency zones: minimalist (<10 M parameters), low overhead (10–100 M), and high overhead (>100 M), with a trend line revealing diminishing returns, meaning that adding more parameters beyond a certain point provides only minimal quality improvements.

Figure 6. Mobile NPU quantization sweet spot: Weight vs. activation precision trade-offs [53,54]. This heatmap shows the balance between quantization precision for weights (X-axis) and activations (Y-axis) on mobile NPUs. Colors indicate speedup (1×–4×, green to red), while the blue dashed contour lines represent iso-contours of ΔFID, indicating levels of quality degradation (e.g., ΔFID = 1, 2, 3) relative to the FP16 baseline. The sweet spot around W8A8 and W4A16 offers the best balance between speed and quality, with representative methods like PTQ4DM [56], Q-Diffusion [54], and MixDQ [84] mapped to their optimal precision regions.

Table 1. Comparison with existing surveys on efficient diffusion models. This table summarizes the comparative analysis between the proposed survey and prior studies [19,28,29], highlighting differences in research focus, the depth of controllability review, and the inclusion of quantitative efficiency metrics.

Aspect	Cao et al. [19]	Ma et al. [29]	Yang et al. [28]	Ours
Focus	General diffusion	Efficient diffusion	Diffusion applications	Edge deployment
Controllability	Limited	Partial	Partial	Comprehensive
Control-Cost Analysis	✗	✗	✗	✓ (CES metric)
FLOPs-Latency Gap	✗	Mentioned	✗	✓ (FER metric)
4-Layer Stack	✗	✗	✗	✓

CES, control-efficiency score; FER, FLOPs efficiency ratio; and FLOPs, floating point operations. ✓ indicates the feature is included; ✗ indicates the feature is not included.

Table 3. Quantitative specifications of baseline diffusion models. This table compares the quantitative specifications of baseline diffusion models, highlighting differences in architecture, computational requirements, and performance metrics across different releases.

Specification	SD1.5 (UNet)	SDXL (UNet)	SD3 Medium (DiT)	FLUX.1-dev (DiT)
Release	August 2022	July 2023	June 2024	August 2024
Architecture	U-Net + ResNet + Transformer	U-Net (3× larger)	MMDiT	Rectified Flow Transformer
Parameters	~860 M	~2.6 B	~2 B	~12 B
Native Resolution	512 × 512	1024 × 1024	1024 × 1024	1024 × 1024+
Text Encoder	CLIP-L [32]	CLIP-L + OpenCLIP bigG [61]	CLIP-L + OpenCLIP bigG + T5-XXL [62]	T5-XXL [62] + CLIP [32]
Typical Steps	20–50	20–50	20–30	20–30
FID (COCO-30K)	~8–12	~6–10	~5–8	~4–7
Est. FLOPs/step	~50 G	~150 G	~100 G	~250 G
Memory (FP16)	~4 GB	~8 GB	~6 GB	~24 GB

MMDiT, multimodal diffusion transformer; CLIP-L, CLIP Large; FLOPs, floating point operations; FID, Fréchet inception distance.

Table 4. FLOPs comparison: standard UNet vs. mobile-optimized UNet (512 × 512, per step). This table compares the computational cost (in GMACs) and structural parameters of the standard UNet [6] against mobile-optimized variants (SnapFusion [9] and MobileDiffusion [18]). The reduction percentages indicate the efficiency gains, relative to the standard model.

Component	Standard UNet (SD1.5) [8]	SnapFusion UNet [9]	MobileDiffusion [18]	Reduction
Encoder (Down)	~18 GMACs	~12 GMACs	~8 GMACs	33–56%
Middle Block	~8 GMACs	~3 GMACs	~2 GMACs	62–75%
Decoder (Up)	~20 GMACs	~14 GMACs	~10 GMACs	30–50%
Attention Layers	~6 GMACs	~4 GMACs	~3 GMACs	33–50%
Total	~52 GMACs	~33 GMACs	~23 GMACs	36–56%
Conv Type	Standard	Mixed	Separable	-
Channel Width	320–1280	256–1024	192–768	20–40%

GMACs, giga multiply accumulate operations, and SD1.5, Stable Diffusion version 1.5. Reduction percentages are calculated relative to the standard UNet baseline.

Table 5. Control accuracy vs. generalization trade-off across control mechanisms. This table summarizes the comparative analysis of various control paradigms, evaluating the inherent trade-offs between the control precision (accuracy) and the breadth of applicable conditions (generalization). It contrasts heavy ‘specialist’ models with lightweight ‘generalist’ or ‘flexible’ architectures, detailing the parameter overhead required for each approach to highlight the progression towards efficient controllability.

Mechanism	Control Accuracy	Generalization	Overhead	Approach
ControlNet [12]	High	Low (single)	361 M/cond	Specialist
ControlNet++ [69]	Very High	Low (single)	361 M/cond	Specialist
UniControl [75]	Moderate	High (9 types)	140 M	Generalist
T2I-Adapter [23]	Moderate	Medium	77 M	Balanced
LoRA-based [70]	Variable	High (composable)	4 M (avg)	Flexible
NanoControl [68]	Moderate–High	High (universal)	0.1 M	Balanced

M, million parameters; cond, per condition model; LoRA, Low-Rank Adaptation; avg, average value; overhead, number of additional trainable parameters relative to the base model; specialist, models optimized for a single specific control type; generalist, models capable of handling multiple control conditions simultaneously; balanced, models achieving a moderate trade-off between accuracy and generalization; and flexible, models supporting composable or swappable control modules.

Table 6. Quantization precision vs. quality and speed trade-off. This table illustrates the trade-offs between quantization precision, model size reduction, and inference speedup, relative to the FP16 baseline [53], alongside the impact on image quality (FID Δ) and hardware (NPU) support status [64].

Precision	Weight Bits	Activation Bits	Model Size (SD1.5)	FID Δ	Speed vs. FP16	NPU Support
FP32	32	32	~3.4 GB	Baseline	0.5×	✗ (Not Supported)
FP16	16	16	~1.7 GB	0	1.0×	Δ (Partial)
BF16	16	16	~1.7 GB	0	1.0×	Limited
W8A8	8	8	~0.85 GB	<1.0	1.8–2.5×	✓ (Supported)
W4A16	4	16	~0.5 GB	1.0–2.0	1.5–2.0×	Δ (Partial)
W4A8	4	8	~0.5 GB	1.0–3.0	2.0–3.0×	✓ (Supported)
W4A4	4	4	~0.5 GB	>5.0	2.5–4.0×	Limited

FP32/FP16/BF16, 32/16/16-bit floating point precision; W8A8, 8-bit weights and 8-bit activations; W4A16, 4-bit weights and 16-bit activations; W4A8, 4-bit weights and 8-bit activations; W4A4, 4-bit weights and 4-bit activations; FID Δ, change in Fréchet Inception Distance relative to FP16 baseline; NPU, Neural Processing Unit; SD1.5, Stable Diffusion version 1.5. ✓, fully supported; Δ, partially supported (hardware-dependent); ✗, not supported.

Table 7. Memory optimization technique effectiveness summary. This table summarizes the effectiveness of various memory optimization techniques [25,90,91,92], comparing their impact on peak VRAM usage and processing speed, alongside their implementation complexity and optimal use cases.

Technique	Peak VRAM Reduction	Speed Impact	Complexity	Best for
VAE Tiling [90]	50–70%	−10.2	Low	High-res decode
U-Net Tiling [90]	30–50%	−20.3	Medium	Panorama
FlashAttention [25]	30–50%	19.6	Low	All attention
FlashAttention-2 [91]	40–60%	39.2	Low	All attention
Attention Slicing [93]	20–40%	−30.5	Very Low	Fallback
Activation Offload [94]	50–80%	−102	Medium	OOM prevention
Gradient Checkpoint [92]	60–80% (train)	−30% (train)	Low	Training only

VRAM, video random access memory; VAE, variational autoencoder; and OOM, out-of-memory. Speed impact values indicate a percentage change relative to the baseline (negative values denote slowdown).

Table 8. Control-Cost Leaderboard: comprehensive comparison of control mechanisms (2023–2025). This table presents the leaderboard of control mechanisms ranked by the control-efficiency score (CES). It compares added parameters, normalized overhead ratio (NOR), memory overhead, and task-specific performance metrics (mIoU, SSIM, RMSE) across different base architectures (DiT, UNet).

Rank	Method	Year	Venue	Base	Added Params	NOR (%)	Memory Overhead	Control Types	mIoU (Seg)	SSIM (Edge)	RMSE (Depth)	CES
1	NanoControl [68]	2025	arXiv	DiT	0.1 M	0.02%	1.01×	Multi	78.2	0.891	0.082	156.4
2	OminiControl [74]	2025	ICCV	DiT	0.5 M	0.10%	1.02×	Universal	79.5	0.895	0.079	92.3
3	ControlNet-XS [73]	2024	ECCV	UNet	14 M (SD1.5)	1.60%	1.15×	8+	81.3	0.902	0.075	70.8
4	ControlNet-XS [73]	2024	ECCV	UNet	48 M (SDXL)	1.40%	1.12×	8+	82.1	0.908	0.072	48.9
5	IP-Adapter [24]	2023	arXiv	UNet	22 M	2.60%	1.08×	Style/ID	-	-	-	45.2 *
6	T2I-Adapter [23]	2024	AAAI	UNet	77 M	9.00%	1.12×	8+	76.8	0.875	0.089	40.6
7	LoRA (Control) [70]	2024	-	UNet	4 M (avg)	0.50%	1.01×	Domain	72.5	0.852	0.095	36.3
8	UniControl [75]	2023	NeurIPS	UNet	140 M	16.30%	1.35×	9+	80.2	0.889	0.078	35.4
9	ControlNet++ [69]	2024	ECCV	UNet	361 M	42.00%	2.00×	8+	89.3	0.938	0.068	34.1
10	ControlNet [12]	2023	ICCV	UNet	361 M	42.00%	2.00×	8+	78.1	0.825	0.086	29.8
11	Uni-ControlNet [95]	2023	NeurIPS	UNet	380 M	44.20%	2.10×	Multi	79.5	0.842	0.083	29.2

NOR, normalized overhead ratio; CES, control-efficiency score; mIoU, mean intersection over union; SSIM, structural similarity index measure; RMSE, root mean square error; DiT, diffusion transformer; SD1.5, Stable Diffusion version 1.5; and SDXL, Stable Diffusion XL. * CES for IP-Adapter is calculated using style transfer metrics.

Table 9. Condition-specific control quality comparison. This table compares the control quality of various methods across different conditions [12,23,68,69,73]. The arrows (↑, ↓) indicate whether higher or lower values represent better performance. Values in parentheses denote the relative improvement rate compared to the ControlNet [12] baseline.

Method	Canny Edge (SSIM↑)	Depth (RMSE↓)	Pose (mAP↑)	Segmentation (mIoU↑)	Normal (Angular↓)
ControlNet [12]	0.825	0.086	72.3	78.1	12.5°
ControlNet++ [69]	0.938 (+13.7%)	0.068 (+20.9%)	79.8 (+10.4%)	89.3 (+14.3%)	9.8° (+21.6%)
ControlNet-XS [73]	0.902 (+9.3%)	0.075 (+12.8%)	76.2 (+5.4%)	81.3 (+4.1%)	11.2° (+10.4%)
T2I-Adapter [23]	0.875 (+6.1%)	0.089 (−3.5%)	69.8 (−3.5%)	76.8 (−1.7%)	13.1 (−4.8%)
NanoControl [68]	0.891 (+8.0%)	0.082 (+4.7%)	74.5 (+3.0%)	78.2 (+0.1%)	11.8° (+5.6%)

SSIM, structural similarity index measure; RMSE, root mean square error; mAP, mean average precision; and mIoU, mean intersection over union. ↑ indicates higher values are better; ↓ indicates lower values are better. Values in parentheses denote relative improvement compared to the ControlNet baseline.

Table 10. Efficiency–quality master comparison: key models and optimization techniques (2022–2025). This table provides a comprehensive comparison of state-of-the-art diffusion models and various optimization techniques. It details architecture, computational complexity (Params, Steps, FLOPs), image quality (FID), and latency across the server (A100) and mobile platforms.

Category/Model	Year	Architecture	Params	Steps	FID (COCO)	Latency (A100)	Latency (Mobile)	FLOPs/Img	Precision
Baseline Models
SD1.5 [8]	2022	UNet	860 M	50	8.59	2850 ms	>60,000 ms	2600 G	FP16
SD1.5 [8]	2022	UNet	860 M	20	9.12	1140 ms	>25,000 ms	1040 G	FP16
SDXL [26]	2023	UNet	3.5 B	50	6.82	8500 ms	N/A	7800 G	FP16
SD3 Medium [14]	2024	DiT	2.0 B	28	5.95	4200 ms	N/A	2800 G	FP16
FLUX.1-dev [15]	2024	DiT	12 B	28	4.72	12,500 ms	N/A	7000 G	FP16
Step Reduction
SD1.5 + LCM [16]	2023	UNet	860 M	4	10.85	228 ms	5200 ms	208 G	FP16
SD1.5 + LCM-LoRA [76]	2024	UNet	860 M + 67 M	4	11.23	232 ms	5350 ms	208 G	FP16
SDXL + LCM [16]	2024	UNet	3.5 B	4	8.45	680 ms	N/A	624 G	FP16
SD1.5 + PeRFlow [79]	2024	UNet	860 M	4	9.52	228 ms	5100 ms	208 G	FP16
SDXL Turbo (ADD) [17]	2023	UNet	3.5 B	1	12.35	170 ms	N/A	156 G	FP16
SDXL Lightning [98]	2024	UNet	3.5 B	4	7.89	680 ms	N/A	624 G	FP16
Mobile-Optimized
SnapFusion [9]	2023	UNet (Opt)	380 M	8	9.85	185 ms	1840 ms	416 G	INT8
MobileDiffusion [18]	2024	UNet (Opt)	320 M	8	10.42	165 ms	520 ms	184 G	INT8
MobileDiffusion [18]	2024	UNet (Opt)	320 M	1	14.25	42 ms	158 ms	23 G	INT8
EdgeFusion [99]	2024	UNet (Opt)	295 M	4	11.78	112 ms	890 ms	92 G	INT8
Quantized Models
SD1.5 W8A8 [54]	2023	UNet	860 M	20	9.45	685 ms	8500 ms	1040 G	INT8
SD1.5 W4A8 (MixDQ) [84]	2024	UNet	860 M	8	10.12	380 ms	3200 ms	416 G	W4A8
SDXL W8A8 [54]	2024	UNet	3.5 B	20	7.28	2550 ms	N/A	3120 G	INT8

FID, Fréchet inception distance; FLOPs, floating point operations; LCM, latent consistency model; LoRA, Low-Rank Adaptation; ADD, adversarial diffusion distillation; DiT, diffusion transformer; FP16, 16-bit floating point; INT8, 8-bit integer; W4A8, 4-bit weights and 8-bit activations; and N/A, not available or not applicable.

Table 11. Optimization technique contribution analysis from SD1.5 to MobileDiffusion. This table analyzes the contribution of each optimization stage in the transition from the baseline SD1.5 [8] to the fully optimized MobileDiffusion [18]. It details the latency reduction factor and FID trade-off at each step, resulting in the final cumulative latency.

Optimization Stage	Technique	Latency Reduction	FID Impact	Cumulative Latency
Baseline	SD1.5 (50 steps, FP16) [8]	-	8.59	60,000+ ms
Step Reduction	50 → 8 steps (LCM) [16]	6.25×	2.26	9600 ms
Architecture	UNet Optimization (NAS) [9,66]	2.5×	0.32	3840 ms
Quantization	FP16 → INT8 [54]	2.0×	0.25	1920 ms
Engine	PyTorch → CoreML [100]/QNN [101]	2.5×	0	768 ms
VAE	VAE Optimization [102]	1.5×	0	512 ms
Total	All Combined	117×	2.83	~520 ms

LCM, latent consistency model; NAS, neural architecture search; FP16, 16-bit floating point; INT8, 8-bit integer; VAE, variational autoencoder; QNN, Qualcomm Neural Network SDK; and FID, Fréchet inception distance. PyTorch denotes the general-purpose inference framework used as the baseline engine prior to platform-specific compilation in [18]; no specific version is reported. FID Impact indicates the increase in FID at each optimization stage.

Table 12. FLOPs efficiency ratio (FER) analysis: theoretical vs. actual speedup. This table analyzes the FLOPs efficiency ratio (FER), defined as the ratio of theoretical speedup (FLOPs Ratio) to actual speedup (Latency Ratio). An FER of 1.00 indicates perfect alignment with the theory, while values below 1.00 suggest optimizations exceeding theoretical expectations.

Model Comparison	FLOPs Ratio	Latency Ratio (A100)	FER	Interpretation
SD1.5 (50 → 20 steps) [8]	2.50×	2.50×	1	Matches Theory
SD1.5 (50 steps) → LCM (4 steps) [16]	12.50×	12.50×	1	Matches Theory
SD1.5 → SnapFusion (8 steps) [9]	6.25×	15.41×	0.41	Faster than Theory
SD1.5 → MobileDiffusion (8 steps) [18]	14.13×	17.27×	0.82	Faster than Theory
SDXL → SD3 Medium [14,26]	2.79×	2.02×	1.38	Slower than Theory
SD1.5 → SD3 Medium [8,14]	0.92×	0.67×	1.37	Slower than Theory
UNet (SD1.5) → DiT (FLUX.1-dev) [8,15]	0.37×	0.23×	1.61	Slower than Theory

FER, FLOPs efficiency ratio (defined as FLOPs ratio divided by latency ratio); LCM, latent consistency model; SD1.5, Stable Diffusion version 1.5; SDXL, Stable Diffusion XL; SD3, Stable Diffusion 3; and DiT, diffusion transformer. FER = 1.00 indicates perfect alignment with theoretical expectations; FER < 1.00 indicates faster than theory; and FER > 1.00 indicates slower than theory due to memory bottlenecks.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ham, S.-J.; Park, C.-S. Efficient and Controllable Image Generation on the Edge: A Survey on Algorithmic and Architectural Optimization. Electronics 2026, 15, 828. https://doi.org/10.3390/electronics15040828

AMA Style

Ham S-J, Park C-S. Efficient and Controllable Image Generation on the Edge: A Survey on Algorithmic and Architectural Optimization. Electronics. 2026; 15(4):828. https://doi.org/10.3390/electronics15040828

Chicago/Turabian Style

Ham, Se-Jun, and Chun-Su Park. 2026. "Efficient and Controllable Image Generation on the Edge: A Survey on Algorithmic and Architectural Optimization" Electronics 15, no. 4: 828. https://doi.org/10.3390/electronics15040828

APA Style

Ham, S.-J., & Park, C.-S. (2026). Efficient and Controllable Image Generation on the Edge: A Survey on Algorithmic and Architectural Optimization. Electronics, 15(4), 828. https://doi.org/10.3390/electronics15040828

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficient and Controllable Image Generation on the Edge: A Survey on Algorithmic and Architectural Optimization

Abstract

1. Introduction

1.1. Motivation: The Gap Between GenAI Demands and Edge Constraints

1.2. Problem Definition: The “Efficiency–Quality–Control” Trilemma

1.3. Scope and Contribution: The Four-Layer Software Optimization Stack

1.4. Comparison with Existing Surveys

2. Background: Metrics and Baselines

2.1. Quality Metrics

2.1.1. Distributional Metrics (FID, IS)

2.1.2. Semantic and Esthetic Metrics (CLIP, Esthetic Score)

2.1.3. Control Accuracy Metrics (mIoU, Edge-F1)

2.2. Theoretical Efficiency Metrics

2.2.1. Computational Complexity (FLOPs, MACs)

2.2.2. Model Size and Parameters

2.2.3. Data Precision (Bit-Width)

2.3. Baseline Architectures

2.3.1. UNet-Based Models (SD1.5, SDXL)

2.3.2. DiT-Based Models (SD3, FLUX)

3. Layer I: Model Architecture Optimization

3.1. Structural Complexity Analysis

3.1.1. Convolution (UNet) vs. Attention (DiT) Complexity

3.1.2. The Quadratic Cost of Global Attention

3.1.3. Memory Access Patterns and Hardware Friendliness

3.2. Neural Architecture Search (NAS) Approaches

3.3. Efficient Backbone Design

3.3.1. Block Removal and Channel Reduction

3.3.2. Depth-Wise Separable Convolutions (Mobile-Optimized UNet)

4. Layer II: Efficient Controllability

4.1. High-Overhead Mechanisms

4.1.1. ControlNet: The Cost of Model Replication

4.1.2. ControlNet++ and Cycle Consistency

4.2. Low-Overhead Adapters

4.2.1. T2I-Adapter and IP-Adapter

4.2.2. LoRA-Based Control

4.3. Architecture-Integrated Control (Minimalist)

4.3.1. ControlNet-XS: Architectural Bottlenecking

4.3.2. NanoControl and OminiControl: Attention/KV Injection

4.4. Control Accuracy vs. Generalization Trade-Off

5. Layer III: Sampling and Algorithmic Acceleration

5.1. Step Reduction Algorithms

5.1.1. Latent Consistency Models (LCM)

5.1.2. Rectified Flow and Flow Matching

5.2. One-Step Generation Techniques

5.3. Algorithmic Caching Strategies

6. Layer IV: Model Compression Techniques

6.1. Quantization Algorithms

6.1.1. Post-Training Quantization (PTQ) for Diffusion

6.1.2. Impact of Bit-Width Reduction (Information Loss Analysis)

6.2. Structural Pruning and Distillation

6.3. Algorithmic Memory Optimization

7. Quantitative Analysis (Software-Centric)

7.1. Control-Efficiency Analysis

7.1.1. Parameter Efficiency Leaderboard (CES)

7.1.2. Parameter Count vs. Control Quality

7.2. Algorithmic Efficiency Analysis

7.3. Theoretical Computational Reduction

8. Discussion: The Limits of Software Optimization

8.1. Summary of Algorithmic Optimization Gains

8.2. The Gap: Theoretical FLOPs vs. Real-World Performance

8.3. Future Work: Towards System-Level Optimization

8.3.1. Hardware–Software Co-Design for Memory-Bound Architectures

8.3.2. Lightweight Control for Next-Generation Architectures

8.3.3. Cross-Layer Optimization Interactions

8.3.4. Extension to Multimodal and Domain-Specific Generation

9. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI