5.2. ECAE Component Analysis
This subsection conducts a hierarchical ablation under the mid-stage injection configuration, denoted as
ECA-Diff (Mid); at the macro level we toggle the VAE latent encoder
and LCN, at the micro level we dissect HTC block by enabling or disabling the Transformer branch, the CNN branch, and Bi-CA. Beyond these switches, we further isolate specific components: the external prior of pretrained
with controlled alternatives (
Table 5), and the HTC fusion strategies (Bi-CA versus addition/concatenation,
Table 6).
(1) Components Ablations. From
Table 7, three observations emerge: (i) LCN is indispensable, directly injecting the VAE latent into the mid stage without LCN substantially degrades quality, showing that structured context extraction is necessary; (ii) within HTC block, the Transformer and CNN branches are complementary, removing either branch reduces performance; and (iii) disabling the Bi-CA further erodes accuracy, indicating that cross-branch fusion is an effective mechanism for leveraging heterogeneous features. Overall, the complete ECAE injection achieves the best quality among the tested variants; compared with the baseline,
ECA-Diff (Mid) improves performance by
dB PSNR and
SSIM.
(2) Impact of the Pretrained Latent Encoder. To assess the benefit of the external prior and the potential domain gap, we evaluate variants under the mid-stage injection setting with LCN enabled (
Table 5). The trends are clear: freezing
leads to noticeable degradation, and freezing a pretrained encoder harms performance even more than freezing a randomly initialized one, indicating a mismatch between the pretraining domain and LLIE when reused without adaptation. Allowing
to learn from scratch recovers and slightly improves over the baseline, while fine-tuning a pretrained
consistently achieves the best PSNR/SSIM among all settings. These results show that large-scale pretraining provides useful global semantics, but light task-specific adaptation is necessary to mitigate the domain gap, motivating our default choice of a pretrained and fine-tuned
.
(3) Effect of Fusion Strategies in the HTC Block. Table 6 compares three fusion strategies in HTC under the same backbone and mid-stage injection setting. Element-wise addition fails to exploit ECAE and even degrades performance, indicating that naive linear fusion cannot properly align the Transformer branch (global semantics) and the CNN branch (local details). Channel concatenation delivers clear quality improvements over the baseline but increases parameters and runtime due to the wider projection layer. The proposed Bi-CA achieves the best PSNR/SSIM with only a moderate overhead relative to concatenation, showing that explicitly modeling bidirectional interactions between global and local features is a more effective and efficient strategy for context refinement in HTC.
5.3. Efficiency and Complexity Analysis
To comprehensively assess the efficiency of ECA-Diff and its design choices, we establish a unified evaluation protocol consisting of: (i) step-count analysis, (ii) overall latency comparison, (iii) comparable-capacity studies, and (iv) condition injection overhead analysis. All measurements are conducted on the LOL dataset with an input resolution of
, a batch size of 1, and a shared hardware/software environment (a single RTX 4090 GPU). We report (i) PSNR/SSIM, (ii) the total number of learnable parameters (M), (iii) the per-step sampling latency of the denoiser
(s/step), and (iv) the end-to-end inference latency
(s/image). Unless otherwise specified, all efficiency experiments are based on the fully optimized
ECA-Diff (Full) configuration, whose performance is reported in
Table 1.
For standard diffusion-based models without additional one-off modules, the inference latency is well approximated by
where
N is the number of DDIM sampling steps and
is the average denoising-step latency. For ECA-Diff, which introduces a once-per-image ECAE, the total latency can be decomposed as
where
denotes the one-off ECAE cost and
is measured from the iterative sampler alone. In practice, Equations (
16) and (
17) closely matches the observed runtimes, with minor deviations due to data transfer, kernel scheduling, and measurement noise. For single-pass Transformer-based baselines (
N = 1) without iterative sampling, we directly report the end-to-end latency as
.
(1) Effect of DDIM Steps. Table 8 reports the behavior of ECA-Diff under different DDIM sampling steps
N. With very few steps (
N = 5), the model is clearly under-converged, leading to ineffective denoising and significant quality degradation. From
N = 10 onward the performance improves substantially, and starting at
N = 25 the results become highly stable: for
N = 25, 50, 75, 100, both PSNR and SSIM fluctuate only marginally, indicating that ECA-Diff quickly saturates once a moderate number of steps is used. The sampling time
grows approximately linearly with
N, with a nearly constant per-step sampling cost around 0.18–0.20 s. The one-time ECAE overhead (latent encoder
plus LCN) is about 0.07 s per image, which remains small compared with the cumulative sampling cost in typical multi-step settings and is well captured by Equation (
17). In the main quantitative comparisons (
Table 1), we adopt a conservative default of
N = 100 steps to report a stable high-quality configuration, while the above results show that ECA-Diff maintains competitive performance even with reduced step counts (e.g.,
N = 25 or 50).
(2) Overall Latency Comparison. Table 3 compares representative transformer-based and diffusion-based LLIE methods on LOL. For transformer-based methods (SNR-Aware [
14], RetinexFormer [
16]),
is the direct single-pass latency. For diffusion-based methods (DiffLL [
20], PyDiff [
19], CLEDiff [
21]), we list the default iterative number
N, the measured end-to-end latency
, and the average per-step sampling cost
. For methods without additional one-off modules,
closely matches
; for ECA-Diff,
is obtained from the isolated sampler runtime and the total latency follows Equation (
17).
The transformer baselines achieve low latency thanks to their lightweight architectures and single-step inference, with RetinexFormer reaching 27.18 dB PSNR using only 1.61 M parameters. Diffusion-based LLIE models offer stronger generative capacity, but their absolute latency is heavily influenced by backbone design and step scheduling. DiffLL exploits wavelet-domain diffusion and PyDiff adopts a pyramid formulation with only four sampling steps, both of which explicitly optimize the sampling process for efficiency. Under this setting, our ECA-Diff, built upon a classical DDPM-style framework with standard DDIM sampling, is not directly tailored to the same extreme-latency regime as these specialized designs.
Our focus here is on the efficiency of the proposed ECA-Diff framework itself. CLEDiff serves as a strong and fair baseline, since it follows the same diffusion paradigm and training strategy. Compared with CLEDiff, ECA-Diff removes the self-attention blocks in the iterative backbone and offloads context modeling to an external, once-executed ECAE, which is coupled back through lightweight TCF modules. Although this redesign roughly doubles the total parameter count (from 84.88 M to 105.74 M + 61.68 M), it yields both faster sampling and higher restoration quality. With N = 10 steps, ECA-Diff reduces the per-step latency from 0.224 s to 0.179 s (about 20% reduction) and the end-to-end latency from 2.240 s to 1.857 s (about 17% reduction), while improving performance by +2.13 dB PSNR and +0.056 SSIM. With N = 100 steps, the per-step cost still decreases (0.224 s → 0.192 s, ∼14% reduction), and ECA-Diff achieves gains of +2.21 dB PSNR and +0.072 SSIM over CLEDiff. These results indicate that reallocating capacity into an external encoder and simplifying the recurrent denoising pathway leads to a more efficient diffusion framework, even when all parameters are accounted for. To further disentangle the benefits of the proposed framework from pure model scaling, the next subsection evaluates ECA-Diff under comparable-capacity settings against CLEDiff and related baselines.
It is also worth noting that the proposed ECA-Diff is orthogonal to backbone-lightening and step-reduction techniques used by DiffLL and PyDiff. Since ECAE and TCF are designed as plug-in components on top of a standard DDPM backbone, they can in principle be combined with wavelet-domain parameterization, pyramid formulations, or other advanced sampling schedules, opening up additional room for jointly improving both efficiency and restoration quality.
(3) Efficiency Under Comparable Capacity. In this study, we compare ECAE against backbones that
increase capacity without using ECAE so as to attain
comparable capacity (i.e., similar parameter count and/or per-step FLOPs). We use
ch_mult to denote the stage-wise channel multipliers of the U-Net encoder–decoder, and
attn_use to list the stages (1-indexed) where global attention (GA) blocks are inserted. For a fair comparison, all models listed in
Table 9, including
ECA-Diff (Full), are trained from scratch under the same setting (
,
for 8000 rounds). We report PSNR/SSIM together with model size, FLOPs per step, and
under DDIM sampling with
N = 100 on full-resolution LOL images (
).
From
Table 9 it is clear that enlarging the backbone alone does not guarantee higher LLIE quality. The
Deeper model approximately doubles the parameter count (83.1 M → 179.1 M;
) and slightly increases per-step FLOPs from 3400 G to 3593 G (
), yet its accuracy decreases to 24.93/0.823, indicating that the improvement delivered by our method arises from task-aligned design rather than model capacity. Increasing capacity further with
Deeper + Wider + GA strengthens long-range modeling, but the quality remains below
ECA-Diff (Full) (25.31/0.833 compared with 26.46/0.864) despite similar compute (4286 G and 4417 G FLOPs per step). In addition, adding GA to shallow, large-resolution stages leads to a much higher per-image latency (27.59 s), whereas
ECA-Diff (Full) maintains a lower per-image latency (19.35 s) by computing global context once in the latent space and reusing it across diffusion steps. Overall, we conclude that the superior performance-efficiency of ECA-Diff results from external, one-off context extraction with lightweight per-step injection rather than brute-force increases in parameters or FLOPs.
(4) TCF vs. LDM-style Conditioning. Table 10 compares three context injection strategies under incremental multi-stage injection. To focus on the fusion overhead, we report backbone parameters and the per-step smpling latency
, excluding the one-time ECAE cost. In the LDM-style token cross-attention (CA) scheme, the ECAE output is compressed into a fixed
token map and each selected stage performs cross-attention with these tokens. Adding more injection stages accumulates additional cross-attention blocks over multi-scale features, leading to a steady increase in
from the mid-only setting to the full M + U configuration. In the spatial CA scheme, ECAE features are resized to match each stage and full-resolution cross-attention is applied in the spatial domain. The complexity grows rapidly with spatial size, causing a sharp rise in
when multiple decoder stages are enabled and resulting in out-of-memory for deeper multi-stage variants.
The proposed TCF instead aligns ECAE features with backbone resolutions and fuses them via lightweight time modulation and channel concatenation, without explicit attention. Its cost scales approximately linearly with the number of injection stages, and the ECA-Diff (M + U) configuration maintains a per-step latency of 0.180 s with 99.18 M backbone parameters, substantially lower than the LDM-style and spatial CA counterparts. This contrast is consistent with the original motivation of LDM-style cross-attention, which is efficient when the backbone operates in a low-resolution latent space with fixed-length text tokens, but becomes less suitable for full-resolution, detail-sensitive LLIE; TCF better matches the design characteristics of ECA-Diff by providing scalable external guidance with well-controlled overhead.