SNR-Guided Enhancement and Autoregressive Depth Estimation for Single-Photon Camera Imaging

Yin, Qingze; Mu, Fangming; Wu, Qinge; Ding, Ding; Fan, Ziyu; Zhang, Tongpo

doi:10.3390/app16010245

Open AccessArticle

SNR-Guided Enhancement and Autoregressive Depth Estimation for Single-Photon Camera Imaging

by

Qingze Yin

¹,

Fangming Mu

¹

,

Qinge Wu

¹

,

Ding Ding

¹,

Ziyu Fan

²

and

Tongpo Zhang

^1,*

¹

School of Computer and Information Engineering, Shanghai Polytechnic University, Shanghai 201209, China

²

Department of Engineering, Durham University, Durham DH1 3LE, UK

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(1), 245; https://doi.org/10.3390/app16010245

Submission received: 12 November 2025 / Revised: 17 December 2025 / Accepted: 23 December 2025 / Published: 25 December 2025

Download

Browse Figures

Versions Notes

Abstract

Recent advances in deep learning have intensified the need for robust low-light image processing in critical applications like autonomous driving, where single-photon cameras (SPCs) offer high photon sensitivity but produce noisy outputs requiring specialized enhancement. This work addresses this challenge through a unified framework integrating three key components: an SNR-guided adaptive enhancement framework that dynamically processes regions with varying noise levels using spatial-adaptive operations and intelligent feature fusion; a specialized self-attention mechanism optimized for low-light conditions; and a conditional autoregressive generation approach applied to robust depth estimation from enhanced SPC images. Our comprehensive evaluation across multiple datasets demonstrates improved performance over state-of-the-art methods, achieving a PSNR of 24.61 dB on the LOL-v1 dataset and effectively recovering fine-grained textures in depth estimation, particularly in real-world SPC applications, while maintaining computational efficiency. The integrated solution effectively bridges the gap between single-photon sensing and practical computer vision tasks, facilitating more reliable operation in photon-starved environments through its novel combination of adaptive noise processing, attention-based feature enhancement, and generative depth reconstruction.

Keywords:

single-photon camera; low-light image enhancement; autoregressive model; image generation

1. Introduction

Over the last ten years, deep learning has revolutionized computational inference, setting new benchmarks for accuracy in fields ranging from natural language processing to autonomous systems. However, as these technologies migrate from controlled laboratory environments to safety-critical real-world applications—such as self-driving vehicles—robustness has become a primary performance metric alongside accuracy. Visual perception systems in autonomous agents face severe challenges when operating in suboptimal conditions, such as extreme darkness, adverse weather, or high-speed motion blur. For instance, nighttime tracking of fast-moving objects poses a fundamental difficulty: the photon count is critically low, resulting in inherently sparse and noisy data acquisition. Under these photon-starved scenarios, traditional inference algorithms often falter due to the severe degradation of visual information.

To address this, computational imaging fusion techniques have gained attention. In particular, deep learning-based low-light enhancement methods have been proposed to recover image details and suppress noise from a single frame [1,2]. In this work, a signal-to-noise ratio (SNR)-based enhancement method that significantly improves visual clarity in low-light conditions is proposed, thereby enabling more robust performance in downstream tasks. At the same time, the field of artificial intelligence is undergoing a major paradigm shift with the emergence of large autoregressive models, such as GPT [3,4] and its successors [5,6,7]. These models have demonstrated strong generalization and versatility, despite challenges like hallucination, and are widely regarded as a step toward artificial general intelligence (AGI). Fundamentally, they are powered by a simple yet effective self-supervised learning mechanism: predicting the next token in a sequence. Inspired by the success of autoregressive models in natural language processing, researchers in computer vision have begun to explore similar architectures for image generation [8]. Pioneering work like DALL·E [9] demonstrated how continuous images can be discretized into 2D grids of visual tokens and flattened into 1D sequences for autoregressive training. The ability to accurately model spatial dependencies in such token sequences is central to the success of these models.

In this context, we propose an autoregressive image generation framework based on single-photon imaging, specifically targeting robust modeling and depth estimation under extremely low-light conditions. To address the inherent challenges of weak signals, high noise, and sparse photon information, our approach integrates signal-to-noise ratio (SNR) guidance with advanced sequence modeling. To summarize, the main contributions of this work are as follows:

An SNR-Guided Adaptive Enhancement Framework: We propose a dual-branch architecture that dynamically fuses short-range and long-range features based on an estimated SNR map. This effectively distinguishes between signal-dominated and noise-dominated regions, allowing for spatially adaptive processing.
Specialized Attention Mechanism: A novel SNR-guided self-attention module is introduced to suppress noise propagation in low-light regions. By masking irrelevant features in low-SNR areas, the model preserves global dependencies in high-quality regions without amplifying artifacts.
Autoregressive Depth Estimation for SPC: We present an end-to-end autoregressive generation framework tailored for single-photon data. By integrating conditional tokens and a control encoder, this solution effectively bridges the gap between photon-limited sensing and downstream depth estimation tasks, demonstrating superior robustness compared to traditional methods. Unlike standard depth estimation models (e.g., AdaBins [X], DPT [Y]) which are designed for dense RGB images and rely on rich texture information, our autoregressive approach is specifically tailored to handle the sparsity and uncertainty inherent in single-photon data. Direct application of these RGB-based baselines to SPAD data without extensive retraining typically yields suboptimal results due to the significant domain gap.

2. Related Work

2.1. Current Research on Visual Neural Networks

Visual neural networks (VNNs) represent one of the most significant research topics in the field of deep learning. They have been widely applied to various computer vision tasks such as image classification, object detection, and scene segmentation, with application domains extending to transportation, healthcare, finance, and more. Currently, VNNs can be broadly categorized into two main types: CNNs and Transformer-based networks.

Over the past decade, CNNs have dominated visual tasks, continuously evolving through architectural innovations and a series of milestone models. However, in recent years, Transformer networks have achieved remarkable success in both natural language processing and computer vision, demonstrating breakthrough performance across a wide range of vision tasks. Notably, in specific applications, Transformer-based backbones have significantly outperformed CNNs, largely due to their superior modeling capabilities. CNNs are designed to process structured multi-dimensional data, typically through hierarchical convolution operations for feature extraction. They update network gradients using the backpropagation algorithm and train parameters based on two core concepts: local connectivity and weight sharing, further enhanced by various activation functions and neural units. In 2012, AlexNet [10] demonstrated the feasibility of deep CNNs by applying the ReLU activation function and dropout regularization, achieving a top-5 error rate of 16.4% on the ImageNet classification task. In 2014, VGGNet [11] explored the impact of network depth by stacking 3 × 3 convolutional kernels, revealing a strong correlation between model depth and feature extraction capability. It achieved a top-5 accuracy of 92.3% in the ILSVRC 2014 competition. In the same year, GoogLeNet [12] introduced the Inception module, a “fundamental neuron” design that enabled a sparse and computationally efficient network architecture, reducing parameter count to 1/12 of AlexNet while achieving a 6.7% top-5 error rate and winning ILSVRC 2014.

To address the issues of gradient vanishing and difficulty in modeling long-range dependencies in CNNs, ResNet [13] introduced skip connections within a residual learning framework, enabling networks to scale up to 152 layers. ResNet achieved a top-5 error rate of 3.57% on ImageNet, with its core innovation lying in the direct propagation of low-level features to deeper layers, which alleviated gradient vanishing and improved feature reuse. These transformative advances have continuously pushed the development of CNNs, introducing new perspectives and fostering ongoing innovation in the field.

The introduction of the Transformer architecture marked a new era for visual neural networks. Its core self-attention mechanism dynamically establishes global contextual dependencies by computing similarity weights between pixels. DETR [14] was the pioneering work to apply vision Transformers to object detection, using a bipartite matching loss to align predicted and ground truth bounding boxes. Inspired by the GPT model, IGP [15] first introduced Transformers to image generation and classification tasks. The Vision Transformer (ViT) [16] subsequently followed a standard Transformer architecture and became the first large-scale vision network fully based on Transformers. In ViT, input images are split into patches, which are linearly projected and fed into a Transformer encoder. The features extracted in this way have demonstrated strong performance across various downstream tasks. Recent research has focused on proposing new variants of the Transformer structure to further improve model performance. This study aims to explore the representational power of Transformer architectures in depth and apply them to autoregressive modeling for image generation tasks.

2.2. Current Research Status of Image Generation

With the continuous development of deep learning technologies, the capabilities and precision of image generation have significantly improved, demonstrating great potential in fields such as image processing, artistic creation, and medical imaging.

Diffusion models have emerged as the dominant paradigm in generative modeling, particularly excelling in image synthesis. These models progressively add noise to the original image during training and reconstruct the image through iterative denoising during inference. This training mechanism enables the model to sample in spatial space and generate images via the denoising process. Since the introduction of diffusion models, research has primarily focused on optimizing training and sampling strategies. To reduce computational complexity and improve efficiency, many studies have attempted to migrate the generation process to latent spaces. In text-to-image generation tasks, mainstream frameworks typically adopt U-Net [17] as the denoising network and utilize pretrained CLIP [18] as the text encoder to extract textual features, which are then integrated into the denoising process through cross-attention mechanisms. DiT [19], which employs a Transformer-based denoising architecture, has achieved highly competitive results in image generation. Current research on diffusion models mainly concentrates on improved learning and sampling strategies [20], guidance mechanisms [21], latent space modeling [22], and architectural optimizations [23].

In contrast to the iterative denoising process of diffusion models, autoregressive models generate images by predicting the next image token based on previously generated ones. Early autoregressive models, such as raster-scan methods, focused on pixel-by-pixel prediction, requiring the encoding of 2D images into 1D token sequences. Previous work [24] has demonstrated that images can be generated sequentially in a standard raster-scan order. VQGAN [25] builds upon VQVAE and performs autoregressive modeling in the latent space, adopting a GPT-2 style Transformer decoder to generate tokens in raster order. This approach is similar to how ViT [16] serializes 2D images into patches. Subsequent models such as VQVAE-2 [26] and RQ-Transformer [27] also follow this raster-scan paradigm, introducing multi-scale or hierarchical encoding mechanisms to enhance generation efficiency and stability. More recent studies increasingly employ efficient language models as autoregressive generators. For instance, LlamaGen [28] and Open-MAGVIT2 utilize the LLaMA architecture as the generation backbone, achieving state-of-the-art performance on multiple datasets and showcasing the architecture’s tremendous potential in image generation. AiM [29] explores a novel generation approach using the Mamba model, yielding promising results. Lumina-mGPT [30] introduces a series of multimodal autoregressive models capable of performing a wide range of visual and language tasks, particularly excelling in generating diverse and realistic images from text descriptions. Additionally, some recent works aim to integrate autoregressive and diffusion models into unified multimodal systems that support both image generation and understanding.

Masked prediction models represent a class of self-supervised learning methods based on predicting randomly masked tokens from the remaining context, without requiring explicitly labeled data. By masking parts of the input and training the model to recover the missing information, this learning paradigm forces the model to understand the semantic relationships and contextual dependencies between tokens. As a result, masked prediction models significantly improve performance across a wide range of downstream tasks. MaskGIT [31], for example, combines a VQ encoder with a BERT-like [32] masked Transformer architecture and applies a greedy algorithm to generate VQ tokens, achieving impressive results.

In text-to-image generation tasks, textual prompts often struggle to precisely convey unique artistic styles or complex visual details. To overcome this limitation, some studies explore guiding image synthesis using conceptual features extracted from example images—features that are difficult to describe accurately in language. This research direction is generally referred to as controllable personalized image generation. Representative works employ spatial structure information such as edge maps, segmentation masks, and depth maps to control the generation process. Subsequent research has proposed improved designs for conditional encoders and more effective training strategies. Other studies combine multimodal encoders with stable diffusion models to enable sound-to-image generation. In contrast, controllable generation based on autoregressive image models remains relatively underexplored. Although some approaches attempt to jointly model control signals and image content through next-token prediction, they fundamentally differ from the standard token-by-token generation paradigm. Therefore, this work aims to investigate effective paradigms for controllable image generation within the autoregressive framework, in order to fully unlock the potential of autoregressive models.

2.3. Current Situation of Inference on Single-Photon Cameras

Currently, large-scale single-photon camera arrays primarily adopt two technical approaches: SPADs (Single-Photon Avalanche Diodes) and jots. Jots amplify single-photon signals through active pixels with high conversion gain [33], thereby avoiding the avalanche effect. This enables smaller pixel sizes, higher quantum efficiency, and lower dark current, although at the cost of relatively lower temporal resolution. Although the experiments in this paper are conducted on data collected using SPAD-based sensors (see Figure 1), the proposed computational methods are also applicable to various single-photon sensors, including jots.

Since the early theoretical proposition of performing computer vision tasks directly on photon streams, the application of quantum sensors in scene inference has shown continuous growth. Use cases include high-speed tracking based on quantum sensors, as well as more recent tasks such as object recognition and image classification. This paper further explores this research direction by providing a high-performance solution for inference tasks based on single-photon camera data.

The core idea is to enhance images captured under low-light conditions with single-photon cameras. Traditional image enhancement methods mainly rely on histogram equalization and gamma correction to expand the dynamic range and improve image contrast. However, these basic techniques often introduce artifacts when applied to real-world scenes, leading to suboptimal results. Methods based on Retinex theory estimate the reflectance component as an approximation for image enhancement and can produce more natural and realistic outputs [34,35], though they tend to suffer from local color distortions when handling complex real-world images. In recent years, a variety of deep learning-based methods [36,37,38] have emerged. For instance, Wang et al. [39] enhanced underexposed images by predicting illumination maps, while Sean et al. [40] developed an enhancement strategy based on learning three types of spatial local filters. Similarly, Yang et al. [41] introduced a semi-supervised approach to restore the linear frequency representation of low-light images, and Guo et al. [37] constructed a lightweight network to estimate pixel-wise high-order curves for dynamic range adjustment. Additionally, various unsupervised methods [42] have also contributed to this field of research.

Unlike existing methods, this paper introduces a novel Signal-to-Noise Ratio (SNR)-aware framework, which features a newly designed SNR-perception Transformer and convolutional model capable of enhancing images in a spatially adaptive manner.

3. Problem Description and Model Design

Low-light imaging plays a critical role in visual tasks such as nighttime object detection and behavior recognition [38]. However, due to limited illumination conditions, images captured in low-light scenarios often suffer from poor visibility, which not only impairs human perception but also degrades the performance of downstream vision models. To address this issue, various image enhancement methods have been proposed. Mainstream approaches utilize neural networks to learn adjustment strategies for image attributes such as color, tone, and contrast to improve overall quality [11], while some recent studies further consider the impact of noise characteristics [43]. Nevertheless, most existing methods treat enhancement and image generation as separate processes, lacking effective integration and controllability across tasks.

To bridge this gap, a controllable low-light image enhancement and generation framework tailored for single-photon camera imagery is proposed. The framework consists of two main stages. First, the input low-light image I is enhanced using a signal-to-noise-ratio-aware neural network, producing an improved image I′. Then, control conditions are introduced to guide the generation process, enhancing controllability and task adaptability. Specifically, a control encoder is designed to extract features from a reference control image, generating a control condition sequence of length L. Next, a VQGAN-based encoder is used to convert the image into a one-dimensional token sequence, into which control tokens are fused via positional alignment and element-wise addition. Specifically, the control tokens are first projected to match the dimension of the autoregressive latent tokens using a learnable linear layer, and then directly added to the image tokens to inject conditional information at each step during autoregressive generation. In addition, a multi-layer perceptron (MLP) is applied to map control tokens before fusion, and three standard layers in the autoregressive model are replaced with conditional sequence layers to further strengthen the influence of control conditions. The entire model is trained in an end-to-end manner, continuously improving the expressiveness and controllability of both enhancement and generation.

3.1. Low-Light Image Enhancement

In images captured by single-photon cameras, data enhancement is essential for supporting subsequent image generation tasks. Low-light images typically exhibit significant spatial variation in illumination, noise levels, and visibility. Within the same image, certain regions may preserve adequate contrast and visibility, while others—particularly those with extremely low illumination—tend to be severely corrupted by noise. This observation suggests that high-quality enhancement requires region-specific and adaptive strategies, rather than applying a uniform approach across the entire image.

To address this, the concept of SNR [31] is introduced to quantify the local relationship between signal and noise, enabling spatially adaptive enhancement. Pixels are categorized based on their local SNR characteristics. In high-SNR regions, local information remains relatively reliable, allowing enhancement to focus on preserving and refining local details. In contrast, low-SNR regions suffer from insufficient valid local information and are dominated by noise, requiring the integration of non-local, long-range features from more reliable areas to guide enhancement.

To accommodate both scenarios, a dual-branch architecture is designed. The short-range branch, built with convolutional residual blocks, emphasizes local feature modeling and refinement. The long-range branch, based on a Transformer architecture, captures global dependencies through self-attention mechanisms. To prevent the attention mechanism from being misled by noise in low-SNR areas, an SNR-guided attention module is incorporated. This mechanism dynamically modulates the attention weights to suppress noise-dominated regions while enhancing the contribution of reliable areas.

To integrate the complementary strengths of local and global features, an SNR-aware fusion strategy is applied, enabling adaptive combination of information from both branches. Furthermore, skip connections between the encoder and decoder are included to preserve fine-grained structural details throughout the enhancement process.

3.1.1. Design of Short- and Long-Range Branches

The short-range branch is implemented using a structure based on convolutional residual blocks, mainly used to capture local information. In contrast, the long-range branch is implemented based on the Transformer structure, as the Transformer structure can effectively model long-range dependencies through global self-attention mechanisms, a property that has been validated in multiple high-level vision tasks [41] as well as low-level vision tasks [44,45]. In the long-range branch, the feature map F extracted by the encoder from the input image

I \in R^{H \times W \times 3}

is first divided into m feature blocks, denoted as

F_{i} \in R^{p \times p \times C}

, where

i = 1, \dots \dots, m

. Assuming the size of the feature map F is

h \times w \times C

, and each feature block has a size of

p \times p

, then there are a total of

m = \frac{h}{p} \times \frac{w}{p}

feature blocks, which can fully cover the entire feature map.

The proposed SNR-Transformer is a patch-based structure. This model mainly consists of a multi-head self-attention module and a feedforward neural network, both of which contain two fully connected layers. The output features

F_{1}, F_{2}, \dots, F_{m}

of the Transformer are consistent in size with the input image blocks. The features

F_{1}, F_{2}, \dots, F_{m}

are flattened into one-dimensional features and computed as follows:

\begin{matrix} y_{0} = [F_{1}, F_{2}, \dots, F_{m}] \end{matrix}

(1)

\begin{matrix} q_{i} = k_{i} = v_{i} = L N (y_{i - 1}) \end{matrix}

(2)

\begin{matrix} \hat{y_{i}} = M S A (q_{i}, k_{i}, v_{i}) + y_{i - 1} \end{matrix}

(3)

\begin{matrix} y_{i} = F F N (L N (\hat{y_{i}})) + \hat{y_{i}} \end{matrix}

(4)

\begin{matrix} [F_{1}, F_{2}, \dots, F_{M}] = y_{l}, i = 1, \dots, l \end{matrix}

(5)

Herein: LN denotes Layer Normalization, which normalizes input features across the channel dimension to stabilize training dynamics and accelerate convergence;

y_{i}

represents the output of the i-th Transformer block; MSA refers to the Signal-to-Noise Ratio Multi-Head Self-Attention module designed in this study;

q_{i}, k_{i}, v_{i}

denote the query, key, and value vectors, respectively, in the MSA module of the i-th layer;

l

indicates the total number of layers in the Transformer. Ultimately, the features

F_{1}, F_{2}, \dots, F_{m}

processed by the Transformer can be reintegrated into a two-dimensional feature map

F_{l}

.

3.1.2. Spatially Varying Feature Fusion Based on Signal-to-Noise Ratio

The proposed framework first requires estimating the SNR map of the input image. Since only a single noisy image is provided, estimating its noise level and preparing a corresponding clean image I to compute the per-pixel SNR is a challenging and laborious task. Following traditional learning-free denoising methods [40], the noise is modeled as discontinuous variations between spatially adjacent pixels. The noise component can be formulated as the difference between the noisy image and its corresponding clean image.

This modeling approach is leveraged to estimate the SNR map of image I and utilize it as an effective prior for spatially varying feature fusion. Given an input image

I \in R^{H \times w \times 3}

, we first convert it into a grayscale image

I^{g} \in R^{H \times W}

, and then compute the SNR map

S \in R^{H \times W}

as follows:

\begin{matrix} \hat{I^{g}} = d e n o i s e (I^{g}) \end{matrix}

(6)

\begin{matrix} N = a b s (I^{g} - \hat{I^{g}}) \end{matrix}

(7)

\begin{matrix} S = \frac{\hat{I^{g}}}{N + ε} \end{matrix}

(8)

where

ε

= 10⁻⁶ prevents division by zero.

Herein:

d e n o i s e

denotes a learning-free denoising operation, implemented in subsequent experiments as local pixel block averaging;

a b s

represents the absolute value operation, and

N \in R^{H \times W}

is the estimated noise map. Although the derived signal-to-noise ratios are approximate estimates due to the inherent bias in learning-free noise estimation, this lightweight strategy avoids the heavy computational burden of complex noise modeling networks. More importantly, in our framework, the SNR map serves primarily as a spatial guidance weight to distinguish signal-dominated regions from noise-dominated ones, rather than a precise metrological measurement. Experimental results demonstrate that using this efficient SNR map as prior information effectively enhances the performance of our framework while maintaining fast inference speeds.

After obtaining the SNR map, it is employed for spatially varying feature fusion. An encoder

E

first extracts feature

F

from the input image

I

. This feature is then fed into long-range and short-range branches, producing long-range features

F^{l} \in R^{h \times w \times C}

and short-range features

F^{s} \in R^{h \times w \times C}

, respectively. To adaptively fuse these features, the SNR map is resized to dimensions

h \times w

and its values are normalized to the range [0, 1]. The normalized SNR map

S^{'}

serves as interpolation weights for fusion, formulated as:

\begin{matrix} F = F^{s} \times S^{'} + F^{l} \times (1 - S^{'}) \end{matrix}

(9)

F \in R^{h \times w \times C}

denotes the final fused feature, which is subsequently passed to the decoder to generate the output image. Since the values in the SNR map dynamically reflect the noise levels across different regions of the input image, this fusion strategy adaptively combines local (short-range) and non-local (long-range) image information, thereby generating the final feature F more effectively.

3.1.3. SNR-Guided Transformer Attention Mechanism

While conventional Transformer architectures excel at capturing non-local image information, they exhibit significant limitations when processing noisy images. In standard Transformers, attention is computed across all image patches indiscriminately. When enhancing a given pixel, its long-range attention may incorporate information from any image region—regardless of local signal or noise characteristics. This leads to a critical flaw: low-SNR regions, being noise-dominated and inherently unreliable, can propagate erroneous features during attention computation, ultimately degrading enhancement performance.

To address this, an SNR-guided attention mechanism that adaptively modulates attention based on local noise levels is proposed. Given an input image

I \in R^{H \times W \times 3}

and its corresponding SNR map

S \in R^{H \times W}

, we first resize S to match the feature map dimensions

S^{'} \in R^{h \times w}

. Both

S^{'}

and the feature map

F

are partitioned into

m

patches, with the average SNR value

S_{i} \in R^{1}

computed per patch

S_{i} \in R^{1}

. These values form a vector

S \in R^{m}

, which serves as a binary mask to block information flow from severely noise-corrupted regions. The mask is defined as:

\begin{matrix} S_{i} = \begin{matrix} 0, S_{i} < s \\ 1, S_{I} > s \end{matrix}, i = 1, \dots, m \end{matrix}

(10)

where s is a predefined threshold set to 0.5 in our experiments. The mask is then replicated into a matrix

S^{'} \in R^{m \times m}

.

For the b-th attention head in the i-th Transformer layer, computations proceed as:

\begin{matrix} Q_{i, b} = q_{i} W_{b}^{q}, K_{i, b} = k_{i} W_{b}^{k}, V_{i, b} = v_{i} W_{b}^{v} \end{matrix}

(11)

\begin{matrix} A t t e n t i o n_{i, b} = (Q_{i, b}, K_{i, b}, V_{i, b}) = S o f t m a x (\frac{Q_{i, b} K_{i, b}^{T}}{\sqrt{d_{k}}} + (1 - S^{'}) σ) V_{i, b} \end{matrix}

(12)

where

q_{i}, k_{i}, v_{i} \in R^{m \times (p \times p \times C)}

denote input features,

W_{b}^{q}, W_{b}^{k}, W_{b}^{v} \in R^{(p \times p \times C) \times C_{k}}

, are projection matrices, The attention weights matrix has a shape of

m \times m

, and the final attention output is of size

m \times C_{k}

.

\sqrt{d_{k}}

serves as a scaling factor to stabilize gradient magnitudes during training, and

σ = - 1 \times 10^{9}

forces masked positions to near-zero attention weights. Multi-head outputs are concatenated and projected to yield the final layer output. This mechanism ensures attention focuses primarily on high-SNR regions.

3.1.4. Loss Functions

The input image

I

is encoded into feature

F

via a convolutional encoder (stacked Conv + LeakyReLU layers with residual blocks).

F

is processed by long-range

F_{l}

and short-range

F_{s}

branches, then fused into

F

using SNR-guided weights. A symmetric decoder reconstructs the residual image R, with final output

I^{'} = I + R

.

This paper employs two reconstruction loss functions to train the model: Charbonnier loss and perceptual loss.

The Charbonnier loss is defined as:

\begin{matrix} L_{r} = \sqrt{{‖I^{'} - \hat{I}‖}^{2} + ε^{2}} \end{matrix}

(13)

where

\hat{I}

is the ground truth image, and ε is a hyperparameter set to

10^{- 3}

in our experiments. The perceptual loss measures the L₁-norm difference after VGG network feature extraction:

\begin{matrix} L_{v g g} = {‖ϕ (I^{'}) - ϕ (I)‖}_{1} \end{matrix}

(14)

where ϕ(·) denotes the feature extraction operation of the VGG network. The total loss function is:

\begin{matrix} L = L_{r} + λ L_{v g g} \end{matrix}

(15)

where λ is a hyperparameter set to 0.1.

3.2. Learning Quantized Representations for Images

The Transformer architecture is designed to model long-range dependencies in sequential data and has consistently demonstrated state-of-the-art performance across various tasks. Unlike CNNs, Transformers lack an inherent inductive bias toward local interactions. This property grants them strong expressive power and is a key factor in their success.

This paper presents a method to combine the inductive bias advantages of CNNs with the expressive capabilities of Transformers, enabling more effective modeling and synthesis of images. Our approach consists of two key steps. First, learning a context-rich codebook of image components using a CNN, and secondly employing a Transformer to efficiently model the compositional relationships among these components in high-resolution images.

The core hypothesis of our method is that low-level image structures can be effectively modeled via local connectivity (i.e., convolutional operations), whereas this assumption becomes less valid at higher semantic levels. Furthermore, while CNNs exhibit both locality bias and spatial invariance through weight sharing, they may struggle to capture global contextual understanding. We argue that integrating CNNs and Transformers is crucial for developing highly expressive models capable of effectively representing visual structures. Specifically, a convolutional architecture is used to efficiently learn a semantically meaningful local codebook, followed by a Transformer to model global compositional patterns among these components. To ensure that the local codebook captures perceptually significant image structures, an adversarial learning strategy is adopted, reducing the Transformer’s burden in modeling low-level details.

The final task involves depth estimation from single-photon image data. To leverage the Transformer’s strong generative capabilities, the image content is represented as a discrete sequence [25]. Rather than modeling raw pixels directly, the complexity of images necessitates a discrete codebook-based representation. Formally, an image

x \in R^{H \times H \times 3}

can be represented as a spatially arranged collection of codebook entries

z_{q} \in R^{h \times w \times n_{z}}

, where

n_{z}

denotes the code dimension. Equivalently, the image can be represented as a sequence of

h \times w

indices, each referencing a corresponding entry in the codebook.

3.2.1. Learning Efficient Image Codebooks

To effectively learn a discrete codebook of image components in this space, the proposed method incorporates the inductive bias of CNNs into the encoding architecture, drawing inspiration from neural discrete representation learning [46].

A convolutional model, comprising an encoder

E

and a decoder G, is first trained to represent images using a learned discrete codebook

Z = {\{z_{k}\}}_{k = 1}^{N} \in R^{n_{z}}

. More precisely, the objective is to approximate the reconstruction of an image x as:

\begin{matrix} \tilde{x} = G (z_{q}) \end{matrix}

(16)

Here, the quantization step maps encoder outputs to the nearest codebook entries, ensuring a compact and structured representation.

The quantized representation

z_{q}

is obtained through a two-step procedure:

The encoder produces a continuous intermediate representation:

$\begin{matrix} \tilde{z} = E (x) \in R^{h \times w \times n_{z}} \end{matrix}$

(17)
Element-wise Quantization: Each spatial component ${\tilde{z}}_{i j} \in R^{n_{z}}$ is mapped to its nearest codebook entry $z_{k}$ :

$\begin{matrix} z_{q} = q (\tilde{z}) ≔ (a r g \min_{z_{k} \in Z} | | {\tilde{z}}_{i j} - z_{k} | |) \in R^{h \times w \times n_{z}} \end{matrix}$

(18)

The final reconstruction is formulated as:

\begin{matrix} \tilde{x} = G (z_{q}) = G (q (E (x))) \end{matrix}

(19)

Since the quantization operation

q (*)

is non-differentiable, the straight-through gradient estimator is employed to enable end-to-end training [47]. During backpropagation, the decoder’s gradients are directly passed through to the encoder while bypassing the quantization step.

3.2.2. Loss Function

The entire model is trained using the following composite loss function:

\begin{matrix} L_{1} = {‖x - \tilde{x}‖}^{2} + {‖s g [E (x)] - z_{q}‖}_{2}^{2} + {‖s g [z_{q}] - E (x)‖}_{2}^{2} \end{matrix}

(20)

where

${‖x - \tilde{x}‖}^{2}$ represents the reconstruction loss, measuring the discrepancy between the original and reconstructed images;
$s g [\cdot]$ denotes the stop-gradient operation, preventing variable updates during backpropagation;
${‖s g [E (x)] - z_{q}‖}_{2}^{2}$ corresponds to the commitment loss [48], encouraging encoder outputs to remain close to codebook entries;
${‖s g [z_{q}] - E (x)‖}_{2}^{2}$ serves as the codebook training loss.

To enable Transformer-based representation of image component distributions while achieving high compression rates, the proposed method incorporates a discriminator and perceptual loss to maintain perceptual quality under aggressive compression [46,49,50,51,52]. Previous approaches typically applied pixel-level methods [26] or Transformer-based autoregressive models on shallow quantization models, while this work advances the quantization stage’s modeling capability.

Specifically, the L₂ reconstruction loss in the original objective is replaced with perceptual loss, combined with an adversarial training mechanism employing a patch-based discriminator D to distinguish between real and reconstructed images [53]. The adversarial loss is formulated as:

\begin{matrix} L_{2} = l o g D (x) + l o g (1 - D (\hat{x})) \end{matrix}

(21)

The complete loss function for this stage becomes:

\begin{matrix} L = L_{1} + λ L_{2} \end{matrix}

(22)

where

λ

represents an adaptive weight computed as:

\begin{matrix} λ = \frac{\nabla_{G L} [L_{r e c}]}{\nabla_{G L} [L_{G A N}] + δ} \end{matrix}

(23)

where:

L_rec denotes the perceptual reconstruction loss [54];
$\nabla_{G L} [\cdot]$ indicates the gradient with respect to the decoder’s final layer L input;
$δ = 10^{- 6}$ serves as a small constant for numerical stability.

For global context aggregation, a single-layer attention mechanism is applied at the lowest resolution. This training strategy significantly reduces the sequence length when unfolding latent codes, enabling effective application of more powerful Transformer models.

3.2.3. Learning Image Structures with Transformer

Given the availability of encoder

E

and decoder

G

, an image can be represented as a sequence of codebook indices corresponding to its quantized encoding. Specifically, for an image x with quantized encoding

z_{q} = q (E (x))

, this representation is equivalent to an index sequence

s \in {\{0, \dots \dots, |Z| - 1\}}^{h \times w}

, where each value

s_{i j} = k

of position

(i, j)

satisfies:

\begin{matrix} {(z_{q})}_{i j} = z_{k} \end{matrix}

(24)

indicating that each quantized vector is replaced by its index in codebook Z. The original

z_{q} = (z_{i j})

can be recovered by mapping the index sequence s back to corresponding codebook entries, enabling image reconstruction through the decoder:

\hat{x} = G (z_{q})

.

When a specific ordering is established for index sequence s, the image generation task can be formulated as an autoregressive next-index prediction problem. Given preceding indices

s_{< i}

, the Transformer learns to predict the probability distribution of the i-th index:

\begin{matrix} p (s) = \prod_{t = 1}^{T} p (s_{i} | s_{< i}) \end{matrix}

(25)

For conditional image generation tasks, given an image

I \in R^{H \times W \times 3}

and pixel-level control condition

C \in R^{1 \times H \times W}

, where H and W represent image dimensions, the conditional distribution

P (I| C)

is learned. Each image I is encoded into a sequence of T discrete tokens

(s_{1}, s_{2}, \dots \dots, s_{T})

. Through autoregressive modeling, the conditional probability can be expressed as:

\begin{matrix} P (I| C) = P (s_{1}, s_{2}, \dots \dots, s_{T}| C) = \prod_{t = 1}^{T} p (s_{t} |s_{< t}| C) \end{matrix}

(26)

The objective is to minimize the next-token prediction loss, formulated as:

\begin{matrix} l_{t r a i n} = C E (M ([c, s_{1}, s_{2}, \dots \dots, s_{T - 1}]), s_{1}, s_{2}, \dots \dots, s_{T}) \end{matrix}

(27)

where M denotes the sequence model and CE represents the cross-entropy function:

\begin{matrix} C E = - \sum_{t = 1}^{T} l o g p (q_{t}| c, q_{1}, q_{2}, \dots \dots, q_{T - 1}) \end{matrix}

(28)

In this work, the conditional input consists of images simulated from single-photon cameras, with the objective of generating corresponding depth maps of target images.

4. Experiment and Analysis

4.1. Datasets

4.1.1. Low-Light Image Enhancement Datasets

The proposed low-light image enhancement framework was evaluated on multiple datasets where noise is prevalent in low-light regions, including LOL (v1 and v2) [55,56], SID [2], SMID [57], and SDSD [58]. The LOL dataset exhibits noticeable noise in both v1 and v2 versions. The v1 version [59] contains 485 training pairs and 15 test pairs, with each pair consisting of one low-light image and its corresponding normal-exposure counterpart. The v2 version [60] is divided into LOL-v2-real and LOL-v2-synthetic subsets. LOL-v2-real comprises 689 training pairs and 100 test pairs, where low-light images were acquired by adjusting exposure time and ISO while keeping other camera parameters constant. LOL-v2-synthetic was generated through analysis of illumination distribution in RAW format images.

Both SID and SMID datasets provide paired short-exposure and long-exposure images. The low-light images in these datasets were captured under extremely dark conditions, resulting in high noise levels. For the SID dataset, the Sony camera-captured subset (Sony Corporation, Tokyo, Japan) was selected, and RAW images were converted to RGB format using the default image signal processing (ISP) pipeline provided by rawpy (version 0.20.0), following official scripts. Crucially, we applied this identical, neutral linear demosaicing pipeline across all comparison methods (both baselines and our proposed framework). This ensures that the evaluation remains fair and that any observed performance gains are attributable solely to the efficacy of the proposed enhancement network, rather than artifacts or biases introduced by the ISP conversion process. For the SMID dataset, the complete image collection was utilized, with RAW data similarly converted to RGB format, as this study focuses on low-light enhancement in the RGB domain. The training-test splits were determined according to the settings described in reference [59]. Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, Figure 7, Figure 8 and Figure 9 illustrate sample image pairs from the aforementioned datasets, each showing a normal-light image paired with its corresponding low-light version.

4.1.2. Depth Estimation Dataset

The performance of the proposed autoregressive model was evaluated on the NYU Depth V2 dataset [60] (see Figure 10). NYU Depth V2 is an RGB-D dataset released by New York University and has been widely used for indoor scene understanding tasks. This dataset is rich in content and was constructed from multiple indoor environment video sequences captured by Microsoft Kinect’s RGB camera and depth sensor. The data mainly includes the following aspects: First, the dataset contains a total of 1449 pairs of densely annotated and aligned RGB and depth images. Second, the dataset covers 464 new scenes across three cities, providing 407,024 additional unlabeled image sequences. Furthermore, the dataset provides category labels and corresponding instance numbers for each object in the data.

Regarding data partitioning, this paper follows the same training and test set settings as previous work: the training part includes 50,000 images used to train the model’s generation capability, while the test part consists of 654 images used to evaluate the model’s generalization ability and actual performance.

4.2. Evaluation Metrics

For the low-light image enhancement task, Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) [61] were adopted as image quality assessment metrics.

PSNR is commonly used to measure pixel-level errors between two images, calculated as:

\begin{matrix} P S N R = 10 \cdot l o g_{10} (\frac{M A X_{I}^{2}}{M S E}) \end{matrix}

(29)

where

M A X_{I}

represents the maximum possible pixel value (e.g., 255 for 8-bit images).

MSE denotes the Mean Squared Error:

\begin{matrix} M S E = \frac{1}{m n} \sum_{i = 1}^{m} \sum_{j = 1}^{n} {[I (i, j) - K (i, j)]}^{2} \end{matrix}

(30)

with

I

as the reference image,

K

as the evaluated image, and

m \times n

as the image dimensions.

SSIM evaluates image quality through luminance, contrast, and structure components, better aligning with human visual perception:

\begin{matrix} S S I M (x, y) = \frac{(2 μ_{x} μ_{y} + C_{1}) (2 σ_{x y} + C_{2})}{(μ_{x}^{2} + μ_{y}^{2} + C_{1}) (σ_{x}^{2} + σ_{y}^{2} + C_{2})} \end{matrix}

(31)

where

μ_{x}, μ_{y}

means of images

x

and

y

,

σ_{x}^{2}, σ_{y}^{2}

means variances,

σ_{x y}

means covariance.

C_{1} = {(K_{1} L)}^{2}, C_{2} = {(K_{2} L)}^{2}

means stabilization constants

K_{1} = 0.01, K_{2} = 0.03, L

is pixel value range. Higher SSIM values indicate better preservation of high-frequency details and structural information.

For depth estimation, four standard metrics were employed:

Mean Relative Error (rel)

$\begin{matrix} r e l = \frac{1}{n} \sum_{p = 1}^{n} \frac{|y_{p} - {\tilde{y}}_{p}|}{y_{p}} \end{matrix}$

(32)
Root Mean Squared Error (rms)

$\begin{matrix} r m s = \sqrt{\frac{1}{n} \sum_{p = 1}^{n} {(y_{p} - {\tilde{y}}_{p})}^{2}} \end{matrix}$

(33)
Mean log₁₀ Error (log₁₀ error)

$\begin{matrix} \frac{1}{n} \sum_{p = 1}^{n} |{l o g}_{10} (y_{p}) - {l o g}_{10} ({\tilde{y}}_{p})| \end{matrix}$

(34)
Threshold Accuracy ( $δ_{i}$ ), percentage of pixels satisfying:

$\begin{matrix} m a x (\frac{y_{p}}{{\tilde{y}}_{p}}, \frac{{\tilde{y}}_{p}}{y_{p}}) < δ_{i} \end{matrix}$

(35)

where the value of $δ_{i}$ is $1.25, {1.25}^{2}, {1.25}^{3}$ . $y_{p}$ is ground truth depth value, ${\tilde{y}}_{p}$ is predicted depth value, and n is total pixels per depth image. These four metrics collectively provide comprehensive evaluation of generated image quality.

4.3. Experimental Setup

All experiments in this study were conducted on the Ubuntu 18.04 operating system using an RTX 3090 GPU with 24 GB of memory. To ensure reproducibility and facilitate future research, the source code and pre-trained models will be made publicly available upon acceptance of this paper.

For all low-light image enhancement experiments, the framework was implemented in PyTorch (version 2.3). The network parameters were randomly initialized using a Gaussian distribution, and standard data augmentation strategies were applied, including vertical and horizontal flipping. In terms of architecture design, the encoder consists of three convolutional layers with stride sizes of 1, 2, and 2, respectively, followed by a residual block. The decoder adopts a symmetric structure to the encoder, employing a pixel shuffle layer for upsampling. For optimization, the Adam optimizer [58] was utilized with a momentum parameter of 0.9. The model was trained for 50 epochs with a batch size of 256 using the Adam optimizer.

During the training of VQGAN, the codebook size was set to |Z| = 1024, with each codebook vector having a dimensionality of 256. These hyperparameters were empirically selected to balance reconstruction fidelity with the computational complexity of the subsequent autoregressive prediction. A larger codebook improves textural details but significantly increases the difficulty of next-token prediction, while a smaller codebook leads to loss of high-frequency information. For downsampling, the channel scaling factors were set to [1,2,4], where the numbers in brackets indicate the multiplicative increase in the number of channels. An attention operation was applied when the original image resolution was reduced to 16 × 16.

For all depth estimation experiments, the Adam optimizer was employed with a learning rate of 1 × 10⁻³. The exponential decay rates for the first- and second-moment estimates were set to β₁ = 0.9 and β₂ = 0.999, respectively, with a weight decay coefficient of 0 and ε = 1 × 10⁻⁸. The AMSGrad variant was not used. Due to computational constraints, the reported quantitative results reflect the model’s performance under a fixed random seed configuration. While statistical variance is not quantified, the consistent improvements observed across multiple metrics and datasets suggest the method’s effectiveness.

4.4. Experimental Results

4.4.1. Image Enhancement

The proposed method is compared with a series of recent state-of-the-art (SOTA) approaches in low-light image enhancement, including LPNet [62], MIR-Net [63], Retinex [64], and IPT [61]. Recent state-of-the-art methods such as Retinexformer [65] and SNR-Aware Transformer [66] have demonstrated impressive results on standard low-light datasets. However, these models are optimized for continuous RGB intensity values. Applying them directly to the binary and extremely sparse photon-counting data from SPAD sensors requires significant architectural adaptation and retraining, which is beyond the scope of this work. Therefore, we focus our comparison on methods that are either training-free or specifically adapted for similar noise distributions. Furthermore, recent studies published in Applied Sciences [67,68] and other leading venues [65,66,69] have validated the efficacy of combining multi-scale attention mechanisms with physics-based priors to resolve complex degradation in photon-starved environments. As shown in Table 1, the proposed method achieves superior performance over all baseline approaches. It should be noted that some of the listed results are obtained from the referenced literature, while others are reproduced by running publicly available code. Figure 11 provides a visualization of the enhanced results on low-light images after applying the proposed data augmentation strategy.

4.4.2. Encoding Capability of VQGAN

The proposed autoregressive framework employs VQGAN as both the encoder and decoder, making the reconstruction quality of VQGAN crucial. To evaluate this, two structurally identical VQGAN models were trained—one using the original images and the other using their corresponding depth maps. The VQGAN model trained on the original images was configured with a discrete codebook size of 1024, where each token had a dimensionality of 256. The only difference between the two models was the input channel configuration: the original images used three channels (RGB), while the depth maps, being single-channel, used one.

Since depth maps are not in RGB format, a color enhancement technique was applied to improve visualization. The reconstructed results are presented below:

The experimental results demonstrate that the VQGAN model achieves satisfactory reconstruction of original images after 30 training epochs. As shown in Figure 12, the model exhibits further improvements in detail preservation and edge sharpness when trained for 50 epochs. Particularly, the 50th-epoch model shows enhanced capability in reconstructing fine textures and maintaining structural integrity compared to earlier training stages. Therefore, the model obtained at the 50th training epoch was selected for subsequent experiments in this study.

4.4.3. Depth Estimation

The proposed method is compared with Joint Denoising and Photon Net, and the experimental results are shown in Figure 13 (↑ indicates that higher values are better, while ↓ indicates that lower values are better). As shown in Table 2, the proposed method outperforms the baseline methods across all metrics.

4.4.4. Experiments on Real Single-Photon Camera Data

To evaluate the effectiveness of the proposed method on real single-photon camera images, we conducted experiments using a dataset captured by a SwissSPAD2 camera (EPFL, Lausanne, Switzerland).

Camera Configuration: The camera was operated in binary mode, capturing binary frames with a spatial resolution of 512 × 256 at a maximum frame rate of 96.8 kHz. Since the sensor was not equipped with a Bayer filter, only grayscale (single-channel) images were acquired. The captured images contained hot pixels, which were corrected during post-processing. Specifically, a dark-field image was first captured to identify the positions of hot pixels, followed by spatial neighborhood-based filtering to remove them.

Dataset Preparation: For the image classification task, original RGB images are projected onto a monitor screen (Dell P2419H, 60 Hz) and captured using the SPAD sensor. The camera was positioned approximately 1 m away from the screen, with its field of view adjusted to cover the entire display area.

The Caltech-UCSD Birds-200-2011 (CUB-200-2011) dataset [46], a widely used benchmark for fine-grained visual classification, contains 11,788 images from 200 bird subcategories (5994 for training and 5794 for testing). From this dataset, a subset of 122 categories (denoted as CUB-subset) was randomly sampled from the original 200 categories. This selection was made to maintain statistical diversity representative of fine-grained textures while keeping the data acquisition and training process within a computationally manageable timeframe. The subset comprises 3656 training images and 3518 test images.

Table 3 summarizes the experimental results on real SPAD images, where the models were trained following the methodology described in Section 4.3. Despite the challenges posed by the low resolution and grayscale nature of the real SPAD images, the proposed method consistently outperforms both baseline approaches. We specifically focused our evaluation on extremely low-light conditions (average PPP ≈ 0.1). This setting represents the most challenging operational threshold for SPAD sensors, serving as a stress test to characterize the model’s performance degradation limits under severe photon starvation. Under these conditions, the designed framework demonstrates robust performance by accurately reconstructing images while preserving critical details and enabling reliable depth estimation from the restored images. As indicated by the quantitative metrics (rel, rms) in Table 3, our method significantly reduces depth estimation errors compared to baselines, validating its effectiveness and robustness for real-world imaging applications. These metrics also serve as proxies for perceptual quality, aligning with the visual improvements shown in Figure 14.

4.5. Ablation Study

4.5.1. Ablation Study on Low-Light Image Enhancement Framework

To thoroughly evaluate the contribution of each component in the proposed framework, systematic ablation experiments with four different configurations are conducted by progressively removing key modules: (1) “Ours w/o L” which removes the long-range branch while keeping only the convolutional operations, thereby eliminating global modeling capability; (2) “Ours w/o S” which removes the short-range branch while maintaining the complete long-range branch and SNR-guided attention mechanism; (3) “Ours w/o SA” that further removes the SNR-guided attention module based on “Ours w/o S”, leaving only the fundamental Transformer structure; and (4) “Ours w/o A” which specifically removes the SNR-guided attention mechanism while preserving all other architectural components. Comprehensive experiments performed across four benchmark datasets (results summarized in Table 4) demonstrate that our complete framework achieves the highest scores in both PSNR and SSIM metrics compared to all ablated variants, clearly validating the critical importance and synergistic effects of each module for optimal image quality enhancement. Notably, the performance degradation observed in each ablated configuration provides concrete evidence for the indispensable role of these carefully designed components in our framework.

4.5.2. Ablation Study on Autoregressive Model Training Strategies

The experimental results demonstrate that the three configurations—“Ours w/o L”, “Ours w/o S”, and “Ours w/o SA”—reveal the limitations of using either pure convolutional structures or basic Transformer architectures alone in modeling capability, which further proves the importance of synergistic integration between short-range (convolutional model) and long-range (Transformer architecture) feature extraction mechanisms. Moreover, the results highlight the critical role of both the SNR-guided attention mechanism (comparison between “Ours w/o A” and “Ours”) and the SNR-guided fusion strategy (comparison between “Ours w/o S” and “Ours”) in performance improvement.

For the autoregressive model, several ablation experiments are designed on parameter update strategies during training to investigate their impact on controllable generation performance. In current controllable generation tasks, common parameter update methods for generative models mainly include three categories: (1) completely frozen model parameters, (2) fine-tuning using Low-Rank Adaptation (LoRA), and (3) full fine-tuning of the entire model. Below are brief descriptions of these three approaches:

Frozen Strategy: During training, all parameters of the generative model itself remain unchanged, and only external additional modules are trained. This approach involves the fewest trainable parameters but generally yields the poorest performance.
LoRA: As a novel lightweight parameter adaptation method, LoRA achieves model transfer and adaptation by superimposing low-rank trainable parameters on the original weight matrices without directly modifying the original model parameters. This method can achieve good performance while adjusting relatively few parameters.
Full Fine-tuning: The core idea is to include all parameters of the backbone generative model in the training process, allowing the model to fully adapt to specific control conditions or task requirements. This approach typically requires training a large number of parameters but often achieves the best results.

In this experiment, the LlamaGen-B model is used as the generative backbone network and trained and evaluated on the NYU-v2 dataset under the three aforementioned settings. The results are shown in Table 5. The experiments demonstrate that full fine-tuning outperforms both the frozen and LoRA strategies across all metrics, showcasing its superior performance. The results indicate that, given sufficient training resources, full fine-tuning of the backbone model remains an effective approach to improving performance.

4.5.3. Ablation Study on Control Encoders

In Table 6, for the task of depth estimation, this paper conducts experiments using different encoders or pre-training schemes. First, this paper designs a CNN-based base control encoder that consists of four consecutive residual blocks with a total downsampling multiplier of 16. This CNN-based base control encoder contains about 21.8 M parametric quantities. Further, in this paper, we explore the effect of using a pre-trained ViT as a control encoder by employing ViT models with different pre-training methods, i.e., supervised pre-training and self-supervised pre-training based on ImageNet, and in the experiments, the C2I model of LlamaGen-B is utilized. As shown in the table, different control encoders exhibit performance differences on different datasets, and the VQGAN model using self-training performs better on the dataset.

4.5.4. Ablation Study on Control Fusion Strategies

The ablation experiments investigate different fusion strategies between control condition tokens and image tokens using NYU Depth V2 dataset images as control conditions with LlamaGen-B as the generative model. When employing cross-attention for control fusion, the control condition tokens serve as Key and Value while image tokens act as Query, implemented through standard cross-attention formulation.

\begin{matrix} A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V \end{matrix}

(36)

In the 12-layer Transformer architecture of LlamaGen-B, direct additive fusion is experimentally compared at different layer positions: exclusively at layer 1, at layers 1/5/9, and across all layers. As shown in Table 7, results demonstrate that direct additive fusion outperforms cross-attention, potentially because cross-attention requires learning positional relationships between image patches and control tokens, which may slow convergence. While increased additive fusion frequency improves conditional consistency of generated images, excessive addition (e.g., full-layer fusion) leads to quality degradation, suggesting an optimal balance exists in intermediate-layer fusion strategies.

5. Conclusions

The difficulty of performing downstream tasks under low-light conditions has driven increasing research on low-light image enhancement and related applications. To address these challenges, this paper proposes several solutions. A generic image enhancement framework combining Transformer and convolutional architectures is introduced, incorporating noise ratio priors to facilitate image information fusion. A novel self-attention module specifically designed for low-light enhancement demonstrates favorable performance across multiple datasets. For depth estimation tasks, an improved autoregressive model is presented, which integrates conditional tokens through additive operations to achieve an end-to-end image generation architecture. Experimental results show that this approach, combined with the aforementioned enhancement method, achieves promising performance on real single-photon camera data.

Regarding computational efficiency, the proposed dual-branch design effectively reduces the computational cost by restricting the heavy Transformer operations to low-resolution features, while the CNN branch handles high-resolution details efficiently. With a control encoder size of approximately 21.8 M parameters, the model maintains a reasonable inference speed suitable for near-real-time applications, striking a favorable balance between performance and resource usage. Regarding the viability of actual implementation, the proposed framework is designed with modularity in mind. While the current experiments were conducted on high-end GPUs (RTX 3090) to validate robustness, the model’s parameter size (≈21.8 M) is within a manageable range for deployment on modern embedded AI platforms (e.g., NVIDIA Jetson Orin) for autonomous vehicles. Future implementation efforts could further optimize the autoregressive step using model quantization or caching mechanisms to meet the strict low-latency requirements (<30 ms) of real-time driving safety systems.

Future research directions include further exploration of prior semantic information to enhance spatially adaptive mechanisms in image enhancement, as well as extending the framework to low-light video enhancement by jointly modeling temporal and spatial variations. The autoregressive architecture’s compatibility with text-based models presents opportunities to leverage advancements in large-scale language models for improving framework performance or generation speed. Additionally, conditional image generation based on textual inputs remains a promising area for investigation.

Limitations: It is important to acknowledge that the real-world validation in this study was conducted using a controlled monitor-capture setup. While this approach allowed for verifiable ground truth comparison, it introduces specific domain gaps, such as monitor refresh rate artifacts and pixel quantization, which differ from directly captured natural scenes. Additionally, due to hardware constraints, our current evaluation did not include direct capture with LiDAR-aligned ground truth or testing across a broad range of photon flux levels (PPP) to plot full degradation curves. Future work will focus on acquiring large-scale diverse indoor and outdoor SPAD datasets with synchronized LiDAR ground truth to further validate the method’s robustness in unconstrained wild environments.

Author Contributions

Conceptualization, Q.Y., D.D. and T.Z.; Methodology, Q.Y., F.M., Q.W. and Z.F.; Software, F.M. and Z.F.; Validation, T.Z.; Formal analysis, Q.W.; Resources, Q.Y. and D.D.; Writing—original draft, Q.Y. and T.Z.; Writing—review & editing, F.M., Q.W., Z.F. and T.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available in the article. The source code and pre-trained models will be made publicly available upon publication.

Conflicts of Interest

The authors declare no conflict of interest.

References

Guo, X.; Li, Y.; Ling, H. LIME: Low-Light Image Enhancement via Illumination Map Estimation. IEEE Trans. Image Process. 2016, 25, 536–547. [Google Scholar] [CrossRef]
Chen, C.; Chen, Q.; Xu, J.; Koltun, V. Learning to See in the Dark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 3291–3300. [Google Scholar]
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. arXiv 2018, arXiv:1801.06146. [Google Scholar]
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training Language Models to Follow Instructions with Human Feedback. arXiv 2022, arXiv:2203.02155. [Google Scholar] [CrossRef]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
Scao, T.L.; Fan, A.; Akiki, C.; Pavlick, E.; Ilić, S.; Hesslow, D.; Castagné, R.; Luccioni, A.S.; Yvon, F.; Gallé, M.; et al. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. arXiv 2022, arXiv:2211.05100. [Google Scholar]
Sun, Y.; Wang, S.; Feng, S.; Ding, S.; Pang, C.; Shang, J.; Liu, J.; Chen, X.; Zhao, Y.; Lu, Y.; et al. ERNIE 3.0: Large-Scale Knowledge Enhanced Pre-Training for Language Understanding and Generation. arXiv 2021, arXiv:2107.02137. [Google Scholar]
Van Den Oord, A.; Kalchbrenner, N.; Kavukcuoglu, K. Pixel Recurrent Neural Networks. In Proceedings of the 33rd International Conference on Machine Learning (ICML), New York, NY, USA, 19–24 June 2016; Volume 48, pp. 1747–1756. [Google Scholar]
Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Radford, A.; Chen, M.; Sutskever, I. Zero-Shot Text-to-Image Generation. In Proceedings of the 38th International Conference on Machine Learning (ICML), Virtual, 18–24 July 2021; Volume 139, pp. 8821–8831. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
Chen, M.; Radford, A.; Child, R.; Wu, J.; Jun, H.; Luan, D.; Sutskever, I. Generative Pretraining From Pixels. In Proceedings of the 37th International Conference on Machine Learning (ICML), Vienna, Austria, 12–18 July 2020; Volume 119, pp. 1691–1703. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 3–7 May 2021. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Navab, N., Hornegger, J., Wells, W., Frangi, A., Eds.; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML), Virtual, 18–24 July 2021; Volume 139, pp. 8748–8763. [Google Scholar]
Peebles, W.; Xie, S. Scalable Diffusion Models with Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 4195–4205. [Google Scholar]
Song, Y.; Ermon, S. Generative Modeling by Estimating Gradients of the Data Distribution. In Advances in Neural Information Processing Systems; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
Ho, J.; Salimans, T. Classifier-Free Diffusion Guidance. In Proceedings of the NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, Virtual, 13 December 2021. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
Ho, J.; Saharia, C.; Chan, W.; Fleet, D.J.; Norouzi, M.; Salimans, T. Cascaded Diffusion Models for High Fidelity Image Generation. J. Mach. Learn. Res. 2022, 23, 1–33. [Google Scholar]
Van den Oord, A.; Kalchbrenner, N.; Vinyals, O.; Espeholt, L.; Graves, A.; Kavukcuoglu, K. Conditional Image Generation with PixelCNN Decoders. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Barcelona, Spain, 5–10 December 2016; Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2016; Volume 29. [Google Scholar]
Esser, P.; Rombach, R.; Ommer, B. Taming Transformers for High-Resolution Image Synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 12873–12883. [Google Scholar]
Razavi, A.; Van den Oord, A.; Vinyals, O. Generating Diverse High-Fidelity Images with VQ-VAE-2. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
Lee, D.; Kim, C.; Kim, S.; Cho, M.; Han, W.S. Autoregressive Image Generation Using Residual Quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11523–11532. [Google Scholar]
Tian, K.; Jiang, Y.; Yuan, Z.; Peng, X.; Wang, G. Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 10–15 December 2024. [Google Scholar]
Yang, T.; Zhu, Y.; Xie, Y.; Zhang, A.; Chen, C.; Li, M. AIM: Adapting Image Models for Efficient Video Action Recognition. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Liu, D.; Zhao, S.; Zhuo, L.; Xu, W.; Li, B.; Elhoseiny, M. Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining. arXiv 2024, arXiv:2408.02657. [Google Scholar]
Chang, H.; Zhang, H.; Barber, J.; Maschinot, A.; Lezama, J.; Jiang, L.; Yang, M.-H.; Murphy, K.; Freeman, W.T.; Rubinstein, M.; et al. MaskGIT: Masked Generative Image Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11315–11325. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 4171–4186. [Google Scholar]
Ma, J.; Masoodian, S.; Starkey, D.A.; Fossum, E.R. Photon-Number-Resolving Megapixel Image Sensor at Room Temperature Without Avalanche Gain. Optica 2017, 4, 1474–1481. [Google Scholar] [CrossRef]
Li, M.; Liu, J.; Yang, W.; Sun, X.; Guo, Z. Structure-Revealing Low-Light Image Enhancement Via Robust Retinex Model. IEEE Trans. Image Process. 2018, 27, 2828–2841. [Google Scholar] [CrossRef]
Jobson, D.J.; Rahman, Z.; Woodell, G.A. Retinex Processing for Automatic Image Enhancement. J. Electron. Imaging 2004, 13, 100–110. [Google Scholar] [CrossRef]
Cai, J.; Gu, S.; Zhang, L. Learning a Deep Single Image Contrast Enhancer from Multi-Exposure Images. IEEE Trans. Image Process. 2018, 27, 2049–2062. [Google Scholar] [CrossRef]
Guo, C.; Li, C.; Guo, J.; Loy, C.C.; Hou, J.; Kwong, S.; Cong, R. Zero-Reference Deep Curve Estimation for Low-Light Image Enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1780–1789. [Google Scholar]
Jiang, Y.; Gong, X.; Liu, D.; Cheng, Y.; Fang, C.; Shen, X.; Yang, J.; Zhou, P.; Wang, Z. EnlightenGAN: Deep Light Enhancement Without Paired Supervision. IEEE Trans. Image Process. 2021, 30, 2340–2349. [Google Scholar] [CrossRef]
Wang, R.; Zhang, Q.; Fu, C.-W.; Shen, X.; Zheng, W.-S.; Jia, J. Underexposed Photo Enhancement Using Deep Illumination Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 6849–6857. [Google Scholar]
Buades, A.; Coll, B.; Morel, J.-M. A Non-Local Algorithm for Image Denoising. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), San Diego, CA, USA, 20–25 June 2005; Volume 2, pp. 60–65. [Google Scholar]
Yang, W.; Wang, S.; Fang, Y.; Wang, Y.; Liu, J. From Fidelity to Perceptual Quality: A Semi-Supervised Approach for Low-Light Image Enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 3063–3072. [Google Scholar]
Chen, Y.-S.; Wang, Y.-C.; Kao, M.-H.; Chuang, Y.-Y. Deep Photo Enhancer: Unpaired Learning for Image Enhancement from Photographs with GANs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 6306–6314. [Google Scholar]
Michalet, X.; Siegmund, O.H.W.; Vallerga, J.V.; Jelinsky, P.; Millaud, J.E.; Weiss, S. Detectors for Single-Molecule Fluorescence Imaging and Spectroscopy. J. Mod. Opt. 2007, 54, 239–281. [Google Scholar] [CrossRef]
Liu, L.; Jia, X.; Liu, J.; Tian, Q. Joint Demosaicing and Denoising with Self Guidance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 2240–2249. [Google Scholar]
Wah, C.; Branson, S.; Welinder, P.; Perona, P.; Belongie, S. The Caltech-UCSD Birds-200-2011 Dataset; Technical Report CNS-TR-2011-001; California Institute of Technology: Pasadena, CA, USA, 2011. [Google Scholar]
Van den Oord, A.; Vinyals, O.; Kavukcuoglu, K. Neural Discrete Representation Learning. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Long Beach, CA, USA, 2017; Volume 30. [Google Scholar]
Lamb, A.; Dumoulin, V.; Courville, A. Discriminative Regularization for Generative Models. arXiv 2016, arXiv:1602.03220. [Google Scholar] [CrossRef]
Bengio, Y.; Léonard, N.; Courville, A. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation. arXiv 2013, arXiv:1308.3432. [Google Scholar] [CrossRef]
Larsen, A.B.L.; Sønderby, S.K.; Larochelle, H.; Winther, O. Autoencoding Beyond Pixels Using a Learned Similarity Metric. In Proceedings of the 33rd International Conference on Machine Learning (ICML), New York, NY, USA, 19–24 June 2016; Volume 48, pp. 1558–1566. [Google Scholar]
Johnson, J.; Alahi, A.; Fei-Fei, L. Perceptual Losses for Real-Time Style Transfer and Super-Resolution. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 694–711. [Google Scholar]
Dosovitskiy, A.; Brox, T. Generating Images with Perceptual Similarity Metrics Based on Deep Networks. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Barcelona, Spain, 5–10 December 2016; Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2016; Volume 29. [Google Scholar]
Mentzer, F.; Toderici, G.; Tschannen, M.; Agustsson, E. High-Fidelity Generative Image Compression. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Virtual, 6–12 December 2020; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 11913–11924. [Google Scholar]
Isola, P.; Zhu, J.-Y.; Zhou, T.; Efros, A.A. Image-to-Image Translation with Conditional Adversarial Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134. [Google Scholar]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 586–595. [Google Scholar]
Chen, W.; Wang, W.; Yang, W.; Liu, J. Deep Retinex Decomposition for Low-Light Enhancement. In Proceedings of the British Machine Vision Conference (BMVC), Newcastle, UK, 3–6 September 2018; p. 155. [Google Scholar]
Yang, W.; Wang, W.; Huang, H.; Wang, S.; Liu, J. Sparse Gradient Regularized Deep Retinex Network for Robust Low-Light Image Enhancement. IEEE Trans. Image Process. 2021, 30, 2072–2086. [Google Scholar] [CrossRef] [PubMed]
Chen, C.; Chen, Q.; Do, M.N.; Koltun, V. Seeing Motion in the Dark. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3185–3194. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
Silberman, N.; Hoiem, D.; Kohli, P.; Fergus, R. Indoor Segmentation and Support Inference from RGBD Images. In Proceedings of the European Conference on Computer Vision (ECCV), Florence, Italy, 7–13 October 2012; pp. 746–760. [Google Scholar]
Chen, H.; Wang, Y.; Guo, T.; Xu, C.; Deng, Y.; Liu, Z.; Ma, S.; Xu, C.; Xu, C.; Gao, W. Pre-Trained Image Processing Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 12299–12310. [Google Scholar]
Li, J.; Li, J.; Fang, F.; Li, F.; Zhang, G. Luminance-Aware Pyramid Network for Low-Light Image Enhancement. IEEE Trans. Multimed. 2021, 23, 3153–3165. [Google Scholar] [CrossRef]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.-H.; Shao, L. Learning Enriched Features for Real Image Restoration and Enhancement. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 492–511. [Google Scholar]
Liu, R.; Ma, L.; Zhang, J.; Fan, X.; Luo, Z. Retinex-Inspired Unrolling With Cooperative Prior Architecture Search for Low-Light Image Enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 10561–10570. [Google Scholar]
Cai, Y.; Bian, H.; Lin, J.; Wang, H.; Timofte, R.; Zhang, Y. Retinexformer: One-Stage Retinex-Based Transformer for Low-Light Image Enhancement. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 12470–12479. [Google Scholar]
Ye, J.; Qiu, C.; Zhang, Z. SNR-Prior Guided Trajectory-Aware Transformer for Low-Light Video Enhancement. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 1873–1885. [Google Scholar] [CrossRef]
Cheng, Z.; Wu, Y.; Tian, F.; Feng, Z.; Li, Y. MSF-ACA: Low-Light Image Enhancement Network Based on Multi-Scale Feature Fusion and Adaptive Contrast Adjustment. Sensors 2025, 25, 4789. [Google Scholar] [CrossRef] [PubMed]
Jan, A.; Seo, S. Monocular Depth Estimation Using Res-UNet with an Attention Model. Appl. Sci. 2023, 13, 6319. [Google Scholar] [CrossRef]
Sun, Y.; Ni, M.; Zhao, M.; Yang, Z.; Peng, Y.; Cao, D. Low-Light Enhancement Method with Dual Branch Feature Fusion and Learnable Regularized Attention. Front. Optoelectron. 2024, 17, 28. [Google Scholar] [CrossRef] [PubMed]

Figure 1. The SwissSPAD2 camera system (EPFL, Lausanne, Switzerland) used for real-world data acquisition. Although visually compact, the setup integrates several critical components: the central SPAD sensor array (512 × 256 resolution) coupled with a high-speed FPGA controller (bottom board) for binary frame buffering, and a C-mount optical lens assembly (front) for light collection. The system transmits binary photon streams via a USB 3.0 interface to the host workstation for subsequent SNR-guided enhancement and depth reconstruction processing.

Figure 2. Sample low-light input images from the LOL-v1 dataset. These images are characterized by low contrast and significant noise. Note: For visualization purposes, the brightness of these images has been increased using gamma correction (γ = 2.2) to make the contents visible.

Figure 3. Corresponding normal-light ground truth images for the LOL-v1 dataset shown in Figure 2. These high-quality reference images are used to supervise the training of the enhancement network.

Figure 4. Representative low-light input samples from the LOL-v2 dataset. The scenes include varying illumination conditions that challenge standard enhancement algorithms. (Gamma correction applied for visibility).

Figure 5. Corresponding normal-light ground truth images for the LOL-v2 real subset shown in Figure 4. These high-quality reference images provide the target supervision signal, enabling the network to learn accurate color restoration and detail enhancement during training. Note: The non-English text appearing on the posters in the background consists of safety slogans inherent to the original scene and is not relevant to the image enhancement metrics.

Figure 6. Representative low-light input samples from the LOL-v2 dataset. The scenes include varying illumination conditions that challenge standard enhancement algorithms. (Gamma correction applied for visibility). The low visibility and noise are characteristics of the low-light input data.

Figure 7. High-quality reference images from the LOL-v2 synthetic dataset. These serve as the ground truth, from which low-light versions were synthesized by simulating noise and illumination degradation. Comparing results against these images allows for precise quantitative evaluation of the model’s generalization ability.

Figure 8. Paired samples from the SID dataset. The left/top images show the short-exposure (low-light) inputs, which suffer from extreme noise and color distortion. The right/bottom images show the corresponding long-exposure ground truth.

Figure 9. Representative sample scenes from the SMID dataset used in our evaluation. This dataset features dynamic scenes captured under extreme darkness, providing paired data consisting of short-exposure noisy inputs and corresponding long-exposure ground truth images (shown here) to benchmark performance in real-world motion scenarios. The images are displayed in standard RGB true color, and the center image features a color checker chart used for color reference.

Figure 10. Representative samples from the NYU Depth V2 dataset. The top row displays the input RGB images of indoor scenes, while the bottom row shows the corresponding ground truth depth maps (visualized as heatmaps). These samples are used to evaluate the depth estimation branch of the proposed framework. In the depth maps, the color gradient represents distance from the camera, where blue indicates near objects and red indicates far objects.

Figure 11. Visual comparison of image enhancement results. (Top row): Original low-light inputs (displayed with digital gain), showing severe noise and loss of detail. (Bottom row): Enhanced results generated by our proposed SNR-guided framework. Our method effectively suppresses noise while recovering structural details and accurate colors, verifying the effectiveness of the proposed data augmentation and enhancement strategy.

Figure 12. Visualization of the VQGAN reconstruction process across different training epochs. From (left) to (right): The Ground Truth (GT) image, and the reconstructed outputs at Epoch 14, Epoch 30, and Epoch 50. As training progresses, the model (at Epoch 50) demonstrates superior capability in preserving high-frequency textures and edge details compared to earlier stages. The depth maps are visualized using a false-color heatmap, where blue represents near distances and red represents far distances.

Figure 13. Qualitative results of depth estimation on the NYU Depth V2 dataset. The figure compares the Ground Truth depth maps (Left/Top) with the Predicted depth maps (Right/Bottom) generated by our autoregressive model. The results show that our model accurately infers depth gradients and object boundaries even in complex scenes. The depth maps are visualized as heatmaps, where blue indicates near distances and red indicates far distances.

Figure 14. Comprehensive evaluation on real-world single-photon data (CUB-200-2011 subset) captured by SwissSPAD2. From (top) to (bottom): (1) The reference monitor image; (2) The raw captured SPAD image, characterized by extreme noise and sparsity; (3) The data-enhanced image restored by our SNR-guided module; and (4) The generated depth map. This demonstrates the framework’s robustness in recovering visual content and estimating depth from actual photon-limited sensors. The depth maps are visualized as heatmaps, where blue indicates near distances and red indicates far distances.

Table 1. Comparison of image enhancement performance.

	LOL-v1		LOL-v2-r		LOL-v2-s		SID
Methods	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
LPNet	21.46	0.802	17.80	0.792	19.51	0.846	20.08	0.598
MIR-Net	24.14	0.830	20.02	0.820	21.94	0.876	20.84	0.605
Retinex	18.23	0.720	18.37	0.723	16.55	0.652	18.44	0.581
IPT	16.27	0.504	19.80	0.813	18.30	0.811	20.53	0.561
Ours	24.61	0.842	21.48	0.849	24.14	0.928	22.87	0.625

Note: Bold indicates the best performance.

Table 2. Depth estimation results.

$M e t h o d$	$δ_{1}$ ↑	$δ_{2}$ ↑	$δ_{3}$ ↑	$r e l$ ↓	$r m s$ ↓	$\log_{10}$ ↓
Joint Denoising	0.671	0.896	0.967	0.209	1.412	0.087
Photon Net	0.713	0.917	0.976	0.183	1.275	0.078
ours	0.725	0.941	0.984	0.162	1.177	0.069

Note: Bold indicates the best performance. Bold indicates the best performance. ↑ indicates that higher values are better, while ↓ indicates that lower values are better.

Table 3. Performance on real single-photon data.

$M e t h o d$	$δ_{1}$ ↑	$δ_{2}$ ↑	$δ_{3}$ ↑	$r e l$ ↓	$r m s$ ↓	$\log_{10}$ ↓
Joint Denoising	0.660	0.878	0.934	0.256	1.476	0.092
Photon Net	0.682	0.924	0.941	0.197	1.293	0.084
ours	0.721	0.961	0.976	0.171	1.184	0.074

Note: Bold indicates the best performance. ↑ indicates that higher values are better, while ↓ indicates that lower values are better.

Table 4. Ablation study results of different modules in the low-light image enhancement framework.

	LOL-v1		LOL-v2-r		LOL-v2-s		SID
Methods	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
Ours w/o L	16.27	0.638	16.98	0.687	20.81	0.881	19.10	0.593
Ours w/o S	23.06	0.828	18.98	0.790	23.47	0.919	22.30	0.604
Ours w/o SA	20.67	0.752	18.85	0.765	21.88	0.842	21.02	0.544
Ours w/o A	21.86	0.760	19.40	0.782	22.23	0.866	21.19	0.550
Ours	24.61	0.842	21.48	0.849	24.14	0.928	22.87	0.625

Note: Bold indicates the best performance.

Table 5. Ablation study results of different training strategies for the autoregressive model.

Training Strategies	$δ_{1}$ ↑	$r e l$ ↓	$r m s$ ↓
Frozen Training	0.706	0.249	1.309
Low-Rank Adaptation (LoRA)	0.718	0.208	1.224
Full Fine-tuning	0.763	0.162	1.177

Note: Bold indicates the best performance. ↑ indicates that higher values are better, while ↓ indicates that lower values are better.

Table 6. Ablation results for different control encoder configurations.

Control Encoder	$δ_{1}$ ↑	$r e l$ ↓	$r m s$ ↓
CNN	0.685	0.248	1.459
ViT-S	0.714	0.256	1.287
DINOv2-S	0.720	0.258	1.279
VQGAN	0.725	0.162	1.177

Note: Bold indicates the best performance. ↑ indicates that higher values are better, while ↓ indicates that lower values are better.

Table 7. Ablation results for control fusion strategies.

Fusion Strategies	Layer	$δ_{1}$ ↑	$r e l$ ↓	$r m s$ ↓
Cross-Attention Fusion	1-th	0.721	0.168	1.179
Direct Additive Fusion	1-th	0.714	0.171	1.205
Direct Additive Fusion	1,5,9-th	0.725	0.162	1.177
Direct Additive Fusion	1∼12-th	0.720	0.165	1.184

Note: ↑ indicates that higher values are better, while ↓ indicates that lower values are better.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yin, Q.; Mu, F.; Wu, Q.; Ding, D.; Fan, Z.; Zhang, T. SNR-Guided Enhancement and Autoregressive Depth Estimation for Single-Photon Camera Imaging. Appl. Sci. 2026, 16, 245. https://doi.org/10.3390/app16010245

AMA Style

Yin Q, Mu F, Wu Q, Ding D, Fan Z, Zhang T. SNR-Guided Enhancement and Autoregressive Depth Estimation for Single-Photon Camera Imaging. Applied Sciences. 2026; 16(1):245. https://doi.org/10.3390/app16010245

Chicago/Turabian Style

Yin, Qingze, Fangming Mu, Qinge Wu, Ding Ding, Ziyu Fan, and Tongpo Zhang. 2026. "SNR-Guided Enhancement and Autoregressive Depth Estimation for Single-Photon Camera Imaging" Applied Sciences 16, no. 1: 245. https://doi.org/10.3390/app16010245

APA Style

Yin, Q., Mu, F., Wu, Q., Ding, D., Fan, Z., & Zhang, T. (2026). SNR-Guided Enhancement and Autoregressive Depth Estimation for Single-Photon Camera Imaging. Applied Sciences, 16(1), 245. https://doi.org/10.3390/app16010245

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SNR-Guided Enhancement and Autoregressive Depth Estimation for Single-Photon Camera Imaging

Abstract

1. Introduction

2. Related Work

2.1. Current Research on Visual Neural Networks

2.2. Current Research Status of Image Generation

2.3. Current Situation of Inference on Single-Photon Cameras

3. Problem Description and Model Design

3.1. Low-Light Image Enhancement

3.1.1. Design of Short- and Long-Range Branches

3.1.2. Spatially Varying Feature Fusion Based on Signal-to-Noise Ratio

3.1.3. SNR-Guided Transformer Attention Mechanism

3.1.4. Loss Functions

3.2. Learning Quantized Representations for Images

3.2.1. Learning Efficient Image Codebooks

3.2.2. Loss Function

3.2.3. Learning Image Structures with Transformer

4. Experiment and Analysis

4.1. Datasets

4.1.1. Low-Light Image Enhancement Datasets

4.1.2. Depth Estimation Dataset

4.2. Evaluation Metrics

4.3. Experimental Setup

4.4. Experimental Results

4.4.1. Image Enhancement

4.4.2. Encoding Capability of VQGAN

4.4.3. Depth Estimation

4.4.4. Experiments on Real Single-Photon Camera Data

4.5. Ablation Study

4.5.1. Ablation Study on Low-Light Image Enhancement Framework

4.5.2. Ablation Study on Autoregressive Model Training Strategies

4.5.3. Ablation Study on Control Encoders

4.5.4. Ablation Study on Control Fusion Strategies

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI