Noise Suppressed Image Reconstruction for Quanta Image Sensors Based on Transformer Neural Networks

Wang, Guanjie; Gao, Zhiyuan

doi:10.3390/jimaging11050160

Open AccessArticle

Noise Suppressed Image Reconstruction for Quanta Image Sensors Based on Transformer Neural Networks

by

Guanjie Wang

^* and

Zhiyuan Gao

^*

School of Microelectronics, Tianjin University, 92 Weijin Road, Tianjin 300072, China

^*

Authors to whom correspondence should be addressed.

J. Imaging 2025, 11(5), 160; https://doi.org/10.3390/jimaging11050160

Submission received: 26 March 2025 / Revised: 10 May 2025 / Accepted: 13 May 2025 / Published: 17 May 2025

(This article belongs to the Section Image and Video Processing)

Download

Browse Figures

Versions Notes

Abstract

:

The photon detection capability of quanta image sensors make them an optimal choice for low-light imaging. To address Possion noise in QIS reconstruction caused by spatio-temporal oversampling characteristic, a deep learning-based noise suppression reconstruction method is proposed in this paper. The proposed neural network integrates convolutional neural networks and Transformers. Its architecture combines the Anscombe transformation with serial and parallel modules to enhance denoising performance and adaptability across various scenarios. Experimental results demonstrate that the proposed method effectively suppresses noise in QIS image reconstruction. Compared with representative methods such as TD-BM3D, QIS-Net and DPIR, our approach achieves up to 1.2 dB improvement in PSNR, demonstrating superior reconstruction quality.

Keywords:

quanta image sensors; photon counting; image reconstruction; image denoising

1. Introduction

With the continuous advancement of Moore’s Law, the resolution of conventional complementary metal-oxide-semiconductor image sensors (CIS) has steadily increased, leading to a reduction in pixel size. However, at the deep sub-diffraction-limit (deep-SDL) scale, smaller pixels result in decreased full well capacity (FWC), which have a negative impact on signal-to-noise ratio (SNR) and limits the dynamic range of the sensors [1]. To address these limitations, the concept of the digital film sensor (DFS) was proposed in 2005 [2], and was formally termed the quanta image sensor (QIS) in 2011 [3]. Recent advances [4] reveal that the latest QIS prototype has achieved a read noise below 0.2 e- r.m.s at room temperature, with a pixel resolution of 16.7 million and a pixel pitch of 1.1 μm. This state-of-the-art sensor is expected to incorporate smaller pixel pitches, higher frame rates, lower read noise, and reduced dark current while maintaining high performance at room temperature and production costs comparable to traditional CIS [5]. These features make QIS highly suitable for low-light imaging applications. Numerous studies in computer vision have demonstrated the effectiveness of the sensor in addressing low-light scenarios [6,7,8,9].

The QIS is built on an array of ultra-small photodetectors referred to as ‘jots’ [10], each of which is quantized using an analog-to-digital converter (ADC). A distinguishing feature of QIS is its ability to perform spatio-temporal oversampling through high-frequency sampling. In this context, the oversampling unit, known as a ‘cubicle’, is composed of 3D arrays of ‘jots’. The main purpose of the spatio-temporal oversampling process is to suppress Poisson noise induced by photon incident events, thereby improving image quality [11]. After oversampling, a reconstruction process is carried out to convert the data in the ‘cubicle’ into reconstructed pixel values. The concept of spatio-temporal oversampling is illustrated in Figure 1.

The analysis of noise and bit error rates in QIS was first introduced in [12,13], followed by detailed mathematical modeling and SNR formulations [14]. One of QIS’s key advantages is its ability to programmatically adjust spatial and temporal oversampling rates for high SNR image reconstruction [10]. Therefore, robust reconstruction algorithms are essential for achieving high-quality imaging. Early research framed image reconstruction as a photon number estimation challenge. Yang et al. [15] proposed a Maximum Likelihood Estimation (MLE)-based algorithm for this purpose and Chan et al. [16] proposed a Maximum A Posteriori (MAP) estimation based on the Alternating Direction Method of Multipliers (ADMM). In addition to photon number estimation methods, a non-iterative reconstruction method based on Block-Matching and 3D Filtering (BM3D) denoiser [17] and a deep learning reconstruction method based on convolutional neural networks (CNN) [18] have also been proposed.

The photon detection capability makes QIS an optimal choice for imaging in low-light conditions. However, the low SNR bit-planes pose challenges to image reconstruction, thus the appropriate noise suppression method can improve the quality of reconstructed images. The photon number estimation methods [15,16] suppress noise based on imaging principles. However, the estimated values of these methods significantly deviate from the true values and the mathematical expressions become very complex and difficult to solve as the bit depth increases. Additionally, the non-iterative BM3D method [17] utilizes complex frequency domain filtering algorithms to achieve better denoising results compared to photon number estimation methods. However, the time complexity of this method is high and the performance of the BM3D denoiser has been proven to be lower compared to the denoisers of the neural network [19]. The QISNet reconstruction method [18], based on CNN, has been proposed for noise suppression in QIS image reconstruction. Nevertheless, with the continued development of denoising technology, more complex neural network denoisers for QIS reconstruction need to be considered.

In recent years, various filtering techniques have been developed to improve image denoising performance. Traditional methods, such as Gaussian smoothing, median filtering, and particularly Wiener filtering, have laid the groundwork for noise suppression by modeling local statistics and frequency domain characteristics. A comprehensive review of these approaches can be found in [20]. Building on these classical methods, filtering approaches based on doubly stochastic models have demonstrated strong potential in handling signal-dependent noise through adaptive probabilistic frameworks. For example, ref. [21] proposed an image restoration method using a doubly stochastic model, while [22] introduced a noise suppression approach that considers the two-layer stochastic nature of image acquisition. These methods provide important theoretical foundations for filtering under complex noise characteristics. With the rapid development of data-driven techniques, deep learning-based methods have also shown impressive results in recent years. These methods are capable of learning powerful priors directly from data and have demonstrated state-of-the-art performance in a wide range of denoising tasks. A recent survey by [23] provides a comprehensive overview of the progress in this area.

The unique hardware characteristics of QIS inherently provide superior capability for low-light imaging. By leveraging high-speed frame acquisition and high temporal oversampling rate, QIS can achieve imaging even in extremely low-light conditions. However, this comes at the cost of longer imaging time and significantly increased data volume. More specifically, we contribute to the literature of QIS image reconstruction as follows:

This study propose a reconstruction method that suppress photon shot noise during the QIS imaging process under a practical and acceptable temporal oversampling rate based on deep neural network instead of traditional denoising.
The proposed neural network framework integrates CNN and Transformer to improve denoising performance. A hybrid structure combining serial and parallel is introduced in the network framework to enhance the strength and robustness of denoising. The serial module deeply explores the key information in image denoising, while the parallel module widely explores more relevant and complementary information between pixels from different angles, thereby enhancing the adaptability of the QIS-SPFT denoiser to complex scenes. In the network, a variance-stabilizing transformation is used to convert the high Poisson noise in QIS into Gaussian noise to enhance the performance as well.

The remainder of the paper is organized as follows. Section 3 introduces the background of QIS imaging. Section 4 provides a detailed theoretical analysis of the proposed image reconstruction method. Section 5 presents and compares experimental results. Finally, Section 6 concludes the paper.

2. Background

In this section, the noise composition of the QIS is analyzed and the mathematical model for imaging is provided. The formulation of the QIS imaging model is fundamental to QIS image reconstruction. Based on the QIS imaging principle, a QIS imaging model is established, as shown in Figure 2.

Consider a light wave with intensity

I (x, y, t)

incident on a sensor with duty cycle T and integration time

Δ

. The mean number of photons, denoted as

θ

, received by a ‘jot’ with an area A during the integration time is expressed as follows:

θ = \underset{(x, y) \in A}{\int \int} \int_{T}^{T + Δ} I (x, y, t) d x d y d t,

(1)

The probability of photon number Y the ‘jot’ which has a small size and a short time interval get satisfy the Poisson distribution:

P (Y = y) = \frac{e^{- θ} θ^{y}}{y!},

(2)

Dark current, generated by pixels in the absence of light, leads to the production of dark electrons, affecting the accuracy of photon counting. These electrons also follow a Poisson distribution.

When considering spatio-temporal oversampling, the image information containing N light intensity

c_{i}

is reconstructed from M oversampled ‘jots’

θ_{i}

. The spatio-temporal oversampling rate is represented as

K = M / N

.

If we use matrix-vector notation, the Poisson distribution of photons is represented as:

Θ_{photon} = Gc,

(3)

where

c = {[c_{0}, \dots, c_{N - 1}]}^{T}

denotes the reconstructed light intensity,

Θ_{photon} = {[θ_{0}, \dots, θ_{M - 1}]}^{T}

denotes the light intensity sampled at the M jots, and the matrix

G \in R^{M \times N}

is a matrix representing the upsampling and the lowpass filter

g_{k}

. The lowpass filter

g_{k}

can be assumed as a box-car. Now, total photons can be described as:

Θ_{tot} = Θ_{photon} + Θ_{dark},

(4)

P ({[Y]}_{l} = y) = \frac{{[γ \cdot Θ_{tot}]}_{l}^{y} e^{- {[γ \cdot Θ_{tot}]}_{l}}}{y!},

(5)

where

l = 1, \dots, N

,

γ

is the sensor gain,

Θ_{tot} \in R^{N}

is the mean number of electrons physically and

Y

is the electrons really sensed by pixels.

Once the electrons are sensed by the pixels, they undergo readout, which the analog circuit generates additive Gaussian noise, denoted as,

η_{r e a d} \sim N (0, σ_{r e a d}^{2})

. Thus, the real number of electrons

Z \in R^{N}

follows a Gaussian-Poisson distribution:

\begin{matrix} P ({[Z]}_{l} = z) = \sum_{y = 0}^{\infty} (\frac{{[γ \cdot Θ_{tot}]}_{l}^{y} e^{- {[γ \cdot Θ_{tot}]}_{l}}}{y!} \cdot \frac{1}{\sqrt{2 π σ_{r e a d}^{2}}} e^{- \frac{{(z - y)}^{2}}{2 σ_{r e a d}^{2}}}), \end{matrix}

(6)

After readout, electrons are converted into digital codes by the ADC. Due to the BER [12] caused by mismatched ADC transition points, there exists difference between ‘jots’ [24]. In conclusion, the quantization process is modeled as:

Q = \{A D C (G a u s s - P o i s s (γ \cdot Θ_{tot}; σ_{r e a d}) + O)\},

(7)

where

O \in R^{N \times N}

represents ADC transition points mismatch.

3. Method Description

3.1. Architecture of the Proposed QIS Reconstruction Method

This paper introduces a neural network architecture named QIS-SPFT (QIS Serial-Parallel Fusion Transformer), which integrates both serial and parallel networks within Transformer frameworks to suppress Gaussian-Poisson noise in QIS reconstruction.

The QIS-SPFT method capitalizes on the interaction of diverse structural information, obtained through attention mechanisms in both breadth and depth from various serial and parallel networks, to extract salient features for improved reconstruction of clean images. Its superior denoising performance is achieved through the Serial Module (SM) and Parallel Module (PM). The SM utilizes linear and nonlinear components to thoroughly search for essential information in image denoising. To gather more complementary information from different perspectives, the PM extensively explores interactions and cross-features between pixels obtained from two mixed networks, Subnet1 and Subnet2, thereby enhancing the network’s adaptability to complex noise.

The denoising framework QIS-SPFT use the Anscombe transformation, which makes the noise in jot-summation images closer to Gaussian distribution with constant variance. This variance-stabilizing technique facilitates the suppression of Poisson noise in QIS images. In this study, the neural network is incorporated as a denoiser within the QIS image reconstruction, and the neural network needs to learn to suppress the noise after Anscombe transformation. The architecture of the proposed QIS reconstruction method is shown in Figure 3.

Within this network framework, a deep denoising network module is employed to thoroughly capture the structural information of images after variance stabilization. This module primarily comprises convolutional layers, activation functions, and Transformers. The convolutional layers convert the noisy images into linear features, the nonlinear activation functions extract richer features, while the Transformers discover the relationship between different image blocks, thus enhancing denoising performance in complex scenarios. Additionally, the framework integrates a robustness-enhancement network module, where two parallel sub-networks interact to extract complementary features from diverse perspectives, thereby improving the model’s robustness and adaptability to various scenarios. To further enhance denoising performance, Transformers are embedded in both serial and parallel modules to extract more prominent features and effectively filter out noise. This network can be represented as:

R e s i d^{- 1} (I_{C}) = R M (P M (S M (I_{N}))),

(8)

where

I_{N}

represents the pre-reconstructed image, and

S M

,

P M

, and

R M

are defined as functions of the Serial Module, Parallel Module, and Residual Module.

I_{C}

is the clean image processed by the network. Further details about each module will be described in next subsection.

To objectively assess the denoising performance of the proposed QIS-SPFT, the Mean Squared Error (MSE) is employed as the loss function for training the network parameters. Specifically, MSE is utilized to train the denoiser with paired training samples

I_{C}^{i}, I_{N}^{i}

(1 \leq i \leq n)

, where

I_{C}^{i}

and

I_{N}^{i}

denote the i-th clean and noisy images in the training dataset, respectively, and n represents the total number of training samples. This training process can be mathematically expressed as:

l (θ) = \frac{1}{2 n} \sum_{i = 1}^{n} {|QIS - SPFT (I_{N}^{i}) - I_{C}^{i}|}^{2},

(9)

where, l represents the loss function, and

θ

denotes the parameters required for training QIS-SPFT.

3.2. Detail of the QIS-SPFT Denoiser

The SM is utilized to extract structural information from quanta images that contain complex noise components. Its superior denoising performance is attributed to its serial architecture, which is primarily composed of three components: convolutional layers, activation functions, and Transformers. The first and third convolutional layers employ

C o n v

. Given that this study focuses on grayscale noisy images directly generated by light intensity, the input and output channels of the first convolutional layer are set to 1 and 64, respectively. The second convolutional layer,

C o n v + R

, integrates linear and nonlinear components to extract richer features. In this context,

C o n v + R

is defined as the combination of a convolutional layer, denoted as

C o n v

, and an activation function, specifically

R e l u

. The convolutional layer

C o n v

extracts linear features, whereas the

R e l u

function acts as a piecewise function, transforming linear information into nonlinear features. The Transformer structure is employed to explore relationships between different patches and to dynamically learn weights for various inputs, enabling adaptive noise reduction across diverse areas in the image. The architecture of SM is shown in Figure 4 and the output of SM is mathematically represented by Equation (10).

\begin{matrix} O_{S M} & = S M (I_{N}) \\ = T (C (C (R (C (I_{N}))) + C (I_{N})), \end{matrix}

(10)

In this network, the Transformer architecture is implemented through an encoder that incorporates Multi-Head Self-Attention (MHSA) and Channel Feature Enhancement (CFE) mechanisms. These components are mathematically represented by Equation (11).

\begin{matrix} T (I_{T}) & = C F E (O_{M H S A} + I_{T}) + I_{C F E} \\ = F C L (F C L (R (L N (O_{M H S A} + I_{T})))) + I_{C F E}, \end{matrix}

(11)

where,

F C L

denotes the fully connected layer,

L N

represents layer normalization,

I_{T}

is the Transformer input,

O_{M H S A}

is the MHSA output, and

I_{C F E}

is the CFE input. The MHSA mechanism extracts global contextual information to enhance significant features for image denoising. It employs normalization layers to standardize the feature distribution, which are then fed into three parallel branches, each comprising a fully connected layer. The outputs of these branches are designated as Q, K, and V. The features are integrated via self-attention, further refined through an

F C L

, and then used as the

M H S A

output. Residual operations merge features from Transformer inputs and

M H S A

outputs into the

C F E

. This process is described by Equation (12):

\begin{matrix} O_{S M} & = T (O_{I N_T}) \\ = C F E (M H S A (O_{I N_T}) + O_{I N_T}) + O_{I N_C F E} \\ = C F E (F C L (softmax \\ (\frac{F C L (L N (O_{I N_T})) \times F C L {(L N (O_{I N_T}))}^{T}}{d}) \\ \times F C L (L N (O_{I N_T}))) + O_{I N_T}) + O_{I N_C F E} \\ = C F E (F C L (softmax (\frac{Q \times K^{T}}{d}) \times V) + O_{I N_T}) + O_{I N_C F E} \\ = C F E (O_{M H S A} + O_{I N_T}) + O_{I N_C F E}, \end{matrix}

(12)

The MHSA mechanism is illustrated in Equation (12), which efficiently captures global structural information. Concurrently, normalization layers and fully connected layers are employed to standardize features, thereby improving denoising performance. The output of the MHSA is then further processed through a series of fully connected layers and normalization steps within the CFE mechanism, where it is combined with the original inputs to achieve a more effective representation of clean image information.

As shown in Figure 5, the parallel module consists of two interacting parallel subnetworks that capture complementary features at different levels. To enhance the effectiveness of deep learning, a residual learning mechanism is employed in the subnetworks, where the residual between the input to the convolutional layer and the output after the convolutional layer is computed and used as the input to the Transformer. This process can be expressed by Equation (13).

\begin{matrix} O_{P M} & = P M (O_{S M}) \\ = O_{S u b n e t 1} \\ = S u b n e t 1 (O_{S M}, O_{S u b n e t 2}), \end{matrix}

(13)

where

O_{P M}

is the output of the parallel module, and

O_{S u b n e t 2}

is the output of subnetwork 2. In the parallel module, subnetwork 2 can be expressed by Equation (14).

\begin{matrix} O_{S u b n e t 2} & = S u b n e t 2 (O_{S M}) \\ = T (C (C R (T (C (C R (O_{S M})) + O_{S M})))) \\ + T (C (C R (O_{S M})) + O_{S M})), \end{matrix}

(14)

Subnetwork 1 can be expressed by Equation (15).

\begin{matrix} O_{S u b n e t 1} & = S u b n e t 1 (O_{S M}, O_{S u b n e t 2}) \\ = I T (T (C R (C (C o n c a t (T (T ( \\ C R (O_{S M}))), O_{S u b n e t 2}))))), \end{matrix}

(15)

The SM deeply searches for key information in image denoising, while the PM extensively explores more relevant and complementary information between pixels from different perspectives to enhance the adaptability of the QIS-SPFT denoiser to complex scenes. Its effectiveness is achieved through the interaction of two heterogeneous networks, namely subnetwork 1 and subnetwork 2, at multiple feature levels. These two subnetworks interact and acquire complementary features from different perspectives to enhance the robustness of the obtained denoising model. The

C o n v + R

component is used to extract nonlinear information, while the

C o n v

component is used to extract linear information. The T component functions between

C o n v

and

C o n v + R

to extract salient information. Furthermore, to augment the denoising network’s capability, the input of

C o n v + R

and the output of

C o n v

are integrated using a residual learning operation.

In the parallel module (PM), two robustness enhancement operations are employed. The first operation aims to improve the robustness of the features obtained from the denoiser by introducing an interaction mechanism between subnetwork 1 and subnetwork 2, namely the merging operation Concat. The Concat operation includes depthwise separable convolution layers, enabling information sharing and mutual influence between the two subnetworks, thereby enhancing the module’s robustness against noise. Here, robustness refers to the module’s ability to maintain effective denoising performance when facing different types of noise or complex noise conditions. The second operation, following the information interaction between different subnetworks, incorporates an improved Transformer component,

I T

, to eliminate interference information from previous interactions. The

I T

component consists of the stacks

F C L

, T,

F C L + R

, and

F C L

. Additionally, to enhance information acquisition, residual learning operations are conducted between the input of

I T

and the output of

F C L

, as well as between the outputs of T and

F C L

.

The residual module (RM) consists of a convolutional layer and a residual learning operation, aiming to construct a clean image by performing residual learning between the original noisy image and the predicted image in QIS-SPFT. The original image in this module is the noisy image before the Anscombe transformation, while the predicted image is the one obtained after the inverse Anscombe transformation. The convolution kernel size in this module is

3 \times 3

, with 64 input channels and 1 output channel. RM can be expressed by Equation (16).

\begin{matrix} I_{c} & = R M (O_{P M}) \\ = I_{N} - C (O_{P M}) . \end{matrix}

(16)

4. Experimental Results

In the following subsection, simulations and experimental validations were conducted for the QIS noise suppression method proposed in this paper. In this section, the effectiveness of the network architecture is first verified through ablation experiments, and then the proposed reconstruction method is evaluated across different datasets. All experiments were performed in a Python 3.11.4 environment on a computer equipped with an Intel 12400F/4.4 GHz 6-core CPU and 32 GB of memory and the neural network training was accelerated using an Nvidia Geforce RTX 3070Ti with CUDA 12.3.

The training dataset consisted of the following: the BSD dataset containing 432 natural images [25], the DIV2K dataset containing 800 natural images [26], the Flickr2K dataset containing 2650 natural images [27], and the WED dataset containing 4744 natural images [28]. This formed a synthetic noise training dataset with a total of 8626 natural images. To enrich the training dataset, each image was randomly cropped into 108 patches of size

48 \times 48

, resulting in 931,608 image patches for training the grayscale denoising model. The input images used for training were the

\frac{S_{n}}{L}

, which are summed three-dimensional data matrices under temporal oversampling rate L. During training, the temporal oversampling rate T was set to 16, the spatial oversampling rate K was set to 1, and

σ_{read}

was set to 0.2e-. The targets are the corresponding clean (ground-truth) images. For testing and evaluation, we adopt the standard benchmark datasets BSD68, Set12 [29], and Kodak24 [30].

The parameter settings for training the QIS-SPFT denoiser were as follows: batch size was set to 8, epochs were set to 24, and the initial learning rate was set to

1 \times 10^{- 4}

.

4.1. Network Analysis Experiments

To explore the rationality of the network structure settings in this study, experiments were conducted to compare the configurations with and without Anscombe transform, with and without the IT component, in a combinatorial manner. The test datasets were Set12 [29] and BSD68 [29], with Noise level 1 (T = 16, K = 4,

σ_{read}

= 0.2e-), Noise level 2 (T = 16, K = 4,

σ_{read}

= 0.4e-), Noise level 3 (T = 16, K = 2,

σ_{read}

= 0.2e-), Noise level 4 (T = 16, K = 1,

σ_{read}

= 0.4e-). This paper objectively evaluates the denoising effect using peak signal-to-noise ratio (PSNR), which can be expressed as:

PSNR = 10 \log_{10} [\frac{{(2^{n} - 1)}^{2}}{M S E}]

(17)

where n is the bit-depth of the image, which is generally set to 8.

M S E

is the mean squared error between images.

We compute the average PSNR results of processed reconstructed images across multiple datasets characterized by differing levels of noise under various network architectures. The comparison result of PSNR across various network architectures is shown in Table 1.

The experimental results shows that the network incorporating the Anscombe transform and IT components has the best denoising performance. The introduction of the Anscombe transform plays a positive role in improving noise suppression performance. In photon-counting imaging, Poisson noise is more significant. This type of non-Gaussian noise makes the denoising process challenging. The Anscombe transform effectively alleviates the complexity of the noise distribution by converting Poisson noise into approximately Gaussian noise, reducing the modeling difficulty and improving noise suppression performance. This denoising structure has been proven to achieve better results in QIS denoising tasks [17]. If the Anscombe transform and its inverse process are not included, the neural network must autonomously learn this process through training, which is a relatively aggressive approach. Experimental results show that whether the IT component is included or not, the network configuration containing the Anscombe transform achieves slight improvements in PSNR, verifying the rationality and effectiveness of including the Anscombe transform in the network.

Secondly, the IT component also significantly improves the denoising performance of the network. To further enhance denoising performance, the IT component—which comprises FCL, ReLU, T, and residual learning—extracts various features across layers. These features include both linear and nonlinear structural information, as well as significant structural details derived from inter-image pixel relationships. Experimental results indicate that the exclusion of the IT component results in a significant drop in PSNR, demonstrating that the IT component plays a crucial role in enhancing the network’s denoising performance.

When both the Anscombe transform and IT component are introduced simultaneously, the network achieves optimal performance. This indicates that the Anscombe transform and IT components are complementary within the network: the former effectively mitigates the complex noise composition problem in QIS, providing higher-quality input data for subsequent network processing, while the latter further exploits image correlation information between networks.

4.2. Image Denoising Experimental Results

To comprehensively evaluate the denoising performance of the proposed QIS-SPFT, both quantitative and qualitative metrics were applied. For quantitative evaluation, the proposed QIS-SPFT was compared with three existing image reconstruction and denoising methods for QIS: the MLE method [15], the TD-BM3D method [17], QISNet [18] and DPIR [31]. PSNR was adopted to quantitatively assess the performance of QIS-SPFT. For grayscale synthetic QIS noisy images, the Set12 [29] and BSD68 [29] datasets were selected for testing, and four noise levels were set: Noise level 1 (T = 16, K = 4,

σ_{read}

= 0.2e-), Noise level 2 (T = 16, K = 4,

σ_{read}

= 0.4e-), Noise level 3 (T = 16, K = 2,

σ_{read}

= 0.2e-), Noise level 4 (T = 16, K = 1,

σ_{read}

= 0.4e-). The average PSNR results on the BSD68 dataset [29] are shown in Table 2.

The PSNR evaluation results for different noise levels on the Set12 dataset [29] are shown in Table 3.

To further illustrate the adaptability of the reconstruction method proposed in this paper across various scenarios, noisy images (T = 16, K = 3,

σ_{read}

= 0.2e-) were processed using three different datasets: Set12 [29], BSD68 [29] and Kodak24 [30]. The PSNR results were averaged for each dataset and the experimental result indicated that the proposed method outperformed the comparative methods. The experimental result is shown in Table 4.

Quantitative analysis results demonstrate that the proposed QIS-SPFT method exhibits significant performance advantages under various noise levels on the Set12 [29], BSD68 [29] and Kodak24 [30] datasets. Compared to existing methods MLE [15], TD-BM3D [17], QISNet [18] and DPIR [31], QIS-SPFT achieves higher PSNR values, proving its superiority in QIS image denoising tasks. The proposed method demonstrates excellent denoising results in both low-noise and high-noise scenarios, showing strong adaptability and robustness.

For the qualitative evaluation of the denoising effectiveness of the proposed QIS-SPFT method, visual comparisons of reconstructed images were conducted. Specifically, regions of interest were selected to highlight key differences among different methods, as illustrated in Figure 6, Figure 7, Figure 8, Figure 9 and Figure 10. These figures provide a clear visual representation of how the proposed QIS-SPFT method enhances image clarity and effectively suppresses noise.

Compared with existing denoising techniques such as MLE [15], TD-BM3D [17], QISNet [18] and DPIR [31], QIS-SPFT demonstrates superior performance by preserving finer image details and improving the visibility of crucial regions. The results indicate that QIS-SPFT effectively reduces non-ideal noise while preserving the structural integrity and texture of the original images, which is essential for high-quality QIS image reconstruction. These findings further confirm that QIS-SPFT outperforms conventional denoising approaches and establishes itself as a robust method for enhancing imaging quality in QIS image reconstruction.

5. Discussion

Loss of Image Texture: The proposed method is primarily designed to enhance image fidelity in the context of QIS reconstruction, where input signals are extremely noisy due to photon-limited conditions and sparse bit-plane encoding. In such scenarios, suppressing heavy noise and faithfully recovering the underlying image structure become the main objectives. As a result, our model is trained to minimize pixel-wise reconstruction error, which is well-reflected by metrics such as PSNR. However, this emphasis on fidelity can sometimes lead to a perceptual compromise. Specifically, the network exhibits a mild tendency toward over-smoothing, which may result in the suppression of some fine details, even when such details remain partially visible in the noisy input. This phenomenon stems from the inherent trade-off between aggressive noise suppression and texture preservation.To address this, our network architecture integrates modules for capturing local and global pixel correlations, which help preserve detailed structures to a certain extent. Nevertheless, the current design is still biased toward achieving higher fidelity scores, rather than optimizing for perceptual quality. To further improve the balance between these two aspects, future extensions of this work will consider introducing perceptual-oriented loss functions and incorporating structural attention mechanisms. These improvements are expected to enhance the retention of fine details and visual textures, particularly in important regions, while maintaining strong denoising performance under extreme noise conditions.
Image Degradation: While the proposed QIS-SPFT framework demonstrates strong performance for photon shot noise suppression in QIS image reconstruction, it is specifically designed for binary bit-plane data under low-light conditions. Its generalization to other degradation types—such as motion blur [32] or haze [33]—has not been explored in this work. The blur degradation is a convolution of the Poisson process of imaging and require additional pre-processing or architectural adaptations. Equation (5) now can be described as:

$P ({[Y]}_{l} = y) = \frac{{[γ \cdot F Θ_{tot}]}_{l}^{y} e^{- {[γ \cdot F Θ_{tot}]}_{l}}}{y!},$

(18)

where $F$ represents the blur kernel. The haze degradation follows the atmospheric scattering model. Future work may consider extending the framework to incorporate deblurring or dehazing modules to broaden its applicability.

6. Conclusions

This paper first introduces the shortcomings of current QIS noise reconstruction methods, then introduces the non-ideal imaging model of QIS imaging and proposes a noise suppression reconstruction method based on neural networks. The proposed neural network integrates convolutional neural networks and Transformer. Its architecture combines the Anscombe transformation with serial and parallel modules to enhance denoising performance and adaptability across various scenarios. Finally, the experiment demonstrates the rationality of the neural network architecture, showing that the reconstruction method proposed in this paper effectively suppresses photon shot noise in the QIS imaging process at a feasible and acceptable oversampling rate. This makes it an ideal solution for practical QIS imaging applications under extreme noise conditions. It achieves superior noise suppression compared to MLE [15], TD-BM3D [17], QISNet [18] and DPIR [31].

Author Contributions

Conceptualization, G.W.; methodology, G.W.; software, G.W.; validation, G.W.; formal analysis, G.W.; investigation, G.W.; resources, G.W.; data curation, G.W.; writing—original draft preparation, G.W.; writing—review and editing, Z.G.; supervision, Z.G.; project administration, Z.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data will be made available upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Fossum, E.R. Some Thoughts on Future Digital Still Cameras. In Image Sensors and Signal Processing for Digital Still Cameras; CRC Press: Boca Raton, FL, USA, 2017; pp. 305–314. ISBN 9781315221083. [Google Scholar]
Fossum, E.R. What to do with sub-diffraction-limit (SDL) pixels?—A proposal for a gigapixel digital film sensor (DFS). In Proceedings of the IEEE Workshop on Charge-Coupled Devices and Advanced Image Sensors, Nagano, Japan, 9–11 June 2005; pp. 214–217. [Google Scholar]
Fossum, E.R. The quanta image sensor (QIS): Concepts and challenges. In Proceedings of the Imaging and Applied Optics, Toronto, ON, Canada, 10–14 July 2011. [Google Scholar]
Ma, J.; Zhang, D.; Elgendy, O.A.; Masoodian, S. A 0.19 e-rms read noise 16.7 Mpixel stacked quanta image sensor with 1.1 μm-pitch backside illuminated pixels. IEEE Electron. Device Lett. 2021, 42, 891–894. [Google Scholar] [CrossRef]
Ma, J.; Chan, S.; Fossum, E.R. Review of quanta image sensors for ultralow-light imaging. IEEE Trans. Electron. Devices 2022, 69, 2824–2839. [Google Scholar] [CrossRef]
Li, C.; Qu, X.; Gnanasambandam, A.; Elgendy, O.A.; Ma, J.; Chan, S.H. Photon-Limited Object Detection Using Non-Local Feature Matching and Knowledge Distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCVW), Montreal, BC, Canada, 11–17 October 2021. [Google Scholar]
Gnanasambandam, A.; Chan, S.H. HDR Imaging with Quanta Image Sensors: Theoretical Limits and Optimal Reconstruction. IEEE Trans. Comput. Imaging 2020, 6, 1571–1585. [Google Scholar] [CrossRef]
Gyongy, I.; Dutton, N.A.; Henderson, R.K. Single-Photon Tracking for High-Speed Vision. Sensors 2018, 18, 323. [Google Scholar] [CrossRef] [PubMed]
Elgendy, O.A.; Chan, S.H. Color Filter Arrays for Quanta Image Sensors. IEEE Trans. Comput. Imaging 2020, 6, 652–665. [Google Scholar] [CrossRef]
Chen, S.; Ceballos, A.; Fossum, E.R. Digital integration sensor. In Proceedings of the International Image Sensor Workshop, Snowbird, UT, USA, 12–16 June 2013. [Google Scholar]
Fossum, E.R.; Ma, J.; Masoodian, S. Quanta image sensor: Concepts and progress. In Proceedings of the Advanced Photon Counting Techniques X, Baltimore, MD, USA, 5 May 2016. [Google Scholar]
Fossum, E.R. Modeling the Performance of Single-Bit and Multi-Bit Quanta Image Sensors. IEEE J. Electron. Devices Soc. 2013, 1, 166–174. [Google Scholar] [CrossRef]
Fossum, E.R. Photon Counting Error Rates in Single-Bit and Multi-Bit Quanta Image Sensors. IEEE J. Electron. Devices Soc. 2016, 4, 136–143. [Google Scholar] [CrossRef]
Gnanasambandam, A.; Chan, S.H. Exposure-Referred Signal-to-Noise Ratio for Digital Image Sensors. IEEE Trans. Comput. Imaging 2022, 8, 561–575. [Google Scholar] [CrossRef]
Yang, F.; Lu, Y.M.; Sbaiz, L.; Vetterli, M. Bits from Photons: Oversampled Image Acquisition Using Binary Poisson Statistics. IIEEE Trans. Image Process. 2012, 21, 1421–1436. [Google Scholar] [CrossRef] [PubMed]
Chan, S.H.; Lu, Y.M. Efficient Image Reconstruction for Gigapixel Quantum Image Sensors. In Proceedings of the 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP), Atlanta, GA, USA, 3–5 December 2014. [Google Scholar]
Chan, S.H.; Elgendy, O.A.; Wang, X. Images from Bits: Non-Iterative Image Reconstruction for Quanta Image Sensors. Sensors 2016, 16, 1961. [Google Scholar] [CrossRef] [PubMed]
Choi, J.H.; Elgendy, O.A.; Chan, S.H. Image Reconstruction for Quanta Image Sensors Using Deep Neural Networks. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018. [Google Scholar]
Tian, C.; Fei, L.; Zheng, W.; Xu, Y.; Zuo, W.; Lin, C.W. Deep Learning on Image Denoising: An Overview. Neural Netw. 2020, 131, 251–275. [Google Scholar] [CrossRef] [PubMed]
Buades, A.; Coll, B.; Morel, J.M. Image Denoising Methods. A New Nonlocal Principle. SIAM Rev. 2010, 52, 113–147. [Google Scholar] [CrossRef]
Andriyanov, N.; Belyanchikov, A.; Vasiliev, K.; Dementiev, V. Restoration of Spatially Inhomogeneous Images Based on Doubly Stochastic Filters. In Proceedings of the 2022 IEEE International Conference on Information Technologies (ITNT), Moscow, Russia, 18–20 May 2022. [Google Scholar] [CrossRef]
Krasheninnikov, V.; Kuvayskova, Y.; Subbotin, A. Pseudo-gradient Algorithm for Identification of Doubly Stochastic Cylindrical Image Model. In Proceedings of the 2020 International Conference on Information Technology and Nanotechnology (ITNT), Samara, Russia, 23–27 May 2020. [Google Scholar] [CrossRef]
Jiang, B.; Li, J.; Lu, Y.; Cai, Q.; Song, H.; Lu, G. Efficient Image Denoising Using Deep Learning: A Brief Survey. Inf. Fusion 2025, 118, 103013. [Google Scholar] [CrossRef]
Xu, J.; Zhao, X.; Han, L.; Nie, K.; Xu, L.; Ma, J. Effect of the Transition Points Mismatch on Quanta Image Sensors. Sensors 2018, 18, 4357. [Google Scholar] [CrossRef] [PubMed]
Martin, D.; Fowlkes, C.; Tal, D.; Malik, J. A Database of Human Segmented Natural Images and Its Application to Evaluating Segmentation Algorithms and Measuring Ecological Statistics. In Proceedings of the Eighth IEEE International Conference on Computer Vision, Vancouver, BC, Canada, 7–14 July 2001. [Google Scholar]
Agustsson, E.; Timofte, R. NTIRE 2017 Challenge on Single Image Super-Resolution: Dataset and Study. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Timofte, R.; Agustsson, E.; Van Gool, L.; Yang, M.H.; Zhang, L. NTIRE 2017 Challenge on Single Image Super-Resolution: Methods and Results. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Ma, K.; Duanmu, Z.; Wu, Q.; Wang, Z.; Yong, H.; Li, H.; Zhang, L. Waterloo Exploration Database: New Challenges for Image Quality Assessment Models. IEEE Trans. Image Process. 2016, 26, 1004–1016. [Google Scholar] [CrossRef] [PubMed]
Roth, S.; Black, M.J. Fields of Experts: A Framework for Learning Image Priors. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005. [Google Scholar]
Franzen, R. Kodak Lossless True Color Image Suite. 1999, Volume 2. Available online: http://r0k.us/graphics/kodak (accessed on 18 March 2023).
Zhang, K.; Zuo, W.; Zhang, L.; Zhang, D. Plug-and-Play Image Restoration with Deep Denoiser Prior. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 6360–6376. [Google Scholar] [CrossRef] [PubMed]
Wang, X.; Zhang, C.; Xu, J. Motion Deblurring Method of Quanta Image Sensor Based on Spatial Correlation and Frequency Domain Characteristics. Opt. Eng. 2024, 63, 083102. [Google Scholar] [CrossRef]
Liu, Y.; Wang, X.; Hu, E.; Wang, A.; Shiri, B.; Lin, W. VNDHR: Variational Single Nighttime Image Dehazing for Enhancing Visibility in Intelligent Transportation Systems via Hybrid Regularization. IEEE Trans. Intell. Transp. Syst. 2025, 1–15. [Google Scholar] [CrossRef]

Figure 1. Concept of spatio-temporal oversampling.

Figure 2. The imaging model of the QIS.

Figure 3. Architecture of the proposed QIS reconstruction method.

Figure 4. Detail of the serial module.

Figure 5. Detail of the parallel module.

Figure 6. Comparison of denoising results of four methods for “castle” in BSD68 dataset: (a) Ground Truth, (b) MLE [15], (c) TD-BM3D [17], (d) QISNet [18], (e) DPIR [31], (f) Proposed.

Figure 7. Comparison of denoising results of four methods for “mushroom” in BSD68 dataset: (a) Ground Truth, (b) MLE [15], (c) TD-BM3D [17], (d) QISNet [18], (e) DPIR [31], (f) Proposed.

Figure 8. Comparison of denoising results of four methods for “boat” in Set12 dataset: (a) Ground Truth, (b) MLE [15], (c) TD-BM3D [17], (d) QISNet [18], (e) DPIR [31], (f) Proposed.

Figure 9. Comparison of denoising results of four methods for “house” in Kodak24 dataset: (a) Ground Truth, (b) MLE [15], (c) TD-BM3D [17], (d) QISNet [18], (e) DPIR [31], (f) Proposed.

Figure 10. Comparison of denoising results of four methods for “airplane” in Kodak24 dataset: (a) Ground Truth, (b) MLE [15], (c) TD-BM3D [17], (d) QISNet [18], (e) DPIR [31], (f) Proposed.

Table 1. Network structure verification and comparison results.

Method	PSNR (dB)
QIS-SPFT without Anscombe, without IT	28.214
QIS-SPFT without Anscombe, with IT	28.253
QIS-SPFT with Anscombe, without IT	28.221
QIS-SPFT with Anscombe, with IT	28.269

Table 2. Average PSNR evaluation results under four different noise levels in the BSD68 dataset [29].

Methods	Noise Level 1	Noise Level 2	Noise Level 3	Noise Level 4
MLE [15]	20.32	19.98	19.65	13.84
TD-BM3D [17]	28.89	28.76	27.86	23.66
QISNet [18]	29.31	29.01	28.32	24.32
DPIR [31]	30.01	29.33	28.79	25.42
QIS-SPFT (proposed)	30.32	29.78	29.13	25.85

Table 3. PSNR evaluation results under four different noise levels in the Set12 dataset [29].

Images	Cameraman	House	Pepper	Starfish	Monarch	Parrot
Noise level 1
MLE [15]	19.72	22.11	20.92	19.76	19.74	18.88
TD-BM3D [17]	29.21	33.32	31.44	30.31	29.34	29.04
QISNet [18]	29.74	33.77	31.91	30.89	29.83	29.64
DPIR [31]	30.43	34.42	32.63	31.52	30.62	30.31
QIS-SPFT (proposed)	30.77	34.75	32.80	31.95	30.91	30.75
Noise level 2
MLE [15]	18.21	21.42	18.91	18.43	19.22	17.84
TD-BM3D [17]	28.77	31.64	30.21	29.02	29.88	28.44
QISNet [18]	29.33	32.12	30.84	29.54	30.34	29.01
DPIR [31]	29.72	33.53	31.23	30.30	30.77	29.33
QIS-SPFT (proposed)	30.07	33.92	31.71	30.65	31.02	29.84
Noise level 3
MLE [15]	18.19	21.34	18.94	18.88	18.26	18.66
TD-BM3D [17]	27.22	31.24	28.34	27.21	27.84	27.34
QISNet [18]	28.41	31.86	28.91	27.86	28.52	27.88
DPIR [31]	29.66	32.91	30.64	29.15	29.72	28.81
QIS-SPFT (proposed)	29.90	33.24	30.89	29.46	30.13	29.16
Noise level 4
MLE [15]	13.23	15.44	14.10	12.98	13.32	13.58
TD-BM3D [17]	23.91	25.71	23.32	23.11	23.97	24.23
QISNet [18]	24.32	26.15	23.91	23.74	24.51	24.72
DPIR [31]	25.88	27.02	26.41	24.97	26.24	26.58
QIS-SPFT (proposed)	26.10	28.34	26.85	25.35	26.62	26.86

Table 4. Average PSNR evaluation results under three different testing datasets.

Datasets	MLE [15]	TD-BM3D [17]	QISNet [18]	DPIR [31]	QIS-SPFT
Set12 [29]	20.33	28.71	29.26	29.87	30.12
BSD68 [29]	20.12	27.97	28.48	29.44	29.76
Kodak24 [30]	20.85	29.24	30.12	30.68	31.13

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, G.; Gao, Z. Noise Suppressed Image Reconstruction for Quanta Image Sensors Based on Transformer Neural Networks. J. Imaging 2025, 11, 160. https://doi.org/10.3390/jimaging11050160

AMA Style

Wang G, Gao Z. Noise Suppressed Image Reconstruction for Quanta Image Sensors Based on Transformer Neural Networks. Journal of Imaging. 2025; 11(5):160. https://doi.org/10.3390/jimaging11050160

Chicago/Turabian Style

Wang, Guanjie, and Zhiyuan Gao. 2025. "Noise Suppressed Image Reconstruction for Quanta Image Sensors Based on Transformer Neural Networks" Journal of Imaging 11, no. 5: 160. https://doi.org/10.3390/jimaging11050160

APA Style

Wang, G., & Gao, Z. (2025). Noise Suppressed Image Reconstruction for Quanta Image Sensors Based on Transformer Neural Networks. Journal of Imaging, 11(5), 160. https://doi.org/10.3390/jimaging11050160

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Noise Suppressed Image Reconstruction for Quanta Image Sensors Based on Transformer Neural Networks

Abstract

1. Introduction

2. Background

3. Method Description

3.1. Architecture of the Proposed QIS Reconstruction Method

3.2. Detail of the QIS-SPFT Denoiser

4. Experimental Results

4.1. Network Analysis Experiments

4.2. Image Denoising Experimental Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI