You are currently viewing a new version of our website. To view the old version click .
Sensors
  • Article
  • Open Access

6 September 2025

Tracking-Based Denoising: A Trilateral Filter-Based Denoiser for Real-World Surveillance Video in Extreme Low-Light Conditions †

,
,
,
,
,
and
1
School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China
2
School of Software, Taiyuan University of Technology, Taiyuan 030024, China
*
Authors to whom correspondence should be addressed.
This paper is an extended version of our paper published in Proceedings of the 2nd International Conference on Internet of Things, Communication and Intelligent Technology, Xuzhou, China, 22–24 September 2023.
This article belongs to the Section Sensing and Imaging

Abstract

Video denoising in extremely low-light surveillance scenarios is a challenging task in computer vision, as it suffers from harsh noise and insufficient signal to reconstruct fine details. The denoising algorithm for these scenarios encounters challenges such as the lack of ground truth, and the noise distribution in the real world is far more complex than in a normal scene. Consequently, recent state-of-the-art (SOTA) methods like VRT and Turtle for video denoising perform poorly in this low-light environment. Additionally, some methods rely on raw video data, which is difficult to obtain from surveillance systems. In this paper, a denoising method is proposed based on the trilateral filter, which aims to denoise real-world low-light surveillance videos. Our trilateral filter is a weighted filter, allocating reasonable weights to different inputs to produce an appropriate output. Our idea is inspired by an experimental finding: noise on stationary objects can be easily suppressed by averaging adjacent frames. This led us to believe that if we can track moving objects accurately and filter along their trajectories, the noise may be effectively removed. Our proposed method involves four main steps. First, coarse motion vectors are obtained by bilateral search. Second, an amplitude-phase filter is used to judge and correct erroneous vectors. Third, these vectors are refined by a full search in a small area for greater accuracy. Finally, the trilateral filter is applied along the trajectory to denoise the noisy frame. Extensive experiments have demonstrated that our method achieves superior performance in terms of visual effects and quantitative tests.

1. Introduction

Reducing the noise inherent in video sensors is a critical challenge, particularly under the harsh conditions of low-light surveillance. The demand for reliable, high-quality video from surveillance systems is ever-increasing, being driven by security needs such as preventing nighttime theft. This quality is also directly essential for applications like autonomous driving, object detection, and action recognition. This necessitates effective denoising techniques that can operate under severe signal degradation. The principle of video denoising is to reconstruct the true signal corrupted by this sensor noise. This is achieved by exploiting the spatiotemporal redundancy inherent in the video data stream. The method involves identifying patches with high similarity to a target region across both space and time. A weighted combination of these similar patches is then used to reconstruct the original feature, effectively restoring the clean signal from its noisy frames. The challenges originate directly at the sensor level. First, the scarcity of incident photons in low-light environments results in a fundamentally low signal-to-noise ratio [1,2,3,4] and significant signal distortion. Second, unlike the sophisticated sensors in professional cameras from manufacturers like Sony or Panasonic, the sensors in cost-effective surveillance hardware are inherently more prone to thermal and read noise, thus severely amplifying issues when the captured signal itself is weak. 
A low-light video denoising method relies on raw video data as input [5]. However, these raw data are difficult to obtain from standard surveillance systems, which typically output processed and compressed formats like H.264 or YUV/RGB. Attempting to reverse the on-camera Image Signal Processing (ISP) pipeline to recover the raw data often leads to severe artifacts and information loss, as critical sensor-level information is irrevocably discarded during processing. A recent method [6] divides surveillance video data into moving and static areas, aiming to separate static and moving regions. It introduces an optimization solution based on Dynamic Mode Decomposition (DMD) and Plug-and-Play Alternating Direction Method of Multipliers (PnP-ADMM), which involves a minimization problem equation and incorporates an implicit regularization term to reduce noise. However, it fails to effectively remove noise in extremely low-light and high-noise scenarios. Moreover, it primarily focuses on dynamic mode decomposition, neglecting structure and texture. Additionally, DMD relies on linear system approximations, which can be inadequate for complex, nonlinear motion, and high-frequency DMD modes may be over-smoothed, adversely affecting details. Another video denoising method [3] embeds the BM3D [7] algorithm into the HEVC processing workflow, which reduces redundant computations by replacing BM3D [7]’s block matching with HEVC’s motion estimation. This approach is considered efficient for video denoising but also faces challenges in extremely noisy low-light conditions, as such hybrid frameworks often prioritize computational efficiency over adaptive noise modeling, leading to insufficient robustness against complex noise patterns in practical surveillance scenarios.
In this paper, we address the task of denoising surveillance videos from low-light environments. This task has the following difficulties. First, we lack ground truth. Second, the noise distribution in each RGB channel is different. Third, the noise varies greatly across frames, and the low-light environment further complicates the problem. Fourth, the images obtained by the many well-known denoising algorithms are either too smooth or a little noisy, which indicates that it is difficult to provide a good balance between noise removal and detail retention in such a harsh environment. In the experiment, noise is largely removed after averaging the adjacent frames. In Figure 1, the first two images contain only stationary objects, whose noise is largely reduced by multi-frame temporal averaging. The second two images include moving objects, but the moving electronic bike (as shown in Figure 1d) disappears and leaves a trajectory blur. Inspired by this phenomenon, we believe that if the trajectory of the moving object is accurately found, and we filter along the trajectory, the noise will be well suppressed, and reversal artifacts can be efficiently reduced like [8]. In this paper, a tracking-based denoising algorithm is proposed, and our contributions can be summarized in the following aspects:
  • A simple but efficient motion vector estimation method is put forward, which can be applied to another computer vision task.
  • A motion updating filter called an amplitude-phase filter is proposed, which improves the accuracy of these motion vectors. In addition, a denoising filter, namely the trilateral filter, is proposed by considering the gradient information, and it can suppress the gradient reversal artifact of the bilateral filter.
  • A tracking-based video denoising method is proposed.
Figure 1. Comparison of noisy images in video and their corresponding temporal average. (a) A single noisy frame from the static sequence. (b) The result of averaging adjacent frames in the static region. (c) A single noisy frame from the moving sequence. (d) The result of averaging adjacent frames in the moving region.

3. Methods

This work is a significantly extended version of our conference paper [46]. In this section, we will introduce our tracking-based algorithm shown in Figure 2.
Figure 2. Flowchart of our method. The process begins with coarse motion estimation to generate initial motion vectors, which are then iteratively updated and refined. Finally, a trilateral filter produces the clean frames.

3.1. Coarse Motion Estimation

The coarse motion estimation module is founded upon the classic Block-Matching Estimation ( B M E ) framework, as exemplified by the method in [47]. This approach is widely adopted for its computational efficiency. Our primary contribution in this context is not the invention of the search mechanism itself, but its novel application and adaptation to the domain of low-light surveillance video denoising, which is a departure from its conventional use in Frame Rate Up-Conversion ( F R U C ). The objective is also fundamentally different: while in F R U C , motion vectors are primarily used for interpolating missing frames via methods like Overlapped Blocks Motion Compensation ( O B M C ), our implementation leverages them to enhance the robustness of cross-frame matching specifically for denoising purposes. We selected this bilateral search strategy because it is not only computationally efficient due to the simple Sum of Absolute Differences ( S A D ) matching criterion, but it also provides accurate results. It can effectively avoid artifacts such as overlaps or holes that can result from unidirectional motion estimation. Furthermore, the resulting motion vectors are well suited for the subsequent vector refinement stage. The process of our bilateral search is illustrated in Figure 3.
Figure 3. Coarse motion estimation via bilateral search. The search areas in the previous and subsequent frames are positioned symmetrically with respect to the location of the current block.
In this process, each frame is segmented into 8 × 8 blocks, where 8 is the block size ( B s ). To improve matching robustness, we adopt ( O B M C ), which extends the boundary of the original block by a margin of Overlapped pixels ( O p ). To facilitate the processing of these edge blocks, the frame is padded with a certain width of Padding pixels ( P p ). Based on these steps, the search area is defined as A a b = { ( x , y ) | 1 + ( a 1 ) × B s + P p O p x < a × B s + P p + O p , 1 + ( b 1 ) × B s + P p O p y < b × B s + P p + O p } , where subscripts a and b denote the block’s row and column indices, respectively. Naturally, the ranges of these indices are constrained by the dimensions of the input video frame (the resolution of motion vectors field), which we denote as M × N . The lower bound for both a and b indices is 1, and the upper bound is determined by the frame’s boundaries. Specifically, for index a, it must satisfy 1 a × B s + P p + O p < M . Once the block’s spatial range is defined, S A D is used to calculate the pixel differences.
S A D = x , y A a b Δ x , Δ y S A a b | F n 1 ( x Δ x , y Δ y ) F n ( x + Δ x , y + Δ y ) |
In Equation (1), F n 1 ( x Δ x , y Δ y ) represents a block with the starting point ( x Δ x , y Δ y ) in the previous frame, and F n ( x + Δ x , y + Δ y ) is defined analogously for the subsequent frame. S A a b is the search area in the previous frame, and S A a b = { ( Δ x , Δ y ) | S w s Δ x < S w s , S w s Δ y < S w s } , in which Δ x and Δ y are integer offsets, and S w s denotes the Search window size. The motion vector ( Δ x , Δ y ) that minimizes the S A D value is selected, as shown in Equation (2).
( Δ x , Δ y ) = arg min S A D

3.2. Motion Vector Updating

The initial motion vector ( Δ x , Δ y ) is often coarse, and thus an update mechanism is introduced. We represent ( Δ x , Δ y ) in terms of its amplitude and phase, where amplitude A = Δ x 2 + Δ y 2 and phase P = arctan ( Δ y Δ x ) . Therefore, the vector can be modified by altering A and P, transforming it as follows: ( Δ x , Δ y ) = ( A cos P , A sin P ) ( A cos P , A sin P ) = ( Δ x , Δ y ) .
The mechanism, therefore, is designed to transform ( A , P ) to ( A , P ) . Let V a b = ( Δ x , Δ y ) be the vector to be updated, which belongs to the block in the ath row and the bth column, and V 1 V 8 are its 8-neighbor vectors, as illustrated in Figure 4.
Figure 4. The Amplitude-Phase Filtering process. The filter first computes the difference between each motion vector and the average of its eight neighbors. This difference is then thresholded to validate the vector’s reasonableness. Finally, an iterative median filter is applied to ensure spatial smoothness.
To reduce singular motions, the local continuity of the motion vector field is considered. Local continuity is measured as the Manhattan distance in terms of amplitude ( A L a b ) and phase ( P L a b ), as defined in Equation (3). Here, A a b and P a b are the amplitude and phase of vector V a b . Similarly, A i and P i are the amplitude and phase of the neighbor vector V i , respectively, where i ranges from 1 to 8.
A L a b = | A a b 1 8 i = 1 8 A i | P L a b = | P a b 1 8 i = 1 8 P i |
A Rationality matrix R is used to judge whether the current motion vector is reliable. Since it is a binary classification problem, the element R a b in this matrix can only be 0 or 1, and R a b indicates the rationality of the block at the ath row and the bth column. Equation (3) is used to estimate the motion vector V a b as well as calculating the local continuity in amplitude A L a b and phase P L a b . Finally, Equation (4) is used for the rationality matrix initialization. In Equation (4), θ m a x and θ m i n are the maximum and minimum phase offset thresholds.
R a b = 0 A L a b B s 8 , P L a b θ m a x 1 A L a b B s 16 , P L a b θ m i n
Vectors with a rationality of 1 are considered reasonable, while those with a rationality of 0 are deemed unreasonable. For reasonable vectors, we leave them unchanged. But for unreasonable vectors, a median filter is applied to update them. This process is called the Amplitude-Phase Filter ( A P F ), and Equation (5) shows the details.
A P F ( V a b ) = m e d i a n { V 1 , V 2 , V 3 , , V 8 } R a b = 0 V a b R a b = 1
Furthermore, a stop condition of this iterative mechanism must be designed. Our design follows two principles: the number of updates cannot be excessively large, and the majority of motion vectors must be reasonable. The first principle can be easily satisfied by setting the maximum number of iterations. The second condition can be measured by the sum of the rationality matrix, namely a = 1 M b = 1 N R a b . In Equation (6), M × N is the resolution of the motion vector field as mentioned before, and β ranges between 0 and 1, which describes the percentage of reasonable vectors in all vectors.
a = 1 M b = 1 N R a b β M N
Our proposed APF introduces a novel approach to motion vector updating. Unlike the approach in [48], which relies on the Bidirectional Prediction Difference and subsequent Outlier Detection, our classification is performed directly using the amplitude and phase of the motion vectors. This allows for a more nuanced evaluation of vector reliability, as the phase component explicitly accounts for directional variations in motion. Moreover, our adaptive, iterative update offers a more flexible solution than the fixed two-stage (median-then-mean) smoothing process in [48] without introducing significant computational overhead.

3.3. Motion Vector Refinement

After the second step, the motion vectors may not be precise enough, and their accuracy cannot reach the pixel level without a full search. A full search is often avoided by many methods due to its time complexity. However, its idea can be adopted. Our approach is to compensate for the motion vectors by doing full search only in a small area. In this way, we can obtain more accurate motion vectors, and the small area helps the system avoid significant computational cost, as shown in Figure 5.
Figure 5. Motion vector refinement. A final adjustment is performed on the vectors using an exhaustive search within a small, local window.
Suppose a vector ( Δ x , Δ y ) is initialized by coarse motion estimation, and it is then updated to ( Δ x , Δ y ) . After the final refinement step, the vector becomes ( Δ x + V C X , Δ y + V C Y ) , as ( V C X , V C Y ) is the compensation vector on the X/Y-axis obtained from a full search in a small area.
S A D p = x , y A a b V C X , V C Y S S A a b | F n 1 ( x Δ x V C X , y Δ y V C Y ) F n ( x + Δ x + V C X , y + Δ y + V C Y ) |
In Equation (7), S A D p is short for the Sum of the Absolute Difference, and the footnote p denotes pixel-level precision. F n 1 ( x Δ x V C X , y Δ y V C Y ) represents an image block F n 1 starting at ( x Δ x V C X , y Δ y V C Y ) in the previous frame, and F n ( x + Δ x + V C X , y + Δ y + V C Y ) is defined analogously.
S S A a b is a new search area, which is different from the search area S A a b . First, S A a b is the search area used in vector initialization, and its size should be large to improve the likelihood of capturing a wide range of motions. However, S S A a b is the small search area used in vector compensation. Second, the search method used in S A a b is bilateral search, but the search method in S S A a b is a full search. In fact, S S A a b = { ( V C X , V C Y ) | S S w s V C X < S S w s , S S w s V C Y < S S w s } , where S S w s here denotes small search window size.
The final optimal vector V a b * can be acquired using Equation (8), and V a b * is the best vector for the block in row a and column b. The entire process for motion vector estimation, updating, and refinement is summarized in Algorithm 1.
V a b * = ( Δ x + V C X , Δ y + V C Y ) = arg min S A D p
Algorithm 1 Motion vectors estimation, updating, and refinement.
  • Require: Low-light Surveillance Video I R m × n × t , Block Size B s = 8 , Padding Pixel P p = 16 , Overlap Pixel O p = 4 , Search Window Size S w s = 6 , Vector Updating Number n u m = 0 , Rationality Matrix R , Reasonable Degree β = 0.96 , and Small Search Window Size S S w s = 2 .
  • Ensure: The Final Vector V and Rationality Matrix R .
    1:
      Initialization: M = m B s , N = n B s , Zero Matrix V R M × N × ( t 1 ) , and R R M × N × ( t 1 )
    2:
      for each k [ 2 , t ]  do
    3:
          for each a [ 1 , M ]  do
    4:
              for each b [ 1 , N ]  do
    5:
                 Get the areas A a b , S A a b , and S S A a b .
    6:
                 Solve Equation (2) to get the vector ( Δ x , Δ y ) .
    7:
                 Repeat: Compute A L a b , P L a b using Equation (3).
    8:
                 Update R a b , V a b using Equations (4) and (5).
    9:
                  n u m = n u m + 1 , and R ( a , b , k 1 ) = R a b .
    10:
                While:  a = 1 M b = 1 N R a b M N β or n u m = 15 .
    11:
               Get the updated vector ( Δ x , Δ y ) .
    12:
               Solve Equation (8) to get the best vector.
    13:
                V ( a , b , k 1 ) = ( Δ x + V C X , Δ y + V C Y ) .
    14:
               Update R a b using Equation (5) for each vector.
    15:
                R ( a , b , k 1 ) = R a b .
    16:
            end for
    17:
        end for
    18:
    end for
    19:
    return The final vector V and rationality matrix R;

3.4. Trilateral Filter

Motion vectors can be classified into three categories using the criteria depicted in Figure 6, namely type 1, type 2 and type 3 vectors. Type 1 vectors are zero vectors; thus, only stationary objects can be found in their corresponding blocks. Type 2 vectors are not zero vectors but they are reasonable, meaning their corresponding blocks have found well-matched counterparts in the adjacent frames. Type 3 vectors are correspondingly unreasonable, signifying that their corresponding blocks fail to find the matching pairs in adjacent frames. Since the trilateral filter is based on the bilateral filter [49], we consider that a bilateral filter is an edge-preserving filter, and it can preserve the image’s edge features while denoising. However, the bilateral filter has two disadvantages. The first is its slow speed, which has already been solved by methods [50]. The second is that the images filtered by a bilateral filter [49] have the gradient reversal artifact.
Figure 6. Motion vectors classification. Each motion vector is classified into one of three types based on R a b and V a b .
To address this artifact, we propose an innovative trilateral filter designed to suppress the gradient reversal artifacts often associated with bilateral filters. While conventional bilateral filters operate on the spatial and pixel domain, and other trilateral filters like those in [51,52] have incorporated third domains such as depth edges or motion vector similarity, our method uniquely introduces a gradient similarity weight. This design choice is specifically tailored to our task of video denoising, as it effectively preserves texture and fine details that are critical in the restored images. Inspired by this concept, our trilateral filter considers three aspects of information: the spatial domain, the pixel domain, and the gradient domain.
The spatial domain information is measured by the local Euclidean distance in Equation (9)), and w s ( a , b ) is the similarity coefficient for the block in row a and column b. ρ s is a constant, which is related to the block variance δ s 2 , and it equals 1 2 δ s 2 . ρ p in Equation (10) and ρ g in Equation (11) have similar meanings.
The pixel domain information is measured by the Euclidean distance between the pixel values. In Equation (10), w p ( a , b ) is adopted as the similarity metric for the block in row a and column b in the pixel domain.
The gradient domain information is measured by the Euclidean distance between the gradient images, namely g S A D p . In Equation (12), w g ( a , b ) is the gradient-domain similarity constant for the block in row a and column b, while G n 1 and G n are the gradient of blocks F n 1 and F n . In the experiment, ( 1 , 1 ) and ( 1 , 1 ) T are used for the computation of gradients along the x-axis and y-axis.
w s ( a , b ) = exp ρ s i = 1 : 8 V a b * V i 2
w p ( a , b ) = exp ρ p | S A D p | 2
w g ( a , b ) = exp ρ g | g S A D p | 2
g S A D p = x , y A a b V C X , V C Y S S A a b | G n 1 ( x Δ x V C X , y Δ y V C Y ) G n ( x + Δ x + V C X , y + Δ y + V C Y ) |
The final coefficient w ( a , b ) = w s ( a , b ) × w p ( a , b ) × w g ( a , b ) . This coefficient integrates the information of the spatial domain, pixel domain, and gradient domains of an image, and it is therefore more robust. In addition, x , y w ( a , b ) = 1 and 0 < x , y A a b V C X , V C Y S S A a b w ( a , b ) < 1 ; thus, w ( a , b ) is suitable as a weighting factor for the block in row a and column b. Based on these, trilateral filter T F ( a , b ) for vector V a b * is defined in Equation (13), in which ∗ means a starting point ( x + Δ x + V C X , y + Δ y + V C Y ) in block F n .
T F ( a , b ) = x , y A a b V C X , Y S S A a b F n ( ) w ( a , b ) x , y A a b V C X , Y S S A a b w ( a , b )
The trilateral filter suppresses noise differently for each vector type shown in Figure 6. For type 1 vectors (zero vectors), noise is reduced by averaging 10 adjacent frames. Type 2 vectors are denoised directly using Equation (13). For type 3 vectors, nonlocal similarity and a large search window ( B S w s ) are employed to find matching pairs, after which Equation (13) is applied for denoising.

4. Experiments

4.1. Datasets for Real-World Low-Light Surveillance Video Denoising

The evaluation of denoisers for low-light surveillance video heavily relies on representative datasets, yet existing benchmarks present significant limitations for this specific task. While widely used benchmarks such as Set8 [23] and DAVIS [53] exist, they primarily feature well-lit common scenarios and are thus unsuitable. Several datasets have been proposed specifically for low-light environments, including CRVD [54], and those in [4,5]. CRVD [54] provides 1080p videos from an IMX385 sensor across five ISO levels, while [5] targets even more extreme conditions (<0.1 lux) at 4K resolution. However, both datasets generate motion by manually controlling static objects frame by frame, resulting in discontinuous motion patterns confined to indoor environments, which fail to capture the fluidity of real-world dynamics. Although the dataset in [4] offers more realistic and complex motion models across its 210 video pairs, it shares a fundamental issue with CRVD [5,54]: they are all provided in RAW format. This creates a critical domain gap, as consumer-grade surveillance cameras (Hikvision, Dahua) typically output compressed streams like YUV or H.264 due to hardware and bandwidth constraints. Attempting to reverse-engineer RAW data from these formats via an inverse ISP pipeline is an ill-posed problem that introduces substantial estimation errors. Compounding these issues, the datasets from [4,5] are not publicly available. Finally, while the public DID dataset [55] offers multi-camera diversity and is accessible, it is designed for video enhancement, and its dynamics are generated by camera motion across static scenes, making it inappropriate for evaluating denoising on scenes with independent object motion.
To address this, we collected real-world extreme low-light sequences with resolution 1920 × 1072 using Hikvision DS-IPC surveillance cameras. The sequences feature wide-angle residential views with static and slow traffic and top–down road perspectives with fast vehicles, encompassing complicated motions and challenging backgrounds. Our dataset is provided in the RGB color space, matching the typical output of surveillance systems and thus eliminating the domain gap associated with RAW-based datasets, and it comprises 14 video clips (around 800 noisy samples) covering static and dynamic regions for quantitative evaluation, which is shown in Table 1 and Table 2. Equations (14)–(16) show the definitions of PSNR and SSIM. In Equation (14), M A X = 255 , and M S E is defined in Equation (15). In Equation (15), m × n is the resolution of the ground truth image I g t . I d e n o i s e is the denoised result of the input noisy image. In Equation (16), x and y are two signals and μ x , μ y are their average values. σ x , σ y are variance x and y, and σ x y is the covariance of x and y. c 1 , c 2 are small values, and both of them equal 0.001. They serve to stabilize the division when the denominators are close to zero.
P S N R = 10 × l o g 10 M A X 2 M S E
M S E = 1 m n i = 1 m j = 1 n I g t ( i , j ) I d e n o i s e ( i , j ) 2
S S I M = ( 2 μ x μ y + c 1 ) ( 2 σ x y + c 2 ) ( μ x 2 + μ y 2 + c 1 ) ( σ x 2 + σ y 2 + c 2 )
Table 1. Quantitative tests: average PSNR(dB)/SSIM values for the No. 1–7 video sequences, where bold and underlined texts indicate the best and second-best performance, respectively.
Table 2. Quantitative tests: average PSNR(dB)/SSIM values for the No. 8–14 video sequences, where bold and underlined texts indicate the best and second-best performance, respectively.
We provide six visual results for qualitative comparison in Figure 7, Figure 8, Figure 9, Figure 10, Figure 11 and Figure 12. Static areas use pseudo-GT via multi-frame averaging; dynamic areas lack reliable GT due to motion blur invalidating temporal averaging. This provides a stringent testbed for evaluating denoising robustness on actual surveillance artifacts.
Figure 7. Qualitative comparison on a dynamic scene with moving barriers.
Figure 8. Qualitative comparison on a static scene with residential buildings.
Figure 9. Qualitative comparison on a dynamic scene with moving car.
Figure 10. Qualitative comparison on a static scene with static vehicle.
Figure 11. Qualitative comparison on a static scene with roadside staircase.
Figure 12. Qualitative comparison on a static scene with tree and street lamp.

4.2. Implementation Details

To evaluate the denoising performance of our method, nine popular denoising methods are used to make comparisons with our method, namely VBM4D [15], FastdvdNet [24], UDVD [25], FloRNN [32], RCD [27], ShiftNet [28], TAP [40], Turtle [42], and VRT [27].
The experimental setup employed Matlab R2023a and PyCharm 2023.1 (python3.8, CUDA11.3) as the primary software environments. The hardware configuration included a single NVIDIA V100-SMX2-32GB GPU, a 12-core Intel Xeon Platinum 8255C CPU operating at 2.50 GHz, and 43 GB of system memory. All of the source code is available on Github, and we follow the default parameters. We use PSNR and SSIM as they are widely used evaluation metrics in the video denoising. PSNR mainly measures the pixel-wise error between two images, while SSIM mainly measures the structural similarity between two images in video.

4.3. Quantitative Tests and Visual Evaluations

Table 1 and Table 2 show the average values of PSNR and SSIM of 14 video sequences. In both PSNR and SSIM tests, our algorithm performs best in all video sequences, demonstrating that our method can retain the structural features of each frame effectively and it can suppress the pixel-wise error of each frame to some extent.
The final denoising performance also needs to be judged by the human eye; thus, the visual quality of each denoised image is important. A low-light environment is filled with the noise of different categories; therefore, in the process of noise removal, the high-frequency information, such as image edge and texture, can easily to be mistaken for the noise and then suppressed by the denoising algorithm. For instance, in Figure 8, our method sharply delineates the edges and contours of the window, whereas results from competing methods are still plagued by complex noise artifacts. In Figure 10, our approach adeptly restores the structure of the car, rendering its wheels and the overhead fence distinctly discernible. In contrast, TAP [39] exhibits severe color distortion, and other algorithms also yield suboptimal outcomes. VBM4D [15], for example, achieves noise reduction but at the expense of sacrificing fine details. Finally, in Figure 12, several methods show limitations. UDVD [26] introduces noticeable green artifacts on the tree, and TAP [39] once again suffers from severe color deviation. While Turtle [42] maintains structural details, it visually amplifies the noise.
Although ShiftNet [28] and RCD [27] achieve commendable results, our method demonstrates a superior trade-off, producing a perceptibly cleaner result that better preserves the tree’s intricate structure. In general, in the low-light environment, our method has a strong ability to protect image structures, and it is also robust to the environment with extremely low luminance.

4.4. Speed

Speed is an important metric for evaluating an algorithm. As shown in Table 3, our method is not the fastest compared to other algorithms, achieving approximately 1.17 s to process a single frame. These tests were all conducted on a system equipped with an RTX 4090D GPU with 24 GB VRAM, paired with an AMD EPYC 9754 128-Core Processor (18 vCPUs utilized, max frequency 3.1 GHz) and 60 GB of system memory, using Python 3.8, and CUDA 11.3. However, our proposed method operates on a CPU and does not leverage specialized hardware like GPUs, which is in contrast to many contemporary deep learning techniques that rely on GPUs for computational acceleration. It is noteworthy that the majority of the time consumption in our algorithm arises from search operations, which are relatively independent and thus amenable to GPU acceleration. We intend to investigate GPU-accelerated versions in future work.
Table 3. Average speed (FPS) of different denoising algorithms.

4.5. Intensity Curve

The intensity curve is a visualization method. Suppose F R m × n × 3 is a frame in the noisy video, and Y R m × n is its luminance channel. A single line of pixels y R 1 × n is sampled from Y . Similarly, the same line from the denoised results and ground truth are also extracted. In Figure 13, the red curve y is a noisy signal from row 120 of frame 26 in video sequence 2, while the black signal y g t R 1 × n is the ground truth.
Figure 13. One example of the intensity curve. Our method demonstrates the most stable and effective noise removal, as evidenced by its signal (green line) being in the closest alignment with the ground truth (black line).
The green curve represents the denoising result for each method. It can be seen that deep learning-based methods, such as FastdvdNet [24], ShiftNet [28], and Turtle [42], perform poorly on low-light videos. This is because in such a harsh environment, it is difficult to accurately model the complex noise distribution solely through the adjustment of network weights, leading to insufficient smoothing. Moreover, the intensity curve of the traditional VBM4D [15] is overly smoothed, and significant residual noise remains around pixel column 80. This indicates a loss of detail, which is corroborated by the texture of the staircase in Figure 11. Finally, the overall difference between our curve and y g t is the smallest, and its profile follows the ground truth more closely and smoothly, which again proves that our algorithm achieves the best denoising performance.

4.6. Video Denoising Performance on DAVIS and CRVD Benchmarks

To further validate the generalizability and robustness of our algorithm, we also evaluated its performance on standard benchmark datasets. Additional experiments were carried out on the DAVIS [53] and CRVD [54] datasets. For CRVD [54], we converted the RAW data to RGB format to ensure a fair comparison. For DAVIS [53], we followed common practice in video denoising by adding Gaussian noise with a standard deviation of 50. We compared our method against several representative top-performing approaches from our main experiments—VBM4D [15], ShiftNet [28], and RCD [27]—using both qualitative and quantitative evaluations. Representative results are presented below under the same evaluation protocol as in our main study. As shown in Figure 14, which depicts flowers and their pots, our method performs competitively, effectively suppressing prominent color noise artifacts while achieving performance comparable to other state-of-the-art methods. In the results, bold and underlined values indicate the best and second-best performance, respectively.
Figure 14. Comparison of the orchid scene from the DAVIS [53] dataset. Visual comparison of methods on a zoomed-in patch (left) with metrics; red box in the image (right) shows the patch location.
As shown in Figure 15, our algorithm achieves the highest PSNR and SSIM values, particularly in regions where the ground and foliage intersect. However, we observe that VBM4D [15] preserves fine textures slightly better, resulting in less blurring. This can be explained by the continuous camera motion in this scene, which results in an absence of static regions. Since our method employs distinct strategies for static and dynamic areas, the lack of static regions necessitates a motion-oriented approach throughout the entire frame, leading to a minor performance trade-off. The superior quantitative results of our approach are likely attributable to its enhanced capability to suppress dominant color noise, which significantly influences PSNR and SSIM metrics.
Figure 15. Comparison of the skate-jump scene from the DAVIS [53] dataset. Visual comparison of methods on a zoomed-in patch (left) with metrics; red box in the image (right) shows the patch location.
Figure 16 presents a detailed view of the ground texture from Scene 1 of the CRVD dataset [54].
Figure 16. Comparison of Scene 1 from the CRVD [54] dataset. Visual comparison of methods on a zoomed-in patch (left) with metrics; red box in the image (right) shows the patch location.
To ensure consistency across experiments, the RAW data from CRVD [54] were converted to the RGB color space using a standard linear transformation. This conversion follows well-established industrial imaging pipelines and may introduce minor color shifts, yet it does not significantly influence the denoising performance comparison. Since CRVD [54] includes both ground-truth and noisy image pairs, no synthetic noise was introduced. Although the dataset involves a static camera setup consistent with our application scenario, it does not represent a low-light or complex-noise environment. In this setting, our method achieves the second-best performance, effectively preserving horizontal ground textures. While the visual smoothness of our result is slightly inferior to that of RCD [27], our approach demonstrates significantly better noise suppression capability compared to both ShiftNet [28] and VBM4D [15].
Figure 17 displays the wall texture from Scene 9, in which our method again achieves the second-best performance. It successfully preserves the fine white gaps between tiles while providing effective noise removal. In summary, while our algorithm is specifically optimized for denoising low-light surveillance videos with complex real-world noise, its competitive performance on general benchmark datasets demonstrates that it is not narrowly specialized. These results indicate that our approach maintains strong generalization capability across different scenes and noise characteristics.
Figure 17. Comparison of Scene 9 from the CRVD [54] dataset. Visual comparison of methods on a zoomed-in patch (left) with metrics; red box in the image (right) shows the patch location.

5. Model Analysis

5.1. Why Low-Light Environment Is Extremely Harsh?

In the traditional denoising tasks, Gaussian noise is artificially added to an image. Usually, the variance is used to measure the noise complexity, with larger variances indicating more complex noise. The variance of Gaussian noise is set as 15, 25, 30, 50, 75, etc. Gaussian noise with a variance of 75 is considered to have a complex distribution. However, noise in the real world is far more complex than Gaussian noise, and a low-light environment intensifies this complexity. Our 14 video sequences are divided into different RGB channels, and then the mean-variance of noise in each channel is calculated, and Figure 18 and Figure 19 show the statistical results. In Figure 18, the mean variance of noise in each color channel is different, and all the mean variances are much larger than 75. In Figure 19, noise varies from frame to frame, and the variation is almost random. In other words, the noise in such an environment is constantly changing spatially and temporally, which undoubtedly increases the denoising difficulty.
Figure 18. The average variance of RGB channels across video sequences and noisy grayscale images confirms the extremely harsh noise in our low-light surveillance environment.
Figure 19. The average variances of RGB channels across frames in Video 2, demonstrating significant noise level fluctuations over time due to complex environmental conditions.

5.2. Why Our Model Can Work?

Lemma 1.
Matrix D is made up of K random values d 1 d K , and  D can be decomposed into the sum of L symmetric matrices without considering the zero values in  D, namely  D = D 1 + D 2 + D L and L K .
Proof. 
  • Case 1: If L = K , we set D i = [ 0 , , 0 i 1 , d i , 0 , , 0 K i ] , 1 i K ; thus, D = D 1 + D 2 + + D K , and the proof ends.
  • Case 2: If L < K , a technique called mathematical induction is used, and this technique consists of three steps.
  • Step 1: When number n = 1 , we set K = 1 , and D = [ d 1 ] , which meets the lemma.
  • Step 2: When number n = K , and if the lemma is satisfied, that is, D = [ d 1 , d 2 , , d K ] = D 1 + D 2 + + D L , and L < K .
  • Step 3: Based on step 2, and when number n = K + 1 , D = [ d 1 , d 2 , , d K , d K + 1 ] . We set D L + 1 = [ 0 , , 0 K , d K + 1 ] D = D 1 + + D L + D L + 1 and L + 1 < K + 1 ; thus, the lemma holds when n = K + 1 if it holds when n = K . By the property of mathematical induction, this lemma holds for all natural numbers.
For instance, if D = [ d 1 , d 2 , d 3 , d 4 ] , three symmetric matrices, namely D 1 = [ d 1 , d 2 , d 2 , d 1 ] , D 2 = [ 0 , 0 , d 3 d 2 , 0 ] and D 3 = [ 0 , 0 , 0 , d 4 d 1 ] , can be found to support the lemma, in which D = D 1 + D 2 + D 3 , L = 3 , and K = 4 . □
Suppose noise is N R m × n × 3 , and a part of N is selected, namely the red signal N R 1 × 11 , and N = [ 0 , 5 , 0 , 1 , 5 , 0 , 0.3 , 3.9 , 0.4 , 0 , 0 ] , as shown in Figure 20. Based on the lemma above, N can be theoretically decomposed into the sum of several symmetrical signals, and symmetrical signals can be fitted with different Gaussian signals approximately. In this case, N = N 1 + N 2 + N 3 + N 4 , in which N 1 = [ 0 , 5 , 0 , 0 , 5 , 0 , 0 , 0 , 0 , 0 , 0 ] , N 2 = [ 0 , 0 , 0 , 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ] , N 3 = [ 0 , 0 , 0 , 0 , 0 , 0 , 0.3 , 3.9 , 0.3 , 0 , 0 ] , N 4 = [ 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0.1 , 0 , 0 ] . For example, N 1 can be fitted by Gaussian distribution p ( x ) = 5 exp ( ( x 1 ) 2 0.18 ) .
Figure 20. The blue line represents an example of a noisy signal, N , while the orange line shows the selected noise component, N , in a magnified view.
Suppose the noise in frame i is X i , and its variance is D ( X i ) . In our experiment, we average 10 adjacent frames; thus, the denoisied data are X = 1 10 i = 1 10 X i . By the property of the Gaussian distribution, namely D ( α X ) = α 2 D ( X ) , the variance of X is 1 100 i = 1 10 D ( X i ) , where α is a constant. When averaging 10 adjacent frames, the noise variance of each frame is reduced by a factor of 100. Assuming that the mean variance of noise in each channel is 300, 400, and 900, after averaging, the mean variance of noise changes to 3, 4, and 9, which greatly weakens the impact of noise.

5.3. Why Is Our Model Fast?

Our algorithm is very fast due to its complexity of O ( k N ) , where k is a constant, and N is the block number.
Our algorithm consists of four parts. In the first part, the search exhausts all the steps, and the search radius is S w s . However, the search method is not a full search; thus, the search scope is ( S w s + 1 ) 2 , and the total steps in the first part are ( S w s + 1 ) 2 N .
The second part is vector updating. For each vector, Equation (3) consumes 16 N steps, and Equations (4) and (5) consume 3 N steps. Since the updating number is n u m , therefore, the maximum steps in the second part are n u m × 19 N .
The third part is a full search and the search area is ( 2 S S w s + 1 ) 2 , where S S w s is the small search window size, so the total steps in the third part is ( 2 S S w s + 1 ) 2 N .
The fourth part is the denoising. In this part, the vectors are classified into three categories, and their numbers are N 1 , N 2 , and N 3 , with N 1 + N 2 + N 3 = N . For type 1 vectors, the system consumes 10 N 1 steps. For type 2 vectors, the system consumes 28 N 2 steps. For type 3 vectors, the system consumes 10 N 3 + ( 2 B S w s + 1 ) 2 N 3 steps, where B S w s is short for big search window size. The total steps in the fourth part are 10 N 1 + 28 N 2 + 10 N 3 + ( 2 B S w s + 1 ) 2 N 3 . For simplicity, the upper bound ( 10 + ( 2 B S w s + 1 ) 2 ) N is used as the total step count.
To sum up, the total steps are ( S w s + 1 ) 2 N + n u m × 19 N + ( 2 S S w s + 1 ) 2 N + ( 10 + ( 2 B S w s + 1 ) 2 ) N = k N ; thus, the complexity of our system is O ( k N ) .

6. Ablation Study

6.1. Ablation on the Loop Count n u m as the Stopping Criterion

In our method, n u m represents the number of loops, and a larger n u m results in more elements equaling 1 in the rationality matrix while also increasing the running time. Thus, n u m needs to be set appropriately to balance efficiency. The sum of the rationality matrix, namely a , b R a b , is defined as its value. It can be observed that the increment in this matrix value gradually decreases with the increase of n u m . In Figure 21, different frames are sampled, and their increment curves nearly coincide. Due to the randomness of sampling, the distribution demonstrates strong statistical significance. It is clear that when n u m exceeds 15, it is difficult to bring obvious increment to the rationality matrix; thus, n u m is set as 15.
Figure 21. The ablation analysis of n u m to verify the appropriate stopping criterion.

6.2. Ablation on β , the Proportion of Reasonable Vectors for Stopping

In the experiment, block size B s is set to 16. Therefore, if the all motion vectors are reasonable, the number of element 1 ( R a b = 1 ) in the rationality matrix is 1920 × 1072 × 3 16 × 16 × 3 = 8040 . However, it is unrealistic for all motion vectors to be reasonable due to motion complexity. Some motion vectors converge to a stable value after many loop times ( n u m > 15 ). β is defined as a , b R a b 8040 , and Figure 22 shows the relationship between a , b R a b and β . Considering the convergence of a , b R a b and the time complexity of the system, parameter β is set as 0.96.
Figure 22. Ablation analysis of the relation between the value of the R matrix and the parameter β as the stopping criterion. The figure shows that after β exceeds 0.96, the increase in the R value is very limited.

6.3. Ablation on Search Window Size for Search Scope

The search window size controls the block matching range. Theoretically, a larger size increases the likelihood of matching similar blocks and improving performance. In our experiments, we varied sizes from 2 to 10 in steps of 2 and evaluated them using the PSNR. Figure 23 shows that the PSNR gains plateau beyond size 6. However, a larger range leads to increased computational and time costs. Thus, we selected 6 as the optimal balance point for the search window size.
Figure 23. Analysis of search window size, as the PSNR improvement is very weak when the search window size exceeds 6.

6.4. Ablation on θ m a x and θ m i n Used as Phase Clamping Thresholds in Amplitude Phase Filter

θ max and θ min are parameters defined in Equation (4). As visualized in Figure 24, the sum of the rationality matrix, denoted as a , b R a b , increases with higher values of θ max and θ min , and it eventually converges within a specific range ( 44 ° θ max 48 ° , 29 ° θ min 35 ° ). In this paper, the values of θ max and θ min are set to 45° and 30°, respectively. This configuration not only ensures the rationality of the matrix but also maintains stable system runtime performance.
Figure 24. Ablation analysis of θ m a x and θ m i n for the appropriate division of reasonable vectors.

6.5. Ablation on Comparing Motion Estimation Methods Used for Denoising

Our algorithm mainly consists of motion estimation and denoising. Consider two adjacent frames in the video, which are denoted as f n and f n + 1 . The estimated motion vectors are used to artificially insert a new frame f n + 0.5 between f n and f n + 1 . This frame f n + 0.5 is generated algorithmically to improve video smoothness, and its quality depends on the accuracy of motion estimation, thus serving as a means to evaluate motion estimation performance. For example, if the first frame f 1 and the third frame f 3 are used to interpolate the second frame f ^ 2 , comparing f ^ 2 with the ground truth f 2 can measure the validity of the estimated motion vectors.
Several motion estimation algorithms are compared, such as BME (Bidirectional Motion Estimation) [47], FBJME (Forward–Backward Joint Motion Estimation) [56], DME (Dual Motion Estimation) [57], DSME (Direction-Select Motion Estimation) [58], and LQME (Linear Quadratic Motion Estimation) [59]. PSNR and SSIM are adopted as evaluation metrics to quantify the accuracy of motion vectors, and the dataset from the Experiments section is used. Table 4 shows that our method achieves first or second place in most cases, indirectly indicating that our motion estimation algorithm is effective and reasonable.
Table 4. Quantitative comparison of different motion estimation algorithms on widely used video datasets. Results are reported as PSNR (dB) / SSIM. bold and underlined texts indicate the best and second-best performance, respectively.

6.6. Ablation on How Component Changes Affect Speed

Our algorithm mainly includes four steps, namely coarse motion estimation, vector updating, vector refinement, and denoising. If step 1 is omitted, steps 2 and 3 cannot be performed. Similarly, step 4 is related to denoising, and omitting it would render the entire method ineffective. Therefore, the ablation study can only be conducted on steps 2 and 3 with four cases to consider.
Table 5 shows the results of the ablation study. In the experiment, only stationary sequences are used. Since no motion exists, omitting step 2 or step 3 has little impact on PSNR or SSIM values. For video sequences containing moving objects, only speed ablation experiments can be performed due to the lack of ground truth.
Table 5. Impact of different step combinations.

7. Limitations

Although our method achieves competitive results, it also has certain limitations. The first is that it is less effective at removing temporally correlated noise, such as certain types of Gaussian noise where the disturbance in each frame is nearly identical. Suppose the Gaussian noise in frame i is X i , and X i X j for i j . When 10 adjacent frames are averaged, the noise variance D 1 10 i = 1 10 X i D 1 10 × 10 X i = D ( X i ) . Therefore, the averaging operation cannot suppress such noise. However, when the noise varies temporally, namely X i X j for i j , such as a in low-light environment, the averaging operation can suppress the noise. Second, it is difficult to capture some extremely complex or isolated motions, as the search radius in our method is finite. However, these motions are rare in the real world, justifying our decision to limit the search radius to a reasonable range.

8. Conclusions

Denoising low-light surveillance video presents a formidable challenge, stemming from the severe and complex noise patterns inherent in such conditions. The core difficulty lies in the critical trade-off between effectively suppressing intense noise and simultaneously preserving fine-grained texture details. Achieving this intricate balance is paramount for practical surveillance applications, as failure to do so can result in the loss of crucial visual information. In this paper, we proposed a tracking-based denoising algorithm designed for surveillance videos captured in extremely low-light environments. Our algorithm integrates coarse motion estimation via bilateral search for initial vector accuracy, motion vector updating using an amplitude-phase filter and rationality matrix to ensure local continuity, motion vector refinement with a small-area full search to achieve pixel-level precision without excessive computational cost, and a trilateral filter that combines spatial, pixel, and gradient domain information to effectively suppress noise by classifying motion vectors into three types and applying adaptive denoising strategies. Our method can effectively suppress noise while retaining detailed information, as demonstrated by extensive quantitative experiments and visual comparisons. In the future, we intend to explore other weighting dimensions for the trilateral filter, such as temporal consistency and texture complexity. Furthermore, we aim to replace the rigid three-category block classification with a more flexible, adaptive mechanism that allows for a smoother transition between categories.

Author Contributions

Conceptualization, H.J.; methodology, H.J.; validation, P.W.; formal analysis, P.W. and C.L.; writing—original draft preparation, H.J. and P.W.; writing—review and editing, P.W. and Z.Z.; visualization, P.W., H.G. and F.Y.; supervision, H.J., C.L. and W.C.; project administration, H.J.; funding acquisition, H.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (grant no. 52304182), the National Key Research and Development Program of China (grant no. 2023YFC2907600, 2021YFC2902701, 2021YFC2902702), the Open Fund of the Key Laboratory of System Control and Information Processing of the Ministry of Education of China (grant no. SCIP20240105), the Graduate Innovation Program of China University of Mining and Technology (grant no. 2024WLKXJ090), and the Postgraduate Research & Practice Innovation Program of Jiangsu Province (grant no. KYCX24_2778).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Malyugina, A.; Anantrasirichai, N.; Bull, D. Wavelet-based topological loss for low-light image denoising. Sensors 2025, 25, 2047. [Google Scholar] [CrossRef] [PubMed]
  2. Liu, X.; Zhao, Q. Guided filter-inspired network for low-light RAW image enhancement. Sensors 2025, 25, 2637. [Google Scholar] [CrossRef] [PubMed]
  3. Lee, S.-Y.; Rhee, C.E. Motion estimation-assisted denoising for an efficient combination with an HEVC encoder. Sensors 2019, 19, 895. [Google Scholar] [CrossRef] [PubMed]
  4. Fu, Y.; Wang, Z.; Zhang, T.; Zhang, J. Low-light raw video denoising with a high-quality realistic motion dataset. IEEE Trans. Multimedia 2023, 25, 8119–8131. [Google Scholar] [CrossRef]
  5. Im, Y.; Pak, J.; Na, S.; Park, J.; Ryu, J.; Moon, S.; Koo, B.; Kang, S.-J. Supervised denoising for extreme low-light raw videos. IEEE Trans. Circuits Syst. Video Technol. 2025; in press. [Google Scholar] [CrossRef]
  6. Yamamoto, H.; Anami, S.; Matsuoka, R. Optimizing dynamic mode decomposition for video denoising via plug-and-play alternating direction method of multipliers. Signals 2024, 5, 202–215. [Google Scholar] [CrossRef]
  7. Dabov, K.; Foi, A.; Katkovnik, V.; Egiazarian, K. Image Denoising with Block-Matching and 3D Filtering. In Proceedings of the SPIE Electronic Imaging 2006: Image Processing, San Jose, CA, USA, 15–19 January 2006. [Google Scholar] [CrossRef]
  8. He, K.; Sun, J.; Tang, X. Guided image filtering. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1397–1409. [Google Scholar] [CrossRef]
  9. Chan, T.W.; Au, O.C.; Chong, T.S.; Chau, W.S. A Novel Content-Adaptive Video Denoising Filter. In Proceedings of the 2005 IEEE International Conference on Acoustics, Speech, and Signal Processing, Philadelphia, PA, USA, 18–23 March 2005; pp. 649–652. [Google Scholar] [CrossRef]
  10. Selesnick, I.W.; Li, K.Y. Video denoising using 2D and 3D dual-tree complex wavelet transforms. In Proceedings of the Wavelets: Applications in Signal and Image Processing X, San Diego, CA, USA, 13 November 2003; pp. 607–618. [Google Scholar] [CrossRef]
  11. Jovanov, L.; Pizurica, A.; Schulte, S.; Schelkens, P.; Munteanu, A.; Kerre, E.; Philips, W. Combined Wavelet-Domain and Motion-Compensated Video Denoising Based on Video Codec Motion Estimation Methods. IEEE Trans. Circuits Syst. Video Technol. 2009, 19, 417–421. [Google Scholar] [CrossRef]
  12. Dugad, R.; Ahuja, N. Video Denoising by Combining Kalman and Wiener Estimates. In Proceedings of the 1999 International Conference on Image Processing, Kobe, Japan, 24–28 October 1999; pp. 152–156. [Google Scholar] [CrossRef]
  13. Buades, A.; Lisani, J.-L.; Miladinovic, M. Patch-Based Video Denoising with Optical Flow Estimation. IEEE Trans. Image Process. 2016, 25, 2573–2586. [Google Scholar] [CrossRef]
  14. Dabov, K.; Foi, A.; Katkovnik, V.; Egiazarian, K. Image Restoration by Sparse 3D Transform-Domain Collaborative Filtering. In Proceedings of the SPIE Electronic Imaging 2008: Image Processing: Algorithms and Systems VI, San Jose, CA, USA, 27–31 January 2008. [Google Scholar] [CrossRef]
  15. Maggioni, M.; Boracchi, G.; Foi, A.; Egiazarian, K. Video Denoising, Deblocking, and Enhancement Through Separable 4-D Nonlocal Spatiotemporal Transforms. IEEE Trans. Image Process. 2012, 21, 3952–3966. [Google Scholar] [CrossRef]
  16. Arias, P.; Morel, J.-M. Video Denoising via Empirical Bayesian Estimation of Space-Time Patches. J. Math. Imaging Vis. 2017, 60, 70–93. [Google Scholar] [CrossRef]
  17. Vaksman, G.; Elad, M.; Milanfar, P. Patch Craft: Video Denoising by Deep Modeling and Patch Matching. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 2137–2146. [Google Scholar] [CrossRef]
  18. Davy, A.; Ehret, T.; Morel, J.-M.; Arias, P.; Facciolo, G. A Non-Local CNN for Video Denoising. In Proceedings of the 2019 IEEE International Conference on Image Processing, Taipei, Taiwan, 22–25 September 2019; pp. 2409–2413. [Google Scholar] [CrossRef]
  19. Davy, A.; Ehret, T.; Morel, J.-M.; Arias, P.; Facciolo, G. Video Denoising by Combining Patch Search and CNNs. J. Math. Imaging Vis. 2020, 63, 73–88. [Google Scholar] [CrossRef]
  20. Qu, Y.; Zhou, J.; Qiu, S.; Xu, W.; Li, Q. Recursive Video Denoising Algorithm for Low Light Surveillance Applications. In Proceedings of the 2021 14th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics, Shanghai, China, 22–24 October 2021; pp. 1–7. [Google Scholar] [CrossRef]
  21. Kim, M.; Park, D.; Han, D.; Ko, H. A Novel Approach for Denoising and Enhancement of Extremely Low-Light Video. IEEE Trans. Consum. Electron. 2015, 61, 72–80. [Google Scholar] [CrossRef]
  22. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
  23. Tassano, M.; Delon, J.; Veit, T. DVDNET: A Fast Network for Deep Video Denoising. In Proceedings of the 2019 IEEE International Conference on Image Processing, Taipei, Taiwan, 22–25 September 2019; pp. 1805–1809. [Google Scholar] [CrossRef]
  24. Tassano, M.; Delon, J.; Veit, T. FastDVDnet: Towards real-time deep video denoising without flow estimation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1354–1363. [Google Scholar] [CrossRef]
  25. Sheth, D.Y.; Mohan, S.; Vincent, J.L.; Manzorro, R.; Crozier, P.A.; Khapra, M.M.; Simoncelli, E.P.; Fernandez-Granda, C. Unsupervised Deep Video Denoising. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 1739–1748. [Google Scholar] [CrossRef]
  26. Qi, C.; Chen, J.; Yang, X.; Chen, Q. Real-time Streaming Video Denoising with Bidirectional Buffers. In Proceedings of the 30th ACM International Conference on Multimedia, Lisbon, Portugal, 10–14 October 2022; pp. 2758–2766. [Google Scholar] [CrossRef]
  27. Zhang, Z.; Jiang, Y.; Shao, W.; Wang, X.; Luo, P.; Lin, K.; Gu, J. Real-Time Controllable Denoising for Image and Video. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 14028–14038. [Google Scholar] [CrossRef]
  28. Li, D.; Shi, X.; Zhang, Y.; Cheung, K.C.; See, S.; Wang, X.; Qin, H.; Li, H. A Simple Baseline for Video Restoration with Grouped Spatial-Temporal Shift. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 9822–9832. [Google Scholar] [CrossRef]
  29. Shi, X.; Huang, Z.; Bian, W.; Li, D.; Zhang, M.; Cheung, K.C.; See, S.; Qin, H.; Dai, J.; Li, H. VideoFlow: Exploiting Temporal Cues for Multi-frame Optical Flow Estimation. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 12435–12446. [Google Scholar] [CrossRef]
  30. Chen, Z.; Jiang, T.; Hu, X.; Zhang, W.; Li, H.; Wang, H. Spatiotemporal Blind-Spot Network with Calibrated Flow Alignment for Self-Supervised Video Denoising. In Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 25 February–4 March 2025; pp. 2411–2419. [Google Scholar] [CrossRef]
  31. Wang, X.; Chan, K.C.K.; Yu, K.; Dong, C.; Loy, C.C. EDVR: Video Restoration With Enhanced Deformable Convolutional Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar] [CrossRef]
  32. Li, J.; Wu, X.; Niu, Z.; Zuo, W. Unidirectional Video Denoising by Mimicking Backward Recurrent Modules with Look-Ahead Forward Ones. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 592–609. [Google Scholar] [CrossRef]
  33. Chen, X.; Song, L.; Yang, X. Deep RNNs for Video Denoising. In Proceedings of the Applications of Digital Image Processing XXXIX, San Diego, CA, USA, 28 August–1 September 2016. [Google Scholar] [CrossRef]
  34. Wang, Y.; Bai, X. Versatile recurrent neural network for wide types of video restoration. Pattern Recognit. 2023, 138, 109360. [Google Scholar] [CrossRef]
  35. Maggioni, M.; Huang, Y.; Li, C.; Xiao, S.; Fu, Z.; Song, F. Efficient Multi-Stage Video Denoising with Recurrent Spatio-Temporal Fusion. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 3465–3474. [Google Scholar] [CrossRef]
  36. Liang, J.; Fan, Y.; Xiang, X.; Ranjan, R.; Ilg, E.; Green, S.; Cao, J.; Zhang, K.; Timofte, R.; Van Gool, L. Recurrent video restoration transformer with guided deformable attention. In Proceedings of the 36th Conference on Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; pp. 378–393. [Google Scholar]
  37. Yue, H.; Cao, C.; Liao, L.; Yang, J. RViDeformer: Efficient Raw Video Denoising Transformer with a Larger Benchmark Dataset. IEEE Trans. Circuits Syst. Video Technol. 2025, 1. [Google Scholar] [CrossRef]
  38. Aiyetigbo, M.; Ravichandran, D.; Chalhoub, R.; Kalivas, P.; Luo, F.; Li, N. Unsupervised Coordinate-Based Video Denoising. In Proceedings of the 2024 IEEE International Conference on Image Processing, Abu Dhabi, United Arab Emirates, 27–30 October 2024; pp. 1438–1444. [Google Scholar] [CrossRef]
  39. Fu, Z.; Guo, L.; Wang, C.; Wang, Y.; Li, Z.; Wen, B. Temporal As a Plugin: Unsupervised Video Denoising with Pre-trained Image Denoisers. In Proceedings of the European Conference on Computer Vision. Cham: Springer Nature Switzerland, Milan, Italy, 29 September–4 October 2024; pp. 349–367. [Google Scholar] [CrossRef]
  40. Zheng, H.; Pang, T.; Ji, H. Unsupervised Deep Video Denoising with Untrained Network. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; pp. 3651–3659. [Google Scholar] [CrossRef]
  41. Laine, S.; Karras, T.; Lehtinen, J.; Aila, T. High-Quality Self-Supervised Deep Image Denoising. In Proceedings of the 33rd Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 6966–6976. [Google Scholar]
  42. Ghasemabadi, A.; Janjua, M.K.; Salameh, M.; Niu, D. Learning Truncated Causal History Model for Video Restoration. In Proceedings of the 38th Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 9–15 December 2024; pp. 27584–27615. [Google Scholar]
  43. Jin, Y.; Ma, X.; Zhang, R.; Chen, H.; Gu, Y.; Ling, P.; Chen, E. Masked Video Pretraining Advances Real-World Video Denoising. IEEE Trans. Multimed. 2025, 27, 622–636. [Google Scholar] [CrossRef]
  44. Dewil, V.; Anger, J.; Davy, A.; Ehret, T.; Facciolo, G.; Arias, P. Self-supervised Training for Blind Multi-Frame Video Denoising. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 5–9 January 2021; pp. 2723–2733. [Google Scholar] [CrossRef]
  45. Lee, S.; Cho, D.; Kim, J.; Kim, T.H. Restore from Restored: Video Restoration with Pseudo Clean Video. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 3536–3545. [Google Scholar] [CrossRef]
  46. Xu, P.; Zheng, P.; Zheng, L.; Zhang, X.; Shang, Y.; Zhang, H.; Geng, Y.; Gao, J.; Jiang, H. Denoising Real-World Low Light Surveillance Videos Based on Trilateral Filter. In Proceedings of the 2nd International Conference on Internet of Things, Communication and Intelligent Technology, Xuzhou, China, 22–24 September 2023; Dong, J., Zhang, L., Cheng, D., Eds.; Lecture Notes in Electrical Engineering. Springer: Singapore, 2024; Volume 1197, pp. 602–615. [Google Scholar]
  47. Choi, B.T.; Lee, S.H.; Ko, S.J. New Frame Rate Up-Conversion Using Bi-Directional Motion Estimation. IEEE Trans. Consumer Electron. 2000, 46, 603–609. [Google Scholar] [CrossRef]
  48. Guo, D.; Lu, Z. Motion-Compensated Frame Interpolation with Weighted Motion Estimation and Hierarchical Vector Refinement. Neurocomputing 2016, 181, 76–85. [Google Scholar] [CrossRef]
  49. Tomasi, C.; Manduchi, R. Bilateral Filtering for Gray and Color Images. In Proceedings of the Sixth International Conference on Computer Vision, Bombay, India, 4–7 January 1998; pp. 839–846. [Google Scholar] [CrossRef]
  50. Paris, S.; Durand, F. A Fast Approximation of the Bilateral Filter Using a Signal Processing Approach. Int. J. Comput. Vis. 2007, 81, 24–52. [Google Scholar] [CrossRef]
  51. Chen, D.; Ardabilian, M.; Chen, L. Depth Edge Based Trilateral Filter Method for Stereo Matching. In Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP), Quebec City, QC, Canada, 27–30 September 2015; pp. 2280–2284. [Google Scholar] [CrossRef]
  52. Stoll, M.; Volz, S.; Bruhn, A. Joint Trilateral Filtering for Multiframe Optical Flow. In Proceedings of the 2013 IEEE International Conference on Image Processing, Melbourne, VIC, Australia, 15–18 September 2013; pp. 3845–3849. [Google Scholar] [CrossRef]
  53. Pont-Tuset, J.; Perazzi, F.; Caelles, S.; Arbeláez, P.; Sorkine-Hornung, A.; Van Gool, L. The 2017 DAVIS Challenge on Video Object Segmentation. arXiv 2017, arXiv:1704.00675. [Google Scholar]
  54. Yue, H.; Cao, C.; Liao, L.; Chu, R.; Yang, J. Supervised Raw Video Denoising With a Benchmark Dataset on Dynamic Scenes. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2298–2307. [Google Scholar] [CrossRef]
  55. Fu, H.; Zheng, W.; Wang, X.; Wang, J.; Zhang, H.; Ma, H. Dancing in the Dark: A Benchmark towards General Low-light Video Enhancement. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 12831–12840. [Google Scholar] [CrossRef]
  56. Vinh, T.Q.; Kim, Y.-C.; Hong, S.-H. Frame Rate Up-Conversion Using Forward-Backward Jointing Motion Estimation and Spatio-Temporal Motion Vector Smoothing. In Proceedings of the 2009 International Conference on Computer Engineering & Systems, Cairo, Egypt, 14–16 December 2009; pp. 605–609. [Google Scholar] [CrossRef]
  57. Kang, S.J.; Yoo, S.; Kim, Y.H. Dual Motion Estimation for Frame Rate Up-Conversion. IEEE Trans. Circuits Syst. Video Technol. 2010, 20, 1909–1914. [Google Scholar] [CrossRef]
  58. Yoo, D.G.; Kang, S.J.; Kim, Y.H. Direction-Select Motion Estimation for Motion-Compensated Frame Rate Up-Conversion. J. Display Technol. 2013, 9, 840–850. [Google Scholar] [CrossRef]
  59. Guo, Y.; Chen, L.; Gao, Z.; Zhang, X. Frame Rate Up-Conversion Using Linear Quadratic Motion Estimation and Trilateral Filtering Motion Smoothing. J. Display Technol. 2015, 12, 89–98. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.