Comparative Study of Supervised Deep Learning Architectures for Background Subtraction and Motion Segmentation on CDnet2014

Boufares, Oussama; Saadaoui, Wajdi; Boussif, Mohamed

doi:10.3390/signals7010014

Open AccessReview

Comparative Study of Supervised Deep Learning Architectures for Background Subtraction and Motion Segmentation on CDnet2014

by

Oussama Boufares

^1,*,

Wajdi Saadaoui

² and

Mohamed Boussif

¹

Department of Physics, Faculty of Sciences of Tunis, University of El Manar, Tunis 2092, Tunisia

²

LRMAN Laboratory, Higher Institute of Applied Sciences and Technology of Kasserine (ISSAT), Kasserine 1200, Tunisia

^*

Author to whom correspondence should be addressed.

Signals 2026, 7(1), 14; https://doi.org/10.3390/signals7010014

Submission received: 7 November 2025 / Revised: 31 December 2025 / Accepted: 5 January 2026 / Published: 2 February 2026

Download

Browse Figures

Versions Notes

Abstract

Foreground segmentation and background subtraction are critical components in many computer vision applications, such as intelligent video surveillance, urban security systems, and obstacle detection for autonomous vehicles. Although extensively studied over the past decades, these tasks remain challenging, particularly due to rapid illumination changes, dynamic backgrounds, cast shadows, and camera movements. The emergence of supervised deep learning-based methods has significantly enhanced performance, surpassing traditional approaches on the benchmark dataset CDnet2014. In this context, this paper provides a comprehensive review of recent supervised deep learning techniques applied to background subtraction, along with an in-depth comparative analysis of state-of-the-art approaches available on the official CDnet2014 results platform. Specifically, we examine several key architecture families, including convolutional neural networks (CNN and FCN), encoder–decoder models such as FgSegNet and Motion U-Net, adversarial frameworks (GAN), Transformer-based architectures, and hybrid methods combining intermittent semantic segmentation with rapid detection algorithms such as RT-SBS-v2. Beyond summarizing existing works, this review contributes a structured cross-family comparison under a unified benchmark, a focused analysis of performance behavior across challenging CDnet2014 scenarios, and a critical discussion of the trade-offs between segmentation accuracy, robustness, and computational efficiency for practical deployment.

Keywords:

CDnet2014 dataset; moving object detection; background subtraction; supervised method

1. Introduction

Background subtraction (BS) [1,2] is a fundamental component in many computer vision applications, including video surveillance, autonomous driving, smart traffic control [3], urban security systems, behavior analysis, human activity recognition [4], and human–computer interaction [5,6]. For instance, in autonomous driving systems [7], background subtraction enables real-time detection of pedestrians and obstacles [8], while in urban surveillance scenarios, it supports the identification of abnormal behaviors in crowded environments. The objective of background subtraction is to extract moving objects (foreground) from a video sequence by detecting pixels that significantly deviate from a static or quasi-static background model.

Although extensively studied for decades, background subtraction remains an active research topic due to persistent challenges encountered in real-world environments, such as dynamic backgrounds, cast shadows, sudden illumination changes, camera motion, and foreground–background camouflage [9,10,11].

Traditional background subtraction techniques, including Gaussian Mixture Models (GMM) [12,13], ViBe [14], Mod-BFDO [15], and SuBSENSE [16], rely primarily on statistical modeling and pixel-level heuristics. While their performance can be partially improved through advanced image enhancement and photometric normalization techniques—such as contrast adjustment, illumination balancing, and noise reduction [17,18,19]—these methods remain fundamentally limited when confronted with complex and highly dynamic scenes.

In this context, the emergence of deep learning has significantly transformed the field by enabling the automatic learning of discriminative spatio-temporal representations from large-scale annotated data. Supervised approaches based on Convolutional Neural Networks (CNN), Fully Convolutional Networks (FCN), U-Net architectures [20], Generative Adversarial Networks (GAN), Recurrent Neural Networks (LSTM/GRU), and more recently Transformer-based models, have led to substantial improvements in segmentation accuracy, robustness, and generalization capability.

Notable supervised architectures such as FgSegNet [21,22], BSUV-Net [23], and Motion U-Net [24] have established strong performance baselines on the CDnet2014 benchmark dataset [25] by combining deep segmentation backbones with multi-scale feature extraction and spatio-temporal modeling strategies. Furthermore, models integrating recurrent units [26,27] or hybrid CNN–LSTM architectures [28] have demonstrated enhanced ability to capture temporal dynamics in video sequences. Transformer-based approaches [29] have also emerged as a promising direction, offering powerful mechanisms for modeling long-range spatial and temporal dependencies [30,31].

In parallel, the growing demand for real-time deployment on embedded and resource-constrained platforms has motivated the development of lightweight architectures such as MobileNet [32] and EfficientNet [32,33]. Additionally, self-supervised and weakly supervised learning strategies [34,35] have gained attention as potential solutions to reduce the heavy reliance on densely annotated data required by fully supervised methods.

One of the primary objectives of this review is to provide a comprehensive comparative analysis of supervised background subtraction methods that have been officially evaluated on the CDnet2014 benchmark [25], particularly those reported on the public results platform (http://jacarini.dinf.usherbrooke.ca/results2014, accessed on 15 May 2025). By focusing on this unified evaluation framework, the study aims to highlight the relative strengths, limitations, and performance behavior of different architectural families under diverse and challenging real-world conditions.

The main contributions of this paper are threefold. First, it presents a structured and unified comparison of supervised deep learning architectures for background subtraction, strictly based on methods officially evaluated on the CDnet2014 benchmark to ensure fairness and reproducibility. Second, it provides a cross-family analysis of representative architectures—including CNN/FCN-based models, encoder–decoder networks, motion-aware methods, GAN-based frameworks, Transformer-based approaches, and hybrid semantic systems—emphasizing their performance under challenging scenarios and their computational characteristics. Third, the review integrates recent advances published between 2020 and 2025 into the literature survey, situating CDnet2014-based methods within current research trends and emerging directions, even when these newer approaches are not yet available for direct quantitative comparison.

The remainder of this paper is organized as follows. Section 2 reviews existing literature on background subtraction, with a focus on supervised deep learning approaches categorized by architectural families. Section 3 describes the representative architectures analyzed in this study. Section 4 presents the evaluation protocol, quantitative and qualitative comparisons on CDnet2014, and an analysis of computational efficiency. Section 5 discusses the main findings and trade-offs observed across different approaches. Finally, Section 6 concludes the paper and outlines potential directions for future research.

2. Related Work

Background subtraction has been the subject of a vast body of research, ranging from traditional statistical models to recent supervised deep learning approaches. This section presents a structured and critical synthesis of the main supervised methods, categorized by architecture family, while highlighting their evolution, performance, and limitations with respect to our objective: accurate, robust, and real-time motion segmentation in video sequences.

2.1. Traditional Background Subtraction Methods

Traditional methods for background subtraction rely primarily on statistical modeling and filtering techniques. A widely used approach is Gaussian Mixture Models (GMM), which model pixel values using a mixture of Gaussian distributions. This technique adapts well to gradual background variations and can effectively distinguish moving objects in stable environments [12,36]. However, GMM struggles under rapid lighting changes, camera movements, or dynamic backgrounds. Another method is median filtering, where the background is estimated as the median of pixel values over a series of frames [37]. While this approach is simple and effective for static scenes, it performs poorly when the background frequently changes or when moving objects remain stationary for extended periods. In general, traditional methods lack the robustness to handle complex or dynamic environments, as shown in various studies [38].

2.2. CNN and FCN-Based Methods

Convolutional Neural Networks (CNNs) and Fully Convolutional Networks (FCNs) represent the earliest supervised deep learning paradigm for background subtraction. Their primary role is to learn discriminative spatial representations directly from labeled data, enabling pixel-wise foreground segmentation without handcrafted features. By exploiting hierarchical convolutional structures, these models effectively capture local appearance variations and fine spatial details, which are critical in moderately complex scenes.

Convolutional Neural Networks (CNNs) marked the beginning of supervised deep learning for background subtraction due to their ability to automatically learn discriminative features for pixel-wise segmentation. Representative models such as FgSegNet [21,22] and its variants, including FgSegNet_v2 [39] and FgSegNet_v2_co [40], have demonstrated strong performance on the CDNet2014 benchmark, achieving F-measures above 0.97. Their multi-scale encoding strategies allow effective handling of fine details and scale variations. Fully Convolutional Networks (FCNs), which avoid fully connected layers, have been adopted in architectures such as BSUV-Net [23] and its improved version BSUV-Net 2.0 [41]. These models incorporate spatio-temporal augmentations to improve generalization on unseen videos. In parallel, DeepBS [42] focuses on computational efficiency by combining shallow CNN architectures with spatial post-processing for near real-time applications. SemanticBGS [43] further enhances robustness by integrating semantic cues extracted from pre-trained networks. Despite their strong segmentation accuracy, CNN- and FCN-based methods generally rely on extensive annotated data and may exhibit limited robustness when deployed in highly dynamic or scene-specific environments, where generalization becomes more challenging.

2.3. Autoencoder and U-Net-Based Approaches

Encoder–decoder architectures, and U-Net–-based models in particular, are designed to jointly learn compact latent representations and accurate spatial reconstructions. Their conceptual strength lies in the symmetric encoder–decoder structure combined with skip connections, which preserve spatial resolution while enabling contextual abstraction. This design makes them especially suitable for dense foreground segmentation, where precise object boundaries and spatial coherence are critical, even under moderately complex background dynamics.

Encoder–decoder architectures, particularly U-Net and its variants, have become standard for dense segmentation tasks due to their ability to capture spatial information while preserving object boundaries. Motion U-Net [24] introduces a multi-stream design that integrates motion and texture cues with effective feature fusion, leading to improved moving object detection in complex scenes.

More recent variants based on DeepLabv3+ have been adapted to specific application contexts. For instance, Sh-DeepLabv3+ [44] is optimized for lightweight segmentation in agricultural environments, while DeepLabv3+-Light [45] targets low-power satellite image segmentation. Similarly, Kwak and Sung [46,47] adapted encoder–decoder principles for dense 3D point cloud segmentation.

These approaches offer a favorable balance between architectural complexity and segmentation precision. However, in the absence of explicit temporal modeling components, their ability to capture long-term motion dynamics remains limited, particularly in scenes characterized by intermittent motion or strong temporal dependencies.

2.4. Generative Adversarial Networks (GAN) for Background Subtraction

Generative Adversarial Networks (GANs) introduce a fundamentally different learning paradigm for background subtraction by formulating the task as a generative modeling problem rather than a direct segmentation task. Their core objective is to learn a realistic representation of the background scene, such that foreground regions emerge implicitly as reconstruction residuals. This adversarial formulation enables GAN-based models to capture complex scene distributions, making them particularly suited for highly dynamic, textured, or illumination-variant environments.

Generative Adversarial Networks (GANs) offer an alternative paradigm by learning to reconstruct the background image realistically and implicitly separate the foreground. BSGAN [48] and BSPVGAN [49] employ Bayesian GAN formulations within a parallel vision framework to address scenes involving fast motion and significant illumination changes. These architectures rely on a generator to produce plausible background reconstructions and a discriminator to identify inconsistencies associated with dynamic foreground regions. This adversarial interaction allows GAN-based methods to model complex appearance patterns that are often difficult for conventional CNN-based segmentation approaches, especially in visually noisy or highly textured environments. However, GAN training is inherently unstable and sensitive to hyperparameter tuning. In addition, the computational cost associated with adversarial learning and background reconstruction limits the deployability of these models on resource-constrained or real-time systems without architectural simplification.

2.5. Recurrent Neural Networks (RNNs, LSTM)

Recurrent neural networks introduce an explicit temporal modeling capability that is particularly relevant for video-based background subtraction, where motion evolution over time carries critical information. By incorporating memory mechanisms, RNN-based architectures are designed to capture temporal dependencies that static convolutional models cannot represent, making them well suited for scenarios involving gradual motion, partial occlusions, or long-term temporal consistency.

Recurrent neural networks, particularly Long Short-Term Memory (LSTM) units, are widely used for their ability to model temporal relationships in video sequences. The ConvLSTM architecture [50] integrates convolutional operations within LSTM cells, enabling the joint modeling of spatial structures and temporal dynamics. Recent studies, such as those by Sabbu & Ganesan and Vrskova et al. [26,27], apply CNN–LSTM frameworks to human activity recognition from video or sensor data, effectively leveraging temporal context to detect subtle or slowly evolving motion patterns. Harb et al. [28] further demonstrate the versatility of this paradigm by proposing an optimized CNN–LSTM fusion strategy for extracting diverse motion features from wearable sensor data. While these approaches exhibit strong robustness in scenarios involving slow motion or partial occlusion, they typically suffer from high inference latency and increased training complexity, particularly for long video sequences. These limitations significantly restrict their applicability in real-time background subtraction systems and resource-constrained deployment environments.

2.6. Transformer-Based Models

Transformers, methods. Transformer-based architectures introduce a fundamentally different modeling paradigm by relying on attention mechanisms rather than convolution or recurrence. Their core strength lies in the ability to capture long-range spatial and temporal dependencies through global self-attention, which is particularly relevant for background subtraction in complex scenes involving camera motion, large-scale context variations, or long-term temporal correlations.

Originally developed for natural language processing, Transformers have recently demonstrated strong potential in computer vision tasks. Their multi-head attention mechanism enables effective modeling of long-range spatial and temporal relationships. The Background Subtraction Transformer (BST) [32] applies this principle to video segmentation by explicitly capturing long-term frame dependencies, resulting in improved foreground detection under challenging conditions. Similarly, the Swin Transformer [51], through its hierarchical and window-based attention design, has shown strong performance in person re-identification tasks within complex visual environments. Despite their representational power, Transformer-based models remain relatively underexplored in the context of background subtraction. Their high memory footprint and computational cost pose significant challenges, particularly for real-time processing and deployment on embedded or resource-constrained systems. Nevertheless, their capacity to unify spatial and temporal reasoning within a global representation framework positions Transformers as a promising direction for next-generation background subtraction architectures, especially when combined with model compression or hybrid designs.

2.7. Lightweight Models (MobileNet, EfficientNet)

Lightweight architectures are designed to address the constraints of embedded and real-time systems, where limited computational resources, memory, and energy consumption impose strict efficiency requirements. Their core objective is to preserve acceptable segmentation accuracy while drastically reducing model complexity, making them particularly relevant for deployment in edge devices.

In embedded environments such as drones, surveillance cameras, and IoT devices, computational efficiency becomes a key requirement. MobileNet [32], based on depthwise separable convolutions, was specifically designed to operate efficiently on mobile and low-power platforms while maintaining competitive performance. More recently, segmentation-adapted variants such as EfficientNet-Lite and Sh-DeepLabv3+ [33,44] have emerged, aiming to balance accuracy and efficiency through optimized scaling strategies and lightweight encoder–decoder designs.

These networks are well suited for real-time deployment in resource-constrained settings, although they may exhibit slightly lower segmentation accuracy compared to heavier architectures. Nevertheless, their favorable trade-off between performance and efficiency makes them a viable and often preferred solution for industrial, embedded, and large-scale deployment scenarios.

2.8. Self-Supervised and Semi-Supervised Methods

Self-supervised and semi-supervised methods aim to alleviate the reliance on large-scale manually annotated datasets by exploiting intrinsic spatial or temporal cues present in video sequences. Kapoor et al. [52], for example, propose a graph-based approach for segmenting moving objects in underwater videos using geometric transformations as implicit supervision. Other strategies, such as TransBlast [31] and pseudo-labeling frameworks, leverage spatial–temporal coherence to generate surrogate foreground masks without dense annotations. While these approaches improve accessibility in annotation-scarce scenarios, their performance often degrades in the presence of strong noise, low contrast, or complex illumination variations.

More recently, the impact of noisy labels on deep neural networks has received increased attention, motivating the development of noise-robust learning frameworks. Progressive sample selection strategies, exemplified by methods such as PSSCL [53], address this issue by explicitly separating reliable samples from noisy ones during training through confidence-driven selection and contrastive learning. Although primarily developed for image-level classification, these frameworks provide relevant insights for background subtraction, where pixel-level annotations may be affected by shadows, illumination changes, and camouflage. Nonetheless, the adaptation of such noise-aware learning paradigms to dense foreground segmentation remains largely unexplored and represents an open research direction.

3. Architectures and Models for Segmentation Methods

The different approaches for foreground segmentation and moving object detection rely on a variety of architectures and models. This section presents the main configurations used in the studied methods, ranging from traditional techniques to deep neural networks, autoencoders, generative adversarial networks (GANs), as well as post-processing techniques. The models are described in the form of detailed architectures or mathematical equations, in order to illustrate the principles underlying each method. Figure 1 summarizes the different supervised detection methods discussed in this section.

3.1. CNN-Based Architectures

3.1.1. DeepBS

DeepBS is a supervised background subtraction model based on a shallow Convolutional Neural Network (CNN) designed for real-time applications [42]. The model combines feature learning with classical image processing by incorporating spatial median filtering as a post-processing step. Its design aims to achieve a trade-off between computational efficiency and segmentation accuracy in constrained environments.

DeepBS is composed of three convolutional layers, each followed by ReLU activation and batch normalization, and a multilayer perceptron (MLP) with two fully connected layers for binary classification. The model receives as input RGB patches from both the current frame and a background image generated by the SuBSENSE algorithm. The final output is a binary foreground mask, which is refined using a median filter to reduce noise and improve spatial consistency.

f (x) = W * x + b

(1)

where

W

represents the convolution kernel,

b

the bias term, and

*

denotes the convolution operation. The loss function used during training is the Binary Cross Entropy (BCE), defined as:

L_{B C E} (x, y) = - [x \cdot \log (y) + (1 - x) \cdot l o g (1 - y)]

(2)

The principal advantages of DeepBS lie in its lightweight architecture and ability to generalize to new scenes without complex post-processing. Its low computational cost makes it suitable for deployment on embedded systems. However, its limited depth restricts its ability to capture complex scene variations, and it does not explicitly model temporal dependencies, making it less robust in highly dynamic environments.

3.1.2. Cascade CNN

Cascade CNN [11] is a deep supervised model developed to improve the precision of foreground segmentation by introducing a two-stage convolutional refinement mechanism. Rather than relying on a single-pass convolutional network, this architecture uses two consecutive CNNs to progressively enhance the segmentation quality, supported by a multi-scale input strategy to improve robustness across varying object sizes.

The process begins with the first CNN, which takes a local RGB image patch

I

as input and generates an initial foreground probability map

P_{f g}

. This step can be described as:

I \overset{{C N N}_{1}}{\to} P_{f g}

(3)

To refine this initial estimation, the probability map is concatenated with the original input patch, forming a four-channel input

[I, P_{f g}]

, which is then passed through a second CNN to produce the final, more accurate output:

[I, P_{f g}] \overset{{C N N}_{2}}{\to} P_{f g}^{r e f i n e d}

(4)

Each CNN in this cascade contains four convolutional layers (with 7 × 7 filters), two max-pooling layers, and two fully connected layers. Importantly, the model incorporates a multi-scale processing strategy, where the input images are resized to three different resolutions—1.0×, 0.75×, and 0.5×—and each resolution is processed independently. The outputs are then upsampled and averaged to produce the final segmentation map, allowing the model to capture features at different scales.

To train the model, a pixel-wise binary cross-entropy loss function is used. This loss measures the difference between the predicted probabilities

p_{k}

and the ground truth labels

C_{k}

over all

k

pixels:

L = - \frac{1}{k} \sum_{k = 1}^{k} [C_{k} \log (p_{k}) + (1 - C_{k}) \log (1 - p_{k})]

(5)

This architecture offers several advantages. It significantly improves spatial accuracy and reduces false detections near object boundaries, thanks to the two-step refinement. The multi-scale approach also helps handle varying object sizes and scene complexities. However, the model does have limitations. The two-stage design increases computational cost and inference time, making it less suitable for real-time deployment. Additionally, since it processes frames independently, it does not explicitly model temporal dynamics—a drawback when dealing with video sequences involving motion or temporal dependencies.

3.2. Encoder-Decoder Architectures: FgSegNet Family

The FgSegNet family, including FgSegNet_M, FgSegNet_S, FgSegNet_v2, and FgSegNet_v2_CO, represents a series of supervised encoder-decoder architectures specifically designed for foreground segmentation in complex video sequences. These models are recognized for their ability to combine multi-scale representations, regularization techniques, and efficient decoders to address spatial and contextual variations in scenes. Figure 2 illustrates the FgSegNet architecture.

3.2.1. FgSegNet_M—Multi-Scale Parallel Encoding

FgSegNet_M introduces a three-stream encoder based on VGG-16 [54], each branch processing a different resolution of the same input image using a Gaussian pyramid. The outputs from the three encoders (F₁, F₂, F₃) are spatially aligned using bilinear interpolation and concatenated before being fed into the decoder (TCNN). This allows learning a mapping:

f : I \in R^{H * W * 3} ⟶ M \in {\{0,1\}}^{H * W}

(6)

where

I

is the RGB input image and

M

is the binary foreground mask. This architecture improves robustness to object size variation and spatial context.

3.2.2. FgSegNet_S—Feature Pooling Module (FPM)

To reduce computational complexity, FgSegNet_M replaces the three encoders with a single encoder followed by a Feature Pooling Module (FPM). The FPM consists of four parallel dilated convolutions with dilation rates

d \in {1, 4, 8, 16}

, followed by a 2 × 2 pooling and concatenation across the channel axis. Batch Normalization and Spatial Dropout are used after each convolution for regularization and to improve generalization with limited data. The architecture maintains strong multi-scale capability at a lower computational cost.

3.2.3. FgSegNet_v2—Modified FPM and Decoder with GAP FgSegNet_v2 Introduces Two Key Components

Modified Feature Pooling Module (M-FPM): A hierarchical and progressive fusion of dilated convolutions based on the intermediate feature map F. The sequence is as follows:

$f_{a} = {c o n v}_{3 * 3} (F) f_{b} = {c o n v}_{3 * 3}^{d = 4} ([F, f_{a}]) f_{c} = {c o n v}_{3 * 3}^{d = 8} ([F, f_{b}]) f_{d} = {c o n v}_{3 * 3}^{d = 16} ([F, f_{c}]) F^{'} = [F, f_{a}, f_{b}, f_{c}]$

(7)

This allows enlarging the receptive field while keeping the channel size fixed.
Decoder with Global Average Pooling (GAP): A scalar weight vector $γ_{i}$ is computed for each feature channel via GAP on shallow encoder layers. The decoder activation map is modulated using element-wise multiplication:

$f_{i, j}^{'} = γ_{i} ⨀ f_{i, j} + f_{i, j}$

(8)

where ⊙ denotes the element-wise product. This enhances feature fusion between shallow and deep layers prior to bilinear interpolation.

3.2.4. FgSegNet_v2_CO—CDA Contour Optimizer (CO)

FgSegNet_v2_CO extends FgSegNet_v2 by introducing a post-processing module called the CDA Contour Optimizer. The CO aims to correct common artifacts such as over-segmented contours (RMO-SL) and small disjoint regions (DSR). It performs:

Detection of over segmented regions.
Extraction of edge maps using Sobel filters.
Energy-based optimization of the contour mask.

The CO minimizes the energy:

E = \int {(I - C)}^{2} + λ \int |\nabla C|

(9)

where

I

is the original image,

C

the contour map, and

λ

a regularization parameter. This energy formulation smooths and sharpens object boundaries.

3.3. MU-Net Family

The MU-Net series, including MU-Net1 (Motion U-Net, 2020) and MU-Net2 (2022), represents a lightweight family of encoder-decoder architectures specifically designed for foreground segmentation in videos by integrating explicit motion cues. Unlike conventional U-Net models relying solely on intensity or color input, MU-Net models incorporate motion tensors and spatial features to improve segmentation accuracy under challenging dynamics. Figure 3 shows the architecture of the Motion U-Net encoder, built upon ResNet-18 and composed of five convolution layers

3.3.1. MU-Net1 (Motion U-Net—2020)

MU-Net1 [24] is based on a standard U-Net structure enhanced with motion information, such as optical flow or frame differencing tensors. The encoder simultaneously processes the input intensity image and a motion map, typically computed using optical flow methods like TV-L1. These inputs are concatenated at the feature level and passed through 3 × 3 convolutional layers within the encoder. The decoder reconstructs the foreground mask from these fused features. The segmentation output is produced by:

\hat{M} = σ ((f_{d e c} (f_{e n c} (I, M)))

(10)

where

I

is the input image,

M

is the motion map,

f_{d e c}

and

f_{e n c}

are the encoder and decoder functions, and

σ

denotes the sigmoid activation for binary classification.

3.3.2. MU-Net2 (2022)

MU-Net2 [24] expands upon its predecessor by introducing two key modules: the Multi-Cue Fusion Module (MFM) and the Attention Refinement Module (ARM). The MFM aggregates multiple spatial and motion cues—such as gradients, texture descriptors, and optical flow—into a unified representation. The ARM then applies spatial attention to refine the focus on informative foreground regions. The process can be summarized as:

F = A R M (c o n c t (I, G, M))

(11)

where

G

is the gradient map,

M

is the motion tensor, and

F

is the attention-weighted feature map sent to the decoder.

In addition, MU-Net2 employs a combined loss function involving Dice Loss and Binary Cross-Entropy (BCE) to better handle fine structures and imbalanced pixel distributions:

L_{T o t a l} = λ L_{D i c e} + (1 - λ) L_{B C E}

(12)

This architectural evolution reflects a strategic enhancement of spatial-temporal modeling capabilities without the added complexity of LSTM or Transformer-based architectures. Both MU-Net versions are efficient and well suited for deployment in embedded systems, real-time robotics, or smart city surveillance.

3.4. GAN-Based Architectures

Generative Adversarial Networks (GANs) have recently gained attention as a promising solution for foreground-background separation, especially in highly dynamic scenes. Rather than directly predicting foreground masks, GAN-based approaches take a generative route: they learn to reconstruct a clean version of the background, and the foreground is implicitly derived by subtracting this reconstruction from the input frame. This strategy makes GANs particularly suited for dealing with motion blur, dynamic backgrounds, and challenging lighting conditions.

3.4.1. BSGAN—Background Subtraction with Bayesian GANs

BSGAN [48] introduces a probabilistic adversarial model where the generator

G

learns to reconstruct the background

\hat{B} = G (X)

, and the discriminator

D

is trained to distinguish between real and generated backgrounds. The training process optimizes a hybrid loss function combining L1 reconstruction loss and adversarial feedback:

L_{G} = E_{X} [{‖X - G (X)‖}_{1}] + λ E_{X} [\log (1 - D (G (X)))]

(13)

The first term ensures the generated background is close to the original input, while the second encourages realism. What sets BSGAN apart is the incorporation of Bayesian inference, introducing uncertainty modeling into the network via dropout or learned distributions over weights. This helps the system better generalize under uncertain or noisy conditions. However, training remains sensitive to initialization and hyperparameters, and inference speed is relatively slow compared to CNN-based alternatives.

3.4.2. BSPVGAN—Enhancing Motion Understanding with Parallel Vision

BSPVGAN [49] builds upon BSGAN by integrating ideas from Parallel Vision Theory, which emphasizes decomposing motion into distinct perspectives for better understanding. Its architecture is composed of two generators: one processes the original frame, and the other processes motion-enhanced inputs. These outputs are fused using attention-based mechanisms designed to highlight spatial and temporal patterns. The training objective combines three components:

L_{T o t a l} = L_{r e c} + β L_{m o t i o n} + {γ L}_{a d v}

(14)

Here,

L_{r e c}

ensures the background is faithfully reconstructed,

L_{m o t i o n}

motion maintains temporal consistency, and

{γ L}_{a d v}

refines the generated output using adversarial feedback. This architecture has shown improved performance in scenes with fast-moving objects, fluctuating textures, and challenging illumination. Yet, as with most GANs, BSPVGAN requires careful training and remains computationally intensive.

GAN-based methods offer compelling results in environments where traditional segmentation models fail. Their strength lies in their ability to model complex scene priors and produce high-fidelity background reconstructions. However, they face significant limitations in real-time applications due to high computational demands and instability during training. Future research could explore lightweight GAN variants, hybrid models that integrate GANs with CNNs or Transformers, and more stable training paradigms. These directions could bring the robustness of GANs into practical, real-world systems.

3.5. Fully Convolutional Networks and the BSUV-Net Series

The BSUV-Net family represents a class of Fully Convolutional Neural Network (FCN) architectures designed for supervised background subtraction on unseen video sequences. Unlike methods tailored to a specific video or set of videos, BSUV-Net adopts a video-agnostic strategy, ensuring that no frames from the test videos are used during training. This approach promotes better generalization to real-world scenarios, where annotated ground truth is typically unavailable. The architecture of the BSUV-Net model is presented in Figure 4.

3.5.1. BSUV-Net

BSUV-Net [23] (Background Subtraction for Unseen Videos) is a fully convolutional U-Net-based architecture designed to perform accurate foreground-background segmentation on video sequences not seen during training. The model receives a 12-channel input tensor, composed of the current frame, a static background image (“empty scene”), a recent background image, and their corresponding foreground probability maps generated by a semantic segmentation network such as DeepLabv3+. This fusion of spatial and semantic cues significantly enhances the network’s ability to distinguish foreground objects in complex environments, including scenes with noise, shadows, or camera motion.

BSUV-Net maintains the symmetric encoder-decoder design typical of U-Net, with skip connections linking encoder and decoder layers. Its training relies on a differentiable formulation of the Jaccard Index (IoU) loss, which is effective in addressing class imbalance:

L_{J a c c a r d} (Y, \hat{Y}) = \frac{T + \sum_{i, j} Y_{i, j} \cdot {\hat{Y}}_{i, j}}{T + \sum_{i, j} {(Y}_{i, j} + {\hat{Y}}_{i, j} - Y_{i, j} \cdot {\hat{Y}}_{i, j})}

(15)

where

Y

is the ground truth mask,

\hat{Y}

is the predicted mask, and

T

is a smoothing constant (typically

T

= 1). Experimental results on the CDNet2014 benchmark demonstrate that BSUV-Net generalizes effectively to unseen videos, achieving an average F-measure of around 0.82. However, its inference speed is limited to 6.4 frames per second (fps) on GPU, which poses challenges for real-time deployment.

3.5.2. BSUV-Net 2.0

BSUV-Net 2.0 [41] builds upon the original model by introducing a training-time augmentation strategy using realistic synthetic perturbations that mimic real-world surveillance scenarios, such as:

Camera jitter and motion,
Pan-Tilt-Zoom (PTZ) movement,
Intermittent object appearance,
Sudden lighting changes.

These augmentations are applied without requiring extra annotations, maintaining the fully supervised framework while increasing the diversity of training conditions. Although the input tensor remains 12-channel, the use of these perturbations results in notable gains in generalization. The model achieves an F-measure of approximately 0.837 on CDNet2014, with improved performance on challenging categories such as Dynamic Background and Intermittent Object Motion. Despite the performance gain, the inference speed remains around 6.4 fps, similar to the original version, leaving room for optimization in time-sensitive applications.

3.5.3. Fast BSUV-Net 2.0

Fast BSUV-Net 2.0 is a streamlined version of BSUV-Net 2.0 [41] specifically designed for real-time processing on GPUs and embedded systems. The main optimization involves removing the semantic foreground probability maps (FPM) from the input tensor, reducing it from 12 to 9 channels (current frame, static background, recent background only). The architecture remains U-Net-based but has been further optimized for computational and memory efficiency.

These architectural adjustments allow Fast BSUV-Net 2.0 [13] to reach 29 fps at a resolution of 320 × 240 on an NVIDIA Titan X GPU, making it well-suited for applications such as smart surveillance cameras, drones, and IoT platforms. Importantly, this speed gain comes at only a minimal cost to accuracy, with the F-measure still exceeding 0.81, making this model a highly practical solution for deployment under resource constraints.

3.6. Hybrid Methods

3.6.1. BSUV-Net + SemanticBGS

The BSUV-Net + SemanticBGS approach represents a hybrid strategy that enriches a fully convolutional segmentation model (BSUV-Net) [23] with high-level semantic cues provided by a pre-trained semantic segmentation network such as DeepLabv3+ or SemanticBGS. While BSUV-Net was originally designed to segment foreground objects in unseen videos without relying on any frame from the test video during training, this extension incorporates an additional binary semantic map highlighting likely object regions (e.g., pedestrians, vehicles).

This semantic map is injected as an additional input channel alongside the current frame, static background, and recent background. By doing so, the model benefits from semantic awareness, helping it distinguish meaningful foreground objects from static or textured background elements like trees or buildings. The integration enhances segmentation accuracy, particularly in visually complex or cluttered environments.

However, the performance of this hybrid model is highly dependent on the quality of the semantic segmentation. If key object classes are misclassified or missed entirely by the semantic backbone, the resulting foreground mask can suffer. Despite this, the model maintains BSUV-Net’s video-agnostic nature, ensuring robust generalization by avoiding exposure to test videos during training. Figure 5 illustrates the semantic segmentation algorithm underlying this approach

3.6.2. RT-SBS-v2

RT-SBS-v2 [55] (Real-Time Semantic Background Subtraction) offers a lightweight and practically efficient hybrid framework for foreground detection, combining deep semantic understanding with real-time background subtraction. Rather than relying exclusively on deep architectures like Transformers, RT-SBS-v2 strategically leverages semantic segmentation outputs at sparse intervals (e.g., every 10 frames) while employing fast motion detection algorithms—like ViBe or Flux Tensor—to ensure frame-level responsiveness.

The method is built around three modular components:

Semantic Change Mask (SCM): Created by comparing the current semantic segmentation map to that of a background reference frame. This captures category-level scene changes.
Pixel-wise Change Mask (PCM): Generated by traditional fast foreground detection methods that detect low-level appearance changes.
Fusion Module: Combines SCM and PCM through adaptive thresholding to produce the final binary mask.

RT-SBS-v2 strikes a rare balance between speed and semantic reasoning, running efficiently (over 30 FPS on CPU) while maintaining robustness against semantic noise and inconsistencies. Moreover, its modular architecture allows easy replacement of the semantic segmenter or the motion detector, making it versatile for deployment in constrained environments.

That said, the approach’s success remains tied to the semantic segmenter’s quality and scope. If the network fails to predict certain foreground classes or misses small objects, detection performance may degrade. Nonetheless, RT-SBS-v2 stands out as a pragmatic, flexible method bridging semantic awareness and real-time execution.

4. Results

4.1. Experimental Setup and Evaluation Metrics

To ensure a rigorous and transparent comparative analysis of supervised deep-learning methods for background subtraction, this review strictly follows the evaluation protocol established by the CDnet2014 benchmark dataset. CDnet2014 [25] is specifically designed to evaluate foreground segmentation algorithms across 11 diverse video categories that simulate challenging real-world conditions, including dynamic backgrounds, illumination changes, camera jitter, shadows, and camouflage effects. This benchmark provides a standardized and widely adopted framework enabling meaningful and reproducible comparisons between different supervised architectures.

The methods analyzed in this study were selected from the official CDnet2014 results platform (available online at: http://jacarini.dinf.usherbrooke.ca/results2014, accessed on 15 May 2025). All quantitative results reported in this paper correspond to the performance values published by the original authors under the official CDnet2014 evaluation protocol. No methods were re-implemented, re-trained, or re-evaluated by the authors of this review. This choice avoids implementation bias and ensures that all comparisons rely on identical dataset partitions, ground-truth annotations, and evaluation criteria.

The performance metrics adopted strictly follow the CDnet2014 evaluation guidelines and are expressed using standard terminology widely accepted in the computer vision community. These metrics include Precision (PR), Recall (RE), Specificity (SP), F-measure (F1-score), False Positive Rate (FPR), False Negative Rate (FNR), and Percentage of Wrong Classifications (PWC). The main definitions are provided below:

Precision (PR): Measures the fraction of correctly classified foreground pixels out of all pixels classified as foreground.

P R = \frac{T P}{T P + F P}

(16)

Recall (RE): Measures the fraction of correctly identified foreground pixels among all actual foreground pixels.

R E = \frac{T P}{T P + F N}

(17)

F-measure (F-M): The harmonic mean of precision and recall, providing a balanced performance metric.

F - M = \frac{2 * P r e c i s i o n * R e c a l l}{P r e c i s i o n + R e c a l l}

(18)

Percentage of Wrong Classifications (PWC): Represents the percentage of incorrectly classified pixels across the frames.

P W C = \frac{100 * (F N + F P)}{T P + F N + F P + T N}

(19)

In addition to segmentation accuracy, computational efficiency is discussed based on the average inference speed reported in frames per second (FPS) in the original publications. These measurements were obtained in heterogeneous experimental environments, typically involving GPU-accelerated platforms (e.g., NVIDIA GTX or Tesla series) or CPU-based systems implemented using common deep learning frameworks such as TensorFlow or PyTorch. Consequently, the reported FPS values should be interpreted as indicative measures of computational behavior rather than strictly comparable benchmarks.

It is important to note that differences in training strategies and architectural design may influence the reported performance. Some supervised approaches rely on video-specific training, which can yield strong results on individual sequences but may favor local optima and limit generalization. Other methods are explicitly designed to operate in a video-agnostic manner, prioritizing robustness across unseen scenes. Accordingly, the analysis emphasizes relative performance trends across architectural families rather than the absolute optimality of individual models, providing a fair and reproducible comparative perspective.

4.2. Selection of Baseline Architectures

The baseline architectures were selected to ensure a fair and representative comparison of the main families of supervised background subtraction methods evaluated on the CDnet2014 benchmark. Rather than conducting an exhaustive assessment of all available approaches, the analysis focuses on models that are widely recognized as standard or early representative architectures within their respective categories.

The selection of baseline methods was guided by three criteria: (i) the availability of official performance results on the CDnet2014 platform to ensure adherence to a unified evaluation protocol; (ii) their role as canonical representatives of major architectural families, including CNN-based refinement, encoder–decoder, motion-aware, GAN-based, and hybrid approaches; and (iii) their ability to reflect the core design principles of each category without extensive task-specific optimization.

More recent and optimized variants were also included to analyze performance evolution within each architectural family and to assess the impact of architectural refinements such as multi-scale encoding, motion-aware fusion, and structured post-processing. While other segmentation models could potentially achieve comparable performance levels after further optimization, this study deliberately avoids re-training or re-optimization in order to preserve fairness, reproducibility, and methodological consistency. Instead, the comparison emphasizes architectural design choices under the standardized CDnet2014 evaluation protocol.

4.3. Quantitative Comparative Analysis

In this study, a comprehensive quantitative comparison of recent supervised background subtraction methods was performed using the well-established CDnet2014 [25] benchmark dataset. CDnet2014 encompasses 11 distinct categories of real-world video sequences, each presenting unique challenges such as dynamic backgrounds, illumination changes, shadows, weather conditions, and camera jitter, thus providing a robust platform for evaluating method effectiveness and reliability.

Table 1 summarizes the comparative results for representative methods from all major supervised architectures: classical convolutional neural networks (CNNs), generative adversarial networks (GANs), U-Net-based models, fully convolutional networks (FCNs), Transformer-inspired models, and hybrid semantic approaches.

To clearly highlight the evolution of performance within each architectural family, we adopted a two-step visualization strategy:

Firstly (Figure 6), we selected one representative method per architectural family from their initial or earlier-generation models: FgSegNet_M (CNN), BSGAN (GAN), MU-Net1 (motion-aware U-Net), BSUV-Net (FCN), DeepBS (refined CNN), and RT-SBS-v2 (semantic hybrid). The evaluation in Figure 6 focused on five specific categories from the CDnet2014 dataset: Baseline, Dynamic Background, Bad Weather, Shadow, and Camera Jitter. The results demonstrate that FgSegNet_M clearly achieves the highest overall performance among the selected models, with an average F-measure of 0.977, particularly excelling in challenging categories such as Shadow and Camera Jitter, where spatial precision and stability are crucial. BSGAN also shows notable robustness, especially in dynamic scenes (Dynamic Background) and under challenging weather conditions (Bad Weather). In contrast, methods such as MU-Net1, BSUV-Net, DeepBS, and RT-SBS-v2 deliver somewhat lower and less stable performances but offer valuable trade-offs in terms of simplicity or computational efficiency.

Secondly (Figure 7), we evaluated newer, optimized variants from each architectural family to assess intra-family evolution. Specifically, we selected the advanced models FgSegNet_v2_CO (CNN), BSPVGAN (GAN), MU-Net2 (enhanced U-Net with attention), Fast BSUV-Net 2.0 (optimized FCN), Cascade CNN (advanced refinement CNN), and BSUV-Net + SemanticBGS (enhanced semantic hybrid). This second comparison focused on the remaining six categories from CDnet2014: Low Framerate, Night Videos, PTZ, Turbulence, Intermittent Object Motion, and Thermal. Results from this advanced set clearly highlight significant performance improvements within each architectural family. Most notably, FgSegNet_v2_CO achieves the highest average F-measure of the entire comparative study (0.985), confirming the effectiveness of its multi-scale integration and contour optimization techniques. The second-generation U-Net method (MU-Net2) clearly surpasses its predecessor MU-Net1 (0.9369 vs. 0.9147), thanks to the incorporation of spatial attention and improved motion fusion. Similarly, the advanced GAN method (BSPVGAN) achieves a noticeable robustness enhancement with an F-measure of 0.9501, despite its higher computational cost. Meanwhile, Fast BSUV-Net 2.0 and Cascade CNN display significant improvements in both precision and robustness compared to their earlier versions.

Overall, this two-stage visualization and comparison strategy not only provides a clear quantitative overview of current state-of-the-art supervised methods but also effectively illustrates the progressive improvements and architectural evolution within each methodological family.

These insights highlight the beneficial integration of attention mechanisms, semantic fusion, and multi-scale processing to tackle the practical challenges of moving object segmentation in diverse, real-world video scenarios.

4.4. Performance Under Challenging Scenarios

The focused evaluation on challenging CDnet2014 categories (Table 2) reveals structural performance gaps that are not captured by global metrics. Under strong illumination variations, camera motion, dynamic backgrounds, intermittent object motion, and shadows, several supervised architectures exhibit pronounced instability, particularly those relying mainly on local appearance modeling.

Classical CNN-based refinement approaches show clear limitations under PTZ and intermittent motion conditions (Table 2), confirming their reduced ability to handle complex temporal and geometric variations. This behavior is further illustrated by the radar-based comparisons in Figure 8A,C, which highlight the substantial gap between cascaded CNNs and modern encoder–decoder architectures.

Encoder–decoder models demonstrate significantly greater robustness in all challenging scenarios. The consistent dominance of FgSegNet_v2_CO in Table 2 is a reflection of the effectiveness of multi-scale feature integration and structured refinement—a current that is clearly visible in Figure 8B.

Motion-aware architectures, exemplified by MU-Net2, occupy a strategically important position. Although they do not reach the peak performance of FgSegNet_v2_CO, they markedly outperform classical baselines in dynamic and intermittent motion scenarios (Table 2). The intra-family comparison in Figure 8D confirms the benefit of explicit motion fusion.

GAN-based approaches exhibit complementary strengths. As shown in Table 2 and Figure 9D, BSPVGAN improves robustness in dynamic and illumination-sensitive scenes compared to earlier adversarial models, though performance remains less uniform across categories. The comparisons in Figure 9A–C further emphasize the gap between classical CNN-based methods and state-of-the-art supervised architectures under challenging conditions.

Overall, the combined evidence from Table 2 and Figure 8 and Figure 9 confirms that robustness under challenging scenarios is driven primarily by architectural design choices—particularly multi-scale encoding, motion-aware fusion, and structured refinement—rather than model complexity alone.

4.5. Qualitative Analysis

A detailed qualitative analysis of visual performance is essential to complement the quantitative results and better understand the real capabilities of the supervised methods studied. To this end, we present in Figure 8 and Figure 9 a visual comparison of segmentation masks obtained by different families of supervised methods, applied to several complex scenarios extracted from the CDnet2014 benchmark.

Figure 10 illustrates the qualitative results for five particularly demanding scene categories: Bad Weather, Shadow, Camera Jitter, Dynamic Background, and Baseline. For each category, a representative method per architectural family (FgSegNet_M, BSGAN, MU-Net1, BSUV-Net, DeepBS, RT-SBS-v2) was selected in order to accurately observe their respective behaviors.

We can clearly see that FgSegNet_M generates masks that are visually very close to the ground truth in all the situations represented, particularly in the Shadow and Camera Jitter scenarios. The resulting segmentation is fine and precise, with little noise or artifacts. Similarly, the BSGAN model offers good visual quality, particularly in the case of Dynamic Background, although it does show some minor flaws such as less sharp contours. The MU-Net1 method also produces satisfactory results, but shows slight visual degradation in the presence of Bad Weather. On the other hand, BSUV-Net, DeepBS, and RT-SBS-v2 clearly show difficulties in complex situations, revealing noisier masks and objects that are sometimes incomplete or poorly segmented, particularly in the presence of dynamic movements or camera jitter.

In order to visually assess the recent evolution of methods within each architectural family, Figure 11 presents a comparison on the remaining CDnet2014 categories: Low Framerate, Night Videos, PTZ, Turbulence, Intermittent Object Motion, and Thermal, this time choosing advanced or optimized variants of each family (FgSegNet_v2_CO, BSPVGAN, MU-Net2, Fast BSUV-Net 2.0, Cascade CNN, BSUV-Net + SemanticBGS).

A clear visual improvement in advanced methods is immediately apparent. FgSegNet_v2_CO demonstrates exceptional, ground-truth segmentation in all scenarios, even the most complex such as Night Videos and PTZ, confirming its superior effectiveness over previous models.

MU-Net2 also marks a very significant visual improvement over MU-Net1, particularly in difficult scenarios such as Intermittent Object Motion and Thermal, with better-defined contours and effective false-positive suppression. BSPVGAN makes significant progress in producing precise, well-defined masks in low-light, low-frequency scenes (Low Framerate), even if a few imperfections remain in turbulent contexts. Similarly, Fast BSUV-Net 2.0 and Cascade CNN show a noticeable visual improvement, albeit less impressive than that observed with previous methods. The hybrid BSUV-Net + SemanticBGS method, despite satisfactory segmentation, remains visually inferior in some difficult cases, such as Turbulence and PTZ, with visible errors due to residual noise or partially segmented objects.

This qualitative analysis thus highlights the concrete visual evolution of the different families of supervised models, revealing how recent optimizations to architectures have significantly improved the actual quality of the results obtained, especially under complex and realistic conditions.

4.6. Analysis of Computational Efficiency

Beyond segmentation accuracy, computational efficiency plays a central role in determining whether a background subtraction method can be deployed in real-world systems, particularly in time-critical and resource-constrained environments such as video surveillance, drones, and embedded platforms. While performance under challenging scenarios is analyzed separately in Table 2, the present section focuses on runtime behavior and deploy ability aspects. Assessing and comparing inference speed across supervised approaches, however, remains inherently challenging due to differences in hardware platforms, software implementations, and evaluation pipelines.

Accordingly, computational efficiency is discussed based on inference speeds reported in the literature for a 320 × 240 video resolution, with the objective of identifying general performance trends rather than establishing a unified benchmark, as summarized in Table 3. Within this context, encoder–decoder CNN architectures from the FgSegNet family typically operate in the near real-time range, with reported speeds between 18 and 23 FPS on GPU-based platforms. Although FgSegNet_v2_CO reports a very high processing rate, this figure corresponds exclusively to the post-processing stage and should not be interpreted as end-to-end performance.

Motion-aware architectures, such as MU-Net1 and MU-Net2, exhibit higher reported efficiency, achieving approximately 35 FPS on high-end GPUs. This observation indicates that the integration of explicit motion cues and attention mechanisms does not necessarily imply prohibitive computational overhead. In parallel, architectural simplification and modular design emerge as effective strategies for improving runtime performance. Fast BSUV-Net 2.0, for example, significantly increases inference speed compared to earlier BSUV-Net variants by reducing input complexity, while hybrid approaches such as RT-SBS-v2 combine intermittent semantic processing with fast traditional methods to achieve near real-time operation.

In contrast, refinement-based CNNs and GAN-based architectures generally present lower reported inference speeds, reflecting the computational burden introduced by multi-stage refinement pipelines or adversarial learning frameworks. Consequently, these approaches remain more suitable for offline processing or scenarios where latency constraints are less stringent.

Taken together, this analysis emphasizes the trade-offs between model complexity, segmentation accuracy, and computational efficiency across different architectural families. Rather than ranking methods by absolute speed, the results reported in Table 3 provide practical insights into how architectural design choices influence deploy ability, thereby supporting informed method selection under application-specific constraints.

5. Discussion

The comprehensive analysis conducted across quantitative metrics, qualitative evaluations, and computational efficiency reveals several important insights into the current landscape of supervised background subtraction methods.

First, CNN-based encoder–decoder architectures, particularly the FgSegNet family, demonstrate superior segmentation accuracy across nearly all CDnet2014 categories. The enhanced variant FgSegNet_v2_CO achieves the highest overall F-measure (0.985), benefiting from multi-scale encoding, precise contour optimization, and dense training supervision. However, this accuracy comes at the cost of relatively high computational demand, with standard variants like FgSegNet_v2 operating at around 23 FPS. While acceptable for semi-real-time applications, further optimization is needed for deployment in real-time or embedded systems.

In contrast, GAN-based methods such as BSGAN and BSPVGAN show promising performance in challenging scenarios (e.g., dynamic background, bad weather), leveraging adversarial training to generate clean foreground masks. Nevertheless, their computational footprint remains substantial, typically limited to 5 FPS on standard CPUs, making them impractical for real-time usage without dedicated acceleration.

A notable strength emerges from motion-aware U-Net architectures, especially MU-Net2, which offers a remarkable balance between precision and speed. With an average F-measure of 0.9369 and real-time inference speed (~35 FPS on Tesla V100), MU-Net2 demonstrates the impact of integrating motion attention and fusion mechanisms into lightweight architectures. This confirms that real-time performance no longer requires sacrificing segmentation quality.

Similarly, Fast BSUV-Net 2.0 offers a pragmatic compromise: while slightly behind in precision (F-measure ~0.80), it reaches ~29 FPS and operates reliably across multiple categories. Its streamlined input processing and semantic reduction make it well suited for edge deployment.

Hybrid approaches, such as RT-SBS-v2, underscore an emerging direction in background subtraction: modular architectures that combine semantic context with fast traditional models. Achieving 25 FPS with consistent results in complex scenes (e.g., PTZ, intermittent motion), this method shows how selectively incorporating semantic cues can enhance robustness without incurring excessive computational costs.

Collectively, the comparative study highlights distinct strengths and limitations across architectural families:
CNN/U-Net architectures remain the most accurate but demand GPU-level resources.
GAN-based models offer perceptually clean masks but are computationally intensive.
Lightweight architectures like MU-Net2 and Fast BSUV-Net 2.0 strike an optimal balance for real-time use.
Hybrid models illustrate the potential of flexible, modular designs for adaptive background subtraction.

Looking ahead, the field would benefit from cross-family integrations, combining the strengths of GAN-based learning, motion attention, and semantic modulation in a unified, resource-efficient framework. Furthermore, addressing challenges such as generalization to unseen scenes, reducing dependency on labeled data, and optimizing inference pipelines for edge hardware remains essential for large-scale, real-world deployment.

6. Conclusions

In this paper, we presented a comprehensive and updated review of supervised deep-learning-based methods for background subtraction, thoroughly evaluated on the CDnet2014 benchmark dataset. Through detailed comparative analyses—including quantitative assessments, qualitative visual inspections, and computational efficiency evaluations—we clearly demonstrated the significant advancements achieved by recent deep neural network approaches such as advanced CNN architectures (FgSegNet_v2_CO), motion-aware U-Net models (MU-Net2), optimized GAN-based methods (BSPVGAN), and hybrid semantic techniques (RT-SBS-v2).

Our experimental results confirm that despite notable progress made in this field over recent years, several challenges remain open. In particular, current models still exhibit limitations in effectively generalizing to highly complex video categories, notably “Intermittent Object Motion” and “PTZ” scenarios. Additionally, the prevalent reliance on traditional RGB inputs constitutes another constraint; thus, incorporating additional modalities such as depth (RGB-D) or multi-spectral data could substantially enhance segmentation performance in critical applications, including camouflage detection or low-light scenarios.

Balancing accuracy with computational efficiency continues to be essential for real-world, real-time deployments. Our study identified promising methods such as MU-Net2 and Fast BSUV-Net 2.0, which offer an excellent trade-off between precision and inference speed. However, further research into lightweight and hardware-adapted architectures remains essential for broader practical applicability.

Finally, our review highlights the increasing importance of modular architectures that intelligently combine intermittent semantic segmentation with rapid foreground detection. The future of background subtraction research lies in developing innovative hybrid models capable of adaptive and context-aware performance. Consequently, we recommend future studies to explore advanced architectural designs such as multi-scale pyramidal networks, probabilistic neural networks, neuromorphic or memristive neural models, as well as auto-supervised learning techniques to reduce reliance on extensive manual annotations.

Overall, this study provides a clear, modern perspective on recent advances in supervised background subtraction and offers concrete directions for the next generation of intelligent video processing systems.

Author Contributions

Conceptualization, O.B. and M.B.; methodology, O.B. and W.S.; resources, M.B. and O.B.; writing—original draft preparation, O.B. and W.S.; visualization, O.B. and M.B.; supervision, W.S. and O.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The CDnet2014 dataset used in this study is publicly available at: http://jacarini.dinf.usherbrooke.ca/dataset2014, accessed on 15 June 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ramírez-Alonso, G.; Chacón-Murguía, M.I. Auto-Adaptive Parallel SOM Architecture with a modular analysis for dynamic object segmentation in videos. Neurocomputing 2016, 175, 1000. [Google Scholar] [CrossRef]
Martins, I.; Carvalho, P.; Corte-Real, L.; Alba-Castro, J.L. BMOG: Boosted Gaussian Mixture Model with Controlled Complexity. Pattern Anal. Appl. 2018, 21, 641–654. [Google Scholar] [CrossRef]
Grimson, W.E.L.; Lee, L.; Romano, R.; Stauffer, C. Using adaptive tracking to classify and monitor activities in a site. In Proceedings of the CVPR98, Santa Barbara, CA, USA, 25 June 1998; pp. 22–31. [Google Scholar]
Shastry, A.C.; Schowengerdt, R.A. Airborne video registration and traffic-flow parameter estimation. IEEE Trans. Intell. Transp. Syst. 2005, 6, 391–405. [Google Scholar] [CrossRef]
Melo, J.; Naftel, A.; Bernardino, A.; Santos-Victor, J. Detection and classification of highway lanes using vehicle motion trajectories. IEEE Trans. Intell. Transp. Syst. 2006, 7, 188–200. [Google Scholar] [CrossRef]
Lv, Z.; Poiesi, F.; Dong, Q.; Lloret, J.; Song, H. Deep learning for intelligent human–computer interaction. Appl. Sci. 2022, 12, 11457. [Google Scholar] [CrossRef]
Khlifi, A.; Othmani, M.; Kherallah, M. A novel approach to autonomous driving using double deep Q-network-based deep reinforcement learning. World Electr. Veh. J. 2025, 16, 138. [Google Scholar] [CrossRef]
Park, S.; Aggarwal, J. A hierarchical Bayesian network for event recognition of human actions and interactions. Multimed. Syst. 2004, 10, 164–179. [Google Scholar] [CrossRef]
Bouwmans, T.; Zahzah, E.H. Robust PCA via Principal Component Pursuit: A review for a comparative evaluation in video surveillance. Comput. Vis. Image Underst. 2014, 122, 22–34. [Google Scholar] [CrossRef]
Bouwmans, T.; Javed, S.; Sultana, M.; Jung, S.K. Deep neural network concepts for background subtraction: A systematic review and comparative evaluation. Neural Netw. 2019, 117, 8–66. [Google Scholar] [CrossRef]
Wang, Y.; Luo, Z.; Jodoin, P.-M. Interactive deep learning method for segmenting moving objects. Pattern Recognit. Lett. 2017, 96, 66–75. [Google Scholar] [CrossRef]
Stauffer, C.; Grimson, W.E.L. Adaptive background mixture models for real-time tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Fort Collins, CO, USA, 23–25 June 1999. [Google Scholar]
Zivkovic, Z. Improved adaptive Gaussian mixture model for back-ground subtraction. In Proceedings of the 17th International Conference on Pattern Recognition, Cambridge, UK, 26 August 2004; IEEE: Piscataway, NJ, USA, 2004; pp. 28–31. [Google Scholar]
Barnich, O.; Van Droogenbroeck, M. ViBe: A universal background subtraction algorithm for video sequences. IEEE Trans. Image Process. 2011, 20, 1709–1724. [Google Scholar] [CrossRef]
Boufares, O.; Boussif, M.; Saadaoui, W.; Miraoui, I. Moving object detection: A new method combining background subtraction, fuzzy entropy thresholding and differential evolution optimization. Acta Mech. Autom. 2025, 19, 106–116. [Google Scholar] [CrossRef]
St-Charles, P.L.; Bilodeau, G.A.; Bergevin, R. SuBSENSE: A universal change detection method with local adaptive sensitivity. IEEE Trans. Image Process. 2015, 24, 359–373. [Google Scholar] [CrossRef] [PubMed]
Bhandari, A.K.; Subramani, B.; Veluchamy, M. Multi-exposure optimized contrast and brightness balance color image enhancement. Digit. Signal Process. 2022, 123, 103406. [Google Scholar] [CrossRef]
Veluchamy, M.; Subramani, B. Fuzzy dissimilarity contextual intensity transformation with gamma correction for color image enhancement. Multimed. Tools Appl. 2020, 79, 19945–19961. [Google Scholar] [CrossRef]
Veluchamy, M.; Bhandari, A.K.; Subramani, B. Optimized Bezier Curve Based Intensity Mapping Scheme for Low Light Image Enhancement. IEEE Trans. Emerg. Top. Comput. Intell. 2021, 6, 602–612. [Google Scholar] [CrossRef]
Rahmon, G.; Palaniappan, K.; Toubal, I.E.; Bunyak, F.; Rao, R.; Seetharaman, G. DeepFTSG: Multi-stream asymmetric USE-Net trellis encoders with shared decoder feature fusion architecture for video motion segmentation. Int. J. Comput. Vis. 2024, 132, 776–804. [Google Scholar] [CrossRef]
Long Ang, L.; Hacer, Y.K. Foreground Segmentation Using a Triplet Convolutional Neural Network for Multiscale Feature Encoding. arXiv 2018, arXiv:1801.02225. [Google Scholar] [CrossRef]
Lim, L.A.; Keles, H.Y. Foreground Segmentation Using Convolutional Neural Networks for Multiscale Feature Encoding. Pattern Recognit. Lett. 2018, 112, 256–262. [Google Scholar] [CrossRef]
Tezcan, M.O.; Ishwar, P.; Konrad, J. BSUV-Net: A Fully-Convolutional Neural Network for Background Subtraction of Unseen Videos. arXiv 2019, arXiv:1907.11371. [Google Scholar]
Rahmon, G.; Bunyak, F.; Seetharaman, G.; Palaniappan, K. Motion U-Net: Multi-cue encoder-decoder network for motion segmentation. In Proceedings of the International Conference on Pattern Recognition (ICPR), Milan, Italy, 13–18 September 2020. [Google Scholar]
Wang, Y.; Jodoin, P.M.; Porikli, F.; Konrad, J.; Benezeth, Y.; Ishwar, P. CDnet 2014: An expanded change detection benchmark dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Columbus, OH, USA, 23–28 June 2014; pp. 387–394. [Google Scholar]
Sabbu, S.; Ganesan, V. LSTM-based neural network to recognize human activities using deep learning techniques. Comput. Intell. Neurosci. 2022, 2022, 1681096. [Google Scholar] [CrossRef]
Vrskova, R.; Kamencay, P.; Hudec, R.; Sykora, P. A new deep-learning method for human activity recognition. Sensors 2023, 23, 2816. [Google Scholar] [CrossRef] [PubMed]
Koşar, E.; Barshan, B. A new CNN-LSTM architecture for activity recognition employing wearable motion sensor data: Enabling diverse feature extraction. Eng. Appl. Artif. Intell. 2023, 124, 106529. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.H.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Gelly, S.; Uszkoreit, J.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference for Learning Representations, Virtual Event, 3–7 May 2021. [Google Scholar]
Homeyer, C.; Schnörr, C. On moving object segmentation from monocular video with transformers. In Proceedings of the ICCV Workshops, Paris, France, 2–6 October 2023. [Google Scholar]
Wang, W.; Chen, C.; Ding, M.; Yu, H.; Zha, S.; Li, J. TransBTS: Multimodal brain tumor segmentation using transformer. In Medical Image Computing and Computer Assisted Intervention—MICCAI 2021; de Bruijne, M., Cattin, P.C., Cotin, S., Padoy, N., Speidel, S., Zheng, Y., Essert, C., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2021; Volume 12901, pp. 109–119. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790, IEEE/CVF. [Google Scholar] [CrossRef]
Ciocarlan, A.; Lefebvre, S.; Le Hégarat-Mascle, S.; Woiselle, A. Self-Supervised Learning for Real-World Object Detection: A Survey. arXiv 2024, arXiv:2410.07442. [Google Scholar]
Fasana, C.; Pasini, S.; Milani, F.; Fraternali, P. Weakly Supervised Object Detection for Remote Sensing Images: A Survey. Remote Sens. 2022, 14, 5362. [Google Scholar] [CrossRef]
Boufares, O.; Aloui, N.; Cherif, A. Adaptive threshold for background subtraction using stationary wavelet transforms 2D. Int. J. Adv. Comput. Sci. Appl. 2016, 7, 312–319. [Google Scholar]
Li, X.; Ng, M.K.; Yuan, X. Median filtering-based methods for static background extraction from surveillance video. Numer. Linear Algebra Appl. 2015, 22, 845–865. [Google Scholar] [CrossRef]
Sobral, A.; Vacavant, A. A comprehensive review of background subtraction algorithms evaluated with synthetic and real videos. Comput. Vis. Image Underst. 2014, 122, 4–21. [Google Scholar] [CrossRef]
Lim, L.A.; Keles, H.Y. Learning Multi-scale Features for Foreground Segmentation. arXiv 2018, arXiv:1808.01477. [Google Scholar]
Gao, F.; Li, Y.; Lu, S. Extracting moving objects more accurately: A CDA contour optimizer. In IEEE Transactions on Circuits and Systems for Video Technology; IEEE: New York, NY, USA, 2021. [Google Scholar] [CrossRef]
Tezcan, M.O.; Ishwar, P.; Konrad, J. BSUV-Net 2.0: Spatio-temporal data augmentations for video-agnostic supervised background subtraction. IEEE Access 2021, 9, 53841–53855. [Google Scholar] [CrossRef]
Mohammadreza, B.; Dinh, D.T.; Rigoll, G. A Deep Convolutional Neural Network for Video Sequence Background Subtraction. Pattern Recognit. 2018, 76, 635–649. [Google Scholar] [CrossRef]
Braham, M.; PiÃ, S.; Van Droogenbroeck, M. Semantic Background Subtraction. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017. [Google Scholar]
Wang, Y.; Gao, X.; Sun, Y.; Wang, L.; Liu, M. Sh-DeepLabv3+: An improved semantic segmentation lightweight network for corn straw cover form plot classification. Agriculture 2024, 14, 628. [Google Scholar] [CrossRef]
Chen, H.; Qin, Y.; Liu, X.; Wang, H.; Zhao, J. An improved DeepLabv3+ lightweight network for remote-sensing image semantic segmentation. Multimed. Tools Appl. 2024, 83, 11069–11091. [Google Scholar] [CrossRef]
Kwak, J.; Sung, Y. DeepLabV3-refiner-based semantic segmentation model for dense 3D point clouds. Remote Sens. 2021, 13, 1565. [Google Scholar] [CrossRef]
Liu, Y.; Bai, X.; Wang, J.; Li, G.; Li, J.; Lv, Z. Image Semantic Segmentation Approach Based on DeepLabV3 Plus Network with an Attention Mechanism. Eng. Appl. Artif. Intell. 2023, 125, 107260. [Google Scholar] [CrossRef]
Zheng, W.; Wang, K.; Wang, F.-Y. Background Subtraction Algorithm Based on Bayesian Generative Adversarial Networks. Acta Autom. Sin. 2018, 44, 878–890. [Google Scholar] [CrossRef]
Zheng, W.; Wang, K.; Wang, F.-Y. A Novel Background Subtraction Algorithm Based on Parallel Vision and Bayesian GANs. Neurocomputing 2020, 394, 178–200. [Google Scholar] [CrossRef]
Jiang, M.; Deng, C.; Pan, Z.G.; Wang, L.; Sun, X. Multiobject tracking in videos based on lstm and deep reinforcement learning. Complexity 2018, 2018, 4695890. [Google Scholar] [CrossRef]
Wang, Q.; Huang, H.; Zhong, Y.; Duan, Y. Swin transformer based on two-fold loss and background adaptation re-ranking for person reidentification. Electronics 2022, 11, 1941. [Google Scholar] [CrossRef]
Kapoor, M.; Prummel, W.; Giraldo, J.H.; Subudhi, B.N.; Zakharova, A.; Bouwmans, T.; Bansal, A. Graph-Based Moving Object Segmentation for Underwater Videos Using Semi-Supervised Learning. Comput. Vis. Image Underst. 2025, 252, 104290. [Google Scholar] [CrossRef]
Zhang, Q.; Zhu, Y.; Cordeiro, F.R.; Chen, Q. PSSCL: A progressive sample selection framework with contrastive loss designed for noisy labels. Pattern Recognit. 2025, 161, 111284. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015; Available online: https://arxiv.org/abs/1409.1556 (accessed on 18 September 2025).
Cioppa, A.; Van Droogenbroeck, M.; Braham, M. Real-Time Semantic Background Subtraction. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 25–28 October 2020; IEEE: New York, NY, USA, 2020; pp. 3214–3218. [Google Scholar] [CrossRef]

Figure 1. Tree diagram illustrating the taxonomy of major supervised approaches for background subtraction and moving object segmentation. Methods are categorized by architectural families: CNN/FCN, U-Net, GAN, FgSegNet, Transformers, lightweight models, and hybrid approaches.

Figure 2. The FgSegNet Architecture.

Figure 3. ResNet-18 encoder backbone used in Motion U-Net with five convolution layers.

Figure 4. Network architecture of BSUV-Net.

Figure 5. Semantic segmentation algorithm.

Figure 6. F-Measure comparison among FgSegNet_M, BSGAN, MU-Net1, BUV-Net, DeepBs, and RT-SBS-V2. Notably, FgSegNet_M achieves the highest performance across all categories.

Figure 7. F-Measure comparison among FgSegNet_v2_CO, BSPVGAN, MU-Net2, FastBUV-Net, Cascade CNN, and BSUV-net + SemanticBGS. Notably, FgSegNet_V2_CO achieves the highest.

Figure 8. First row: (A) Performance gain between classical cascaded CNN-based refinement methods and early encoder–decoder architectures. (B) Performance gap among representative top deep neural networks (FgSegNet_v2_CO, FgSegNet_M, BSPVGAN) under challenging CDnet2014 scenarios. Second row: (C) Performance gain from cascaded CNNs to state-of-the-art encoder–decoder architectures (FgSegNet_v2_CO). (D) Performance comparison within the MU-Net family, highlighting the gain from MU-Net1 to MU-Net2.

Figure 9. First row: (A) Performance gain between classical CNN-based background subtraction (DeepBS) and advanced supervised deep learning architectures. (B) Performance gap among mixed top deep neural networks under challenging CDnet2014 scenarios. Second row: (C) Performance gain from traditional CNN-based methods (DeepBS) to a state-of-the-art encoder–decoder model (FgSegNet_v2_CO). (D) Performance comparison within the GAN family, highlighting the gain achieved by advanced adversarial architectures.

Figure 10. Qualitative Evaluation of Supervised Foreground Segmentation Techniques on Diverse CDnet2014 Scenarios: Bad Weather, Shadows, Camera Jitter, Dynamic Background and Baseline.

Figure 11. Qualitative Evaluation of Supervised Foreground Segmentation Techniques on Diverse CDnet2014 Scenarios: Low framerate, Night videos, PTZ, Turbulence, inter. Object motion, Thermal.

Table 1. Average Evaluation Metrics of Object Detection Methods on the CDnet 2014 Dataset.

Group Description	Method	RE	SP	FPR	(FNR)	PWC	F-M	PP
CNN-Encoder-Decoder	FgSegNet_v2_CO	0.989	0.9998	0.0002	0.011	0.0395	0.985	0.9828
	FgSegNet_v2_GOP	0.989	0.9998	0.0002	0.011	0.0395	0.985	0.9828
	FgSegNet_v2	0.9891	0.9998	0.0002	0.0109	0.0402	0.9847	0.9823
	FgSegNet_S	0.9896	0.9997	0.0003	0.0104	0.0461	0.9804	0.9751
	FgSegNet_M	0.9836	0.9998	0.0002	0.0164	0.0559	0.977	0.9758
GAN-Based	BSPVGAN	0.9544	0.999	0.001	0.0456	0.2272	0.9501	0.9472
GAN-Based	BSGAN	0.9476	0.9983	0.0017	0.0524	0.3281	0.9339	0.9232
Motion-Aware Encoder-Decoder	MU-Net2	0.9454	0.9991	0.0009	0.0546	0.2347	0.9369	0.9407
Motion-Aware Encoder-Decoder	MU-Net1	0.9277	0.999	0.001	0.0723	0.2097	0.9147	0.9414
CNN Refinement	DeepBS	0.7545	0.9905	0.0095	0.2455	1.9920	0.7458	0.8332
CNN Refinement	Cascade CNN	0.9506	0.9968	0.0032	0.0494	0.4052	0.9209	0.8997
Fully Convolutional	BSUV-Net	0.8203	0.9946	0.0054	0.1797	1.1402	0.7868	0.8113
	BSUV-Net 2.0	0.8136	0.9979	0.0021	0.1864	0.7614	0.8387	0.9011
	Fast BSUV-Net 2.0	0.8181	0.9956	0.0044	0.1819	0.9054	0.8039	0.8425
Hybrid Methods	RT-SBS-v2	0.8361	0.9941	0.0059	0.1639	0.9439	0.8045	0.7934
Hybrid Methods	BSUV-net + SemanticBGS	0.8179	0.9944	0.0056	0.1821	1.1326	0.7986	0.8319

Table 2. Performance (F-measure) under selected challenging CDnet 2014 categories.

Method	B. W	N. V	PTZ	D. B	C.J	I.O.M	SHD
FgSegNet_v2_CO	0.9905	0.9740	0.9864	0.9959	0.9971	0.9964	0.9958
FgSegNet_M	0.9845	0.9655	0.9843	0.9958	0.9954	0.9951	0.9937
BSPVGAN	0.9644	0.9001	0.9486	0.9849	0.9893	0.9951	0.9849
BSGAN	0.9465	0.8965	0.9194	0.9763	0.9828	0.9366	0.9680
Cascade CNN	0.9431	0.8965	0.9168	0.9658	0.9758	0.8505	0.9414
DeepBS	0.8301	0.5835	0.3133	0.8761	0.8990	0.6098	0.9304
MU-Net2	0.9343	0.8362	0.8185	0.9892	0.9824	0.9894	0.9845
MU-Net1	0.9319	0.8575	0.7946	0.9836	0.9802	0.9872	0.9845
BSUV-Net	0.8844	0.6987	0.6282	0.7967	0.7743	0.7499	0.9233
Fast BSUV-Net 2.0	0.8909	0.6551	0.5014	0.7320	0.8828	0.9016	0.8890
RT-SBS-v2	0.8279	0.5599	0.5808	0.9217	0.8233	0.8946	0.9497

Table 3. Reported Processing Time and Configuration of Supervised Background Subtraction Methods on 320 × 240 Videos.

Method	Group Description	FPS	Hardware	Key Parameters
FgSegNet _v2_CO	CNN-Encoder-Decoder	231 (post-process only)	Intel i5-8300H CPU NVIDIA GTX	R = 20, Cp = 1, Cmin = 0, Cmax = 36, D1 = 5, D2 = 11
FgSegNet_v2		23	NVIDIA GTX 970	200 frames, val_split = 0.2, threshold = 0.9, lr = 1 × 10⁻⁴
FgSegNet_S (FPM)		21	NVIDIA GTX 970	200 frames, threshold = 0.8, dropout = 0.5, reg = 5 × 10⁻⁴
FgSegNet_M		18	NVIDIA GTX 970	200 frames, epoch = 50, threshold = 0.8, dropout = 0.5, reg = 5 × 10⁻⁴
Cascade CNN	CNN Refinement	~2	NVIDIA GTX 970	Epochs = 20, lr = 0.01, batch = 5, threshold = 0.7, Adadelta
DeepBS	CNN Refinement	10	Xeon E5-1620 + GeForce Titan X	CNN (3 conv + 2 FC layers)
MU-Net1	Motion-Aware Encoder-Decoder	~35	NVIDIA Tesla V100	Epochs = 40, Adam, lr = 1 × 10⁻⁴, batch = 8
MU-Net2	Motion-Aware Encoder-Decoder	~35	NVIDIA Tesla V100	Epochs = 40, Adam, lr = 1 × 10⁻⁴, batch = 8
BSUV-Net	Fully Convolutional	~6	NVIDIA Titan-X	lr = 1 × 10⁻⁴, batch = 8, input = 224 × 224
BSUV-Net 2.0		~6	NVIDIA Tesla P100	lr = 1 × 10⁻⁴, batch = 8, input = 224 × 224
Fast BSUV-Net 2.0		~29	NVIDIA Tesla P100	Reduced input (no semantics), lr = 1 × 10⁻⁴, batch = 8
BSUV-Net + SemanticBGS	Hybrid Methods	~6	NVIDIA Titan-X	BSUV-Net + semantic segmentation, lr = 1 × 10⁻⁴, batch = 8
RT-SBS-v2	Hybrid Methods	25	NVIDIA Tesla V100	Semantic + traditional fusion, Bayesian optimization
BSPVGAN	GAN-Based	~5	Intel i7-6700HQ, 4GB RAM	Epochs = 300, Adam (momentum = 0.5), batch = 64, lr = 0.001
BSGAN	GAN-Based	~5	Intel i7-6700HQ, 4GB RAM	Epochs = 300, Adam (momentum = 0.5), batch = 64, lr = 0.001

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Boufares, O.; Saadaoui, W.; Boussif, M. Comparative Study of Supervised Deep Learning Architectures for Background Subtraction and Motion Segmentation on CDnet2014. Signals 2026, 7, 14. https://doi.org/10.3390/signals7010014

AMA Style

Boufares O, Saadaoui W, Boussif M. Comparative Study of Supervised Deep Learning Architectures for Background Subtraction and Motion Segmentation on CDnet2014. Signals. 2026; 7(1):14. https://doi.org/10.3390/signals7010014

Chicago/Turabian Style

Boufares, Oussama, Wajdi Saadaoui, and Mohamed Boussif. 2026. "Comparative Study of Supervised Deep Learning Architectures for Background Subtraction and Motion Segmentation on CDnet2014" Signals 7, no. 1: 14. https://doi.org/10.3390/signals7010014

APA Style

Boufares, O., Saadaoui, W., & Boussif, M. (2026). Comparative Study of Supervised Deep Learning Architectures for Background Subtraction and Motion Segmentation on CDnet2014. Signals, 7(1), 14. https://doi.org/10.3390/signals7010014

Article Menu

Comparative Study of Supervised Deep Learning Architectures for Background Subtraction and Motion Segmentation on CDnet2014

Abstract

1. Introduction

2. Related Work

2.1. Traditional Background Subtraction Methods

2.2. CNN and FCN-Based Methods

2.3. Autoencoder and U-Net-Based Approaches

2.4. Generative Adversarial Networks (GAN) for Background Subtraction

2.5. Recurrent Neural Networks (RNNs, LSTM)

2.6. Transformer-Based Models

2.7. Lightweight Models (MobileNet, EfficientNet)

2.8. Self-Supervised and Semi-Supervised Methods

3. Architectures and Models for Segmentation Methods

3.1. CNN-Based Architectures

3.1.1. DeepBS

3.1.2. Cascade CNN

3.2. Encoder-Decoder Architectures: FgSegNet Family

3.2.1. FgSegNet_M—Multi-Scale Parallel Encoding

3.2.2. FgSegNet_S—Feature Pooling Module (FPM)

3.2.3. FgSegNet_v2—Modified FPM and Decoder with GAP FgSegNet_v2 Introduces Two Key Components

3.2.4. FgSegNet_v2_CO—CDA Contour Optimizer (CO)

3.3. MU-Net Family

3.3.1. MU-Net1 (Motion U-Net—2020)

3.3.2. MU-Net2 (2022)

3.4. GAN-Based Architectures

3.4.1. BSGAN—Background Subtraction with Bayesian GANs

3.4.2. BSPVGAN—Enhancing Motion Understanding with Parallel Vision

3.5. Fully Convolutional Networks and the BSUV-Net Series

3.5.1. BSUV-Net

3.5.2. BSUV-Net 2.0

3.5.3. Fast BSUV-Net 2.0

3.6. Hybrid Methods

3.6.1. BSUV-Net + SemanticBGS

3.6.2. RT-SBS-v2

4. Results

4.1. Experimental Setup and Evaluation Metrics

4.2. Selection of Baseline Architectures

4.3. Quantitative Comparative Analysis

4.4. Performance Under Challenging Scenarios

4.5. Qualitative Analysis

4.6. Analysis of Computational Efficiency

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI