1. Introduction
Pedestrian detection, as a core task in computer vision, aims to rapidly identify and accurately localize pedestrian targets within images or videos. It has found widespread application in critical scenarios such as security surveillance, industrial inspection, and autonomous driving [
1]. Under normal lighting conditions, deep learning-based object detection algorithms have achieved significant progress [
2,
3]. However, when the scene shifts to low-light environments at night, visible light image quality deteriorates sharply. Prominent challenges arise, including uneven illumination, blurred object contours, significant background noise, and highly variable object scales, posing severe difficulties for pedestrian detection [
4,
5]. Although hyperspectral image processing demonstrates highly efficient feature enhancement capabilities through multi-band information extraction and fusion under normal illumination [
6,
7,
8,
9], its performance degrades under low-light conditions at night due to noise interference and spectral information attenuation. While this provides insights for visible-light single-modality detection, it also highlights its limitations.
Early pedestrian detection methods primarily relied on convolutional neural networks (CNNs) and could be categorized into two major schools: two-stage and single-stage approaches. Two-stage algorithms such as R-CNN [
10] and Faster R-CNN [
11] gained attention for their high detection accuracy, but their slower processing speeds made them unsuitable for scenarios demanding high real-time performance [
12]. With the advancement of deep learning, single-stage algorithms have gradually become mainstream. Among them, the YOLO series [
13,
14,
15,
16,
17,
18,
19,
20] offers high efficiency by eliminating the need for bounding box proposals and directly predicting both object classification and location. However, these CNN-based detectors remain constrained by their sensitivity to low-light features in nighttime scenarios. For instance, the MDCFVit-YOLO model proposed by Zhang et al. [
21] incorporates a ViT module to enhance features for low-light small-object detection [
22]. Yet it remains sensitive to nighttime noise, leading to reduced precision for small-scale pedestrians. Furthermore, it lacks adaptive high-frequency enhancement and incurs high computational overhead, making it inefficient for processing multi-scale boundaries with blurred edges. Li et al. [
23] introduced the YOLO-FFRD model with a fast fusion module for dynamic small-scale pedestrian detection. However, it suffers from background noise interference under low-light conditions and insufficient feature extraction capability, resulting in high miss rates for distant small-scale pedestrians. Zou et al. [
24] improved the YOLOv8-ECSH model for nighttime pedestrian and vehicle detection, utilizing the ECSH module to enhance recall. However, it suffers from insufficient feature relabeling under nighttime noise and relatively static multi-scale fusion, making it difficult to adapt to small pedestrians in shadows [
25]. Overall, CNN-based detectors are constrained by their local receptive fields, making it challenging to effectively capture global illumination changes. They are sensitive to noise in complex nighttime scenes and exhibit insufficient feature extraction capabilities [
26].
In recent years, detectors centered around Transformers have demonstrated tremendous potential [
27,
28]. The DETR model proposed by Carion et al. [
29] achieves true end-to-end detection by replacing traditional anchor designs with a query mechanism. Zhu et al. [
30] further enhanced computational efficiency with Deformable DETR through a deformable attention mechanism. Notably, Baidu’s RT-DETR [
31] achieved real-time performance for Transformer-based detectors for the first time while maintaining high precision by introducing an efficient hybrid encoder and an IoU-aware query selection mechanism. However, despite the global modeling capabilities of Transformer architectures, they still face significant limitations in low-light detection tasks at night [
32,
33]. First, attention mechanisms are sensitive to noise: uneven illumination and background artifacts prevalent in nighttime images easily divert attention resources [
34]. For instance, while Huang et al.’s [
35] Swin DETR optimizes nighttime detection through local attention, its attention remains susceptible to noise distraction, resulting in low efficiency for small pedestrian detection and high computational overhead. Second, insufficient capture and fusion of fine-grained features: Global attention struggles to perceive locally blurred contours, and existing feature enhancement methods remain inadequate in nighttime scenarios. Li et al. [
36] applied FATNet to DETR, enhancing detection through feature attention. However, this mechanism remains sensitive to nighttime artifacts and illumination interference, resulting in insufficient feature interaction, blurred boundaries, and missed detections. It also suffers from slow training convergence and limited multi-scale processing capabilities [
37]. Furthermore, computational efficiency and multi-scale adaptability remain critical challenges: while introducing complex mechanisms improves performance, model complexity often increases, hindering deployment. Traditional Transformers also exhibit relatively limited mechanisms for cross-scale feature fusion, making it difficult to dynamically adapt to the drastic scale variations inherent in nighttime scenes [
38,
39].
To overcome the limitations of single-modal visible light data, multimodal fusion methods have been extensively explored. The DETR-based multimodal fusion approach proposed by Zhao et al. [
40] enhances nighttime pedestrian detection performance by integrating visible light and millimeter-wave radar data. Such methods leverage the complementary advantages of different modalities, improving detection effectiveness under low-light conditions to a certain extent. However, they also introduce new challenges: their performance heavily relies on precise pairing and high-quality multimodal data, computational complexity increases significantly, system deployment costs are high, and they struggle to function effectively in mainstream scenarios where only visible light cameras are available. Consequently, due to their strong hardware dependency, system complexity, and limited generalization capabilities, multimodal approaches face significant limitations in universal nighttime visible light pedestrian detection tasks.
Single-modal detection constitutes a core technological branch within nighttime target detection, with thermal imaging modalities demonstrating significant performance potential owing to their inherent physical advantages. This technique generates images by capturing infrared radiation emitted by targets, rendering it entirely independent of illumination conditions. It inherently possesses high target-to-background discrimination capabilities, enabling stable detection without reliance on complex feature enhancement modules. For instance, the lightweight thermal imaging network proposed by Jhong et al. [
41], through structural simplification and quantisation optimization, successfully achieved efficient deployment in vehicle-to-everything (V2X) edge scenarios, further validating thermal imaging’s technical feasibility for nighttime tasks. However, thermal imaging solutions face insurmountable engineering bottlenecks: the hardware manufacturing costs of its core sensors far exceed those of visible-light cameras, and its adoption rate remains extremely low in consumer-grade terminals such as standard in-vehicle cameras and civilian surveillance equipment.
Despite the significant engineering value of single-modal visible light detection, its technical challenges far exceed those of thermal imaging solutions. A review of existing research reveals that current visible light nighttime detection methods still face notable bottlenecks: on the one hand, CNN-based detectors are constrained by their local receptive fields, making it difficult to effectively handle global illumination variations in nighttime scenes [
42,
43], whilst also being sensitive to background noise; On the other hand, whilst the Transformer architecture possesses global modeling capabilities, its attention mechanism is prone to being distracted by illumination artifacts and extraneous light sources within nighttime imagery. This results in inadequate capture of features for blurred, small-scale pedestrians, leading to elevated false-negative rates. Furthermore, while multimodal fusion approaches can enhance performance through cross-modal information complementarity, their effectiveness heavily relies on precise multimodal data pairing. This significantly increases hardware complexity and deployment costs, rendering them unsuitable for mainstream devices equipped solely with visible light sensors. In summary, constructing a detection model under single visible light conditions that balances global and local feature perception, effectively suppresses noise interference, and maintains high computational efficiency has become a critical issue requiring urgent breakthroughs.
Addressing the three core challenges in nighttime visible-light pedestrian detection—weak feature representation, noise interference, and multi-scale adaptation difficulties—this paper proposes the multi-level adaptive model BNE-DETR, using RT-DETR-R18 as the baseline. The core innovation lies in constructing an end-to-end solution tailored for nighttime signal degradation within a lightweight Transformer architecture. This is achieved through a synergistic module design integrating feature enhancement, spatial awareness, and multi-scale fusion. Key contributions are as follows:
Introduction of the SECG Feature Enhancement Module: Replacing the C2f bottleneck layer, this module integrates single-head self-attention (enhancing pedestrian contours), dynamic gating (filtering noisy attention), and convolutional gating units (calibrating feature weights). This addresses weak features and noise interference, providing subsequent modules with high signal-to-noise ratio features.
Designing the AIFI-SEFN Spatial Awareness Module: Introducing a dual-branch feedforward network within the decoder. Through gated fusion of ‘global attention semantics + original spatial structure’, it prevents distractions from invalid light sources, enhances spatial localisation accuracy for blurred targets, and compensates for the spatial awareness bias inherent in traditional Transformers.
Constructing the MANStar Multi-Scale Fusion Module: centered on a star topology, this module combines depth-separable convolutions (reducing computational redundancy at small scales) with gated feature selection (dynamically allocating scale weights). It synergistically adapts to multi-scale variations alongside SECG and SEFN features, resolving the inadequate adaptability of traditional fusion methods to nighttime scale degradation.
Model effectiveness is validated across multiple nighttime pedestrian datasets. Experimental results demonstrate that the proposed model outperforms mainstream methods in key metrics, including Precision, Recall, and mAP50, while maintaining low computational complexity and parameter count.
The organizational structure of this paper is as follows:
Section 2 reviews the research progress on foundational models and related techniques.
Section 3 details the three core modules and the overall network architecture proposed in this paper.
Section 4 presents the experimental results and ablation studies for the proposed model and its individual modules, further conducting generalization experiments on several public datasets.
Section 5 summarizes the contributions of this research and discusses future work directions.
3. Methods
3.1. Improved Model BNE-DETR
In this paper, we propose an enhanced RT-DETR model—BNE-DETR—which addresses the unique challenges of nighttime visible-light scenarios by combining feature enhancement mechanisms with multi-scale adaptive fusion. The model comprises three core modules: the Adaptive Screening Enhancement Module (C2f-SECG), the Spatial Enhanced Attention Feedforward Module (AIFI-SEFN), and the Mixed Aggregation Network with Star Blocks (MANStar). While preserving the real-time capability of the RT-DETR-R18 baseline architecture, our model integrates three key enhancement components: C2f-SECG introduces single-head self-attention and dynamic gating mechanisms at the backbone stage, effectively addressing nighttime illumination variations and object degradation through feature purification and contour enhancement; AIFI-SE FN constructs a spatially enhanced feedforward network at the encoder layer. Its dual-path feature selection mechanism focuses computational resources on critical pedestrian regions, significantly improving feature discriminability in noisy environments; MANStar employs a hybrid-aggregated star topology during feature fusion, achieving adaptive representation of multi-scale pedestrian features through multi-branch collaboration and gated selection mechanisms. Through the synergistic interaction of these three modules, BNE-DETR achieves significant detection performance improvements in complex nighttime scenarios, particularly excelling in handling challenges such as uneven illumination, blurred objects, and multi-scale distributions. The improved model architecture is illustrated in
Figure 2.
3.2. C2f-SECG Module
To alleviate the insufficient feature representation caused by blurred pedestrian contours and background noise interference in nighttime visible light images, this section designs the C2f-SECG module, as shown in
Figure 3. This module replaces the bottleneck layer in C2f with the designed SECG Block. By integrating deep convolutions, incorporating Single-head Self-attention (SHSA [
55]) with dynamic gating EPGO [
56], and utilizing Convolutional Gated Linear Units (CGLU [
57]), it constructs a multi-level, adaptive feature enhancement module that effectively improves the model’s pedestrian feature representation capability under low-light conditions.
The C2f-SECG module adopts a cross-stage partial fusion design philosophy, dividing the input feature stream into two parallel processing paths. One path preserves the original feature stream to maintain gradient flow and critical information, while the other path processes features through a series of our proposed SECG modules. This architecture ensures efficient computation while enabling deep multi-level feature extraction, which is crucial for identifying low-contrast pedestrians against noisy nighttime backgrounds.
The SECG Block, serving as the core processing unit of the C2f-SECG module, employs three key components to achieve adaptive processing of nighttime scenes, as shown in
Figure 3b. The module’s input feature map
, where
denotes batch size,
represents the number of channels, and
and
denote the height and width of the feature map, first undergoes a feature preprocessing stage. To address the dual requirements of parameter efficiency and feature retention in nighttime scenes, the module employs a combined strategy of separable convolutions and residual connections [
58]:
This design significantly reduces the number of parameters while preserving local feature extraction capabilities, effectively preventing overfitting on limited nighttime data. The introduction of residual structures ensures that subtle yet critical pedestrian contour information is transmitted losslessly to deeper layers, preventing the loss of valuable clues during the initial feature transformation stage.
The preprocessed features enter the core component SHEG, which splits the input features into two parts: one for attention calculation and the other retained in its original state, as shown in
Figure 4.
Specifically, the first
channels are used for attention calculations, while the remaining
channels remain the original feature information:
This channel partitioning enables the model to leverage attention mechanisms for capturing global context while relying on raw features to maintain precise spatial localization—particularly crucial when pedestrian boundaries become blurred due to insufficient illumination.
To address unstable feature distribution caused by uneven lighting, we apply group normalization to
in the convolutional layer before projecting features onto the query and key matrices:
Setting the query and key dimension
to a smaller fixed value essentially forces the attention mechanism to focus on the most discriminative components within the noise-dense nighttime features, preventing the dispersion of limited attention resources on irrelevant noise. The core innovation of the SHEG component lies in the design of its dynamic gating optimization mechanism, which directly addresses the uneven light distribution characteristic of nighttime scenes:
The gated network generates spatial weight maps through two 1 × 1 convolutional layers and the Sigmoid activation function . Here, represents the average of the gated output across the entire feature map, reflecting the importance level of the current feature. denotes the sequence length, while is the dynamically selected number of attention connections based on feature importance. This mechanism adaptively adjusts the attention scope based on local information density, focusing on greater detail in well-lit areas while concentrating on key contours in dim regions.
Subsequently, attention computation employs dynamic sparsity:
The scaling factor prevents vanishing gradients caused by low vector dimensions. Next is the dynamic mask generation process: First, create a zero-filled mask matrix with dimensions . Then, the Topk operation identifies the positions with the highest attention weights, marking these positions as 1 in the mask matrix. Finally, the attention weights at positions masked as 0 are set to negative infinity. This ensures that after the softmax operation, the probabilities at these positions approach zero. The masked attention weights are normalized through the softmax function:
The sparse attention mechanism automatically filters out unimportant regions in nighttime scenes, focusing on key features of pedestrian targets. The sparse attention then aggregates the value vectors, which are concatenated with the unprocessed features along the channel dimension. After SiLU activation and 1 × 1 projection, the final output is obtained. This design effectively reduces computational complexity while preserving the expressive power of the attention mechanism. The specific implementation formula is as follows:
represents the projection layer, which effectively fuses the partially processed features from the attention mechanism with the original features to generate the final augmented feature representation. This provides high-quality input for subsequent CGLU processing. The features processed by the attention mechanism then enter the CGLU module, as shown in
Figure 5.
The CGLU first performs a linear transformation on the channel dimension via a 1 × 1 convolution, then splits it into two parts:
Here,
denotes the gate signal branch feature, while
represents the value signal branch feature. The gate branch further extracts spatial features through a 3 × 3 depthwise convolution(DWConv) and employs the GELU activation function to enhance nonlinear expressive capability:
Deep convolutions capture local spatial patterns, while the GELU function enhances optimization stability under low-quality image conditions due to its smooth gradient properties. Subsequently, feature selection and weighting are achieved through element-wise multiplication:
Here,
denotes element-wise multiplication. This gating mechanism enables the model to dynamically adjust the information flow based on spatial location and feature channel importance, making it particularly suitable for scenarios where targets and backgrounds are poorly distinguished, such as in nighttime environments. Finally, the gated features undergo 1 × 1 convolution projection and are fused with the original input via residual connections:
The residual structure not only alleviates the vanishing gradient problem in deep networks but also ensures the complete preservation of original features, providing dual safeguards against potential feature weakening in nighttime images.
The entire C2f-SECG module adheres to the design principle of residual connections:
The original input features are combined with the output features processed through all steps to form the final module output.
The C2f-SECG module design incorporates targeted optimisations for nighttime pedestrian detection scenarios: Dynamic Sparse Single-Head Self-Attention (SHSA) captures global contextual information, while 3 × 3 depth convolutions extract local spatial patterns. These components complement each other functionally, effectively enhancing the feature representation of blurred pedestrian contours in nocturnal environments. The gating mechanism selectively transmits effective feature information, significantly reducing interference from nighttime background noise and further enhancing feature discriminability. The Convolutional Gated Linear Unit (CGLU) integrates the local modeling capability of convolutions with the global regulatory properties of the gating mechanism, providing crucial technical support for efficient representation of pedestrian features in low-light environments.
3.3. AIFI-SEFN Module
The AIFI-SEFN module addresses the core challenges of inadequate multi-scale feature fusion and complex background interference in nighttime visible-light pedestrian detection. The standard feedforward network in traditional AIFI architectures performs poorly in nighttime scenarios, as its uniform feature processing struggles to handle variations in feature quality caused by uneven illumination and fails to effectively enhance blurred pedestrian contour information [
59]. To overcome this limitation, we innovatively replace the standard FFN with the Spatially Enhanced Feedforward Network (SEFN) [
60]. Through a dual-branch architecture and a gated fusion mechanism, SEFN enables refined processing of spatial information. This enhancement is particularly well-suited for the feature enhancement demands of low-light nighttime environments.
As shown in
Figure 6, this module first receives the input feature map from the backbone network. Considering that input features in nighttime scenes often suffer from blurred pedestrian contours and redundant background noise, the module first preserves the original spatial feature map
to retain complete spatial position information. Subsequently, the feature map is flattened into a [B, H × W, C] sequence format to accommodate the computational requirements of the self-attention mechanism. The feature map undergoes linear projection to generate query (
Q), key (
K), and value (
V) components. However, recognizing the critical role of spatial positional information for target localization in nighttime scenes, the module employs 2D sine-cosine position encoding (
) to enhance spatial perception capabilities. This encoding is generated via the following formula:
Here,
denotes the position index,
represents the dimension index, and
is the feature dimension. By integrating these with
Q and
K, respectively, we obtain
and
. This enables the self-attention mechanism to simultaneously consider spatial positional relationships when computing feature similarity, thereby adapting to the modeling requirements of nighttime pedestrian targets across varying positions and scales. The outputs from the multi-head self-attention are concatenated with the original input features via residual connections and undergo layer normalization, ultimately yielding the following formula:
This process effectively alleviates the vanishing gradient problem in deep network training, ensuring the stability of complex feature learning during nighttime.
Among these, SEFN serves as the core unit of the module, where its dual-branch architecture and gated fusion mechanism are pivotal for achieving spatial information enhancement and effective feature selection (as shown in
Figure 7). In Branch one (Attention Feature Processing Branch), the input
first undergoes 1 × 1 convolution to expand the channel dimension, enhancing feature expression capacity to accommodate complex nighttime features. Subsequently, deep convolution extracts local spatial correlation information. This operation maintains a 3 × 3 receptive field to capture pedestrian local structures while significantly reducing computational complexity. Finally, channel dimension splitting chunk (2,dim = 1) yields
(X′) and
(X″), which, respectively, serve as the “feature carrier for subsequent fusion” and the “gated base feature”, reserving interfaces for dynamic fusion. Branch Two (Spatial Enhancement Branch) takes the previously saved
as input. It first downsamples and compresses local noise via average pooling to extract global pedestrian contour information, mitigating interference from nighttime road reflections and light spots. It then undergoes a 3 × 3 convolution for feature transformation. Layer normalization combined with the ReLU activation function enhances feature discriminability. Subsequently, another 3 × 3 convolution and layer normalization deepen the modeling of global spatial correlations. Finally, an upsampling operation restores the features to the same spatial dimensions as Branch One’s output, ensuring precise spatial alignment during the subsequent fusion process.
Building upon this foundation, SEFN achieves adaptive integration of dual-branch features through a gated fusion mechanism. First, branch one’s
is concatenated with branch two’s upsampled output Y along the channel dimension, forming a composite feature that integrates local attention correlations with global spatial contours. This provides multidimensional discriminative evidence for gated signal generation. The concatenated feature undergoes 1 × 1 convolution to compress redundant channels and consolidate information. Subsequently, 3 × 3 depth convolution further enhances local spatial consistency. Finally, the GELU activation function generates a smooth gating signal. This gating signal undergoes element-wise multiplication with branch one’s
, as expressed in the formula:
To achieve dynamic filtering of effective features. In nighttime scenes, the gated signal assigns high weights to well-lit pedestrian areas and low weights to shadowed or noisy regions, thereby highlighting pedestrian features while suppressing interference.
The SEFN output features undergo 1 × 1 convolutional projection to restore the original channel dimension. They are then connected to
via residual connection and processed through layer normalization to yield the final module output. This process ensures the fused features remain within global contextual constraints while preserving the spatially detailed information enhanced by SEFN. As shown in the formula:
The AIFI-SEFN module successfully integrates global spatial contour information with local attention features by introducing a spatial enhancement branch and a gated fusion mechanism. This design enables the model to dynamically filter and enhance beneficial spatial details when encountering uneven nighttime illumination and low-contrast targets. Consequently, it provides the decoder with more discriminative feature representations, effectively improving detection accuracy.
3.4. MANStar Module
To address the challenges of varying pedestrian scales and insufficient feature fusion capabilities in pedestrian detection tasks under nighttime visible light conditions, this paper proposes a Mixed Aggregation Network with Star Blocks (MANStar). Building upon the efficient Mixed Aggregation Network (MANet) architecture [
38], this module innovatively incorporates Star Feature Extraction Blocks [
39]. It aims to enhance the model’s perception and recognition capabilities for pedestrians of varying scales in complex nighttime scenes through multi-branch aggregation and deep spatial feature learning.
As shown in
Figure 8, the input feature map of the MANStar module has
channels, denoted as input features
. First, a 1 × 1 convolutional layer performs channel projection to expand the feature dimension to 2c, laying the foundation for subsequent multi-branch processing. This process can be represented as:
After completing the dimensional projection of input features, the 2c-dimensional feature map is fed into a carefully designed four-branch aggregation structure. This architecture comprehensively captures diverse feature representations through four distinct processing paths, providing multidimensional support for subsequent precise modeling of nighttime pedestrian features. It effectively adapts to the complexity and diversity of pedestrian characteristics in nighttime scenarios.
The first path directly extracts fundamental pedestrian contour features through a single 1 × 1 convolution layer while reducing the number of channels to c:
The second approach employs Deep Separable Convolution (DSConv) for refined processing, comprising three steps: First, a 1 × 1 convolution expands the 2c-dimensional feature channel to 4c, reserving ample channel capacity for subsequent spatial feature extraction; Next, a 3 × 3 depthwise convolution is introduced. While maintaining the 4c channel count, it efficiently captures spatial features with computational complexity significantly lower than standard convolutions. Finally, a 1 × 1 pointwise convolution recompresses the channel count back to c, achieving feature reconstruction and dimensional unification. This approach balances computational efficiency with feature expressiveness while effectively enhancing the discriminative power of local features and improving the accuracy of detail extraction, as detailed below:
The third approach is the direct splitting method, which equally divides the initial 2c-dimensional features along the channel dimension, yielding two independent c-dimensional feature streams:
One of the c-dimensional feature streams is directly fed into the subsequent feature fusion stage (denoted as
), preventing loss of critical original information due to excessive processing.
The fourth path constitutes the Star Block sequence path, which takes another c-dimensional feature stream obtained from segmentation operations as input (denoted as
) and feeds it into a serial processing unit composed of n Star Blocks. Each Star Block strictly maintains consistency between input and output feature dimensions (both c-dimensional). Features are optimized through internal gated feature selection and spatial enhancement via deep convolutions. Unlike traditional designs that utilize only the final sequence output, the output from every Star Block within the sequence participates in subsequent feature fusion. This creates multi-level representations, enabling more effective modeling of complex nighttime pedestrian features. The process is as follows:
After completing the processing of the four branches described above, all generated feature streams are concatenated along the channel dimension to form a composite feature map of (4 +
n) ×
c dimensions. This is ultimately fused and output through a 1 × 1 convolutional layer. This aggregation process can be represented as:
This multi-branch, multi-level aggregation strategy ensures the model comprehensively integrates feature information from different processing paths and depth levels, effectively addressing the diversity of pedestrian targets in nighttime scenes.
The design of Star Block (as shown in
Figure 9) draws inspiration from StarNet [
24], an architecture that achieves multi-branch coordination through a star topology. However, to address the specific challenges of low-light pedestrian detection at night—such as uneven illumination, blurred targets, and background noise interference—we have modified StarNet. This module achieves effective enhancement of discriminative features by synergistically integrating a dual-branch gated feature selection mechanism with large receptive field deep convolutions. Given input features
, the processing flow is as follows.
Firstly, Star Block combines two applications of large-core deep convolutions (7 × 7) to ensure a broad receptive field captures contextual relationships of blurred nighttime targets. Simultaneously, it integrates a stochastic depth mechanism as a regularization technique, enhancing generalization capabilities in scenes with variable lighting conditions:
Subsequently, a dual-branch gated feature selection mechanism is constructed to perform parallel transformation and fusion on
. This mechanism employs two independent 1 × 1 convolutions
with non-shared parameters to project features into a higher-dimensional space (where r denotes the channel expansion ratio). It enhances robustness against noise through nonlinear activation and gating operations, dynamically emphasizing information-rich channels while suppressing redundant responses:
Here, denotes element-wise multiplication, an operation that dynamically emphasizes information-rich channels while suppressing redundant responses.
To demonstrate the effectiveness of the gating mechanism in high-noise, low-light environments, we conduct a theoretical analysis from the perspective of signal-to-noise ratio (SNR) behavior. Assuming the input feature X comprises useful signal S and noise N (X = S + N). In the dual-branch gating scheme, X1 undergoes 1 × 1 convolution and ReLU6 activation to generate the gating signal G(X1) ∈ [0, 6]. The saturation property of this activation function ensures that G approaches 0 for low-value noise channels, thereby suppressing noise amplification (G ⊙ N ≈ 0). Conversely, for useful signal channels, G assumes higher values (approaching 6), amplifying the response (G ⊙ S > S). From an SNR perspective, the output feature’s SNR can be approximated as SNR_out ≈ (G ⊙ S)2/(G ⊙ N)2. G′s noise suppression mitigates risks of weak gradient vanishing or amplified noise. In nighttime pedestrian detection, this adaptive gating ensures gradient preservation for blurred targets while mitigating noise’s impact on multi-scale fusion.
To project features back to the original channel dimension and further refine their spatial structure, the gated output
is first compressed via a 1 × 1 convolution
, followed by a second large-kernel depthwise convolution to complete spatial refinement:
Ultimately, the module output is obtained through residual connections and a random depth mechanism:
These modifications render the Star Block more suited for nighttime tasks, achieving targeted enhancements in multi-scale fusion and noise suppression compared to the original StarNet. It integrates seamlessly into the MANStar module, forming a hybrid aggregation architecture. When deployed within the deep feature processing path of the MANStar module, its serialized structure preserves the intermediate feature outputs from each Star Block, ultimately achieving hierarchical aggregation at the channel dimension. This design enables multi-granularity synergistic integration—from shallow texture details and mesoscale structural information to deep semantic features—thereby enhancing the model’s multidimensional capabilities for precise capture and feature representation of nocturnal pedestrian targets.
5. Discussion and Conclusions
Nighttime pedestrian detection is a critical and challenging task in the field of computer vision. Due to complex and variable lighting conditions at night, significant variations in target sizes, and the need to balance real-time detection with stringent edge deployment requirements, achieving high-accuracy, high-efficiency pedestrian detection under low-light and high-noise conditions has become a research hotspot. Timely and accurate identification and localization of pedestrian targets in nighttime environments hold significant practical importance for enhancing the performance of intelligent surveillance systems, ensuring public safety, and reducing traffic accidents.
This paper proposes a lightweight detection model, BNE-DETR, based on the RT-DETR architecture. It aims to systematically address core challenges in pedestrian detection under nighttime visible light conditions, including insufficient accuracy and high rates of false positives and false negatives. While maintaining low computational complexity and parameter count, the model significantly enhances detection performance in complex environments with low illumination and uneven lighting by introducing multi-level feature enhancement and spatial perception mechanisms. Specifically, three core modules are designed: First, the C2f-SECG module, embedded within the lightweight backbone CSPDarknet, fuses single-head self-attention with a dynamic gating mechanism. This effectively enhances the representation of pedestrian contours and edge information while suppressing background noise interference. Second, the AIFI-SEFN module introduces a Spatial Enhancement Feedforward Network within the encoder. Its dual-branch architecture and gated fusion mechanism strengthen the extraction of subtle details, overcoming the susceptibility of traditional attention mechanisms to noise interference in nighttime scenes. Finally, the MANStar module constructs a broad receptive field through its hybrid aggregation architecture and large-core star topology, enabling efficient fusion of multi-scale features. Its gated selection mechanism dynamically focuses on key features, effectively addressing the core challenges of scale variability and background interference in nighttime scenes, significantly enhancing the model’s overall robustness.
Experimental results demonstrate that BNE-DETR achieves significant improvements over the baseline model RT-DETR-R18 on the LLVIP dataset across key metrics, including Precision, Recall, and mAP50. Concurrently, it reduces the number of parameters by 20.2% and lowers GFLOPS requirements, showcasing outstanding lightweight characteristics and potential for edge deployment. Furthermore, cross-dataset generalization experiments on NightSurveillance and NightOwls further validate the model’s robustness and adaptability across diverse nighttime scenarios, particularly excelling in detecting small-scale and low-contrast objects.
This study not only proposes a model that excels in nighttime pedestrian detection tasks but, more importantly, provides a modular design approach for applying lightweight Transformer architectures to complex visual tasks. The concepts embodied by each module—such as “combining global perception with local refinement”, “adaptive feature selection”, and “multi-scale fusion”—exhibit strong transferability and can serve as a reference for other low-quality image analysis tasks.
Although BNE-DETR has made progress in multiple aspects, there remains room for further optimization in the future. We will continue to focus on enhancing model compression and inference speed to meet stricter edge deployment requirements. Simultaneously, addressing more challenging scenarios—such as extreme lighting conditions, adverse weather interference, and object occlusion—particularly the detection of distant small-scale pedestrians and partially occluded targets, will be key priorities for future research. By continuously refining model architecture and training strategies, we aim to further advance the practical application and development of nighttime intelligent surveillance and security systems.