LCB-Net: Long-Range Context and Box Distribution Network for Small Object Detection

Qiao, Yiguo; Liang, Yun; Liu, Mingzhe

doi:10.3390/electronics14224487

Open AccessFeature PaperArticle

LCB-Net: Long-Range Context and Box Distribution Network for Small Object Detection

by

Yiguo Qiao

^1,2,*

,

Yun Liang

^1,2,3 and

Mingzhe Liu

^1,2

¹

School of Computer Science and Engineering, Southeast University, Nanjing 211189, China

²

Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Southeast University, Ministry of Education, Nanjing 211189, China

³

Huawei Technologies Co., Ltd., Nanjing 210000, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(22), 4487; https://doi.org/10.3390/electronics14224487

Submission received: 14 October 2025 / Revised: 4 November 2025 / Accepted: 13 November 2025 / Published: 17 November 2025

(This article belongs to the Topic Recent Advances in Label Distribution Learning)

Download

Browse Figures

Versions Notes

Abstract

Small object detection (SOD) remains a critical challenge in computer vision, with vital applications in areas like UAV inspection, autonomous driving, and medical image analysis. Existing methods are often limited by inadequate feature representation for small objects and insufficient utilization of contextual information. To tackle these issues, this paper proposes a novel LCB-Net. First, we design a plug-and-play Saliency-guided Long-range Mamba (SL-Mamba) module, which leverages spatially attentive maps from shallow features to explicitly guide the model’s focus toward small target regions. This module captures long-range contextual dependencies through state space modeling and enhances local–global feature synergy via cross-scale fusion. Second, we introduce a Bounding Box Distribution Loss (BDL) that employs label distribution learning (LDL) to explicitly model localization ambiguity and improve accuracy. Extensive experiments on standard small object benchmarks such as VisDrone, WiderPerson, and NWPU-VHR-10 demonstrate that our approach achieves significant performance gains over strong baselines. Specifically, on the VisDrone dataset, it yields a 4.3% improvement in mAP@0.5:0.95. Furthermore, evaluations across small object benchmarks and the general-purpose MS-COCO dataset confirm that the proposed BDL consistently surpasses traditional IoU-based losses, including CIoU and ProbIoU, in localization tasks.

Keywords:

small object detection; long-range Mamba; label distribution learning; bounding box distribution loss

1. Introduction

Object detection, as one of the core tasks in computer vision, aims to localize and identify objects of interest from images or video data. Small objects typically refer to targets occupying less than 32 × 32 pixels in an image, such as distant pedestrians, traffic signs, vehicles, birds, etc. Small object detection (SOD) plays a critical role in numerous fields including autonomous driving, medical image analysis, industrial quality inspection, as well as remote sensing and military applications [1,2]. In autonomous driving, systems require rapid and accurate detection of tiny objects like distant pedestrians, traffic signs, and small obstacles to ensure driving safety. In medical image analysis, SOD technology is widely used in early cancer screening, particularly for precisely detecting lesions smaller than 5 mm in diameter within CT or MRI images, which holds significant importance for early clinical diagnosis and intervention [3]. In the industrial quality inspection domain, detecting surface micro-cracks in electronic components and PCB solder joint defects can effectively improve the accuracy of product quality control, meeting the demands of high-reliability manufacturing. Furthermore, in remote sensing and military applications, SOD is employed to rapidly identify small buildings, vehicles, or military installations from high-resolution satellite imagery, providing core technical support for national defense security, urban management, and disaster monitoring [4]. As a cross-disciplinary critical technology, SOD is increasingly becoming a key guarantee for enhancing system intelligence and safety.

However, current SOD still faces numerous challenges. First, at the data level, the extremely high annotation accuracy requirements for small objects make them prone to labeling errors. Furthermore, the relatively low proportion of small object samples in mainstream detection datasets further restricts the feature modeling capability of detectors during the training phase [5]. Second, in SOD tasks, complex backgrounds and environmental noise often cause small targets to blend with background textures, leading to issues such as blurred boundaries and reduced localization accuracy [6,7]. Additionally, compared to medium and large objects, the limited pixel area occupied by small objects in images results in the loss of critical semantic information during feature extraction, making it difficult for detectors to accurately capture discriminative feature representations [8,9]. Moreover, to compensate for the feature loss in SOD, higher-resolution input images are often required, which significantly increases computational and storage overhead and poses challenges for deployment on resource-constrained devices [10].

To address the issues of insufficient fine-grained feature modeling and missing contextual relationships in SOD, this paper proposes a novel SOD framework, which integrates long-range context modeling and bounding box distribution learning. Firstly, a plug-and-play saliency-guided long-range Mamba (SL-Mamba) module is proposed, which adopts Mamba as the state space modeler to model long-range contextual relationships through a linear recurrence mechanism. To compensate for Mamba’s relative weakness in modeling local details, a residual fusion mechanism is further employed to perform spatial alignment and fusion between Mamba’s output features and the original features, achieving synergistic enhancement of global and local information, and significantly improving the model’s detection accuracy and robustness for small objects in complex scenarios.

Secondly, based on the theory of label distribution learning, a multivariate Gaussian distribution is used to model the probability distribution of bounding boxes, and a bounding box distribution loss (BDL) function is constructed in the continuous probability space. Unlike traditional IoU-based losses, which rely on discrete coordinate predictions of bounding boxes and treat the regression process as a deterministic point estimation—ignoring the uncertainty inherent in prediction and often suffering from gradient sparsity or abrupt changes when bounding boxes are close but not overlapping—the proposed method captures the uncertainty of predicted boxes through probabilistic modeling. It provides fine-grained gradient information in the continuous space, enabling the loss function to be more sensitive to subtle shifts of bounding boxes and reducing the impact of extreme values on regression. This thereby enhances the accuracy and stability of bounding box regression. The main contributions are as follows:

To address the limited ability of SOD models in modeling long-range contextual relationships, we propose a novel SOD framework that integrates long-range context modeling and bounding box distribution learning.
We introduce the SL-Mamba module, which efficiently establishes long-range dependencies between pixels, overcoming the architectural limitations of traditional CNNs and local attention mechanisms. Additionally, a residual fusion mechanism is incorporated to synergistically enhance local and global information.
We employ multivariate Gaussian distribution to model bounding box probability distributions and construct a corresponding loss function. This approach mitigates localization ambiguity and uncertainty, significantly improving detection accuracy.
Extensive experiments validate the effectiveness of SL-Mamba on small object detection datasets. Specifically, on the VisDrone dataset, our method achieves a 4.3% improvement in mAP@0.5:0.95 compared to baseline approaches. Furthermore, the proposed BDL demonstrates superior localization performance over both CIoU and ProbIoU on both small object and general object detection datasets.

2. Related Work

Research in SOD has advanced along several key pathways to overcome the fundamental challenges of weak features and limited contextual information. Prevailing approaches primarily focus on: multi-scale feature fusion, the integration of contextual information, image enhancement, and refined region proposal generation. Situated at the confluence of feature fusion and context modeling, our work introduces novel mechanisms for capturing long-range dependencies and learning probabilistic bounding box distributions.

2.1. Feature Fusion-Based Small Object Detection

The performance of SOD is constrained by inherently weak feature representation. Convolutional Neural Networks (CNNs) compress feature maps through successive downsampling operations, leading to significant degradation of small object details in deeper layers. Moreover, features at different levels exhibit distinct spatial and semantic characteristics: deep features possess large receptive fields and strong semantic abstraction but low spatial resolution, while shallow features retain fine-grained details but lack sufficient semantic context. This representational conflict forms a critical bottleneck for SOD.

To address these issues, researchers have proposed feature fusion-based methods. In early work, Liu et al. introduced SSD, which balanced multi-scale feature fusion and detection efficiency by leveraging parallel feature maps for multi-task prediction [5]. Subsequently, Fu et al. proposed MDSSD, which incorporated deconvolution modules and skip connections to fuse high-level semantic features with low-level details, effectively supplementing spatial information and contextual awareness for small objects [6]. Further, Zhang et al. developed DR-CNN, which achieved consistent multi-scale feature alignment through deconvolutional reconstruction and l2 normalization of VGG16 backbone features, significantly improving small object recall [7]. To enhance cross-scale fine-grained fusion, Tang et al. proposed MR-CNN, which directly concatenated upsampled deep features with shallow features via multi-scale deconvolution, improving region proposal quality [11].

Recently, attention mechanisms and dynamic feature fusion have emerged as promising directions. Xu et al. [8] proposed TransFuse-SOD, which combined Transformer and CNN backbones with cross-attention to dynamically calibrate feature importance across scales, achieving notable performance gains [8]. Chen et al. developed Dynamic-FPN, which employed a learnable gating mechanism to adaptively regulate feature fusion paths, optimizing both accuracy and efficiency in complex scenes [10].

2.2. Context-Aware Small Object Detection

Since small objects occupy minimal pixel areas, relying solely on local features often proves insufficient for accurate detection. Consequently, leveraging contextual information has become a key strategy for performance improvement.

ION employed spatial RNNs to model external context, generating directional feature copies processed by recurrent units and integrated via attention mechanisms [12]. PCNN enhanced SSD with dilated convolutions and deconvolution to extract multi-scale context, while incorporating memory networks to store semantic history, improving classification accuracy [13]. For domain-specific challenges, Müller et al. introduced TL-SSD, which combined shallow and deep features with pyramid pooling to detect traffic lights and other small objects [14]. For highway-specific scenarios, Chan et al. proposed regional contextual information modeling to enhance small object detection in traffic environments [15,16].

2.3. Image Enhancement-Based Small Object Detection

The limited pixel coverage of small objects restricts feature expressiveness, prompting methods that enhance image or feature resolution. Generative Adversarial Networks (GANs) have shown particular promise in this area.

Li et al. proposed Perceptual GAN, the first GAN-based method for SOD [17]. Its generator learned mapping between small and large object features, while a dual-branch discriminator performed adversarial and detection tasks, achieving 9.7% AP gain on KITTI. To address semantic inconsistency, Bai et al. developed SOD-MTGAN, a multi-task GAN where the discriminator jointly evaluated image authenticity and performed detection, guiding the generator to produce more discriminative features [18]. Yuan et al. designed HRT, which integrated multi-resolution parallel design from HRNet with window-based self-attention to reduce computational cost while enhancing context awareness [19]. EdgeSR-GAN combined edge and grayscale information with GANs to generate structurally rich high-resolution images, outperforming traditional super-resolution methods in SSIM and mitigating mode collapse [20].

2.4. Region Proposal-Based Small Object Detection

Prior to deep learning, region proposals relied on methods like Selective Search, which were computationally expensive. Faster R-CNN [21] revolutionized this with the Region Proposal Network (RPN), enabling end-to-end candidate generation. To overcome the reliance on manual anchor design in RPNs, Sun et al. proposed PBLS-SRPN, which used evolutionary algorithms (PSO and BFO) to automate hyperparameter optimization, generating higher-quality proposals [22]. Sparse R-CNN replaces the dense anchor-based mechanism with a small set of learnable object proposals that interact iteratively with image features. This approach eliminates the computational cost of evaluating thousands of anchor boxes per location while maintaining high recall for small objects through focused feature reasoning [23]. Concurrently, modern backbone networks have demonstrated remarkable capability in enhancing proposal quality through richer feature representations. The Swin Transformer, with its hierarchical architecture and shifted windowing mechanism, constructs intrinsic feature pyramids that preserve fine-grained details crucial for small object detection while capturing long-range dependencies. This results in more discriminative features for the proposal stage, significantly improving localization accuracy for challenging small instances [24].

3. Method

3.1. Overview

The overall architecture of our proposed framework is illustrated in Figure 1. Building upon the YOLOv8 baseline [25,26], we introduce two key innovations to address the specific challenges of SOD. First, we integrate a Saliency-guided Long-range Mamba (SL-Mamba) module to enhance feature representation by suppressing irrelevant background information and capturing long-range contextual dependencies through state space modeling. Second, we propose a Bounding Box Distribution Loss (BDL) function that reformulates bounding box regression as a label distribution learning problem, effectively mitigating gradient instability and convergence difficulties in small object localization.

3.2. Preliminaries

3.2.1. YOLOv8 Architecture

YOLOv8 has emerged as a widely adopted object detection framework due to its favorable balance between accuracy and inference efficiency. The architecture follows an anchor-free paradigm and comprises three main components:

Backbone: The enhanced CSPDarknet backbone incorporates cross-stage partial connections and spatial pyramid pooling (SPPF) modules, enabling efficient multi-scale feature extraction while preserving fine-grained details essential for SOD.

Neck: The path aggregation network with feature pyramid network (PAN-FPN) facilitates bidirectional feature fusion across different resolution layers, enhancing both semantic richness and spatial precision for multi-scale object detection [27,28].

Head: YOLOv8 employs a decoupled head structure that separates classification and regression branches, allowing specialized processing for each task and improving overall detection performance.

3.2.2. State Space Models and Mamba

State Space Models: State space models (SSMs) provide a principled framework for modeling sequential data through hidden state dynamics [29]. The continuous-time formulation of a linear time-invariant (LTI) SSM is described by:

\begin{matrix} h^{'} (t) & = A h (t) + B x (t), \end{matrix}

(1)

\begin{matrix} y (t) & = C h (t) + D x (t), \end{matrix}

(2)

where

x (t) \in R

is the input signal,

h (t) \in R^{N}

is the hidden state,

y (t) \in R

is the output,

A \in R^{N \times N}

governs state transitions,

B \in R^{N \times 1}

maps input to state,

C \in R^{1 \times N}

projects state to output, and

D \in R

represents a skip connection.

For discrete-time processing with step size

Δ

, the system is discretized via zero-order hold:

\begin{matrix} h_{k} & = \bar{A} h_{k - 1} + \bar{B} x_{k}, \end{matrix}

(3)

\begin{matrix} y_{k} & = C h_{k}, \end{matrix}

(4)

where discretized parameters are computed as:

\begin{matrix} \bar{A} & = exp (Δ A), \end{matrix}

(5)

\begin{matrix} \bar{B} & = {(Δ A)}^{- 1} (exp (Δ A) - I) Δ B . \end{matrix}

(6)

Mamba Mechanism: Mamba [30] enhances SSMs through input-dependent parameter selection, overcoming the limitations of time-invariant systems. The selective SSM formulation introduces:

\begin{matrix} B_{k} & = {Linear}_{B} (x_{k}), \end{matrix}

(7)

\begin{matrix} C_{k} & = {Linear}_{C} (x_{k}), \end{matrix}

(8)

\begin{matrix} Δ_{k} & = Softplus ({Linear}_{Δ} (x_{k}) + bias) . \end{matrix}

(9)

The discretization becomes input-dependent:

\begin{matrix} {\bar{A}}_{k} & = exp (Δ_{k} A), \end{matrix}

(10)

\begin{matrix} {\bar{B}}_{k} & = {(Δ_{k} A)}^{- 1} (exp (Δ_{k} A) - I) Δ_{k} B_{k}, \end{matrix}

(11)

yielding the selective recurrence:

\begin{matrix} h_{k} & = {\bar{A}}_{k} h_{k - 1} + {\bar{B}}_{k} x_{k}, \end{matrix}

(12)

\begin{matrix} y_{k} & = C_{k} h_{k} . \end{matrix}

(13)

This selective mechanism enables: (1) context-aware filtering through

B_{k}

; (2) adaptive memory via

Δ_{k}

; and (3) content-dependent output projection through

C_{k}

. Mamba maintains efficiency through hardware-aware parallelization of the selective scan operation.

3.2.3. Label Distribution Learning

Label distribution learning (LDL) extends single-label and multi-label learning by associating each instance with a probability distribution over the label space [31,32]. Formally, let

X \subseteq R^{q}

be the feature space and

Y = {y_{1}, y_{2}, \dots, y_{c}}

the label space. Given a training set

S = {(x_{1}, D_{1}), (x_{2}, D_{2}), \dots, (x_{n}, D_{n})}

, where

D_{i} : Y \to [0, 1]

is a label distribution satisfying

\sum_{j = 1}^{c} D_{i} (y_{j}) = 1

, the goal of LDL is to learn a conditional probability mass function

p (y | x; θ)

parameterized by

θ

[33,34,35].

The optimal parameters are obtained by minimizing the Kullback–Leibler divergence between true and predicted distributions:

θ^{*} = {argmax}_{θ} \sum_{i = 1}^{n} \sum_{j = 1}^{c} D_{i} (y_{j}) ln p (y_{j} | x_{i}; θ) .

(14)

3.3. Saliency-Guided Long-Range Mamba

Saliency Detection Module: In small object detection tasks, targets typically occupy minimal pixel areas and are surrounded by extensive low-response background regions. Such redundant features not only dilute the discriminative capacity of target representations but may also introduce substantial interference during subsequent global modeling stages.

To mitigate this issue, this paper introduces a saliency detection module (SDM) between the backbone network and the state space modeling component. The proposed SDM employs a soft mask sparsification mechanism aimed at enhancing computational efficiency and perceptual focus, while preserving critical target-related information [24,36].

The module takes multi-scale feature maps

P_{l} \in R^{C \times H \times W}

from the PAN–FPN as input, where

l \in {3, 4, 5}

denotes the pyramid level. For each level, a saliency map

M_{l} \in R^{H \times W}

is generated by computing the

l_{2}

-norm along the channel dimension at each spatial location:

M_{l} (i, j) = {∥P_{l} (:, i, j)∥}_{2},

(15)

where each element

M_{l} (i, j)

reflects the activation intensity at position

(i, j)

for pyramid level l. To ensure consistent saliency responses across different input images,

M_{l}

is normalized to the range

[0, 1]

via min–max normalization:

{\hat{M}}_{l} (i, j) = \frac{M_{l} (i, j) - min (M_{l})}{max (M_{l}) - min (M_{l})} .

(16)

The normalized saliency maps are used as soft attention masks and multiplied element-wise with the corresponding original feature maps:

F_{l} = P_{l} ⊙ {\hat{M}}_{l} .

(17)

This operation enhances informative regions while suppressing irrelevant background noise, thereby directing the model’s attention toward potential regions of interest. The resulting saliency-weighted features

F_{l}

are then flattened into sequential format and fed into the Mamba state space module. This integration effectively improves both representational quality and localization stability in small object detection, as demonstrated in our experimental results.

Long-Range State Space Modeling: Traditional local convolutional operations struggle to adequately capture the semantic and boundary features of small objects, particularly in scenarios with sparse targets or complex backgrounds. Modeling long-range contextual dependencies is crucial for accurate detection of small objects. Although Transformer architectures possess powerful global modeling capabilities, the

O (N^{2})

computational complexity of their self-attention mechanism results in prohibitive computational and memory overhead when processing high-resolution inputs, limiting their scalability in practical detection tasks [37].

To address this, we propose a long-range Mamba encoder, which utilizes a structurally efficient state space model to achieve long-range dependency modeling of sparse small object features. The module takes the saliency-weighted features

F_{l}

from the SDM module as input, flattens them into a sequence, and processes them through stacked Mamba blocks. Through a recurrent state-update mechanism, Mamba maintains modeling depth and spatial coverage while effectively avoiding the redundant computations inherent in explicit global attention.

To enhance the spatial semantic representation of small object regions, long-range Mamba incorporates 2D sinusoidal positional encoding at the token input stage, explicitly restoring spatial geometric relationships between features. Let the feature vector of the t-th token be

f_{t} \in R^{d}

with spatial coordinates

(h_{t}, w_{t})

. By computing the 2D sinusoidal positional encoding

ϕ (h_{t}, w_{t}) \in R^{d_{ϕ}}

, concatenating it with the original features, and applying a linear transformation matrix

W_{p} \in R^{d \times (d + d_{ϕ})}

and bias term

b_{p}

followed by a residual connection, we obtain the enhanced feature representation:

{\tilde{f}}_{t} = W_{p} [f_{t}; ϕ (h_{t}, w_{t})] + b_{p} + f_{t} .

(18)

This operation explicitly encodes spatial location information of sparse tokens without introducing additional resolution expansion.

The position-encoded sequence is fed into L Mamba blocks for recursive state modeling. The ℓ-th layer takes

u_{t}^{(l)} = {\tilde{f}}_{t}^{(l)}

as the input signal and recursively updates the hidden state

x_{t}^{(l)} \in R^{d}

:

\begin{matrix} x_{t + 1}^{(l)} & = A^{(l)} (u_{t}^{(l)}) x_{t}^{(l)} + B^{(l)} (u_{t}^{(l)}) u_{t}^{(l)}, \end{matrix}

(19)

\begin{matrix} y_{t}^{(l)} & = C^{(l)} (u_{t}^{(l)}) x_{t}^{(l)} + D^{(l)} (u_{t}^{(l)}) u_{t}^{(l)}, \end{matrix}

(20)

where

A^{(l)} (\cdot)

,

B^{(l)} (\cdot)

,

C^{(l)} (\cdot)

, and

D^{(l)} (\cdot)

are generated through channel-wise dynamic linear mapping, with an overall computational complexity of

O (N d)

, avoiding quadratic growth in computation and memory requirements.

Notably, the recursive kernel within Mamba possesses an exponential decay property, causing background tokens to rapidly converge to zero after few iterations, while high-saliency small object tokens are progressively enhanced through state accumulation, thereby naturally optimizing the semantic signal-to-noise ratio.

Building upon this, we optionally incorporate a Convolutional Block Attention Module (CBAM) after the SL-Mamba block to further refine the feature representation [38]. The role of CBAM is to perform sequential attention weighting in both channel and spatial dimensions, complementing Mamba’s selection mechanism by explicitly reinforcing discriminative small-object features and suppressing residual background noise.

Consequently, SL-Mamba achieves refined enhancement of small object features and adaptive suppression of redundant noise while maintaining linear inference efficiency, providing more robust feature representations for subsequent detection tasks.

3.4. Bounding Box Distribution Loss Function

The loss function in object detection typically comprises classification loss and localization loss. Widely used localization losses, such as CIoU, GIoU, and ProbIoU, measure spatial discrepancies between predicted and ground-truth bounding boxes. However, these traditional losses often suffer from optimization challenges, including gradient discontinuity, sensitivity to local optima, and limited responsiveness to small objects [39,40,41].

The BDL is proposed to overcome these limitations. Our approach reformulates bounding box localization as a label distribution learning problem. Instead of relying on a single hard label, we model the neighborhood around the ground-truth box using a multivariate Gaussian distribution, generating a set of soft labels that offer a more comprehensive representation of the target.

Let the bounding box be parameterized by its center coordinates and dimensions,

b = {[x, y, w, h]}^{⊤}

. We model the label distribution of b using a multivariate Gaussian with probability density function:

p (z ∣ μ, Σ) = \frac{1}{\sqrt{{(2 π)}^{k} | Σ |}} exp (- \frac{1}{2} {(z - μ)}^{⊤} Σ^{- 1} (z - μ)),

(21)

where

z = {[x, y, w, h]}^{⊤}

denotes the predicted bounding box,

μ = {[x_{g}, y_{g}, w_{g}, h_{g}]}^{⊤}

represents the ground-truth bounding box, and

Σ

is a

4 \times 4

covariance matrix capturing correlations among the variables:

Σ = [\begin{matrix} σ_{x x} & σ_{x y} & σ_{x w} & σ_{x h} \\ σ_{y x} & σ_{y y} & σ_{y w} & σ_{y h} \\ σ_{w x} & σ_{w y} & σ_{w w} & σ_{w h} \\ σ_{h x} & σ_{h y} & σ_{h w} & σ_{h h} \end{matrix}] .

(22)

Here,

σ_{x y}

denotes the covariance between x and y, and so on. Both z and

Σ

are direct outputs of the detection network.

We adopt the negative log-likelihood (NLL) as our loss function due to its probabilistic interpretation, inherent uncertainty modeling, and stable gradient properties. The BDL loss is defined as:

L_{BDL} = - log p (z ∣ μ, Σ) .

(23)

Substituting (21) yields the explicit form:

L_{BDL} = \frac{1}{2} [{(z - μ)}^{⊤} Σ^{- 1} (z - μ) + log | Σ | + k log (2 π)] .

(24)

By leveraging distributional learning, BDL enhances boundary precision for small objects, mitigates numerical instability caused by tiny target sizes, and ultimately improves detection accuracy.

4. Results

4.1. Datasets and Evaluation Metrics

Datasets. We evaluate our method on three specialized SOD datasets, VisDrone [42], NWPU-VHR-10 [43], and WidePerson [44], which collectively cover pedestrian detection in complex scenes, multi-scale remote sensing object recognition, and high-density aerial object detection. To further assess the generalizability of the proposed BDL, we also conduct experiments on two general-purpose datasets: COCO128 and COCO1000 [45].

VisDrone, captured from UAV viewpoints, contains objects such as pedestrians and vehicles characterized by small size, high density, and complex background clutter. It serves as a robust benchmark for testing model generalization in aerial SOD scenarios. WidePerson is a large-scale pedestrian detection dataset featuring diverse pedestrians in complex scenes. It is particularly rich in small-scale, occluded, and variably posed instances. The high-resolution images and cluttered backgrounds make it well-suited for validating SOD performance under challenging real-world conditions. NWPU-VHR-10 is a high-resolution remote sensing dataset containing 10 object categories (e.g., aircraft, ships). It exhibits substantial scale variation, with a high proportion of small objects such as vehicles and vessels, making it ideal for evaluating multi-scale detection capability. COCO128 is a lightweight subset of MS COCO 2017, comprising the first 128 training images. It maintains a subset of the original 80 categories and is suitable for rapid prototyping and lightweight model validation due to its compact size and rich annotations. COCO1000 includes approximately 1000 images with higher scene complexity and object density, better approximating the full COCO dataset’s class distribution. It supports multi-task evaluation including detection and segmentation, making it suitable for assessing algorithmic robustness.

Evaluation Metrics. We adopt mAP@0.5, mAP@0.5:0.95, Precision, and Recall as our primary evaluation metrics.

Here, mAP@0.5 denotes the mean Average Precision at an IoU threshold of 0.5, reflecting detection performance under moderate localization requirements. mAP@0.5:0.95 represents the average mAP computed over IoU thresholds from 0.5 to 0.95 with a step size of 0.05, providing a comprehensive measure of localization robustness. Precision quantifies the proportion of correctly identified positive predictions, while Recall measures the model’s ability to detect all relevant ground-truth objects. These metrics collectively offer a multi-faceted assessment of detection performance across both localization accuracy and classification reliability.

4.2. Implementation Details

All experiments were conducted on a server equipped with four NVIDIA RTX 4080 GPUs. Our model builds upon the official YOLOv8 implementation (version 8.0.110), where we integrate the proposed SL-Mamba module into the backbone and replace the original localization loss with our BDL. Distributed Data Parallel (DDP) training was employed across all available GPUs to accelerate the process.

The model was trained for 200 epochs with a batch size of 16. We utilized the AdamW optimizer with Cosine Annealing learning rate scheduling. Training stability was enhanced through a linear warmup phase at initialization and the application of Exponential Moving Average (EMA). Standard data augmentation techniques including random flipping and color jittering were applied during training.

4.3. Comparison with State-of-the-Art Methods

Quantitative Comparison. Table 1 presents a quantitative comparison of different detection architectures on the VisDrone dataset. The proposed method achieves superior performance across all four evaluation metrics, mAP@0.5, mAP@0.5:0.95, Precision, and Recall, consistently outperforming both Transformer-based and CNN-based counterparts.

Specifically, our method attains 44.5% in mAP@0.5 and 27.3% in mAP@0.5:0.95, surpassing YOLOv10s (42.8% and 25.5%, respectively) and Sparse DETR (44.0% and 27.0%). This balanced improvement across different IoU thresholds suggests enhanced localization robustness. In terms of Precision and Recall, our method reaches 56.2% and 42.3%, respectively, exceeding YOLOv10s (53.2% and 40.1%). The simultaneous improvement in both metrics indicates a better trade-off between reducing false positives and maintaining high detection coverage. The comprehensive performance gains demonstrate the overall effectiveness of our integrated approach for small object detection.

To further validate generalization capability, extended experiments were conducted on WiderPerson and NWPU-VHR-10. The consistent improvements across these diverse benchmarks confirm the robustness of our method in various small object detection scenarios, from crowded pedestrian scenes to high-resolution remote sensing imagery.

Qualitative Comparison. Figure 2, Figure 3 and Figure 4 illustrate the visual detection results of the proposed method and other representative models on typical datasets. The proposed method demonstrates superior target perception and localization accuracy across multiple complex environments.

In high-density scenarios such as urban street views, as shown in Figure 2, the proposed LCB-Net accurately detects numerous occluded or distant small objects, producing bounding boxes that are more complete and uniformly distributed. In contrast, while the YOLO series models achieve high inference speeds, they exhibit noticeable missed detections, particularly in shadowed areas and low-resolution regions. Concurrently, DETR-based models (e.g., DETR, Deformable DETR) often generate incomplete or redundant boxes in these cluttered scenes. A further comparison using pedestrian detection examples in Figure 3 reveals that the bounding boxes generated by our method align more closely with the actual target contours in both size and position. The YOLO-series models are prone to issues such as bounding box misalignment and incorrect merging of multiple adjacent objects. Meanwhile, certain DETR-series models (e.g., Sparse DETR) show improved localization over standard DETR but still struggle with inconsistent detection confidence in crowded areas. Furthermore, in aerial imagery scenarios containing distant small objects (Figure 4), our method demonstrates higher detection sensitivity for small vehicles, consistently and reliably identifying target locations. The YOLO series, on the other hand, suffers from low confidence scores and frequent missed detections of small targets. While RT-DETR shows competitive speed among transformer-based detectors, its detection completeness for extremely small objects (e.g., distant vehicles) remains inferior to our approach.

In summary, the proposed LCB-Net effectively captures subtle target features, enhances contextual relationships, and suppresses background interference. It mitigates the localization inaccuracy and missed detections observed in YOLO-series models, while also addressing the slow inference and incomplete detections characteristic of many DETR-based models, thereby exhibiting excellent detection performance across complex, high-density, and high-resolution conditions.

Efficiency Comparison. The experimental results in Table 2 demonstrate that our proposed LCB-Net achieves an excellent balance between model complexity and inference efficiency. While maintaining competitive performance (125 FPS on RTX 4090), our method significantly reduces computational overhead compared to other DETR-based approaches. Specifically, LCB-Net requires 15.2 M parameters and 42.5 G FLOPs, which are approximately 2.7× and 3.4× lower than the standard DETR baseline, respectively. Although YOLO-series models exhibit superior inference speed, their performance in complex scenarios remains limited. The substantial efficiency improvement of LCB-Net over other transformer-based detectors makes it particularly suitable for real-time applications on resource-constrained devices.

4.4. Ablation Study

Impact of Core Modules. We conducted systematic ablation studies on the VisDrone, WiderPerson, and NWPU-VHR-10 datasets, using YOLOv8 as the baseline model. The contributions of each component were evaluated by incrementally adding or replacing corresponding modules. The experimental results are summarized in Table 3.

As shown in the table, after introducing the proposed long-range Mamba mechanism on the VisDrone dataset, the model achieved improvements of

1.4 %

,

1.1 %

,

1.5 %

, and

1.0 %

in mAP@0.5, mAP@0.5:0.95, Precision, and Recall, respectively. Similar trends were observed on WiderPerson and NWPU-VHR-10, where the Mamba module consistently enhanced perception of small object features across all datasets, demonstrating its effectiveness in modeling long-range dependencies.

Upon further integrating the SDM, all metrics showed additional gains on all three datasets, indicating that explicitly modeling salient local regions effectively strengthens small object representation. It is worth noting that the multi-scale saliency aggregation mechanism (across P3-P5 levels) effectively mitigates the risk of false positives caused by background activations, as evidenced by the improved precision metrics across all datasets. The CBAM module further improved performance, particularly on VisDrone, where it contributed

0.2 %

in mAP@0.5 and

1.4 %

in mAP@0.5:0.95, highlighting its ability to focus on key regions while suppressing background noise.

Moreover, replacing the original CloU loss with the proposed BDL led to consistent improvements across all datasets and metrics, confirming that the distribution-based bounding box modeling effectively mitigates gradient vanishing issues and enhances localization performance for small objects. Notably, on NWPU-VHR-10, the full model (with SL-Mamba, CBAM, and BDL) achieved the highest scores of

94.1 %

Precision,

84.6 %

Recall,

92.4 %

mAP@0.5, and

62.7 %

mAP@0.5:0.95.

These experiments demonstrate that the SL-Mamba module enhances perceptual capability for small object features, while BDL improves the robustness and confidence of bounding box regression. The two modules operate in a complementary manner—one in contextual modeling and the other in bounding box localization—jointly contributing to significant and consistent performance gains across diverse small object detection scenarios.

BDL Analysis. Figure 5 illustrates the per-class performance in terms of mAP@0.5 and mAP@0.5:0.95 for different loss functions on the VisDrone dataset. The results demonstrate the superior capability of our BDL in detecting a wide range of object categories. Furthermore, we validated the generalizability of BDL on generic datasets, as summarized in Table 4. These consistent gains across datasets confirm that BDL is not only a more effective loss function for small objects but also a robust and versatile solution for object detection in general. These consistent improvements can be directly attributed to the core algorithmic advantages of BDL:

Addressing gradient vanishment at IoU = 0. Traditional IoU-based loss functions exhibit fundamental limitations when handling non-overlapping bounding boxes: once IoU becomes zero, the loss saturates at a fixed value, failing to reflect the spatial relationship between completely separated boxes. This binary failure mode leads to gradient vanishing and optimization stagnation. In contrast, the proposed BDL framework circumvents this issue through distributional modeling. By measuring the Mahalanobis distance between the predicted and ground-truth distributions, BDL maintains sensitivity even in non-overlapping scenarios. The covariance matrix $Σ$ enables quantitative assessment of distributional divergence, ensuring continuous gradient flow and stable optimization regardless of overlap conditions.
Mitigating oversensitivity to small object annotations. Small object detection is particularly susceptible to annotation noise and variance due to the low pixel coverage and inherent localization ambiguity. Conventional regression losses treat all dimensional errors equally, often forcing the model to overfit to annotation inaccuracies. BDL introduces an intelligent weighting mechanism through the inverse covariance matrix $Σ^{- 1}$ . When high uncertainty exists in certain dimensions (e.g., height and width of tiny objects), the corresponding elements in $Σ^{- 1}$ automatically downweight their contribution to the total loss. This uncertainty-aware design redirects the model’s focus toward more reliable dimensions during optimization, effectively suppressing overfitting to noisy annotations and improving generalization performance.
Probabilistic representation for enhanced robustness. By modeling bounding boxes as multivariate Gaussian distributions, BDL fundamentally enhances the robustness of small object detection. The probabilistic representation naturally accommodates the inherent ambiguity in small object localization, transforming the learning objective from deterministic fitting to distributional alignment. This approach not only resolves the gradient vanishment issue but also provides a principled mechanism for handling annotation uncertainties, ultimately leading to more efficient and accurate detection of challenging small objects.

4.5. Analysis of Challenging Conditions

We further analyze the model’s performance under challenging real-world conditions such as severe weather, varying lighting, occlusion, and extreme scale variations. Among the three datasets employed in our study, VisDrone contains numerous challenging scenarios including adverse weather conditions (e.g., fog, haze), varying illumination (e.g., overexposure, shadows), and severe occlusion (e.g., crowded streets, overlapping objects). NWPU-VHR-10, while minimally affected by weather variations, presents significant challenges in terms of extreme scale variations and complex background clutter.

Experimental results across the three datasets demonstrate that the proposed LCB-Net achieves the most substantial performance improvements on the VisDrone dataset, followed by NWPU-VHR-10. This performance pattern can be attributed to two key design elements of our approach. First, the SL-Mamba module enhances long-range dependency modeling, which helps maintain contextual awareness under severe occlusion and weather degradation conditions, allowing the model to infer missing visual information from surrounding environmental contexts. Second, BDL loss provides more stable training through probabilistic bounding box regression, particularly effective under ambiguous conditions where object boundaries become unclear due to weather effects or occlusion.

Despite these improvements, our analysis reveals that missed detections still occur under extreme occlusion scenarios and cases with drastic scale variations. These failure modes primarily manifest in heavily crowded scenes with significant object overlap and in remote sensing images with substantial scale differences. To address these limitations, enhanced multi-scale feature fusion strategies and explicit occlusion reasoning mechanisms will be explored to further improve robustness in these challenging scenarios.

5. Conclusions

In this work, we present LCB-Net, a novel framework that combines long-range context modeling with probabilistic bounding box regression to address fundamental challenges in SOD. Our approach introduces two core contributions: the SL-Mamba module, which enables efficient global dependency modeling through state space mechanisms, and the BDL, which overcomes critical limitations of traditional IoU-based losses by reformulating localization as a distribution learning task. Comprehensive evaluations demonstrate that LCB-Net achieves state-of-the-art performance across specialized small object benchmarks. Notably, the proposed BDL consistently outperforms existing loss functions not only in small object scenarios but also in general object detection, demonstrating its universal applicability and robustness.

This work provides valuable insights into probabilistic localization and efficient context modeling, establishing a solid foundation for future research in developing more adaptive and robust vision systems. The principles demonstrated here could potentially benefit a wider range of computer vision tasks beyond object detection. Future research will explore alternative probability distributions to enhance regression robustness and investigate the integration of attention mechanisms with probabilistic frameworks [52,53,54].

Author Contributions

Conceptualization, methodology, and writing—original draft, Y.Q. and Y.L.; formal analysis and investigation, Y.Q. and Y.L.; validation and writing—review and editing, Y.Q. and M.L.; supervision and project administration, Y.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Science Foundation of China under Grant 62306072.

Data Availability Statement

The original data presented in the study are openly available. The citations for all these datasets are provided in the reference list.

Conflicts of Interest

Author Yun Liang was employed by the company Huawei Technologies Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Wang, H.; Gao, P. Survey Of Small Object Detection Methods Based On Deep Learning. In Proceedings of the 2024 9th International Conference on Intelligent Informatics and Biomedical Sciences (ICIIBMS), Okinawa, Japan, 21–23 November 2024; Volume 9, pp. 221–224. [Google Scholar] [CrossRef]
Zheng, X.; Bi, J.; Li, K.; Zhang, G.; Jiang, P. SMN-YOLO: Lightweight YOLOv8-Based Model for Small Object Detection in Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2025, 22, 8001305. [Google Scholar] [CrossRef]
Wang, G.; Liu, X.; Wang, Y.; Li, X.; Zhang, H.; Jiang, M.; Song, S. Deep learning for pulmonary nodule detection: A comparative study of 2D and 3D convolutional neural networks. Med. Image Anal. 2021, 71, 102052. [Google Scholar]
Zhang, M.; Wang, Y.; Lin, J.; Zhao, Y. Transformer-based small object detection in high-resolution remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2023, 197, 309–322. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Cui, L.; Jiang, R.; Li, Z. MDSSD: Multi-scale deconvolutional single shot detector for small objects. IEEE Trans. Image Process. 2019, 28, 829–840. [Google Scholar] [CrossRef]
Liu, Z.; Li, D.; Ge, S.S.; Tian, F. Small traffic sign detection from large image. Appl. Intell. 2019, 49, 2001–2013. [Google Scholar] [CrossRef]
Li, Y.; Huang, Q.; Pei, X.; Chen, Y.; Jiao, L.; Shang, R. Cross-Layer Attention Network for Small Object Detection in Remote Sensing Imagery. IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 2021, 14, 2148–2161. [Google Scholar] [CrossRef]
Lee, G.; Hong, S.; Cho, D. Self-Supervised Feature Enhancement Networks for Small Object Detection in Noisy Images. IEEE Signal Process. Lett. 2021, 28, 1026–1030. [Google Scholar] [CrossRef]
Wang, X.; Zhang, S.; Yu, Z.; Feng, L.; Zhang, W. Scale-Equalizing Pyramid Convolution for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 13356–13365. [Google Scholar]
Wang, Y.; Li, H.; Bai, X. MR-CNN: Multi-scale region-based CNN for small object recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; AAAI Press: Palo Alto, CA, USA, 2019; pp. 9213–9220. [Google Scholar]
Bell, S.; Zitnick, C.L.; Bala, K.; Girshick, R. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, LA, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 2874–2883. [Google Scholar]
Zhang, N.; Donahue, J.; Girshick, R.; Darrell, T. Part-based R-CNNs for fine-grained category detection. In Proceedings of the Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Lecture Notes in Computer Science. Springer: Cham, Switzerland, 2014; Volume 8689. [Google Scholar]
Müller, J.; Dietmayer, K. Traffic light detection using deep learning with spatial pyramid pooling. In Proceedings of the IEEE Intelligent Vehicles Symposium, Changshu, China, 26–30 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 680–685. [Google Scholar]
Chan, S.; Yu, M.; Chen, Z.; Mao, J.; Bai, C. Regional Contextual Information Modeling for Small Object Detection on Highways. IEEE Trans. Instrum. Meas. 2023, 72, 2531613. [Google Scholar] [CrossRef]
Cheng, G.; Yuan, X.; Yao, X.; Yan, K.; Zeng, Q.; Xie, X.; Han, J. Towards Large-Scale Small Object Detection: Survey and Benchmarks. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13467–13488. [Google Scholar] [CrossRef]
Li, J.; Liang, X.; Wei, Y.; Xu, T.; Feng, J.; Yan, S. Perceptual generative adversarial networks for small object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1222–1230. [Google Scholar]
Bai, Y.; Zhang, Y.; Ding, M.; Ghanem, B. SOD-MTGAN: Small object detection via multi-task generative adversarial network. In Proceedings of the Computer Vision—ECCV 2018: 15th European Conference, Munich, Germany, 8–14 September 2018; Lecture Notes in Computer Science. Springer: Cham, Switzerland, 2018; Volume 11217. [Google Scholar]
Yuan, Y.; Fu, R.; Huang, L.; Lin, W.; Zhang, C.; Chen, X.; Wang, J. HRFormer: High-resolution transformer for dense prediction. arXiv 2021, arXiv:2110.09408. [Google Scholar] [CrossRef]
Nazeri, K.; Thasarathan, H.; Ebrahimi, M. Edge-informed single image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 3008–3017. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Sun, Y.; Huang, X.; Zhou, H.; Zhang, Q. SRPN: Similarity-based region proposal networks for nuclei and cells detection in histology images. Med. Image Anal. 2021, 72, 102142. [Google Scholar] [CrossRef] [PubMed]
Sun, P.; Zhang, R.; Jiang, Y.; Kong, T.; Xu, C.; Zhan, W.; Tomizuka, M.; Li, L.; Yuan, Z.; Wang, C.; et al. Sparse R-CNN: End-to-End Object Detection with Learnable Proposals. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 21–26 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 14449–14458. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 9992–10002. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 779–788. [Google Scholar]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLO, (Version 8.0.0) [Computer Software]; Ultralytics: Ballenger Creek, MD, USA, 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 12 November 2025).
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar] [CrossRef]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2024, arXiv:2312.00752. [Google Scholar]
Gu, A.; Goel, K.; Re, C. Efficiently Modeling Long Sequences with Structured State Spaces. arXiv 2022, arXiv:2111.00396. [Google Scholar] [CrossRef]
Geng, X. Label distribution learning. IEEE Trans. Knowl. Data Eng. 2016, 28, 1734–1748. [Google Scholar] [CrossRef]
Geng, X.; Yin, C.; Zhou, Z.H. Facial age estimation by learning from label distributions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 3422–3429. [Google Scholar]
Xu, N.; Liu, Y.P.; Geng, X. Label Enhancement for Label Distribution Learning. IEEE Trans. Knowl. Data Eng. 2021, 33, 1632–1643. [Google Scholar] [CrossRef]
Xu, N.; Hu, Y.; Qiao, C.; Geng, X. Aligned Objective for Soft-Pseudo-Label Generation in Supervised Learning. In Proceedings of the Forty-First International Conference on Machine Learning, ICML 2024, Vienna, Austria, 21–27 July 2024. [Google Scholar]
Xu, N.; Qiao, C.; Zhao, Y.; Geng, X.; Zhang, M.L. Variational Label Enhancement for Instance-Dependent Partial Label Learning. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 11298–11313. [Google Scholar] [CrossRef]
He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 16000–16009. [Google Scholar]
Dosovitskiy, A. An image is worth 16×16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations, Virtual Event, 26–30 April 2020. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer: Cham, Switzerland, 2018; pp. 3–19. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; AAAI Press: Palo Alto, CA, USA, 2020; pp. 12993–13000. [Google Scholar]
Li, X.; Wang, W.; Wu, L.; Chen, S.; Hu, X.; Li, J.; Yang, J. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–12 December 2020; pp. 21002–21012. [Google Scholar]
Murrugarra-Llerena, J.; Kirsten, L.N.; Zeni, L.F.; Jung, C.R. Probabilistic Intersection-Over-Union for Training and Evaluation of Oriented Object Detectors. IEEE Trans. Image Process. 2024, 33, 671–681. [Google Scholar] [CrossRef]
Zhu, P.; Wen, L.; Du, D.; Bian, X.; Fan, H.; Hu, Q.; Ling, H. Detection and tracking meet drones challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7380–7399. [Google Scholar] [CrossRef]
Cheng, G.; Han, J.; Zhou, P.; Guo, L. Multi-class geospatial object detection and geographic image classification based on collection of part detectors. ISPRS J. Photogramm. Remote Sens. 2014, 98, 119–132. [Google Scholar] [CrossRef]
Zhang, S.; Xie, Y.; Wan, J.; Xia, H.; Li, S.Z.; Guo, G. WiderPerson: A Diverse Dataset for Dense Pedestrian Detection in the Wild. IEEE Trans. Multimed. (TMM) 2019, 22, 380–393. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the Computer Vision—ECCV 2014, Zurich, Switzerland, 6–12 September 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Jocher, G.; Chaurasia, A.; Stoken, A.; Borovec, J.; NanoCode012; Kwon, Y.; Michael, K.; Xie, T.; Fang, J.; Imphxy; et al. Ultralytics, YOLOv5 [Computer Software]; Ultralytics: Ballenger Creek, MD, USA, 2020. Available online: https://github.com/ultralytics/yolov5 (accessed on 12 November 2025).
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In Proceedings of the International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
Zhu, F.; Chen, X.; Wang, J.; Loy, C.C.; Lin, D. Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 14464–14473. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-Time Object Detection. arXiv 2023, arXiv:2304.08069. [Google Scholar]
Nix, D.; Weigend, A. Estimating the mean and variance of the target probability distribution. In Proceedings of the 1994 IEEE International Conference on Neural Networks (ICNN’94), Orlando, FL, USA, 27 June–2 July 1994; Volume 1, pp. 55–60. [Google Scholar] [CrossRef]
Lakshminarayanan, B.; Pritzel, A.; Blundell, C. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. In Proceedings of the Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016. [Google Scholar]
Chen, H.; Zendehdel, N.; Leu, M.C.; Yin, Z. Fine-grained activity classification in assembly based on multi-visual modalities. J. Intell. Manuf. 2023, 35, 2215–2233. [Google Scholar] [CrossRef]

Figure 1. The overall architecture of our LCB-Net. The model integrates the Saliency-guided Long-range Mamba (SL-Mamba) module into the backbone for enhanced feature representation and leverages the Bounding Box Distribution Loss (BDL) in the detection head for improved localization accuracy.

Figure 2. Qualitative results on VisDrone, arranged from top to bottom and left to right: top row (YOLOv5s, YOLOv8s, YOLOv10s); middle row (DETR, Deformable DETR, Sparse DETR); bottom row (RT-DETR, LCB-Net, Ground Truth). Green boxes indicate magnified detail regions.

Figure 3. Qualitative results on WiderPerson, arranged from top to bottom and left to right: top row (YOLOv5s, YOLOv8s, YOLOv10s); middle row (DETR, Deformable DETR, Sparse DETR); bottom row (RT-DETR, LCB-Net, Ground Truth). Green boxes indicate magnified detail regions.

Figure 4. Qualitative results on NWPU-VHR-10, arranged from top to bottom and left to right: top row (YOLOv5s, YOLOv8s, YOLOv10s); middle row (DETR, Deformable DETR, Sparse DETR); bottom row (RT-DETR, LCB-Net, Ground Truth). Green boxes indicate magnified detail regions.

Figure 5. Class-wise comparison of different loss functions on VisDrone dataset.

Table 1. Quantitative results comparison of different detection models on VisDrone, WiderPerson, and NWPU-VHR-10 datasets. The best performance is high-lighted in bold.

Methods	Precision	Recall	mAP@0.5	mAP@0.5:0.95
VisDrone
YOLOv5s [46]	48.3	35.2	34.9	19.5
YOLOv8s [26]	51.7	39.2	40.2	23.0
YOLOv10s [47]	54.6	42.1	44.1	26.7
DETR [48]	–	–	39.3	23.1
Deformable DETR [49]	–	–	43.7	26.8
Sparse DETR [50]	–	–	44.0	27.0
RT-DETR-R18 [51]	–	–	44.6	26.7
LCB-Net (Ours)	56.2	42.3	44.5	27.3
WiderPerson
YOLOv5s [46]	81.1	66.2	78.7	59.3
YOLOv8s [26]	87.3	76.5	84.9	66.1
YOLOv10s [47]	89.2	77.2	86.3	66.6
DETR [48]	–	–	69.5	43.2
Deformable DETR [49]	–	–	76.5	48.6
Sparse DETR [50]	–	–	78.5	51.7
RT-DETR-R18 [51]	–	–	80.2	52.9
LCB-Net (Ours)	90.1	77.4	86.7	66.9
NWPU-VHR-10
YOLOv5s [46]	90.6	80.9	90.1	58.9
YOLOv8s [26]	92.1	81.1	91.3	61.8
YOLOv10s [47]	93.5	83.7	92.1	62.5
DETR [48]	–	–	89.7	45.9
Deformable DETR [49]	–	–	91.2	58.6
Sparse DETR [50]	–	–	91.6	59.2
RT-DETR-R18 [51]	–	–	92.5	59.9
LCB-Net (Ours)	94.1	84.6	92.4	62.7

Table 2. Comparison of Model Efficiency.

Methods	Params (M)	FLOPs (G)	FPS (RTX 4090)
YOLOv5s [46]	7.2	16.5	244
YOLOv8s [26]	11.1	28.6	302
YOLOv10s [47]	8.3	23.8	395
DETR [48]	41.3	86.2	28
Deformable DETR [49]	39.8	173.0	19
Sparse DETR [50]	38.2	142.0	42
RT-DETR-R18 [51]	32.0	58.0	108
LCB-Net (Ours)	15.2	42.5	125

Table 3. Ablation study of different components. The best performance is high-lighted in bold.

Module Configuration	Precision	Recall	mAP@0.5	mAP@0.5:0.95
VisDrone
Baseline	51.7	39.2	40.2	23.0
+ SL-Mamba *	54.4	40.8	43.1	25.0
+ SL-Mamba ^†	55.6	41.5	44.3	25.9
+ SL-Mamba ^†, CBAM	56.2	42.3	44.5	27.3
+ SL-Mamba ^†, CBAM, BDL	56.7	42.5	44.9	27.9
WiderPerson
Baseline	87.3	76.5	84.9	66.1
+ SL-Mamba *	88.5	77.0	85.6	66.5
+ SL-Mamba ^†	89.2	77.3	86.1	66.7
+ SL-Mamba ^†, CBAM	89.7	77.4	86.4	66.7
+ SL-Mamba ^†, CBAM, BDL	90.1	77.4	86.7	66.9
NWPU-VHR-10
Baseline	92.1	81.1	91.3	61.8
+ SL-Mamba *	92.8	82.0	91.8	62.1
+ SL-Mamba ^†	93.4	83.5	92.2	62.4
+ SL-Mamba ^†, CBAM	93.8	84.2	92.3	62.6
+ SL-Mamba ^†, CBAM, BDL	94.1	84.6	92.4	62.7

* SDM not activated, ^† SDM activated.

Table 4. Object detection performance comparison of different loss functions on generic datasets. The best performance is high-lighted in bold.

Datasets	Loss Function	Precision	Recall	mAP@0.5	mAP@0.5:0.95
COCO128	CIoU	94.6	84.9	95.1	82.5
	ProbIoU	93.1	97.1	97.6	90.0
	BDL (Ours)	96.4	96.3	97.7	91.1
COCO1000	CIoU	97.8	96.9	97.7	89.3
	ProbIoU	97.3	96.8	98.2	90.2
	BDL (Ours)	97.9	97.2	98.0	90.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qiao, Y.; Liang, Y.; Liu, M. LCB-Net: Long-Range Context and Box Distribution Network for Small Object Detection. Electronics 2025, 14, 4487. https://doi.org/10.3390/electronics14224487

AMA Style

Qiao Y, Liang Y, Liu M. LCB-Net: Long-Range Context and Box Distribution Network for Small Object Detection. Electronics. 2025; 14(22):4487. https://doi.org/10.3390/electronics14224487

Chicago/Turabian Style

Qiao, Yiguo, Yun Liang, and Mingzhe Liu. 2025. "LCB-Net: Long-Range Context and Box Distribution Network for Small Object Detection" Electronics 14, no. 22: 4487. https://doi.org/10.3390/electronics14224487

APA Style

Qiao, Y., Liang, Y., & Liu, M. (2025). LCB-Net: Long-Range Context and Box Distribution Network for Small Object Detection. Electronics, 14(22), 4487. https://doi.org/10.3390/electronics14224487

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LCB-Net: Long-Range Context and Box Distribution Network for Small Object Detection

Abstract

1. Introduction

2. Related Work

2.1. Feature Fusion-Based Small Object Detection

2.2. Context-Aware Small Object Detection

2.3. Image Enhancement-Based Small Object Detection

2.4. Region Proposal-Based Small Object Detection

3. Method

3.1. Overview

3.2. Preliminaries

3.2.1. YOLOv8 Architecture

3.2.2. State Space Models and Mamba

3.2.3. Label Distribution Learning

3.3. Saliency-Guided Long-Range Mamba

3.4. Bounding Box Distribution Loss Function

4. Results

4.1. Datasets and Evaluation Metrics

4.2. Implementation Details

4.3. Comparison with State-of-the-Art Methods

4.4. Ablation Study

4.5. Analysis of Challenging Conditions

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI