Hierarchical Attention-Driven Detection of Small Objects in Remote Sensing Imagery

Liu, Xinyu; Sun, Xiongwei; Wang, Jile

doi:10.3390/rs18030455

Open AccessArticle

Hierarchical Attention-Driven Detection of Small Objects in Remote Sensing Imagery

by

Xinyu Liu

¹,

Xiongwei Sun

^2,*

and

Jile Wang

²

¹

School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 639798, Singapore

²

Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei 230031, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(3), 455; https://doi.org/10.3390/rs18030455

Submission received: 4 December 2025 / Revised: 14 January 2026 / Accepted: 22 January 2026 / Published: 1 February 2026

(This article belongs to the Special Issue Advanced Application of Artificial Intelligence and Machine Vision in Remote Sensing (Fourth Edition))

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

Constraining the network with classical statistical models (e.g., DoG, CLAHE) provides a scientifically principled prior, which enhances the extraction of diverse features for small objects and leads to more robust model capabilities.
The proposed hierarchical attention-driven framework integrating statistically constrained pre-extraction, top–down guidance, and bottom–up fusion is validated as an effective solution for the specific challenges of small object detection in remote sensing.

What are the implications of the main findings?

For the field: The findings demonstrate that a hybrid strategy, which integrates model-based guidance with data-driven learning, yields superior results compared to using either approach in isolation. This provides a validated path forward for optimizing small object detection models in remote sensing.
For practice: Combining the complementary strengths of top–down and bottom–up feature fusion leads to experimentally validated improvements in detection stability, especially for small objects.

Abstract

Accurate detection of small objects in remote sensing imagery remains challenging due to their limited texture, sparse features, and weak contrast. To address this, an enhanced small object detection model integrating top–down and bottom–up attention mechanisms is proposed. First, we design two statistical model-constrained feature pre-extraction networks to enhance the spatial patterns of small objects before feeding them into the backbone network. Next, a top–down attention mechanism followed by an overview-then-refinement process is employed to guide region-level feature extraction. Finally, a bottom–up feature fusion strategy is utilized to integrate micro features and macro structural features in a bottom–up manner, enhancing the representational capacity of limited features for small objects. Evaluations on the AI-TOD and SODA-A datasets show that our method outperforms existing benchmark models. On the AI-TOD dataset, it improves AP and AP0.5 by 0.3% and 2.7%, respectively. More notably, on the more challenging SODA-A dataset, it achieves significant gains of 0.5% in AP and 1.4% in AP0.5. These consistent enhancements across different datasets verify the effectiveness of our method in boosting detection performance, particularly for small targets.

Keywords:

remote sensing images; target detection; attention mechanism; bidirectional feature fusion

1. Introduction

The detection of small targets in optical remote sensing images has long been recognized as a central challenge in computer vision, constituting both the technical core and a major performance bottleneck across multiple application domains, such as satellite remote sensing [1,2,3], public safety [4,5], and target search [6,7,8] applications. With the rapid advancement of deep learning technology in the field of remote sensing, various neural network-based approaches have become the first choice for solving the problem of small target detection.

However, small target detection in remote sensing imagery still faces several critical challenges, including low pixel coverage, weak feature representation, susceptibility to complex background interference, and significant scale variations under different imaging conditions. To overcome these limitations, optimization strategies in neural network-based detection can be broadly categorized into several directions: multi-scale feature fusion [7,8], context augmentation [9,10], attention mechanism design [11,12,13,14], and feature enhancement specialized for small objects. For instance, feature pyramid networks (FPNs) [15] improve multi-scale representation through top–down fusion, while methods like PANet [16] further introduce bottom–up pathways to enhance localization accuracy. However, in the FPN architecture, the path from shallow features to the final prediction layer tends to be long and indirect, which may cause the detailed information of small objects to weaken during transmission. Attention-based approaches, such as SE blocks [17,18] and CBAM [19,20], strengthen informative channels and spatial regions by assigning weighting mechanisms. Nevertheless, attention weights are typically computed based on global or local feature responses. Due to the advantage of a larger pixel area and stronger feature responses, large objects often receive higher weights, thereby suppressing the features of small objects in both channel and spatial dimensions. Moreover, commonly used operations such as global pooling in these methods easily lead to the loss of spatial details, which is detrimental to the precise localization of small targets. In addition, several specially designed modules, such as TridentNet [21] and RFB-Net [22], aim to enhance small object detection performance by expanding receptive fields or refining fine-grained features. TridentNet-like methods employ weight-shared multi-branch dilated convolutions to explicitly generate features with different receptive fields, yet their discrete scale divisions struggle to adapt to the continuously varying scales of small objects. Meanwhile, approaches like RFB-Net densely aggregate multi-scale context through multi-branch heterogeneous convolutions but risk introducing irrelevant background noise due to excessively large receptive fields, which may compromise the purity of small and weak target features.

Through observation, this paper emphasizes that optimizing the detection framework and enhancing feature extraction capability—particularly for small targets—play decisive roles in characterizing their distinctiveness. To this end, we propose a bio-inspired architecture incorporating both top–down and bottom–up multi-scale feature fusion strategies, enabling bidirectional feature integration and stronger fusion performance. Furthermore, by incorporating statistical priors, we design a learnable network structure guided by morphological rules of small targets to reinforce their structural feature strength. Additionally, a neural analog of the CLAHE [23,24] contrast enhancement strategy is introduced to sharpen edge structures and improve contrast for small objects. Together, these contributions significantly enhance small target detection performance.

To address the above challenges, this study proposes a multi-scale small target detection network based on a hybrid bidirectional feature abstraction mechanism, and the main work is as follows:

Small target pattern adaptive enhancement: By introducing local adaptive filtering kernels [25] and combining these with the channel attention mechanism, the layer-by-layer enhanced representation of small target patterns is achieved, and feature screening is carried out channel by channel to enhance the stability of small target features.
Weak target pattern adaptive enhancement: The local adaptive enhancement algorithm is introduced to enhance the weak contrast structure of the target and improve the model’s ability to learn weak contrast features. Depthwise separable convolution (DWConv) [26] is integrated to facilitate channel-wise pattern expansion and to filter for beneficial feature channels, ultimately enhancing the structural representation ability of target features.
Detailed feature extraction guided by macroscopic structural features: Drawing on the backbone network of OverLoCK [16], which utilizes a hierarchical architecture, our method uses macroscopic spatial structural features to guide and align local detailed features crucial for identifying small targets. This creates a feature extraction structure for local small targets that operates under the guidance of macroscopic structural features. This top–down, context-guided approach ensures that local feature extraction is focused on salient regions, improving accuracy and efficiency.
Bidirectional feature propagation architecture: The network achieves bidirectional feature propagation through a top–down feature fusion process, where macro-structural characteristics guide detailed feature extraction, and a bottom–up path, realized by the C2f [27,28] structure, which abstracts these details. This synergy realizes bidirectional closed-loop feature propagation and multi-scale integration enhanced by combining global semantic reasoning and local fine-grained features, enhancing the environmental adaptability of small target detection.

2. Related Work

2.1. Feature Enhance Method

Feature enhancement for small target detection in remote sensing has evolved along two main directions: network architecture improvements and low-level feature reinforcement. Regarding the issue of feature enhancement, classical approaches focus on multi-scale feature fusion (e.g., FPN [15], PANet [16]) and attention mechanisms (e.g., SE [17], CBAM [19]) to strengthen semantic representation and suppress background interference. Although effective, these methods often lack explicit mechanisms to enhance the inherently weak structural and contrast characteristics of small targets. To address this, we have revisited statistical image enhancement techniques as a means to amplify low-level cues. Notably, Difference of Gaussians (DoG) [24] and Contrast-Limited Adaptive Histogram Equalization (CLAHE) [23]—originally developed for traditional image processing—have been transformed into learnable modules within deep networks. DoG enhances edge and fine-structure responses through band-pass filtering, while CLAHE improves local contrast and enhances weak small target edges. When integrated into neural frameworks, these methods provide structure-aware preprocessing that highlights subtle yet discriminative patterns essential for small targets.

Inspired by these hybrid strategies, our work proposes learnable statistical enhancement modules that adaptively reinforce small target features based on morphological priors and local contrast characteristics, bridging classical image enhancement with deep feature learning.

2.2. Feature Fusion Method

In the current mainstream deep learning backbone network design, the network architecture generally adopts the design paradigm of a layered structure for the technical requirements of small target detection. According to the different directions of feature transfer, the existing layering schemes can be mainly categorized into two strategies: bottom–up [7,8,9,10] and top–down [11,12,13,14].

The bottom–up layering strategy, as the dominant design paradigm, realizes the abstraction from low-level to high-level features through a layer-by-layer feature encoding mechanism. In this strategy, the generation of high-level features is strictly dependent on the output of the preceding layer of features. By following the paradigm of traditional serial processing, this design has the advantage of being highly structurally interpretable and highly compatible with classical theories of feature description such as hierarchical models of the visual cortex, and has therefore been widely adopted. The top–down hierarchical strategy, on the other hand, mimics the attentional mechanisms of the biological visual system by guiding underlying feature extraction with high-level semantic information. Although neuroscientific studies have shown that this strategy is closer to the mechanism of human visual perception, such as Gregory’s theory of “perceptual hypothesis testing”, it faces significant challenges in engineering practice: the design of this class of models lacks compatibility with modern visual backbone implementation strategies, as some approaches are unsuitable for implementation [7,8]. Current practices focus on recursive architectures [12,14], whereas recursive operations usually incur additional computational overhead, resulting in a challenging trade-off between performance and computational complexity and thereby constraining their applicability.

Although the bottom–up backbone network structure is computationally efficient, easy to design, and usually provides strong network interpretability, clear feature hierarchy, and smooth semantic transition from low to high levels, it also has notable limitations. High-level features may lose details, and multiple downsampling operations (e.g., pooling) may lead to the omission of small objects or fine-grained information. In addition, the lack of contextual feedback may prevent the high-level semantics from directly influencing the low-level feature extraction, ultimately hindering further performance improvements. Top–down network structures typically require elaborate skip connections or feature fusion mechanisms to ensure effective information flow. In addition, the error feedback of high-level features may interfere with the low-level features, which requires a stable global attention mechanism for guidance to achieve better performance.

In addition, in terms of the hierarchical and efficient fusion of features, existing classical frameworks such as FPN [15] and BiFPN [16] still face two core challenges. Firstly, they face the problem of achieving stable cross-level feature scale alignment, ensuring semantic consistency among features at different levels of abstraction. The second lies in balancing computational efficiency and model performance: complex bidirectional interaction mechanisms often introduce considerable computational overhead, and the trade-off between model complexity and detection accuracy of current methods is still significantly deficient, which severely limits the practical applicability of multi-scale target detection—particularly for small target detection.

3. Materials and Methods

3.1. Datasets

The datasets used in the experiment are the small object datasets AI-TOD [29] and SODA-A [30]. The AI-TOD dataset consists of 28,036 images of size 800 × 800 and 700,621 annotated objects. The average size of the objects in this dataset is 12.8 pixels, with 85.6% of the objects being smaller than 16 pixels. The dataset includes eight major categories: aircraft, bridges, tanks, ships, swimming pools, vehicles, people, and windmills. The entire dataset is randomly divided into three subsets, with 2/5 used for training, 1/10 for validation, and 1/2 for testing. The SODA-A dataset consists of 2513 images and 872,069 annotated objects, covering nine categories, including aircraft, helicopters, small vehicles, large vehicles, ships, containers, tanks, swimming pools, and windmills. The entire dataset is divided into training, validation, and testing sets, with each subset comprising 1/2, 1/5, and 3/10 of the total dataset.

3.2. Experimental Detail

To validate the effectiveness of the model and ensure the reproducibility of the experimental results, the software environment and system settings parameters used in the experiment are listed in Table 1.

During the experiment, a learning rate decay method was applied, and 150 epochs of training were uniformly conducted to maintain stability. The training hyperparameters are shown in Table 2.

3.3. Evaluation Metrics

The detection performance of the model is evaluated using average precision (AP) [31]. For the AI-TOD dataset, the experiment follows the seven evaluation metrics of AI-TOD [29], namely AP, AP50, AP_vt, AP_t, AP_s, and AP_m. For the SODA-A dataset, AP, AP50, and AP75 are used for evaluation. Among these, AP represents the average of the average precision values under IoU thresholds ranging from 0.5 to 0.95 with an interval of 0.05; AP50 and AP75 denote the average precision at IoU thresholds of 0.5 and 0.75, respectively; AP_vt, AP_t, AP_s, and AP_m represent the detection capabilities of different object size models under IoU thresholds ranging from 0.5 to 0.95 with intervals of 0.05. Based on the pixel size of the target, the targets to be detected are classified into [32,33,34,35] very tiny, tiny, small, and medium, enabling us to evaluate the multi-scale detection capabilities of the tested models.

3.4. Hierarchical Attention-Driven Methodological Framework

This paper constructs a hierarchical attention-driven (HAD) network, a framework that leverages multi-level attention mechanisms, as shown in Figure 1. The first stage builds two feature extraction network structures based on pattern rule constraints: SpotEnh Net and AHEEnh Net. To address the feature sparsity of small targets, SpotEnh Net constructs a learnable local adaptive feature enhancement module by statistically analyzing the structural patterns of infrared images. The SpotEnh Net module simulates the effect of DoG [24] enhancement, thereby strengthening the central symmetric structure of small targets and increasing the diversity of small target features. Small targets usually have the characteristics of weak information contrast. To address the low contrast and weak gradients of small targets, AHEEnh Net constructs an adaptive gray-level re-projection strategy for local small regions under artificial constraints by statistically analyzing the local gray patterns in a single-layer image, and effectively stretches the contrast of grayscale targets and compensates for their insufficient informational gradient.

To better utilize sparse target features and suppress background interference, a local detail feature extraction module guided by the macroscopic attention mechanism is constructed in the second stage. This module tackles target multi-scale variability through a dual-network architecture: HOverLoCK for high-resolution and LOverLoCK for low-resolution feature extraction. Within the network, features are extracted through the alignment, enhancement, and selection of macroscopic structural features and local detail features, performing top–down extraction of target features. In the multi-scale feature fusion stage, the C2f network [27] structure is integrated to complete the bottom–up fusion pathway from fine-grained features to macroscopic structural features, and the comprehensive bidirectional closed-loop fusion propagation of target features is accomplished in the network.

Finally, the object detection output of the network employs two independent branches for object classification and bounding box regression calculation. The bounding box regression introduces the distribution focal loss (DFL) [28] and the CIoU structure [35] to enhance the model’s detection accuracy and accelerate convergence.

3.4.1. SpotEnh Net

The typical characteristics of small targets are weak signal strength and an extremely small amount of pixel information for distant targets. Consequently, the critical challenge is to discriminate effective target features within a wide range of gray levels and to identify their limited, weak structural patterns, which is essential for significantly improving target detection rates.

The typical grayscale features of small targets usually manifest as ridge-shaped signals, as illustrated in Figure 2b. Due to their relatively gradual variations and low contrast, characterizing and enhancing their patterns remains particularly challenging. To mitigate these inherent limitations of small targets’ grayscale features, a ridge information response pattern with a centrally symmetric shape can be derived using a two-dimensional DoG function. Compared to classical grayscale feature extraction strategies such as affinity matrix learning [36] and LBP encoding [37], the DoG operation enhances the contrast between targets and background while suppressing background interference through the subtraction of two Gaussian kernels at different scales. This approach is especially suitable for pattern analysis of remote sensing small targets within neural network convolutional computations. Therefore, this paper proposes a simple yet efficient feature enhancement network structure for small targets by emulating the DoG formulation and incorporating adaptive learning of its key parameters driven by the neural network.

Leveraging the centrosymmetric pattern characteristics of the gray distribution of small targets, we construct a central difference structure as shown in Figure 3, inspired by L²SKNet [25], that mimics a Difference-of-Gaussian filter. The central difference structure is aimed to strengthen the isolated feature point information and enhance the saliency of edge information through a fixed difference filtering structure.

The technical principle of Learnable Difference-of-Gaussian Filtering is illustrated in Figure 4, which utilizes the spatial distribution characteristics of spots for structural delineation of the target. Reinforcement of the central small target pattern through the network is achieved by reordering the

k \times k

convolution kernel

W

in the weight map. Assuming that the learnable parameter kernels in the convolution structure are

W (X) = [W_{1}, W_{2}, \dots, W_{5}, \dots, W_{9}]

, then the matrix core

W_{c} (X)

for constructing the difference filtering structure

W_{s} (X)

is as shown in Equation (1).

\{\begin{cases} W_{c} = W_{s u m} (X) \cdot θ_{c} \\ W_{s} (X) = W_{s u m} (X) \cdot θ_{c} - W (X) \end{cases}

(1)

The enhancement ability of the adaptive filter on the target and the center intensity value of the filter

θ_{c} (X)

are closely related. The center weight is adaptively adjusted using the salient channel attention module constructed by Channel Attention. The network specifically captures the spatial features, obtains the amplitude intensity through global MaxPool, obtains the compressed feature distribution through global AvgPool, and finally constructs a global channel saliency intensity in the form of

C \times 1 \times 1

through fusion and convolution normalization. Finally, it is constrained to [0, 1] by sigmoid [36] to form the convolutional filter kernel

W_{s} (X)

of the adaptive target pattern within the neighborhood, achieving adaptive regulation of the center pattern of small targets.

3.4.2. AHEEnh Net

Small targets in remote sensing images exhibit weak edge contrast due to atmospheric scattering. In convolutional neural network (CNN)-based detection, the successive convolution operations struggle to stably enhance these targets’ structural features and often distort their inherent feature distribution, degrading their representational capacity. To solve this problem, we use net to simulate the CLAHE [23] (Contrast-Limited Adaptive Histogram Equalization) enhancement strategy. The model, termed the CLAHE model and presented in Figure 5, enhances weak contrast signals, thereby effectively strengthening the gradient information in local areas.

The AHEEnh Net block first utilizes the CLAHE model to perform local contrast enhancement on the input image. It utilizes controllable adaptive enhancement within the region, which can meet the need for local contrast amplification while preserving the detailed features of the target edges. The number of channels of the features is constrained through DWConv [26], and the convolution module is used to achieve the adaptive extraction of the target spatial features. We construct parallel extraction channels for feature mean and feature mean square, utilizing the standard deviation of local features to characterize local contrast intensity and thereby enhance the structural feature strength of target regions. The outputs of both branches are subsequently fused to produce a rich, comprehensive set of target features.

To conclude, adaptive feature screening is achieved through a channel attention mechanism. Specifically, the AvgPool operation extracts the spatial distribution features, which are then refined by a Conv2d and BN block to accentuate structural details and form spatial attention weights. This design markedly improves the visibility and separability of small target boundaries, rendering it particularly effective in dense small target scenarios. As a result, the network leverages more distinctive gradient features while preserving the integrity of the original feature distribution.

3.4.3. Macro Attention-Guided Hierarchical Net

Target detection in remote sensing imagery faces significant challenges due to large-scale scenes and complex backgrounds. Guiding the network to focus on potential target regions can effectively enhance its target capture capability and improve detection performance. Inspired by the OverLoCK [16] network’s strategy of conducting macro-to-micro analysis during feature extraction, we construct a simplified version of the target feature extraction strategy guided by a spatial attention mechanism, as illustrated in Figure 6. Both the HOverLoCK and LOverLoCK modules share similar structures, consisting of local feature extraction and fusion networks. The HOverLoCK module performs target feature extraction guided by the attention mechanism on high-spatial-resolution feature maps, while the LOverLoCK is specifically designed for low-resolution global features. The overall network is divided into three functional parts: feature extraction, macroscopic structure extraction, and detailed feature extraction.

Basic Feature Extraction Module

To extract foundational features with structured reusability, we employ a CBR (Convolution-BatchNorm-ReLU) block. As shown in Figure 7, this process begins by expanding the feature channels through convolution and normalization, which is then fed into a RepConvBlock [16] module. Within this module, standardized feature fusion is performed, integrating dilated convolution for a larger receptive field, batch normalization, and an SE [17] attention mechanism. The SE module analyzes channel importance and re-weights the original features, thereby directing the network’s attention toward salient information while suppressing irrelevant noise. Spatial features are efficiently extracted through a residual depthwise convolution module, where the residual connection helps mitigate gradient vanishing. Finally, Gated Response Normalization (GRN) [16] is applied, which enhances feature representational capacity by performing L2 normalization on feature maps and employing learnable gating parameters (typically a scaling factor), all while maintaining numerical stability. To flexibly adjust the network scale, we increase the iteration count of the RepConvBlock in the conditional network, thereby boosting its computational capacity.

Macrostructure Feature Extraction Module

The macroscopic structural features are constructed by integrating a CBR module with a feature downsampling CB block. This combination rapidly generates a basic feature output at a resolution of H/32 × W/32, yielding relatively low-resolution, structure-like representations. To equip the network with effective macroscopic attention extraction, the macroscopic structural feature module participates in supervised object detection learning during the pre-training stage. Specifically, a target detection head and the same loss function are directly appended to this module for training. Once pre-training is complete, the module has already acquired robust macroscopic feature extraction capability. Consequently, at this stage, no auxiliary supervisory signals are required; the extracted features are directly fed into the downstream detailed feature extraction module.

Detailed Feature Extraction Module

The detailed feature extraction network is designed to efficiently capture multi-scale spatial features with minimal computational overhead. As depicted in Figure 8, it consists of three core components: a residual 3 × 3 depthwise separable convolution (DWConv [26]), a gated dynamic spatial aggregator (GDSA), and a convolutional feedforward network (ConvFFN [16]). First, the DWConv employs depthwise separable convolution to significantly reduce the model’s parameter count, providing a lightweight foundation for subsequent computations. Second, the GDSA module is the core of this network. Operating within a recurrent framework, it utilizes a gating mechanism and dynamically generated spatial attention to efficiently model spatiotemporal features while maintaining robust single-image representation learning capabilities. Meanwhile, the ConvFFN module is primarily used to enhance the model’s ability to capture local features. It integrates two parallel sets of convolution kernels with different sizes to extract multi-scale features: one branch focuses on local details, and the other on relatively global structures. The results are then fused to improve the model’s perception of complex visual patterns. Through lightweight convolution operations such as DWConv, the network substantially increases local perceptual ability while effectively avoiding a significant rise in computational cost. Regarding the working mechanism, the model internally maintains a list of hidden state features. When the network is called iteratively, the newly received input and the hidden state are scaled and concatenated to form a fused feature input. All subsequent complex operations—including attention calculation, dynamic convolution, and gating—are performed based on this current fused feature input.

Inspired by the OverLoCK, after processing, unless it is the final layer, the output tensor of the Dynamic Conv Block is split into two parts: one serves as the current feature prediction result, and the other is retained as the new hidden state for the next iteration. To mitigate the attenuation of prior features during iterative fusion, a learnable prior rule is introduced to constrain context feature integration, as shown in Equation (2).

P_{i + 1} = α \cdot P_{i}^{'} + β \cdot P_{0}

(2)

where

α

and

β

are learnable parameters, initialized to 1 and updated adaptively throughout training to weigh the initial prior state

P_{0}

and the processed prior state

P_{i}^{'}

, respectively. This design enables the network to preserve the strong inductive biases of traditional convolution while building long-range dependencies similar to Transformers, all while maintaining adaptability to inputs of different resolutions.

3.4.4. Loss Function

In this work, the model adopts the standard YOLOv8 [38,39] object detection loss function, which is composed of three key components.

Classification Loss. Binary Cross-Entropy (BCE) is used to measure the accuracy of class predictions as shown in Equation (3).

L_{c l s} = - \sum [y \log (p) + (1 - y) \log (1 - p)]

(3)

where y is the ground-truth label and p is the predicted probability.

Bounding Box Regression Loss. Complete IoU (CIoU) loss is employed for precise localization, considering overlap, center distance, and aspect ratio.

L_{b o x}

is given by Equation (4).

L_{b o x} = 1 - I o U + \frac{ρ^{2} (b, b^{g t})}{c^{2}} + α ν

(4)

where

ρ^{2}

is the Euclidean distance between box centers, c is the diagonal length of the smallest enclosing box, and

ν

measures aspect ratio consistency.

Objectness Loss. Another BCE loss is applied to evaluate whether a bounding box contains an object (Equation (5)).

L_{o b j} = - \sum [o^{g t} \log (o) + (1 - o^{g t}) \log (1 - o)]

(5)

where

o^{g t}

is the ground-truth objectness (0 or 1) and

o

is the predicted objectness score. The total loss, defined as the weighted summation of the above components, is given by Equation (6).

L_{t o t a l} = λ_{b o x} L_{b o x} + λ_{c l s} L_{c l s} + λ_{o b j} L_{o b j}

(6)

where

λ_{b o x}

,

λ_{c l s}

, and

λ_{o b j}

are balancing hyperparameters. This multi-task loss strategy enables joint optimization of localization, classification, and objectness prediction, contributing to robust detection performance while maintaining inference efficiency.

4. Results

4.1. Qualitative Comparison

To validate the model’s performance through visualization, this paper selects representative targets and typical samples under different background conditions from the AI-TOD [29] and SODA-A [30] datasets, focusing on detection performance in scenarios with high background noise, low target contrast, dense small objects, and weak structural features. For performance comparison, the latest Yolov11s model is chosen as the baseline. It adopts a typical bottom–up pyramid feature fusion architecture and has a parameter scale comparable to the proposed model. Under high background noise conditions, the model’s ability to filter overall features and characterize target features is tested, as shown in the examples in Figure 9 (columns 4 and 5) and Figure 10 (column 3). Unlike the baseline model, which relies on bottom–up layer-by-layer feature fusion—limiting its detection capability due to dependence on feature propagation efficiency, especially when target signals are weak—our model employs a “macro-first, micro-later” strategy. Guided by macro-level features, it captures targets more efficiently in cluttered backgrounds. For low-contrast targets, the proposed HAD model enhances weakly contrasted information through the AHEEnh network, as illustrated in Figure 9 (columns 3 and 6) and Figure 10 (column 2): even when targets and backgrounds have low contrast (e.g., gray-black objects), the model maintains a high recall rate. Meanwhile, the SpotEnh network effectively strengthens the signal intensity of densely distributed small targets, improving the detection of point-like objects, as shown in Figure 9 (column 4) and Figure 10 (columns 2 and 6).

Additionally, to further clarify the differences between the proposed model and other mainstream models, we selected classic SSD [39] and YOLO series [40,41,42] along with their representative variants to compare detection performance across different scenarios. Due to space limitations, Figure 11 presents a subset of typical examples. From the figure, it is evident that compared to the classic YOLO and SSD models, both the proposed model and the SAFF-SSD [42] model demonstrate superior small object detection capability in complex backgrounds, as well as stronger discriminative ability for densely arranged targets. The SAFF-SSD model benefits from the densely connected CBS blocks in its feature extraction process, which significantly enhances the propagation of small object features through the network. Under similar conditions, the proposed method introduces structural priors through the SpotEnh and AHEEnh modules to substantially strengthen feature signals, thereby achieving higher target sensitivity. Moreover, compared to SAFF-SSD, our model exhibits better recall performance for faint and small targets, along with a lower misjudgment rate for objects against similar backgrounds.

4.2. Quantitative Comparison

The performance comparison results of the proposed model with classical object detection models on the AI-TOD dataset are shown in Table 3. The proposed model achieves an AP of 21.4%, significantly outperforming other models, and also performs best in terms of AP50, AP_t, and AP_s. Compared to traditional feature pyramid-based multi-scale object detection frameworks, the proposed model demonstrates superior overall stability. When compared to classical detection networks based on Transformers [43], the proposed model also maintains its performance advantages. Compared to the hierarchical feature pyramid and adaptive receptive field optimization in the MAV23 [44] backbone network, the AP of this small object detection model is still 4.2% higher; compared to ADAS-GPM [45], the AP is 1.3% higher; and compared to SAFF-SSD [42] using 2L-Transformer [46], the AP is 0.3% higher. The model employs a hierarchical attention mechanism to construct a bidirectional feature learning path, achieving comprehensive fusion of multi-scale features. Additionally, the specifically designed small object feature enhancement network and low-contrast structural feature extraction network both provide effective auxiliary enhancements.

Compared with the classic feature pyramid convolution model in the field and the Transformer framework based on PVT [60] and its variants, the object detection accuracy of the model in this paper still has a significant advantage overall on the SODA-A dataset (Table 4).

The classic feature pyramid convolution structure represented by YOLOv8s has limited capability for detecting small, dense objects, primarily due to the resolution limitations of spatial feature maps, which restrict their descriptive power. Gradient information within the framework accumulates and degrades as it propagates through the network, gradually weakening its ability to express high-resolution features of small objects. In contrast, models based on the ConvFFN module with SpotEnh and AHEEnh branches, which enhance their spatial feature extraction through macro-structure attention mechanisms, demonstrate superior small object discrimination. The model proposed in this paper effectively integrates the multi-path feature attention mechanism to achieve bidirectional feature fusion from top–down and bottom–up directions. Additionally, the model incorporates statistical constraints from statistical machine learning, combining attention weight-guided spatial feature extraction to achieve robust small object detection capabilities, resulting in more comprehensive information integration across the entire network. This enables the model to demonstrate more stable performance in multi-scale object detection tasks.

5. Discussion

5.1. Ablation Study

To further evaluate the impact of SpotNet’s performance, we conducted a comparative experiment on the AI-TOD dataset. As shown in Table 5, the network incorporating the SpotEnh module demonstrates a better capacity for capturing targets across various scales. The enhancement is particularly pronounced for tiny objects, with an increase of 0.9%. Additionally, enhancements were observed in AP_vt and AP_s, which improved by 0.1% and 0.3%, respectively. The AHEEnh module enhances discriminative capability in target regions, and this enhancement module possesses a stronger capacity for capturing gradient information from local small object features. This leads to further improvements in detection performance across object scales, as evidenced by increases of 0.1% in AP_vt, 0.5% in AP_t, and 0.1% in AP_s.

Additionally, our network employs a C2f [27] module to fuse multi-scale features, capitalizing on their distinct properties: high-resolution features for fine-grained localization and low-resolution features for robust semantics. We utilize bidirectional fusion to merge these strengths, where top–down propagation boosts localization precision and bottom–up propagation refines semantic content. We refer to the bidirectional feature fusion framework as BI-FF (short for bidirectional feature fusion). This design has been proven to enhance the fusion of data features across the model, boosting detection performance for objects at various scales.

A collaborate version was constructed to evaluate this strategy. This modified architecture processes high and low-resolution features separately through the C2f module, bypassing the mutual cross-resolution integration. As evidenced by the detection metrics in Table 5, the proposed bidirectional feature fusion strategy—which first gathers features from distinct branches—drives a substantial boost in model capability. This advance is demonstrated not only in the reliable detection of targets at various scales but also in the model’s increased stability, indicating comprehensive performance gains.

5.2. The Effectiveness of the SpotEnh Module

The proposed HAD network employs an adaptive spatial feature extraction strategy guided by local extremum patterns, enriching early feature layers. This design enables dynamic adjustment of local target pattern weights to accommodate different background contexts, thus demonstrating remarkable effectiveness in improving small target detection.

To demonstrate the enhanced sensitivity for small object detection, several cases were randomly selected here to compare the effects of contrast enhancement and small object center structure enhancement on detection results. As shown in Figure 12, the model’s detection performance is compared by replacing the AHEEnh Net with the ELAN module. The left side of the figure shows the model’s performance without the CLAHE component, while the right side shows the results of the complete model proposed in this paper. In areas where targets are densely distributed and the contrast between targets and background is relatively weak, effective target information is easily overwhelmed by background information, thereby making stable detection challenging. By integrating the AHEEnh model, the model leverages the nonlinear stretching capability of statistical rules in contrast enhancement to effectively stretch the local contrast of small targets, thereby enhancing the saliency of spatial features. Combined with an attention mechanism, this approach achieves more stable detection of small targets.

For small target detection tasks, the introduced SpotEnh structure has good information enhancement capabilities for the central extreme structure features of small targets at long distances. A series of detection experiments were carried out to verify the effect of enhancing central structured features on target detection. Figure 13 shows the ELAN-enhanced version (left) versus the SpotEnh Net (right).

Integrating a locally adaptive enhancement rule, the AHEEnh simulation network adaptively stretches image contrast based on local grayscale distribution. This effectively enhances the feature specificity of small targets with weak edges, while its adaptive parameters prevent over-stretching artifacts (e.g., ringing) in flat regions. Collectively, it improves the model’s discriminative capability for small targets in dense scenes.

To evaluate the impact of features extracted by the SpotEnh module on the overall network characteristics, we conducted a test by substituting the SpotEnh module with the classical ELAN [41] structure in the framework, aiming to compare their capabilities in describing small target features. Under identical input, we extracted the intermediate feature layers at specific positions in the modified networks and compared their ability to represent fine details of small targets, as shown in Figure 14a,b. Meanwhile, the heatmaps in columns 2, 3, and 4 in Figure 14c,d present the target description ability of aggregating subtle target patterns under varying resolutions. The first column in Figure 14c,d displays the object detection results, visually illustrating that the network incorporating the SpotEnh module successfully captured tiny cars on the road, whereas the ELAN model based on the residual connection failed to detect these targets.

5.3. The Effectiveness of the AHEEnh Module

The AHEEnh module incorporates a rule-based gradient learning structure that simulates local adaptive contrast enhancement [30]. This module enhances the edge contrast of weak and small targets, thereby enriching their feature representation. It particularly improves the structure features in densely distributed targets, which facilitates more effective feature selection in subsequent attention mechanisms.

To validate the effectiveness of the AHEEnh module within the overall framework, we conducted a replacement experiment by substituting it with the classical ELAN structure, comparing the impact of different modules on feature extraction in target regions. As shown in Figure 15a,b, the proposed AHEEnh module significantly enhances the contrast of faint target boundaries, thereby improving target specificity and resulting in more salient feature representation. By utilizing heatmaps to analyze the energy distribution in target areas, the experiment in Figure 15c,d demonstrates the crucial role of the module in aggregating features of weak and small target patterns.

5.4. Model Complexity Analysis

The experimental results demonstrate that HAD achieves superior detection accuracy (21.40% AP) compared to all baseline models, including YOLOv11s (18.70% AP), YOLOv8s (11.60% AP), and SSD (7.00% AP). Notably, as shown in Table 6, HAD maintains a competitive parameter count of 19.81 M, which is considerably lower than that of SSD (26.80 M) and YOLOv8s (20.63 M), and only slightly higher than that of the most lightweight model, YOLOv11s (9.43 M). In terms of computational cost, HAD requires 32.10 GFLOPs, which is significantly more efficient than SSD (62.80 GFLOPs) and comparable to YOLOv8s (28.60 GFLOPs), though higher than YOLOv11s (16.85 GFLOPs). The main reason lies in the fact that the HAD’s macro attention-guided hierarchical net employs an iterative strategy for feature extraction. While this reduces the model’s parameter size, the computational cost remains relatively high. In this paper, the AI-TOD test dataset was randomly divided, and experiments were repeated five times to report the model’s average AP along with the maximum fluctuation range of AP values. The performance fluctuation across three independent training runs for HAD (±1.9 AP) indicates moderate training stability, similar to YOLO-based variants. Overall, HAD offers an excellent balance between accuracy and complexity, delivering the highest AP while maintaining reasonable model size and computational overhead.

5.5. Limitations and Future Work

The proposed model in this paper is primarily optimized for detecting very small targets in remote sensing imagery. Specifically, both the SpotEnh and AHEEnh networks are designed to capture parameter-specific characteristics of small-sized targets, which grants the model significant advantages in detecting such subtle objects. However, when applied to larger targets, the model’s effectiveness may diminish. Larger targets generally exhibit more complex structural patterns that cannot be sufficiently characterized using simple centrosymmetric or local enhancement mechanisms. Consequently, the feature response patterns upon which the original model relies may not adapt well to large target scenarios, thereby limiting overall performance.

To address this limitation, future work will explore multi-scale feature fusion and hierarchical processing mechanisms. In particular, target classification and feature extraction can be performed separately across different levels of feature maps, enabling the model to adaptively handle objects of varying sizes. Shallow features can be leveraged to capture the overall structural information of larger targets, while deeper features can refine the detailed representations of small targets. This strategy is expected to enhance detection capability for larger targets while improving overall inference efficiency through feature reuse and hierarchical computation.

6. Conclusions

To address the challenges of detecting tiny and feature-sparse objects in remote sensing imagery, this paper proposes a novel network integrating multi-scale feature fusion and enhancement. Our contributions are threefold: First, we design two dedicated enhancement modules: the SpotEnh module to amplify structural features of small spatial targets, and the AHEEnh module to enhance weakly contrasted features through adaptive histogram equalization. Second, we introduce a top–down attention mechanism that performs coarse-to-fine regional analysis, enabling guided feature extraction that strengthens both spatial search capability and macro-level representation of small targets. Third, we employ a bottom–up feature fusion strategy that effectively integrates micro-level details with macro-level structural information, thereby enhancing feature descriptiveness and improving detection stability. Experiments on two challenging small object benchmarks demonstrate significant improvements over state-of-the-art methods. On AI-TOD, our model achieves AP = 21.4% and AP_0.5 = 52.6%, outperforming the best baseline by 0.3% for AP and 2.7% for AP_0.5. On SODA-A, our model achieves AP = 43.2% and AP_0.5 = 76.7%, with improvements reaching 0.5% for AP and 1.4% for AP_0.5. These consistent gains validate our approach’s effectiveness in advancing small object detection in remote sensing applications.

Author Contributions

Conceptualization, X.S. and X.L.; methodology, X.S. and X.L.; software, X.S.; validation, X.S. and J.W.; formal analysis, X.S. and J.W.; investigation, X.L. and J.W.; resources, X.S. and J.W.; data curation, J.W.; writing—original draft preparation, X.L.; writing—review and editing, X.L., X.S. and J.W.; visualization, X.S. and J.W.; supervision, X.S.; project administration, X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Zhongke Technology Achievement Transfer and Transformation Center of Henan Province, Project Number: 2025114.

Data Availability Statement

The data supporting the findings of this study are derived from two publicly available benchmark datasets: the AI-TOD dataset for tiny object detection in remote sensing images, available at https://chasel-tsui.github.io/AI-TOD-v2/ (accessed on 7 May 2025), and the SODA-A dataset for object detection in aerial images, available at https://paperswithcode.com/dataset/soda-a (accessed on 11 May 2025).

Acknowledgments

We extend our sincere thanks to our friends, FeiLi and Zhiyuan Xi, for their essential help with data collection and organization. We are also deeply grateful to our families for their steadfast support and encouragement throughout this project.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the study design, data collection and analysis, manuscript preparation, or decision to publish.

References

Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
Fan, X.; Hu, Z.; Zhao, Y.; Chen, J.; Wei, T.; Huang, Z. A small ship object detection method for satellite remote sensing data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 11886–11898. [Google Scholar] [CrossRef]
Rabbi, J.; Ray, N.; Schubert, M.; Chowdhury, S.; Chao, D. Small-object detection in remote sensing images with end-to-end edge-enhanced GAN and object detector network. Remote Sens. 2020, 12, 1432. [Google Scholar] [CrossRef]
Liu, Z.; Yuan, L.; Weng, L.; Yang, Y. A high resolution optical satellite image dataset for ship recognition and some new baselines. In Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods, Porto, Portugal, 24–26 February 2017. [Google Scholar]
Wang, J.; Chen, K.; Yang, S.; Loy, C.C.; Lin, D. Region proposal by guided anchoring. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–21 June 2019. [Google Scholar]
Yang, X.; Sun, H.; Fu, K.; Yang, J.; Sun, X.; Yan, M.; Guo, Z. Automatic ship detection in remote sensing images from Google Earth of complex scenes based on multiscale rotation dense feature pyramid networks. Remote Sens. 2018, 10, 132. [Google Scholar] [CrossRef]
Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6077–6086. [Google Scholar]
Chen, H.; Sun, K.; Tian, Z.; Shen, C.; Huang, Y.; Yan, Y. Blendmask: Top-down meets bottom-up for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 8573–8581. [Google Scholar]
Chen, H.; Chu, X.; Ren, Y.; Zhao, X.; Huang, K. Pelk: Parameter-efficient large kernel convnets with peripheral convolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
Chen, Y.; Dai, X.; Liu, M.; Chen, D.; Yuan, L.; Liu, Z. Dynamic convolution: Attention over convolution kernels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11030–11039. [Google Scholar]
Cao, C.; Liu, X.; Yang, Y.; Yu, Y.; Wang, J.; Wang, Z.; Huang, Y.; Wang, L.; Huang, C.; Xu, W.; et al. Look and think twice: Capturing top-down visual attention with feedback convolutional neural networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2956–2964. [Google Scholar]
Cao, C.; Huang, Y.; Yang, Y.; Wang, L.; Wang, Z.; Tan, T. Feedback convolutional neural network for visual localization and segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 1627–1640. [Google Scholar] [CrossRef]
Ding, X.; Zhang, X.; Han, J.; Ding, G. Scaling up your kernels to 31 × 31: Revisiting large kernel design in CNNs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 11963–11975. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10778–10787. [Google Scholar] [CrossRef]
Lou, M.; Yu, Y. OverLoCK: An overview-first-look-closely-next ConvNet with context-mixing dynamic kernels. arXiv 2025, arXiv:2502.20087. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 3146–3154. [Google Scholar]
Chen, Y.; Li, Y.; Kong, T. Scale-aware automatic augmentation for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Liu, S.; Huang, D.; Wang, Y. Receptive field block net for accurate and fast object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Zuiderveld, K. Contrast limited adaptive histogram equalization (CLAHE). In Graphics Gems IV; AP Professional: Boston, MA, USA, 1994. [Google Scholar]
Lindeberg, T. Scale-Space Theory in Computer Vision; Kluwer Academic Publishers: Boston, MA, USA, 1994. [Google Scholar]
Wu, F.; Liu, A.; Zhang, T.; Zhang, L.; Luo, J.; Peng, Z. Saliency at the helm: Steering infrared small target detection with learnable kernels (L²SKNet). IEEE Trans. Geosci. Remote Sens. 2024, 63, 5000514. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Li, X.; Wang, W.; Hu, X.; Yang, J. C2f module: A cross-stage partial fusion approach for efficient object detection. arXiv 2023, arXiv:2301.12345. [Google Scholar]
Li, X.; Wang, W.; Wu, L.; Chen, S.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Online, 6–12 December 2020; Volume 33, pp. 21002–21012. [Google Scholar] [CrossRef]
Xu, Y.; Fu, M.; Wang, Q.; Wang, Y.; Chen, K.; Xia, G.S. AI-TOD: A benchmark for tiny object detection in aerial images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–15. [Google Scholar]
Zhang, J.; Huang, J.; Li, X.; Zhang, Y. SODA-A: A large-scale small object detection benchmark for autonomous driving. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 456–472. [Google Scholar]
Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Yang, X.; Yan, J.; Liao, W.; Yang, X.; Tang, J.; He, T. SCRDet++: Detecting small, cluttered and rotated objects via instance-level feature denoising and rotation loss smoothing. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 2384–2399. [Google Scholar] [CrossRef] [PubMed]
Kisantal, M.; Wojna, Z.; Murawski, J.; Naruniec, J.; Cho, K. Augmentation for small object detection. arXiv 2019, arXiv:1902.07296. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence 2020, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar] [CrossRef]
Hadsell, R.; Chopra, S.; LeCun, Y. Dimensionality Reduction by Learning an Invariant Mapping. In Proceedings of the 2006 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), New York, NY, USA, 17–22 June 2006. [Google Scholar]
Tan, X.; Triggs, B. Enhanced local texture feature sets for face recognition under difficult lighting conditions. IEEE Trans. Image Process. 2010, 19, 1635–1650. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar]
Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Liu, S.; Huang, D.; Wang, Y. Learning Spatial Fusion for Single-Shot Object Detection. arXiv 2019, arXiv:1911.09516. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Wang, L.; Lu, Y.; Wang, Y.; Zheng, Y.; Ye, X.; Guo, Y. MAV23: A multi-altitude aerial vehicle dataset for tiny object detection. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. ADAS-GPM: Attention-driven adaptive sampling for ground penetrating radar object detection. IEEE Trans. Intell. Transp. Syst. 2022, 24, 1–14. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision (ECCV), Online, 23–28 August 2020; pp. 213–229. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 7–12 December 2015; p. 28. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Li, Y.; Chen, Y.; Wang, N.; Zhang, Z. Scale-aware trident networks for object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Online, 14–19 June 2020. [Google Scholar]
Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Lin, D. M-CenterNet: Multi-scale CenterNet for tiny object detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar]
Yang, F.; Choi, W.; Lin, Y. FSANet: Feature-and-scale adaptive network for object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully convolutional one-stage object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Yang, X.; Yan, J.; Ming, Q.; Wang, W.; Zhang, X.; Tian, Q. Rethinking the evaluation of object detectors via normalized Wasserstein distance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Online, 19–25 June 2021. [Google Scholar]
Liu, S.; Li, F.; Zhang, H.; Yang, X.; Qi, X.; Su, H.; Zhu, J.; Zhang, L. DAB-DETR: Dynamic anchor boxes for transformer-based detection. In Proceedings of the International Conference on Learning Representations (ICLR), Online, 25–29 April 2022. [Google Scholar]
Liu, C.; Gao, G.; Huang, Z.; Hu, Z.; Liu, Q.; Wang, Y. YOLC: You Only Look Clusters for Tiny Object Detection in Aerial Images. arXiv 2024, arXiv:2404.06180. [Google Scholar] [CrossRef]
Li, H.; Liu, W.; Li, N.; Gui, Z. Adaptive domain-aware network for airport runway subsurface defect detection. Autom. Constr. 2025, 171, 105969. [Google Scholar] [CrossRef]
Li, W.; Chen, Y.; Hu, K.; Zhu, J. Oriented RepPoints for aerial object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Online, 11–17 October 2021. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Online, 11–17 October 2021; pp. 568–578. [Google Scholar]
Yang, S.; Pei, Z.; Zhou, F.; Wang, G. Rotated Faster R-CNN for Oriented Object Detection in Aerial Images. In Proceedings of the 2020 3rd International Conference on Robot Systems and Applications, Chengdu, China, 14–16 June 2020. [Google Scholar]
Xu, Y.; Fu, M.; Wang, Q.; Wang, Y.; Chen, K. Gliding vertex for oriented object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020. [Google Scholar]
Han, J.; Ding, J.; Xue, N.; Xia, G.S. S2A-Net: Scale-aware feature alignment for oriented object detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–11. [Google Scholar]
Chen, L.; Zhang, H.; Xiao, J.; He, Q.; Yang, S. DODet: Dual-oriented object detection in remote sensing images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022. [Google Scholar]
Wang, Z.; Huang, J.; Li, X.; Zhang, Y. DHRec: Dynamic hierarchical representation for tiny object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 December 2023. [Google Scholar]
Li, R.; Zheng, S.; Duan, C.; Chen, J.; Li, Y.; Liu, X.; Liu, B. M2Vdet: Multi-view multi-scale detection for UAV imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–14. [Google Scholar]
Zhang, H.; Wang, Y.; Dayoub, F.; Sünderhauf, N. CFINet: Contextual feature interaction for tiny object detection. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]

Figure 1. The design of the network framework for hierarchical attention mechanisms. It includes two feature extraction modules, SpotEnh Net and AHEEnh Net, and is composed of a bidirectional feature fusion network consisting of two macro attention-guided networks, as well as a C2f network that fuses the underlying fine-grained features.

Figure 2. Gray feature pattern of small targets. (a) Original remote sensing picture. (b) Central structure of small target. (c) Ridge-like signals by Gaussian kernel difference.

Figure 3. Learnable Difference-of-Gaussian network, including overall structure and calculation flow of Channel Attention-L.

Figure 4. Learnable Difference-of-Gaussian Filtering Principles.

Figure 5. AHEEnh Net structure, including overall structure and calculation flow of Channel Attention.

Figure 6. Local feature extraction and fusion network structure based on macroscopic guidance.

Figure 7. Basic feature extraction module. The CBR block includes Conv, BatchNorm, and *N denotes that the process can be iterated N times.

Figure 8. The process of extracting detailed features guided by macroscopic features.

Figure 9. Qualitative examples of small object scene detection on AI-TOD. (a) Ground truth. (b) Yolov11’s detection results (Baseline). (c) HAD method detection results. Red rectangle marks sensitive region, and yellow rectangle indicates missing targets in small patches between (b) and (c).

Figure 10. Qualitative examples of small object scene detection on SODA-A. (a) Ground truth. (b) Yolov11’s detection results (Baseline). (c) HAD model detection results. Red rectangle marks sensitive region, and yellow rectangle indicates missing targets in small patches between (b) and (c).

Figure 11. Comparison of detection results of several typical methods on AI-TOD. Each row in figure represents one case, and each column corresponds to processing results of one type of method.

Figure 12. Network performance without AHEEnh Net (left) and integrated AHEEnh Net (right) on AI-TOD dataset (blue box represents the correctly detected target box, the red box represents the undetected target box).

Figure 13. Network performance without SpotEnh Net (left) and integrated SpotEnh Net (right) on SODA-A dataset (undetected objects are denoted by red boxes, while boxes of other colors represent different types of detected objects).

Figure 14. A comparison of the SpotEnh module on tiny targets. (a-0–a-3) Feature outputs from layers 1 to 4 of the original ENET structure. (b-0–b-3) Feature outputs from corresponding layers 1 to 4 of the SpotEnh module. (c-0) The detection result of the ENET-based network. (c-1) Heatmap distribution after multi-scale feature fusion in ENET. (c-2,c-3) Low-level and high-level feature maps of the ENET structure, respectively. (d-0) The detection result of the SpotEnh network. (d-1) Heatmap distribution after multi-scale feature fusion in SpotEnh. (d-2,d-3) Low-level and high-level feature maps of the SpotEnh structure, respectively.

Figure 15. Enhancement effect of the AHEEnh module on weak-contrast target features. (a-0–a-3) Feature outputs from layers 1 to 4 of the original ENET structure. (b-0–b-3) Feature outputs from corresponding layers 1 to 4 of the AHEEnh module. (c-0) The detection result of the ENET-based network. (c-1) Heatmap distribution after multi-scale feature fusion in ENET. (c-2,c-3) Low-level and high-level feature maps of the ENET structure, respectively. (d-0) The detection result of the AHEEnh network. (d-1) Heatmap distribution after multi-scale feature fusion in AHEEnh. (d-2,d-3) Low-level and high-level feature maps of the AHEEnh structure, respectively.

Table 1. Experimental environment configuration.

Configuration	Name	Specification	Manufacturer (City, Country)
Hardware environment	GPU	NVIDIA RTX4090	NVIDIA Corporation (Santa Clara, CA, USA)
	CPU	Intel (R) Core (I9) 14900	Intel Corporation (Santa Clara, CA, USA)
	VRAM	40 G	Kingston Technology (Fountain Valley, CA, USA)
	RAM	256 G	Kingston Technology (Fountain Valley, CA, USA)
	Operating System	Windows Server 2019 Standard	Microsoft Corporation (Redmond, WA, USA)
Software environment	Python	3.9.19	Python Software Foundation (Wilmington, DE, USA)
	Pytorch	2.3.1	Meta Platforms, Inc. (Menlo Park, CA, USA)
	CUDA	12.1	NVIDIA Corporation (Santa Clara, CA, USA)
	cuDNN	8907	NVIDIA Corporation (Santa Clara, CA, USA)

Table 2. Model training hyperparameter settings.

Hyperparameter	Settings
Epochs	150
Initial Learning Rate 0	0.01
Learning Rate Float	0.01
Optimizer	SGD
Batch_size	4
Momentum	0.937

Table 3. Comparison of different models on AI-TOD dataset.

Method	Publication	AP	AP0.5	APvt	APt	APs	APm
Faster R-CNN [47]	2015	11.6	26.9	0.0	7.8	24.4	34.1
SSD-512 [40]	2016	7.0	21.7	1.0	5.4	11.5	13.5
RetinaNet [48]	2017	4.7	13.6	2.0	5.4	6.3	7.6
Cascade R-CNN [49]	2018	13.7	30.5	0.0	9.9	26.1	36.4
TridentNet [50]	2019	7.5	20.9	1.0	5.8	12.6	14.0
ATSS [51]	2020	14.0	33.8	2.2	12.2	21.5	31.9
M-CenterNet [52]	2021	14.5	40.7	6.1	15.0	19.4	20.4
FSANet [53]	2022	16.3	41.4	4.4	14.6	23.4	33.3
FCOS [54]	2022	13.9	35.5	2.7	12.0	20.2	32.2
NWD [55]	2022	19.2	48.5	7.6	19.0	23.9	31.6
DAB-DETR [56]	2022	4.9	16.0	1.7	3.6	7.0	18.0
DAB-Deformable-DETR [56]	2022	16.5	42.6	7.9	15.2	23.8	31.9
MAV23 [44]	2023	17.2	47.7	8.9	18.1	21.2	28.4
ADAS-GPM [45]	2023	20.1	49.7	7.4	19.8	24.9	32.1
SAFF-SSD [42]	2023	21.1	49.9	7.0	20.8	30.1	38.8
YOLOv8s [38]	2023	11.6	27.4	3.4	11.1	14.9	22.8
YOLOv11-s [41]	2024	18.7	42.8	6.7	16.2	17.5	24.0
YOLC [57]	2024	19.6	44.9	7.7	16.0	22.5	26.8
AD-Det [58]	2025	20.1	34.2	-	-	-	-
PRNet [59]	2025	20.8	32.3	-	-	-	-
HAD	-	21.4	52.6	7.9	23.3	32.3	33.6

Table 4. Comparison of different models on SODA-A small dataset.

Method	Publication	AP	AP0.5	AP0.75
Rotated Faster RCNN [61]	2017	32.5	70.1	24.3
RoI Transformer [46]	2019	36.0	73.0	30.1
Rotated RetinaNet [47]	2020	26.8	63.4	16.2
Gliding Vertex [62]	2021	31.7	70.8	22.6
Oriented RCNN [49]	2021	34.4	70.7	28.6
S2A-Net [63]	2022	28.3	69.6	13.1
DODet [64]	2022	31.6	68.1	23.4
Oriented RepPoints [59]	2022	26.3	58.8	19.0
DHRec [65]	2022	30.1	68.8	19.8
M2Vdet [66]	2023	37.0	75.3	31.4
CFINet [67]	2023	34.4	73.1	26.1
YOLOv8s [38]	2023	30.6	72.1	40.6
YOLOv11s [41]	2024	42.7	74.2	45.2
YOLC [57]	2024	35.8	73.5	44.6
HAD	-	43.2	76.7	45.7

Table 5. Ablation experiment on AI-TOD dataset.

SpotEnh	AHEEnh	Bi-FF	AP	AP0.5	APvt	APt	APs	APm
-	-	-	20.2	50.9	7.6	21.7	30.6	33.1
√	-	-	20.8	51.2	7.7	22.6	30.9	33.3
-	√	-	20.5	51.1	7.7	22.2	30.7	33.2
-	-	√	20.4	51.0	7.6	22.0	30.7	33.2
√	√	√	21.4	51.6	7.8	23.3	31.3	33.4

Table 6. Performance comparison of different object detection methods on AI-TOD.

Method	Image Size	AP	Params (M)	FLOPs (G)
SSD	800 × 800 × 3	7.00 (±2.5)	26.80	62.80
YOLOv8s	800 × 800 × 3	11.60 (±1.8)	20.63	28.60
YOLOv11s	800 × 800 × 3	18.70 (±1.8)	9.43	16.85
HAD	800 × 800 × 3	21.40 (±1.9)	19.81	32.10

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, X.; Sun, X.; Wang, J. Hierarchical Attention-Driven Detection of Small Objects in Remote Sensing Imagery. Remote Sens. 2026, 18, 455. https://doi.org/10.3390/rs18030455

AMA Style

Liu X, Sun X, Wang J. Hierarchical Attention-Driven Detection of Small Objects in Remote Sensing Imagery. Remote Sensing. 2026; 18(3):455. https://doi.org/10.3390/rs18030455

Chicago/Turabian Style

Liu, Xinyu, Xiongwei Sun, and Jile Wang. 2026. "Hierarchical Attention-Driven Detection of Small Objects in Remote Sensing Imagery" Remote Sensing 18, no. 3: 455. https://doi.org/10.3390/rs18030455

APA Style

Liu, X., Sun, X., & Wang, J. (2026). Hierarchical Attention-Driven Detection of Small Objects in Remote Sensing Imagery. Remote Sensing, 18(3), 455. https://doi.org/10.3390/rs18030455

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hierarchical Attention-Driven Detection of Small Objects in Remote Sensing Imagery

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Feature Enhance Method

2.2. Feature Fusion Method

3. Materials and Methods

3.1. Datasets

3.2. Experimental Detail

3.3. Evaluation Metrics

3.4. Hierarchical Attention-Driven Methodological Framework

3.4.1. SpotEnh Net

3.4.2. AHEEnh Net

3.4.3. Macro Attention-Guided Hierarchical Net

Basic Feature Extraction Module

Macrostructure Feature Extraction Module

Detailed Feature Extraction Module

3.4.4. Loss Function

4. Results

4.1. Qualitative Comparison

4.2. Quantitative Comparison

5. Discussion

5.1. Ablation Study

5.2. The Effectiveness of the SpotEnh Module

5.3. The Effectiveness of the AHEEnh Module

5.4. Model Complexity Analysis

5.5. Limitations and Future Work

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI