You are currently on the new version of our website. Access the old version .
ElectronicsElectronics
  • Article
  • Open Access

27 June 2025

A Remote Sensing Image Object Detection Model Based on Improved YOLOv11

,
,
and
1
Heilongjiang Province Key Laboratory of Laser Spectroscopy Technology and Application, Harbin University of Science and Technology, Harbin 150080, China
2
School of Integrated Circuit, Shenzhen Polytechnic University, Shenzhen 518055, China
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Machine Learning and Computational Intelligence in Remote Sensing

Abstract

Due to the challenges posed by high resolution, substantial background noise, significant object scale variation, and long-tailed data distribution in remote sensing images, traditional techniques often struggle to maintain both high accuracy and low latency. This paper proposes YOLO11-FSDAT, an advanced object detection framework tailored for remote sensing imagery, which integrates not only modular enhancements but also theoretical and architectural innovations to address these limitations. First, we propose the frequency–spatial feature extraction fusion module (Freq-SpaFEFM), which breaks the conventional paradigm of spatial-domain-dominated feature learning by introducing a multi-branch architecture that fuses frequency- and spatial-domain features in parallel. This design provides a new processing paradigm for multi-scale object detection, particularly enhancing the model’s capability in handling dense and small-object scenarios with complex backgrounds. Second, we introduce the deformable attention-based global–local fusion module (DAGLF), which combines fine-grained local features with global context through deformable attention and residual connections. This enables the model to adaptively capture irregularly oriented objects (e.g., tilted aircraft) and effectively mitigates the issue of information dilution in deep networks. Third, we develop the adaptive threshold focal loss (ATFL), which is the first loss function to systematically address the long-tailed distribution in remote sensing datasets by dynamically adjusting focus based on sample difficulty. Unlike traditional focal loss with fixed hyperparameters, ATFL decouples hard and easy samples and automatically adapts to varying class distributions. Experimental results on the public DOTAv1, SIMD, and DIOR datasets demonstrated that YOLO11-FSDAT achieved 75.22%, 82.79%, and 88.01% mAP, respectively, outperforming baseline YOLOv11n by up to 4.11%. These results confirm the effectiveness, robustness, and broader theoretical value of the proposed framework in addressing key challenges in remote sensing object detection.

1. Introduction

With the rapid development of satellite and UAV technologies, object detection in high-resolution remote sensing imagery—an essential task in remote sensing analysis—has seen remarkable advancements in recent years. It plays a significant role in various fields, including smart city development, emergency rescue, and military reconnaissance [1].
Traditional remote sensing object detection methods rely on manually designed feature representations, making them sensitive to object size and deformation, leading to poor robustness and weak generalization ability. When implemented on remote sensing data with densely distributed small objects and low contrast between objects and background, these methods are prone to missed detections and even false detections [2]. Aiming to boost accuracy in object detection and motion tracking, deep-learning-based remote sensing object detection techniques have emerged. However, with the increasing demand and the rapid growth in the scale of remote sensing images, higher accuracy in object detection is required, necessitating continuous technological advancements. Consequently, object detection in remote sensing imagery continues to encounter various difficulties.
The core mission of detecting objects within remote sensing images is to determine whether there are objects of interest in the image, accurately predict the positional information of the objects, and identify their categories. Remote sensing imagery exhibits the following characteristics:
(1)
The trade-off between high resolution and computational efficiency: Remote sensing images with excessively high resolution can result in a significant loss of detailed information when inputted at a smaller resolution. However, using high-resolution images directly leads to excessive hardware resource consumption. Moreover, cropping the image may result in incomplete detection of objects at the cropped boundaries, making the image slicing process a challenging problem to address.
(2)
Interference from complex backgrounds and inter-class similarity: In remote sensing scenes, the material reflectance characteristics of objects and backgrounds can be similar (e.g., camouflage vehicles in deserts blending with the sand). Additionally, objects of the same class may exhibit polymorphism due to viewpoint differences (e.g., flat-roofed houses and peaked-roofed houses both being classified as “buildings”). This leads to overlapping feature spaces, interfering with the model’s capability for learning the distinguishing features of the detected objects, reducing the effectiveness of object feature extraction and raising the complexity of detecting objects.
(3)
Arbitrary object orientation and dense arrangements: The overhead perspective of remote sensing images causes objects (such as port containers or tilted aircraft) to appear in arbitrary orientations. Horizontal bounding boxes tend to include a significant amount of background noise, while densely packed objects (such as small boats in ports) are prone to missed detections.
(4)
Significant variations in object scale: In remote sensing images, object sizes span multiple orders of magnitude (e.g., 10 m-class oil tankers versus 1 m-class cars). A single architecture typically struggles to capture features uniformly across different object sizes, such as the reliance of large objects on a large receptive field and global information, and the need for small objects to focus on local details.
(5)
Long-tailed data distribution: In satellite remote sensing object detection datasets, commonly occurring targets dominate the data, whereas small or infrequent objects are underrepresented, resulting in a distinctly long-tailed distribution. This imbalance hampers the generalization capability of deep learning models, which tend to perform well on frequent categories but poorly on rare or small targets. Consequently, the detection accuracy for small objects is notably affected.

3. Materials and Methods

3.1. Architecture of the Proposed Object Detection Model

This work introduces an enhanced neural network architecture based on YOLOv11n, called YOLO11-FSDAT, which incorporates three enhancements over the original network.
Firstly, considering the characteristics of remote sensing images—such as a high proportion of high-frequency components, dense small targets prone to omission, significant background noise, and large inter-class size variations [22]—we designed the Freq-SpaFEFM (frequency–spatial feature extraction and fusion module). This module was integrated into both the backbone and neck of the network to effectively address these challenges.
Secondly, to tackle the increased computational complexity caused by high-resolution remote sensing images and the challenge of fusing global and local information due to target scale variations [23], we proposed the deformable attention global feature and local feature fusion (DAGLF) module, which was elegantly and efficiently incorporated into the neck of the network to enhance its feature fusion capabilities.
Finally, we employed adaptive threshold focal loss (ATFL) as the classification loss to address class imbalance in remote sensing datasets, enabling the model to better focus on hard-to-classify targets and improve detection performance across underrepresented categories [24].
The architecture of the YOLO11-FSDAT model is depicted in Figure 1. It should be noted that, in our effort to enhance the visual representation of the network structure, the colored feature maps shown in Figure 1 are only used to characterize the visual presentation of the feature intensity distribution, and the colors do not represent the real image color channel information. The purple, green, and blue colors in the figure were derived from common pseudo-color mappings for feature visualization and are intended to reflect the activation of features extracted by the network at different stages and their spatial distribution. This approach helps to demonstrate the multi-scale fusion process of the features and the role of each module in the network structure.
Figure 1. Network architecture of YOLO11-FSDAT (the colored feature map in Figure 1 is a pseudo-color visualization of the feature activation distribution—the colors are only used to show the feature information at different stages and do not represent the real image channel).

3.2. Frequency Domain—Space Feature Extraction Fusion Module

3.2.1. WaveleteConv

Through an analysis of the typical features of remote sensing imagery, we noticed that due to their high resolution and large size, small targets in the image were often arranged in dense spatial patterns and were typically accompanied by a significant amount of complex background noise [25]. This characteristic makes the model prone to missing detections. Figure 2 shows grayscale images from the COCO128 dataset (left) and the DOTAv1 dataset (right), along with their corresponding frequency spectrum images after fast Fourier transform (FFT). It should be noted that the spectrograms in the figure are all grayscale graph visualization results—their grayscale intensity indicates the magnitude size of different frequency components, and the color does not represent the actual color information of the original graph. As observed, the frequency spectrum of the remote sensing images exhibited a distinct “cross-shaped” highlighted region that extends outward, indicating a significant presence of periodic high-frequency components in the image. In contrast, the frequency spectrum of the ordinary image on the left has its highlights concentrated at the center, signifying that the image predominantly contained low-frequency components. This comparative experiment demonstrated that extracting more high-frequency information from remote sensing images can significantly enhance the model’s detection performance in object detection tasks, while the effective utilization of such high-frequency features can substantially improve the network’s robustness in complex scenarios [26].
Figure 2. Grayscale images of example images and their frequency spectrum after fast Fourier transform (the spectrogram in the figure is the result of the magnitude visualization of the image after FFT, and the gray intensity indicates the energy distribution of the frequency components).
However, the image obtained after Fourier transform lost spatial information and performing convolution on the frequency spectrum did not hold much significance. Inspired by WTConv [27], we chose wavelet transform, a time–frequency analysis tool, to extract sub-bands of different frequency components from the image while retaining the spatial information. This allowed the convolutional kernels operating on high-frequency sub-bands to more accurately capture the key high-frequency component information in the image. Furthermore, the frequency features contained rich information about scale, texture, and angle, which can be a good complement to the spatial features [28].
Figure 3 shows the sub-band images obtained after performing two-level discrete Haar wavelet transform on the remote sensing image from Figure 2. It should be noted, in particular, that the colors in the plots are used for visualization purposes only. The proximity maps are still in grayscale and represent the low-frequency backbone information, while the detailed component maps are rendered in blue pseudo-colors, which are used to show the high-frequency texture details in different directions more clearly, and are not the original color information of the image.
Figure 3. Sub-band images obtained after performing two-level discrete Haar wavelet transform on the remote sensing image (from the DOTAv1 dataset). The blue detail map in the figure shows the results of the visualization of the wavelet high-frequency sub-bands, which are used to enhance the readability of the high-frequency information.
In dense detection tasks for remote sensing images, target edge features often manifest as high-frequency signals, particularly evident in the final three columns. In contrast, the first column on the left contains more complex low-frequency background information, which can interfere with the convolution layer’s ability to extract key features. The high-frequency sub-bands after wavelet decomposition mainly represent the edge information of the target. Performing convolution on these high-frequency sub-bands helps the convolutional kernel capture the crucial edge information of the target, while minimizing the impact of cluttered background information.
As can be observed from the comparison between Figure 2 and Figure 3, the wavelet transform demonstrated significant advantages over the Fourier transform in preserving spatial information. The Fourier transform converted the entire image into the frequency domain, where each frequency component corresponded to the whole image area, resulting in complete loss of positional information and making it difficult to locate the spatial positions of specific frequency components within the image. While this global processing approach is suitable for periodic and stationary signals, it proves inadequate for remote sensing images rich in local details and sharp edges. In contrast, the wavelet transform performed multi-scale analysis using localized window functions at different scales, enabling simultaneous provision of both frequency and spatial information—known as the “time–frequency localization” property. This means wavelets can precisely capture the specific locations of high-frequency texture features in images, making them particularly suitable for processing key visual characteristics, such as edges, textures, and corner points. By decomposing the image into multi-level high- and low-frequency components, wavelets not only preserve the main image structure (low frequency) but also extract directional high-frequency details, significantly enhancing the representation of object boundaries and structural features in remote sensing images. Therefore, in remote sensing object detection tasks, wavelet transforms are more effective than Fourier transforms in maintaining spatial information and improving detail perception, enabling networks to achieve better performance when detecting densely distributed small objects or structurally complex targets.
As illustrated in Figure 4, the WaveleteConv architecture employs a tensor parallel replication mechanism. The input feature map X is duplicated along the batch dimension to generate two identical computational branches, X 1 and X 2 , which subsequently undergo multi-scale wavelet decomposition and cross-branch feature fusion:
X 2 = X 1
Figure 4. Information flow of WaveletConv (the input grayscale image is from the DOTAv1 dataset, showing how the image flows through WaveletConv and how the feature maps change). The color shades in the figure reflect the response intensity in different regions of the feature map, which is used to visualize the information flow changes in each stage of WaveletConv.
The branch X 1 undergoes a wavelet transform (WT) first, which splits it into four sub-bands: “HH,” “HL,” “LH,” and “LL”:
X L L 1 , X L H 1 , X H L 1 , X H H 1 = W T X 1
Then, the X L L 1 sub-band undergoes another wavelet transform, which is also split into four sub-bands:
X L L 2 , X L H 2 , X H L 2 , X H H 2 = W T X L L 1
After the second-level wavelet transform, the four sub-bands, X L L 2 , X L H 2 , X H L 2 , and X H H 2 , each pass through a convolution layer, and then an inverse wavelet transform (IWT) is utilized to recover Y 2 :
Y 2 = I W T C o n v X L L 2 , C o n v X L H 2 , C o n v X H L 2 , C o n v X H H 2
Afterward, Y 2 is added to the four sub-bands X L L 1 , X L H 1 , X H L 1 , and X H H 1 from the first wavelet transform, which have also passed through the convolution layer. The result is then subjected to an inverse wavelet transform (IWT) to obtain Y 1 :
X 1 = C o n v X L L 1 , C o n v X L H 1 , C o n v X H L 1 , C o n v X H H 1
Y 1 = I W T X 1 + Y 2
Finally, Y 1 is added to Y 2 , which has passed through a convolution layer, resulting in the final output Y of the wavelet convolution:
Y = Y 1 + X 2

3.2.2. Freq-SpaFEFM

The proposed frequency–spatial feature extraction fusion module (Freq-SpaFEFM) aims to efficiently address challenges, such as complex background with noise, small and densely arranged targets, and significant size differences between targets. Specifically, the Freq-SpaFEFM module innovatively combines time–frequency domain analysis with spatial-domain feature extraction strategies. Through the collaboration of convolutional branch structures and the WaveletConv module, it significantly improves detection accuracy for both small and large objects in remote sensing imagery.
As illustrated in Figure 5, the input feature map was first processed by a convolutional layer to generate feature map X 2 , which was then evenly split along the channel dimension into continuous sub-tensor branches, a and b, using a Split operation. One of these branches, a, was fed into a region called the “Fre-Spa feature extraction” area, where frequency-domain and spatial-domain features are jointly extracted. In this region, the feature map was first split into two consecutive sub-branches, a1 and a2, using a Split operation. Each branch contained half of the original number of channels. This design ensured that different branches focus on extracting multi-scale and multi-level fine-grained information. Then, branch a1 will undergo sequential feature extraction through two Conv–WaveletConv combination blocks. Each combination block was designed with residual connections (represented by yellow dashed lines) [29], further enhancing the flexibility and diversity of feature flow within the module. This effectively alleviated feature degradation and gradient vanishing problems that can occur in deep networks.
Figure 5. Structural diagram of the Freq-SpaFEFM module.
Afterward, the a2 branch passed through a convolutional layer and was then concatenated with a1 along the channel dimension to fuse the features extracted by each sub-branch. The resulting feature map b was then output from the Fre-Spa feature extraction region. In this region, thanks to the time–frequency domain characteristics of the WaveletConv module, the module can effectively filter out low-frequency background noise that interferes with the features in remote sensing images, while highlighting important information, such as the contours and edges of small targets in the high-frequency sub-bands. Therefore, performing convolution on the high-frequency sub-bands effectively extracted the key features of small targets. At the same time, larger targets, which are usually preserved in the low-frequency sub-bands along with background noise, can still have their features extracted by performing convolution on the low-frequency sub-bands. The background noise primarily interfered with small targets, so convoluting the low-frequency sub-bands helped in extracting the features of large targets as well. Additionally, thanks to the downsampling operation of the wavelet transform (WT) and its time–frequency analysis properties, the receptive field during convolution on the sub-bands was mapped to a larger area of the original image [27]. Specifically, applying a 3 × 3 convolution to the sub-bands in the second-level wavelet domain of the WaveletConv module will map back to a 12 × 12 area of the original image through the inverse wavelet transform (IWT). This expansion of the receptive field undoubtedly enhanced the model’s ability to extract features of large-sized targets. Subsequently, the frequency domain features, which have been filtered and enhanced, were reconstructed into the spatial domain through an inverse wavelet transform (IWT). After the feature map b passed through a 1 × 1 convolution layer to adjust the channel number, it entered another identical second-level Fre-Spa feature extraction region. The output feature map c, after being processed by a Concat operation and a convolution layer, was fused with feature map X and the output feature map b from the first-level Fre-Spa feature extraction region. This fusion formed a rich feature representation that incorporated deep and shallow information, large and small receptive fields, and both frequency domain and spatial information. The ability to detect targets of different scales, particularly closely located small ones, was notably enhanced, and the model became more robust to background interference in remote sensing images.

3.3. Deformable Attention Global Feature and Local Feature Fusion Module (DAGLF)

3.3.1. Deformable Attention Module

In various computer vision tasks, the attention mechanisms are highly regarded for their ability to capture global dependencies. Traditional attention mechanisms mainly include global self-attention (e.g., vision transformer) and local self-attention (e.g., Swin transformer, PVT). Global self-attention computes dependencies across all positions, enabling the capture of long-range dependencies. However, this comes with a significant computational and memory cost and can sometimes lead to attention being distributed to irrelevant information. On the other hand, local attention uses fixed windows or downsampling strategies, which are computationally more efficient but are often data-independent, potentially overlooking crucial global information or long-range relationships. The deformable attention transformer (DAT) addressed these issues by introducing a data-dependent deformable self-attention mechanism. It started by generating a uniform reference point grid on the input features. Then, a lightweight offset network dynamically generated offsets based on the queries, mapping the keys and values to more representative local regions. At the same time, it captured spatial geometric information more precisely through deformable relative position biases, allowing for finer-grained attention to important areas. This design significantly reduced computational complexity and memory overhead while adaptively focusing on important regions. It effectively balanced global contextual dependencies with local feature details, thereby enhancing performance in image classification, object detection, and semantic segmentation tasks [30].
DAM is a key component of the deformable attention transformer (DAT). In the previous section, we proposed methods for addressing missed and misdetected small and densely arranged targets. However, in remote sensing images, there is another scenario where small targets are sparsely distributed in high-resolution images. Traditional multi-head attention mechanisms, for input features x R H × W × C (where H and W represent image height and width, and C represents the number of channels), typically map the input to query, key, and value spaces through a 1 × 1 convolution:
q = x W q , k = x W k , v = x W v C ,
where W q , W k , W v R C × C is the linear projection matrix. In multi-head attention, the channel C is divided into M attention heads, with each head having a dimension of d = C M . For the m -th head, its attention output is:
z m = s o f t max q m k m T d v m , m = 1 , , M
Finally, the outputs of all heads were concatenated and passed through the output projection W o to obtain the final result:
z = C o n c a t z 1 , , z M W o
This global dependency modeling performed excellently in capturing the overall semantics of an image, but in remote sensing images, where the target regions are often small and sparsely distributed, directly applying global dense attention introduced a large amount of irrelevant background information, which affected the detection accuracy.
To mitigate this problem, the DAT module incorporated a deformable sampling mechanism. First, the query q was decomposed by groups, dividing the channels C into G groups. The number of channels per group is:
C g = C G
The corresponding number of attention heads for each group is:
M g = M G
For each group of query features q g r o u p , a lightweight offset generation network calculated the 2D offset:
p = θ o f f s e t q g r o u p ,
where θ o f f s e t · consists of a depth-wise convolution (grouped convolution) with size k × k , followed by LayerNorm and GELU activation, and then a 1 × 1 convolution with an output channel size of 2. To prevent the offset from becoming too large, it was normalized using a hyperbolic tangent function and multiplied by an offset range generated from the downsampled size:
o f f s e t _ r a n g e = 1 H K 1 , 1 W K 1
Then, multiplying by a predefined scaling factor o f f s e t _ r a n g e _ f a c t o r , the normalized offset was obtained:
p = o f f s e t _ r a n g e _ f a c t o r · tanh p o f f s e t _ r a n g e .
Next, we constructed a uniform reference grid p :
p ( i , j ) = 2 i + 0 . 5 H K 1 , 2 j + 0 . 5 W K 1 , i = 0 , , H K 1 , j = 0 , , W K 1 .
The offset was added to the reference grid, resulting in the final sampling positions:
p = p + p
The interpolation formula is as follows:
ϕ z ; p x , p y = r x , r y max 0 , 1 p x r x max 0 , 1 p y r y z r y , r x , :
In remote sensing object detection, this dynamic sampling strategy enables the model to adaptively focus on target regions. Especially in scenarios with significant target size variations and sparse distribution, it effectively enhances the representation of target features.
After dynamic sampling, the DAT module used 1 × 1 convolutions to project the deformed sampled feature elements to generate the keys and values:
k ˜ = p r o j k x ˜ , v ˜ = p r o j v x ˜
The original query q was also passed through a 1 × 1 convolution to retain the full channel information. Then, for each attention head, the shape of the query was rearranged to B · M , d , H × W , while the shapes of the key and value were B · M , d , N s , where the number of sampled points was N s = H K × W K . The multi-head attention calculation formula is:
A = s o f t max q k ˜ T d
It computed the weighted sum to obtain the output of each head:
z m = A v ˜
In remote sensing object detection, preserving complete channel information combined with multi-head attention facilitates comprehensive fusion of large-scale structures and small-object features within the image. This makes the model more accurate when handling complex backgrounds.
To further enhance spatial information modeling capabilities, the DAT module introduced relative position encoding in attention calculations. First, the query grid q g r i d was generated through a function _ g e t _ q _ g r i d (normalized coordinates, similar to the reference grid p ), in the form of:
q g r i d i , j = 2 i H 1 , 2 j W 1 , i = 0 , , H 1 , j = 0 , , W 1
The relative displacement between each query position and the sampling position was calculated as:
R = q g r i d p
The relative displacement R describes the geometric relationship of the target in space and was normalized. Then, using a predefined relative position bias table B ^ (e.g., of shape 2 H 1 × 2 W 1 ) and an interpolation function ϕ B ^ ; R , the relative position bias term was calculated as:
ϕ B ^ ; R
Finally, this bias term was added to the attention scores, and the extended attention calculation formula is:
A t t n = s o f t max q k ˜ T d + ϕ B ^ ; R
In remote sensing images, the spatial relationships between targets are often crucial. This mechanism provides the model with precise spatial prior information, which enhances the precision of target localization and detection.
The traditional global attention mechanism typically has a computational complexity of O H 2 W 2 C on high-resolution images. However, the DAT module reduced the computational complexity of the offset generation part through dynamic sampling and the grouped shared offset strategy. Let the downsampling factor be r , and the number of sampling points be:
N s = H W r 2
The overall computational complexity is approximately:
Ω D M H A = 2 H W N s C + 2 H W C 2 + k 2 + 2 N s C
The first term 2 H W N s C corresponds to the dot-product attention computation, the second term 2 H W C 2 comes from the full-channel 1 × 1 convolution projection, and the third term k 2 + 2 N s C represents the computational load of the offset generation network. The complexity optimization enabled real-time remote sensing object detection under high-resolution inputs, while maintaining low computational and memory overhead.
In summary, dynamic sampling automatically focused on the target area, multi-head attention and full-channel projection ensured that target details were not lost, and relative position encoding provided more refined spatial prior information. Additionally, the optimization of computational complexity met the real-time detection requirements for large-scale high-resolution images. Therefore, the DAM is well-suited as an efficient and robust attention mechanism module in the field of image object detection.

3.3.2. Deformable Attention Global Feature and Local Feature Fusion Module

Although the DAM offers excellent dynamic attention performance and relatively lightweight computational complexity, it is inherently based on the transformer architecture, which emphasizes global information modeling. Direct insertion of this module may result in local features being diluted by global information, thereby reducing the model’s sensitivity to fine-grained target detection.
We designed the DAGLF module, which retains the convolutional layer in C2PSA for extracting local features and introduces the global information captured by the DAM and the spatial sensitivity due to dynamic sampling, to combine the advantages of the two network architectures and thus achieve higher accuracy and robustness in remote sensing image target detection. Specifically, after the feature map entered the DAGLF module, it first passed through a Conv layer, and then was split into two parts, x 1 and x 2 . The feature map x 2 underwent the DAM, first passing through a 1 × 1 Conv layer with 2C output channels, followed by another 1 × 1 Conv layer with C output channels, resulting in y 2 , thus performing an operation similar to a fully connected layer:
y 2 = f C o n v 2 f C o n v 1 x 2 + b 1 + b 2
where C o n v 1 and C o n v 2 correspond to convolutional layers for channel expansion and compression, respectively, with f representing the activation function. This feed-forward network structure effectively applied a fully connected transformation at each spatial location, providing additional nonlinear mapping for pixel-level features without significantly increasing computational overhead, thereby further enhancing the feature representation capability. Additionally, considering the deep network layers of the module, residual connections [29] were designed for both the DAM and the Conv layers to provide smoother gradient flow during backpropagation. This helped mitigate the risk of gradient vanishing as the network depth increased, allowing the network to maintain the flow of gradients on the “original input path” even after introducing the attention module. On the other hand, the attention mechanism combined with the residual connections can be seen as the original features plus supplemental features from the attention mechanism’s output:
x o u t = x i n + x a t t e n
Global information from attention mechanisms is progressively used to complement convolutional local features, enabled by the structure during training.
However, the feature map y 2 was not directly output. Since this module was embedded into the final layer of the backbone network, we first concatenated y 2 and x 1 along the channel dimension using the Concat operation. Then, a 1 × 1 convolutional layer was applied to adjust the number of channels, and the result was used as the final output of the module. This architecture enriched feature representation by combining global semantics from the module with backbone-derived local features, which is vital for managing large inter-class scale variations in remote sensing tasks. Figure 6 visualizes the DAGLF module’s structure.
Figure 6. The structural block diagram of the DAGLF module.

3.4. Adaptive Threshold Focal Loss

In remote sensing image object detection, imbalanced sample distribution across categories is a prevalent issue, primarily arising from the inherent class composition of the datasets. Using the traditional loss functions, such as cross-entropy loss, often leads to significantly poorer detection performance for classes with fewer samples compared to others. Although the classic focal loss mitigates the impact of easily classified samples by introducing a modulation factor 1 p t γ , it also reduces the loss weight for difficult samples, which is detrimental to the learning of challenging samples. To address this issue, we adopted the adaptive threshold focal loss (ATFL), which is a dynamic loss function that adjusts sample weights based on classification difficulty. By decoupling easily and difficultly classified targets, it suppresses the influence of easy samples and emphasizes learning from hard examples, thereby enhancing model performance on imbalanced remote sensing datasets. ATFL combines the targeted nature of threshold decoupling and the flexibility of adaptive weight adjustment, enabling adaptive focusing based on sample difficulty [31]. It is capable of addressing more difficult samples while effectively mitigating the imbalance between foreground objects and background regions.
The classic cross-entropy loss function can be represented as:
L B C E = y log p + 1 y log 1 p
where p represents the predicted probability, and y represents the true label, which can be succinctly expressed as:
L B C E = log p t
where
p t = p ,    i f   y = 1 1 p , o t h e r s
The standard cross-entropy loss is insufficient for handling sample imbalance issues. Therefore, the focal loss function introduced a modulation factor 1 p t γ and adjusted the focusing parameter r to reduce the loss contribution of easily classified samples. The focal loss function can be expressed as:
F L p t = 1 p t γ log p t
Although modulation reduces the impact of easy samples, it simultaneously diminishes the gradient contribution of hard examples, which adversely affects their learning. To address this issue, the threshold focal loss (TFL) function was introduced, which reduces the loss weight of easily classified samples while increasing the loss weight assigned to difficult samples, effectively mitigating the impact of easy samples. Specifically, prediction probability values above 0.5 were designated as easy samples, and those above this threshold were considered difficult samples. The TFL loss can be expressed as:
T F L = λ p t η log p t p t 0.5 1 p t γ log p t p t > 0.5
where η , γ , and λ (>1) are hyperparameters. For different models and datasets, these hyperparameters need to be adjusted multiple times to achieve optimal performance. In deep learning, each training process takes a significant amount of time, resulting in high time costs. Therefore, the adaptive improvements were made to η and γ .
For the simple samples, the loss value decreased as p t increased, further reducing the loss generated by simple samples. At the beginning of training, even the simple samples had relatively low predicted probabilities, similar to the difficult samples, and as training progressed, these probabilities gradually increased. γ should gradually approach 0. The predicted probabilities of ground truth targets can serve as a mathematical proxy for modeling training progress, and their evolution can be effectively estimated using exponential smoothing. This adaptive improvement can be expressed as:
p ^ c = 0.05 × 1 t 1 i = 0 t 1 p ¯ i + 0.95 × p t
where p ^ c represents the predicted value for the next epoch, p t represents the current average predicted probability, and p t represents the average predicted probability for each epoch. According to Shannon’s information theory, the larger the probability value of an event, the smaller the amount of information it brings; conversely, the greater the information amount. Therefore, the adaptive modulation factor γ can be expressed as:
γ = ln p ^ c
However, in the later stages of model training, an overly large expected probability would reduce the weight of difficult samples. Therefore, we defined η as:
η = ln p t
By combining η and γ into the TFL, the final expression for ATFL was obtained as:
A T F L = λ p t ln p t log p t p t 0.5 1 p t ln p ^ c log p t p t > 0.5
First, the threshold was set to decouple easily classified targets from hard-to-classify targets. Second, by reinforcing the loss related to difficult-to-classify targets and reducing the loss associated with easy-to-classify samples, we forced the model to allocate more attention to the features of difficult targets, thus alleviating the problem of class imbalance. Finally, an adaptive design was applied to the hyperparameters to reduce the time consumption caused by hyperparameter tuning.
In remote sensing object detection tasks, the class imbalance often leads to the underperformance of models in correctly classifying target classes with fewer samples. The integration of ATFL loss with data augmentation strategies effectively enhanced the model’s attention to challenging targets during training, leading to improved detection accuracy.

4. Results

4.1. Description of the Experimental Environment and Dataset

4.1.1. Experiment Environment Settings

The experimental environment used CUDA 12.1 and the Pytorch 2.1.2 deep learning framework, with the operating system being Ubuntu. The model training was performed on an AMD EPYC 9754 @ 2.3GHz processor and an NVIDIA GeForce RTX 4090 GPU, with 300 epochs. Data augmentation techniques were applied, with Mosaic probability set to 1.0, Mixup probability to 0.2, Copy_paste to 0.2, and fliplr to 0.5. None of the models involved in ablation and comparison experiments used pre-trained weights during training.

4.1.2. Evaluate Metrics

In this study, we evaluated the algorithm’s performance by comparing the differences between the pre-improvement model and the post-improvement model under the same image detection experimental setup. The evaluation system for object detection was based on four key metrics: mean average precision (mAP), precision, recall, and F1 score:
Precision = T P T P + F P
Recall = T P T P + F N
A P = 0 1 p r d r
m A P = q = 1 Q A P q Q
F 1 = 2 · Precision · Recall Precision + Recall
TP (true positive), FN (false negative), and FP (false positive) denote the number of correctly detected targets, missed targets, and incorrectly detected background samples, respectively. Precision and recall were used to evaluate detection accuracy and completeness. The precision–recall curve was used to derive the AP for each class, and mAP was calculated as the mean of all APs across N categories. The F1 score was computed as the harmonic mean of precision and recall.

4.1.3. Experimental Datasets

To validate the detection performance of the improved YOLOv11n network on remote sensing image object detection, the experiments were conducted on three publicly available datasets: DOTAv1, SIMD, and DIOR. The specific details are as follows:
  • The DOTAv1 dataset [32] is a large-scale remote sensing dataset specifically designed for aerial image object detection, released by the Wuhan University team in 2018. It contains 2806 aerial images, covering 15 categories with 188,282 instances.
  • The SIMD dataset [33] was proposed by the research team from the National University of Sciences and Technology (NUST) in Pakistan in 2020, primarily for vehicle detection tasks. This dataset contains 5000 high-resolution remote sensing images (size 1024 × 768) and is annotated with 45,096 target instances. It has a high proportion of small to medium-sized targets (both width and height smaller than 0.4).
  • The DIOR dataset [34] was introduced in 2020 as a comprehensive benchmark for optical remote sensing image object detection. It includes 23,463 images and 190,288 annotated instances, covering 20 object categories (such as airplanes, bridges, ports, vehicles, etc.).
These datasets include a diverse spectrum of detection targets, ranging from small-scale objects to large-scale infrastructure, thereby offering a comprehensive benchmark for assessing detection accuracy and model robustness. Table 1 presents the instance count for each category across the three datasets.
Table 1. The number of instances for each category in the three datasets.

4.1.4. Analysis of Experimental Results

To verify the target detection performance of the improved YOLO11-FSDAT network in remote sensing images, this paper compared the proposed improved algorithm with the currently mainstream algorithms in the field of remote sensing image target detection, specifically including YOLOv5n, YOLOv8n, YOLOv10n, YOLOv11n, YOLOv12n, and RT-DETR. The mean average precision (mAP) was adopted as the primary evaluation metric for assessing detection performance in the comparative experiments, and the corresponding results are presented in Table 2. To improve readability and facilitate comparison, we used red, blue, and green to label the first, second, and third rankings of the indicators.
Table 2. Comparison of detection results of different models on the DOTAv1 dataset (%).
The mAP50 represents the value when the IOU threshold is set to 0.5. Intersection over union (IoU) is a widely used metric that quantifies the overlap between the predicted bounding box and the corresponding ground truth bounding box. The higher the ratio, the better the detection performance. As shown in Table 2, our model achieved a mAP of 75.22% on the DOTAv1 dataset, outperforming YOLOv5n, YOLOv8n, YOLOv10n, YOLOv11n, YOLOv12n, and RT-DETR by 4.43%, 4.22%, 2.05%, 4.11%, 1.40%, and 4.32%, respectively. There were significant improvements in detecting objects, such as “ground track field,” “bridge,” and “swimming pool.” Specifically, the progress in detecting “ground track field” was attributed to the introduction of the Freq-SpaFEFM module, which helped expand the receptive field, enabling the model to extract key information from larger-sized targets. “Bridge” targets often suffer from background similarity and can be easily drowned in background noise. The Freq-SpaFEFM module’s convolutional features across different frequency sub-bands effectively enhanced high-frequency edge features and, combined with the DAGLF module’s ability to focus on key information from large targets, it significantly improved the model’s performance in detecting “bridge” targets. Additionally, as seen in the previous section, the sample sizes of “ground track field” and “swimming pool” targets were relatively small in the dataset. The adaptive threshold focal loss (ATFL) improved the model’s attention to low-frequency and hard-to-detect targets by adaptively adjusting its focus based on sample difficulty, thereby effectively mitigating the class imbalance problem in object detection.
Figure 7 illustrates the detection performance of the enhanced YOLO11-FSDAT model on the DOTAv1, SIMD, and DIOR datasets. The results demonstrate that the three proposed enhancements significantly improved the model’s performance over the baseline across a range of challenging remote sensing scenarios, leading to higher detection accuracy and a reduced missed detection rate.
Figure 7. Object detection results of YOLOv11n and YOLO11-FSDAT on remote sensing images. (a) Sample image with complex background noise and small-target scale (DOTAv1 dataset). (b) Densely arranged multi-sample image (DOTAv1 dataset). (c) Multi-sample image with large class size differences (DOTAv1 dataset). (d) Multi-sample image with foreground occlusion (SIMD dataset). (e) Single-class multi-angle small-target sample image (SIMD dataset). (f) Multi-sample image with large class size differences (SIMD dataset). (g) Single-class multi-angle sample image (DIOR dataset). (h) Multi-sample image with large class size differences (DIOR dataset). (i) Single-sample image with high-target-background similarity (DIOR dataset).
Specifically, as shown in Figure 7a, the image had significant background noise, and the target to be detected was small in size, easily drowning in the background noise. Before the improvement, the baseline model only detected one target, but the improved model detected three targets with higher confidence. Additionally, Figure 7d shows a sample with foreground occlusion, where a small car at the lower center is partially blocked by a bus, and a bus in the upper right corner is largely obscured. The baseline model failed to detect these targets, while the improved model detected them both. In Figure 7i, the target to be detected was large in size and very similar to the background. The improved model correctly detected the target, and the confidence level was significantly higher compared to the baseline model.
To further evaluate the detection performance across different target categories, we analyzed the AP, precision, recall, and F1-score for each class in the DOTAv1 dataset based on the proposed improved method, as presented in Table 3. Among all categories, “tennis court” achieved the highest scores across all four metrics, which can be attributed to its moderate object size and relatively clean surrounding background, allowing for more distinctive feature extraction. In contrast, the “bridge” category showed the lowest performance, likely due to its less distinctive structural features and frequent blending with complex backgrounds, resulting in a higher rate of missed detections. In order to improve readability and to facilitate comparison of indicator sizes for different types of detection targets, we used red, blue, and green to label the first, second, and third rankings of indicators. To clearly observe the variations in different metrics across target categories, we plotted bar charts for visualization, as shown in Figure 8.
Table 3. Performance analysis of the improved model for each category in the DOTAv1 dataset.
Figure 8. Bar charts of various metrics and categories on the DOTAv1 dataset. (a) Bar chart of objects by category and their AP. (b) Bar chart of objects by category and their precision. (c) Bar chart of objects by category and their recall. (d) Bar chart of objects by category and their F1 measure.

5. Discussion

5.1. Comparative Experiments

To assess the robustness and generalization capability of the proposed model, we performed comparative experiments against several state-of-the-art object detection algorithms, including YOLOv5n, YOLOv8n, YOLOv10n, YOLOv11n, YOLOv12n, and RT-DETR, using three publicly available remote sensing datasets: DOTAv1, SIMD, and DIOR. These datasets differ in terms of scene complexity, object distribution, and scale variation, thus serving as a reliable benchmark for evaluating model performance. The detailed experimental results are presented in Table 4. In order to improve readability and facilitate comparison of the size of the corresponding indicators on different datasets, we used red, blue, and green to label the first, second, and third rankings of the indicators. We plotted bar charts to compare the mAP and precision of each dataset and model, as shown in Figure 9. Additionally, to demonstrate the real-time performance of the model, we plotted a point–line graph showing the relationship between GFLOPs and precision, as shown in Figure 10.
Table 4. Comparative experiments of the three datasets.
Figure 9. Bar charts of mAP and precision for different datasets and models. (a) Bar charts of mAP50 (%) for different datasets and models. (b) Bar charts of mAP50-95 (%) for different datasets and models. (c) Bar charts of precision (%) for different datasets and models.
Figure 10. Point–line chart of GFLOPs and accuracy for models on DOTAv1, SIMD, and DIOR datasets.
The results in Table 4 show that YOLO11-FSDAT achieved improvements of 4.11%, 3.82%, and 0.99% in mAP50, and 1.99%, 2.73%, and 0.72% in mAP50-95, compared to its baseline model YOLOv8n on the DOTAv1, SIMD, and DIOR datasets, respectively. This demonstrates the effectiveness of the proposed improvements. Furthermore, our model outperformed YOLOv5n, YOLOv8n, YOLOv10n, YOLO12n, and RT-DETR on all three datasets, showcasing its potential and advantages in remote sensing image object detection tasks. As shown in the Figure 10, our model not only demonstrated excellent accuracy with a high detection rate but also significantly reduced computational load, effectively lowering the demand for computing resources. Overall, the model achieved a good balance between accuracy and efficiency, showcasing strong practical value and potential for widespread application.

5.2. Ablation Experiments

In response to the issues of uneven sample distribution across categories, missed detection in dense small-target scenes, large-scale differences between object categories, and severe background contamination typical of remote sensing object detection scenes, we proposed three network improvement methods in this paper. To evaluate the effectiveness of each method and their optimal combination, we conducted ablation experiments on the DOTAv1 dataset. Method A introduces the frequency–spatial feature extraction fusion module (Freq-SpaFEFM) into the backbone and neck of the network. Method B integrates the dynamic attention global–local fusion module (DAGLF) into the last layer of the backbone. Method C replaces the classification loss with ATFL. The experimental results are presented in Table 5. To improve readability and facilitate comparison of the size of different indicators, we used red, blue, and green to label the first, second, and third rankings of the indicators. At the same time, we plotted bar charts to compare the metrics of different models, as shown in Figure 11. Additionally, to demonstrate the real-time performance of the models, we plotted point–line charts illustrating the relationship between GFLOPs and accuracy, as shown in Figure 12.
Table 5. Results of ablation experiments.
Figure 11. Comparison bar charts of different models’ mAP50 (%), mAP50-95 (%), precision (%), and recall (%).
Figure 12. Scatter and line plot of GFLOPs versus accuracy for different models.
The results indicate that each approach significantly contributed to enhancing model performance, and their combination led to optimal outcomes. The training process of the ablation experiment on the DOTAv1 dataset is presented in Figure 13.
Figure 13. The training process of the ablation experiment on the DOTAv1 dataset.

5.3. Visual Analytics

5.3.1. Visual Analysis of the Freq-SpaFEFM Module

To visually highlight the enhanced feature extraction ability of the Freq-SpaFEFM module, Figure 14 and Figure 15 display the sixth-layer output feature maps for both our model and the baseline model, using a sample image from the DOTAv1 test set. In the baseline, this corresponded to the output of the C3K2 module, while in our model, it corresponded to the Freq-SpaFEFM module.
Figure 14. A sample image from the validation set of the DOTAv1 data and the output feature map of layer 6 of our model.
Figure 15. The output feature map of the sixth layer of the baseline model.
First, from an overall perspective, the feature maps output by each channel of the Freq-SpaFEFM module contained more detailed information about the small target “ship.” This was particularly evident in the output feature maps, where more channels highlighted the “ship” at corresponding positions. On a more detailed level, for the small target “ship,” some channels in the feature maps output by the Fre-SpaFEFM module had relatively complete small-target features, with minimal noise (as indicated by the red dashed box in Figure 14). In contrast, the feature maps output by the C3K2 module contained incomplete small-target information and more background noise, as shown by the purple dashed box in Figure 15, where certain small-target positions did not appear highlighted. For the larger target “harbor,” the output feature map from the C3K2 module had only one channel that contained clear and relatively complete target features (as shown by the green dashed box in Figure 15). However, the output feature map from the Freq-SpaFEFM module had three channels that contained clear and relatively complete target features (as shown by the yellow dashed box in Figure 14). This is thanks to the robustness of the Freq-SpaFEFM module in dealing with background noise and its excellent feature extraction capability in dense small-target scenarios, as well as the stronger feature extraction capability for larger targets due to the larger receptive field. These results demonstrate that our Freq-SpaFEFM module had a stronger feature extraction ability in scenarios with dense small targets and large differences in target scales across multiple categories.

5.3.2. Visual Analysis for the DAGLF Module

To more intuitively analyze the superior performance of the DAGLF module, we selected two images from the DOTAv1 test set to extract the attention matrices from the DAGLF module in our model and the C2PSA module in the baseline model, and obtained the attention heatmaps, as shown in Figure 16 and Figure 17, corresponding to small-target images and large-target images. In the figures, (a) represents the original image, (b) represents the attention heatmap output by the DAGLF module, and (c) represents the attention heatmap output by the C2PSA module.
Figure 16. Comparison chart of attention visualization under small-object detection. (a) Validation sample image 1 of DOTAv1 dataset. (b) DAGLF module output image. (c) C2PSA module outputs images.
Figure 17. Comparison chart of attention visualization under large-object detection. (a) Validation sample image 2 of DOTAv1 dataset. (b) DAGLF module output image. (c) C2PSA module outputs images.
It is clear that, whether for small targets or larger targets, the DAGLF module exhibited superior performance. When dealing with small targets, the DAGLF module evenly distributed attention to each small target, which is shown as deeper red in the figure, while the C2PSA module failed to assign attention to some targets or assigned lower attention, represented by light blue or light pink in the figure. When dealing with large targets, the DAGLF module accurately allocated attention to key positions of the target, such as the nose and engine in Figure 17b, which are marked with deeper red. The high performance was largely due to the deformable attention’s adaptability and the DAGLF module’s capability for streamlined multi-source information integration.
To further verify the impact of the attention matrix on the final detection results, we selected Figure 18a and used the LayerCAM algorithm to compute the class activation map (CAM) for the tenth layer (where the DAGLF and C2PSA modules are located). As shown in Figure 18, in the class activation map of the DAGLF module, each small target was highlighted, while in the class activation map of the C2PSA module, some small targets were not highlighted. This indicates that with the help of the DAGLF module, the network can indeed focus more effectively on the small targets in the image, reducing the risk of missing small targets and improving the network’s detection performance.
Figure 18. Class activation diagram of layer 10 computed by the LayerCAM algorithm: (a) the class activation map of the DAGLF module and (b) the class activation map of the C2PSA module.

6. Conclusions

This study presented an enhanced architecture based on YOLOv11n, named YOLO11-FSDAT, specifically designed to address the unique challenges of object detection in remote sensing imagery. Built upon the lightweight YOLOv11n framework, the model incorporates three innovative modules: the frequency–spatial feature extraction fusion module (Freq-SpaFEFM), the deformable attention-based global–local feature fusion module (DAGLF), and the adaptive threshold focal loss (ATFL). Together, these components form an efficient detection framework tailored for difficult remote sensing scenarios characterized by small objects, dense object distributions, large-scale variations, and complex backgrounds.
The main findings and contributions of this study are summarized as follows:
  • The Freq-SpaFEFM module effectively integrated time–frequency analysis with spatial-domain feature extraction. By adopting a multi-branch architecture that separately processes small and large targets, the module enhanced the model’s ability to detect multi-scale and densely packed objects while suppressing complex background interference.
  • The DAGLF module enabled the organic fusion of local details with global contextual information. Through the incorporation of a deformable attention mechanism, the model can adaptively focus on relevant target regions, significantly improving performance on large-scale and fine-grained object detection tasks.
  • The ATFL loss function dynamically adjusted loss weights to make the model focus more on hard-to-classify objects. This was especially effective in addressing the long-tailed distribution commonly found in remote sensing data, thereby significantly improving the detection accuracy of underrepresented classes and small-sample targets.
  • Experimental results on three public remote sensing datasets—DOTAv1, SIMD, and DIOR—demonstrated that YOLO11-FSDAT outperformed current mainstream methods, including YOLOv5n, YOLOv8n, YOLOv10n, YOLOv11n, and RT-DETR, in terms of detection accuracy. On the DOTAv1 dataset, for instance, the proposed model achieved a mAP50 of 75.22%, representing a 4.11% improvement over the baseline YOLOv11n, validating the model’s superior performance and generalization ability in complex remote sensing scenarios. Additionally, the model exhibited stronger precision and robustness in detecting small, dense, and scale-varying targets within challenging environments.
  • The model maintained high detection accuracy while also offering fast inference speed and a lightweight architecture, making it well-suited for real-time deployment and practical applications in diverse remote sensing contexts.
In future work, we plan to incorporate model pruning techniques to eliminate redundant parameters and network connections, achieving a more lightweight design and reducing computational overhead without compromising accuracy. At the same time, we aim to further enhance the overall robustness, precision, and adaptability of YOLO11-FSDAT to ensure its effective deployment across various real-world applications, thereby advancing the development of intelligent remote sensing technologies.

Author Contributions

Conceptualization, Z.F., Y.Z., H.C. and A.W.; methodology, Z.F., Y.Z. and A.W.; software, Z.F.; validation, Z.F., Y.Z., A.W. and H.C.; writing—review and editing, Z.F., Y.Z., H.C. and A.W. All authors have read and agreed to the published version of the manuscript.

Funding

Supported by the Key Research and Development Plan Project of Heilongjiang (No. JD2023SJ19), the Natural Science Foundation of Heilongjiang Province (No. LH2023F034), the Natural Science Foundation of Heilongjiang Province (No. LH2022E061), the Shenzhen Polytechnic University Research Fund (No. 6025310007K) and the Science and Technology Project of Heilongjiang Provincial Department of Transportation (HJK2024B002).

Data Availability Statement

DOTA: https://captain-whu.github.io/DOTA/dataset.html (accessed on 1 June 2025). SIMD: https://github.com/ihians/simd (accessed on 1 June 2025). DIOR: https://aistudio.baidu.com/datasetdetail/53045 (accessed on 1 June 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Hou, L.; Li, F. Lightweight Remote Sensing Image Detection Based on Improved EAF-YOLO. J. Shenyang Univ. Technol. 2025, 44, 7–12. [Google Scholar]
  2. Cai, Q.; Wang, J.; Liang, H. Remote Sensing Image Object Detection Based on Hybrid Attention and Dynamic Sampling. Comput. Syst. Appl. 2025, 34, 171–179. [Google Scholar]
  3. Cheng, G.; Han, J.; Guo, L.; Qian, X.; Zhou, P.; Yao, X.; Hu, X. Object detection in remote sensing imagery using a discriminatively trained mixture mode. ISPRS J. Photogramm. Remote Sens. 2013, 85, 32–43. [Google Scholar] [CrossRef]
  4. Qiu, S.; Wen, G.; Fan, Y. Using Layered Object Representation to Detect Partially Visible Airplanes in Remote Sensing Images. In Proceedings of the 2016 International Conference on Progress in Informatics and Computing (PIC), Shanghai, China, 23–25 December 2016; pp. 196–200. [Google Scholar]
  5. Cheng, G.; Zhou, P.; Han, J. Learning Rotation-Invariant Convolutional Neural Networks for Object Detection in VHR Optical Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2016, 54, 7405–7415. [Google Scholar] [CrossRef]
  6. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
  7. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
  8. Ma, H.; Liu, Y.; Ren, Y.; Yu, J. Detection of Collapsed Buildings in Post-Earthquake Remote Sensing Images Based on the Improved YOLOv3. Remote Sens. 2020, 12, 44. [Google Scholar] [CrossRef]
  9. Luo, S.; Yu, J.; Xi, Y.; Liao, X. Aircraft Target Detection in Remote Sensing Images Based on Improved YOLOv5. IEEE Access 2022, 10, 5184–5192. [Google Scholar] [CrossRef]
  10. Ahmed, M.; El-Sheimy, N.; Leung, H.; Moussa, A. Enhancing Object Detection in Remote Sensing: A Hybrid YOLOv7 and Transformer Approach with Automatic Model Selection. Remote Sens. 2024, 16, 51. [Google Scholar] [CrossRef]
  11. Zhao, D.; Shao, F.; Liu, Q.; Yang, L.; Zhang, H.; Zhang, Z. A Small Object Detection Method for Drone-Captured Images Based on Improved YOLOv7. Remote Sens. 2024, 16, 1002. [Google Scholar] [CrossRef]
  12. Zhu, R.; Jin, H.; Han, Y.; He, Q.; Mu, H. Aircraft Target Detection in Remote Sensing Images Based on Improved YOLOv7-Tiny Network. IEEE Access 2025, 13, 48904–48922. [Google Scholar] [CrossRef]
  13. Yi, H.; Liu, B.; Zhao, B.; Liu, E. Small Object Detection Algorithm Based on Improved YOLOv8 for Remote Sensing. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 1734–1747. [Google Scholar] [CrossRef]
  14. Nie, H.; Pang, H.; Ma, M.; Zheng, R. A Lightweight Remote Sensing Small Target Image Detection Algorithm Based on Improved YOLOv8. Sensors 2024, 24, 2952. [Google Scholar] [CrossRef] [PubMed]
  15. Hwang, D.; Kim, J.-J.; Moon, S.; Wang, S. Image Augmentation Approaches for Building Dimension Estimation in Street View Images Using Object Detection and Instance Segmentation Based on Deep Learning. Appl. Sci. 2025, 15, 2525. [Google Scholar] [CrossRef]
  16. Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 16965–16974. [Google Scholar]
  17. Li, B.; Fang, J.; Zhao, Y. RTDETR-Refa: A real-time detection method for multi-breed classification of cattle. J Real-Time Image Proc 2025, 22, 38. [Google Scholar] [CrossRef]
  18. Shi, Y.; He, Q.; Mao, C.; Ge, Y.; Du, H.; Wang, H. The Object Detection in Remote Sensing Images Based on Improved YOLO11 with LDConv. In Proceedings of the 2024 6th International Conference on Robotics, Intelligent Control and Artificial Intelligence (RICAI), Nanjing, China, 6–8 December 2024; pp. 605–609. [Google Scholar] [CrossRef]
  19. Zhao, H.; Jia, L.; Wang, Y.; Yan, F. Autonomous UAV Detection of Ochotona curzoniae Burrows with Enhanced YOLOv11. Drones 2025, 9, 340. [Google Scholar] [CrossRef]
  20. Agrawal, A.; Papreja, N.; Singh, A.A.; Bhurani, K.; Minocha, S. YoloNet: Hybrid YOLO11 and EfficientNet for Lumpy Skin Disease Detection. In Proceedings of the 2025 3rd International Conference on Device Intelligence, Computing and Communication Technologies (DICCT), Dehradun, India, 21–22 March 2025; pp. 63–67. [Google Scholar] [CrossRef]
  21. Huang, Z.; Zhao, Y.; Liu, Y. Adaptive YOLOv8-based target detection algorithm for remote sensing images. Commun. Inf. Technol. 2025, 3, 129–133. [Google Scholar]
  22. Liu, X.; Gong, W.; Shang, L.; Li, X.; Gong, Z. Remote Sensing Image Target Detection and Recognition Based on YOLOv5. Remote Sens. 2023, 15, 4459. [Google Scholar] [CrossRef]
  23. Wang, X.; Yi, J.; Guo, J.; Song, Y.; Lyu, J.; Xu, J.; Yan, W.; Zhao, J.; Cai, Q.; Min, H. A Review of Image Super-Resolution Approaches Based on Deep Learning and Applications in Remote Sensing. Remote Sens. 2022, 14, 5423. [Google Scholar] [CrossRef]
  24. Putty, A.; Annappa, B.; Prajwal, R.; Perumal, S.P. Semantic Segmentation of Remotely Sensed Images using Multisource Data: An Experimental Analysis. In Proceedings of the 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT), Kamand, India, 24–28 June 2024; pp. 1–6. [Google Scholar] [CrossRef]
  25. Qu, B.; Li, X.; Tao, D.; Lu, X. Deep semantic understanding of high resolution remote sensing image. In Proceedings of the 2016 International Conference on Computer, Information and Telecommunication Systems (CITS), Kunming, China, 6–8 July 2016; pp. 1–5. [Google Scholar] [CrossRef]
  26. Sun, H.; Luo, Z.; Ren, D.; Du, B.; Chang, L.; Wan, J. Unsupervised multi-branch network with high-frequency enhancement for image dehazing. Pattern Recognit. 2024, 156, 110763. [Google Scholar] [CrossRef]
  27. Xia, Z.; Pan, X.; Song, S.; Li, L.E.; Huang, G. Vision Transformer with Deformable Attention. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 21–24 June 2022; pp. 4784–4793. [Google Scholar]
  28. Xu, Y.; Pan, Y.; Wu, Z.; Wei, Z.; Zhan, T. Channel Self-Attention Based Multiscale Spatial-Frequency Domain Network for Oriented Object Detection in Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5650015. [Google Scholar] [CrossRef]
  29. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–777. [Google Scholar]
  30. Finder, S.E.; Amoyal, R.; Treister, E.; Freifeld, O. Wavelet Convolutions for Large Receptive Fields. In Computer Vision–ECCV 2024. ECCV 2024. Lecture Notes in Computer Science; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Springer: Cham, Switzerland, 2025; Volume 15112. [Google Scholar]
  31. Yang, B.; Zhang, X.; Zhang, J.; Luo, J.; Zhou, M.; Pi, Y. EFLNet: Enhancing Feature Learning Network for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 590651. [Google Scholar] [CrossRef]
  32. Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A Large-Scale Dataset for Object Detection in Aerial Images. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar]
  33. Haroon, M.; Shahzad, M.; Fraz, M.M. Multisized Object Detection Using Spaceborne Optical Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 3032–3046. [Google Scholar] [CrossRef]
  34. Fatima, S.A.; Kumar, A.; Pratap, A.; Raoof, S.S. Object Recognition and Detection in Remote Sensing Images: A Comparative Study. In Proceedings of the 2020 International Conference on Artificial Intelligence and Signal Processing (AISP), Amaravati, India, 10–12 January 2020; pp. 1–5. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.