An Improved Geospatial Object Detection Framework for Complex Urban and Environmental Remote Sensing Scenes

Zhu, Yueying; Chen, Aidong; Li, Xiang; Pan, Yu; Yuan, Yanwei; Yang, Ning; Chen, Wenwen; Huang, Jiawang; Cai, Jun; Fu, Hui

doi:10.3390/app16031288

Open AccessArticle

An Improved Geospatial Object Detection Framework for Complex Urban and Environmental Remote Sensing Scenes

by

Yueying Zhu

¹,

Aidong Chen

¹,

Xiang Li

²,

Yu Pan

¹,

Yanwei Yuan

¹,

Ning Yang

³,

Wenwen Chen

^1,4,

Jiawang Huang

^1,4,

Jun Cai

¹ and

Hui Fu

^1,*

¹

College of Robotics, Beijing Union University, Beijing 100101, China

²

Institute of Semiconductors, Chinese Academy of Sciences, Beijing 100083, China

³

Nanjing Vocational College of Finance and Economics, Nanjing 211121, China

⁴

Beijing Key Laboratory of Information Service Engineering, Beijing Union University, Beijing 100101, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(3), 1288; https://doi.org/10.3390/app16031288

Submission received: 28 October 2025 / Revised: 24 December 2025 / Accepted: 23 January 2026 / Published: 27 January 2026

(This article belongs to the Topic Geospatial AI: Systems, Model, Methods, and Applications)

Download

Browse Figures

Versions Notes

Featured Application

The RS-YOLO framework can be used in urban infrastructure monitoring, land use change detection, and transportation facility management. It offers a powerful GeoAI instrument for sustainable urban planning and environmental governance.

Abstract

The development of Geospatial Artificial Intelligence (GeoAI), combining deep learning and remote sensing imagery, is of great interest for automated spatial inference and decision-making support. In this paper, a GeoAI-based efficient object detection framework named RS-YOLO is introduced by adopting the YOLOv11 architecture. The model integrates Dynamic Convolution for adaptive receptive field adjustment, Selective Kernel Attention for multi-path feature aggregation, and the MPDIoU loss function for geometry-aware localization. The proposed approach outperforms in experimental results on the TGRS-HRRSD dataset of 13 scenes from different geospatial scenarios, giving an 89.0% mAP and an 87 F1-score. Beyond algorithmic advancement, RS-YOLO provides a GeoAI-based analytical tool for applications such as urban infrastructure monitoring, land use management, and transportation facility recognition, enabling spatially informed and sustainable decision-making in complex remote sensing environments.

Keywords:

Geospatial Artificial Intelligence (GeoAI); urban monitoring; land use analysis; object detection; remote sensing images (RSIs); adaptive receptive fields; multi-path feature attention

1. Introduction

Recently, the explosive development of Geospatial Artificial Intelligence (GeoAI) has revolutionized the way spatial data are studied and understood. Leveraging geospatial data, remote sensing imagery, and AI algorithms, GeoAI emerges as a tool for uncovering patterns and dynamics in space in natural and built environments. This paradigm enables geographic knowledge discovery with data-driven applications, such as in urban planning, environmental monitoring, and disaster management, where massive spatial data can be converted into actionable geographic knowledge. GeoAI represents a significant shift in geospatial analysis from traditional methods to intelligent spatial reasoning. Unlike conventional remote sensing approaches that treat pixels as independent visual units, GeoAI incorporates geographic context, spatial relationships, and domain knowledge into the learning process. This allows models not only to detect objects but also to understand their spatial meaning—for example, recognizing the functional connection between ports and vessels or identifying the topological continuity of road networks. Such spatial awareness is crucial for applications requiring real-time decision-making, including urban infrastructure monitoring, disaster response, and environmental management. Within this framework, object detection plays a fundamental role. It transforms raw image pixels into geographically meaningful entities, enabling further analysis, modeling, and decision-making within Geographic Information Systems (GIS). In doing so, it forms an essential link between remote sensing imagery and actionable geospatial intelligence.

With the development of satellite imaging technology and Unmanned Aerial Vehicle (UAV) platform, the spatial resolution of remote sensing images can reach a sub-meter level, which can clearly record a ground object’s geometric structure and texture features. These high-resolution images also have very important application value in military reconnaissance [1], disaster emergency response [2], and port logistics monitoring [3]. In GeoAI, they are essential for intelligent perception and space cognition as a source of rich features and fine, reliable understanding to support automatic analysis and decision-making. Nevertheless, remote sensing object detection is still one of the most challenging issues in information processing. Classical techniques are predominantly based on handcrafted feature extraction and pattern matching. Although they have certain interpretability under the aggregation of geospatial rules or physical models in some situation, their expression ability is relatively lower in feature level. As a result, it is difficult for DBF methods to adapt to the multi-scale, dense distribution and occluding targets in RC scenes, thus resulting in small object misdetection and insufficient feature fusion [4]). The rapid progress of deep convolutional neural networks has greatly improved detection accuracy and efficiency for the end-to-end learning frameworks, such as Single Shot MultiBox Detector (SSD), You Only Look Once (YOLO) series, and so on, which speed up the development of AI-based remote sensing object detection in GeoAl [5].

The YOLO series has been the mainstream framework for remote sensing object detection owing to its good trade-off between speed and accuracy. Researchers have focused on optimizing its performance by incorporating adaptive feature fusion, attention mechanisms, and lightweight designs. Xie et al. [6] proposed the use of the Adaptively Spatial Feature Fusion (ASFF) structure in YOLOv4 to optimize multi-scale information extraction. They also improved Spatial Pyramid Pooling (SPP) and the lightweight backbone network to accelerate inference and enhance detection accuracy. Lv et al. [7] proposed Multi-scale Feature Adaptive Fusion (MFAF) based on YOLOv4. By introducing Feature Integration (FI) and Spatial Attention Weight (SAW), combined with Detail Enhancement (DE), Squeeze-and-Excitation (SE), and Cross Stage Partial (CSP), MFAF improved feature representation and adaptive fusion in multi-scale object detection. Liu et al. [8], based on the YOLOv3 architecture, introduced the Focus module for image slicing preprocessing, integrated Spatial Pyramid Pooling Network (SPPNet) and Cross Stage Partial Network (CSPNet), and adopted MobileNetv3 as a lightweight backbone. This enhanced the detection of small traffic signs in complex scenes. Yang et al. [9], building on the YOLOX framework, incorporated the Efficient Channel Attention (ECA) module and the ASFF structure. They also employed Varifocal Loss (VFL) to optimize positive–negative sample balance and combined the Slicing Aided Hyper Inference (SAHI) framework to improve small-object detection in remote sensing imagery.

In addition, many studies have explored model architectures beyond YOLO, including improvements based on Faster Region-based Convolutional Neural Network (Faster R-CNN), the introduction of Transformers, and novel designs such as dynamic convolution. These approaches aim to address the challenges of variable object scales and complex backgrounds in remote sensing imagery. Li et al. [10] enhanced small-object and dense-object detection using Structure-Guided Feature Transform (SGFT) and Hybrid Residual (HR). Huang et al. [11] proposed the DConvTrans-LGA model, which integrates Dynamic Convolution (DConv) with Local Attention (LocalAttn) to strengthen local feature extraction and global context modeling. They further designed the Feature Residual Pyramid Network (FRPN) to optimize multi-scale feature fusion, thereby improving the detection of multi-scale objects, especially small objects, in optical remote sensing images. Yuan et al. [12] proposed the Optimized Low-Coupling Network (OLCN) based on Faster R-CNN. By introducing Low-Coupling Robust Regression (LCRR) and the Receptive Field Optimization Layer (RFOL), OLCN enhanced the robustness of small-object localization and the accuracy of classification. Li et al. [13] introduced a deformable DETR-based remote sensing detection method. By integrating Multi-Scale Split Attention (MSSA), Multi-Scale Deformable Prescreening Attention (MSDPA), and the A-D loss function, this method improved multi-scale feature extraction, decoding efficiency, and the detection and localization of small objects. Yuan et al. [14] proposed the Noise-Adaptive Context-Aware Detector (NACAD) based on Faster R-CNN. The method introduced a Noise-Adaptive Module (NAM) to expand small-object regions and retain more positive samples, combined with a Context-Aware Module (CAM) to enhance contextual feature representation, and a Position-Refined Module (PRM) to filter background noise. These improvements boosted the recall rate and robustness of small-object detection.

In spite of the great development in remote sensing object detection, there exist several unresolved problems. Remote sensing images of high resolution are usually characterized by various types of scenes, including urban building complexes, port terminals, mountainous forests, and ocean vessels. In such cases, the ratio of small objects is high and the occlusion among targets is more severe. This also emphasizes the deficiencies of current detection methods in receptive field generation and multi-scale feature aggregation. Therefore, the model is easily troubled by complex backgrounds and finds it hard to take overall consideration of detection precision and real-time performance. To address challenges in diverse remote sensing scenarios, we propose RS-YOLO, a systematically designed detection framework. Unlike existing methods that improve modules in isolation, RS-YOLO integrates three complementary mechanisms under a unified GeoAI-oriented design:

Scale-adaptive feature extraction: Dynamic Convolution adjusts receptive fields via learnable attention, enabling fine-grained representation for small objects and global context for large ones.
Context-aware multi-path attention: Selective Kernel Attention (SKAttention) fuses multi-branch features with channel-wise attention, enhancing relevant spatial-semantic cues while suppressing background noise.
Geometry-aware localization loss: Multi-Polar Distance Intersection over Unio (MPDIoU) jointly optimizes center distance, scale, aspect ratio, and vertex alignment, reducing orientation-induced localization errors.

2. YOLOv11 Object Detection Network and Improvements

2.1. YOLOv11 Algorithm

YOLOv11 is a new generation of real-time detection models introduced by the Ul-tralytics team, built upon the YOLO series framework, as shown in Figure 1. It inherits features from YOLOv3, YOLOv5, and YOLOv8, while further integrating CSPNet’s [15] cross-stage partial connection mechanism, FPN [16]/PANet [17] multi-scale feature fusion, and SPPNet spatial pyramid pooling. Structurally, YOLOv11 refines the CSPDarknet53 backbone and introduces the C3k2 module, which achieves more efficient feature extraction by dynamically adjusting kernel combinations. In the Backbone, the Spatial Pyramid Pooling-Fast (SPPF) module is added before the neck to extract multi-scale contextual features, which helps address scale sensitivity in remote sensing images while preserving feature map resolution. The Head adopts a decoupled design, separating classification and regression tasks. For the loss function, YOLOv11 combines Focal Loss with CIoU loss to reduce localization errors. For training, mosaic data augmentation and progressive learning rate scheduling are used. Early training benefits from the image mosaic to enhance local feature sensitivity, while switching to single-image mode later ensures stable convergence. The functional descriptions of the key modules of YOLOv11 are shown in Table 1.

2.2. RS-YOLO Within the GeoAI Framework

RS-YOLO is designed as a detection framework that integrates computer vision with geospatial information. Within the GeoAI paradigm, object detection is not only a visual recognition task but also a process of extracting spatially meaningful information from remote sensing imagery. The model therefore needs to account for geographic context, scale variation, and spatial distribution patterns. The network structure diagram of RS-YOLO is shown in Figure 2. Gray parts in the diagram correspond to the original YOLOv11 structure, and colored parts are newly integrated modules. Based on this understanding, RS-YOLO follows three core principles:

Scale Adaptation: Remote sensing images present significant scale variation challenges, with objects ranging from sub-meter vehicles to kilometer-level ports. To enhance RS-YOLO’s multi-scale adaptation capabilities, a Dynamic Convolution module is innovatively embedded into the feature transmission path from the Backbone to the Neck, enabling better handling of such diverse scale scenarios.
Use of Geographical Context: Remote sensing scenes exhibit strong spatial structures, such as vehicles clustering in parking areas and vessels appearing near ports. To leverage these spatial co-occurrence patterns, the SKAttention module is inserted into the feature fusion pathway of the Neck region, enabling the model to capture multi-scale contextual information and improve detection accuracy.
Geometry-Aware Localization: Many geospatial entities are direction-sensitive, such as roads, bridges, and runways. RS-YOLO adopts MPDIoU as the regression loss to provide orientation-aware bounding box prediction, which benefits accurate mapping and GIS integration.

2.3. Enhancing Model Representation with Dynamic Convolution

Dynamic Convolution [18] addresses the performance degradation of lightweight CNNs, which is caused by limited depth and width due to computational constraints, as shown in Figure 3. Unlike static convolution, which applies a single weight matrix and bias per layer, Dynamic Convolution assigns K parallel convolution kernels of the same size and dimensions to each layer. These kernels are weighted and aggregated through an input-dependent attention mechanism, resulting in adaptive transformations of input features. This enhances the non-linear representation capacity without increasing network depth, width, or output dimensions.

As shown in Figure 4, the process begins with global average pooling, which captures the global spatial information of the input features. The results are then mapped into a K-dimensional space through two fully connected layers with ReLU activation. A Softmax function then generates normalized attention weights

\{π_{k} (x)\}

, ensuring

\sum_{k = 1}^{K} π_{k} (x) = 1

. Weighted summation of convolution kernels and biases produces an adaptive operator tailored to the input, enabling non-linear feature composition. At little additional computational cost, small kernels can be aggregated efficiently to significantly boost model performance. To stabilize training, large-temperature Softmax is applied at the early stage for near-uniform attention distribution, later annealed for sharper allocation. Dynamic Convolution can be embedded into any convolutional structure and complements attention modules such as SENet. It is particularly suitable for lightweight detection scenarios that require high accuracy under limited computational resources.

To address the challenges of blurred small objects, occlusion, and unclear boundaries, we introduce the Dynamic Convolution module into the YOLOv11 framework. This module is suitable for small targets, objects with indistinct boundaries, or those partially occluded, such as vehicles, airplanes, basketball courts, and tennis courts. These targets often have large shape variations, small scales, or complex backgrounds, which make traditional fixed convolutions ineffective in capturing their features. The core of Dynamic Convolution lies in its dynamic region-aware mechanism. By the adaptive convolution response, it enables the model to be more flexible in recognition of shapes and positions of objects, and therefore better for small objects or targets in complicated regions.

2.4. Feature Refinement with Selective Kernel Attention

The core idea of SKAttention, illustrated in Figure 5, is motivated by the fashion of biological visual neurons that modulate their receptive field adaptively. Unlike conventional convolutional neural networks with fixed receptive field sizes, SKAttention introduces multi-scale feature fusion and soft attention, enabling neurons to dynamically adjust their receptive fields according to input characteristics. The mechanism consists of three key steps: Split, Fuse, and Select. In the Split stage, input features are fed through several parallel convolution branches with diverse kernel sizes (e.g., 3 × 3 and 5 × 5) to model multi-scale spatial information. In the Fuse stage, the three scores from these branches are summed and concatenated element by element to have a single feature representation. Following this, Global Average Pooling (GAP) as well as lightweight fully connected layers are used to obtain the compact feature vectors. During the Select stage, we compute attention weights across b branches and apply them with a Softmax to get channel-wise preferences for different scales. Finally, weighted fusion is applied to achieve adaptive receptive field adjustment. This mechanism enhances multi-scale object recognition while introducing only a small number of extra parameters and computations.

In order to handle the scale, shape, and context variants of objects better than standard CNNs, YOLOv11 incorporates SKAttention into its backbone. This module has excellent performance on three manners of targets: (1) objects with significant scale differences, such as vehicles and ships; (2) structurally complex or shape-varying objects, such as airplanes, ports, and storage tanks; and (3) objects located in complex backgrounds or regions with strong contextual variations. In these cases, SKAttention helps to improve feature representation by channel attention and multi-scale adaptation in order to enhance the ability of model generalization for different targets.

2.5. Efficient and Accurate Bounding Box Regression Loss

Bounding box regression is designed to predict object positions in detection or localization tasks. However, when the predicted box and the ground-truth box share the same aspect ratio but differ in width or height, common regression loss functions often fail. This leads to slower convergence and reduced accuracy. To address this issue and further improve the precision of bounding box regression, the MPDIoU [19] is introduced as the optimization metric. MPDIoU builds on traditional IoU, DIoU, and CIoU. By applying a multi-polar constraint, it simultaneously accounts for differences in center distance, scale ratio, aspect ratio, and edge alignment between predicted and ground-truth boxes. This ensures high fitting accuracy across targets of different shapes and spatial distributions.

Specifically, MPDIoU first computes the IoU to measure the overlap between predicted and ground-truth boxes. It then incorporates geometric distances, including the Euclidean distance between box centers, aspect ratio deviations, and edge misalignments, into a unified loss function L_MPDIoU. With balanced weighting, the loss function L_MPDIoU minimizes the distance between the top-left and bottom-right corners of predicted and ground-truth boxes, as shown in Figure 6.

MPDIoU can also be used to evaluate the similarity of two convex shapes A and B in 2D space. The calculation process is as follows:

The input consists of two convex shapes $A, B \subseteq S \subseteq R^{n}$ , within an image of width w and height $h$ ;
Define the coordinates of the top-left and bottom-right corners of shapes $A$ and $B$ . $(x_{1}^{A}, y_{1}^{A})$ and $(x_{2}^{A}, y_{2}^{A})$ represent the top-left and bottom-right corners of shape $A$ , while $(x_{1}^{B}, y_{1}^{B})$ and $(x_{2}^{B}, y_{2}^{B})$ represent those of shape $B$ ;
Compute the squared distance between the top-left corners of $A$ and $B$ , $d_{1}^{2} = {(x_{1}^{B} - x_{1}^{A})}^{2} + {(y_{1}^{B} - y_{1}^{A})}^{2}$ . Compute the squared distance between the bottom-right corners of $A$ and $B$ , $d_{2}^{2} = {(x_{2}^{B} - x_{2}^{A})}^{2} + {(y_{2}^{B} - y_{2}^{A})}^{2}$ ;
Calculate and output the $M P D I o U$ value, $M P D I o U = \frac{A \cap B}{A \cup B} - \frac{d_{1}^{2}}{w^{2} + h^{2}} -$ $\frac{d_{2}^{2}}{w^{2} + h^{2}}$ Based on the definition of $M P D I o U$ , the loss function derived from $M P D I o U$ is shown in Equation (1).

L_{M P D I o U} = 1 - M P D I o U

(1)

To enhance localization accuracy for irregular targets, YOLOv11 incorporates MPDIoU to optimize bounding box regression. This method is suitable for elongated, T-shaped, and structurally complex objects, such as crossroads, T-junctions, bridges, athletic fields, and baseball fields. For these targets, rotation angles vary considerably, and IoU is insensitive to angular changes, often failing when bounding box offsets are significant. MPDIoU emphasizes the maximum potential overlap, which effectively improves the localization accuracy of non-square or rotated objects, thereby enhancing the model’s ability to detect structurally complex targets.

3. Experimental Results and Analysis

3.1. Experimental Setup and Data

The experiments were conducted on a workstation equipped with an Intel Xeon Gold 5218 CPU and an Nvidia GeForce 2080Ti GPU. The deep learning framework used was PyTorch 2.6.0. Each training batch contained 64 images, and the total number of training epochs is 200. The optimizer was SGD (Stochastic Gradient Descent) with an initial learning rate of 0.01 and a momentum of 0.9. During training, both the best model from the validation process and the final model from the last epoch were saved.

This study employed the TGRS-HRRSD dataset [20], released by the Optical Image Analysis and Learning Center of the Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences. The dataset is designed for object detection in very high-resolution remote sensing imagery (VHR RSI) and covers diverse scenes and a wide range of object categories. It includes 13 typical object classes, as shown in Figure 7 and Table 2. All samples are accurately annotated and stored in PASCAL VOC format, making the dataset well-suited for training and evaluation of deep learning models.

To establish clear, measurable criteria for class grouping, quantifiable attributes are identified for each category in the TGRS-HRRSD dataset. This work determines background complexity based on the distinguishability between objects and their typical scenes: high complexity applies to categories situated in coastlines, waters, dense urban areas or cluttered environments, where interference elements are dense and object-background boundaries tend to be confused; medium complexity characterizes categories in general urban or suburban areas, which feature regular scene structures, scattered interference and clearly recognizable object outlines; and low complexity pertains to categories in structured industrial zones, where interference remains limited and object-background boundaries stay distinct. Object scale is defined using a dual-criteria approach combining the median pixel size of samples and the upper limit of typical size: small scale corresponds to a median below 100 px; medium scale refers to a median between 100 and 300 px with the upper limit of typical size less than 800 px; and large scale applies to targets that either have a median above 300 px or a median between 100 and 300 px with the upper limit of typical size no less than 800 px. The term typical size describes the minimum and maximum pixel sizes across all samples of a given category. Boundary complexity reflects the geometric characteristics of objects: complexity is classified as high for objects with an aspect ratio greater than 3, a rotation range exceeding 90°, or composite shapes such as T-shaped or cross-shaped configurations; medium complexity is assigned to objects with an aspect ratio between 1.5 and 3, a rotation range of 30° to 90°, and moderate structural variation; low complexity applies to compact, regular shapes with an aspect ratio between 0.8 and 1.5 and a rotation range below 30°.

To further verify the generalizability of RS-YOLO beyond the TGRS-HRRSD dataset, two widely used public remote sensing detection datasets, DOTA-v2.0 [21] and DIOR [22], were additionally adopted for supplementary experiments. DOTA-v2.0 is a large-scale dataset containing multi-source aerial images with 18 object categories and significant variations in rotation, scale, and density. It is widely regarded as a challenging benchmark for evaluating multi-scale and densely distributed targets. DIOR contains 23,463 images and 20 object categories, covering diverse imaging conditions and complex backgrounds, making it suitable for evaluating the robustness and scene adaptability of detection models.

3.2. Performance Comparison of Different Improvement Modules

To validate the performance of the proposed model, this study conducted systematic experiments on 13 object categories using the TGRS-HRRSD dataset, with mean Average Precision (mAP) as the quantitative evaluation metric. The results, as shown in Table 3, indicate that different modules have distinct effects on different target types, while the combination of all three modules provides complementary gains.

The SKAttention module enhances the feature representation of the network for medium-to-large objects and targets in complex backgrounds through multi-scale convolution branches and channel attention. Experimental results show that this module significantly improves the detection of Vehicle, Ship, Ground track field, Storage tank, and Harbor. Specifically, the mAP of Vehicle increased from 93.1% to 94.5%, and that of Ship increased from 89.1% to 89.8%, Ground track field from 96.7% to 97.5%, Storage tank remained at 98.2%, and Harbor improved from 97.9% to 99.0%. These results indicate that SKAttention has clear advantages in handling objects with large scale variations, rich textures, or complex backgrounds. However, for elongated or small objects such as Bridge and T junction, the mAP slightly decreased, dropping from 90.9% to 88.8% and from 71.6% to 71.0%, respectively. This suggests that the module is limited in capturing local features or localizing rotated targets. For medium-to-large, texture-clear objects such as Airplane and Baseball diamond, the mAP remained at 99.3% and 79.2%, consistent with the baseline. SKAttention does not uniformly improve all small-object categories. Its dynamic receptive-field adjustment benefits targets with strong shape variation (e.g., bridge-like structures) but may weaken dense or texture-sparse categories such as Parking. This selective behavior explains why some small-object classes improve while others slightly drop.

The Dynamic Convolution module generates convolution kernels dynamically to enhance the perception of small objects and local features. Results show that the mAP improved for Vehicle (95.5%), Ground track field (97.8%), T junction (72.5%), Basketball court (69.7%), and Bridge (91.4%), demonstrating clear advantages in detecting small objects or capturing features in diverse regions. However, for large or elongated objects such as Harbor and Tennis court, the improvements were limited, and in some cases performance even declined. For instance, the mAP of Harbor decreased from 97.9% to 96.8%. However, this improvement is not universal. For categories with fine boundaries and low-contrast backgrounds (e.g., Tennis court), the adaptive kernels may overfit local textures, leading to slight performance degradation.

The MPDIoU module improves the localization accuracy of irregular and rotated targets by refining the regression loss function. Experimental results show that the mAP improved for Bridge (91.9%), T junction (72.4%), Baseball diamond (82.9%), and Ground track field (97.0%). These results indicate a significant enhancement in localizing elongated or T-shaped targets. However, for densely distributed small objects or large-scale, complex-background targets such as Parking and Ship, the mAP decreased to 64.0% and 88.2%, respectively, showing that the module is less effective for small-object localization or under complex background conditions.

Although the SKAttention module alone achieves a slightly higher overall mAP than the complete RS-YOLO, it exhibits greater inter-class performance fluctuation. To quantify stability across categories, the standard deviation σ of the Average Precision (AP) for each of the 13 classes was calculated. The baseline YOLOv11n model shows an σ of 11.2%, YOLOv11n + SKAttention yields σ = 11.8%, while RS-YOLO achieves σ = 10.5%. The lower σ of RS-YOLO indicates more consistent performance across different object categories, with significant improvements observed in key underperforming classes such as T-junctions and parking areas. In contrast, single-module solutions tend to excel only in certain dominant categories. For practical GeoAI applications—such as road network analysis and port monitoring—system reliability often depends on the detection capability of the weakest classes. Therefore, RS-YOLO adopts a complementary design: dynamic convolution enhances features of small objects, SKAttention integrates multi-scale contextual information, and MPDIoU optimizes the localization of geometrically sensitive targets. This modular and synergistic mechanism achieves a better trade-off between overall performance and inter-category balance, offering greater practical value compared to approaches that solely pursue peak performance on a single metric.

Figure 8 presents a qualitative comparison between YOLOv11n and RS-YOLO across diverse remote sensing scenarios. The results demonstrate that RS-YOLO significantly reduces missed detections in dense harbor and urban scenes, with noticeably higher confidence scores across detected objects. For challenging categories such as bridges and vehicles, RS-YOLO successfully detects previously missed instances while substantially improving detection confidence. In sports facility scenes, RS-YOLO maintains robust detection performance for large-scale objects while enhancing localization accuracy for smaller targets. Overall, the visual comparison confirms RS-YOLO’s stronger generalization capability and detection stability across complex backgrounds, dense distributions, and multi-scale objects.

The combination of the three modules demonstrates significant complementarity, as shown in Figure 9. From the above results, it is obvious that by fusing all detectors, the detection results exceed those of a single detector for small objects, multi-scale targets, and irregular objects. For example, the mAP of Vehicle increased to 95.1%, Ship to 93.0%, T junction to 81.0%, and Parking to 71.8%, indicating that the combination effectively balances small-object enhancement, multi-scale feature fusion, and the localization accuracy of irregular targets. The combined mAP on the medium-to-large, texture-clear targets like Airplane and Storage tank is about the same as the following baseline. In general, the combination of the three modules endows our model with a stronger capability in complex remote sensing scenes.

To evaluate cross-category stability, we computed the standard deviation of the per-class mAP. YOLOv11n baseline exhibits σ = 11.2%, SKAttention increases the variance to σ = 11.8%, whereas RS-YOLO reduces it to σ = 10.5%, the lowest among all settings. This confirms that RS-YOLO achieves a more balanced performance across the 13 categories. Classes were further grouped into hard classes (baseline mAP < 75%) and easy classes (baseline mAP > 90%). RS-YOLO improves hard classes by an average of +4.4%, including a substantial +9.4% gain on T-junction, which is critical for urban road network monitoring. For easy classes, RS-YOLO maintains or slightly improves upon the single-module variants. These results demonstrate that RS-YOLO enhances the weakest classes without sacrificing performance on the easiest ones, which is crucial for real-world multi-object scenarios.

3.3. Comparison of Different Object Detection Algorithms

To verify the effectiveness of the proposed RS-YOLO, we compared it with twelve popularized and state-of-the-art object detectors known as Swin Transformer, the YOLO series, and Faster R-CNN. Evaluation was performed using mAP (mean Average Precision) and F1-score, and report in Table 4. From the overall comparison, RS-YOLO (Ours) achieved the best comprehensive detection performance, with an mAP of 89.0%. This value is not only significantly higher than that of traditional models such as Swin Transformer and Faster R-CNN, but also superior to mainstream YOLO series models. Even when compared with state-of-the-art improved methods, RS-YOLO maintained a clear accuracy advantage.

In the F1-score, RS-YOLO reached 87 which significantly exceeded that of Swin Transformer, Faster R-CNN and other models. These findings demonstrate the dual merits of RS-YOLO in detection accuracy and robustness, validating the effectiveness of the introduced Dynamic Convolution, Selective Kernel Attention, and MPDIoU modules in enhancing object detection performance under complex remote sensing scenarios.

To further evaluate the cross-dataset generalization capability of the proposed RS-YOLO, this paper conducts additional experiments on two representative remote sensing detection datasets, DOTA-v2.0 and DIOR. RS-YOLO achieves 76.8% mAP on DOTA-v2.0 and 65.5% mAP on DIOR, as shown in Table 5, demonstrating stable and consistent performance across different benchmarks. It is noteworthy that these datasets differ significantly from the primary benchmark TGRS-HRRSD in terms of image resolution, target scale distribution, and category diversity—for example, DOTA contains densely arranged and rotation-sensitive objects, while DIOR includes a larger number of categories and multi-source imaging conditions. Despite these differences, RS-YOLO maintains reliable detection accuracy across varying scenes, indicating that the architectural improvements proposed in this paper offer strong generalization rather than dataset-specific optimization.

3.4. Geospatial Application Perspective

The detection results of RS-YOLO contain spatial information and can therefore be incorporated into GIS platforms for subsequent Geospatial Artificial Intelligence (GeoAI) applications. For instance, accurate identification of transportation infrastructures, ports, and urban facilities enables dynamic mapping of urban growth and land-use change. Moreover, the robustness of the model for complex environmental conditions makes it a useful tool for climate adaptation studies like monitoring coastal infrastructures and areas with flood risk. If RS-YOLO is integrated into GeoAI workflows, it can help spatial decision-makers improve their automaticity and situational awareness in urban management and environmental governance.

4. Conclusions

This study addresses key challenges in high-resolution remote sensing object detection and proposes RS-YOLO, an enhanced framework built upon YOLOv11. By integrating SKAttention, Dynamic Convolution, and MPDIoU, the model improves feature representation, small-object modeling, and localization accuracy. Instead of pursuing peak mAP with a single module, RS-YOLO emphasizes system-level robustness by reducing inter-class variance and improving the hardest categories—an essential property for practical remote sensing applications, where overall reliability is often constrained by the weakest class.

With more stable performance across diverse object types, RS-YOLO advances deep learning–based detection for GeoAI applications in smart cities, environmental monitoring, and disaster management. Future work will further integrate RS-YOLO with geospatial databases and GIS platforms to enable real-time, multi-source spatiotemporal analysis.

Author Contributions

Methodology, Y.Z., A.C. and N.Y.; software, Y.Z., X.L., Y.P., Y.Y., J.H., J.C. and N.Y.; validation, X.L. and Y.Y.; formal analysis, A.C.; investigation, Y.P.; writing—original draft preparation, Y.Z., W.C. and A.C.; writing—review and editing, X.L. and Y.P.; visualization, Y.Z.; supervision, A.C. and H.F.; project administration, H.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National key research and development plan “ Multidimensional visual information edge intelligent processor chip” (2022YFB2804402).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this study are from open-source datasets, and their original papers have been cited in the references. Detailed information on the datasets, including links to their publicly archived repositories, is provided within the article to ensure readers can access the data supporting the reported results.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ASFF	Adaptively Spatial Feature Fusion
SPP	Spatial Pyramid Pooling
MFAF	Multi-scale Feature Adaptive Fusion
FI	Feature Integration
SAW	Spatial Attention Weight
DE	Detail Enhancement
SE	Squeeze-and-Excitation
CSP	Cross Stage Partial
SPPNet	Spatial Pyramid Pooling Network
CSPNet	Cross Stage Partial Network
ECA	Efficient Channel Attention
VFL	Varifocal Loss
SAHI	Slicing Aided Hyper Inference
SGFT	Structure-Guided Feature Transform
HR	Hybrid Residual
DConv	Dynamic Convolution
LocalAttn	Local Attention
FRPN	Feature Residual Pyramid Network
OLCN	Optimized Low-Coupling Network
LCRR	Low-Coupling Robust Regression
RFOL	Receptive Field Optimization Layer
MSSA	Multi-Scale Split Attention
MSDPA	Multi-Scale Deformable Prescreening Attention
NACAD	Noise-Adaptive Context-Aware Detector
NAM	Noise-Adaptive Module
CAM	Context-Aware Module
PRM	Position-Refined Module
DConvTrans-LGA	Dynamic Convolution Transformer with Local-Global Attention
LGA	Local-Global Attention
SKAttention	Selective Kernel Attention
GAP	Global Average Pooling
SGD	Stochastic Gradient Descent
VHR RSI	Very High Resolution Remote Sensing Imagery
GeoAI	Geospatial Artificial Intelligence
GIS	Geographic Information Systems
UAV	Unmanned Aerial Vehicle
Faster R-CNN	Faster Region-based Convolutional Neural Network

References

Jiang, Q.; Wang, Q.; Miao, S.; Jin, X.; Lee, S.J.; Wozniak, M.; Yao, S. SR_ColorNet: Multi-path attention aggregated and mask enhanced network for the super resolution and colorization of panchromatic image. Expert Syst. Appl. 2025, 276, 127091. [Google Scholar] [CrossRef]
Fang, C.; Fan, X.; Wang, X.; Nava, L.; Zhong, H.; Dong, X.; Qi, J.; Catani, F. A globally distributed dataset of coseismic landslide mapping via multi-source high-resolution remote sensing images. Earth Syst. Sci. Data 2024, 16, 4817–4842. [Google Scholar] [CrossRef]
Hou, M.; Li, Y.; Xie, M.; Wang, S.; Wang, T. Monitoring vessel deadweight tonnage for maritime transportation surveillance using high resolution satellite image. Ocean Coast. Manag. 2023, 239, 106607. [Google Scholar] [CrossRef]
Mo, N.; Zhu, R. A novel transformer-based object detection method with geometric and object co-occurrence prior knowledge for remote sensing images. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2024, 18, 2383–2400. [Google Scholar] [CrossRef]
Li, M.; Pi, D.; Qin, S. An efficient single shot detector with weight-based feature fusion for small object detection. Sci. Rep. 2023, 13, 9883. [Google Scholar] [CrossRef] [PubMed]
Xie, T.; Han, W.; Xu, S. Yolo-rs: A more accurate and faster object detection method for remote sensing images. Remote Sens. 2023, 15, 3863. [Google Scholar] [CrossRef]
Lv, H.; Qian, W.; Chen, T.; Yang, H.; Zhou, X. Multiscale feature adaptive fusion for object detection in optical remote sensing images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6511005. [Google Scholar] [CrossRef]
Liu, Y.; Shi, G.; Li, Y.; Zhao, Z. M-YOLO: Traffic sign detection algorithm applicable to complex scenarios. Symmetry 2022, 14, 952. [Google Scholar] [CrossRef]
Yang, L.; Yuan, G.; Zhou, H.; Liu, H.; Chen, J.; Wu, H. RS-YOLOX: A high-precision detector for object detection in satellite remote sensing images. Appl. Sci. 2022, 12, 8707. [Google Scholar] [CrossRef]
Li, J.; Zhang, H.; Song, R.; Xie, W.; Li, Y.; Du, Q. Structure-guided feature transform hybrid residual network for remote sensing object detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5610713. [Google Scholar] [CrossRef]
Huang, Y.; Jiao, D.; Huang, X.; Tang, T.; Gui, G. A hybrid CNN-transformer network for object detection in optical remote sensing images: Integrating local and global feature fusion. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2024, 18, 241–254. [Google Scholar] [CrossRef]
Yuan, Y.; Zhang, Y. OLCN: An optimized low coupling network for small objects detection. IEEE Geosci. Remote Sens. Lett. 2021, 19, 8022005. [Google Scholar] [CrossRef]
Li, M.; Cao, C.; Feng, Z.; Xu, X.; Wu, Z.; Ye, S.; Yong, J. Remote sensing object detection based on strong feature extraction and prescreening network. IEEE Geosci. Remote Sens. Lett. 2023, 20, 8000505. [Google Scholar] [CrossRef]
Yuan, Y.; Zhao, Y.; Ma, D. NACAD: A noise-adaptive context-aware detector for remote sensing small objects. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1001413. [Google Scholar] [CrossRef]
Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 13–19 June 2020; pp. 390–391. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Chen, Y.; Dai, X.; Liu, M.; Chen, D.; Yuan, L.; Liu, Z. Dynamic convolution: Attention over convolution kernels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11030–11039. [Google Scholar]
Ma, S.; Xu, Y. Mpdiou: A loss for efficient and accurate bounding box regression. arXiv 2023, arXiv:2307.07662. [Google Scholar] [CrossRef]
Zhang, Y.; Yuan, Y.; Feng, Y.; Lu, X. Hierarchical and robust convolutional neural network for very high-resolution remote sensing object detection. IEEE Trans. Geosci. Remote Sens. 2019, 57, 5535–5548. [Google Scholar] [CrossRef]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef] [PubMed]
Zhang, X.; Shen, T.; Xu, D. Object detection in remote sensing images based on improved YOLOv8 algorithm. Laser Optoelectron. Prog. 2024, 61, 1028001. [Google Scholar]
Cheng, G.; Wang, J.; Li, K.; Xie, X.; Lang, C.; Yao, Y.; Han, J. Anchor-free oriented proposal generator for object detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5625411. [Google Scholar] [CrossRef]

Figure 1. YOLOv11 Network Architecture.

Figure 2. RS-YOLO Network Architecture.

Figure 3. Dynamic Convolution Weighted Aggregation Mechanism.

Figure 4. Lightweight detection process with dynamic convolution and attention fusion.

Figure 5. Workflow of the SK Attention mechanism.

Figure 6. Illustration of MPDIoU computation and bounding box regression.

Figure 7. Different Categories in the TGRS-HRRSD Dataset.

Figure 8. Qualitative comparison of detection results between YOLOv11n and RS-YOLO.

Figure 9. Detection accuracy of RS-YOLO across 13 categories.

Table 1. Functional Description of Key Modules in YOLOv11.

Module	Function
Conv	Basic convolution with BN and SiLU activation
C2f	CSP bottleneck with residual connections
C3k2	Dynamic kernel feature extraction
SPPF	Multi-scale spatial pooling for context
Upsample	Resolution enhancement for feature fusion
Concat	Feature concatenation along the channel dimension

Table 2. Category Distribution and Key Statistical Characteristics of the TGRS-HRRSD Dataset.

Category	Number of Samples	Typical Size	Rotation Range	Scene Type	Background Complexity	Scale Group	Boundary Complexity
Airplane	757	50–200	0–360°	Urban	Medium	Medium	Medium
Baseball diamond	390	100–300	0°, 180°	Urban	Medium	Medium	High
Basketball court	159	80–250	0°, 180°	Urban	Medium	Medium	Low
Bridge	124	150–500	0–90°	River/canyon and Urban overpasses	High and medium	Medium	High
Cross road	5401	200–800	0–90°	Urban	High	Large	High
Ground track field	163	300–600	0–45°	Urban	Medium	Medium	Low
Harbor	224	500–2000	0–180°	Coastline	High	Large	Medium
Parking	5417	400–1500	0–90°	Urban	High	Large	Low
Ship	302	100–500	0–360°	Coastline	High	Medium	Medium
Storage tank	655	100–400	0°	Industrial zone	Low	Medium	Low
T junction	543	150–400	0–90°	Urban	High	Medium	High
Tennis court	524	100–300	0°, 90°	Urban	Medium	Medium	Low
Vehicle	4961	20–100	0–360°	Urban	High	Small	Medium

Table 3. Detection accuracy of YOLOv11n, YOLOv11n with a single module, and RS-YOLO.

	YOLOv11n	(+) SKAttention	(+) Dynamic	(+) MPDIoU	RS-YOLO (Ours)
mAP	87.0%	89.8%	88.7%	87.0%	89.0%
Bridge	90.9%	88.8%	91.4%	91.9%	94.4%
Airplane	99.3%	99.3%	99.4%	99.2%	99.3%
Ground track field	96.7%	97.5%	97.8%	97%	98%
Vehicle	93.1%	94.5%	95.5%	91%	95.1%
Parking	67.7%	67.0%	66.7%	64%	71.8%
T junction	71.6%	71.0%	72.5%	72.4%	81.0%
Baseball diamond	79.2%	79.2%	78.4%	82.9%	78.8%
Tennis court	91.7%	91.7%	91.5%	89.7%	91.2%
Basketball court	67%	65.5%	69.7%	63%	66.6%
Ship	89.1%	89.8%	86.9%	88.2%	93.0%
Cross road	88.3%	89.1%	89.5%	88.9%	91.4%
Harbor	97.9%	99.0%	96.8%	96.5%	99.2%
Storage tank	98.2%	98.2%	97.9%	97.8%	98.2%

Table 4. Comparison between the proposed in this paper with other models on the TGRS-HRRSD dataset.

Model	mAP (%)	F1-Score
Swin Transformer [23]	82.00	63.10
YOLOv5	84.40	85.00
YOLOv8n	85.70	83.00
YOLOv11n	87.00	85.00
Faster R-CNN [24]	81.40	64.90
Xie’s [6]	88.39	/
MFAF [7]	86.90	/
Liu’s [8]	85.50	/
SGFTHR [10]	86.38	/
DConvTrans-LGA [11]	82.10	63.88
OLCN [12]	60.00	/
Li’s [13]	86.30	/
NACAD [14]	88.60	/
RS-YOLO (Ours)	89.00	87.00

Table 5. Compares the method proposed in this paper with other models on the DOTA-v2.0 and DIOR datasets.

Model	DOTA-v2.0 (mAP%)	DIOR (mAP%)
YOLOv11n	74.70	63.50
MFAF [7]	/	53.50
DConvTrans-LGA [11]	/	61.30
Li’s [13]	75.60	/
Zhang’s [25]	80.66	/
Cheng’s [26]	72.10	/
RS-YOLO (Ours)	77.40	65.50

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhu, Y.; Chen, A.; Li, X.; Pan, Y.; Yuan, Y.; Yang, N.; Chen, W.; Huang, J.; Cai, J.; Fu, H. An Improved Geospatial Object Detection Framework for Complex Urban and Environmental Remote Sensing Scenes. Appl. Sci. 2026, 16, 1288. https://doi.org/10.3390/app16031288

AMA Style

Zhu Y, Chen A, Li X, Pan Y, Yuan Y, Yang N, Chen W, Huang J, Cai J, Fu H. An Improved Geospatial Object Detection Framework for Complex Urban and Environmental Remote Sensing Scenes. Applied Sciences. 2026; 16(3):1288. https://doi.org/10.3390/app16031288

Chicago/Turabian Style

Zhu, Yueying, Aidong Chen, Xiang Li, Yu Pan, Yanwei Yuan, Ning Yang, Wenwen Chen, Jiawang Huang, Jun Cai, and Hui Fu. 2026. "An Improved Geospatial Object Detection Framework for Complex Urban and Environmental Remote Sensing Scenes" Applied Sciences 16, no. 3: 1288. https://doi.org/10.3390/app16031288

APA Style

Zhu, Y., Chen, A., Li, X., Pan, Y., Yuan, Y., Yang, N., Chen, W., Huang, J., Cai, J., & Fu, H. (2026). An Improved Geospatial Object Detection Framework for Complex Urban and Environmental Remote Sensing Scenes. Applied Sciences, 16(3), 1288. https://doi.org/10.3390/app16031288

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

An Improved Geospatial Object Detection Framework for Complex Urban and Environmental Remote Sensing Scenes

Featured Application

Abstract

1. Introduction

2. YOLOv11 Object Detection Network and Improvements

2.1. YOLOv11 Algorithm

2.2. RS-YOLO Within the GeoAI Framework

2.3. Enhancing Model Representation with Dynamic Convolution

2.4. Feature Refinement with Selective Kernel Attention

2.5. Efficient and Accurate Bounding Box Regression Loss

3. Experimental Results and Analysis

3.1. Experimental Setup and Data

3.2. Performance Comparison of Different Improvement Modules

3.3. Comparison of Different Object Detection Algorithms

3.4. Geospatial Application Perspective

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI