Next Article in Journal
Performance Analysis of a MIMO System Under Realistic Conditions Using 3GPP Channel Model
Previous Article in Journal
Model-Embedded Lightweight Network for Joint I/Q Imbalance and CFO Estimation in NB-IoT
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Scene Heatmap-Guided Adaptive Tiling and Dual-Model Collaboration-Based Object Detection in Ultra-Wide-Area Remote Sensing Images

1
School of Mechanical and Material Engineering, North China University of Technology, Beijing 100144, China
2
Information Center of Ministry of Natural Resources, Beijing 100812, China
*
Author to whom correspondence should be addressed.
Symmetry 2025, 17(12), 2158; https://doi.org/10.3390/sym17122158
Submission received: 6 September 2025 / Revised: 24 November 2025 / Accepted: 10 December 2025 / Published: 15 December 2025

Abstract

This work addresses computational inefficiency in ultra-wide-area remote sensing image (RSI) object detection. Traditional homogeneous tiling strategies enforce computational symmetry by processing all image regions uniformly, ignoring the intrinsic spatial asymmetry of target distribution where target-dense coexist with vast target-sparse areas (e.g., deserts, farmlands), thereby wasting computational resources. To overcome symmetry mismatch, we propose a heat-guided adaptive blocking and dual-model collaboration (HAB-DMC) framework. First, a lightweight EfficientNetV2 classifies initial 1024 × 1024 tiles into semantic scenes (e.g., airports, forests). A target-scene relevance metric converts scene probabilities into a heatmap, identifying high-attention regions (HARs, e.g., airports) and low-attention regions (LARs, e.g., forests). HARs undergo fine-grained tiling (640 × 640 with 20% overlap) to preserve small targets, while LARs use coarse tiling (1024 × 1024) to minimize processing. Crucially, a dual-model strategy deploys: (1) a high-precision LSK-RTDETR-base detector (with Large Selective Kernel backbone) for HARs to capture multi-scale features, and (2) a streamlined LSK-RTDETR-lite detector for LARs to accelerate inference. Experiments show 23.9% faster inference on 30k-pixel images and reduction in invalid computations by 72.8% (from 50% to 13.6%) versus traditional methods, while maintaining competitive mAP (74.2%). The key innovation lies in repurposing heatmaps from localization tools to dynamic computation schedulers, enabling system-level efficiency for Ultra-Wide-Area RSIs.

1. Introduction

In recent years, increased Chinese investment in space infrastructure has led to the increased utilization of domestic remote sensing satellites. Concurrently, improvements in sensor resolution (e.g., GF-2 satellite images with 14,865 × 8138 pixels) [1,2] have expanded applications of ultra-wide-area RSIs in domains like land resource management and ecological monitoring. However, their large dimensions and high resolution of these RSIs introduce significant challenges, including excessive computational load and loss of cross-tile contextual information. Moreover, object detection in RSIs must contend with highly complex backgrounds and extreme scale variations [3], underscoring the critical need for more efficient acquisition and interpretation of wide-area remote sensing data.
Although deep learning techniques [4,5,6,7] and large-scale annotated remote sensing datasets [8] have markedly improved object detection accuracy in large-scale images, the efficient interpretation of ultra-wide-area RSIs is fundamentally constrained by GPU memory. As a result, raw images cannot be processed directly by detection networks, and downsampling to fit memory limits often sacrifices critical detail, especially for small objects [9,10]. Consequently, a homogeneous tiling paradigm is predominantly employed—segmenting the image into dense, fixed-size tiles (e.g., 640 × 640 pixels) via sliding windows, applying identical detection models on all tiles, and subsequently aggregating results.
However, target distribution within ultra-wide-area RSIs typically sparse yet highly clustered within specific scenes (e.g., aircraft concentrated at airports, vehicles prevalent in residential areas), while regions like forests, farmland, and bare land often lack targets (see Figure 1). Under traditional methods, the entire image is uniformly segmented into fixed-size sub-images, and the same detection model is applied across both high and low-interest regions. From the perspective of symmetry, traditional sliding-window approaches rely on translational symmetry, applying identical computational operators across the entire image grid regardless of content. However, the information density in ultra-wide-area RSIs is highly asymmetrical—targets are concentrated in specific semantic zones while vast backgrounds remain empty. Applying a symmetrical processing grid to such asymmetrical data leads to extreme redundancy. This practice not only produces many target-free tiles, thereby wasting computational resources, but also potentially increases false positives and reduces detection accuracy.
To mitigate this computational redundancy, previous research has explored coarse-to-fine strategies, such as the superpixel-based filtering method proposed by Lin et al. [11] and the grid-level objectness estimation network introduced by Xie et al. [12]. However, these methodologies are fundamentally constrained by their reliance on low-level visual features and binary classification of target presence, which lack a deep understanding of scene context. In complex remote sensing environments, such approaches are prone to error; intricate background textures can mimic target features leading to false positives, while minute targets in low-contrast regions are frequently discarded. In contrast, we propose that utilizing a scene-level heatmap for semantic guidance represents a superior strategy for ultra-wide-area processing. Distinct from methods that depend solely on visual saliency, our approach leverages scene classification to predict target probability distributions based on semantic priors. Based on this semantic guidance, we define regions with high heatmap values and dense targets as high-attention regions (HARs), and regions with low heatmap values, sparse targets, or dominant background as low-attention regions (LARs). A core challenge is the reliable prediction of each tile’s salience category, which enables prioritization of HARs and thus improves inference speed and detection accuracy.
To address these challenges, we propose a scene heatmap–guided adaptive tiling with dual-model collaborative detection framework. First, a lightweight scene-classification network generates a heatmap that distinguishes target-dense HARs from sparse LARs. Next, HARs are subdivided into fine-grained tiles processed by a high-precision detector, while LARs are covered by coarse tiles coarser tiles processed by a lightweight detector. This demand-driven allocation of computational resources substantially reduces redundant processing and accelerates inference while preserving detection accuracy. We evaluate our method on ultra-wide-area Gaofen-2 images; results demonstrate that, compared to traditional homogeneous tiling, our approach markedly reduces invalid computations, achieves faster detection speeds, and maintains or improves detection accuracy—benefits that become increasingly pronounced as image resolution increases. In summary, this paper presents a novel scene heatmap-guided adaptive tiling and dual-model collaboration (HAB-DMC) framework tailored for computationally efficient object detection in ultra-wide-area RSIs. This work provides contributions from four aspects as follows.
  • A Novel Computation Scheduling Scheme: We repurpose the scene heatmap from a mere localization tool into a dynamic computation scheduler. This semantic-guided approach enables demand-aware resource allocation, fundamentally addressing the inefficiency of homogeneous processing in ultra-wide-area images.
  • An Adaptive Tiling Mechanism: We introduce a heatmap-guided adaptive secondary tiling strategy that dynamically partitions high-attention regions (HARs) into fine-grained tiles for small-object preservation, while processing low-attention regions (LARs) with coarse tiles to minimize redundant computation.
  • A Dual-Model Collaborative Detection Framework: We propose a collaborative inference scheme that deploys a high-precision LSK-RTDETR-base detector for HARs and a streamlined LSK-RTDETR-l detector for LARs. This ensures accuracy in critical areas while maximizing overall throughput.
  • Comprehensive Empirical Validation: We conduct extensive experiments and a thorough sensitivity analysis on large-scale images, demonstrating that our framework significantly reduces invalid computations and accelerates inference while maintaining competitive detection accuracy.
The rest of the paper is organized as follows. Section 2 surveys related work, critically analyzing the limitations of existing approaches to motivate our contributions. Section 3 provides a detailed description of our novel framework, including the scene heatmap-guided computation scheduler, the adaptive tiling mechanism, and the dual-model collaboration strategy. Section 4 offers extensive experimental validation and ablation studies to demonstrate the effectiveness and efficiency of our method. Finally, Section 5 discusses the limitations of this work and outlines promising future directions, and Section 6 concludes the paper with a summary of our contributions.

2. Related Works

2.1. Tiling Strategies for Large-Scale Remote Sensing Images

Ultra-wide-area RSIs inherently possess extremely high resolution, making direct GPU processing infeasible. While significant progress in object detection for ultra-high-resolution RSIs has been achieved through convolutional neural networks (CNNs) and advanced detection frameworks [13,14,15,16], fundamental computational challenges persist. Early approaches predominantly relied on sliding-window uniform tiling [17,18]. Although it ensures comprehensive spatial coverage, this paradigm incurs substantial computational waste in target-scarce regions such as forests, farmland, and bare land.
To mitigate this, Lin et al. proposed a superpixel-based tiling strategy [11], reducing tile quantity by approximately 28%. However, the requisite superpixel segmentation imposes significant computational overhead, increasing overall inference time by 29.9% and adding data-dependent pre-processing steps that reduce portability across datasets.
Alternatively, Xie et al. introduced an objectness activation network that filters target-free sub-images through grid-level prediction, achieving 30% acceleration [12]. Nevertheless, this method demonstrates suboptimal performance for small-scale objects, frequently misclassifying sub-images containing small targets as target-free regions, consequently inducing missed detections, a failure mode tied to coarse grid granularity and lack of multi-scale semantic fusion.
Critically, existing methodologies predominantly rely on low-level visual features while lacking scene-level semantic understanding, thus struggling to balance efficiency and accuracy effectively. Consequently, achieving substantial inference acceleration without compromising detection accuracy remains a core challenge in ultra-wide-area RSIs processing. To address this limitation, we propose a semantics-guided secondary tiling mechanism driven by scene heatmaps: HARs utilize fine-grained sliding window tiling and high-accuracy detection models to maintain sensitivity to small-scale targets, whereas LARs apply coarse-grained tiling combined with computationally efficient lightweight detectors for accelerated processing. This approach simultaneously enhances detection efficiency while effectively preventing small-target omissions, providing a new paradigm for efficient intelligent interpretation of ultra-wide-area RSIs.

2.2. Efficient Object Detection in Ultra–High–Resolution Remote Sensing Images

With the rapid advancement of imaging technologies, the resolution of remotely sensed data continues to increase—from satellite and Unmanned Aerial Vehicle (UAV) platforms to automotive scenes and 4K/8K video, making efficient object detection in such large-scale images a critical challenge. Although the design of lightweight detectors (e.g., YOLO [19,20], RetinaNet [21], RefineDet [22], HSD [23], FCOS [24]) can accelerate inference, applying these models directly to ultra–high–resolution images incurs prohibitive computational costs. Model compression techniques, such as channel pruning on YOLOv5 [25], have been shown to improve Frames Per Second(FPS) by up to 40% but often at the expense of small-object mAP.
Recent studies have explored coarse-to-fine cascade strategies to further accelerate detection on large images. For instance, the Objectness Activation Network (OAN) employs a lightweight binary branch to predict the presence of objects in each tile, thereby skipping empty regions [12]; ClusDet generates clustered candidate regions via a clustering network before selective detection [26]; DMNet uses a density map to localize regions of interest and restricts detection to those areas [27].
However, these approaches still exhibit significant limitations in inference speed, training complexity, and general applicability. In particular, density-map-guided cropping may improve small-object recall under some conditions but struggles with scale variation and requires dense supervision that is costly to produce; it also can create fragmented or redundant crops that hurt large object integrity. Such methods therefore present a trade-off: improved local recall vs. higher annotation and computation costs and reduced robustness to scale and dataset shift. Conceptually, our approach differs fundamentally from density-map-based methods like DMNet [27] in several key aspects. While DMNet relies on object density estimation which requires instance-level annotations and may struggle with scale variations, our scene heatmap approach operates at the semantic level, requiring only scene category labels that are easier to obtain and generalize better across datasets. This semantic foundation provides superior scalability to new geographical areas and object categories, as scene semantics tend to be more transferable than precise object density distributions.
ClusDet exhibits similar shortcomings. OAN’s fixed-grid objectness predictions cannot reliably detect objects at grid boundaries or those spanning multiple cells, and its lack of multi-scale semantic fusion hampers the balance between precision and speed. To address these limitations, some works have shifted toward feature-domain optimization. Bai et al. [28] propose a remote sensing detection framework that integrates wavelet time–frequency analysis with reinforcement-learning-based feature selection. Their method employs a dueling Deep Q-Network (DQN) to identify dominant time–frequency channels and reduce computational complexity, and introduces a discrete wavelet multi-scale attention module to suppress background interference. In practice, such feature-domain optimizations reduce some spatial processing costs but introduce algorithmic complexity.
In addition, for efficient detection on ultra-Wide-Area RSIs, CoF-Net [29] presents a progressive coarse-to-fine framework: candidate regions are first rapidly filtered at low resolution, then detection is gradually refined, thereby substantially reducing computation while preserving accuracy.
Although conceptually similar to our heatmap-guided adaptive tiling approach in pursuing coarse-to-fine computation, CoFNet’s reliance on multi-resolution image pyramids introduces nontrivial memory, and interpolation overheads-overheads that can hamper throughput on resource-constrained hardware and complicate end-to-end deployment, and Bai et al.’s scheme bypasses spatial-pyramid costs via time–frequency feature reconstruction.
More critically, CoF-Net’s low-resolution filtering lacks semantic understanding, making it prone to discarding regions containing small but semantically important objects.
In contrast, our method leverages scene-semantic priors to drive adaptive tiling, enabling more flexible and demand-aware allocation of computational resources. Concretely, the scene classifier we use requires only coarse tile-level supervision (classification labels) rather than dense pixel-wise annotations, which reduces training complexity and annotation cost and improves scalability and transferability across datasets.
Furthermore, by assigning high-resolution processing only to semantically important HARs (high heat, dense target), our approach preserves cross-tile object integrity while avoiding the memory burdens of maintaining multiple full-resolution pyramids, leading to better practical scalability on large images and constrained hardware.
To overcome these limitations, we propose a scene heatmap–guided adaptive tiling and dual-model collaborative framework tailored for ultra-wide-area RSIs. Unlike methods that rely solely on low-level features or grid-level activations, our approach introduces a global semantic perception mechanism. An EfficientNetV2-based classifier generates a heatmap encoding the target–scene correlation for each coarse tile; the image is then partitioned into HARs (high heat, dense target) and LARs (low heat, sparse target) via a dynamic threshold. High-attention regions undergo fine-grained tiling and are processed by a high-precision detector, while wea-attention regions use coarse tiling and a lightweight model. This demand-driven allocation markedly reduces redundant computation in empty areas, maintains high accuracy for both small and cross-tile large objects, and achieves superior inference efficiency and robustness across diverse resolutions.

2.3. Evolution of Heatmap-Guided Strategies

Heatmap-guided methods have progressed from simple object localization to holistic system optimization. Chen et al.’s THNet [30] first introduced Gaussian-based heatmap supervision into remote sensing object detection, achieving end-to-end regression with a lightweight encoder–decoder and reducing parameter count by approximately 90% compared to Faster R-CNN. However, THNet employs the heatmap solely as a localization cue and does not address the more fundamental issue of allocating computational resources in ultra-wide-area scenarios.
In contrast, our approach leverages EfficientNetV2 [31] to perform scene classification on coarse tiles and constructs a target–scene correlation heatmap that encodes the semantic importance of each region. This scene heatmap then drives an adaptive tiling strategy, dynamically scheduling computational resources according to regional value. On 14,865 × 8138-pixel ultra-wide-area RSIs, our method filters out substantial amounts of redundant computation, thereby markedly improving overall detection efficiency. This paradigm shift—from using heatmaps merely as localization aids to employing them as system-level computation schedulers, represents a significant advance in end-to-end remote sensing object detection.

2.4. Transformer-Based Hybrid Detection Models

Recent studies have sought to combine convolutional feature extractors with Transformer architectures to exploit both local detail and global context in RSIs object detection. Li et al. proposed the TRD (Transformer with Transfer CNN) framework, which fuses CNN’s local representation power with a Transformer’s long-range dependency modeling in an end-to-end detection pipeline [32]. Building on this, Zhang et al. introduced RT-DETR by integrating a residual-enhanced CNN backbone with a deformable Transformer detection head, yielding robust query embeddings and dynamic attention maps without requiring region proposal networks, and achieving state-of-the-art mAP on aerial benchmarks [33]. Although these hybrid approaches confirm the feasibility of coupling multi-scale CNN features with end-to-end Transformer decoders, they do not simultaneously address computation redundancy and adaptive receptive-field requirements in ultra-wide-area RSIs.
While pure Transformer architectures such as Swin Transformer [34] and Mask2Former [35] achieve excellent results on general vision benchmarks, we argue that a hybrid CNN-Transformer design is better suited to the object-detection module in ultra-wide-area RSIs. The primary reasons are engineering efficiency at scale and the value of CNN inductive biases when processing thousands of tiles of varying sizes and content.
In our LSK-RTDETR detector the LSKNet backbone builds multi-scale representations via a large selective-kernel mechanism that adaptively adjusts receptive fields, yielding context modeling comparable to windowed self-attention in Swin, while relying on large-kernel convolutions rather than global attention. These convolutions are typically more hardware friendly and incur lower latency on common deep-learning accelerators, which is important when the detector is applied to many tiles per image [34,36].
Moreover, LSKNet’s inductive biases, translation equivariance and locality, naturally favor learning compact, position-consistent features that are crucial for detecting small, densely packed objects in remote-sensing images. Although Vision Transformers can learn similar properties from data, they often require larger-scale pretraining or careful multi-scale engineering to match CNNs on fine-grained localization tasks [34,35]. Our hybrid approach therefore combines efficient, locality-aware feature extraction with Transformer-based global reasoning where it matters most.
To this end, we propose the enhanced LSK-RTDETR (Large Selective Kernel–RT-DETR) model integrated into a demand-driven adaptive tiling framework. HARs use an LSKNet large-kernel backbone with a deformable Transformer decoder to strengthen multi-scale representations of small and cross-tile large objects, while the LARs employ a lightweight variant to boost throughput. This dual-model collaborative inference strategy not only ensures precise detection in high-value areas but also significantly reduces invalid computation in background regions, achieving a balanced trade-off between detection accuracy and inference efficiency.

3. Methods

Figure 2 illustrates the proposed scene heatmap-guided adaptive tiling with dual-model collaborative detection framework for object detection in ultra-wide-area RSIs. The method comprises three core modules: scene-classification-driven heatmap generation, heatmap-guided adaptive secondary tiling, and an enhanced LSK-RTDETR dual-model detection scheme. Specifically, the original ultra-high-resolution image is first partitioned into 1024 × 1024 pixel tiles, which are classified by a lightweight EfficientNetV2 [31] to obtain per-tile scene probabilities. These probabilities are combined with predefined target–scene correlation scores to compute a heat value for each tile. A threshold is applied to distinguish dense “high-attention” regions from sparse “low-attention” regions, thereby producing a full-image scene heatmap. Next, high-attention regions undergo fine-grained 640 × 640-pixel sliding tiling with 20% overlap to preserve small and dense object details, while low-attention regions use coarse 1024 × 1024-pixel tiling with reduced overlap to minimize computation. During detection, high-attention tiles are processed by the full-capacity LSK-RTDETR detector, leveraging large-kernel selection and multi-scale fusion for high-precision small-object detection, whereas low-attention tiles are handled by a lightweight LSK-RTDETR variant for accelerated inference and reduced redundant computation. This heatmap-driven, demand-aware allocation of computational resources effectively filters out low-value regions and concentrates effort on high-value areas, thereby achieving significant inference acceleration while preserving detection accuracy. Experimental results demonstrate that, compared to conventional uniform tiling, our method offers superior practicality and scalability for object detection in ultra-wide-area RSIs.

3.1. Scene Classification–Driven Heatmap Generation

This module describes how to construct a target-scene correlation heatmap quantify the spatial asymmetry of target distribution from coarse scene classification results and use it to guide subsequent tiling and detection. It comprises four main steps: initial tiling and preprocessing, transfer-learning-based scene classification using EfficientNetV2 [37,38], target–scene correlation mapping, and heatmap generation via interpolation.

3.1.1. Initial Tiling and Preprocessing

As illustrated in Figure 3, the original ultra–wide–area RSIs (e.g., 14,865 × 8138 pixels) are too large for direct network inference. Therefore, they are first partitioned into fixed-size tiles (e.g., 1024 × 1024 pixels) to balance tile count and computational cost. Denoting the resulting tiles as { B i } i = 1 N , each tile undergoes standard preprocessing: (1) Size normalization—resizing to the EfficientNetV2 input resolution to ensure scale consistency; (2) Pixel-value normalization—subtracting ImageNet means and dividing by standard deviations. The preprocessed tile Bi’ is then ready for classification.

3.1.2. Transfer Learning–Based Scene Classification with EfficientNetV2

To capture high-level semantic context, we adopt EfficientNetV2 as the backbone for scene classification. Compared to conventional CNN backbones, EfficientNetV2 offers: (1) Compound scaling across depth, width, and resolution for parameter and compute efficiency; (2) Fused-MBConv and MBConv modules for faster inference without sacrificing accuracy; and (3) Seamless transfer learning, leveraging ImageNet pretraining to quickly adapt to remote-sensing scene categories with minimal fine-tuning. During training, a labeled scene dataset (see Figure 4) is used to supervise transfer learning on the initial tiles. At inference, each preprocessed tile Bi’ is passed through EfficientNetV2 to extract globally pooled features, followed by a fully connected layer that outputs a probability vector.
p i = p i , 1 , p i , 2 , , p i , C ,
where C denotes the number of scene classes and p i , j is the probability that tile Bi’ belongs to class j. This probability vector p i serves as the basis for computing the target–scene correlation heat value in Section 3.1.3.

3.1.3. Target–Scene Correlation Mapping

The scene classification output by itself only indicates the semantic category of each tile (e.g., residential area, forest, airport, bare land), whereas our ultimate objective is to estimate the likelihood of object presence within each tile. To this end, we define a prior correlation for each scene–object pair. Let the set of scene classes be { S 1 , S 2 , , S C } , and let there be K object categories of interest (e.g., “airplane,” “vehicle,”, etc.). For each scene S j and object class k , we assign a prior correlation score.
a j ( k ) [ 0,1 ] ,
where a j ( k ) reflects the a priori probability that an object of class k appears in a region labeled as scene S j . These scores can be obtained either by statistical analysis of the training dataset or by expert annotation.
For example, if scene S j corresponds to “airport” and object class k is “airplane,” one may set a j ( k ) = 1.0 ; if scene S j is “forest” and k is “vehicle,” one may set a j ( k ) = 0.0 . All other scene–object combinations can be assigned a correlation score in [0, 1] based on statistical analysis or expert knowledge.
First, initial estimates for each a j ( k ) were derived by calculating the co-occurrence frequency between scene categories and object classes within the annotated training dataset. Specifically, for a given scene S j and object k , we computed the proportion of image tiles labeled as scene S j that contained at least one instance of object k . This provided a data-driven foundation, ensuring the scores reflect actual patterns present in the remote sensing images. These initial frequency-based values were reviewed and calibrated using domain expertise. This refinement step was crucial to handle edge cases, such as semantically implausible object-scene pairs that may sporadically appear in the data due to annotation noise or rare occurrences. For example, while a vehicle might occasionally appear in a forest, domain knowledge justifies assigning a consistently low correlation score. Conversely, for strong semantic relationships like aircraft in airports, scores were ensured to be high and stable. This hybrid methodology guarantees that the final correlation scores a j ( k ) are both empirically grounded and semantically robust, thereby enhancing the reliability of the subsequent heatmap generation. The complete set of refined correlation scores for all scene-object pairs is comprehensively presented in Figure 5, which provides a detailed visualization of the correlation matrix used in our framework.
After defining each a j ( k ) , we reduce the per-scene, per-object vector to a single scalar by taking the maximum across all K object categories:
a j = m a x 1 k K   a j ( k ) , j = 1,2 , , C .
For example, if S j is “airport” and its object-category correlation vector is [1.0, 0,2, …], then a j = 1.0 .
Finally, given a tile’s scene-probability vector p i = [ p i , 1 , , p i , C ] , its heat value h i is computed as
h i = j = 1 C   p i , j m a x 1 k K   a j k , h i 0,1 .
This formulation integrates the classifier’s soft assignment to each scene with the corresponding prior likelihood of object presence. Thus, a tile that the model assigns a high probability of being “airport” (and hence a high a j ) will yield a larger h i , indicating a greater chance of containing targets; conversely, tiles resembling “forest” or “bare land” will produce h i values near zero, signifying minimal expected object content.

3.1.4. Heatmap Generation

After computing the discrete heat values h i for all initial tiles, a full-resolution scene heatmap H x , y is constructed that matches the dimensions of the original ultra-wide-area RSIs. First, each tile’s heat h i is assigned to its corresponding location in a low–resolution grid H ~ . Then, H ~ is upsampled to the full image size W × H times W × H using bicubic interpolation, which ensures smooth transitions and avoids oscillations between adjacent tiles. The resulting heatmap H x , y is normalized to [0, 1] and can be visualized with a pseudocolor palette (red for high values, blue for low values) to highlight areas of varying target–scene correlation. Figure 6 presents a typical example of H x , y , where high-heat regions (near 1.0) coincide with target-dense areas such as airports, residential zones, and roads, while low-heat regions (near 0.0) correspond to background areas like forests, grasslands, and bare land.

3.2. Heatmap-Guided Adaptive Secondary Tiling

Leveraging the generated scene heatmap, a secondary tiling of the original image is performed as follows. First, the entire image is initially partitioned into uniform coarse tiles of size, for example, 1024 × 1024 pixels. For each coarse tile, its corresponding heatmap response is consulted to determine whether further subdivision is warranted.
As depicted in Figure 7, if a coarse tile’s heat value exceeds a predefined threshold, it is subdivided into multiple smaller tiles (optionally with overlap) to enable fine-grained detection; otherwise, the tile is retained at its original scale or subjected to minimal processing. By applying fine-grained tiling in target-dense regions, coverage of small objects is enhanced and detection precision is improved; in target-sparse regions, coarse tiles are maintained to reduce computational load. This adaptive tiling strategy dynamically adjusts detection granularity according to the spatial distribution of objects, effectively eliminating the redundancy inherent in conventional fixed-window approaches. Moreover, tiles with exceptionally low heat values may be skipped entirely or assigned a lower detection priority to further accelerate inference. It should be noted that both the heat threshold and subdivision parameters can be tuned to achieve the desired trade-off between accuracy and processing speed for specific application scenarios.

3.3. LSK-RTDETR Dual-Model Collaborative Detection

In this section, LSK-RTDETR, an enhanced detection network based on RT-DETR [33], is presented and the dual-model collaboration mechanism is described to show how efficient inference is achieved across different attention regions. Specifically, two variants of LSK-RTDETR are deployed: a full-capacity model for HARs and a lightweight model for LARs, thereby balancing detection accuracy and inference speed. The backbone integrates Large Selective Kernel (LSK) modules to dynamically adjust receptive fields for multi-scale object representation, while Adaptive Intra-scale Feature Interaction (AIFI) and Cross-scale Contextual Fusion Module (CCFM) strengthen feature expressiveness. Furthermore, the Transformer detection head produces bounding-box predictions end-to-end, eliminating the need for post-processing via non-maximum suppression—thus further reducing computational overhead and accelerating detection.

3.3.1. LSK-RTDETR Model Architecture

Once tiling is complete, each tile is forwarded to the detection network. Our method employs a dual-model collaboration strategy (Figure 7), where tiles are processed by differently scaled detectors based on their heat values. Specifically, we extend the RT-DETR object detection framework to create LSK-RTDETR, incorporating the LSKNet backbone in the encoder to enhance multi-scale feature extraction [33,36] (Figure 8).
Conventional CNN backbones often struggle to capture objects with large scale variations in ultra-wide-area RSIs. To address this, LSK-RTDETR adopts LSKNet, which integrates LSK convolutions. At each stage, the LSK module applies multiple large-kernel convolutions in parallel and employs a channel-wise attention mechanism to adaptively weight their outputs. Concretely, the LSK block implements two complementary spatial branches to realize multi-scale receptive fields: the first branch is a depthwise 5 × 5 convolution that focuses on fine local detail and small-object cues, and the second branch is a depthwise 7 × 7 convolution used together with a dilation rate of three, which preserves the parameter cost of a 7 × 7 kernel while substantially enlarging the effective receptive field to approximately 19 × 19 for broader contextual modeling and medium-to-large object capture. Outputs from both branches are reduced in channel dimension by 1 × 1 projections and then concatenated; the concatenated feature is aggregated by average pooling and max pooling, followed by a 7 × 7 squeeze convolution and sigmoid gating to produce spatially varying weights that adaptively re-scale each branch’s contribution at every spatial location. All spatial convolutions are implemented as depthwise operations with grouping equal to the number of channels to minimize parameter count and computational cost. This combination of multi-scale branches, lightweight projections, and adaptive gating allows the network to dynamically adjust its receptive field according to input features, effectively capturing both small and large objects while preserving global context, and thus significantly improves multi-scale feature representation without substantially increasing parameter overhead.
In the detection head, LSK-RTDETR retains RT-DETR’s Transformer decoder architecture but augments each decoder layer with an auxiliary prediction branch to enable coarse bounding-box predictions at earlier stages. By directly regressing box coordinates and class probabilities, the model obviates the need for a separate non-maximum suppression step, reducing post-processing overhead. Furthermore, the integration of the Cross-scale Contextual Fusion Module leverages multi-scale contextual information to refine box regression, enhancing both accuracy and stability.

3.3.2. Dual-Model Collaboration Mechanism

The full-capacity LSK-RTDETR-base model (heavy variant) is dedicated to detecting tiles with high heat values. It employs the LSKNet backbone coupled with the RT-DETR detection head, offering strong feature representation and precise detection capabilities essential for high-value regions. The RT-DETR framework inherently manages multi-scale features and allows inference speed adjustments via decoder depth.
Conversely, the lightweight LSK-RTDETR-l model (lite variant) targets tiles with lower heat values. Structurally more compact than its full-capacity counterpart, it proportionally reduces the backbone depth while retaining the dynamic Large Selective Kernel mechanism, thereby maximizing inference throughput with minimal accuracy degradation. Deploying the lite model in low-heat regions significantly increases overall processing speed without compromising detection performance.
Both variants operate in parallel, each loading its own weights and executing inference independently. Upon completion, detections from both models are mapped back to the original image coordinates and fused into the final output via a merging strategy. Figure 9 illustrates example detection results.
The dual-model collaborative detection framework implements a conditional computing paradigm: different regions are processed with tailored computational budgets, thereby substantially improving overall efficiency.

4. Experiments

4.1. Implementation Details

All experiments were conducted on a single workstation equipped with Ubuntu 22.04, Python 3.9.18, PyTorch 1.13.1, CUDA 11.7, and NVIDIA RTX 4090 GPUs. Random seeds were fixed across all experiments to ensure reproducibility.
Scene Heatmap Generation. The scene-heatmap module is based on EfficientNetV2 [37,38] and initialized with ImageNet-pretrained weights. Initial 1024 × 1024 tiles are resized to 224 × 224 and normalized following the ImageNet protocol. EfficientNetV2 is fine-tuned using stochastic gradient descent (SGD) for 80 epochs with an initial learning rate of 1 × 10−3, momentum of 0.9, and weight decay of 1 × 10−4. Simple data augmentations (random horizontal flips and color jitter) are applied; adversarial training is not used. After training, each tile’s scene-probability vector is mapped to a prior correlation vector (Section 3.1.3) via a max-aggregation rule to produce the tile heat value. A heat threshold of 0.5 is then applied to separate HARs and LARs.
Object Detector Training. LSK-RTDETR is constructed by replacing the RT-DETR [33] ResNet backbone with LSKNet while preserving the original Transformer decoder and loss configuration. The full-capacity variant (LSK-RTDETR-base) and the lightweight variant (LSK-RTDETR-l) are trained in parallel on two RTX 4090 GPUs for 100 epochs each, with a batch size of 2 per GPU. Training uses 640 × 640 inputs, random horizontal flips, and multi-scale resizing. SGD is employed with an initial learning rate of 5 × 10−4, momentum 0.9, and weight decay 1 × 10−4; the learning rate decays by a factor of 0.1 at epochs 30 and 40. Both variants share the same hyperparameters and are applied uniformly to tiles from both attention regions. During inference, tiles are kept at 640 × 640 with no additional preprocessing; all detections are mapped back to the original image coordinates to produce the final output. We followed the official train/validation/test split of the DIOR dataset. Detection accuracy is reported as mean Average Precision at IoU = 0.50 (mAP@50).
Inference Settings. For ultra–high-resolution test images about 15,000 × 8000 pixels, initial tiling is performed with 1024 × 1024 non-overlapping tiles. HARs use 640 × 640 tiles with 20% overlap, while LARs use 1024 × 1024 non-overlapping tiles. The heat threshold is set to 0.5. All comparative evaluations were executed under identical hardware and software conditions to ensure fairness.

4.2. Datasets

DIOR: It is a large-scale benchmark dataset for optical RSIs object detection [39], containing 23,463 images and 192,472 annotated instances across 20 categories: Airplane, Airport, BaseballField, BasketballCourt, Bridge, Chimney, Dam, ExpresswayServiceArea, ExpresswayTollStation, Port, GolfCourse, GroundTrackField, Overpass, Ship, Stadium, StorageTank, TennisCourt, TrainStation, Vehicle, Windmill. All images are standardized to 800 × 800 pixels, with ground sampling distances ranging from 0.5 m to 30 m, covering multi-source scenarios from UAV to satellite images. DIOR provides a diverse evaluation platform for remote sensing detection algorithms; per-class statistics are presented in Figure 10.
Remote Sensing Scene Classification Dataset: a labeled dataset comprising 15 scene categories: Airport, BareLand, Bridge, DenseResidential, Desert, Farmland, Forest, Industrial, Meadow, SparseResidential, Park, Road, Parking, Port, and Tailing. These image samples are used to train the lightweight EfficientNetV2 classifier to recognize the scene type of each initial tile, thereby providing the semantic basis for generating the target–scene correlation heatmap. Representative examples for each category are shown in Figure 4.

4.3. Evaluation Metrics

In this study, we employ four complementary metrics—Precision, mean Average Precision (mAP), Frames Per Second (FPS), and Invalid Compute Ratio—to comprehensively assess both individual modules and overall system performance.
Precision: Measures the accuracy of the scene classification network (EfficientNetV2) in distinguishing high-attention from low-attention tiles, i.e., its ability to correctly identify tiles likely to contain targets. Precision is defined as:
P r e c i s i o n = T P T P + F P ,
where TP (true positives) denotes the number of tiles correctly classified as strong-attention (containing at least one object), and FP (false positives) denotes the number of background tiles incorrectly classified as strong-attention.
Mean Average Precision (mAP): mAP evaluates the overall accuracy of the detection network (LSK-RTDETR) on the tiled inputs. For each object class c, we compute the area under its precision–recall curve, denoted as A P c , and then average over all C classes:
m A P = 1 C c = 1 C   A P c .
Each class-specific A P c is computed by integrating (or approximating via discretization) the precision values across different recall levels. mAP thus provides a comprehensive measure of LSK-RTDETR’s detection performance over varying object scales and densities—higher mAP indicates more accurate localization and classification across all target categories.
Inference Speed (FPS): Frames Per Second (FPS) denotes the average number of tile inferences the entire framework can process per second on the target hardware. To measure FPS, we execute the complete pipeline—heatmap generation to adaptive tiling to dual-model detection on all test tiles, record the total processing time, and divide the total number of tiles by this duration. A higher FPS indicates superior real-time performance and practical applicability on large-scale, ultra-high-resolution RSIs.
Invalid Compute Ratio: This metric quantifies the proportion of tiles processed by the detection network that do not contain any ground-truth objects. A lower Invalid Compute Ratio indicates more effective filtering of empty or background regions, thereby reducing wasted computational effort.
Invalid   Compute   Ratio = N t o t a l N v a l i d N t o t a l = 1 N v a l i d N t o t a l ,
Here, N t o t a l is the total number of tiles submitted to the detector, and N v a l i d is the number of tiles containing at least one ground-truth object. The Invalid Compute Ratio thus reflects the effectiveness of the tiling and heatmap-guidance modules—lower values indicate stronger filtering of empty or background tiles, reducing unnecessary detection computation and improving overall inference efficiency.

4.4. Quantitative Results and Comparative Analysis

Figure 4 and Table 1 present the per-class scene classification accuracies on our remote sensing scene dataset using VGG16 [40], ResNet50 [41], MobileNetV2 [42], EfficientNetV2 [31], and a transfer-learned EfficientNetV2. As shown in Table 2, the fine-tuned EfficientNetV2 consistently attains the highest average classification accuracy across all scene categories, substantially outperforming the four classical architectures. These results demonstrate that transfer-learning-enhanced EfficientNetV2 offers a clear advantage for remote sensing scene categorization, thereby ensuring high semantic fidelity for subsequent tiling and heatmap generation.
Table 3 summarizes the mean Average Precision (mAP) achieved by a range of detectors on the DIOR dataset’s 20 object categories, including Faster R-CNN [16], Faster R-CNN With FPN [43], YOLOv3 [44], YOLOv4 [45], YOLOv5, YOLOX [46], the multi-scale feature fusion with attention model by Yao et al. [47], SSD [48], ASSD [49], MSFC [50], AFADet [51], FSoD-Net [52], RSADet [53], RT-DETR [33], and our proposed LSK-RTDETR variants-lightweight (LSK-RTDETR-l) and full-capacity (LSK-RTDETR-base). Notably, Yao et al.’s multi-scale feature fusion and attention mechanism model achieves 73.2% mAP, while RT-DETR registers 72.6% mAP. Our lightweight LSK-RTDETR-l improves this to 74.0%, and the full-capacity LSK-RTDETR-base further climbs to 77.5%. It is observed that while our method secures the highest overall mAP, it does not strictly outperform competing methods in every individual category; certain specialized architectures may exhibit marginal advantages on specific targets due to their distinct inductive biases regarding local geometric features. However, thanks to the coupling of large-kernel multi-scale feature extraction with end-to-end Transformer regression, our model excels in establishing robust global context modeling. Consequently, it secures decisive gains across the majority of challenging scenarios, thereby driving the overall state-of-the-art performance.
These results indicate that, by coupling large-kernel multi-scale feature extraction with end-to-end Transformer regression and applying targeted pruning and model-lightening, LSK-RTDETR attains a superior trade-off between detection accuracy and computational efficiency.
Inference Efficiency. Figure 11 presents a quantitative comparison of inference speed between the proposed heatmap-guided adaptive secondary tiling method and the conventional homogeneous tiling approach across images of varying resolutions. In both methods, the initial RSIs is partitioned using sliding windows of 640 × 640 pixels with 20% overlap. For our method, HARs use the same 640 × 640 tiles (20% overlap), whereas LARs use larger 1024 × 1024 tiles without overlap. Homogeneous tiling employs single-model inference with the base LSK-RTDETR, while our approach utilizes the dual-model pipeline.
In the left panel of Figure 11, the computational complexity (FLOPs) is plotted against image size (pixels). The “Traditional homogeneous tiling” curve shows a steep increase in computational cost as image dimensions increase. In contrast, the “Heatmap-guided adaptive tiling” curve rises more gradually, indicating that our method better controls computational overhead on large images. This theoretical behavior behavior is consistent with a complexity-level advantage. While the homogeneous tiling approach incurs cost proportional to the full tile count, our adaptive scheme reduces the effective number of processed tiles and therefore lowers the overall computational load. The reduction factor depends on scene sparsity and object density, which explains why the observed efficiency increases for larger images or for scenes with more background area. Consequently, Figure 11 demonstrates a significantly reduced theoretical computational burden achieved by the proposed adaptive tiling.
The right panel quantifies the relative efficiency gain of our method over the baseline (“Efficiency Gain %”). For smaller images (e.g., 2112, 2436, 3136, and 3200 pixels), initial overhead causes negative gains (−22.2% at 2112 pixels, −4.4% at 3200 pixels). However, as image size increases, our method’s advantages become pronounced: 26.4% gain at 10,000 pixels, 22.6% at 20,000 pixels, 23.9% at 30,000 pixels, 22.6% at 40,000 pixels, and 24.9% at 50,000 pixels. These results confirm that the proposed tiling strategy significantly accelerates inference for ultra-high-resolution RSIs.

4.5. Ablation Study

To quantify the contributions of the heatmap-guidance and dual-model modules, the following ablation experiments were conducted.
Baseline (B): The entire 14,856 × 8138 pixel image is uniformly partitioned into 640 × 640 tiles with 20% overlap. All tiles are processed by the base LSK-RTDETR model. This homogeneous approach incurs a massive computational load of 38,544 G FLOPs, resulting in a total inference time of 11.82 s per image.
Ablation A (Heatmap Guidance): We introduce the heatmap-driven adaptive secondary tiling to the Baseline, while retaining the single-model detection pipeline. Specifically, the image is first divided into 1024 × 1024 non-overlapping tiles, classified by EfficientNetV2 to generate a scene heatmap. Based on the heatmap, HARs are re-tiled into 640 × 640 tiles (20% overlap) and LARs remain at 1024 × 1024. All resulting tiles are then detected by the base LSK-RTDETR model. By reducing tile density in background areas, the computational complexity decreases significantly to 28,358 G FLOPs, shortening the inference time to 9.33 s.
Ablation B (Dual-Model): Building upon Ablation A, we incorporate the lightweight LSK-RTDETR variant to exclusively handle low-heat (low-attention regions) tiles. This lite model substantially increases inference throughput on LARs while maintaining comparable detection accuracy. This optimization further reduces the computational cost to 26,810 G FLOPs and achieves the fastest inference speed of 8.92 s.
As summarized in Table 4, these experiments clearly demonstrate that both the heatmap-guided adaptive tiling and the dual-model collaboration are critical for enhancing overall system performance. Specifically, the scene heatmap effectively filters out empty or low-value tiles, directly slashing the computational load from 38,544 G to 28,358 G FLOPs and accelerating inference. The subsequent addition of the lightweight detector further optimizes resource allocation, bringing the total complexity down to 26,810 G FLOPs. Consequently, this substantial decrease in theoretical complexity directly yields the minimum inference latency and the lowest Invalid Compute Ratio.

4.6. Hyperparameter Sensitivity Analysis

To evaluate the robustness of our HAB-DMC framework in its intended operational context, we conducted a hyperparameter sensitivity analysis using the same 14,856 × 8138-pixel test image as in our ablation study (Section 4.5). This ensures that the results directly reflect the method’s performance on ultra-wide-area images.

4.6.1. Sensitivity to Heatmap Threshold

The heatmap threshold controls the partitioning between HARs processed by the full-capacity detector and LARs handled by the lightweight detector, and therefore mediates a direct trade-off between detection accuracy and computational complexity. Table 5 reports the pipeline’s computational complexity, time-per-image, and the invalid-compute ratio, and the detection performance on HARs and LARs for a single 14,856 × 8138 pixel test image across five representative threshold settings. Here we define the invalid-compute ratio as the fraction of compute expended on tiles that contain no ground-truth objects.
The results in Table 5 expose the expected accuracy–efficiency trade-off. Permissive thresholds increase the proportion of tiles treated as HARs, resulting in a high Invalid Compute Ratio of 18.2%. This inefficiency forces the high-capacity detector to process redundant tiles, inflating the computational complexity to 32,450 G FLOPs and dragging the inference time to 9.85 s. As the threshold increases to 0.5, the Invalid Compute Ratio drops to 13.6%, indicating that more empty tiles are correctly assigned to the lightweight LAR pathway. This successfully reduces the complexity to 26,810 G FLOPs, which directly translates into the optimal inference time of 8.92 s. Conversely, aggressive thresholds push more area into the lightweight LARs, successfully reducing the theoretical complexity to 23,950 G FLOPs. However, an aggressive threshold exposes a critical non-linear relationship. While it further minimizes the Invalid Compute Ratio 11.8% and achieves the lowest theoretical complexity 23,950 G FLOPs, the inference time unexpectedly rebounds to 9.73 s.
This discrepancy highlights that Time is not solely determined by FLOPs. We attribute this primarily to region fragmentation and its downstream effects: higher thresholds fragment formerly contiguous HARs into a larger number of smaller, disconnected tiles. This increases the count of region-of-interest crops dispatched to the detector, amplifying per-region scheduling and post-processing overheads. These implementation-level fixed costs, together with reduced batching efficiency on the GPU and potential load imbalance between pipeline stages, offset the pixel-level savings in FLOPs that motivated the larger threshold. Finally, from a practical standpoint, a modest threshold range (in our case threshold 0.4–0.5) yields a robust accuracy–efficiency compromise; when deploying to small-object-dense scenes one should bias toward lower threshold (and slightly increased overlap), whereas throughput-critical deployments can raise threshold but should apply mitigation measures such as morphological smoothing/min-area filtering of HARs masks, batching of region-of-interest crops, or region merging to avoid fragmentation-induced overheads.

4.6.2. Tiling Granularity Trade-Off Analysis for HARs and LARs Processing

The granularity of tiling in both high-attention regions (HARs) and low-attention regions (LARs) fundamentally determines the computational workload distribution between the two detector pathways. To systematically examine this trade-off, we varied the tile sizes for HARs and LARs while keeping the detector architectures fixed (LSK-RTDETR-base for HARs and LSK-RTDETR-l for LARs), and the corresponding performance metrics are listed in Table 6.
Using smaller HARs (512 × 512) yields the highest overall accuracy (74.8% mAP) by preserving fine spatial details crucial for small-object detection. However, this configuration significantly increases the computational burden to 33,500 G FLOPs. The finer granularity generates a larger number of tiles with increased overlap redundancy, which must be processed by the high-capacity detector. Consequently, the total inference time rises to 11.23 s per image, only marginally faster than the uniform tiling baseline. This indicates that finer HARs partitioning can improve accuracy but offers diminishing efficiency gains once the number of processed HARs becomes excessive.
In contrast, employing larger LARs (1280 × 1280) successfully reduces the computational complexity to 25,200 G FLOPs and accelerates inference to 8.15 s by substantially reducing the number of background tiles analyzed. Nevertheless, this setting forces the lightweight detector to process larger image regions at once, degrading feature resolution and causing small targets to be overlooked. As a result, the accuracy drops to 73.3% mAP, reflecting the reduced capacity to localize small or densely packed objects when LARs become excessively large.
Our chosen configuration, with HARs of 640 × 640 and LARs of 1024 × 1024, achieves the best overall balance between accuracy and efficiency. It maintains competitive performance (74.2% mAP, within 0.6% of the highest) while achieving a 24.5% speedup over the uniform tiling baseline. The 640 × 640 HARs retain sufficient local detail for the high-capacity detector without introducing excessive fragmentation, whereas the 1024 × 1024 LARs allow the lightweight detector to efficiently process large background regions while maintaining reliable precision in less dense areas.
Overall, these findings demonstrate that the advantage of the proposed lies not merely in employing detectors of different capacities, but in tailoring the input granularity to each detector’s operational role. This demand-driven allocation of computational resources enables a more favorable accuracy–efficiency Pareto frontier, underscoring the importance of jointly optimizing tiling parameters and detector assignments within the framework.

4.6.3. Impact of Overlap Rate on Object Integrity and Computational Cost

To mitigate the risk of object truncation at tile boundaries—a critical concern in tiling-based detection, we systematically evaluate the effect of overlap percentage between adjacent tiles in high-Attention regions. This analysis quantifies the trade-off between preserving object integrity and managing computational redundancy. Adjacent tiles in High-Attention Regions (HARs) are generated with a specific overlap. We evaluate this parameter using two core metrics: Boundary Object Recall (measuring detection completeness for vulnerable objects) and Time per Image, and computational complexity.
Boundary Object Recall is defined as the proportion of ground-truth objects that intersect predefined boundary buffer zones of the non-overlapping tiling grid which are successfully detected. Formally, it is computed as:
Boundary   Object   Recall = T P boundary G T boundary
To quantify the effectiveness of the overlap strategy in mitigating missed detections of objects at tile boundaries, we employ the Boundary Object Recall metric. This metric is derived from two core components: the number of ground-truth objects designated as GTboundary and the number of true positive detections among them, designated as TPboundary. The GTboundary is defined as the count of ground-truth objects whose center points fall within a pre-defined buffer zone. This buffer zone is a fixed-width region (e.g., 50 pixels) extending inwards from all edges of the tiles in the initial non-overlapping grid. These objects are considered most vulnerable to being truncated or missed during the tiling process. The TPboundary is then quantified as the number of objects within this GTboundary set that are successfully detected by our pipeline, meaning a predicted bounding box has an Intersection-over-Union of at least 0.5 with the corresponding ground-truth bounding box. This focused assessment allows for a precise evaluation of the detection system’s capability to preserve objects that are at high risk due to tiling artifacts.
The analysis presented in Table 7 demonstrates a clear and critical trade-off mediated by the overlap percentage in fine-grained tiling. Operating without any overlap (0%) yields the lowest computational complexity 22,500 G FLOPs and fastest processing speed but incurs a severe performance penalty, with over 30% of objects situated near tile boundaries being missed. This confirms that object truncation at boundaries is a fundamental challenge in tiling-based detection, necessitating an overlap strategy. Adopting a 20% overlap rate strikes the optimal balance, producing a dramatic 20.1 percentage-point improvement in Boundary Object Recall (from 68.5% to 88.6%) for a relatively modest increase in computational complexity to 26,810 G FLOPs. This configuration effectively resolves the majority of object fragmentation issues. Beyond this 20% threshold, the system exhibits strongly diminishing returns; increasing the overlap further to 40% provides a negligible gain of only 2.5 percentage points in recall, while the associated computational complexity surges to 36,200 G FLOPs, representing a 35% increase in complexity. Consequently, an overlap rate of 20% is validated as the optimal compromise, successfully mitigating the primary source of accuracy loss in tiled detection without undermining the computational efficiency gains central to our adaptive framework.

5. Discussions

The heat-guided adaptive blocking and dual-model collaboration framework offers an effective, demand-driven strategy for object detection in ultra-wide images, but several practical considerations remain. Its effectiveness depends on the upstream scene-heatmap; therefore, imperfect scene predictions can influence downstream resource allocation and detection consistency. To be specific, two potential failure modes may arise. First, regarding heatmap misidentification, if a small, target-dense area (e.g., a remote expressway service area) is surrounded by a vast dominant scene (e.g., desert) and is misclassified as a Low-Attention Region due to dominant scene features, the system will default to coarse tiling. This reduced resolution may lead to missed detections of minute targets like vehicles. Second, regarding boundary effects, although we employ an overlap strategy, extremely large objects (e.g., long-span bridges) that traverse multiple tiles may still suffer from truncation or fragmentation if the object scale exceeds the tile’s receptive field, potentially complicating the final result merging.
However, unlike hard filtering methods, our dual-model strategy processes even misclassified LARs using a lightweight model (74.0% mAP) instead of discarding them, effectively preventing the skipping of crucial signatures. Second, regarding tile sizing, while fully dynamic sizing is theoretically ideal, it hampers GPU parallelism. We therefore adopt a dual-fixed-size strategy (640 × 640 for HARs and 1024 × 1024 for LARs) as a necessary engineering trade-off to balance flexibility with computational efficiency. Key hyperparameters were tuned on our evaluation images, and their behavior across different sensors, spatial resolutions, and geographical environments warrants broader validation. In addition, heatmap generation introduces an initial computational overhead that may limit the benefit for smaller or less heterogeneous scenes, making the framework most advantageous at very large image scales where adaptive allocation significantly reduces redundant processing.
To strengthen and extend this work, we plan targeted next steps. We will conduct additional repeated experiments to quantify variability, report statistical confidence measures, and further validate cross-dataset robustness. In the mid to long term, we will explore dynamic tile size adjustment methods to overcome current engineering bottlenecks. We also aim to explore end-to-end formulations that jointly learn scene understanding and detection, enabling the system to dynamically adjust computation based on learned spatial priors. Moreover, replacing the current binary HARs/LARs split with a finer-grained, learnable attention hierarchy may improve flexibility across varying densities and object distributions. Finally, we intend to extend the demand-driven computation paradigm to other data modalities and streaming scenarios (e.g., drone video, multi-temporal monitoring, and SAR images), where adaptive resource allocation can yield substantial efficiency gains and broader real-world applicability.

6. Conclusions

Traditional remote sensing image object detection methods employ homogeneous tiling, meaning the entire image is divided into patches of the same size and overlap ratio, with the same detection model applied to all patches. This strategy embodies translational symmetry, applying identical processing operations to every image region in space. However, the distribution of targets in remote sensing images exhibits high spatial asymmetry: targets are concentrated in specific semantic regions (e.g., airports, residential areas), while large areas (e.g., forests, farmlands) contain almost no targets. This research uses “symmetry” as a theoretical starting point and, by breaking away from the traditional symmetric processing paradigm, proposes a heatmap-guided adaptive tiling and dual-model collaborative detection framework, aiming to address redundant computation and real-time performance bottlenecks in ultra-wide-area remote sensing image (RSIs) object detection. The experimental results demonstrate the effectiveness of the proposed framework: compared to conventional homogeneous tiling schemes, it significantly improves inference speed and reduces invalid computational overhead. These performance gains are primarily attributed to the scene classification-driven heatmap generation module, which serves as a core computation scheduling engine capable of filtering a substantial portion of redundant processing and enhancing overall detection efficiency. Furthermore, by employing a high-precision LSK-RTDETR base model in high-attention regions (HARs) and a lightweight variant in low-attention regions (LARs), the proposed method effectively maintains detection accuracy while improving inference efficiency, achieving a favorable balance between speed and precision.

Author Contributions

Conceptualization, F.H.; methodology, Y.L., J.Z. and F.H.; software, Y.L. and J.Z.; validation, F.H. and C.M.; formal analysis, Y.L. and J.Z.; investigation, Y.L., J.Z. and F.H.; data curation, Y.L. and J.Z.; writing—original draft preparation, Y.L., J.Z., F.H. and C.M.; writing—review and editing, Y.L., J.Z. and F.H.; visualization, Y.L. and J.Z.; supervision, F.H.; project administration, C.M. and F.H.; funding acquisition, C.M. and F.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key Research and Development Program of China (Grant No. 2023YFB3406300).

Data Availability Statement

The high-resolution remote sensing images utilized in this study were obtained from the GEOVIS Earth platform (https://daily.geovisearth.com/), initially accessed on 1 October 2025. These images are commercial satellite data acquired via a paid subscription service, which grants the license for academic research and publication. All map data and satellite imagery are used in compliance with the provider’s terms of service.

Conflicts of Interest

Author Chunping Min was employed by the Information Center of Ministry of Natural Resources. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
CNNsConvolutional Neural Networks
DQNDeep Q-Network
HAB-DMCHeatmap-Guided Adaptive Tiling and Dual-Model Collaboration
HARsHigh-Attention Regions
LARsLow-Attention Regions
LSKLarge Selective Kernel
OANObjectness Activation Network
RSIsRemote Sensing Images
UAVUnmanned Aerial Vehicle

References

  1. Gu, X.; Angelov, P.P.; Zhang, C.; Atkinson, P. A Semi-Supervised Deep Rule-Based Approach for Complex Satellite Sensor Image Analysis. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 2281–2292. [Google Scholar] [CrossRef]
  2. Ding, J.; Xue, N.; Xia, G.S.; Bai, X.; Yang, W.; Yang, M.Y.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; et al. Object Detection in Aerial Images: A Large-Scale Benchmark and Challenges. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7778–7796. [Google Scholar] [CrossRef]
  3. Chen, L.; Liu, C.; Chang, F.; Li, S.; Nie, Z. Adaptive Multi-Level Feature Fusion and Attention-Based Network for Arbitrary-Oriented Object Detection in Remote Sensing images. Neurocomputing 2021, 451, 67–80. [Google Scholar]
  4. Bai, C.; Bai, X.; Wu, K. A Review: Remote Sensing Image Object Detection Algorithm Based on Deep Learning. Electronics 2023, 12, 4902. [Google Scholar] [CrossRef]
  5. Gui, S.; Song, S.; Qin, R.; Tang, Y. Remote Sensing Object Detection in the Deep Learning Era—A Review. Remote Sens. 2024, 16, 327. [Google Scholar]
  6. Wang, Y.; Bashir, S.M.A.; Khan, M.; Ullah, Q.; Wang, R.; Song, Y.; Guo, Z.; Niu, Y. Remote Sensing Image Super-Resolution and Object Detection: Benchmark and State of the Art. arXiv 2021. [Google Scholar] [CrossRef]
  7. Wen, L.; Cheng, Y.; Fang, Y.; Li, X. A Comprehensive Survey of Oriented Object Detection in Remote Sensing Images. Expert Syst. Appl. 2023, 224, 119960. [Google Scholar] [CrossRef]
  8. Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A Large-Scale Dataset for Object Detection in Aerial Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar]
  9. Lin, Q.; Zhao, J.; Fu, G.; Yuan, Z. CRPN-SFNet: A High-Performance Object Detector on Large-Scale Remote Sensing Images. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 416–429. [Google Scholar] [CrossRef]
  10. Gao, Y.; Wang, Y.; Zhang, Y.; Li, Z.; Chen, C.; Feng, H. Feature Super-Resolution Fusion With Cross-Scale Distillation for Small-Object Detection in Optical Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
  11. Lin, J.; Zhao, Y.; Wang, S.; Chen, M.; Lin, H.; Qian, Z. Aerial Image Object Detection Based on Superpixel-Related Patch. In Proceedings of the International Conference on Image and Graphics, Haikou, China, 26–28 December 2021; pp. 256–268. [Google Scholar]
  12. Xie, X.; Cheng, G.; Li, Q.; Miao, S.; Li, K.; Han, J. Fewer Is More: Efficient Object Detection in Large Aerial Images. Sci. China Inf. Sci. 2024, 67, 112106. [Google Scholar] [CrossRef]
  13. Liu, B.; Mo, P.; Wang, S.; Cui, Y.; Wu, Z. A Refined and Efficient CNN Algorithm for Remote Sensing Object Detection. Sensors 2024, 24, 7166. [Google Scholar] [CrossRef]
  14. Girshick, R. Fast R-CNN. arXiv 2015, arXiv:1504.08083. [Google Scholar] [CrossRef]
  15. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. arXiv 2017, arXiv:1703.06870. [Google Scholar]
  16. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
  17. Zhang, L.; Zhang, Y. Airport detection and aircraft recognition based on two-layer saliency model in high spatial resolution remote-sensing images. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2017, 10, 1511–1524. [Google Scholar] [CrossRef]
  18. Yokoya, N.; Iwasaki, A. Object detection based on sparse representation and Hough voting for optical remote sensing images. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2015, 8, 2053–2062. [Google Scholar] [CrossRef]
  19. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
  20. Lou, H.; Liu, X.; Bi, L.; Liu, H.; Guo, J. BD-YOLO: Detection Algorithm for High-Resolution Remote Sensing Images. Phys. Scr. 2024, 99, 066003. [Google Scholar] [CrossRef]
  21. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 318–327. [Google Scholar]
  22. Zhang, S.; Wen, L.; Bian, X.; Lei, Z.; Li, S.Z. Single-Shot Refinement Neural Network for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4203–4212. [Google Scholar]
  23. Cao, J.; Pang, Y.; Han, J.; Li, X. Hierarchical Shot Detector. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9705–9714. [Google Scholar]
  24. Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9626–9635. [Google Scholar]
  25. Li, Z.; Wang, Y.; Chen, K.; Yu, Z. Channel Pruned YOLOv5-based Deep Learning Approach for Rapid and Accurate Outdoor Obstacles Detection. arXiv 2022, arXiv:2204.13699. [Google Scholar] [CrossRef]
  26. Yang, F.; Fan, H.; Chu, P.; Blasch, E.; Ling, H. Clustered Object Detection in Aerial Images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8310–8319. [Google Scholar]
  27. Li, C.; Yang, T.; Zhu, S.; Chen, C.; Guan, S. Density Map Guided Object Detection in Aerial Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 737–746. [Google Scholar]
  28. Bai, J.; Ren, J.; Yang, Y.; Xiao, Z.; Yu, W.; Havyarimana, V.; Jiao, L. Object Detection in Large-Scale Remote-Sensing Images Based on Time-Frequency Analysis and Feature Optimization. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5405316. [Google Scholar] [CrossRef]
  29. Zhang, C.; Lam, K.M.; Wang, Q. CoF-Net: A Progressive Coarse-to-Fine Framework for Object Detection in Remote-Sensing images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–14. [Google Scholar] [CrossRef]
  30. Chen, H.; Zhang, L.; Ma, J.; Zhang, J. Target Heat-Map Network: An End-to-End Deep Network for Target Detection in Remote Sensing Images. Neurocomputing 2019, 331, 375–387. [Google Scholar]
  31. Tan, M.; Le, Q.V. EfficientNetV2: Smaller Models and Faster Training. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 10096–10106. [Google Scholar]
  32. Li, Q.; Chen, Y.; Zeng, Y. Transformer with Transfer CNN for Remote-Sensing-Image Object Detection. Remote Sens. 2022, 14, 984. [Google Scholar]
  33. Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-Time Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 1203–1213. [Google Scholar]
  34. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 9992–10022. [Google Scholar]
  35. Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention Mask Transformer for Universal Image Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1290–1299. [Google Scholar]
  36. Li, Y.; Hou, Q.; Zheng, Z.; Cheng, M.M.; Yang, J.; Li, X. Large Selective Kernel Network for Remote Sensing Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 16748–16759. [Google Scholar]
  37. Hendrycks, D.; Lee, K.; Mazeika, M. Using Pre-Training Can Improve Model Robustness and Uncertainty. arXiv 2019, arXiv:1901.09960. [Google Scholar] [CrossRef]
  38. Huh, M.; Agrawal, P.; Efros, A.A. What Makes ImageNet Good for Transfer Learning? arXiv 2016, arXiv:1608.08614. [Google Scholar] [CrossRef]
  39. Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object Detection in Optical Remote Sensing Images: A Survey and a New Benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
  40. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  41. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  42. Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
  43. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]
  44. Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
  45. Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
  46. Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
  47. Yao, Y.; Cheng, G.; Xie, X.; Han, J. Optical Remote Sensing Image Object Detection Based on Multi-Resolution Feature Fusion. Nat. Remote Sens. Bull. 2021, 25, 1124–1137. [Google Scholar]
  48. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot Multibox Detector. In European Conference on Computer Vision; LNCS 9905; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
  49. Xu, T.; Sun, X.; Diao, W.; Zhao, L.; Fu, K.; Wang, H. ASSD: Feature Aligned Single-Shot Detection for Multiscale Objects in Aerial images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–17. [Google Scholar] [CrossRef]
  50. Zhang, T.; Zhuang, Y.; Wang, G.; Dong, S.; Chen, H.; Li, L. Multiscale Semantic Fusion-Guided Fractal Convolutional Object Detection Network for Optical Remote Sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–20. [Google Scholar]
  51. Wang, J.; Gong, Z.; Liu, X.; Guo, H.; Yu, D.; Ding, L. Adaptive Feature-Aware Object Detection in Optical Remote Sensing Images. Remote Sens. 2022, 14, 3616. [Google Scholar] [CrossRef]
  52. Wang, G.; Zhuang, Y.; Chen, H.; Liu, X.; Zhang, T.; Li, L.; Dong, S.; Sang, Q. FSoD-Net: Full-Scale Object Detection from Optical Remote Sensing images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5602918. [Google Scholar] [CrossRef]
  53. Yu, D.; Ji, S. A New Spatial-Oriented Object Detection Framework for Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4407416. [Google Scholar] [CrossRef]
Figure 1. Non-uniform distribution of targets in an ultra-wide-area remote sensing image. An example of an ultra-wide-area remote sensing image (14,865 × 8138 pixels, center panel) illustrating the typical non-uniform distribution of targets. The lower-left region depicts an agricultural area largely lacking discernible targets. In contrast, target-aggregated zones, such as the airport and residential areas, are evident; detected aircraft and vehicles within these zones are annotated with blue and red bounding boxes, respectively. A stadium is further annotated in green. This figure exemplifies the spatial heterogeneity in target density inherent to ultra-wide-area RSIs.
Figure 1. Non-uniform distribution of targets in an ultra-wide-area remote sensing image. An example of an ultra-wide-area remote sensing image (14,865 × 8138 pixels, center panel) illustrating the typical non-uniform distribution of targets. The lower-left region depicts an agricultural area largely lacking discernible targets. In contrast, target-aggregated zones, such as the airport and residential areas, are evident; detected aircraft and vehicles within these zones are annotated with blue and red bounding boxes, respectively. A stadium is further annotated in green. This figure exemplifies the spatial heterogeneity in target density inherent to ultra-wide-area RSIs.
Symmetry 17 02158 g001
Figure 2. The scene heatmap–guided adaptive tiling and dual-model detection framework. The input ultra-wide-area RSIs is first partitioned into coarse tiles, each of which is classified by a lightweight EfficientNetV2 to produce an attention heatmap. Tiles are then split into HARs (core target areas such as buildings and airports) and LARs (background areas such as farmland and desert) based on a heat threshold. HARs undergo fine-grained secondary tiling and are processed by a high-precision detector to preserve detailed features, while LARs are handled by a lightweight model for efficient inference. Finally, detection outputs are fused and stitched onto the original image.
Figure 2. The scene heatmap–guided adaptive tiling and dual-model detection framework. The input ultra-wide-area RSIs is first partitioned into coarse tiles, each of which is classified by a lightweight EfficientNetV2 to produce an attention heatmap. Tiles are then split into HARs (core target areas such as buildings and airports) and LARs (background areas such as farmland and desert) based on a heat threshold. HARs undergo fine-grained secondary tiling and are processed by a high-precision detector to preserve detailed features, while LARs are handled by a lightweight model for efficient inference. Finally, detection outputs are fused and stitched onto the original image.
Symmetry 17 02158 g002
Figure 3. Computation of tile heat values. Each 1024 × 1024 tile is classified by a transfer-learned EfficientNetV2 to produce a scene-probability vector p i . For each scene S j , the maximum object-category correlation a j = m a x 1 k K   a j ( k ) is selected, and the tile heat h i is obtained by the weighted sum h i = j   p i , j a j . Finally, the discrete heat values are interpolated to generate a continuous scene heatmap H x , y (see Figure 6).
Figure 3. Computation of tile heat values. Each 1024 × 1024 tile is classified by a transfer-learned EfficientNetV2 to produce a scene-probability vector p i . For each scene S j , the maximum object-category correlation a j = m a x 1 k K   a j ( k ) is selected, and the tile heat h i is obtained by the weighted sum h i = j   p i , j a j . Finally, the discrete heat values are interpolated to generate a continuous scene heatmap H x , y (see Figure 6).
Symmetry 17 02158 g003
Figure 4. Representative samples from the Remote Sensing Scene Classification dataset.
Figure 4. Representative samples from the Remote Sensing Scene Classification dataset.
Symmetry 17 02158 g004
Figure 5. Detailed scene-object correlation table. In this heatmap, the vertical axis denotes the 15 scene categories defined in our classification branch, while the horizontal axis corresponds to the 20 object categories to be detected. The value in each cell represents the specific correlation score a j ( k ) defined in Equation (2), which is derived from statistical co-occurrence frequencies in the training set and refined by domain expert calibration. The color intensity reflects the magnitude of the correlation: deep red cells (e.g., value 1.00 for Airport–Airplane) indicate a strong semantic dependency where targets are highly likely to appear, whereas light or white cells indicate a negligible probability. This look-up table serves as the basis for calculating the final tile heat values via Equation (4).
Figure 5. Detailed scene-object correlation table. In this heatmap, the vertical axis denotes the 15 scene categories defined in our classification branch, while the horizontal axis corresponds to the 20 object categories to be detected. The value in each cell represents the specific correlation score a j ( k ) defined in Equation (2), which is derived from statistical co-occurrence frequencies in the training set and refined by domain expert calibration. The color intensity reflects the magnitude of the correlation: deep red cells (e.g., value 1.00 for Airport–Airplane) indicate a strong semantic dependency where targets are highly likely to appear, whereas light or white cells indicate a negligible probability. This look-up table serves as the basis for calculating the final tile heat values via Equation (4).
Symmetry 17 02158 g005
Figure 6. Remote sensing image tiles and their corresponding scene heatmaps. Examples of scene heatmaps for three representative RSIs tiles (left). In the heatmaps (middle), red denotes high-heat regions—areas densely populated with targets or semantically important scenes such as airports; green indicates medium-heat regions with moderate target density, e.g., residential zones; and blue highlights low-heat background areas with negligible target presence, such as open fields. The color bar (right) maps heat values from 0 to 1.
Figure 6. Remote sensing image tiles and their corresponding scene heatmaps. Examples of scene heatmaps for three representative RSIs tiles (left). In the heatmaps (middle), red denotes high-heat regions—areas densely populated with targets or semantically important scenes such as airports; green indicates medium-heat regions with moderate target density, e.g., residential zones; and blue highlights low-heat background areas with negligible target presence, such as open fields. The color bar (right) maps heat values from 0 to 1.
Symmetry 17 02158 g006
Figure 7. Flowchart of the adaptive tiling and detection framework. The framework first generates a scene heatmap (center) and partitions the image into HARs (heat ≥ threshold) and LARs (heat < threshold). Strong-attention regions undergo fine-grained tiling and are processed by the full-capacity LSK-RTDETR model, while LARs use coarse tiling and the lightweight LSK-RTDETR variant, enabling demand-driven allocation of computational resources.
Figure 7. Flowchart of the adaptive tiling and detection framework. The framework first generates a scene heatmap (center) and partitions the image into HARs (heat ≥ threshold) and LARs (heat < threshold). Strong-attention regions undergo fine-grained tiling and are processed by the full-capacity LSK-RTDETR model, while LARs use coarse tiling and the lightweight LSK-RTDETR variant, enabling demand-driven allocation of computational resources.
Symmetry 17 02158 g007
Figure 8. Schematic of the LSK-RTDETR network architecture. Note: The backbone comprises multiple LSK Blocks, each consisting of a Large Selective Kernel (LSK Selection) convolution module and a Feed-Forward Network (FFN) sub-block. The LSK module dynamically selects among large convolution kernels of varying sizes. Feature maps at three scales (P3–P5) are extracted by the backbone and then fused (Fusion) and concatenated (CAT). The IOU-Aware Query Selection module refines object queries based on intersection-over-union scores, and the Decoder denotes the Transformer-based detection head.
Figure 8. Schematic of the LSK-RTDETR network architecture. Note: The backbone comprises multiple LSK Blocks, each consisting of a Large Selective Kernel (LSK Selection) convolution module and a Feed-Forward Network (FFN) sub-block. The LSK module dynamically selects among large convolution kernels of varying sizes. Feature maps at three scales (P3–P5) are extracted by the backbone and then fused (Fusion) and concatenated (CAT). The IOU-Aware Query Selection module refines object queries based on intersection-over-union scores, and the Decoder denotes the Transformer-based detection head.
Symmetry 17 02158 g008
Figure 9. Full-scene detection results on a Gaofen-2 ultra-wide-area image. Detection results on a single ultra–wide–area Gaofen-2 image (14,865 × 8138 pixels). The central panel shows the fused and stitched inference output for the entire scene, with bounding boxes color-coded by object category. Surrounding panels display inference results for cropped high-density regions (top row) and low-density regions (bottom row), demonstrating that the dual-model framework maintains high detection precision for both densely clustered and sparsely distributed targets. Note: The detection targets include vehicles, aircraft, and ground track and field fields.
Figure 9. Full-scene detection results on a Gaofen-2 ultra-wide-area image. Detection results on a single ultra–wide–area Gaofen-2 image (14,865 × 8138 pixels). The central panel shows the fused and stitched inference output for the entire scene, with bounding boxes color-coded by object category. Surrounding panels display inference results for cropped high-density regions (top row) and low-density regions (bottom row), demonstrating that the dual-model framework maintains high detection precision for both densely clustered and sparsely distributed targets. Note: The detection targets include vehicles, aircraft, and ground track and field fields.
Symmetry 17 02158 g009
Figure 10. Representative samples from the DIOR dataset.
Figure 10. Representative samples from the DIOR dataset.
Symmetry 17 02158 g010
Figure 11. Comparative Performance of Remote Sensing Image Blocking Strategies. (a) Comparison of Computational Complexity (FLOPs) and Image Size; (b) Comparison of Efficiency Gains under Different Image Sizes.
Figure 11. Comparative Performance of Remote Sensing Image Blocking Strategies. (a) Comparison of Computational Complexity (FLOPs) and Image Size; (b) Comparison of Efficiency Gains under Different Image Sizes.
Symmetry 17 02158 g011
Table 1. Metrics per class.
Table 1. Metrics per class.
ClassPrecision
Airport98.20%
BareLand98.20%
Bridge99.88%
DenseResidential99.38%
Desert99.48%
Farmland97.48%
Forest97.50%
Industrial98.07%
Meadow99.77%
SparseResidential98.38%
Park98.88%
Parking99.00%
Port97.80%
Tailing99.42%
Road99.83%
Table 2. Classification Precision of different models.
Table 2. Classification Precision of different models.
ModelPrecision
VGG16 [40]85.20%
ResNet50 [41]89.45%
MobileNetV2 [42]91.86%
EfficientNetV2 [31]94.55%
EfficientNetV2 with transfer learning95.76%
Table 3. (a) Presents results for the first 10 object categories.; (b) Presents results for the remaining 10 object categories.
Table 3. (a) Presents results for the first 10 object categories.; (b) Presents results for the remaining 10 object categories.
(a)
Model mAPAirplaneAirportBaseball FieldBasketball CourtBridgeChimneyDamExpressway Service AreaExpressway Toll StationGolf Course
Faster R-CNN [16]54.153.649.378.866.028.070.962.369.055.268.0
FasterR-CNN with FPN [43]63.154.171.463.381.042.672.557.568.762.173.1
Yolov3 [44]57.172.229.274.078.631.269.726.948.654.431.1
Yolov4 [45]66.775.269.970.988.739.976.65459.960.667.6
Yolov5 [25]69,685.976.172.389.443.680.861.559.558.075.5
YOLOX [46]72.289.372.075.390.247.879.361.560.166.274.2
Yao et al. [47]73.266.783.674.889.150.580.669.084.975.283.9
ASSD [49]71.185.682.475.889.540.777.664.767.161.780.8
SSD [48]58.659.572.772.475.729.765.856.663.553.165.3
MSFC [50]70.085.876.274.390.144.178.155.560.959.576.9
AFADet [51]66.185.666.576.388.137.478.353.661.858.454.3
FSoD-Net [52]71.888.966.986.890.245.579.648.286.975.567.0
RSADet [53]72.273.686.072.689.643.675.362.379.568.778.6
Rt-Detr [33]72.691.875.991.278.543.585.956.267.867.076.5
(our) LSK-RTDETR-l74.092.376.591.879.243.986.456.768.367.577.1
LSK- RTDETR -base77.594.078.793.381.045.288.758.470.569.879.0
(b)
ModelmAPGround Track FieldHarborOverpassShipStadiumStorage tankTennis courtTrain stationVehicleWind Mill
Faster R-CNN [16]54.156.950.250.127.773.039.875.238.623.645.4
FasterR-CNN with FPN [43]63.176.542.856.071.857.053.581.253.043.180.9
Yolov3 [44]57.161.144.949.787.470.668.787.329.642.778.6
Yolov4 [45]66.770.158.757.387.750.275.686.552.652.788.6
Yolov5 [25]69,673.862.157.689.155.772.786.961,057.882.7
YOLOX [46]72.276.858.162.389.971.177.589.961.057.383.5
Yao et al. [47]73.284.253.865.275.674.662.788.165.846.488.8
ASSD [49]71.178.662,058.084.965.365.387.962.444.576.3
SSD [48]58.668.649.448.159.261.046.676.355.127.465.7
MSFC [50]70.073.649.557.289.669.276.586.751.855.284.3
AFADet [51]66.167.270.453.182.762.863.988.250.343.979.2
FSoD-Net [52]71.877.353.659.778.369.975.091.452.352.090.6
RSADet [53]72.279.157.959.290.055.877.087.865.355.386.5
Rt-Detr [33]72.679.162.863.190.189.380.788.460.975.177.8
(our) LSK-RTDETR-l74.079.863.263.590.589.781.288.961.475.678.2
LSK- RTDETR -base77.581.465.865.092.291.583.090.663.377.380.7
Table 4. Ablation Study on 14,856 × 8138-Pixel Ultra-Wide-Area RSIs.
Table 4. Ablation Study on 14,856 × 8138-Pixel Ultra-Wide-Area RSIs.
MethodTime per Image (s)FLOPsInvalid Compute Ratio
Baseline11.8238,544 G50%
Ablation A (+Heatmap Guidance)9.3328,358 G13.6%
Ablation B (+Dual-Model)8.9226,810 G13.6%
Table 5. Performance variation with different heatmap thresholds.
Table 5. Performance variation with different heatmap thresholds.
ThresholdTime per Image (s)FLOPsInvalid Compute Ratio
0.39.8532,450 G18.2%
0.49.2129,120 G15.1%
0.58.9226,810 G13.6%
0.69.1425,200 G12.3%
0.79.7323,950 G11.8%
Table 6. The impact of tiling strategy on overall system performance.
Table 6. The impact of tiling strategy on overall system performance.
Tiling Strategy (HARs × LARs)Overall mAP (%)FLOPsTime per Image (s)
640 × 1024 (Ours)74.226,810 G8.92
512 × 102474.833,500 G11.23
768 × 102473.124,100 G7.85
640 × 76874.529,800 G9.87
640 × 128073.325,200 G8.15
Uniform 640 × 640 (Baseline)74.838,544 G11.82
Table 7. The trade-off between overlap rate, boundary object recall, and inference speed.
Table 7. The trade-off between overlap rate, boundary object recall, and inference speed.
Overlap RateTime per Image (s)FLOPsBoundary Object Recall (%)
0%7.9622,500 G68.5
10%8.4124,300 G79.3
20%8.9226,810 G88.6
30%9.6830,500 G90.2
40%10.7536,200 G91.1
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hu, F.; Li, Y.; Zhao, J.; Min, C. Scene Heatmap-Guided Adaptive Tiling and Dual-Model Collaboration-Based Object Detection in Ultra-Wide-Area Remote Sensing Images. Symmetry 2025, 17, 2158. https://doi.org/10.3390/sym17122158

AMA Style

Hu F, Li Y, Zhao J, Min C. Scene Heatmap-Guided Adaptive Tiling and Dual-Model Collaboration-Based Object Detection in Ultra-Wide-Area Remote Sensing Images. Symmetry. 2025; 17(12):2158. https://doi.org/10.3390/sym17122158

Chicago/Turabian Style

Hu, Fuwen, Yeda Li, Jiayu Zhao, and Chunping Min. 2025. "Scene Heatmap-Guided Adaptive Tiling and Dual-Model Collaboration-Based Object Detection in Ultra-Wide-Area Remote Sensing Images" Symmetry 17, no. 12: 2158. https://doi.org/10.3390/sym17122158

APA Style

Hu, F., Li, Y., Zhao, J., & Min, C. (2025). Scene Heatmap-Guided Adaptive Tiling and Dual-Model Collaboration-Based Object Detection in Ultra-Wide-Area Remote Sensing Images. Symmetry, 17(12), 2158. https://doi.org/10.3390/sym17122158

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop