2.1. Tiling Strategies for Large-Scale Remote Sensing Images
Ultra-wide-area RSIs inherently possess extremely high resolution, making direct GPU processing infeasible. While significant progress in object detection for ultra-high-resolution RSIs has been achieved through convolutional neural networks (CNNs) and advanced detection frameworks [
13,
14,
15,
16], fundamental computational challenges persist. Early approaches predominantly relied on sliding-window uniform tiling [
17,
18]. Although it ensures comprehensive spatial coverage, this paradigm incurs substantial computational waste in target-scarce regions such as forests, farmland, and bare land.
To mitigate this, Lin et al. proposed a superpixel-based tiling strategy [
11], reducing tile quantity by approximately 28%. However, the requisite superpixel segmentation imposes significant computational overhead, increasing overall inference time by 29.9% and adding data-dependent pre-processing steps that reduce portability across datasets.
Alternatively, Xie et al. introduced an objectness activation network that filters target-free sub-images through grid-level prediction, achieving 30% acceleration [
12]. Nevertheless, this method demonstrates suboptimal performance for small-scale objects, frequently misclassifying sub-images containing small targets as target-free regions, consequently inducing missed detections, a failure mode tied to coarse grid granularity and lack of multi-scale semantic fusion.
Critically, existing methodologies predominantly rely on low-level visual features while lacking scene-level semantic understanding, thus struggling to balance efficiency and accuracy effectively. Consequently, achieving substantial inference acceleration without compromising detection accuracy remains a core challenge in ultra-wide-area RSIs processing. To address this limitation, we propose a semantics-guided secondary tiling mechanism driven by scene heatmaps: HARs utilize fine-grained sliding window tiling and high-accuracy detection models to maintain sensitivity to small-scale targets, whereas LARs apply coarse-grained tiling combined with computationally efficient lightweight detectors for accelerated processing. This approach simultaneously enhances detection efficiency while effectively preventing small-target omissions, providing a new paradigm for efficient intelligent interpretation of ultra-wide-area RSIs.
2.2. Efficient Object Detection in Ultra–High–Resolution Remote Sensing Images
With the rapid advancement of imaging technologies, the resolution of remotely sensed data continues to increase—from satellite and Unmanned Aerial Vehicle (UAV) platforms to automotive scenes and 4K/8K video, making efficient object detection in such large-scale images a critical challenge. Although the design of lightweight detectors (e.g., YOLO [
19,
20], RetinaNet [
21], RefineDet [
22], HSD [
23], FCOS [
24]) can accelerate inference, applying these models directly to ultra–high–resolution images incurs prohibitive computational costs. Model compression techniques, such as channel pruning on YOLOv5 [
25], have been shown to improve Frames Per Second(FPS) by up to 40% but often at the expense of small-object mAP.
Recent studies have explored coarse-to-fine cascade strategies to further accelerate detection on large images. For instance, the Objectness Activation Network (OAN) employs a lightweight binary branch to predict the presence of objects in each tile, thereby skipping empty regions [
12]; ClusDet generates clustered candidate regions via a clustering network before selective detection [
26]; DMNet uses a density map to localize regions of interest and restricts detection to those areas [
27].
However, these approaches still exhibit significant limitations in inference speed, training complexity, and general applicability. In particular, density-map-guided cropping may improve small-object recall under some conditions but struggles with scale variation and requires dense supervision that is costly to produce; it also can create fragmented or redundant crops that hurt large object integrity. Such methods therefore present a trade-off: improved local recall vs. higher annotation and computation costs and reduced robustness to scale and dataset shift. Conceptually, our approach differs fundamentally from density-map-based methods like DMNet [
27] in several key aspects. While DMNet relies on object density estimation which requires instance-level annotations and may struggle with scale variations, our scene heatmap approach operates at the semantic level, requiring only scene category labels that are easier to obtain and generalize better across datasets. This semantic foundation provides superior scalability to new geographical areas and object categories, as scene semantics tend to be more transferable than precise object density distributions.
ClusDet exhibits similar shortcomings. OAN’s fixed-grid objectness predictions cannot reliably detect objects at grid boundaries or those spanning multiple cells, and its lack of multi-scale semantic fusion hampers the balance between precision and speed. To address these limitations, some works have shifted toward feature-domain optimization. Bai et al. [
28] propose a remote sensing detection framework that integrates wavelet time–frequency analysis with reinforcement-learning-based feature selection. Their method employs a dueling Deep Q-Network (DQN) to identify dominant time–frequency channels and reduce computational complexity, and introduces a discrete wavelet multi-scale attention module to suppress background interference. In practice, such feature-domain optimizations reduce some spatial processing costs but introduce algorithmic complexity.
In addition, for efficient detection on ultra-Wide-Area RSIs, CoF-Net [
29] presents a progressive coarse-to-fine framework: candidate regions are first rapidly filtered at low resolution, then detection is gradually refined, thereby substantially reducing computation while preserving accuracy.
Although conceptually similar to our heatmap-guided adaptive tiling approach in pursuing coarse-to-fine computation, CoFNet’s reliance on multi-resolution image pyramids introduces nontrivial memory, and interpolation overheads-overheads that can hamper throughput on resource-constrained hardware and complicate end-to-end deployment, and Bai et al.’s scheme bypasses spatial-pyramid costs via time–frequency feature reconstruction.
More critically, CoF-Net’s low-resolution filtering lacks semantic understanding, making it prone to discarding regions containing small but semantically important objects.
In contrast, our method leverages scene-semantic priors to drive adaptive tiling, enabling more flexible and demand-aware allocation of computational resources. Concretely, the scene classifier we use requires only coarse tile-level supervision (classification labels) rather than dense pixel-wise annotations, which reduces training complexity and annotation cost and improves scalability and transferability across datasets.
Furthermore, by assigning high-resolution processing only to semantically important HARs (high heat, dense target), our approach preserves cross-tile object integrity while avoiding the memory burdens of maintaining multiple full-resolution pyramids, leading to better practical scalability on large images and constrained hardware.
To overcome these limitations, we propose a scene heatmap–guided adaptive tiling and dual-model collaborative framework tailored for ultra-wide-area RSIs. Unlike methods that rely solely on low-level features or grid-level activations, our approach introduces a global semantic perception mechanism. An EfficientNetV2-based classifier generates a heatmap encoding the target–scene correlation for each coarse tile; the image is then partitioned into HARs (high heat, dense target) and LARs (low heat, sparse target) via a dynamic threshold. High-attention regions undergo fine-grained tiling and are processed by a high-precision detector, while wea-attention regions use coarse tiling and a lightweight model. This demand-driven allocation markedly reduces redundant computation in empty areas, maintains high accuracy for both small and cross-tile large objects, and achieves superior inference efficiency and robustness across diverse resolutions.
2.4. Transformer-Based Hybrid Detection Models
Recent studies have sought to combine convolutional feature extractors with Transformer architectures to exploit both local detail and global context in RSIs object detection. Li et al. proposed the TRD (Transformer with Transfer CNN) framework, which fuses CNN’s local representation power with a Transformer’s long-range dependency modeling in an end-to-end detection pipeline [
32]. Building on this, Zhang et al. introduced RT-DETR by integrating a residual-enhanced CNN backbone with a deformable Transformer detection head, yielding robust query embeddings and dynamic attention maps without requiring region proposal networks, and achieving state-of-the-art mAP on aerial benchmarks [
33]. Although these hybrid approaches confirm the feasibility of coupling multi-scale CNN features with end-to-end Transformer decoders, they do not simultaneously address computation redundancy and adaptive receptive-field requirements in ultra-wide-area RSIs.
While pure Transformer architectures such as Swin Transformer [
34] and Mask2Former [
35] achieve excellent results on general vision benchmarks, we argue that a hybrid CNN-Transformer design is better suited to the object-detection module in ultra-wide-area RSIs. The primary reasons are engineering efficiency at scale and the value of CNN inductive biases when processing thousands of tiles of varying sizes and content.
In our LSK-RTDETR detector the LSKNet backbone builds multi-scale representations via a large selective-kernel mechanism that adaptively adjusts receptive fields, yielding context modeling comparable to windowed self-attention in Swin, while relying on large-kernel convolutions rather than global attention. These convolutions are typically more hardware friendly and incur lower latency on common deep-learning accelerators, which is important when the detector is applied to many tiles per image [
34,
36].
Moreover, LSKNet’s inductive biases, translation equivariance and locality, naturally favor learning compact, position-consistent features that are crucial for detecting small, densely packed objects in remote-sensing images. Although Vision Transformers can learn similar properties from data, they often require larger-scale pretraining or careful multi-scale engineering to match CNNs on fine-grained localization tasks [
34,
35]. Our hybrid approach therefore combines efficient, locality-aware feature extraction with Transformer-based global reasoning where it matters most.
To this end, we propose the enhanced LSK-RTDETR (Large Selective Kernel–RT-DETR) model integrated into a demand-driven adaptive tiling framework. HARs use an LSKNet large-kernel backbone with a deformable Transformer decoder to strengthen multi-scale representations of small and cross-tile large objects, while the LARs employ a lightweight variant to boost throughput. This dual-model collaborative inference strategy not only ensures precise detection in high-value areas but also significantly reduces invalid computation in background regions, achieving a balanced trade-off between detection accuracy and inference efficiency.