A Unified Framework with Dynamic Kernel Learning for Bidirectional Feature Resampling in Remote Sensing Images

Xiang, Jiajun; Xiao, Zixuan; Wang, Shuojie; Fu, Ruigang; Zhong, Ping

doi:10.3390/rs17213599

Open AccessArticle

A Unified Framework with Dynamic Kernel Learning for Bidirectional Feature Resampling in Remote Sensing Images

by

Jiajun Xiang

,

Zixuan Xiao

,

Shuojie Wang

,

Ruigang Fu

^* and

Ping Zhong

National Key Laboratory of Science and Technology on ATR, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(21), 3599; https://doi.org/10.3390/rs17213599 (registering DOI)

Submission received: 24 September 2025 / Revised: 22 October 2025 / Accepted: 24 October 2025 / Published: 30 October 2025

(This article belongs to the Section Remote Sensing Image Processing)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A lightweight dynamic resampling kernel based on a compact source space and bilinear interpolation achieves competitive performance with significantly fewer parameters, eliminating the need for learnable offsets and channel compression.
A unified bidirectional resampling framework ensures architectural consistency across upsampling and downsampling operations, enhancing multiscale feature learning for remote sensing object detection.
Extensive validation on DIOR and DOTA benchmarks shows consistent performance improvements over baseline methods under substantially reduced parameter constraints.

What is the implication of the main finding?

The method provides a highly parameter-efficient alternative to existing learnable resamplers, making it suitable for resource-constrained deployment scenarios.
The unified design facilitates more coherent feature representation learning across scales, which is essential for accurate detection in complex remote sensing imagery.

Abstract

The inherent multiscale nature of objects poses a fundamental challenge in remote sensing object detection. To address this, feature pyramids have been widely adopted as a key architectural component. However, the effectiveness of these pyramids critically depends on the sampling operations used to construct them, highlighting the need to move beyond traditional fixed-kernel methods. While conventional interpolation approaches (e.g., nearest-neighbor and bilinear) are computationally efficient, their content-agnostic nature often leads to detail loss and artifacts. Recent dynamic sampling operators improve performance through content-aware mechanisms, yet they typically incur substantial computational and parametric costs, hindering their applicability in resource-constrained scenarios. To overcome these limitations, we propose Lurker, a learned and unified resampling kernel that supports both upsampling and downsampling within a consistent framework. Lurker constructs a compact source kernel space and employs bilinear interpolation to generate adaptive kernels, enabling content-aware feature reassembly while maintaining a lightweight parameter footprint. Extensive experiments on the DIOR and DOTA datasets demonstrate that Lurker achieves a favorable trade-off between detection accuracy and efficiency, outperforming existing resampling methods in terms of both accuracy and parameter efficiency, making it especially suitable for remote sensing object detection applications.

Keywords:

feature resample; sampling operator; unified and lightweight; object detection

1. Introduction

Object detection is a fundamental task in remote sensing image processing. Through object detection algorithms, aircraft, vehicles, and ships can be accurately localized and identified, which plays an essential role in military intelligence analysis and operational decision-making. However, remote sensing images typically acquired from high altitudes, contain objects with substantial scale variations. These objects are often small and densely arranged. In addition to the influence of other factors such as weather and sensor parameters, it is often a significant challenge to separate the objects of interest from their surroundings. Furthermore, the deployment in resource-constrained environments such as onboard satellites, unmanned aerial vehicles (UAVs), and embedded platforms [1], imposes stringent requirements on computational efficiency and power consumption.

The key strategy to tackle the challenges mentioned above is to build efficient multiscale feature representations. The Feature Pyramid Network (FPN) and its variants have become standard structures for this purpose. However, the performance of FPN largely depends on the quality of its built-in sampling operations—specifically, upsampling and downsampling. These operations are responsible for transferring and combining features across different levels of the pyramid. Therefore, the quality of sampling methods directly impacts the completeness and semantic richness of multiscale features, as well as the final object detection performance.

In remote sensing image analysis, upsampling and downsampling operations serve as fundamental components for multiscale feature representation. Upsampling enhances spatial resolution to recover fine details and facilitates the fusion of low-level spatial information with high-level semantics, thereby constructing feature representations that are simultaneously rich in both spatial and semantic content. However, traditional interpolation-based upsampling methods often fail to adequately capture semantic context within feature maps, leading to the loss of fine-grained details that are critical for detecting small objects in remote sensing imagery. To mitigate this, learnable upsampling operators have been developed, which incorporate trainable parameters to enhance representational flexibility. Representative dynamic upsampling approaches, such as CARAFE [2], A2U [3], DLU [4], FADE [5], SAPA [6], DIP [7] and DySample [8], perform instance-specific processing in a data-driven manner by adaptively adjusting upsampling strategies based on input content. Conversely, downsampling reduces spatial resolution to alleviate computational complexity, commonly implemented via max pooling or strided convolution, which aggregate local features into higher-level abstractions. Nevertheless, such fixed operations tend to inadequately preserve fine-grained details, often resulting in feature blurring or irreversible information loss—an issue particularly critical in remote sensing, where small and structurally complex targets demand rich detail for accurate recognition. Recent efforts have introduced content-aware downsampling strategies to address this limitation. For instance, CARAFE++ [9] integrates a complementary content-aware downsampling operation, unifying both upsampling and downsampling within a single framework. Similarly, modules like the Adaptive Downsampling Module (ADM) [10], Robust Feature Downsampling (RFD) [11], and Content-aware Pooling and Downsampling Module (CPDM) [12] aim to preserve essential details during resolution reduction.

Although advanced sampling operators demonstrate improved adaptability and performance, they exhibit systematic limitations that hinder practical deployment in remote sensing applications. Current methods typically specialize in either upsampling or downsampling operations, lacking a unified framework for efficient bidirectional resampling. Static approaches fail to preserve critical details across diverse remote sensing scenarios [13], while non-unified architectures necessitate separate modules for different sampling directions, increasing structural complexity [14]. Moreover, dynamically parameterized operators often introduce substantial computational costs and parameter overhead in pursuit of feature preservation, rendering them unsuitable for resource-constrained environments where efficiency is paramount [13]. For instance, while CARAFE++ supports bidirectional operations, its parameter burden remains considerable; DLU has made progress in lightweight design but relies on learnable offset prediction and has not been extended to downsampling or thoroughly evaluated in remote sensing contexts. This architectural separation and methodological limitation creates a fundamental trade-off where reducing model size or improving inference speed typically comes at the expense of detection accuracy. Consequently, a unified framework capable of delivering high-performance bidirectional dynamic sampling under strict lightweight constraints remains unavailable, significantly impeding deployment in real-time remote sensing applications and edge computing systems where computational resources are severely limited [14].

To achieve a better balance between detection accuracy and computational complexity, we propose a novel resampling framework named Lurker, which builds upon the foundation of DLU. Lurker extends the principles of DLU to both upsampling and downsampling but eliminates the need for learnable guidance offsets. This is achieved by constructing a compact source kernel space and generating the target kernels for both upsampling and downsampling via bilinear interpolation. Consequently, Lurker not only reduces the parameter count and computational overhead but also improves the mean Average Precision (

m A P

), thereby achieving a more favorable overall performance trade-off. Figure 1 reveals the working mechanism of Lurker in an upsampling case. We visualize the feature maps in the top-down pathway of feature pyramid network (FPN) [15] and compare Lurker with the nearest neighbor interpolation baseline. After upsampled by Lurker, a feature map can more accurately represent the informational characteristics of objects, consequently enabling the model to achieve superior remote sensing object detection results.

The main contributions of our work are summarized as follows:

1.: Lightweight and Unified Dynamic Resampling Kernel: To improve parameter efficiency, we propose Lurker, which extends the dynamic lookup (DLU) principle to perform both upsampling and downsampling using a compact source kernel space and bilinear interpolation. In contrast to DLU, our method removes the need for learnable guidance offsets and a channel compressor, yet maintains competitive performance with significantly fewer parameters.
2.: Unified Bidirectional Resampling Framework: To resolve architectural inconsistency across resampling operations, we design a unified bidirectional resampling framework based on consistent design principles. Unlike existing methods that employ separate designs for each resampling direction, our framework ensures architectural coherence and promotes more effective multiscale feature learning—critical for remote sensing object detection.
3.: Comprehensive Experimental Validation: To fill the gap in remote sensing validation, we conduct extensive experiments on two authoritative benchmarks. We compare Lurker against several notable resampling modules on the challenging DIOR [16] and DOTA [17] datasets. The results confirm that our approach consistently outperforms baseline methods while operating under considerably lower parameter constraints compared to existing learnable resampling techniques.

The remainder of this paper is organized as follows. Section 2 reviews related work on feature upsampling and downsampling operators in remote sensing. Section 3 presents our proposed methodology, including an overview of the Lurker framework and detailed descriptions of the kernel generation and dynamic reassembly modules. Section 4 provides comprehensive experiments and discussions, covering dataset descriptions, implementation details, evaluation metrics, result analysis with both qualitative and quantitative comparisons, effectiveness analysis, and ablation studies. Finally, Section 5 concludes the paper with a summary of our main findings and contributions.

2. Related Work

In remote sensing imagery, objects often exhibit a wide range of sizes, requiring the use of multiscale feature representations to effectively detect them. To address this challenge, many modern detectors leverage Feature Pyramid Networks (FPN) [15] as the neck structure, which constructs multiscale features to represent objects of varying sizes.

2.1. Feature Upsampling Operators in Remote Sensing

Within the FPN architecture, the upsampling operation plays a fundamental role. In the FPN, nearest-neighbor interpolation is typically employed to upsample feature maps from coarser resolutions to higher resolutions. Traditional interpolation methods, such as nearest-neighbor and bilinear interpolation, rely on predetermined rules for upsampling feature maps. However, these methods often fail to capture semantic information when processing small objects and tend to lose critical fine-grained details [18].

To overcome these limitations, several learnable upsampling operators have been proposed. These approaches incorporate trainable parameters, often leveraging the concept of convolution, to improve model performance. For instance, deconvolution is employed to upsample high-level features, enhancing detection performance for small objects in SAPNet [19]. Deconvolution [20] achieves feature upsampling by reversing the standard convolution process. In Info-FPN [21], pixel shuffle upsampling is proposed for multiscale feature fusion. Pixel Shuffle [22] reduces the number of channels, but redistributes this information to the spatial dimension, ensuring no loss of information. Despite these advancements, these methods still have inherent limitations. Deconvolution relies on fixed learned kernels during inference, and Pixel Shuffle is constrained by its predefined channel-to-space transformation rule. These upsampling approaches struggle to tackle the unique challenges posed by remote sensing images, such as complex backgrounds, varying target scales, and densely arranged objects.

In response to these limitations, significant research efforts have shifted toward dynamic upsampling operators that adapt to input content. As a class of kernel-based methods, they generate adaptive convolution kernels and have demonstrated considerable promise in addressing the complex challenges of remote sensing imagery. CARAFE [2] stands as a prominent example, employing a subnetwork to produce dynamic convolution kernels for content-aware feature reorganization. Its integration into detection frameworks demonstrates improved feature representation for small targets through larger receptive fields and instance-specific processing. Building upon this kernel generation paradigm, several specialized variants have emerged to address particular challenges in remote sensing imagery. CAFUS [23] introduces a feature modification kernel that refines interpolated outputs to better preserve semantic information. CAU [24] adaptively generates upsampling kernels according to contextual information, proving particularly valuable for recovering detailed object boundaries in high-resolution scenes. More recently, DLU was proposed to reduce parameter count while maintaining performance through a compact source kernel space and learnable guided offsets. However, this approach still requires an additional offset predictor that increases computational complexity, and it has yet to see widespread adoption in remote sensing applications compared to earlier methods.

In contrast to kernel-based methods, the pixel displacement paradigm achieves upsampling through spatial transformation rather than convolution operations. DySample exemplifies this approach by splitting single points into multiple locations from a point-sampling perspective, creating sharper edges through precise semantic clustering. Its successful integration into networks like ADD-YOLO and 4SDC-YOLOv8 demonstrates particular effectiveness in enhancing small target detection in remote sensing imagery. Following similar principles, Spatial-Guided Feature Upsampler (SGFU) [25] dynamically computes sampling positions by incorporating both higher-level and lower-level features, with specialized offset prediction improving geometric accuracy. Similarly, Guided Upsampling Module (GU) [26] employs a guidance table of offset vectors to direct sampling toward correct semantic categories. Another notable approach, Flow Guided Upsampling Module (FGUM) [27], addresses feature shift issues by constructing flow fields that enable shallow features to guide the upsampling of deep features through grid sampling operations. Together, these pixel displacement methods offer complementary advantages for remote sensing applications where precise spatial alignment and edge preservation are critical.

The feature rearrangement paradigm represents another important approach for resolution enhancement in remote sensing, which operates through spatial reorganization of existing features rather than kernel generation or pixel displacement. A foundational method in this category is Sub-Pixel Conv, which employs periodic shuffling to efficiently increase feature resolution. Building upon this concept, SP-Conv extends the rearrangement strategy with more sophisticated spatial processing. Further advancing this paradigm, Local Relationship Upsampling (LRU) [28] calculates similarity relationships between high-level feature points and their corresponding low-level regions to enhance point-to-region integration. These rearrangement-based methods collectively offer computationally efficient alternatives for remote sensing applications, particularly valuable in scenarios demanding rapid processing while maintaining adequate feature representation.

2.2. Feature Downsampling Operators in Remote Sensing

Downsampling represents a critical operation in deep neural networks for expanding receptive fields, reducing computational costs, and feature aggregation. In remote sensing object detection, however, conventional downsampling approaches often incur substantial information loss that adversely affects performance, particularly for small objects. Traditional methods including pooling operations and strided convolution provide computational efficiency but frequently sacrifice essential spatial information.

To mitigate these limitations, recent research has developed hybrid multi-path pooling mechanisms that better preserve critical information. Several works employ dual-branch architectures to combine complementary downsampling strategies. The Enhanced Effective Channel Attention in ABNet [29] integrates both average and max pooling to generate enriched channel attention maps. Similarly, the Haar wavelet-based downsampling and max pooling (HWD-MP) [30] module preserves complete information while providing diverse feature representations. The Efficient Downsample Module(EDown) [31] combines max pooling with depthwise separable convolution, utilizing batch normalization to maintain feature continuity and computational efficiency.

Beyond conventional downsampling techniques, several advanced content-aware approaches have emerged specifically designed for remote sensing challenges. These methods can be broadly categorized by their underlying mechanisms. One line of research focuses on dynamic kernel generation, exemplified by CARAFE++ [9] which extends the content-aware paradigm to both upsampling and downsampling through adaptive kernels within large receptive fields, though its application in remote sensing remains limited. Another direction employs adaptive weighting strategies, as seen in the Adaptive Downsampling Module (ADM) [10] that enhances tiny object detection through local detail preservation, and the Content-Aware Downsampling Module (CADM) [12] which implements a three-stage process of channel expansion, content-aware weight prediction, and feature aggregation to maintain small object information. Further advancing this concept, the Scale-Enhanced Detection Head combines adaptive downsampling with multi-scale feature enhancement without increasing parameters. A distinct approach explores multi-branch feature integration through the Robust Feature Downsampling module(RFD) [11], which combines features from different downsampling techniques to create complementary feature sets that overcome limitations of single-method approaches. Collectively, these advanced methods demonstrate the evolving sophistication in addressing information preservation challenges during downsampling in remote sensing applications.

Despite these advancements, a fundamental challenge in resampling methods for remote sensing object detection (RSOD) lies in achieving an optimal balance between lightweight design and effective information retention. Traditional approaches often rely on complex module structures or additional operator parameters to preserve useful information while filtering out background interference. To address this issue, we propose a learned and unified resampling kernel that extends the DLU framework. Our kernel incorporates downsampling operations while eliminating the requirement for learnable guidance offsets and channel compression, thereby achieving efficient and effective sampling within a unified architecture. Unlike existing dynamic sampling operators such as DySample, which only support upsampling, our framework achieves content aware upsampling and downsampling in a unified structure.

3. Methodology

3.1. Overview

Based on the DLU [4] framework, the proposed Lurker method comprises two core components: the Kernel Generation Module and the Dynamic Reassembly Module. As illustrated in Figure 2, the overall process is designed to dynamically produce content-adaptive resampling kernels for each target location, enabling unified and flexible feature resampling for both upsampling (

σ > 1

) and downsampling (

σ < 1

).

The workflow consists of two stages. First, the Kernel Generation Module predicts a compact kernel representation for each spatial location by performing a series of transformations, normalizations, and expansions on the input feature map. Then, the Dynamic Reassembly Module leverages these generated kernels to perform weighted aggregation over local regions of the input features, effectively extracting relevant neighborhoods and computing output values via kernel-pixel interactions. Together, these modules facilitate high-quality feature resampling while preserving semantic representation integrity.

In the following sections, we detail the designs of the Kernel Generation Module and the Dynamic Reassembly Module.

3.2. Kernel Generation Module

As shown in the upper part of Figure 2, the kernel generation module in our framework comprises three core components: a kernel space generator, a kernel space normalizer, and a kernel space expander. The process begins with the kernel space generator, which employs a convolutional encoder with a kernel size of

k_{e n c o d e r}

to transform the input features. This operation produces a structured tensor with dimensions

H \times W \times k_{r e s i z e}

, where H and W represent the height and width of the input feature map, and

k_{r e s i z e}

denotes the size of the resampling neighborhood, thereby establishing the source kernel space. Each spatial position within this source kernel space contains adjustable

k_{r e s i z e}

convolutional kernels that adaptively govern receptive field scaling through learnable spatial resampling. Inspired by classical interpolation methods, dynamic kernel space normalization is implemented via a spatial softmax activation. This normalization ensures resampled features maintain magnitude consistency with input features while enhancing physical interpretability. Subsequently, bilinear interpolation is applied to the normalized source kernels through an upsampling operation to generate the target kernels. Critically, this expansion process preserves softmax normalization across all kernels in the target space. The detailed proof of this preservation can be found in the appendix of DLU [4]. The normalized kernel space thus enables effective resampling into desired kernel configurations through bilinear interpolation. For downsampling operations, the total reconstructed kernel count equals

⌈ H / σ ⌉ \times ⌈ W / σ ⌉ \times k_{resize}^{2}

. Conversely, upsampling generates

σ H \times σ W \times k_{resize}^{2}

kernels. Our methodological design ensures the downsampled dimensions

⌈ H / σ ⌉ \times ⌈ W / σ ⌉

maintain integer values. This approach relaxes the conventional constraint that

σ H

and

σ W

must be integers, thereby enabling the dynamic reassembly module to handle non-integer upsampling ratios. Such architectural flexibility significantly enhances the practical applicability and scalability of our Lurker framework.

3.3. Dynamic Reassembly Module

As shown in the lower part of Figure 2, the dynamic reassembly module is designed to generate the resampled feature map with dimensions

σ H \times σ W \times C

for upsampling and

⌈ H / σ ⌉ \times ⌈ W / σ ⌉ \times C

for downsampling. To compute the response at position

(i, j)

in the c-th feature map of the output, two sequential operations are performed. First, the predicted kernel at position

(i, j)

is extracted from the expanded kernel space. Second, a

k_{resize} \times k_{resize}

neighborhood is retrieved from the c-th input feature map, centered at the pixel location corresponding to output position

(i, j)

. For upsampling operations, this corresponds to locations within the

⌊ H / σ ⌋ \times ⌊ W / σ ⌋

input space, while for downsampling it operates within the

σ H \times σ W

input space. The final output response derives from the inner product between the selected kernel and retrieved pixel neighborhood. This mechanism enhances the feature representation by aggregating relevant local information through the content-adaptive kernels.

4. Experiment and Discussion

In this section, we conduct extensive experiments on two widely used remote sensing object detection benchmarks: DIOR [16] and DOTA [17]. To establish a comprehensive and impartial comparison, we integrate our proposed Lurker module into the FPN structure for upsampling, evaluating its performance against established methods including nearest neighbor interpolation, CARAFE++ [9], Dysample [8] and DLU [4]. Similarly, for downsampling within the ResNet50 backbone, we compare Lurker against methods with low parameter overhead, including average pooling, max pooling, CARAFE++ [9], and strided convolution. Due to their substantial parameter complexity, which introduces approximately 1 to 2 million additional parameters, methods such as RFD [11] and LIP [32] are excluded from our comparison. This ensures our analysis remains focused on computationally efficient downsampling strategies. All compared modules were carefully re-implemented and evaluated under identical experimental settings to ensure a fair comparison.

4.1. Datasets

The proposed resampling module is evaluated on two established remote sensing object detection benchmarks, the DIOR and DOTA datasets, both employing horizontal bounding boxes for object annotation. These datasets consist of optical satellite imagery acquired in the visible spectrum with three-channel RGB color components. The spatial resolution varies across images in each dataset, reflecting realistic acquisition conditions and presenting significant scale-related challenges for detection algorithms.

4.1.1. DIOR Dataset

The DIOR serves as a large-scale benchmark for optical remote sensing object detection, containing 23,463 images uniformly sized at

800 \times 800

pixels. The dataset is divided into 11,725 training and 11,738 testing images. The imagery exhibits diverse spatial resolutions ranging from 0.5 to 30 m, introducing substantial scale variations. It also displays natural diversity in illumination, atmospheric conditions, and sensor noise, representing realistic operational environments. Expertly annotated using Google Earth imagery, DIOR includes 192,472 object instances spanning 20 common geographic categories including Airplane, Airport, Bridge, Harbor, Ship, Stadium, and Windmill.

4.1.2. DOTA Dataset

The DOTA dataset serves as a pivotal benchmark for aerial image object detection, distinguished by its rigorous annotation standards and complex scene composition. It comprises 2806 high-resolution images divided into 1403 for training, 468 for validation, and 935 for testing. The original image sizes vary considerably, with dimensions ranging from

800 \times 800

to

4000 \times 20, 000

pixels. To facilitate model processing, all images were cropped into

1024 \times 1024

patches with a 200-pixel overlap between adjacent patches. The dataset exhibits a spatial resolution range of 0.3 to 4.5 m [33], representing one of the highest-resolution aerial imagery collections publicly available. It encompasses 15 object categories and introduces challenging real-world conditions, notably high object densities of up to 2000 instances per image and extreme scale variations. These characteristics collectively establish DOTA as a robust testbed for evaluating detection algorithms under demanding operational conditions.

4.2. Implementation Details

All experiments were executed on a high-performance computing workstation equipped with an Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz, 256 GB DDR4 RAM (3000 MHz), and a NVIDIA RTX A6000 GPU (48 GB VRAM). Utilizing the mmdetection framework with PyTorch 2.0.0, we implemented the proposed method and comparative approaches. We employed ResNet-50 as the consistent backbone architecture across all experiments Experiments were conducted utilizing two publicly available remote sensing datasets, DIOR and DOTA, both comprising satellite imagery. Random horizontal flipping served as the primary data augmentation technique during training. Addressing the dense target distributions characteristic of aerial imagery, we optimized critical testing parameters, nms_pre was elevated from 1000 to 2000 and max_per_img from 100 to 2000. These adjustments, consistent with the default configurations in mmrotate, are necessary to handle the high target density and prevalence of small objects in aerial imagery.

Faster R-CNN served as the baseline detection model, optimized using Stochastic Gradient Descent [34] with a mini-batch size of 8, momentum set to 0.9, and weight decay fixed at

1 \times 10^{- 4}

. A random seed of 2025 was used for all experiments to ensure reproducibility. For the upsampling experiments, a 12-epoch training schedule (schedule 1x) was employed with an initial learning rate of 0.01, which was reduced by a factor of 10 at epochs 8 and 11. The downsampling experiments utilized a 24-epoch schedule (schedule 2x) with the same initial learning rate of 0.01, similarly reduced 10-fold at epochs 16 and 22.

The Lurker resampling operator was systematically integrated into both the backbone and neck of the detection network. In the ResNet50 backbone, all standard

3 \times 3

convolutions with stride = 2 inside Bottleneck blocks were replaced by Lurker modules Figure 3a, each followed by a

3 \times 3

convolution with stride = 1 to improve feature extraction. Correspondingly, within the downsampling layers, the conventional stride = 2 convolution was replaced with the Lurker operator along with a

1 \times 1

convolution using stride = 1. In the Feature Pyramid Network (FPN) [15] neck, nearest neighbor interpolation was replaced by the Lurker module Figure 3b, which performs 2× upsampling via a learnable kernel. This kernel based method adaptively produces resampling patterns optimized for reconstructing features while preserving spatial details, without introducing additional convolutional layers.

4.3. Evaluation Metrics

To comprehensively evaluate object detection performance, we employ Mean Average Precision (mAP) and its variants, which are standard metrics in remote sensing detection tasks. The evaluation begins with two fundamental metrics: precision and recall. Precision (P), which measures the reliability of the detected objects, is defined as the ratio of true positive to all positive detections:

P = \frac{T P}{T P + F P}

(1)

Recall (R), which measures the ability to find all relevant objects, is defined as the ratio of true positives to all actual ground-truth objects:

R = \frac{T P}{T P + F N}

(2)

Here

T P

,

F P

, and

F N

denote true positives, false positives, and false negatives, respectively. These detections are determined based on the Intersection over Union (IoU) metric, which measures the spatial overlap between predicted bounding boxes and ground-truth annotations. The IoU is calculated as the ratio of the intersection area to the union area of the predicted and ground-truth bounding boxes:

I o U = \frac{| B_{p} \cap B_{g t} |}{| B_{p} \cup B_{g t} |}

(3)

where

B_{p}

represents the predicted bounding box and

B_{g t}

represents the ground-truth bounding box. A detection is considered a true positive when the IoU exceeds a predefined threshold. The Precision-Recall (PR) curve is then plotted by varying the detection confidence threshold. The Average Precision (AP) for a single class is computed as the area under this PR curve:

A P = \int_{0}^{1} P (r) d r

(4)

In practice, this is typically approximated using a discrete summation over a set of equally spaced recall levels. In our evaluation, we report both

m A P_{50}

and

m A P_{75}

, where

m A P_{50}

uses an IoU threshold of 0.5, providing a more lenient evaluation suitable for general object detection, while

m A P_{75}

uses a stricter IoU threshold of 0.75, demanding more precise localization accuracy. The overall mean Average Precision (mAP) is finally obtained by averaging the AP values across all object categories:

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(5)

where N is the total number of classes. The overall mAP serves as our primary accuracy indicator, as it is the most widely adopted benchmark. This is complemented by the scale-specific variants

m A P_{s}

for small objects (with a pixel area less than

32^{2}

),

m A P_{m}

for medium objects (pixel area between

32^{2}

and

96^{2}

), and

m A P_{l}

for large objects (pixel area greater than

96^{2}

) which provide detailed performance insights across different object sizes.

Model efficiency is evaluated through architectural lightweightness, measured by the parameter count (#Params) of the resampling module, and inference speed, quantified by Frames Per Second (FPS) on a test platform. Additionally, we include computational complexity metrics: Floating Point Operations (FLOPs) to measure computational requirements, and GPU memory usage (GPU RAM) during inference to assess practical deployment constraints. The FLOPs for a convolutional layer can be calculated as [35]:

F L O P s = 2 \times H \times W \times C_{i n} \times C_{o u t} \times K_{h} \times K_{w}

(6)

where H and W are the output feature map dimensions,

C_{i n}

and

C_{o u t}

are the input and output channels, and

K_{h}

and

K_{w}

are the kernel dimensions. The total FLOPs of the network is the sum over all layers. GPU memory usage is measured as the peak memory consumption during inference of a single image batch. Note that #Params refers to the sum of parameters from the contrastive resampling methods used to replace either all downsampling modules in the ResNet50 backbone or all upsampling modules in the FPN structure. These metrics collectively form a rigorous framework addressing accuracy, efficiency, and deployability.

4.4. Results and Analysis

4.4.1. Results and Analysis of Upsampling

Figure 4 and Table 1 present comprehensive qualitative and quantitative comparisons of upsampling methods on the DIOR and DOTA datasets, evaluating detection performance, inference speed, model complexity, and computational efficiency.

The qualitative visualization in Figure 4 compares detection results using nearest neighbor interpolation, CARAFE, DySample, DLU, and our proposed Lurker method. On the DIOR dataset (first row), our method successfully detects all targets while other approaches exhibit detection misses, particularly when objects are difficult to distinguish from cluttered backgrounds. On the DOTA dataset (second row), all baseline methods produce multiple false alarms when processing small objects embedded in visually similar backgrounds, whereas Lurker maintains strong discriminative ability with only one false positive. These observations suggest that Lurker offers superior capability in distinguishing small targets within complex scenes.

Quantitatively, Table 1 demonstrates that Lurker achieves an optimal balance between performance and efficiency across multiple metrics. On the DIOR dataset, Lurker attains a competitive mAP of 43.9 while achieving the highest inference speed of 70.7 FPS among learnable upsamplers, representing a 56% speed improvement over CARAFE. Crucially, Lurker accomplishes this with only 149.29 GFLOPs and 1758 MB GPU RAM usage, achieving the lowest computational complexity and memory footprint among all learnable methods. This represents a significant reduction in both computational overhead and memory requirements compared to alternatives.

On the more challenging DOTA dataset, Lurker excels in small object detection with an

m A P_{s}

of 27.6 (second-highest) and ranks second in medium object detection with an

m A P_{m}

of 46.1, while maintaining the highest inference speed of 57.2 FPS. Notably, Lurker achieves these results with only 226.54 GFLOPs and 2312 MB GPU RAM, demonstrating superior computational efficiency. The combination of low GFLOPs and minimal GPU RAM usage, coupled with competitive detection performance, underscores Lurker’s practical advantage for real-world applications where computational resources are constrained.

The strong performance on small and medium objects, combined with substantial efficiency gains across all metrics (FPS, GFLOPs, GPU RAM, and parameters), validates Lurker’s design approach of replacing complex learnable components with efficient bilinear interpolation while preserving effective feature representation. This efficiency-performance trade-off makes Lurker particularly suitable for deployment in resource-constrained environments typical of remote sensing applications.

4.4.2. Results and Analysis of Downsampling

Figure 5 and Table 2 present comprehensive qualitative and quantitative comparisons of downsampling methods on the DIOR and DOTA datasets, demonstrating Lurker’s exceptional balance between detection accuracy and computational efficiency.

The qualitative visualization in Figure 5 compares detection results using strided convolution, average pooling, max pooling, CARAFE++, and our Lurker method. On the DIOR dataset (first row), traditional methods including average pooling, max pooling, and strided convolution suffer from noticeable missed detections when handling dense similar targets. While the learnable CARAFE++ method introduces false alarms in such scenarios, our approach achieves perfect detection of all targets without any errors. On the DOTA dataset (second row), all four comparison methods exhibit either missed detections or false alarms when processing small targets against complex backgrounds, whereas our method successfully identifies all targets without any such errors. These visual comparisons clearly demonstrate Lurker’s superior capability in detecting small targets and distinguishing them from challenging backgrounds.

Quantitatively, Table 2 demonstrates Lurker’s efficiency advantages across multiple metrics. On the DIOR dataset, Lurker delivers competitive performance with an mAP of 41.1 while achieving a processing speed of 69.7 FPS. This frame rate substantially surpasses that of strided convolution at 43.2 FPS and CARAFE++ at 54.6 FPS. Importantly, Lurker attains these results with computational requirements of only 148.35 GFLOPs and 1818 MB GPU RAM usage, showing superior efficiency compared to other learnable methods. This efficiency advantage is further emphasized by Lurker’s minimal parameter count of 24.85K, which represents an 82 percent reduction relative to CARAFE++.

Lurker’s advantages become even more pronounced on the complex DOTA dataset, where it achieves state-of-the-art performance with an mAP of 38.7, surpassing all other methods including CARAFE++ (38.2). Scale-specific analysis reveals exceptional performance on large objects, attaining an

m A P_{l}

of 47.9 compared to CARAFE++’s 46.5, attributed to Lurker’s content-aware dynamic kernels that adaptively aggregate features over semantically meaningful receptive fields. Remarkably, Lurker maintains the highest inference speed of 51.2 FPS while using only 225.11 GFLOPs and 2188 MB GPU RAM—significantly more efficient than CARAFE++ which requires 226.42 GFLOPs and 4358 MB GPU RAM. This 50% reduction in memory usage, combined with lower computational complexity and superior accuracy, underscores Lurker’s practical advantages for processing large-scale remote sensing imagery.

The combination of leading accuracy, high speed, minimal parameters, and low computational footprint validates Lurker’s design principle: a simple yet effective dynamic kernel generation mechanism, based on bilinear interpolation from a compact source space, enables superior feature resampling for remote sensing object detection. Its consistent performance across all object scales and computational metrics confirms its effectiveness in handling the multiscale challenges inherent in remote sensing imagery while maintaining exceptional efficiency.

4.4.3. Visualization of the Lurker Mechanism

To intuitively demonstrate the operational principles of our proposed Lurker module, we present comparative visualizations of feature representations in Figure 6. The left three columns display downsampling results from ResNet50’s final residual layer, while the right three columns show upsampling results from the P2 level of the FPN architecture. Specifically, columns 1–3 compare the original ResNet50 features, average pooling output, and Lurker downsampling results. The visualization clearly demonstrates that our Lurker achieves more concentrated attention on target regions, as shown in the first row where Lurker exhibits stronger focus on dam structures compared to average pooling. Similarly, columns 4–6 present the original P2 features, nearest-neighbor interpolation results, and Lurker upsampling outputs. These comparisons reveal that Lurker effectively captures higher-level semantic information during upsampling, providing more precise target localization and superior background suppression. For instance, in the second row, Lurker accurately localizes expressway service areas while significantly reducing background interference compared to nearest-neighbor interpolation.

4.5. Ablation Study

This subsection presents a comprehensive ablation study to validate the effectiveness of our proposed Lurker method. The study is structured into two parts module quantity analysis and hyperparameter sensitivity evaluation. All ablated models are assessed using the absolute performance in terms of

m A P

,

m A P_{50}

, and

m A P_{75}

, as well as their relative improvement

Δ

over the baseline. Here,

m A P_{50}

and

m A P_{75}

denote the mean average precision computed at IoU thresholds of 0.5 and 0.75, respectively. We first investigate the effect of progressively inserting the Lurker module into the FPN and ResNet50 architectures. We then analyze the sensitivity of key hyperparameters, including the encoder kernel size and the receptive field settings for upsampling and downsampling operations. These experiments collectively substantiate the design rationale of Lurker and its robustness in handling multiscale objects in remote sensing imagery.

Table 3 systematically evaluates the impact of progressively replacing nearest neighbor interpolation with our Lurker modules throughout the FPN architecture. The baseline configuration without Lurker modules establishes reference performance at 43.2 AP, 70.6

m A P_{50}

, and 45.7

m A P_{75}

. Experimental results demonstrate a clear positive correlation between the number of Lurker modules and performance improvement. While a single module shows minimal performance variation, incorporating more modules yields consistent gains across all metrics. The optimal four-module configuration achieves the most significant improvement, elevating AP by 0.7 points to 43.9 and

m A P_{50}

by 1.2 points to 71.8 while maintaining

m A P_{75}

at 46.6, 0.9 points above baseline. These progressive enhancements confirm Lurker’s effectiveness in improving multiscale feature representation for remote sensing object detection, with complete replacement delivering the most substantial performance benefits.

Table 4 presents a systematic ablation study evaluating the impact of progressively replacing standard downsampling operations with our Lurker modules throughout the ResNet50 architecture. The baseline configuration without any Lurker modules establishes reference performance at 39.5 mAP, 63.5

m A P_{50}

, and 43.0

m A P_{75}

. Introducing a single Lurker module to replace the maxpool layer demonstrates immediate performance improvements, achieving a significant 1.2 point mAP gain and a remarkable 2.9 point improvement in

m A P_{50}

. While the two-module configuration experiences a slight performance dip, the three-module setup recovers with a solid 0.7 point mAP improvement and substantial 2.5 point gain in

m A P_{50}

. The most impressive results emerge with full replacement of four downsampling operations, where Lurker delivers maximum performance gains of 1.6 points in mAP, 3.8 points in

m A P_{50}

, and 0.3 points in

m A P_{75}

. These progressive improvements demonstrate Lurker’s capacity to enhance feature representation throughout the network architecture, with complete replacement yielding the most substantial performance benefits. The consistent enhancement in

m A P_{50}

across all configurations highlights Lurker’s particular effectiveness in improving detection accuracy for standard IoU thresholds, making it highly suitable for practical object detection applications in remote sensing imagery.

Table 5 and Table 6 present comprehensive ablation studies on the key hyperparameters of the Lurker module, focusing on the encoder kernel size and receptive field configuration for both upsampling and downsampling operations. In the case of upsampling, the configuration with an encoder kernel size of 3 and an upsampling kernel size of 3 achieves the best performance of 44.2 mAP. However, we ultimately select a configuration with kernel sizes of 1 and 5, which attains a competitive mAP of 43.9. This decision is driven by practical deployment needs, as the minimal performance drop of only 0.3 mAP is offset by a substantial reduction in parameter count and computational overhead, consistent with Lurker’s focus on extreme lightweight design. For downsampling, the highest performance of 41.1 mAP is achieved using encoder and downsampling kernel sizes of 5 and 7, respectively. In contrast, we adopt a more efficient setup with sizes of 1 and 3, yielding 40.9 mAP, which represents only a marginal decrease of 0.2 mAP. This configuration significantly reduces computational complexity, an important advantage given the frequent use of downsampling operations in the backbone network. These hyperparameter choices reflect the core design philosophy of Lurker, which aims to maintain high detection accuracy while minimizing computational cost. The results confirm that our method successfully balances performance and efficiency, making it highly suitable for resource-constrained remote sensing applications.

4.6. Discussion

In this section, we first elucidate the position of our proposed framework relative to existing research in dynamic kernel learning and remote sensing feature resampling. Subsequently, the limitations of Lurker are discussed. Finally, we conclude by exploring potential directions for future research. Current approaches in remote sensing can be grouped into three main categories including kernel based generation (for example CAFUS [23], CAU [24], ADM [10], CADM [12], EDown [31], ScDown [36], CARAFE [2], and DLU [4]), pixel displacement methods (such as SGFU [25], DySample [8], FGUM [27], and GU [26]), and feature rearrangement techniques (including SP-Conv [37], LRU [28], and Sub-Pixel Conv). While these approaches demonstrate promise, they exhibit distinct limitations that we analyze per category below.

Kernel based methods: Kernel based methods typically implement content-aware filtering but often require substantial computational resources. This pattern is evident in CARAFE’s large kernel prediction, the multi-stage architectures of CAU and CAFUS, and the channel-spatial attention mechanisms used in ADM and CADM, all of which introduce considerable parameters and complexity. While EDown and ScDown pursue more efficient designs, they usually sacrifice kernel adaptability or dynamic range to obtain these efficiency gains.
Pixel displacement methods: Pixel displacement methods like DySample, GU, and SGFU achieve operational efficiency by sampling from a small fixed receptive field, generally limited to a $2 \times 2$ area. This approach naturally constrains their capacity to incorporate wider contextual information, frequently causing loss of fine details and boundary artifacts in complex remote sensing imagery.
Feature rearrangement methods: Feature rearrangement techniques such as Sub-Pixel Conv and LRU present another limitation through their dependence on static transformations that operate uniformly across content. Since these methods cannot adjust to local semantic variations, they struggle to manage the high heterogeneity found in geospatial imagery.

As a representative of the kernel-based generation paradigm, Lurker preserves the essential concept of dynamic kernel generation through a significantly simplified structure, standing out as the most lightweight operator in its category while preserving competitive accuracy. In contrast to pixel-displacement methods, Lurker addresses their inherent limitation of constrained receptive fields by employing content-adaptive kernels with configurable sizes. This design allows it to capture multi-scale contextual information, which is crucial for interpreting complex geospatial scenes. Likewise, whereas feature rearrangement techniques apply uniform transformations irrespective of content, Lurker incorporates a dynamic sampling mechanism that adapts to local semantic variations. Unlike feature rearrangement techniques, which perform uniform transformations irrespective of local content and thus limit their flexibility, Lurker employs a dynamic sampling mechanism that adapts to semantic variations. This capability is particularly vital for handling the high heterogeneity present in remote sensing imagery. By integrating these strengths, Lurker demonstrates that effective resampling in remote sensing must strike a balance between semantic awareness and operational efficiency, suggesting a direction for future designs that achieve robust performance without excessive complexity

Despite these advantages, certain limitations warrant attention. The main difficulty arises when incorporating Lurker as a downsampling operator into pre-trained models, where structural changes interfere with weight compatibility and hinder effective use of pre-trained initialization. This limitation is clearly reflected in experimental outcomes, where downsampling setups consistently yield poorer results than upsampling across both datasets, with particular impact on small and medium object detection. Moreover, the qualitative assessment, though demonstrating Lurker’s attention mechanisms, depends on limited scene examples that cannot comprehensively represent performance across varied environmental conditions and sensor types. Consequently, generalizability to wider operational situations remains partially unconfirmed. Additionally, the evaluation uses original-resolution images without considering image compression effects, leaving practical implementation concerns partially unresolved given the common presence of compression artifacts in real-world remote sensing applications.

To overcome these challenges, subsequent research will concentrate on three primary directions. First, we will create weight adaptation methods to settle compatibility problems between Lurker and pre-trained architectures, maintaining output distribution consistency while protecting learned representations. Next, we will thoroughly examine how image compression influences Lurker’s performance, investigating its incorporation into complete preprocessing pipelines to reduce information loss from standard compression techniques. Finally, we will extend validation across diverse geographical settings, sensor properties, and operational scenarios to completely evaluate generalization capacity and robustness. Through these organized investigations, we intend to improve Lurker’s practical value and reliability for real-world remote sensing implementations while methodically addressing existing constraints.

5. Conclusions

This paper presents Lurker, an innovative learned and unified resampling framework that addresses critical challenges in remote sensing object detection. The core contribution lies in the development of a lightweight content-aware kernel generation mechanism that effectively handles both upsampling and downsampling operations within a single architectural paradigm. Through systematic evaluation on the challenging DIOR and DOTA datasets, our experiments demonstrate that Lurker achieves detection accuracy comparable to or superior than state-of-the-art methods while maintaining exceptionally low parameter overhead. The framework reduces parameters by approximately 90% compared to CARAFE and 82% compared to CARAFE++, establishing new benchmarks for efficiency in learnable resampling operators. The comprehensive ablation studies provide strong validation for Lurker’s design principles, confirming the effectiveness of its three-component architecture and the robustness of hyperparameter selections across different operational scenarios. Our analysis reveals that the combination of compact source kernel generation with efficient bilinear interpolation enables superior feature representation while minimizing computational complexity. The method demonstrates particular strength in handling the multiscale challenges inherent in remote sensing imagery, showing consistent performance improvements across small, medium, and large object categories. While Lurker represents a significant advancement in efficient resampling for remote sensing applications, our discussion has identified important limitations that warrant future investigation. The framework’s current limitation in fully utilizing pre-trained weights during downsampling integration presents an opportunity for further refinement. Additionally, the exploration of image compression effects and broader environmental validation will be crucial for enhancing practical deployment capabilities. Despite these challenges, Lurker establishes a solid foundation for future research in lightweight dynamic resampling, offering a compelling solution for resource-constrained remote sensing applications where the balance between accuracy and efficiency is paramount.

Author Contributions

Conceptualization, J.X. and R.F.; Data curation, J.X.; Formal analysis, J.X., Z.X. and S.W.; Investigation, J.X., Z.X. and S.W.; Methodology, J.X. and R.F.; Resources, J.X.; Software, J.X. and R.F.; Supervision, R.F. and P.Z.; Validation, J.X.; Visualization, J.X.; Writing—original draft, J.X., Z.X. and S.W.; Writing—review & editing, J.X. and R.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Fundamental Research Foundation of National Key Laboratory of Automatic Target Recognition OF FUNDER grant JKWATR-240301.

Data Availability Statement

The data used in this study are public datasets. DIOR dataset can be obtained from https://opendatalab.org.cn/OpenDataLab/DIOR (accessed on 22 October 2025). DOTA dataset can be obtained from https://captain-whu.github.io/DOTA/dataset.html (accessed on 22 October 2025). Code will be available soon.

Acknowledgments

The authors would like to express their gratitude for the valuable feedback and suggestions provided by all the anonymous reviewers and the editorial team.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

FPN	Feature Pyramid Network
Lurker	Learned Unified Resampling Kernel
CARAFE	Content-Aware ReAssembly of FEatures
DLU	Dynamic Lightweight Upsampling
FPS	Frames Per Second
mAP	Mean Average Precision
DIOR	Dataset for Object Detection in Optical Remote Sensing
DOTA	Dataset for Object Detection in Aerial Images

References

Nong, Y.J.; Wang, J.J. Real-time Remote Sensing Object Detection Method Based on Embedded System. Acta Opt. Sin. 2021, 41, 171–178. [Google Scholar]
Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. Carafe: Content-aware reassembly of features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 3007–3016. [Google Scholar]
Dai, Y.; Lu, H.; Shen, C. Learning Affinity-Aware Upsampling for Deep Image Matting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 6841–6850. [Google Scholar]
Fu, R.; Hu, Q.; Dong, X.; Gao, Y.; Li, B.; Zhong, P. Lighten CARAFE: Dynamic lightweight upsampling with guided reassemble kernels. In Proceedings of the International Conference on Pattern Recognition, Kolkata, India, 1–5 December 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 383–399. [Google Scholar]
Lu, H.; Liu, W.; Fu, H.; Cao, Z. FADE: Fusing the Assets of Decoder and Encoder for Task-Agnostic Upsampling. In Proceedings of the Computer Vision—ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer: Cham, Switzerland, 2022; pp. 231–247. [Google Scholar]
Lu, H.; Liu, W.; Ye, Z.; Fu, H.; Liu, Y.; Cao, Z. SAPA: Similarity-aware point affiliation for feature upsampling. Adv. Neural Inf. Process. Syst. 2022, 35, 20889–20901. [Google Scholar]
Liu, Y.; Li, J.; Pang, Y.; Nie, D.; Yap, P.T. The devil is in the upsampling: Architectural decisions made simpler for denoising with deep image prior. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 12408–12417. [Google Scholar]
Liu, W.; Lu, H.; Fu, H.; Cao, Z. Learning to Upsample by Learning to Sample. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 6027–6037. [Google Scholar]
Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. Carafe++: Unified content-aware reassembly of features. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 4674–4687. [Google Scholar] [CrossRef]
Zhang, Y.; Liu, T.; Zhen, J.; Kang, Y.; Cheng, Y. Adaptive Downsampling and Scale Enhanced Detection Head for Tiny Object Detection in Remote Sensing Image. IEEE Geosci. Remote Sens. Lett. 2025, 22, 6003605. [Google Scholar] [CrossRef]
Lu, W.; Chen, S.B.; Tang, J.; Ding, C.H.Q.; Luo, B. A Robust Feature Downsampling Module for Remote-Sensing Visual Tasks. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4404312. [Google Scholar] [CrossRef]
Zhang, L.; Liu, Y.; Wang, X.; He, Y.; Li, G.; Zhang, Y.; Liu, C.; Jiang, Z.; Liu, Y. CADDN: A Content-Aware Downsampling-Based Detection Method for Small Objects in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5404517. [Google Scholar] [CrossRef]
Zheng, P.; Zhao, Y.; Cui, Z.; Li, Y. PRNet: Original Information Is All You Have. arXiv 2025, arXiv:2510.09531. [Google Scholar]
Li, Q.; Fan, Z.; Zhao, X. An advanced adaptive detector for oriented objects in remote sensing imagery. Sci. Rep. 2025, 15, 33877. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar]
Hong, Y.; Shu, Y.; Guo, S. AAD-YOLO: An Improved YOLOv8 Model for Complex Remote Sensing Scenarios. IEEE Access 2025, 13, 102578–102588. [Google Scholar] [CrossRef]
Zhang, S.; He, G.; Chen, H.B.; Jing, N.; Wang, Q. Scale Adaptive Proposal Network for Object Detection in Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2019, 16, 864–868. [Google Scholar] [CrossRef]
Noh, H.; Hong, S.; Han, B. Learning Deconvolution Network for Semantic Segmentation. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1520–1528. [Google Scholar] [CrossRef]
Chen, S.; Zhao, J.; Zhou, Y.; Wang, H.; Yao, R.; Zhang, L.; Xue, Y. Info-FPN: An Informative Feature Pyramid Network for object detection in remote sensing images. Expert Syst. Appl. 2023, 214, 119132. [Google Scholar] [CrossRef]
Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar] [CrossRef]
Zhang, K.; Shen, H. Multi-Stage Feature Enhancement Pyramid Network for Detecting Objects in Optical Remote Sensing Images. Remote Sens. 2022, 14, 579. [Google Scholar] [CrossRef]
Yang, F.; Yuan, X.; Ran, J.; Shu, W.; Zhao, Y.; Qin, A.; Gao, C. Accurate Instance Segmentation for Remote Sensing Images via Adaptive and Dynamic Feature Learning. Remote Sens. 2021, 13, 4774. [Google Scholar] [CrossRef]
Li, Z.; Hu, X.; Qian, J.; Zhao, T.; Xu, D.; Wang, Y. Self-Supervised Feature Contrastive Learning for Small Weak Object Detection in Remote Sensing. Remote Sens. 2025, 17, 1438. [Google Scholar] [CrossRef]
Mazzini, D. Guided Upsampling Network for Real-Time Semantic Segmentation. arXiv 2018, arXiv:1807.07466. [Google Scholar] [CrossRef]
Li, Z.; Li, E.; Xu, T.; Samat, A.; Liu, W. Feature Alignment FPN for Oriented Object Detection in Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2023, 20, 6001705. [Google Scholar] [CrossRef]
Lin, B.; Yang, G.; Zhang, Q.; Zhang, G. Semantic Segmentation Network Using Local Relationship Upsampling for Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 8006105. [Google Scholar] [CrossRef]
Liu, Y.; Li, Q.; Yuan, Y.; Du, Q.; Wang, Q. ABNet: Adaptive Balanced Network for Multiscale Object Detection in Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5614914. [Google Scholar] [CrossRef]
Wang, Z.; Bai, J.; Zhang, Q.; Shao, C. Dual-Path Downsampling Algorithm Based on HWD-MP. In Proceedings of the 2024 3rd International Conference on Cloud Computing, Big Data Application and Software Engineering (CBASE), Hangzhou, China, 11–13 October 2024; pp. 148–152. [Google Scholar] [CrossRef]
Li, H.; Ma, J.; Zhang, J. ELNet: An Efficient and Lightweight Network for Small Object Detection in UAV Imagery. Remote Sens. 2025, 17, 2096. [Google Scholar] [CrossRef]
Gao, Z.; Wang, L.; Wu, G. LIP: Local Importance-Based Pooling. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3354–3363. [Google Scholar] [CrossRef]
Xie, X.; Cheng, G.; Yao, Y.; Yao, X.; Han, J. Dynamic Feature Fusion for Remote Sensing Image Object Detection. J. Comput. 2022, 45, 735–747. [Google Scholar]
Robbins, H.; Monro, S. A Stochastic Approximation Method. Ann. Math. Stat. 1951, 22, 400–407. [Google Scholar] [CrossRef]
Molchanov, P.; Tyree, S.; Karras, T.; Aila, T.; Kautz, J. Pruning Convolutional Neural Networks for Resource Efficient Inference. arXiv 2016, arXiv:1611.06440. [Google Scholar]
Zheng, X.; Bi, J.; Li, K.; Zhang, G.; Jiang, P. SMN-YOLO: Lightweight YOLOv8-Based Model for Small Object Detection in Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2025, 22, 7002105. [Google Scholar] [CrossRef]
Liu, Y.; Yang, D.; Song, T.; Ye, Y.; Zhang, X. YOLO-SSP: An object detection model based on pyramid spatial attention and improved downsampling strategy for remote sensing images. Vis. Comput. 2025, 41, 1467–1484. [Google Scholar] [CrossRef]

Figure 1. Illustration of Lurker working mechanism. (Left): Multi-level FPN features from Faster R-CNN baseline (left to dotted line) and (Right): Multi-level FPN features from Faster R-CNN with Lurker (right to dotted line). For sampled locations, this figure shows the accumulated reassembled regions in the top-down pathway of FPN. Information inside such a region is reassembled into the corresponding reassembly center, where salient regions exhibit information gain.

Figure 2. The overall framework of our Lurker. A feature map with size

C \times H \times W

is upsampled or downsampled by a factor of

σ = 2

in this figure.

Figure 2. The overall framework of our Lurker. A feature map with size

C \times H \times W

is upsampled or downsampled by a factor of

σ = 2

in this figure.

Figure 3. Integration of Lurker modules in detection architectures: (a) Deployment in FPN’s top-down pathway, replacing nearest neighbor interpolation with content-aware upsampling to expand features by 2× while recovering spatial details; (b) Integration within ResNet’s Bottleneck structure, replacing strided convolutions with learnable sub-pixel reorganization for

2 \times

downsampling followed by stride-free convolutions.

Figure 3. Integration of Lurker modules in detection architectures: (a) Deployment in FPN’s top-down pathway, replacing nearest neighbor interpolation with content-aware upsampling to expand features by 2× while recovering spatial details; (b) Integration within ResNet’s Bottleneck structure, replacing strided convolutions with learnable sub-pixel reorganization for

2 \times

downsampling followed by stride-free convolutions.

Figure 4. Qualitative visual comparisons of various upsampling operators combined with FPN across two remote sensing datasets, arranged from top to bottom as DIOR and DOTA, with methods ordered left to right: nearest neighbor interpolation, CARAFE, DySample, DLU, and Lurker (Ours). GT represents Ground Truth. True positives, false positives, and false negatives are indicated by green, blue, and red rectangles.

Figure 5. Qualitative visual comparisons of various downsampling operators combined with ResNet50 across two remote sensing datasets, arranged from top to bottom as DIOR and DOTA, with methods ordered left to right: Stride Conv, Avgpool, Maxpool, CARAFE++ and Lurker(Ours). GT represents Ground Truth. The true positives, false positives, and false negatives are indicated by green, blue, and red rectangles.

Figure 6. Visual comparison of feature maps generated by baseline methods and our Lurker framework on the DIOR dataset. The left three columns present downsampling effects extracted from the final residual layer of ResNet50, while the right three columns display upsampling results obtained from the largest P2 level in the FPN architecture.Redder areas indicate higher model sensitivity at those locations.

Table 1. Upsampling detection results using Faster R-CNN on DOTA and DIOR datasets with

1 \times

training schedule. The GFLOPs and GPU RAM values reported correspond to the full Faster R-CNN model integrated with the respective downsampling methods listed in the second column. Best performance is highlighted in boldface and the second-best is underlined, except for the non-learnable methods above the dashed line.

Table 1. Upsampling detection results using Faster R-CNN on DOTA and DIOR datasets with

1 \times

training schedule. The GFLOPs and GPU RAM values reported correspond to the full Faster R-CNN model integrated with the respective downsampling methods listed in the second column. Best performance is highlighted in boldface and the second-best is underlined, except for the non-learnable methods above the dashed line.

Dataset	Method	FPS	GFLOPs	GPU RAM/MB	# Params/K	${mAP}_{s}$	${mAP}_{m}$	${mAP}_{l}$	$mAP$
DIOR	Nearest	73.1	148.31	1802	0	12.7	35.4	60.1	43.2
	CARAFE	45.2	150.18	1856	296.60	13.3	35.4	60.8	44.0
	Dysample	69.3	149.36	1850	32.92	13.2	35.4	60.7	43.8
	DLU	64.3	149.67	1848	141.96	12.8	35.7	61.1	44.2
	Lurker	70.7	149.29	1758	25.72	12.8	35.4	60.8	43.9
DOTA	Nearest	59.7	225.06	2136	0	25.6	45.3	49.7	41.9
	CARAFE	33.9	228	2328	296.60	26.5	46.2	49.9	42.6
	Dysample	56.9	226.66	2184	32.92	27.5	45.9	50.6	42.5
	DLU	54.7	227.17	2330	141.96	28.0	45.7	51.8	42.9
	Lurker	57.2	226.54	2312	25.72	27.6	46.1	50.6	42.4

Table 2. Downsampling detection results using Faster R-CNN on DIOR and DOTA datasets with

2 \times

training schedule. The GFLOPs and GPU RAM values reported correspond to the full Faster R-CNN model integrated with the respective downsampling methods listed in the second column. Best performance is highlighted in boldface and the second-best is underlined, except for the non-learnable methods above the dashed line.

Table 2. Downsampling detection results using Faster R-CNN on DIOR and DOTA datasets with

2 \times

training schedule. The GFLOPs and GPU RAM values reported correspond to the full Faster R-CNN model integrated with the respective downsampling methods listed in the second column. Best performance is highlighted in boldface and the second-best is underlined, except for the non-learnable methods above the dashed line.

Dataset	Method	FPS	GFLOPs	GPU RAM/MB	# Params/K	${mAP}_{s}$	${mAP}_{m}$	${mAP}_{l}$	$mAP$
DIOR	Avgpool	70.8	148.31	1736	0	10.3	32.7	58.5	41.1
	Maxpool	71.0	148.31	1756	0	10.5	32.0	59.3	41.6
	Conv	43.2	149.79	2206	5885.90	9.8	31.0	56.5	39.5
	CARAFE++	54.6	149.14	3094	137.20	10.7	32.9	58.4	41.4
	Lurker	69.7	148.35	1818	24.85	10.4	32.8	58.8	41.1
DOTA	Avgpool	58.3	225.06	2254	0	21.4	39.0	45.3	37.1
	Maxpool	58.0	225.06	2294	0	21.5	39.9	45.6	37.5
	Conv	41.0	227.47	2870	5885.90	23.5	39.5	43.2	36.4
	CARAFE++	39.9	226.42	4358	137.20	22.3	40.9	46.5	38.2
	Lurker	51.2	225.11	2188	24.85	23.6	40.2	47.9	38.7

Table 3. Ablation Study on the Effect of Replacing nearest neighbor interpolation with our Lurker in FPN.

Δ

values indicate the performance change relative to the baseline (zero). Positive

Δ

values (↑) are bolded to highlight performance improvements.

Table 3. Ablation Study on the Effect of Replacing nearest neighbor interpolation with our Lurker in FPN.

Δ

values indicate the performance change relative to the baseline (zero). Positive

Δ

values (↑) are bolded to highlight performance improvements.

Number	$mAP$	$Δ mAP ↑$	${mAP}_{50}$	$Δ {mAP}_{50} ↑$	${mAP}_{75}$	$Δ {mAP}_{75} ↑$
zero	43.2	–	70.6	–	45.7	–
one	43.1	−0.1	70.9	+0.3	45.3	−0.4
two	43.5	+0.3	71.3	+0.7	46.0	+0.3
three	43.6	+0.4	71.4	+0.8	46.3	+0.6
four	43.9	+0.7	71.8	+1.2	46.6	+0.9

Table 4. Ablation Study on the Effect of Replacing Stride Conv with our Lurker in ResNet50.

Δ

values indicate the performance change relative to the baseline (zero). Positive

Δ

values (↑) are bolded to highlight performance improvements.

Table 4. Ablation Study on the Effect of Replacing Stride Conv with our Lurker in ResNet50.

Δ

values indicate the performance change relative to the baseline (zero). Positive

Δ

values (↑) are bolded to highlight performance improvements.

Number	$mAP$	$Δ mAP ↑$	${mAP}_{50}$	$Δ {mAP}_{50} ↑$	${mAP}_{75}$	$Δ {mAP}_{75} ↑$
zero	39.5	–	63.5	–	43.0	–
one	40.7	+1.2	66.4	+2.9	43.1	+0.1
two	39.2	-0.3	64.7	+1.2	41.4	-1.6
three	40.2	+0.7	66.0	+2.5	42.6	-0.4
four	41.1	+1.6	67.3	+3.8	43.3	+0.3

Table 5. Detection results with various encoder kernel size

k_{e n c o d e r}

and receptive field for upsampling

k_{u p}

. Bolded figures represent the highest values.

Table 5. Detection results with various encoder kernel size

k_{e n c o d e r}

and receptive field for upsampling

k_{u p}

. Bolded figures represent the highest values.

$k_{encoder}$	$k_{up}$	$mAP$	${mAP}_{50}$	${mAP}_{75}$
1	3	43.6	71.6	46.6
1	5	43.9	71.8	46.6
3	3	44.2	71.9	47.1
3	5	43.9	71.9	46.6
3	7	42.9	71.0	45.1
5	5	43.6	71.4	46.5
5	7	43.7	71.5	46.5

Table 6. Performance (

m A P

,

m A P_{50}

,

m A P_{75}

) under different combinations of k_en and k_down. Higher AP values indicate better detection performance. Bolded figures represent the highest values.

Table 6. Performance (

m A P

,

m A P_{50}

,

m A P_{75}

) under different combinations of k_en and k_down. Higher AP values indicate better detection performance. Bolded figures represent the highest values.

$k_{encoder}$	$k_{down}$	$mAP$	${mAP}_{50}$	${mAP}_{75}$
1	3	40.9	66.8	43.5
1	5	40.5	66.7	43.1
3	3	40.5	66.5	42.9
3	5	40.1	66.2	42.5
3	7	40.2	66.3	42.5
5	5	40.7	66.6	43.2
5	7	41.1	67.2	43.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xiang, J.; Xiao, Z.; Wang, S.; Fu, R.; Zhong, P. A Unified Framework with Dynamic Kernel Learning for Bidirectional Feature Resampling in Remote Sensing Images. Remote Sens. 2025, 17, 3599. https://doi.org/10.3390/rs17213599

AMA Style

Xiang J, Xiao Z, Wang S, Fu R, Zhong P. A Unified Framework with Dynamic Kernel Learning for Bidirectional Feature Resampling in Remote Sensing Images. Remote Sensing. 2025; 17(21):3599. https://doi.org/10.3390/rs17213599

Chicago/Turabian Style

Xiang, Jiajun, Zixuan Xiao, Shuojie Wang, Ruigang Fu, and Ping Zhong. 2025. "A Unified Framework with Dynamic Kernel Learning for Bidirectional Feature Resampling in Remote Sensing Images" Remote Sensing 17, no. 21: 3599. https://doi.org/10.3390/rs17213599

APA Style

Xiang, J., Xiao, Z., Wang, S., Fu, R., & Zhong, P. (2025). A Unified Framework with Dynamic Kernel Learning for Bidirectional Feature Resampling in Remote Sensing Images. Remote Sensing, 17(21), 3599. https://doi.org/10.3390/rs17213599

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Unified Framework with Dynamic Kernel Learning for Bidirectional Feature Resampling in Remote Sensing Images

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Feature Upsampling Operators in Remote Sensing

2.2. Feature Downsampling Operators in Remote Sensing

3. Methodology

3.1. Overview

3.2. Kernel Generation Module

3.3. Dynamic Reassembly Module

4. Experiment and Discussion

4.1. Datasets

4.1.1. DIOR Dataset

4.1.2. DOTA Dataset

4.2. Implementation Details

4.3. Evaluation Metrics

4.4. Results and Analysis

4.4.1. Results and Analysis of Upsampling

4.4.2. Results and Analysis of Downsampling

4.4.3. Visualization of the Lurker Mechanism

4.5. Ablation Study

4.6. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI