Self-Supervised Feature Contrastive Learning for Small Weak Object Detection in Remote Sensing

Li, Zheng; Hu, Xueyan; Qian, Jin; Zhao, Tianqi; Xu, Dongdong; Wang, Yongcheng

doi:10.3390/rs17081438

Open AccessArticle

Self-Supervised Feature Contrastive Learning for Small Weak Object Detection in Remote Sensing

by

Zheng Li

^1,2

,

Xueyan Hu

¹,

Jin Qian

¹,

Tianqi Zhao

^1,2

,

Dongdong Xu

¹

and

Yongcheng Wang

^1,*

¹

Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, China

²

University of Chinese Academy of Sciences, Beijing 100049, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(8), 1438; https://doi.org/10.3390/rs17081438

Submission received: 24 February 2025 / Revised: 9 April 2025 / Accepted: 16 April 2025 / Published: 17 April 2025

(This article belongs to the Special Issue Remote Sensing of Target Object Detection and Identification (Third Edition))

Download

Browse Figures

Versions Notes

Abstract

Despite advances in remote sensing object detection, accurately identifying small, weak objects remains challenging. Their limited pixel representation often fails to capture distinctive features, making them susceptible to environmental interference. Current detectors frequently miss these subtle feature variations. To address these challenges, we propose FCDet, a feature contrast-based detector for small, weak objects. Our approach introduces: (1) a spatial-guided feature upsampler (SGFU) that aligns features by adaptive sampling based on spatial distribution, thus achieving fine-grained alignment during feature aggregation; (2) a feature contrast head (FCH) that projects GT and RoI features into an embedding space for discriminative learning; and (3) an instance-controlled label assignment (ICLA) strategy that optimizes sample selection for feature contrastive learning. We conduct comprehensive experiments on challenging datasets, with the proposed method achieving 73.89% mAP on DIOR, 95.04% mAP on NWPU VHR-10, and 26.4% AP on AI-TOD, demonstrating its effectiveness and superior performance.

Keywords:

remote sensing; object detection; small weak object; contrastive learning; feature alignment

1. Introduction

Due to interference from factors such as high imaging altitude and the imaging environment, small and weak objects are widely distributed in remote sensing images [1]. These objects typically exhibit characteristics of low signal-to-noise ratio (SNR) and insufficient structural information, resulting in weak feature responses. According to [2], their small scale, low SNR, and influence of complex contexts contribute to the manifestation of weak characteristics. This exponentially increases the difficulty of detecting small and weak objects compared to general objects. However, high-precision small and weak object detection (SWOD) is of great significance across various fields, including agricultural monitoring [3], environmental protection [4], and military reconnaissance [5], among others. Such objects often carry more critical information, and accurate detection can significantly enhance decision-making efficiency and response capabilities [6].

Currently, the performance of most mainstream object detection networks significantly lags when dealing with small and weak objects compared to other objects. This disparity primarily arises from the subtlety of key features, which makes it challenging for these objects to effectively convey their attributes. Figure 1 illustrates several examples of small and weak objects. In Figure 1a, these objects display blurred appearances and details, limited by both the spatial resolution and their inherent characteristics. In Figure 1b, atmospheric obstruction and lighting conditions further obscure the details of the objects, exacerbating their weak attributes. Similarly, in Figure 1c, complex contexts can significantly influence objects, complicating the process of filtering out large amounts of irrelevant information. Consequently, the objects may become overwhelmed by the background. These factors collectively hinder feature expression. From the network perspective, it is difficult for the model to learn inconspicuous features from challenging contexts, resulting in widespread neglect of small and weak objects [7]. From the feature perspective, misalignment during multi-resolution feature aggregation further undermines the representation of key semantic features [8]. Figure 2 visualizes the feature representations at adjacent levels in the feature pyramid network (FPN). It is clear that the interpolated features do not accurately align with the spatial distribution of low-level features. Due to rough interpolation operations, the restored semantic information fails to precisely match the corresponding spatial positions. This not only hinders effective feature aggregation but also introduces additional noise interference, particularly for small and weak objects (see the red dashed box). Therefore, the challenges associated with SWOD persist and necessitate urgent attention and resolution.

Significant efforts have been dedicated to improving the performance of small and weak objects in image interpretation and analysis. Zhang et al. [9] designed a gated context aggregation module to enhance the feature representation of small and weak objects by aggregating contextual information. FSANet [10] introduced a feature-aware alignment module to mitigate the effects of spatial misalignment on feature fusion and proposed a spatial-aware guidance head that leverages the shapes of small objects to inform predictions. SRNet [11] identified the locations of potential objects using a region search module, restoring the boundary structures and texture features of small objects in camouflage detection. DSNet [12] employed a dual-branch architecture to encode and interact with small object features, utilizing global and mutual constraints to differentiate foreground and background for camouflaged objects. PRNet [13] integrated multi-scale features through a proposed cascaded attention perceptron to locate camouflaged objects, extract contextual information, and aggregate details to refine results further. PBTA [14] utilized a partial break bidirectional feature pyramid network to address scale confusion and semantic loss issues in small pedestrian detection. In moving vehicle detection, SDANet [15] customized a bidirectional Conv-RNN module to align disturbed backgrounds, alleviating the lack of appearance information for small vehicles. However, these studies primarily focus on structurally modifying or adding extra components to enhance the network’s extraction capabilities. They tend to overlook the impact of the inconspicuousness of small and weak object features, as well as how existing networks can improve their learning and discrimination abilities. As a result, these methods still have limitations in detecting weak objects.

In this paper, we aim to enhance the learning capabilities of the network to improve the detection performance of small and weak objects. Specifically, a feature contrast-based detector (FCDet) is developed for SWOD in remote sensing. First, to address the misalignment issues in feature aggregation for small and weak objects, we propose a spatial-guided feature upsampler (SGFU). SGFU follows a learnable upsampling way, where the sampling locations are determined jointly by both high- and low-resolution features. This enables the acquisition of finely aligned semantic features. Second, we construct a contrastive R-CNN network by introducing a feature contrast head (FCH) that operates in parallel with the classification and localization branches. The FCH effectively mitigates background interference by employing feature contrastive learning (FCL) to guide the model in focusing on the differences between small and weak objects and similar backgrounds. Furthermore, the model can learn richer and more discriminative features, providing a better representation of small and weak objects. Finally, to exploit the learning capability of FCH, we propose an instance-controlled label assignment strategy (ICLA). ICLA introduces adjustable ports for positive and negative samples within the contrastive R-CNN network. By modifying the ratio and quantity of samples, the FCH promotes aggregation and uniformity among the contrastive sample pairs, thereby driving the network to effectively learn feature discrimination. To demonstrate the effectiveness of the proposed method, we conduct extensive experiments on three large-scale datasets: DIOR [16], NWPU VHR-10 [17], and AI-TOD [18]. The experimental results indicate that the proposed method achieves superior performance, particularly in SWOD. The main contributions of this paper are summarized as follows:

A SGFU is proposed to align spatial patterns in feature aggregation by adaptively adjusting the sampling locations based on the learned spatial feature distribution.
A contrastive learning-based FCH is introduced to guide the network in distinguishing between object features and context. The inconspicuous key features of the objects can be captured, especially for small and weak ones.
A simple yet effective ICLA strategy is designed to regulate the effects of positive and negative sample pairs on feature learning. Diverse samples can also enhance the feature representation of the objects.

The remaining sections of this article are organized as follows: Section 2 offers a brief review of related work, including small weak object detection, contrastive learning, and feature upsampling. Section 3 provides a detailed description of the proposed FCDet. Experimental results and analysis are presented in Section 4. Section 5 concludes the article.

2. Related Work

2.1. Small Weak Object Detection

Due to the small scale and inconspicuous features, detecting small and weak objects remains a challenge in remote sensing [2]. Currently, several methods have made considerable efforts to improve the discriminability of these objects. TBNet [19] explored important texture patterns and boundary cues to enhance the feature representation of small and weak objects, while also introducing a task-decoupled detection head to address the entanglement of classification and regression tasks. Li et al. [20] proposed a context integration module that develops an RFConv to integrate contextual information from various receptive fields for small and weak objects. Han et al. [21] adopted a context-driven approach, leveraging surrounding information to augment the key features of small and weak objects. Ma et al. [22] introduced a category-aware module that separates features of different categories into distinct channels. This approach effectively reduces interference from inter-class features and background noise. Guo et al. [23] proposed a hierarchical activation method that activates features at different scales to obtain pure features. This method alleviates the interference caused by feature overlap between classes for small objects. However, absorbing cues from external sources may lead to interference, as irrelevant information can further weaken feature representations.

In contrast, we enhance the learning ability of the networks to differentiate object features from highly similar environments through a self-supervised learning strategy. This approach projects the extracted features into an embedding space and employs loss constraints to minimize the distance between relevant features while maximizing the distance between irrelevant features based on their differences. This effectively enhances the discernibility of small and weak objects in complex backgrounds.

2.2. Contrastive Learning

Self-supervised learning [24] is an emerging paradigm that automatically generates labels from the internal attributes of the data to guide model training. It can reduce the reliance on manually annotated labels. As a form of self-supervised learning, contrastive learning has garnered increasing attention and research, gradually expanding into the field of object detection. It aligns similar features in the embedding space by leveraging the information similarity between constructed sample pairs while ensuring a uniform distribution of irrelevant features. Yuan et al. [25] designed a feature imitation branch to enhance the feature representation of small objects and employed contrastive loss for self-supervised training. Liu et al. [26] constructed semantic-level and geometric-level feature relations to eliminate noise interference in the feature pyramid. Lv et al. [27] developed a contrastive detection head for ship detection in SAR images, encoding positive and negative samples into a contrastive feature space for feature learning. PCLDet [28] introduced prototypical contrastive learning for fine-grained object detection. It constructed a prototype bank for each category as an anchor, minimizing the intra-class feature distance while maximizing the inter-class distance to extract discriminative features. Chen et al. [29] proposed a few-shot object detection method based on multi-scale contrastive learning. This method utilizes Siamese structures to process proposals and multi-scale objects and introduces a contrastive multiscale proposal loss to guide multi-scale contrastive learning.

These methods illustrate that contrastive learning can effectively enhance the discriminative ability of objects. To leverage this advantage, we aim to extend it to SWOD. We adopt the features of ground truth as anchors in contrastive learning, assigning the positive RoI features as positive samples while treating the remaining RoI features as negative samples. Through self-supervised iterative training, the representation of weak objects is gradually enhanced.

2.3. Feature Upsample

Feature upsampling is a critical technique for cross-scale feature fusion in object detection. It restores high-resolution feature maps using specific interpolation algorithms, such as nearest neighbor interpolation, bilinear interpolation, and bicubic interpolation. These traditional methods estimate new pixel values based on surrounding ones but are inherently rigid, often leading to unstable results in complex scenes. In contrast, learning-based upsampling methods, like deconvolution (Deconv) [30] and pixel shuffle [31], introduce learnable kernel functions that offer greater flexibility and can overcome the limitations of fixed-rule interpolation techniques. Currently, several methods [32,33,34,35,36] introduce more flexible strategies for feature upsampling. For example, Liu et al. [32] proposed a dynamic point-based sampler that performs lightweight interpolation by searching for the interpolation positions. CARAFE [33] predicts a specific deconvolution kernel for each pixel location, where the predicted kernel incorporates local contextual information to assist in feature reorganization. Mazzini [34] introduced a guided upsampling module that predicts a high-resolution compensation table to guide nearest-neighbor upsampling in semantic segmentation. FADE [35] presents a task-agnostic upsampling operator that combines high- and low-resolution features to generate upsampling kernels, refining the upsampled features through a gated filter. SAPA [36] computes the mutual similarity between encoder and decoder features and transforms the computed scores into upsampling kernel weights.

The mentioned methods have specifically refined upsampling kernels to meet the demands of different tasks better. To achieve spatial alignment for feature aggregation in SWOD, we develop a flexible upsampler SGFU. Inspired by the idea of dynamically predicting sampling positions, our proposed upsampler guides the upsampling process according to the spatial position distribution, which can produce more precise high-resolution features.

3. Proposed Method

The overall structure of our FCDet is shown in Figure 3. Our proposed method is built on Faster R-CNN [37]. Given a remote sensing image, the backbone first performs hierarchical feature extraction. Next, the feature pyramid network (FPN) [38] aggregates features across different scales. We integrate SGFU into the FPN, replacing the original interpolation method. SGFU dynamically samples spatial positions to enable a learnable upsampling process, ensuring spatial alignment during adjacent feature fusion. Subsequently, the fused features, along with the proposals generated by the RPN, are fed into the contrastive R-CNN. The proposed FCH projects RoI features into an embedding space to measure the distance between samples. To regulate the influence of sample pairs on contrastive feature learning, ICLA selects appropriate positive and negative samples for FCH. Finally, the regression head in the contrastive R-CNN predicts refined object locations, while the classification head outputs category scores. Through iterative training, the model parameters are optimized.

3.1. Spatial-Guided Feature Upsampler

FPN employs a top-down feature aggregation flow, allowing high-level semantic information to be transmitted to lower-level features. This is a crucial discriminative factor, particularly for small weak objects with limited information. However, FPN relies on interpolation to align adjacent features. This fixed-rule approach tends to cause spatial misalignment, where different-level features do not spatially correspond. Inspired by Dysampler [32], we introduce low-level features rich in spatial information to guide feature upsampling. The sampling positions are dynamically determined by considering the spatial distribution of semantic information.

The full process of the proposed SGFU is illustrated in Figure 4. Its core concept is to dynamically compute the spatial position for each upsampling point. Unlike linear interpolation, SGFU takes both the higher-level feature

F^{l} \in R^{C_{l} \times H_{l} \times W_{l}}

and the lower-level feature

F^{l - 1} \in R^{C_{l - 1} \times H_{l - 1} \times W_{l - 1}}

from adjacent levels as input, producing the upsampled feature

F_{u p}^{l - 1} \in R^{C \times H_{l - 1} \times W_{l - 1}}

as output. First, for the inputs

F^{l}

and

F^{l - 1}

, a 1 × 1 convolution followed by group normalization (GN) is applied to unify the dimensions to C and normalize the data to enhance robustness. Next, we compute the offset for each sampling point corresponding to an upsampling position by linearly mapping

F^{l}

to

R^{H_{l} \times W_{l} \times 2}

in parallel. Here,

H_{l}

and

W_{l}

represent the coordinates of the offset points, while 2 denotes the offset values in the x- and y- directions. For specialized prediction of the offsets for the top-left, top-right, bottom-left, and bottom-right positions, we decouple the prediction into four parallel linear mapping branches. These parallel offsets are then concatenated along the channel dimension and reorganized into an offset matrix

O \in R^{2 \times H_{l - 1} \times W_{l - 1}}

using pixel shuffling. It can be expressed as

{off}_{i} = {Linear}_{i} (GN (Con v_{1 \times 1} (F^{l}))) (i = 1, 2, 3, 4)

(1)

O = Pixel_Shuffle {Concat ({off}_{1}, {off}_{2}, {off}_{3}, {off}_{4})}

(2)

where

{Conv}_{1 \times 1}

represents the 1 × 1 convolution, off represents the predicted offset, and Concat represents the concatenation operation.

Although offset prediction introduces flexibility for feature upsampling, relying solely on the position of the sampled features is insufficient to accurately represent the geometric information during feature aggregation. Even after applying offsets, the upsampled features may lack precise spatial correspondence. To address this issue, we incorporate low-level features to guide the prediction of upsampling positions. For feature

F^{l - 1}

, we similarly apply convolution and GN to align its data distribution with that of

F^{l}

. We then linearly project it to

R^{2 \times H_{l - 1} \times W_{l - 1}}

to represent spatial offsets

S

. A sigmoid function is adopted to constrain the offsets and prevent excessive shifts that could lead to instability. Next, we add

S

to

O

to obtain the offset corresponding to each upsampling position. This incorporates the spatial information from the lower-level features, reducing the likelihood of inaccurate pixel interpolation. The computed offsets are added to the grid points to generate the final sampling coordinates, which are subsequently normalized to the range of (−1, 1). Leveraging these coordinates, SGFU performs pixel-wise sampling from

F^{l}

to construct the upsampled feature

F_{u p}^{l - 1}

. The value of

F_{u p}^{l - 1}

not only captures the semantic relationships in the image but also incorporates crucial spatial information. This enables it to effectively convey useful information for small, weak objects during feature aggregation. The above process can be expressed as

S = Sigmoid (Linear (GN (Con v_{1 \times 1} (F^{l - 1}))))

(3)

C o o r d s = 2 \times \frac{S + O + G r i d}{(H_{l}, W_{l})} - 1

(4)

F_{u p}^{l - 1} = G rid_Sample (F^{l}, C o o r d s)

(5)

where

G r i d

represents the coordinates of the grid points, and

C o o r d s

denotes the sampling coordinates.

Grid_Sample

refers to the grid sampler, whose workflow is illustrated in Figure 5. Based on the specified positions within the sampling coordinates, the sampler extracts the corresponding pixel values by performing bilinear interpolation on the four nearest pixels, filling the results into the corresponding positions. This process queries the key feature locations and interpolates the fine-grained features. After traversing the entire grid pixel by pixel, the network obtains high-resolution features.

To provide a comprehensive overview of SGFU, we formally summarize the aforementioned pipeline in Algorithm 1. In summary, the proposed upsampler adaptively adjusts sampling positions based on the semantic content and spatial distribution of the input features. This ensures spatial consistency while preserving both feature details and representations. Compared to fixed upsampling methods, SGFU provides precise alignment during feature aggregation.

Algorithm 1 Spatial-guided feature upsampling (SGFU).

Input: Input features

F^{l}

,

F^{l - 1}

Output: Upsampled features

F_{u p}^{l - 1}

1:: Feature Normalization:
2:: $F_{n o r m}^{l} \leftarrow GN ({Conv}_{1 \times 1} (F^{l}))$ ▹ 1 × 1 conv + GroupNorm
3:: $F_{n o r m}^{l - 1} \leftarrow GN ({Conv}_{1 \times 1} (F^{l - 1}))$
4:: Offset Prediction:
5:: for $i \leftarrow 1$ to 4 do ▹ Parallel offset branches
6:: ${off}_{i} \leftarrow {Linear}_{i} (F_{n o r m}^{l})$
7:: $O \leftarrow Pixel_Shuffle (Concat ({off}_{1}, {off}_{2}, {off}_{3}, {off}_{4}))$
8:: Spatial Guidance:
9:: $S \leftarrow Sigmoid (Linear (F_{n o r m}^{l - 1}))$ ▹ Constrained to (0, 1)
10:: Coordinate Computation:
11:: $Coords \leftarrow 2 \times \frac{S + O + Grid}{(H_{l}, W_{l})} - 1$ ▹ Normalize to (−1, 1)
12:: Feature Upsampling:
13:: $F_{u p}^{l - 1} \leftarrow Grid_Sample (F^{l}, Coords)$ ▹ Bilinear interpolation
14:: return $F_{u p}^{l - 1}$

3.2. Feature-Contrastive Learning

As illustrated in Figure 1, small and weak objects often exhibit high visual similarity to their surrounding context, making it challenging for the network to extract valuable discriminative information from backgrounds with similar colors. Existing methods [19,20,21] attempt to enhance feature representation by incorporating complex structures or attention mechanisms. In contrast, we aim to strengthen the model’s response to challenging objects using contrastive learning, which allows the network to discover inherent patterns in unlabeled data. By projecting data into the embedding space, contrastive learning brings similar data points closer together while pushing dissimilar points apart. Inspired by this, we leverage contrastive learning to explore the similarity and dissimilarity potentials within the feature space, thereby improving networks to detect weak-response objects.

The complete process of FCL is shown in Figure 6. Specifically, the important concept in contrastive learning is the definition of positive and negative sample pairs, which are used to distinguish between different data. Upon extracting the RoI features, we define anchors as well as positive and negative sample pairs. The network extracts the RoI features from the ground truth as the anchor, denoted as

F_{g t}

. For the specified anchor

F_{g t}

, the positive samples assigned during label assignment are defined as the positive samples here, marked as

F^{+}

. The positive samples from the remaining ground truth, along with all negative samples, are designated as negative samples relative to

F_{g t}

, labeled as

F^{-}

. This can be expressed as

M [F_{g t}^{i}, F^{j}] = \{\begin{matrix} 1, if s_{j} \in P_{i} \\ 0, otherwise \end{matrix}

(6)

where

M [.]

is an identifier used to mark samples,

P_{i}

represents the assigned set of positive samples, and

F^{j}

is the RoI feature extracted from the proposal

s_{j}

.

During the RoI feature extraction stage, RPN generates proposals based on the multi-scale feature maps produced by FPN. These proposals are subsequently projected onto their corresponding regions within the feature maps. FCH then utilizes RoI Align with bilinear interpolation to achieve precise feature sampling, producing standardized RoI features with fixed dimensions (

R^{7 \times 7 \times 256}

). These extracted features effectively combine fine-grained local patterns with high-level semantic information. After labeling these RoIs, the features are forwarded to the FCH for representation learning. Figure 3 illustrates the constructed contrastive R-CNN, which integrates the FCH in parallel with the classification and regression heads. The FCH comprises two linear mapping layers that project the input features into a low-dimensional embedding space for distance measurement. Specifically, we define the dimensions of the input RoI features

F

and

F_{gt}

as

R^{N \times 7 \times 7 \times 256}

and

R^{G \times 7 \times 7 \times 256}

, respectively, where N denotes the number of samples and G denotes the number of ground truths. FCH initially applies global average pooling followed by a flattening operation to transform the dimensions of

F

and

F_{gt}

into

R^{N \times 1 \times 256}

and

R^{G \times 1 \times 256}

, respectively. Subsequently, a linear projection layer reduces their dimensionality to

R^{N \times 1 \times 64}

and

R^{G \times 1 \times 64}

in a channel-wise manner. The compressed embeddings preserve the semantic information from the original features while reducing representation costs. In the embedding space, we utilize cosine similarity to measure the distance between samples, which can be expressed as

S i m (i, j) = \frac{i \cdot j}{∥i∥ ∥j∥}

(7)

where i and j are data used to calculate the cosine similarity.

In general, the InfoNCE loss [39] serves as a fundamental tool in contrastive learning that drives the optimization of similarity between sample data. It enhances the ability of models to aggregate positive samples while increasing the disparity among negative samples. This loss can be expressed as follows:

L_{InfoNCE} = - \frac{1}{N} \sum_{i = 1}^{N} L_{i}

(8)

L_{i} = log \frac{exp (S i m (z_{i}, v_{i}) / τ)}{exp (S i m (z_{i}, v_{i}) / τ) + \sum_{j = 1}^{N} exp (S i m (z_{i}, v_{j}) / τ)}

(9)

where N represents the number of samples,

z_{i}

and

v_{i}

denote the representation vectors of positive samples, and

v_{j}

denotes the representation vector of negative samples;

τ

is the temperature parameter, which adjusts the sensitivity to negative samples. In Equation (9), the more similar a negative sample is to the anchor, the larger the penalty term generated by

L_{InfoNCE}

. This loss drives the network to separate sample distances in the embedding space, facilitating data discrimination.

We design a feature contrast loss

L_{FC}

that optimizes the aforementioned computations. Specifically, the network first evaluates the cosine similarity between the anchor

F_{gt}

and the sample

F

using Equation (7). Subsequently, the similarity table is transformed into a normalized probability distribution

P_{τ}

through the softmax function, with a temperature coefficient

τ

set to control the sample distribution:

P_{τ} (f_{i}, f_{j}) = \frac{exp (S i m [f_{i}, f_{j}] / τ)}{\sum_{k = 1}^{N} exp (S i m [f_{i}, f_{k}] / τ)}

(10)

This process effectively assigns attention weights to the samples, highlighting the relative importance of the data. Spatially, positive samples are predominantly extracted from specific structures or local contextual regions of the ground truth. These samples carry significant co-occurrence information that indicates the fundamental attributes of the ground truth. Therefore, in feature measurement, positive samples need particular attention, as they provide valuable guidance for feature learning. By introducing the indicator

M [.]

, we retain only the positive sample probability scores for each anchor. It is important to note that this approach does not overlook the contribution of negative samples. On the contrary, negative samples corresponding to similar backgrounds are implicitly amplified through the penalty probability density

P_{τ}

of

L_{FC}

. Finally, we compute the feature contrastive loss to facilitate feature discrimination learning. The overall process can be represented as follows:

\begin{matrix} L_{FC} & = - \frac{1}{N^{+}} \sum_{i = 1}^{G} \sum_{j = 1}^{N} M [i, j] log (P_{τ} (f_{i}, f_{j})) \\ = - \frac{1}{N^{+}} \sum_{i = 1}^{G} \sum_{j = 1}^{N} M [i, j] log (\frac{exp (S i m [f_{i}, f_{j}] / τ)}{\sum_{k = 1}^{N} exp (S i m [f_{i}, f_{k}] / τ)}) \end{matrix}

(11)

where

P_{τ}

represents the probability distribution of the samples, and

N^{+}

denotes the number of positive samples.

Our proposed FCL differs from PCLDet in three key aspects: (1) In terms of loss computation, FCL focuses on optimizing the similarity of positive samples while implicitly balancing the influence of negative samples, achieving automatic adjustment between positive and negative contributions; (2) For anchor sample selection, FCL directly utilizes RoI features of real instances, which better captures intra-class variations compared to PCLDet’s use of class prototypes, making it particularly suitable for small, weak object detection; (3) In sample generation, FCL adopts a decoupled ICLA strategy instead of PCLDet’s shared sampler, offering greater flexibility and avoiding interference with other training tasks.

For SWOD, the proposed method guides the network to progressively learn to distinguish and associate features of real samples. The model is capable of capturing and implicitly amplifying subtle feature differences, which is particularly crucial for detecting small, weak objects that closely resemble the background.

3.3. Instance-Controlled Label Assignment

In FCL, positive samples enhance the consistency of representations for the same instance and improve the robustness of the model. Negative samples help prevent feature confusion and reduce interference from similar backgrounds. Therefore, the selection of positive and negative samples is crucial for representation learning. However, existing methods fail to provide the most suitable samples due to the difficulty in controlling sample definitions. To address this, and to explore the potential of contrastive learning in object detection, we propose a simple yet effective label assignment strategy. This method allows manual control over the ratio and number of positive and negative samples according to the network’s requirements.

Specifically, given that the total number of proposals to be sampled is N and the number of ground truth is G, the number of positive samples

N_{p o s}

can be calculated as

N_{p o s} = σ \times N

(12)

where

σ

is a ratio parameter controlling the balance between positive and negative samples. We adopt a simple averaging strategy, treating each ground truth equally. Accordingly, the number of positive samples k assigned to each ground truth can be calculated as

k = ⌈N_{p o s} / G⌉

(13)

where

⌈\cdot⌉

represents the rounding function. Positive samples should have similar semantic representations to the ground truth, as they directly influence the quality of feature contrast. Thus, these samples should be spatially close to the ground truth and have overlapping parts. To achieve this, we use the intersection over union (IoU) metric to determine positive samples. The sampler selects the top k samples with the highest overlap with the ground truth as positive samples, which can be expressed as

P_{j} = {s_{i} \in S | i \in \underset{i \in {1, 2, \dots n}}{arg max} IoU (s_{i}, g t_{j}), 1 \leq i \leq k}

(14)

where

P_{j}

represents the set of positive samples for the ground truth

g t_{j}

,

S = {s_{1}, s_{2}, \dots, s_{n}}

denotes the set of candidate samples.

3.4. Loss Function

The overall loss of the network mainly includes four parts, which can be defined as

L_{total} = L_{rpn} + λ_{1} L_{cls} + λ_{2} L_{reg} + λ_{3} L_{FC}

(15)

where

L_{rpn}

represents the loss of the first-stage detection network (RPN), while

L_{cls}

,

L_{reg}

, and

L_{FC}

correspond to the classification loss, regression loss, and feature contrast loss of the second-stage network (contrastive R-CNN), respectively. The expressions

λ_{1}

,

λ_{2}

, and

λ_{3}

are the balancing coefficients for each component. We empirically set

λ_{1} = 1

,

λ_{2} = 1

, and

λ_{3} = 0.05

. Unless otherwise specified, this configuration will be consistently used in all subsequent experiments. The values of

L_{rpn}

and

L_{cls}

are set the same as in Faster R-CNN;

L_{reg}

employs Balanced L1 loss as proposed by [40], which is an optimized version of Smooth L1 loss. The core idea is to enhance the regression gradients of easily classified samples. This approach allows the network to achieve a balanced optimization of classification and localization tasks during the training phase. Balanced L1 loss can be defined as

L_{b} (x) = \{\begin{matrix} \frac{α}{b} (b | x | + 1) ln (b | x | + 1) - a | x |, & if | x | < 1 \\ γ | x | + C, & otherwise \end{matrix}

(16)

where the parameters

α

, b, and

γ

adjust the regression gradients to achieve a more balanced training process. The relationship can be expressed as follows:

α ln (b + 1) = γ

(17)

We set

α = 0.5

,

b = 1.0

, and

γ = 1.5

following the official settings in [40].

4. Experiments

4.1. Datasets

(1) DIOR is a publicly available large-scale dataset for object detection in remote sensing images. It contains 23,463 optical remote sensing images with a uniform size of 800 × 800, carrying 192,472 axis-aligned annotations. The images exhibit varying resolutions, ranging from 0.5 m to 30 m, to simulate complex remote sensing scenarios. This dataset covers 20 categories: airplane (C1), airport (C2), baseball field (C3), basketball court (C4), bridge (C5), chimney (C6), dam (C7), expressway service area (C8), expressway toll station (C9), golf course (C10), ground track field (C11), harbor (C12), overpass (C13), ship (C14), stadium (C15), storage tank (C16), tennis court (C17), train station (C18), vehicle (C19), and windmill (C20). In our experiments, 11,725 images from the Trainval set were utilized for training, while 11,738 images from the Test set were employed for performance evaluation.

(2) NWPU VHR-10 is a high-resolution remote sensing image object detection dataset released by Northwestern Polytechnical University, China. It comprises 650 remote sensing images with spatial resolutions ranging from 0.5 m to 2 m. The dataset includes 10 categories: airplane (AI), ship (SH), storage tank (ST), baseball field (BF), tennis court (TC), basketball court (BC), ground track field (GTF), harbor (HB), bridge (BR), and vehicle (VE). In the latest version, the dataset is partitioned into 1172 images, each with dimensions of 400 × 400 pixels. It specifies a Trainval set consisting of 879 images and a Test set comprising 293 images, with a ratio of 0.75:0.25.

(3) AI-TOD is a large-scale remote sensing dataset specifically collected for small object detection tasks that carries a large number of tiny objects. This dataset contains 28,036 images and 700,621 instances. It includes eight object categories: airplane (AI), ship (SH), person (PE), bridge (BR), swimming pool (SP), windmill (WM), storage tank (ST), and vehicle (VE). The average size of the objects is only 12.7 pixels, located within complex remote sensing environments, such as ocean surfaces, ports, parking lots, streets, and deserts. All images are uniformly sized at 800 × 800 pixels. We utilize the designated Trainval set (14,018 images) for model training and the Test set (14,018 images) for performance evaluation.

4.2. Evaluation Metrics

For the DIOR and NWPU VHR-10 datasets, we adopt the MS COCO benchmark. By setting the IoU threshold to 0.5, we determine true positive (TP), false positive (FP), true negative (TN), and false negative (FN), which are then used to calculate precision (Pr) and recall (R):

Pr = \frac{TP}{TP + FP}

(18)

R = \frac{TP}{TP + FN}

(19)

The average precision (AP) is a comprehensive metric that allows for the evaluation of precision and recall for a single category of objects. To evaluate the overall performance of the model across different categories, the mean average precision (mAP) is adopted, which is calculated by averaging the AP of all categories. It can be defined as

AP = \int_{0}^{1} Pr (R) d R

(20)

mAP = \frac{1}{N} \sum_{i = 1}^{N} A P_{c_{i}}

(21)

In the ablation experiments,

{AP}_{s}

,

{AP}_{m}

, and

{AP}_{l}

are employed to evaluate the detection performance on small (0–32 pixels), medium (32–96 pixels), and large (>96 pixels) scale objects, respectively. These metrics facilitate the analysis of performance variations across different object scales.

For the AI-TOD dataset, due to the specific distribution of sizes, more refined metrics have been introduced for performance evaluation. Specifically, following the MS COCO benchmark,

{AP}_{50}

(IoU threshold = 0.5) and

{AP}_{75}

(IoU threshold = 0.75) are used to assess overall performance. AP is defined as the average from

{AP}_{50}

to

{AP}_{95}

, with an IoU interval of 0.05. To more accurately evaluate small objects, we further divide objects (0–32 pixels) into very tiny (0–8 pixels), tiny (8–16 pixels), and small scale (16–32 pixels) objects, using

{AP}_{v t}

,

{AP}_{t}

, and

{AP}_{s}

for targeted performance evaluation.

To evaluate detection efficiency, the number of parameters (Param), floating point operations (FLOPs), and frames per second (FPS) are also compared in the ablation experiments.

4.3. Implement Details

We build the experimental environment on two RTX 3090 hardware platforms, each with 24 GB of memory, running the Ubuntu 20.01 system. All experiments are conducted using PyTorch 1.13 and the MMDetection toolbox [41]. AdamW is selected as the optimizer, with the input image size specified as 800 × 800 pixels. For the DIOR dataset, the number of epochs is set to 12, with an initial learning rate of 0.0001 and weight decay of 0.0001. The learning rate is reduced by a factor of 10 at the 8th and 11th epochs. For the NWPU VHR-10 and AI-TOD datasets, the number of epochs is set to 36, with learning rate decay occurring at the 24th and 33rd epochs. Our method is built on Faster R-CNN, utilizing ResNet-50 as the backbone, loaded with parameters pre-trained on ImageNet. The confidence score is set to 0.05.

4.4. Ablation Study

To comprehensively evaluate the effectiveness of each proposed component and its compatibility, we conduct extensive ablation experiments on the DIOR dataset. The experimental results are shown in Table 1. The baseline is Faster R-CNN with FPN. In all ablation experiments, the parameter settings are kept consistent to ensure a fair performance comparison. The baseline achieves 70.19% mAP on the DIOR dataset, and all ablation results are compared against this baseline to explore the effectiveness of the proposed method.

4.4.1. Effect of SGFU

In Table 1, the proposed SGFU achieves 71.88% mAP, an improvement of 1.69% compared to the baseline. It can be attributed to the network mitigating feature misalignment caused by interpolation during feature aggregation and restoring more refined features. During feature upsampling, SGFU considers the spatial positions of high-resolution features and adaptively computes the sampling locations for each pixel based on the semantic distribution of the features to be sampled. In detailed metrics, SGFU comprehensively improves the detection accuracy of objects across various scales. Specifically, it increases the

{AP}_{s}

from 25.16% to 27.20% (↑2.04%), the

{AP}_{m}

from 58.13% to 59.09% (↑0.96%), and the

{AP}_{l}

from 87.55% to 88.78% (↑1.23%). Notably, the most significant gains are observed for small objects. This is likely due to feature misalignment severely disrupting the feature representation of small objects during the fusion process, which further degrades their performance. The proposed SGFU dynamically determines the content for upsampling, which alleviates the blurring of information for small objects, as reflected in the performance. We visualize the feature heatmaps to intuitively demonstrate the effect of SGFU. In Figure 7a, the network does not activate the object region. After applying SGFU, the small object features are emphasized and spatially align with the object in Figure 7c. It is important to note that SGFU is a lightweight module with high-performance gains. It introduces only 0.01 M Params and 0.24 G FLOPs, which is negligible.

We also compare various upsampling methods, including interpolation-based techniques such as nearest neighbor interpolation and bilinear interpolation, as well as learning-based approaches such as DeConv [30], Pixel Shuffle [31], Dysample [32], and CARAFE [33], to demonstrate the advantages of SGFU. The experimental results are presented in Table 2. Nearest neighbor and bilinear both calculate pixel values based on fixed linear rules. This approach lacks flexibility and fails to account for complex scenes. Numerically, both methods achieve similar performance, with nearest neighbor reaching 70.22% mAP and bilinear reaching 70.18% mAP. In contrast, the learnable Deconv can recover more details through training, which demonstrates satisfactory performance in image super-resolution tasks. However, deconvolution lacks prior guidance and is prone to introducing noise and artifacts during upsampling. As shown in Table 2, Deconv leads to a decrease of 0.90% mAP, with the

{AP}_{s}

dropping from 24.38% to 22.79% (↓1.59%). Similarly, Pixel Shuffle improves resolution by integrating and rearranging pixels and is commonly used in image generation tasks. However, this method is not well-suited for the current task, resulting in a decline of 1.16% mAP. This may be due to information loss when processing high-dimensional features. The above two learnable methods introduce learnable kernels, which increases computational complexity. Dysample is a flexible upsampler that employs dynamic interpolation. However, it neglects the issue of feature alignment and does not consider the spatial distribution of high-resolution features. CARAFE introduces contextual information during feature upsampling, which is beneficial for object detection tasks. This is also reflected in performance, with an increase of 0.97% mAP. The proposed SGFU achieves the most significant performance improvement while introducing only negligible parameters and computational burden, thus demonstrating the effectiveness of our method.

4.4.2. Effect of FCL

To investigate the effectiveness of the proposed FCL for SWOD, we conduct ablation experiments, as shown in Table 1. When the FCH is incorporated into the detection head for feature learning, the baseline improves by 1.57%, achieving 71.76% mAP. Furthermore, when both the FCL and SGFU are employed concurrently, the baseline exhibits an additional improvement of 3.32%, reaching 73.51% mAP. These results demonstrate that the proposed FCL can effectively constrain the network to learn the similarities and differences between sample features, thereby contributing to the overall performance gain. Notably, the FCL and SGFU are fully compatible, each alleviating the challenges associated with small objects in different aspects and promoting accurate recognition of small objects. This is evidenced by a significant increase in the

{AP}_{s}

by 5.98%. It is worth noting that FCL introduces only a minimal number of learning parameters, resulting in a negligible impact on memory cost.

The temperature coefficient

τ

controls the discriminative ability for sample distribution in the embedding space. A large coefficient results in a lack of bias when learning negative samples. Conversely, a small coefficient increases the penalty for negative samples, leading the model to focus more on difficult ones. The parameter

λ_{3}

regulates the scale of the feature contrastive loss, indicating the importance of network learning. Since

λ_{3}

and

τ

jointly influence feature learning, we investigate their combined effects on detection performance. The experimental results are shown in Table 3. When

λ_{3} = 0.05

and

τ = 0.1

, FCDet achieves the best performance at 73.89% mAP. At this point, the network produces the most effective supervisory signal, which facilitates learning the representation relationship of sample features.

We visualize the training loss of FCDet in Figure 8. It is shown that when

λ_{3} = 0.05

and

τ = 0.1

, FCL effectively integrates with other tasks for multi-task learning without adversely affecting their performance. The total loss

L_{total}

, regression loss

L_{reg}

, and classification loss

L_{cls}

remain optimized throughout the training process. At the same time,

L_{FC}

decreases rapidly during the initial training stages and gradually stabilizes. This indicates that the network successfully learns the relationships between features in the sample space, facilitating effective feature differentiation.

4.4.3. Effect of ICLA

ICLA facilitates sample learning within FCL by enabling precise control over the quantity and ratio of positive and negative samples. It can provide FCH with optimally learnable data. Integrating ICLA into FCL leads to further performance improvements. As shown in Table 1, ICLA improves the mAP from 71.76% to 71.92% when paired solely with FCL, with an

{AP}_{s}

gain of 1.67%. This demonstrates that by regulating the sampling of positive and negative samples, the model learns richer feature representations. It enhances the ability of FCDet to differentiate similar features, particularly for small, weak objects. When ICLA is deployed alongside SGFU and FCL, the mAP rises from 73.51% to 73.89% (↑0.38%), with an additional

{AP}_{s}

improvement of 0.85%. This indicates that the three proposed components are complementary and provide notable performance gains for small object detection. Furthermore, ICLA introduces no additional parameters and does not affect inference speed.

In contrastive learning, the ratio and quantity of positive and negative samples play a crucial role in model training. Typically, the ratio dictates the learning bias. An overabundance of positive samples may lead the model to overemphasize subtle differences among similar data. At this point, the differences in data between classes will be ignored, leading to a decrease in generalization. Conversely, too many negative samples can also overwhelm the learning signals from related data. Sample quantity, in turn, affects the diversity of feature representations. When sample quantity is too high, the model may struggle to adequately learn the relationships and distinctions among representations, while too few samples can lead to overfitting. Thus, a balance in sample quantity and ratio is essential. To explore the influence of sample configurations on the current task, we conduct comprehensive experiments to uncover the underlying patterns. The experimental results are shown in Table 4 and Table 5. When N = 128 and

σ

= 0.1, the model achieves optimal performance on both the DIOR and NWPU VHR-10 datasets. This is likely because the current sample size and ratio yield a balanced contribution from positive and negative samples, preventing the network from overfitting to either type. FCL is capable of both clustering similar samples and separating dissimilar ones, thereby forming a well-structured feature space that enhances the model’s discriminability and robustness. At the same time, ICLA encourages the FCDet to discern correlated features, allowing the network to better learn features that distinguish from background noise, especially for small, weak objects. Adjustments in sample quantity and ratio—whether increased or decreased—correspondingly impact detection performance, further validating the above hypothesis.

4.5. Comparing State-of-the-Art Methods

4.5.1. Results on DIOR

To demonstrate the advantages of FCDet, we compare several state-of-the-art methods on the DIOR dataset. The experimental results are presented in Table 6. Our FCDet achieves the best performance, with 73.9% mAP. This improvement is attributed to the proposed SGFU, which provides more refined upsampled features and mitigates feature blurring during feature aggregation. FCL enhances the network’s capacity to differentiate among various instances, especially in data that exhibit inter-class similarity and intra-class diversity. Notably, TBNet [19] explores texture and boundary features for small, weak objects. Libra-SOD [42] employs a specialized balanced label assignment strategy for small objects. RingMo-Lite [43] introduces a hybrid CNN–Transformer architecture for interleaving high- and low-frequency feature extraction. All these methods demonstrate competitive performance. However, FCDet achieves improvements of 4.5% to 0.2% over these competitors, which fully validates the effectiveness of the proposed method. Additionally, FCDet achieves 76.5% mAP when Swin-tiny is selected as the extractor. The Transformer can capture more comprehensive global representations, thereby enriching the discernible features for detection tasks. Some representative detection results are visualized in Figure 9. It is evident that FCDet effectively identifies most objects, including visually weak vehicles, ships, and bridges. It also successfully detects objects that closely resemble their environments, such as windmills, golf courses, and baseball fields. These results clearly demonstrate the effectiveness of the proposed method.

We also compare the efficiency metrics with various methods, as shown in Figure 10. The proposed FCDet maintains high computational efficiency through three key design principles: (1) SGFU employs a lightweight upsampling architecture, (2) FCH introduces minimal parameter overhead (0.02 M parameters), and (3) ICLA only operates during the training stage without inference cost. Compared to the baseline, our method improves mAP by 3.7% at a cost of just 3 FPS. Thus, the proposed method remains competitive regarding parameters and inference speed while maintaining high accuracy.

4.5.2. Results on NWPU VHR-10

We compare several competitive methods on the NWPU VHR-10 dataset, with the experimental results shown in Table 7. FCDet achieves the best performance, recording 75.04% mAP. Compared to the competitors, our method demonstrates an advantage ranging from 6.74% to 0.25%. This highlights the effectiveness and superiority of FCDet. However, in terms of category-wise performance, FCDet does not exhibit a significant advantage. Except for Airplane and Ground Track Field, the network does not achieve optimal results in other categories. A possible contributing factor is the limited amount of data, which may constrain the learning capacity of FCL. Consequently, the network does not achieve the expected benefits in data fitting.

4.5.3. Results on AI-TOD

We conduct comparative experiments on AI-TOD against several of the latest methods, as shown in Table 8. FCDet achieves the best results with 26.4% AP, 57.3%

{AP}_{50}

, and 19.9%

{AP}_{75}

, which outperforms the competitors by a large margin. Notably, our method achieves 14%

{AP}_{vt}

, improving performance by 4.6% compared to TBNet [19]. At this scale, the objects often appear weak in appearance, making it difficult to express detailed information. As a result, most detectors perform poorly. However, FCDet demonstrates the capability to recognize discriminative features in complex scenes and remains competitive. In addition, the proposed method achieves 27.2%

{AP}_{t}

and 29.5%

{AP}_{s}

, also leading in performance. However, it slightly sacrifices performance for large-scale objects due to its extra attention to small, weak objects, which is also reflected in

{AP}_{m}

. Moreover, FCDet achieves the best performance in six out of eight categories. Specifically, it obtains 30.0% on Airplane, 59.1% on Ship, 11.5% on Person, 19.3% on Bridge, 20.3% on Swimming Pool, and 26.4% on Vehicle. When adopting Swin-tiny as the extractor, the proposed method further improves the AP to 27.4%. Compared to competitors equipped with the same Swin-tiny, FCDet maintains its lead. We compare the detection results of Faster R-CNN [37], FCOS [58], RetinaNet [59], and TBNet [19] for a visual comparison. In Figure 11, Faster R-CNN and FCOS exhibit significant missed detections due to interference from complex contexts. While RetinaNet recalls some small objects, it still produces several misclassifications on data similar to the objects. TBNet also misclassifies many objects due to their feature similarity. The proposed FCDet effectively distinguishes objects from complex environments via contrastive learning, resulting in satisfactory results.

4.6. Discussion

Although FCDet has achieved satisfactory performance, it still has limitations worth discussing. FCL implementation operates exclusively on assigned positive samples, which may induce feature embedding ambiguities when processing visually similar instances within the same category. This design constraint can potentially generate conflicting learning signals for semantically similar features, thereby degrading model discriminability.

In terms of real-time performance, although our two-stage FCDet architecture achieves 40 FPS on modern GPU hardware, its practical deployment faces significant challenges. The computationally intensive nature of both the feature extraction pipeline and anchor prediction mechanisms imposes substantial constraints on real-time performance, particularly for edge devices with limited resources. This computational bottleneck currently restricts the framework’s applicability in mobile and embedded deployment scenarios.

To address these issues, future research will focus on exploring lightweight network architecture designs and incorporating techniques such as model compression and parameter pruning to significantly reduce model complexity. Additionally, we will optimize operator implementations specifically for edge computing devices, aiming to enhance the method’s applicability in resource-constrained scenarios.

5. Conclusions

In this article, we propose FCDet for small, weak object detection in remote sensing. FCDet introduces three key components: SGFU, FCL, and ICLA, to address the challenges associated with small, weak objects. Specifically, SGFU presents a learnable upsampling strategy that adaptively adjusts sampling positions by incorporating spatial prior. The refined features facilitate semantic alignment during feature aggregation. FCL drives the network to learn the relationships among sample feature representations in the embedding space. This enhances its discriminative ability in exploring similar data, particularly for obscure weak objects. We also design a decoupled ICLA strategy to provide richer and more suitable representations for the embedding space. The proposed method demonstrates its effectiveness across three large-scale datasets, especially for small, weak objects.

However, due to limitations in sample annotation, FCDet struggles to perform effective contrastive learning for same-category objects with similar visual appearances. In future work, we will focus on the feasibility of applying contrastive learning to within-class data to mine homogeneous information for weak small objects.

Author Contributions

Conceptualization, Z.L.; methodology, Z.L.; software, X.H.; validation, J.Q.; formal analysis, Z.L.; investigation, Y.W.; resources, Y.W.; data curation, D.X.; writing—original draft preparation, Z.L.; writing—review and editing, Z.L.; visualization, T.Z.; supervision, Y.W.; project administration, Y.W.; funding acquisition, D.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Science and Technology Development Plan Project of Jilin Province grant number 20240101057JC.

Data Availability Statement

The AI-TOD dataset are available from the website https://github.com/Chasel-Tsui/mmdet-aitod.

Acknowledgments

The authors thank the editors and reviewers for their hard work.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, Z.; Wang, Y.; Zhang, N.; Zhang, Y.; Zhao, Z.; Xu, D.; Ben, G.; Gao, Y. Deep Learning-Based Object Detection Techniques for Remote Sensing Images: A Survey. Remote Sens. 2022, 14, 2385. [Google Scholar] [CrossRef]
Han, W.; Chen, J.; Wang, L.; Feng, R.; Li, F.; Wu, L.; Tian, T.; Yan, J. Methods for Small, Weak Object Detection in Optical High-Resolution Remote Sensing Images: A survey of advances and challenges. IEEE Geosci. Remote Sens. Mag. 2021, 9, 8–34. [Google Scholar] [CrossRef]
Lin, Z.; He, Z.; Wang, X.; Liang, H.; Su, W.; Tan, J.; Xie, S. Cross-Scale Hybrid Gaussian Attention Network for Object Detection in Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2024, 21, 3002305. [Google Scholar] [CrossRef]
Dong, S.; Wang, L.; Du, B.; Meng, X. ChangeCLIP: Remote sensing change detection with multimodal vision-language representation learning. ISPRS J. Photogramm. Remote Sens. 2024, 208, 53–69. [Google Scholar] [CrossRef]
Pang, B.; Liu, Y.N. PRO-SSRGAN: Stable super-resolution generative adversarial network based on parameter reconstructive optimization on Gaofen-5 remote-sensing images. Int. J. Remote Sens. 2024, 45, 3022–3053. [Google Scholar] [CrossRef]
Lin, Q.; Zhao, J.; Fu, G.; Yuan, Z. CRPN-SFNet: A High-Performance Object Detector on Large-Scale Remote Sensing Images. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 416–429. [Google Scholar] [CrossRef] [PubMed]
Li, Z.; Wang, Y.; Feng, H.; Chen, C.; Xu, D.; Zhao, T.; Gao, Y.; Zhao, Z. Local to Global: A Sparse Transformer-Based Small Object Detector for Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5606516. [Google Scholar] [CrossRef]
Chen, S.; Zhao, J.; Zhou, Y.; Wang, H.; Yao, R.; Zhang, L.; Xue, Y. Info-FPN: An Informative Feature Pyramid Network for object detection in remote sensing images. Expert Syst. Appl. 2023, 214, 119132. [Google Scholar] [CrossRef]
Zhang, T.; Zhang, X.; Zhu, X.; Wang, G.; Han, X.; Tang, X.; Jiao, L. Multistage Enhancement Network for Tiny Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5606516. [Google Scholar] [CrossRef]
Wu, J.; Pan, Z.; Lei, B.; Hu, Y. FSANet: Feature-and-Spatial-Aligned Network for Tiny Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5630717. [Google Scholar] [CrossRef]
Liu, G.; Wu, W. Search and recovery network for camouflaged object detection. Image Vis. Comput. 2024, 151, 105247. [Google Scholar] [CrossRef]
Yang, H.; Zhu, Y.; Sun, K.; Ding, H.; Lin, X. Camouflaged Object Detection via Dual-branch Fusion and Dual Self-similarity constraints. Pattern Recogn. 2025, 157, 110895. [Google Scholar] [CrossRef]
Hu, X.; Zhang, X.; Wang, F.; Sun, J.; Sun, F. Efficient Camouflaged Object Detection Network Based on Global Localization Perception and Local Guidance Refinement. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 5452–5465. [Google Scholar] [CrossRef]
Sha, X.; Guo, Z.; Guan, Z.; Li, W.; Wang, S.; Zhao, Y. PBTA: Partial Break Triplet Attention Model for Small Pedestrian Detection Based on Vehicle Camera Sensors. IEEE Sens. J. 2024, 24, 21628–21640. [Google Scholar] [CrossRef]
Feng, J.; Liang, Y.; Zhang, X.; Zhang, J.; Jiao, L. SDANet: Semantic-Embedded Density Adaptive Network for Moving Vehicle Detection in Satellite Videos. IEEE Trans. Image Process. 2023, 32, 1788–1801. [Google Scholar] [CrossRef] [PubMed]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
Cheng, G.; Han, J.; Zhou, P.; Guo, L. Multi-class geospatial object detection and geographic image classification based on collection of part detectors. ISPRS J. Photogramm. Remote Sens. 2014, 98, 119–132. [Google Scholar] [CrossRef]
Wang, J.; Yang, W.; Guo, H.; Zhang, R.; Xia, G.S. Tiny Object Detection in Aerial Images. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR 2021), Milan, Italy, 10–15 January 2021; pp. 3791–3798. [Google Scholar] [CrossRef]
Li, Z.; Wang, Y.; Xu, D.; Gao, Y.; Zhao, T. TBNet: A texture and boundary-aware network for small weak object detection in remote-sensing imagery. Pattern Recogn. 2025, 158, 110976. [Google Scholar] [CrossRef]
Li, Z.; Wang, Y.; Zhang, Y.; Gao, Y.; Zhao, Z.; Feng, H.; Zhao, T. Context Feature Integration and Balanced Sampling Strategy for Small Weak Object Detection in Remote Sensing Imagery. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6009105. [Google Scholar] [CrossRef]
Han, W.; Li, J.; Wang, S.; Wang, Y.; Yan, J.; Fan, R.; Zhang, X.; Wang, L. A context-scale-aware detector and a new benchmark for remote sensing small weak object detection in unmanned aerial vehicle images. Int. J. Appl. Earth Obs. 2022, 112, 102966. [Google Scholar] [CrossRef]
Ma, W.; Wu, Y.; Zhu, H.; Zhao, W.; Wu, Y.; Hou, B.; Jiao, L. Adaptive Feature Separation Network for Remote Sensing Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5639717. [Google Scholar] [CrossRef]
Guo, G.; Chen, P.; Yu, X.; Han, Z.; Ye, Q.; Gao, S. Save the Tiny, Save the All: Hierarchical Activation Network for Tiny Object Detection. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 221–234. [Google Scholar] [CrossRef]
Xie, E.; Ding, J.; Wang, W.; Zhan, X.; Xu, H.; Sun, P.; Li, Z.; Luo, P. DetCo: Unsupervised Contrastive Learning for Object Detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV 2021), Montreal, QC, Canada, 10–17 October 2021; pp. 8372–8381. [Google Scholar] [CrossRef]
Yuan, X.; Cheng, G.; Yan, K.; Zeng, Q.; Han, J. Small Object Detection via Coarse-to-fine Proposal Generation and Imitation Learning. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV 2023), Paris, France, 2–3 October 2023; pp. 6294–6304. [Google Scholar] [CrossRef]
Liu, H.I.; Tseng, Y.W.; Chang, K.C.; Wang, P.J.; Shuai, H.H.; Cheng, W.H. A DeNoising FPN With Transformer R-CNN for Tiny Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4704415. [Google Scholar] [CrossRef]
Lv, Y.; Li, M.; He, Y. An Effective Instance-Level Contrastive Training Strategy for Ship Detection in SAR Images. IEEE Geosci. Remote Sens. Lett. 2023, 20, 4007505. [Google Scholar] [CrossRef]
Ouyang, L.; Guo, G.; Fang, L.; Ghamisi, P.; Yue, J. PCLDet: Prototypical Contrastive Learning for Fine-Grained Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5613911. [Google Scholar] [CrossRef]
Chen, J.; Qin, D.; Hou, D.; Zhang, J.; Deng, M.; Sun, G. Multiscale Object Contrastive Learning-Derived Few-Shot Object Detection in VHR Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5635615. [Google Scholar] [CrossRef]
Zeiler, M.D.; Taylor, G.W.; Fergus, R. Adaptive deconvolutional networks for mid and high level feature learning. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–11 November 2011; pp. 2018–2025. [Google Scholar] [CrossRef]
Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar] [CrossRef]
Liu, W.; Lu, H.; Fu, H.; Cao, Z. Learning to Upsample by Learning to Sample. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV 2023), Paris, France, 1–6 October 2023; pp. 6004–6014. [Google Scholar] [CrossRef]
Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. CARAFE: Content-Aware ReAssembly of FEatures. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV 2019), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3007–3016. [Google Scholar] [CrossRef]
Mazzini, D. Guided Upsampling Network for Real-Time Semantic Segmentation. arXiv 2018, arXiv:1807.07466. [Google Scholar]
Lu, H.; Liu, W.; Fu, H.; Cao, Z. FADE: A Task-Agnostic Upsampling Operator for Encoder–Decoder Architectures. Int. J. Comput. Vision 2025, 133, 151–172. [Google Scholar] [CrossRef]
Lu, H.; Liu, W.; Ye, Z.; Fu, H.; Liu, Y.; Cao, Z. SAPA: Similarity-Aware Point Affiliation for Feature Upsampling. In Advances in Neural Information Processing Systems; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Curran Associates, Inc.: San Francisco, CA, USA, 2022; Volume 35, pp. 20889–20901. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar] [CrossRef]
Oord, A.v.d.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
Pang, J.; Chen, K.; Shi, J.; Feng, H.; Ouyang, W.; Lin, D. Libra R-CNN: Towards Balanced Learning for Object Detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 15–20 June 2019; pp. 821–830. [Google Scholar] [CrossRef]
Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J.; et al. MMDetection: Open MMLab Detection Toolbox and Benchmark. arXiv 2019, arXiv:1906.07155. [Google Scholar]
Zhou, Z.; Zhu, Y. Libra-SOD: Balanced label assignment for small object detection. Knowl.-Based Syst. 2024, 302, 112353. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, T.; Zhao, L.; Hu, L.; Wang, Z.; Niu, Z.; Cheng, P.; Chen, K.; Zeng, X.; Wang, Z.; et al. RingMo-Lite: A Remote Sensing Lightweight Network With CNN-Transformer Hybrid Framework. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5608420. [Google Scholar] [CrossRef]
Yang, X.; Yan, J.; Liao, W.; Yang, X.; Tang, J.; He, T. SCRDet++: Detecting Small, Cluttered and Rotated Objects via Instance-Level Feature Denoising and Rotation Loss Smoothing. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 2384–2399. [Google Scholar] [CrossRef]
Dong, Y.; Yang, H.; Liu, S.; Gao, G.; Li, C. Optical Remote Sensing Object Detection Based on Background Separation and Small Object Compensation Strategy. IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 2024, 1–11. [Google Scholar] [CrossRef]
Xu, T.; Sun, X.; Diao, W.; Zhao, L.; Fu, K.; Wang, H. ASSD: Feature Aligned Single-Shot Detection for Multiscale Objects in Aerial Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5607117. [Google Scholar] [CrossRef]
Gao, T.; Liu, Z.; Wu, G.; Li, Z.; Wen, Y.; Liu, L.; Chen, T.; Zhang, J. A Self-Supplementary and Revised Network for Remote Sensing Object Detection. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6003105. [Google Scholar] [CrossRef]
Zhou, Z.; Zhu, Y. KLDet: Detecting Tiny Objects in Remote Sensing Images via Kullback–Leibler Divergence. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4703316. [Google Scholar] [CrossRef]
Yu, D.; Ji, S. A New Spatial-Oriented Object Detection Framework for Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4407416. [Google Scholar] [CrossRef]
Liu, Y.; Li, Q.; Yuan, Y.; Du, Q.; Wang, Q. ABNet: Adaptive Balanced Network for Multiscale Object Detection in Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5614914. [Google Scholar] [CrossRef]
Gao, T.; Liu, Z.; Zhang, J.; Wu, G.; Chen, T. A Task-Balanced Multiscale Adaptive Fusion Network for Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5613515. [Google Scholar] [CrossRef]
Xu, G.; Song, T.; Sun, X.; Gao, C. TransMIN: Transformer-Guided Multi-Interaction Network for Remote Sensing Object Detection. IEEE Geosci. Remote Sens. Lett. 2023, 20, 6000505. [Google Scholar] [CrossRef]
Gao, T.; Li, Z.; Wen, Y.; Chen, T.; Niu, Q.; Liu, Z. Attention-Free Global Multiscale Fusion Network for Remote Sensing Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5603214. [Google Scholar] [CrossRef]
Shi, L.; Kuang, L.; Xu, X.; Pan, B.; Shi, Z. CANet: Centerness-Aware Network for Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5603613. [Google Scholar] [CrossRef]
Zhang, C.; Lam, K.M.; Wang, Q. CoF-Net: A Progressive Coarse-to-Fine Framework for Object Detection in Remote-Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5600617. [Google Scholar] [CrossRef]
Gao, T.; Niu, Q.; Zhang, J.; Chen, T.; Mei, S.; Jubair, A. Global to Local: A Scale-Aware Network for Remote Sensing Object Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5615614. [Google Scholar] [CrossRef]
Ma, W.; Wang, X.; Zhu, H.; Yang, X.; Yi, X.; Jiao, L. Significant Feature Elimination and Sample Assessment for Remote Sensing Small Objects’ Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5615115. [Google Scholar] [CrossRef]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV 2019), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9626–9635. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef]
Yang, Z.; Liu, S.; Hu, H.; Wang, L.; Lin, S. RepPoints: Point Set Representation for Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV 2019), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9656–9665. [Google Scholar] [CrossRef]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: High Quality Object Detection and Instance Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 1483–1498. [Google Scholar] [CrossRef]
Qiao, S.; Chen, L.C.; Yuille, A. DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), Nashville, TN, USA, 20–25 June 2021; pp. 10208–10219. [Google Scholar] [CrossRef]
Xu, C.; Wang, J.; Yang, W.; Yu, H.; Yu, L.; Xia, G.S. Detecting tiny objects in aerial images: A normalized Wasserstein distance and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2022, 190, 79–93. [Google Scholar] [CrossRef]
Ge, L.; Wang, G.; Zhang, T.; Zhuang, Y.; Chen, H.; Dong, H.; Chen, L. Adaptive Dynamic Label Assignment for Tiny Object Detection in Aerial Images. IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 2024, 17, 6201–6214. [Google Scholar] [CrossRef]

Figure 1. Some typical attributes that affect small, weak objects: (a) objects with ambiguous appearance and details; (b) objects with low signal-to-noise ratio; (c) objects in complex contexts.

Figure 2. Demonstration of feature misalignment caused by simple bilinear interpolation during adjacent-level feature aggregation: (a) input images; (b) low-level features; (c) high-level features; (d) interpolated features.

Figure 3. Overall structure of the proposed FCDet for small weak object detection, including the proposed SGFU, ICLA, and FCL illustrated below.

Figure 4. Processing flow of the proposed SGFU.

Figure 5. Demonstration of the grid sampling function. The network progressively samples features from the source features

F^{l}

using the input grid (normalized coordinates), interpolates (e.g., bilinearly) at non-integer positions, and obtains flexibly upsampled features

F_{u p}^{l - 1}

.

Figure 5. Demonstration of the grid sampling function. The network progressively samples features from the source features

F^{l}

using the input grid (normalized coordinates), interpolates (e.g., bilinearly) at non-integer positions, and obtains flexibly upsampled features

F_{u p}^{l - 1}

.

Figure 6. Detailed demonstration of the proposed feature contrastive learning. (a) Feature space. Following ICLA, the ground truth box and sampled proposals are marked and input into RoI Align to extract RoIs. FCH linearly projects the RoIs into the embedding space for feature representation

F

. (b) Embedding space. The similarity table is calculated among

F_{g t}

,

F^{+}

, and

F^{-}

. The feature contrastive loss computes penalties for feature discrimination.

Figure 6. Detailed demonstration of the proposed feature contrastive learning. (a) Feature space. Following ICLA, the ground truth box and sampled proposals are marked and input into RoI Align to extract RoIs. FCH linearly projects the RoIs into the embedding space for feature representation

F

. (b) Embedding space. The similarity table is calculated among

F_{g t}

,

F^{+}

, and

F^{-}

. The feature contrastive loss computes penalties for feature discrimination.

Figure 7. Visualization comparison of feature heatmaps: (a) input image; (b) feature heatmap output by the baseline; (c) feature heatmap output by the SGFU.

Figure 8. Visualization of training loss over iterations.

Figure 9. Some representative small, weak object detection results on the DIOR dataset.

Figure 10. Comparison of detection performance (mAP), parameters (Params), and inference speed (FPS) of various methods on the DIOR dataset. The size of the circles indicates the inference speed.

Figure 11. Comparison of prediction results of various methods on the AI-TOD dataset.

Table 1. Ablation results (%) of the three proposed components on the DIOR dataset.

Baseline	SGFU	FCL	ICLA	${AP}_{s}$	${AP}_{m}$	${AP}_{l}$	mAP	Param	FLOPs	FPS
✓				25.16	58.13	87.55	70.19	41.36	134.48	41.7
✓	✓			27.20	59.09	88.78	71.88	41.37	134.72	38.6
✓		✓		26.67	55.88	88.87	71.76	41.38	134.48	41.6
✓	✓	✓		31.14	58.70	89.16	73.51	41.39	134.72	38.4
✓		✓	✓	28.34	58.15	88.09	71.92	41.38	134.48	41.6
✓	✓	✓	✓	31.99	60.42	88.93	73.89	41.39	134.72	38.4

Table 2. Comparison of various feature upsampling methods on the DIOR dataset.

Method	${AP}_{s}$	${AP}_{m}$	${AP}_{l}$	mAP	Param	FLOPs	FPS
Nearest	24.39	57.33	88.05	70.22	46.08	135.27	41.6
Bilinear	24.70	56.58	88.30	70.18	46.08	135.27	41.7
Deconv [30]	22.79	55.67	88.18	69.28	49.03	143.23	36.0
Pixel Shuffle [31]	22.45	54.49	87.67	69.06	56.11	166.75	34.7
Dysample [32]	25.32	57.17	88.34	70.44	46.08	135.33	38.9
CARAFE [33]	16.17	57.74	88.59	71.15	46.08	135.33	36.7
Ours	27.23	59.03	88.89	71.78	46.09	135.49	38.6

Table 3. Comparison of

τ

and

λ_{3}

Values with corresponding mAP,

{AP}_{s}

,

{AP}_{m}

, and

{AP}_{l}

results.

Table 3. Comparison of

τ

and

λ_{3}

Values with corresponding mAP,

{AP}_{s}

,

{AP}_{m}

, and

{AP}_{l}

results.

$τ$	$λ_{3}$	mAP	${AP}_{s}$	${AP}_{m}$	${AP}_{l}$
	0.01	73.48	30.93	60.67	88.84
0.05	0.05	73.51	31.57	60.16	88.96
	0.1	73.41	30.56	58.07	89.02
	0.01	73.32	30.20	60.00	88.59
0.1	0.05	73.89	31.99	60.42	88.93
	0.1	73.39	30.34	59.06	88.86
	0.01	73.29	31.25	60.29	88.68
0.15	0.05	73.10	31.06	59.13	88.80
	0.1	73.31	30.46	59.50	88.67
	0.01	73.38	30.43	59.55	88.75
0.2	0.05	73.33	31.43	59.39	88.71
	0.1	72.85	29.86	58.96	88.71

Table 4. Comparison of N and

σ

values with corresponding mAP,

{AP}_{s}

,

{AP}_{m}

, and

{AP}_{l}

results on DIOR.

Table 4. Comparison of N and

σ

values with corresponding mAP,

{AP}_{s}

,

{AP}_{m}

, and

{AP}_{l}

results on DIOR.

N	$σ$	mAP	${AP}_{s}$	${AP}_{m}$	${AP}_{l}$
	0.05	73.24	30.61	59.99	88.65
128	0.1	73.89	31.99	60.42	88.93
	0.15	73.52	31.19	59.90	88.74
	0.05	73.34	30.83	59.50	88.75
256	0.1	73.31	30.09	59.86	88.97
	0.15	72.93	29.80	58.99	88.58
	0.05	73.37	31.91	59.17	88.73
384	0.1	72.61	30.21	57.59	87.96
	0.15	72.00	30.09	57.67	87.47

Table 5. Comparison of N and

σ

values with corresponding mAP,

{AP}_{s}

,

{AP}_{m}

, and

{AP}_{l}

results on NWPU VHR-10.

Table 5. Comparison of N and

σ

values with corresponding mAP,

{AP}_{s}

,

{AP}_{m}

, and

{AP}_{l}

results on NWPU VHR-10.

N	$σ$	mAP	${AP}_{s}$	${AP}_{m}$	${AP}_{l}$
	0.05	94.30	50.0	93.81	87.41
128	0.1	95.04	75.0	90.58	97.57
	0.15	94.19	50.0	92.02	91.36
	0.05	94.71	50.0	94.40	92.03
256	0.1	94.61	54.21	90.86	89.73
	0.15	93.81	25.0	92.52	90.88
	0.05	94.18	58.33	90.91	96.44
384	0.1	93.61	41.67	93.35	96.20
	0.15	92.71	45.87	90.86	95.91

Table 6. The mAP results (%) on DIOR. The best results are marked in bold.

Method	Backbone	C1	C2	C3	C4	C5	C6	C7	C8	C9	C10	C11	C12	C13	C14	C15	C16	C17	C18	C19	C20	mAP
SCRDet++ [44]	ResNet50	64.3	79.0	73.2	85.7	45.8	76.0	68.4	79.3	68.9	77.7	77.9	56.7	62.2	70.4	67.7	60.4	80.9	63.7	44.4	84.6	69.4
BSSOCNet [45]	CSPDarkNet53	87.4	78.4	74.5	89.1	44.3	77.6	59.8	58.5	60.2	78.2	71.8	62.4	59.2	89.7	65.7	74.0	88.5	55.6	54.4	84.8	70.7
ASSD [46]	VGG16	85.6	82.4	75.8	89.5	40.7	77.6	64.7	67.1	61.7	80.8	78.6	62.0	58.0	84.9	76.7	65.3	87.9	62.4	44.5	76.3	71.1
CFIMNet [20]	ResNet50	72.5	74.1	79.0	89.8	45.6	79.1	67.7	82.3	77.6	80.7	81.6	38.4	59.5	75.9	67.4	68.6	89.9	51.8	52.7	89.6	71.2
SSRNet [47]	PResNet50	92.6	77.4	75.7	91.0	45.9	77.2	54.1	64.5	63.5	74.9	72.7	64.5	61.4	90.2	67.7	76.4	90.7	58.6	55.4	84.6	72.0
KLDet [48]	ResNet50	67.8	78.5	79.6	87.0	46.2	78.6	61.3	79.9	71.8	76.6	80.3	54.3	59.8	91.8	60.5	76.6	88.0	56.2	57.3	89.2	72.1
RSADet [49]	DLA34	73.6	86.0	72.6	89.6	43.6	75.3	62.3	79.5	68.7	78.6	79.1	57.9	59.2	90.0	55.8	77.0	87.8	65.3	55.3	86.5	72.2
ABNet [50]	ResNet50	66.8	84.0	74.9	87.7	50.3	78.2	67.8	85.9	74.2	79.7	81.2	55.4	61.6	75.1	74.0	66.7	87.0	62.2	53.6	89.1	72.8
TMAFNet [51]	ResNet50	92.2	77.7	75.0	91.3	47.1	78.6	53.6	67.1	66.2	78.5	76.3	64.9	61.4	90.5	72.2	75.4	90.7	62.1	55.2	83.2	72.9
TransMIN [52]	ResNet50	62.6	82.3	77.1	90.3	50.6	79.5	69.8	81.2	73.1	82.7	83.5	56.6	63.8	76.2	73.5	62.3	88.4	63.6	52.3	89.5	73.0
AGMFNet [53]	DarkNet53	90.9	72.8	79.3	89.7	44.7	81.4	59.3	66.0	62.7	73.8	79.2	65.0	61.7	91.7	78.6	75.8	90.7	60.0	58.0	83.1	73.2
RingMo-Lite [43]	Swin-Tiny	64.8	90.5	81.0	87.6	44.1	79.9	76.8	86.6	67.3	85.0	82.5	60.9	61.8	72.3	77.0	62.5	87.7	68.9	44.2	86.3	73.4
TBNet [19]	ResNet50	64.9	86.8	76.6	89.2	50.6	80.0	74.3	86.4	74.7	82.5	83.6	56.4	64.9	79.1	77.8	58.1	88.2	69.4	44.6	87.0	73.6
Libra-SOD [42]	ResNet50	70.3	84.2	76.0	88.6	46.9	79.2	66.5	81.6	69.7	82.4	82.6	62.3	60.8	89.8	75.4	69.5	88.6	61.8	50.3	88.4	73.7
Ours	ResNet50	66.7	86.5	76.4	87.7	49.5	79.6	74.3	83.1	76.8	78.0	83.7	53.5	61.8	80.2	78.6	68.4	87.0	69.9	50.5	89.6	73.9
Ours	Swin-Tiny	72.0	89.9	81.2	87.2	50.9	82.2	77.7	87.5	81.6	85.7	84.4	51.9	65.0	81.1	82.7	69.4	86.2	75.1	52.6	89.6	76.5

Table 7. Comparison of different methods on the NWPU VHR-10 dataset. The best results are marked in bold.

Method	Backbone	TC	BR	AI	VE	ST	HB	EF	GTF	SH	BC	mAP
Faster R-CNN [37]	ResNet50	87.59	43.37	100	86.60	100	92.11	95.93	99.73	85.28	92.08	88.30
CANet [54]	ResNet101	97.80	89.16	99.99	90.25	99.27	90.38	97.28	98.38	85.99	84.77	93.33
BSSOCNet [45]	ResNet50	96.60	85.40	99.90	90.80	89.70	97.70	97.30	100	85.90	98.80	94.20
ABNet [50]	RseNet50	99.26	69.04	100	95.62	97.77	94.26	97.76	99.86	92.58	95.98	94.21
CoF-Net [55]	ResNet50	91.10	89.70	100	90.80	96.10	91.40	98.80	100	95.80	90.90	94.50
GLSANet [56]	ResNet50	99.69	84.83	99.90	86.76	97.12	97.51	99.40	99.46	95.80	86.12	94.57
SESANet [57]	DarkNet53	95.10	85.90	100	96.10	99.50	95.70	97.50	99.30	94.70	85.00	94.60
CFIMNet [20]	ResNet50	91.41	94.95	100	93.17	92.51	97.05	96.33	100	86.71	95.77	94.79
Ours	ResNet50	98.83	89.60	100	87.53	97.80	92.98	94.99	100	94.12	94.60	95.04

Table 8. Comparison of different methods on the AI-TOD dataset. The best results are marked in bold.

Method	Backbone	AI	SH	PE	BR	SP	WM	ST	VE	AP	${AP}_{50}$	${AP}_{75}$	${AP}_{vt}$	${AP}_{t}$	${AP}_{s}$	${AP}_{m}$
Reppoints [60]	ResNet50	2.9	26.4	5.4	2.3	0.0	0.0	21.4	15.2	9.2	23.6	5.3	2.5	9.2	12.9	14.4
FCOS [58]	ResNet50	14.4	21.2	3.3	4.6	0.0	0.7	24.0	13.7	10.3	26.1	6.3	1.5	8.6	14.7	25.4
Faster R-CNN [37]	ResNet50	22.1	19.6	4.6	5.2	7.7	0.1	19.6	13.3	11.5	27.2	7.9	0.0	8.3	24.0	32.7
RetinaNet [59]	ResNet50	17.6	23.4	4.1	8.7	8.0	3.1	17.3	11.3	11.7	30.8	7.0	3.4	8.8	17.3	31.0
Cascade R-CNN [61]	ResNet50	25.6	23.6	5.3	7.4	10.8	0.0	23.3	14.1	13.8	30.8	10.5	0.0	10.6	25.5	26.6
DetectoRS [62]	ResNet50	26.4	26.0	5.9	9.4	11.0	0.8	23.1	14.9	14.7	32.1	11.7	0.1	10.3	28.5	37.9
FSANet [10]	ResNet50	22.2	29.8	5.9	10.8	9.4	4.2	28.2	19.6	16.3	41.4	9.8	4.4	14.6	23.4	33.3
FSANet [10]	Swin-tiny	27.7	31.6	8.0	8.8	15.3	3.3	33.4	23.8	19.0	47.2	11.6	4.7	17.5	26.2	36.9
NWD [63]	ResNet50	24.4	34.4	9.5	12.2	10.1	3.5	35.2	24.2	19.2	48.5	11.1	7.6	19.0	23.9	31.6
RetinaNet w/DLA [64]	ResNet50	11.9	49.9	5.3	17.9	14.7	6.4	27.7	21.7	19.4	50.2	11.7	8.1	19.0	23.3	22.4
KLDet [48]	ResNet50	13.6	42.3	9.3	18.7	5.2	6.7	35.7	24.9	19.6	46.4	13.7	8.4	20.6	22.7	26.4
MeNet [9]	ResNet50	22.6	39.8	9.3	16.0	10.8	6.6	35.2	22.7	20.4	50.0	12.9	8.9	21.4	23.2	31.0
MeNet [9]	Swin-tiny	27.5	36.0	10.8	16.4	18.8	8.2	37.4	24.9	23.2	56.2	15.0	9.7	23.9	25.3	34.4
TBNet [19]	ResNet50	27.4	43.7	10.9	18.2	15.0	7.3	36.0	25.6	23.0	54.3	15.6	9.4	24.3	28.5	32.6
Ours	ResNet50	30.0	59.1	11.5	19.3	20.3	7.2	37.2	26.4	26.4	57.3	19.9	14.0	27.2	29.5	34.4
Ours	Swin-tiny	32.4	55.3	12.4	20.0	21.5	8.9	40.1	28.6	27.4	61.5	20.1	13.7	28.3	31.2	35.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Z.; Hu, X.; Qian, J.; Zhao, T.; Xu, D.; Wang, Y. Self-Supervised Feature Contrastive Learning for Small Weak Object Detection in Remote Sensing. Remote Sens. 2025, 17, 1438. https://doi.org/10.3390/rs17081438

AMA Style

Li Z, Hu X, Qian J, Zhao T, Xu D, Wang Y. Self-Supervised Feature Contrastive Learning for Small Weak Object Detection in Remote Sensing. Remote Sensing. 2025; 17(8):1438. https://doi.org/10.3390/rs17081438

Chicago/Turabian Style

Li, Zheng, Xueyan Hu, Jin Qian, Tianqi Zhao, Dongdong Xu, and Yongcheng Wang. 2025. "Self-Supervised Feature Contrastive Learning for Small Weak Object Detection in Remote Sensing" Remote Sensing 17, no. 8: 1438. https://doi.org/10.3390/rs17081438

APA Style

Li, Z., Hu, X., Qian, J., Zhao, T., Xu, D., & Wang, Y. (2025). Self-Supervised Feature Contrastive Learning for Small Weak Object Detection in Remote Sensing. Remote Sensing, 17(8), 1438. https://doi.org/10.3390/rs17081438

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Self-Supervised Feature Contrastive Learning for Small Weak Object Detection in Remote Sensing

Abstract

1. Introduction

2. Related Work

2.1. Small Weak Object Detection

2.2. Contrastive Learning

2.3. Feature Upsample

3. Proposed Method

3.1. Spatial-Guided Feature Upsampler

3.2. Feature-Contrastive Learning

3.3. Instance-Controlled Label Assignment

3.4. Loss Function

4. Experiments

4.1. Datasets

4.2. Evaluation Metrics

4.3. Implement Details

4.4. Ablation Study

4.4.1. Effect of SGFU

4.4.2. Effect of FCL

4.4.3. Effect of ICLA

4.5. Comparing State-of-the-Art Methods

4.5.1. Results on DIOR

4.5.2. Results on NWPU VHR-10

4.5.3. Results on AI-TOD

4.6. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI