SAR-CDSS: A Semi-Supervised Cross-Domain Object Detection from Optical to SAR Domain

Luo, Cheng; Zhang, Yueting; Guo, Jiayi; Hu, Yuxin; Zhou, Guangyao; You, Hongjian; Ning, Xia

doi:10.3390/rs16060940

Open AccessArticle

SAR-CDSS: A Semi-Supervised Cross-Domain Object Detection from Optical to SAR Domain

by

Cheng Luo

^1,2,3,

Yueting Zhang

^1,2,3,*

,

Jiayi Guo

^1,2,3,

Yuxin Hu

^1,2,

Guangyao Zhou

^1,2,

Hongjian You

^1,2,3 and

Xia Ning

^1,2

¹

Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China

²

Key Laboratory of Technology in Geo-Spatial Information Processing and Application System, Chinese Academy of Sciences, Beijing 100190, China

³

School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(6), 940; https://doi.org/10.3390/rs16060940

Submission received: 14 January 2024 / Revised: 5 March 2024 / Accepted: 6 March 2024 / Published: 7 March 2024

Download

Browse Figures

Review Reports Versions Notes

Abstract

The unique imaging modality of synthetic aperture radar (SAR) has posed significant challenges for object detection, making it more complex to acquire and interpret than optical images. Recently, numerous studies have proposed cross-domain adaptive methods based on convolutional neural networks (CNNs) to promote SAR object detection using optical data. However, existing cross-domain methods focus on image features, lack improvement on input data, and ignore the valuable supervision provided by few labeled SAR images. Therefore, we propose a semi-supervised cross-domain object detection framework that uses optical data and few SAR data to achieve knowledge transfer for SAR object detection. Our method focuses on the data processing aspects to gradually reduce the domain shift at the image, instance, and feature levels. First, we propose a data augmentation method of image mixing and instance swapping to generate a mixed domain that is more similar to the SAR domain. This method fully utilizes few SAR annotation information to reduce domain shift at image and instance levels. Second, at the feature level, we propose an adaptive optimization strategy to filter out mixed domain samples that significantly deviate from the SAR feature distribution to train feature extractor. In addition, we employ Vision Transformer (ViT) as feature extractor to handle the global feature extraction of mixed images. We propose a detection head based on normalized Wasserstein distance (NWD) to enhance objects with smaller effective regions in SAR images. The effectiveness of our proposed method is evaluated on public SAR ship and oil tank datasets.

Keywords:

cross-domain; synthetic aperture radar (SAR); semi-supervised object detection; metric learning

1. Introduction

Synthetic aperture radar (SAR) is an advanced active microwave sensor. Its all-weather, all-day operation capabilities and resistance to lighting and climate conditions have led to its widespread application in fields such as geological surveying, disaster relief, climate monitoring, agricultural management, and maritime surveillance [1]. Under adverse weather conditions, high-resolution SAR images can provide effective detection and monitoring of typical objects such as ships, buildings, aircraft, and oil tanks [2,3,4]. Due to the unique imaging modality of SAR, its imaging results significantly differ from optical remote sensing [5], as shown in Figure 1. Compared with the optical images, SAR images have lower signal-to-noise ratios and spatial resolution, and are heavily contaminated with image noise, which leads to difficulty for object detection. In the community of SAR object detection, the approach of utilizing optical images to assist in the interpretation of SAR images has garnered attracted substantial attention.

Traditional approaches for object detection in SAR images can be broadly classified into two groups [6]. The first group utilizes non-deep learning feature extraction methods, including detection algorithms based on structural features, grayscale characteristics, and image texture properties. Gu et al. [7] proposed a multifeature joint algorithm that extracts both the size and orientation of the object, with binary search employed for precise orientation search. The Constant False Alarm Rate algorithm (CFAR) [8], a typical grayscale-based detection approach, calculates the detection threshold by considering background clutter features to determine whether the target in the SAR image is a object pixel. Image texture segmentation and classification are performed using rotation-invariant features [9], which effectively represent domain scale-related texture features. However, these algorithms have low detection accuracy, high rates of missed detection and false alarms, and are easily affected by background clutter. The second group of methods capitalizes on deep learning techniques, specifically convolutional neural networks (CNNs), mainly divided into two-stage object detection algorithms-generated candidate bounding boxes and single-stage object detection algorithms based on regression. Kang et al. [10] were pioneers to apply the two-stage detector Faster R-CNN to SAR target detection. Cui et al. [11] proposed a dense attention pyramid network for multiregion analysis, specifically tailored for multiscale ship detection in SAR. Sun et al. [12] proposed an innovative SAR ship detector based on YOLO, which incorporates bidirectional feature fusion and angle classification, enabling arbitrary direction detection. Numerous studies [13,14,15] have demonstrated that methods based on CNN have achieved exceptional performance in the field of SAR object detection.

However, methods based on CNN require a large and diverse dataset [16]. Unlike optical images, SAR images exhibit speckles and intricate details related to texture, and this low degree of visualization brings challenges to SAR image interpretation [17]. In addition, the rapid advancement of remote sensing technology has led to an abundance of SAR image data. However, these data originate from various carriers and imaging platforms, each with distinct technical specifications (e.g., radar parameters, imaging modes, and angles). The feature distributions of SAR images from different imaging platforms often vary significantly. This variation results in a phenomenon where a model trained on one dataset frequently underperforms when tested on another dataset.

Domain adaptation (DA) emerges as a valuable technique to mitigate the challenge posed by limited annotated samples in the field of object detection [18]. The fundamental goal of domain adaptation is to train a model on a source dataset and ensure its robust performance on a significantly different target dataset. Typically, the source domain comprises the data distribution used for model training, while the target domain represents the distinct data distribution encountered during testing [19]. Methods of domain adaptation have been widely applied in cross-domain target detection tasks [20,21,22]. Recently, several studies have employed domain adaptation methods for SAR image object detection. Shi et al. [23] extended the pioneering work of domain-adapted object detection by integrating multiple discriminators by Chen et al. [24]. Pan et al. [25] proposed an end-to-end domain-adaptation-based ship detection network consisting of imbalanced discriminant alignment and imbalanced prediction consistency. Xu et al. [26] introduced a multilevel alignment network that transfers knowledge from the optical domain to the SAR domain using domain adversarial strategies. However, most current cross-domain methods overlook two issues. The first is the neglect of the supervisory role of few annotated SAR images on the network. In practical applications, there are often few annotated data in the SAR domain (target domain) that can be accessed. The second issue is the focus on feature adversarial alignment while ignoring the fusion alignment on the data across the two domains.

Therefore, we propose a semi-supervised cross-domain object detection from optical to SAR domain. This method leverages a large amount of labeled optical images (source domain) and few labeled SAR images (target domain) to facilitate knowledge transfer for SAR object detection. Our method focuses on the data-processing aspects to gradually transfer knowledge at the image, instance, and feature levels. First, we propose a data augmentation method of image mixing and instance swapping to generate a mixed domain that is more similar to the SAR domain feature distribution. This method focuses on data processing and fully utilizes few SAR annotation information to reduce domain shift at image and instance levels. Second, at the feature level, we propose an adaptive optimization strategy to filter out mixed domain samples that significantly deviate from the SAR feature distribution and select data similar to SAR samples to train the feature extractor. For mixed data of SAR and optical images, the convolution-based local feature extraction method still focuses on the SAR and optical images themselves, which limits the extraction capability of the fused features. Therefore, we adopt Vision Transformer as a feature extractor. Through its global receptive field, ViT can better extract the features of the mixed images. This approach aims to enhance the feature extraction process and improve the overall performance of the model in handling mixed SAR and optical data. In addition, in SAR images, manmade objects such as ships, aircrafts, oil tanks, etc., have smaller effective energy regions and relatively scarce detail information. The conventional intersection over union (IOU) metric, along with its extensions, exhibits high sensitivity to small deviations in object localization. To address this problem, we propose an alternative approach by modeling bounding boxes as two-dimensional Gaussian distributions, the normalized Wasserstein distance (NWD) [27]. The NWD measures the similarity of corresponding Gaussian distributions and can be seamlessly integrated into any anchor-based detector, replacing commonly used IOU metrics. In summary, the main contributions of this paper are as follows.

(1): Feature-level processing: We propose a novel adaptive optimization strategy that utilizes metric learning to identify and filter feature samples that are more conducive to knowledge transfer, rather than blindly increasing the amount of data. This approach not only focuses on optimizing the quality of the selected samples but also effectively prevents overfitting by addressing the issue of data scarcity.
(2): Image-level and instance-level processing: We construct a two-step data augmentation method called Domain Mix. In the first step, we randomly combine images from the optical and SAR domains to enhance the diversity of the training data. In the second step, we separate a limited number of instance annotations from the SAR domain and interchange these annotations with the optical domain. This augmentation technique focuses on aligning the two domains at the image and instance levels, significantly enhancing the diversity of the dataset and playing a crucial role in improving detection performance.
(3): Two Improvements for SAR object detection: In contrast to the local feature extraction of convolution, we employ ViT for mixed images between SAR and optical image, aligning SAR and optical image features better through a global receptive field. Additionally, considering the smaller effective energy region of SAR image objects, we model the bounding box as a two-dimensional Gaussian distribution and utilize the normalized weighted distance (NWD) metric to improve the detection accuracy in complex scenarios.

2. Related Work

2.1. Object Detection in SAR Images

In recent years, high-quality SAR datasets have become increasingly abundant, and researchers have proposed various automatic, fast, and accurate SAR image object detection algorithms [28,29,30]. Traditional SAR image detection methods have predominantly focused on enhancing the well-known Constant False Alarm Rate (CFAR) algorithm. For example, Wang et al. [31] proposed a two-stage hierarchical scheme that achieves the fusion of intensity and spatial information. Pappas et al. [32] achieved excellent results by using superpixels to replace the sliding window to define the CFAR protection band and background. In complex scenarios, Shi et al. [33] converted panchromatic images into pseudo-hyperspectral form to enhance the separability between objects and backgrounds and used histogram of oriented gradients for feature extraction. With the improvement in hardware computing capabilities, deep learning technology has been widely applied. Kang et al. [10] were the first to apply the two-stage detection model FasterRCNN to SAR object detection. Cui et al. [34] introduced a lightweight framework based on threshold neural networks, adaptively determining optimal detection thresholds within sliding windows. Recent research has introduced Transformer into SAR detection tasks, Ma et al. [35] improved upon end-to-end Transformers by incorporating incident angles as prior labels and introducing feature descriptor operators based on scattering centers. Zhou et al. [36] proposed a lightweight meta-learning approach for small sample SAR object detection.

2.2. Domain Adaptation Object Detection

Domain adaptation [37] plays a crucial role in mitigating the domain shift between training and testing datasets characterized by distinct data distributions, thereby addressing the difficulties in interpreting SAR image target detection and the lack of annotation information. For instance, Xu et al. [26] proposed a multilevel registration network based on domain-invariant features to achieve adaptation from labeled optical images to unlabeled SAR images. Zhao et al. [38] proposed a feature decomposition cross-domain object detection method aimed at decomposing domain-invariant and domain-specific features to achieve improved object detection performance. Shi et al. [23] developed an unsupervised cross-domain method that progressively transfers knowledge at image, feature, and prediction levels, effectively transferring ship-related information from optical images to SAR images. Chen et al. [39] proposed a teacher–student mutual learning framework based on pseudo-label multilevel feature alignment, fully utilizing unlabeled SAR images to iteratively generate high-quality pseudo-labels to further narrow the domain shift. However, directly aligning feature distributions on CNN backbone yields limited improvements. Therefore, Wang et al. [31] introduced a novel approach—the sequential feature alignment Transformer. Comprising domain query feature alignment and labeled feature alignment modules, this Transformer reduces domain shift across global, local, and instance-level feature representations in both encoder and decoder components. Despite these advancements, there is still a lack of extensive research on Transformer-based DA object detection methods in the field of SAR images. These methods mainly utilize the source domain and unlabeled target domain but ignore the supervisory effect of few labeled images in the target domain.

2.3. Semi-Supervised Object Detection

The core of semi-supervised object detection lies in the rational use of few labeled samples and a large number of unlabeled samples [40]. Traditional methods mainly rely on consistency enforcement strategies, using unlabeled data to regularize the network, requiring prediction results to have consistency with input perturbations and network parameters. For example, Jeong et al. [41] proposed a semi-supervised object detection method based on consistency, which uses consistency constraints as a tool. Du et al. [42] constructed a scene feature learning branch, and designed a scene classifier and scene aggregation loss to utilize scene-level annotations. This allows the feature extraction network to fully learn SAR image scene features, thereby enhancing the recognition ability of ship target features and clutter features. Zheng et al. [43] proposed a semi-supervised cross-domain ship detection network, constructing a dual-teacher framework to solve the mutual interference problem between optical supervision and SAR supervision. However, existing semi-supervised methods often rely heavily on labeled data, which poses challenges for SAR images with limited annotations.

3. Methodology

We present the details of a novel semi-supervised cross-domain object detection framework from optical to SAR domain. The proposed method’s overall architecture is shown in Figure 2, and it mainly involves three iterative steps to optimize the detector to transfer knowledge from the fully labeled optical domain to the SAR domain with only few labeled samples. Firstly, to address the issue of insufficient SAR image data in the target domain and align the image-level and instance-level features between the two domains, we construct a new data augmentation method to generate a new domain that is more similar to the distribution of the SAR domain. Secondly, we enhance the acquisition of global features by using ViT for extract features both the expanded hybrid and SAR images. Thirdly, to enhance the effectiveness of data features, we introduce a metric matrix to filter the extracted features. Finally, we input the optimized samples for iterative optimization of the detector, and the IOU of the detector head is replaced with NWD.

In our cross-domain detection task, we have a large dataset optical domain (source domain)

D_{s} = {(x_{i}^{s}, y_{i}^{s})}_{i = 1}^{n_{s}}

and a few examples from the SAR domain (target domain)

D_{t} = {(x_{j}^{t}, y_{j}^{t})}_{j = 1}^{n_{t}}

, where

x_{i}^{s}, x_{j}^{t}

represent optical images and SAR images, respectively, and

y_{i}^{s}, y_{j}^{t}

represent the bounding boxes and categories of objects in the images. The target domain also has some test data. The goal of our method is to train an adaptive detector that can alleviate performance drop due to domain gap. In the following subsections, we present the details of SAR-CDSS.

3.1. Data Augmentation: Domain Mix

In order to reduce the domain gap between the source and target domains, we propose a new data augmentation strategy called Domain Mix. The hybrid images generated by this method are close to the distribution of target domain. We propose data augmentation as Mix at image-level and Instance-level.

Image-level augmentation: To increase the diversity of images and align image-level features, we randomly mix images from the source domain and target domain. Assuming a batch of data in the source domain

D_{s} = {(x_{i}^{s}, y_{i}^{s})}_{i = 1}^{n_{s}}

and target domain

D_{t} = {(x_{j}^{t}, y_{j}^{t})}_{j = 1}^{n_{t}}

, sample

m \leq n_{s}

and

n \leq n_{t}

samples, respectively, from two domain

D_{s}

and

D_{t}

, we randomly mix these samples to a single image

x^{m i x}

as follows:

x^{m i x} = x_{0}^{m i x} + \sum_{i = 1}^{m} \sum_{j = 1}^{n} H_{(i, j)} (λ_{i} x_{i}^{s} + λ_{j} x_{j}^{t}),

(1)

where

x_{0}^{m i x}

denotes an initialized empty image with dimensions different from those of

x_{i}^{s}

and

x_{j}^{t})

, The transformation matrix

H_{(i, j)}

is hand-crafted for image pair

x_{i}^{s}, x_{j}^{t}

. The weights

λ_{i}, λ_{j}

, ensuring

1 \geq λ_{i} + λ_{j} \geq 0

, corresponding to

x_{i}^{s}

and

x_{j}^{t}

, respectively.

Instance-level augmentation: To make the most of limited instance annotations and align instance-level features between the two domains, we can separate limited instance annotations from the background and randomly place them in other images. Unlike previous pixel-level copy–paste methods [44], we separate the entire bounding box annotation and then paste it onto other images. Given bounding box

b^{s}

and

b^{t}

from the source and target domain, each resized to dimensions

w (w i d t h)

and

h (h e i g h t)

, we perform an exchange operation to combine their distinctive characteristics. This can be specifically represented as follows:

b_{(r, c)}^{m i x} = ε_{(r, c)} b_{(r, c)}^{s} + (1 - ε_{(r, c)}) b_{(r, c)}^{t}

(2)

where

(r, c)

,

r = 1, 2, \dots, w, c = 1, 2, . . ., h

denotes the pixel indices within bounding boxes

b^{s}

and

b^{t}

. The weight

ε_{(r, c)} \in [0, 1]

corresponds to each index. The visualization of image-level and instance-level augmentations is shown in Figure 3.

3.2. Global Feature Extractor: Vision Transformer

In ViT, the role of self-attention is similar to that of the convolutional layer in CNN. For the input token embedding sequence

e \in R^{N \times d}

, we create the query

Q \in R^{N \times d}

, the key

K \in R^{N \times d}

, and the value

V \in R^{N \times d}

, which are obtained through three learnable linear projectors

W_{Q}, W_{K}

, and

W_{V}

applied to layer-normalized features. Then, we match the query sequence with the key to construct an

N \times N

self-attention matrix, where each element signifies the semantic relevance between corresponding query–key pairs. These embedded states can be learned and used as image representations. In both pretraining and fine-tuning stages, the classification head is attached to the same dimension. Additionally, we add one-dimensional position embedding to patch embeddings to retain positional information. Notably, ViT employs the standard Transformer encoder and produces output prior to the multilayer perceptron (MLP) header. Typically, ViT is pretrained on large datasets and subsequently fine-tuned for downstream tasks using smaller datasets. Additionally, we augment patch embedding with one-dimensional position embeddings to retain positional information. Importantly, ViT’s output is computed as weighted sums of values based on the self-attention matrix, as expressed by the following equation:

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d}}) V,

(3)

where

s o f t m a x (\cdot)

represents the logistic regression function. Equation (3) computes scores between different vectors using

Q K^{T}

, determining attention weights for encoding words at the current position. These scores are normalized for gradient stability and transformed into probabilities. Finally, each value vector is multiplied by the sum of these probabilities. Following the principles of SAR image target detection, we extract SAR image blocks centered on individual pixels and create training sets by randomly sampling from labeled blocks. These patches are then fed into the Vision Transformer pipeline to construct feature representations. ViT can be reconstructed layer by layer, comprising data preprocessing layers, self-attention layers, and multilayer perceptron layers. Analogous to the role of convolution in CNNs, ViT primarily constructs features through self-attention mechanisms. The underlying idea is to assess the importance of each pixel relative to others, capturing their long-range interactions. Self-attention aims to generate a weighted average of embedding values, facilitating robust representation learning.

3.3. Adaptive Optimization Strategy

Domain adaptation assumes that the source domain and target domain exist in distinct feature distribution spaces, and attempts to align their data distributions for effective knowledge transfer. Two common approaches for achieving feature alignment are discrepancy-metric-based and adversarial-based methods [19]. Discrepancy-metric-based methods typically design metrics to quantify the distribution difference between the source and target domains, minimizing these metrics to achieve alignment. As shown in Figure 2, we can generate a set of data using the introduced data augmentation method called Domain Mix, which we expect to be as close as possible to the target domain distribution. In our framework, the detector

d_{θ} = (g_{θ}, h_{θ})

comprises parameters

θ

composed of a backbone

g_{θ}

and a head

h_{θ}

. The SAR-CDSS leverages

g_{θ}

as a feature extractor to produce representations of

{\tilde{f}}_{i}

in

f_{j}^{t}

as follows:

{\tilde{f}}_{i} = g_{θ} (x_{i}^{m i x}), f_{j}^{t} = g_{θ} (x_{j}^{t}) .

(4)

Then, the SAR-CDSS employs a discrepancy-metric-based method to represent the distance between

{\tilde{f}}_{i}

and

f_{j}^{t}

to sort the mixed candidate samples

x_{i}^{m i x}

as follows:

d_{i}^{m i x} = d i s t_{f} ({\tilde{f}}_{i}, {\{f_{j}^{t}\}}_{j = 1}^{M_{t}}) .

(5)

To mitigate noise in

\tilde{D}

, we introduce a shrinkage ratio

0 < k \leq 1

to reduce the number of expanded candidates. Subsequently, we define an optimization function

φ_{o p t}

to refine

\tilde{D}

, resulting in the optimized extended domain denoted as

{\tilde{D}}^{o p t}

:

{\tilde{D}}^{o p t} = {(x_{i}^{o p t}, y_{i}^{o p t})}_{i = 1}^{n_{2}} = φ_{o p t} (\tilde{D}, {\{d_{i}^{m i x}\}}_{i = 1}^{n_{1}}, k) .

(6)

Through

φ_{o p t}

, we select the top

n_{2} (n_{2} = ⌊n_{1} * k⌋)

candidates from

\tilde{D}

. This process yields an optimized extended domain, denoted as

{\tilde{D}}^{o p t}

, which better aligns with the target domain distribution

P_{t}

. However, the suitability of

{\tilde{D}}^{o p t}

may change with the convergence of

d_{θ}

. To solve this problem, we iteratively optimize

{\tilde{D}}^{o p t}

, assuming that detectors

d_{θ}^{a}

and

d_{θ}^{b}

have gone through a and b iterative epochs

(a > b \geq 0)

. Since

g_{θ}, h_{θ}

are updated errors of

d_{θ}

in source domain and target domain

ε_{D_{s}} (d_{θ}^{a}), ε_{D_{t}} (d_{θ}^{a})

are expected to be less than

ε_{D_{s}} (d_{θ}^{b}), ε_{D_{t}} (d_{θ}^{b})

. Consequently, the feature extractor

g_{θ}^{a}

more accurately represents both

x_{i}^{m i x}

and

x_{i}^{t}

compared to

g_{θ}^{b}

. Leveraging this insight, we iteratively optimize

{\tilde{D}}^{o p t}

using the

d i s t_{f}

metric, updating feature representations

f_{i}^{m i x}

and

f_{i}^{t}

. After the

n^{t h} (n \geq 1)

iterative epoch, we can obtain

{\tilde{D}}_{n}^{o p t}

by filtering

\tilde{D}

as follows:

{\tilde{D}}_{n}^{o p t} = φ_{o p t} (\tilde{D}, {\{d i s t_{f} (g_{θ}^{n} (x_{i}^{m i x}), {\{g_{θ}^{n} (x_{j}^{t})\}}_{j = 1}^{M})\}}_{i = 1}^{n_{a}}, k) .

(7)

Finally, the adaptive detector

d_{θ}^{n} = (g_{θ}^{n}, h_{θ}^{n})

can also be iteratively optimized by

(x_{i}^{o p t}, y_{i}^{o p t}) \in {\tilde{D}}_{n}^{o p t}

as follows:

g_{θ}^{n + 1}, h_{θ}^{n + 1} \leftarrow o p t i m i z e r ((g_{θ}^{n}, h_{θ}^{n}), ▽_{θ} L_{θ} (d_{θ}^{n} (x_{i}^{m i x}), y_{i}^{m i x}), η) .

(8)

Given an optimizer denoted as

o p t i m i z e r

, a learning rate

η

for

g_{θ}, h_{θ}

, and a loss function

L

, we aim to assess the correlation between

{\tilde{f}}_{i}

and

{\{f_{j}^{t}\}}_{j = 1}^{M}

, so we adopt the widely used maximum mean discrepancy (MMD) metric function, denoted as

d i s t_{f}

. In reproducing Keral Hilbert space, MMD quantifies the distance between two distributions. Specifically, we instantiate it as

M M D^{2}

using the following formula:

M M D^{2} ({\tilde{f}}_{i}, {\{f_{j}^{t}\}}_{j = 1}^{M}) = {∥\frac{1}{M} \sum_{j = 1}^{M} f_{j}^{t} - {\tilde{f}}_{i}∥}_{2}^{2} .

(9)

3.4. NWD-Based Head

As is well known, IoU is actually the calculation of the Jaccard similarity coefficient of two finite samples. However, IoU exhibits varying sensitivity to objects of different scales. Notably, for small objects, even a minor positional deviation can lead to a substantial IoU drop, resulting in inaccurate label assignment. Conversely, for objects at a normal scale, IoU remains relatively stable under the same positional deviation. Therefore, in detection tasks with small objects, IoU is not suitable for evaluating the positional relationship between small objects, so we propose utilizing the Wasserstein distance, based on optimal transport theory, as an alternative metric for SAR object detection. For small objects, real-world entities are rarely strictly rectangular, often exhibiting background pixels within their bounding boxes. In such bounding boxes, foreground and background pixels tend to concentrate around the center and boundary, respectively. To better characterize the significance of different pixels within the bounding box, we model it as a two-dimensional Gaussian distribution. In this model, the center pixel carries the highest weight, with pixel importance gradually decreasing from the center toward the boundary. Specifically, consider a horizontal bounding box

R = (c x, c y, w, h)

, where

(c x, c y)

denote the center coordinates and

w, h

represent the width and height, respectively. The equation of its inscribed ellipse can be expressed as follows:

\frac{{(x - μ_{x})}^{2}}{σ_{x}^{2}} + \frac{{(y - μ_{y})}^{2}}{σ_{y}^{2}} = 1,

(10)

where

(μ_{x}, μ_{y})

represents the center coordinates of the ellipse, and

(σ_{x}, σ_{y})

denote the semi-axis lengths along the x and y axes, respectively. Consequently, we have

μ_{x} = c x, μ_{y} = c y, σ_{x} = \frac{w}{2}, σ_{y} = \frac{h}{2}

. Next, we derive the probability density function of the two-dimensional Gaussian distribution as follows:

f (x | μ, Σ) = \frac{e x p (- \frac{1}{2} {(x - μ)}^{T} Σ^{- 1} (x - μ))}{2 π {|Σ|}^{\frac{1}{2}}},

(11)

where

x, μ

,

Σ

represent the coordinates

(x, y)

of the Gaussian distribution, the mean vector, and the covariance matrix, respectively. When

{(x - μ)}^{T} Σ^{- 1} (x - μ) = 1

, the ellipse described by the formula corresponds to the density contour of a two-dimensional Gaussian distribution. Consequently, we can model a horizontal bounding box

R = (c x, c y, w, h)

as a two-dimensional Gaussian distribution

N (μ, Σ)

, where

μ = [\begin{matrix} c x \\ c y \end{matrix}], Σ = [\begin{matrix} \frac{w^{2}}{4} & 0 \\ 0 & \frac{h^{2}}{4} \end{matrix}] .

(12)

We employ the Wasserstein distance in optimal transport theory to quantify distribution dissimilarity. Consider two two-dimensional Gaussian distributions:

μ_{1} = N (m_{1}, Σ_{1})

and

μ_{2} = N (m_{2}, Σ_{2})

. The second-order Wasserstein distance between

μ_{1}

and

μ_{2}

can be expressed as follows:

W_{2}^{2} (μ_{1}, μ_{2}) = {∥m_{1} - m_{2}∥}_{2}^{2} + {∥Σ_{1}^{1 / 2} - Σ_{2}^{1 / 2}∥}_{F}^{2},

(13)

where

{∥ ∥}_{F}

denotes the Frobenius norm. When considering Gaussian distributions

N_{a}

and

N_{b}

, modeled from bounding boxes

A = (c x_{a}, c y_{a}, w_{a}, h_{a})

and

B = (c x_{b}, c y_{b}, w_{b}, h_{b})

, the formula can be further simplified as follows:

W_{2}^{2} (N_{a}, N_{b}) = {∥({[c x_{a}, c y_{a}, \frac{w_{a}}{2}, \frac{w_{a}}{2}]}^{T}, {[c x_{b}, c y_{b}, \frac{w_{b}}{2}, \frac{w_{b}}{2}]}^{T})∥}_{2}^{2} .

(14)

However,

W^{2} (N_{a}, N_{b})

serves as a distance metric. To achieve a value range akin to IoU (i.e., between 0 and 1), we employ an exponential nonlinear transformation function to remap the Wasserstein distance into another space. This transformation yields the normalized Wasserstein distance (NWD), defined as follows:

N W D (N_{a}, N_{b}) = e x p (- \frac{\sqrt{W_{2}^{2} (N_{a}, N_{b}}}{C}),

(15)

where C represents a constant closely tied to the dataset. Empirically, setting C to the average absolute size of the dataset yields optimal performance. By adopting this approach, we achieve a more accurate evaluation of the positional relationships among SAR objects.

4. Evaluation

4.1. Datasets Description and Implementation Details

In this section, we evaluate our proposed method for domain adaptation from optical to SAR object detection datasets. To assess the generalization ability of our framework, we select four datasets, divided into ship detection and oil tank detection. Ship detection consists of the optical domain HRSC2016 [45] (source domain) and the SAR domain HRSID [46] (target domain); oil tank detection consists of optical and SAR images from the SpaceNet 6(SN6) [47] dataset, which contains 820 pairs of single-polarization synthetic aperture radar (target domain) and corresponding optical images (source domain) of oil tanks. In addition, the statistics of these real-world public datasets are shown in Table 1.

Ship detection (HRSC2016 → HRSID): In this scenario, we use HRSC2016 as the source dataset. It is extracted from six major ports in Google Earth, containing 1061 valid annotated images, with image sizes ranging from 300 × 300 to 1500 × 900, and we use all of them for training. HRSID contains 5604 SAR images with resolutions ranging from 0.5 m to 3 m and sizes of 800 × 800, which include 16,951 ship instances. In order to challenge the detection task in complex scenes, we only use 471 inshore images as the target dataset, with the amount of 5% images for training and 95% for testing.

Oil tank detection (SN6 Optical → SAR): SpaceNet 6 is a multisensor all-weather mapping dataset that combines SAR and optical image datasets. The spatial resolution of its SAR and optical images is 0.5 m/pixel, and the size is 900 × 900. It contains 820 pairs of oil tank images. We use its optical images as the source dataset and SAR images as the target dataset. We use all optical images and 5% of SAR images as the training set, and 95% of SAR images as the test set.

All experiments are conducted using Pytorch. In our experiments, we use the supervised single stage detection YOLOv5 as the baseline network and compare it with other unsupervised and semi-supervised cross-domain methods. For the unsupervised domain adaptation (UDA) setting, we use the labeled optical domain dataset as the source domain and the unlabeled SAR dataset as the target domain. For the semi-supervised domain adaptation (SSDA) setting, we use the labeled optical domain dataset as the source domain and few labeled SAR data as the target domain. In our semi-supervised setting, we randomly select 5% labeled images in the SAR dataset for supervision, and the results are averaged over the same number of images five times. The optimization shrinkage rate in domain adaptation optimization is k = 0.85, all experiments adjust the input image size to 640 × 640, the number of iterations is 300 times, and the average precision when the IoU threshold in the evaluation metrics is 0.5 is represented as AP50.

4.2. Evaluation Metrics

We employ the standard MSCOCO evaluation metrics to compare the performance of ship and oil tank detection on SAR images. All results are reported on the test set. The intersection over union (IoU) measures the ratio of the area covered by the ground truth bounding box to that of the predicted bounding box. If the IoU is above a certain threshold, it is considered a true positive (TP). If the IoU is below the threshold, it is considered a false positive (FP). False negative (FN) refers to real objects that are not detected. The formulas for calculating precision (PR) and recall (RE) are as follows:

P r e c i s i o n = \frac{T P}{T P + F P},

(16)

R e c a l l = \frac{T P}{T P + F N} .

(17)

Then, we sort the precision and recall according to the confidence score to generate the PR curve. Based on the PR curve, the average precision (AP) score can be obtained by calculating the area under the PR curve. The formula for calculating AP is as follows:

A P = \int_{0}^{1} P R (R E) d (R E) \times 100 % .

(18)

In the experiments of this paper, we use AP50 as the final evaluation indicator, which is the AP calculated under the condition that the IoU threshold is 0.5.

4.3. Comparison Experiments

To demonstrate the effectiveness and superiority of our proposed method for cross-domain object detection on the HRSC2016, HRSID, and SpaceNet 6 datasets, this section compares our method with several advanced cross-domain methods.

YOLOv5 [48]: This is the baseline method of our approach and is currently a mainstream excellent single-stage detection method. While achieving the same detection accuracy as the two-stage detector FasterRCNN, its detection speed is faster. Also, our detector is based on YOLOv5.

SWDA [20]: This is a method for UDA object detection based on the two-stage detector FasterRCNN. It performs strong alignment on local features while conducting weak alignment on global scenarios, which is one of the pioneering works in domain adaptation object detection and provides ideas for subsequent research.

SSDA-YOLO [49]: This is a new semi-supervised domain adaptation method based on YOLOv5. It uses the style transfer algorithm CUT for scene style transfer, cross-generates pseudo-images in different domains to bridge image-level differences, and adapts the knowledge distillation framework with the mean–teacher model to enable the student model to obtain instance-level features of the unlabeled target domain.

OS-SSL [50]: This is a method for enhancing SAR images with multichannel optical images for oil tank detection. The feature extractor is trained using the OS-SSL method in the pretraining stage, and an optical image knowledge distillation algorithm with attention mechanism is used in the training of the detection network to further learn optical feature knowledge. The dataset established in this paper is the oil tank dataset used in our experiment.

SoftTeacher [51]: This is an end-to-end semi-supervised object detection method. The proposed soft teacher mechanism effectively balances the classification loss of unlabeled boxes, and the box network mechanism selects reliable pseudo boxes for learning box regression.

To be fair, we assign the same dataset partition to different algorithms. In Table 2, we report the AP(%) of various comparison algorithms on the test set. The third column of Table 2 shows the results of various algorithms in the ship detection when the source domain is HRSC2016 and the target domain is HRSID. The fourth column shows the results of various algorithms in the oil tank when the source domain is SN6 optical images and the target domain is SN6 SAR images. From Table 2, we can see that using optical remote sensing data to assist SAR image target detection achieved significant results. The unsupervised cross-domain method improved the detection performance by about 9%. Due to the significant differences in imaging methods between optical images and SAR images, it poses a great challenge to cross-domain detection. In practical applications, we often have a small amount of annotated SAR data available. On the basis of unsupervised cross-domain, we added a small amount of labeled SAR data for supervision. The detection performance was increased by more than 15% compared with the unsupervised cross-domain method in the ship and oil tank dataset experiments. Compared with the advanced semi-supervised cross-domain object detection method Softteacher, our proposed method had an improvement of more than 7% in detection performance. This also validates that our approach focusing on performing fusion on raw data can effectively reduce the shift between two domains.

Figure 4 and Figure 5 illustrate the visualization results of the SAR image object detection task in the ship and oil tank scenario, respectively. Our method is capable of accurately detecting regression bounding boxes while addressing the issue of densely distributed multiple ships and oil tanks. This is attributed to our proposed data augmentation method, which aligns features between two domains at both the image and instance levels. This approach effectively improves detection performance in situations where objects are densely distributed. In the ship experiment, we specifically focus on inshore scenes, which represent one of the most challenging contexts for SAR image ship detection. In the oil tank experiment, we show different sizes of oil tanks and high-density oil tank areas. In comparison experiments, our method exhibits relatively fewer false alarms but occasionally misses some detection objects, which is due to convolution-based methods focus on local features, while our ViT-based feature extractor pays more attention to global features. From Figure 4 and Figure 5, we can see that our method is still performing well on complex scenes. Overall, our method achieves the best detection performance.

The specific implementation details of our proposed framework are presented in Algorithm 1.

Algorithm 1: Details of SAR-CDSS.

Input: Initialized detector

θ^{i n}

, the labeled source domain

D_{s} = {(x_{i}^{s}, y_{i}^{s})}_{i = 1}^{n_{s}}

, few labeled target domain

D_{t} = {(x_{j}^{t}, y_{j}^{t})}_{j = 1}^{n_{t}}

, total epochs T, metric function

d i s t_{f}

, the number of steps per epoch N, domain-adaptive optimization function

{\tilde{D}}_{n}^{o p t}

, shrinkage k, loss function

L

.

Output: Adaptive Detector

d_{θ} = (g_{θ}, h_{θ})

1: Initialize

θ = θ^{i n}

2: Initialize feature extractor

g_{θ}

and head

h_{θ}

3: for

e p o c h \leftarrow 1, \dots, T

do

4:

\tilde{D} = {\{(x_{i}^{m i x}, y_{i}^{m i x})\}}_{i = 1}^{n_{a}} = M i x (D_{s}, D_{t})

5:

{\tilde{D}}_{n}^{o p t} = φ_{o p t} (\tilde{D}, {\{d i s t_{f} (g_{θ}^{n} (x_{i}^{m i x}), {\{g_{θ}^{n} (x_{j}^{t})\}}_{j = 1}^{M})\}}_{i = 1}^{n_{a}}, k)

6: for

s t e p \leftarrow 1, \dots, N

do

7: sample batch

B = {\{(x_{i}^{o p t}, y_{i}^{o p t})\}}_{i}^{b_{n}}

from

{\tilde{D}}_{n}^{o p t}

8:

p r e d = d_{θ} (\{x_{1}^{o p t}, \dots, x_{b_{n}}^{o p t}\})

9:

l o s s = L (p r e d, \{y_{1}^{o p t}, \dots, y_{b_{n}}^{o p t}\})

Update

θ

to minimize loss

10: end for

11:

g_{θ}, h_{θ} \leftarrow θ

12: end for

13:

d_{θ} \leftarrow θ

4.4. Ablation Experiment

4.4.1. Component Analysis

In order to evaluate the impact of each component of our proposed method, we conducted ablation experiments in the ship and oil tank scenarios to further validate the effectiveness of our modules. Table 3 shows the detailed results for the various modules. Our ablation experiments were divided into four groups: (1) Incorporating the data augmentation module called Domain Mix, while keeping all other components constant; (2) Solely substituting the original feature extraction Darknet-53 of YOLOv5 with ViT, while keeping all other components constant; (3) Incorporating data optimization called Domain Mix, and utilizing ViT as a feature extractor, while keeping all other components constant; (4) Retaining the use of data augmentation and ViT, and substituting the IoU utilized by the head of YOLOv5 with NWD. The results of the study indicate that our proposed data augmentation method called Domain Mix significantly enhances the detection performance of the model, improving performance by over 5%. ViT, which emphasizes global feature extraction, reduces the false alarm and improves detection performance by approximately 2%. Given the limited effective energy area of SAR image targets within the image, the detection head combined with NWD effectively enhances the detection precision of SAR objects. Furthermore, Table 4 illustrates the enhancement of small object detection performance by the NWD module, which incorporates the multiscale features of the COCO evaluation system. As evidenced in Table 4, the NWD module effectively boosts the detection performance for small objects, with an increase of over 1% in the

{AP}_{S}

.

4.4.2. SAR Data Quantity Analysis

We conduct a series of experiments to verify the effect of different numbers of data in target domain, and choose 1%, 5%, and 10% data from HRSID and SN6-SAR for model training. Table 5 illustrates the performance of three methods under the supervision of different quantities of SAR images. Due to the limited number of training datasets, the performance of various algorithms is not satisfactory at 1% SAR quantity. However, our method exhibits outstanding performance in all cases.

4.4.3. Shrinkage Ratio k Analysis

Furthermore, we investigated the impact of the shrinkage rate k in filtering samples during adaptive optimization process on detection performance, where

k = 1

indicates no optimization, as shown in Figure 6. The line graph in the figure suggests that our proposed adaptive optimization method can better assist the model in cross-domain training. Simultaneously, it can be observed that within a certain range of k values, the detection performance of the model remains relatively stable.

5. Discussion

In the oil tank and ship datasets, extensive experiments and analyses have substantiated the efficacy of our proposed method. As indicated in Table 2, we found that the detection performance of the detector is low when only a small amount of labeled SAR images are used for training. The introduction of other modal data such as optical remote sensing data can significantly enhance the detection performance of the detector. Optical data provide a more potent representation for learning from SAR images. Compared to optical images, SAR images exhibit distinct characteristics: discontinuous contours, severe geometric deformation, and the presence of speckle noise. These factors weaken the semantic and appearance correlations among objects, rendering semantic interpretation uncertain and SAR image analysis challenging.

To address these challenges, we leverage both labeled SAR image information and optical remote sensing data for detector training. The results show that the proposed training method significantly improves the detection performance. Our method combines the scattering intensity information of SAR images and optical remote sensing prior knowledge to reduce the domain gap between SAR images and optical images at the image level, instance level, and feature level. In addition, unlike previous convolutional operations that focus on local feature extraction, we use a global feature extractor to better extract common features of targets in both domains, which is suitable for the characteristics of SAR image targets. Our method was validated on oil tank and ship targets and can be extended to other SAR image object detection tasks. The use of optical data to assist SAR image target detection tasks has significant implications for deep-learning-based SAR image interpretation.

6. Conclusions

In this study, we propose an innovative semi-supervised cross-domain object detection method that bridges the gap between the optical domain and the SAR domain. By leveraging a limited number of annotated SAR images and a large number of annotated optical images, our method can effectively perform cross-domain detection on SAR images. This novel approach solves the scarcity of SAR data and its labeling difficulties, and the domain shift between optical and SAR images, and provides new insights for more stable and efficient object detection in SAR images. We experimentally validated our method on ship objects and oil tank objects, and the results show that compared with existing advanced cross-domain detection methods, our method has significantly improved detection performance and model robustness, demonstrating excellent generalization ability and robustness of our method.

Looking forward, we believe there is still much to explore in this field. Integrating more advanced semi-supervised learning techniques or introducing additional data sources may further enhance the performance of our method. In addition, our method can be extended to other tasks in computer vision, such as segmentation and classification. We expect our work to inspire future research in this exciting field.

Author Contributions

Conceptualization, C.L. and Y.Z.; methodology, C.L., J.G. and Y.Z.; validation, G.Z., H.Y. and X.N.; writing—original draft preparation, C.L. and Y.Z.; supervision, Y.Z. and Y.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China 61991420; National Natural Science Foundation of China 61991421.

Data Availability Statement

The HRSID dataset is available at https://github.com/chaozhong2010/HRSID (accessed on 17 November 2023). The HRSC2016 dataset is available at https://link.zhihu.com/?target=https//sites.google.com/site/hrsc2016/ (accessed on 17 November 2023). The SpaceNet 6 dataset is available at https://EIS-VIPG.github.io/SpaceNet6-OTD/ (accessed on 17 November 2023).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Moreira, A.; Prats-Iraola, P.; Younis, M.; Krieger, G.; Hajnsek, I.; Papathanassiou, K.P. A tutorial on synthetic aperture radar. IEEE Geosci. Remote Sens. Mag. 2013, 1, 6–43. [Google Scholar] [CrossRef]
Liu, Y.; Li, Q.; Yuan, Y.; Du, Q.; Wang, Q. ABNet: Adaptive balanced network for multiscale object detection in remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5614914. [Google Scholar] [CrossRef]
Xu, H.; Chen, W.; Sun, B.; Chen, Y.; Li, C. Oil tank detection in synthetic aperture radar images based on quasi-circular shadow and highlighting arcs. J. Appl. Remote Sens. 2014, 8, 083689. [Google Scholar] [CrossRef]
Zhao, Y.; Zhao, L.; Li, C.; Kuang, G. Pyramid attention dilated network for aircraft detection in SAR images. IEEE Geosci. Remote Sens. Lett. 2020, 18, 662–666. [Google Scholar] [CrossRef]
Abdallah, R.B.; Mian, A.; Breloy, A.; Taylor, A.; El Korso, M.N.; Lautru, D. Detection methods based on structured covariance matrices for multivariate SAR images processing. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1160–1164. [Google Scholar] [CrossRef]
Huang, Q.; Zhu, W.; Li, Y.; Zhu, B.; Gao, T.; Wang, P. Survey of target detection algorithms in SAR images. In Proceedings of the 2021 IEEE 5th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), Chongqing, China, 12–14 March 2021; IEEE: Ptscataway, NJ, USA, 2021; Volume 5, pp. 1756–1765. [Google Scholar]
Gu, D.; Xu, X. Multi-feature extraction of ships from SAR images. In Proceedings of the 2013 6th International Congress on Image and Signal Processing (CISP), Hangzhou, China, 16–18 December 2013; IEEE: Ptscataway, NJ, USA, 2013; Volume 1, pp. 454–458. [Google Scholar]
Gao, G.; Liu, L.; Zhao, L.; Shi, G.; Kuang, G. An adaptive and fast CFAR algorithm based on automatic censoring for target detection in high-resolution SAR images. IEEE Trans. Geosci. Remote Sens. 2008, 47, 1685–1697. [Google Scholar] [CrossRef]
Charalampidis, D.; Kasparis, T. Wavelet-based rotational invariant roughness features for texture classification and segmentation. IEEE Trans. Image Process. 2002, 11, 825–837. [Google Scholar] [CrossRef] [PubMed]
Kang, M.; Ji, K.; Leng, X.; Lin, Z. Contextual region-based convolutional neural network with multilayer fusion for SAR ship detection. Remote Sens. 2017, 9, 860. [Google Scholar] [CrossRef]
Cui, Z.; Li, Q.; Cao, Z.; Liu, N. Dense attention pyramid networks for multi-scale ship detection in SAR images. IEEE Trans. Geosci. Remote Sens. 2019, 57, 8983–8997. [Google Scholar] [CrossRef]
Sun, Z.; Leng, X.; Lei, Y.; Xiong, B.; Ji, K.; Kuang, G. BiFA-YOLO: A novel YOLO-based method for arbitrary-oriented ship detection in high-resolution SAR images. Remote Sens. 2021, 13, 4209. [Google Scholar] [CrossRef]
Zhao, Y.; Zhao, L.; Liu, Z.; Hu, D.; Kuang, G.; Liu, L. Attentional feature refinement and alignment network for aircraft detection in SAR imagery. arXiv 2022, arXiv:2201.07124. [Google Scholar] [CrossRef]
Yang, R.; Pan, Z.; Jia, X.; Zhang, L.; Deng, Y. A novel CNN-based detector for ship detection based on rotatable bounding box in SAR images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 1938–1958. [Google Scholar] [CrossRef]
Zeng, L.; Zhu, Q.; Lu, D.; Zhang, T.; Wang, H.; Yin, J.; Yang, J. Dual-polarized SAR ship grained classification based on CNN with hybrid channel feature loss. IEEE Geosci. Remote Sens. Lett. 2021, 19, 4011905. [Google Scholar] [CrossRef]
Shin, H.C.; Roth, H.R.; Gao, M.; Lu, L.; Xu, Z.; Nogues, I.; Yao, J.; Mollura, D.; Summers, R.M. Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Trans. Med. Imaging 2016, 35, 1285–1298. [Google Scholar] [CrossRef] [PubMed]
Cheng, G.; Yan, B.; Shi, P.; Li, K.; Yao, X.; Guo, L.; Han, J. Prototype-CNN for few-shot object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5604610. [Google Scholar] [CrossRef]
Peng, J.; Sun, W.; Ma, L.; Du, Q. Discriminative transfer joint matching for domain adaptation in hyperspectral image classification. IEEE Geosci. Remote Sens. Lett. 2019, 16, 972–976. [Google Scholar] [CrossRef]
Wang, M.; Deng, W. Deep visual domain adaptation: A survey. Neurocomputing 2018, 312, 135–153. [Google Scholar] [CrossRef]
Saito, K.; Ushiku, Y.; Harada, T.; Saenko, K. Strong-weak distribution alignment for adaptive object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6956–6965. [Google Scholar]
Deng, J.; Li, W.; Chen, Y.; Duan, L. Unbiased mean teacher for cross-domain object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4091–4101. [Google Scholar]
Li, W.; Liu, X.; Yuan, Y. Sigma: Semantic-complete graph matching for domain adaptive object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5291–5300. [Google Scholar]
Shi, Y.; Du, L.; Guo, Y.; Du, Y. Unsupervised domain adaptation based on progressive transfer for ship detection: From optical to SAR images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5230317. [Google Scholar] [CrossRef]
Chen, Y.; Li, W.; Sakaridis, C.; Dai, D.; Van Gool, L. Domain adaptive faster r-cnn for object detection in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3339–3348. [Google Scholar]
Pan, B.; Xu, Z.; Shi, T.; Li, T.; Shi, Z. An Imbalanced Discriminant Alignment Approach for Domain Adaptive SAR Ship Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5108111. [Google Scholar] [CrossRef]
Xu, C.; Zheng, X.; Lu, X. Multi-level alignment network for cross-domain ship detection. Remote Sens. 2022, 14, 2389. [Google Scholar] [CrossRef]
Wang, J.; Xu, C.; Yang, W.; Yu, L. A normalized Gaussian Wasserstein distance for tiny object detection. arXiv 2021, arXiv:2110.13389. [Google Scholar]
Chen, Z.; Liu, C.; Filaretov, V.; Yukhimets, D. Multi-Scale Ship Detection Algorithm Based on YOLOv7 for Complex Scene SAR Images. Remote Sens. 2023, 15, 2071. [Google Scholar] [CrossRef]
Han, P.; Liao, D.; Han, B.; Cheng, Z. SEAN: A Simple and Efficient Attention Network for Aircraft Detection in SAR Images. Remote Sens. 2022, 14, 4669. [Google Scholar] [CrossRef]
He, F.; Zhou, F.; Gui, C.; Xing, M. SAR Target Detection Based on Improved SSD with Saliency Map. In Proceedings of the 2021 CIE International Conference on Radar (Radar), Hainan, China, 15–19 December 2021; IEEE: Ptscataway, NJ, USA, 2021; pp. 918–922. [Google Scholar]
Wang, Y.; Liu, H. A hierarchical ship detection scheme for high-resolution SAR images. IEEE Trans. Geosci. Remote Sens. 2012, 50, 4173–4184. [Google Scholar] [CrossRef]
Pappas, O.; Achim, A.; Bull, D. Superpixel-level CFAR detectors for ship detection in SAR imagery. IEEE Geosci. Remote Sens. Lett. 2018, 15, 1397–1401. [Google Scholar] [CrossRef]
Shi, Z.; Yu, X.; Jiang, Z.; Li, B. Ship detection in high-resolution optical imagery based on anomaly detector and local shape feature. IEEE Trans. Geosci. Remote Sens. 2013, 52, 4511–4523. [Google Scholar]
Cui, J.; Jia, H.; Wang, H.; Xu, F. A fast threshold neural network for ship detection in large-scene SAR images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 6016–6032. [Google Scholar] [CrossRef]
Ma, C.; Zhang, Y.; Guo, J.; Hu, Y.; Geng, X.; Li, F.; Lei, B.; Ding, C. End-to-end method with transformer for 3-D detection of oil tank from single SAR image. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5217619. [Google Scholar] [CrossRef]
Zhou, Z.; Chen, J.; Huang, Z.; Wan, H.; Chang, P.; Li, Z.; Yao, B.; Wu, B.; Sun, L.; Xing, M. FSODS: A lightweight metalearning method for few-shot object detection on SAR images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5232217. [Google Scholar] [CrossRef]
Farahani, A.; Voghoei, S.; Rasheed, K.; Arabnia, H.R. A brief review of domain adaptation. In Advances in Data Science and Information Engineering: Proceedings from ICDATA 2020 and IKE 2020; Springer: Cham, Switzerland, 2021; pp. 877–894. [Google Scholar]
Zhao, S.; Luo, Y.; Zhang, T.; Guo, W.; Zhang, Z. A feature decomposition-based method for automatic ship detection crossing different satellite SAR images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5234015. [Google Scholar] [CrossRef]
Chen, S.; Zhan, R.; Wang, W.; Zhang, J. Domain adaptation for semi-supervised ship detection in SAR images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 4507405. [Google Scholar] [CrossRef]
Liu, Y.C.; Ma, C.Y.; He, Z.; Kuo, C.W.; Chen, K.; Zhang, P.; Wu, B.; Kira, Z.; Vajda, P. Unbiased teacher for semi-supervised object detection. arXiv 2021, arXiv:2102.09480. [Google Scholar]
Jeong, J.; Lee, S.; Kim, J.; Kwak, N. Consistency-based semi-supervised learning for object detection. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
Du, Y.; Du, L.; Guo, Y.; Shi, Y. Semisupervised SAR Ship Detection Network via Scene Characteristic Learning. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5201517. [Google Scholar] [CrossRef]
Zheng, X.; Cui, H.; Xu, C.; Lu, X. Dual Teacher: A Semi-Supervised Co-Training Framework for Cross-Domain Ship Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5613312. [Google Scholar] [CrossRef]
Ghiasi, G.; Cui, Y.; Srinivas, A.; Qian, R.; Lin, T.Y.; Cubuk, E.D.; Le, Q.V.; Zoph, B. Simple copy-paste is a strong data augmentation method for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2918–2928. [Google Scholar]
Liu, Z.; Wang, H.; Weng, L.; Yang, Y. Ship rotated bounding box space for ship extraction from high-resolution optical satellite images with complex backgrounds. IEEE Geosci. Remote Sens. Lett. 2016, 13, 1074–1078. [Google Scholar] [CrossRef]
Wei, S.; Zeng, X.; Qu, Q.; Wang, M.; Su, H.; Shi, J. HRSID: A high-resolution SAR images dataset for ship detection and instance segmentation. IEEE Access 2020, 8, 120234–120254. [Google Scholar] [CrossRef]
Shermeyer, J.; Hogan, D.; Brown, J.; Van Etten, A.; Weir, N.; Pacifici, F.; Hansch, R.; Bastidas, A.; Soenen, S.; Bacastow, T.; et al. SpaceNet 6: Multi-sensor all weather mapping dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 196–197. [Google Scholar]
Jocher, G.; Chaurasia, A.; Borovec, J.; Stoken, A.; Kwon, Y.; Michael, K.; Fang, J.; Xie, T.; Zeng, Y.; Sonck, V.; et al. Yolov5. 2020. Available online: https://github.com/ultralytics/yolov5 (accessed on 1 October 2023).
Zhou, H.; Jiang, F.; Lu, H. SSDA-YOLO: Semi-supervised domain adaptive YOLO for cross-domain object detection. Comput. Vis. Image Underst. 2023, 229, 103649. [Google Scholar] [CrossRef]
Zhang, R.; Guo, H.; Xu, F.; Yang, W.; Yu, H.; Zhang, H.; Xia, G.S. Optical-Enhanced Oil Tank Detection in High-Resolution SAR Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5237112. [Google Scholar] [CrossRef]
Xu, M.; Zhang, Z.; Hu, H.; Wang, J.; Wang, L.; Wei, F.; Bai, X.; Liu, Z. End-to-end semi-supervised object detection with soft teacher. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3060–3069. [Google Scholar]

Figure 1. Some examples of SAR and optical images.The first row displays optical images, while the second row shows SAR images. The first two columns are dedicated to images of ships, and the last two columns exhibit images of oil tanks.

Figure 2. Overall architecture of our proposed model. SAR-CDSS includes three steps: (1) using the proposed Domain Mix augmentation to acquire Mixed candidates which are more similar to the SAR domain. (2) The augmented and target domain data are fed into ViT to obtain the extracted feature vector. Through the adaptive optimization strategy, the samples which are unsuitable for mitigating domain shifts can be filtered. (3) Detector iteratively updates the model using optimized samples.

Figure 3. Visualization of Domain Mix augmentation, including both image-level and instance-level augmentation. Dashed boxes of different colors indicate that the instances are interchanged.

Figure 4. Visualization results of ship detection for three scenes in HRSC → HRSID task. The green rectangle is the correct detection object, the red rectangle is the missed detection object, and the blue rectangle is the false alarm detection object. (a) Ground truth, (b) YOLOv5, (c) SoftTeacher, (d) proposed method.

Figure 5. Visualization results of oil tank detection for three scenes in SN6-OPT → SN6-SAR task. The green rectangle is the correct detection object, the red rectangle is the missed detection object, and the blue rectangle is the false alarm detection object. (a) Ground truth, (b) YOLOv5, (c) SoftTeacher, (d) proposed method.

Figure 6. Experiments on different values of k on ship and oil tank scenarios. (a) Ship, (b) Oil Tank.

Table 1. Statistics of the public datasets.

Datasets	SAR		Optical
Datasets	HRSID	SpaceNet 6	HRSC2016	SpaceNet 6
Satellite	Sentinel-1B, TerraSAR-X, TanDEM	Capella Space	Google Earth	Maxar Worldview-2
Polarization	HH, HV, VV	HH, HV, VH, VV	-	-
Resolution (m)	0.5, 1, 3	0.5	0.4–2	0.5
Image number	16,951	3401	1061	3401
Image size ( ${pixel}^{2}$ )	800 × 800	900 × 900	300 × 300–1500 × 900	900 × 900

Table 2. Comparison with different methods on the ship and oil tank datasets.

Setting	Method	Datasets/AP50
Setting	Method	HRSC → HRSID (Ship)	SN6_OPT → SN6_SAR (Oil Tank)
Baseline	YOLOv5	25.35%	28.02%
Unsupervised	SWDA	31.20%	33.42%
Unsupervised	SSDA-YOLO	35.21%	39.66%
Semi-supervised	OS-SSL	-	54.60%
	SoftTeacher	42.80%	46.30%
	SAR-CDSS	50.63%	55.42%

Table 3. Ablation study on the component of SAR-CDSS.

	Data Augmentation Domain Mix	Feature Extractor ViT	Detector Head NWD	Datasets
	Data Augmentation Domain Mix	Feature Extractor ViT	Detector Head NWD	Ship	Oil Tank
SAR-CDSS				37.74%	43.38%
	✓			45.68%	50.65%
		✓		39.96%	46.45%
	✓	✓		48.32%	53.23%
	✓	✓	✓	50.63%	55.42%

Table 4. Detection performances of SAR-CDSS for small object.

Methods	Ship			Oil Tank
Methods	${AP}_{S}$	${AP}_{M}$	${AP}_{L}$	${AP}_{S}$	${AP}_{M}$	${AP}_{L}$
With NWD	12.2%	24.8%	39.4%	11.6%	39.3%	57.1%
Without NWD	11.0%	25.2%	38.5%	10.1%	39.8%	56.3%

Table 5. Results of methods with different number of supervised images.

Method	1%		5%		10%
Method	Ship	Oil Tank	Ship	Oil Tank	Ship	Oil Tank
YOLOv5	6.22%	8.16%	25.35%	28.02%	38.44%	43.64%
Softteacher	15.38%	18.54%	42.80%	46.30%	54.14%	59.22%
SAR-CDSS	20.32%	21.82%	50.63%	55.42%	68.70%	74.78%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Luo, C.; Zhang, Y.; Guo, J.; Hu, Y.; Zhou, G.; You, H.; Ning, X. SAR-CDSS: A Semi-Supervised Cross-Domain Object Detection from Optical to SAR Domain. Remote Sens. 2024, 16, 940. https://doi.org/10.3390/rs16060940

AMA Style

Luo C, Zhang Y, Guo J, Hu Y, Zhou G, You H, Ning X. SAR-CDSS: A Semi-Supervised Cross-Domain Object Detection from Optical to SAR Domain. Remote Sensing. 2024; 16(6):940. https://doi.org/10.3390/rs16060940

Chicago/Turabian Style

Luo, Cheng, Yueting Zhang, Jiayi Guo, Yuxin Hu, Guangyao Zhou, Hongjian You, and Xia Ning. 2024. "SAR-CDSS: A Semi-Supervised Cross-Domain Object Detection from Optical to SAR Domain" Remote Sensing 16, no. 6: 940. https://doi.org/10.3390/rs16060940

APA Style

Luo, C., Zhang, Y., Guo, J., Hu, Y., Zhou, G., You, H., & Ning, X. (2024). SAR-CDSS: A Semi-Supervised Cross-Domain Object Detection from Optical to SAR Domain. Remote Sensing, 16(6), 940. https://doi.org/10.3390/rs16060940

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SAR-CDSS: A Semi-Supervised Cross-Domain Object Detection from Optical to SAR Domain

Abstract

1. Introduction

2. Related Work

2.1. Object Detection in SAR Images

2.2. Domain Adaptation Object Detection

2.3. Semi-Supervised Object Detection

3. Methodology

3.1. Data Augmentation: Domain Mix

3.2. Global Feature Extractor: Vision Transformer

3.3. Adaptive Optimization Strategy

3.4. NWD-Based Head

4. Evaluation

4.1. Datasets Description and Implementation Details

4.2. Evaluation Metrics

4.3. Comparison Experiments

4.4. Ablation Experiment

4.4.1. Component Analysis

4.4.2. SAR Data Quantity Analysis

4.4.3. Shrinkage Ratio k Analysis

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI