Semantic Attention and Structured Model for Weakly Supervised Instance Segmentation in Optical and SAR Remote Sensing Imagery

Chen, Man; Xu, Kun; Chen, Enping; Zhang, Yao; Xie, Yifei; Hu, Yahao; Pan, Zhisong

doi:10.3390/rs15215201

Open AccessArticle

Semantic Attention and Structured Model for Weakly Supervised Instance Segmentation in Optical and SAR Remote Sensing Imagery

by

Man Chen

^1,2

,

Kun Xu

¹,

Enping Chen

²,

Yao Zhang

¹,

Yifei Xie

¹,

Yahao Hu

¹ and

Zhisong Pan

^1,*

¹

College of Command and Control Engineering, Army Engineering University of PLA, Nanjing 210007, China

²

School of Electrical and Information Engineering, Changsha University of Science and Technology, Changsha 410114, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(21), 5201; https://doi.org/10.3390/rs15215201

Submission received: 17 July 2023 / Revised: 29 October 2023 / Accepted: 30 October 2023 / Published: 1 November 2023

(This article belongs to the Section AI Remote Sensing)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Instance segmentation in remote sensing (RS) imagery aims to predict the locations of instances and represent them with pixel-level masks. Thanks to the more accurate pixel-level information for each instance, instance segmentation has enormous potential applications in resource planning, urban surveillance, and military reconnaissance. However, current RS imagery instance segmentation methods mostly follow the fully supervised paradigm, relying on expensive pixel-level labels. Moreover, remote sensing imagery suffers from cluttered backgrounds and significant variations in target scales, making segmentation challenging. To accommodate these limitations, we propose a semantic attention enhancement and structured model-guided multi-scale weakly supervised instance segmentation network (SASM-Net). Building upon the modeling of spatial relationships for weakly supervised instance segmentation, we further design the multi-scale feature extraction module (MSFE module), semantic attention enhancement module (SAE module), and structured model guidance module (SMG module) for SASM-Net to enable a balance between label production costs and visual processing. The MSFE module adopts a hierarchical approach similar to the residual structure to establish equivalent feature scales and to adapt to the significant scale variations of instances in RS imagery. The SAE module is a dual-stream structure with semantic information prediction and attention enhancement streams. It can enhance the network’s activation of instances in the images and reduce cluttered backgrounds’ interference. The SMG module can assist the SAE module in the training process to construct supervision with edge information, which can implicitly lead the model to a representation with structured inductive bias, reducing the impact of the low sensitivity of the model to edge information caused by the lack of fine-grained pixel-level labeling. Experimental results indicate that the proposed SASM-Net is adaptable to optical and synthetic aperture radar (SAR) RS imagery instance segmentation tasks. It accurately predicts instance masks without relying on pixel-level labels, surpassing the segmentation accuracy of all weakly supervised methods. It also shows competitiveness when compared to hybrid and fully supervised paradigms. This research provides a low-cost, high-quality solution for the instance segmentation task in optical and SAR RS imagery.

Keywords:

weakly supervised instance segmentation; remote sensing imagery; semantic attention; structured model; multi-scale feature extraction

1. Introduction

In recent years, there has been a growing diversification in remote sensing imagery, accompanied by an expansion of its application domains. Researchers have made significant progress in scene classification [1,2,3], object detection [4,5,6], and semantic segmentation [7,8,9,10] in RS imaging, achieving impressive results. Among them, scene classification aims to assign a predefined class label or category to an entire image. Object detection aims to identify and locate multiple objects in an image. Semantic segmentation involves labeling each pixel in an image with its corresponding class or category, and its objective is to partition the image into meaningful regions based on semantic content. In contrast to the simple tasks mentioned above, the instance segmentation task in RS imagery combines the benefits of object detection and semantic segmentation tasks, aiming to predict the locations of instances in RS imagery and represent them through pixel-level masks. This is currently a hot research topic in remote sensing imagery. This task can extract more comprehensive information than scene classification, object detection, and semantic segmentation tasks. Benefiting from each instance’s more precise pixel-level information, instance segmentation in RS imagery has tremendous potential applications in resource planning, urban surveillance, and military reconnaissance. Therefore, this study concentrates on the more challenging task of instance segmentation in RS imagery.

Several instance segmentation methods for RS images have been developed in the past few years, influenced by deep convolutional neural networks (DCNNs). These methods still have considerable room for improvement concerning label dependency, segmentation accuracy, and scene adaptability. Su et al. [11] proposed a high-resolution feature pyramid network (FPN) with multiple levels of features in the network architecture. They also refined the information flow between branches, achieving good results in instance segmentation tasks for optical and SAR RS imagery. Chen et al. [12] carefully designed a bounding box attention module and a bounding box filter module to suppress noise in the background and enable the network to adapt to objects in the RS imagery, resulting in more accurate instance masks. Zhao et al. [13] proposed a SAR ship instance segmentation method based on the synergistic attention mechanism, which incorporates the advantages of multiple attention mechanisms and effectively extracts instance masks of objects in the image. Fan et al. [14] improved the feature extraction performance of the module by integrating the Swin Transformer into the feature extraction stage. They promoted the information exchange between the bounding box branch and the mask branch during the network’s design process, significantly improving the segmentation quality of optical and SAR RS imagery. Wei et al. [15] paid full attention to the low-level features in the feature extraction and fusion stages and introduced super-resolution (SR) denoising techniques for the network, achieving adaptation to various complex scenes in the instance segmentation of RS imagery.

However, almost all previous methods for the instance segmentation of RS imagery are limited to the fully supervised paradigm. These methods rely on fine-grained pixel-level labels (as shown in Figure 1a), which undoubtedly incur expensive manual annotation costs, limiting their application and generalization. Therefore, this research focuses on the instance segmentation task in a weakly supervised paradigm to reduce the model’s dependence on expensive pixel-level labels. Existing weakly supervised instance segmentation methods can be further divided into two categories based on the type of labels: image-level label-based methods [16,17,18,19,20,21,22,23] and bounding box label-based methods [24,25,26,27,28]. Image-level label-based methods can extract instance masks of objects through mask proposals [16,17], object ranking [18], or pseudo-label acquisition [19,20,21,22,23]. Although this approach can save a significant amount of label production costs, the supervisory power of image-level labels is too weak to be applied to complex remote sensing imagery. In contrast, bounding box label-based methods can perform instance segmentation in images using only simple bounding box labels, making it easier to achieve the trade-off between label production costs and supervisory power. Some bounding box label-based methods [24,25,26] typically employ extra auxiliary salient data or unsupervised methods to generate pseudo-labels, which are then used to guide network training. However, these methods involve cumbersome training steps and cannot achieve end-to-end training. Recent methods [27,28] model the relationships in the spatial domain and construct supervisory information to guide the weakly supervised training of the module in an end-to-end manner, achieving mask perception. In this study, we concentrate on bounding box label-based instance segmentation methods for RS imagery and incorporate the modeling of spatial relationships into the network design process to reduce the dependence on expensive pixel-level labels while ensuring visual processing effectiveness.

Furthermore, we observe that the supervisory information provided by bounding box labels during the network training process is coarse, leading to insufficient network sensitivity to edges and affecting the accuracy of instance segmentation. To accommodate this, we introduce the structured model to guide the model progressively during training, enabling the model to develop a representation with structured induction bias and enhance its perception of object edge information. In addition to the above limitations, as shown in Figure 1b, remote sensing imagery often has unique characteristics different from natural imagery due to factors such as imagery distance and mechanism, which may affect the segmentation performance. First, optical remote sensing imagery contains various background elements, such as the ground, vegetation, water bodies, and streets, mixed with the objects of interest, making it difficult to distinguish them from the background. SAR remote sensing imagery is also prone to interference from coastlines and noise, resulting in a cluttered background. Second, significant scale variations within and between object classes in RS images can increase the difficulty of learning neural network features. Given the challenges posed by the unique characteristics of RS imagery to the instance segmentation task, we consider these limitations when designing the network to achieve high-performance instance segmentation of remote sensing imagery.

Specifically, we propose a weakly supervised instance segmentation network for remote sensing imagery called the semantic attention enhancement and structured model-guided multi-scale weakly supervised instance segmentation network (SASM-Net) to accommodate practical problems, such as too-expensive label production costs, insufficient edge sensitivity of bounding box labels, background clutter and significant variations in target scales. The network constructs weakly supervised constraints by modeling the spatial domain relationships through the segmentation branch. It mainly includes three modules: a multi-scale feature extraction module (MSFE module), a semantic attention enhancement module (SAE module), and a structured model guidance module (SMG module). Among them, the MSFE module can create equivalent feature scales in the feature extraction using a hierarchical approach similar to the residual structure to achieve efficient multi-scale feature extraction. It is adaptable to the significant scale variations of targets in remote sensing imagery. The SAE module includes a semantic information prediction stream and an attention enhancement stream, which can enhance the network’s activation to the instances and reduce the interference of the cluttered backgrounds in the RS imagery. The SMG module can assist the SAE module in the training process to construct supervision containing edge information, thereby reducing the impact of the model’s insufficient sensitivity to edge information caused by the lack of fine-grained pixel-level labeling.

The contributions of this research can be summarized as follows:

We propose SASM-Net for weakly supervised instance segmentation tasks in optical and SAR remote sensing imagery. The segmentation branch of this network incorporates spatial relationship modeling to establish weak supervision constraints, allowing the accurate prediction of instance masks without the requirement of pixel-level labels.
We introduce an MSFE module to build equivalent feature scales through a hierarchical approach similar to the residual structure during feature extraction, achieving efficient multi-scale feature extraction to adapt to the challenge of significant scale variations of targets in remote sensing imagery.
We construct an SAE module that includes a semantic information prediction stream and an attention enhancement stream, which enhances the activation of instances and reduces interference from cluttered backgrounds in remote sensing imagery.
We propose an SMG module to assist the SAE module in building supervision containing edge information during training, reducing the impact of insufficient sensitivity to edge information caused by the lack of fine-grained pixel-level labels and improving the model’s perceptual ability for target edge information.

2. Relate Work

2.1. Supervised Instance Segmentation

Instance segmentation aims to predict the locations of instances in the image and represent them with pixel-level masks. Existing instance segmentation methods can be divided into two types of architecture: two-stage [29,30,31,32,33] and one-stage [34,35,36,37,38]. Two-stage architectures generate a series of region proposals and then predict masks for proposals. Mask R-CNN [29] indicates the mask of the region of interest (RoI) by adding a mask branch to Faster R-CNN [39]. Generally, the region proposals in two-stage-architecture methods often suffer from redundancy, resulting in subsequent branches having to encode the features redundantly. Consequently, the processing speed of these models is limited.

One-stage architectures directly cluster pixels in an image into different instances without relying on region proposals. Specifically, YOLACT [34] predicts masks through prototype masks and correlated mask coefficients, achieving faster prediction speeds. BlendMask [35] proposes a blender module based on FCOS [40] that effectively combines instance-level and fine-grained semantic information for the instance segmentation task. CondInst [36] dynamically generates conditional convolutions to adapt to different instances to perceive object instance masks accurately. SOLO [37] transforms the mask perception task into a pixel classification task by assigning pixel categories based on instance positions and sizes. SOLOv2 [38] extends SOLO by predicting mask kernels, learning mask features, and matrix non-maximum suppression, achieving more competitive segmentation performance. However, the two-stage and one-stage architectures both require expensive pixel-level labels for training to extract instance masks of the target, which undoubtedly incurs expensive human labeling costs.

2.2. Weakly Supervised Instance Segmentation

Recently, weakly supervised instance segmentation methods have gained attention as a feasible solution to alleviate the cost of label creation. Based on the type of labels, existing weakly supervised methods can be further divided into two categories: image-level label-based methods [16,17,18,19,20,21,22,23] and bounding box label-based methods [24,25,26,27,28]. Image-level label-based methods extract instance masks through mask proposals [16,17], object ranking [18], or pseudo-label acquisition [19,20,21,22,23]. Zhou et al. [18] propose a method called peak response map (PRM), which uses class peak responses as image-level annotations to train neural networks for segmentation. Laradji et al. [19] first train an image classifier to obtain PRM from image-level class labels, then use MCG to extract candidate regions and take PRM as the weights of candidate regions to provide pseudo-labels for training the instance segmentation network. Although these image-level label-based methods can save many labeling costs, their supervision capabilities are too weak and cannot be applied to complex remote sensing imagery.

On the other hand, bounding box label-based methods can segment instances of interest only with simple bounding box labels, making it easier to balance label production costs and supervision capabilities. Some bounding box label-based methods [24,25,26] usually generate pseudo-labels through extra auxiliary salient data or unsupervised methods. Specifically, Khoreva et al. [24] adopt unsupervised methods to generate pseudo-labels and then utilize recursive training to denoise the network so that it can predict masks. Wang et al. [25] regard the instance segmentation problem in bounding box supervision as a category-independent target mask extraction and propose to combine bounding box supervision information with salient detection to generate pseudo-labels using additional auxiliary salient data, which achieves good segmentation results. However, these methods involve tedious training steps and struggle to achieve end-to-end training. Recent methods [27,28] model relationships and construct supervisory information for guidance during weakly supervised training to realize mask perception. Hsu et al. [27] formulate the problem of bounding box label-based instance segmentation into a multi-instance learning problem and model the pairwise similarity relationship between pixels to generate target instance masks. Lan et al. [28] combine the bounding box label-based instance segmentation task with the semantic correspondence task and introduce consistency constraints between degenerated and original models to improve mask perception. This study concentrates on the bounding box label-based instance segmentation method and adopts the idea of modeling spatial relationships in network design to ensure visual processing effects while reducing dependence on expensive pixel-level labels.

3. Methodology

3.1. Overview

The proposed SASM-Net follows a one-stage architecture, and its holistic framework is displayed in Figure 2. Its holistic framework comprises the MSFE module, SAE module, SMG module, and segmentation branch. Our MSFE module facilitates efficient multi-scale feature extraction using a hierarchical approach similar to the residual structure, which enables the construction of equivalent feature scales and adapts to the significant scale variations often present in remote sensing imagery. The SAE module adopts a dual-stream structure comprising a semantic information prediction stream and an attention enhancement stream, which enhances the network’s activation towards the instances and attenuates interference from cluttered backgrounds in remote sensing imagery. The SMG module assists the SAE module in constructing supervised representations that incorporate edge information during training. It employs a progressive approach to guide the model’s learning with structured induction biases, thereby mitigating the impact of insufficient sensitivity to edges due to the deficiency of fine-grained pixel-level labels. The segmentation branch leverages spatial relationships to establish weakly supervised constraints, enabling the model to obtain masks without relying on expensive pixel-level labels and reducing the model’s dependency on fine-grained pixel-level labels.

3.2. Multi-Scale Feature Extraction Module

Remote sensing imagery has significant scale variations within and across object classes, making it more difficult for neural networks to learn efficient feature representations. Therefore, it is valuable to accommodate this limitation during the network design process to achieve high-performance instance segmentation of RS imagery. We employ a hierarchical approach similar to residual structures to establish equivalent feature scales to accommodate this. In this way, we can effectively explore richer receptive fields, accommodating the prominent scale variations exhibited by targets in remote sensing imagery.

In our work, we utilize Res2Net [41], a simple and efficient multi-scale backbone network, to extract features at different scales. Res2Net can mine richer receptive fields at a finer level. According to Figure 3, the Bottle2neck module of Res2Net employs multiple 1 × 1 and 3 × 3 convolutional layers to extract features, with the connection of these 3 × 3 convolutional layers resembling a hierarchical manner similar to the residual structure. Specifically, given an input feature map

F_{i n p u .}

, it undergoes initial processing via a 1 × 1 convolutional layer, and then the resulting feature map

F_{r e s u .}

is partitioned into four distinct groups denoted as

F_{r e s u .}^{i}

, where

i = 1, \dots, S

and

S = 4

. Subsequently, the feature map

F_{r e s u .}^{1}

from the first group is directly considered as the processed outcome, and the features of the second group’s feature map

F_{r e s u .}^{2}

are extracted through a 3 × 3 convolutional layer. Commencing from the third group’s feature map

F_{r e s u .}^{3}

, the output feature map from the previous group’s convolutional layer is integrated with the feature map of the current group, combining them as the input for the subsequent convolutional layer in this group. This operation continues until all group’s feature maps have been processed, yielding four groups of feature maps denoted as

F_{t r e a .}^{i}

. The feature maps

F_{t r e a .}^{i}

are combined and fed into a 1 × 1 convolutional layer, leading to the generation of the new feature map

F_{n e w}

. Finally, the input feature map

F_{i n p u .}

is skip-connected with

F_{n e w .}

, culminating in producing the output feature map

F_{o u t p .}

. This hierarchical residual-like approach employed within the Bottle2neck module results in numerous equivalent feature scales, enabling the efficient extraction of multi-scale features and better adaptation to significant variations in target scales observed in remote sensing imagery.

3.3. Semantic Attention Enhancement Module

Optical remote sensing imagery contains elements such as the ground, vegetation, water bodies, and streets that mix with the targets of interest, making it difficult to distinguish them from the background. SAR remote sensing imagery is also susceptible to adverse factors such as coastlines and noise, resulting in complicated backgrounds. To accommodate this limitation, we introduce an SAE module to help improve the activation of instances and reduce the interference of cluttered backgrounds in remote sensing imagery.

As depicted in Figure 4, similar to the semantic attention mechanism with semantic segmentation supervision [42], the SAE module adopts a dual-stream structure consisting of a semantic information prediction stream and an attention enhancement stream. Firstly, the feature maps

F^{i}, i = 3, \dots, 7

from the P3 to P7 layers of the FPN at different scales are normalized and fused by means of element-wise summation followed by averaging, resulting in the fused feature map

F_{f u s e .}

, which can be denoted as

F_{f u s e .} = \frac{1}{5} \sum_{i = 3}^{7} N (F^{i}),

(1)

where

N

represents the normalization operation. The fused feature map

F_{f u s e .}

is then fed into a feature extraction branch consisting of four 3 × 3 convolution layers, resulting in the intermediate feature map

F_{i n t e .}

.

F_{i n t e .} = E (F_{f u s e .}; θ_{E}),

(2)

where

E

denotes the process of extracting intermediate feature maps, while

θ_{E}

represents the parameters of four 3 × 3 convolution layers. Subsequently, the intermediate feature map

F_{i n t e .}

is sent to the semantic information prediction stream and the attention enhancement stream, respectively. The semantic information prediction stream comprises a 1 × 1 convolutional layer followed by a softmax layer. It is capable of the semantic segmentation of objects in the image, yielding semantic segmentation result

M_{s e m a .}

, represented as

M_{s e m a .} = S (F_{i n t e .}; θ_{s}),

(3)

where

S

signifies the prediction process of the semantic information prediction stream and

θ_{s}

represents the parameters of the semantic information prediction stream. The supervision for the semantic information prediction stream is assisted by the SMG module, which is further explained in Section 3.4. On the other hand, the attention enhancement stream involves two steps: generating the semantic attention map and performing element-wise multiplication semantic enhancement. Specifically, for the intermediate feature map

F_{i n t e .}

, we first employ a 1 × 1 convolutional layer to generate the semantic attention map

F_{a t t e .}

. Then, we multiply the fused feature map

F_{f u s e .}

by

F_{a t t e .}

to generate the enhanced feature map

F_{e n h a .}

, which activates the instances. The generation of the semantic attention map and the element-wise multiplication semantic enhancement can be represented by Equations (4) and (5).

F_{a t t e .} = A (F_{i n t e .}; θ_{A}),

(4)

F_{e n h a .} = F_{f u s e .} ⊙ F_{a t t e .},

(5)

where

A

is the operation of generating the semantic attention map and

θ_{A}

corresponds to the parameters of the convolutional layer in the attention enhancement stream. Finally, we resize the enhanced feature map

F_{e n h a .}

to match the scale of the feature maps outputted by the FPN. The adjusted feature maps are denoted as

F_{e n h a .}^{i}

, where

i = 3, \dots, 7

. We integrate

F_{e n h a .}^{i}

and

F^{i}

with skip connections to generate the final feature maps

F_{f i n a .}^{i}

of our SAE module, expressed as follows:

F_{f i n a .}^{i} = F_{e n h a .}^{i} + F^{i} .

(6)

The final output feature maps

F_{f i n a .}^{i}

can provide inputs to subsequent task branches and facilitate the network’s efficient perception of instance masks for objects of interest. In summary, the SAE module mentioned above introduces the semantic attention mechanism through a carefully designed dual-stream structure, which can enhance the activation of SASM-Net towards the objects of interest, demonstrating adaptability to cluttered backgrounds in optical and SAR remote sensing imagery.

3.4. Structured Model Guidance Module

From an intuitive perspective, the supervision information provided by bounding box labels during the network’s training process is coarse, leading to insufficient sensitivity to edges. Additionally, constructing the segmentation branch through spatial relationship modeling lacks a design that explicitly considers edge information. Considering the refined capabilities of structured models in the segmentation task [43], we introduce the SMG module into SASM-Net to enhance the network’s sensitivity towards edges.

Unlike existing partially weakly supervised instance segmentation methods that simply treat the structured model as a mechanical post-processing approach [27,44], our SMG module implicitly guides the model to produce representations with a structured inductive bias using the structured model to improve the model’s ability to perceive target edge information. We first correct the predicted segmentation masks of the segmentation branch using a structured model. Then, by combining these updated masks with the semantic segmentation output from the semantic information prediction stream, we construct a teacher loss function to supervise the SAM module. In this manner, the SMG module cleverly integrates the capabilities of the structured model with the supervised task of the SAM module’s prediction stream, reducing the impact of the low sensitivity of the model to edge information caused by the lack of fine-grained pixel-level labeling and improving the model’s awareness of target edges.

Expressly, we represent the input image as

I

and the generated instance mask of the segmentation branch in SASM-Net as

M_{p r e d .} = {M_{p r e d .}^{1}, \dots, M_{p r e d .}^{N}}

. Each

M_{p r e d .}^{n}

in

M_{p r e d .}

corresponds to a bounding box

B^{n}

, where

n = 1, \dots, N

and

N

is the total number of instances in the image. Firstly, we construct an undirected graph

G = (V_{g r a p .}, E_{g r a p .})

on the

I

, where

V

is the set of pixels in the image and

E_{g r a p .}

is the set of edges. Then, we define a random field

R^{n}

on

G

and let

X^{n}

be the labeling of random field

R^{n}

within bounding box

B^{n}

. Each node

v^{i}, i \in B^{n}

in bounding box

B^{n}

is connected to its eight neighboring nodes

v^{j}, j \in N_{n e i g .}^{i}

, where

i

and

j

are indices representing pixels, and

N_{n e i g .}^{i}

represents the set of neighboring pixels of pixel

i

. Next, we define a structured model using the Gibbs energy

E_{G i b b .}

as follows:

E_{G i b b .} = τ_{u} (X^{n}) + τ_{p} (X^{n}),

(7)

where

τ_{u} (X^{n}) = \sum_{i \in B^{n}} ψ (X_{i}^{n})

is the unary potential constructed based on the predicted target mask

M_{p r e d .}

from the segmentation branch.

τ_{p} (X^{n})

is the pairwise potential that simulates pairwise pixel relationships within the bounding box, which can be further written as

τ_{p} (X^{n}) = \sum_{i \in B^{n}, j \in N_{n e i g .}^{i}} 1_{X_{i}^{n} \neq X_{j}^{n}} w \exp (- \frac{| C_{i}^{n} - C_{j}^{n} |}{2 η^{2}}),

(8)

where

w

and

η

are the weight and scale of the Gaussian kernel.

1_{X_{i}^{n} \neq X_{j}^{n}}

is the indicator function that becomes 1 if

X_{i}^{n} \neq X_{j}^{n}

and 0 otherwise.

C_{i}^{n}

and

C_{j}^{n}

represent pixels

i

and

j

color information within bounding box

B^{n}

. During the training process, we optimize the structured model by minimizing the Gibbs energy

E_{G i b b .}

using the standard mean-field, injecting edge detail information into the mask prediction results of the segmentation branch, and then achieving further refinement of the predicted instance mask

M_{p r e d .}

.

For the refined instance mask

M_{r e f i .}

, we consider it as the ground truth for the semantic information prediction stream in the SAE module. On this basis, we construct a teacher loss function

L_{t e a c .}

to supervise the prediction stream (referred to as the student) in the SAE module, thereby incorporating the edge detail information into SASM-Net and enhancing its segmentation capability for remote sensing imagery. More specifically, the teacher loss function

L_{t e a c .}

is constructed based on cross-entropy loss and can be expressed as

L_{t e a c .} = - \frac{1}{H \times W} \sum_{\begin{array}{l} {\tilde{y}}_{i j} \in M_{p r e d .} \\ y_{i j} \in M_{r e f i .} \end{array}} \sum_{k = 0}^{C} y_{i j}^{k} \log ({\tilde{y}}_{i j}^{k}),

(9)

where

k

is the index of the class.

{\tilde{y}}_{i j}^{k}

denotes the probability that a pixel in row i, column j, in the semantic segmentation prediction of the prediction stream belongs to category

k

. Similarly,

y_{i j}^{k}

represents the probability that a pixel in row i, column j, in the refined instance mask

M_{r e f i .}

belongs to class

k

.

C

is the total number of categories.

Our STM module effectively integrates the structured model with the semantic information prediction stream of SAM, implicitly guiding the model to progressively learn representations with structured inductive biases, thereby enhancing the model’s perception of target edge information.

3.5. Segmentation Branch

Existing instance segmentation methods for remote sensing imagery are predominantly based on the fully supervised paradigm, relying heavily on fine-grained pixel-level labels. These methods undoubtedly incur expensive manual label production costs, limiting their application and generalization. Bounding box label-based weakly supervised instance segmentation methods can offer a trade-off between labeling cost and supervision capability, as they can segment objects of interest using simple bounding box annotations. Unlike approaches that utilize extra auxiliary salient data or generate pseudo-labels through unsupervised methods, some advanced bounding box label-based weakly supervised instance segmentation methods typically establish weakly supervised losses by modeling relationships in the spatial domain, which allows constraints to be imposed on the model, facilitating end-to-end weakly supervised training [27,28].

The segmentation branch in SASM-Net inherits the concept of establishing weakly supervised constraints by modeling spatial relationships. It enables the model to obtain masks without needing pixel-level labels and reduces its dependency on expensive fine-grained annotations. Specifically, we first model the projection consistency relationship between the bounding box and the mask in the spatial domain through the projection consistency loss term

L_{p r o j .}

, which can enforce that the predicted mask is enclosed by its corresponding bounding box. The formulation of

L_{p r o j .}

is as follows:

\begin{matrix} L_{p r o j .} & = L_{d i c e} (P_{h} (M_{p r e d .}), P_{h} (B_{g r o u .})) \\ + L_{d i c e} (P_{v} (M_{p r e d .}), P_{v} (B_{g r o u .})), \end{matrix}

(10)

where

M_{p r e d .}

represents the predicted mask from the segmentation branch and

B_{g r o u .}

is the ground truth bounding box corresponding to the predicted mask

M_{p r e d .}

.

P_{h}

and

P_{v}

denote the horizontal and vertical projection calculations, respectively.

L_{d i c e}

refers to the dice loss [45].

Furthermore, we model the pairwise similarity relationship between pixels in the spatial domain by introducing the color similarity loss term

L_{c o l o .}

, which enables the separation of objects from the background. The formulation of

L_{c o l o .}

can be expressed as

L_{c o l o .} = - \frac{1}{N_{t o t a .}} \sum_{(a, b) \in E_{a l l}} 1_{S^{(a, b)} \geq σ} \log P (y^{(a, b)} = 1),

(11)

where

(a, b)

represents a pixel pair composed of pixels a and b within the same neighborhood, and

y^{(a, b)} \in {0, 1}

indicates whether the pixel pair

(a, b)

belongs to the same class, where

y^{(a, b)} = 1

when pixels a and b belong to the same class and

y^{(a, b)} = 0

when their classes are different.

P (y^{(a, b)} = 1)

denotes the prediction probability of pixels a and b belonging to the same class when predicting the mask.

E_{a l l}

is the set of all pixel pairs in the bounding box, and

N_{t o t a .}

is the whole number of pixel pairs in

E_{a l l}

. The indicator function

1_{S^{(a, b)} \geq σ}

takes a value of 1 if

S^{(a, b)}

is greater than or equal to

σ

and 0 otherwise.

S^{(a, b)}

represents the color similarity between pixels a and b, which is computed using the pixel values

a_{L A B}

and

b_{L A B}

of pixel pair

(a, b)

in the LAB color space, as defined below:

S^{(a, b)} = \exp (- \frac{1}{λ} | a_{L A B} - b_{L A B} |),

(12)

where

λ

is a hyperparameter set to 2. The conversion of pixel pair

(a, b)

to the LAB color space in the above formula is performed because the LAB color space is more uniformly distributed and more consistent with human visual perception. It is worth noting that for SAR images, which are single-channel grayscale images, we ignore the color channels of the LAB color space and only focus on the luminance channel because it is closely related to the brightness perception of the human eye. For three-channel optical images, we retain the color channels to represent color characteristics.

Finally, the loss function

L_{s e g m .}

for the segmentation branch is composed of two loss terms mentioned above, and can be written as

L_{s e g m .} = L_{p r o j .} + L_{c o l o .} .

(13)

This loss function allows the modeling of the relationships in the spatial domain and imposing constraints on the network during network training, which enables instance segmentation using only bounding box annotations and achieves a balance between labeling costs and supervisory capability.

Regarding the network architecture, the segmentation branch of SASM-Net inherits the dynamic perception concept from CondInst [36], allowing for the dynamic adjustment of the weights based on different instances, thus providing excellent flexibility. As illustrated in Figure 5, this branch includes a controller, a bottom-level feature converter, and an FCN head. The controller is used repeatedly for multiple output feature maps from the SAE module and has a structure of a 3 × 3 convolutional layer, primarily responsible for predicting the parameters of the FCN head. The bottom-level feature converter is connected to the bottom-level feature map

F_{f i n a .}^{3}

, which can convert the bottom-level feature map

F_{f i n a .}^{3}

into the input of the FCN head through four 3 × 3 convolutional layers and one 1 × 1 convolutional layer. The FCN head consists of three 1 × 1 convolutional layers with parameters predicted by the controller, which can further process the output from the bottom-level feature converter to obtain the corresponding instance segmentation results

M_{p r e d .}

. Compared to methods where the parameters of the segmentation branch are fixed for all instances, the segmentation branch of SASM-Net can be dynamically adjusted based on the predictions from the controller, allowing for effective and flexible adaptation to different instances with good flexibility.

Overall, the segmentation branch models the projection consistency relationship between the bounding box and the mask in the spatial domain and the pairwise similarity relationship between pixels, thereby imposing constraints on the network during training. These constraints guide the model to obtain masks without requiring pixel-level labels, reducing the model’s reliance on expensive fine-grained annotations. Additionally, this branch can dynamically adapt to different targets in remote sensing imagery through the controller, enabling efficient and flexible segmentation of the instances.

4. Experiments

4.1. Datasets

NWPU VHR-10 instance segmentation dataset: This dataset [11] expands the NWPU VHR-10 dataset [46,47] and was annotated at the pixel level to facilitate the instance segmentation task. This dataset comprises 800 optical images, with 650 images containing instances and 150 pure background images. The image sizes vary from 533 × 597 to 1728 × 1028 pixels. This dataset includes ten categories, namely airplane (AI), baseball diamond (BD), ground track field (GTF), vehicle (VC), ship (SH), tennis court (TC), harbor (HB), storage tank (ST), basketball court (BC), and bridge (BR). For our experiments, we randomly selected 70% of the images containing instances (i.e., 454 images) as the training set and 30% of the images containing instances (i.e., 196 images) as the testing set. It is important to note that the pixel-level labels were only used during the testing process of SASM-Net and not during the training process.

SSDD dataset: This dataset [11] comprises 1160 SAR images obtained from RadarSat-2, TerraSAR-X, and Sentinel-1, with 2456 ships in this dataset. In our experiments, we divided this dataset randomly into training and testing sets in a 7:3 ratio, with 812 images in the training set and 348 images in the testing set. During the training process of SASM-Net, we did not use pixel-level labels and relied solely on box-level labels to supervise the training of MGWIS-Net. During the testing process, we utilized pixel-level labels to compute the test results.

4.2. Implementation Details

Our experiments were conducted using the PyTorch framework. SASM-Net was optimized using stochastic gradient descent (SGD) with a momentum of 0.9 and weight decay of 1 × 10⁻⁴. The GPU used in the experiments was NVIDIA Tesla V100. The training lasted a total of 12 epochs. A batch size of 4 was set, and the base learning rate was initialized as 2.5 × 10⁻³. Furthermore, the learning rate was reduced by a factor of 10 at epochs 8 and 11. During training, we randomly resized the input image’s shorter side to 600, 700, 800, 900, and 1000 while keeping the longer side fixed at 1333. We set the size of the feature maps from the P4 to P7 layers of the FPN to the size of the feature map from the P3 layer in the normalization process to focus appropriately on low-level features. In the segmentation branch,

σ

was 0.3 to balance the positive sample ratio and the number of samples. To prevent overfitting of the network to the results of the structured model processing, we removed the structured model at the eighth epoch. The training process of SASM-Net relied solely on simple bounding box labels to supervise the network and achieve the segmentation of objects in RS images. For testing, pixel-level labels were utilized to evaluate the network’s segmentation performance.

4.3. Evaluation Metrics

The segmentation methods were comprehensively evaluated according to the evaluation metric of MS COCO [48]. Firstly, the intersection over union IoU between the predicted mask

M_{p r e d .}

generated by the network and their corresponding ground truth

M_{g r o u .}

was calculated as follows:

I o U = \frac{M_{p r e d .} \cap M_{g r o u .}}{M_{p r e d .} \cup M_{g r o u .}} .

(14)

Subsequently, the instance segmentation results were then categorized by

I o U

into true positive (TP), false positive (FP), true negative (TN), and false negative (FN), and the precision

P

and recall

R

can be written as

P = \frac{T P}{T P + F P},

(15)

R = \frac{T P}{T P + F N} .

(16)

The average precision for a given

I o U

(

A P_{I o U}

) can be expressed as

A P_{I o U} = \int_{0}^{1} P (r) d r,

(17)

where

r

denotes recall, while

P (r)

represents the precision corresponding to the recall

r

. AP is the average of

A P_{I o U}

at 10

I o U

thresholds ranging from 0.50 to 0.95 with a step size of 0.05, which can be obtained by

A P = \frac{1}{10} \sum_{I o U = 0.5}^{0.95} A P_{I o U} .

(18)

where AP₅₀ and AP₇₅ signify the computed AP at

I o U

thresholds of 0.5 and 0.75, respectively. Additionally, AP_S, AP_M, and AP_L are designed to evaluate the model’s mask prediction performance on small (<32² pixels), medium (>32² pixels and <96² pixels), and large (>96² pixels) objects, respectively. These metrics can enable a complete assessment of the quality of the model’s mask predictions.

4.4. Experimental Results on the NWPU VHR-10 Instance Segmentation Dataset

The SASM-Net proposed in this study not only achieves basic weakly supervised instance segmentation by modeling spatial relationships, but also adapts to the limitations of cluttered backgrounds, insufficient sensitivity to edges, and significant scale variations. We conducted many experiments to evaluate the effectiveness of SASM-Net in instance segmentation tasks based on optical remote sensing images. The quantitative experimental performance is documented in Table 1 and Table 2, and the qualitative performance is illustrated in Figure 6. It is important to note that we compare SASM-Net with other methods, which can be categorized into three groups: weakly supervised paradigm methods, fully supervised paradigm methods, and hybrid supervised paradigm methods. The details of these methods used for comparison are similar to [49] and are briefly described as follows:

Weakly supervised paradigm methods: We categorize the compared weakly supervised methods into two types: adaptations of fully supervised methods and dedicated weakly supervised instance segmentation methods. Adaptations of fully supervised methods directly treat the object-level labels from annotations as bounding box labels to train the original fully supervised methods. Dedicated weakly supervised methods are designed explicitly for bounding box labels, including BoxInst [50], DiscoBox [28], DBIN [51], and MGWI-Net [49]. For DBIN, we exclude the domain adaptation aspect as it is beyond the scope of this paper. It should be noted that adaptations of fully supervised methods directly adopt the bounding box labels from annotations as pixel-level labels, thus requiring only consistent labeling with bounding box labels, and we classify them as weakly supervised paradigm methods.
Fully supervised paradigm methods: Fully supervised methods perform instance segmentation by training with finely annotated pixel-level labels, which imposes expensive labeling costs. We select several representative fully supervised methods for comparison with the proposed SASM-Net.
Hybrid supervised paradigm methods: To further compare with our proposed method, we also design a series of hybrid supervised methods. Specifically, we combine partial pixel-level labels with some object-level labels for network training. The labeling cost of this paradigm falls between weakly supervised and fully supervised methods.

As shown in Table 1, compared to the adaptations of fully supervised methods in the weakly supervised paradigm, the proposed SASM-Net achieves excellent segmentation accuracy. Its AP values are 33.3 and 36.0 higher than those of Mask R-CNN and CondInst, respectively. Dedicated weakly supervised paradigm methods can achieve better segmentation results than adaptations of fully supervised methods, and their AP values are significantly higher than those of fully supervised methods. Furthermore, our network also demonstrates advantages over dedicated weakly supervised methods, with an AP value 5.5 higher than that of BoxInst, a popular method used for comparison. Overall, the segmentation performance of the proposed SASM-Net surpasses that of the weakly supervised instance segmentation methods. Table 2 reports the performance for different categories in the NWPU VHR-10 instance segmentation dataset. Except for the slightly inferior AP value of category HB, our SASM-Net consistently provides the best segmentation results compared to the other weakly supervised paradigm methods. This indicates that our SASM-Net performs exceptionally well in completing instance segmentation tasks based on optical remote sensing imagery.

We also compare the proposed SASM-Net with hybrid supervised paradigm methods and fully supervised paradigm methods, as shown in Table 1 and Table 2. Regarding the hybrid supervised paradigm methods, the higher the proportion of pixel-level labels during training, the better the instance segmentation performance. We find that even when compared to a hybrid supervised method with 75% pixel-level labels, our method achieves higher AP values, further highlighting the strong segmentation capabilities of SASM-Net. Regarding the fully supervised paradigm methods, the proposed SASM-Net significantly outperforms the one-stage YOLACT and has only a 5.7% lower AP value than the classical Mask R-CNN. This demonstrates that our SASM-Net achieves competitive segmentation results compared to fully supervised methods without the need for complex training with pixel-level labels.

Furthermore, the rightmost column of Table 1 and Table 2 reflects the processing speed and parameters of the models. Since the methods under the hybrid supervised and fully supervised paradigms differ from weakly supervised methods only in the training process, they are essentially similar in processing speed and model parameters. Therefore, we only report the processing speed and model parameters of YOLACT, Mask R-CNN, and CondInst under the weakly supervised paradigm. YOLACT demonstrates a significant advantage over other methods regarding processing speed and model parameters. However, its segmentation accuracy metrics are notably poor, indicating that the speed and parameter advantages come at the cost of substantially reduced segmentation accuracy. On the other hand, BoxInst, DBIN, and MGWI-Net are all weakly supervised segmentation approaches built upon CondInst. Consequently, they share similar structures and exhibit comparable processing speeds and model parameters. DiscoBox achieves a slightly faster processing speed than SASM-Net, albeit with more parameters. Due to its intricate design that includes the multi-scale feature extraction module, semantic attention enhancement module, and structured model guidance module, SASM-Net has a slightly slower processing speed. Although SASM-Net does not possess a significant advantage with regard to processing speed, it achieves higher segmentation accuracy compared to other weakly supervised segmentation methods and offers lower annotation costs than fully supervised and hybrid supervised segmentation methods. Overall, SASM-Net demonstrates commendable performance.

Figure 6 presents the qualitative experimental results of SASM-Net and its compared methods based on the NWPU VHR-10 instance segmentation dataset. The first column depicts the ground truth, the fifth column showcases the segmentation results obtained by SASM-Net, and the sixth column displays the feature maps of SASM-Net.

In Figure 6a and Figure 6b, we compare SASM-Net with hybrid supervised paradigm methods. The second through fourth columns in Figure 6a demonstrate the instance segmentation performance of CondInst at pixel-level ratios of 25%, 50%, and 75%, respectively. Similarly, the second to fourth columns in Figure 6b exhibit the segmentation results of Mask R-CNN at pixel-level ratios of 25%, 50%, and 75%, respectively. The resulting segmentation is still coarse, even with a pixel-level label ratio of 75% in hybrid supervised methods. In contrast, SASM-Net produces more refined masks that effectively cover the regions of interest. Moving on to Figure 6c and Figure 6d, we compare SASM-Net with fully supervised paradigm methods. The second through fourth columns illustrate the performance of YOLACT, Mask R-CNN, and CondInst, respectively. This set of experimental results shows that our proposed SASM-Net achieves excellent segmentation performance, even when compared to fully supervised segmentation methods. It can avoid noticeable over-segmentation or under-segmentation artifacts, demonstrating its exceptional mask prediction capability. Next, in Figure 6e and Figure 6f, we compare SASM-Net with adaptations of fully supervised methods within weakly supervised paradigms. The second through fourth columns depict the instance segmentation performance obtained by YOLACT, Mask R-CNN, and CondInst, respectively. We observed that the absence of pixel-level labels leads to severe segmentation errors, whereas our SASM-Net accurately extracts object masks from the images. Subsequently, in Figure 6g and Figure 6h, we compare SASM-Net with dedicated weakly supervised networks. The second through fourth columns display the instance segmentation performance of BoxInst, DBIN, and MGWI-Net, respectively. Compared to these methods, our SASM-Net better handles edge details for the ground track field in Figure 6g and accurately extracts masks even for indistinct ships in Figure 6h.

In addition, the feature maps in the sixth column provide a clear visualization of SASM-Net’s perception of objects within the image. It can be observed that SASM-Net exhibits higher intensity in the feature maps near the target locations, further confirming its high sensitivity to regions surrounding the objects of interest. This enhanced heat map response around the object-related feature maps signifies the powerful capability of SASM-Net in extracting target-specific features, laying the foundation for its outstanding performance in instance segmentation tasks. In conclusion, our proposed SASM-Net exhibits remarkable advantages in optical remote sensing imagery segmentation. It can accurately acquire the desired instances’ masks even without pixel-level annotations.

4.5. Experimental Results on the SSDD Dataset

Similarly, we compare SASM-Net with weakly supervised paradigm methods, fully supervised paradigm methods, and hybrid supervised paradigm methods to validate the performance of the proposed approach in SAR remote sensing imagery instance segmentation. The quantitative and qualitative results are presented in Table 3 and Figure 7, respectively.

As illustrated in Table 3, our method provides significant advantages over adaptations of fully supervised methods, with a noticeable improvement in instance segmentation metrics. The AP values of our proposed SASM-Net are higher than those of the weakly supervised Mask R-CNN and CondInst, respectively. Additionally, the AP value of SASM-Net is higher than those of all other dedicated weakly supervised methods used for comparison, indicating its strong adaptability for SAR remote sensing imagery instance segmentation tasks.

Compared to hybrid supervised paradigm methods, our SASM-Net demonstrates superior segmentation performance, with AP values no lower than those achieved with 75% pixel-level label supervision in YOLACT, Mask R-CNN, and CondInst. Compared to fully supervised paradigm methods, the segmentation results of the proposed SASM-Net implementation exceed the fully supervised YOLACT 10.0 AP values, even without relying on pixel-level label supervision. This further highlights the excellent performance of SASM-Net. Overall, the comprehensive comparisons with weakly supervised, fully supervised, and hybrid supervised paradigm methods show that our SASM-Net can obtain good segmentation results for SAR remote sensing imagery without relying on costly pixel-level labels, striking a balance between labeling cost and segmentation accuracy.

The rightmost column of Table 3 illustrates the processing speed of the models. YOLACT achieves the best results in terms of processing speed. However, this comes at the expense of significantly reduced segmentation accuracy. CondInst, BoxInst, DBIN, and MGWI-Net exhibit comparable processing speeds due to the similar structures employed. SASM-Net, consisting of MSFE, SAE, and SMG modules, has a relatively complex structure, resulting in a slightly slower processing speed when compared to other approaches. Based on the previous analysis, it is evident that although SASM-Net does not possess a significant advantage in terms of inference speed, it still offers lower annotation costs compared to fully supervised and hybrid supervised segmentation methods while achieving higher segmentation accuracy than other weakly supervised methods. Overall, SASM-Net shows a commendable performance.

Figure 7 presents the qualitative results of SASM-Net and the compared methods based on the SSDD dataset. In Figure 7a and Figure 7b, we compare SASM-Net with hybrid supervised paradigm methods. We find that SASM-Net can obtain accurate instance masks, while the hybrid supervised paradigm methods, particularly with a 25% and 50% pixel-level label proportion, tend to produce poor segmentation quality. In Figure 7c and Figure 7d, we compare SASM-Net with fully supervised paradigm methods. The proposed SASM-Net successfully extracts instance masks of objects and accurately locates small objects in Figure 7d. In Figure 7e and Figure 7f, we compare SASM-Net with adaptations of fully supervised methods in the weakly supervised paradigm. While SASM-Net is able to generate good segmentation masks, the adapted fully supervised methods suffer from severe segmentation errors due to the lack of pixel-level labels. In Figure 7f, the image exhibits significant speckle noise, but our SASM-Net can accurately extract the masks of the objects from this noise. In Figure 7g and Figure 7h, we showcase the comparison of SASM-Net with dedicated weakly supervised paradigm methods, and our method can accurately locate all the ships in Figure 7g and Figure 7h and extract their masks accurately. Furthermore, in the feature maps of the rightmost column, regions near the ships exhibit higher intensity. This phenomenon underscores the ability of SASM-Net to effectively focus on the areas surrounding the ships and extract relevant vital features, highlighting the exceptional capability of SASM-Net in extracting target features. Overall, the proposed SASM-Net can accurately predict instance masks without relying on pixel-level labels, which is crucial for balancing labeling costs and segmentation results.

4.6. Ablation Study

Considering that SASM-Net consists primarily of the MSFE, SAE, and SMG modules, we conducted detailed ablation studies based on the NWPU VHR-10 instance segmentation and SSDD datasets in this section to explore the contributions of these modules to instance segmentation in optical and SAR remote sensing imagery. The main results are presented in Table 4 and Table 5. During the ablation studies, we treated the SASM-Net without the MSFE, SAE, and SMG modules as the baseline model, and the segmentation branch directly supervised the predicted flow of the SAE module.

Table 4 reflects the results of the ablation experiments based on the NWPU VHR-10 instance segmentation dataset for the different modules in SASM-Net. After incorporating the MSFE module, SASM-Net improved the AP, with the segmentation accuracy metrics AP_S (for small objects) and AP_L (for large objects) increasing by 1.1 and 1.5, respectively. This indicates that establishing equivalent feature scales enhances the network’s perception of interested targets and improves its ability to perceive objects of different scales in remote sensing imagery. With the introduction of the SAE module, the segmentation performance of SASM-Net was further improved, demonstrating the benefits of this module for enhancing the model’s segmentation effectiveness. After introducing the SMG module, the segmentation performance also exhibited improvement, with an increase of 2.1 in AP₇₅, indicating that this module’s focus on the edge details of instances enables the network to obtain more accurate masks.

Table 5 reports the results of the ablation experiments on the SSDD dataset for the different modules in SASM-Net. With the introduction of the MSFE module, SASM-Net showed improvements in AP_S, AP_M, and AP_L, indicating the positive impact of multi-scale feature extraction. With the inclusion of the SAE and SMG modules, the proposed method also exhibited improvements in AP value, demonstrating the contributions of these two modules to enhancing SAR remote sensing image segmentation performance. The network achieved the best segmentation performance when all three modules were incorporated into SASM-Net. These ablation studies demonstrate that the carefully designed three modules can enhance the segmentation performance of the network for SAR remote sensing imagery.

In this study, our SMG module ingeniously integrates the structured model with the semantic information prediction stream of the SAM module, which can implicitly lead the model to obtaining a representation with structured inductive bias. In some weakly supervised instance segmentation methods, the structured model is commonly employed as a post-processing technique. To compare the effectiveness of our implicit guidance approach with post-processing methods, we further designed the experiments shown in Table 6. In these experiments, the baseline model is SASM-Net with only the SAE module, and it is essential to note that the predicted flow of the SAE module, in this case, is also directly supervised by the segmentation branch.

As demonstrated in Table 6, incorporating the structured model through post-processing and implicitly guided approaches in SASM-Net led to improvements in instance segmentation performance compared to the baseline. This indicates the positive influence of introducing the structured model on instance segmentation tasks for remote sensing imagery. Comparatively, the implicitly guided approach employed in SASM-Net outperformed mechanical post-processing techniques, underscoring its effectiveness in enabling the network to achieve better mask prediction capabilities. In conclusion, our ablation study reveals that integrating the structured model into SASM-Net using an implicitly guided approach yields improved instance segmentation performance. This approach surpasses traditional post-processing techniques and facilitates superior mask prediction abilities in the network.

5. Discussion

We propose SASM-Net, a network for weakly supervised instance segmentation tasks in optical and SAR remote sensing imagery. It addresses practical challenges such as too-expensive label production costs, insufficient edge sensitivity of bounding box labels, background clutter, and significant variations in target scales. This network establishes weak supervision constraints by modeling spatial relationships and incorporates the MSFE, SAE, and SMG modules. Overall, this network accurately predicts instance masks without relying on pixel-level labels, surpasses the segmentation accuracy of all weakly supervised methods, and demonstrates strong competitiveness over hybrid and fully supervised approaches. This research provides a low-cost, high-quality instance segmentation solution in optical and synthetic aperture radar remote sensing imagery.

Although our proposed method has achieved remarkable results, it has some limitations. First, compared to fully supervised methods, our approach significantly reduces the annotation costs but may affect the segmentation accuracy, although this effect is minimal. Second, our method relies on bounding box labels, which, although relatively easy to create, still have certain production costs that may affect the practical application and generalization of this method. Additionally, since SASM-Net takes into account practical issues such as the insufficient edge sensitivity of bounding box labels, background clutter, and significant target scale variations and has carefully designed modules (MSFE, SAE, and SMG) to address these challenges, it does not provide significant advantages in terms of processing speed and model parameters. In practical applications, there may be certain limitations in extending this method to domains that require very accurate masks or high-speed processing, such as medical image diagnosis and autonomous driving.

In future work, we will further explore the characteristics of remote sensing imagery and design novel network structures to improve segmentation quality. We will continue to explore methods to reduce the cost of label generation and lightweight techniques to further reduce the annotation cost and processing speed of our proposed method, thus laying a better foundation for its practical application and widespread adoption. Additionally, we will consider the broader adoption of weakly supervised segmentation tasks in more challenging fields, aiming to advance its application in engineering.

6. Conclusions

In this paper, we design SASM-Net to accommodate the problems of expensive label production costs, insufficient edge sensitivity of bounding box labels, background clutter, and significant variations in target scales in remote sensing imagery instance segmentation. SASM-Net constructs weakly supervised constraint modeling spatial relationships and includes the MSFE, SAE, and SMG modules that can balance the label production costs and visual processing. The MSFE module constructs equivalent feature scales during the feature extraction, enabling efficient multi-scale feature extraction to accommodate the significant scale variations in remote sensing imagery. The SAE module is designed as a dual-stream structure with a semantic information prediction stream and an attention enhancement stream, which can enhance the network’s activation for objects of interest, reducing interference from cluttered backgrounds in remote sensing imagery. The SMG module assists the SAE module during training by constructing supervision containing edge information, thereby improving the model’s perception of target edge information. We conducted a series of experiments based on the NWPU VHR-10 and SSDD datasets. The experimental results demonstrate that SASM-Net achieves accurate instance segmentation of remote sensing imagery without relying on pixel-level labels. Its segmentation accuracy outperforms weakly supervised methods and demonstrates competitiveness compared with hybrid and fully supervised paradigms. This research provides a low-cost, high-quality solution to the instance segmentation task in optical and SAR RS imagery.

Author Contributions

Conceptualization, M.C. and Z.P.; methodology, M.C. and K.X.; software, M.C., E.C. and Y.Z.; validation, M.C., K.X. and Y.X.; formal analysis, M.C., K.X. and Y.X.; investigation, K.X. and Z.P.; resources, Z.P.; data curation, E.C. and Y.Z.; writing—original draft preparation, M.C., E.C. and Y.H.; writing—review and editing, Z.P. and Y.X.; visualization, M.C. and Y.H.; supervision, Z.P.; project administration, M.C.; funding acquisition, Z.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Chen, X.; Ma, M.; Li, Y.; Cheng, W. Fusing Deep Features by Kernel Collaborative Representation for Remote Sensing Scene Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 12429–12439. [Google Scholar] [CrossRef]
Jing, W.; Zhang, M.; Tian, D. Improved U-Net Model for Remote Sensing Image Classification Method Based on Distributed Storage. J. Real-Time Image Process. 2021, 18, 1607–1619. [Google Scholar] [CrossRef]
Zhang, J.; Liu, J.; Pan, B.; Chen, Z.; Xu, X.; Shi, Z. An Open Set Domain Adaptation Algorithm via Exploring Transferability and Discriminability for Remote Sensing Image Scene Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–12. [Google Scholar] [CrossRef]
Li, B.; Xie, X.; Wei, X.; Tang, W. Ship Detection and Classification from Optical Remote Sensing Images: A Survey. Chin. J. Aeronaut. 2021, 34, 145–163. [Google Scholar] [CrossRef]
Geng, J.; Xu, Z.; Zhao, Z.; Jiang, W. Rotated Object Detection of Remote Sensing Image Based on Binary Smooth Encoding and Ellipse-Like Focus Loss. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Yang, L.; Yuan, G.; Zhou, H.; Liu, H.; Chen, J.; Wu, H. RS-YOLOX: A High-Precision Detector for Object Detection in Satellite Remote Sensing Images. Appl. Sci. 2022, 12, 8707. [Google Scholar] [CrossRef]
Alam, M.; Wang, J.-F.; Guangpei, C.; Yunrong, L.; Chen, Y. Convolutional Neural Network for the Semantic Segmentation of Remote Sensing Images. Mob. Netw. Appl. 2021, 26, 200–215. [Google Scholar] [CrossRef]
Wang, J.-X.; Chen, S.-B.; Ding, C.H.Q.; Tang, J.; Luo, B. Semi-Supervised Semantic Segmentation of Remote Sensing Images with Iterative Contrastive Network. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Zhao, S.; Feng, Z.; Chen, L.; Li, G. DANet: A Semantic Segmentation Network for Remote Sensing of Roads Based on Dual-ASPP Structure. Electronics 2023, 12, 3243. [Google Scholar] [CrossRef]
Yang, Z.; Wu, Q.; Zhang, F.; Zhang, X.; Chen, X.; Gao, Y. A New Semantic Segmentation Method for Remote Sensing Images Integrating Coordinate Attention and SPD-Conv. Symmetry 2023, 15, 1037. [Google Scholar] [CrossRef]
Su, H.; Wei, S.; Liu, S.; Liang, J.; Wang, C.; Shi, J.; Zhang, X. HQ-ISNet: High-Quality Instance Segmentation for Remote Sensing Imagery. Remote Sens. 2020, 12, 989. [Google Scholar] [CrossRef]
Chen, L.; Fu, Y.; You, S.; Liu, H. Efficient Hybrid Supervision for Instance Segmentation in Aerial Images. Remote Sens. 2021, 13, 252. [Google Scholar] [CrossRef]
Zhao, D.; Zhu, C.; Qi, J.; Qi, X.; Su, Z.; Shi, Z. Synergistic Attention for Ship Instance Segmentation in SAR Images. Remote Sens. 2021, 13, 4384. [Google Scholar] [CrossRef]
Fan, F.; Zeng, X.; Wei, S.; Zhang, H.; Tang, D.; Shi, J.; Zhang, X. Efficient Instance Segmentation Paradigm for Interpreting SAR and Optical Images. Remote Sens. 2022, 14, 531. [Google Scholar] [CrossRef]
Wei, S.; Zeng, X.; Zhang, H.; Zhou, Z.; Shi, J.; Zhang, X. LFG-Net: Low-Level Feature Guided Network for Precise Ship Instance Segmentation in SAR Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–17. [Google Scholar] [CrossRef]
Uijlings, J.R.R.; van de Sande, K.E.A.; Gevers, T.; Smeulders, A.W.M. Selective Search for Object Recognition. Int. J. Comput. Vis. 2013, 104, 154–171. [Google Scholar] [CrossRef]
Pont-Tuset, J.; Arbelaez, P.; Barron, J.T.; Marques, F.; Malik, J. Multiscale Combinatorial Grouping for Image Segmentation and Object Proposal Generation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 128–140. [Google Scholar] [CrossRef] [PubMed]
Zhou, Y.; Zhu, Y.; Ye, Q.; Qiu, Q.; Jiao, J. Weakly Supervised Instance Segmentation Using Class Peak Response. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 3791–3800. [Google Scholar]
Laradji, I.H.; Vazquez, D.; Schmidt, M. Where are the Masks: Instance Segmentation with Image-Level Supervision. arXiv 2019, arXiv:1907.01430. [Google Scholar]
Ahn, J.; Cho, S.; Kwak, S. Weakly Supervised Learning of Instance Segmentation with Inter-Pixel Relations. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 2204–2213. [Google Scholar]
Zhu, Y.; Zhou, Y.; Xu, H.; Ye, Q.; Doermann, D.; Jiao, J. Learning Instance Activation Maps for Weakly Supervised Instance Segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3111–3120. [Google Scholar]
Ge, W.; Guo, S.; Huang, W.; Scott, M.R. Label-PEnet: Sequential Label Propagation and Enhancement Networks for Weakly Supervised Instance Segmentation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27–28 October 2019; pp. 3345–3354. [Google Scholar]
Arun, A.; Jawahar, C.V.; Kumar, M.P. Weakly Supervised Instance Segmentation by Learning Annotation Consistent Instances. In Proceedings of the European Conference on Computer Vision (ECCV), 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 254–270. [Google Scholar]
Khoreva, A.; Benenson, R.; Hosang, J.; Hein, M.; Schiele, B. Simple Does It: Weakly Supervised Instance and Semantic Segmentation. arXiv 2016, arXiv:1603.07485. [Google Scholar]
Wang, X.; Feng, J.; Hu, B.; Ding, Q.; Ran, L.; Chen, X.; Liu, W. Weakly-Supervised Instance Segmentation via Class-Agnostic Learning with Salient Images. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 10225–10235. [Google Scholar]
Lee, J.; Yi, J.; Shin, C.; Yoon, S. BBAM: Bounding Box Attribution Map for Weakly Supervised Semantic and Instance Segmentation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 2643–2651. [Google Scholar]
Hsu, C.-C.; Hsu, K.-J.; Tsai, C.-C.; Lin, Y.-Y.; Chuang, Y.-Y. Weakly Supervised Instance Segmentation Using the Bounding Box Tightness Prior. In Proceedings of the 2019 Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019; pp. 6586–6597. [Google Scholar]
Lan, S.; Yu, Z.; Choy, C.; Radhakrishnan, S.; Liu, G.; Zhu, Y.; Davis, L.S.; Anandkumar, A. DiscoBox: Weakly Supervised Instance Segmentation and Semantic Correspondence from Box Supervision. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 3386–3396. [Google Scholar]
He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 386–397. [Google Scholar] [CrossRef]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into High Quality Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6154–6162. [Google Scholar]
Chen, K.; Ouyang, W.; Loy, C.C.; Lin, D.; Pang, J.; Wang, J.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; et al. Hybrid Task Cascade for Instance Segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4969–4978. [Google Scholar]
Liu, J.-J.; Hou, Q.; Cheng, M.-M.; Wang, C.; Feng, J. Improving Convolutional Networks with Self-Calibrated Convolutions. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10093–10102. [Google Scholar]
Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. YOLACT: Real-Time Instance Segmentation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27–28 October 2019; pp. 9156–9165. [Google Scholar]
Chen, H.; Sun, K.; Tian, Z.; Shen, C.; Huang, Y.; Yan, Y. BlendMask: Top-Down Meets Bottom-Up for Instance Segmentation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 8570–8578. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H. Conditional Convolutions for Instance Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 254–270. [Google Scholar]
Wang, X.; Kong, T.; Shen, C.; Jiang, Y.; Li, L. SOLO: Segmenting Objects by Locations. In Proceedings of the European Conference on Computer Vision (ECCV), 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 649–665. [Google Scholar]
Wang, X.; Zhang, R.; Kong, T.; Li, L.; Shen, C. SOLOv2: Dynamic and Fast Instance Segmentation. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–12 December 2020; pp. 17721–17732. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27–28 October 2019; p. 9626. [Google Scholar]
Gao, S.-H.; Cheng, M.-M.; Zhao, K.; Zhang, X.-Y.; Yang, M.-H.; Torr, P. Res2Net: A New Multi-Scale Backbone Architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 652–662. [Google Scholar] [CrossRef] [PubMed]
Zhang, T.; Zhang, X.; Zhu, P.; Tang, X.; Li, C.; Jiao, L.; Zhou, H. Semantic Attention and Scale Complementary Network for Instance Segmentation in Remote Sensing Images. IEEE Trans. Cybern. 2022, 52, 10999–11013. [Google Scholar] [CrossRef] [PubMed]
Krähenbühl, P.; Koltun, V. Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials. arXiv 2012, arXiv:1210.5644. [Google Scholar]
Hao, S.; Wang, G.; Gu, R. Weakly Supervised Instance Segmentation Using Multi-Prior Fusion. Comput. Vis. Image Underst. 2021, 211, 103261. [Google Scholar] [CrossRef]
Milletari, F.; Navab, N.; Ahmadi, S.-A. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar]
Su, H.; Wei, S.; Yan, M.; Wang, C.; Shi, J.; Zhang, X. Object Detection and Instance Segmentation in Remote Sensing Imagery Based on Precise Mask R-CNN. In Proceedings of the IGARSS 2019—2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 1454–1457. [Google Scholar]
Cheng, G.; Zhou, P.; Han, J. Learning Rotation-Invariant Convolutional Neural Networks for Object Detection in VHR Optical Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2016, 54, 7405–7415. [Google Scholar] [CrossRef]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision (ECCV), 13th European Conference, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Chen, M.; Zhang, Y.; Chen, E.; Hu, Y.; Xie, Y.; Pan, Z. Meta-Knowledge Guided Weakly Supervised Instance Segmentation for Optical and SAR Image Interpretation. Remote Sens. 2023, 15, 2357. [Google Scholar] [CrossRef]
Tian, Z.; Shen, C.; Wang, X.; Chen, H. BoxInst: High-Performance Instance Segmentation with Box Annotations. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 5443–5452. [Google Scholar] [CrossRef]
Li, Y.; Xue, Y.; Li, L.; Zhang, X.; Qian, X. Domain Adaptive Box-Supervised Instance Segmentation Network for Mitosis Detection. IEEE Trans. Med. Imaging 2022, 41, 2469–2485. [Google Scholar] [CrossRef]

Figure 1. (a) The primary limitations of pixel-level and bounding box labels and the strategies employed in this study to accommodate these constraints. (b) The challenges posed by remote sensing imagery and the strategies employed in this research to accommodate them.

Figure 2. The holistic framework of SASM-Net.

Figure 3. Schematic representation of the Bottle2neck module.

Figure 4. Diagrammatic representation of the SAE module.

Figure 5. Diagrammatic representation of the segmentation branch.

Figure 6. Qualitative results based on the NWPU VHR-10 instance segmentation dataset. (a,b) Comparison between SASM-Net and hybrid supervised paradigm methods. (c,d) Comparison between SASM-Net and fully supervised paradigm methods. (e,f) Comparison between SASM-Net and adaptations of fully supervised methods. (g,h) Comparison between SASM-Net and dedicated weakly supervised networks.

Figure 7. Qualitative results based on the SSDD dataset. The distribution of rows and columns is the same as in Figure 6. (a,b) Comparison between SASM-Net and hybrid supervised paradigm methods. (c,d) Comparison between SASM-Net and fully supervised paradigm methods. (e,f) Comparison between SASM-Net and adaptations of fully supervised methods. (g,h) Comparison between SASM-Net and dedicated weakly supervised networks.

Table 1. Quantitative results based on the NWPU VHR-10 instance segmentation dataset. P_pix is the proportion of pixel-level labels used during the training phase to the total number of labels. T_spee. Denotes the average processing speed of the method on the testing set, measured in milliseconds (ms).

Paradigm	Method	P_pixe.	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L	T_spee.
Hybrid supervised	YOLACT [34]	25%	15.2	41.2	7.8	7.7	16.8	12.6	-
	YOLACT [34]	50%	22.5	49.7	17.0	9.6	19.9	31.5	-
	YOLACT [34]	75%	27.5	54.4	27.4	12.1	25.9	34.2	-
	Mask R-CNN [29]	25%	25.7	59.4	18.8	16.9	25.3	29.3	-
	Mask R-CNN [29]	50%	35.5	70.8	31.3	24.6	34.2	39.9	-
	Mask R-CNN [29]	75%	49.3	82.6	51.7	36.9	47.0	53.9	-
	CondInst [36]	25%	23.9	59.8	14.8	19.8	23.7	25.3	-
	CondInst [36]	50%	34.5	73.4	27.6	23.7	34.1	35.9	-
	CondInst [36]	75%	49.5	85.1	50.3	35.9	48.6	53.7	-
Fully supervised	YOLACT [34]	100%	35.6	68.4	36.4	14.8	33.3	56.0	-
	Mask R-CNN [29]	100%	58.8	86.6	65.2	47.1	57.5	62.4	-
	CondInst [36]	100%	58.5	90.1	62.9	29.4	56.8	71.3	-
Weakly supervised	Adaptations of fully supervised methods
	YOLACT [34]	0	9.8	32.9	1.3	4.4	11.3	8.0	61.0
	Mask R-CNN [29]	0	19.8	54.7	9.7	7.8	19.4	24.6	74.1
	CondInst [36]	0	17.1	50.5	6.7	10.7	17.7	18.5	94.3
	Dedicated weakly supervised methods
	BoxInst [50]	0	47.6	78.9	49.0	33.8	43.9	55.5	94.3
	DiscoBox [28]	0	46.2	79.7	47.4	29.4	42.9	57.1	90.9
	DBIN [51]	0	48.3	80.2	50.5	34.5	46.1	57.0	99.0
	MGWI-Net [49]	0	51.6	81.3	53.3	37.6	48.2	59.1	96.2
	SASM-Net	0	53.1	82.4	55.2	38.6	49.9	61.0	107.5

Table 2. Class-wise quantitative results based on the NWPU VHR-10 instance segmentation dataset. N_para. is the number of parameters in the model, and the unit of P_para. is megabytes (M).

Paradigm	Method	R_pixe.	AI	BD	GTF	VC	SH	TC	HB	ST	BC	BR	N_para_.
Hybrid supervised	YOLACT [34]	25%	0	39.1	14.2	12.2	1.5	7.6	12.2	49.8	10.3	1.4	-
	YOLACT [34]	50%	0.1	55.0	49.9	11.7	4.6	17.0	19.3	52.5	13.6	1.2	-
	YOLACT [34]	75%	0.7	64.0	62.9	19.4	5.9	19.8	17.4	60.7	16.9	7.2	-
	Mask R-CNN [29]	25%	0.1	34.2	37.4	12.5	4.7	39.3	22.5	56.6	42.0	7.8	-
	Mask R-CNN [29]	50%	8.8	49.7	50.8	28.1	14.9	44.1	30.6	59.4	52.1	16.7	-
	Mask R-CNN [29]	75%	27.6	71.2	68.6	40.5	30.4	63.2	33.0	70.1	64.7	23.9	-
	CondInst [36]	25%	0	37.5	35.7	18.2	3.0	36.6	18.2	53.8	32.5	3.5	-
	CondInst [36]	50%	14.8	49.0	45.6	23.0	12.8	45.6	28.4	57.1	56.8	11.3	-
	CondInst [36]	75%	30.9	68.4	64.5	41.5	31.8	68.4	30.3	65.0	66.0	27.8	-
Fully supervised	YOLACT [34]	100%	8.2	70.5	70.8	22.7	21.5	24.3	34.8	63.4	26.5	13.5	-
	Mask R-CNN [29]	100%	35.3	78.8	84.8	46.1	50.2	72.0	48.1	80.9	64.2	28.0	-
	CondInst [36]	100%	26.7	77.7	89.1	46.2	46.1	69.7	46.8	73.4	74.0	35.4	-
Weakly supervised	Adaptations of fully supervised methods
	YOLACT [34]	0	0	20.7	12.1	4.8	0.1	9.6	2.1	33.5	14.9	0.1	34.8
	Mask R-CNN [29]	0	0	33.3	34.2	8.0	2.3	21.5	16.4	48.6	26.9	6.6	63.3
	CondInst [36]	0	0	30.7	26.8	6.6	1.1	19.2	14.2	46.1	23.1	3.1	53.5
	Dedicated weakly supervised methods
	BoxInst [50]	0	12.5	76.6	89.7	38.0	47.9	65.5	11.3	75.4	58.9	6.8	53.5
	DiscoBox [28]	0	12.0	77.7	91.5	33.7	42.8	64.3	10.6	74.6	57.9	6.0	65.0
	DBIN [51]	0	14.0	77.1	91.2	37.8	48.6	67.8	13.0	75.2	61.9	5.4	55.6
	MGWI-Net [49]	0	17.0	77.3	91.9	41.0	50.8	71.2	15.7	76.5	64.6	10.9	53.7
	SASM-Net	0	19.6	78.6	92.7	42.6	51.7	72.4	14.5	77.0	66.8	11.3	58.1

Table 3. Quantitative results based on the SSDD dataset. The limited number of large objects (>96² pixels) in the SSDD dataset makes the AP_L metric unstable. Therefore, we opted not to include this specific metric in our report.

Paradigm	Method	R_pixe.	AP	AP₅₀	AP₇₅	AP_S	AP_M	T_spee.
Hybrid supervised	YOLACT [34]	25%	17.4	59.0	1.5	19.7	21.0	-
	YOLACT [34]	50%	28.6	76.7	9.0	32.1	34.2	-
	YOLACT [34]	75%	39.0	79.9	32.5	40.3	45.5	-
	Mask R-CNN [29]	25%	22.8	72.4	6.3	27.2	28.5	-
	Mask R-CNN [29]	50%	39.3	86.2	28.0	42.7	44.4	-
	Mask R-CNN [29]	75%	54.6	90.2	63.0	56.6	57.1	-
	CondInst [36]	25%	18.6	65.7	2.7	22.1	23.7	-
	CondInst [36]	50%	38.4	87.4	28.6	41.3	43.8	-
	CondInst [36]	75%	54.1	93.0	59.6	54.6	56.8	-
Fully supervised	YOLACT [34]	100%	44.6	86.6	41.0	45.3	48.5	-
	Mask R-CNN [29]	100%	64.2	94.9	80.1	62.0	64.7	-
	CondInst [36]	100%	63.0	95.9	78.4	63.7	63.6	-
Weakly supervised	Adaptations of fully supervised methods
	YOLACT [34]	0	12.4	49.4	0.6	15.9	17.3	43.9
	Mask R-CNN [29]	0	15.5	61.0	1.6	20.2	21.1	50.8
	CondInst [36]	0	14.8	59.1	1.4	17.7	19.6	63.7
	Dedicated weakly supervised methods
	BoxInst [50]	0	49.9	90.1	52.7	50.6	52.3	64.1
	DiscoBox [28]	0	48.4	90.2	50.4	47.2	50.6	60.6
	DBIN [51]	0	50.6	91.7	52.8	51.3	52.0	65.4
	MGWI-Net [49]	0	53.0	92.4	57.1	53.7	54.9	64.9
	SASM-Net	0	54.6	93.0	60.8	56.6	57.9	69.9

Table 4. Ablation study of modules in SASM-Net based on the NWPU VHR-10 instance segmentation dataset.

Method	MSFE	SAE	SMG	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L
Baseline				49.8	80.7	51.0	35.2	46.1	57.4
Models	$\sqrt$			50.7	81.0	52.1	36.3	47.3	58.9
	$\sqrt$	$\sqrt$		51.6	81.2	53.1	36.8	48.7	59.9
		$\sqrt$	$\sqrt$	52.2	81.7	54.4	37.6	48.8	60.4
SASM-Net	$\sqrt$	$\sqrt$	$\sqrt$	53.1	82.4	55.2	38.6	49.9	61.0

Table 5. Ablation study of modules in SASM-Net based on the SSDD dataset.

Method	MSFE	SAE	SMG	AP	AP₅₀	AP₇₅	AP_S	AP_M
Baseline				51.8	91.9	54.0	52.7	53.2
Models	$\sqrt$			52.9	92.1	55.9	54.8	54.7
	$\sqrt$	$\sqrt$		53.5	92.3	57.5	55.7	56.7
		$\sqrt$	$\sqrt$	53.9	92.2	59.4	55.3	57.0
SASM-Net	$\sqrt$	$\sqrt$	$\sqrt$	54.6	93.0	60.8	56.6	57.9

Table 6. Ablation study of the role types of structured model based on the NWPU VHR-10 instance segmentation and SSDD datasets.

Dataset	Method	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L
NWPU VHR-10	Baseline	50.9	80.6	52.2	36.0	47.4	59.0
	Post-processing	51.1	80.2	52.9	35.7	47.9	59.8
	Implicit guidance	52.2	81.7	54.4	37.6	48.8	60.4
SSDD	Baseline	52.6	91.7	56.2	54.1	55.2	-
	Post-processing	53.0	91.8	57.1	54.4	56.1	-
	Implicit guidance	53.9	92.2	59.4	55.3	57.0	-

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, M.; Xu, K.; Chen, E.; Zhang, Y.; Xie, Y.; Hu, Y.; Pan, Z. Semantic Attention and Structured Model for Weakly Supervised Instance Segmentation in Optical and SAR Remote Sensing Imagery. Remote Sens. 2023, 15, 5201. https://doi.org/10.3390/rs15215201

AMA Style

Chen M, Xu K, Chen E, Zhang Y, Xie Y, Hu Y, Pan Z. Semantic Attention and Structured Model for Weakly Supervised Instance Segmentation in Optical and SAR Remote Sensing Imagery. Remote Sensing. 2023; 15(21):5201. https://doi.org/10.3390/rs15215201

Chicago/Turabian Style

Chen, Man, Kun Xu, Enping Chen, Yao Zhang, Yifei Xie, Yahao Hu, and Zhisong Pan. 2023. "Semantic Attention and Structured Model for Weakly Supervised Instance Segmentation in Optical and SAR Remote Sensing Imagery" Remote Sensing 15, no. 21: 5201. https://doi.org/10.3390/rs15215201

APA Style

Chen, M., Xu, K., Chen, E., Zhang, Y., Xie, Y., Hu, Y., & Pan, Z. (2023). Semantic Attention and Structured Model for Weakly Supervised Instance Segmentation in Optical and SAR Remote Sensing Imagery. Remote Sensing, 15(21), 5201. https://doi.org/10.3390/rs15215201

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Semantic Attention and Structured Model for Weakly Supervised Instance Segmentation in Optical and SAR Remote Sensing Imagery

Abstract

1. Introduction

2. Relate Work

2.1. Supervised Instance Segmentation

2.2. Weakly Supervised Instance Segmentation

3. Methodology

3.1. Overview

3.2. Multi-Scale Feature Extraction Module

3.3. Semantic Attention Enhancement Module

3.4. Structured Model Guidance Module

3.5. Segmentation Branch

4. Experiments

4.1. Datasets

4.2. Implementation Details

4.3. Evaluation Metrics

4.4. Experimental Results on the NWPU VHR-10 Instance Segmentation Dataset

4.5. Experimental Results on the SSDD Dataset

4.6. Ablation Study

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI