Global Context Relation-Guided Feature Aggregation Network for Salient Object Detection in Optical Remote Sensing Images

Li, Jian; Li, Chuankun; Zheng, Xiao; Liu, Xinwang; Tang, Chang

doi:10.3390/rs16162978

Open AccessArticle

Global Context Relation-Guided Feature Aggregation Network for Salient Object Detection in Optical Remote Sensing Images

by

Jian Li

¹,

Chuankun Li

¹,

Xiao Zheng

^2,*,

Xinwang Liu

² and

Chang Tang

³

¹

National Key Laboratory of Electronic Testing Technology, North University of China, Taiyuan 030051, China

²

School of Computer, National University of Defense Technology, Changsha 410073, China

³

School of Computer Science, China University of Geosciences, Wuhan 430074, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(16), 2978; https://doi.org/10.3390/rs16162978

Submission received: 4 July 2024 / Revised: 4 August 2024 / Accepted: 8 August 2024 / Published: 14 August 2024

(This article belongs to the Section Remote Sensing Image Processing)

Download

Browse Figures

Versions Notes

Abstract

:

With the rapid development of deep neural networks, salient object detection has achieved great success in natural images. However, detecting salient objects from optical remote sensing images still remains a challenging task due to the diversity of object types, scale, shape and orientation variations, as well as cluttered backgrounds. Therefore, it is impractical to directly leverage methods designed for natural images to detect salient objects in optical remote sensing images. In this work, we present an end-to-end deep neural network for salient object detection in optical remote sensing images via global context relation-guided feature aggregation. Since the objects in remote sensing images often have a scattered distribution, we design a global context relation module to capture the global relationships between different spatial positions. In order to effectively integrate low-level appearance features as well as high-level semantic features for enhancing the final performance, we develop a feature aggregation module with the global context relation information as guidance and embed it into the backbone network to refine the deep features in a progressive manner. Instead of using traditional binary cross entropy as a training loss which treats all pixels equally, we design a weighted binary cross entropy to capture local surrounding information of different pixels. Extensive experiments on three public datasets are conducted to validate the efficiency of the proposed network and the results demonstrate that our proposed method consistently outperforms other competitors.

Keywords:

salient object detection; optical remote sensing images; global context relation; multi-level feature aggregation

1. Introduction

As a practical and important task in the computer vision field, salient object detection aims to locate the most attractive objects in an image or video, which has been widely used in many vision-related tasks, such as image segmentation [1,2], content-aware image editing [3,4], quality assessment [5,6] and so on. As with traditional object detection [7], it should detect all of the objects in an image or video, which does not distinguish whether a certain object is salient or not. Over the past decade, benefiting from the powerful feature learning and representation capability, deep neural networks have had great success in salient object detection from natural images [8,9,10,11,12,13,14]. Compared with natural images, there are several challenges which make salient object detection from remote sensing images more intractable [15,16,17,18,19,20,21,22,23,24,25,26,27]. Firstly, the objects in remote sensing images are often of diverse types, with varying scales, shapes and orientations (as shown in Figure 1a,b). Secondly, the distribution of salient objects is often scattered due to the large observation range (as shown in Figure 1d). Thirdly, a certain object may be surrounded by a complex background (as shown in Figure 1c). To address the above issues, we propose an end-to-end deep neural network for salient object detection in remote sensing images via multi-scale feature aggregation and global context relation learning. By using a pre-trained backbone network, we can obtain a series of feature maps that capture multi-scale information of original images. Then, we generate an initial output saliency score map from the highest-level feature maps which contain sufficient semantic clues that help locate the salient objects. However, the initial saliency map lacks accurate appearance details. Therefore, we aggregate multi-scale features into the initial output in a progressive manner. In addition, in order to capture the global context relation between similar objects with varying distribution in a remote sensing scene, we design a global context relation learning module for each stage-wise feature maps and embed it into the feature aggregation process.

In summary, the contributions of this work are twofold, as follows:

We design a global context relation learning module to capture the scattered distribution of salient objects in remote sensing images. In order to exploit the high-level semantic information as well as the low-level appearance details, we propose a global context relation-guided feature aggregation module to refine the initial saliency score map in a progressive manner.
Instead of using traditional binary cross entropy as training loss, which treats all pixels equally, we embed a weighted binary cross entropy to capture local surrounding information of different pixels, which can ensure that the pixels located in hard areas such as edges and holes can be assigned with larger weight.

There is a large number of methods proposed for salient object detection of natural images, and many natural image salient object detection methods have been deployed for remote sensing images. Below, we give a brief review of previous salient object detection methods that were designed for natural images as well as for remote sensing images.

Salient object detection for natural images. In earlier years, most of salient object detection methods were often prior-driven and unsupervised, which adopt hand-crafted low-level visual features such as color, texture, spatial location [28,29,30] or heuristic priors such as center prior [31] and boundary prior [32,33]. Although these hand-crafted feature-based methods make remarkable progress, the low-level features and priors are not enough to understand the image scene in a high-level manner, i.e., lacking semantic information of image contents, which results in unsatisfactory saliency maps. Motivated by the rapid popularity of deep neural networks used in computer vision, the performance for different image salient object detection methods have also been promoted with a big step by using deep learning techniques. The main idea of deep neural network-based salient object detection aims to integrate different levels of feature maps for locating salient objects as well as preserving the appearance details [8,9,34,35,36,37,38]. Apart from RGB image source, many other information sources are also used to assist salient object detection for natural images, e.g., depth image [39,40,41,42], image labels/tags and image captions [43,44,45]. The depth image can provide an extra 3D structure of the scene, which can help separate those salient objects with highly similar colors and textures to the background. The image tags and captions can be plugged into a multi-task framework, in which the salient objects can be learned to boost the classification task.

Salient object detection for remote sensing images. Driven by the strong capability of feature extraction for image data [46], many convolutional neural network (CNN)-based salient object detection methods for optical remote sensing images have been put forward. For example, Zhao et al. [47] proposed to use a sparse learning model for extracting both global context information and background noise to locate salient objects in remote sensing images. With regard to the special case of building extraction, Li et al. [48] combined the boundary connectivity, region contrast, as well as background constraints to calculate saliency values of buildings. The contrast of a multi-level color histogram is also utilized to generate the saliency in a hierarchical manner for different regions of interest [49]. Since knowledge is seldom used in image saliency detection, in [50], knowledge-oriented saliency and vision-oriented saliency are combined to help locate the airport from a remote sensing image. In [51], Li et al., proposed an end-to-end deep neural network for remote-sensing-image salient object detection. In LV-Net, a two-stream pyramid module (named the L-shaped module) is designed to extract a set of hierarchical deep features which can perceive the scale diversity and local details of salient objects. Then, an encoder–decoder module with nested connections (named the V-shaped module) is developed to gradually integrate low-level details and semantic features by using nested connections, which helps suppress the background clutters as well as highlight the salient objects. In addition, they constructed the first public remote sensing image dataset (ORSSD) for salient object detection, which contains 800 remote sensing images with their corresponding manually annotated ground truths. Li et al. [52] proposed a fusion network in a parallel down–up manner for detecting salient objects from remote sensing images. In order to highlight the diversely scaled salient objects, it takes successive down-sampling operations for capturing varied scale information. In addition, feature aggregation is also a widely used strategy for saliency detection in remote sensing images. For example, in [19], Zhang et al., generated the edge map to supervise the network training for preserving the details of the final saliency map. Multi-level features are extracted by a CNN backbone and fused with other geometric information such as edge for enhancing salient object boundary [23]. In order to adapt to lightweighted application scenarios, lightweight salient object detection in optical remote sensing images has also become a hot research field [24,26]. Since the global contextual information is very important for understanding remote sensing images, the local-to-contextual paradigm is widely used in recent works for exploring the relation between global and local feature embeddings [25,53,54,55], in which the transformer structure is a typical one [17,18].

Although remarkable success has also been achieved by using deep neural networks for detecting salient objects in remote sensing images, there are still several major issues that need to be noted. Firstly, different salient objects with diverse scales are hard to be accurately detected simultaneously. Secondly, the global context relation of scattered objects has not been fully exploited to enhance the detection results.

2. Materials

Datasets

In this work, we conduct experiments on three public datasets: ORSSD [51], EORSSD [19], and ORSI-4199 [21]. The detailed information of these datasets are as follows:

ORSSD contains 800 images and the corresponding pixel-level annotated binary map, in which 600 images are used for training and the remaining 200 are used for testing.

EORSSD is an extension of ORSSD. Compared to ORSSD, there are 1200 newly collected images. Therefore, there are a total of 2000 images in EORSSD, in which 1400 images are used for model training and the remaining 600 are used for testing.

ORSI-4199 is the biggest dataset for remote-sensing-image salient object detection, it contains 4199 images with pixel-level ground truths, in which 2000 images are used for training and the remaining 2199 are used for testing.

3. Methods

In this work, we introduce a novel end-to-end deep neural network for accurately detecting salient objects from remote sensing scenes by exploring the global context relation of different regions, which plays an important role in remote sensing images.

3.1. Overview

As illustrated in Figure 2, we take the ResNet architecture [56] as the backbone to extract multi-level features and the parameters are initialized by the pre-trained ResNeXt network, which is trained on the ImageNet dataset [57]. Therefore, given an input image, we can obtain five levels of hierarchical features, including conv1, conv2_x, conv3_x, conv4_x, and conv5_x. Since the conv5_x contains abundant high-level features which are beneficial to locate salient objects, we utilize conv5_x to generate an initial saliency score map. However, due to a succession of convolution and pooling operations, the appearance details of image contents are significantly lost in conv5_x. In order to tackle this issue, we design a feature aggregation module (FA) and insert it into the proposed framework to refine the initial output in a progressive manner by capturing both high-level semantic information and low-level details. Since the objects in remote sensing images are often scattered due to the large observation range, the global context relation should also be taken into consideration. Therefore, we develop a global context relation learning module (GCRL) and utilize its learned features to guide the feature aggregation progress. For each feature aggregation stage, we generate an intermediate score map that captures a certain scale of information of the corresponding feature extraction layer. Finally, we introduce a multi-scale fusion module (MSF) to fuse the multiple intermediate score maps of different feature aggregation stages and obtain the final saliency score map.

3.2. Global Context Relation Learning Module (GCRL)

For a given remote sensing image, there are usually some challenging cases such as cluttered background and scattered object distribution. Therefore, the global context relationship among multiple salient objects as well as the different parts of a certain salient object should be exploited to boost the saliency map. In this section, we introduce a GCRL module to learn the global context information and utilize it to guide the feature aggregation process in each stage. The structure of each GCRL module is shown in Figure 3.

For two spatial position indices p and q in the i-th-layer feature map

F_{i} \in R^{H \times W \times C}

, we define their global context relationship

G C R (f_{p}, f_{q})

as follows:

G C R (f_{p}, f_{q}) = Φ (Ψ (f_{p}, f_{q})),

(1)

where

f_{p}

and

f_{q}

are two feature vectors with size

1 \times 1 \times C

. In linear algebra, a simple way to describe the relationship between two feature vectors is through the dot production. Therefore, for simplicity and computational efficiency, we use dot production operation to calculate the relationship between

f_{p}

and

f_{q}

, and then, we define

Ψ (f_{p}, f_{q})

as follows:

Ψ (f_{p}, f_{q}) = w_{1} {(f_{p})}^{T} w_{2} (f_{q}),

(2)

where

w_{1} (f_{p})

and

w_{2} (f_{q})

can be obtained by using two different convolution layers. In our GCRL module, we use two

1 \times 1

convolution layers to obtain

w_{1} (F)

and

w_{2} (F)

from

F_{i}

. Then, we reshape and transpose

w_{1} (F)

and

w_{2} (F)

to

H W \times C

and

C \times H W

, respectively. By using a simple matrix multiplication, we obtain a

H W \times H W

matrix, which we further reshape to form a global context relation feature

{GF}_{i}

of size

H \times W \times H W

. Finally, we apply a softmax layer to eliminate negative spatial relations, which is defined as

Φ ({GF}_{i, p q}) = \frac{exp ({GF}_{i, p q})}{\sum_{p, q = 1}^{W \times H} exp ({GF}_{i, p q})} .

(3)

By using the abovementioned GCRL module, global relationships between any two spatial positions in deep feature maps can be fully exploited to further enhance feature representations, especially for haphazardly distributed salient objects, which is very unusual in remote sensing images. In our experiments, we will intuitively demonstrate the efficacy of the proposed GCRL module.

It should be noted that our proposed GCRL module is quite different to previous non-local (NL) neural networks [58] and global context learning networks [59]. Previous NL blocks were actually formed via the generalization of the self-attention mechanism and they would compute the response at a position as a weighted sum of the features at all positions. For the work proposed in [59], it captures global context information in the segmentation process, which weighs the response at a feature location by features at all locations in the input feature map, and the weights are determined based on the feature similarity between two corresponding locations. On the contrary, our proposed GCRL dose not learn the weighted features, but learns the global context relationships for different parts of a remote sensing scene. Therefore, the output of each GCRL is not just weighted features, but reflects spatial context relationships of a certain remote sensing scene. In addition, the output of each GCRL module is used to guide the feature aggregation process in each step.

3.3. Feature Aggregation Module (FA)

It is well known that both the high-level semantic information and low-level details are important for pixel-wise salient object detection. Therefore, the way in which to integrate the features extracted from different network layers is critical. A simple and commonly used operation in previous methods is concatenation. However, through this method, the complementary information between different layers of features cannot be fully captured, and the noisy and redundant features cannot be effectively eliminated. In this paper, a feature aggregation module is proposed to fully integrate low-level detail features, high-level semantic features, as well as global context relation information to suppress the background noises and recover more structural and semantic information.

Supposing that the feature maps of the i-th layer, the global context relation of the i-th layer, and the output of the

(i - 1)

-th FA module are denoted as

F_{i}

,

G_{i}

and

{FA}_{i - 1}

, respectively, then the output of the i-th FA module can be obtained via the flowchart from Figure 4. The proposed FA module aims to preserve the consistency and suppress the inconsistency between

F_{i}

and

{FA}_{i - 1}

. Instead of directly using common addition or concatenation for feature integration in most existing studies, we firstly extract the consistent parts between

F_{i}

and

{FA}_{i - 1}

by element-wise multiplication. In such a manner, the common parts of cross-level features can be enhanced while the inconsistent parts can be weakened. Then, we combine

F_{i}

and the enhanced features

F_{i, m}

by element-wise addition for retaining a part of the information from layer-wise features. In addition, in order to enhance the global context relation information of the integrated features, we embed

G_{i}

into each FA module to guide the feature aggregation. Since the feature channel of

G_{i}

is different to

F_{i}

, we first use a convolution operation to adapt

G_{i}

to

F_{i}

. After obtaining the global context relation-enhanced feature by multiplying the adapted

G_{i}

with

F_{i}

, we combine it with

F_{i, m}

by concatenation to obtain the output of the i-th FA module

{FA}_{i}

. Mathematically, the features obtained by multiplication

F_{i, m}

and

F_{i}

can be generated as follows:

\begin{matrix} F_{i, m} = W_{i, 2} (W_{i, 1} ({FA}_{i - 1})) * W_{i, 4} (W_{i, 3} (F_{i})), \\ {FA}_{i} = C a t (W_{i, 6} (W_{i, 5} (F_{i, m}) + W_{i, 3} (F_{i})), W_{i, 7} (F_{i}) * W_{i, 8} (G_{i})), \end{matrix}

(4)

where

W_{i, k} (k = 1, 2, \dots, 8)

is the k-th combination of convolution, batch normalization and ReLU operations of the i-th FA module, Cat represents the concatenation operation, and ∗ represents element-wise multiplication.

It should be noted that each “Conv” block in both the GCRL and FA modules is actually a combination with batch normalization and activation, which is the so called “convolution block”. There are no residual connections in the block. We simply use the ResNet architecture as the backbone for feature extraction, which consists of a series of residual connections for avoiding gradient vanishing. However, in our proposed modules, we only use the extracted features for salient object detection, so we do not use any residual connection in the convolution block. Since there are only a small number of convolution blocks compared to the backbone, the gradient vanishing phenomenon does not happen in our network.

3.4. Multi-Scale Fusion Module (MSF)

Since different levels of features capture different scale information of the original images, the outputs of different FA modules can be used to help detect salient objects with varying sizes. Therefore, we design an MSF module to combine the multiple intermediate outputs to generate the final saliency score map. In addition, we also utilize original hierarchical features extracted from the backbone network to guide the fusion process. Although many previous works use the fusion strategy to fuse multiple outputs, they ignore the deep features extracted by shallow layers. The structure of our designed MSF module is shown in Figure 5. We first concatenate

F_{1}

,

F_{2}

,

F_{3}

,

F_{4}

and

F_{5}

together and reduce the channel dimension of the concatenated features by using a convolution. Then, the five side outputs are combined with the reduced features to obtain the final prediction result

S

. The whole process can be mathematically described as follows:

\begin{matrix} F_{c} = C a t (U p (F_{1}), U p (F_{2}), U p (F_{3}), U p (F_{4}), U p (F_{5})), \\ S = G_{2} (C a t (G_{1} (F_{c}), O_{1}, O_{2}, O_{3}, O_{4}, O_{5})), \end{matrix}

(5)

where Up represents the up-sampling operation;

F_{c}

is the concatenated feature of different layers;

G_{1}

and

G_{2}

are two convolution operations; and

O_{1}, O_{2}, O_{3}, O_{4}

and

O_{5}

represent the output of the five feature aggregation modules, respectively.

3.5. Local Surrounding Aware Loss

In previous works, binary cross entropy (BCE) is the most widely used loss function for network training. However, BCE treats each pixel equally and calculates the loss for each pixel independently, and so, the local surrounding information of each pixel is not captured. In fact, pixels located on cluttered or boundary areas should be given more attention since these pixels are often with complex local surrounding structures. Therefore, we propose to design a local surrounding aware loss (

L_{S A}

) as follows:

L_{S A} = \sum_{i = 1}^{H} \sum_{j = 1}^{W} ω_{i j} L_{B C E}^{i j} / \sum_{i = 1}^{H} \sum_{j = 1}^{W} ω_{i j},

(6)

where

L_{B C E}^{i j}

and

ω_{i j}

are the traditional BCE loss and the weight for the

i j

-th pixel, respectively. Given the ground truth saliency map (GT) of a training image, the weight map of all its pixels can be calculated as follows:

W = Φ (G T) - G T,

(7)

where

Φ (\cdot)

is a mapping function to calculate the surrounding information of pixels. In this work, we use the average pooling operation to implement

Φ (\cdot)

. In such a manner, pixels located in hard areas such as edges and holes will be assigned with larger weight.

3.6. Model Training and Testing

Training: As mentioned in the previous section, we initialize our proposed network by the well-trained ResNeXt network (ResNeXt-50) on ImageNet [57]. The size of the pre-trained ResNeXt-50 model is 191 MB. After we add the designed modules, the size of our final trained model is 205 MB. Therefore, the increment of the model size is only about 7.2%. For each dataset, we divide it into two parts, one for training and the rest for testing.

Testing: During testing, for an input image, we feed it into our trained network and obtain the final saliency score map.

Implementation Details: Our proposed network is implemented by using the PyTorch library on a PC with an Intel Xeon ES-2603 v4 CPU, 32 GB RAM, and an NVIDIA GeForce GTX 1080Ti GPU. During training, the batch size (the number of training samples in each iteration) is set to 8. The whole network is optimized by using the Stochastic gradient descent (SGD) algorithm with a momentum of 0.9 and a weight decay of 0.0005. We adjust the learning rate by the “poly” policy with a power of 0.9. We stop the whole learning process after 10k iterations. The whole training process costs about 1.5 h. The code will be released with this paper for public academic usage.

Loss Function: In this work, we use the local surrounding aware loss for each side output and the final output during the training process, which can be calculated as follows:

L = L_{S A}^{d} + \sum_{k = 1}^{5} L_{S A}^{k},

(8)

where

L_{S A}^{k}

represents the loss for the k-th side output,

L_{S A}^{d}

is dominant loss for the final output layer.

4. Results

4.1. Evaluation Metrics

In order to validate the efficacy of the proposed network, we use six widely used metrics for a quantitative performance evaluation: the S-measure (

S_{α}

,

α = 0.5

) [60], F-measure (

F_{β}

,

β^{2} = 0.3

) [61], E-measure (

E_{ξ}

) [62], mean absolute error scores (MAE,

M

), a precision–recall (PR) curve and an F-measure curve. As an overall performance measurement, the F-measure is defined as

F_{β} = \frac{(1 + β^{2}) \cdot p r e c i s i o n \cdot r e c a l l}{β^{2} \cdot p r e c i s i o n + r e c a l l} .

(9)

As neither the precision nor recall measure evaluate the true negative saliency assignments, we use the mean absolute error (MAE) as a complementary. The MAE score calculates the average difference between the detected saliency score map

S

and the ground truth

G

, it is computed as

M A E = \frac{1}{W \times H} \sum_{x = 1}^{W} \sum_{y = 1}^{H} | S (x, y) - G (x, y) |,

(10)

where H and W are the height and width of the input image, respectively.

4.2. Comparison with State-of-the-Art Methods

In order to validate the efficacy of our proposed network, we compare its performance with other 24 salient object detection methods, including 17 traditional natural image salient object detection models and 7 optical remote-sensing-image salient object detection models. The compared methods are R3Net [63], PiCANet [64], PoolNet [65], EGNet [66], BASNet [67], CPD [68], RAS [69], CSNet [70], SAMNet [71], HVPNet [72], ENFNet [73], SUCA [74], PA-KRN [75], VST [76], DPORTNet-VGG [77], DNTD-Res [78], ICON-PVT [79], MJRBMVGG [21], EMFINetVGG [25], ERPNetVGG [23], ACCoNet-VGG [27], CorrNet [24], MCCNet [20] and HFANet [80]. Among these methods, the last seven ones are specifically designed for salient object detection in remote sensing images.

Quantitative Comparison. In Table 1 and Table 2, we report the compared results of different metrics obtained by different models implemented on the three datasets. With regard to the ORSSD dataset, it can be observed that our proposed network consistently performs favourably against other methods. With regard to the ORSSD and ORSI-4199 datasets, our method also performs better than other competitors in nearly all cases. In Figure 6, Figure 7 and Figure 8, we plot the PR curves and F-measure curves of different methods on different datasets (since there are 24 salient object detection methods used for comparison in this paper, which inevitably produce many curves in different figures; therefore, in order to make the comparison in figures clearer, we intentionally make some of the curves gray and only label those curves corresponding to some representative or competitive methods with names). From the results, we observe that our method also consistently outperforms other counterparts on the three datasets.

Qualitative Comparison. In order to give a more intuitive illustration of the comparison of the results, we show some visual results of our proposed network and others in Figure 9. As can be seen, our method generates more accurate saliency score maps when the input images contain background clutter. In addition, the boundaries of the salient objects can be well preserved in our results. It should be noted that when the salient objects are slender, our proposed network can also obtain good results. In addition, as to many other challenging cases such as varying object shapes, sizes and orientations, our proposed network can also handle well. Therefore, the high-level semantics information, low-level details as well as different scale information can be well captured by our proposed network.

4.3. Ablation Analysis

Effectiveness of GCRL module. In this work, we design the GCRL module to learn the global context relation information to guide the feature aggregation process. In order to demonstrate the efficacy of this module, we remove it from our network, i.e., there is no

G_{i}

in Figure 4 (denoted as no_GCRL), and the results are shown in Figure 10. As can be seen from the results, when the network is without the GCRL module, some scattered objects cannot be detected well simultaneously (see the blue factory buildings in the first image). With the GCRL module, our network can detect scattered salient objects.

Effectiveness of the MSF module. In order to validate the efficacy of the proposed MSF module, we give an intuitive presentation of the results with/without MSF in Figure 11. As can be seen, if we directly use the output of the final FA module as the saliency score map (no_MSF), the salient objects can be roughly located, but the details are lost. On the contrary, if we use the MSF module to fuse the side outputs of different FA modules, the high-level semantic information and low-level details can be well preserved.

In Table 3, we also display the quantitative ablation analysis results of GCRL and MSF modules. “OURS_noGCRL” means the proposed network without the GCRL modules, and “OURS_noMSF” means the proposed network without the MSF module. As can be seen from the results, both the GCRL and MSF modules show obvious improvement for the final results, which validates the efficacy of the designed modules in the proposed network.

4.4. Running Efficiency Comparison

In order to show the intuitive efficiency of the proposed network, we compare its inferring time with other methods on different datasets in Table 4. For eliminating bias caused by different sizes of different images in each dataset, we report the average inferring time (seconds) for a single image in different datasets. As can be seen, although the proposed network does not perform the fastest, it still outperforms most of other competitors.

5. Conclusions

In this work, we propose an end-to-end deep neural network for salient object detection from remote sensing images. In order to capture the global context relation of scattered objects, we propose a global context relation learning model. Since both the high-level semantic information and low-level details are important for final results, we design a feature aggregation module and embed it into the network for feature boosting. The learned global context relation is also used to guide the feature aggregation in each stage. The side outputs of different FA modules are fused to generate the final result. In addition, instead of using traditional binary cross entropy as the training loss, which treats all pixels equally, we design a weighted binary cross entropy to capture local surrounding information of different pixels. In order to promote further study of this field, we collect a new dataset which consists of various challenging images and their pixel-wise annotations. Extensive experiments on previous datasets and our newly collected one are conducted to validate the efficiency of the proposed network.

Author Contributions

Methodology, J.L., C.L. and X.Z.; Validation, C.L.; Formal analysis, X.L. and C.T.; Investigation, X.Z. and X.L.; Writing—original draft, J.L.; Supervision, C.T. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China (No. 62101512 and 62271453), and in part by the Fundamental Research Program of Shanxi Province (20210302124031) and Shanxi Scholarship Council of China (2023-131).

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, Z.; Davaasuren, D.; Wu, C.; Goldstein, J.A.; Gernand, A.D.; Wang, J.Z. Multi-region Saliency-aware Learning for Cross-domain Placenta Image Segmentation. Pattern Recognit. Lett. 2020, 140, 165–171. [Google Scholar] [CrossRef] [PubMed]
Wang, W.; Shen, J.; Yang, R.; Porikli, F. Saliency-aware video object segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 20–33. [Google Scholar] [CrossRef] [PubMed]
Battiato, S.; Farinella, G.M.; Puglisi, G.; Ravi, D. Saliency-based selection of gradient vector flow paths for content aware image resizing. IEEE Trans. Image Process. 2014, 23, 2081–2095. [Google Scholar] [CrossRef] [PubMed]
Cho, D.; Park, J.; Oh, T.H.; Tai, Y.W.; So Kweon, I. Weakly-and self-supervised learning for content-aware deep image retargeting. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4558–4567. [Google Scholar]
Zhang, L.; Shen, Y.; Li, H. VSI: A visual saliency-induced index for perceptual image quality assessment. IEEE Trans. Image Process. 2014, 23, 4270–4281. [Google Scholar] [CrossRef] [PubMed]
Oszust, M. No-Reference quality assessment of noisy images with local features and visual saliency models. Inf. Sci. 2019, 482, 334–349. [Google Scholar] [CrossRef]
Liu, L.; Ouyang, W.; Wang, X.; Fieguth, P.; Chen, J.; Liu, X.; Pietikäinen, M. Deep learning for generic object detection: A survey. Int. J. Comput. Vis. 2020, 128, 261–318. [Google Scholar] [CrossRef]
Pang, Y.; Zhao, X.; Zhang, L.; Lu, H. Multi-Scale Interactive Network for Salient Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9413–9422. [Google Scholar]
Zhang, L.; Wu, J.; Wang, T.; Borji, A.; Wei, G.; Lu, H. A Multistage Refinement Network for Salient Object Detection. IEEE Trans. Image Process. 2020, 29, 3534–3545. [Google Scholar] [CrossRef]
Zhou, S.; Wang, J.; Zhang, J.; Wang, L.; Huang, D.; Du, S.; Zheng, N. Hierarchical U-Shape Attention Network for Salient Object Detection. IEEE Trans. Image Process. 2020, 29, 8417–8428. [Google Scholar] [CrossRef]
Zhang, J.; Yu, X.; Li, A.; Song, P.; Liu, B.; Dai, Y. Weakly-Supervised Salient Object Detection via Scribble Annotations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12546–12555. [Google Scholar]
Wei, J.; Wang, S.; Wu, Z.; Su, C.; Huang, Q.; Tian, Q. Label Decoupling Framework for Salient Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13025–13034. [Google Scholar]
Fu, K.; Fan, D.P.; Ji, G.P.; Zhao, Q. Jl-dcf: Joint learning and densely-cooperative fusion framework for rgb-d salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 3052–3062. [Google Scholar]
Piao, Y.; Rong, Z.; Zhang, M.; Ren, W.; Lu, H. A2dele: Adaptive and Attentive Depth Distiller for Efficient RGB-D Salient Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9060–9069. [Google Scholar]
He, X.; Tang, C.; Liu, X.; Zhang, W.; Sun, K.; Xu, J. Object detection in hyperspectral image via unified spectral-spatial feature aggregation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5521213. [Google Scholar] [CrossRef]
Tang, C.; Wang, J.; Zheng, X.; Liu, X.; Xie, W.; Li, X.; Zhu, X. Spatial and spectral structure preserved self-representation for unsupervised hyperspectral band selection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5531413. [Google Scholar] [CrossRef]
Li, G.; Bai, Z.; Liu, Z.; Zhang, X.; Ling, H. Salient object detection in optical remote sensing images driven by transformer. IEEE Trans. Image Process. 2023, 32, 5257–5269. [Google Scholar] [CrossRef] [PubMed]
Yan, R.; Yan, L.; Geng, G.; Cao, Y.; Zhou, P.; Meng, Y. ASNet: Adaptive Semantic Network Based on Transformer-CNN for Salient Object Detection in Optical Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5608716. [Google Scholar] [CrossRef]
Zhang, Q.; Cong, R.; Li, C.; Cheng, M.M.; Fang, Y.; Cao, X.; Zhao, Y.; Kwong, S. Dense Attention Fluid Network for Salient Object Detection in Optical Remote Sensing Images. IEEE Trans. Image Process. 2020, 30, 1305–1317. [Google Scholar] [CrossRef] [PubMed]
Li, G.; Liu, Z.; Lin, W.; Ling, H. Multi-content complementation network for salient object detection in optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5614513. [Google Scholar] [CrossRef]
Tu, Z.; Wang, C.; Li, C.; Fan, M.; Zhao, H.; Luo, B. ORSI salient object detection via multiscale joint region and boundary model. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5607913. [Google Scholar] [CrossRef]
Cong, R.; Zhang, Y.; Fang, L.; Li, J.; Zhao, Y.; Kwong, S. RRNet: Relational reasoning network with parallel multiscale attention for salient object detection in optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5613311. [Google Scholar] [CrossRef]
Zhou, X.; Shen, K.; Weng, L.; Cong, R.; Zheng, B.; Zhang, J.; Yan, C. Edge-guided recurrent positioning network for salient object detection in optical remote sensing images. IEEE Trans. Cybern. 2023, 53, 539–552. [Google Scholar] [CrossRef] [PubMed]
GongyangLi, Z.; Bai, Z.; Lin, W.; Ling, H. Lightweight salient object detection in optical remote sensing images via feature correlation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5617712. [Google Scholar]
Wang, Z.; Guo, J.; Zhang, C.; Wang, B. Multiscale feature enhancement network for salient object detection in optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5634819. [Google Scholar] [CrossRef]
Li, G.; Liu, Z.; Zhang, X.; Lin, W. Lightweight salient object detection in optical remote-sensing images via semantic matching and edge alignment. IEEE Trans. Geosci. Remote Sens. 2023, 60, 5617712. [Google Scholar] [CrossRef]
Li, G.; Liu, Z.; Zeng, D.; Lin, W.; Ling, H. Adjacent context coordination network for salient object detection in optical remote sensing images. IEEE Trans. Cybern. 2023, 53, 526–538. [Google Scholar] [CrossRef]
Cheng, M.M.; Mitra, N.J.; Huang, X.; Torr, P.H.; Hu, S.M. Global contrast based salient region detection. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 569–582. [Google Scholar] [CrossRef]
Wang, J.; Tang, C.; Zheng, X.; Liu, X.; Zhang, W.; Zhu, E.; Zhu, X. Fast approximated multiple kernel k-means. IEEE Trans. Knowl. Data Eng. 2023, 1–10. [Google Scholar] [CrossRef]
Wang, J.; Tang, C.; Wan, Z.; Zhang, W.; Sun, K.; Zomaya, A.Y. Efficient and effective one-step multiview clustering. IEEE Trans. Neural Netw. Learn. Syst. 2023, 1–12. [Google Scholar] [CrossRef]
Chen, S.; Zheng, L.; Hu, X.; Zhou, P. Discriminative saliency propagation with sink points. Pattern Recognit. 2016, 60, 2–12. [Google Scholar] [CrossRef]
Zhu, W.; Liang, S.; Wei, Y.; Sun, J. Saliency optimization from robust background detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 2814–2821. [Google Scholar]
Yang, C.; Zhang, L.; Lu, H.; Ruan, X.; Yang, M.H. Saliency detection via graph-based manifold ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 3166–3173. [Google Scholar]
Zhang, P.; Wang, D.; Lu, H.; Wang, H.; Xiang, R. Amulet: Aggregating Multi-level Convolutional Features for Salient Object Detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 202–211. [Google Scholar]
Hu, X.; Zhu, L.; Qin, J.; Fu, C.W.; Heng, P.A. Recurrently Aggregating Deep Features for Salient Object Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; pp. 6943–6950. [Google Scholar]
Chen, S.; Tan, X.; Wang, B.; Hu, X. Reverse Attention for Salient Object Detection. In Proceedings of the ECCV, Munich, Germany, 8–14 September 2018. [Google Scholar]
Zhang, X.; Wang, T.; Qi, J.; Lu, H.; Wang, G. Progressive Attention Guided Recurrent Network for Salient Object Detection. In Proceedings of the CVPR, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Wang, T.; Zhang, L.; Wang, S.; Lu, H.; Yang, G.; Ruan, X.; Borji, A. Detect Globally, Refine Locally: A Novel Approach to Saliency Detection. In Proceedings of the CVPR, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Chen, H.; Li, Y. Progressively complementarity-aware fusion network for RGB-D salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3051–3060. [Google Scholar]
Liu, N.; Zhang, N.; Han, J. Learning Selective Self-Mutual Attention for RGB-D Saliency Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13756–13765. [Google Scholar]
Fan, D.P.; Lin, Z.; Zhang, Z.; Zhu, M.; Cheng, M.M. Rethinking RGB-D Salient Object Detection: Models, Data Sets, and Large-Scale Benchmarks. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 2075–2089. [Google Scholar] [CrossRef]
Li, C.; Cong, R.; Kwong, S.; Hou, J.; Fu, H.; Zhu, G.; Zhang, D.; Huang, Q. ASIF-Net: Attention steered interweave fusion network for RGB-D salient object detection. IEEE Trans. Cybern. 2020, 32, 2075–2089. [Google Scholar] [CrossRef]
Wang, L.; Lu, H.; Wang, Y.; Feng, M.; Wang, D.; Yin, B.; Ruan, X. Learning to detect salient objects with image-level supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 136–145. [Google Scholar]
Zhang, L.; Zhang, J.; Lin, Z.; Lu, H.; He, Y. Capsal: Leveraging captioning to boost semantics for salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6024–6033. [Google Scholar]
Zeng, Y.; Zhuge, Y.; Lu, H.; Zhang, L.; Qian, M.; Yu, Y. Multi-source weak supervision for saliency detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6074–6083. [Google Scholar]
Tang, C.; Liu, X.; Zheng, X.; Li, W.; Xiong, J.; Wang, L.; Zomaya, A.Y.; Longo, A. DeFusionNET: Defocus blur detection via recurrently fusing and refining discriminative multi-scale deep features. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 955–968. [Google Scholar] [CrossRef]
Zhao, D.; Wang, J.; Shi, J.; Jiang, Z. Sparsity-guided saliency detection for remote sensing images. J. Appl. Remote Sens. 2015, 9, 095055. [Google Scholar] [CrossRef]
Li, E.; Xu, S.; Meng, W.; Zhang, X. Building extraction from remotely sensed images by integrating saliency cue. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2016, 10, 906–919. [Google Scholar] [CrossRef]
Li, T.; Zhang, J.; Lu, X.; Zhang, Y. SDBD: A hierarchical region-of-interest detection approach in large-scale remote sensing image. IEEE Geosci. Remote Sens. Lett. 2017, 14, 699–703. [Google Scholar] [CrossRef]
Zhang, Q.; Zhang, L.; Shi, W.; Liu, Y. Airport extraction via complementary saliency analysis and saliency-oriented active contour model. IEEE Geosci. Remote Sens. Lett. 2018, 15, 1085–1089. [Google Scholar] [CrossRef]
Li, C.; Cong, R.; Hou, J.; Zhang, S.; Qian, Y.; Kwong, S. Nested network with two-stream pyramid for salient object detection in optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2019, 57, 9156–9166. [Google Scholar] [CrossRef]
Li, C.; Cong, R.; Guo, C.; Li, H.; Zhang, C.; Zheng, F.; Zhao, Y. A parallel down-up fusion network for salient object detection in optical remote sensing images. Neurocomputing 2020, 415, 411–420. [Google Scholar] [CrossRef]
Huang, K.; Li, N.; Huang, J.; Tian, C. Exploiting Memory-based Cross-Image Contexts for Salient Object Detection in Optical Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5614615. [Google Scholar] [CrossRef]
Zhao, R.; Zheng, P.; Zhang, C.; Wang, L. Progressive Complementation Network with Semantics and Details for Salient Object Detection in Optical Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 8626–8641. [Google Scholar] [CrossRef]
Quan, Y.; Xu, H.; Wang, R.; Guan, Q.; Zheng, J. ORSI Salient Object Detection via Progressive Semantic Flow and Uncertainty-aware Refinement. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5608013. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
Lin, C.Y.; Chiu, Y.C.; Ng, H.F.; Shih, T.K.; Lin, K.H. Global-and-local context network for semantic segmentation of street view images. Sensors 2020, 20, 2907. [Google Scholar] [CrossRef]
Cheng, M.M.; Fan, D.P. Structure-measure: A new way to evaluate foreground maps. Int. J. Comput. Vis. 2021, 129, 2622–2638. [Google Scholar] [CrossRef]
Achanta, R.; Hemami, S.; Estrada, F.; Susstrunk, S. Frequency-tuned salient region detection. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 1597–1604. [Google Scholar]
Fan, D.P.; Gong, C.; Cao, Y.; Ren, B.; Cheng, M.M.; Borji, A. Enhanced-alignment measure for binary foreground map evaluation. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; pp. 698–704. [Google Scholar]
Deng, Z.; Hu, X.; Zhu, L.; Xu, X.; Qin, J.; Han, G.; Heng, P.A. R3net: Recurrent residual refinement network for saliency detection. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; pp. 684–690. [Google Scholar]
Liu, N.; Han, J.; Yang, M.H. Picanet: Learning pixel-wise contextual attention for saliency detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3089–3098. [Google Scholar]
Liu, J.J.; Hou, Q.; Cheng, M.M.; Feng, J.; Jiang, J. A simple pooling-based design for real-time salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3917–3926. [Google Scholar]
Zhao, J.X.; Liu, J.J.; Fan, D.P.; Cao, Y.; Yang, J.; Cheng, M.M. EGNet: Edge guidance network for salient object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8779–8788. [Google Scholar]
Qin, X.; Zhang, Z.; Huang, C.; Gao, C.; Dehghan, M.; Jagersand, M. Basnet: Boundary-aware salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7479–7489. [Google Scholar]
Wu, Z.; Su, L.; Huang, Q. Cascaded partial decoder for fast and accurate salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3907–3916. [Google Scholar]
Chen, S.; Tan, X.; Wang, B.; Lu, H.; Hu, X.; Fu, Y. Reverse attention-based residual network for salient object detection. IEEE Trans. Image Process. 2020, 29, 3763–3776. [Google Scholar] [CrossRef]
Gao, S.H.; Tan, Y.Q.; Cheng, M.M.; Lu, C.; Chen, Y.; Yan, S. Highly efficient salient object detection with 100k parameters. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 702–721. [Google Scholar]
Liu, Y.; Zhang, X.Y.; Bian, J.W.; Zhang, L.; Cheng, M.M. SAMNet: Stereoscopically attentive multi-scale network for lightweight salient object detection. IEEE Trans. Image Process. 2021, 30, 3804–3814. [Google Scholar] [CrossRef]
Liu, Y.; Gu, Y.C.; Zhang, X.Y.; Wang, W.; Cheng, M.M. Lightweight salient object detection via hierarchical visual perception learning. IEEE Trans. Cybern. 2020, 51, 4439–4449. [Google Scholar] [CrossRef]
Tu, Z.; Ma, Y.; Li, C.; Tang, J.; Luo, B. Edge-guided non-local fully convolutional network for salient object detection. IEEE Trans. Circuits Syst. Video Technol. 2020, 31, 582–593. [Google Scholar] [CrossRef]
Li, J.; Pan, Z.; Liu, Q.; Wang, Z. Stacked U-shape network with channel-wise attention for salient object detection. IEEE Trans. Multimed. 2020, 23, 1397–1409. [Google Scholar] [CrossRef]
Xu, B.; Liang, H.; Liang, R.; Chen, P. Locate globally, segment locally: A progressive architecture with knowledge review network for salient object detection. Proc. Aaai Conf. Artif. Intell. 2021, 35, 3004–3012. [Google Scholar] [CrossRef]
Liu, N.; Zhang, N.; Wan, K.; Shao, L.; Han, J. Visual saliency transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 4722–4732. [Google Scholar]
Liu, Y.; Zhang, D.; Liu, N.; Xu, S.; Han, J. Disentangled capsule routing for fast part-object relational saliency. IEEE Trans. Image Process. 2022, 31, 6719–6732. [Google Scholar] [CrossRef]
Fang, C.; Tian, H.; Zhang, D.; Zhang, Q.; Han, J.; Han, J. Densely nested top-down flows for salient object detection. Sci. China Inf. Sci. 2022, 65, 182103. [Google Scholar] [CrossRef]
Zhuge, M.; Fan, D.P.; Liu, N.; Zhang, D.; Xu, D.; Shao, L. Salient object detection via integrity learning. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 3738–3752. [Google Scholar] [CrossRef]
Wang, Q.; Liu, Y.; Xiong, Z.; Yuan, Y. Hybrid feature aligned network for salient object detection in optical remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5624915. [Google Scholar] [CrossRef]

Figure 1. Some challenging cases of salient object detection for remote sensing images. The first row shows some example images and the second row presents the corresponding annotated ground truth.

Figure 2. A schematic illustration of our proposed network for salient object detection from remote sensing images. GCRL, FA and MSF denote the global context relation learning module, feature aggregation module and multi-scale fusion module, respectively.

Figure 3. The architecture of the global context relation learning module.

Figure 4. The architecture of the feature aggregation module.

Figure 5. The architecture of the multi-scale fusion module.

Figure 6. Comparison of precision–recall curves and F-measure curves of different methods on the ORSSD dataset.

Figure 7. Comparison of precision–recall curves and F-measure curves of different methods on the EORSSD dataset.

Figure 8. Comparison of precision–recall curves and F-measure curves of different methods on the ORSI-4199 dataset.

Figure 9. Visual comparison of saliency score maps generated from different methods. The results demonstrate that our method consistently outperforms other approaches in most cases, and produces results closer to the ground truth.

Figure 10. Salient object detection with/without the designed GCRL module.

Figure 11. Salient object detection with/without the MSF module.

Table 1. Quantitative comparison of different methods on ORSSD and EORSSD datasets. The symbols ↑ and ↓ indicate that higher and lower values, respectively, are better for different metrics. The best results are highlighted in red for clear visualization.

Methods	ORSSD								EORSSD
Methods	$S_{α} ↑$	$F_{β}^{\max} ↑$	$F_{β}^{mean} ↑$	$F_{β}^{adp} ↑$	$E_{ξ}^{\max} ↑$	$E_{ξ}^{mean} ↑$	$E_{ξ}^{adp} ↑$	$M ↓$	$S_{α} ↑$	$F_{β}^{\max} ↑$	$F_{β}^{mean} ↑$	$F_{β}^{adp} ↑$	$E_{ξ}^{\max} ↑$	$E_{ξ}^{mean} ↑$	$E_{ξ}^{adp} ↑$	$M ↓$
Salient object detection methods for natural images
R3Net₁₈ [63]	0.8141	0.7456	0.7383	0.7379	0.8913	0.8681	0.8887	0.0399	0.8192	0.7516	0.6320	0.4180	0.9500	0.8307	0.6476	0.0171
PiCANet₁₈ [64]	0.8124	0.7489	0.7410	0.7391	0.8988	0.8752	0.8902	0.0323	0.8204	0.7544	0.6364	0.4297	0.9501	0.8351	0.6517	0.0155
PoolNet₁₉ [65]	0.8403	0.7706	0.6999	0.6166	0.9343	0.8650	0.8124	0.0358	0.8217	0.7575	0.6432	0.4627	0.9318	0.8215	0.6851	0.0210
EGNet₁₉ [66]	0.8721	0.8332	0.7500	0.6452	0.9731	0.9013	0.8226	0.0216	0.8601	0.7880	0.6967	0.5379	0.9570	0.8775	0.7566	0.0110
BASNet₁₉ [67]	0.8716	0.8357	0.7621	0.6558	0.9766	0.9083	0.8317	0.0204	0.8751	0.7916	0.7018	0.5417	0.9581	0.8797	0.7662	0.0111
CPD₁₉ [68]	0.8955	0.8524	0.8243	0.7717	0.9439	0.9208	0.9211	0.0186	0.8873	0.8094	0.7661	0.6637	0.9391	0.8978	0.8664	0.0110
RAS₂₀ [69]	0.8961	0.8634	0.8250	0.7761	0.9491	0.9220	0.9271	0.0176	0.8864	0.8123	0.7679	0.6685	0.9412	0.8994	0.8681	0.0112
CSNet₂₀ [70]	0.8910	0.8790	0.8285	0.7615	0.9628	0.9171	0.9068	0.0186	0.8364	0.8341	0.7656	0.6319	0.9535	0.8929	0.8339	0.0169
SAMNet₂₁ [71]	0.8761	0.8137	0.7531	0.6843	0.9478	0.8818	0.8656	0.0217	0.8622	0.7813	0.7214	0.6114	0.9421	0.8700	0.8284	0.0132
HVPNet₂₁ [72]	0.8610	0.7938	0.7396	0.6726	0.9320	0.8717	0.8471	0.0225	0.8734	0.8036	0.7377	0.6202	0.9482	0.8721	0.8270	0.0110
ENFNet₂₀ [73]	0.8604	0.7894	0.7301	0.6674	0.9217	0.8641	0.8362	0.0231	0.8654	0.7932	0.7267	0.6163	0.9383	0.8645	0.8147	0.0123
SUCA₂₁ [74]	0.8989	0.8484	0.8237	0.7748	0.9584	0.9400	0.9194	0.0145	0.8988	0.8229	0.7949	0.7260	0.9520	0.9277	0.9082	0.0097
PA-KRN₂₁ [75]	0.9239	0.8890	0.8727	0.8548	0.9680	0.9620	0.9579	0.0139	0.9192	0.8639	0.8358	0.7993	0.9616	0.9536	0.9416	0.0104
VST₂₁ [76]	0.9365	0.9095	0.8817	0.8262	0.9810	0.9621	0.9466	0.0094	0.9208	0.8716	0.8263	0.7089	0.9743	0.9442	0.8941	0.0067
DPORTNet-VGG₂₂ [77]	0.8827	0.8309	0.8184	0.7970	0.9214	0.9139	0.9083	0.0220	0.8960	0.8363	0.7937	0.7545	0.9423	0.9116	0.9150	0.0150
DNTD -Res₂₂ [78]	0.8698	0.8231	0.8020	0.7645	0.9286	0.9086	0.9081	0.0217	0.8957	0.8189	0.7962	0.7288	0.9378	0.9225	0.9047	0.0113
ICON-PVT₂₃ [79]	0.9256	0.8939	0.8671	0.8444	0.9704	0.9637	0.9554	0.0116	0.9185	0.8622	0.8371	0.8065	0.9687	0.9619	0.9497	0.0073
Salient object detection methods for optical remote sensing images
EMFINetVGG₂₂ [25]	0.9366	0.9002	0.8857	0.8616	0.9737	0.9672	0.9663	0.0110	0.9291	0.8720	0.8486	0.7984	0.9712	0.9605	0.9501	0.0084
ERPNetVGG₂₂ [23]	0.9254	0.8975	0.8745	0.8357	0.9710	0.9565	0.9520	0.0135	0.9210	0.8633	0.8304	0.7554	0.9603	0.9402	0.9228	0.0089
CorrNet₂₂ [24]	0.9380	0.9128	0.9001	0.8875	0.9790	0.9745	0.9720	0.0098	0.9289	0.8778	0.8621	0.8310	0.9696	0.9647	0.9594	0.0084
MCCNet₂₂ [20]	0.9437	0.9155	0.9054	0.8957	0.9800	0.9758	0.9735	0.0087	0.9327	0.8904	0.8604	0.8137	0.9755	0.9685	0.9538	0.0066
HFANet₂₂ [80]	0.9399	0.9112	0.8981	0.8819	0.9770	0.9712	0.9722	0.0092	0.9380	0.8876	0.8681	0.8365	0.9740	0.9679	0.9644	0.0070
MJRBMVGG₂₃ [21]	0.9204	0.8842	0.8567	0.8022	0.9622	0.9414	0.9327	0.0163	0.9197	0.8657	0.8238	0.7066	0.9646	0.9350	0.8897	0.0099
ACCoNet-VGG₂₃ [27]	0.9437	0.9149	0.8971	0.8806	0.9796	0.9754	0.9721	0.0088	0.9290	0.8837	0.8552	0.7969	0.9727	0.9653	0.9450	0.0074
OURS	0.9440	0.9170	0.9073	0.8990	0.9801	0.9762	0.9758	0.0084	0.9346	0.8923	0.8710	0.8448	0.9688	0.9738	0.9697	0.0069

Table 2. Quantitative comparison of different methods on the ORSI-4199 dataset. The symbols ↑ and ↓ indicate that higher and lower values, respectively, are better for different metrics. The best results are highlighted in red for clear visualization.

Methods	ORSI-4199
Methods	$S_{α} ↑$	$F_{β}^{\max} ↑$	$F_{β}^{mean} ↑$	$F_{β}^{adp} ↑$	$E_{ξ}^{\max} ↑$	$E_{ξ}^{mean} ↑$	$E_{ξ}^{adp} ↑$	$M ↓$
Salient object detection methods for natural images
R3Net₁₈ [63]	0.8142	0.7847	0.7790	0.7776	0.8880	0.8722	0.8645	0.0401
PiCANet₁₈ [64]	0.8145	0.7920	0.7792	0.7786	0.8894	0.8891	0.8674	0.0421
PoolNet₁₉ [65]	0.8271	0.8010	0.7779	0.7382	0.8964	0.8676	0.8531	0.0541
EGNet₁₉ [66]	0.8464	0.8267	0.8041	0.7650	0.9161	0.8947	0.8620	0.0440
BASNet₁₉ [67]	0.8341	0.8157	0.8042	0.7810	0.9069	0.8881	0.8882	0.0454
CPD₁₉ [68]	0.8476	0.8305	0.8169	0.7960	0.9168	0.9025	0.8883	0.0409
RAS₂₀ [69]	0.7753	0.7343	0.7141	0.7017	0.8481	0.8133	0.8308	0.0671
CSNet₂₀ [70]	0.8241	0.8124	0.7674	0.7162	0.9096	0.8586	0.8447	0.0524
SAMNet₂₁ [71]	0.8409	0.8249	0.8029	0.7744	0.9186	0.8938	0.8781	0.0432
HVPNet₂₁ [72]	0.8471	0.8295	0.8041	0.7652	0.9201	0.8956	0.8687	0.0419
ENFNet₂₀ [73]	0.7766	0.7285	0.7177	0.7271	0.8370	0.8107	0.8235	0.0608
SUCA₂₁ [74]	0.8794	0.8692	0.8590	0.8415	0.9438	0.9356	0.9186	0.0304
PA-KRN₂₁ [75]	0.8491	0.8415	0.8324	0.8200	0.9280	0.9168	0.9063	0.0382
VST₂₁ [76]	0.8790	0.8717	0.8524	0.7947	0.9481	0.9348	0.8997	0.0281
DPORTNet-VGG₂₂ [77]	0.8094	0.7789	0.7701	0.7554	0.8759	0.8687	0.8628	0.0569
DNTD-Res₂₂ [78]	0.8444	0.8310	0.8208	0.8065	0.9158	0.9050	0.8963	0.0425
ICON-PVT₂₃ [79]	0.8752	0.8763	0.8664	0.8531	0.9521	0.9438	0.9239	0.0282
Salient object detection methods for optical remote sensing images
EMFINetVGG₂₂ [25]	0.8675	0.8584	0.8479	0.8186	0.9340	0.9257	0.9136	0.0330
ERPNetVGG₂₂ [23]	0.8670	0.8553	0.8374	0.8024	0.9290	0.9149	0.9024	0.0357
CorrNet₂₂ [24]	0.8623	0.8560	0.8513	0.8534	0.9330	0.9206	0.9142	0.0366
MCCNet₂₂ [20]	0.8746	0.8690	0.8630	0.8592	0.9413	0.9348	0.9182	0.0316
HFANet₂₂ [80]	0.8767	0.8700	0.8624	0.8323	0.9431	0.9336	0.9191	0.0314
MJRBMVGG₂₃ [21]	0.8593	0.8493	0.8309	0.7995	0.9311	0.9102	0.8891	0.0374
ACCoNet-VGG₂₃ [27]	0.8775	0.8686	0.8620	0.8581	0.9412	0.9342	0.9167	0.0314
OURS	0.8821	0.8834	0.8776	0.8647	0.9542	0.9431	0.9258	0.0266

Table 3. Quantitative ablation analysis results of GCRL and MSF modules.

Methods	ORSSD
Methods	$S_{α} ↑$	$F_{β}^{\max} ↑$	$F_{β}^{mean} ↑$	$F_{β}^{adp} ↑$	$E_{ξ}^{\max} ↑$	$E_{ξ}^{mean} ↑$	$E_{ξ}^{adp} ↑$	$M ↓$
OURS_noGCRL	0.9288	0.9001	0.8862	0.8810	0.9688	0.9523	0.9561	0.0132
OURS_noMSF	0.9373	0.9079	0.9002	0.8893	0.9758	0.9686	0.9696	0.0120
OURS	0.9440	0.9170	0.9073	0.8990	0.9801	0.9762	0.9758	0.0084
–	EORSSD
–	$S_{α} ↑$	$F_{β}^{m a x} ↑$	$F_{β}^{m e a n} ↑$	$F_{β}^{a d p} ↑$	$E_{ξ}^{m a x} ↑$	$E_{ξ}^{m e a n} ↑$	$E_{ξ}^{a d p} ↑$	$M ↓$
OURS_noGCRL	0.9211	0.8775	0.8516	0.8267	0.9579	0.9578	0.9504	0.0081
OURS_noMSF	0.9276	0.8861	0.8648	0.8364	0.9629	0.9688	0.9602	0.0072
OURS	0.9346	0.8923	0.8710	0.8448	0.9688	0.9738	0.9697	0.0069
–	ORSI-4199
–	$S_{α} ↑$	$F_{β}^{m a x} ↑$	$F_{β}^{m e a n} ↑$	$F_{β}^{a d p} ↑$	$E_{ξ}^{m a x} ↑$	$E_{ξ}^{m e a n} ↑$	$E_{ξ}^{a d p} ↑$	$M ↓$
OURS_noGCRL	0.8676	0.8642	0.8590	0.8396	0.9403	0.9237	0.9113	0.0311
OURS_noMSF	0.8704	0.8711	0.8668	0.8587	0.9478	0.9406	0.9221	0.0299
OURS	0.8821	0.8834	0.8776	0.8647	0.9542	0.9431	0.9258	0.0266

Table 4. Average inferring time (seconds) for a single image in different datasets.

Methods	Datasets
Methods	ORSSD/EORSSD	ORSI-4199
R3Net₁₈ [63]	0.512	0.228
PiCANet₁₈ [64]	0.612	0.273
PoolNet₁₉ [65]	0.043	0.019
EGNet₁₉ [66]	0.111	0.049
BASNet₁₉ [67]	0.204	0.090
CPD₁₉ [68]	0.197	0.087
RAS₂₀ [69]	0.107	0.048
CSNet₂₀ [70]	0.026	0.012
SAMNet₂₁ [71]	0.023	0.010
HVPNet₂₁ [72]	0.017	0.008
ENFNet₂₀ [73]	0.027	0.012
SUCA₂₁ [74]	0.018	0.008
PA-KRN₂₁ [75]	0.063	0.028
VST₂₁ [76]	0.051	0.023
DPORTNet-VGG₂₂ [77]	0.012	0.006
DNTD-Res₂₂ [78]	0.157	0.069
ICON-PVT₂₃ [79]	0.028	0.012
EMFINetVGG₂₂ [25]	0.041	0.018
ERPNetVGG₂₂ [23]	0.034	0.015
CorrNet₂₂ [24]	0.021	0.009
MCCNet₂₂ [20]	0.020	0.009
HFANet₂₂ [80]	0.037	0.016
MJRBMVGG₂₃ [21]	0.031	0.014
ACCoNet-VGG₂₃ [27]	0.019	0.008
OURS	0.022	0.014

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, J.; Li, C.; Zheng, X.; Liu, X.; Tang, C. Global Context Relation-Guided Feature Aggregation Network for Salient Object Detection in Optical Remote Sensing Images. Remote Sens. 2024, 16, 2978. https://doi.org/10.3390/rs16162978

AMA Style

Li J, Li C, Zheng X, Liu X, Tang C. Global Context Relation-Guided Feature Aggregation Network for Salient Object Detection in Optical Remote Sensing Images. Remote Sensing. 2024; 16(16):2978. https://doi.org/10.3390/rs16162978

Chicago/Turabian Style

Li, Jian, Chuankun Li, Xiao Zheng, Xinwang Liu, and Chang Tang. 2024. "Global Context Relation-Guided Feature Aggregation Network for Salient Object Detection in Optical Remote Sensing Images" Remote Sensing 16, no. 16: 2978. https://doi.org/10.3390/rs16162978

APA Style

Li, J., Li, C., Zheng, X., Liu, X., & Tang, C. (2024). Global Context Relation-Guided Feature Aggregation Network for Salient Object Detection in Optical Remote Sensing Images. Remote Sensing, 16(16), 2978. https://doi.org/10.3390/rs16162978

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Global Context Relation-Guided Feature Aggregation Network for Salient Object Detection in Optical Remote Sensing Images

Abstract

1. Introduction

2. Materials

Datasets

3. Methods

3.1. Overview

3.2. Global Context Relation Learning Module (GCRL)

3.3. Feature Aggregation Module (FA)

3.4. Multi-Scale Fusion Module (MSF)

3.5. Local Surrounding Aware Loss

3.6. Model Training and Testing

4. Results

4.1. Evaluation Metrics

4.2. Comparison with State-of-the-Art Methods

4.3. Ablation Analysis

4.4. Running Efficiency Comparison

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI