Global and Multiscale Aggregate Network for Saliency Object Detection in Optical Remote Sensing Images

: Salient Object Detection (SOD) is gradually applied in natural scene images. However, due to the apparent differences between optical remote sensing images and natural scene images, directly applying the SOD of natural scene images to optical remote sensing images has limited performance in global context information. Therefore, salient object detection in optical remote sensing images (ORSI-SOD) is challenging. Optical remote sensing images usually have large-scale variations. However, the vast majority of networks are based on Convolutional Neural Network (CNN) backbone networks such as VGG and ResNet, which can only extract local features. To address this problem, we designed a new model that employs a transformer-based backbone network capable of extracting global information and remote dependencies. A new framework is proposed for this question, named Global and Multiscale Aggregate Network for Saliency Object Detection in Optical Remote Sensing Images (GMANet). In this framework, the Pyramid Vision Transformer (PVT) is an encoder to catch remote dependencies. A Multiscale Attention Module (MAM) is introduced for extracting multiscale information. Meanwhile, a Global Guiled Brach (GGB) is used to learn the global context information and obtain the complete structure. Four MAMs are densely connected to this GGB. The Aggregate Refinement Module (ARM) is used to enrich the details of edge and low-level features. The ARM fuses global context information and encoder multilevel features to complement the details while the structure is complete. Extensive experiments on two public datasets show that our proposed framework GMANet outperforms 28 state-of-the-art methods on six evaluation metrics, especially E-measure and F-measure. It is because we apply a coarse-to-fine strategy to merge global context information and multiscale information.


Introduction
Salient Object Detection (SOD) endeavours to emulate the human visual system, empowering computers to identify the most compelling objects or regions within a given scene [1].Serving as a pivotal image preprocessing step, salient object detection finds diverse applications, including object segmentation [2], image relocation [3], image retrieval, classification [4], image compression, and target recognition.
The conventional methodology for salient object detection involves hand-crafted features [1], yet its efficacy and precision are suboptimal.With the advent of deep learning, researchers have incorporated Convolutional Neural Networks (CNN) into computer vision [5], resulting in superior outcomes compared to hand-crafted features [6].Different conceptualisations have been proposed, such as multiscale, attention, and edge guidance.Recently, salient object detection has garnered increased attention, manifesting in various branches such as salient object detection for natural scene images (NSI-SOD) [7], RGB-D salient object detection [8,9], RGB-T salient object detection, and video salient object detection.There are other tasks similar to salient object detection in computer vision, such as object detection, novelty detection, anomaly detection, and clustering.Object detection is used to detect and classify all objects in an image.It usually outputs a bounding box around the detected object and the corresponding class label [10].Salient object detection focuses on identifying visually salient objects or regions, while object detection focuses on detecting and classifying multiple instances of different classes.Novelty detection and anomaly detection are both recognition problems that deal with rare or abnormal instances.However, novelty detection focuses on detecting new patterns [11].Meanwhile, anomaly detection aims to detect instances that are significantly different from the norm, which are usually used in network intrusion, medical diagnosis, and other fields [12].Clustering divides data into clusters based on similarity or proximity between them and aims to discover natural groupings or clusters in the data [13].Unlike other tasks, clustering does not require prior knowledge of the class labels.
This study is specifically dedicated to a distinct application of salient object detection, salient object detection in optical remote sensing images (ORSI-SOD).In optical remote sensing images, salient objects refer to the obvious or prominent features in the image that stand out from the surrounding environment, including colour, brightness, texture, shape, and other important attributes.Salient objects are often meaningful for specific applications or analysis.For example, in urban planning, the salient objects can be buildings, roads, or other infrastructure.In topographic analysis, the salient objects can be rivers, farmlands, and islands.Detecting salient objects is critical in processing and analysing optical remote sensing images because it can focus attention on areas of interest and facilitate more efficient data decisions.
Distinctive features of optical remote sensing images relative to natural scene images include: (1) Optical remote sensing images offer surface information encompassing cities, farmland, rivers, buildings, and roads, reflecting a diversity of object types [14].(2) Objects in optical remote sensing images exhibit varying sizes, e.g., ships, aeroplanes, bridges, rivers, and islands, signifying diversity in target size [14].(3) The background of an optical remote sensing image may comprise intricate textures and structures, surpassing the complexity of a natural image [14].
Consequently, tools or methodologies for salient object detection in natural scene images may not be directly applicable to ORSI-SOD.Research in ORSI-SOD typically adopts a CNN-based encoder-decoder structure, with VGG [15] and ResNet [16] as the backbone network.Additionally, related studies have introduced modules to enhance model accuracy, including the attention module [17], multiscale module, and edge guidance module [18].However, CNN-based network models predominantly focus on the convolution of local features and lack the capability to learn remote relations, resulting in issues such as misdetection, omission of salient objects, and inaccuracies.This limitation is particularly pronounced in the context of ORSI-SOD, where the predicted results lack global structural consistency.
This study replaces the CNN backbone with the Pyramid Vision Transformer v2 (PVT-v2) to address the challenges above and introduces a novel salient object detection method known as the Global and Multiscale Aggregate Network (GMANet) for ORSI-SOD.GMANet is specifically tailored for Optical Remote Sensing Images (ORSI) and comprises a PVT-v2 encoder, a Global Guidance Branch (GGB), an Aggregation Refinement Module (ARM), and a Dense Decoder (DD).Notably, the GGB incorporates four densely connected Multiscale Attention Modules (MAM) to address the identified effectively.
The key contributions of this study are as follows: (1) This research replaces traditional CNN-based ResNet or VGG with a transformerbased backbone network, PVT-v2, to enhance the comprehensiveness of salient regions.Unlike CNN-based methods that primarily capture local information, transformer-based approaches excel in learning remote dependencies and acquiring global information.The proposed encoder-decoder architecture includes a PVT-v2 encoder for learning multiscale features and a DD for hierarchical feature map decoding.At the same time, a Global Guidance Branch is designed on the encoder.(2) The study introduces the MAM, recognising the challenge of large variations in object scales within optical remote sensing images.This module adeptly extracts multiscale features and establishes densely connected structures for the GGB.The GGB leverages four MAM modules to generate global semantic information, guiding low-level features for more precise localisation.(3) The ARM is innovatively proposed in this study to amalgamate global guidance information with fine features through a coarse-to-fine strategy.Leveraging global guidance information ensures accurate localisation of salient objects, capturing the complete structural context, while the incorporation of fine features augments details in the preliminary saliency map.
This investigation executed a series of comparative experiments, utilising the GMANet model, against 28 state-of-the-art methods on two publicly available ORSI-SOD datasets.The outcomes of these experiments reveal the heightened competitiveness of the GMANet proposed in this study in comparison to previously introduced methodologies.Particularly noteworthy is the superiority of the proposed method, demonstrating a 3.50% improvement in terms of E adp ξ over the second-ranking method.A comprehensive evaluation across all methods on the ORSSD dataset highlights the distinctiveness of the method presented in this study.It stands out as the singular approach to attaining F adp β exceeding 0.86, F mean β surpassing 0.88, and E adp ξ exceeding 0.97.This substantiates that the proposed method contributes to enhanced object accuracy and area completeness relative to alternative methodologies.
The subsequent sections of this paper are organised as follows: Section 2 provides an extensive review of the pertinent literature on ORSI-SOD.Section 3 offers a detailed exposition of the GMANet components.In Section 4, a thorough analysis of the experimental results and ablation experiments is conducted.Finally, Section 5 presents concluding remarks and a comprehensive summary of the study.

Related Work
This section provides a comprehensive review of research outcomes in the domain of the NSI-SOD and ORSI-SOD.The investigation spans both traditional and CNN-based methodologies.

Traditional Methods for NSI-SOD
Pioneering the field, Ltti et al. [19] introduced the initial computer vision attention model founded on the centre-surround disparity mechanism for localisation.Traditional NSI-SOD approaches predominantly rely on hand-crafted features [1], with three primary categories: unsupervised, semi-supervised, and supervised methods.While the majority are unsupervised [19][20][21][22][23][24], there are fewer semi-supervised [25] and supervised methods [26].Notable examples include Kim et al.'s extension [22] of the SOD method based on high-dimensional colour transformations and Zhou et al.'s iterative semi-supervised learning framework [25].Some studies [20,21] used techniques like random walk and ranking.The random walk algorithm is able to compute a saliency score for each pixel.The ranking algorithm ranks the images according to the saliency score of the pixels.In [22], a high-dimensional colour transform is used to map the colour of image pixels to a highdimensional space, which can better capture the differences and relationships between colours.Liang et al. [26] employed support vector machines for feature selection through supervised learning.Although traditional methods may lack generalisability in novel scenarios, they form the foundational basis for subsequent methodologies.

CNN-Based Methods for NSI-SOD
CNN-based NSI-SOD methods have surpassed the limitations of traditional approaches [1].These CNN-based methods predominantly operate through supervised learning, diverging from their traditional counterparts.Enhancements to model accuracy include Zhao et al.'s introduction of an edge-aware network [18], Liu et al.'s design of a pooling-based module [27], and the widespread integration of attention mechanisms [17,28].The use of various loss functions, such as the IoU loss introduced by Ma et al. [29] and SSIM loss by Qin et al. [30], further refines the supervised learning process.GateNet proposed Folded Atrous Spatial Pyramid Pooling(FASPP) to summarise and combine the output feature maps of atrous convolutions with different atrous rates [31].DSS introduces short connections into the network to fuse features from different levels and achieve multiscale feature aggregation [32].PoolNet extracts and aggregates multiscale information in a bottom-up and top-down manner [33].While these methods significantly influence ORSI-SOD, their direct application is hindered by the distinctive characteristics of optical remote sensing images.

CNN-Based Methods for ORSI-SOD
The increasing ubiquity of optical remote sensing images has spurred the amalgamation of salient object detection with these images, giving rise to a novel research area-ORSI-SOD.CNN-based ORSI-SOD methods have overcome the limitations of traditional approaches, exhibiting substantial improvements in experimental results.Diverse solutions have emerged since the introduction of the first ORSI-SOD dataset, ORSSD.Li et al. [32] proposed an LV network with a two-stream pyramid and encoder-decoder architecture.Zhang et al. [34] introduced an end-to-end dense attention fluid network.Li et al.'s [35] parallel up-down fusion network and Tu et al.'s [36] joint learning scheme based on bidirectional feature transformation are notable advancements.Additionally, Li et al.'s [37] multi-content complementary module leverages an attention mechanism to highlight useful features.
The proposed methodologies cater specifically to the unique characteristics of optical remote sensing images.The collective findings underscore the critical roles played by global context, feature fusion, and dense connections in the SOD task.However, when used independently, these components led to misdetection, omission of salient objects, and inaccurate localisation.To address these issues comprehensively, we combined all three components, explored their interrelationships, and obtained a model that can accurately locate salient objects.

Proposed Method
This section presents a detailed exposition of the proposed GMANet.It commences with an overarching depiction of the GMANet in Section 3.1.This provides a foundational understanding of the network architecture before delving into specific components.In Section 3.2, we meticulously expound upon the intricacies of the MAM.This component plays a pivotal role in extracting multiscale features, a crucial aspect for comprehensive salient object detection.Section 3.3 offers a detailed account of the GGB, an integral part of GMANet that comprises interconnected MAMs.This section elucidates how the GGB contributes to generating global semantic information, guiding low-level features to enhance localisation precision.The ARM is explicated in Section 3.4, outlining its role in fusing global guidance information with fine features through a strategically devised coarseto-fine strategy.This process aims to ensure accurate localisation of salient objects and refine the overall structural context.The DD is the focus of Section 3.5, where its components and functions are delineated in detail.This segment explains how the DD contributes to hierarchically decoding feature maps, adding fine details to the saliency map.The final section of this exposition provides an insight into the chosen loss functions employed in GMANet.This includes a comprehensive description of the specific loss functions utilised to train and optimise the proposed network for salient object detection.By organising the detailed description of GMANet and its constituent modules in a structured manner, this section aims to facilitate a comprehensive understanding of the proposed network architecture and its integral components.

Network Overview
Figure 1 illustrates the comprehensive framework of our proposed GMANet.GMANet comprises four essential components: the PVT-v2 encoder, the GGB, the ARM, and the DD.Notably, the GGB integrates four densely connected MAMs.The overall architectural strategy employs a coarse-to-fine approach.The initial image undergoes processing through the PVT-v2 encoder, generating four feature maps at distinct scales.The MAM combines convolution kernels of different sizes on the same scale to perform multiscale feature aggregation to enhance the perception of scale changes.These multiscale aggregated features are used as the input of the GGB, and the correlation between the features is fully captured by dense connection and learns global context information.Global context information is used to determine the location of salient regions.The ARM then merges the global context information with high-level and low-level features.The resultant amalgamation is subsequently input into the DD, facilitating in-depth analysis and ultimately yielding a refined saliency map.This intricate framework orchestrates a systematic progression from the original image to a nuanced and accurate delineation of salient regions.
To better establish long-range dependencies and image continuity, we use PVT-v2 [43] as the backbone network of the encoder to extract multiscale features.Specifically, we first cut the input image into uniform patches for self-attention, and PVT outputs four groups of feature maps with sizes of 64 × 64, 32 × 32, 16 × 16, and 8 × 8, respectively.In order to compute multi-head self-attention more efficiently, PVT uses a sequential reduction method.The input sequence x i ∈ R (HW×C) is first reshaped into xi ∈ R ( HW r ×C×r) , and then the MLP is applied to reduce the channel from C × r to C. The PVT-v2 encoder generates four blocks at different scales through a series of convolution, down-sampling, self-attention and multiperceptron.The multiscale feature mapping of the output of these blocks is notated as {f x1 , f x2 , f x3 , f x4 }.These feature maps are then fed into the Global Guidance Branch (GGB) to mine the multiscale contextual information in it.The multiscale features are densely connected and aggregated step by step to learn global context information f a with global guidance information.Then, the global context information f a and the multiscale feature maps f i (i = 1, 2, 3, 4) are fed into the Aggregate Refinement Module (ARM) at the same time, and f a and the feature maps at all levels are fused respectively to better fuse the global guidance information, high-level semantic information, and local detail information.The fused features are input into the Dense Decoder (DD).The final fine predictive salient map is generated after step-by-step decoding.

Multiscale Attention Module (MAM)
Optical remote sensing images exhibit distinct characteristics, including extreme scale variation and variable numbers, emphasising the paramount importance of multiscale contextual information in ORSI-SOD.DSS [32] and PoolNet [33] apply fixed-size convolution kernels to extract and aggregate information at different scales.However, the fixed-size convolution kernel can only learn fixed features and capture very limited context information; thus, the current methods do not perform well on images with great scale variation, such as optical remote sensing.We design a Multiscale Attention Module (MAM) to address this challenge.The smaller convolution kernel can capture the detailed features, while the larger convolution kernel can capture a wider range of contextual information.MAM combines convolution kernels of different sizes at the same scale, enhancing the perception of scale changes while reducing information loss.Our approach strategically harnesses both local and global information from features of different resolutions, proving highly effective in determining the precise locations of salient regions.
is fully captured by dense connection and learns global context information.Global context information is used to determine the location of salient regions.The ARM then merges the global context information with high-level and low-level features.The resultant amalgamation is subsequently input into the DD, facilitating in-depth analysis and ultimately yielding a refined saliency map.This intricate framework orchestrates a systematic progression from the original image to a nuanced and accurate delineation of salient regions.Recognising the limitations of fixed-size convolutional kernels, which can only learn a predetermined number of features and capture a limited context, we adopt a convolutional strategy involving various kernel sizes applied to the same features.This approach enhances the receptive field, allowing for a more comprehensive understanding of the contextual intricacies.Leveraging the distinctive attributes of high-level features, rich in semantic information, and low-level features, abundant in detailed information, we aggregate these cross-scale features.This aggregation culminates in the creation of a global context feature, amalgamating multiscale contextual and global information.This enriched feature is further integrated with encoder features, providing valuable multi-global contextual information that enhances the precision of salient region localisation.Figure 2 provides a detailed illustration of the MAM, elucidating its intricate mechanisms.We use multiscale feature fusion technology to combine feature maps of different levels, and the combination of max pooling and full connection can help the network focus on the key features of small objects, solving the problem that small objects are easily missed.
We use multiscale feature fusion technology to combine feature maps of different levels, and the combination of max pooling and full connection can help the network focus on the key features of small objects, solving the problem that small objects are easily missed.Specifically, the input feature  is convolved with convolution kernel sizes of 1, 3, 5 and 7, respectively, to obtain four feature maps of different scales, denoted as  , , and  .This process can be expressed as: We concatenate the three multiscale features  ,  and  , and use 3 × 3 convolution for feature fusion at different scales.The fused feature and feature  are multiplied.Common features are extracted by feature intersection, which aims to minimise the interference of noise on salient regions while extracting multiscale features.Then  is obtained by adding the fused feature and  .This process can be formulated as follows: The enhanced feature  is converted into a channel vector by maximum pooling.We then input this channel vector into the two fully connected layers to obtain the weight of the feature  .Finally, we multiply this weight with  itself channel by channel, and perform channel-wise weighted highlighting on  to obtain the feature  .This process can be expressed as follows: Since feature maps with larger resolutions contain more detailed information, these features do not all belong to the salient object.Therefore, we simultaneously perform max- Specifically, the input feature f x is convolved with convolution kernel sizes of 1, 3, 5 and 7, respectively, to obtain four feature maps of different scales, denoted as f 1 , f 3 , f 5 and f 7 .This process can be expressed as: We concatenate the three multiscale features f 3 , f 5 and f 7 , and use 3 × 3 convolution for feature fusion at different scales.The fused feature and feature f 1 are multiplied.Common features are extracted by feature intersection, which aims to minimise the interference of noise on salient regions while extracting multiscale features.Then f ′ 1 is obtained by adding the fused feature and f 1 .This process can be formulated as follows: The enhanced feature f ′ 1 is converted into a channel vector by maximum pooling.We then input this channel vector into the two fully connected layers to obtain the weight of the feature f ′ 1 .Finally, we multiply this weight with f ′ 1 itself channel by channel, and perform channel-wise weighted highlighting on f ′ 1 to obtain the feature f ′ 1c .This process can be expressed as follows: Since feature maps with larger resolutions contain more detailed information, these features do not all belong to the salient object.Therefore, we simultaneously perform maxpooling and 3 × 3 convolution on the feature f ′ 1 to transform the input feature into a single channel feature.Finally, we multiply the single channel feature and the channel-weighted feature f ′ 1c pixel by pixel to highlight the salient region and suppress the background interference to get the feature f ′ 1s .This process can be expressed as: Finally, using the residual idea, we add the original input feature f x and feature f ′ 1s , and obtain the final output feature f ′ x after 1 × 1 convolution.This process can be expressed as:

Global Guided Branch (GGB)
GateNet [31] performs simple global information extraction, which cannot fully use the rich correlation information between features at different scales.In this paper, we design a global guided branch, which densely connects features across scales to better capture the correlation between features at different scales, fully capture the long-range semantic dependencies between all spatial locations, and improve global feature consistency.Figure 1 introduces the GGB, comprising four MAM modules interconnected through dense connections.Each of the four MAM modules individually explores multiscale context information embedded within feature maps of varying resolutions, facilitating the dense aggregation of multiscale features.However, given the semantic disparities among features at different scales, a direct aggregation approach may incur partial information loss and introduce new noise interference.To mitigate these challenges, we implement dense connections to process features.This strategic inclusion emphasises inter-layer feature correlation and learns global context information, denoted as f a , enriched with global context information.The detailed workings of this GGB are visually depicted in Figure 1.

Aggregation Refinement Module (ARM)
Effectively combining both local details and global semantic information is pivotal for accurate salient region detection.However, merging these two types of information may not straightforwardly yield optimal results.We introduce a specialised ARM to address this.The ARM strategically employs global information to guide local details and utilises detailed information to enhance global semantics.This reciprocal optimisation process culminates in the aggregation of the two types of information, producing a feature map characterised by precise positioning and rich details.A detailed depiction of the ARM is provided in Figure 3.
The feature maps f x generated by the PVT-v2 encoder at different scales represent local details, characterised by intricate details but lacking semantic information, thereby introducing noise.The initial enhancement of f x is achieved through a Transformer (TF) block, yielding the augmented feature fx .Simultaneously, the global information f a generated by the GGB possesses semantic richness but lacks intricate details.To address this imbalance, f a undergoes a channel attention process [44] for channel selection, resulting in the refined feature fa .Mathematically, this process is expressed as: Then we multiply fx and fa to make the saliency region localisation more accurate, and the resulting feature map is denoted as f x a .At the same time, we optimise global semantic features and local detail features.This process can be expressed as follows: Finally, the features f a x and f x a are concatenated to obtain the features, which makes the semantic information and detail information better fused, the boundary is clearer, and the noise is reduced.This process can be expressed as follows: The feature maps  generated by the PVT-v2 encoder at different scales represent local details, characterised by intricate details but lacking semantic information, thereby introducing noise.The initial enhancement of  is achieved through a Transformer (TF) block, yielding the augmented feature  .Simultaneously, the global information  generated by the GGB possesses semantic richness but lacks intricate details.To address this imbalance,  undergoes a channel attention process [44] for channel selection, resulting in the refined feature  .Mathematically, this process is expressed as: Then we multiply  and  to make the saliency region localisation more accurate, and the resulting feature map is denoted as  .At the same time, we optimise global semantic features and local detail features.This process can be expressed as follows: Finally, the features  and  are concatenated to obtain the features, which makes the semantic information and detail information better fused, the boundary is clearer, and the noise is reduced.This process can be expressed as follows:

Dense Decoder (DD)
Traditional decoders [45] typically adopt a cascade structure involving multiple convolutional connections.However, the distinctive feature of ORSI-SOD lies in the substantial scale variations, encompassing scenarios with both small and large objects.In such cases, conventional decoders prove suboptimal.Drawing inspiration from [46], we introduce the DD.Unlike conventional counterparts, dense decoders employ Dense Separable Convolution (DSConv) blocks with a dense structure, as illustrated in Figure 4.

Dense Decoder (DD)
Traditional decoders [45] typically adopt a cascade structure involving multiple convolutional connections.However, the distinctive feature of ORSI-SOD lies in the substantial scale variations, encompassing scenarios with both small and large objects.In such cases, conventional decoders prove suboptimal.Drawing inspiration from [46], we introduce the DD.Unlike conventional counterparts, dense decoders employ Dense Separable Convolution (DSConv) blocks with a dense structure, as illustrated in Figure 4.Each dense decoder comprises three DSConvs with dilation rates [47] of 2, 4, and 6, alongside three 1 × 1 convolutions.The utilisation of dilated DSConv facilitates an expanded receptive field while concurrently minimising the parameter count.The 1 × 1 convolution functions to amalgamate densely connected features.The input to the dense decoder is denoted as  , and the decoding process unfolds as follows: Each dense decoder comprises three DSConvs with dilation rates [47] of 2, 4, and 6, alongside three 1 × 1 convolutions.The utilisation of dilated DSConv facilitates an expanded receptive field while concurrently minimising the parameter count.The 1 × 1 convolution functions to amalgamate densely connected features.The input to the dense decoder is denoted as f xa , and the decoding process unfolds as follows: where DSconv r () represents a 3 × 3 DSConv with an expansion rate of r.The DD can obtain features of different scales, better localise the salient regions, and greatly improve accuracy.
In the decoding process, the decoder upsamples the compressed feature map layer by layer, which not only restores the size of the feature map but also reconstructs the features by convolution and gradually restores the image details based on accurate positioning.As shown in Figure 1, the features are decoded by D1-D4 to generate four different saliency maps S1-S4, respectively.The D4 decoder is responsible for recovering the low-level details and texture information of the image.It produces S4 with high resolution to capture subtle variations and details of the input image.The D3 decoder is used to recover the shape of the image.It generates S3, which can show the general outline and structure of salient objects.The D2 decoder is dedicated to recovering the semantic information of the image.It generates S2 with a better understanding of the image's content.The D1 decoder is responsible for the overall image reconstruction and salient object recovery.The S1 it generates presents the salient object completely and uniquely with high quality.The final output image is simply an upsampling of S1, restoring the image to the same size as the input image.The output image is a saliency map with accurate object localisation, complete structure and clear quality.

Loss Function
Our approach incorporates deep supervision [48] during the training process.This entails utilising the loss function to supervise feature layers at different levels and scales, ensuring timely parameter adjustments to facilitate comprehensive learning of features across scales and expedite network convergence.Instead of relying on a single loss function, we employ a hybrid loss function that combines Binary Cross-Entropy loss (BCE) and Intersection over Union loss (IoU) [49].
In SOD tasks, the commonly employed BCE loss measures the pixel-wise discrepancy between the predicted mask and the ground truth, emphasising pixel-level loss evaluations.The BCE loss is denoted as follows: Here G(i) ∈ {0, 1} represents the ground truth label of the ith pixel, and S(i) ∈ {0, 1} signifies the predicted salient score.On the other hand, the IoU loss assesses overall architectural similarity, measuring structural congruence rather than individual pixel discrepancies.The IoU loss is expressed as: In combining both losses, we address both pixel-level differences and overall structural disparities concurrently, enhancing the supervision of the saliency map and aiding in network training.The combined loss function is denoted as: Here, G represents the ground truth, and l bce (•) and l iou (•) represent BCE loss and IoU loss, respectively.
In the training phase, as illustrated at the bottom of Figure 1, we employ pixel-level supervision for each decoder block to ensure rapid convergence.Specifically, a convolution is designed after each decoder to generate the saliency map S t .The combined BCE and IoU losses are iteratively applied to generate the final saliency map.
ORSSD [32], pioneered by Li et al., marks the inception of public datasets for Remote Sensing Images (RSI).Comprising 800 optical RSI images portraying diverse scenes such as aircraft, islands, lakes, cars, and ships, each image is accompanied by corresponding pixellevel ground truth.For training and testing, we utilise 600 and 200 images, respectively.
EORSSD [34] represents an expanded and more challenging version of ORSSD.It currently stands as the largest public dataset for Optical Remote Sensing Images (ORSI), featuring 2000 images.Here, we allocate 1400 images for training and 600 for testing.
Our network training encompasses the use of EORSSD for model training and subsequent evaluation on both ORSSD and EORSSD datasets.

Network Training Details
The dataset undergoes preprocessing, including augmentation through flipping and rotating, resulting in seven times the original enhanced data.Specifically, 4800 augmented pairs are generated for ORSSD [32], and 11,200 augmented pairs for EORSSD [34].The training process spans 40 epochs for both datasets, employing the PyTorch [50] 1.11.0 platform with NVIDIA GeForce RTX 3060 Ti for accelerated training.The PVT-v2 serves as the encoder, initialising network parameters, while new layers are initialised using a normal distribution [51].The learning rate is initialised at 1 × 10 −4 , diminishing by a factor of 10 every 30 epochs.The batch size is set at 4 to align with GPU memory constraints, and the Adam optimiser [52] is employed.The code will be available at https://github.com/houjiayue/GMANet(accessed date: 31 January 2024).
S-measure assesses region-aware and object-aware structural similarity, measuring the similarity between foreground pixels and ground truth, with larger values indicating better performance.
F-measure strikes a balance between precision and recall, serving as a weighted average of both, with higher values indicating superior performance.
E-measure combines pixel-level local information with image-level global information, with larger values indicating improved performance.

MAE calculates the average of absolute errors between the predicted and true values, with smaller values signifying better pixel-wise accuracy.
Precision-Recall (PR) curves portray the relationship between precision and recall, with thresholds ranging from 0 to 255.A PR curve closer to the top-right corner indicates superior performance.Traditional NSI-SOD Method (five methods): RRWR [20], HDCT [22], DSG [23], SMD [24], RCRR [21].
We did the following to ensure the fairness of the experiment.The same dataset is used: Each method is evaluated on the ORSSD and EORSSD datasets.The same training period and parameters are used: For fair evaluation, we meticulously retrained AccoNet [66], Corr-Net [46], MSCNet [67], and MCCNet [37] using the same training parameters, all initialising the learning rate to 1 × 10 −4 , scaling it down by a factor of 10 every 30 epochs, and setting the batch size to 4. Ensure that all base networks are trained and tested under the same conditions.The same optimisation strategy is adopted: both use the Adam optimiser [52] to optimise the network.The same performance metrics are used: All methods use a unified performance metric to evaluate the models, which can comprehensively reflect the strengths and weaknesses of different models.Comparison under the same backbone network: DAFNet [34], MJRBM [36], and AccoNet [66] methods have different versions of the backbone network, and we uniformly use the VGG version to ensure the same basic network's performance.Upon evaluation on the EORSSD dataset, our method achieved a top-ranking position in four metrics, secured the second position in one, and attained the third position in one, emerging as the overall best performer.Notably, among existing NSI-SOD methods, PA-KRN demonstrated superior performance because PA-KRN can better model the location information of the object in the image by introducing a location-aware mechanism.However, our proposed method exhibited significant advantages across all indicators, except for a marginal 0.04% shortfall in F max β .Specifically, our method surpassed PA-KRN by 2.67%, 1.18%, 3.50%, and 0.34% in F adp β , F mean β , E adp ξ , S α , respectively, while registering a modest 0.33% decrease in M.This advantage in data is because our method uses multiple convolution kernels of different sizes to perform convolution operations on the feature map, which better fuses the feature information of multiple scales.This multiscale feature fusion helps improve object detection performance and has strong adaptability to objects with extreme scale changes.Additionally, our method outperformed the leading ORSI-SOD method, MCCNet, across all metrics, showcasing substantial improvements, especially with a notable 1.15% enhancement in E adp ξ and a 0.97% reduction in M.This benefits from our method's dense connections in the global guidance branch and decoder, which can better capture the correlation between features at different scales.

Quantitative Comparison
On the ORSSD dataset, our method secured the top position in all five metrics, distinguishing itself as the only method with F adp β surpassing 0.86, F mean β exceeding 0.88, and E adp ξ surpassing 0.97.Compared to the leading PA-KRN method, our approach exhibited significant advantages with higher values of 1.12%, 0.97%, 3.05%, and 0.28% in F max β , F adp β , F mean β , E adp ξ , S α , respectively.In contrast to eight traditional methods, encompassing both NSI-SOD and ORSI-SOD, as well as 11 CNN-based NSI-SOD methods, our method consistently outperformed the competition.This is because our approach focuses on capturing remote dependencies, overcoming the disadvantage of focusing on local feature learning.At the same time, the coarse-to-fine strategy can add rich details to global information and improve object detection accuracy.
In terms of speed, GMANet is not dominant compared to other salient object detection methods.This is because we extract features using PVT-v2, which consists of multiple transformer blocks and pays more attention to modelling long-range dependencies in the image, which means that it requires self-attention computation at more locations.This causes the model to perform more computational operations on the input image, which slows down inference.Despite its relatively slow speed, GMANet is more accurate in perceptual ability and semantic understanding.Regarding model size, CSNet is the smallest network, but every salient object detection evaluation metric is inferior to our method.GMANet performs better than them in terms of evaluation metrics than methods of similar size.GMANet can be used for image editing and enhancement tasks, such as highlighting important objects or adjusting the focus of an image in some photo processing software.
Furthermore, we include the Precision-Recall (PR) curve in Figure 5, revealing that the PR curve associated with our method resides closer to the top right corner compared to all the methods under comparison.This substantiates the assertion that our proposed method stands out as the most effective performer.
Upon scrutinising the tabulated results, a discernible trend emerges: the CNN-based ORSI-SOD method consistently outperforms its NSI-SOD counterpart.This observation leads to the conclusion that a specialised approach yields superior performance.Thus, it underscores the critical importance of devising methods explicitly tailored for ORSI to attain optimal results.This further fortifies our conviction in the efficacy of specialised methodologies for ORSI diagrams.
tasks, such as highlighting important objects or adjusting the focus of an image in some photo processing software.
Furthermore, we include the Precision-Recall (PR) curve in Figure 5, revealing that the PR curve associated with our method resides closer to the top right corner compared to all the methods under comparison.This substantiates the assertion that our proposed method stands out as the most effective performer.

Visual Comparison
In Figure 6, we present illustrative examples showcasing the qualitative efficacy of our method.These instances encompass scenarios with multiple tiny objects, irregular geometric structures, objects with shadows, objects against complex backgrounds, objects with low contrast, and objects with interferences.Additionally, we compare the saliency maps generated by our method with those from eight advanced methods.This set includes three CNN-based ORSI-SOD methods (MCCNet, AccoNet, and LVNet), three CNN-based NSI-SOD methods (PA-KRN, GateNet, and MINet), and one traditional ORSI-SOD method (CMC) and a traditional NSI-SOD method (SMD).
geometric structures, objects with shadows, objects against complex backgrounds, ob with low contrast, and objects with interferences.Additionally, we compare the sali maps generated by our method with those from eight advanced methods.Thi includes three CNN-based ORSI-SOD methods (MCCNet, AccoNet, and LVNet), t CNN-based NSI-SOD methods (PA-KRN, GateNet, and MINet), and one traditi ORSI-SOD method (CMC) and a traditional NSI-SOD method (SMD).(1) Multiple tiny objects.This scenario features a combination of multiple and tiny objects.The distinct shooting distance and angle in ORSI images make small objects significantly smaller than those in NSI, presenting a challenge in detecting all small objects comprehensively.The CNN-based methods in the first row often miss or misdetect salient objects, and traditional methods struggle to adapt to ORSI.In contrast, our method comprehensively detects all objects in scenes with multiple salient objects.This is due to the multiscale feature fusion technique that we use in MAM to combine features from different levels.The shallow detail and deep semantic information are fused to better deal with objects of different sizes.Second, we introduce an attention mechanism to focus on the key features of small objects.In the deep layer of the network, we use upsampling to enlarge the feature map and fuse it with the shallow features so as to recover the lost detailed information.In this way, our network can guarantee the effectiveness and accuracy of small object processing.(2) Irregular geometry structure.These structures exhibit intricate and irregular topologies, making accurate edge delineation challenging.They appear at various positions and sizes in the image.While AccoNet, LVNet, and MINet can only detect a portion of the river, other methods encounter difficulties, such as introducing noise and unclear edges.Our method, however, accurately detects rivers with complete structures and clear boundaries, notably capturing the lower-left region of the island.We extracted the global context information to improve the clarity of the boundary, which is beneficial to identify the irregular geometry structure of the image.(3) Objects with shadows.Shadows, often misdetected as salient objects, can create inaccurate detection results.Other methods may miss one or two circles, and GateNet incorrectly highlights shadows.In contrast, our method adeptly detects objects without redundant shadow regions.(4) Objects with complex backgrounds.The multiscale attention module we designed uses the attention mechanism to highlight salient objects while suppressing background information effectively.Enhance the ability to recognise objects with complex backgrounds.Our results exhibit superior noise reduction, effectively shielding background interference and precisely capturing salient objects.(5) Objects with low contrast.When salient objects closely resemble the background, many existing methods struggle to highlight them accurately.The lines detected using the three NSI-SOD methods appear fuzzy, and MCCNet fails to detect lines altogether.Conversely, our method yields clear detections, particularly demarcating the accurate boundaries of small islands.( 6) Objects with interferences.Some non-salient objects may interfere with detection, leading to incorrect highlights.Our method can distinguish the interfering objects by modelling the context information around the target, including object shape, texture, etc.In addition, we use the attention mechanism to weight the feature selection and weighting, which also makes the model pay more attention to the features that are helpful to the target and reduce the impact of interfering objects.Our method excels in distinguishing and accurately highlighting salient objects in the presence of potential interferences.
Our method adeptly leverages contextual information, global semantic details, and intricate image features.It effectively addresses challenges related to scale, location, number, and shape variations in ORSI, demonstrating robustness and accuracy in highlighting salient objects across diverse scenarios.
Inspired by the visualisation results, we also ponder some specific applications of GMANet in specific domains.In terms of urban planning, our network can be used for urban development, infrastructure layout, and land use planning to help planners make rational decisions from a clear view of the urban layout.In terms of environmental monitoring, relevant personnel can monitor forest cover change, water pollution, land degradation, etc., based on the saliency map provided by the network, which is crucial for environmental protection and sustainable management.In terms of resource exploration, this method supports resource exploration in remote or inaccessible areas, which is conducive to discovering natural resources such as water and minerals.In the future, the network has potential application value in Marine and coastal detection, agricultural monitoring, etc.

Ablation Experiment
This section presents comprehensive experiments designed to assess the effectiveness of crucial components within our GMANet on both the EORSSD and ORSSD datasets.The experiments focus on the following aspects: (1) the distinct contributions of the ARM and the GGB, (2) the significance of dense links within the GGB branch, (3) the rationale behind the dilation rate design in the MAM, (4) the effectiveness of the Transformer (TF) block and Channel Attention (CA) block within the ARM module, (5) the efficacy of the MAM module.Additionally, (6) we explore the complementarity between Binary Cross-Entropy (BCE) and Intersection over Union (IoU) in the loss function.
In each variant experiment, modifications are made to only one component at a time, and the model is retrained on both datasets, adhering strictly to the parameters and training methods outlined in Section 4.1.
(1) Individual contribution of each module in the network: To assess the distinct contributions of each module, namely the ARM module and GGB, we propose three variants of GMANet in Table 2. Baseline: The base network comprises only the encoder-decoder, where the encoder is PVT-v2, and the decoder is the dense decoder.
Baseline + ARM: GGB is removed, retaining only the ARM module.Given the absence of GGB, the dual input of the ARM module is modified to a single input-the multiscale feature map output by the encoder.This feature map directly passes through a transformer and a convolution layer with a 3 × 3 convolution kernel.
Baseline + GGB: The ARM module is omitted, and the feature maps generated by GGB are directly connected to the dense decoder.
Baseline + ARM + GGB: This represents the complete network structure, where both the ARM module and GGB are incorporated into the network to form GMANet. Quantitative results are presented in Table 2.
As presented in Table 2, on the EORSSD dataset, the "baseline" achieves 86.08% on F max β , 94.94% on E adp ξ , 91.35% on S α , and 0.0094 on M. Comparatively, the "ARM" module exhibits increases of 0.83%, 0.67%, and 0.60% in these three metrics, respectively, compared to the "baseline."Similarly, the "GGB" branch demonstrates improvements of 0.22%, 1.33%, and 0.04% over the "baseline" in these corresponding metrics.In the collaborative application of both "ARM" and "GGB," there are respective increases of 1.37%, 1.29%, and 0.92% compared to the "Baseline," validating the efficacy of the "ARM" and "GGB" modules and their synergistic impact.The trend observed on the ORSSD dataset aligns consistently with the EORSSD dataset, thus affirming the effectiveness of each proposed module.
(2) Importance of Dense Links in GGB: Two branch structures have been proposed for GGB to maximise model accuracy.One consists of four MAM modules directly spliced, while the other features four MAM modules densely connected, as illustrated in Figure 7. Quantitative results are detailed in Table 3. dataset aligns consistently with the EORSSD dataset, thus affirming the effectiveness of each proposed module.
(2) Importance of Dense Links in GGB: Two branch structures have been proposed for GGB to maximise model accuracy.One consists of four MAM modules directly spliced, while the other features four MAM modules densely connected, as illustrated in Figure 7. Quantitative results are detailed in Table 3.As indicated in Table 3, on the EORSSD dataset, GGB-1 achieves 86.88% on F , 95.87% on E , and 90.44% on S .In comparison, GGB-2 attains 87.45% in F , 96.23% in E , and 92.27% in S , representing increases of 0.57%, 0.36%, and 0.44%, respectively.Similarly, these three metrics show improvement on the ORSSD dataset, with increases of 1.29%, 0.45%, and 0.59%, respectively.The rationale behind this lies in the notable feature of ORSI, characterised by substantial scale variation.A direct connection may result in inadequate fusion of feature maps across different scales, whereas a dense connection facilitates more effective layer-by-layer fusion of feature maps at varying scales.Therefore, we opt for the GGB-2 structure, demonstrating superior effectiveness as the GGB of the network.
(3) The rationality of expansion rate design in the MAM module: We present two MAM module variants to assess the rationality of dilation rates in dilated convolutions within the MAM module.The first variant features dilation rates of 1, 3, 5, and 7, mirroring the dilation rates employed by our network.The second variant adopts dilation rates of 3, 5, 7, and 9, respectively, while keeping other components unchanged.The quantitative results are presented in Table 4.As indicated in Table 3, on the EORSSD dataset, GGB-1 achieves 86.88% on F max β , 95.87% on E adp ξ , and 90.44% on S α .In comparison, GGB-2 attains 87.45% in F max β , 96.23% in E adp ξ , and 92.27% in S α , representing increases of 0.57%, 0.36%, and 0.44%, respectively.Similarly, these three metrics show improvement on the ORSSD dataset, with increases of 1.29%, 0.45%, and 0.59%, respectively.The rationale behind this lies in the notable feature of ORSI, characterised by substantial scale variation.A direct connection may result in inadequate fusion of feature maps across different scales, whereas a dense connection facilitates more effective layer-by-layer fusion of feature maps at varying scales.Therefore, we opt for the GGB-2 structure, demonstrating superior effectiveness as the GGB of the network.
(3) The rationality of expansion rate design in the MAM module: We present two MAM module variants to assess the rationality of dilation rates in dilated convolutions within the MAM module.The first variant features dilation rates of 1, 3, 5, and 7, mirroring the dilation rates employed by our network.The second variant adopts dilation rates of 3, 5, 7, and 9, respectively, while keeping other components unchanged.
The quantitative results are presented in Table 4. is 0.9623, and S α is 0.9227.However, with an increase in dilation rate to d = 3, 5, 7, and 9, these three indices experience a decrease of 0.76%, 0.24%, and 0.33%, respectively.The trend observed in the ORSSD dataset aligns with the pattern identified in the EORSSD dataset.The enlargement of the dilation rate corresponds to a wider receptive field, thereby enhancing the network's perceptual capabilities.Distinct dilation rates result in varied receptive fields, acquiring multiscale information.However, with a continuous increase in the dilation rate, diminishing returns are noted.This is attributed to the large receptive field causing the network to struggle to accurately capture variable-scale salient objects in optical remote sensing images.Optimal results are achieved with d = 1, 3, 5, and 7 on both the EORSSD and ORSSD datasets, affirming the rationality of our chosen dilation rate.
(4) The efficacy of the Transformer (TF) and Channel Attention (CA) components in the ARM is assessed through ablation experiments, where two ARM variants are presented: (1) "w/o TF," which excludes transformer blocks, and (2) "w/o CA," which omits the channel attention module.The complete ARM module, denoted as "w/TF + CA," is also included for reference.The quantitative results are presented in Table 5. Upon examination of the ablation experiment results in Table 5, it is evident that the performance experiences degradation in the absence of both TF and CA blocks in the ARM module.Specifically, on the EORSSD dataset, the removal of TF blocks results in a decrease of 0.64% in F max β , 0.24% in E adp ξ , and 0.39% in S α .Similarly, without CA blocks, these metrics decrease by 0.65%, 0.78%, and 0.52%, respectively.The ORSSD dataset exhibits a consistent trend with the EORSSD dataset.The transformer is adept at capturing remote dependencies, showcasing a robust ability to model relationships across distant regions and adaptively extract global context information.This characteristic is particularly beneficial for images with significant scale variations, such as those encountered in ORSI.On the other hand, channel attention predicts channel importance and assigns varying weights to each channel to accentuate salient regions while disregarding less relevant information.Consequently, channel attention facilitates the redistribution of feature weights, reducing noise.This substantiates the efficacy of TF and CA in the ARM module.
(5) To demonstrate the role of BCE losses and IoU losses in the loss function, we designed three variants: the first is an approach using only BCE loss.The second is an approach using only IoU loss.The third method is the mixed loss method of BCE and IoU, which is the comprehensive loss used in this paper.The quantitative results are shown in Table 6.increases by 6.57%, and S α increases by 7.13%.BCE loss, offering pixel-wise supervision, measures the loss between the predicted mask and true values at each pixel.In contrast, IoU loss, providing map-level supervision, evaluates structural similarity without concentrating solely on individual pixels.Their combination yields a synergistic effect, with the two losses complementing each other.Therefore, the conclusion is drawn that training the network with the combined BCE and IoU loss functions produces superior results.and IoU, which is the comprehensive loss used in this paper.The quantitative results are shown in Table 6.Similarly, on the ORSSD dataset, F increases by 3.47%, E increases by 6.57%, and S increases by 7.13%.BCE loss, offering pixel-wise supervision, measures the loss between the predicted mask and true values at each pixel.In contrast, IoU loss, providing map-level supervision, evaluates structural similarity without concentrating solely on individual pixels.Their combination yields a synergistic effect, with the two losses complementing each other.Therefore, the conclusion is drawn that training the network with the combined BCE and IoU loss functions produces superior results.The results in Figure 8 show that with the increase in BCE ratio, the experimental effect is gradually improved, and the best effect is achieved at 50% BCE + 50% IoU.However, as the IoU ratio continues to increase, the experimental effect gradually decreases.This is because the BCE loss provides pixel-wise supervision, and the IoU loss The results in Figure 8 show that with the increase in BCE ratio, the experimental effect is gradually improved, and the best effect is achieved at 50% BCE + 50% IoU.However, as the IoU ratio continues to increase, the experimental effect gradually decreases.This is because the BCE loss provides pixel-wise supervision, and the IoU loss provides map-level supervision, evaluating the similarity of structures.Both are equally important and setting them in equal proportions will provide full supervision of the images.Therefore, we choose the mixed loss of 50% BCE + 50% IoU as the loss function of this method.

Conclusions
In this paper, we combine the three aspects of global context, feature fusion and dense connection, deeply explore the relationship between features, and propose a GMANet network specifically for optical remote sensing images.First, we use the Pyramid Vision Transformer (PVT-V2) encoder to capture remote dependencies and address the limitations of CNN-based models.To adapt to the large-scale variation of ORSI, we propose the MAM module for learning multiscale information.We then propose the Global Guided Branch, which consists of four densely connected MAM modules for learning global context information.We propose the ARM module between the encoder and decoder to fuse global and detailed information better.We also refer to the Dense Decoder to increase the receptive field and obtain accurate localisation information.In particular, we employ the supervision of hybrid loss to improve the network's performance.A large number of experiments and ablation experiments show that our proposed method has strong superiority among 28 methods and can obtain relatively complete and accurate salient regions.Nevertheless, the proposed method may encounter challenges in accurately detecting images with extremely fine edges, such as aeroplanes.Future work will explore integrating edge detection methods to enhance model accuracy in such scenarios.

Figure 1 .Figure 1 .
Figure 1.The overall framework of the proposed Global and Multiscale Aggregate Network for Saliency Object Detection in Optical Remote Sensing Images (GMANet).GMANet consists of fourFigure 1.The overall framework of the proposed Global and Multiscale Aggregate Network for Saliency Object Detection in Optical Remote Sensing Images (GMANet).GMANet consists of four main parts: the PVT-v2 encoder, the global guide branch, the Aggregate Refinement Module (ARM), and the dense decoder (DD), where the global guide branch consists of four densely connected multiscale attention modules (MAM).First, four feature maps of different levels are generated by the encoder PVT-v2, which are fed into the Global Guidance Branch (GGB) to learn global context information.The global context information and high-level and low-level features are fused through the Aggregate Refinement Module (ARM) and then input into the Dense Decoder (DD) for further analysis.Notably, in the training phase, we adopt the deep supervision strategy and attach supervision to each decoder block.GT denotes ground truth.

Figure 5 .
Figure 5. Quantitative comparison of the PR curves of SOD methods on EORSSD and ORSSD datasets.

Figure 5 .
Figure 5. Quantitative comparison of the PR curves of SOD methods on EORSSD and ORSSD datasets.

Figure 6 .
Figure 6.Visual comparisons with eight representative state-of-the-art methods.Please zoom the best view.

Figure 6 .
Figure 6.Visual comparisons with eight representative state-of-the-art methods.Please zoom in for the best view.

Figure 7 .
Figure 7. GGB Variant.(a) Consists of four MAM modules directly spliced, and (b) consists of four MAM modules densely connected.

Figure 7 .
Figure 7. GGB Variant.(a) Consists of four MAM modules directly spliced, and (b) consists of four MAM modules densely connected.

Figure 8 .
Figure 8. Ablation studies to evaluate the contribution of the BCE and IoU in loss functions.The best result for each column is in bold.

Figure 8 .
Figure 8. Ablation studies to evaluate the contribution of the BCE and IoU in loss functions.The best result for each column is in bold.
State-of-the-Arts 4.2.1.Comparison Methods Our proposed methods were systematically compared against 28 contemporary techniques, categorised into four groups: traditional Natural Scene Image Salient Object Detection (NSI-SOD) methods, CNN-based NSI-SOD methods, traditional Optical Remote Sensing Image Salient Object Detection (ORSI-SOD) methods, and CNN-based ORSI-SOD methods.The breakdown of methods in each category is as follows:

Table 1
, and M. Notably, the first five indicators reflect a superior performance with larger values, while the last indicator, M, signifies better results with smaller values.This thorough comparison aims to elucidate the efficacy and competitiveness of our proposed method in relation to existing state-of-the-art techniques.

Table 1 .
Quantitative results on two datasets, EORSSD and ORSSD.At present, there are 28 methods studied, including five traditional salient object detection in natural scene images (NSI-SOD) methods, 11 CNN-based NSI-SOD methods, three traditional salient object detection in optical remote sensing images (ORSI-SOD) methods, and 9 CNN-based ORSI-SOD methods.↑/↓ Indicates that the larger or smaller the score, the better.The top three results are highlighted in red, blue, and green.

Table 2 .
Ablation analysis measuring the overall contribution of ARM and GGB in GMANet.The baseline is the encoder-decoder network.The best result for each column is in bold.

Table 3 .
Ablation experiments for two classes of GGB variants in the GMANet.The best result for each column is in bold.

Table 3 .
Ablation experiments for two classes of GGB variants in the GMANet.The best result for each column is in bold.

Table 4 .
Rationality of expansion rate design in the GMANet.The best result for each column is in bold.

Table 5 .
Effectiveness of TF and CA in the ARM module.The best result for each column is in bold.
w/o tf: ARM without TF blocks.w/o ca: ARM without CA blocks.w/tf + ca: ARM for both the TF and CA blocks.

Table 6 .
Ablation studies to evaluate the complementarity of the BCE and IoU in loss functions.The best result for each column is in bold.Examination ofTable 6 reveals that training the GMANet network with either solely BCE loss or IoU loss individually yields decent performance.For BCE loss on the EORSSD dataset, F max is 0.8667, and S α is 0.8849.On the ORSSD dataset, F max ξ is 0.9057, and S α is 0.8555.Meanwhile, IoU loss exhibits superior performance compared to the BCE loss.However, employing both loss functions in tandem during network training results in improved performance.On the EORSSD dataset, F max β increases by 2.97%, E adp ξ increases by 9.56%, and S α increases by 3.78%.Similarly, on the ORSSD dataset, F max adp

Table 6 .
Ablation studies to evaluate the complementarity of the BCE and IoU in loss functions.The best result for each column is in bold.Examination ofTable 6 reveals that training the GMANet network with either solely BCE loss or IoU loss individually yields decent performance.For BCE loss on the EORSSD dataset, F is 0.8448, E is 0.8667, and S is 0.8849.On the ORSSD dataset, F is 0.8721, E is 0.9057, and S is 0.8555.Meanwhile, IoU loss exhibits superior performance compared to the BCE loss.However, employing both loss functions in tandem during network training results in improved performance.On the EORSSD dataset, F increases by 2.97%, E increases by 9.56%, and S increases by 3.78%.