Feature Refine Network for Salient Object Detection

Different feature learning strategies have enhanced performance in recent deep neural network-based salient object detection. Multi-scale strategy and residual learning strategies are two types of multi-scale learning strategies. However, there are still some problems, such as the inability to effectively utilize multi-scale feature information and the lack of fine object boundaries. We propose a feature refined network (FRNet) to overcome the problems mentioned, which includes a novel feature learning strategy that combines the multi-scale and residual learning strategies to generate the final saliency prediction. We introduce the spatial and channel ‘squeeze and excitation’ blocks (scSE) at the side outputs of the backbone. It allows the network to concentrate more on saliency regions at various scales. Then, we propose the adaptive feature fusion module (AFFM), which efficiently fuses multi-scale feature information in order to predict superior saliency maps. Finally, to supervise network learning of more information on object boundaries, we propose a hybrid loss that contains four fundamental losses and combines properties of diverse losses. Comprehensive experiments demonstrate the effectiveness of the FRNet on five datasets, with competitive results when compared to other relevant approaches.


Introduction
Visual saliency can be defined as the most interesting region or object in the human vision system. The detection and segmentation of salient objects in natural scenes is often referred to as salient object detection (SOD). SOD usually serves as an important image pre-processing in many applications, including strengthening the high-resolution satellite image scene classification under unsupervised feature learning [1], unsupervised video target segmentation [2], video summarization [3], image editing and operation [4], visual tracking [5], etc. Detecting saliency objects demands a semantic comprehension of the entire image and a specific structure information of the object. Thus, saliency detection is a crucial and difficult basic problem in computer vision.
VGG [6] and ResNet [7] are examples of deep convolutional neural networks (CNNs) that have shown potential in computer vision. However, it is difficult to reconcile the problems when ResNet is applied to solve dense prediction tasks, such as medical image segmentation and scene segmentation. As the network depth becomes deeper, the spatial size of the features becomes smaller, due to the strides of multiple convolution operations. Therefore, early SOD methods are based on super-pixel [8] or based on image patches [9]. However, these frameworks do not make full use of high-level information, and spatial information cannot propagate to the last fully connected layer, which leads to the loss of global information [10]. These problems prompt the introduction of fully convolutional networks [11] into dense prediction tasks. However, there is also the difficulty of recovering the object boundary (low-level features). The difficulty prompts researchers to explore the restoration of spatial details by multi-scale architecture.
Amulet has [12] introduced a multi-scale feature aggregation strategy. The architecture utilizes the VGG-16 as the backbone to extract features and the output of its convolutional

•
We introduce an scSE module for pre-processing the outputs at five scales of the backbone. The scSE consists of channel and spatial attention. It improves the representation ability of the features at each scale and enhances the effectiveness of the residual learning strategy.

•
We propose an AFFM module that enables the overall fusion of multi-scale information to compose the final saliency prediction. AFFM combines two kinds of multi-scale feature fusion strategies to improve saliency detection performance. • A hybrid loss is proposed for supervision during five scales' residual learning processes, and for the generation of final saliency maps. It fuses BCE, structural similarity (SSIM) and dice and intersection-over-union (IOU) to train the model in local-level and global-level.
The remainder of the work is structured as follows: the related SOD works are discussed in Section 2. In Section 3, the overall architecture of FRNet and the details of each part in FRNet are discussed. In Section 4, the experimental results of FRNet with other methods are reported. Then, the ablation experiments are conducted for each module to verify the effectiveness of the proposed modules. Finally, the paper is concluded in Section 5.

Related Work
Amulet, DSS, R3Net [18] and the R2Net method are based on the full convolutional neural network and have achieved impressive results in the SOD task. They employ holistically-nested edge detection (HED) [14] and iteratively improve network architecture and learning strategies to reconstruct different low-scale features into high-scale features. To improve the SOD, this paper combines two feature reconstruction strategies. One is residual learning through iterative refinement and another is high-resolution map reconstruction directly from all the scales. Therefore, we can also consider SOD as a reconstruction problem [19].

Residual Learning
The strategy for residual learning is inspired by ResNet [7], and it encodes original data, as residual tensors are more efficient than the ordinary mode. By stacking the convolutional layer and fitting the mapping function H (x), it is formulated as the residual form function H (x) = F (x) + x, x denotes the initial input, and also can be regarded as prior knowledge. Without using residual learning, the formula is written as H (x) = F (x). We can conclude the learning strategy of DSS and Amulet, which treats the learning process on all scales as a sub-network, and each sub-network learns a mapping function to regress GT. R3Net and R2Net learn the residual mapping function to regress GT to adjust the coarse prediction at each scale. Our approach similarly employs such a strategy.

Multi-Scale Feature Fusion
Refs. [11,14] show that multi-scale deep features can generate better predictions [20]. Researchers have proposed many different strategies and methods for multi-scale feature fusion. Zhang et al. [21] introduce a reformulated dropout and a hybrid up-sampling module to eliminate the checkerboard effects of deconvolution operators. Luo et al. [22] present an NLDF network with a 4 × 5 grid architecture for saliency detection, in which feature maps from later layers are gradually blended with feature maps from earlier layers. Zhang et al. [23] utilize a sibling architecture to generate saliency maps by extracting feature maps from both original pictures and projection images. Wu et al. [24] propose a novel mutual learning module for improving SOD accuracy by more effectively exploiting the correlation of borders and regions. Chen et al. [25] iteratively use the side output of the backbone network as feature attention guidance to predict and refine saliency maps. For quick and accurate SOD, Wu et al. [26] offer the cascaded partial decoder (CPD). These SOD approaches make use of the multi-scale information retrieved by the backbone and outperform classic algorithms in the SOD task.

Attention Mechanism
It may also be considered as a dynamic selection process in computer vision that which is achieved by adaptive weighted features based on the relevance of the input data. In the saliency areas, it mimics the mechanism of the human visual system. Attention mechanisms provide performance improvements in many computer vision tasks. Some scholars have introduced the attention mechanism into SOD. Liu et al. [27] proposed Pi-CANet to generate attention over the context regions. Zhang et al. [28] offer a progressive attention-guiding mechanism to refine saliency maps by designing a spatial and channel attention module to acquire global information at each layer. The attention module proposed by Zeng et al. [29] computes the spatial distribution of foreground objects over image regions and aggregates the feature of all the regions. Various pure and deep self-attention networks have emerged [30][31][32], demonstrating the enormous potential of transformer models. Liu et al. [33] proposed the pure transformer for RGB and RGB-D SOD for the first time.
We utilized channel attention and spatial attention to design scSE and weight spatial attention to guide the network to learn more discriminative feature for predicting saliency map. Then, we designed the feature fusion module AFFM with multi-scale features, processing the feature at each scale after ARM, and adaptively integrated them to compose the final saliency prediction.

Loss Function
In addition to the discussion of the model architecture, the demand for innovating different loss functions have also become more prominent. In most of the SOD models, the binary cross-entropy (BCE) loss is often used for network training. It is formulated as follows: where y i denotes ground truth pixel values, x i denotes prediction pixel values and Luo et al. [22] introduced dice loss [34] for the SOD task, and statistical supervision on the final prediction. Dice is formulated as follows: where TP denotes true positive, FN denotes false negative; FP denotes false positive.
Qin et al. [35] introduced SSIM and IOU loss to the traditional BCE loss. SSIM loss and IOU loss are formulated as follows: where x = x j : j = 1, . . . , N 2 and y = y j : j = 1, . . . , N 2 are the pixel values of two corresponding patches (size: N × N) cropped from the predicted map and ground truth mask, respectively. In addition, µ x , µ y , σ x , σ y are the mean and standard deviation of x and y, respectively. σ xy is covariance of x, y. TP denotes true positive, FN denotes false negative; FP denotes false positive. C 1 = 0.01 2 and C 2 = 0.03 2 are used to avoid being divided by zero. These loss functions focus on different points. BCE and SSIM focus on local supervision, calculating the gradient of each pixel based on the prediction of limited neighbored regions. Dice and IOU focus on global supervision, reducing the difference between prediction and GT. They enhance the prediction of object contour. Based on that, we propose hybrid loss, which combines these four loss functions to supervise the network learning.

Proposed Method
In Section 3.1, we show the architecture of FRNet. In Section 3.2, we show the details of the AFFM. Next, we describe the scSE module in Section 3.3. Finally, we explain the hybrid loss in Section 3.4.

Overview of FRNet
The feature refine network (FRNet) utilizes the full convolutional network to solve the SOD task. The entire framework is illustrated in Figure 1. The orange region is the backbone, used to extract features from the input image. The output of ResNet-5 is processed by scSE and fed into the DCPP module for multi-scale dilated convolution to coarsely predict salient objects. Details of the DCPP are shown in Figure 2a. The feature fusion strategy in skip-layer style is used in the palm blue region containing ARM blocks. The data streams are interacting with each other in every scale. Then, calculating loss with the GT at each scale, the ARM gradually adjusts the output with GT. More information about the ARM can be found in Figure 2b. the GT at each scale, the ARM gradually adjusts the output with GT. More information about the ARM can be found in Figure 2b.  The original ResNet model is used for the image classification task, so that the spatial resolution of the feature generated by the network is mismatched to the input image resolution. To make it suitable for the SOD task, we have made the following modifications: (1) the full connection layer of the ResNet model for image classification is removed. (2) the GT at each scale, the ARM gradually adjusts the output with GT. More information about the ARM can be found in Figure 2b. Pre-4 Pre-5 Pre-global   The original ResNet model is used for the image classification task, so that the spatial resolution of the feature generated by the network is mismatched to the input image resolution. To make it suitable for the SOD task, we have made the following modifications: (1) the full connection layer of the ResNet model for image classification is removed. (2) The original ResNet model is used for the image classification task, so that the spatial resolution of the feature generated by the network is mismatched to the input image resolution. To make it suitable for the SOD task, we have made the following modifications: (1) the full connection layer of the ResNet model for image classification is removed. (2) The stride in ResNet-4 is changed to 1, retaining more feature information because the feature spatial scale has not changed.
After the backbone, the outputs from each layer are processed by scSE to guide the feature maps that focus on the saliency region. These maps are fed to the ARM modules for residual learning, to adjust the prediction output gradually from a small scale to a large scale. Finally, five scale outputs after ARM processing are fed to the AFFM module for adaptive fusion. The process generates the final prediction output. The entire network is trained under the supervision of the hybrid loss.

Adaptive Feature Fusion Module
The main idea of AFFM is to achieve adaptively the learning of the fusion spatial weights of feature maps in various scales. As shown in Figure 1, AFFM can be divided into the following two steps: unifying scale; adaptive fusing.
Unifying scale. The saliency map of the resolution at scale S (S ∈ {1, 2, 3, 4, 5} for ResNet-50) is denoted as x S . For scale S, the maps x n is resized from scale n (n = S) to the same shape as x S . Thus, AFFM can unify features on the five scales. In AFFM, it up-samples features when they are smaller than the input scale, and down-sample features when they are larger than the input scale.
Adaptive fusing. x n→S i,j denotes the feature tensor at the position (i, j) of the feature map, which is resized from scale n to scale S. The fusion process is formulated as follows: where y S i,j denotes the (i, j)-th tensor of the output feature maps y S in one of the channels.
i,j refers to the spatial weights for the saliency map at five scales, which are adaptively calculated by the network.
We use 1 × 1 convolution layers to compute the weight scalar maps i,j , respectively, so that they can be learned by standard back-propagation. By the AFFM module, the features at all scales can be adaptively aggregated. The outputs y 1 , y 2 , y 3 , y 4 , y 5 are processed by the addition operation to compose the final saliency prediction.

Mixed Channel and Spatial Attention
Inspired by Roy et al. [36], we proposed the spatial and channel squeeze and excitation blocks (scSE blocks), using spatial blocks as a complement for channel SE blocks. The scSE combines two blocks directly by element-wise addition and utilizes channel attention information to improve the role of spatial attention block. Then, scSE is added to the ARM output with one channel to strengthen the representation ability of feature on each iteration. The architecture of scSE is shown in Figure 3.
The stride in ResNet-4 is changed to 1, retaining more feature information because the feature spatial scale has not changed.
After the backbone, the outputs from each layer are processed by scSE to guide the feature maps that focus on the saliency region. These maps are fed to the ARM modules for residual learning, to adjust the prediction output gradually from a small scale to a large scale. Finally, five scale outputs after ARM processing are fed to the AFFM module for adaptive fusion. The process generates the final prediction output. The entire network is trained under the supervision of the hybrid loss.

Adaptive Feature Fusion Module
The main idea of AFFM is to achieve adaptively the learning of the fusion spatial weights of feature maps in various scales. As shown in Figure 1, AFFM can be divided into the following two steps: unifying scale; adaptive fusing.
Unifying scale. The saliency map of the resolution at scale ( ∈ {1, 2, 3, 4, 5} for ResNet-50) is denoted as . For scale , the maps is resized from scale ( ≠ ) to the same shape as . Thus, AFFM can unify features on the five scales. In AFFM, it upsamples features when they are smaller than the input scale, and down-sample features when they are larger than the input scale.
Adaptive fusing. , → denotes the feature tensor at the position ( , ) of the feature map, which is resized from scale to scale . The fusion process is formulated as follows: where , denotes the ( , )-th tensor of the output feature maps in one of the channels. , , , , , , , , , refers to the spatial weights for the saliency map at five scales, which are adaptively calculated by the network.

Mixed Channel and Spatial Attention
Inspired by Roy et al. [36], we proposed the spatial and channel squeeze and excitation blocks (scSE blocks), using spatial blocks as a complement for channel SE blocks. The scSE combines two blocks directly by element-wise addition and utilizes channel attention information to improve the role of spatial attention block. Then, scSE is added to the ARM output with one channel to strengthen the representation ability of feature on each iteration. The architecture of scSE is shown in Figure 3.  The channel attention is formulated as follows: where W denotes the filter kernel weight and GAP denotes the global average pooling. GAP uses the nn.AdaptiveAvgPool2d method for global average pooling operations. Spatial attention is formulated as follows: where σ denotes sigmoid activation, and δ denotes ReLU activation. scSE is formulated as follows: where f denotes the fusion function, which we set it to element-wise addition operation.

The Hybrid Loss
The common loss functions in the SOD task are divided into two categories. For local supervision, such as BCE and SSIM, these are unable to supervise the network to learn fine object boundaries. For global supervision, dice and IOU can detect finer object boundaries. Naturally, we consider the model to achieve better performance when it is training with hybrid loss. The hybrid loss is defined as follows: where α, β are the proportion parameter; in this paper, we set it to 0.4 and 0.6 and allowed the model to focus more on the prediction of the object boundary. These loss functions complement each other. BCE and SSIM supervise the network as a local loss to prompt the network learning of the approximate location of the salient object. Dice and IOU supervise the network as a global loss to prompt the network learning of the fine boundary of the salient object. FRNet employs the hybrid loss on the predicted output branch of the ARM blocks at all scales, as well as the final output after AFFM processing, as shown in Figure 1. On the one hand, multi-scale supervision prompts the network to learn the salient features more efficiently. On the other hand, it also improves the model prediction accuracy. Each ARM module predicts a saliency map that can be expressed as M (i) , where i ∈ {1, 2, 3, 4, 5}. They correspond to the prediction output after ARM block processing on the five scales, and each one calculates the loss. Equation (11) is further expressed as follows: More experiments about ablation of the hybrid loss can be found in Section 4.5.

Experiments
To validate the FRNet network, we conduct the comparison experiment on five public datasets (including ECSSD [37], PASCAL-S [38], DUT-OMRON [39], HKU-IS [40], DUTS [41]). DUTS is used for the training of the model. The other datasets are used for the evaluation of the model. We choose precision-recall curves (PR), mean absolute error (MAE), S-measure (Sm) and maximum F-measure (Max-F) as the evaluation metrics for the experiments. The baseline is selected as R2Net. contrast, saturation and hue of images are randomly changed to enhance the generalization ability of the model.

Parameter Setting
The model code is implemented on public platform PyTorch, with two Tesla V100 GPUs (with 16 GB memory) for the experiments. At first, each image is resized to 384 × 384 and performed normalization. During the training, epoch is set to 40, batch-size is 16. The momentum is set to 0.9 and weight decay is 0.0005. Then, we set the base learning rate (lr) to 0.0005 and the learning rate multiplies 0.1 every 10 epochs. FRNet employs the Adam optimizer and hybrid loss to optimize the network. The "Kaiming" method initializes the convolutional layers of ResNet.

Datasets
We evaluate the saliency detection performance of FRNet on five benchmark datasets. ECSSD [37] contains 1000 natural images with different sizes and contains multiple objects. Some images are derived from the challenging Berkeley-300 dataset [43].
PASCAL-S [38] has 850 images selected from the validation set of PASCAL VOC2010 [44] used for the segmentation task.
DUT-OMRON [39] contains 5172 images that are carefully annotated by 5 testers, and hand-picked from over 140,000 natural images, each containing one or more saliency objects, along with associated intricate backgrounds.
HKU-IS [40] includes 4447 images, with high-quality pixel-level labels. Images are selected carefully, containing multiple disconnected objects or the object itself overlaps the background.
DUTS [41] allocates 10,533 images for training, and 5019 images for validation. The training images are collected from ImageNet DET training set [45]. The images of validation are collected from ImageNet DET test set and the SUN dataset [46] with pixel-level annotation.

Precision-Recall (PR) Curves
At first, the saliency map S is converted into a binary mask M. Precision and recall are calculated by G and M. G denotes ground-truth. PR is formulated as follows:

Maximum F-Measure (Max-F)
Precision and recall cannot fully evaluate the quality of the prediction map, in other words, the high precision and recall are both required. So, F-measure is proposed as the weighted harmonic mean of precision and recall with a non-negative weight β 2 . The formula of Max-F as follows: When the max-F value is larger, the model performance is better.

Structure-Measure (Sm)
Sm evaluates the structural details of the prediction map, and combines region-aware and object-aware structural similarity evaluation to obtain the final formulation of structuremeasure, which is as follows: where S r denotes the region-aware structural similarity measure; S o denotes the objectaware structural similarity measure. For more details, readers may refer to Fan et al. [47]. A larger value of Sm indicates the better performance of the model.

Mean Absolute Error (MAE)
The above metrics have not taken the prediction of the non-salient pixels into account, that is, the pixels correctly labeled as non-salient. For this purpose, the MAE is calculated by saliency map S and binary GT. S and M are pre-normalized to the range of [0, 1]. The formulation of MAE is as follows: A smaller value of MAE indicates the better performance of model.

Visual Comparison
As shown in Figure 4, the results of FRNet against the other methods can be observed. FRNet accurately predicts the salient objects of all images. For objects that are easy to misjudge and connected with other objects, FRNet performs better than all other methods. These results indicate that our method is more robust. In the first row, Amulet and Basnet are unable to predict the saliency objects of the test image. For the overlapping and connected saliency objects, such as row 2 and 4, most of the methods cannot predict effectively. FRNet generates the finest object boundaries compared to other models. It also proves that scSE, loss, and AFFM all provide improvement to the FRNet. More experiment information on the three components can be found in Section 4.5.
These results indicate that our method is more robust. In the first row, Amulet and Basnet are unable to predict the saliency objects of the test image. For the overlapping and connected saliency objects, such as row 2 and 4, most of the methods cannot predict effectively. FRNet generates the finest object boundaries compared to other models. It also proves that scSE, loss, and AFFM all provide improvement to the FRNet. More experiment information on the three components can be found in Section 4.5.

Evaluation of Saliency Map
We evaluate the prediction maps on PR Curves and F-measures. Figures 5 and 6 visualize the results of these. It can be observed that FRNet outperforms most of the models. In Figure 5, FRNet maintains a higher precision value at a higher recall value. In Figure 6, FRNet also maintains a high F-measure value at high thresholds. This indicates that FRNet generates saliency maps closer to GT, with higher confidence in the saliency target region. This allows FRNet to predict the location of salient objects and to segment it finely.

Evaluation of Saliency Map
We evaluate the prediction maps on PR Curves and F-measures. Figures 5 and 6 visualize the results of these. It can be observed that FRNet outperforms most of the models. In Figure 5, FRNet maintains a higher precision value at a higher recall value. In Figure 6, FRNet also maintains a high F-measure value at high thresholds. This indicates that FRNet generates saliency maps closer to GT, with higher confidence in the saliency target region. This allows FRNet to predict the location of salient objects and to segment it finely.
These results indicate that our method is more robust. In the first row, Amulet and Basnet are unable to predict the saliency objects of the test image. For the overlapping and connected saliency objects, such as row 2 and 4, most of the methods cannot predict effectively. FRNet generates the finest object boundaries compared to other models. It also proves that scSE, loss, and AFFM all provide improvement to the FRNet. More experiment information on the three components can be found in Section 4.5.

Evaluation of Saliency Map
We evaluate the prediction maps on PR Curves and F-measures. Figures 5 and 6 visualize the results of these. It can be observed that FRNet outperforms most of the models. In Figure 5, FRNet maintains a higher precision value at a higher recall value. In Figure 6, FRNet also maintains a high F-measure value at high thresholds. This indicates that FRNet generates saliency maps closer to GT, with higher confidence in the saliency target region. This allows FRNet to predict the location of salient objects and to segment it finely.

Ablation Analysis
We carry out experiments on five public datasets to analyze the validity of hybrid loss, AFFM and scSE.

The Effectiveness of the Adaptive Feature Fusion Module
To demonstrate the effectiveness of AFFM, we design a set of experiments on AFFM by only adding AFFM modules to the baseline and then comparing results with the baseline. The first and second row result in Table 2, showing the comparison results. According to the results, it shows that the MAE measure is greatly improved compared with the baseline, but the other two metrics decreased in different degrees on ECSSD, DUT-OM-RON, and HKU-IS. The Sm improved on PASCAL-S and DUT-TEST. The results demonstrate the validity of the AFFM module. As shown in Figure 7, in comparison with other methods, it shows that the backbone with the AFFM module has fewer misjudgments of salient pixels and finer boundaries.

Ablation Analysis
We carry out experiments on five public datasets to analyze the validity of hybrid loss, AFFM and scSE.

The Effectiveness of the Adaptive Feature Fusion Module
To demonstrate the effectiveness of AFFM, we design a set of experiments on AFFM by only adding AFFM modules to the baseline and then comparing results with the baseline. The first and second row result in Table 2, showing the comparison results. According to the results, it shows that the MAE measure is greatly improved compared with the baseline, but the other two metrics decreased in different degrees on ECSSD, DUT-OMRON, and HKU-IS. The Sm improved on PASCAL-S and DUT-TEST. The results demonstrate the validity of the AFFM module. As shown in Figure 7, in comparison with other methods, it shows that the backbone with the AFFM module has fewer misjudgments of salient pixels and finer boundaries. As shown in Table 2, the results of the first four rows show that different components play different roles in the baseline. At the same time, due to the differences among the five datasets, different components have different effects on each dataset. Then, when three components are combined together, it can be constructed as a complete model architecture. In addition, the multi-scale feature fusion strategy and the residual learning strategy can be effectively combined to obtain better performance. This is reflected in the last four rows of Table 2.

B
B+AFFM B+Loss B+scSE GT

The Effectiveness of Hybrid Loss
Firstly, we conduct ablation experiments for different combinations of loss functions. As shown in Table 3, the three combinations of BCE+SSIM, Dice+IOU and BCE+Dice play a certain role in improving the MAE metric of the model. However, the other two metrics cannot be effectively improved. Finally, we combine three combinations to form the hybrid loss and obtain the best model performance. Secondly, we conduct ablation experiment on the hybrid loss, as shown in Table 4. On the other hand, we perform experiments on the parameters of , in the hybrid loss, as shown in Table 4. It shows that the combination of loss with the different proportions of , has different effects on the model. FRNet does not perform well when the or is too large or too small. The characteristics of the loss functions and the regulatory focus are not coordinated. By comparing the results in the second and fourth rows, we need to assign a higher proportion to appropriately. If in this case, we will observe that the model achieves the best performance. This kind of hyper-parameter setting is also As shown in Table 2, the results of the first four rows show that different components play different roles in the baseline. At the same time, due to the differences among the five datasets, different components have different effects on each dataset. Then, when three components are combined together, it can be constructed as a complete model architecture. In addition, the multi-scale feature fusion strategy and the residual learning strategy can be effectively combined to obtain better performance. This is reflected in the last four rows of Table 2.

The Effectiveness of Hybrid Loss
Firstly, we conduct ablation experiments for different combinations of loss functions. As shown in Table 3, the three combinations of BCE+SSIM, Dice+IOU and BCE+Dice play a certain role in improving the MAE metric of the model. However, the other two metrics cannot be effectively improved. Finally, we combine three combinations to form the hybrid loss and obtain the best model performance. Secondly, we conduct ablation experiment on the hybrid loss, as shown in Table 4. On the other hand, we perform experiments on the parameters of α, β in the hybrid loss, as shown in Table 4. It shows that the combination of loss with the different proportions of α, β has different effects on the model. FRNet does not perform well when the α or β is too large or too small. The characteristics of the loss functions and the regulatory focus are not coordinated. By comparing the results in the second and fourth rows, we need to assign a higher proportion to β appropriately. If in this case, we will observe that the model achieves the best performance. This kind of hyper-parameter setting is also adopted as the best setting in experiments. As shown in Figure 7, the hybrid loss effectively improves the model to detect salient objects, highlights the boundary of salient objects, and enhances the segmentation performance of the model. Table 4. Ablation analysis of hybrid loss hyper-parameters. α denotes ration of (BCE+SSIM), β denotes ration of (Dice+IOU). The best results are in bold.  Table 2. The scSE improves most of the metrics on the five datasets. The results also demonstrate the effectiveness of scSE in the FRNet. As shown in Figure 7, the scSE module facilitates the model to predict salient regions, and suppresses the attention of the model on non-salient regions. Comparing the results of rows 2, 3, 5 and 7 in Table 2, the scSE module combined with other modules can be more effective in improving the detection performance of the model.

Conclusions and Discussion
A novel FRNet for SOD is proposed in this paper. To construct the final prediction, FRNet gradually optimizes the prediction results from a low to a high scale and obtains information from optimized features at all scales, until the results match GT as closely as possible. We propose a hybrid loss supervised method to obtain the object boundary information to solve the problem of the coarse object boundary. In the ablation experiments, we demonstrate the effectiveness of the hybrid loss. The extensive experimental results indicate the efficiency of the proposed strategy and FRNet. However, there are still possibilities to improve our network architecture, and we will continue to investigate the potential of various network architectures in the field of SOD, such as transformers.  Data Availability Statement: Publicly available datasets were analyzed in this study. The training dataset DUTS can be obtained at: http://saliencydetection.net/duts/ (accessed on 15 January 2022). The testing datasets of DUT-OMRON, ECSSD, PASCAL-S, HKU-IS are available at: http://saliencydetection. net/dut-omron/, https://www.cse.cuhk.edu.hk/leojia/projects/hsaliency/dataset.html, https:// cbs.ic.gatech.edu/salobj/ and https://sites.google.com/site/ligb86/mdfsaliency/ (Above URLs were accessed on 15 March 2022).