Salient Object Detection Combining a Self-attention Module and a Feature Pyramid Network

Salient object detection has achieved great improvement by using the Fully Convolution Network (FCN). However, the FCN-based U-shape architecture may cause the dilution problem in the high-level semantic information during the up-sample operations in the top-down pathway. Thus, it can weaken the ability of salient object localization and produce degraded boundaries. To this end, in order to overcome this limitation, we propose a novel pyramid self-attention module (PSAM) and the adoption of an independent feature-complementing strategy. In PSAM, self-attention layers are equipped after multi-scale pyramid features to capture richer high-level features and bring larger receptive fields to the model. In addition, a channel-wise attention module is also employed to reduce the redundant features of the FPN and provide refined results. Experimental analysis shows that the proposed PSAM effectively contributes to the whole model so that it outperforms state-of-the-art results over five challenging datasets. Finally, quantitative results show that PSAM generates clear and integral salient maps which can provide further help to other computer vision tasks, such as object detection and semantic segmentation.


Introduction
Salient object detection or segmentation aims to identify visually distinctive parts of a natural scene. With this capability of providing high-level information, the saliency detection is widely applied in the computer vision applications, such as object detection [24,39,17] and tracking [8], visual robotic manipulations [38,25], image segmentation [33,32] and video summarization [19,26]. In early studies, the salient object detection was formulated as a binary segmentation problem. However, the connections' establishment between the salient object detection and other computer vision tasks was unclear. Nowadays, convolution neural network (CNN) attracts more attention in the research community. Compared with the classic hand-crafted feature descriptors [18,4], CNNs have stronger feature representation ability. Specifically, CNN kernel with small receptive fields can provide local information and the kernel with large receptive fields can provide global information. This characteristic enables CNN-based approaches to detect salient areas with refined boundaries [2]. Thus, CNN-based approaches have become the major research field in the salient object detection.
Recently, Fully Convolution Networks (FCNs) becomes the fundamental framework in the salient object detection [15,31,22], as FCNs can be fed by arbitrary size of input and achieve richer spatial information compared with the fully connected layer. Although these works have achieved great improvement in the performance, they are still restricted by some limitations. FCN-based approaches utilize multiple convolution layers and pooling layers to produce the high-level semantic features which are helpful to locate objects but they may lose information during pooling operations. This can lead to degraded boundaries of detected objects being generated. Besides, when the high-level features are upsampled to generate score prediction for each pixel, it will also be diluted which could decrease the ability of object localization.
In this paper, we propose a novel pyramid self-attention module (PSAM) to overcome the limitation of feature dilution of the previous FCN-based approaches. Figure 1(c) shows the inherent problems of Feature Pyramid Networks (FPNs). Through incorporating self-attention module with multi-scale feature map of FPNs, the model will focus on the high-level features. This leads to the extraction of features with richer high-level semantic information and larger receptive fields. In addition, a channel-wise attention module is employed to reduce the redundancy in the FPN which can refine the final results. Experimental results show that PSAM can improve the performance of salient object detection and achieve state-of-the-art results in five challenging datasets. The contributions of this work can be concluded as: 1) We propose a novel pyramid self-attention structure which can make the model focus more on the high-level features and reduce feature dilation in top-down pathway. 2) We adopt a channel-wise attention to reduce the redundant information in the lateral connections of the FPN to refine the final results.

Related Works
Salient Object Detection Due to the outstanding feature representation ability of CNN, the handcraft feature based methods have been replaced by the CNN models. In the work of [13], Li and Yu used fully connected layers on the top CNN layers to extract different scale features of a single region. Then, multi-scale features were used to predict the scores for each region. [43] utilized two independent CNN to extract the global context of the full image and the local context of the detailed area to train the model jointly. However, the spatial information was lost in these CNN-based methods, because of the fully connected layers.
Recently, FCN-based methods have raised more concerns in the salient object detection. [22] proposed a boundary-aware salient object detection network which incorporates a predict module and a residual refinement module module (RRM). The predict module was used to estimate the salient map from the raw images and the RRM was used to refine the results from the predict module which was trained by using the residual between the salient map and the ground truth. [15] introduced a PoolNet structure which has two pooling modules: global guidance module (GGM) and feature aggregation module (FAM). The GGM was designed to acquire more high-level information around the inputs which tackles the feature dilation problem in the U-shape network structure. Then, the FAM merges the multi-scale features of the FPN, leading to reduce the problem of aliasing caused by the up-sampling and enlarge the receptive fields. From the experiments, PoolNet can make more precise localization of the sharpened salient objects compared with other baseline approaches. In the work of [35], a Cascaded Partial Decoder (CPD) structure that contains two prime branches was proposed. The first branch contributes in the computation speed improvement by dropping features in the shallow layers. The second branch uses the salient map from the first branch in order to refine the features in the deeper layers which ensures the speed and accuracy of the framework.
Attention Mechanism Attention mechanism is mainly used in the area of Natural Language Processing (NLP). [27] introduced a framework called Transformer which was used to replace the recurrent layers through using attention mechanism to capture global dependencies between input and output. This framework also allows parallel computing which leads to faster speed compared to recurrent networks. Except for the sequence models, this kind of attention mechanism is also needed in the CNN models. Different from the attention mechanism in the sequence models, selfattention was introduced to utilize attention mechanism in the single context data. [1] proposed Attention Augmented Convolutional Network which produces attentional feature maps via selfattention module and combines these with CNN feature maps to capture spatial dependencies of the input and it achieves huge improvement in the tasks of object classification and detection. The stand alone self-attention layer was introduced in the work of [23]. It can be used to set up a fully attention model through replacing all spatial convolution layers with self-attention layers. The self-attention layer leverages the components in the previous works and proves that it can be used as a stand-alone layer which can replace the spatial convolution layer easily. In this section, we describe the proposed architecture that integrates two attention modules. More specifically, we use a pyramid self-attention module which aims to enhance the high-level semantic features and transmit the enhanced semantic information to different feature levels. In addition, when feature maps are merged in the top-down pathway, a simple channel-wise attention [10,3,44] module is added in each lateral connection to focus on the high responses of salient objects. The proposed architecture is based on a classic feature pyramid network (FPNs) [14] which exploits ResNet [7] as a backbone. It is well known that this basic FPN architecture has been widely used in many different computer vision tasks, especially for detection tasks, leading to accurate detection results because of its robust and reasonable structure. As shown in Figure 2, we retain the basic structure and introduce two effective modules to achieve a state-of-the-art performance. A pyramid self-attention module, which is built between the bottom-up and top-down pathway, supports the model to focus on the high-level features which contain semantic information. Then this module transfers the processed high-level information to each feature levels in the top-down path. Meanwhile, feature maps from different stages in ResNet pass through a channel-wise attention module to further emphasize the context information. Figure 3: The structure of Pyramid Self-attention Module.

Pyramid Self-Attention Module
In this subsection, we describe the proposed module in detail and demonstrate the differences from previous works. [9] has demonstrated that high-level semantic features are more representative and discriminative, leading to the position of salient objects being located more accurately. Figure 1(c) shows that without any extra attention modules, the FPN baseline can generate rough saliency map which has insufficient and incomplete salient objects. Meanwhile, there are also some non-salient objects which should not be detected in the saliency maps. These error predictions are caused by two main challenges which cannot be avoided in the FPN architecture. The first problem is that the high-level information is diluted progressively when it is integrated in different feature levels in the top-down pathway. Another problem of the FPN baseline is that this architecture can be impacted by other non-essential information which may reduce the final performance of the model. In other words, the FPN architecture detects not only incomplete salient objects but also unnecessary objects.
To overcome these two intrinsic problems of the baseline, we propose a novel pyramid self-attention module (PSAM) which contains stand-alone self-attention layers [23] in different scales, further focusing on important regions and enlarging the receptive field of the model. Specifically, as shown in Figure 3, PSAM firstly transforms the feature map which is produced by bottom-up pathway into multi-scale feature regions and then each self-attention layer learns to pay more attention on important semantic information. After processed by self-attention layers, these multi-scale representations, which contain effective semantic information, are concatenated together to complement high-level semantic information in the top-down pathway. More technically, let X in denote the feature map, which is produced by the top-most layer. We downsample the feature map X in ∈ R H× W ×C into three different scales denoted as {X 1 , X 2 , X 3 }. Given a pixel x ij ∈ {X 1 , X 2 , X 3 } a corresponding local memory block r k is extracted from the same feature map X i ∈ {X 1 , X 2 , X 3 }. This r k is a k × k region which surrounds x ij . There are three crucial learnable parameters in this self-attention algorithm: queries, keys and values. We use W Q , W K and W V to represent their learnable weights respectively. The final attention output pixel is computed as follows: where q ij = W Q x ij , k r k = W K x r k , q r k = W V x r k denote the three crucial parameters, s ij denotes the output pixel of a self-attention layer, .., k−1 2 }} defines the coordinates of r k and we use {Y 1 , Y 1 , Y 3 } to denote the final feature map which further represents an upsampling operation after each self-attention layer. Then we concatenate them with the original X in to generate the final output of the PSAM.
Inspired by the previous work [15], we exploit a similar complementing strategy to avoid the dilution of high-level semantic information. However, compared to the previous work, the proposed pyramid self-attention module achieves a state-of-the-art performance. This module is built at the end of bottom-up pathway and converts the high-level semantic features into different scales, further enlarging the receptive field of the model. Based on multi-scale high-level feature maps, the attention layers view these semantic features at different scales and then can achieve a comprehensive attention task. More performance detail will be shown in the ablation study.

Channel-Wise Attention
To enhance the context and structural information, the lateral connection has been used in the topdown pathway, leading to a state-of-the-art performance of detection tasks. However, this operation also introduces some unmeaningful information, which can reduce the performance and impact on the final prediction. From Figure 1(c), the two problems caused by features redundancy are obvious. The first problem is that there are extra regions which should not be detected in the saliency map. Another problem is that the edges of salient objects are ambiguous. Both problems indicate that a further refinement should be applied. [10] has pointed out that different channels have different semantic features and channel-wise attention can capture channel-wise dependencies. In other words, channel-wise attention can emphasize the salient objects and alleviate the inaccuracy which is caused by redundant features in channels. Therefore, we add this simple channel-wise attention [10,3,44] to each later connection to achieve a refinement task. The structure of channel-wise attention is shown in Figure 4. It consists of one pooling layer and two fully-connected layers which are followed by a ReLU [21] and a sigmoid function respectively. First, an operation of squeezing global spatial information is applied to each channel. This step can be easily implemented by an average pooling : where c refers to the channel number, H x W refers to the spatial dimensions of i-th element of X c . After the pooling operation, the generated channel descriptor is fed into the fully-connected layers to fully capture channel-wise dependencies.
where σ refers to the sigmoid function, δ refers to the ReLU function and fc means the fully-connected layers. Finally, this generated scalar s c multiplies the feature map X c to generate a weighted feature mapX c :X

Datasets and Evaluation Metrics
For the evaluation of the proposed methodology, we carry out a series of experiments using five popular saliency detection benchmarks. More specifically, we use the: ECSSD [36], DUT-OMRON [37], DUT-TE [29], HKU-IS [13] and SOD [20]. These five datasets consist of a variety of objects and structures which are still challenging for salient object detection algorithms to locate and detect them precisely. For the training of our model we use the large-scale dataset DUTS [29], which contains 10533 training images and 5019 testing images. To evaluate the performance of the model, we estimate three representative evaluation metrics: precision-recall curves, F-measure score and mean absolute error (MAE). F-measure indicates the standard overall performance which are computed by precision and recall: where β 2 is set to 0.3 as default, precision and recall are obtained by using different thresholds to compare prediction and ground truth. The MAE indicates the deviations between the binary saliency map and the ground truth. In other words, this metric quantifies the similarity between prediction map and ground truth mask: where W denotes the width and H denotes the height of prediction, P denotes the prediction map which is the output of the model and G represents the ground truth.

Impelmentation Detail
Our model is implemented in Pytorch. We use ResNet-50 [7] as a backbone which has been pretrained on ImageNet [12]. The proposed architecture is trained on a GTX TITAN X GPU for 24 epochs. As suggested in [15], the initial learning rate is set equal to 5e-5 for the first 15 epochs and then reduces to 5e-6 for the last 9 epochs. We adopt 0.0005 weight decay for the Adam [11] optimizer and binary cross entropy loss function in the proposed framework. Finally, in order to increase the robustness of the model, we perform data augmentation through the application of random horizontal flipping. Figure 6: Overall comparison of qualitative visual results between our method and selected baseline methods. It shows that our method is capable to provide more complete salient map and smooth boundaries.

Comparisons with State-of-the-arts
We perform our proposed method on five datasets to compare with 11 previous state-of-the-art methods, which include LEGS [28], UCF [42], DSS [9], Amulet [41], R3Net [5], DGRL [30], PiCANet [16], BMPM [40], MLMSNet [34], AFNet [6] and PAGE-Net [31]. For fair comparisons, we use the results which are generated by their original work with default parameters and released by the authors. Moreover, all results are evaluated by the same evaluation method without any other processing tools. Figure 5 and Table 1 show the evaluation results of the proposed framework in comparison to eleven state-of-the-art methods on five challenging salient object datasets. More specifically, in Figure 5, PR curve of the proposed methodology (red line) outperforms the state-of-the-art methods. This result means that our method has better robustness than other previous methods. Furthermore, the quantitative results are listed in Table 1. The proposed method achieves higher F-measure scores and lower error scores than other methods, demonstrating that our novel model outperforms almost all previous state-of-the-art models on the different testing datasets. Figure 6 illustrates the visual comparisons in order to further show the advantages of the method. More precisely, compared to other approaches, the detection results of our method show the best performance on the different challenging scenarios. In other words, the detection results, even in certain details, are close to the ground truth.

Ablation study
In this subsection, we conduct a series of experiments on five different datasets to investigate the effectiveness of two modules. The ablation experiments are trained on DUTS [29] training dataset in the same environment. From Table 1, the model which contains PSAM and channel-wise attention module achieves the best performance, demonstrating that the proposed modules can effectively assist the baseline's the salient object detection performance. More specifically, we initially conduct the baseline experiments on FPN baseline with ResNet-50 as backbone. This basic model can generate rough saliency map which is shown in Figure 1(c). Then we add pyramid self-attention module (PSAM) on the baseline and the F-measure scores increases significantly on all benchmark datasets, especially for DUTS-TE [29] and SOD [20]. On this basis, we add channel-wise attention on the model to compose the proposed framework. The final best results show that the channel-wise attention modules can further increase the performance and alleviate error predictions. To this end, Figure 1

Conclusion
In this paper, we propose a novel end-to-end salient object detection method. Considering the intrinsic problems of the FPN architecture, a pyramid self-attention module (PSAM) is designed. This module contains different self-attention layers in multiple scales, leading to capture multi-scale high-level features to make the model focus on the high-level semantic information and further enlarge the receptive field. Furthermore, we employ the channel-wise attention in lateral connections to reduce the feature redundancy and refine prediction results. Experimental results on five challenging datasets demonstrate that our proposed model surpasses 11 state-of-the-art methods and the ablation experiments also demonstrate the effectiveness of the two modules.