Multiscale Balanced-Attention Interactive Network for Salient Object Detection

The purpose of saliency detection is to detect significant regions in the image. Great progress on salient object detection has been made using from deep-learning frameworks. How to effectively extract and integrate multiscale information with different depths is an open problem for salient object detection. In this paper, we propose a processing mechanism based on a balanced attention module and interactive residual module. The mechanism addressed the acquisition of the multiscale features by capturing shallow and deep context information. For effective information fusion, a modified bi-directional propagation strategy was adopted. Finally, we used the fused multiscale information to predict saliency features, which were combined to generate the final saliency maps. The experimental results on five benchmark datasets show that the method is on a par with the state of the art for image saliency datasets, especially on the PASCAL-S datasets, where the MAE reaches 0.092, and on the DUT-OMROM datasets, where the F-measure reaches 0.763.


Introduction and Background
Salient object detection (SOD) aims to localize the most visually obvious regions in an image. SOD has been used in many computer-vision tasks, such as image retrieval [1,2], visual tracking [3], scene segmentation [4], object recognition [5], image contrast enhancement [6], assisted medicine [7][8][9], etc. Meanwhile, the specific scenarios of salient object detection in mobile communications applications, such as image background filtering and background atomization in mobile applications, all rely on the high accuracy of foreground target extraction, as shown in Figure 1. Scholars have also proposed many models [10][11][12], but the accurate extraction of salient objects in complex and changeable scenes is still a problem to be solved. Traditional salient detection methods [13][14][15] used bottom-up computational models and low-level hand-crafted features to predict saliency, such as image contrast [15,16], numbers in the sample and ignores the potential correlations between different samples. Guo et al. [35] proposed an external attention mechanism, based on two external, small, learnable, and shared memories, which uses two cascaded linear layers and two normalized layers to compute the feature maps by calculating the similarity between the query vector and the external learnable key memory. The feature map is multiplied by another external learnable value memory to refine the feature map. External attention has linear complexity and implicitly considers the correlations between all samples. Combining the advantages of the self-attention mechanism and the beyond-attention mechanism, this paper proposes a balanced attention mechanism (BAM). Firstly, the BAM inherits from self-attention mode and calculates the affinity between features at each position within a single sample. Then, with the help of the external attention mechanism, the whole data set is shared through linear and normalization operations, which improves the potential correlation between different samples and reduces the computational complexity.
Pang et al. [12] proposed a self-interactive module (SIM) to extract features from the shallow and middle layers so that the multiscale information can be adaptively extracted from the data. The multiscale information was used to deal with the scale changes of salient objects, which effectively meets the multiscale requirements of SOD tasks. Inspired by reference [12], the interactive residual model (IRM) was designed in this paper. IRM models learned the multiscale features of a single convolution block in two different resolutions, interactively. To extract richer features and reduce redundant information, IRM removed the original feedback mechanism of SIM, optimized the internal sampling structure, and added a dropout block to prevent overfitting. For the deep semantic information, we used the BAM to extract the features in space and channels. The spatial attention module computed the affinity between any two feature parameters in the spatial position. For the channel information, balanced attention was also used to calculate the affinities between any two-channel maps. Finally, two attention modules were fused together to further enhance the feature representation. The complete flowchart is shown in Figure 2. A multiscale balanced-aware interactive network (MBINet) is proposed for salient object detection. Our contributions are summarized as three items: • An interactive residual model (IRM) was designed to capture the semantic information of multiscale features. The IRM can extract multiscale information adaptively from the samples and can deal with the scale changes better.

•
We proposed a balanced-attention model (BAM), which not only captures the dependence between different features of a single sample, but considers the potential correlation between different samples, which improves the generalization ability of attention mechanism. • To effectively fuse the output of IRMs and BAM cascade structure, an improved bidirectional propagation strategy was adopted, which can fully capture contextual in-formation of different scales, thereby improving detection performance.
The rest of this paper is organized as follows. Section 2 discusses the improved MBINet algorithm. Section 3 shows the simulation experimental results, and Section 4 concludes the paper.

Proposed Method
In this section, we first introduce the overall framework of the MBINet we have proposed. Figure 3 shows the network architecture. Next, we introduce the principle of the interactive residual model and formula derivation in detail in Section 2.2. In Section 2.3, we describe how to derive the balanced attention model step-by-step and the implementation details. Finally, we effectively merge all the features together by introducing the BPM to reflect the multiscale feature fusion further.

Network Architecture
In our model, the VGG-16 [36] network is used as the pre-training backbone network. Similarly to other SOD methods, we removed all fully connected layers and the last pooling layer, and marked the side outputs of different scales as {Conv1, Conv2, Conv3, Conv4, Conv5}. Since the receptive field of Conv1 was too small and there was too much noise, we decided to only use the side output of Conv2-5 for feature extraction. First, we used the dilated convolution (DC) with the dilation rate of {1, 2, 4} to extract the features of the side output, denoted as C = {C1, C2, C3, C4}, and send the output result to the interactive residual networks (IRMs) model. The characteristic of DC is to expand the receptive field with a fixed-size convolution kernel. The larger the receptive field, the richer the semantic information captured. The purpose of introducing DC is to extract features of different scales from the same feature map. However, IRMs use fixed-size convolution kernels to extract feature maps of different scales. Through complementary learning of DC and IRMs, more effective multiscale features can be obtained. In addition, we added a residual structure (RES) after Conv5, as shown in Figure 4, in order to reduce channel numbers of in-depth features and to pass the output features to a BAM based on both spatial and channel directions. The in-depth information output by BAM was fused together, and the output of IRMs was used as input to the bidirectional propagation network [4]. A prediction graph was generated at each fusion node. Next, we will introduce these models in detail.

Interactive Residual Model
The side output of different depths contains different features of information. We use DC to extract the features from the side output of the encoder, and the output result is expressed as F d . The function of DC is to expand the receptive field, which can improve the representation ability of small features and ensure the integrity of feature extraction. However, while improving the integrity of the features, the noise is also preserved. In order to further refine the features and complement DC, we introduce an interactive residual model (IRM), as shown in Figure 5. IRM divides the input features into two in a pooling manner, denoted as f l1 and f h1 , where f h1 = F d . To extract the features of two resolutions in parallel, and fuse the two output results with their respective sizes, denoted as f l2 and f h2 , the expressions are as follows: where i ∈ {1, 2, 3, 4} represents the level of the side output depth, U p(·) and Down(·) represent the de-convolution and pooling operations on the feature, and DownC(·) represents channel numbers for reducing the feature. Channel numbers of f l2 is 1/4 of the number of input channels, and the channel numbers of f h2 are the same as the number of input channels. To keep the input and output consistent and have the same size as the input, we upsampled f l2 , downchanneled f h2 , and merged the results of the two branches, denoted as F I . In order to facilitate optimization, each feature size change underwent normalization and nonlinear processing, and the input feature F d was processed again, and the F d was changed directly to the output port of IRMs after changing the channel. The whole process can be expressed by the following formula:

Balanced-Attention Model
The attention mechanism is widely used in computer vision, and it is also popular in salient detection. Some attention models [35,[37][38][39][40] have also been proposed in recent years. Self-attention is used to compute similarity between local features in order to obtain large-scale dependence. Specifically, the input features are linearly projected into a query matrix Q ∈ R N×d , a key matrix K ∈ R N×d , and a value matrix V ∈ R N×d [33]. The self-attention model can be formulated as: where N represents the number of pixels, a i,j is the corresponding pixel value, d and d represent the number of feature dimensions, and F O represents the output. Obviously, the high computational complexity of self-attention is an obvious disadvantage. Based on this, the beyond-attention model uses two storage M k units and M v to substitute the three selfattention matrices Q, K, and V, for calculating the correlation between feature pixels and external storage unit M, and using double-normalization to complete the normalization of features. Since M is a learnable parameter, affected by the whole datasets, it acts as a medium for the association of all features and is the memory of the whole training datasets. Double-normalization is essentially the nesting of softmax, at the expense of computational cost. Although features can be further refined and the impact of noise can be reduced, the effect is not obvious for deep features that have been processed multiple times. To this end, combining the advantages of the two attention models above, we proposed a model named the balanced-attention model (BAM). The details of the BAM are shown in Figure 6. First of all, we extracted the deep features in the spatial position, and transformed the input feature F ∈ R C×H×W projection into a query matrix Q ∈ R N×C/8 and a key matrix K ∈ R N×C/8 , where N = H × W is the number of pixels, then multiplied the transpose matrix of Q and K, and calculated the resulting matrix using the soft max function. The result is the spatial correlation between different pixels. Then a linear layer was used as the memory M v to share the whole datasets, and the input was fused into the above operation results to get the final output F P ∈ R C×H×W . The whole process can be formulated as: where F P represents the output feature in the spatial direction. While paying attention to the affinity relationship between different location features, we also noticed that the interdependence between channel mappings would affect the feature representation of specific semantics. Therefore, we also used the attention mechanism in the channel direction to improve the interdependence between channels. First of all, the input feature F was transformed into ∼ F ∈ R C×N , and then the multiplication operation was performed between ∼ F and its own transpose matrix, and the resulting square matrix was sent to the softmax classification function to calculate the pixel correlation degree in the channel direction. Then a linear layer was used as the memory to transform the result to R C×H×W , and the final result is represented by F C . The whole process can be expressed by the following formula:

Model Interaction and Integration
In order to strengthen the interdependence between different features, we fused the output results in the space and channel direction together and denoted it as F A . In addition, the BAM was regarded as an independent part, and the detected semantic information and the output characteristics of the IRMs were fused in a propagation mode. First, we changed the number of output feature channels for BAM and IRMs to 21 and produced two results of the same dimension simultaneously. Then, one of the results of each module was compiled into a group for cascade fusion in the direction from shallow to deep, and the other group was cascaded and fused in the direction from deep to shallow, and the output obtained after each level was fused as the predicted value. The above process is referred to as the bidirectional propagation model (BPM), as shown in BPM in Figure 3. Take F I as the IRMs output layer and F A as the BAM output layer. The features are gradually superimposed in two directions from shallow to deep and from deep to shallow, and the output of each side is used as the final prediction result. The whole process can be expressed by the following formula: Among them, f ues(·) is the feature fusion operation, Cat.(·) is the feature stitching operation according to the channel direction, F s2d A and F d2s A are the results of F A after the reduced channel, F O(s2d) is the output of each level from shallow to deep, and F O(d2s) is similar. Finally, we used the standard binary cross-entropy to train the predicted value; the expression is as follows: where G x,y ∈ {0, 1} represents the true value of pixel (x, y), and P x,y represents the predicted value of pixel (x, y) training. The calculation process is as follows:

Experimental Setup
Datasets: We evaluated the proposed model on six benchmark datasets: DUT-OM-RON [19], DUTS [41], ECSSD [42], PASCAL-S [43], HKU-IS [44], and SOD [45]. DUTS contains 10,553 training images and 5019 testing images, of which the images used for training are called DUTS-TR and the images used for testing are called DUTS-TE, and both contain complex scenes. ECSSD contains 1000 semantic images with various complex scenes. PASCAL-S contains 850 images with cluttered backgrounds and complex salient regions, which are from the verification set of the PASCAL VOC 2010 segmentation datasets. HKU-IS contains 4447 images with high-quality annotations, and has multiple unconnected salient objects in many images. DUT-OMRON contains 5168 challenging images. Most of the images in these datasets contain one or more salient objects with complex backgrounds. The SOD datasets have 300 very challenging images, most of which contain multiple objects with low contrast or connection with image boundaries.
Evaluation criteria: We used three metrics to evaluate the performance of the proposed model MBINet and other state-of-the-art SOD algorithms. The three metrics were F βmeasure [46], mean absolute error (MAE) [15], and S-measure [47]. The F β -measure was computed by the weighted harmonic mean of precision value and recall value, and its expression is: where β 2 is generally set to 0.3 for weight precision. Since the precision value and recall value are calculated on the basis of a binary image, we first needed to threshold the prediction map to a binary image. Threshold calculation is to combine multiple thresholds to calculate the appropriate value adaptively. Different thresholds corresponded to different F β scores. Here we reported the maximum F β score of all thresholds. MAE is calculated to measure the average difference between the predicted map P and the ground-truth map G ∈ {0, 1}. The expression of MAE is: where W and H are the width and height of the image respectively. S-measure calculates the structural similarity of object perception S o and region perception S r between the predicted map and the ground-truth map, and the expression is: where γ is set to 0.5 [47]. Implementation details: We implemented our proposed model based on the Pytorch framework. To facilitate comparison with other works, we chose VGG-16 [36] as the backbone network. Following most existing methods, we used the DUTS-TR datasets as a training set, the SOD datasets as a validation set to update best weights, the DUTS-TE, ECSSD, PASCAL-S, HKU-IS, and DUT-OMRON as testing sets, and the DUTS-TE datasets as the benchmark for ablation experiments. Our network was based on the NVIDIA GTX 1080 Ti GPU, and the operating system was trained on Ubuntu 16.04. To ensure that the model converges, we used the Adam [48] optimizer to train our model. The initial learning rate was set to 1e-4, the batch size was set to 8, and the resolution was adjusted to 320 × 320. The training process of our model took about 18 h and converged after 22 epochs.

Ablation Studies
The proposed model is composed of three parts: IRM, BAM, and BPM. In this section, we conducted ablation studies to verify the effectiveness of each module combination. The experimental setup follows Table 1, and we mainly report the results using the DUTS-TE datasets. Effectiveness of the BAM: BAM combines the advantages of the self-attention and the beyond-attention models. These attention models are embedded into our model, and compared on the DUTS-TE datasets, and the results are recorded in Table 2. We found that the performance of BAM was better than that of the self-attention and the external attention models alone. Especially in terms of MAE, which increased by 1.72% and 3.45% compared with the beyond-attention and self-attention models respectively, and the visual effects are shown in Figure 7. We can see that BAM can capture richer contextual information.
Mathematics 2022, 10, x FOR PEER REVIEW Effectiveness of the BAM: BAM combines the advantages of the self-atten the beyond-attention models. These attention models are embedded into our m compared on the DUTS-TE datasets, and the results are recorded in Table 2.
We found that the performance of BAM was better than that of the self-atte the external attention models alone. Especially in terms of MAE, which increased and 3.45% compared with the beyond-attention and self-attention models res and the visual effects are shown in Figure 7. We can see that BAM can capture ri textual information. Effectiveness of the IRM and BPM: IRM mainly processes the features of the shallow layer and the middle layer in order to further refine the features extracted by DC. BPM is a feature fusion mechanism between IRM and BAM. In Table 1, different combinations of the three models are shown. From Table 1, it is obviously that the integration of these independent modules can complement each other.
For a more comprehensive analysis of IRM, we further studied the impact of the number of dilate convolutions and the dilation rate. First of all, we tested the number of convolutions k when the dilation rate is {1, 2, 4}. We conducted five different k values experiments, and recorded the test results in Table 3. It can be found from Table 3 Table 4. We found that the performance is the best when the dilation rate is {1,2,4}. Therefore, we finally determined the dilation rate of {1, 2, 4}.
Quantitative comparison: We evaluated the proposed method from the three aspects of F β -measure, MAE, and S-measure, and also compared it with other SOD methods, as shown in Table 5. It can be seen from the results that our method is significantly better than other salient detection methods. In particular, in terms of MAE, the best scores are obtained on all five datasets, and the performance is improved on average compared to the second-best method. Qualitative evaluation: Figure 8 shows some representative examples. These examples reflect different scenes, including small objects, complex scenes, and images with low contrast between foreground and background. It can be seen from the figure that the proposed method can predict the foreground area more accurately and completely.

Conclusions
In this paper, we proposed a novel multiscale balanced-attention interactive perception network for salient object detection. First, we used dilated convolutions to extract multiscale features on the side output of the encoder. Next, interactive residual modules (IRMs) were designed to further refine the edge information of the multiscale features. Thus, the features, extracted by the interactive residual network and the dilated convolutions module, were complementary to each other, and the noise was suppressed. In addition, we proposed a balanced-attention model (BAM), which captured the deep context information of the objects in both spatial and channel directions respectively. The ablation experiments showed that the BAM's and IRM's cascade structures could extract richer semantic information for different scales features. Finally, in order to better describe and accurately locate the predicted object, we adopted the improved bi-directional propagation module (BPM) to improve the interdependence between different features. Whether IRM and BPM modules were cascaded for testing, or BAM and BPM modules were cascaded for testing, the results showed that the bi-directional propagation module could more effectively integrate multiscale features. In conclusion, the experimental evaluation on five datasets demonstrated that the designed method could predict the saliency map more accurately than the existing partial saliency detection methods under different evaluation metrics. In further research, we will plan to maintain the existing strengths of our method while considering the challenging problem of model lightweighting. In the future, we will try to further optimize our solution to achieve better predictive performance.