Alignment Integration Network for Salient Object Detection and Its Application for Optical Remote Sensing Images

Salient object detection has made substantial progress due to the exploitation of multi-level convolutional features. The key point is how to combine these convolutional features effectively and efficiently. Due to the step by step down-sampling operations in almost all CNNs, multi-level features usually have different scales. Methods based on fully convolutional networks directly apply bilinear up-sampling to low-resolution deep features and then combine them with high-resolution shallow features by addition or concatenation, which neglects the compatibility of features, resulting in misalignment problems. In this paper, to solve the problem, we propose an alignment integration network (ALNet), which aligns adjacent level features progressively to generate powerful combinations. To capture long-range dependencies for high-level integrated features as well as maintain high computational efficiency, a strip attention module (SAM) is introduced into the alignment integration procedures. Benefiting from SAM, multi-level semantics can be selectively propagated to predict precise salient objects. Furthermore, although integrating multi-level convolutional features can alleviate the blur boundary problem to a certain extent, it is still unsatisfactory for the restoration of a real object boundary. Therefore, we design a simple but effective boundary enhancement module (BEM) to guide the network focus on boundaries and other error-prone parts. Based on BEM, an attention weighted loss is proposed to boost the network to generate sharper object boundaries. Experimental results on five benchmark datasets demonstrate that the proposed method can achieve state-of-the-art performance on salient object detection. Moreover, we extend the experiments on the remote sensing datasets, and the results further prove the universality and scalability of ALNet.


Introduction
As an important research branch in computer vision, salient object detection (SOD) has received much attention in recent years. It can serve as a fundamental pre-processing technique to facilitate various computer vision applications, such as foreground map evaluation [1], image retrieval [2], visual tracking [3,4], remote sensing image segmentation [5], and semantic segmentation [6].
Benefiting from the development of deep learning technology, great advancements  in SOD have been made. In [38], Wang et al. provide a comprehensive survey that reviews deep SOD algorithms from various aspects, including network architecture, level of supervision, and so on. As summarized by [38], most of the current deep learning based methods design their architectures based on fully convolutional networks (FCN) [39] to integrate multi-level convolutional features. However, due to stepwise down-sampling operations, features from different levels have contradictions, and the contextual information they possess is asymmetric, which results in misalignment problems during the feature aggregation process; current work tends to ignore this problem. To increase interpretability of the models, we visualize the integrated feature maps of each model. An FCN-based model, which combines adjacent level features by direct addition, is utilized as the baseline model (i.e., w.o.Align). As we can see, features without alignment are fuzzy and unfocused. The important semantic and structural information is not well represented because of misalignment. Flow alignment (see Figure 1a), which has been proven to be effective in semantic segmentation [40], provides us with a feasible solution to alleviate the misalignment. Motivated by [40], we propose a flow alignment model to align adjacent level features for SOD. In flow alignment, semantic flow (i.e., offset ∆) is learned for spatial warping of high-level features. The visualized results in Figure 1 demonstrate the effectiveness of flow alignment. However, the flow alignment only learns one offset at each spatial position of a feature, which is sometimes not enough to handle complex misalignment. Therefore, we further propose a deformable alignment model (see Figure 1b) by substituting deformable convolution for spatial warping to increase the offset diversity for better alignment. Compared with flow alignment, deformable alignment can better highlight the salient region as well as maintain useful spatial details. The details of flow alignment and deformable alignment are explained in Section 3.2.
Moreover, the ability of a network to model global context is also critical to performance improvement. Recently, non-local self-attention mechanism [41] has been proven to be effective in capturing long-range dependencies. However, how to effectively incorporate it in SOD is still challenging. First of all, we need to consider computational efficiency. In this paper, we introduce strip attention [42] into our network to augment contextual information as well as ensure computational efficiency. Second, the adaptation of the selfattention mechanism for SOD is also an important factor to consider. Different from [42], where strip attention is utilized once to enhance the final feature for scene parsing, in our ALNet, strip attention modules (SAMs) are embedded in the intermediate procedure of alignment integration to augment contextual information for the high-level integrated features. Due to SAM, the global semantics are selectively incorporated in the alignment integration to recover precise salient objects.
Furthermore, to strengthen the model's learning ability at the object boundary, we design a simple but effective boundary enhancement module, which can output an attention map for the network. Based on the attention map, an attention weighted loss (AW loss) function is proposed to make the network pay more attention to the ambiguous and hard regions. Features from this branch are utilized as a complement for the multi-level integrated features to conduct the final prediction.
Finally, to prove the robustness and scalability of the proposed method, we directly apply our network to optical remote sensing images (RSIs) and compare it with state-ofthe-art RSI-SOD methods [32,[43][44][45][46] (salient object detection methods that are specially designed for RSIs). The extensive experiments demonstrate the effectiveness of our method.
The main contributions of our proposed method are summarized as follows: 1.
We propose an alignment integration network (ALNet) to alleviate the misalignment problem in multi-level feature fusion, thereby generating effective representation for salient object detection.

2.
Strip attention is introduced into our network to augment global contextual information for the high-level integrated features as well as keep computational efficiency.

3.
To make the network focus more on the boundary and error-prone regions, we propose a boundary enhancement module and an attention weighted loss function to help the network generate results with sharper boundaries.

4.
Experimental results on SOD benchmarks as well as remote sensing datasets demonstrate the effectiveness and scalability of the proposed ALNet.

Related Work
Existing deep SOD methods can be roughly categorized into multi-level features integration based and boundary learning based approaches.

Integrating Multi-Level Features for SOD
A simple but effective way to integrate multi-level features is adding or concatenating features step by step, as with FCN [39], which is usually taken as a baseline model. However, in this direct integration way, associations between features cannot be well modeled, resulting in unsatisfactory performance. Compared with this direct way, Amulet [10] integrates multi-scale features in a fully connected way. Nevertheless, fusing features from all levels at every specific scale may introduce unnecessary redundant information. Based on FCN, PAGRN [12] introduces both channel-wise and spatial-wise attention to suppress the irrelevant interference from features and then combines attentive features by stepwise addition. Pyramid fusion structure is utilized by Wei et al. [23] to fuse highlevel semantics with low-level details via lateral connections. In [17], Wang et al. design an ingenious network that conduct both top-down and bottom-up inference in an iterative and cooperative manner. The predicted saliency map is integrated with multi-level features step by step for coarse-to-fine saliency estimation. Sun et al. [28] leverage the average-and max-pooling modules to integrate the multi-level features in the spatial and channel-wise dimensions, respectively. An architecture search framework is proposed by Zhang et al. [29] to automatically learn a multi-scale features fusion strategy. All of the existing methods design ingenious modules to integrate features; nevertheless, they neglect the misalignment problem of multi-level features. To address this problem, we introduce alignment technology into SOD and further design an alignment integration network to relieve the misalignment for effective feature integration.

Boundary Learning for SOD
Precise salient object boundaries are beneficial for the performance of SOD methods. CNN-based methods suffer from blurred boundaries due to stride and pooling operations. Incorporating shallow layer features can alleviate the problem to a certain extent, but sometimes this is not enough. In order to obtain sharper object boundaries, some methods, such as [9,11,14], utilize CRF [47] as the post-processing step to enhance object edges. However, the post-processing operation is too time-consuming to be employed in real-time applications. In [16], Wang et al. design a salient edge detection module to emphasize the importance of boundary information, and L2-norm loss is employed to supervise salient edges. BASNet [20] employs a hybrid loss that incorporates SSIM [48] to capture the structural information in an image. Weighted BCE and IOU loss are utilized by F3Net [23], which synthesizes the local structure information of a pixel to guide the network to focus more on local details. In [29], Zhang et al. employ boundary loss [49] to penalize the misalignment of salient object boundaries. Mei et al. [37] adopt the patch-level edge preservation loss [50], which considers a local neighborhood of each pixel and assigns more attention to the object boundary. Different form these algorithms, in this paper, based on the boundary enhancement module, we propose an attention weighted loss, which can adaptively promote the network to focus on the hard pixels (i.e., pixels from boundaries or other error-prone parts).

Materials and Methods
In this section, we explain the details of our proposed ALNet, whose main framework is shown in Figure 2.

Block1
Block2 Main framework of our alignment integration network; CBR k×k means a k × k convolution followed by batch normalization and ReLU operations. We first side-out the multi-level convolutional features from the backbone and process them by the pre-process module. An additional 3 × 3 convolution operation is applied on the top level feature to encode high-level semantics. Then, features from multi-level are fed into the alignment integration module, in which adjacent level features are progressively combined by feature alignment. A strip attention module is utilized to capture non-local contextual information for the intermediate integrated feature. The final integrated feature is further enhanced by a boundary enhancement module, and the enhanced feature is exploited to conduct salient object prediction.
The backbone includes five convolutional blocks, which are {Block } 4 =0 . Multi-level features with different resolutions (i.e., 1/4, 1/8, 1/16, and 1/32 of the original resolution) are side-outputted from Block1 to Block4 and are denoted as {X } 4 =1 . Then, the features are sent to the pre-process module, which is explained in Section 3.1. Next, we propose the alignment integration module (AIM) to combine adjacent level features by feature alignment in Section 3.2. The boundary enhancement module (BEM), which is utilized to equip AIM to generate more powerful features, is explained in Section 3.3. Finally, we introduce the proposed attention weighted loss and the supervision strategy in our work in Section 3.4.

Pre-Process Module
As shown in Figure 2, shallower features {X } 3 =1 are fed into 1 × 1 convolution followed by the batch norm and ReLU operations, respectively. As for the top-level feature (i.e., X 4 ), an additional 3 × 3 convolution is applied to extract high-level semantics for the network. After pre-processing, we can obtain multi-level features {F } 4 =1 . Then, alignment integration is carried out for them.

Alignment Integration Module
Most of the existing methods directly integrate multi-level features without considering the misalignment problem between them. To alleviate this problem, we propose a novel alignment integration module (AIM), which is constructed based on the feature alignment (FA). As shown in Figure 2, in AIM, adjacent level features conduct FA to generate an aligned feature, which is then fed into next FA with the shallower level feature, and so on. The procedures of FA are shown in Figure 1. For the adjacent level features F and F +1 , we first generate alignment offset for them.

Offset Generation
First of all, F +1 , which denotes a high-level feature with low resolution, is up-sampled to the same size as F . Next, we concatenate them together and take the concatenated features as the input for a 3 × 3 convolution layer to output the alignment offset: where Cat(·) and Up(·) denote the concatenation and bi-linear upsampling operation, respectively. Then, we conduct feature alignment for them.

Feature Alignment
Two kinds of feature alignment models (i.e., flow alignment in Figure 1a and deformable alignment in Figure 1b), which intrinsically share the same formulation but differ in their offset diversity, are proposed in our work.
Flow Alignment. For flow alignment, the offset ∆ ∈ R H ×W ×2 is utilized for the spatial warping of F +1 :F where T(·, ·) represents the alignment transformation function; ∆ consists of two feature maps, which represent the offset for x-and y-coordinates of each position on the feature map to be aligned, respectively. Let T hw denote the output of T(F, ∆). The function is defined as follows: which samples features on position p( h+∆ 1hw δ , w+∆ 2hw δ ) of F and linearly interpolates the values of the four neighbors (top-left, top-right, bottom-left, and bottom-right) of p to approximate the output. The variable δ denotes the scale difference between F and ∆ (e.g., when F's resolution is half that of ∆, δ = 2); ∆ 1hw and ∆ 2hw represent the learned 2D transformation offsets for position (h,w).
Deformable Alignment. As for deformable alignment, a 3 × 3 deformable convolution is utilized in our network. The number of offsets is in proportion to the kernel size of the deformable convolution. Therefore, the learned offset is ∆ ∈ R H ×W ×18 . The feature is aligned by modulated deformable convolution (i.e., DCN-v2 [51]) based on the offset: Let Y denote the output of DeformConv(F, ∆): where p is the spatial position, p k is the kth sampling offset in a standard convolution, and n is the kernel size of deformable convolution (i.e., 3); ω and m are learnable parameters in the DeformConv. Compared with flow alignment, deformable alignment adaptively learns the diverse offsets for features, thus can deal with the misalignment problem better, which corresponds with the experimental results in Section 5.1.

Aligned Integration
The aligned integrated feature can be obtained by: where CBR 3×3 (·) denotes a 3 × 3 convolution with batch normalization and ReLU operations. In AIM, multi-level features are integrated step by step like FCN but alleviate misalignment. The integrated feature of the last step (i.e.,F 1 ) is equipped with both semantic information and spatial details.

Strip Attention Module
To augment the contextual information for the intermediate integrated features and promote their pixel-wise representative capacity, we incorporate non-local self-attention into our network. The standard non-local self-attention has a computational complexity of , where H and W denote the spatial dimensions of the input feature map. In this paper, we introduce strip attention [42], which reduces the computational complexity to O((H × W) × W) by a stripping operation to add global context as well as keep efficiency. The strip attention module (SAM) is displayed in Figure 3. For simplicity, here we use F ∈ R C×H×W to denote the input feature. First, F is fed into three convolutional layers with 1 × 1 filters followed by batch normalization and ReLU to generate three new feature maps, which are Q ∈ R C ×H×W , K ∈ R C ×H×W , and V ∈ R C×H×W , respectively; C is an intermediate feature dimension number for variable Q and K. To make SAM efficient, we set C smaller than C.
A stripping operation (i.e., average pooling with pooling windows of size H × 1) is applied on K to encode global context representation in the vertical direction, and then we get K ∈ R C ×1×W . We also try to apply 1 × W pooling on the feature to incorporate context in the horizontal direction, but it has little effect on the performance improvement. Considering computation complexity, we only use a one direction stripping operation.
Next, we reshape Q and K to R C ×N and R C ×W , respectively, where N = H × W. Then, we can calculate the strip attention map SA ∈ R N×W along the horizontal as follows: where means matrix multiplication and T means matrix transposition. Similarly, we apply stripping and reshape operations to V and can obtain V ∈ R C×W . Then, we conduct a matrix multiplication between SA and V T and reshape the result to get F SA ∈ R C×H×W . The output feature can be formulated as: For inputsF 3 andF 2 , the outputs of SAM are denoted asF 3 andF 2 , respectively. As shown in Figure 2, after adding SAM in our network, when = 1 or = 2, input of Equations (2) and (4) should beF +1 . For SOD, a high-level feature is expected to be augmented by global context, whereas a shallow-level feature is supposed to place emphasis on structural details. Therefore, we do not add SAM in the shallow level integration (i.e., level 1 in Figure 2). The experimental results in Section 5.2 demonstrate the rationality of our design (i.e., SAM-ver vs. SAM-ver-1).

Boundary Enhancement Module
An auxiliary boundary enhancement branch, which is simple but effective, is proposed to guide the network focus on boundaries and other error-prone parts of the image. The boundary enhancement module (BEM) is illustrated in Figure 4.  We apply two convolution followed by batch normalization operations on the input feature to generate attention map A, which is utilized as a weight for the loss computation in Section 3.4. Ground-truth boundary maps, which are pre-computed by the method in [52], are used to provide guidance for the attention generation. In addition, we extract the intermediate feature as guidance to enhance and complement the input feature. As shown in Figure 4, the input feature and the guidance are concatenated together and fused by a 1 × 1 convolution with batch normalization and ReLU operations. The enhanced feature is then used to conduct salient object prediction.

Supervision Strategy
In this paper, a hybrid loss function is proposed to supervise the network. At first, we introduce BCE [53] and IOU loss [54] to ensure pixel-wise smooth gradient as well as optimize the global structure. For the saliency map S and ground truth G, the BCE loss can be calculated as follows: where (x, y) denotes the spatial position; H and W represent the height and width of images. IOU loss is formulated as: Furthermore, to boost the network to learn sharper boundaries, we propose an attention weighted loss (AW loss) based on the learned attention map A in Section 3.3. The AW loss can be considered as a combination of attention weighted BCE and IOU loss: In addition, to ensure the attention map focuses on the boundary, we use an auxiliary weighted BCE Loss L AX (A, G b ) to supervise A, where G b is the ground-truth boundary (radius = 2) generated from G. The calculation of L AX is as in [55].
The final loss function for the proposed network is as follows: where β = 1 and λ = 20 are weighting coefficients for the loss function. We set the parameters based on experimental experience.

Results
Experimental results of the proposed work are displayed in this section. In Sections 4.1 and 4.2, we introduce the datasets and evaluation metrics of the experimental results. Implementation details of the proposed ALNet are described in Section 4.3. In Section 4.4, we compare our method with the state-of-the-art models from both quantitative and qualitative aspects. Furthermore, we conduct extension experiments on optical remote sensing images (RSIs) and compare our ALNet with state-of-the-art RSI-SOD methods. The details are introduced in Section 4.5.

Datasets
The experiments are conducted on five benchmark datasets: ECSSD [56], HKU-IS [57], PASCAL-S [58], DUT-OMRON [59], and DUTS [60]. The ECCSD dataset contains 1000 natural images with complex structures. In HKU-IS, there are 4447 images, which include multiple salient objects or objects touching the image boundary. PASCAL-S, which is generated from the PASCAL VOC dataset [61], contains 850 images. DUT-OMRON is a challenging dataset with 5168 images. DUTS is a relatively large dataset that contains 10,553 training images and 5019 testing images. We train our network based on the training images of DUTS for salient object detection.
In addition, in order to further demonstrate the stability and scalability of ALNet, we test the proposed method on two optical remote sensing datasets dedicated to SOD: ORSSD [44] and EORSSD [43]. ORSSD is the first publicly available dataset for SOD in optical remote sensing images. It contains 800 images (600 for training and 200 for testing), which are collected from the Google Earth and some existing RSI datasets. EORSSD is a large public dataset for RSI-SOD that extends ORSSD to 2000 images (1400 for training and 600 for testing). Specifically, we augment the training set of EORSSD and ORSSD by flipping and rotation, generating seven additional variants of the original training data. On EORSSD, we train our ALNet based on 11,200 augmented pairs. On ORSSD, we train our ALNet with 4800 augment pairs.

Metrics
We adopt the popular precision-recall (PR) curves, F-measure curves, mean F-measure (F β ) [62], weighted F-measure (F ω β ) [63], mean absolute error (M) [64], and mean E-measure (E m ξ ) [1] as our evaluation metrics. Mean F-measure is an overall performance measurement, which is defined as: where β 2 = 0.3 to emphasize the precision. Weighted F-measure offers an intuitive generalization of mean F-measure by changing precision and recall to their ωth power. As suggested in [65], β 2 for the weighted F-measure is set to 1.0. Mean absolute error is defined as the average pixel-wise absolute difference between the binary ground truth G and the saliency map S, which can be computed by: where W and H denote width and height of saliency map, respectively. The E-measure focuses on both local pixel values and image-level statistics. It can be computed by: where θ(ξ) is the enhanced alignment matrix. Mean E-measure (E m ξ ) is utilized in our experiment.

Implementation Details
The proposed method is based on the Pytorch platform. We conduct our experiments on a PC with an Intel Core i7-9700KF CPU (with 3.9 GHz Turbo boost) and a single NVIDIA GTX 2080Ti GPU. The input images are resized to 352 × 352 for both training and testing. We use data augmentation methods such as normalizing, cropping, and flipping. The parameters of the backbone are initialized from VGG16 [66], ResNet50 [67], and MSCAN-b [68] for fair comparison with existing methods. We utilize SGD optimizer [69] to train the entire network end to end. The base learning rate is set to 0.05, and the warm-up and linear decay strategies are used to adjust the learning rate. The momentum and the weight decay are set to 0.9 and 1 × 10 −4 , respectively. Batch size is set to 30 (for ResNet50 backbone) and 20 (for VGG16 and MSCAN-b backbone), and we train the network for 60 epochs. Apex (https://github.com/NVIDIA/apex (accessed on 20 December 2022)) and fp16 are utilized to accelerate the training process.
For extended experiments on remote sensing datasets, the implementation details are just the same as the original SOD. The only difference is the training dataset. Specially, on the EORSSD and ORSSD datasets, we resize the input image to 288 × 288 and train our ALNet for 65 epochs and 45 epochs, respectively.
For VGG16, ResNet50, and MSCAN-b backbone, the inference time of the proposed method for a 352 × 352 image is 0.0235 s (43 fps), 0.0188 s (53 fps), and 0.0280 s (36 fps), respectively, which demonstrates the feasibility of our method for real-time applications. The source code will be released to facilitate reproducibility.
For fair comparisons, we directly use the saliency maps offered by the authors or use the provided codes to generate the results. As some algorithms employ various backbones, we compare with the best results of them.
Quantitative Comparison. Table 1 shows the quantitative comparison results in terms of mean F-measure, weighted F-measure, mean absolute error, and mean E-measure. We also compare the computational complexity and the size of parameters in the second and third columns of Table 1 (i.e., MACs and Params). Figure 5 shows the P-R curves and F-measure curves of our method and the state-of-the-art methods. From the results, we can see that, for VGG16 and ResNet50 backbone, our proposed network performs favorably against other state-of-the-art methods on all datasets and metrics, as well as keeps the complexity and model size relatively small, which demonstrates the effectiveness of our proposed network based on alignment integration.
For the attention based backbone, we implement our network based on MSCANb, which utilizes multi-scale convolution attention to encode features. Compared with the existing methods, ALNet-MS ranks first on most of the datasets and metrics. It is noteworthy that ALNet-MS has smaller MACs and Params than the existing methods. The experiments based on different backbones all prove that our proposed network can achieve state-of-the-art performance in both effectiveness and efficiency. Table 1. Comparisons with 17 methods on 5 benchmark datasets. The best two results of each part are shown in red and blue; ↑ means higher value is better, whereas ↓ is the contrary. '-V': VGG16 [66], '-R': ResNet50 [67], '-T2': T2T-ViT [70], '-S': SWIN [71], '-MS': MSCAN-b [68].

Method
MACs Params Qualitative Comparison. In Figure 6, we compare the visual results of the methods for qualitative evaluation. Benefiting from multi-level alignment integration, our network can generate powerful integrated features, which contain both high-level semantics and spatial details, to segment salient regions even in very challenging scenes (e.g., 1st and 2nd rows in Figure 6). In addition, compared with other boundary learning based methods such as F3Net and A-MSF, our proposed methods can generate relatively clear and accurate object boundaries.

Extension Experiment on the Remote-Sensing Datasets
To further discuss the proposed model's robustness and scalability, we conduct experiments on optical remote sensing datasets. We compare our ALNet with four state-of-the-art RSI-SOD methods: LVNet [44], DAFNet [43], MJRB [45], and ACCoNet [46]. For fair comparison with existing methods, the network is initialized from ResNet50.
Quantitative Comparison. The quantitative comparison results of mean F-measure, weighted F-measure, mean absolute error, and mean E-measure are shown in Table 2. For the EORSSD dataset, the proposed method ranks first on all metrics. For the ORSSD dataset, the result of our method is also competitive. In Figure 7, we display the F-measure curves of the proposed method with state-of-the-art methods on two remote sensing datasets. Our method performs well against state-of-the-art RSI-SOD methods. It is worth mentioning that the proposed method is a universal framework for salient object detection and not dedicated to optical remote sensing images. However, the results demonstrate the effectiveness and scalability of the proposed network. Table 2. Comparisons with four state-of-the-art RSI-SOD methods on two remote sensing datasets. The best two results are shown in red and blue.  Qualitative Comparison. The qualitative results, including several challenging and representative scenes of optical remote sensing images, are shown in Figure 8.

EORSSD ORSSD
For the first scene (i.e., object with shadows), being affected by the shadows, ACCoNet, MJRB, DAFNet, and LVNet cannot generate accurate and sharp boundaries, but our method can better highlight the object and produce relatively accurate results.
For the scene with a tiny object, which is typical in optical remote sensing images, our proposed method can segment the tiny object with fine details; compared with the other methods, the object shape generated by our method is closer to the ground truth.
Another difficult scene is one with multiple objects. As shown in Figure 8, ACCoNet and MJRB incorrectly predict non-salient interference in the background as foreground. DAFNet generates blur salient regions, and LVNet fails to detect the real objects in the first row of this scene. In contrast, our method captures all objects finely without any redundant regions.
For the scene with irregular geometry structure (e.g., lakes and rivers), the saliency maps of our method obviously have sharper boundaries, and the highlighted regions are concentrated. From the visual results, we can see that our methods can better deal with the complex and challenging scenes in optical remote sensing images, which further proves the reliability of the method.

Discussion
In this section, we conduct ablation studies for all the proposed modules (i.e., feature alignment, strip attention module, and boundary enhancement module) in our ALNet and analyze the effectiveness of them in Sections 5.1-5.3, respectively. For a comprehensive analysis of the model, we further discuss the failure cases in Section 5.4.

Effectiveness of Feature Alignment
We use an FCN-based model, which combines adjacent level features by direct addition, as the baseline model (i.e., w.o.Align). Different alignment technologies are exploited on the baseline model. The results based on ResNet50 are shown in the first part of Table 3. F-Align and and D-Align denote flow alignment and deformable alignment, respectively. The comparisons of the alignment methods and the baseline demonstrate the misalignment problem in multi-level feature integration indeed decreases the performance. Deformable alignment performs better than flow alignment, which indicates the importance of offset diversity.
In addition, we visualize the last-stage integrated features of different methods to make the results explainable, as shown in Figure 9. As we can see, features without alignment are fuzzy and lack both semantic and structural information. After alignment, the models can generate more meaningful feature representation. Compared with flow alignment, deformable alignment features can better highlight salient regions and have more precise boundaries, which coincides with the quantitative results in Table 3. Table 3. Ablation study of our proposed method. We show the results based on the ResNet50 backbone. The table can be divided into three parts to demonstrate effectiveness of the proposed modules in ALNet. Best results are shown in bold.

Effectiveness of SAM
On the basis of D-Align, we conduct stripping operations in both the vertical and horizontal directions for the integrated feature. In the second part of Table 3, we list the results of using vertical stripping (+SAM-ver), using horizontal stripping (+SAM-hori), and using both directions (+BSAM). The experimental results demonstrate SAM is effective for adaptively encoding global contextual relations for the integrated feature. SAM-ver is superior to SAM-hori on most of the datasets, and using both directions did not bring out improvement. On the basis of SAM-ver, we add SAM on the integration of feature level 1 (SAM-ver-1). The results show that SAM-ver is better than SAM-ver-1, which indicates shallow level integration prefers spatial details to global context. Furthermore, SAM is essentially a simplified self attention mechanism, and to further prove its effectiveness, we compare it with a non-local module in Table 3. Compared with non-local, SAM performs better in our network, and, due to stripping operations, SAM is more efficient in computation.

Effectiveness of BEM
The boundary enhancement module is a simple but effective branch for the network to generate clear boundaries. In the third part of Table 3, we conduct ablation studies for BEM based on +SAM-ver. The results indicate that BEM is effective for performance improvement. Removing L AX or L AW lowers the final results, which proves the effectiveness of each part in BEM.
In Figure 10, we compare the saliency maps with and without BEM and visualize attention map A at the same time.

Image
GT w.o.BEM Attention BEM From the results, we can see that BEM can learn reasonable attention maps, which make the network put more emphasis on boundary and error-prone regions. The results with BEM obviously have sharper boundaries and can deal with more complex backgrounds.

Failure Cases
The failure cases of our method are displayed in Figure 11.

Image
DCENet GT ALNet-R ICON-R A-MSF Figure 11. Failure cases of our proposed method and other state-of-the-art methods.
In the first row, our method incorrectly predicts oranges as the foreground objects. In the second row, the whole bed (not just the pillow) is taken as the salient object by our method. In the third row, our method cannot detect the real object (i.e., the board with "organic"). Similarly, other state-of-the-art methods also fail in these cases. We summarize the possible reasons for these failure cases: (1) insufficient training samples (e.g., 1st row); (2) controversial annotations (e.g., 2nd row); (3) too complex scene and requirements for additional information like depth (e.g., 3rd row).

Conclusions
In this paper, an alignment integration network (ALNet) is proposed to alleviate misalignment problems in combining multi-level convolutional features. Feature alignment is designed in our network to align adjacent level features step by step to produce effective feature representation for salient object detection. To help the network encode global context, a strip attention module is introduced to augment the representative capacity of the feature. Finally, we construct a boundary enhancement module and an attention weighted loss function to make the network focus on boundaries and hard regions. Comprehensive experiments are conducted on five SOD benchmarks and two remote sensing datasets. The experimental results demonstrate the state-of-the-art performance of our ALNet as well as the effectiveness of each proposed modules.
Author Contributions: X.Z. designed and implemented the whole model architecture and manuscript writing. Y.Y. and Y.W. proofread the manuscript. X.C. and C.W. provided suggestions and reviewed the manuscript. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Conflicts of Interest:
The authors declare no conflicts of interest.