Hybrid-Attention Network for RGB-D Salient Object Detection

: Depth information has been widely used to improve RGB-D salient object detection by extracting attention maps to determine the position information of objects in an image. However, non-salient objects may be close to the depth sensor and present high pixel intensities in the depth maps. This situation in depth maps inevitably leads to erroneously emphasize non-salient areas and may have a negative impact on the saliency results. To mitigate this problem, we propose a hybrid attention neural network that fuses middle- and high-level RGB features with depth features to generate a hybrid attention map to remove background information. The proposed network extracts multilevel features from RGB images using the Res2Net architecture and then integrates high-level features from depth maps using the Inception-v4-ResNet2 architecture. The mixed high-level RGB features and depth features generate the hybrid attention map, which is then multiplied to the low-level RGB features. After decoding by several convolutions and upsampling, we obtain the ﬁnal saliency prediction, achieving state-of-the-art performance on the NJUD and NLPR datasets. Moreover, the proposed network has good generalization ability compared with other methods. An ablation study demonstrates that the proposed network e ﬀ ectively performs saliency prediction even when non-salient objects interfere detection. In fact, after removing the branch with high-level RGB features, the RGB attention map that guides the network for saliency prediction is lost, and all the performance measures decline. The resulting prediction map from the ablation study shows the e ﬀ ect of non-salient objects close to the depth sensor. This e ﬀ ect is not present when using the complete hybrid attention network. Therefore, RGB information can correct and supplement depth information, and the corresponding hybrid attention map is more robust than using a conventional attention map constructed only with depth information.


Introduction
Saliency detection extracts relevant objects with pixel-level details from an image. It has been widely used in many fields such as object segmentation [1], region proposal [2], object recognition [3], image quality assessment [4], and video analysis [5]. It has been found that when the background has similar colors to those of a salient object or it is highly complex and salient objects are very large or small, saliency detection solely based on RGB images often fails to provide accurate results. Therefore, depth information is being increasingly used as a supplement to RGB information for saliency detection [6][7][8]. RGB-D salient object detection based on handcrafted features generally uses depth maps to determine edges, textures, and histogram statistics, and then bottom-up [9] or top-down [10] approaches are used to predict whether a pixel belongs to a salient object. Various methods consider the rarity of pixels in an image at local and global regions [11], while others use prior knowledge to support prediction and obtain accurate detection [12]. However, these methods rely on handcrafted features, empirical parameter setting, and statistical prediction, which limit their performance. In fact, such methods cannot fully extract representative features due to inadequate parameter setting, subjective factors, and redundant or erroneous information. In addition, models of the human visual system may be incomplete and misleading. Alternatively, deep learning methods have emerged in recent years, improving the accuracy of salient object detection [13][14][15][16]. By combining the advantages of deep learning and features in depth maps, several stereoscopic saliency detection methods based on neural networks have achieved great leaps in accuracy. For instance, DF combines RGB images and depth maps into a deep learning framework [17]. Then, encoder-decoder networks, such as PDNet [18], provide high accuracy and robustness. Chen et al. further improved the results by proposing hidden structure conversion [19], complementary fusion [20], a dilated convolutional model [21], and modification to loss functions [22] for highly accurate salient object detection. On the other hand, methods based on attention mechanisms can quickly identify the position of objects and then reconstruct the edges for improving salient object detection. Wang et al. proposed a residual network with attention mechanism [23] and then DANet [24] to achieve accurate results by using channel and spatial attention maps.
Current stereoscopic salient object detection based on deep learning usually adopts networks such as VGG [25], ResNet [26], and Inception [27] as its backbone and the U-Net encoding-decoding structure [28] as the framework. However, this is not an ideal solution for saliency detection. As the depth map (disparity map) is an image reflecting the distances to objects, many networks use it to generate an attention map to distinguish objects from the background. However, depth maps have two major limitations. First, the depth map reflects the distance to all objects, and some non-salient objects are the closest to camera and provide the lowest (highest) pixel intensities. Thus, the underlying network may consider such objects as salient, in a phenomenon that we call the depth principle error. Second, data acquisition limitations may degrade the accuracy of edge information in the depth map.
Overall, the neural networks that determine the location of objects using only depth information to construct the attention map may be biased. Using the RGB image to discard the closest non-salient objects in depth maps may improve the detection accuracy. Based on spatial attention maps, we propose stereoscopic salient object detection using a hybrid attention network (HANet). Before processing features for saliency detection, high-level features extracted from the RGB image are encoded into an attention map, which is then mixed with the depth attention map for subsequent joint processing with the saliency features. Experimental results show that this novel method prevents non-salient object interference present in depth maps. In addition, unlike many symmetric neural networks, the proposed asymmetric network has fewer parameters, because the depth map has less information and a large network is unnecessary. Thus, we use a simplified Inception-v4-ResNet2 [29] architecture with fewer parameters to extract the depth attention map and a Res2Net [30] architecture for feature extraction to construct the RGB attention map containing more complex information. The proposed asymmetric HANet can prevent the depth principle error by filtering features with cross-modal attention maps separately obtained from RGB and depth data.

Proposed Method
The proposed HANet architecture achieves salient object detection and prevents the depth principle error. The processing pipeline of HANet is shown in Figure 1. HANet can be divided into two main parts. The first part extracts features through eight neural network blocks (shown in blue in Figure 1) for the RGB attention map and through two blocks (shown in green) for the depth attention map. The second part consists of six blocks (shown in orange in Figure 1) that fuse the two types of features to generate a hybrid attention map, and one block (shown in pink) that generates the saliency prediction map according to feature filtering based on the hybrid attention map.

Feature Extraction
We adopt two popular backbone networks for feature extraction. Specifically, Res2Net [30] extracts RGB features, and a simplified Inception-v4-ResNet2 [29] extracts depth features. The latter can handle the relatively less information from depth maps while preventing overfitting and reducing the computation time by omitting unnecessary parameters. Therefore, we establish an asymmetric architecture for this two-steam network.
For RGB images, the Res2Net backbone has been used to extract multilevel features for different tasks, being widely used in semantic segmentation, key-point estimation, and salient object detection. We have conducted comprehensive experiments on many datasets and benchmarks and verified the excellent generalization ability of Res2Net. For salient object detection, we remove all the fully connected layers of Res2Net to ensure that the output is an image. To preserve the feature information, we delete the first max pooling layers of the network and set the stride of the convolution to 1 (instead of 2) to prevent excessive downsampling. This prevents severe information loss and failure to reconstruct object details after saliency detection. As we obtain the features at each downsampling process, Res2Net provides four outputs: low-level features extracted by Layer1, middle-level features extracted by Layer2 and Layer3, and high-level features extracted by Layer4. In [27], 1 × 1 convolutions have dual purpose: most critically, they are used mainly as dimension reduction modules to remove computational bottlenecks, that would otherwise limit the size of our network. This allows for not just increasing the depth, but also the width of our networks without significant performance penalty. Then, inspired by [27], we use four 1 × 1 convolutions to reduce the number of channels to one-eighth of the original number, which is high and requires long computation time during both training and inference.
For depth maps, we use a simplified Inception-v4-ResNet2. To reduce the computational complexity, we only adopt its Stem part and five Inception-ResNet-A blocks. In addition, we follow the same procedure for RGB images to ensure that the output is an image. Likewise, we delete the first max pooling layers, set the stride of the convolution to 1, and use 3 × 3 convolutions to construct the depth attention map.

Hybrid Attention Predictor
The depth principle error in non-salient objects described above makes the closest objects to the depth sensor to have either the lowest or highest intensities in a disparity map. When a neural network searches for salient objects in depth maps, it can be misled by such objects. Therefore, a single-modal attention map containing only depth information is biased. By leveraging the complementarity between RGB and depth information, we can eliminate the depth principle error by constructing a hybrid attention map. This map combines the RGB and depth modes to obtain a

Feature Extraction
We adopt two popular backbone networks for feature extraction. Specifically, Res2Net [30] extracts RGB features, and a simplified Inception-v4-ResNet2 [29] extracts depth features. The latter can handle the relatively less information from depth maps while preventing overfitting and reducing the computation time by omitting unnecessary parameters. Therefore, we establish an asymmetric architecture for this two-steam network.
For RGB images, the Res2Net backbone has been used to extract multilevel features for different tasks, being widely used in semantic segmentation, key-point estimation, and salient object detection. We have conducted comprehensive experiments on many datasets and benchmarks and verified the excellent generalization ability of Res2Net. For salient object detection, we remove all the fully connected layers of Res2Net to ensure that the output is an image. To preserve the feature information, we delete the first max pooling layers of the network and set the stride of the convolution to 1 (instead of 2) to prevent excessive downsampling. This prevents severe information loss and failure to reconstruct object details after saliency detection. As we obtain the features at each downsampling process, Res2Net provides four outputs: low-level features extracted by Layer1, middle-level features extracted by Layer2 and Layer3, and high-level features extracted by Layer4. In [27], 1 × 1 convolutions have dual purpose: most critically, they are used mainly as dimension reduction modules to remove computational bottlenecks, that would otherwise limit the size of our network. This allows for not just increasing the depth, but also the width of our networks without significant performance penalty. Then, inspired by [27], we use four 1 × 1 convolutions to reduce the number of channels to one-eighth of the original number, which is high and requires long computation time during both training and inference.
For depth maps, we use a simplified Inception-v4-ResNet2. To reduce the computational complexity, we only adopt its Stem part and five Inception-ResNet-A blocks. In addition, we follow the same procedure for RGB images to ensure that the output is an image. Likewise, we delete the first max pooling layers, set the stride of the convolution to 1, and use 3 × 3 convolutions to construct the depth attention map.

Hybrid Attention Predictor
The depth principle error in non-salient objects described above makes the closest objects to the depth sensor to have either the lowest or highest intensities in a disparity map. When a neural network searches for salient objects in depth maps, it can be misled by such objects. Therefore, a single-modal attention map containing only depth information is biased. By leveraging the complementarity between RGB and depth information, we can eliminate the depth principle error by constructing a hybrid attention map. This map combines the RGB and depth modes to obtain a weighted attention map in which each pixels has information on its likelihood to belong to a specific object.
To obtain the hybrid attention map, we devise a decoder network (orange blocks in Figure 1) that consist of a 3 × 3 convolutions and binary interpolation upsampling. After each upsampling, we concatenate the lower-level and current features. The decoder blocks can be represented by the following formula: where F represents convolution, U represents upsampling, k is the feature channel, R k n − 1 is the k-th channel of the (n − 1)-th RGB attention features extracted by the corresponding block in the decoder network, r k n − 1 is the k-th channel of the (n − 1)-th RGB features extracted by Res2Net, whose number of channels is reduced by the convolutions, denotes concatenation, and W is the parameter for convolution.
When the RGB attention map is obtained after decoding, we aggregate the depth attention map to generate the hybrid attention map. This cross-modal attention map provides accurate localization of objects in the image. Then, we multiply the map with low-level RGB features, and several convolutions and upsampling operations lead to the prediction map for salient object detection.

Loss Function
We use the binary cross-entropy as loss function for HANet: where (h, w) represents the pixel values of the image at the corresponding position, Y is the prediction map, and G is the ground truth. Thus, L(Y, G) provides the final loss function values of the prediction and label map.

Evaluation Measures
To comprehensively evaluate the detection performance of various saliency methods, we adopt five evaluation measures: precision-recall curve, maximum and mean F-measure, mean absolute error, and area under the precision-recall curve [31,32].
The binary saliency map corresponding to a threshold is then compared to the ground truth, and precision P and recall R are computed as The average precision and recall for images in each dataset are plotted in a precision-recall curve. An adaptive threshold is applied to the grayscale saliency map to obtain the corresponding binary saliency map. For each saliency map, the precision and recall are computed using (3) and (4). Then, F β is defined as where β is a positive parameter specifying the relative importance of precision and recall. For consistency while comparing the performance of the proposed network with that of other methods, we set β= 0.3.
The mean absolute error reflects the average absolute pixelwise difference between the predicted saliency maps and corresponding ground truth. Thus, it is an important measure to evaluate the proposed HANet, and it is given by where H and W are the numbers of rows and columns in the saliency map, respectively.

Implementation Details
We implement HANet using the popular PyTorch 1.2.0 library in Python. We apply Adam optimization with learning rate of 0.001, which is reduced by a factor of 2 if no improvement is observed in the validation performance over five consecutive epochs. The NJUD dataset [31] containing more than 2000 images and the NLPR dataset [32] containing 1000 images corresponding pixel-level ground truths are used to evaluate the proposed HANet. We follow the datasets splitting scheme proposed in [18,21], 80% are used for training and the remaining 20% for test. All the images are resized to 224 × 224 pixels. The network is trained over 100 epochs with early stopping, and a minibatch of 2 images is used at every training iteration. In this study, HANet was trained on a computer equipped with an Intel i7-7750H CPU at 2.21 GHz and an NVIDIA GeForce GTX TITAN Xp GPU.

Comparison with State-of-Art Methods
We compared the proposed method with seven state-of-the-art methods: ACSD [31], CDCP [33], DCMC [34], DF [17], MBP [21], PDNet [18], and SFP [35]. Table 1 and Figure 2 show that the proposed HANet outperforms the other evaluated methods. Figure 3 shows various saliency maps obtained from each method in typical scenarios. In the first and second rows, the closest objects are non-salient and have the highest pixel intensities. For the comparison methods, the two images are misjudged due to the depth principle error. In contrast, HANet can correctly detect the salient objects by using the information in the hybrid attention map.
To further demonstrate the effectiveness of HANet, we conducted an ablation study by removing the RGB attention map. The results are shown in the 12th column of Figure 3, where the miscalculation due to the depth principle error appears. On the third and fourth rows, we show the saliency obtained from HANet in scenes with multiple and large salient objects, confirming the effectiveness of the proposed method.

Ablation Study
To analyze the effectiveness of both the proposed hybrid attention mechanism and RGB attention map to correct mistakes caused by depth principle error, we removed Layer2, Layer3, and Layer4 and their corresponding 1 × 1 convolutions from HANet. In addition, we removed the upsampling and convolution during fusion, and omitted the RGB attention map and thus its combination with the depth attention map. Table 2 and Figure 4 show that the saliency results are substantially deteriorated, as illustrated in the 12th column of Figure 3, where the depth principle error is evident. Therefore, HANet accurately predicts salient objects and eliminates interference caused by the depth principle error.

Computational Complexity
The computational complexity of the proposed HANet and the other methods was estimated from tests on the NJUD dataset. It takes approximately 4 h to train HANet using an Intel i5-7500 CPU at 3.4 GHz and an NVIDIA GeForce GTX TITAN Xp GPU. HANet achieves saliency detection at 11.6 fps for images of 224 × 224 pixels. Therefore, our model has low computational complexity and can be applied to real-time image processing systems.

Conclusions
We propose HANet, a hybrid network based on an attention mechanism for stereoscopic salient object detection. HANet uses a novel attention method that fuses RGB and depth attention maps to filter the original saliency features. Combined with an encoder-decoder network, HANet provides higher performance on the NJUD and NLPR datasets. Furthermore, an ablation study confirms that the HANet performance decreases when removing the RGB attention map, indicating the effectiveness of the proposed hybrid attention mechanism. The RGB attention map helps solving interference caused by the depth principle error, which occurs when non-salient objects are close to the depth sensor. Moreover, HANet provides high performance in scenes containing multiple objects, large objects, and other complex information.