1. Introduction
With astonishing ability, humans are able to detect visually distinctive, so-called salient objects/regions effortlessly and rapidly. Benefitting from selective attention mechanisms of human visual systems (HVSs), people capture the most obvious objects from complex scenes and implement analysis and treatment, and this helps to greatly improve the efficiency of information perception [
1,
2,
3].
Different from fixation prediction, the aim of saliency detection lies in locating the most remarkable objects in the scene accompanied with a binary segmentation result output [
4,
5,
6]. In general, these methods can be divided into two categories (i.e., bottom-up and top-down approaches) driven by stimulus and tasks, respectively. Early bottom-up approaches usually adopted hand-designed features or combinations of them. However, limited with the cognition of HVS, the mechanism of feature selection and combinational optimization was unclear. The main disadvantage of hand-designed features mainly lies in its low generalization ability. Some handcraft features are deigned for specific tasks without accounting for variability in the input data. They also suffer from a lack of human expertise. Meanwhile, owing to the application of space pyramid theory in multi-scale analysis, the resolution of saliency map was relatively low, and this caused salient objects to be inaccurately located.
With the breakthrough of deep learning, early methods relying on low-level vision cues have been fast superseded in many fields, including handwritten digit recognition, pedestrian detection, and automatic drive. In the field of saliency detection, deep convolutional networks have also acquired enormous successes [
7,
8,
9]. In the issue of detection for salient objects, the fusion of multi-scale features is very important for achieving good results of saliency maps and has been widely referenced in many published papers [
10,
11].
Inspired by the achievements of deep semantic segmentation, the inherent hierarchical structure of DNN models is very conducive to the extraction of multi-scale features of salient objects [
12,
13]. In
Figure 1, it can be seen that saliency maps generated from deep side outputs of DNN models mainly focus on global appearances of the object. As a contrast, saliency maps generated from shallow side outputs capture detailed information such as textures and skeletons. Both of them are necessary for achieving the high precision detection of salient objects [
14,
15].
In this paper, with proper combinations of shallow and deep connections on various side outputs generated from different layers, a high precision detection method of salient objects based on a symmetric end-to-end DNN model is proposed. Generally, improvements of our method mainly focus on four aspects:
(1) A symmetric end-to-end architecture with stronger backbone networks.
An end-to-end encoder and decoder DNN architecture is proposed in this paper. Moreover, an ImageNet pre-trained ResNet50 is adopted instead of VGG16 to improve the ability of feature extraction for backbone networks [
16].
(2) Resolution recovering based on nonlinear transposed convolution.
The nonlinear transposed convolution adopted in our model helps to recover the resolution of feature maps with higher accuracy, and this works toward outperforming the bilinear interpolation applied in fully convolutional networks (FCNs) [
17].
(3) Combinations of shallow and deep connections on various side outputs.
Combinations of shallow and deep connections on various side outputs help to integrate feature maps with different sizes of receptive fields. Such integration of overall and detailed information has been validated to be very helpful to achieve a higher accuracy detection of salient objects [
18,
19].
(4) HED-based architecture with the fusion of multiple side outputs.
Outputs generated from various layers of DNN models focus on different scale information of salient objects. Inspired by HED architecture, multiple side outputs are well fused to integrate multi-scale information that helps to promote accuracy further [
20].
2. Related Work
During the past two decades, an extremely rich set of detection methods of salient objects has been well developed. Conventional detection methods of salient objects are primarily based on hand-crafted local features, global features, or the combination of them. A complete review is beyond the scope of this paper, and details can be acquired from a recent survey paper published by Borji A. et al. [
21]. In this paper, we mainly focus on developments of deep leaning based detection methods of salient objects.
The inherent hierarchical architecture of DNN models is conducive to the extraction of feature maps with various scale information, which is vital for the high precision detection of salient objects. The literature published by Li G. et al. demonstrates an early attempt [
22]. In their work, multi-scale information was acquired by manually selected local images centered by salient objects such that their corresponding generalization abilities are limited. The method of SALICON [
23] published by Huang X. et al. in ICCV2015 puts forward another effective solution. In their method, sub-sampled and full resolution images were fed into DNN models (e.g., AlexNet [
24]). Then, generated coarse and fine feature maps were concatenated and interpolated to produce a saliency map. Cornia M. et al. proposed a multi-level neural network for the detection of salient objects in ICPR2016 [
25]. In their work, with a feature encoding network, feature maps extracted from different layers of an end-to-end DNN model were well combined. They also proposed a new loss function to tackle the imbalance problem of saliency maps. Such a highly efficient end-to-end framework has also been applied in many other outstanding saliency detection models. Inspired by HED [
20], Hou Q. et al. proposed a deeply supervised salient object detection model by introducing short connections on skip layers in CVPR2017 [
26]. Different from previous methods, hierarchical side outputs produced from various layers were fused to form one integrated saliency map, and this helped to capture the multi-scale information of the target. Similarly, Wang W. et al. proposed a deep visual attention prediction model in TIP2018 [
27]. In their work, dense connections between skip layers were discarded from the organization structure of the reference [
26] so as to simplify the network architecture at the expense of some performance degradation. The motivation and impacts are worth discussing further.
With the similar task of object segmentation, achievements of deep semantic segmentation can be referenced for the high precision detection of salient objects. The FCN proposed by Long J. et al. in CVPR2015 replaced the full connection layer with
convolutional layers, and this provides the FCN with the ability to deal with arbitrary sizes of input images [
28]. As a strong baseline method, the FCN has been applied in a wide range of applications, including Mask-RCNN [
29]. However, the bilinear interpolation for resolution-retrieving of extracted feature maps is the main shortcoming of FCN. The reason for this lies in the fact that such operation ruins the original spatial relationship between image pixels. In order to solve this problem, Badrinarayanan V. et al. proposed a deep fully convolutional neural network for semantic pixel-wise segmentation termed SegNet in PAMI2017 [
30]. The innovation of SegNet lies in its encoder–decoder architecture. The nonlinear resolution-recovering through transposed convolution in the decoder part of SegNet overcomes the mode of bilinear interpolation adopted by the FCN with higher accuracy. However, the encoder–decoder architecture applied by SegNet is inclined to fall into over-fitting, even with rich training samples. For better performance, Ronneberger O. et al. proposed a novel convolutional network for biomedical image segmentation called U-Net [
31]. The unique skip-layer architecture concatenates shallow and deep layers symmetrically, and this is conducive to the propagation of gradient information in DNN models [
32]. Zhao H. S. et al. proposed another effective model for the fusion of multi-scale information called PSPNet [
33]. With pyramid pooling and the integration of feature maps with various scales, PSPNet can both extract large and/or small objects with higher accuracy.
Based on convolutional neural networks (CNNs), the idea of feature clustering has also been widely applied in recent works of object detection and recognition. A recurrent convolutional neural network (RCNN)-based visual odometry approach for endoscopic capsule robots was proposed by Turan M. et al. in [
34]. On the one side, just-in-moment features were well extracted by CNNs. On the other side, by means of the RCNN, inference of dynamics across the frames could also be well obtained. Through the fusion of these two powerful deep-learning-based models, an outstanding monocular visual odometry method with high translational and rotational accuracies was achieved. The clustered features can also be used in intelligent systems for disease diagnosis. Połap D. et al. proposed a smart home system to diagnose skin diseases for residents in the house [
35]. With the aid of a SIFT algorithm and a histogram comparison, clusters of potential areas were first located. These cluster images were then put forward into CNNs to find real disease areas. With the aid of CNNs, Babaee M. et al. proposed a novel background subtraction method from video sequences [
36]. In their method, input frames along with the corresponding background images were patch-wise processed. These fused image patch stacks contained mixed information regarding the foreground and the background. Therefore, the segmentation results could take comprehensive consideration of the detection scene so as to generate high accuracy segmentation results. The idea of feature clustering can also be used in model optimization. A multi-threading mechanism to minimize training time by rejecting unnecessary weight selection was proposed by Połap D. et al. in [
37]. The multi-core solution was utilized to select the best weights from all parallel trained networks. A systematic investigation of the impact of class imbalance on classification performance of convolutional neural networks (CNNs) was proposed by Buda M. et al. in [
38]. Frequently used methods including oversampling, undersampling, two-phase training, and thresholding were compared, and the optimal solution against class imbalance for deep learning was summarized. An unsupervised extraction of low or intermediate complexity visual features with spiking neural networks (SNNs) was proposed by Kheradpisheh S. R. et al. in [
39]. They introduced the technology of spike-timing-dependent plasticity (STDP) into a DNN and presented a temporal coding scheme to selectively fire neurons according to intensities of activation. With the combination of STDP and latency coding, the manner in which primate visual systems learn was well explored.
Based on reviews of related literature, connection mechanisms of various types of DNN models are deeply discussed in
Section 3. Through comparative analysis of their merits and drawbacks, a kind of scheme for combinatorial optimization of shallow and deep connections will be put forward accordingly.
3. Analysis and Optimization of the Connection Mechanism of DNNs
As mentioned above, existing DNN models for the detection of salient objects can be summarized into two typical architectures, i.e., the FCN-type architecture and the encoder–decoder type architecture. In this section, based on the analysis of various DNN model structures, a series of improvements will be carried out to generate the framework of our proposed DNN architecture.
3.1. Fully Convolutional Networks (FCNs)
FCNs have been adopted in many strong detection algorithms of salient objects (e.g., [
26,
27]). Based on comparisons of backbone networks architectures, configurations of existing FCN-net-based saliency detection DNN models can be classified into three categories: (1) a single stream with a single output; (2) multi-streams with a single output; (3) a single stream with multiple side outputs. Model structures of various types of FCNs are shown in
Figure 2.
(1) A single stream with a single output.
Figure 2a shows FCNs based on a single-steam backbone network. With the aid of convolutional layers, semantic features are extracted from original input images. Afterwards, the method of bilinear interpolation is applied onto extracted feature maps, and locations of salient objects are achieved as a result. A fully convolutional architecture of FCNs makes the network capable of dealing with arbitrary sizes of input images. Meanwhile, shared parameters between convolutional kernels are also helpful to improve the efficiency of the network.
(2) Multi-streams with a single output.
Based on initial explorations of detection algorithms for salient objects, DNN models with multi-stream architecture shown in
Figure 2b have been widely applied. With the aid of pyramid spatial transformation, multi-scale information is extracted from original input images with various sizes [
40]. This approach has also been referenced by authors using traditional methods [
41].
(3) A single stream with multiple side outputs.
The third architecture of FCNs exhibited in
Figure 2c is inspired by HED architecture. The main difference between
Figure 2a,c is that the former one only exports a single prediction from the network. As a contrast, the latter generates multi-level predictions from various hidden layers of the backbone network. With supervisions directly propagated back to the hidden layers, this architecture helps networks quickly converge to a global optimal solution. Moreover, it can be regarded as a lightweight version of multi-stream FCNs shown in
Figure 2b, and the quantity of parameters can be well controlled accordingly.
As a strong baseline architecture, FCNs still face drawbacks that need to be overcome. Spatial information among pixels in input images is lost during the operation of max- pooling and/or convolution with various strides. In addition, backbone networks can be strengthened further to enhance the ability of feature extraction to help better locate salient objects with higher accuracy [
42].
3.2. Deep Convolutional Encoder–Decoder Networks (EN-DE Nets)
In order to overcome the drawbacks of FCNs, deep convolutional encoder–decoder networks (EN-DE nets) have seen many improvements in terms of model structure. Like FCNs, the architecture of EN-DE nets can be divided into three categories: (1) single-stream EN-DE nets; (2) EN-DE nets with a skip-layer architecture; (3) EN-DE nets with multiple side outputs. These structures are shown in
Figure 3.
(1) Single-stream EN-DE nets.
The architecture of EN-DE nets with single stream backbone networks exhibited in
Figure 3a can be separated into two symmetric parts. The former, called “encoder”, is in charge of feature extraction, and the latter, termed “decoder”, is responsible for resolution reconstruction of feature maps correspondingly. The loss between the prediction and the ground truth is back-propagated to adjust the weights of hidden neurons. However, once the outputs of neurons in the shallow layers fall into insensitive regions of the activation function, gradient information cannot be effectively transmitted to deeper layers. With increasing layers, such a single stream architecture tends to become trapped into over-fitting. The detailed reason for this phenomenon has been illustrated in [
16].
(2) EN-DE nets with a skip-layer architecture.
Inspired by bottleneck layers adopted in the residual module, EN-DE nets with a skip layer architecture (e.g., U-Nets) help to propagate the flow of information and prevent the model from over-fitting. Meanwhile, feature maps extracted from shallow layers with a smaller receptive field focus on the details of an object. As a contrast, maps extracted from deep layers with a larger receptive field focus on global views. With the aid of the skip-layer architecture, feature maps with various sizes of receptive field are well fused, and this is beneficial for achieving a higher precision detection of salient objects.
(3) EN-DE nets with multiple side outputs.
Compared with
Figure 3b, the architecture shown in
Figure 3c integrates multiple side outputs with various sizes of receptive field and has already been applied, e.g., [
43]. Meanwhile, this multi-level output architecture also exhibits outstanding performance in the task of object detection (e.g., feature pyramid networks (FPNs) [
44]). Based on the fusion of multi-scale feature maps, further modifications of
Figure 3c will be discussed in
Section 3.3.
3.3. Deep Convolutional Models with Combinations of Shallow and Deep Connections
A multi-level side output architecture helps to integrate attention information from different layers of DNN models, and this is beneficial for achieving more accurate detection of variously sized salient objects [
13,
25]. In this section, through an analysis of metrics and drawbacks of deep networks with combinations of short connections based on an FCN architecture, our framework of a deep network with combinations of shallow and deep connections based on an EN-DE architecture is put forward accordingly and integrated with improvements of multiple aspects.
(1) A deep network with combinations of short connections based on an FCN architecture.
The model shown in
Figure 4 was first published in [
26] as an improvement of that shown in
Figure 1c. The author also drew comparisons of various short connection patterns, and the model shown in
Figure 4 outperformed the compared architectures in terms of accuracy. Intuitively, the main difference between these two models shown in
Figure 4 and
Figure 1c lies in the addition of short connections between outputs generated from various layers of backbone networks. With the aid of this new model, global and local information of salient objects can be well fused, and higher accuracy segmentation results are achieved as a consequence. Although it overcomes many state-of-the-art DNN models, relying on the comparison of FCNs and EN-DE nets stated above, a series of improvements can be adopted with respect to this structure to propose our DNN model afterwards.
(2) A deep network with combinations of shallow and deep connections on an EN-DE architecture.
Based on architectures of HED and densely connected convolutional networks [
45], our adopted network architecture in this paper, i.e., a deep network with combinations of shallow and deep connections based on an EN-DE architecture, is exhibited in
Figure 5. Three main advantages are integrated into this DNN architecture. First, backbone networks based on EN-DE nets with a skip-layer architecture help the model converge faster and better [
32]. Meanwhile, nonlinear resolution reconstruction based on symmetric EN-DE nets can also help recover the context information of pixels with higher accuracy. Second, the integration of multi-level side outputs generated from different layers helps to integrate the multi-scale saliency information. Third, shallow and deep connections on various side outputs help to fuse the global and local information of salient objects. Along with the advantages stated above, the details of this architecture will be discussed in
Section 4.
5. Performance Analysis and Assessment
Through comprehensive analysis, the characteristics of the DNN model based on combinations of shallow and deep connections have been illustrated in depth. In this section, through comparisons with other strong baseline methods, the performance of our DNN model will be synthetically evaluated on extensive saliency detection benchmark datasets. According to experimental results, the metrics and drawbacks of our model will be analyzed, and applicable scenarios will be discussed.
5.1. Benchmark Datasets and Evaluation Indices
5.1.1. Benchmark Datasets
In order to evaluate the performance, comparative experiments were implemented on three widely used saliency detection benchmark datasets: ECSSD [
50], MSRA-10K [
52], and iCoSeg [
53,
54].
For the ECSSD dataset, the sizes of the original images and their corresponding masks are exactly the same, mainly , , and . RGB images are stored in the format of jpg. The gray scale mask images are stored in the format of png. The ECSSD dataset has the capacity for 1000 image samples. Moreover, objects in ECSSD usually have a complex appearance, which makes it very suitable for evaluating the description ability of DNN models.
For the MSRA-10K dataset, the sizes of the original images and their corresponding masks are also the same. The range of image size mainly covers , , etc. RGB images are stored in the format of jpg. The gray scale mask images are stored in the format of png. The MSRA-10K dataset has a large quantity of image samples from hundreds of different categories. Most of these images only include one main salient object near the center area.
For the iCoSeg dataset, like ECSSD and MSRA-10K, each RGB image has a corresponding mask with the same size. The image size is mainly , , , or . RGB images are stored in the format of jpg. The gray scale mask images are stored in the format of png. As a small dataset, the amount of image samples in iCoSeg was designed for co-segmentation with only 643 images. Meanwhile, some images in iCoSeg include multiple objects with complex shapes.
According to the design of our proposed DNN model structure, before training images are imported into the DNN model, all of these images and their corresponding masks should be normalized to the size of
for all three benchmark datasets. The entire process can also be seen very clearly from the system flow chart shown in
Figure 18.
5.1.2. Evaluation Indices
In order to assess the performance, four universally agreed, standard evaluation metrics were adopted: P-R curves, F-measure, intersection-over-union (IoU), and mean absolute error (MAE). Meanwhile, time consumption was also counted to evaluate the efficiency of the compared DNN models. P-R curves, F-measure, and the IoU score are illustrated above, and here we only provide the expression of MAE as follows.
Assuming that the width and height of the original RGB image is
, the MAE score can be calculated by Equation (
8):
where
S stands for the binary saliency map segmented with the threshold of twice the average value, and
Z stands for the ground truth correspondingly. MAE is useful in evaluating the applicability of a model, as it reveals the numerical distance between the saliency map and the ground truth.
5.2. Platform and Implementation Details
We constructed the DNN model with combinations of shallow and deep connections on the platform of Keras 2.0.9 using Tensorflow 1.2 as the backend. As mentioned above, the encoder part was designed as an ImageNet pre-trained ResNet50 without fully connected layers. We invoked the weights from the library of Keras instead of training from the scratch. The rest of the model was randomly initialized and adjusted during the procedure of training. All of the weights were adjustable without freezing.
The algorithm of Adam [
51] was adopted as the optimizer. During the procedure of training, in order to stably converge to the optimal solution, one adjusting schedule was designed for the learning rate—
. Specifically, for the whole 45 epochs training process, in 1 to 30, 31 to 40, and 41 to 45 epochs, learning rates,
, were set to be 0.0022, 0.00022, and 0.000022, respectively. The hyperparameters of
and
remained unchanged throughout the training.
When the process of training was finished, the best model with the minimum loss in the validation set was saved. It took about 2.5, 30, and 2 h to train the model for the ECSSD, MSRA-10K, and iCoSeg datasets, respectively. Two NVIDIA GEFORCE 1070Ti GPUs with 16 GB memory under multi-GPU mode provided the computing power. All experiments were carried out on this platform without further explanations.
5.3. Performance Assessment by Verification on ECSSD
In order to evaluate the performance, five state-of-the-art deep-learning-based saliency detection methods including deep networks with short connections (DSCs) [
26], deep visual attention networks (DVAs) [
27], networks of static saliency (NSSs) [
55], networks of dynamic saliency (NDSs) [
55], and deep multi-level networks (DMLNs) [
25] were applied to compare with our DNN models. Meanwhile, five conventional saliency detection methods, i.e., RBD [
56], DSR [
57], MC [
58], GR [
59], and CA [
60] were also employed, and source codes were all obtained from the project website of [
21]. First, on the ECSSD dataset, P-R curves of these compared methods were drawn, which can be seen in
Figure 19.
In
Figure 19, it can be seen that our model has the best performance. With stronger backbone networks based on an encoder–decoder architecture, our model overcomes the performance of the DSC, which is designed in FCNs. In addition, DVA and DMLN models were also built on the FCN architecture. Meanwhile, in these two models, feature maps with different sizes of receptive fields were merged directly without fusions between each other. Similar situations also occurred with respect to NSS and NDS. Although these two models are based on the encoder–decoder architecture, their single steam models do not consider the utilization of feature maps generated from various layers of the network. Therefore, useful multi-scale information regarding salient objects was discarded, resulting in relatively poor performance.
Extensive indices including precision, recall, F-measure, IoU, and MAE were also adopted to assess the performance between contrastive DNN models. In the process of comparison, continuous saliency maps were first processed by the operation of threshold segmentation. For the sake of fairness, thresholds were all set to be twice the average value of the saliency maps in each saliency detection DNN model. Comparison results are shown in
Figure 20.
In
Figure 20, it can be seen that our model has obvious advantages, regardless of precision or recall. This is mainly due to the coordination of multiple key modules. With the aid of the fusion of multiple side outputs, our models capture the local and global information of salient objects comprehensively. In addition, the score of the IoU and MAE indicate that our model can accurately locate not only salient objects, and the probability of false alarm is controlled at a low level in irrelevant background areas.
At last, a visual comparison of saliency maps is exhibited in
Figure 21. The first column exhibits original RGB images, and the second column exhibits their corresponding binary ground truth masks. The third and later columns stand for continuous saliency maps generated by various contrastive methods. It is obviously noticed that our saliency maps are most similar to the ground truth and highlight the saliency objects with high accuracy.
5.4. Performance Assessment by Verification on MSRA-10K
Similarly, the performance of our proposed detection model of salient objects are also evaluated on the large scale MSRA-10K dataset. Because the generalization ability of our DNN model has been well validated through 10-fold cross-validation in
Section 4.3.3, here we randomly selected 6000 samples from the dataset as the training set. The remaining 4000 samples are utilized as the test set to evaluate the performance of well trained DNN models. For the sake of fairness, all models were trained on the same training set. P-R curves are shown in
Figure 22.
In
Figure 22, it can be seen that our model still exhibits the best performance. Compared with the ECSSD dataset, objects in MSRA-10K generally have a relatively simple shape. As a consequence, the performance of contrastive DNN models for saliency detection is promoted. Meanwhile, other indices are also compared, and the experimental results are shown in
Figure 23.
In
Figure 23, it can be seen that our detection model of salient objects shows obvious advantages compared with other DNN models. This superiority is mainly due to the design of the network architecture. The coordination of multiple key modules makes our model able to locate the most salient object with high accuracy. In addition, our model also demonstrates a strong ability in terms of recall and other performance indices.
Finally, a visual comparison of saliency maps on the MSRA-10K dataset are provided in
Figure 24. Like
Figure 21, the first and second columns exhibit original RGB images and their corresponding ground truth masks, respectively. The third and later columns exhibit saliency maps generated by contrastive models. From comparisons of saliency maps generated from various methods, the advantages of our detection model of salient objects can be viewed intuitively.
5.5. Performance Assessment by Verification on iCoSeg
Based on a small-scale dataset, we also validated the performance of DNN models on the iCoSeg dataset. Compared with MSRA-10K, the capacity of iCoSeg is too small for training a deep network with a structure shown in
Figure 6. However, with the aid of the unique multiple side output architecture, supervisions can be directly propagated back to the hidden layers, which helps the network quickly converge to a global optimal solution. Meanwhile, skip layer architectures can also help the network from falling into over-fitting. Therefore, even with fewer training sets, our model can still perform well. Specifically, from the total 634 image samples, 450 of them were randomly selected as the training set, and the remaining images were utilized to validate the performance of the well trained DNN models. For the sake of fairness, all of the models were trained on the same training set, and the P-R curves are shown in
Figure 25.
In
Figure 25, it can be seen that the precision of our DNN model dropped sharply only when the value of the recall approached 1. Among most of the other compared methods, the value of the precision dropped gradually with the increase in recall. This phenomenon indicates that the saliency maps generated by our proposed model shows a very high contrast.
Meanwhile, we also compared other performance indices between these contrastive methods. It can be seen in
Figure 26 that, from a comprehensive comparison of various indicators, our DNN model exhibits a strong capability compared with other DNN models. An outstanding model structure design leads to a good performance, even under smaller-scale datasets.
For an intuitive visual comparison, saliency maps generated from various DNN models are provided in
Figure 27. It can be clearly seen that the saliency maps generated by our model are the most similar with the ground truth masks. The superiority of our model has thus been fully validated.
In order to further improve the performance of our model on small-scale datasets (e.g., iCoSeg), we also tried to introduce the technique of transfer learning and data augmentation into the process of training. Specifically, we first performed off-line data augmentation to extend the capacity of the original dataset. The operations of data augmentation and some off-line augmented image samples are shown in
Figure 28.
For each image, nine corresponding samples were generated, with operations listed in
Figure 28b. With the aid of data augmentation, the amount of training sets was expanded to 10 times the original (e.g., 4500 images for the training set of iCoSeg). However, the correlation between these augmented image samples is very high, which is adverse. Thus, we introduced the technique of fine tuning to help further improve the performance. Specifically, weights of the model pre-trained on MSRA-10K were introduced into the new model. Afterwards, this model will continue to be trained on the augmented iCoSeg dataset. In order to evaluate the improvements of performance, evolution curves of validation loss during the process of training are shown in
Figure 29.
In
Figure 25, it can be seen that, by means of data argumentation, the performance of our model improves remarkably. Techniques of data augmentation and transfer learning have become necessary approaches to improve the performance of DNN models without the need for special emphasis. The experiment implemented here is only to show that the performance of our model still has room for further improvement. Finally, quantitative experimental results of deep-learning-based saliency detection methods on these benchmark datasets are shown in
Table 6.
From records shown in
Table 6, it can be seen that our model shows outstanding performance in accuracy. However, massive adjustable parameters in symmetrical network structure also pulls down efficiency. The original intention of the design of our network is to improve the precision of detection for salient objects. Owing to the update of backbone networks and the introduction of the fusion of multiple side outputs with shallow and deep connections, the number of parameters has increased. This is the reason why our network obviously lags behind FCNs in terms of testing time.
6. Discussion and Comments
The effective extraction of deep features is vital to achieve high accuracy saliency detection. Generally speaking, with deeper backbone networks, the capability of the feature extraction of DNN models is promoted as a consequence. However, with increasing layers and limited training samples, the model tends to become trapped in over-fitting. Skip-layer architectures connecting various depths of DNN models assist the transmission of data flow. Benefitting from this, the performance of DNN models based on encoder and decoder architecture overcomes FCNs by a large margin.
In addition, global and local vision cues are both very important to locate salient objects in complex detection scenes. However, conventional saliency detection methods only utilize the end output of DNN models. The inherent hierarchical structure of DNN models can be used to extract multi-scale feature maps with various sizes of receptive fields. Through the fusion of feature maps extracted from various layers of the DNN model with shallow and deep connections, different scale information of salient objects has been comprehensively utilized. In this way, the detection of salient objects has been significantly improved as a result.
Through comprehensive evaluations on benchmark datasets, experimental results reveal the fact that, with various improvements, our model yields state-of-the-art results in terms of the accuracy of the detection of salient objects. With a series of comparisons, the effectiveness of combinations of shallow and deep connections has also been well validated. In our model, we did not deliberately choose the best network. What we want to emphasize is the idea and approach of how to reinforce the performance of saliency detection through the fusion of multi-scale feature maps on the symmetric encoder and decoder architecture. With the development of research, stronger backbone networks will be put forward continuously. Based on the architecture proposed in this paper, the backbone network can be easily replaced by these stronger networks, and the performance of our model can be further improved accordingly.