Saliency Detection Based on Low-Level and High-Level Features via Manifold-Space Ranking

: Saliency detection as an active research direction in image understanding and analysis has been studied extensively. In this paper, to improve the accuracy of saliency detection, we propose an efﬁcient unsupervised salient object detection method. The ﬁrst step of our method is that we extract local low-level features of each superpixel after segmenting the image into different scale parts, which helps to locate the approximate locations of salient objects. Then, we use convolutional neural networks to extract high-level, semantically rich features as complementary features of each superpixel, and low-level features, as well as high-level features of each superpixel, are incorporated into a new feature vector to measure the distance between different superpixels. The last step is that we use a manifold space-ranking method to calculate the saliency of each superpixel. Extensive experiments over four challenging datasets indicate that the proposed method surpasses state-of-the-art methods and is closer to the ground truth.


Introduction
Saliency detection, an important research task in human visual and cognitive systems, has been widely developed.The purpose of saliency detection is to obtain the most attractive objects from an image, which is applied in various visual studies such as visual tracking [1], image retargeting [2,3], object recognition [4,5], and image segmentation [6,7].
The early conventional methods [8][9][10][11][12][13][14][15]usually combine low-level features to capture salient objects.These low-level features include color, gradient, texture, and spatial distribution.Since color features are the most obvious features in human vision, many methods based on the center-surrounding contrast take advantage of the color feature to distinguish the different regions of an image.Itti et al. [8] proposed the saliency detection method with the center-surrounding contrast.The color and direction features of the image from different scales were extracted to calculate the similarity in multiple scales.Liu et al. [10] first changed the purpose of saliency detection from eye fixation prediction to salient object detection.A binary segmentation algorithm by training a conditional random field was proposed, which generated saliency maps by extracting color histogram features.Wei et al. [16] exploited background prior knowledge and computed the saliency of the image patch by its color difference to the image boundary.The smaller its difference to the image boundary, the greater the likelihood that the patch is the background.However, the effect of this method is not obvious when the foreground appears in the boundary region.Based on [16], Yang et al. [17] proposed the graph-based detection method with manifold ranking (MR).The method uses the mean color feature to represent nodes and assumes the boundary region is connected.They measured the saliency of each superpixel by its relevance to the boundary region in color space.In order to show the color features of nodes more comprehensively and enhance the precision of the MR method, Li et al. [18] improved the MR method by extracting color histogram features.The saliency of each node was calculated using the manifold ranking method both in the average color feature and the color histogram feature spaces, respectively.The saliency maps were formed by fusing the saliency of each node obtained from the two color feature spaces.
The above color-based methods may lose some details but discriminant information, especially for low-contrast images.In terms of this issue, researchers have proposed methods based on multi-view features [19,20], for instance, texture and structure are often exploited as a supplement to color features.Wang et al. [21] proposed a feature integration detection method.In this method, multiple feature vectors of regions are mapped to saliency using the supervised learning method, and then the salient values of different regions generate the final saliency map.Zhang et al. [22] extracted the LBP, HOG, and DRFI features of an image.This is effective for different foreground regions and background objects using the cascade scheme in saliency detection.Deng et al. [23] extracted the multiple features of the image and generated the original salient regions using the manifold ranking method.An iterative optimization method for hyperparameters was exploited to improve the original saliency map.These methods extract multiple features to form a feature vector, and the final saliency maps are generated using bottom-up methods.
These conventional saliency detection methods usually have promising results.Nevertheless, without feature training and learning, they have limitations in low-contrast images.This drawback can be easily solved in saliency detection methods based on deep features, which is seen in Figure 1.Moreover, low-level features are usually good in one particular field; for example, color is advantageous for differentiating image appearance, texture is advantageous for gray spatial distribution, and frequency spectrum is advantageous for differentiating energy modes [24,25].Moreover, it is often unrealistic to extract many different features using one algorithm.Compared with bottom-up methods, deep learning methods have achieved a great level of performance [26][27][28][29].Shen et al. [30] used sparse noise to represent the object regions, and the low-rank matrix represented the background regions.To achieve better performance, high-level features were incorporated to produce prior maps.Li et al. [31] extracted high-level features from a deep convolutional neural network (CNN) at three different scales.The high-level feature maps and spatial consistency model were applied to each multi-level segmentation.Thus, the refined salient maps were fused into a final salient map.Zhao et al. [32] proposed a deep learning method based on a multicontext with both local features and context.Girshick et al. [33] applied a convolutional neural network to the bottom-up model to locate and segment the objects.
Deep neural networks facilitate the learning of complex high-level features and models.Deep features can discriminate objects in an image while having difficulty locating the precise boundary of the objects.As evident in Figure 1, the saliency maps using only deep features are not ideal for locating objects precisely.
It is not optimal to use hand-crafted features or deep features alone in saliency detection, but they have their own advantages and can complementarily represent the images.Deep features can evaluate the coarse spatial location of the object in an image.The latter layers of deep features are weak in detecting the boundaries of objects and background regions but are closely related to the classification of semantics.Meanwhile, we can evaluate the similarities in appearance between the different image patches relying on low-level features.The aim of saliency detection is to segment foregrounds from backgrounds precisely rather than to identify their classification of semantics.Thus, in this paper, on the basis of [18], we incorporated the semantic deep features extracted from a CNN as well as hand-crafted features in an unsupervised way, and a ranking method in manifold space helped to construct final saliency maps.
In summary, our method has three main contributions: (1) We propose an efficient multi-feature manifold ranking algorithm to detect salient objects.We used a feature vector including the low-level features and high-level features of input images to represent the image regions, which can better capture object regions when conventional saliency detection may fail.(2) On the new feature vector, we utilized the complementary relationship of multi-features to further refine the detection results.Moreover, the saliency maps were formed with the manifold space-ranking method.(3) Our empirical experiments show that our method achieves better results in terms of evaluation metrics against comparative state-of-the-art methods.

Related Work Graph-Based Manifold Ranking
Based on the method of [17], we suppose an input image contains n regions X = {x 1 , . . ., x l , x (l+1) , . . ., x n }, and the n regions are disconnected sets.We select some regions as marked seeds, and the other regions are sorted based on their relevance to the query seeds.The graph-based method should first compute the affinities of all the regions by constructing an affinity matrix.The affinity matrix is an important process for obtaining a higher detection effect.Given the affinity matrix W, and denoted by weight w ij (n) between different nodes i and j, where w ii = 0 to avoid self-reinforcement, n i and n j are extracted features, and δ is a coefficient for restricting the weight.
The ranking method can be regarded as a manifold structure [22].Given f represents a sorting score f i by a sorting function, denoted by a vector when x i is a query seed, let y i = 1 , and set y i is 0 when x i is not a query seed.After that, an indication graph G = (V, E) is constructed, in which V represents the n regions and edge E is defined by affinity matrix W . Thus, we obtain the ranking results by solving optimization problems by Equation (2): where f * represents the ranking results, and µ is a balancing constant.We have the ranking function f * by setting the derivative of Equation ( 2) to zero [17]: where I represents an identity matrix.Set α=1/(1+µ) and S=D 1/2 WD −1/2 is the normalized Laplacian matrix.The degree matrix D = diag{d 11 , . . ., d nn }, d ii = ∑ j w ij .Thus, Equation (3) turns into [17]:

The Proposed Method
Our method has three steps.First, we extracted the low-and high-level features from deep neural networks to measure the similarity of different image patches.Then, the saliency of image patches was computed in manifold space based on prior boundaries.Finally, we fused four coarse saliency maps to produce the final results.The overview of our method is depicted in Figure 2. We first used the efficient SLIC algorithm [34] to segment the image into irregular regions.An irregular region consists of pixels that have adjacent and similar features such as texture, color, and brightness.Each region is regarded as a superpixel or a node.In this paper, the extraction of multi-features was performed on each node, and the multi-features helped to capture more information (compared with color features) of each node.

Low-Level Feature Extraction
We extracted five different visual features that are the most common visual features to represent each superpixel: Mean color.The mean color feature is the average color of all the pixels in each superpixel, which was extracted in the CIELab color space.This yields three color features; Color histogram.The values of color histograms are all statistics, which describe the statistical distribution and basic hue of the colors in an image.In the CIELab color space, we quantized the number of L channels as 16 bins, the number of a channel as 8 bins, and the number of b channel as 8 bins.This yielded 16 + 8 + 8 = 32 color features; Gabor filters.Gabor filters as bandpass filters are preprocessing steps for extracting features and analyzing textures.The product of complex oscillation and the Gaussian envelope function generates the impulse response of these filters.We extracted the Gabor filter responses in 12 directions and 3 scales.The minimum filter bandwidth was set to 3, the scaling factor was set to 2, and the ratio of the Gaussian function representing the standard deviation in the propagation function to the center frequency of the Gabor filter was set to 0.65.This yielded 12 × 3 = 36 features; Local binary pattern (LBP) [35].Lbp is one of the texture features and is invariant to light, which was computed by the differences between a pixel and its neighborhoods.This yielded one feature; Spectral residual.Spectral residual used each color channel separately.Fourier transformed was run on an image, which weakened the magnitude components.The frequency distribution was then converted into a gray distribution using the inverse Fourier transform conversion.This yielded one feature.
After we extracted the low-level features, we combined these 73 features into a feature vector.

High-Level Feature Extraction
Low-level features can measure the difference between each node by image appearance.In order to represent the image information more comprehensively, high-level features have been used in image-processing tasks.Recently, saliency detection methods based on CNNs have achieved superior performance [36,37].Thus, we exploited the pretrained VGGNet-19 [36] to extract semantic information from an image.The semantic classification between objects and background is strengthened and spatial resolution is gradually reduced when features are propagated to deeper layers.On the other hand, the earlier layers of VGGNet-19 show higher spatial resolution for the precise localization of the objects.For saliency detection, we focused more on the accurate locations of objects than their semantic category.We thus employed the semantic classification of the latter layers to discriminate obvious category differences from the background and used the information of earlier layers to fix the precise localization of the objects.We visualized the upsampled outputs of the conv1-2, conv2-2, conv3-4, conv4-4, and conv5-4 convolutional layers (Figure 3).As shown in Figure 3, the conv1-2 layer has more precise localization information similar to a hand-crafted low-level feature map, the conv3-4 and conv4-4 layers show large appearance changes, and the conv5-4 layer is not good at locating the contour of the objects due to its coarse spatial resolution [36].There are two reasons for using a pretrained CNN.First of all, since pretrained CNNs do not require adjusting the net, it becomes easier and more convenient to extract deep features.Secondly, since the pretrained CNN model was trained from 1000 classified images, these can help to recognize an unknown object well with model-free identification [36].In this paper, we exploited the responses from the conv1-2, conv3-4, and conv4-4 layers as image representations.The number of features for conv1-2, conv3-4, and conv4-4 was 64, 256, and 512, respectively.Deep image features as well as hand-crafted features were incorporated into a new image feature space to evaluate the difference in image patches.The new feature vector is summarized in Table 1.Considering computational efficiency, we used the PCA (principal component analysis) algorithm [38] to eliminate redundant information.After that, the image feature vector was reduced by using fewer principal components.
Our method does not need to train many images, so it is easier to obtain the features of nodes.In order to show the effect of the new feature vector on the saliency detection task more intuitively, we exploited several methods based on different features to detect the same objects.The results are shown in Figure 1.The MFMR [18] method is based on color features, Figure 1d is our proposed method based on only low-level features, and Figure 1e is our proposed method based on only high-level features.We can note that our method detected the saliency objects better than any method using a single feature.

Ranking with Background Seeds
We extracted the multi-features of each node to form a new feature vector.The distance between the two nodes in Equation ( 1) can be measured using this new feature vector.It is commonly known that the background region and boundary regions often have similarities in appearance [17,39].That is, the salient objects are relatively uniform in structure and have spatial distribution, and the affinity matrix [17] can also infer these.Many previous saliency detection methods have applied the prior boundary method, and it is more effective than the prior center method [1,5].In this paper, we also used boundary regions as background queries.The nodes in the four boundary regions of the image were first marked as background seeds.The saliency values of other nodes were computed by their relevance to the background seeds.Specifically, we used prior boundaries to obtain four saliency maps and then fused them into coarse saliency maps, as shown in Figure 2. The fusion Equation [17] is where the saliency maps P t , P b , P l , and P r are generated by using the top boundary, the bottom boundary, the left boundary, and the right boundary as background seeds, respectively.We obtained the coarse saliency map obtained by the above steps, but the saliency map did not uniformly highlight the saliency objects.Thus, we continued to binarize the map P bg using the mean saliency over the entire saliency map and selected the foreground nodes as query seeds.The ranking value f * can be defined by Equation (4).f * is normalized between 0 and 1, and the final saliency map [17] is produced as follows: where f * (i) denotes the normalized vector of superpixel i.

Experimental Validation and Analysis
In order to evaluate the performance of our algorithm, we measured the proposed algorithm on four different datasets, namely MSRA-5000, DUT-OMRON, SOD, and ECSSD.MSRA-5000 contains 5000 natural images, and the background regions of the images are complex.DUT-OMRON includes 5168 high-quality challenging images in which backgrounds and objects are more comprehensive.SOD contains 300 images, and some of their salient objects are close to the boundary of the images.It is a very challenging image dataset.ECSSD includes 1000 semantic meaning images with multiple objects.All images have their corresponding ground truth images in the above four datasets.

Experiment Setup
Since most of the images have a resolution of 300 × 400 pixels in the four datasets, we segmented each image into 200 superpixels to balance the efficiency and accuracy of the calculation by reference to previous works [15,17].In Equations ( 3) and ( 4), the coefficient α of the affinity matrix was set to 0.99, similar to a previous study [17].In Equation ( 1), the parameter δ 2 was set to 0.1 [17].

Visual Performance Comparison
Some detection results produced by our model and compared methods are shown in Figure 4.In addition to our method, the results of other methods are from our previous paper [18].We note that the AC, SR, MSS, and FT methods fully detect the whole object regions, but the contrast with the background is low.The GBVS, BMA, and SWD methods can detect saliency objects roughly, but the object regions are blurred and uneven.The WTLL and LRMR methods find out the regions with special pixels, but the edge of saliency regions is not clear.The MR, GR, RR, and MFMR methods highlight the edge of the saliency regions, but their detection results are not complete when multiple objects appear in the image.The BFSS method can detect the region with high color contrast, but it erroneously identifies background areas as the foreground areas.Our method exhibits superior performance, especially in challenging scenes.For instance, when the input image is low contrast, as shown in Figure 4(h), the other compared methods are not able to detect the objects accurately, while our method, benefiting from the multi-feature component, is still closest to the ground truth.

Quantitative Performance Comparison
We used precision-recall curves, ROC curves, and F-measure bar graphs to evaluate our method and other compared models.The results are reported in Figures 5-7.We compared our algorithm with other algorithms in terms of the scores of precision, recall, AUC, and F-measure, as listed in Table 2.The graphs and table of other compared methods are from our previous paper [18].
It can be noted that our model not only shows satisfactory results for the four datasets under all evaluation curves but also outperforms the compared methods with dominant advantages.This is due to having multi-feature, as the saliency objects are better segmented from backgrounds.In Table 2, we note that for the methods whose precision values are comparable to our method (i.e., MFMR, GR), their values of recall and F-measure are often much lower on MSRA-5000 and DUT-OMRON datasets.Thus, they are more likely to miss some salient information.In the ECSSD dataset, our model achieves 81.08% precision and 72.29% recall, while the second-best model (MFMR) achieves 78.29% precision and 67.70% recall.The improvement in the performance of our method is due to its multi-feature extraction.The new feature vector composed of multi-feature shows its complementary advantages and achieves better results.

Limitation and Analysis
In Figure 8, we present a failure case with our method.When the color contrast of the object region and the background region is very low, our method fails to detect the whole object.This is because different features are often good at different aspects of an image.The features are directly concatenated and may not completely represent complementary information of different features.In the existing methods, even deeplearning-based methods, saliency detection is also a challenge for low-contrast images.Thus, We look forward to improving the detection method for low-contrast images from a multi-view perspective in our future research.

Conclusions
We proposed a saliency detection method based on multiple features via manifold space ranking.The semantically rich deep features derived from a CNN model (VGG19) help to accurately distinguish between objects and backgrounds but fail to fix the precise localization of the objects.Hand-crafted low-level features have a strong discriminative function to measure the similarity between different image regions.Considering the different roles of these two features, we exploited both complementary features to describe image patches.Furthermore, we adopted a manifold space-ranking method to rank the saliency of the image patches.Without any preprocessing and postprocessing, our method achieved superior performance with the help of multi-feature.

Figure 1 .
Figure 1.A glimpse of different methods: (a) input images; (b) ground truth; (c) saliency maps using the MFMR method [18] based on color features; (d) saliency maps using the method based on low-level features; (e) saliency maps using the method based on high-level features; (f) our method.

Figure 2 .
Figure 2. Overall pipeline of our method, including feature extraction, manifold space ranking, and the final saliency map.

Figure 8 .
Figure 8. Failure result of our method: (a) input image; (b) saliency map of our method.

Table 1 .
The dimension of feature vectors.