Pre-Processing Filter Reflecting Human Visual Perception to Improve Saliency Detection Performance

Salient object detection is a method of finding an object within an image that a person determines to be important and is expected to focus on. Various features are used to compute the visual saliency, and in general, the color and luminance of the scene are widely used among the spatial features. However, humans perceive the same color and luminance differently depending on the influence of the surrounding environment. As the human visual system (HVS) operates through a very complex mechanism, both neurobiological and psychological aspects must be considered for the accurate detection of salient objects. To reflect this characteristic in the saliency detection process, we have proposed two pre-processing methods to apply to the input image. First, we applied a bilateral filter to improve the segmentation results by smoothing the image so that only the overall context of the image remains while preserving the important borders of the image. Second, although the amount of light is the same, it can be perceived with a difference in the brightness owing to the influence of the surrounding environment. Therefore, we applied oriented difference-of-Gaussians (ODOG) and locally normalized ODOG (LODOG) filters that adjust the input image by predicting the brightness as perceived by humans. Experiments on five public benchmark datasets for which ground truth exists show that our proposed method further improves the performance of previous state-of-the-art methods.


Introduction
The information received from the human visual system (HVS) constitutes a significant proportion of all information received during the daytime [1]. Although the human brain can process large amounts of information, human visual information exceeds this processing capacity. Therefore, HVS discriminately processes large amounts of visual information by classifying them according to their importance [2][3][4][5]. Through this process, humans concentrate on areas with high importance, which are called salient regions. The saliency detection method refers to a computational model that quantifies a salient area or detects objects in an image by emulating a human's selective visual attention mechanism.
Various methods were proposed to predict the areas where human attention is concentrated while staring at images or video scenes. Itti et al. [6] proposed a saliency detection algorithm that combines color, intensity, and orientation features in a center-surround manner based on feature integration theory. Harel et al. [7] proposed a graph-based method that combines the activation maps formed using feature vectors. Hou et al. [8] proposed a method for generating a saliency map using the spectral residuals obtained by considering the statistical characteristics of a natural image in the spectral domain.
Object-based saliency detection methods were also proposed from a different perspective from location-based methods. Liu et al. [9] segmented images to obtain regions of interest (ROIs) using density-based clustering and calculated the region saliency map based on the color, region size and location in the image. Cheng et al. [10] calculated the saliency value of the region using the measured global contrast score, but considering the influence of the contrast and spatial distance to the surrounding region that gains human attention. Li et al. [11] adopted a superpixel segmentation method and treated the saliency values of all regions as the optimization objective for each image rather than mapping visual features to saliency values with a unified model.
In order to perform quantitative comparisons on the performance of saliency detection methods, different public datasets have been created according to location-based and objectbased detection methods. For the location-based detection method, Einhäuser et al. [12], Bruce et al. [13], Judd et al. [14], and Achanta et al. [15] constitute the dataset. The ground truth image for each image is a record of randomly selected subjects wearing an eye tracking device and staring at the image on the display for a set period of time. For the object-based detection method, datasets named DUT-OMRON [16], ECSSD [17], MSRA10K [10,[18][19][20], PASCAL-S [21][22][23], and THUR15K [24] have been mainly used. The ground truth image for this type of dataset is a pixel-level binary mask judged as a salient object by randomly selected experimenters watching the image.
For saliency detection, various features that can be obtained from the scene are used. Among them, color and luminance, which are spatial characteristics of a scene, are mainly used because they are intuitive and easy to obtain. However, humans do not perceive visual stimuli as absolute values, but as relative values affected by surrounding information, causing an visual illusion [25][26][27][28][29]. This means that the result of detecting saliency by using the original image data without any processing may not effectively reflect the visual illusion of such a human. Therefore, it is necessary to process the features obtained from the image in consideration of the human's neurobiological and psychological mechanisms of visual perception.
In this study, we verified the effect of applying a pre-filter that reflects the human visual perception system to the dataset used in the saliency detection method. We approached it from the following two perspectives:

•
According to the visual perception model, the rapid initial analysis of visual features in the natural scene recognition process starts at low spatial frequencies following a "coarse-to-fine" sequence [30][31][32][33]. In other words, when recognizing a scene, it first accepts the overall characteristics of the whole scene, and then recognizes the detailed characteristics. Saliency detection is a technology that detects an area or object that a person will pay attention to when facing a scene. Therefore, it is necessary to pay attention to the overall features rather than the details of the scene in consideration of the human scene recognition process. In addition, the performance of the superpixel method used as a segmentation method in saliency detection is judged by the similarity of pixels constituting each superpixel and whether the edge in the actual image is reflected. In order to satisfy both requirements, it is necessary to remove minute differences in pixel values while maintaining important edges. Therefore, if an edge-preserving filter is applied to the original image before performing superpixel segmentation, the segmentation result and saliency detection performance can be expected to improve. • Simultaneous contrast effect is a visual illusion that perceives the same gray color differently depending on the brightness of the background. Studies that approach this visual illusion as a low-level process have analyzed that simple interactions between adjacent neurons are caused by simple filters implementing lateral inhibition in the early stages of the visual system, where they are performed [34][35][36][37][38]. In addition, various methods have been proposed to predict the brightness perceived by humans under the influence of visual illusions [39][40][41][42][43]. The ground truth image used to compare the performance of the saliency detection method is the result of the perception of brightness as the subject observes the image and creates it manually. Nevertheless, because the input image uses the original pixel-specific data as it is, it is used without considering human visual characteristics. Therefore, a pre-processing method that considers the brightness perception of the input image is required.
The remainder of this paper is organized as follows: In Section 2, we introduce our method to verify the effects of the pre-filtered datasets. Section 3 presents our experimental results and analysis, and Section 4 concludes the paper.

Proposed Methodology
This section describes two pre-processing filters that can be applied to improve saliency detection performance:

•
Bilateral filter: Both the foreground and the background in the image do not exist as a single pixel, and have meaning by clusters of pixels of similar color and brightness in a certain area. The superpixel-based saliency detection method focuses on this characteristic and divides the image into superpixels, which are clusters of similar pixels, and detects salient objects by considering the correlation of each cluster. The bilateral filter removes the detail within the clusters that degrades the correlation between superpixels. It also preserves prominent edges within the image so that the boundaries between superpixels better reflect real edges. • Perceptual brightness prediction filter: In general, saliency detection methods use original data values of input images. However, since humans perceive relative brightness, stimulus distortion occurs in the scene recognition process. Since saliency detection is a technique for detecting areas or objects that humans judge to be salient, such stimulus distortion must be reflected in the detection process. The perceived brightness prediction filter calculates a relative brightness value that is actually perceived with respect to the light intensity obtained by the human eye.
More details on each filter are described in the following subsections.

Bilateral Filtering for Superpixel
In a natural image, although pixels that are spatially close to each other may appear to be identical, there exist slight differences in their values, which contribute to the rich details of the image. However, humans instantaneously acquire data from the scene and process it appropriately. In such a short time, the HVS ignores fine details and unnecessary edges in the scene and acquires the overall information that is necessary to understand the context of the scene. Therefore, it is more useful to process the image in units of superpixels, which represent perceptually meaningful pixels, than in units of pixels individually. Hence, several object-based saliency detection methods [11,16,[44][45][46][47][48][49][50][51][52] mainly use the simple linear iterative clustering (SLIC) technique [53,54] to segment an image into superpixels. This technique uses a few parameters that can be easily adjusted and demonstrates reasonable segmentation results with a low computational cost.
However, although SLIC splits the image better than other state-of-the-art methods, it sometimes exhibits a jagged boundary on the actual image edges, as depicted in the upper row of Figure 1. To alleviate this problem and improve the superpixel result by removing the details in the image, we applied a bilateral filter [55][56][57][58], which is an edgepreserving filter.
The bilateral filter developed by Tomasi and Manduchi [55] is a nonlinear filter, in which the output is a weighted average of the input. This weight is based on a Gaussian distribution and depends on both the distance and intensity difference of the pixels. For input image I, the bilateral-filtered imageÎ at pixel p is obtained aŝ where q is the pixel inside the window mask S centered on p, G(·) is the Gaussian function, σ s and σ r are the standard deviations of space s and range r, specifying the amount of filtering on the image, and · represent the Euclidean distance function. As can be seen from the comparison depicted in Figure 1a, the bilateral filter removes the delicate details while preserving the major edges of the image. Figures 1b,c depict the result of the bilateral filter smoothing of the boundaries of the superpixels, which significantly reflects the edges of the image.

Brightness Perception
Brightness refers to the apparent luminance of a patch in the image itself, whereas lightness refers to the apparent reflectance of a surface in a scene [25]. As humans respond to the proportion of the light reflected by an object rather than the total light reflected by an object, they perceive the relative reflectance of objects despite changes in the illumination, which is called the lightness constancy [28]. This means that humans do not absolutely perceive the brightness of the scene but rather perceive it relatively under the influence of the surrounding environment. A representative example of brightness perception is the checker shadow illusion [26], as depicted in Figure 2. Comparing the two blocks labeled A and B in Figure 2a, blocks A and B appear to consist of gray patches of different brightness. Although block B looks brighter than block A, as can be seen in Figure 2b, the luminance value of the two blocks in 8-bit grayscale is the same as 120. To measure the degree of human perception of brightness within an image, we applied the oriented difference-of-Gaussians (ODOG) model [39][40][41][42][43]. The ODOG filter was designed by reflecting the principle of orientation selectivity [59] of a primary visual cortex (V1) simple cell, which is located in the first stage of the cortical processing of visual information. An anisotropic Gaussian filter was used to accumulate the strength in a desired orientation over the lateral geniculate nucleus (LGN) center-surround group with a response pattern in the form of a difference-of-Gaussians (DoG) [60]. The ODOG filter is defined as follows: where σ 1 , σ 2 are the standard deviations of the center and surrounding Gaussian functions, respectively. The parameter σ 1 is set to seven values arranged in octave intervals between 1.06 and 67.88 to reflect the space constants, and σ 2 is set as σ 2 = 2σ 1 . The orientation of the filter θ has six components at 30 • intervals (from 0 • to 150 • ). The coordinates u, v rotated in the θ orientation for the x, y variables are given as follows: The input image was processed linearly using 42 filters generated by Equation (2). Thereafter, the weighted summations for each orientation θ were calculated, and the final result was obtained by averaging the normalized output values over all orientations.
The ODOG model applies the same weights to the energies in each orientation and sums them globally. Meanwhile, reference [41] reported that the global normalization in the ODOG model may not effectively reflect the brightness of the image; to compensate for this limitation, a locally normalized ODOG (LODOG) was proposed. The LODOG calculates the normalized root mean square (RMS) for a Gaussian weight window centered on each pixel.
A comparison of the images filtered by ODOG and LODOG is depicted in Figure 3. Figure 3b,c depict the predicted brightness values after applying each filter to the original image in Figure 3a in the RGB color space. The filtered images exhibit a slight color difference compared to the original image, and the contrast tends to increase. However, visual confirmation of the filtered image triggers a new perception of brightness. Therefore, in Figure 3d, the pixel level values at the horizontal position indicated by the red line at the center of the Figure 3a-c are compared. The x-axis of the graph represents the spatial location corresponding to the red line shown in Figure 3a-c, and the y-axis represents the pixel range. The blue solid line and the red dotted line are pixel values filtered through ODOG and LODOG, respectively. In the image, since the sneakers are centered on a dark background, the white sneakers stand out brighter and the dark background is perceived as relatively dark. In the case of the background, it is perceived as darker near the borderline adjacent to the bright sneakers, and is less affected the further away from the sneakers. The comparison with the original pixel values in the solid black line shows that the ODOG and LODOG filters predict the perceived brightness.

Datasets
The proposed method was applied to five public object-based datasets for comparison of saliency detection performance. The characteristics of each dataset are as follows: DUT-OMRON dataset contains 5168 high-quality images that are manually selected from more than 140,000 images with one or more salient objects and a relatively complex background [16]. All of them were resized with a resolution of 400 × x or x × 400, where x is less than 400. ECSSD dataset consists of 1000 images obtained from the internet, which typically contain natural images [17]. The selected images include semantically meaningful but structurally complex backgrounds. MSRA10K dataset generated by Cheng et al. [10] consisted of 10,000 images randomly selected from the MSRA dataset of more than 20,000 images provided by Liu et al. [61]. PASCAL-S dataset is built on the validation set of the PASCAL VOC 2010 segmentation challenge [21]. It contains 850 natural images with multiple objects in the scene [22,23]. THUR15K dataset consists of approxi-mately 15,000 images classified by five keywords (butterfly, coffee mug, dog jump, giraffe, and plane) [24]. In each image, there is an unambiguous object that matches the query keyword and contains the correct content for most of the object to be displayed. As there is no ground truth for all images, only 6233 images for which ground truth images exist were used in the experiment.
Each dataset comes with ground truth images for salient objects. These ground truth images are binary masks, manually labeled by 3 to 12 subjects selected by the creators of each dataset.

Evaluation Metrics
For objective performance evaluation, we adopted precision-recall (PR) curve, area under the curve (AUC) score, and F-measure.
The PR curve is a plot with precision on the y-axis and recall on the x-axis for different probability thresholds. Precision(also known as positive predicted value) is the ratio of the correctly predicted salient regions out of all predicted salient regions. Recall (also known as the true positive rate or sensitivity) is the ratio of the correctly predicted salient region to the actual salient region. The precision and recall are calculated using the following equation: where TP, FP, and FN are the true positive, false positive, and false negative rates, respectively. The AUC score measures the overall performance based on the area under the receiver operating characteristic (ROC) curve. The ROC curve is a plot with recall on the y-axis and a false positive rate (FPR) on the x-axis for different probability thresholds. FPR is calculated using the following equation: where TN represents the true negative rates. The higher the performance of the saliency detection model, the closer the AUC score is to one, and the lower the performance of the saliency detection model, the closer the AUC score is to zero (i.e., the minimum AUC score is 0.5). The F-measure is the weighted harmonic mean of precision and recall. It is also adopted to measure the overall performance of the saliency detection model and is calculated as follows: where the weighting parameter β 2 is set to one for our implementation.

Implementation Details
To verify the effectiveness of pre-processing filters performed on the input image on the saliency detection performance, we applied six saliency detection methods. All of these methods are superpixel-based salient object detection methods and are widely known and highly cited or recently published methods. The names of the six methods and the number of superpixels and compactness parameters for SLIC performed in each method were set as listed in Table 1. The size of the mask S and the standard deviations σ s and σ r used for the bilateral filter in Equation (1) were 11 × 11, 50 and 0.1, respectively. The bilateral filter was applied twice iteratively to the image.  [48] 500 pixels/superpixel 20 RBD [49] 600 pixels/superpixel 20 GLGOV [50] 200 pixels/superpixel 20

Verification Framework
To verify that the filters described in Sections 2.1 and 2.2 improve the saliency detection performance, we applied the filters both individually and in combination on the images. Table 2 lists the names and descriptions of the filters applied to the images. The bilateral filter was applied iteratively to achieve a sufficient smoothing result as long as the edges were preserved. We also assumed that brightness interference occurs on a per-area basis rather than on a per-pixel basis. This means that the details of the scene are not important. Therefore, while combining two filters, the bilateral filter preceded the brightness perception filter.

BF Applying bilateral filter only ODOG
Predicting perceived brightness using ODOG only LODOG Predicting perceived brightness using LODOG only BF+ODOG Applying a bilateral filter, and thereafter predicting by using ODOG BF+LODOG Applying a bilateral filter, and thereafter predicting by using LODOG

Subjective Quality Comparison
To evaluate the effectiveness of the proposed method, we demonstrated both subjective quality comparison and objective performance for all results with each algorithm applied to five datasets.
A subjective quality comparison is the process of comparing how similar the saliency map result generated by applying the saliency detection method to the input image is with the ground truth image. In this process, different saliency detection methods can be compared by paying attention to the shape of the salient object, the shape of superpixels, and the brightness, which means the degree of saliency. To effectively compare the results of the proposed method applied to each saliency detection method, several examples are depicted in Figures 4-9. Columns (a) and (b) depict the input image and the ground truth image, respectively. Column (c) depicts the saliency detection results obtained through the existing algorithm without any pre-processing. First, as depicted in column (d), the jagged patterns existing at the boundary line between the superpixels are significantly reduced by the bilateral filter, as intended. This reflects the boundary line between the object and the background more effectively, which can be seen in the original image and the ground truth image. Second, the effects of ODOG and LODOG are depicted in columns (e) and (f), respectively. Both filters produce an image with a brightness value affected by the surrounding area. This is similar to the effect of increasing the contrast ratio of an image. Therefore, it can be seen that the greater the difference in contrast between the salient object and the background, the greater the influence of the filters. Third, the effects of ODOG and LODOG combined with bilateral filters are depicted in columns (g) and (h), respectively. Although there are differences depending on the characteristics of the input image, it is possible to confirm that the advantages that can be obtained through each filter are effectively combined.      The results indicate that the proposed method is effective for all saliency detection methods. In particular, when applied to GMR and DSR, it exhibited the most noticeable improvement in results. In the first, second, and fourth rows of Figure 5 and the first and second rows of Figure 6, the results produced by the two methods are significantly different from the shape of the salient object. However, after applying the proposed method, the overall silhouette of the salient object present in the image appears.

Objective Performance Comparison
To objectively compare the performance of the proposed method, we first demonstrated the PR curves as depicted in Figures 10-15. First, in the case of the MC and RBD mehods depicted in Figures 10 and 14, the results of the five pre-processing methods are similar to or slightly improved in all datasets. In particular, through PASCAL-S, the MC method demonstrated that ODOG produces better results than other pre-processing methods, and the RBD method demonstrated that it is advantageous to combine BF and brightness perception filters. Furthermore, Figures 11 and 12 show that the proposed method exhibits a noticeable performance improvement when applied to the GMR and DSR methods. In the case of the GMR method, except for THUR15K, all pre-processing methods led to effective performance improvements, and in the DSR method, BF exhibited an effective improvement compared to other pre-processing methods in all datasets. In Figure 13, it is shown that only ODOG and LODOG have similar performances to the existing HDCT method in the datasets excluding THUR15K. Finally, in Figure 15, GLGOV exhibited improved effects on ECSSD, PASCAL-S, and THUR15K when BF and brightness perception filters were applied together, similar to RBD.      To numerically evaluate the performance of our approach, we additionally compared the AUC and F-measure using the method proposed in [15]. The results of the two measurements are listed in Tables 3 and 4. It should be noted that the red and blue scores listed in the table represent the first and second performances, respectively. Although there are minor differences in the combination of detection methods and datasets, in general, the performance improves when the proposed pre-processing filter is applied. The most pronounced improvement is seen in the GMR and DSR methods, as in the subjective quality comparison. In the case of the GMR method, the differences between the best case and the original for AUC and F-measure were 0.0072 and 0.0160, and for the DSR method, these values were 0.0019 and 0.0075, respectively. Unlike other detection methods, the HDCT method exhibits the best effect when only the brightness perception filter is applied. The remaining detection methods are generally more effective when BF and brightness perception filters are combined.

Conclusions
In this paper, we proposed a method of applying a pre-processing filter that considers human visual perception characteristics to improve the protrusion detection performance. The pre-processing filters were used individually or in combination depending on the purpose. The experimental results with five publicly available data sets have shown that our method effectively improves the performance of six superpixel-based saliency detection methods. Further evaluation with a PR curve, AUC, and F-measure have also confirmed the effectiveness of the proposed approach. We have found several advantages in terms of saliency detection performance by applying the proposed method. First, the bilateral filter smooths the detail component while preserving the edges of the image. The flattened image increases the similarity both within and between each superpixel. The superpixelbased saliency detection method detects salient objects through superpixel correlation, which leads to an improved performance. Second, the salient object contains stimulus components that are distinct from the surrounding environment. An easily perceived difference is the brightness of the scene. Humans receive this component as a distorted value because it is interfered with by the ambient luminance when acquiring it. Therefore, the brightness component obtained through the cognitive brightness prediction filter amplifies the difference between the salient object and the background to a level perceived by humans. This leads to a decrease in the correlation between the salient object and the background. When humans detect salient objects, it is known that the bottom-up and top-down methods finally combine to put the visual recognition system into action. Recently, top-down methods for training deep neural networks using salient datasets and the detection of salient objects in images through this trained network is being actively studied. For future work, we will focus on bottom-up saliency detection methods that work in conjunction with pre-processing filters to improve performance and their combination with top-down methods.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: HVS