Saliency Map Estimation Using a Pixel-Pairwise-Based Unsupervised Markov Random Field Model

: This work presents a Bayesian statistical approach to the saliency map estimation problem. More speciﬁcally, we formalize the saliency map estimation issue in the fully automatic Markovian framework. The major and original contribution of the proposed Bayesian–Markov model resides in the exploitation of a pixel pairwise modeling and a likelihood model based on a parametric mixture of two different class-conditional likelihood distributions whose parameters are adaptively and previously estimated for each image. This allows us to adapt our saliency estimation model to the speciﬁc characteristics of each image of the dataset and to provide a nearly parameter-free—hence dataset-independent—unsupervised saliency map estimation procedure. In our case, the parameters of the likelihood model are all estimated under the principles of the iterative conditional estimation framework. Once the estimation step is completed, the MPM (maximum posterior marginal) solution of the saliency map (which we show as particularly suitable for this type of estimation), is then estimated by a stochastic sampling scheme approximating the posterior distribution (whose parameters were previously estimated). This unsupervised data-driven Markovian framework overcomes the limitations of current ad hoc or supervised energy-based or Markovian models that often involve many parameters to adapt and that are ﬁnely tuned for each different benchmark database. Experimental results show that the proposed algorithm performs favorably against state-of-the-art methods and turns out to be particularly stable across a wide variety of benchmark datasets.


Introduction
Saliency detection can be defined as the selection of visually interesting and important regions or objects in an image which catch immediate attention or naturally grab and hold the viewer's attention. In fact, this task tries to fundamentally simulate the early processing stage (namely, eye fixations and selective processing) of the human vision system (HVS), which has the natural and astonishing ability to quickly and accurately identify the most visually noticeable and informative foreground object, in a (possibly complex) scene, in order to then adaptively focus the attention, via the visual attention mechanism, on such perceived important regions. This allows humans (and some mammals) to efficiently and quickly analyze the scene with a minimal allocated processing (visual) resources.
Recent years have witnessed rapidly increasing interest in saliency detection since this task plays an important role in a variety of applications, especially as a first step of many graphics/vision applications for which it is necessary, before all, to tackle the problem of information overload. It includes content-based image or video retrieval, categorization, summarization, compression, browsing and/or resizing, automatic image cropping, adaptive image display on small device, infrared small target detection [1], object cosegmentation or advertising design to name a few.
Since the pioneer work of Itti et al. [2], who was one of the first authors to propose a saliency-based computational model relying on contrast-based image local features combined and computed across different scales, a rich literature on image saliency analysis has then been proposed to date. Within that literature, a significant number of methods have exploited the same principle based on local contrast concepts relying on pixel/region difference (with possibly different features) in the vicinity and expressing that a salient region/object exhibits a significant contrast to its immediate surroundings [3][4][5][6][7][8][9][10][11][12][13]. Another approach relies on the assumption that the salient object is globally distinct, i.e., it possesses discriminative color distributions with respect to the rest of the image or color uniqueness in terms of global statistics [7,11,[14][15][16][17][18][19][20].
Local-contrast-based methods tend to produce higher saliency values near edges instead of uniformly highlighting salient objects (whose inner region can be discarded) and are also degraded with the presence of high-frequency noise, while global-contrast-based methods cannot clearly distinguish among regions and are degraded with the presence of background with complex (or salient small-scale) patterns [21,22].
That is why some saliency detection (SD) models propose to mix local-and global-(and region) contrast-based features [23], to combine local contrast cues via a multilayer or multiscale approach [20,21], a tree structure [19] or to base their saliency map estimation on a multilevel segmentation [18]. Unlike those who use local-or global-contrast-based saliency features that depend on the statistics of the particular image being viewed, Zhang et al. derived a saliency measure from natural image statistics obtained in advance from a database of natural images [24].
Most of these above-mentioned saliency-based computational methods are ad hoc designed, partly because the overarching goal of these models (i.e., the criterion used to select the optimal solution or simply what it is designed to optimize) is not specified, and also because these techniques have many parameters which must either be finely tuned (hand-selected) or require a high degree of supervision and machine learning, which in turn makes them entirely dataset-dependent. In other words, there is no guarantee that these ad hoc algorithms can achieve similar performance on another dataset for which they have not been trained or tuned for [22,25].
In order to remedy these drawbacks and shortcomings, some saliency detection models have been guided by the Gestalt law [26], developed in the matrix decomposition model framework [27], incorporated into the framework of graph cuts [28] or conceptualized within the framework of quantum mechanics [22]. Some other models have been developed in the energy-based model framework [21,29] or equivalently in the regression framework [18] and thus consider the saliency estimation as a global optimization problem. Other saliency detection models have been designed in the Bayes statistical framework [30][31][32] or with the conditional random field (CRF)-based approach (which is often and simply built from an ad hoc conditional distribution with an associated graphical structure) [23,[33][34][35][36] or finally described in the (supervised) Markov Random Field (MRF) theory [5,[37][38][39][40][41][42].
More precisely, regarding the above-mentioned MRF models, most authors [5,[38][39][40][41] formulate the saliency estimation as a random-walk problem using the equilibrium distribution [5,[38][39][40][43][44][45][46] to simulate human attention or exploit this equilibrium distribution to weight the absorbed time, thereby suppressing the saliency of long-range background regions with a homogeneous appearance [41]. More generally, an absorbing Markov chain [47,48] possibly using different hierarchies of deep features extracted from fully convolutional networks [49] or guided by depth information [50] can also be used for this purpose. Finally, Han et al. proposed to formulate the saliency estimation problem as a maximum a posteriori (MAP) estimation in a more classical contextual classification problem [37]. More precisely, the saliency map was estimated in a two-step process. The first step allowed the authors to generate a rough saliency map obtained by Itti's algo-rithm [2]. In the second step, only a few attention seeds were first selected, according to the previously estimated saliency map and were combined and integrated, with some low-level features, in an MRF model to sequentially grow the attention objects from these selected attention seeds. Nevertheless, since no distribution or mixture of distributions were actually estimated from the image, this approach was not, stricto sensu, a Markovian approach. In addition, the resulting final algorithm remained an adhoc combination of several techniques with two algorithmic postprocesses to refine the extracted results. Moreover, that method relied on many internal parameters which had to be finely hand-selected and thus suffered from severe parameter dependencies, since the model was not cast within an unsupervised Markovian framework, which would have made it possible to estimate, in a criterion sense, the important internal parameters of the model.
Contrary to [37], we herein consider an unsupervised Markovian approach based on a field of observation built from a modeling by a pair of pixels and encoding the nonlocal pairwise pixel interactions (NLPPI) existing in the image. Based on these NLPPIs, our likelihood model is given by a mixture of class-conditional likelihood distributions of pairwise features (for each pairwise pixels existing in the image), whose parameters are (adaptively) estimated for each image. This allows us to adapt our saliency estimation model to the specific characteristics of each image of the dataset and to provide a nearly parameter-free-hence dataset-independent-unsupervised saliency map estimation procedure. In our case, the likelihood model parameters are fully estimated under the principles of the ICE (iterative conditional estimation) framework [51,52] with an ML (maximum likelihood) estimator (obtained with an iterative procedure minimizing the mean square error). Finally, the MPM (maximum posterior marginal) estimation of the saliency map is then computed via a Markov chain Monte Carlo (MCMC) sampling method to approximate the posterior distribution with the previously estimated parameters. This Bayesian criterion [53] is specifically well suited to our problem since it provides a practical way to obtain either the binary saliency map or its soft probabilistic version with values varying from zero to one.
Let us underline that, up to now, relatively few research works have been proposed in vision and image processing about energy-based or MRF models encoding or modeling via a likelihood (energy) function or (mixture of) distributions, all (or a subset of) the NLPPIs existing in an image. We can nevertheless mention some research works using energy-based models for texture synthesis [54], image segmentation [55,56], three-band compression model (for the color visualization of hyperspectral images) in remote sensing [57,58], gait analysis [59], edge histogram specification [60], for the compression of high-dynamic-range (HDR) images [61], with MRF models in segmentation fusion [62] or recently, in multimodal (heterogeneous) change detection in remote sensing [63].
The rest of this paper is structured as follows: section 2 details the proposed automatic Markovian model for the saliency map estimation problem by first defining and combining the likelihood and prior density functions and then the proposed two-step procedure, namely, a parameter estimation step followed by a saliency map estimation step, based on the previously estimated parameters. Section 3 presents a set of experimental results and comparisons with existing saliency map estimation methods, as well as the evaluation of the robustness of the proposed Bayesian approach. Finally, Section 4 concludes the paper.

Unsupervised Markovian Model For The Saliency Map Estimation Problem
Herein, we formalize the saliency estimation issue in the fully automatic Bayesian framework. To this end, an efficient and reliable method is a two-step approach [64]. First of all, a parameter estimation step is carried out to deduce the parameters of the likelihood model (in the sense of ML). Based on the value of these parameters, a second step is then devoted to the estimation of the saliency measure map (with range values between 0 and 1) or the binary saliency map.

Observation Field
Let us introduce some notation which is used throughout the paper. We first commonly consider a couple of random fields Z = (X, Y), with Y = {Y s , s ∈ S} the input image (for which a saliency map must be estimated and assumed to be toroidal) with N pixels located on a lattice S of N sites (or pixels) s, and X = {X s , s ∈ S}, the pixelwise random vector defining the label field or the saliency map. Each Y s takes its value in a color space and each X s is defined either in the discrete binary state space Λ = {e 0 = nonsalient, e 1 = salient} or (it will be explicit in the following) in the probabilities range from 0 to 1.
In addition to that, we also consider that the set of y <s,t> values are a realization of a random variable (rv) vector Y <s,t> = {Y <s,t> , Y <s,u> , . . . , Y <u,v> , . . .} gathering the N(N − 1) random variables associated to each site (or pixel) pair, that we herein call the random (pixel-pairwise) observation field and secondly that X <s,t> is its corresponding random (pairwise) label field taking its value in the discrete state space Λ <s,t> = {id, di}. The pixel-pairwise label id means that the pixel at location s an t must share the same (identical) class label in the final saliency mapx to be estimated (leading either to the configuration <x s =salient, x t =salient> or <x s = nonsalient, x t =nonsalient>). Conversely, x <s,t> = di means that we have a different configuration, i.e., either the configuration <x s =salient, x t =nonsalient> or <x s =nonsalient, x t =salient>.
In our pixel-pairwise modeling approach, in order to keep a quasi-linear complexity with respect to the number of pixels in the image (and therefore reduce the computational complexity of our stochastic algorithm), we consider for each pixel s and centered around it, a subsample G s of 8 pairs of pixels evenly located around an N w × N w square window (see Figure 1). Furthermore, we consider two feature vectors at site s, namely, V s and V s , encoding first the textural and structural information existing around each local squared region of size N T = 16 × N T = 16 centered at s and the color information, via the feature vector V • s , existing within a local squared region of size N c = 5 × N c = 5.
More precisely, in our application, we first estimate the discrete cosine transform (DCT) of each (gray-scale version of the) local squared window N T × N T , compute its module (i.e., its absolute value since the DCT is real) and then apply a half-circular or radial integration transform (RIT) (i.e., to get the mean of the absolute DCT coefficient values for each discrete radius using a bilinear interpolation) to estimate a spectral descriptor vector V s of size N T /2 (see Figure 1). From this DCT transform, we also compute an axial integration, for each of the five discretized possible orientations, as described in Figure 1 to get V s .
As this texture descriptor is obtained from the compressed domain, it has the ability to be both greatly reduced in size and also robust to noise (since several denoisers are built from a filtering in this DCT domain [65,66]). Moreover, this strategy also allows one to efficiently code a texture with the properties of translation and rotation invariance for V s and scale invariance for V s .
Moreover, remember that the DCT has a better compressive power than the discrete Fourier transform (DFT) and also the ability to estimate a less biased spectrum than that obtained by a DFT (especially when it is calculated on small images). This particularity comes from the even or mirror symmetry properties of the DCT which thus avoid the creation of false spectral components (or artifacts) generated by edge effects induced by the intrinsic periodic property of the DFT. In addition, the DCT calculates a spectrum which is real, contrary to the DCT, which estimates a complex spectrum (as a result, this also makes the calculation of the DCT extremely fast). (To this end, we have exploited the fast and optimized 16 × 16 DCT function implemented in C code by T. Euro (i.e., DDCT16X16S) and extracted from the FFT2D package available online at the http address given in [67]. The size N T = 16×N T = 16 was herein chosen because N T = 16 is, before all, a power of 2 allowing to compute the DCT with linear complexity. Let us note that a size N T = 8 × N T = 8 would have given almost the same (but very slightly less good) result.)  [21,68], assumed to be toroidal and showing a salient object and its ground truth. (b) Subsample G s of 8 pairs of pixels <s, t> associated to each pixel s (in which the pixel t is evenly located all around an N w ×N w square window (with N w = 50 in our case) with the two local square subwindows associated to each site s or t. (c) The spatial and spectral feature vectors V • s , V s and V s were built from the two local squared subwindows associated to the site s.
Finally, from the N c × N c local window, the color information around the pixel is taken into account, via the feature vector V • s of length 3 by specifying the mean L (luminance or lightness), A and B values contained in this subwindow (see Figure 1).
In our approach, we based our likelihood model on the assumption that the salient region or object was globally distinct, i.e., it possessed both discriminative colors and (to a lesser extent) discriminative textural properties with respect to the rest of the image. To this end, in our pixel-pairwise modeling using a subset of pixel pairs <s, t> existing in S (see Figure 1), y <s,t> was computed from each considered pairwise feature vectors (V , V , V • ) extracted from the pixel pair < s, t >, with the following symmetric relation: Color Features (1) where |.| 1 is the L 1 norm and ρ c is the weighting factor (see Section 3.1) between the color and spectral feature measures.

Likelihood Distributions
To use the observation measure y <s,t> (see Equation (1)) in a Bayesian settings, we must, before all, estimate the (marginal or conditional) likelihood distributions of Y <s,t> , namely P Y <s,t> |X <s,t> , in the two possible cases: identical pixel-pairwise label, x <s,t> = id, or not, x <s,t> = di.

Likelihood in the Identical Pixel-Pairwise Label Case
In our experiments, we noticed that if x <s,t> =id, P Y <s,t> |X <s,t> was well approximated, for a given s, t, by an exponential distribution p id = E (.; λ id ) with shape (or inverse rate) parameter λ id , i.e., P id (y <s,t> ) = P Y <s,t> |X <s,t> (y <s,t>|x <s,t> =id ) with λ id > 0 and the right-continuous Heaviside step function This approximation can be justified and understood if we notice that, for a site pair < s, t >, either fully included in a salient region or entirely within a nonsalient area (i.e., x <s,t> = id), the computation of y <s,t> (see Equation (1)) is, in fact, a (weighted) sum of three n-order spatial gradient norms (n is the distance in pixel between s and t) in the textural sense (i.e., a difference, in the L 1 norm sense, between pairwise textural feature vectors). The gradient norm, meanwhile, built either from grayscale, color levels or from two texture feature vectors of an image, is known to be well approximated by a simple exponential distribution [63,69] or its possible variants (e.g., its generalized version [70], truncated variant [71,72] or finally a long-tail version with a shape and scale factor [60,61]).

Likelihood in the Different Pixel-Pairwise Label Case
In the case of x <s,t> =di (different pixel-pairwise labels), we empirically noticed that a normal distribution P di = N (.; µ di , σ 2 di ) was well adapted to describe the measure y <s,t> : P di (y <s,t> ) = P Y <s,t> |X <s,t> (y <s,t>|x <s,t> =di ) Let us note that the Gaussian shape of this distribution is consistent with the central limit theorem saying that the Gaussian distribution is an attractor for the conditional random variable Y <s,t> |X <s,t> , whose realizations result from the sum of many i.i.d. (independent and identically distributed) random variables (in our case resulting from the difference of different spectral and spatial feature vectors for different (salient and nonsalient) texture feature vector pairs). Let us also add that the parameters of these two distributions, namely, (λ id , µ di , σ di ) closely depends on the statistics of the salient or nonsalient region contained in each input image. The estimation of these parameters (see Section 2.4) is crucial in order to provide a nearly parameter-free-hence dataset independent-unsupervised saliency map estimation procedure.

Posterior Distribution
Let us now assume the independence of the pairwise data Y <s,t> , conditionally on the pairwise labeling process X <s,t> relatively to the considered subsample G s of pairs of pixels defined in Section 2.1 (see Figure 1), we have: Moreover, if we assume that the distribution of X is Markovian and stationary and we specify a suitable prior distribution for the labeling process X and we agree that the saliency map x explicitely defines x <s,t> , using likelihood (Section (2.2)), the joint distribution of (X, X <s,t> , Y <s,t> ) becomes: We obtain for the posterior distribution: We finally get: We encoded a second-order isotropic Potts prior model related to the 8 closest neighbors of s, with equal potential value β for the various cliques s, v configurations (i.e., vertical, horizontal, left and right diagonal) of η s , thus a model favoring homogeneous regions of the same class (no-saliency or saliency label) forx, i.e., P X ( [73], wherex, the saliency map to be computed, is related to the following corresponding posterior probability: where δ is the delta Kronecker function and |G s | is here the cardinality of the graph G s (i.e., in our case |G s | = 8, see Figure 1).

Principle
In our automatic Markovian segmentation model, we have first to estimate (estimation step) the parameter vector Θ y <s,t> defining, respectively, the likelihood distributions P Y <s,t> |X <s,t> (y <s,t> |x <s,t> ) for each of the two class labels x <s,t> of y <s,t> (see Equations (2) and (3)), i.e., the parameter vector Θ y <s,t> (λ id , µ di , σ di ) encoding the shape parameter of the exponential distribution P id (y <s,t> ) and the mean and variance parameters of the normal law P di (y <s,t> ).
In our application, this estimation stage can be challenging for three main reasons; First of all, we find ourselves in the particular case of a mixture of different distributions (exponential and Gaussian). Second, these two distributions are also often strongly mixed (see Figure 2). Finally, this mixture also has very different mixture proportions; generally, the di class is very underweighted because this class is related to the reduced number of labels or transitions per pixel pairs existing between the class salient and non-salient with respect to the considered graph (see Figure 1).  Figure 1) and distribution mixture estimated by the ICE procedure (see Section 2.4) with an exponential distribution for the id entical class label and a Gaussian distribution for the di fferent pixel-pairwise label. Bottom: likelihood mixture composed of the two previous conditional likelihood distributions P id (y <s,t> ) and P di (y <s,t> ) (without weighting proportion) used in the likelihood part of the posterior density of our MRF SD model.
To this end, we resorted to the ICE [51,52,74] iterative procedure, which we used here in the particular case of our pixel-pair modeling and which was easily able to cope with different distributions and which experimentally turned out to be more efficient than the classical expectation maximization (EM) [75] algorithm or its stochastic version, the stochastic EM (SEM) [76]. This efficiency comes from the fact that the ICE estimation algorithm [51,52] can easily be understood as a direct improvement of the EM algorithm, and more precisely as both its stochastic and Markovian versions (since it is also constrained by the distribution of X assumed to be Markovian).
The ICE procedure first requires to find an estimatorΘ y <s,t> = Θ(x <s,t> , y <s,t> ) providing an estimate of Θ y <s,t> based on the complete data configuration (x <s,t> , y <s,t> ). The random field X <s,t> being unobservable, the iterative ICE procedure thus defines the param- y <s,t> , at step [k + 1], as the conditional expectations ofΘ y <s,t> given Y <s,t> = {Y <s,t> } and the current parameter Θ [k] y <s,t> . The good behavior of this fixed point for the estimation of Θ y <s,t> in the sense of the mean squared error was demonstrated in the simple case [52] and in many past experiments.
By denoting IE k the conditional expectation based on Θ [k] y <s,t> , this estimation procedure is described as follows:
We noticed that n = 1 gave good results (while ultimately minimizing the computational cost of the iterative procedure) as was also found in several other experiments. (In fact, this is certainly due to the ergodic property of any image, which makes the ensemble average equivalent to a spatial average in the case of a random variable modeling the data or the (modified) color levels of an image [51]). From this observation, we therefore kept this value n = 1 in the rest of our experiments.
It was the case in our unsupervised Markovian saliency model, and we actually chose n = 1 in our experiments.

Ml Estimators for the ICE
For the Gaussian law, an ML estimate of (µ di , σ 2 di ), based on the complete data configuration, can be easily given by the empirical mean and variance statistics. For example, If N di = #{x <s,t> = di}, one gets: µ di (x <s,t> , y <s,t> ) = ∑ x <s,t>=di y <s,t> N di (9) For the exponential law, if N id = #{x <s,t> = id}, an ML estimate of the shape parameter is: λ id (x <s,t> , y <s,t> ) = ∑ x <s,t>=id y <s,t> N id (11) In our Bayesian SD framework, we do not need to estimate the proportion of each class. Nevertheless, the mixing proportion can be easily estimated within this procedure with the empirical frequency estimator π id = N id /(N id + N di ) (and π di = N di /(N id + N di )).

ICE Algorithm
Θ y <s,t> (λ id , µ di , σ di ) is therefore estimated with the ICE algorithm as follows: • Parameter Initialization: We start with a randomly initialized saliency map x with two classes (salient/nonsalient) and from Θ [0] y <s,t> is then calculated from Θ [k] y <s,t> in the following way: (1) Stochastic step: using the Gibbs sampler, one realization x of the saliency map is simulated according to the posterior distribution P X/Y <s,t> (x/y <s,t> ), with parameter vector Θ [k] y <s,t> . More precisely, for each site s (lexicographically), we sample x s with the local version of Equation (7), i.e., P X s |Y <s,t> (.) ∝ ∏ <s,t> t∈G s P Y <s,t> |X <s,t> (.) · P X s (x s ) (12) with P Y <s,t> |X <s,t> an exponential law for x <s,t> = id with parameter λ id (see Section 2.2.1); with P Y <s,t> |X <s,t> a Gaussian law for x <s,t> = di with parameter (µ di , σ di ) (see Section 2.2.2). (2) Estimation step: the parameter vector Θ [k+1] y <s,t> is estimated with the ML estimator of the "complete data" (see Equations (9)-(11)).
(3) Repeat until a stopping criterion is met or until convergence is achieved, i.e., if y <s,t> (i.e., if the 1-norm of the difference between these two parameter vectors is below a threshold or after a maximum number of iterations), we return to the stochastic step.
In order to always ensure a good convergence of the ICE procedure, even in the presence of strongly mixed mixture distributions with unbalanced mixing proportions (as shown in Figure 1), we start this iterative procedure with Θ di as the average of the 10% highest values of y <s,t> in order to model the fact that the mean of the (x <s,t> =) di class is generally associated to the mean of the highest values of y <s,t> . We finally use the stochastic step with a Gibbs sampler with a temperature (with a temperature factor T, we recall that the local posterior distribution is 1 Z s exp 1 T log P X s |Y <s,t> (.) ) equal to 0.15 (empirically set after a couple of trials and errors) in order to allow a fast convergence and to reduce the number of explored solutions around the initialization values.

Saliency Map Generation Step
Once the estimation step is completed and based on the value of these crucial parameters, we now have to generate the saliency measure map with range values between 0 and 1 as efficiently as possible. To this end, the maximum a posteriori MAP [73] estimator, namely, the one that searches the configurationx such thatx = arg max x∈Ω P X|Y <s,t> (.) (with the configuration set Ω = Λ N ) would allow to find, in our application, the most probable, in the MAP sense, binary saliency mapx given the image data. Equivalently, this strategy would search to find the global minimum of the negative log-posterior. Nevertheless, this strategy would allow us to finally estimate a binary saliency map and not a saliency measure map with range values between 0 and 1.
If a good initialization is not available, the ICM [73] algorithm will be ineffective and will be stuck in a local minima (i.e., give a poor suboptimal binary saliency map solution). The only way to avoid suboptimal local minima is to use a simulated annealing scheme, which is very computational demanding [63,77], especially for such an energy function to be optimized.
In our case, there is an interesting alternative which makes it possible to circumvent the global solution of the MAP estimator [78] associated with a binary or hard classification map (the MAP) and which is well suited to provide (rather quickly) either a binary saliency map or (as rather required here) a soft saliency measurement map; it is a more local Bayesian estimator that associates to each couple of sites s, t, the value of X s,t that is the most probable given the image data. It is referred to as the "marginal posterior mode" (MPM) estimator [53]. It relies, as the ICE procedure (see Section 2.4), on a sitewise posterior, i.e., the local version, for each site, of the posterior (Equation (12)). Mathematically, the MPM estimator is the one that searches the configurationx such thatx = arg max x∈Λ P X s |Y <s,t> (.).
Algorithmically speaking, it consists in sampling N MPM realizations of the random field X with the local version of the posterior or equivalently, this means repeating the stochastic step of the ICE procedure (see Section 2.4.3 and point 1) in order to simulate N MPM samples of the binary saliency map. After these sampling steps, the proportion of salient labels for each pixel simply gives us the soft measure of salience for this pixel and thus the soft saliency map. Moreover, the median label for each site can give us the binary saliency map, in the MPM criterion sense.

Additional Location-Based Prior
In addition to the classical and standard isotropic Pott type prior P X s (x s ) used in the posterior (see Equation (8)), which aims to favor homogeneous salient or nonsalient regions in each of the N MPM binary saliency map samples generated by the MPM algorithm and which finally induces a regularizing effect on the soft saliency measure map, we added another prior. This additional prior (see Figure 3) was independent of the visual features and reflected prior knowledge of where the salient object/region was likely to appear, i.e., most likely in the center of the image [18,21,24,29]. In our case, this prior was modeled by the following additional prior distribution: where s row and s col designates, respectively, the row and column coordinates of pixel s. hgth and wdth are the height and width of the image and we recall that e 1 is the label associated to the class salient. With this additional prior, the prior part of the posterior becomes P X (x) = P X s (x s ) · P X s:c (x s:c ) (see Equations (7) and (8)).

Region-Based Constraint
We finally added a region constraint which aimed to better preserve the boundary of salient objects and regions. This kind of constraint exploits the concept of superpixel and has already been used successfully, in different ways, in several saliency SD models [79], especially in [18,19,29,30,[79][80][81]. Superpixel algorithms produce an oversegmentation of an image in which each oversegmented region (called superpixel) brings together a group of pixels (forming a perceptually meaningful atomic region) that can be used to replace the rigid structure of the pixel grid in images. In our application, we resorted to the superpixel algorithm proposed by Felzenswalb [82] with the default values (settings) suggested by the authors. To this end, we first computed, before the MPM, a superpixel map, and at the end of each sampling (among the existing N MPM samples) of the binary saliency map achieved by the MPM sampler, we simply counted the number of salient labels contained in each superpixel; if this one was less than half the number of pixels it contained, then we associated to all these pixels the nonsalient label.
We individually discuss and quantify the effectiveness of these two prior constraints in the following section.

Setup
First, we resized all images so that max(height,width) of the image was 200 pixels. In the following, all parameters were set with respect to this basic image size. Due to its strong correlation with the human visual perception, the perceptual CIE-Lab color space was herein used in this application. Experimentally speaking, it also turned out to be more effective in our application than the RGB color space.
The internal parameters of the model that remained to be set were the size of the squared window N w (size of the graph G s ), the size of the squared color window N c (see Figure 1) and ρ c , the weighting factor between the color and spectral feature measures (see Equation (1)).
Let us note that the number of MPM iterations was not critical since in fact the higher, the better the results (but the higher the computational time). In our application, the number of MPM iteration was set to 300 in order to ensure a computational time around 4 seconds per image. Finally, the MPM sampler was initialized, at the start of the sampling process, not by a random label image but by an image containing class labels change in a central square with a surface area half that of the image on a background of class labels no-change. This strategy allowed us to improve the SD results compared to a conventional random initialization.
The efficiency of the MPM sampler, in term of F measure, is also quantified in Section 3.4 for a random initialization, for different number of iterations and for the different location priors considered in Sections 2.5.1 and 2.5.2.

Dataset Description
We validated our algorithm on the Extended Complex scene saliency dataset (ECSSD) built by Shi et al. [21,68]. This database is composed of 1000 images, including many semantically meaningful but structurally complex images for evaluation and containing various categories, such as natural objects (vegetables, flowers, mammals, insects and human) along with man-made objects (cups, vehicles, clocks, etc.). The backgrounds of many of these images are not uniform but possibly contains a small-scale structure or are composed of several parts, and some of the salient objects or regions do not have a sharp, clear boundary and/or an obvious difference with the background (which sometimes changes in illumination intensity) and are sometimes transparent. In addition, multiple salient objects possibly exist in one image, while a part of or all of them are regarded as salient as decided by an expert. A rigorous protocol was employed using several experts and several manually drawn saliency maps combined with each other in order to find the binary mask or ground truths that minimized the intersubject label consistency in the majority vote sense. More details about the database and the protocol used to construct the ground truths can be found in [68]. We also tested our method on the saliency dataset MSRA-5000 (or MSRA-5K) [23]. Images of MSRA-5000 are relatively more uniform and therefore the SD estimation is somewhat less difficult than on the ECSSD dataset [68].

Quantitative Measure
Our quantitative evaluations and experiment followed the setting described in [11,15,68,79]. We first plotted the precision-recall curve for the set of saliency measure maps obtained by our model and compared the obtained curve against to other methods (see Figure 4a). In addition, since in many applications, a high precision is required, we thus also estimated the F β -measure, as a function of each possible threshold (within the range [0, 255]), as proposed in [68,79]: (14) in which thresholding was applied, and β 2 was set to 0.3 as suggested in [11,68,79] (the reason for weighting precision more than recall is that recall rate is not as important as precision [23,79] (since a 100% recall can be easily achieved by setting the whole region foreground)) and the obtained F β -measure was plotted and compared to other methods (see Figure 4b). Moreover, we further investigated the performance of the mean absolute error (MAE) following [68,79], which measures the quality of the weighted continuous saliency map which may be of higher importance than the binary mask itself. More precisely, the MAE measures the mean absolute error between the soft saliency measure map x and the binary ground truth x G , both normalized in the range [0, 1]. The MAE score is defined as follows: with i and j designating, respectively, the row and column coordinates of x or x G .

Results and Comparisons
First, we tested (see Figure 5) the efficiency of the MPM-based SD estimator, in terms of precision-recall measures, for: 1. different number of iterations (namely, 2, 10, 100, 300 and 3000); 2. for a random initialization of the MPM (see Section 3.1); 3. by considering just the color features (see Equation (1)); 4. without considering the location-based prior (see Section 2.5.1); or finally, 5. without considering the region-based constraint (see Section 2.5.2). In all these cases, the other internal parameters and the proposed model were kept unchanged. From these tests, we can conclude, in term of F β -measure, that the higher the number of iterations of the MPM, the better the results (but the higher the computation time too). The textural feature, the region-based constraint or a random initialization for the MPM slightly improved the results by only 2.1%, 1.2% and 1.7% (in term of F β measure), respectively, whereas the location-based prior was important to achieve better SD results with the highest F β -measure. Moreover, Figure 6 shows the distribution of the F β , F and MAE measures given by our model (saliency map estimation with pixel-pairwise-based MRF) model on the 1000 images of the ECSSD dataset. We evaluated our model on the ECSSD and MSRA-5000 datasets and compared our results in terms of the average MAE measure (see Equation (15)), with local schemes IT [3], GB [5], AC [9] and global schemes LC [14], FT [11], CA [16], HC [15], RC [15], RCC [20], LR [17], SR [7] and CHS [68] (see Table 1). Figure 4 shows the plot of the precision-recall curve and the F β measure (see Equation (14)), as a function of the threshold, between these different SD methods comparatively to our model on the ECSSD dataset. AC [9] CA [16] FT [11] GB [5] HC [15] IT [3] LC [14] LR [17] SR [7] RC [15] RCC [20] HS [

Discussion
Our current implementation took on average 4.2 s to process one image with resolution 400 × 300 (the 1000 images in the ESCCD database were processed in 70 min) in the benchmark data on a 3.33 GHz Intel i7 CPU (6675.25 bogomips and 8 GB memory) with an unoptimized C++ code running on Linux. Up to 80% of this total time was dedicated to the stochastic saliency map generation step achieved by the MPM sampler (see Section 2.5), which could be easily parallelized using a GPU implementation as described in [83] in order to reduce the computation time by a factor greater than 100.
Firstly, we can notice that the precision-recall curve corresponding to our model was comparable to the best existing state-of-the-art CD algorithms (see Figure 4a) on the ECSSD dataset.
Secondly, we can also notice that our F β -measure curve (Equation (14) with β = 1), as a function of the threshold, was the highest and flattest. This property allowed us to obtain the best F β -measure (F β = 0.727) but also to obtain, by very far, the best MAE score, i.e., the lowest mean absolute error between the continuous saliency map and the binary ground truth (both normalized in the range [0, 1]) compared to all other proposed SD methods (see Table 1). Indeed, the MAE score was proportional to the area under the plotted F β -measure curve (as a function of the threshold) and to the line of equation y = F β -measure = 1 in Figure 4b.
It is also worth mentioning that our F β curve had a very different shape from all the others (very flat even at the ends) which showed us (in addition to showing that the F β measure of our method was better than the other existing state-of-the-art algorithms whatever the value of the threshold (a curve constantly above)), two important things. First, our proposed SD method was very different conceptually and in terms of modeling than the other proposed methods. Second, and more importantly, the fact that the F β measure curve, as a function of this threshold, associated with our model, was also much flatter than the other curves, showed us that our model was less sensitive to the threshold necessary to convert the saliency probability map into a binary saliency map. Graphically speaking, this was indicated by lesser grayscale variations in the estimated soft saliency map (see Figure 7). Perhaps for this reason, a greater confidence could be placed in our SD estimation result, or conversely, there may be less ambiguity in our model for accurately detecting and locating salient areas compared to other methods. In addition, the F β score was better in our case, whereas our precision-recall curve remained comparable to the three best existing state-of-the-art CD methods; this meant that according to (Equation (14), since the F β score was favored by a better precision measure, our model allowed us to obtain the best precision measure. Concerning the MAE score, it is also worth recalling that this evaluation metric is actually quite different from the two F-based evaluation measures. Indeed, unlike the MAE score, the two F-based measures do not take into account the true negative saliency assignments [79], (i.e., the pixels correctly marked as nonsalient, which are, in fact, related to a quite large proportion of pixels in an image). A good MAE score combined with a good F-based measure rewards an SD method that successfully assigns saliency assignments to salient pixels and, unlike the methods with a lower MAE score, does not fail to correctly detect all the nonsalient regions existing in an image, as it should be.
For the MSRA-5000 dataset, the maximal F β we obtained was F β = 0.790 (for a threshold of 136) and F = 0.757 (β = 1, for a threshold of 51), which was a very competitive measure and also the best MAE measure (for the same reason above mentioned, because the F β measure curve, as a function of the threshold, reached a maximum as high as the best curves while remaining flatter than the other curves, thus inducing a lower surface area under the curve).
The estimation of each saliency map from the MSRA-5000 dataset depended on the image size but took, on average, 4.42 s using unoptimized C++ code running on a single core of an Intel i7 CPU with 3.33 GHz and 6675.25 bogomips.
In the future, we will study the fuzzy Markov chain modeling or the combination of a hidden Markov model and a fuzzy logic reasoning framework that will combine, on the one hand, the advantages of a Bayesian statistical analysis of the data with, on the other hand, the inaccuracies and uncertainties of the data on this difficult and imprecisely defined saliency map detection problem [84,85].

Conclusions
In this paper, we presented an unsupervised Markovian model for the saliency map estimation problem. This model was based on an original pixel-pair modeling and a likelihood model that used a parametric mixture of two different class-conditional likelihood distributions whose parameters were adaptively and previously estimated, according to a criterion that mixed maximum likelihood and least squares, for each image, with the iterative conditional estimation algorithm. Once the estimation step was completed, the MPM estimate solution of our model was particularly well-suited to our problem since it allowed us to obtain, by minimizing the Bayesian risk associated to the expected number of mis-(saliency) detection error, either a reliable binary saliency map or its (soft) probabilistic version. This unsupervised data-driven Markovian framework adapted our saliency estimation model to the specific characteristics of each image or/of each dataset in a widely unsupervised manner. Experimental results showed that the proposed algorithm performed very well against state-of-the-art methods for different and complementary measures of good saliency detection and was particularly stable across a wide variety of images. Moreover, the average F-measure curve, as a function of the threshold, associated with our model, also appeared much flatter than the curves associated to other state-of-the-art algorithms, which showed that our model was less sensitive to the threshold necessary to convert the saliency probability map into a binary saliency map comparatively to the other SD methods proposed in the literature. In addition, let us add that our model is perfectible by increasing the number of MPM iterations or by adding other informative prior knowledge (possibly via soft or hard constraints and may be expressed at different levels of abstraction such as pixel, segment, region, pairwise region, etc.), which can be easily integrated in the MPM random simulation scheme.