Efﬁcient Learning of Spatial Patterns with Multi-Scale Conditional Random Fields for Region-Based Classiﬁcation

: Automatic image classiﬁcation is of major importance for a wide range of applications and is supported by a complex process that usually requires the identiﬁcation of individual regions and spatial patterns (contextual information) among neighboring regions within images. Hierarchical conditional random ﬁelds (CRF) consider both multi-scale and contextual information in a uniﬁed discriminative probabilistic framework, yet they suffer from two main drawbacks. On the one hand, their current classiﬁcation performance still leaves space for improvement, mostly due to the use of very simple or inappropriate pairwise energy expressions to model complex spatial patterns; on the other hand, their training remains complex, particularly for multi-class problems. In this work, we investigated alternative pairwise energy expressions to better account for class transitions and developed an efﬁcient parameters learning strategy for the resultant expression. We propose: (i) a multi-scale CRF model with novel energies that involves information related to the multi-scale image structure; and (ii) an efﬁcient maximum margin parameters learning procedure where


Introduction
In this work, we consider the following problem: after segmenting an image into several regions, classify each region into one of the predefined classes.We refer to this process as region-based image labeling.The result is both a segmentation of the image and labeling of each region (segment) as a given object class.Automatic (region-based) image labeling is of importance for several applications, such as scene understanding, image data base querying, etc.
In natural scenes, different classes are dependent across similar pixels or regions and also on their co-occurrence.It is not surprising to find regions of certain classes (i.e., river and tree) located next to each other, but it is very unlikely to find some aberrant co-occurrences, such as a building in the middle of a lake.Spatial patterns provide valuable contextual information for achieving consistent labeling.Moreover, since segmentation could provide very small (or too big) ambiguous regions, multi-scale analysis allows one to determine the objects' class more accurately [1,2] in the most appropriate scale within a range (multi-scale) [3].
One of the popular choices to address image classification, including contextual and multi-scale information, has been Markov random fields (MRF) [4,5].Recently, a discriminative probabilistic framework, named conditional random fields (CRF) [6,7], has been proposed for the same purpose [1,8,9].However, despite the undeniable improvements of CRF with respect to MRF, two main issues are still common in most of the CRF state-of-the-art models.
The first issue is that classification results, in several cases, are still very low.We associate this problem with an inappropriate pairwise energy expression that fails to model complex spatial patterns (i.e., class transition detections or co-occurrence patterns).This problem can be alleviated by defining proper features for the energies and by defining the appropriate framework to allow more complex energy expressions.
The work of [10] suggests the use of non-linear (exponential) potentials.Similarly, in this work, we model each potential as an unknown function ψ c : Y → R (of unknown distribution) in a set of possible functions ψ C ∈ H, where H is a Hilbertian space endorsed with an inner product •, • H .We also introduce a set of additional weighting parameters (variables) in order to control the potential combination.
The work of Gould et al. [11] discusses the design of pairwise energies and augments these with relative location information, but their segmentation strategy does not allow the use of multi-scale analysis.The authors in [9] use multi-scale information to express pairwise potentials, despite their local dependencies (intra-scale) involving image structure information (i.e., number of neighbors); their long-range dependencies (multi-scale or inter-scale) only use color distances.We also make use of spectral feature differences in the form of distances, and we go further by including image hierarchical structure (multi-scale) information, such as the scale of analysis, lowest and highest spectral (color) distances to neighboring regions, the number of siblings (the regions sharing the same parent region on the next coarser scale) and the region's lifetime.We follow a strategy more close to [12], where the region locations are determined by the multi-scale information ( inside, tangential, non-tangential, etc.), but based on hierarchical definitions (parents, siblings, neighbors), since our segmentation strategy guarantees a non-overlapping region's hierarchical decomposition.
The first contribution of this work is a general energy expression that considers: • Possible non-linear potential functions ψ c (•; θ) = θ, f Φ H (inner product) that differ from the usual ψ c (•; θ) = θ T f (dot product), where f is a feature vector; • A weighted mixture of potentials E W (•; θ, d) = c d c ψ c (.; θ), d c ≥ 0; • Simpler (in number of features), yet rich and expressive pairwise energies involving contextual and multi-scale information.
The second issue, present in most CRF models, is related to the complexity of the parameter learning stage.The conditional maximum likelihood estimation (CMLE) approach requires approximations to estimate the partition function [8,13,14] and achieves relatively simple and accurate parameter learning using gradient descendant optimization methods [15].Maximum margin confidence approaches do not involve the partition function on their formulation, but poses a challenging quadratic program (QP) for the parameter learning, usually solved with (online) sub-gradient [10,16] or cutting-plane methods [17][18][19].
When using any of them in the context of CRF parameter learning, another challenge is related to the necessity of solving an inference problem at each iteration during the optimization loop, which increases the computational burden.
Other training strategies, such as the piecewise training framework [20] and its variants [21], proposed simpler formulations to the parameter learning problem and have been shown to be very competitive [19,21].They have been extensively used in the context of CMLE [11,19,[21][22][23].
An example is the work in [24], where the authors start from a decomposition on the dual formulation of a max-margin QP [25] and introduce a piecewise pseudo-likelihood (PWPL) estimator [21] in order to reduce the (per iteration) inference complexity during training.
In this paper, we consider a particular set of values for the energy weighting coefficients (d c ), where the final energy form involves only single sites and their corresponding pairwise factors, while in [24], the single-site factors of neighboring sites are involved, too.Then, in order to learn the parameters, we decoupled the associated QP into simpler individual multi-class sub-problems that can be efficiently solved dually using the Crammer and Singer fixed-point method [26].
The second contribution of this work is a more general energy expression considering contextual and multi-scale information, in a max-margin parameter learning approach within the piecewise framework, which provides a novel strategy to tackle the resultant parameter learning problem.The proposed energy expression connects to and generalizes other works, reaching the same expression as in [22,24] for different weighting schemes (values for the weights).
The remainder is organized as follows.The multi-scale CRF formulation is provided in Section 2. The proposed training strategy is detailed in Section 3, while the inference process is described in Section 4. Experimental results, using remote sensing images, are reported and discussed in Section 5, and conclusions are given in Section 6.

Multi-Scale Conditional Random Fields for Image Classification
The CRF formulation model directly the labels conditioned on observations, decomposing the energy expression according to the different clique order.The clique decomposition theorem defines the overall form of a general CRF model as follows: where Z(y, θ) is a normalization function named the partition function.
The above model is known to follow an exponential distribution [4], while the potentials ψ c are functions defined over cliques c ∈ C, usually assumed to be linear and expressed as an inner product between features y and model parameters θ [1,7,8,22]; x c are the labels limited to factors of order c ∈ C.
This general CRF model can be specified for (defined on) the multi-scale region-based image labeling context.This includes an appropriate potential definition, features for the energy expression, among others.For the region-based image labeling, there are some other stages involved in the overall process, such as the image pre-processing (filtering), segmentation, parameter learning and classification.The general framework is illustrated in Figure 1.

PDE-Based Smoothing
In order to avoid blurring and the delocalization of the image features, an image adaptive scale-space filter is used.In this work, we opted for a method that guides the filtering process in such a way that intra-region smoothing is preferred over inter-region smoothing and edges are gradually enhanced.The employed filter [27], belongs to the class of nonlinear anisotropic diffusion filters.It is a combination of the Catté et al. [28] regularized Perona and Malik filter [29].
Let I = {I (1) , I (2) , . . ., I (R) } be a vector-valued image defined on a finite domain Ω.The multiscale tower (u) of I is governed by the following system of coupled parabolic partial differential equations (PDEs): where u (r) represents the r − th image band, t is the continuous scale parameter, δΩ is the image boundary (with n denoting the normal direction to it) and σ is a regularization parameter, which ensures the well-posedness of the above system; g is the Lorentzian edge stopping function [30], which is formulated as: The so-called contrast parameter k in Equation (3) separates backward from forward diffusion and is estimated using the cumulative histogram of the regularized gradient magnitude (|∇u σ |) [27].In this work, a discrete version of the multiscale tower u, denoted as U = {u 0 , . . ., u n , . . ., u N }, is obtained by applying the natural scale-space sampling method [31].From the scale space tower, we select a scale, referred to as the localization scale, at which the diffusion filter has removed the noise without affecting or dislocating the edges of the salient features in the image, and we apply to it the multi-scale region adjacency graph (MSRAG) generation process described in Section 2.2.The localization scale is determined using a maximum correlation criterion proposed in [32].
The maximum correlation criterion selects the scale that maximize the correlation between the original image and the diffused image: where σ[u(t)] and σ[u(0)] are the standard deviation of both the diffused image at the t − th scale and the noisy image, respectively, and σ[n out (t)] and σ[n out (0)] are the standard deviation of the outlier noise of both the diffused image at the t − th scale and the noisy image, respectively.The values of σ[n out (t)] and σ[n out (0)] are estimated by using the median absolute deviation of the gradient of the diffused and original images (see [33] for details).

Multi-Scale Region Adjacency Graph (MSRAG)
In this work, the image classification is treated as a hierarchical labeling problem applied to the nodes S of a multi-scale region adjacency graph (MSRAG).The MSRAG represents a nested hierarchy of partitions, S n = {r n 1 , r n 2 , . . ., r n mn }; n = 1, . . ., N (r n j being disjoint regions), which preserves the inclusion relationship S n ⊇ S n−1 , implying that each atom of the set S n is a disjoint union of atoms from the set S n−1 .In other words, a partition at a coarse level is obtained by merging regions of the finer partition.The MSRAG is defined as a graph G = (S, E), with the set of nodes S being the partitions at each scale, S = S 1 S 2 • • • S N , and the edges E are the set of common boundaries between the regions at each scale (intra-scale) and the inter-scale (parent-child) links; see Figure 2.
Hierarchical feature representation through multi-scale segmentation offers new possibilities in object-oriented and multi-scale image analysis [34].
State-of-the art, object-based and object-oriented segmentation algorithms are often based on region growing and merging, or linear and non-linear scale-space.In the proposed framework, starting from the localization scale s 0 (obtained using the PDE of Section 2.1), we follow the approach of [35], where the input multi-spectral image, u 0 , is first segmented using the watershed transform.Then, the waterfall algorithm [36] is used for producing a nested hierarchy of partitions: the multi-scale region adjacency graph (MSRAG); as illustrated in Figure 2.Such a hierarchy preserves the topology of the initial watershed lines and extracts homogeneous objects of a larger scale.The waterfall algorithm removes from the current partition (hierarchical level) all of the boundaries completely surrounded by higher boundaries.Thus, the saliency of a boundary is measured with respect to its neighborhood.The iteration of the waterfall algorithm ends with a partition of only one region.To apply the watershed, the gradient of the multi-spectral image is obtained by combining, using the approach of [37], the gradients of the texture (orientations) and the DiZenzo [38] multi-spectral gradient.For producing the nested hierarchy, we use the approach proposed in [35], where the saliency measure of a boundary is based on a collection of energy functions used to characterize desired single-segment properties and pair-wise segment properties.The single segment properties includes area, convexity, compactness and color variances within the segment.The pair-wise properties includes color mean differences between two segments and edge strength.In [35], the saliency measure, E(r = r i ∪ r j |r i , r j ), of a boundary between two neighboring segments r i and r j (being the cost of merging the regions r i and r j ), has been defined as: The single segment properties, E(r), is expressed as: and the pair-wise property as: The different energies associated with segment properties are given in Appendix (C), and additional details can be found in [35].

Hierarchical Labeling for Region Classification
The labeling is performed using a finite set L = {1, . . ., L} of interpretation labels.We consider a couple of random fields (X, Y ) on G, with X = {X n , ∀n ∈ [1, . . ., N ]} and We define the general form of the proposed multi-scale CRF for determining the optimal labeling as: with s the parent of a node s, η(i) the set of neighbors of site i at the considered level and the parameters θ = (ω, β I , β L ) T defined in order to differentiate intra-scale and inter-scale spatial patterns; the weighting parameters d m ∈ R + regulate each potential individual contribution to the overall energy expression.The pairwise potentials ψ I (•; β I ) and ψ L (•; β L ) encode contextual information between neighbors within the same scale and between parent/child regions at any two consecutive scales, respectively; while ψ A (•; ω) is the association potential, which provides local evidence.The considered model interactions are approximately represented in Figure 3.In order to make the representation simpler to follow, we illustrate only parent-child and four neighbors, while in real segmentation, they usually have any amount of neighbors.This structure falls within a research line of higher order potential design [39,40].
The partition function Z(y, θ) is defined as the sum of per-clique model energy for every possible realization x, Z(y, θ) = x c∈C ψ c (x c , y, θ), where C account for up to second order cliques.The number of possible values combinations for x is equal to the number of different labels exponential to the number of sites (|L| |S| ); due to this, the exact computation of Z(y, θ) becomes very easily intractable, even for small graphs with a low number of labels.
For the labeling, we consider inter-scale interactions directionally from parents to child, aiming to collect classification results at the finest scale (see Figure 3), which provides the final classification map.In the considered structure, during classification, the information flows from coarser (higher) to finer (lower) scales to regulate long-range labels consistency and flows within the same scale to capture local interactions between neighboring regions.

Potential Definition
The different potentials are defined as follows: where f y is a unary feature vector, while h L y and h I y are pairwise feature vectors.The function 1 (a=b) returns one if the condition (a = b) is met and zero otherwise.The separation among the different classes is determined as the following inner product: being w l ∈ H a total of l unknown parameters (l ∈ L), while β L 1(l=k) and β I 1(l=k) are two and two unknown parameters, respectively (four in total with 1 (l=k) ∈ {0, 1}).
The (mapped) features f y,Φ , h L y,Φ and h I y,Φ are the feature vectors (f y , h L y and h I y ) mapped by a function Φ : Y → H.In practice, neither the mapping function Φ(•) nor the space H needs to be explicitly defined; it is sufficient to define a symmetric and positive definite kernel function K : Y × Y → R, to inherit both the hypothesis space H as its unique reproducing kernel Hilbert space (the Moore-Aronszajn theorem [41]) and the mapping function Note that the distribution of the potentials ψ c (•; θ) is assumed to be unknown.For the particular case of using linear kernel functions, our work will recover the usual linear potential functions, and for any other kernel functions, we will recover more complex potential functions.A bias value can be easily introduced in Equations ( 12)-( 14) (if wanted or necessary) by making use of algebraic methods.
In this work, we consider class transition detections for pairwise potential modeling which can be seen as two binary models: a binary model (transition occur: yes or no) for the intra-scale potential and a binary model (correct region merging: yes or no) for the inter-scale potential.
However, the model has been designed for a broader multi-class pairwise potential definition

Unary Local Features
The association potential, ψ A (•; ω), is defined as the probability of a site to acquire a certain label considering the site observations independently.The single-site observation feature vector f y is obtained as a concatenation of different features making use of an observation y.The following features have been considered in the current implementation: 1. Spectral: the spectral signature of a region y is characterized for each channel by its mean, standard deviation, kurtosis and skewness.2. Textural: the texture of a region is characterized using the local binary pattern [42], from which we compute the region's mean value, repeating the process using eight different radius (neighborhood) values.3. Morphological: for the shape information, we compute the following morphological features: elongation, area to length ratio and extent.4. Scale: the actual scale of the region in question.
Considering a color RGB image, the resultant feature vectors will contain 24 components.

Inter-Scale Pairwise Features
The inter-scale potential, ψ L (•; β L ), aims at modeling the probability distribution of a good region merging having taken place (detecting errors on segmentation) during the segmentation process.This potential function is based on the observations of the region in question (named the actual region) and other additional information, such as: 1. its parent region: a good merging is supposed to keep spectral homogeneity between the actual region and its parent; 2. all of its siblings: a region that is merged with several siblings is supposed to keep spectral homogeneity with each one of them; 3. the number of siblings: the greater the number of regions that are merged (the number of siblings), the more likely it is for a bad merging to occur; 4. the scale of the actual region: in hierarchical segmentation, higher levels are more likely to exhibit regions that combine multiple objects (bad merging); 5. the lifetime of the actual region: whenever a region exhibits a high lifetime (strong edges), it is because it can be well differentiated (in terms of spectral homogeneity) from its neighboring regions; therefore, the merging of regions with a high lifetime implies a bad merging detection.
For the inter-scale pairwise energy expression, we combine all of the previous information in a five-dimensional vector as follows: ς(s) being all of the siblings (which have a parent region in common) of the region in question; lifetime(s) is the number of scales at which a region remains the same (without being merged) according to the MSRAG structure; n denotes the level of the actual region in the MSRAG; |ς(s)| denotes the number of siblings of the region in question; and D(•, •) is the Bhattacharyya distance between any two regions' histogram.

Intra-Scale Pairwise Features
The intra-scale potential, ψ I (•; β I ), aims at modeling the probability distribution of two regions sharing the same label.This potential function is based on the observations of the two regions in question, but also benefits from additional information, such as: 1. the corresponding neighbor: a region that shares the same label with a neighbor is supposed to keep spectral homogeneity; 2. lowest neighbor distance: the lowest spectral distance from all neighbors provides a measurement of how homogeneous the surrounding area that contains the region in question is; 3. highest neighbor distance: the highest spectral distance from all neighbors provides a measurement of how homogeneous the surrounding area that contains the region in question is; 4. the scale of the actual region: in hierarchical segmentation, higher levels are more likely to exhibit regions that combine multiple objects (bad merging).
For the intra-scale pairwise energy expression, we concatenate all of the previous information in a four-dimensional vector as follows: n being the actual processing scale of site s in the MSRAG and D(•, •) the Bhattacharyya distance between the two regions' histogram.

Piecewise Training Framework
The piecewise training framework has been proposed by [20] in order to reduce the complexity of the partition function estimation in CMLE approaches, consisting of dividing the full model into pieces P ∈ T , according to a set of piece types T [21].An illustration of the standard piecewise framework (P = C 2 ) can be seen in Figure 4b.From the partitions P, a new family of upper bounds log Z P (y, θ) can be used to provide a simpler (tractable) estimator [20] for the minimization of the log-partition function log Z(y, θ) ∝ P∈T log Z P (y, θ).This allows one to re-write the log-likelihood of Equation (1) as follows: where ψ c (•; θ) consists of the per clique and per piece-type energy given θ and Z P (y, θ) the local partition function estimator per model piece-type P [21].Based on Equation (18), gradient descendant techniques provide a simple way to learn θ.

Max-Margin Parameter Learning
Within the max-margin parameter learning framework, the problem is posed differently.The parameter learning problem aims to find a θ, such that, for the observed labels x and observations y in training samples, P (x | y, θ) has to be the highest possible, among all possible values of x: This will guarantee that for any possible value x, arg max x P (x | y, θ) will attain the maximum when x = x.Considering the clique decomposition theorem (Equation ( 1)), the previous inequality takes the following form: therefore: where the partition Z(y, θ) function is no longer necessary to be computed.
In the max-margin learning approach, we want that the previous in-equivalence Equation ( 20) not only holds, but also has the maximum margin.Assuming the following notations: and introducing a generic loss function (x, x), the parameter learning can be formulated as the following optimization problem: s.t.
which is similar to the max-margin parameter learning problems posed in [18] without considering noisy data (hard margin).The application of cutting-planes or sub-gradient methods are a common choice to solve this problem, requiring a full inference on every iteration step [16,18].This is the first drawback associated with the max-margin parameter learning.The second drawback is related to the number of constraints included by x = x in Equation (23), which is an exponential value in the number of sites equal to |L| |S| − 1.

Proposed Efficient Max-Margin Parameter Learning
In our proposed learning strategy, we avoided using sub-gradient and cutting-plane methods by decomposing the original problem into several individual multi-class sub-problems that can be efficiently solved dually, and we significantly reduced the number of constraints included by x = x.Starting from Equation ( 23), a usual assumption [43] is that, likewise the per-clique energy functions ψ c (•; θ), the loss function (x, x) is decomposable according to the different cliques' order.Using the notation P = P (x P , xP ), we can state that: Then, for any model partition (as simple or as complex as wanted) under T , we propose to decouple our formulation of Equation ( 23), decomposing it into the following set of individual multi-class sub-problems (see Appendix A): where Ω P is the set of label combinations for a P-order clique (see Appendix A) and N P the total number of related factors [21]; x i ∈ Ω P represents the i-th factor (i = {1, . . ., N P }) possible labels, ξ i are slack variables and xi along with y i the corresponding (i-th) ground-truth and observations, respectively, present in the training samples.Note that, since T defines the relations of each site with its neighborhood in a Markov blanket (the set of its parent, children and brothers), it will determine the number of constraints per each i-th site, as well as the energy form related to each individual site i.Since they have a direct impact on the model complexity, simple model partitions, such as piecewise (PW) or piecewise-pseudolikelihood (PWPL), are often preferred, but any structure can be defined for T .
The per-site energy expression, usually defined as E(•; θ) = ψ c (•; θ), can take different forms and vary according to the site's relations with its neighbors in the Markov blanket.It is known that such simple energy expressions are sometimes inappropriate, and the use of extra weighting terms is necessary to adjust the potential combination [9,18,22,39].
We propose a more general formulation E W (•; θ, d) for the per-site energy expression, which involves a weighted mixture of potential functions, resulting in a weighted combination of local, contextual and multi-scale information: In order to establish connections with other energy expressions, we defined the model structure G according to some MSRAG and considered, without loss of generality, some model partition, such as the PWPL proposed in [21] and assumed in [24].
As can be seen, ψ I A (•) and ψ L A (•) are unknown potential functions involving single factors of neighbors within the same scale and parent/child scales, respectively.The weighting terms d m , m = 1, . . ., 5 control the mixture of the different potential information.
The proposed energy expression, E W , generalizes the one used in the works of Alahari et al. [24], referred to as E P W P L .Note that assuming a fixed set of weights d = (1, 1, 0, 1, 0), both energy expressions E W and E P W P L become equivalents: E W ≡ E P W P L .It also generalizes the one used in [22] within the context of CMLE and referred to as E P W , since, for the particular case of d = (1, 1, 0, 0, 0), both energy expressions E W and E P W become equivalents: Then, we introduce the weighted potential mixture into our learning strategy to obtain the following set of QP problems: where: We considered values for d, such that ∀ 5 m=1 d m ∈ R + with d 4 = d 5 = 0 and solved the parameter learning in a max-margin approach, where each resultant problem of Equation (28) takes the form of a standard multiclass support vector machine (SVM) model, very similar to the work of Crammer and Singer [26], that can be efficiently solved dually using the fixed-point method.
Similar to [39] for max-margin and to [22] for CMLE, we assume a fixed value for d during the training stage, and we teach them afterword on a validation set.
Since each individual problem is solved dually, the obtained set of optimal dual variables (α * -coefficients) are related, under a strong duality assumption (see [25,26]), to the primal variables (ω, β L and β I ) by the Karush-Khun-Tucker (KKT) conditions.Therefore, the optimal ω l is the following: where α * i,l are the (already optimal) i th dual variables (also known as support vectors) for the class l and y i the observation of the i − th sample; the same for β I , β L .For more details, we refer to [26].
The following algorithm details the proposed training strategy: for (P ∈ T ) do Regarding the observations, since the training data is a limited set of labeled pixels (regions) from the original image, it is structureless and has no related hierarchical information.Therefore, in our model, they are represented by the observations on the first (finest) scale y 1 s only.In order to obtain the training samples at the other (higher) scales, we propagate this information to the next coarser scales (see Figure 5), considering that a parent region will have the same label as its child, only if all of the children share the same label.When children have different labels, we select the most represented class, by counting the number of pixels related to each class and taking the highest one: 1 xn i =l = 1 if the region r n s i has been labeled with a label l on the ground-truth or zero otherwise, and |y n i | is the number of pixels inside the region in question.Then, a region at a higher scale will inherit this class according to the following condition: The above condition guarantees (for the training stage only) that a parent region will receive the same label of the class most represented on its children regions, only if the total number of known pixels on its children covers more than a certain percent of the parent region.Such a percent is given by the threshold th; otherwise, it will remain as an unknown (non-labeled, represented by x n+1 s = 0) region, and therefore, it will not be part of the training sample set.A similar threshold has been used in [39] for the same purpose.We usually keep this value as th = 1 L , where L is the total number of available classes.

Inference
Once the optimal parameters θ * haven been learned, we can use them to label new samples, by finding the most likely label configuration given the observations and parameters: arg max x∈X P (x | y, θ * ), where: The inference process can be performed using several methods in the case of multiclass problems structured as a general graph with loops.Among them, we can find: message passing algorithms, such as loopy believe propagation (LBP) [44] or tree re-weighted LBP (TR-LBP) [45], and move making algorithms (graph-cut), such as α/β swap [46] or α-expansion [46].
We selected the LBP method for the inference process, which approximate the marginalized posterior by a value named believes, computed using a message passing algorithm [44].
Initially, each site belief is equal to the association potential.Then, the neighbors and parent of each region start to iteratively influence the site belief value, until all of the belief values settle, providing the final belief value as an approximation of the marginalized posterior.
Note that different classification maps are obtained from the inference process on different scales, and coarser scale results have an influence on finer classification maps during inference (see Figure 6).This influence consists in the enforcement of long-range dependencies from coarser scales to finer scales.The convergence properties of LBP have been associated with the properties of Bethe free energy and converge to stationary points of this free energy approximation; see [44] for further details.Recalling Equation (33), it might be possible that the terms of each independent potential are correlated.This can lead to over-counting problems during the inference.Once the parameter θ * has been found for a fixed weighting scheme, likewise in [9,18,22,39], we learn the weights d 1 , d 2 , d 3 via cross-validation.

Experimental Results and Discussion
For the evaluation of the proposed model, we selected two well-known remote sensing datasets, namely the Pavia City Center and Pavia University (both images available online at http://www.ehu.es/ccwintco/index.php/HyperspectralRemote Sensing Scenes), and a high resolution aerial RGB image.The Pavia images are used to compare the proposed method to state-of-art ones, and the high resolution image is used to analyze the effectiveness of our multi-scale CRF approach.

Pavia University Dataset
For the case of the Pavia University dataset (Figure 7a), both the training samples (Figure 7b) and the ground-truth (Figure 7c) were available and used in the work of Aksoy [47], where a pixel-based and a region-based classification method have been compared.The details of the training sample pixel distribution can be found in [47].The segmentation scheme described in Section 2.2 has been applied to Pavia University using three spectral bands (Bands 68, 30 and 2), out of the 103 available bands.We further used the five first obtained scales in order to define the different potentials, learn the parameters and perform the classification.It has to be noted that state-of-the-art hierarchical methods use a different number of scales: two in [39], three in [40] and four in [1]; however, we found that scales higher than five usually do not provide complementary information, due to the limited ground-truth.
Results using the the proposed multi-scale CRF model, the pixel-level Bayesian (P-L-B) model of [47] and the region-based Bayesian (R-B-B) model of [47] are given in Tables 1 and 2.It can be observed that the proposed approach outperforms both the previous results, as well as other region-based approaches, despite the fact that our proposed model was trained using only three spectral bands, out of the 103 available bands used in [47].

Pavia Center Dataset
In the case of the Pavia Center dataset, the ground-truth was available, but not the training samples.As consequence, we decided to take 10% of the ground-truth as the training samples.We used a random uniform distribution for determining: (1) squares of a random size; and (2) to locate these squares in random positions within the image.During the random training selection, we also granted that all of the classes were roughly represented in a balanced way, resulting in 10% of each class selection conforming to 10% of the ground-truth.The randomly obtained training samples are detailed in Table 3 and are illustrated in Figure 8b.As such, compared to [47], we use less training pixels for the following five out of the considered nine classes: trees, meadows, bricks, bare soil and shadow.Results using the the proposed multi-scale CRF model, the pixel-level Bayesian (P-L-B) model of [47] and the region-based Bayesian (R-B-B) model of [47] are given in Table 3.
Table 3. Pavia Center dataset: description of the randomly selected samples, and performance evaluation (as accuracy %) of the proposed approach.

Class
Training Pixels Total Pixels (%) Proposed Models in [47 From Tables 2 and 3, it is apparent and confirmed according to the measured accuracy considering the available ground-truth that the multi-scale CRF-based strategy is generally more effective for classification compared to state-of-art Bayesian (region and pixel) approaches.

Airborne RGB Color Image
In order to asses the effectiveness of the proposed multi-scale CRF-based classification, we considered an areal color (RGB) image, as illustrated in Figure 9a, with nine classes, namely: shadow, dark roads and paths, light roads and paths, grass land, blue rooftop, red rooftop, green rooftop and arid land.We conducted an assessment process consisting of: (i) evaluating the importance of the MSRAG structure, i.e., hierarchical segmentation; and (ii) analyzing the discriminative power of the different models along with the impact of the potentials on the result.For all of the experiments, we used the same training data of Figure 9a.

Segmentation and MSRAG Generation Analysis
The proposed multi-scale CRF-based classification can be applied to any hierarchical decomposition of the image.In this section, we evaluate the classification results using different hierarchical decomposition approaches.
Namely, the proposed MSRAG approach of Section 2.2 and a scale-space-based hierarchy, denoted MRAT (multiscale region adjacency tree), as described in [3].The creation of the MRAT is based on the creation of the multi-scale tower using the PDE approach of Section 2.1.Then, at a manually selected localization scale, u 0 , a gradient watershed transformation is performed to detect a set of regions with well-localized contours.To create the hierarchy, the regions at the localization scale are tracked across the multi-scale tower, and a parent-child linking [27] process is made.This is achieved by spatially projecting the regional minima at each scale into the coarser one.Using the duality between the regional minima of the gradient and the catchment basins of the watershed, each regional minimum residing in a given scale is linked with a regional minimum in the coarser scale, if and only if its projection falls in the catchment basin of that regional minimum (for more details, refer to [27]).It has to be noted that, in [3], the MRAT has the form of a truncated tree, with each node (apart from the ones at the highest scale) having a unique predecessor (its parent) and not considering intra-scale links (edges denoting the region adjacency at a given scale).
An illustration of the different hierarchical segmentations is provided in Figures 10 and 11 along with the respective classification results.The multi-scale CRF classification applied to the considered multi-scale region adjacency graphs can be compared under visual inspection in Figure 12, and their per-class accuracy is detailed in Table 4.
In our experiments, we considered the following multi-scale region adjacency graphs: • A-MRAT (augmented MRAT): this being the MRAT produced using the approach of [3], where the localization scale was selected manually, and we added the intra-scale edges/links to conform to our definition of MSRAG (see Figure 2); • MSRAG A-D: four multi-scale region adjacency graphs obtained starting from the same localization scale used in the A-MRAT and applying the hierarchical segmentation scheme of Section 2.2 with an extra step of small regions merging before the waterfall process, taking into account an expected granularity during the waterfall, with different parameters; • MSRAG E: this multi-scale region adjacency graph has been obtained by applying (on the original image) the PDE filtering and localization scale selection of Section 2.1 followed by the hierarchical segmentation scheme of Section 2.2.The resultant classification accuracy is given in Table 5, along with the regions and average region size, per scale and per multi-scale region adjacency graph.As can be seen, multi-scale region adjacency graphs with a lesser number of regions seem to be better for the proposed multi-scale CRF model, as far as the reduction of the number of regions not incurring the addition of excessive region merging errors on the initial scales.Additionally, there seems to be an influence of the average region's size over the accuracy of the proposed multi-scale CRF model classification results.
It is worth noting that the optimal region's size depends on the size of objects present in the image and naturally varies from one image to another.Moreover, to better characterize the spectral, textural and morphological features of a region, the larger the region is, the better the characterization is.In summary, the experimental results suggest a direct influence of the total number of regions and their average size on the classification results.The obtained results also suggest that the proposed segmentation scheme using the PDE smoothing and localization scale selection of Section 2.1 and the multi-scale segmentation scheme of Section 2.2 produces satisfactory results.

Different Classification Models
For the valuation of the discriminative power of the proposed multi-scale CRF potentials, which involves both intra-scale and inter-scale potentials trained using the above-described max-margin training strategy, we compared it with: (i) a region-based support vector machine (SVM) model; (ii) the MRF classification model of [3]; and (iii) a CRF model using standard potentials.
Classification accuracy results are detailed in Table 6, using the ground-truth of Figure 13b.The classification maps of the MRF and the proposed CRF, using the A-MRAT hierarchy, are illustrated in Figure 13c,d, respectively.The SVM is a discriminative unstructured model, which has been widely used in remote sensing applications [48].The related SVM classification is parameterized by a penalty factor C and the kernel parameters.For our tests, we used the LibSVM (available at http://www.csie.ntu.edu.tw/cjlin/libsvm/) package, a linear kernel (K(x, x) = x, x ) and a hyper-parameter C = 0.1.Both training and classification were performed over the finest scale of the hierarchy; each region was described by its color, texture and shape features.
For the MRF, we used the implementation of the multiscale region-based classification of Katartzis et al. [3], where the authors proposed a causal Markovian model, defined on the hierarchy of a multiscale region adjacency tree (MRAT).As mentioned above, their proposed MRAT graph has the form of a truncated tree, with each node (apart from the ones at the highest scale) having a unique predecessor (its parent).It has to be noted that this model does not consider intra-scale information.In [3], the observation model is a unary region property that represents the spectral signature of the region at the original image.The parameters were learned using a maximum likelihood approach (using expectation-maximization) and using an efficient exact inference procedure thanks to the tree-like structure of the model.
As standard CRF, we used the Undirected Graphical Models (UGM) matlab code [49].The model is based on Equation (1) using linear unary and pairwise energies E(•; θ) = θ T f .We adapted the implementation to the proposed multi-scale structure and used the max-margin parameters learning strategy of Section 3.3, however, keeping the original standard energies.As can be seen from Figure 13, the MRF model of [3] does not exhibit the salt and pepper effect.This is due to the use of the region-based approach and the long-range dependencies enforced using the multi-scale information.However, it is unable to properly enforce spatial consistency, as it does not model intra-scale dependencies.
From Table 6, the higher performance of the proposed multi-scale CRF model compared to the MRF of [3] has to be related to: (i) the CRF's discriminative nature [7] compared to the generative nature [6,7] of MRF, which is also confirmed when comparing to the SVM results; (ii) the design of the unary and pairwise potentials; and (iii) the type of MSARG.
The highest per-class performance has been underlined in Table 6.It is worth noting that, using the MRAT as defined in [3], MRF is capable of achieving slightly better performance in one of the eight classes compared to the CRF models and in five of eight classes compared to the CRF applied on the A-MRAT.However, the accuracy improvements of the MRF compared to the CRF models is in the order of 1.5% in the highest case, while the proposed CRF, using MSRAG E, is capable of improving the classification results in the order 22% of the accuracy value for certain classes (i.e., C3).The low overall accuracy of the standard CRF can be due to the parameter settings in our implementation.

The Impact of the Intra-and Inter-Scale Potentials
The above comparison between the MRF and CRF models does not use the same features to characterize the classes.Indeed, for MRF, only the spectral signature of the regions has been used, while, for the CRF, each region was characterized by its spectral, texture and shape information.Therefore, in order to measure the real impact of each individual potential (intra-and inter-scale) in the classification results, we considered different sets of values for the potential combination of the energy expression Equation (33).The obtained results are detailed in Table 7 using the considered areal color image.
The results indicate that for the proposed model, for the considered image and model structure (MSRAG), the multi-scale information is valuable.This can be seen comparing the results that involve multi-scale information (d 2 and d 3 used, with the second column as Yes) with the others that do not involve it (only d 1 used, with the second column as No).
From the potentials' influence point of view, it is clear that their different combinations on the energy expression of Equation ( 33) lead to different classification results.However, each potential contribution will vary depending on the image nature and on the training samples.
For instance, if certain images exhibit clear class transition patterns and these patterns are present in the training samples, then we expect the intra-scale potential to be influential on the energy expression in order to improve the classification results.If, on the contrary, no clear class transition patterns can be distinguished in the image or if they are not present in the training samples, then we should not expect the intra-scale potential to provide valuable information on the energy expression.A similar analysis can be done for the inter-scale potential.For all evaluated models, we assume that they share the same computational complexity for the hierarchical segmentation stage, which can be more or less efficient (it can vary depending on the used segmentation scheme).As can be seen, SVM applied on the localization scale is the most computational efficient, since it is simple, but cannot benefit from any available contextual information (hierarchy).For the MRF and CRF models, the major difference in computational complexity is due to the parameter learning and inference process for multiclass problems.The computational complexity is dependent on: (i) the parameter learning approach (maximum likelihood or max-margin); and (ii) the inference process according to the type of input structure of the hierarchy (with or without loops).From the theoretical point of view, likewise in [24], the proposed parameter learning is much more efficient, mainly due to the reduction of the number of constraints (∀x = x in Equation ( 23)) to a much less number of constraints (∀x P = xP , ∀P ∈ T in Equation ( 38) in Appendix A); but it is also due to the proposed problem decomposition into multiple individual problems that can be solved using the fixed-point method (see [26] for the details on the method's efficiency).The execution time values detailed in Table 8 confirm the suitability of the proposed max-margin parameter learning strategy.The proposed CRF and the standard CRF models, both involving loopy structures and trained using max-margin approach, perform similar to the MRF with a loopy-free, tree-like structure.The proposed CRF also largely outperforms the standard CRF model when trained in a traditional maximum likelihood fashion.For the CRF models, the reported execution time for the inference process accounts for a fixed set of values for d 1 , . . ., d 5 .
From Corollary 1 (see Appendix B), under the maximum margin criteria, we can redefine constraints (35) Note that for the case of x P = xP , Constraint (41) is met, since both sides of the inequality (E P − ÊP and P ) become zero; therefore, we considered ∀x P instead of ∀x P = xP .
Let us define Ω P as the set of label combinations for any clique-order P and N P = |P| the total number of (related) factors [21]; in our problem, Ω P=site ∈ {1, . . ., L}, Ω P=neighbor ∈ {0, 1} and Ω P=parent/child ∈ {0, 1}.For each i-th factor (i = {1, . . .., N P }), we associate a possible label x i ∈ Ω P with the corresponding observation y i and the corresponding true label xi from the training data:

B. Necessary and Sufficient Conditions for Maximum Margin
Proposition 1: Constraints x n ≥ l n ∀n are sufficient (although not necessary) conditions for the constraint x n ≥ l n to be satisfied.
Proof: This is a property of inequalities in algebra.Let {x n , l n } N n=1 be a set of arbitrary real numbers satisfying that x n ≥ l n ∀n: x n − l n be a solution margin of P associated with x, U = {n | x n < l n } and S = A \ U be the relative complement of U in A. Any solution margin M U =∅ of P , such that U = ∅, will be bounded above by another solution margin M U =∅ of P , where U = ∅.
The compactness energy P erimeter(r i ) 2 4πArea(r i ) is always greater than or equal to one (one for a circle, 4/π for a square).We assume that regions with compactness larger than 1.25 are not preferred.Again, the offset for compactness energy is set to −1.25.

Homogeneity
E hom (r i ) = 1 − σ(r i )V (r i ) represents the region's intensity (I) homogeneity.Homogeneity is largely related to the local information extracted from an image and reflects how uniform a region is.
Color variance represents the color homogeneity of a region, with σ c (r i ) the standard deviation of the color c ∈ {L, a, b} within region r i .The normalization factor for color variances is derived from statistical analysis of the color variance results on the image data base [52].

Edgeness
E edge (r i , r j ) represents the dynamics of the contour of the edge between r i and r j .

Figure 1 .
Figure 1.Stages involved in the proposed multi-scale region-based image labeling process.

Figure 2 .
Figure 2. The multi-scale region adjacency graph (MSRAG) defines the hierarchical structure of an image.(a) Nested hierarchy of partitions; (b) multi-scale region adjacency graph (MSRAG).

Figure 3 .
Figure 3.In the considered structure, during classification, the information flows from coarser (higher) to finer (lower) scales to regulate long-range labels consistency and flows within the same scale to capture local interactions between neighboring regions.
as two multi-class co-occurrence models: a multi-class model of L × L classes (the transition of class a to b occurs: yes or no?) for the intra-scale potential and a multi-class model of L × L classes (correct region merging of class a to b: yes or no) for the inter-scale potential.

Figure 4 .
Figure 4. Piecewise pairwise factors of a Markovian model are considered individual pieces of the same original model.(a) Original model.(b) Standard piecewise training of the same model (pairwise factors).

Figure 5 .
Figure 5. Information flows from the finest to coarsest scales in order to conform the training sample set structured according to the MSRAG.(a) The provided training samples.(b) Training at Scale 1. (c) Training at Scale 5.

43 ) 2 2
E P − ÊP ≥ P , ∀ N P i=1 , ∀x i ∈ Ω P (The introduction of slack variables ξ will allow one to assume errors on the obtained separating hyperplane (soft margin):+ C i ξ i s.t.E P − ÊP ≥ P − ξ i , ∀ N P i=1 , ∀x i ∈ Ω P

) Proposition 2 :Theorem 1 :
The constraint x n ≥ l n admits a representation of the form u∈U x u ≥ l, where U = {n | x n < l n } are indexes related to unsatisfied constraints, U ⊆ A, A = {1, . . ., N } and l is a slack variable.Proof: Let S = A \ U be the relative complement of U in A. Let us consider the residual (x p − l p ), where x p = s∈S x s and l p = s∈S l s , part of the following slack definition: l = u∈U l u − (x p − l p ) Let x ∈ R N = {x n | ∀n ∈ A} be a solution of a problem P satisfying the constraint x n ≥ l n , l n ∈ R, where A = {1, . . ., N }.Let M : R N → R, M (x) =

2 −
T d (52) represents the normalized mean color difference.The normalization factor T d (for color merging) is estimated as T d = µ d − σ v , with µ d the mean of the color differences D i s, and σ v = 1/n n i=1 (D i − µ d ) 2 the standard deviation of the n = k(k−1) 2 color differences between the k regions.
∀s ∈ S n } being the label field and Y = {Y n , ∀n ∈ [1, . . ., N ]} and Y n = {Y n s , ∀s ∈ S n } the observations field.In our current implementation Y n s represents the region properties at certain site s and scale n.A similar notation holds for the realizations:

Table 1 .
Pavia University dataset: performance evaluation of the proposed approach in comparison with previous results.

Table 2 .
Pavia University dataset: performance evaluation of the proposed approach in comparison with previous results; per-class accuracy (%) and overall accuracy (%).P-L-B, pixel-level Bayesian; R-B-B, region-based Bayesian.

Table 4 .
Per-class accuracy (%) and overall accuracy (%) of the multi-scale CRF applied to the considered multi-scale region adjacency graphs.

Table 5 .
The different hierarchical segmentation strategies provide different numbers of regions, as well as average region sizes (between brackets) per scale, which influence the classification results of the CRF model.

Table 6 .
[3]-class accuracy (%) and overall accuracy (%) of the proposed CRF model considering different MSRAG structures and the region-based multi-scale Markov random fields (MRF) of[3]; the highest per-class accuracy has been underlined.Std.CRF refers to a CRF model with standard energies/potentials.

Table 7 .
The potentials mixture weights (d m , m = 1, ..., 3 in Equation (33)) establish the final performance of the multi-scale CRF classifier.All of the reported experiments of Section 5.3.2.have been made on an Intel Core i7 at 3.40 Ghz PC with 8 GB of RAM.Table8summarizes the computational costs of the evaluated approaches, considering the same hierarchical segmentation as the input.

Table 8 .
Execution time for the parameter learning and inference stage of different models: SVM as Support Vector Machines; Prop.CRF as the proposed CRF model; Std.CRF (Max-Margin) as CRF model with standard energies/potentials trained using max-margin approach; Std.CRF (Max-Likelihood) as CRF model with standard energies/potentials trained using max-likelihood approach.