Open Access
This article is
 freely available
 reusable
2013, 5(5), 22752291; https://doi.org/10.3390/rs5052275
Article
TileLevel Annotation of Satellite Images Using MultiLevel MaxMargin Discriminative Random Field
^{1}
Signal Processing Laboratory, School of Electronic Information, Wuhan University, Wuhan 430072, China
^{2}
TSI Department, TELECOM ParisTech, F75013 Paris, France
^{*}
Author to whom correspondence should be addressed.
Received: 1 March 2013; in revised form: 3 May 2013 / Accepted: 7 May 2013 / Published: 13 May 2013
Abstract
:This paper proposes a multilevel maxmargin discriminative analysis (M^{3}DA) framework, which takes both coarse and fine semantics into consideration, for the annotation of highresolution satellite images. In order to generate more discriminative topiclevel features, the M^{3}DA uses the maximum entropy discrimination latent Dirichlet Allocation (MedLDA) model. Moreover, for improving the spatial coherence of visual words neglected by M^{3}DA, conditional random field (CRF) is employed to optimize the soft label field composed of multiple label posteriors. The framework of M^{3}DA enables one to combine wordlevel features (generated by support vector machines) and topiclevel features (generated by MedLDA) via the bagofwords representation. The experimental results on highresolution satellite images have demonstrated that, using the proposed method can not only obtain suitable semantic interpretation, but also improve the annotation performance by taking into account the multilevel semantics and the contextual information.
Keywords:
satellite images annotation; topic model; MedLDA; multilevel maxmargin; conditional random field1. Introduction
Nowadays the information extraction and intelligent interpretation of highresolution satellite images are frontier technologies in the remote sensing field. With the growing number of highresolution satellite images, efficient content extraction and scene annotation that can help us quickly understand the hugesize image are becoming more and more desirable. Given such a large data volume, manually based annotation tasks typically require a lot of human effort. Hence an effective interpretation method based on midlevel or highlevel semantic is strongly required in remote sensing applications.
However, the lowlevel features (physical features), most of the time, cannot precisely represent the scene semantics of images, and consequently how to bridge the semantic gap is becoming the main issue to deal with. Recently there have been evergrowing interests in image annotation by using topic models, such as Probabilistic Latent Semantic Analysis (PLSA) [1,2], Latent Dirichlet Allocation (LDA) [3,4], which can map from lowlevel physical features to highlevel semantic concepts, and essentially reduce the dimensionality of features. These generative probabilistic models were originally developed for text document modeling, which can generate an infinite sequence of samples according to the distribution of latent topics. It is assumed that each document is a mixture over latent topics and each topic is, in turn, a mixture over words from documents. The representation of latent topics can build a global information space, which is more reliant on content coherence than local description. Meanwhile, the computational efficiency based on approximate inference methods also makes the aspect models gain much popularity. It is necessary to build a corresponding relationship between the document and image for the application of these models from text domain into image domain. Conventionally, the whole image is treated as corpus and divided into rectangular tiles, which are regarded as documents. Each tile is further partitioned into multiple smaller patches. Local features extracted from patches are transformed by vector quantization into “visual words”, and each tile is thus represented as a collection of words. Some researchers have demonstrated that aspect models provide an understanding of aerial images in an effective way. According to [5,6], a scene of a satellite image, modeled by LDA, is represented as a finite mixture over some underlying semantic classes. This discriminative representation leads to a satisfactory result on annotation performance of large satellite images.
As we know, almost all kinds of topic models built on lowlevel features have their own limitation and serious drawbacks. Due to the independence assumption of each visual word in a tile and independence between tiles, these models frequently ignore the spatial relationship of adjacent regions and hence fail to capture important context information. In order to solve this kind of problem, many methods and algorithms have been proposed. The authors in [5] have introduced spatial information by cutting the patches in the large image with an overlap. Various extensions of the aspect models have been also designed, e.g. the spatial LDA (SLDA) model [7]. Different from LDA, the wordassignment of SLDA is a random hidden variable and the spatial information between visual words is encoded. In another method, the random field models such as the Markov Random Field (MRF), Conditional Random Field (CRF) have been employed for improving the spatial coherence of aspect models as well. Particularly, for the sake of describing the spatial relationship of latent topics, an MRF prior has been defined over hidden topic labels, which has been obtained by PLSA, and the experimental results of supervised and weakly supervised manner have demonstrated that the segmentation and recognition accuracy is obviously enhanced by the two complementary models detailed in [8,9].
In this paper, we present a method of annotation of satellite images based on the combination of a novel topic model and the CRF. According to [8], each latent topic in PLSA is regarded as one semantic class. However, such onetoone mapping is inappropriate for the representation of local scene semantic in satellite image due to insufficient representation of complex scene information. It seems like more reasonable that one semantic class should contain several latent topics. As illustrated in Figure 1, the randomly selected patch in a scene of a commercial area may consist of some objects, such as road, house, trees, etc., which could be represented in form of latent topics in aspect models. We are therefore motivated by the maximum entropy discrimination latent Dirichlet Allocation (MedLDA) model [10], which was originally proposed for regression and classification for text analysis and can train supervised models based on a maxmargin principle. The discovery process in latent topics of this extension of LDA model is by way of optimizing an objective function with a set of margin constraints. The coupling of parameters and analysis of latent topics makes the representation of lowdimensional semantic vectors more suitable for a prediction task. Based on the MedLDA model, we propose a multilevel maxmargin discriminative analysis (M^{3}DA) framework, which takes both coarse and fine semantics into consideration. Furthermore, we introduce the CRF model over the label inference in soft label fields generated by the multilevel maxmargin discriminative topic model. In this way, the final label field is then optimized, since it takes into account the spatial information of neighboring areas and the local correlation between them is reinforced. The experimental results have shown the effectiveness and robustness of the proposed method for satellite image annotation.
The rest of the paper is organized as follows. Section 2 briefly introduces the MedLDA model and our proposed multilevel maxmargin topic model. Section 3 talks about CRF, as well as the improved algorithm of the proposed model by combining the CRF. Section 4 gives an algorithm flowchart of our method on image annotation, and then shows the experimental results on two different satellite images. In Section 5 the discussion is presented with the future work discussed. Finally, Section 6 draws a conclusion for the paper.
2. MultiLevel MaxMargin Discriminative Topic Model Based on MedLDA
In this section, an overview of MedLDA for classification is given. Then, multilevel maxmargin discriminative topic model based on MedLDA is introduced. The MedLDA is a crucial part of our method, due to the appropriate latent semantic representation, which is usually difficult to handle in the annotation task of satellite images.
2.1. MedLDA Model
As explained in [10], MedLDA is derived from supervised topic models [11], depicted as a graphical model in Figure 2, that has introduced a response variable to LDA for each document. It allows the number of topics used to be decoupled from the number of classes. Meanwhile, the discriminative latent topics are still learned. Hence, it might be helpful to improve the overall accuracy. The experiments in [10] on text suggest that it works well and has a fast speed comparable to standard LDA. As a consequence, we attempt to apply this aspect model into satellite image annotation.
MedLDA model is capable of processing both for regression and classification. Here we only briefly introduce the part of classification, which has been employed for the annotation task. Suppose each document is a sequence of N words w_{n}, denoted by W = {w_{1}, w_{2},…,w_{N}} and the number of latent topics is K. The vector of response discrete variable in corpus D is y, where y∈{1,2,…,M}. The generative process of MedLDA is the same as supervised topic models [11]:
 (1)
 Draw topic proportions θα ∼ Dir(α);
 (2)
 For each of the N words w_{n}:
 (a)
 Draw a topic assignment z_{n}θ ∼ Multinomial(θ);
 (b)
 Draw a word w_{n} from P(w_{n}z_{n},β), a multinomial probability conditioned on the topic z_{n}, namely w_{n}z_{n},β_{1:K} ∼ Multinomial(β_{zn}).
 (3)
 Draw a response variable yz_{1:N},η,σ^{2} ∼ N(η^{T} Z̄,σ^{2}), where $\overline{Z}=1/N{\sum}_{n=1}^{N}{z}_{n}$
Here (α,β,η,σ^{2}) are the unknown hyper parameters. We obtain the marginal distribution joined with the response variable y of a document:
$$P\left(y,W\alpha ,\beta ,\eta ,{\sigma}^{2}\right)=\int P(\theta \alpha ){\sum}_{{z}_{1:N}}\left({\prod}_{n=1}^{N}P\left({z}_{n}\theta \right)P\left({w}_{n}{z}_{n},{\beta}_{1:K}\right)\right)P\left(y{z}_{1:N},\eta ,{\sigma}^{2}\right)d\theta $$
The variational EM algorithm is adopted during the parameter estimation of supervised topic models, and the goal is to maximize the joint likelihood function P(y,Wα,β,η,σ^{2}) by learning a point estimate of η. Different from such learning method, the authors of MedLDA take a Bayesianstyle approach to learn the distribution of parameters by maxmargin principle due to intractability of the likelihood P(y,Wα,β,η,σ^{2}) (the normalization factor). Unlike fully generative topic models, a partially generative model on (θ,z,W) has been defined. The margin constraint is written as follows:
Here L(q) is the variational upper bound of −logP(Wα,β); p_{0} (η) is a prior distribution over the parameters and KL(pq)≜E_{p}[log(p/q)] is the KullbackLeibler (KL) divergence; C is a positive regularization constant; Δf_{d}(y)=f(y_{d}, Z̄_{d})− f(y, Z̄_{d}) and ξ are slack variables; E[η^{T} Δf_{d}(y)] is the “expected margin” by which the true label y_{d} is favored over a prediction y; and H(q) is the entropy of q. Because of the margin constraint in Equation (2) the model tries to learn a latent topic representation q(θ,zγ,ϕ) and a parameter distribution q(η) both for the accurate prediction of training data and the proper explanation of data. During the parameter estimation the posterior distribution of the hidden variables is inferred, in which MedLDA is distinguished essentially from supervised topic models. After the distribution of q(η) is learned, the label can be inferred as follows:
Here F is linear discriminant function and can be written as: F(y,z_{1:N},η)=η^{T}f(y, Z̄), where f(y, Z̄) is the feature vector. In the model, the process of latent topic discovery is integrated with maxmargin principle by optimizing a single objective function with a set of margin constraints, which leads to a predictive topic representation.
$$\begin{array}{l}\underset{q,q(\eta ),\alpha ,\beta ,\xi}{\text{min}}\hspace{0.17em}L(q)+\mathit{KL}(q(\eta ){p}_{0}(\eta ))+C\sum _{d=1}^{D}{\xi}_{d}\\ \hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}s.t.\hspace{0.17em}\hspace{0.17em}\forall d,y\ne {y}_{d}:\{\begin{array}{c}E\left[{\eta}^{T}\mathrm{\Delta}{f}_{d}(y)\right]\ge 1{\xi}_{d}\\ {\xi}_{d}\ge 0\end{array}\end{array}$$
$$L(q)=E[\text{log}\hspace{0.17em}p(\theta ,z,W\alpha ,\beta )]H(q(z,\theta ))$$
$${y}^{*}=\text{arg}\hspace{0.17em}\underset{y}{\text{max}}E\hspace{0.17em}\left[F\left(y,{z}_{1:N},\eta \right)\alpha ,\beta \right]$$
2.2. MultiLevel MaxMargin Discriminative Topic Model
MedLDA could discover sparse and highly discriminative topical representation by exploiting the popular and potentially powerful maxmargin principle. As we know, support vector machines (SVM) as a typical instance learned by the maxmargin mechanism has been successfully applied to a wide range of discriminative problems such as image annotation and target recognition.
Formally, the linear SVM finds an optimal linear function by solving the following constrained optimization problem:
where x_{d}∈X are inputs/feature vectors of samples, which are the visualword features in this paper; w is the parameter vector; ξ_{d} is a slack variable that tolerates some errors in the training data; y_{d} is the class label of samples; C is a positive regularization constant.
$$\begin{array}{c}\underset{\eta ,\xi ,{\xi}^{*}}{\text{min}}\frac{1}{2}{\Vert w\Vert}_{2}^{2}+C\sum _{d=1}^{D}{\xi}_{d}\\ s.t.\forall d:{y}_{d}{w}^{T}f\left({x}_{d}\right)\ge 1{\xi}_{d},{\xi}_{d}\ge 0\end{array}$$
In our proposed framework, soft labels generated by MedLDA inference are essentially features of topical description. As a result of the complexity of largescale image scenes, single level low dimensional topic feature may not discover effective semantic representations. In this paper, in consideration of inseparable cases result from the maxmargin mechanism in both SVM and MedLDA, we construct a multiple soft label posterior as shown in Figure 3, which combine the wordlevel feature generated by SVM and topiclevel feature generated by MedLDA based on a bag of words representation (BOW). Our M^{3}DA topic model, which is described from two different feature levels that may make up each other effectively, could provide more discriminative labels. This improvement will be verified in the subsequent experiment.
3. M^{3}DABased Random Field
3.1. Conditional Random Field
Aforementioned topic models suffer from loss of spatial information in supervised classification. In order to complement the lost contextual information, some researchers have extended aspect models with MRF [8,12]. The resulting MRF aspect models, which usually build aspect models with MRF properties at a latent topic level, have shown significant boosts in classification performance over standard aspect models. Here we utilize CRF [13,14] to optimize the soft label field, which can directly model the posterior probability of classes. The basic formation of CRF can be written as:
$$P(xy)=\frac{1}{Z}\text{exp}\left\{\left(\sum _{i}{n}_{i}\varphi \left({x}_{i},{y}_{i}\right)+\sum _{i}\sum _{j\in N(i)}{w}_{ij}\phi \left({x}_{i},{x}_{j},{y}_{i},{y}_{j}\right)\right)\right\}$$
Here, x and y denote the predictive class labels and observation image respectively. n_{i} and w_{ij} are the model parameters. ϕ and φ denote, respectively, the unary potential function and the dual potential function, which both describe the interrelations among basic elements in CRF. In our experiment, the unary potential is denoted by the soft probability, and meanwhile, these pairwise potentials are parameterized by the Potts model. Thus, the original CRF model could be transformed into the variational form as below, where σ is the smooth coefficient:
$$P(\mathbf{x}\mathbf{y})\propto \text{exp}\left(\sum _{i}\text{log}P\left({x}_{i}{y}_{i}\right)+\sum _{i}\sum _{j\in N(i)}\sigma \u2022\left[{x}_{i}={x}_{j}]\right]\right)$$
3.2. M^{3}DABased Random Field
In this section we describe the M^{3}DAbased Random Field (named as M^{3}DARF for short) approach for the semantic annotation of large satellite images.
The concept and category of semantics in image are beforehand defined. Then, a training set is built in the following steps. A large satellite image S to be annotated containing M semantic classes can be considered as a testing set consisting of a set of image patches S_{d} with equal size. It can be written as:
$$\underset{d}{\cup}{S}_{d}=S$$
The process of annotation can thus be regarded as a classification procedure where document S_{d} is labeled as semantic classes C_{m}, m∈{1,2,…,M}. Since in the MedLDA model the order of words and documents in corpus is ignored, therefore we employ CRF into the label inference by introducing the contextual information, for the sake of improving the annotation task.
As we know, during the parameter estimation of the MedLDA model, it seeks for a latent topic representation q(θ,zγ,ϕ) and a parameter distribution q(η) for the multiclass classification, so that it can, on one hand, as accurately as possible predict on the training set, while on the other hand also represent the data set well. In fact, the label inference of each test data is based on the statistic of discriminative latent topics. According to Equation (4), the Kdimensional vector of latent topics is transformed into Mdimension soft probability, and the final label is then inferred by MAP principle. Here we propose another algorithm for label inference, instead of MAP inferring. A CRF prior with eightneighbor connectivity, which is also a vector with Mdimension and can be regarded as probability of each semantic class is introduced over a soft label field derived from our M^{3}DA topic model. Considering the relevance of surrounding areas, the optimization over soft label field is fulfilled by Graph Cut algorithm [15]. Then, predictive labels inferred by CRF will be smoothed and lead to a desirable annotation result compared to the ones without CRF inference. A remarkable smooth effect is presented in subsequent experiments.
4. TileLevel Annotation Algorithm and Experimental Result Analysis
4.1. M^{3}DARF Based Tilelevel Annotation Algorithm of Satellite Images
The flowchart of the M^{3}DARF based tilelevel annotation algorithm is illustrated in Figure 4, and the pseudocode of this algorithm is shown in Algorithm 1. Here, the visual words are obtained through several steps, which are tile partition, feature extraction, vector quantization, and Kmeans clustering, in that order. Tiles and visual words represent documents and words respectively in the topic models. The set of visual words (bagofword representation) is then used to represent an image regardless of their spatial arrangement similar to how documents can be represented as an unordered set of words in text analysis. The image of BOW representation is handled in two ways simultaneously: by using the visual word histogram to train a SVM classifier, we get the scene class label distribution (the socalled soft label probability) of each tile; by training the MedLDA, we can also obtain the soft label distribution of each tile. Then by concatenating the two different soft label probabilities, the multiple class label posterior is generated, which is described with soft label field in the CRF manner in latter steps. This kind of combination is reasonable since soft probability P^{Med} and P^{SVM} are generated from two different feature levels (the former one is from a wordfeature level and the latter one is from a topicfeature level).
Input: original highresolution image I^{O} 
Output: the annotation image I^{A} 

4.2. Experimental Data and Settings
In this section, we present the experimental results of two large highresolution satellite images which are both acquired by GeoEye1: one (image I) is taken from somewhere nearby the airport of Tucson in USA (shown in Figure 5), and the other (image II) is taken from the Majuqiao Town of southwest Tongzhou District in Beijing (shown in Figure 6). Here a series of experiments based on different methods are conducted to image I, but we do not spend much effort on carefully analyzing the experimental results. We deal with image II in depth and the experimental results will be interpreted qualitatively and quantitatively.
The size of image I (namely, Figure 5(a)) to be annotated is 4,000 × 4,000 pixels, which includes five semantic classes: residential area, bare land, factories, commercial area and grassland. And the large image is divided into 1,600 nonoverlapping patches with size of 100 × 100 pixels, which are regarded as documents (tiles). We randomly choose 50% of each class as a training set and the remainder as a testing set. For generality, we only use a SIFT feature. We use Kmeans to quantize the descriptors, producing 300 clusters. The centroids are thus regarded as words. A word corresponds to a window with a size of 5 × 5, thus each document contains 400 words. The number of latent topics in MedLDA is fixed to 35 and, as well, we set σ = 0.5 empirically. Linear SVM is selected in our experiment because of its high computational efficiency as well as satisfying classification accuracy. Otherwise, the soft label field is optimized by utilizing Graph Cuts, and then we finally obtain the smoothed annotation result.
In the experiments, we have compared the performance based on original PLSA and LDA with our proposed method respectively. Furthermore we will find that the annotation performance that combines soft probability P^{SVM} and P^{Med} is better than single mode.
Identical experimental settings and workflow as mentioned above were conducted on image II with eight semantic classes: water (WAs), bare land (BLs), roads (ROs), factories (FAs), farmland (FLs), green land (GLs), high building (HBs, commercial building), short building (SBs, residential building), with the number of topics varying from 10 to 100 and OPPSIFT features instead of SIFT features. Figure 6(c) shows one example of each class from the eightclass satellite scene.
4.3. Annotation Results and Analysis
Annotation accuracy for each category is calculated as the ratio of the correctly annotated pixels to the total number of the category pixels, given in percentage with reference to the ground truth map.
According to the fixed experimental settings, we have done nine tests by employing different kinds of aspect models or the SVM with and without combining the CRF. In the BOW+SVM case, we especially test the accuracy and computational speed with linear kernel SVM and radial basis function kernel SVM (RBF kernel, a kind of nonlinear kernel). Compared to the classification accuracy of 88% obtained by linear, RBF kernel achieves 90%, however the running time of the RBF kernel gains increases almost 250%. Therefore, given that the performance of linear kernel is acceptable, we choose linear kernel rather than nonlinear kernel. The entire annotation results and classification accuracy of our proposed M^{3}DARF method are illustrated in Figures 7 and 8, respectively. The annotation performance of our method outperforms those of other methods as expected. Due to the simplicity of scene structure, the accuracy of our method on image I reaches as high as 98.19%.
In addition, for the sake of highlighting the effect of multiple soft label posterior probability, the partial enlarged view of M^{3}DARF is shown in Figure 9 (The yellow circles stand for the misclassification region between residential area and bare land, and the blue circles stand for the region that is misclassified as greenland). Compared to the other two methods which only utilize single soft posterior probability P^{Med} or P^{SVM}, the annotation result of M^{3}DARF gets much more close to the ground truth and produces less confusion than other semantic classes.
Given that image II is a colorized image and has more complex scene structures as well as more semantic classes than image I, we put emphasis on dealing with image II. The annotation results of image II are shown in Figure 10. The results in the first row are obtained directly from three different topic models. It’s not difficult to see that most of FLs, BLs, and HLs are labeled correctly. The satisfactory performances of these three semantic classes result from larger quantities of training samples and more recognizable structures. However, the confusions between SBs and FAs, GLs and ROs are obvious due to these semantic classes sharing a few similar topics. On the whole, results obtained from all three topic models, without considering spatial dependencies among labels, are rather noisy. For instance, in the upperleft area of the image, a few BLs are misclassified as FLs; in the uppermiddle area, some HBs are confused with other classes, which are more serious in PLSA and MedLDA than in LDA.
In order to take into account spatial contextual information, a soft label field described by CRF has been employed. The corresponding results are shown in the second row of Figure 10. As a result of the smooth effect, there exist just a small number of isolated patches in the annotation results and, hence, the results thereby appear to be much more homogeneous. Meanwhile, the classification accuracies shown in Table 1 of the three models smoothed by CRF have been improved compared to those without CRF.
The overall accuracies of different methods with different topic numbers for image II are shown in Table 1. As mentioned above, combination of soft probability P^{Med} and P^{SVM} (multiple class label posterior) is reasonable and may make up inseparable instances, each from two different feature subspaces. The experimental results further validate our analysis. Our M^{3}DARF model shows better performance than MedLDA+CRF and BOW+SVM+CRF respectively, as shown in Figure 11. In addition, the classification accuracy is generally better with a larger number of topics, which is reflected in the first six groups of experiments in Table 1. Meanwhile the accuracy of our M^{3}DARF model leads to a relatively stable value with the growing number of topics, which is mainly because the wordlevel features (or the soft label probability P^{SVM}) play a dominant role in the multilevel maxmargin discriminative feature space and the topiclevel features, as a supplement, are just the minor feature components.
An overview performance of image II (the number of topic is set to 35) given by the confusion matrix of all eight semantic classes is presented in Figure 12. According to the result of our proposed method, each semantic class is considerably well preserved, especially SBs, FAs, GLs, and ROs, these four classes that are seriously misclassified in the former methods. The annotation results appear to be somehow serrated due to the rectangular cutting of patches. In order to eliminate the edge effect, we can conduct oversegmentation on the original image, and reannotate the image with superpixels with our proposed method. This work will be done in the future.
5. Discussion
In this work, we have attempted to improve the annotation performance of highresolution satellite images from two different aspects. On the one hand, considering that the lowlevel features may not precisely represent the scene semantics of images, the MedLDA model [10], which is a powerful discriminative topic model, is employed to extract the highlevel semantic features (also known as the topiclevel features); on the other hand, topic models ignore the spatial neighborhood relationship because of the independence assumption of visual words, and hence we introduce the CRF for the purpose of strengthening the neighborhood coherence. Furthermore, due to the limitation of MedLDA in which only the topiclevel features are available, whereas the wordlevel features are important in image annotation tasks, as well as are properly unobtainable, we propose the M^{3}DA framework, which takes both the coarse and fine semantics into consideration, to combine the topiclevel features and wordlevel features together.
The experimental results shown in Figure 8 and Table 1 suggest that our proposed M^{3}DARF model performs better than the single MedLDA and other typical topic models [2–3], mainly because it can utilize discriminative features from different levels reasonably and reinforce the local correlation of neighboring area efficiently. Figure 9 and Figure 11 show that the advantage of M^{3}DA framework lies in less confusion among the different semantic classes in a feature combined multilevel maxmargin fashion. Figure 7 and Figure 10 show that the M^{3}DARF model leads to more smooth and accurate annotation performance. Meanwhile the annotation accuracy of our M^{3}DARF model tends to a relatively stable value with the growing number of topics, which is mainly because the wordlevel features (also known as the soft label probability P^{SVM}) play a dominant role in the multilevel maxmargin discriminative feature space and the topiclevel features are just helpful supplements.
The most related work to ours is detailed in [16], where the authors only exploit different types of feature representation and do not make full use of the contextual information that may be beneficial to annotation tasks. Some other related studies [1,5,9] have investigated the application of topic model in satellite images annotation task. These studies did not apply multilevel features into classification framework [5] and introduced spatial information by means of cutting large image into small patches with an overlap and [9] employed Markov random field for the sake of utilizing the contextual information in satellite images. However we suggest that the CRF model is more suitable for discriminant tasks like image annotation or scene classification. Therefore, our M^{3}DARF model not only exploits the features of different levels but also combines CRF model so as to obtain smoother and more precise annotation performance.
Otherwise, our proposed method is currently limited in the sense that the MedLDA and the CRF have not been jointly optimized, i.e., the MedLDA is trained in fully supervised form using the training label for each tile, once MedLDA is fully trained, and then the CRF is trained using the MedLDA output probabilities as feature potentials. Because of the structure of this model, it should also be possible to combine the margin based training of the tilelevel classifiers with the margin based training of the CRF layer into a single maxmargin CRF with discriminatively trained topic model structure. As future work, we intend to envisage a coupled model in which both the MedLDA and the CRF are trained together in a variational maxmargin framework.
6. Conclusion
In this paper, we focus on the semantic annotation of large highresolution satellite image. Our proposed method multilevel maxmargin discriminative analysis (M^{3}DA) can discover effective semantic representation and produce more discriminative class label posterior in the framework of multilevel maxmargin discrimination. The semantic annotation performance is obviously improved by the combination with conditional random field (CRF) due to the consideration of contextual information, and meanwhile the proposed algorithm yields an average annotation accuracy of approximate 13.2% higher than the original maximum entropy discrimination latent Dirichlet Allocation (MedLDA) method. The experimental results on two satellite images, of quite different land covers, have demonstrated its robustness and effectiveness.
Acknowledgments
This work was supported in part by the National Key Basic Research and Development Program of China under Contract 2013CB733404 and the Chinese National Natural Sciences Foundation grants (NSFC) 61271401. The authors would like to specially thank Kan Xu for his helpful guidance on MedLDA.
 Conflict of InterestThe authors declare no conflict of interest.
References
 Yi, W.; Tang, H.; Chen, Y. An objectoriented semantic clustering algorithm for highresolution remote sensing images using the aspect model. IEEE Geosci. Remote Sens. Lett. 2011, 8, 522–526. [Google Scholar]
 Hofmann, T. Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 2001, 42, 177–196. [Google Scholar]
 Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
 Larlus, D.; Jurie, F. Latent mixture vocabularies for object categorization and segmentation. Image Vis. Comput. 2009, 27, 523–534. [Google Scholar]
 Lienou, M.; Maitre, H.; Datcu, M. Semantic annotation of satellite images using latent dirichlet allocation. IEEE Geosci. Remote Sens. Lett. 2010, 7, 28–32. [Google Scholar]
 Xu, K.; Yang, W.; Liu, G.; Sun, H. Unsupervised satellite image classification using markov field topic model. IEEE Geosci. Remote Sens. Lett. 2013, 10, 130–134. [Google Scholar]
 Wang, X.; Grimson, E. Spatial Latent Dirichlet Allocation. Proceedings of 21st Neural Information Processing Systems, Vancouver, BC, Canada, 3–8 December 2007; pp. 1577–1584.
 Verbeek, J.; Triggs, B. Region Classification with Markov Field Aspect Models. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 18–23 June 2007; pp. 1–8.
 Yang, W.; Dai, D.; Triggs, B.; Xia, G.S. SARbased terrain classification using weakly supervised hierarchical Markov aspect models. IEEE Trans. Image Process. 2012, 21, 4232–4243. [Google Scholar]
 Zhu, J.; Ahmed, A.; Xing, E.P. MedLDA: Maximum Margin Supervised Topic Models for Regression and Classification. Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, QC, Canada, 14–18 June 2009; pp. 1257–1264.
 Blei, D.M.; McAuliffe, J.D. Supervised Topic Models. Proceedings of 21st Neural Information Processing Systems, Vancouver, BC, Canada, 3–8 December 2007; pp. 121–128.
 Zhao, B.; Li, F.; Xing, E. Image Segmentation with Topic Random Field. Proceedings of 11th European Conference on Computer Vision, Crete, Greece, 5–11 September 2010; pp. 785–798.
 Lafferty, J.; McCallum, A.; Pereira, F.C.N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. Proceedings of the 29th International Conference on Machine Learning, Williamstown, MA, USA, 28 June–1 July 2001.
 Delong, A.; Osokin, A.; Isack, H.N.; Boykov, Y. Fast approximate energy minimization with label costs. Int. J. Comput. Vis. 2012, 96, 1–27. [Google Scholar]
 Kolmogorov, V.; Zabin, R. What energy functions can be minimized via graph cuts? IEEE Trans. Patt. Anal. Mach. Int 2004, 26, 147–159. [Google Scholar]
 Wang, Y.; Mori, G. MaxMargin Latent Dirichlet Allocation for Image Classification and Annotation. Proceedings of 22nd British Machine Vision Conference, Dundee, UK, 29 August–2 September 2011.
Figure 1.
Landusage classes such as “Commercial area” often include several visually distinct kinds of image content. It is thus useful to associate several abstract visual “topics” to each class.
Figure 5.
Original image I, to be annotated, and corresponding handlabeled ground truth. (a) Original image (GeoEye1). (b) Handlabeled ground truth.
Figure 6.
Original image II to be annotated and examples of image II. (a) Original image (GeoEye1). (b) Handlabeled ground truth. (c) Example of each class in the eightclass satellite scene.
Figure 7.
The annotation results of different methods for image I (number of topic is fixed to 35). (a) PLSA. (b) LDA. (c) MedLDA. (d) PLSA+CRF. (e) LDA+CRF. (f) MedLDA+CRF. (g) BOW+SVM. (h) BOW+SVM+CRF. (i) M^{3}DARF.
Figure 9.
Partial enlarged view of different annotation results. (a) Ground truth. (b) MedLDA+CRF. (c) BOW+SVM+CRF. (d) M^{3}DARF.
Figure 10.
The annotation results of different methods for image II (number of topic is fixed to 35). (a) PLSA. (b) LDA. (c) MedLDA. (d) PLSA+CRF. (e) LDA+CRF. (f) MedLDA+CRF. (g) BOW+SVM. (h) BOW+SVM+CRF. (i) M^{3}DARF.
Figure 11.
Annotation accuracies of three methods under the condition of different topic numbers. (a) Comparison between BOW+SVM+CRF and M^{3}DARF. (b) Comparison between MedLDA+CRF and M^{3}DARF.
Figure 12.
Confusion matrix of semantic classes obtained by MedLDA and our proposed method. (a) MedLDA. (b) M^{3}DARF.
Topics  10  20  30  35  40  50  60  75  100 

Method  
PLSA  68.06%  69.44%  71.38%  72.25%  73.5%  73.13%  73.69%  73.94%  74.44% 
LDA  69.38%  73.13%  74.56%  76.13%  74.94%  75.94%  76.38%  77.94%  78.5% 
MedLDA  71.4%  73.6%  76.4%  77.6%  79%  79.4%  80.1%  83.18%  83.93% 
PLSA+CRF  72%  73%  75.75%  76.88%  76.94%  77.44%  78.125%  78.81%  78.81% 
LDA+CRF  71.88%  78.18%  79.13%  80.06%  80.5%  81%  80.81%  82.31%  83.5% 
MedLDA+CRF  76.69%  77.44%  80.31%  81%  80.5%  83%  81.69%  84.75%  86.44% 
M^{3}DARF  91.88%  91.38%  91.31%  91.38%  91.19%  91.63%  91.5%  91.75%  91.63% 
© 2013 by the authors; licensee MDPI, Basel, Switzerland This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license ( http://creativecommons.org/licenses/by/3.0/).