Coral Image Segmentation with Point-Supervision via Latent Dirichlet Allocation with Spatial Coherence

: Deep neural networks provide remarkable performances on supervised learning tasks with extensive collections of labeled data. However, creating such large well-annotated data sets requires a considerable amount of resources, time and effort, especially for underwater images data sets such as corals and marine animals. Therefore, the overreliance on labels is one of the main obstacles for widespread applications of deep learning methods. In order to overcome this need for large annotated dataset, this paper proposes a label-efﬁcient deep learning framework for image segmentation using only very sparse point-supervision. Our approach employs a latent Dirichlet allocation (LDA) with spatial coherence on feature space to iteratively generate pseudo labels. The method requires, as an initial condition, a Wide Residual Network (WRN) trained with sparse labels and mutual information constraints. The proposed method is evaluated on the sparsely labeled coral image data set collected from the Pulley Ridge region in the Gulf of Mexico. Experiments show that our method can improve image segmentation performance against sparsely labeled samples and achieves better results compared with other semi-supervised approaches.


Introduction
Semantic image segmentation is the process of assigning a categorical label to each image pixel automatically. There are many critical applications that require this procedure, such as marine species detection and conservation, object localization, and scene understanding. For instance, coral detection in reef imagery is one such applications because coral reefs are struggling due to global warming and pollution. However, the quantification of coral abundance is currently completed by humans and it is a time-consuming, boring, and expensive task. For example, it takes 16 people to work for several months to analyze the abundance of corals in hundreds of images collected from one typical two-week cruise. Hence, semantic segmentation can be used to quantify the abundance of each species by counting the number of pixels belonging to that category. In recent years, this topic has been widely investigated using deep learning based methods such as SegNet [1], Unet [2] and fully convolutional network (FCN) [3]. However, such models require full pixel-level annotation to train. Unfortunately, existing marine species and biomedical images data sets lack annotated labels due to the cost of pixel-level labels. In our work, humans will provide labels only for 50 pixels per image. Figure 1 shows the sparse point-level labels in the coral images data set, where different colors represent different classes.
Semi-supervised semantic segmentation can be framed as semi-supervised image classification with sliding window patch to identify the class of the patch's central pixel. Prior works on semi-supervised classification are divided into two main categories. The first is consistency regularization which adds a regularizer into the loss function. This term is applied to either all images or only the unlabeled samples, and designed based on the assumption that if a realistic perturbation was applied to the unlabeled data samples, the network prediction should not change significantly. Π-model [4] encourages that the distance between a network output with original input and its corresponding standard transformation (i.e., flipping, cropping) should be small. Virtual adversarial training (VAT) [5] approximates a tiny perturbation to the corresponding input data that would most significantly affect network output, then they put consistency regularization into the objective function to penalize the difference in the network outputs for the perturbed and unperturbed samples. Methods in the second category are called pseudo labeling because they assign pseudo-labels to the unlabeled samples based on either a network trained by predictor or the similarity between labeled and unlabeled samples. The pseudo-labeled examples augment the human labels in the training process with supervised loss, such as cross entropy. Both categories use a standard loss term that is trained with supervision from labeled samples. Our method belongs to the pseudo-labeling methodology. There are many different ways to assign pseudo-labels on unlabeled data. The simplest way to generate pseudo-labels is based on the distance from the true labels, as exemplified by our previous work [6,7] to generate pseudo-label using superpixels in the input images. Lee et al. [8] was the first, to our knowledge, to use the trained network to infer pseudolabels of unlabeled examples effectively by choosing the most confident class. Similarly, entropy minimization (EntMin) [9] encourages the network to make "confident" predictions for all unlabeled samples. The same principle was adopted by Shi et al. [10], where the authors further add contrastive loss to the consistency loss in the feature space, combined with a Mean Teacher approach [11]. Blundell et al. [12] and Kendall et al. [13] infer the pseudo-labels using Bayesian neural network (BNN) rather than the traditional neural network. Other methods for generating pseudo-labels employ a graph model, which consider samples as nodes and find the labels of unlabeled nodes from labeled nodes. Zhu et al. [14] proposes label prorogation and Ahmet et al. [15] applies label propagation into a deep neural network. Carlini et al. [16] achieves the better performance by incorporating ideas of consistency regularization, entropy minimization and Mixup operation [17]. Recently, deep learning has been applied to coral images. Gonzalez-Rivero et al. [18] employ convolutions neural networks in coral image patch classification, but without data augmentation. Akbari Asanjan et al. [19] develop a deep learning model for extracting domain invariant features from multimodal remote sensing imagery to create high-resolution coral images. Modasshir et al. [20] focus on coral images video and uses forward and backward tracking algorithms to generate labels. Our method is different from all above methods, and our pseudo-labels are inferred from a latent class distribution.
Our method focus on the feature space because the input space is high dimensional which is hard to do clustering. It is obvious that a good feature representation plays a critical role in our proposed method. To this end, we apply information maximization criterion, which maximizes the mutual information between the input and latent features, in the training process to obtain a good representation. In this paper, we use matrix-based Rényi's α-order entropy functional proposed by Giraldo et al. [21] to estimate the mutual information and Yu et al. [22] extend it to multivariate condition. The main advantage of this approach is that it estimates the entropy and joint entropy directly from data without PDF estimation. This methodology is different from variational information bottleneck (VIB) [23] and mutual information neural estimation (MINE) [24], which either approximate the variational lower bound of mutual information or find the function to maximize the lower bound, but their accuracy in complex imagery is unclear.
The main idea of assigning pseudo-label in our method is to find the probability of the image patch given the class and assign the label to the image patch corresponding to the highest probability. To obtain the latent class distribution over the image patches, we need to fit the feature space with a statistical model. Latent Dirichlet Allocation (LDA) [25] is a good choice, which is a three-level hierarchical Bayesian model. Each item of a collection is modeled as the mixture of topics and each topic is modeled as mixture of the codebook. To apply LDA for image processing, we regard the whole images as documents, categories as topics, and small image patches as visual words. However, traditional LDA is a "bagof-words" model and doesn't consider the spatial information at all, which is essential for image processing, therefore, we add spatial information in LDA by calculating the frequency of the category around the image patches. Different from Wang et al [26], which adds another layer between codebook and category, our method is simpler and easy to train. Motivated by active learning which allowed human in the loop to annotate data at each iteration, we also propose an iterative strategy to generate pseudo-labels. The key idea of our strategy is to use previously learned knowledge to improve the model learning by adding pseudo-labels inferred from previous knowledge.
In this paper, we propose a novel framework to generate pseudo-labels iteratively depending only on the original sparsely labels. To summarize, the contributions of this paper are as following. Firstly, we propose a simple yet effective framework to image semantic segmentation based on the sparsely point-supervision. Secondly, we modify the Latent Dirichlet allocation (LDA) by adding spatial coherence and use latent distribution as the criterion to generate pseudo-labels iteratively. Finally, we add mutual information constraint between the input and feature space to get a good representation.
The rest of this paper is organized as follows. Section 2 provides the overview of our method and describes each part of our framework in detail. Section 3 shows the results of the proposed method in coral images dataset compared with other semi-supervised approaches, and ablation study for the impact of different components of our method. Conclusion and some future works are mentioned in Section 4.

Materials and Methods
In this section, we first provide the overview of the proposed method and then formulate coral image segmentation as the semi-supervised image classification problem. Detail description for each part in Figure 2 are also demonstrated.
As we can see, our framework, summarized in Figure 2, consists of three steps. Starting from a randomly initialized network. The first step is to train the network from labeled samples and mutual information constrain between input and latent features. The second step is to employ spatial coherence LDA in the embedding of the network trained in the previous step to infer the category distribution over latent features and generate pseudolabels. The third step is to train the neural network on the entire training set, with labeled samples, pseudo-labeled samples and unlabeled samples. The pseudo-labeled samples are weighted per samples and per class.
Step 1: Train for T epochs with Step 2: Pseudo-labels generation by LDA with spatial coherence 1 1 1 1 Step 3: Train for T' epochs with

Preliminaries
In this section, we first formulate the semi-supervised coral images segmentation and then we discuss the loss function that used in our work. For semi-supervised classification, we assume a collection of n examples X = (x 1 , x 2 , ..., x l , x l+1 , ..., x n ) with x i ∈ X . The first l examples x i for i ∈ L = {1, ..., l} denoted by X L are labeled by Y L = (y 1 , y 2 , ..., y l ) with y i ∈ C, where C = {1, ..., c} is a discrete label set for c class. The remaining u = n − l examples x i for i ∈ U = {l + 1, ..., n}, denoted by X U , are unlabeled. The goal is to use all X (image patches) and only small label size Y L (point-supervision) to train a classifier to identify the class of unlabeled samples X U . In practical conditions, the number of samples in label set is much smaller than that in the unlabeled set. For our coral images dataset, there are only 0.0015% labeled pixels.
The neural network takes input examples from X and produce a vector of class probability. We denote it by f w : X → R c , where w represents the parameters of network. Function f w is the mapping from the input space to the class space. The output of the network for ith example is f w (x i ) and the prediction is the index of maximum probability, which is shown in Equation (1).ŷ where subscript j denotes the j-th dimension of probability vector corresponding to the j-th class. Basically, we need an objective function and the goal is to minimize it, which is nothing but to take the derivative of the loss function respect to the parameters w. There are two stages for our method. First, we train a classifier with labels and mutual information constraint to get a good feature representation. Then, we generate pseudo-labeled samples via spatial LDA in the feature space extracted in the first stages and add them in the training set.
The objective function (L 1 ) for the first stage consists of two component: supervised loss(L s ) and mutual information constraint loss (L MI ) shown in Equation (2). where the minus sign is added before mutual information constraint loss because we want maximize the mutual information.
The objective function (L 2 ) for the second stage consists of three component: supervised loss (L s ), pseudo-label loss (L p ) and mutual information constraint loss (L MI ), which is shown in Equation (3), we bring in the pseudo-labeled samples information in loss function.
The network is trained by minimizing a supervised loss term (L s ) on labeled samples in X L , which is shown in Equation (4). A standard choice of l s in classification is crossentropy loss. Pseudo-label loss (L P ) is the second component in L 2 , which is applied only to pseudo-labeled samples. Y p represents the pseudo-labels of X p andŷ i in Equation (5) denotes the pseudo-labels for each example x i for i ∈ U. This label is assigned according to the latent class distribution from LDA with spatial information described in Section 2.3.
The third component in L 2 is the mutual information between input space and latent features space shown in Equation (6). The reason why add such term is that we want to obtain a good representation combining not only the label information but also the input structure information. The classifier (dotted rectangle box in Figure 1) is conceptually divided in two parts. The first part is feature extraction network φ w : X → R d , mapping the input to a d dimension feature vector, we denote it by v i = φ w (x i ) for i-th input sample x i . The second classifier part typically consists of a fully connected layer appied on the top of φ w followed by softmax layer.
The classifier of choice is Wide Residual Networks (WRN) [27] which is widely used in many semi-supervised methods for image classification. It consists of an initial convolutional layer and three groups of residual blocks followed by average pooling and final fully connected layer. The main difference between WRN and ResNet [28] is that the number of kernels is larger than that of ResNet, which achieves better representation.

Feature Extraction with Information Maximization
In order to get a good feature representation for the input samples, we require that the feature space not only contains the label information but also preserves the input sample structure as well. Therefore, we maximize the mutual information between the input space and feature space. The loss function is in Equation (7): The first term is the cross entropy loss between predict and true labels, the second term is mutual information between input and its corresponding features.
For completeness, we review briefly bellow the matrix-based Rényi's α-order entropy functional on positive definite matrices and how to use it for calculating mutual information. We first give the definition of entropy and joint entropy and then provide the equation to calculate the mutual information. Definition 1. Let κ : χ × χ → R be a real valued positive definite kernel that is also infinitely divisible. Given {x i } n i=1 ∈ χ, each x i can be a real-valued scalar or vector, and the Gram matrix K ∈ R n×n computed as K ij = κ(x i , x j ), a matrix-based analogue to Rényi's α-entropy can be given by the following functional: where α ∈ (0, 1) ∪ (1, ∞). A is the normalized version of K, i.e., A = K/tr(K). λ i (A) denotes the i-th eigenvalue of A.

Definition 2. Given n pairs of samples
, each sample contains two measurements x ∈ χ and y ∈ γ obtained from the same realization. Given positive definite kernels κ 1 : χ × χ → R and κ 2 : γ × γ → R, a matrix-based analogue to Rényi's α-order joint-entropy can be defined as:

denotes the Hadamard product between the matrices A and B.
Given Equations (8) and (9), the matrix-based Rényi's α-order mutual information I α (A; B) in analogy of Shannon's mutual information is given by: Throughout this work, we use the Gaussian kernel κ( 2σ 2 ) to obtain the Gram matrices. For each sample, we evaluate its k (k = 10) nearest distances and take the mean. We choose kernel width σ as the average of mean values for all samples. Further information and the analytical gradient of Equation (10) are shown in Appendix A.

LDA with Spatial Information
In this section, we first give a briefly introduction of traditional LDA and then we modify the LDA by adding local spatial information. LDA is one of the most popular generative models originally developed for natural language processing, which contains a three-level hierarchical structure. Recently, it has developed rapidly in the field of image processing such as image segmentation, classification and annotation. When LDA is applied to image processing, we treat the classes of objects as topics, local patches of images as words and the whole image as a document. A codebook is created by clustering all the local descriptors in the image set using K-means. Each local patch is quantized into a visual word according to the codebook. The graphical model of traditional LDA is shown in Figure 3. There are M images in the dataset. Each image m has N m image patches. v m.n is the observed feature value of the local image patch n in image m, z m,n denotes the hidden class for v m,n . All the local image patches in the corpus will be clustered into K classes. Each image m is modeled as a multinomial distribution (p(z m,n | θ m )) with parameter θ m over classes and similarly each category k is modeled as a multinomial distribution (p(v m,n | ϕ z )) with parameter ϕ z over the visual codebook, and α, β are Dirichlet prior for multinormal distribution. Equation (11) shows the LDA model, θ m and ϕ z are hidden variables to be inferred. The generative process of LDA is shown in Algorithm 1. Hidden category variable z m,n can be sampled through a Gibbs sampling [29] procedure which integrates out θ m and ϕ k . We fist randomly assign the class to each image patch and then determine the class according to Equation (12). More details about Gibbs sampling for LDA are shown in Appendix B.
where n k t,mn is the number of visual words in the corpus with value t assigned to category k excluding visual word n in document m, and n m k,mn is the number of visual words in document m assigned to category k excluding word m in document n. Equation (12) is the product of two ratios: the probability of visual word v mn = t under category k (ϕ k t ) and the probability of category k in document m (θ m k ). However, traditional LDA is a "bag of words" model and does not consider spatial information at all, which is essential for image processing. Therefore, we want to add spatial information in the original formulation based on the assumption that if visual words are from the same class of objects, they should also be close in space. So we group image patches which are close in space. One straightforward way is to calculate the frequency of the category in the neighborhood of the image patch and add it to the corresponding conditional category distribution. Therefore we change the category distribution term from p(z m,n | θ m ) to p(z m,n | θ m , z m,n i ) and bring the local information, which is shown in Equation (13). The LDA graphical model with spatial coherence is shown in Figure 4. P(z mn = k | x mn = t, z mn , x mn , α, β) where λ is a trade-off parameter to change the weight of the local spatial information, z mn i represents the i-th image patch's category of N neighborhoods for z mn . Recall that the indicator function 1(z mn i = k) equals 1 if and only if z mn i = k. Equation (13) shows that the category of the image patch is more likely to belong to the neighborhood's category than Equation (12). In this paper, we set N = 8 denoting eight connected neighborhoods of the center image patch. The LDA generative process with spatial coherence is almost the same to original LDA (Algorithm 1), except that category z mn is sampled from P(z mn = k | z mn i , z mn , x mn ). Algorithm 2 demonstrates inference for parameter θ m and ϕ k of LDA with spatial information using Gibbs sampling.

Algorithm 2
Gibbs sampling for LDA with spatial coherence 1: Input: image patch feature values matrix (M × H × W), the number of categories K, initial category of each image patch features. 2: Output: θ m and ϕ k . 3: for each iteration T do: 4: for each image m do: 5: for each image patch n do: 6: Sampling category of nth image patch based on Equation (13). 7: end for 8: end for 9: end for 10: Estimate the θ m and ϕ k .

Pseudo-Label Generation
In this section, we will introduce how to generate pseudo-labels based on LDA illustrated in Figure 5. The three heatmaps in the middle column represent higher probability over image patch codebooks in areas with coral, red algae and green algae, respectively (from top to bottom) according to the category distribution (left-hand side of Figure 5). We annotate the pseudo-labels (star point) in the sample image at right-hand side. We calculate the distance between the pseudo-labeled samples and the original labeled samples to determine the class for each cluster.  One of the problems for generating pseudo-labels is that low-quality features extracted by the neural network at early training stages may mislead the training process into a wrong direction and such wrong information can spread to the following training process. To overcome this problem, we come up with a confidence level for each pseudo-labeled sample, which indicates how reliable the pseudo-label is. For each labeled sample x i ∈ X L , we always set its confidence level r = 1. For each pseudo-labeled sample x p ∈ X U , we compute r using Equation (14), based on the assumption that x p will be more reliable if it is located in densely populated regions.
where, x i is the original labeled sample and x p is the pseudo-labeled sample we generated. We adopt kernel density estimation to estimate the probability of pseudo-labeled samples within the label samples in the feature space. We use Gaussian kernel and for each sample, we evaluate its k(k = 10) nearest distances and take the mean. We choose the average of mean values for all samples as the kernel size σ. When the pseudo-labeled samples are far away from the original labeled samples, we can get the small confidence level r.
In addition, we also introduce the class weight (ζ j of class j) to deal with the issue of class imbalance. ζ j is defined in Equation (15), which is inversely proportional to class population.
where |L j | denotes the number of class j in labeled samples and |P j | represents the number of class j of generated pseudo-labels.

Iterative Training
After pseudo-label generation, we will train the neural network with labeled samples, pseudo-label samples and unlabeled samples together using objective function shown in Equation (16).
As can be seen, there are three terms in Equation (16). The first term is cross-entropy between predict of labeled samples and its corresponding true labels, the second term is cross-entropy between predict of unlabeled samples and its corresponding pseudo-labels, and the last term is mutual information between unlabeled samples and its corresponding features. λ p and λ MI are the hyper-parameters to adjust the importance of them.
Given the image patch feature extraction, pseudo-labels generation and neural network training with labeled samples, pseudo-labeled samples and unlabeled samples, we plug these components into an iterative learning process. First, we train the network for T epochs with labeled samples and mutual information constraint using Equation (7). Second, we obtain the class distribution over feature visual words via spatial LDA. Third, we assign pseudo-labels to unlabeled image patches by selecting higher probability in class distribution. Finally, we train the network on the entire dataset using Equation (16) for T epochs. We repeat this iterative process for M iterations. The above steps are summarized in Algorithm 3.

Results
In this section, we first describe coral image data set used in our experiments and semi-supervised image segmentation setup. Then, we discuss the training details for our method. Finally, we perform the experiments to compare with other semi-supervised image classification approaches and show the impact of different components involved in the our proposed method.

Dataset
For the coral image data set, which is collected from Pulley Ridge region in the Gulf of Mexico. There are 120 images with only 50 labeled pixels for each image, the size of each image is 2048 × 1536. For each human label, we select 30 × 30 pixel patch centered at the label. We use 100 images for training and 20 images for testing. The number of image patches for training is 5000. We select 4000 image patches samples for training and 1000 for validation. There are five classes: corals, rock, green algaes, red algaes and others.

Experiments Setup and Training Details
Experiments on coral image dataset are performed with Wide Residual Networks (WRN). Specifically, we used "WRN-28-2", i.e., ResNet with 28 convolutional layers and the number of kernels is twice as that of ResNet, including average pooling, batch normalization and leaky ReLU nonlinearities. For training, the size of input image patch is 30 × 30 and we chose the Adam optimizer [30], with 0.001 learning rate and 64 batch size for labeled samples, 128 batch size for unlabeled samples. We set the λ MI = 0.1, and linearly ramp up λ p to its maximum value (we set it as 10 in our experiment) over the 500 epochs during the training. We employ the mean intersection over union criterion (mIOU) [31] to quantify our proposed method.
We first train the network for 100 epochs with only sparse point-level labels and mutual information constraint between input and the output of the last layer before the softmax. Then, we use K-means to construct the visual codebook in the feature space. The codebook size is 200 and the dimension of feature visual word is 128. The way to assign the pseudo-labels for unlabeled image patches is as follows: we first find the 10 highest probability features for each class based on the class over feature visual word distribution obtained by spatial LDA and assign such features as that class label. Then we go back to whole image to search the image patches and give them the same pseudo-labels as its corresponding features. Finally, we train the neural network with labeled samples, pseudolabeled samples and unlabeled samples together for 500 epochs. We repeat the above steps 5 to 10 times and generate about 5000 pseudo-labeled samples for each iteration.

Parameter Analysis and Performance Comparison
We first show the performance with different codebook size which is essential for our experiments. It is obvious that the small codebook cannot represent all image patches, while a large codebook size will improve the computational complexity in inferring the parameters of LDA. So, we select an appropriate codebook size according to the image segmentation results shown in Figure 6. As can be seen, when the codebook size is larger than 200, the performance starts decreasing slowly. Therefore, we set codebook size as 200 in our experiments. Then, we compare our method with other semi-supervised methods in Table 1. As we can see, pseudo-labeling methods are more accurate than supervised approach (use sparse labels only). Entropy minimization, virtual adversarial training (VAT) and Π-model work better than pseudo-labeling. Our proposed method performs better than other competing methods and when combined with VAT, we can achieve the best performance against others. The way we combine VAT is to add another adversarial consistency loss term (mean square error between original sample and its corresponding adversarial example) in Equation (4) at stage 2 training. Figure 7 shows the results of coral images segmentation for different methods. As can be seen, our proposed method can detect coral well and the areas are more smooth than other approaches. Table 2 shows the abundance of coral, green algae and red algae detected by different methods on the coral images test dataset. Our proposed method performs much better than others especially for coral and red algae detection.  Figure 7. Coral image segmentation on test dataset with different methods. Blue color represents coral, green color represents green algae, red color represents red algae, gray color represents rock and yellow color represents other species.

Ablation Study
We investigate the impact of different component of our proposed approach. First, we show the benefit of using weighting strategy (confidence level r i for samples and class weights ζ j for different classes) for generated pseudo-labeled samples. Green and orange curves in Figure 8 shows that our weight strategy has positive contribution. Then, we study the effectiveness of including spatial coherence in LDA. Figure 9a shows the value of log-likelihood during the Gibbs sampling process for LDA with or without spatial coherence. λ denotes the weight to adjust the importance of spatial information. As can be seen, when adding spatial information, the performance improves (the higher log-likelihood the better), and we can achieve the best performance when λ = 0.01 corresponding green curve in Figure 9. Similarly, we also plot the log-likelihood for LDA with or without mutual information constraint in Figure 9b, which shows that the features extracted with MI constraints are better than without MI constraints. 20 Figure 9. (a) log-likelihood for LDA with or without spatial coherence; (b) log-likelihood for LDA with or without MI regularization. Table 3 and Figure 8 demonstrate that weighting strategy, spatial coherence and MI constraints in our proposed method have positive contributions for coral images segmentation. Spatial coherence in LDA considers the local patch information, and bring weights for pseudo-labeled samples can reduce the bad effect of labeling errors. MI constraints introduced in the loss function achieve the better representation for feature extraction.

Conclusions
In this paper, we propose a novel and effective framework to generate pseudo-labels iteratively only depending on sparsely labels. The results in the coral image data set from Pulley Ridge show that our approach can generate more correct pseudo-labels and help us get a better result for image segmentation against other semi-supervised method. The main advantage of generating pseudo-label iteratively is that previously learned knowledge can be incorporated to improve the model learning and final results. However, the limitation of our method is that for the under represented classes, i.e., classes that have a low percentage of the overall pixels, our method does not work well. Nevertheless, our method is a productive way to tell human experts what kind of classes should be more annotated, and which classes already have sufficient labels to yield good identification results. Future works may follow four directions: First, we think that metric learning may quantify the uncertainty of the pseudo-labels by including distance in the input space, latent feature space and label space. Second, we want to improve the information theoretic methods to obtain more useful information besides the label information. Third, we want to change the current architecture for image patch classification to a fully convolutional network. One of the obvious weakness of the current architecture is that the network can only see the small size image patches but cannot obtain the whole image structure. Last but not least, we want to develop a graphical user interface (GUI) software to allow humans in the loop interaction to guide the annotation of more useful labels. Data Availability Statement: Data is available from the authors.

Acknowledgments:
We would like to thank Stephanie Farrington, John Reed and Brian Cousin for preparing the coral image dataset.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: Similarly, we can get the distribution of x given by z and β.
Therefore, we can get the joint distribution for the image patch feature x and its category z, which is shown as follows.
P( x, z | α, β) = P( z | α)P( x | z, β) We use the Gibbs sampling algorithm, which is one of the Markov chain Monte Carlo (MCMC) methods to estimate the parameters of the LDA model.