Multi-Label Classification from Multiple Noisy Sources Using Topic Models †

Multi-label classification is a well-known supervised machine learning setting where each instance is associated with multiple classes. Examples include annotation of images with multiple labels, assigning multiple tags for a web page, etc. Since several labels can be assigned to a single instance, one of the key challenges in this problem is to learn the correlations between the classes. Our first contribution assumes labels from a perfect source. Towards this, we propose a novel topic model (ML-PA-LDA). The distinguishing feature in our model is that classes that are present as well as the classes that are absent generate the latent topics and hence the words. Extensive experimentation on real world datasets reveals the superior performance of the proposed model. A natural source for procuring the training dataset is through mining user-generated content or directly through users in a crowdsourcing platform. In this more practical scenario of crowdsourcing, an additional challenge arises as the labels of the training instances are provided by noisy, heterogeneous crowd-workers with unknown qualities. With this motivation, we further augment our topic model to the scenario where the labels are provided by multiple noisy sources and refer to this model as ML-PA-LDA-MNS. With experiments on simulated noisy annotators, the proposed model learns the qualities of the annotators well, even with minimal training data.


Introduction
With the advent of internet enabled hand-held mobile devices, there is a proliferation of user generated data.Often, there is a wealth of useful knowledge embedded within this data and machine learning techniques can be used to extract the information.However, as much of this data is user generated, it suffers from subjectivity.Any machine learning techniques used in this context should address the subjectivity in a principled way.
Multi-label classification is an important problem in machine learning where an instance d is associated with multiple classes or labels.The task is to identify the classes for every instance.In traditional classification tasks, each instance is associated with a single class; however, in multi-label classification, an instance can be explained with several classes.Multi-label classification finds applications in several areas, for example, text classification, image retrieval, social emotion classification [1], sentiment based personalized search [2], financial news sentiment [3,4], etc.Consider the task of classification of documents into several classes such as crime, politics, arts, sports, etc.The classes are not mutually exclusive since a document belonging to, say, the "politics" category may also belong to "crime".In the case of classification of images, an image belonging to the "forest" category may also belong to the "Scenery" category, and so on.
A natural solution approach for multi-label classification is to generate a new label set that is a power set of the original label set, and then use traditional single label classification techniques.The immediate limitation here is an exponential blow-up of the label set (2 C where C is the number of classes) and availability of only a small sized training dataset for each of the generated labels.Another approach is to build one-vs-all binary classifiers, where, for each label, a binary classifier is built.This method results in a lower number of classifiers than the power set based approach.However, it does not take into account the correlation between the labels.
Topic models [5][6][7] have been used extensively in natural language processing tasks to model the process behind generating text documents.The model assumes that the driving force for generating documents arise from "topics" that are latent.The alternate representation of a document in terms of these latent topics has been used in several diverse domains such as images [8], population genetics [9], collaborative filtering [10], disability data [11], sequential data and user profiles [12], etc.Though the motivation for topic models arose in an unsupervised setting, they were gradually found to be useful in the supervised learning setting [13] as well.The topic models available in literature for multi-label setting include Wang et al. [14], Rubin et al. [15].However, these models either involve too many parameters [15] or learn the parameters by heavily depending on iterative optimization techniques [14], thereby making it hard to adapt to the scenario where labels are provided by multiple noisy sources such as crowd-workers.Moreover, in all of these models, the topics and hence words are assumed to be generated depending only on the classes that are present.They do not make use of the information provided by the absence of classes.The absence of a class often provides critical information about the words present.For example, a document labeled "sports" is less likely to have words related to "astronomy".Similarly, in the images domain, an image categorised as "portrait" is less likely to have the characteristics of "Scenery".Needless to say, such correlations are dataset dependent.However, a principled analysis must account for such correlations.Motivated by this subtle observation, we introduce a novel topic model for multi-label classification.
In the current era of big data where large amounts of unlabeled data are readily available, obtaining a noiseless source for labels is almost impossible.However, it is possible to get instances labeled by several noisy sources.An emerging example of this case occurs in crowdsourcing, which is the practice of obtaining work by employing a large number of people over the internet.The multi-label classification problem is more interesting in this scenario, where the labels are procured from multiple heterogeneous noisy crowd-workers with unknown qualities.We use the terms crowd-workers, workers, sources and annotators interchangeably in the paper.The problem becomes harder as now the true labels are unknown and the qualities of the annotators must be learnt to train a model.We non-trivially extend our topic model to this scenario.

Contributions
1. We introduce a novel topic model for multi-label classification; our model has the distinctive feature of exploiting any additional information provided by the absence of classes.In addition, the use of topics enables our model to capture correlation between the classes.We refer to our topic model as ML-PA-LDA (Multi-Label Presence-Absence LDA). 2. We enhance our model to account for the scenario where several heterogeneous annotators with unknown qualities provide the labels for the training set.We refer to this enhanced model as ML-PA-LDA-MNS (ML-PA-LDA with Multiple Noisy Sources).A feature of ML-PA-LDA-MNS is that it does not require an annotator to label all classes for a document.Even partial labeling by the annotators up to the granularity of labels within a document is adequate.3. We test the performance of ML-PA-LDA on several real world datasets and establish its superior performance over the state of the art.4. Furthermore, we study the performance of ML-PA-LDA-MNS, with simulated annotators providing the labels for these datasets.In spite of the noisy labels, ML-PA-LDA-MNS demonstrates excellent performance and the qualities of the annotators learnt approximate closely the true qualities of the annotators.
The rest of the paper is organized as follows.In Section 2, we describe relevant approaches in the literature.We propose our topic model for multi-label classification from a single source (ML-PA-LDA) in Section 3 and discuss the parameter estimation procedure for our model using variational expectation maximization (EM) in Section 4. In Section 5, we discuss inference on unseen instances using our model.We adapt our model to account for labels from multiple noisy sources in Section 6. Parameter estimation for this revised model, ML-PA-LDA-MNS, is described in Section 7. In Section 8, we discuss how our model can incorporate smoothing, so that, in the inference phase, new words which were never seen in the training phase can be handled.We discuss our experimental findings in Section 9 and conclude in Section 10.

Related Work
Several approaches have been devised for multi-label classification with labels provided by a single source.The most natural approach is the Label Powerset (LP) method [16] that generates a new class for every combination of labels and then solves the problem using multiclass classification approaches.The main drawback of this approach is the exponential growth in the number of classes, leading to several generated classes having very few labeled instances leading to overfitting.To overcome this drawback, a RAndom k-labELsets method (RAkEL) [17] was introduced, which constructs an ensemble of LP classifiers where each classifier is trained with a random subset of k labels.However, the large number of labels still poses challenges.The approach of pairwise comparisons (PW) improves upon the above methods, by constructing C(C-1)/2 classifiers for every pair of classes, where C is the number of classes.Finally, a ranking of the predictions from each classifier yields the labels for a test instance.Rank-SVM [18] uses the PW approach to construct SVM classifiers for every pair of classes and then performs a ranking.Further details on the above approaches can be obtained in the survey [19].
The previously described approaches are discriminative approaches.Generative models for the multi-label classification model the correlation between the classes by mixing weights for the classes [20].Other probabilistic mixture models include Parametric Mixture Models PMM1 and PMM2 [21].After the advent of the topic models like Latent Dirichlet Allocation (LDA) [5], extensions have been proposed for multi-label classification such as Wang et al. [14].However, in [14], due to the non-conjugacy of the distributions involved, closed form updates cannot be obtained for several parameters and iterative optimization algorithms such as conjugate gradient and Newton Raphson are required to be used in the variational E step as well as M step, introducing additional implementation issues.Adapting this model to the case of multiple noisy sources would result in enormous complexity.The approach used in [22] makes use of Markov Chain Monte Carlo (MCMC) methods for parameter estimation, which is known to be expensive.The topic models proposed for multi-label classification in [15] involve far too many parameters that can be learnt effectively only in the presence of large amounts of labeled data.For small and medium sized datasets, the approach suffers from overfitting.Moreover, it is not clear how this model can be adapted when labels are procured from crowd-workers with unknown qualities.Supervised Latent Dirichlet Allocation (SLDA) [13] is a single label classification technique which works well on multi-label classification when used with the one-vs-all approach.SLDA inherently captures the correlation between classes through the latent topics.
With crowdsourcing gaining popularity due to the availability of large amounts of unlabeled data and difficulty in procuring noiseless labels for these datasets, aggregating labels from multiple noisy sources has become an important problem.Raykar et al. [23] look at training binary classification models with labels from a crowd with unknown annotator qualities.Being a model for multiclass classification, this model does not capture the correlation between classes and thereby cannot be used for multi-label classification from the crowd.Mausam et al. [24] look at multi-label classification for taxonomy creation from the crowd.They construct C classifiers by modeling the dependence between the classes explicitly.The graphical model representation involves too many edges especially when the number of classes is large and hence the model suffers from overfitting.Deng et al. [25] look at selecting the instance to be given to a set of crowd-workers.However, they do not look at aggregating these labels and developing a model for classification given these labels.In the report [26], Duan et al. look at methods to aggregate a multi-label set provided by crowd-workers.However, they do not look at building a model for classification for new test instances for which the labels are not provided by the crowd.Recently, the topic model, SLDA, has been adapted to learning from the labels provided by crowd annotators [27].However, like its predecessor SLDA, it is only applicable to the single label setting and not to multi-label classification.
The existing topic models in the literature such as [28] assume that the presence of a class generates words pertaining to those classes and do not take into account the fact that the absence of a class may also play a role in generating words.In practice, the absence of a class may yield information about occurrence of words.We propose a model for multi-label classification based on latent topics where the presence as well as absence of a class could generate topics.The labels could be procured from multiple sources (e.g., crowd workers) whose qualities are unknown.

Proposed Approach for Multi-Label Classification from a Single Source: ML-PA-LDA
We now explain our model for multi-label classification assuming labels from a single source (The model from a single source has not explained in the conference version of this paper.Sections 3 to 5 entirely deal with the single source model.).For ease of exposition, we use notations from the text domain.However, the model itself is general and can be applied to several domains by suitable transformation of features into words.In our experiments, we have applied the model to domains other than text.We will explain the transformation of features to words when we describe our experiments.
Let D be the number of documents in the training set, also known as a corpus.Each document is a set of several words.Let C be the total number of classes in the universe.In multi-label classification, a document may belong to any 'subset' of the C classes as opposed to the standard classification setting where a document belongs to exactly one class.Let T be the number of latent topics responsible for generating words.The set of all possible words is referred to as a vocabulary.We denote by V the size of the vocabulary V = {ν 1 , . . ., ν V }, where ν j refers to the j th word in V. Consider a document d comprising N words w = {w 1 , w 2 , . . ., w N } from the vocabulary V. Let λ = [λ 1 , . . . ,λ C ] ∈ {0, 1} C denote the true class membership of the document.In our notations, we denote by w nj the value 1[w n = ν j ], which is the indicator that the word w n is the j th word of the vocabulary.Similarly, we denote by λ ij , the indicator that λ i = j, where j = 0 or 1.Our objective is to predict the vector λ for every test document.

Topic Model for the Documents
We introduce a model to capture the correlation between the various classes generating a given document.Our model is based on topic models [5] that were originally introduced for the unsupervised setting.The key idea in topic models is to get a representation for every document in terms of "topics".Topics are hidden concepts that occur through the corpus of documents.Every document is said to be composed of some proportion of topics, where the proportion is specific to a document.Each topic is responsible for generating the words in the document and has its own distribution for generating the words.Neither the topic proportions of a document nor the topic-word distributions are known.The number of topics is assumed to be known.The topics can be thought of capturing the concepts present in a document.For further details on topic models, the reader may refer [5].
For the case of multi-label classification, we make the observation that the presence as well as absence of a class provides additional information about the topics present in a document.We now describe our generative process for each document, assuming labels are provided by a perfect source.

Draw class membership λ
2. Draw θ i,j,.∼ Dir(α i,j,. ) for i = 1, . . ., C, for j ∈ {0, 1}, where α i,j,.are the parameters of a Dirichlet distribution with T parameters.θ i,j,.provides the parameters of a multinomial distribution for generating topics.3.For every word w in the document (a) Sample u ∼ Unif{1, . . ., C} from one of the C classes.(b) Generate a topic z ∼ Mult(θ u,λ u ,. ), where θ i,j,.are the parameters of a multinomial distribution in T dimensions.(c) Generate the word w ∼ Mult(β z. ) where β z. are the parameters of a multinomial distribution in V dimensions.
Intuitively, for every class i, its presence or absence (λ i ) is first sampled from a Bernoulli distribution parameterized by ξ i .The parameter ξ i is the prior for class i.We capture the correlations across classes through latent topics.The corpus wide distribution Dir(α i,j,. ) is the prior for the distribution (Mult(θ i,j,.)) of topics for class i taking the value j.Then, the latent class u is sampled, which, in turn, along with λ u , generates latent topic z.The topic z is then responsible for a word.The same process repeats for the generation of every word in the document.
The generative process for the documents is depicted pictorially in Figure 1a.The parameters of our model consist of π = {α , ξ , β} and must be learnt.During the training phase, the observed variables for each document are d = {w, λ i } for i = 1, . . ., C. The hidden random variables are Θ = {θ, u, z}.We refer to the above described topic model as ML-PA-LDA (Multi-Label Presence-Absence LDA).

Variational EM for Learning the Parameters of ML-PA-LDA
We now detail the steps for estimating the parameters of our proposed model ML-PA-LDA.Given the observed words w and the label vector λ for a document d, the objective of the model described above is to first obtain p(Θ|d), the posterior distribution over the hidden variables.Here, the challenge lies in the intractable computation of p(Θ|d), which arises due to the intractability in the computation of p(d|π).We use variational inference with mean field assumptions [29] to overcome this challenge.
The underlying idea in variational inference is the following.Suppose q(Θ) is any distribution over Θ for any arbitrary Θ = {θ, u, z} that approximates p(Θ|d).We refer to q(Θ) as variational distribution.

E-Step Updates for ML-PA-LDA
The E-step involves computing the document-specific variational parameters Θ d = {δ d , γ d , φ d }, for every document d, assuming a fixed value for the parameters π = {α, ξ, β}.As a consequence of the mean field assumptions on the variational distributions [29], we get the following update rules for the distributions by maximising L(Θ).From now on, when clear from context, we omit the superscript d: In the computation of the expectation of E Θ\z [p(d, Θ)] above, only the terms in p(d, Θ) that are a function of z need to be considered as the rest of the terms contribute to the normalizing constant for the density function q(z).Hence, all terms in the right hand side of Equation ( 2) need not be considered and expectations of only log p(z|u, λ, θ) (Equation ( 6)) and log p(w|z, β) (Equation ( 7)) must be taken with respect to u, λ, θ.Observe that Equation ( 8) follows the structure of the multinomial distribution as assumed, and, therefore, Similarly, the updates for the other variational parameters are as follows: Again, Equation (10) follows the structure of the multinomial distribution, and so We have used that where In all of the above update rules, E[log , where ψ(.) is the digamma function.

M-Step Updates for ML-PA-LDA
In the M-step, the parameters ξ, β and α are estimated using the values of φ d , δ d , γ d estimated from the E-step.The function L(Θ) in Equation ( 1) is maximized with respect to the parameters π yielding the following update equations.
Updates for ξ: for i = 1, . . ., C: Intuitively, Equation (13) makes sense as ξ i is the probability that any document in the corpus belongs to class i.Therefore, ξ i is the mean of λ d i over all documents.Updates for β: for t = 1, . . ., T; for j = 1, . . ., V: Intuitively, the variational parameter φ d nt is the probability that the word w d n is associated with topic t.Having updated this parameter in the E-step, β tj computes the fraction of times the word j is associated with topic t by giving a weight φ d nt to its occurrence in document d.Updates for α: There do not exist closed form updates for α parameters.Hence, we use the Newton Raphson (NR) method to iteratively obtain the solution as follows: where

Inference in ML-PA-LDA
In the previous sections, we introduced our model ML-PA-LDA and described how the parameters of the model π = {α, β, ξ} are learnt from a training dataset D = {(d, λ d 1 , . . ., λ d C )} comprising documents d and their corresponding labels.More specifically, we used variational inference for learning π.We now describe how inference can be performed on a test document d (We provide complete details of the inference phase in this section.This is a new section and was not present in our conference paper.).Here, unlike the previous scenario, the labels λ d 1 , . . ., λ d C are unknown and the task is to predict these labels.
The graphical model depicting this scenario is provided in Figure 2a.The model is similar to the case of training except that the variable λ is no longer observed and must be estimated.Therefore, the set of hidden variables is now Θ = {λ, z, u, θ}.We use ideas from variational inference to estimate λ.Now, the approximating variational model is given by Figure 2b.Note that λ is not observed in the original model (Figure 2a); therefore, an approximating distribution for λ is required in the variational model (Figure 2b).In this testing phase, the parameters π = {α, ξ, β} are known (from the training phase) and do not need to be estimated.Therefore, only the E-step of variational inference is required to be derived and executed, the end of which estimates for the variational parameters {∆, γ, φ, δ} are obtained.
We begin by deriving the updates for the parameters of the posterior distribution of the latent variable z for a new document d.
Assume the following independent variational distributions (as per Figure 2b) over each of the variables in Θ for a document d: . Note that, in the training phase, λ was observed or known and there was no variational distribution over λ.However, in the inference phase, λ is not observed and therefore a variational distribution over λ is also required.We are now set to derive the updates for the various distributions q(.).The key rule [29] that arises as a consequence of variational inference with mean field assumptions is that We now apply Equation ( 16) to obtain the updates for the parameters of q(.): Similar to the derivations for the training phase, in the computation of the expectation of E Θ\z [p(d, Θ)] in Equation ( 17), only the terms in p(d, Θ) that are a function of z need to be considered as the rest of the terms contribute to the normalizing constant for the density function q(z).Hence, only expectations of log p(z|u, λ, θ) (Equation ( 6)) and log p(w|z, β) (Equation ( 7)) need to be taken with respect to u, λ, θ.Therefore, Note that Equation ( 18) is different from Equation ( 9), as, in Equation ( 18), we also have an expectation over λ terms in the inference phase.Similarly, the updates for the other variational parameters are as follows: Therefore, Following similar steps, we get Hence, In all of the above update rules, E[log , where ψ(.) is the digamma function.Aggregation rule for predicting document labels: Algorithm 1 provides the algorithm for the inference phase of ML-PA-LDA.After execution of Algorithm 1, ∆ i gives a probabilistic estimate corresponding to class i.In order to predict the labels of any document, a suitable threshold (say 0.5) can be applied on the value of ∆ i so that, if ∆ i > threshold, the estimate for λ i , that is, λi = 1.Update ∆ using Equations ( 22) and ( 23) until convergence

Proposed Approach for Multi-Label Classification from Multiple Sources: ML-PA-LDA-MNS
So far, we have assumed the scenario where the labels for training instances are provided by a single source.Now, we move to the more realistic scenario where the labels are provided by multiple sources with varying qualities that are unknown to the learner.These sources could even be human workers with unknown and varying noise levels.We adopt the single coin model for the sources, which we will now explain.

Single Coin Model for the Annotators
When the true labels of the documents are not observed, λ is unknown.Instead, noisy versions y 1 , . . ., y K of λ provided by a set of K independent annotators with heterogeneous unknown qualities {ρ 1 , . . ., ρ K } are observed.y j is the label vector given by annotator j. y ji can be either 0, 1 or −1.y ji = 1 indicates that, according to annotator j, the class i is present, while y ji = 0 indicates that the class i is absent as per annotator j. y ji = −1 indicates that the annotator j has not made a judgement on the presence of class i in the document.This allows for partial labeling up to the granularity of labels even within a document.This flexibility in the modeling is essential, especially when the number of classes is large.ρ j is the probability with which an annotator reports the ground truth corresponding to each of the classes.ρ j is not known to the learning algorithm.For simplicity, we have assumed the single coin model for annotators and also that the qualities of the annotators are independent of the class under consideration.That is, P(y j1 = 1|λ 1 = 1) = P(y j1 = 0|λ 1 = 0) = . . .= P(y jC = 1|λ C = 1) = P(y jC = 0|λ C = 0) = ρ j .This is a common assumption in literature [23].
The generative process for the documents is depicted pictorially in Figure 3a.The parameters of our model consist of π = {α , ξ , ρ, β}.The observed variables for each document are d = {w, y ji } for i = 1, . . ., C, j = 1, . . ., K. The hidden random variables are Θ = {θ, λ, u, z}.We refer to our topic model trained with labels from multiple noisy sources as ML-PA-LDA-MNS (Multi-Label Presence-Absence LDA with Multiple Noisy Sources).

Variational EM for ML-PA-LDA-MNS
We now detail the steps for estimating the parameters of our proposed model ML-PA-LDA-MNS.Given the observed words w and the labels y 1 , . . ., y K for a document d, where y j is the label vector provided by annotator j.The objective of the model described above is to obtain p(Θ|d).Here, the challenge lies in the intractable computation of p(Θ|d), which arises due to the intractability in the computation of p(d|π), where π = {α, β, ρ, ξ}.As in ML-PA-LDA, we use variational inference with mean field assumptions [29] to overcome this challenge.

E-Step Updates for ML-PA-LDA-MNS
The E-step involves computing the document-specific variational parameters Θ d = {δ d , ∆ d , γ d , φ d }, for every document d, assuming a fixed value for the parameters π = {α, ξ, ρ, β}.As a consequence of the mean field assumptions on the variational distributions, we get the following update rules for the distributions by maximising L(Θ).When it is clear from context, we omit the superscript d: In the computation of the expectation of E Θ\z [p(d, Θ)] in Equation ( 31), the terms in p(d, Θ) that are a function of z need to be considered as the rest of the terms contribute to the normalizing constant for the density function q(z).Hence, expectations of log p(z|u, λ, θ) (Equation ( 28)) and log p(w|z, β) (Equation ( 29)) must be taken with respect to u, λ, θ.Therefore, Similarly, the updates for the other variational parameters are as follows: Therefore, where Hence, In all of the above update rules, E[log , where ψ(.) is the digamma function.In addition, terms involving y ji are considered only when y ji = −1.Observe that, for the single source model ML-PA-LDA, the variational parameter ∆ i is absent as λ i is observed.

M-Step Updates for ML-PA-LDA-MNS
In the M-step, the parameters ξ, ρ, β and α are estimated using the values of ∆ d , φ d , δ d , γ d estimated from the E-step.The function L(Θ) in Equation ( 24) is maximized with respect to the parameters π yielding the following update equations.
Updates for ξ: for i = 1, . . ., C: Intuitively, Equation (38) makes sense as ξ i is the probability that any document in the corpus belongs to class i. ∆ d i is the probability that document d belongs to class i and is computed in the E-step.Therefore, ξ i is an average of ∆ d i over all documents.In ML-PA-LDA, as λ d i was observed, the average was taken over λ d i instead of ∆ d i .Updates for ρ: for j = 1, . . ., K: From Equation (39), we observe that ρ j is the fraction of times that crowd-worker j has provided a label that is consistent with the probability estimate ∆ d i over all classes i.The implicit assumption is that every crowd-worker has provided at least one label; otherwise, such a crowd-worker need not be considered in the model.
Updates for β: for t = 1, . . ., T; for j = 1, . . ., V: Intuitively, the variational parameter φ d nt is the probability that the word w d n is associated with topic t.Having updated this parameter in the E-step, β tj computes the fraction of times the word j is associated with topic t by giving a weight φ d nt to its occurrence in document d.Updates for α: There do not exist closed form updates for α parameters.Hence, we use the Newton Raphson (NR) method to iteratively obtain the solution as follows: where The M-step updates for β and α involved in ML-PA-LDA (single source version) are the same as the updates in the ML-PA-LDA-MNS.The parameter ρ is absent in ML-PA-LDA.The overall algorithm for learning the parameters is provided in Algorithm 2.
Algorithm 2 Algorithm for learning the parameters π during the training phase of ML-PA-LDA-MNS.

Inference
The inference in ML-PA-LDA-MNS is identical to the inference in ML-PA-LDA, as, for both of the models, in the test phase, the labels of a new document are unknown.More specifically, in the inference stage for ML-PA-LDA-MNS, the sources also do not provide any information.

Smoothing
In the model for the documents described in Section 3, we modeled β to be a parameter that governs the multinomial distributions for generating the words from each topic.In general, a new document can include words that have not been encountered in any of the training documents.The unsmoothed model described earlier does not handle this issue.In order to handle this, we must "smoothen" the multinomial parameters involved [5].One way to perform smoothing is to model β as a multinomial random variable over the vocabulary V, with parameters η.Again, due to the intractable nature of the computations, we model the variational distribution for β as β ∼ Dir(χ).We estimate the variational parameter χ in the E-step of variational EM using Equation ( 42), assuming η is known: The model parameter η is estimated in the M-step using Newton Raphson method as follows: where The steps for the derivation are similar to the steps for non-smooth version.

Experiments
In order to test the efficacy of the proposed techniques, we evaluate our model on datasets from several domains.

Dataset Descriptions
We have carried out our experiments on several datasets from the text domain as well as the non-text domain.Our code is available on bitbucket [30].We now describe the datasets and the pre-processing steps below.

Text Datasets
In the text domain, we have performed studies on the Reuters-21578, Bibtex and Enron datasets.Reuters-21578: The Reuters-21578 dataset [31] is a collection of documents with news articles.The original corpus had 10,369 documents and a vocabulary of 29,930 words.We performed stemming using the Porter Stemmer algorithm [32] and also removed the stop words.From this set, the words that occurred more than 50 times across the corpus were retained and only documents which contained more than 20 words were retained.Finally, the most commonly occurring top 10 labels were retained, namely, acq, crude, earn, fx, grain , interest, money, ship, trade, and wheat.This led to a total of 6547 documents and a vocabulary of size 1996.Of these, a random 80% was used as a training set and the remaining 20% as test.
Bibtex: The Bibtex dataset [33] was released as part of the ECML-PKDD 2008 Discovery Challenge.The task is to assign tags such as physics, graph, electrochemistry, etc. to Bibtex entries.There are a total of 4880 and 2515 entries in the training set and test, respectively.The size of the vocabulary is 1836 and the number of tags is 159.
Enron: The Enron dataset [34] is a collection of emails for which a set of pre-defined categories are to be assigned.There are a total of 1123 and 573 training and test instances, respectively, with a vocabulary of 1001 words.The total number of email tags are 53.
Delicious: The Delicious dataset [35] is a collection of web pages tagged by users.Since the tags or classes are assigned by users, each web page has multiple tags.The corpus from Mulan [34] had a vocabulary of 500 words and 983 classes.The training set had 14,406 instances and the test set had 4671 instances.From the training set, only documents that contained more than 50 words were retained, and the most commonly occurring top 20 classes were retained.The final training set used had 430 training instances and 108 test instances.

Non-Text Datasets
We also evaluate our model on datasets from domains other than text, where the notion of words is not explicit.
Converting real valued features to words: Since we assume a bag-of-words model, we must replace every real-valued feature with a 'word' from a 'vocabulary'.We begin by choosing an appropriate size for the vocabulary.Thereafter, we collect every real number that occurs across features and instances in the corpus into a set.We then cluster this set into V clusters, using the k-means algorithm, where V is the size of the vocabulary previously chosen.Therefore, each real valued feature has a new representative word given by the nearest cluster center to the feature under consideration.The corpus is then generated with this new feature representation scheme.
Yeast: The Yeast dataset [18] contains a set of genes that may be associated with several functional classes.There are 1500 training examples and 917 examples in the test set with a total of 14 classes and 103 real valued features.

Scene:
The Scene dataset [36] is a dataset of images.The task is to classify images into the following six categories: beach, sunset, fall, field, mountain, and urban.The dataset contains 1211 instances in the training set and 1196 instances in the test set with a total of 294 real valued features.
In our experiments, we use the measures, accuracy across classes, micro-f1 score and average class log likelihood on the test sets to evaluate our model.Let TP, TN, FP and FN denote the number of true positives, true negatives, false positives and false negatives, respectively, with respect to all classes.Then, the overall accuracy is computed as: The micro-f1 is the harmonic of micro-precision and micro-recall where micro-precision = TP/(TP + FP) and micro-recall = TP/(TP + FN).
The average class log-likelihood on the test instances is computed as follows: where D test is the number of instances in the test set.Further details on these measures can be found in the survey [37].9.2.Results: ML-PA-LDA (with a Single Source) We run our model first assuming labels from a perfect source.
In Table 1, we compare the performance of our non-annotator model vs. other methods such as RAKel, Monte Carlo Classifier Chains (MCC) [38], Binary Relevance Method -Random Subspace (BRq) [39], Bayesian Chain Classifiers (BCC) [40] and SLDA.BCC [40] is a probabilistic method that constructs a chain of classifiers by modeling the dependencies between the classes using a Bayesian network.MCC instead uses a Monte Carlo strategy to learn the dependencies.BRq improves upon binary relevance methods of combining classifiers by constructing an ensemble.As mentioned earlier, RAKel draws subsets of the classes, each of size k and constructs ensemble classifiers.The implementations of RAKel, MCC, BRq and BCC provided by Meka [41] were used.For SLDA, the code provided by the authors of [13] was used.Since SLDA does not explicitly handle multi-label datasets, we built C SLDA classifiers (where C is the number of classes) and used them in a one-vs-rest fashion.
On the Reuters, Bibtex and Enron datasets, ML-PA-LDA (without the annotators) performs significantly better than SLDA.On Scene and Yeast datasets, ML-PA-LDA and SLDA give the same performance.It is to be noted that these datasets, known to be hard, are from the images and biology domains, respectively.As can be seen from Table 1, our model ML-PA-LDA gives a better overall performance than SLDA and also does not require training C binary classifiers.This advantage is a significant one, especially in datasets such as Bibtex where the number of classes is 159.We compared the performance of our algorithm with the size of the datasets used for training as well as the number of topics used.The results of our model are shown in Figure 4c,f,i.An increase in the size of the dataset improves the performance of our model with respect to all of the measures in use.Similarly, an increase in the number of topics generally improves the measures under consideration.Note that an increase in the number of topics corresponds to an increased model complexity (and also increased running time).A striking observation is the low accuracy, log likelihood and micro-f1 scores associated with the model when the number of topics = 80 (eight times the number of classes) and the size of the dataset is low (S = 25%).This is expected as the number of parameters to be estimated is too large (as the model complexity is high) to be learned using very few training examples.However, as more training data is available, the model achieves enhanced performance.This observation is consistent with Occam's razor [42].

Statistical Significance Tests:
From Table 1, since the performance of SLDA and ML-PA-LDA are similar in some of the datasets such as Enron and Bibtex, we performed tests to check for statistical significance.
Binomial Tests: We first performed binomial tests [43] for statistical significance.The null hypothesis was fixed as "Average accuracy of SLDA is the same as that of ML-PA-LDA" and the alternate hypothesis was fixed as "Average accuracy of ML-PA-LDA is better than that of SLDA".We executed our method as well as SLDA with 10 random initializations each on the data sets Reuters, Enron, Bibtex and Delicious.In order to reject the null hypothesis in favour of the alternate hypothesis, at a level α = 0.05, SLDA should have been better than ML-PA-LDA in either 0, 1 or 2 runs.However, during this experiment, it was observed that the accuracy of ML-PA-LDA was better than SLDA in every run for all the datasets.Therefore, for each of these datasets, the null hypothesis was rejected with a p-value of 0.0009.
t-Tests: We also performed the unpaired two-sample t-test [44] with the null hypothesis "Mean accuracy of SLDA is the same as the mean accuracy of ML-PA-LDA ( µ SLDA = µ ML-PA-LDA )" and alternate hypothesis "Mean accuracy of ML-PA-LDA is higher than SLDA (µ SLDA < µ ML-PA-LDA )".The null hypothesis was rejected at the level α = 0.05 for all of the datasets-Reuters, Delicious, Bibtex and Enron.In Reuters, for instance, the null hypothesis was to be rejected if µ SLDA − µ ML-PA-LDA < −0.0073,where µ M denotes the empirically obtained mean accuracy for method M. The actual difference obtained was µ SLDA − µ ML-PA-LDA = −0.048< −0.0073, and the null hypothesis was rejected with a p-value of 8.47 × 10 −8 for the Reuters dataset.For the Delicious dataset, the rule for rejecting the null hypothesis was µ SLDA − µ ML-PA-LDA < −0.001, while the actual difference obtained was µ SLDA − µ ML-PA-LDA = −0.003< −0.001.
The above tests show that our results are statistically significant.

Results: ML-PA-LDA-MNS (with Multiple Noisy Sources)
To verify the performance of the annotator model ML-PA-LDA-MNS where the labels are provided by multiple noisy annotators, we simulated 50 annotators with varying qualities.The ρ-values of the annotators were sampled from a uniform distribution.For 10 of these annotators, ρ was sampled from U[0.51, 0.65].For another 20 of them, ρ was sampled from U[0.66, 0.85], and, for the remaining 20 of them, ρ was sampled from U[0.86, 0.9999].This captures the heterogeneity in the annotator qualities.For each document in the training set, a random 10% (= 5) of annotators were picked for generating the noisy labels.
In Table 1, we report the performance of the ML-PA-LDA-MNS.We find that the performance of ML-PA-LDA-MNS is close to that of ML-PA-LDA and often better than or at par with SLDA (from Table 1), in spite of having access to only noisy labels.On Scene and Yeast datasets, ML-PA-LDA-MNS, ML-PA-LDA and SLDA give the same performance.In Table 2, we compare the performance of ML-PA-LDA-MNS and ML-PA-LDA on the Reuters dataset under varying amounts of training data.With more training data, both models perform better.We also report the annotator root mean square error or "Ann RMSE", which is the L2 norm of the difference in predicted qualities of the annotators vs. the true qualities.Ann RMSE = ∑ K j=1 | ρj − ρ j | 2 /K, where ρj is the quality of annotator j as predicted by our variational EM algorithm and ρ j is the true annotator quality, which is unknown during training.We find that "Ann RMSE" decreases as more training data is available, showing the efficacy of our model for learning the qualities of the annotators.Similar to the experiment carried out on ML-PA-LDA, we vary the number of topics as well as dataset sizes and compute all of the measures used.The plots are shown in Figure 4 (first two columns) and help in understanding how T, the number of topics, must be tuned depending on the size of the available training set.As in ML-PA-LDA, an increase in the topics as well as dataset size improves the performance of ML-PA-LDA-MNS in general.Therefore, as more training data become available, having a larger number of topics helps.

Adversarial Annotators
We also tested the robustness of our model against labels from adversarial or malicious annotators.Such malicious annotators occur in many scenarios such as crowdsourcing.An adversarial annotator is characterized by a quality parameter ρ < 0.5.As in the previous case, we simulated 50 annotators.The ρ-values of 10 of them were sampled from U[0.0001, 0.1].For another 15 annotators, ρ was sampled from U[0.51, 0.65].For another 20 of them, ρ was sampled from U[0.66, 0.85], and, for the remaining five of them, ρ was sampled from U[0.86, 0.9999].The choice of the proportion of malicious annotators is as per literature [23].From the Reuters dataset, we obtained an average accuracy of 0.955, average class log likelihood of −0.193, average micro-f1 of 0.793 and an average Ann-RMSE of 0.002 over five runs with 40 topics.This shows that, even in the presence of malicious annotators, our model remains unaffected and performs well.

Conclusions
We have introduced a new approach for multi-label classification using a novel topic model, which uses information about the presence as well as absence of classes.In the scenario when the true labels are not available and instead a noisy version of the labels is provided by the annotators,

Figure 1 .
Figure 1.Our model ML-PA-LDA (Training phase).(a) Graphical model for ML-PA-LDA; (b) Graphical model representation of the variational distribution used to approximate the posterior

Figure 2 .
Figure 2. Graphical model representations of inference phase of ML-PA-LDA.D test is the number of new documents.(a) graphical model for ML-PA-LDA (test phase); (b) graphical model representation of the variational distribution used to approximate the posterior.

Figure 3 .
Figure 3. model representations of our model ML-PA-LDA-MNS where labels are provided by multiple sources.(a) graphical model for ML-PA-LDA-MNS; (b) graphical model representation of the variational distribution used to approximate the posterior.

Figure 4 .
Figure 4. Performance of ML-PA-LDA and ML-PA-LDA-MNS on the Reuters dataset.T is the number of topics, C is the number of classes and S is the percentage of dataset used for training.The graphs show the trend in the various measures as a function of number of examples in the training set as well as number of topics.Other datasets follow a similar trend.The last column-Figure 4c,f,i-is the results for the single source version (ML-PA-LDA), whereas all other plots study the performance of the multiple sources version (ML-PA-LDA-MNS).

Table 1 .
Comparison of average accuracy of various multi-label classification techniques.The following abbreviations have been used: RAKel: Random k-label sets, MCC: Monte Carlo Classifier Chains, BRq: Binary Relevance Method -Random Subspace, BCC: Bayesian Chain Classifiers, SLDA: Supervised Latent Dirichlet Allocation

Table 2 .
Performances of ML-PA-LDA and ML-PA-LDA-MNS for different sizes of training sets, for a fixed number of topics (=20).Results are shown for the Reuters dataset.A similar trend is demonstrated by other datasets (omitted for space).