Iterative Learning for K-Approval Votes in Crowdsourcing Systems

: Crowdsourcing systems have emerged as cornerstones to collect large amounts of qualiﬁed data in various human-powered problems with a relatively low budget. In eliciting the wisdom of crowds, many web-based crowdsourcing platforms have encouraged workers to select top-K alternatives rather than just one choice, which is called “ K -approval voting”. This kind of setting has the advantage of inducing workers to make fewer mistakes when they respond to target tasks. However, there is not much work on inferring the correct answer from crowd-sourced data via a K -approval voting. In this paper, we propose a novel and efﬁcient iterative algorithm to infer correct answers for a K -approval voting, which can be directly applied to real-world crowdsourcing systems. We analyze the average performance of our algorithm, and prove the theoretical error bound that decays exponentially in terms of the quality of workers and the number of queries. Through extensive experiments including the mixed case with various types of tasks, we show that our algorithm outperforms Expectation and Maximization (EM) and existing baseline algorithms.


Introduction
As the need for large-scale labeled data grows in various fields, crowdsourcing has become an attractive paradigm in human-powered problem solving systems. Web-based crowdsourcing platforms such as Amazon Mechanical Turk and Zooniverse are now in widespread use for amassing enormous amounts of responses from many crowds in a short time with a relatively low budget [1][2][3]. For example, ImageNet, a large-scale image database, was a successful project that exploited the idea of crowdsourcing to label 3.2 million images hierarchically [4].
In eliciting the wisdom of crowds, many crowdsourcing platforms encourage workers to select top-K alternatives they believe as correct candidates. This voting rule is called "K-approval voting" and its interface provides workers with more flexibility to respond and even takes advantage of their partial expertise [21,22]. Due to the above merits, many crowdsourcers adopted the K-approval voting setup to collect a large amount of responses. For real crowdsourcing examples, two tasks described in Figures 1 and 2 are real-world crowdsourcing examples of K-approval voting. Figure 1 shows a task being distributed on Amazon Mechanical Turk; the task was requested by one of the well-known online shopping platforms, Amazon. The goal of the task was to classify the best category of the item in the picture from given alternatives. Figure 2 shows another real task named "Wisconsin Wildlife Watch" on Zooniverse, which is another popular crowdsourcing system. The goal of this task is to correctly figure out what Wisconsin wildlife animals were pictured. Although the demand on K-approval voting increases, there is not much work involved in inferring the correct answer from the collected data via a K-approval voting. One natural method to aggregate responses is majority voting. However, it is insufficient to exploit the wisdom of crowds appropriately since it assumes that the reliability levels of the responses of all workers are the same. In fact, it was recently demonstrated that majority voting is sub-optimal for any K-approval voting systems [22].  In this paper, we design a novel algorithm for K-approval-voting systems that evaluates workers' reliability to infer the correct answers to the tasks more precisely. This work can be generally applicable on real crowdsourcing platforms in practice where tasks have D multiple-choice questions allowing workers to select top-K alternatives. Moreover, our algorithm can be applied to the case that each problem has a different (D, K) value.
One of main contributions of this paper is the performance guarantee of our algorithm. We rigorously prove that the error bound of our algorithm decays exponentially. An interesting aspect of the error bound is its dependency on the negative entropy of workers in a perspective on information theory. Additionally, we verify the performance of our algorithm through numerical experiments on various cases including a realistic case containing mixed tasks with various numbers D of alternatives and K selections. Moreover, through experiments, we show that our algorithm estimates the relative reliabilities of the workers properly.
The paper is organized as follows: In Section 3, we make a setup, and in Section 4, we describe our algorithm to infer the correct answers for K-approval votes. Then, we provide performance guarantees for our algorithm in Section 5. In Section 6, we present comparative results through numerical experiments, and we draw conclusions in Section 7.

Related Work
Recently, there have been some studies about K-approval votes in the crowdsourcing field. These works aim to elicit the mode of a worker's belief and propose a new paradigm of amassing high-quality responses. Specifically, in [21], the authors endeavored to obtain higher-quality labels with a new incentive mechanism that encourages good workers with additional payments. In addition, [22] proved that simple majority voting is sub-optimal, and they brought up a conversation topic of the necessity of a probabilistic inference algorithm that can be applied to K-approval voting.
In single-choice voting, there have been various approaches to obtaining reliable results from unreliable responses. The simplest one is majority voting. However, it is insufficient for obtaining reliable results since it regards the expertise of each worker as equal and gives the same weight to every worker. Typically, the levels of expertise are very different from experts to novices, free money collectors, and even adversarial workers [5]. In order to exploit differences of expertise among workers, Expectation and Maximization (EM) algorithms have been suggested with latent variables and unknown model parameters [9][10][11][12][13]23,24].
Alternative approaches of single voting for the binary questions have been suggested in [15] in the context of spectral methods that use low-rank matrix approximations. Additionally, the authors proposed a novel iterative learning algorithm [14,15,25]. Although they did not assume any prior knowledge, Liu et al. [16] showed that choosing a suitable prior can improve performance via a Bayesian approach. Lately, the authors of [18,26] proposed an iterative learning algorithm for single voting on multiple-choice questions and realvalued vector regressions, respectively. However, their algorithm cannot be applied to K-approval voting.
In recent years, there have been some studies that built models that combine human and computer vision models. The authors of [27] proposed an online collaboration crowdsourcing system that integrates crowdsourcing platforms and computer vision machines with a Bayesian predictive model. The authors of [28,29] developed a combined systems using confidence modeling and applied these systems to various applications in binary and multiclass setting. For efficient matching between tasks and workers, the authors of [30] proposed a bibliometric analysis on task recommendations in crowdsourcing systems (TRCS), which help workers find their preferred tasks according to their abilities. Additionally, in more general task assignment, the authors of [31] proposed a new parallel allocation of delay-tolerant tasks in practical crowdsourcing systems. As an example of a crowdsourced dataset, the authors of [32] conducted sentiment analysis experiments on a product reviews dataset [33] in the Persian language.
There are several approaches to crowdsourcing systems via evolutionary algorithms. The authors of [34] modeled trustworthy worker selection as a multi-objective combinatorial optimization problem and solved the problem using evolutionary algorithms. Similarly, the authors of [35] developed a worker selection method with multiple objects including bug detection and cost minimization. Recently, the authors of [36] suggested a generic algorithm that uses video games as a way to gather novel solutions to optimization problems, and the authors of [37] introduced an expertise estimation as a meta-heuristic optimization harmony search problem.

Problem Definition
In this section, we define the setup of the problem. Suppose a crowdsourcing system where there are m tasks to be solved and n workers participating. For convenience, tasks and workers are indexed by i ∈ [m] and j ∈ [n], respectively, where [m] means a set of integers from 1 to m such that {1, ..., m}.
In Figure 3, Each task is a multiple-choice question and consists of D i alternatives (or options) with only one correct answer (correct alternative). Each task is duplicated l times to make redundancy for boosting performance, which is a common strategy in crowdsourcing systems. Therefore, in total, ml tasks are to be distributed. Since we use a random and equal assignment strategy to distribute tasks, each worker gets to solve r tasks such that ml = nr. If we draw this scheme as a graph, a random (l, r)-regular bipartite graph can be made by the pairing model with m task nodes, n worker nodes, and ml(= nr) edges representing tasks assigned to workers.
Each task allows K i -approval votes when workers make their own responses. Here, we need to consider how workers formulate their responses when solving tasks. For a simple probabilistic approach, we assume that worker j solves task i, and gives a response including the correct answer with a probability of p ij ∈ [0, 1]. This probability should be considered as a value that depends on the type of the given task, including D i and K i , and how reliable the worker is. We will discuss its dependency on the task and worker in Section 3.2. We also assume that distractors (a set of alternatives apart from the correct answer) are independent from each other, and their levels of difficulty are the same. Thus, in the K-approval voting system, a response includes the correct answer with a probability of p ij , and the remaining alternatives in the response are randomly selected from the distractors.
The goal of the problem is to infer the correct answer of tasks when workers' responses are given. The error performance can be easily measured by the ratio of the tasks incorrectly inferred such that 1 where t i andt i are the correct and inferred answers of task i, respectively, and I is an indicator function.

Worker Model for Various (D, K)
In our model, responses are generated by a simple probabilistic approach, so reliability should be calculated for every task and worker. To verify this, we can consider two easy examples. Suppose that there is a worker who only gives random responses, and two tasks are given to the worker where the (D, K)-pairs of the tasks are (2, 1) and (3, 1). Then we can easily calculate the probability of obtaining a response including the correct answer from the worker. Since the worker picks a random choice, the reliabilities for the two tasks should be set to 0.5 and 0.33, respectively. For another example, suppose that the same worker solves two other tasks whose (D, K) pairs are (3, 1) and (3,2). Then, the reliabilities for the two tasks should be set to 0.33 and 0.67, respectively. Therefore we conclude that reliability should be dependent on D i and K i . Here we assume that each worker has an intrinsic value q j called the quality of the worker, which represents the inherent diligence or expertise of the worker. This means that the quality of the worker should be fixed whatever types of tasks are given. From an information theoretical perspective, the quality of a worker can be defined as negative entropy with an offset, which sets the amount to zero when the worker gives random responses. This is reasonable since no information can be obtained from random responses. We define the quality of workers more precisely and the relationship between p ij and q j in Section 5.1 and skip mathematical analysis here to take a look at our algorithm first.

Algorithms
In this section, we explain the design of our algorithm and the process used to estimate the correct answers. To avoid confusion in notation, we use uppercase K to represent the number of selections and lowercase k to represent the iteration number of our algorithm. In our setting, workers to which task i is assigned are allowed to vote for K i alternatives.
For each edge (i, j), a response is denoted as: To extract the correct answers from unreliable responses of workers, we propose an iterative algorithm for K-approval voting systems. Our algorithm takes advantage of two types of messages between a task node and a worker node. A task message is denoted as a D i dimensional vector, x i→j . Each component of this vector corresponds to the likelihood, meaning the possibility of being the correct answer to task i. A worker message, y j→i , represents how reliable the worker j is. Since these worker messages are strongly correlated with the reliability p ij , our algorithm can assess relative reliability. We will empirically verify the correlation between {y j→i } and {p ij } in Section 6. The initial messages of our iterative algorithm are sampled independently from the Gaussian distribution with unit mean and variance, i.e., y j→i ∼ N (1, 1). Unlike EMbased algorithms [10,11], our approach is not sensitive to initial conditions as long as the consensus of the group of workers is positively biased. Now, we define the adjacent set of task i as ∂i, and similarly, the adjacent set of worker j is defined as ∂j. Then, at the k th iteration, both messages are updated using the following rules: At the task message update process shown in (1), our algorithm gives a weight to the response according to the relative reliability of a worker. At the worker message update process shown in (2), it gives greater relative reliability to a worker who strongly follows the consensus of other workers. Figure 4 describes two vectors in the message vector space. As shown above, 1 ) 1 represents the difference between the response of worker j solving task i and the expectation of a random response is the weighted sum of responses of other workers who have solved the task i . Thus, the inner product of these two vectors in (2) can assess the similarity between the response of worker j for the task i and sum of those of other workers who have solved the task i . A larger positive similarity value of the two vectors means that worker j is more reliable, whereas a negative value means that the worker j does not follow the consensus of other workers, and our algorithm regards the worker j as unreliable. Specifically, when x ) 1 are orthogonal for fixed task i , the inner product of the two vectors is close to zero. This means that x (k−1) i →j does not contribute to the message of worker j. Then, y (k) j→i is defined as the sum of inner products from each task message except for that of task i, representing the relative reliability of worker j. Returning to (1), x (k) i→j is determined by the weighted voting of workers who have solved task i, except for the message from worker j. Worker j contributes to the response A ij as much as the weight value of y i→j is defined as the sum of A ij y (k−1) j →i , which represents the estimated likelihood of the correct answer for task i.
The following describes the pseudo-code of our algorithm. In practice, a dozen iterations are sufficient for the convergence of our algorithm. After k max iterations, our algorithm makes the final estimate vector x i of the task i, and each component of the vector represents the possibility of being the correct answer. Our algorithm infers the correct answer by choosing u i that has the maximum value among final likelihoods of x i . Then, our algorithm outputs the estimate of the correct answer denoted as a unit vector, e u i .

Analysis of Algorithms
In this section, we verify the error performance of Algorithm 1. In Theorem 1, we show that the error bound depends on the task degree l and the quality of workers q. Furthermore, we provide that the upper bound on the probability of error decays exponentially as the quality of workers increases. Here, we assume that D i = D and K i = K for all tasks i ∈ [n] and accordingly p ij = p j . However, we will show the performance of our algorithm with general scenarios in Section 6.

Quality of Workers
We can assume several worker models that reflect how workers behave when they receive tasks to solve. Common methods to generate their responses are simple probabilistic approaches. The basic assumption is that a worker has latent variables, which influences the generating of workers' responses. Latent variables usually represent diligence or expertise of workers, or the level of difficulty of given tasks. A recent model [22] uses a noise model that assumes that workers' responses are considered corrupted by noise from true answer.
Here and after, we use U D to denote a set of standard D-dimensional unit vectors. For a fixed (D, K), we model a task i associated with an unobserved "correct" solution s i ∈ U D . A ij denotes the response vector of each worker j and can be represented by the sum of K number of vectors in U D (K-approval setting). Each component of the response vector A ij is binary (1 or 0). The worker j with reliability p j has D C K choices for a response, and those choices are divided into only two types as follows: From an information theoretical perspective, the quality of workers can be defined as negative entropy with an offset. This offset causes the quality of worker with random response to set to zero. Using the probabilities above, we can express negative entropy and define the quality of worker j with reliability p j as: According to the quality of each worker, we can divide the workers into three types. "Reliable workers" are workers with the a quality close to 1, who make mostly correct answers. At the extreme, workers with a quality close to 0 make arbitrary responses and we define them as "Non-informative workers". At the other extreme, there are workers who make wrong answers on purpose and affect the crowdsourcing system badly; they can be regarded as "Malicious workers". In our algorithm, since the worker message value y j is related to the quality, workers with positive y j , negative y j and y j close to zero correspond to "Reliable workers", "Malicious workers", and "Non-informative workers", respectively.

Algorithm 1 K-approval Iterative Algorithm
Although the quality of workers theoretically follows negative entropy, we found that a fourth-order polynomial approximation is sufficient for our analysis as described in Figure 5. As the number of alternatives D increases, the approximation deviates from the real quality. Nevertheless, fourth-order approximation fits well to the real quality in the acceptable D case that our algorithm targets in general.
For simplicity, we will use this approximated quality in the following sections. There is one more necessary assumption about worker distribution that workers give the correct answers on average rather than random or adversarial answers, so that E p j > K D . Given only workers' responses, any inference algorithms analogize the correct answers from the general or popular choices of crowds. Consider an extreme case in which everyone gives adversarial answers in a binary classification task; no algorithm can correctly infer the reliability of the crowds. Hence, the assumption E p j > K D is inevitably necessary.

Bound on the Average Error Probability
In this section, we show the error bound of our algorithm. Theorem 1 claims that the probability of error decays exponentially as the quality of workers q or the degree of worker nodes l increases.
From now on, letl ≡ l − 1,r ≡ r − 1, and we use the quality q as defined in (4). Additionally, σ 2 k denotes an effective variance in the sub-Gaussian tail of the task message distribution after k iterations.
where T = 32 The second term of the equation is the upper bound of probability that the graph does not have a local tree-like structure and it can be quite small as long as we treat a large number of tasks. Therefore, the dominant factor of the upper bound is the first exponential term. As shown in (5), T = 1 is the crucial condition and we can satisfy T > 1 by using a sufficiently large l or r. Then, with T > 1, σ 2 k converges to a finite limit σ 2 ∞ , and we have: Thus, the bounds of the first term of (6) do not depend on the number of tasks m or the number of iterations k.

Proof of the Theorem 1
The proof is roughly composed of three parts. First, the second term at the right-hand side of (6) is proved using its locally tree-like property. Second, the remaining term of the right-hand side of (6) is verified using a Chernoff bound in the assumption that the estimates of the task message follow sub-Gaussian distribution. Lastly, we prove that the assumption of the second part is true within certain parameters.
Without a loss of generality, it is possible to assume that the correct answer for each task, for any i ∈ [m], is t i = e 1 . Lett denote the estimated answer of task i defined in Section 5.2. If we draw a task I I I uniformly at random from the task set, the average probability of error can be denoted as: Let G I I I,k denote a subgraph of the random (l, r)-regular bipartite graph that consists of all of the nodes whose distance from the node 'I I I' is at most k. After k iterations, the local graph with root 'I I I' is G I I I,2k−1 , since the update process operates twice for each iteration. To take advantage of density evolution, the full independence of each branch is needed. Thus, we bound the probability of error with two terms, one that represents the probability that subgraph G I I I,2k−1 is not a tree, and the other that represents the probability that G I I I,2k−1 is a tree with a wrong answer.
The following lemma bounds the first term and proves that the probability that a local subgraph is not a tree vanishes as m grows. A proof of Lemma 1 is provided in [38] (cf. Karger, Oh and Shah 2011, Section 3.2). Lemma 1. From a random (l,r)-regular bipartite graph generated according to the pairing model, From the result of Lemma 1, we can concentrate on the second term of (9) and define the pairwise difference of task messages asx x x Then, a Chernoff bound is applied to the independent message branches and this gives us the tight bound of our algorithm. A random variable z z z with mean m is said to be sub-Gaussian with parameter σ if for any λ ∈ R the following inequality holds: d is sub-Gaussian with mean m k and parameter σ 2 k for a specific region of λ, precisely for |λ| 1/(2m k−1r ). Its proof is presented in Appendix A.1. Now we define the mean m k and varianceσ 2 k for the sub-Gaussian: The local tree-like property of a sparse random graph provides the distributional Because of the full independence of each branch, we can apply a Chernoff bound with λ = −m k /(σ 2 k ), and then we obtain: Since m k m k−1 /(σ 2 k ) 1/(3r), we can check |λ| 1/(2m k−1r ). This finalizes the Proof of Theorem 1.

Phase Transition
As shown in (5), the performance of our algorithm is only bounded when the condition T > 1 is satisfied. Meanwhile, with T < 1,σ 2 k , which means that the variance of thex x x (k) d diverges as the number of iterations k increases. In this case, our performance guarantee is no longer valid and the performance becomes worse compared to majority voting. Note that except for extreme cases, such as when using very low-quality workers and deficient assignments, T > 1 is easily satisfied and our performance guarantee becomes valid. In Section 6, we will verify the existence of this critical point at T = 1 through several experiments with different conditions.

Experiments
In this section, we verify the performance of the K-approval iterative algorithm discussed in Section 4 with different sets of simulations. First, we check if the error performance of our algorithm exhibits exponential decay as the degree of worker node l or quality of workers q increases. In addition, we show that our algorithm achieves a better performance than that of the majority voting above a phase transition of T = 1. The next simulation investigates the linear relationship between the y j value and the ratio of the number of responses including the correct answer to r j for each worker. Then, we consult experiments with a task set consisting of various (D, K) pairs, which indicate the number of alternatives D and selections K.

Error Performance with q and l
To show the competitiveness of our algorithm, we ran our K-approval iterative algorithm, majority voting, and the oracle estimator for 2000 tasks and 2000 workers with a fixed (D, K) pair ( Figure 6). Here, the oracle estimator can give each worker an appropriate weight since we assume the oracle estimator knows the quality of each worker. The error performance of the oracle estimator is presented as a lower bound. In Figure 6 (top), we can see that the probability of error decays exponentially as l increases, and is lower than that of the majority voting above the phase transition point T = 1. In addition, Figure 6 (bottom) presents the probabilities of error decays as q increases.
As detailed in Section 5, we expect a phase transition at T = 32 q . If we follow the result, we can expect transitions to happen around l = 4.52 for (3, 1) and (4, 1), l = 3.39 for (5, 2) and l = 3.62 for (6,2). From the experiments in Figure 6 (top), we find that the iterative algorithm starts to perform better than majority voting around l = 5, 6, 3, 4 for each (D, K) pair. Note that these values are very similar to the above theoretical values. It follows from the fact that the error of our method increases with k when T < 1 as stated in Section 5. As can clearly be seen from the simulation results, we can be sure that the l values required for achieving T > 1 are not large.

Relationship between Reliability and y-Message
The inference of workers' relative reliability in the course of iterations is the most important aspect of our algorithm. Now, we define the estimated reliabilityp j for each worker j as follows:p j = # o f responses including correct answer r j .
After k max iterations, we can find reliable workers by the value of the worker message y j since this value is proportional top j , which is influenced by p j . The relative reliability y j is calculated by the following equation in Algorithm 1: Figure 7 shows that there are strong correlations between y j andp j . In one simulation, the correlation coefficients (Pearson product-moment correlation coefficient (PPMCC) was used for evaluation) between y j andp j were measured as 0.991, 0.992, 0.904, and 0.954 for (D, K) = (3, 1), (4, 1), (5,2) and (6,2), respectively. We can also check that the line approximately passes the point of ( K D , 0), where p j = K D is close to the reliability of a non-informative worker who gives random responses, as expected in Section 5.

Error Performance with Various (D,K) Pairs
To show the performance of the K-approval iterative algorithm, we performed simulations on a task set consisting of various (D, K) pairs with Algorithm 1. In detail, we repeated the same experiments with a question set composed of an equal ratio of different task types whose (D, K) values are (3, 1), (4, 1), (5,2) and (6,2). Then, we investigated for the general case that q j is calculated with a general version of Equation (4) in Section 5.1, as follows: where Q(p) = p log We define q j as an individual inherent quality of the worker j. To perform simulations and to analyze the results, we have to make an assumption that a worker with an individual quality q j solves questions with a reliability p ij for each (D i , K i ). Thus for each (D i , K i )-task, the corresponding reliability p ij is determined by applying Newton's method on a graph of the quality function.
It can be seen that the same tendencies found in previous simulations also appear in Figure 8. There also is a strong correlation between y j andp j , i.e., 0.929. This result is notable in that in the real world, there are many more cases where questions have various (D, K) pairs rather than fixed ones.

Conclusions
Web-based crowdsourcing systems have evolved to encourage workers to make multiple choices among given options. These changes are primarily aimed at reducing workers' mistakes and collecting higher-quality data. In this article, we have proposed an iterative algorithm for top-K selection questions. Moreover, our scheme covers the case with a task set consisting of various (D, K) pairs (mixed). Furthermore, we proved that the performance of our algorithm is upper-bounded and the bound on the probability of error decays exponentially in terms of the quality of workers and the number of queries. We hope that our work on K-approval crowdsourcing systems will be of practical help to researchers who need to collect high-quality real data.
Presently, the proposed algorithm in this paper adopts an iterative regime; however, in future work we will generalize our algorithm with generalized task-worker allocation using evolutionary algorithms (generic algorithms and harmony search). For example, by ensuring that appropriate tasks are assigned according to each worker's experiences, preference, and ability, a task master can increase worker efficiency and reduce the error rate.
for any z ∈ R and b ∈ [0, 1] (cf. Alon Similar to (A3)'s process, from (A1), we get: with 2 e λ + e −λ for any λ ∈ R. Substituting the result of Lemma A1 into the above inequality provides: Now we are left to bound (A5) using the following Lemma A2.
Proof of Lemma A1.
In the k + 1 th step of mathematical induction, we assume that for any d ∈ [2 : D] with |λ| 1/(2m k−1r ): In other words, all ofx x x (k) d follow sub-Gaussian distribution with parameters m k andσ 2 k . From (A3)), each component at the right-hand side can be represented as the product of several combinations of [e λx x x (k) d ] and the product of variables means a linear combination in the exponential field. We verify that the linear transformation of sub-Gaussian random variables also follows sub-Gaussian distribution with some parameters. Moreover, these parameters are determined by the D, K, mean m k and varianceσ 2 k of each sub-Gaussianx x x The second inequality holds since the case D 2K is natural for a K-approval setting; otherwise the quality of response is too worthless to infere the ground truth accurately. Putting these two results together finishes the proof of Lemma A1.