Optimal Number of Choices in Rating Contexts

In many settings people must give numerical scores to entities from a small discrete set. For instance, rating physical attractiveness from 1-5 on dating sites, or papers from 1-10 for conference reviewing. We study the problem of understanding when using a different number of options is optimal. For concreteness we assume the true underlying scores are integers from 1-100. We consider the case when scores are uniform random and Gaussian. We study when using 2, 3, 4, 5, and 10 options is optimal in these models. One may expect that using more options would always improve performance in this model, but we show that this is not necessarily the case, and that using fewer choices -- even just two -- can surprisingly be optimal in certain situations. While in theory for this setting it would be optimal to use all 100 options, in practice this is prohibitive, and it is preferable to utilize a smaller number of options due to humans' limited computational resources. Our results suggest that using a smaller number of options than is typical could be optimal in certain situations. This would have many potential applications, as settings requiring entities to be ranked by humans are ubiquitous.


Introduction
Humans rate items or entities in many important settings. For example, users of dating websites and mobile applications rate other users' physical attractiveness, teachers rate scholarly work of students, and reviewers rate the quality of academic conference submissions. In these settings, the users assign a numerical (integral) score to each item from a small discrete set. However, the number of options in this set can vary significantly between applications, and even within different instantiations of the same application. For instance, for rating attractiveness, three popular sites all use a different number of options. On "Hot or Not," users rate the attractiveness of photographs submitted voluntarily by other users on a scale of 1-10 (Figure 1 1 ). These scores are aggregated and the average is assigned as the overall "score" for a photograph. On the dating website OkCupid, users rate other users on a scale of 1-5 (if a user rates another user 4 or 5 then the rated user receives a notification) 2 (Figure 2 3 ). And on the mobile application Tinder users "swipe right" (green heart) or "swipe left" (red X) to express interest in other users (two users are allowed to message each other if they mutually swipe right), which is essentially equivalent to using a binary {1, 2} scale (Figure 3 4 ). Education is another important application area requiring human ratings. For the 2016 International Joint Conference on Artificial Intelligence, reviewers assigned a "Summary Rating" score from -5-5 (equivalent to 1-10) for each submitted paper (Figure 4 ). The papers are then discussed and scores aggregated to produce an acceptance or rejection decision based largely on the average of the scores.
Despite the importance and ubiquity of the problem, there has been little fundamental research done on the problem of determining the optimal number of options to allow in such settings. We study a model in which users have a underlying integral ground truth score for each item in {1, . . . , n} arXiv:1605.06588v2 [cs.AI] 30 May 2016   and are required to submit an integral rating in {1, . . . , k}, for k << n. We use two generative models for the ground truth scores: a uniform random model in which the fraction of scores for each value from 1 to n is chosen uniformly at random (by choosing a random value for each and then normalizing), and a model where scores are chosen according to a Gaussian distribution with a given mean and variance. We then compute a "compressed" score distribution by mapping each full score s from {1, . . . , n} to {1, . . . , k} by applying We then compute the average "compressed" score a k , and compute its error e k according to where a f is the ground truth average score. The goal is to pick argmin k e k . While there are many possible generative models and cost functions to use, these seemed like the most natural ones to start with. We leave study of alternative choices for future work.
We derive a closed-form expression for e k that depends on only a small number (k) of parameters of the underlying distribution for an arbitrary distribution. 5 This allows us to efficiently exactly characterize the performance of using each number of choices. In computational simulations we repeatedly compute e k for m items and compare the average values. We focus on m = 100 and k = 2, 3, 4, 5, 10, which we believe are the most natural and interesting choices for first study.
One could argue that this model is somewhat "trivial" in the sense that it would be optimal to set k = n to permit all the possible scores, as this would result in the "compressed" scores agreeing exactly with the full scores. However, there are several reasons that would lead us to prefer to select k << n in practice (as all of the examples previously described have done), thus making this "thought experiment" worthwhile. It is much easier for a human to assign a score from a small set than from a large set, particularly when rating many items under time constraints. We could have included an additional term into the cost function e k that explicitly penalizes larger values of k, which would have a significant effect on the optimal value of k (providing a favoritism for smaller values). However the selection of this function would be somewhat arbitrary and would make the model more complex, and we leave this for future study. Given that we do not include such a penalty term, one may expect that increasing k will always decrease e k in our setting. While the simulations show a clear negative relationship between k and e k , we show that smaller values of k can actually lead to smaller e k surprisingly often. These smaller values would receive further preference if a penalty term were included.
The most closely related work studies the impact of using finely grained numerical grades (e.g., 100, 99, 98) vs. coarse letter grades (e.g., A, B, C) [2]. They conclude that if students care primarily about their rank in class (relative to the other students), they are often best motivated to work by assigning them to coarse categories (letter grades) than by the exact numerical exam scores. In a specific setting of "disparate" student abilities they show that the optimal absolute grading scheme is always coarse. Their model is game-theoretic; each player (student) selects an effort level, seeking to optimize a utility function that depends on both the relative score and effort level. Their setting is quite different from ours in many ways. For one, they are studying a setting where it is assumed that the underlying "ground truth" score is known, yet may be disguised for strategic reasons. In our setting the ultimate goal is to approximate the ground truth score as closely as possible.

Theoretical characterization
Suppose scores are given by continuous pdf f (with cdf F ) on (0, 100), and we wish to compress them to two options, {0, 1}. Scores below 50 are mapped to 0, and scores above 50 are mapped to 1.
The average of the full distribution is The average of the compressed version is .
For three options, In general for n total and k compressed options, are over {0, . . . , k − 1}. In this setting we use a normalization factor of n k instead of n−1 k for the e k term. Continuous approximations for large discrete spaces have been studied in other settings; for instance, they have led to simplified analysis and insight in poker games with continuous distributions of private information [1]. In practice these could likely be closely approximated from historical data for small values of k.

Computational simulations and analysis
The above analysis leads to the immediate question of whether the example for which e 2 < e 3 was just a fluke or whether using fewer choices can actually reduce error under reasonable assumptions  on the generative model. We study this question using simulations with we believe are the two most natural models. While we have studied the continuous setting where the full set of options is continuous over (0, n) and the compressed set of options is the discrete space {0, . . . , k −1}, we will now consider the perhaps more realistic setting where the full set is the discrete set {0, . . . , n − 1}, and the compressed set is {0, . . . , k − 1} (though it should be noted that the two settings are likely extremely similar qualitatively).
The first generative model we consider is a uniform model in which the values of the pmf p f for each of the n possible values are chosen independently and uniformly at random. Formally, we construct a histogram of n scores for p f according to Algorithm 1. We then compress the full scores to a compressed distribution p k by applying Algorithm 2.
The second generative model is a Gaussian model in which the values are generated according to a normal distribution with specified mean µ and standard deviation σ. This model also takes as a parameter a number of samples s to use for generating the scores. The procedure is given by Algorithm 3. As for the uniform setting, Algorithm 2 is then used to compress the scores.

+= scores[i]
For our simulations we used n = 100, and considered k = 2, 3, 4, 5, 10, which are popular and natural values. For the generative model we used s = 1000, µ = 50, σ = 50 3 . For each set of simulations we computed the errors for all considered values of k for m = 100, 000 "items" (each corresponding to a different distribution generated according to the specified model). The main quantities we are interested in computing are the number of times that each value of k produces the lowest error over the m items, and the average value of the errors over all items for each k value.
In the first set of experiments, we compared performance between using k = 2, 3, 4, 5, 10 to see for how many of the trials each value of k produced the minimal error. The results are in Table 1 However, what is perhaps surprising is that using a smaller number of compressed scores produced the optimal error in a far from negligible number of the trials. For the uniform model, using 10 scores minimized error only around 53% of the time, while using 5 scores minimized error 17% of the time, and even using 2 scores minimized it 5.6% of the time. The results were similar for the Gaussian model, though a bit more in favor of larger values of k, which is what we would expect because the Gaussian model is less likely to generate "fluke" distributions that could favor the smaller values. We next explored the number of victories between just k = 2 and k = 3, with results in Table 2. Again we observed that using a larger value of k generally reduces error, as expected. However, we find it extremely surprising that using k = 2 produces a lower error 37% of the time. As before, the larger k value performs relatively better in the Gaussian model. We also looked at results for the most extreme comparison, k = 2 vs. k = 10. These results are provided in Table 3. Using 2 scores outperformed 10 8.3% of the time in the uniform setting, which was larger than we expected. In Figures 8-10, we present a distribution for which k = 2 particularly outperformed k = 10.  Table 3: Number of times each value of k in {2,10} produces minimal error and average error values, over 100,000 items generated according to both generative models.
We next repeated the extreme k = 2 vs. 10 comparison, but we imposed a restriction that the k = 10 option could not give a score below 3 or above 6. (If it selected a score below 3 then we set it to 3,   and if above 6 we set it to 6). These result are given in Table 4. For some settings, for instance the paper reviewing setting, extreme scores are very uncommon, and we strongly suspect that the vast majority of scores are in this middle range. Some possible explanations are that reviewers who give extreme scores may be required to put in additional work to justify their scores, and are more likely to be involved in arguments with the other reviewers (or with the authors in the rebuttal). Reviewers could also experience higher regret or embarrassment for being "wrong" and possibly off-base in the review by missing an important nuance. In this setting using k = 2 outperforms k = 10 nearly  Table 4: Number of times each value of k in {2,10} produces minimal error and average error values, over 100,000 items generated according to both models. For k = 10, we only permitted scores between 3 and 6 (inclusive). If a score was below 3 we set it to be 3, and above 6 to 6.
We also considered the situation where we restricted the k = 10 scores to fall between 3 and 7 (as opposed to 3 and 6). Note that the possible scores range from 0-9, so this restriction is asymmetric in that the lowest three possible scores are eliminated while only the highest two are. This is motivated by the intuition that raters may be less included to give extremely low scores which may hurt the feelings of an author (for the case of paper reviewing). In this setting, which is seemingly quite similar to the 3-6 setting, using k = 2 produced lower error 93% of the time in the uniform model! Gaussian average error 1.13 1.09 Table 5: Number of times each value of k in {2,10} produces minimal error and average error values, over 100,000 items generated according to both generative models. For k = 10, we only permitted scores between 3 and 7 (inclusive). If a score was below 3 we set it to be 3, and above 7 to 7.

Conclusion
Settings in which humans must rate items or entities from a small discrete set of options are ubiquitous. We have singled out several important applications-rating attractiveness for dating websites and mobile applications, assigning grades to students, and reviewing academic papers for conferences. The number of available options can vary considerably, even within different instantiations of the same application. For instance, we saw that three popular sites for the attractiveness rating problem use completely different systems: Hot or Not uses a 1-10 system, OkCupid uses 1-5 "star" system, and Tinder uses a binary 1-2 "swipe" system.
Despite the ubiquity and importance of the problem of selecting the optimal number of rating choices, we have not seen it studied previously in academic literature. Our goal is to select k to minimize the average (normalized) error between the compressed average score and the ground truth average. We studied two natural models for generating the scores. The first is a uniform model where the scores are selected independently and uniformly at random, and the second is a Gaussian model where they are selected according to a more structured procedure that gives more preference for the options near the center.
We provided a closed-form solution for continuous distributions with arbitrary cdf. This allows us to characterize the relative performance of choices of k for a given distribution. We saw that, counterintuitively, using a smaller value of k can actually produce a smaller error for some distributions (even though we know that as k approaches n the error approaches 0). We presented a specific example distribution f for which using k = 2 outperforms k = 3.
We performed numerous computational simulations, comparing the performance between different values of k for different generative models and metrics. The main metric we used was the absolute number of times for which values of k produced the minimal error. We also considered the average error over all simulated items. Not surprisingly, we observe that performance generally improves monotonically with increased k, and more so for the Gaussian model than the uniform. However, we observe that small k values can be optimal a non-negligible amount of the time, which is perhaps counterintuitive. In fact, using k = 2 outperformed k = 3, 4, 5, and 10 on 5.6% of the trials in the uniform setting. Just comparing 2 vs. 3, k = 2 performed better around 37% of the time. Using k = 2 outperformed k = 10 8.3% of the time, and significantly more as we imposed some very natural restrictions on the k = 10 setting that are motivated by intuitive phenomena. When we restricted the k = 10 to only assign values between 3 and 7 (inclusive), using k = 2 actually produced lower error 93% of the time! This could correspond to a setting where raters are ashamed to assign extreme scores (particularly extreme low scores).