Optimal Forgery and Suppression of Ratings for Privacy Enhancement in Recommendation Systems

Recommendation systems are information-filtering systems that tailor information to users on the basis of knowledge about their preferences. The ability of these systems to profile users is what enables such intelligent functionality, but at the same time, it is the source of serious privacy concerns. In this paper we investigate a privacy-enhancing technology that aims at hindering an attacker in its efforts to accurately profile users based on the items they rate. Our approach capitalizes on the combination of two perturbative mechanisms---the forgery and the suppression of ratings. While this technique enhances user privacy to a certain extent, it inevitably comes at the cost of a loss in data utility, namely a degradation of the recommendation's accuracy. In short, it poses a trade-off between privacy and utility. The theoretical analysis of said trade-off is the object of this work. We measure privacy as the Kullback-Leibler divergence between the user's and the population's item distributions, and quantify utility as the proportion of ratings users consent to forge and eliminate. Equipped with these quantitative measures, we find a closed-form solution to the problem of optimal forgery and suppression of ratings, and characterize the trade-off among privacy, forgery rate and suppression rate. Experimental results on a popular recommendation system show how our approach may contribute to privacy enhancement.


I. INTRODUCTION
From the advent of the Internet and the World Wide Web, the amount of information available to users has grown exponentially. As a result, the ability to find information relevant for their interests has become a central issue in recent years. In this context of information overload, recommendation systems arise to provide information tailored to users on the basis of knowledge about their preferences [2]. In essence, a recommendation system may be regarded as a type of information-filtering system that suggests information items users may be interested in. Examples of such systems include recommending music at Last.fm and Pandora Radio, movies by MovieLens and Netflix, videos at YouTube, news at Digg and Google News, and books and other products at Amazon.
Most of these systems capitalize on the creation of profiles that represent interests and preferences of users. Such profiles are the result of the collection and analysis of the data that users communicate to those systems. A distinction is frequently made between explicit and implicit forms of data collection. The most popular form of explicit data collection is that users communicate their preferences by rating items. This is the case of many of the applications mentioned above, where users assign ratings to songs, movies or news they have already listened, watched or read. Other strategies to capture users' interests include asking them to sort a number of items by order of predilection, or suggesting that they mark the items they like. On the other hand, recommendation systems may collect data from users without requiring them to explicitly convey their preferences [3]. These practices comprise observing the items clicked by users in an online store, analyzing the time it takes users to examine an item, or simply keeping a record of the purchased items.
The prolonged collection of these personal data allows the system to extract an accurate snapshot of user interests, i.e., their profiles. With this invaluable source of information, the recommendation system applies some technique [4] to generate a prediction of users' interests for those items they have not yet considered. For example, Movielens and Digg use collaborativefiltering techniques to predict the rating that a user would give to a movie and to create a personalized list of recommended news, respectively. In a nutshell, the ability of profiling users based on such personal information is precisely what enables the intelligent functionality of those systems.
Despite the many advantages recommendation systems are bringing to users, the information collected, processed and stored by these systems prompts serious privacy concerns. One of the main privacy risks perceived by users is that of a computer "figuring things out" about them [5]. Many users are worried about the idea that their profiles may reveal sensitive information such as health-related issues, political preferences, salary or religion. Such privacy risk is exacerbated especially when these profiles are combined across several information services or enriched with data from social networks. An illustrative example Some parts of this paper (a reduced version of Secs. I and II) were presented at the International Workshop on Data Privacy Management, Leuven, Belgium, Sep. 2011 [1]. The formulation of the trade-off between privacy and utility (Sec. III), the theoretical analysis (Sec. IV), the experiments (Sec. V) and the conclusions (Sec. VI) are all new work.
The authors are with the Department of Telematics Engineering, Universitat Politècnica de Catalunya (UPC), E-08034 Barcelona, Spain (e-mail: javier.parra@entel.upc.edu; david.rebollo@entel.upc.edu; jforne@entel.upc.edu). is [6], which demonstrates that it is possible to unveil sensitive information about a person from their movie rating history by cross-referencing data from other sources. The authors analyzed the Netflix Prize data set [7], which contained anonymous movie ratings of around half a million users of Netflix, and were able to uncover the identity, political leaning and even sexual orientation of some of those users, by simply correlating their ratings with reviews they posted on the popular movie Web site IMDb. Apart from the risk of cross-referencing, users are also concerned that the system's predictions may be totally erroneous and be later used to defame them. This latter situation is examined in [8], where the accuracy of the predictions provided by TiVo digital video recorder and Amazon is questioned. Lastly, other privacy risks embrace unsolicited marketing, information leaked to other users of the same computer, court subpoenas, and government surveillance [5].
As a result of all this, it is not surprising that some users are reticent to reveal their interests. In fact, [9] reports that the 24% of Internet users surveyed provided false information in order to avoid giving private information to a Web site. Alternatively, another study [10] finds that 95% of the respondents refused, at some point, to provide personal information when requested by a Web site. In closing, these studies seem to indicate that submitting false information and refusing to give private information are strategies accepted by users concerned with their privacy.

A. Contribution and Plan of this Paper
In this paper we approach the problem of protecting user privacy in those recommendation systems that profile users on the basis of the items they rate. Given the willingness of users to provide fake information and elude disclosing private data, we investigate a privacy-enhancing technology (PET) that combines these two forms of data perturbation, namely the forgery and the suppression of ratings. Concordantly, in our scenario users rate those items they have an opinion on. However, in order to avoid being accurately profiled by the recommender or, in general, by any privacy attacker capable of collecting this information, users may wish to refrain from rating some of those items and/or rate items that do not reflect their actual preferences. Our approach thus protects user privacy to a certain degree, without having to trust the recommendation system or the network operator, but at the cost a loss in utility, a degradation of the quality of the recommendation. In other words, our PET poses a trade-off between privacy and utility.
The theoretical analysis of the trade-off between these two contrasting aspects is the object of this work. We tackle the issue in a systematic fashion, drawing upon the methodology of multiobjective optimization. Before proceeding, though, we adopt a quantifiable measure of user privacy-the Kullback-Leibler (KL) divergence between the probability distribution of the user's items and the population's distribution, a criterion that we introduced in previous work [11] and justified and interpreted in [12], [13] by leveraging on the rationale behind entropy-maximization methods. Equipped with a measure of both privacy and utility, we formulate an optimization problem modeling the trade-off between privacy on the one hand, and on the other forgery rate and suppression rate as utility metrics. Our extensive theoretical analysis finds a closed-form solution to the problem of optimal forgery and suppression of ratings, and characterizes the optimal trade-off between the aspects of privacy and utility.
In addition, we provide an empirical evaluation of our data-perturbative approach. Specifically, we apply the forgery and the suppression of ratings in the popular movie recommendation system Movielens, and show how these two strategies may preserve the privacy of its users.
Sec. II reviews several data-perturbative approaches aimed at enhancing user privacy in the context of recommender systems. Sec. III introduces our privacy-enhancing technology, proposes a quantitative measure of the privacy of user profiles, and formulates the trade-off between privacy and utility. Sec. IV presents a theoretical analysis of the optimization problem characterizing the privacy-forgery-suppression trade-off. In this same section we also provide a numerical example that illustrates our formulation and theoretical results. Sec. V evaluates our privacy-protecting mechanism in a real recommendation system. Finally, conclusions are drawn in Sec. VI.

II. STATE OF THE ART
Numerous approaches have been proposed to protect user privacy in the context of recommendation systems. These approaches fundamentally suggest either perturbing the information provided by users or using cryptographic techniques.
In the case of perturbative methods for recommendation systems, [14] proposes that users add random values to their ratings and then submit these perturbed ratings to the recommender. After receiving these ratings, the system executes an algorithm and sends the users some information that allows them to compute the prediction. When the number of participating users is sufficiently large, the authors find that user privacy is protected to a certain extent and the system reaches a decent level of accuracy. However, even though a user disguises all their ratings, it is evident that the items themselves may uncover sensitive information. Simply put, the mere fact of showing interest in a certain item may be more revealing than the rating assigned to that item. For instance, a user rating a book called "How to Overcome Depression" indicates a clear interest in depression, regardless of the score assigned to this book. Apart from this critique, other works [15], [16] stress that the use of randomized data distortion techniques might not be able to preserve privacy.
In line with this work, [17] applies the same data-perturbative technique to collaborative-filtering algorithms based on singularvalue decomposition. Specifically, the authors focus on the impact that their technique has on privacy. For this purpose, they use the privacy metric proposed by [18], which is essentially equivalent to differential entropy, and conduct some experiments with data sets from Movielens and Jester. The results show the trade-off curve between accuracy in recommendations and privacy. In particular, they measure accuracy as the mean absolute error between the predicted values from the original ratings and the predictions obtained from the perturbed ratings.
At this point, we would like to remark that the use of perturbative techniques is by no means new in other scenarios such as private information retrieval and the semantic Web. In the former scenario, users send general-purpose queries to an information service provider. A perturbative approach to protect user profiles in this context consists in combining genuine with false queries. Precisely, [11] proposes a nonrandomized method for query forgery and investigates the trade-off between privacy and the additional traffic overhead. In the semantic Web scenario, users annotate resources with the purpose of classifying them. In this application domain, the perturbation of user profiles for privacy preservation may be carried out by dropping certain annotations or tags. An example of this kind of perturbation may be found in [19]- [21], where the authors propose the elimination of tags as a privacy-enhancing strategy.
Regarding the use of cryptographic techniques, [22], [23] propose a method that enables a community of users to calculate a public aggregate of their profiles without revealing them on an individual basis. In particular, the authors use a homomorphic encryption scheme and a peer-to-peer communication protocol for the recommender to perform this calculation. Once the aggregated profile is computed, the system sends it to users, who finally use local computation to obtain personalized recommendations. This proposal prevents the system or any external attacker from ascertaining the individual user profiles. However, its main handicap is assuming that an acceptable number of users is online and willing to participate in the protocol. In line with this, [24] uses a variant of Pailliers' homomorphic cryptosystem which improves the efficiency in the communication protocol. Another solution [25] presents an algorithm aimed at providing more efficiency by using the scalar product protocol.

III. PRIVACY PROTECTION VIA FORGERY AND SUPPRESSION OF RATINGS
In this section, first we present the forgery and the suppression of ratings as a privacy-enhancing technology. The description of our approach is prefaced by a brief introduction of the concepts of soft privacy and hard privacy. Secondly, we propose a model of user profile and set forth our assumptions about the adversary capabilities. Finally, we provide a quantitative measure of both privacy and utility, and present a formulation of the trade-off between these two contrasting aspects.

A. Soft Privacy vs. Hard Privacy
The privacy research literature [26] recognizes the distinction between the concepts of soft privacy and hard privacy. A privacy-enhancing mechanism providing soft privacy assumes that users entrust their private data to an entity, which is thereafter responsible for the protection of their data. In the literature, numerous attempts to protect privacy have followed the traditional method of anonymous communications [27]- [30], which is fundamentally based on the suppositions of soft privacy. Unfortunately, anonymous-communication systems are not completely effective [31]- [34], they normally come at the cost of infrastructure, and assume that users are willing to trust other parties.
Our privacy-protecting technique, per contra, leverages on the principle of hard privacy, which assumes that users mistrust communicating entities and therefore strive to reveal as little private information as possible. In the motivating scenario of this work, hard privacy means that users need not trust an external entity such as the recommender or the network operator. Consequently, because users just trust themselves, it is their own responsibility to protect their privacy. In this state of affairs, the forgery and the suppression of ratings appear as a technique that may hinder privacy attackers in their efforts to accurately profile users on the basis of the items they rate. Specifically, when users are adhered to this technique, they have the possibility to submit ratings to items that do not reflect their genuine preferences, and/or refrain from rating some items of their interest-this is what we refer to as the forgery and the suppression of ratings, respectively.

B. User Profile and Adversary Model
In the scenario of recommendation systems, users rate items of a very different nature, e.g., music, pictures, videos or news, according to their personal preferences. The information conveyed allows those systems to extract a profile of interests or user profile, which turns to be essential in the provision of personalized recommendations.
We mentioned in Sec. I that Movielens represents user profiles by using some kind of histogram. Other systems such as Jinni and Last.fm show this information by means of a tag cloud, which in essence may be regarded as another kind of histogram. In this same spirit, recent privacy-protecting approaches in the scenario of recommendation systems also propose using histograms of absolute frequencies for modeling user profiles [35], [36].
According to these examples and inspired by other works in the field [1], [11], [19]- [21], [37], we model the items rated by users as random variables (r.v.'s) taking on values in a common finite alphabet of categories, namely the set {1, . . . , n} for some integer n 2. Concordantly, we model the profile of a user as a probability mass function (PMF) q = (q 1 , . . . , q n ), that is, a histogram of relative frequencies of items within a predefined set of categories of interest.
We would like to emphasize that, under this model, user profiles do not capture the particular scores given to items, but what we consider to be more sensitive: the categories these items belong to. This is exactly the case of Movielens and numerous content-based recommendation systems. Fig. 1 provides an example that illustrates how user profiles are constructed in Movielens. In this particular example, a user assigns two stars to a movie, meaning that they consider it to be "fairly bad". However, the recommender updates their profile based only on the categories this movie belongs to. According to this model, a privacy attacker supposedly observes a perturbed version of this profile, resulting from the forgery and the suppression of certain ratings, and is unaware or ignores the fact that the observed user profile, also in the form of a histogram, does not reflect the actual profile of interests of the user in question. In principle, our passive attacker could be the recommender itself or the network operator. However, the set of potential attackers is not restricted merely to these two entities. Since ratings are often publicly available to other users of the recommendation system, any other attacker able to crawl through this information is taken into consideration in our adversary model.
When users adhere to the forgery and the suppression of ratings, they specify a forgery rate ρ ∈ [0, ∞) and a suppression rate σ ∈ [0, 1). The former is the ratio of forged ratings to total genuine ratings that a user consents to submit. The latter ratio is the fraction of genuine ratings that the user agrees to eliminate (a) . Note that, in our approach, the number of false ratings submitted by the user can exceed the number of genuine ratings, that is, ρ can be greater than 1. Nevertheless, the number of suppressed ratings is always lower than the number of genuine ratings.
By forging and suppressing ratings, the actual profile of interests q is then perceived from the outside as the apparent PMF t = q+r−s 1+ρ−σ , according to a forgery strategy r = (r 1 , . . . , r n ) and a suppression strategy s = (s 1 , . . . , s n ). Such strategies represent the proportion of ratings that the user should forge and eliminate in each of the n categories. Naturally, these strategies must satisfy, on the one hand, that r i 0, s i 0 and q i + r i − s i 0 for i = 1, . . . , n, and on the other, that n i=1 r i = ρ and n i=1 s i = σ. In conclusion, the apparent profile is the result of the addition and the substraction of certain items to/from the actual profile, and the posterior normalization by 1 1+ρ−σ so that n i=1 t i = 1.

C. Measuring the Privacy of User Profiles
Inspired by the privacy measures proposed in [11]- [13], [19], [38], and according to the model of user profile assumed in Sec. III-B, we define initial privacy risk as the KL divergence [39] between the user's genuine profile and the population's distribution, that is, Similarly, we define (final) privacy risk R as the KL divergence between the user's apparent profile and the population's distribution, An intuitive justification of our privacy metric stems from the observation that, whenever the user's apparent item distribution diverges too much from the population's, a privacy attacker will have actually gained some information about the user, in contrast to the statistics of the general population.
A richer argument may be found in [12], [13], where we establish some riveting connections between Jaynes' rationale on entropy-maximization methods and the use of entropies and divergences as measures of privacy. The leading idea is that the method of types from information theory establishes an approximate monotonic relationship between the likelihood of a PMF (a) The description of an architecture implementing this data-perturbative approach may be found in [1]. in a stochastic system and its Shannon's entropy. Loosely speaking and in our context, the higher the entropy of a profile, the more likely it is, the more users behave similarly. This is in absence of a probability distribution model for the PMFs, viewed abstractly as r.v.'s themselves. Under this interpretation, Shannon's entropy is a measure of anonymity, not in the sense that the user's identity remains unknown, but only in the sense that higher likelihood of an apparent profile, believed by an external observer to be the actual profile, makes that profile more common, helping the user go unnoticed, less interesting to an attacker assumed to strive to target peculiar users.
If an aggregated histogram of the population were available as a reference profile, as we assume in this work, the extension of Jaynes' argument to relative entropy also gives an acceptable measure of privacy (or anonymity). Recall [39] that KL divergence is a measure of discrepancy between probability distributions, which includes Shannon's entropy as the special case when the reference distribution is uniform. Conceptually, a lower KL divergence hides discrepancies with respect to a reference profile, say the population's, and there also exists a monotonic relationship between the likelihood of a distribution and its divergence with respect to the reference distribution of choice, which enables us to regard KL divergence as a measure of anonymity in a sense entirely analogous to the above mentioned.

D. Formulation of the Trade-Off among Privacy, Forgery and Suppression
Our data-perturbative mechanism allows users to enhance their privacy to a certain extent, since the resulting profile, as observed from the outside, no longer captures their actual interests. The price to be paid, however, is a loss in data utility, in particular in the accuracy of the recommender's predictions.
For the sake of tractability, in this work we consider as utility metrics the forgery rate and the suppression rate. This consideration enables us to formulate the problem of choosing a forgery strategy and a suppression strategy as a multiobjective optimization problem that takes into account privacy, forgery rate and suppression rate. Specifically, under the assumption that the population of users is large enough to neglect the impact of the choice of r and s on p, we define the privacy-forgerysuppression function which characterizes the optimal trade-off among privacy, forgery rate and suppression rate. Conceptually, the result of this optimization are two strategies r and s that contain information about which ratings should be forged and which ones should be suppressed, in order to achieve the minimum privacy risk. More precisely, the component r i is the percentage of items that the user should forge in the category i. The component s i is defined analogously for suppression.

IV. OPTIMAL FORGERY AND SUPPRESSION OF RATINGS
This section is entirely devoted to the theoretical analysis of the privacy-forgery-suppression function (1) defined in Sec. III-D. In our attempt to characterize the trade-off among privacy risk, forgery rate and suppression rate, we shall present a closedform solution to the optimization problem inherent in the definition of this function. Afterwards, we shall analyze some fundamental properties of said trade-off. For the sake of brevity, our theoretical analysis only contemplates the case when all given probabilities are strictly positive: q i , p i > 0 for all i = 1, . . . , n.
Additionally, we suppose without loss of generality that Before diving into the mathematical analysis, it is immediate from the definition of the privacy-forgery-suppression function that its initial value is R(0, 0) = D(q p). The characterization of the optimal trade-off surface modeled by R(ρ, σ) at any other values of ρ and σ is the focus of this section.

A. Closed-Form Solution
Our first theorem, Theorem 3, will present a closed-form solution to the minimization problem involved in the definition of function (1). The solution will be derived from Lemma 1, which addresses a resource allocation problem. This a theoretical problem encountered in many fields, from load distribution and production planning to communication networks, computer scheduling and portfolio selection [40]. Although this lemma provides a parametric-form solution, we shall be able to proceed towards an explicit closed-form solution, albeit piecewise.
Lemma 1 (Resource Allocation): For all k = 1, . . . , n, let f k be a real-valued function on {(x k , y k ) ∈ R 2 : κ k +x k −y k 0}, twice differentiable in the interior of its domain. Assume that strictly decreasing in y k . Consequently, for a fixed y k , h k (x k , y k ) is an invertible function of x k . Denote by h −1 k the inverse of h k (x k , 0). Suppose further that h k (x k , y k ) = h k (x k − y k , 0) and finally that lim following optimization problem in the variables x 1 , . . . , x n and y 1 , . . . , y n : and n k=1 x k = η, n k=1 y k = θ for some η, θ 0.
(i) The solution to the problem (x * k , y * k ) depends on two real numbers ψ, ω that satisfy the equality constraints k x * k = η and k y * k = θ. The solution exists provided that ψ ω. If ψ < ω, then the solution is unique and yields . If ψ = ω, then there exists an infinite number of solutions of the form (x * k + α k , y * k + α k ) for all α k ∈ R + meeting the two aforementioned equality constraints. Without loss of generality, suppose that h 1 (0, 0) · · · h n (0, 0).
In each case, and for the corresponding indexes i and j, Proof: The proof of statement (i) consists of two steps. In the first step, we show that the optimization problem stated in the lemma is convex; then we apply Karush-Kuhn-Tucker (KKT) conditions to said problem, and finally reformulate these conditions into a reduced number of equations. The bulk of this proof comes later, in the second step, where we proceed to solve the system of equations for the two cases considered in the lemma, ψ < ω and ψ = ω. Lastly, statements (ii) and (iii) follow from (i). To see that the problem is convex, simply observe that the objective function is convex on account of H(f k ) 0, and that the inequality and equality constraint functions are affine. Since the objective and constraint functions are also differentiable and Slater's constraint qualification holds, KKT conditions are necessary and sufficient conditions for optimality [41]. Systematic application of these optimality conditions leads to the Lagrangian cost, and finally to the conditions Because lim it follows from the dual optimality conditions that κ k + x k − y k > 0, which implies, by complementary slackness, that ν k = 0. Subsequently, we may rewrite the dual optimality conditions as Lastly, we substitute the above expressions of λ k and µ k into the complementary slackness conditions, so that we can formulate the dual optimality and complementary slackness conditions equivalently as In the following, we shall proceed to solve these equations which, together with the primal and dual feasibility conditions, are necessary and sufficient conditions for optimality. To this end, first note that, if ψ > ω, then there exists no (x k , y k ) that satisfies equations (4) and (5) at the same time, and consequently, as stated in part (i) of the lemma, there is no solution. Concordantly, next we shall study the case when ψ < ω; afterwards we shall tackle the other case when ψ = ω.
Before plunging into the analysis of the former case, recall that the function h k is strictly increasing in x k and strictly decreasing in y k . Having said this, observe that, under the assumption ψ < ω, the variables x k and y k cannot be positive simultaneously by virtue of equations (6) and (7). Bearing this in mind, consider these three possibilities for each k: h k (0, 0) < ψ, ψ h k (0, 0) ω and ω < h k (0, 0).
When h k (0, 0) < ψ, the only conclusion consistent with (4) and with the fact that h k is strictly increasing in x k is that x k > 0. Since x k must be positive, the complementary slackness condition (6) implies that h k (x k , y k ) = ψ and, because of (7), that y k = 0. As a result, x k must satisfy h k (x k , 0) = ψ, or equivalently, x k = h −1 k (ψ). Next, we show that the solution (x k , 0) is unique. For this purpose, suppose that y k > 0 and, in consequence, that x k = 0. It follows from (7), however, that h k (0, y k ) = ω, which contradicts the fact that h k is a strictly decreasing function of y k . In the end, we verify that x k = y k = 0 does not satisfy (4) and thus prove that (x k , y k ) = (h −1 k (ψ), 0) is the unique minimizer of the objective function when h k (0, 0) < ψ. Now consider the case when ψ h k (0, 0) ω. First, suppose that x k > 0, and therefore that y k = 0. By complementary slackness, it follows that h k (x k , 0) = ψ, which is not consistent with the fact that h k is strictly increasing in x k . Consequently, x k cannot be positive. Secondly, assume that x k is zero and y k positive. Under this assumption, equation (7) implies that h k (0, y k ) = ω, a contradiction since h k is a strictly decreasing function of y k . Accordingly, y k cannot be positive either. Finally, check that x k = y k = 0 satisfies the optimality conditions and hence it is the unique solution.
The last possibility corresponds to the case when ω < h k (0, 0). Note that, in this case, the only conclusion consistent with (5) and with the fact that h k is strictly decreasing in y k is that y k > 0. Thus, because of (7), y k must satisfy h k (0, y k ) = ω.
Lastly, we check that this solution is unique in the case under study. To this end, note that a solution such that x k > 0 and y k = 0 contradicts the fact that h k is strictly increasing in x k . As a result, x k cannot be positive. Finally, we confirm that equation (5) does not hold for x k = y k = 0 and therefore prove that (x k , y k ) = (0, −h −1 k (ω)) is the unique solution when ω < h k (0, 0).
In summary, Accordingly, we may write the solution compactly as , where ψ, ω must satisfy the primal equality constraints k x k = η and k y k = θ.
Having examined the case when ψ < ω, next we proceed to solve the optimality conditions at hand for ψ = ω. Observe that, in this new case, (4) and (5) transform into the equation Moreover, note that any pair (x k , y k ) satisfying (8) also meets the complementary slackness conditions (6) and (7). However, notice that this does not mean that all those pairs are optimal. To elaborate on this point, consider the following three possibilities for each k: h k (0, 0) < ψ, h k (0, 0) = ψ and ψ < h k (0, 0). In the case when h k (0, 0) < ψ, the only condition consistent with (8) and with the fact that h k is strictly increasing in x k is that x k > 0. From the lemma, it is immediate that ∂h k ∂x k = − ∂h k ∂y k , which implies that x k must also be greater than y k . Hence, the set of solutions is where every pair in this set must also fulfill the primal equality conditions. Let x k satisfy h k (x k , 0) = ψ, or equivalently, Then, because h k (x k + α k , α k ) = ψ for any α 0, this set may be recast equivalently as For the two remaining cases, i.e., h k (0, 0) = ψ and ψ < h k (0, 0), the set of solutions is obtained in a completely analogous way as above. In the former case, the pairs (x k , y k ) must satisfy x k = y k , and the set of solutions may be expressed as In the latter case, it follows that y k > x k and, consequently, that the set of solutions is To sum up, the case ψ = ω leads to the following solutions: for some ψ, ω and nonnegative sequence α 1 , . . . , α n such that k x k = η and k y k = θ. Note that, although ψ = ω, we intentionally write ω instead of ψ to highlight that the solutions for ψ < ω and for ψ = ω just differ in the term α k , as we claimed in part (i) of the lemma.
Note that the particular case when the index i ranges from 1 to j − 1 and the index j goes from 2 to n is the case described in (ii) (a), which corresponds to η, θ > 0. Further, observe that the case assumed in (ii) (b), i.e., when j = n + 1, implies that θ = 0. Here, the index i starts at 1, therefore excluding η = 0, and ends at n, including the possibility that x i > 0 for all i. In part (ii) (c), we consider i = 0, which is equivalent to the condition η = 0. In this case, the index j starts at 1, permitting y j > 0 for all j, and ends at n, avoiding θ = 0. Finally, the case described in (ii) (d), namely when j = n + 1 and i = 0, is precisely the trivial case x = y = 0.
In order to verify statement (iii), we proceed analogously by noting that if ψ = h i+1 (0, 0) = · · · = h j−1 (0, 0) holds for some i = 1, . . . , j − 2 and some j = 3, . . . , n, then The previous lemma presented the solution to a resource allocation problem that minimizes a rather general but convex objective function, subject to affine constraints. Our next theorem, Theorem 3, applies the results of this lemma to the special case of the objective function of problem (1). In doing so, we shall confirm the intuition that there must exist a set of ordered pairs (ρ, σ) where the privacy risk vanishes and another set where it does not. We shall refer to the former set as the criticalprivacy region and formally define it as The latter set will be the complementary setC and we shall refer to it as the noncritical-privacy region.
Before proceeding with Theorem 3, first we shall introduce what we term forgery and suppression thresholds, two sequences of rates that will play a fundamental role in the characterization of the solution to the minimization problem defining the privacy-forgery-suppression function. Secondly, we shall investigate certain properties of these thresholds in Proposition 2. And thereafter, we shall introduce some definitions that will facilitate the exposition of the aforementioned theorem.
Let Q i = i k=1 q k and P i = i k=1 p k be the cumulative distribution functions corresponding to q and p. Denote bȳ Q i = n k=i q k andP i = n k=i p k the complementary cumulative distribution functions of q and p. Define the forgery thresholds ρ i as for j = 2, . . . , n. Additionally, define the suppression thresholds σ j as . . , n, and σ 0 = 1. Observe that ρ 1 = σ n = 0 and that the forgery threshold ρ j is a linear function of σ. We shall refer to this latter threshold as the critical forgery-suppression threshold and denote it also by ρ crit (σ). The reason is that said threshold will determine the boundary of the critical-privacy region, as we shall see later. The following result, Proposition 2, characterizes the monotonicity of the forgery and the suppression thresholds. (iii) Further, for any j = 2, . . . , n and any σ ∈ (σ j , σ j−1 ], the critical forgery-suppression threshold satisfies ρ j (σ) ρ j−1 , with equality if, and only if, σ = σ j−1 . Proof: The first statement can be shown from the definition of the forgery thresholds by routine algebraic manipulation and under the labeling assumption (3). To this end, it is helpful to note that The second statement can be shown analogously, observing that For the last statement, use the definitions of the forgery and the suppression thresholds to note that the condition ρ j (σ) ρ j−1 is equivalent to σ σ j−1 .
Prior to investigate a closed-form solution to the problem (1), we introduce some definitions for ease of presentation. For i = 1, . . . , j − 1 and j = 2, . . . , n, defineq = Q i , q i+1 , . . . , q j−1 ,Q j , r = ρ , 0 , . . . , 0 , 0 , s = 0 , 0 , . . . , 0 , σ , p = P i , p i+1 , . . . , p j−1 ,P j , whereq andp are distributions in the probability simplex of j − i + 1 dimensions, andr ands are tuples of the same dimension that represent a forgery strategy and a suppression strategy, respectively. Particularly, note that the indexes i = 1 and j = n lead toq = q andp = p. (ii) For any (ρ, σ) ∈ clC , either ρ ∈ [ρ i , ρ i+1 ] for i = 1 or ρ ∈ (ρ i , ρ i+1 ] for some i = 2, . . . , j −1, and either σ ∈ [σ j , σ j−1 ] for j = n or σ ∈ (σ j , σ j−1 ] for some j = 2, . . . , n − 1. Then, for the corresponding indexes i, j, the optimal forgery and suppression strategies are . . , n , and the corresponding, minimum KL divergence yields the privacy-forgery-suppression function Proof: The proof is structured as follows. We begin by showing that the optimization problem (1) may be construed as a particular case of that stated in Lemma 1. Accordingly, we apply this lemma, namely the cases (ii) and (iii), to obtain the optimal forgery and suppression strategies. The application of the former case allows us to derive the solution for (ρ, σ) ∈C . The latter case enables us, first, to confirm that this solution is also valid on ∂C , and secondly, to prove statement (i). Lastly, we complete the proof of (ii) by expressing function (1) in terms of the optimal apparent distribution. Use the definition of KL divergence to write the objective function of the optimization problem as Then, note that the functions f k and h k satisfy the assumptions of Lemma 1, and that the inequality and equality constraints of function (1) coincide with those in the lemma. This exposes the structure of the optimization problem as a special case of the resource allocation lemma. Before proceeding any further, notice from (10) that h k (r k , 0) is a strictly increasing function of r k and hence invertible. Note also that, according to the lemma, the solutions are completely determined by the inverse of this function, which is denoted by h −1 k and yields Finally, observe that the assumption h 1 (0, 0) · · · h n (0, 0) in the lemma is equivalent to the labeling assumption (3), as h k (0, 0) is a strictly increasing function of q k p k . Next we apply Lemma 1 (ii), where it is assumed the condition ψ < ω. We start with case (ii) (a). On account of part (i) of the lemma, the optimal forgery strategy must satisfy or equivalently, Analogously for the suppression strategy, Then it suffices to substitute the expressions of ψ and ω into the function h −1 k , to obtain the nonzero optimal solutions claimed in assertion (ii) of the theorem. Now we proceed to confirm the interval of values of ρ and σ where these solutions are defined. In the case under study, ψ and ω satisfy h i (0, 0) < ψ h i+1 (0, 0) for some i = 1, . . . , j − 1 and h j−1 (0, 0) ω < h j (0, 0) for some j = 2, . . . , n. We split the discussion into two cases, namely i < j − 1 and i = j − 1.
Assume the former case. Observe that the condition h i (0, 0) < ψ is equivalent to and finally, after routine algebraic manipulation, to Similarly, the upper-bound condition ψ h i+1 (0, 0) leads to Hence, the intervals resulting from imposing h i (0, 0) < ψ h i+1 (0, 0) are of the form (ρ i , ρ i+1 ]. The monotonicity of the thresholds ρ i , demonstrated in Proposition 2, guarantees that these intervals are contiguous and nonoverlapping. In an analogous manner, it can be shown that the condition h j−1 (0, 0) ω < h j (0, 0) leads to intervals of the form (σ j , σ j−1 ], also contiguous and nonoverlapping by virtue of Proposition 2. Now assume the latter case, where h i (0, 0) < ψ < ω < h j (0, 0) with i = j − 1. On the one hand, the assumption h j−1 (0, 0) < ψ is, as shown above, equivalent to the condition ρ > ρ j−1 . On the other hand, straightforward manipulation allows us to write the inequality ψ < ω as Combining these two bounds on ψ, we obtain the interval (ρ j−1 , ρ crit (σ)). With this last interval, we complete the range of validity of the solution for the case (ii) (a) in the lemma. Ultimately, it is easy to verify that, in those intervals of ρ and σ, the optimal apparent profile t = q+r−s 1+ρ−σ does not coincide with the population's profile p. In consequence, D(t p) > 0. Next, we turn to case (ii) (b) of the lemma. Here, the assumption h n (0, 0) ω leads to σ = 0, or equivalently, to the solution s = 0. Note that, precisely, this is the solution given in the theorem for σ = σ j with j = n. On the other hand, the application of the condition i k=1 r k = ρ results in the same optimal forgery strategy obtained in case (ii) (a). Proceeding analogously as in this case, from the assumptions on ψ we derive the intervals of values of ρ where the solution is defined: (ρ i , ρ i+1 ] for i = 1, . . . , n − 1 and (ρ i , ρ i+1 ) for i = n. Given these intervals, it is then straightforward to check that R(ρ, 0) = 0 if, and only if, ρ ρ n . This provides us with the pairs (ρ, 0) that belong to clC .
In case (ii) (c), the condition ψ h 1 (0, 0) means that ρ = 0, or equivalently, r = 0. Observe that this is the solution stated in the theorem for ρ = ρ i with i = 1. Then again, the condition n k=j s k = σ leads to the same optimal suppression strategy found in case (ii) (a). From the assumptions in the lemma on ω, we obtain the intervals (σ j , σ j−1 ] for j = 2, . . . , n and (σ j , σ j−1 ) for j = 1. Then, we verify that R(0, σ) = 0 if, and only if, σ σ 1 , from which it follows the pairs (0, σ) that belong to clC .
Finally, the case (ii) (d) in the lemma, in which h n (0, 0) ω and ψ h 1 (0, 0), corresponds to the trivial case σ = σ j for j = n and ρ = ρ i for i = 1, that is, the solution r = s = 0.
After having applied Lemma 1 (ii) to function (1), now we proceed with case (iii) (a). In applying it, we shall show that the solution claimed in the theorem is also valid for the extreme values of the intervals in case (ii) (a), specifically the set . . , n, and σ ∈ (σ j , σ j−1 ) for j = 2}.
population's profile p actual user profile q optimal apparent profile t* optimal forgery strategy r* optimal suppression strategy s* population's profile p Fig. 2: A user's item distribution is perturbed according to two optimal forgery and suppression strategies, in order for the resulting profile to minimize the KL divergence with respect to the population's distribution. Now we assume the latter possibility, i.e., (ρ, σ) (ρ crit (σ), σ), to show that the privacy-risk function also vanishes for these values of ρ and σ. On account of part (iii) (a) of the lemma and (11), we derive the optimal forgery and suppression strategies r k = p k (1 + ρ crit (σ) − σ) + p k ζ P j − q k + α k and s k = α k for k = 1, . . . , j − 1, and and r k = α k for k = j, . . . , n. Then, we substitute r and s back into the apparent profile t and check that D(t p) = 0. In doing so, we determine the pairs (ρ, σ) 0 that belong to clC , and finally obtain the expression for the boundary of the critical-privacy region claimed in statement (i) of the theorem.
To conclude the proof, it remains only to write the privacy-risk function R(ρ, σ) = n k=1 t k log t k p k in terms of the optimal apparent distribution. With this aim, we split the summation into three parts. The first part, corresponding to t k = p k (Qi+ρ) where we leverage on the fact that t k p k does not depend on k. The second part of the sum, corresponding to t k = q k 1+ρ−σ , yields The last part, corresponding to t k = p k (Qj −σ) where we also note that t k p k does not depend on k either. Now, it is straightforward to identify the terms of R(ρ, σ) as the KL divergence between the distributions precisely the distributions stated in the theorem. In light of Theorem 3, we would like to remark the intuitive principle that both the optimal forgery and suppression strategies follow. On the one hand, the forgery strategy suggests adding ratings to those categories with a low ratio q k p k , that is, to those in which the user's interest is considerably lower than the population's. On the other hand, the suppression strategy recommends eliminating ratings from those categories where the ratio q k p k is high, i.e., where the interest of the user exceeds that of the population.
Another straightforward consequence of Theorem 3 is the role of the forgery and the suppression thresholds. In particular, we identify ρ i as the forgery rate beyond which the components of r k for k = 1, . . . , i become positive. A similar reasoning applies to σ j , which indicates the suppression rate beyond which the components of s k for k = j, . . . , n are positive. In a nutshell, these thresholds determine the number of nonzero components of the optimal strategies. Also, from this theorem we deduce that the perturbation of the user profile does not only affect those categories where either r k > 0 or s k > 0. In fact, since we are dealing with relative frequencies, the components of the apparent distribution t k belonging to the categories k = i + 1, . . . , j − 1 are normalized by 1 1+ρ−σ . Fig. 2 illustrates these three conclusions by means of a simple example with n = 5 categories of interest.
In this example we consider a user who is disposed to submit a percentage of false ratings ρ ∈ (ρ 2 , ρ 3 ], and to refrain from sending a fraction of genuine ratings σ ∈ (σ 4 , σ 3 ]. Given these rates, the optimal forgery strategy recommends that the user forge ratings belonging to the categories 1 and 2, where clearly there is a lack of interest, compared to the reference distribution. On the contrary, the suppression strategy specifies that the user eliminate ratings from the categories 4 and 5, that is, from those categories where they show too much interest, again compared to the population's profile. In adopting these two strategies, the apparent user profile approaches the population's distribution, especially in those components where the ratio q k p k deviates significantly from 1. Finally, the component of the apparent profile t 3 , which is not directly affected by the forgery and the suppression strategies, gets closer to p 3 as a result of the aforementioned normalization.
In the following subsections, we shall analyze a number of important consequences of Theorem 3.

B. Orthogonality, Continuity and Proportionality
In this subsection we study some interesting properties of the closed-form solution obtained in Sec. IV-A. Specifically, we investigate the orthogonality and continuity of the optimal forgery and suppression strategies, and then establish a proportionality relationship between the optimal apparent user profile and the population's distribution.

Corollary 4 (Orthogonality and Continuity):
(i) For any (ρ, σ) ∈ clC , the optimal forgery and suppression strategies satisfy r * k s * k = 0 for k = 1, . . . , n. (ii) The components of r * and s * , interpreted as functions of ρ and σ respectively, are continuous on clC . Proof: The proof of (i) is trivial from Theorem 3. To prove statement (ii) we also resort to this theorem. According to it, each component r * k may be regarded as a piecewise function of ρ defined on the contiguous, nonoverlapping intervals [ρ i , ρ i+1 ] for i = 1 and (ρ i , ρ i+1 ] for i = 2, . . . , j − 1. A direct verification shows that, for any k = j, . . . , n, the component r * k is identically zero on the whole interval [ρ 1 , ρ j ] and hence continuous. For any k = 1, . . . , j − 1, we immediately check the continuity of r * k on the interior of each of the intervals parameterized by i. Now we examine the endpoints of such intervals. The continuity at the extreme points ρ 1 and ρ j is verified straightforwardly as the intervals are closed at these points. Then, we check that the limit at the remaining endpoints ρ i exists, since for i = 2, . . . , j − 1. Because each limit coincides with the corresponding value r * k (ρ i ), we prove the continuity of the components r 1 , . . . , r j−1 . The proof of the continuity of the components of s * is analogous to that of r * .
The orthogonality of the optimal forgery and suppression strategies, in the sense indicated by Corollary 4 (i), conforms to intuition-it would not make any sense to submit false ratings to items of a particular category and, at the same time, eliminate genuine ratings from this category. This intuitive result is illustrated in Fig. 2. The second part of Corollary 4 is applied to show our next result, Proposition 5. (i) For any j = 2, . . . , n and i = 1, . . . , j − 1, and for any σ ∈ [σ j , σ j−1 ] and ρ ∈ [ρ i , ρ i+1 ], the optimal apparent profile t * and the population's distribution p satisfy (ii) The function φ is continuous and strictly increasing in each of its arguments, and satisfies φ(ρ, σ) 1, with equality if, and only if, (ρ, σ) = (ρ j (σ), σ). (iii) The function χ is continuous and strictly decreasing in each of its arguments, and satisfies χ(ρ, σ) 1, with equality if, and only if, (ρ, σ) = (ρ j (σ), σ). Proof: The continuity of the components of t * on clC follows from Corollary 4 (ii). This allows us to write the intervals in Theorem 3 as [ρ i , ρ i+1 ] and [σ j , σ j−1 ], in lieu of (ρ i , ρ i+1 ] and (σ j , σ j−1 ], respectively. From the expressions of r * k and s * k in the theorem, it is immediate to identify the ratios t * k p k as either φ(ρ, σ) or χ(ρ, σ). The inner inequalities in statement (i) of this proposition also follow immediately from the labeling assumption (3). Direct manipulation shows that the outer inequalities pj are equivalent to ρ ρ i+1 and σ σ j−1 , respectively. This proves (i). Next, we proceed to demonstrate the strict monotonicity of φ. A simple calculation shows that To prove that ∂φ ∂ρ > 0, it is sufficient to verify thatQ j > σ j−1 , or equivalently, thatP j qj−1 pj−1 > 0. Then, by the positivity assumption (2), we immediately see that this latter inequality holds for any j = 2, . . . , n. The strict monotonicity of φ in σ also follows from assumption (2).
Our previous result tells us how perturbation operates. According to Proposition 5, the optimal strategies perturb the user profile in such a manner that, in those categories with the lowest and highest ratios q k p k , the apparent profile becomes proportional to the population's distribution. More precisely, the common ratio t * k p k increases with both ρ and σ in those categories affected by forgery, that is, k = 1, . . . , i. Exactly the opposite happens in those categories affected by suppression, where the common ratio t * j pj decreases with both rates. This tendency continues until ρ = ρ crit (σ), at which point t * = p. Fig. 3 illustrates this proportionality property in the case of the example depicted in Fig. 2.

C. Critical-Privacy Region
One of the results of Theorem 3 is that the boundary of the critical-privacy region is determined by the critical forgerysuppression threshold ρ j (σ), which we also denote by ρ crit (σ) to highlight this fact. The following proposition leverages on this result and characterizes said region. In particular, Proposition 6 first examines some properties of this threshold and then investigates the convexity of the critical-privacy region.
(ii) C is convex. . Next, we prove that the slopes satisfy m j < m j−1 for all j = 3, . . . , n. We proceed by contradiction, assuming that m j m j−1 . Note that this inequality is equivalent to P j−1Pj−1 P j −P jPj−1 and, after algebraic simplification, to p j−1 0. This contradicts the positivity assumption (2), which, in turn, implies that m j < 0 for all j = 2, . . . , n. Therefore, since ρ j is a piecewise linear function defined by the strictly increasing sequence of negative slopes {m n , . . . , m 2 }, we can conclude that ρ j is convex. This proves statement (i). The second statement follows from the first one. As ρ j is convex, so is its epigraph, i.e., the critical-privacy region.
The conclusions drawn from Proposition 6 are illustrated in Fig. 4. In this figure we represent the critical and noncriticalprivacy regions for n = 5 categories of interest; the distributions q and p assumed in this conceptual example are different critical-privacy region from those considered in Figs. 2 and 3. That said, the figure in question shows a straightforward consequence of our previous proposition-the noncritical-privacy region is nonconvex. In this illustrative example, the sequences of forgery thresholds {ρ 1 . . . , ρ 5 } and suppression thresholds {σ 5 , . . . , σ 1 } are strictly increasing. By Proposition 2, we can conclude then that the inequalities of the labeling assumption (3) hold strictly. Related to these thresholds is also the number of nonzero components of the optimal strategies, as follows from Theorem 3. Fig. 4 shows the sets of pairs (ρ, σ) where the number of nonzero components of r * and s * is fixed. Thus, in the triangular area shown darker, corresponding to the Cartesian product of the intervals [ρ 3 , ρ 4 ] and [σ 4 , σ 3 ], the solutions r * and s * have i = 3 and n − j + 1 = 2 nonzero components, respectively.

D. Case of Low Forgery and Suppression
This subsection characterizes the privacy-forgery-suppression function in the special case when ρ, σ 0.
Proposition 7 (Low Rates of Forgery and Suppression): Assume the nontrivial case in which q = p. Then, there exist two indexes i, j such that 0 = ρ 1 = · · · = ρ i < ρ i+1 and 0 = σ n = · · · = σ j < σ j−1 . For any ρ ∈ [0, ρ i+1 ] and σ ∈ [0, σ j−1 ], the number of nonzero components of the optimal forgery and suppression strategies is i and n − j + 1, respectively. Further, the gradient of the privacy-forgery-suppression function at the origin is Proof: The existence of the indexes i and j is guaranteed by the assumption that q = p. The number of nonzero components of r * and s * is trivial from Theorem 3. In view of this theorem, for any ρ ∈ [0, ρ i+1 ] and σ ∈ [0, σ j−1 ], we have R(ρ, σ) = D q + ρ(1, 0, . . . , 0) − σ(0, . . . , 0, 1) The continuity of the components of r * and s * proven in Corollary 4 (ii) ensures the continuity of the privacy-forgerysuppression function onC . It is routine to check its differentiability in this region and to obtain its derivative with respect to σ at the origin, On account of Proposition 2, the conditions ρ 1 = · · · = ρ i and σ j = · · · = σ n imply Therefore, The derivative of R with respect to ρ at ρ = σ = 0 follows analogously. Next, we shall derive an expression for the relative decrement of the privacy-risk function at ρ, σ 0. To this end, define the forgery relative decrement factor and the suppression relative decrement factor By dint of Proposition 7, the first-order Taylor approximation of function (1) around ρ = σ = 0 yields or more compactly, in terms of the decrement factors, In words, the minimum and maximum ratios q k p k characterize the relative reduction in privacy risk. The following result, Proposition 8, establishes a bound on these relative decrement factors.
Proof: Observe that the statement δ ρ > 1 is equivalent to the condition q 1 < p 1 . We prove this by contradiction. Suppose that q 1 > p 1 . By the labeling assumption (3), it follows that q k > p k for all k, what leads to the contradiction that 1 = q k > p k = 1. Now assume that q 1 = p 1 . Since q = p, there must exist an index i such that But this implies that a contradiction. This proves the first part of the proposition. For the second part, note that the statement δ σ > 0 is equivalent to q 1 log q 1 p 1 + · · · + q n log q n p n < log q n p n , and, after algebraic manipulation, to q 1 log q 1 p 1 p n q n + · · · + q n−1 log q n−1 p n−1 p n q n < 0.
The positivity and labeling assumptions (2), (3) ensure that all terms in the sum are nonpositive. However, the additional assumption q = p implies that q1 p1 < qn pn , which in turn implies that the first term is negative and so is, consequently, the entire summation.
Conceptually, the bound on δ ρ tells us that the relative decrement in privacy risk is greater than the forgery rate introduced. This is under the assumption that q = p and at low rates of forgery and suppression. The bound on δ σ , however, is looser than the previous one and just ensures that an increase in the suppression rate always leads to a decrease in privacy risk, as one would expect.

E. Pure Strategies
In the previous subsections we investigated the forgery and the suppression of ratings as a mixed strategy that users may adopt to enhance their privacy. In this subsection we contemplate the case in which users may be reluctant to use these two mechanisms in conjunction; and as a consequence, they may opt for a pure strategy consisting in the application of either forgery or suppression. In this case, it would be useful to determine which is the most appropriate technique in terms of the privacy-utility trade-off posed. Our next result, Corollary 9, provides some insight on this, under the assumption that, from the user's perspective, the impact on utility due to forgery is equivalent to that caused by the effect of suppression.
(ii) The forgery and the suppression relative decrement factors satisfy δ ρ > δ σ if, and only if, Proof: Both statements are immediate from the definitions of ρ n and σ 1 on the one hand, and δ ρ and δ σ on the other. In conceptual terms, the condition ρ n < σ 1 means that the pure forgery strategy is the most appropriate mechanism in terms of causing the minimum distortion to attain the critical-privacy region. On the other hand, the condition δ ρ > δ σ implies that, at low rates, the pure forgery strategy offers better privacy protection than the pure suppression strategy does. Therefore, the conclusion that follows from Corollary 9 is that, together with the quantity D(q p), the arithmetic and geometric mean of the ratios q1 p1 and qn pn determine which strategy to choose. Another interesting remark is the duality of these two ratios q1 p1 and qn pn . The former characterizes the minimum rate for the pure suppression strategy to reach the critical-privacy region and, at the same time, it establishes the privacy gain at low forgery rates. Conversely, the latter ratio defines the critical rate of the pure forgery strategy and determines the relative decrement in privacy risk at low suppression rates.
Lastly, we would like to establish a connection between our work and that of [11], [20], where the pure forgery and suppression strategies are investigated. Denote by R F the function derived in [11] modeling the trade-off between forgery rate and privacy risk, the latter being measured as the KL divergence between the user's apparent profile and the population's distribution. Define ρ as the ratio of forged ratings to total number of ratings. Accordingly, it can be shown that ρ = ρ 1+ρ and that R(ρ, 0) = R F (ρ ). On the other hand, denote by P S the function in [20] characterizing the trade-off between suppression rate and privacy gain. In this case, privacy is measured as the Shannon's entropy of the user's apparent profile. Under the assumption that the population's profile is uniform, it can be proven that R(0, σ) = log n − P S (σ). In short, our formulation of the problem of optimal forgery and suppression of ratings encompasses, as particular cases, the cited works.

F. Numerical Example
This subsection presents a numerical example that illustrates the theoretical analysis conducted in the previous subsections. Later on in Sec. V we shall evaluate the effectiveness of our approach in a real scenario, namely in the movie recommendation system Movielens. In our numerical example we assume n = 3 categories of interests. Although the example shown here is synthetic, these three categories could very well represent interests across topics such as technology, sports and beauty. Accordingly, we suppose that the user's rating distribution is q = (0.130, 0.440, 0.430),     Note that these distributions satisfy the positivity and labeling assumptions (2), (3). From Sec. IV-A, we easily obtain the forgery thresholds ρ 1 = 0, ρ 2 0.299 and ρ 3 0.870 on the one hand, and on the other the suppression thresholds σ 3 = 0, σ 2 0.171 and σ 1 0.658. The thresholds ρ 3 and σ 1 are the critical rates of the pure strategies. If we are to reach the critical-privacy region and do not have any preference for either forgery or suppression, the fact that ρ 3 > σ 1 leads us to opt for suppression as pure strategy. However, the geometric mean of q1 p1 and q3 p3 is approximately 0.799, which is lower than 2 D(q p) 1.20. On account of Corollary 9, this means that the pure forgery strategy contributes to a greater reduction in privacy risk at low rates than suppression does. In fact, the gradient of the privacy-forgery-suppression function at the origin is ∇R(0, 0) T (−1.81, −0.639), by virtue of Proposition 7. Fig. 5 shows the contour lines of this function, computed analytically from Theorem 3 and numerically (b) . The region plotted in gray shades corresponds to the noncritical-privacy regionC . The initial privacy risk is R(0, 0) 0.263. The white area represents the critical-privacy region C , where the apparent user profile coincides with the population's distribution and thus the privacy risk vanishes. An interesting observation arising from Fig. 5 is the synergistic effect of combining forgery and suppression. Just as an example, in the case when ρ = ρ 2 and σ = σ 2 , the sum of these two distortion measures is lower than the critical rates of the pure strategies.
Next, we examine the optimal apparent rating distribution for different values of ρ and σ. For this purpose, the user's genuine distribution q, the population's distribution p and the optimal apparent distribution t * are depicted in the probability simplices shown in Fig. 6. In each simplex, we also represent the contour lines of the KL divergence D(· p) between every distribution in the simplex and p. Further, we plot the set of feasible apparent user distributions, not necessarily optimal, for four different combinations of ρ and σ; in any of these cases, the set takes the form of a hexagon. Having said this, now we turn our attention to Fig. 6(a). In this case, the optimal forgery and suppression strategies have i = n − j + 1 = 1 nonzero component, since ρ ∈ [0, ρ 2 ] and σ ∈ [0, σ 2 ]. This places the solution t * at one vertex of the hexagon. A remarkable fact is that, for these rates, (b) The numerical method chosen is the interior-point algorithm [41] implemented by the Matlab R2012b function fmincon.

Index
Category name  Index  Category name  Index  Category name   1  animation  7  sci-fi  13  war  2  action  8  comedy  14  mystery  3  film-noir  9  thriller  15  musical  4  children's  10  fantasy  16  romance  5  adventure  11  horror  17  IMAX  6  crime  12  western  18  drama  19  documentary the privacy risk is approximately halved. In the end, consistently with Proposition 8, the forgery and the suppression relative decrement factors are δ ρ 6.87 > 1 and δ σ 2.42 > 0.
In the case shown in Fig. 6(b), r * still has i = 1 nonzero components, while s * contains n − j + 1 = 2 nonzero components. Geometrically, the optimal apparent distribution lies at one edge of the feasible region. This lowers privacy risk to a 19% of its initial value. The case in which (ρ, σ) = (ρ crit (σ), σ) is depicted in Fig. 6(c). Here, the number of nonzero components of r * and s * remains the same as in the previous case, but the privacy risk becomes zero. The last case, illustrated in Fig. 6(d), does not have any practical application, as R(ρ, σ) = 0 for any (ρ, σ) ∈ ∂C . In this figure we can observe that the solution t * is placed in the interior of the hexagon, and that the orthogonality principle of the strategies r * and s * stated in Corollary 4 is not satisfied.

V. EXPERIMENTAL EVALUATION
In this section we evaluate the extent to which the forgery and the suppression of ratings could enhance user privacy in a real-world recommendation system. The system chosen to conduct this evaluation is Movielens, a popular movie recommender developed by the GroupLens Research Lab [42] at the University of Minnesota. As many other recommenders, Movielens allows users to both rate and tag movies according to their preferences. These preferences are then exploited by the recommender to suggest movies that users have not watched yet.

A. Data set
The data set that we used to assess our data-perturbative mechanism is the Movielens 10M data set [43], which contains 10 000 054 ratings and 95 580 tags. The ratings and tags included in this data set were assigned to 10 681 movies by 71 567 users. The data are organized in the form of quadruples (username, movie, rating, time), each one representing the action of a user rating a movie at a certain time. Usernames have been replaced with numbers in an attempt to anonymize the data set.
For our purposes of experimentation, we just needed the data fields username and movie, together with the categories each movie belongs to. Movielens contemplates n = 19 categories or movies genres, listed in alphabetical order as follows: action, adventure, animation, children's, comedy, crime, documentary, drama, fantasy, film-noir, horror, IMAX, musical, mystery, romance, sci-fi, thriller, war and western. As we shall see later in Sec. V-B, for each particular user, we shall have to rearrange those categories in such a way that the labeling assumption (3) is satisfied.
In our data set, all users rated, at least, 20 movies. This was the minimum number of ratings for the recommender to start working (c) . After the elimination of those users who exclusively tagged movies, the total number of users reduced to 69 878. Despite the large number of users, we found that only 4 099 satisfied the positivity assumption (2). Considering that this small group of users represents just the 5.8% of the total number of users, we can assume that the application of our technique will have a negligible effect on the population's profile p, as supposed in Sec. III-D.

B. Results
In this subsection we examine how the forgery and the suppression of ratings may help users of Movielens to enhance their privacy. With this aim, first, we analyze the effect of the perturbation of ratings on the privacy protection of a particular user from our data set. Secondly, we consider the entire set of 4 099 users and assess the relative reduction in privacy risk when these users apply the same forgery and suppression rates. Lastly, we investigate the forgery and the suppression strategies separately, and draw some conclusions about these two pure strategies.
To conduct our first experiments, we choose a particular user from our data set (d) . Before perturbing the movie rating history of this user, it is necessary that the components of the user's profile q and the population's distribution p be rearranged to satisfy the labeling assumption (3). Table I shows how movie categories have been sorted, and then indexed from 1 to n, to fulfill the assumption above. We would like to note that the index provided in this table does not have to coincide with the index of other users in our data set. (c) Fig. 7: In this figure we represent (a) the item distribution q of a particular user as well as the population's item distribution p. In addition, we plot (b) the optimal forgery strategy r * and (c) the optimal suppression strategy s * that the user in question should adopt when they specify σ = 0.150 and ρ = ρ crit (σ) 0.180. Fig. 7(a) depicts the user profile and the population profile, the latter being computed by averaging across the 69 878 users. From this figure we note that the user's interest far exceeds the population's in categories such as musical, romance, IMAX, drama and documentary. More precisely, such ratios q k p k yield q k p k k=15,..., 19 (1.300, 1.306, 1.451, 1.728, 2.292).
In this figure, we also observe that the user's interest and the population's in the category 17 are nearly zero, namely q 17 0.0005 and p 17 0.0003. On the other hand, Fig. 7(a) indicates that the user shows little interest, compared to the population's preferences, in categories such as animation, action, film-noir or children's, to name just a few. Specifically, the first six smallest ratios q k p k yield q k p k k=1,..., 6 (0.444, 0.599, 0.651, 0.691, 0.705, 0.714).

Figs. 7(b) and 7(c)
show the optimal forgery and suppression strategies that this particular user should apply, in the case when σ = 0.150 and ρ crit (σ) 0.180. The solutions plotted in these figures are consistent with our two previous observations-the optimal forgery strategy recommends that the user submit false ratings to movies falling into the categories where the ratio q k p k is low; and the optimal suppression strategy suggests that the user refrain from rating movies belonging to categories where the ratio q k p k is high. Just as an example, the fact that s * 17 0.0001 means that the user at hand should eliminate one in five ratings to movies classified as IMAX.
The optimal trade-off surface among privacy, forgery rate and suppression rate is represented in Fig. 8. In this figure we plot the contour levels of the function R(ρ, σ), which we computed theoretically. The initial privacy risk is R(0, 0) 0.101 and the arithmetic mean between the ratios q1 p1 and q19 p19 yields approximately 1.37. Since the mean is higher than 1, Corollary 9 tells us that the user should opt for suppression as pure strategy, in lieu of forgery. This is under the assumption that they wish to achieve the minimum privacy risk and do not have any preference for any of the pure strategies. Nevertheless, the fact that δ ρ 12.6 > δ σ 10.9 leads us to choose forgery as pure strategy for ρ, σ 0. When both strategies are combined, note that a forgery and suppression rate of just 0.1% leads to a relative reduction in privacy risk of 2.35%, on account of the first-order Taylor approximation derived in Sec. IV-D.  Fig. 9. The aim is to show how the optimal apparent profile becomes proportional to the population's distribution, as the user approaches the critical-privacy region. Fig. 9 In Fig. 9(b) we double the rates of forgery and suppression. On the one hand, this leads to p7 . On the other, the fact that σ ∈ [σ 15 , σ 14 ] implies that p19 . It is also interesting to note that, for these relatively small values of ρ and σ, the final privacy risk is 26% of the initial value D(q p).
As ρ and σ increase, so does the function φ. The contrary happens with the function χ, which decreases with both rates. In Fig. 9(c), for example, the proportionality relationship between t * and p holds for all except 4 categories. The last pair (ρ, σ) (0.18, 0.15) lies at the boundary of C , as shown in Fig. 8. This implies that t * p = 1 and therefore that R(ρ, σ) = 0, as captured in Fig. 9(d).
Having examined the case of a specific user, in our next series of experiments we evaluate the privacy-protection level that users can achieve if they are disposed to forge and eliminate a fraction of their ratings. For simplicity, we suppose that all users satisfying the positivity assumption (2) apply a common forgery rate and a common suppression rate. Fig. 10 depicts the contours of the 10 th , 50 th and 90 th percentile surfaces of relative reduction in privacy risk, for different values of ρ and σ. Two conclusions can be drawn from this figure.
• First, for relatively small values of ρ and σ (lower than 15%), a vast majority of users lowered privacy risk significantly.
In quantitative terms, we observe in Fig. 10(a) that, for ρ = σ = 0.05, the 10% of users adhered to our technique obtained a reduction in privacy risk by at least 52.4%. For those same rates of forgery and suppression rates, the 50 th and 90 th percentiles are 73.9% and 94.8%. For higher rates, e.g., ρ = σ = 0.15, Fig. 10(b) highlights that half of users experienced a reduction in privacy risk less than or equal to 100%. • Secondly, the three percentile surfaces exhibit a certain symmetry with respect to the line ρ = σ. If this symmetry were exact, the exchange of the rates of forgery and suppression would not have any impact on the resulting privacy-protection achieved. However, this is not the case. For example, Fig. 10(a) shows a lower reduction in privacy risk for ρ < σ, particularly accentuated when σ 0. The reason for this may be found in the fact that, for most users, ρ n is greater than σ 1 . We shall elaborate more on this later on when we consider forgery and suppression as pure strategies. Next, we analyze the privacy protection provided by our technique for ρ, σ 0. In the theoretical analysis conducted in Sec. IV-D we derived an expression for the relative reduction in privacy risk at low rates. Particularly, said expression was in terms of two factors, namely δ ρ and δ σ . In Fig. 11 we show the probability distribution of these factors. Consistently with Proposition 8, the minimum values of these factors are δ ρ 3.12 > 1 and δ σ 2.30 > 0. The maximum values attained by these forgery and the suppression factors are approximately 324.98 and 266.13. On the other hand, in favour of suppression is the fact that the percentage of users with δ ρ 30 is lower than those users with δ σ 30. More precisely, these percentages yield 26.8% and 33.1%, respectively. In the end, an eye-opening finding is that δ ρ > δ σ in 43.45% of users, which suggests introducing a suppression rate higher than that of forgery, at least at low rates.
After analyzing the forgery and the suppression of ratings as a mixed strategy, our last experimental results contemplate the application of forgery and suppression as pure strategies. In Fig. 12 we illustrate the probability distribution of the critical rates ρ n and σ 1 . The critical forgery rate ranges approximately from 0.171 to 54.18, and its average is 3.45. The critical suppression   9: Proportionality relationship between, on the one hand, the optimal apparent item distribution t * of the user identified as 3301 in our data set, and on the other, the population's item distribution p. rate, on the other hand, goes from 0.153 to 0.963, and its average is 0.632. These figures indicate that, on average, a user will have either to refrain from rating an item six out of ten times, or submit nearly 3.45 false ratings per each original rating. This is, of course, when the user wishes to reach the critical-privacy region. Bearing these figures in mind, it is not surprising then that 95.3% of the users in our data set would opt for suppression as pure strategy, as it comes at the cost of a lower impact on utility.
VI. CONCLUSION In the literature of recommendation systems there exists a variety of approaches aimed at protecting user privacy. Among these approaches, the forgery and the suppression of ratings emerge as a technique that may hinder attackers in their efforts to accurately profile users on the basis of the items they rate. Our technique does not require that users trust neither the recommender nor the network operator, it is simple in terms of infrastructure requirements, and it can be used in combination with other approaches providing soft privacy. However, as any data-perturbative approach, our privacy-enhancing technology comes at the expense of a loss in data utility, in particular a degradation of the quality of the recommender's predictions. Put another way, it poses a trade-off between privacy and utility.
The objective of this paper is to investigate mathematically said trade-off. For this purpose, first we propose a quantitative measure of both privacy and utility. We quantify privacy risk as the KL divergence between the user's rating distribution and the population's, and measure utility as the fraction of ratings the user is willing to forge and suppress. With these two quantities, we formulate a multiobjective optimization problem characterizing the trade-off between privacy risk on the one hand, and on the other forgery rate and suppression rate.
Our theoretical analysis provides a closed-form solution to this problem and characterizes the optimal trade-off surface between privacy and utility. The solution is confined to the closure of the noncritical-privacy region. The interior of the critical-privacy region is of no interest as the privacy risk attains its minimum value at the boundary ofC . In the region of interest, our analysis finds that the optimal forgery and suppression strategies are orthogonal. In addition, these two strategies follow an intuitive principle. The forgery strategy recommends adding ratings to those categories where the user's interest is lower than the population's. The suppression strategy suggests eliminating those ratings belonging to the categories where the user shows too much interest compared to the reference distribution.
Our theoretical study also examines how these optimal strategies perturb user profiles. It is interesting to observe that the optimal apparent profile becomes proportional to the population's distribution in those categories with the lowest and highest     Fig. 10: We assume that the 4 099 users satisfying the positivity assumption (2) protect their privacy by using a common forgery rate and a common suppression rate. Under this assumption, we plot some percentiles surfaces of relative reduction in privacy risk, against these two common rates.  ratios q k p k . Our analysis also includes the characterization of R at low rates of forgery and suppression. More accurately, we provide a first-order Taylor approximation of the privacy-utility trade-off function, from which we conclude that the ratios q1 p1 and qn pn determine, together with the quantity D(q p), the privacy risk at low rates. An eye-opening fact is that the relative decrement in privacy risk is greater than the forgery rate introduced.
Further, we consider the special case when forgery and suppression are not used in combination. Under this consideration, we investigate which one is the most appropriate technique, first, in terms of causing the minimum distortion to reach the critical-privacy region, and secondly, in terms of offering better privacy protection at low rates. Our findings show that the arithmetic and geometric mean of the maximum and minimum ratios q k p k play a fundamental role in deciding the best technique to use. Afterwards, our formulation and theoretical analysis are illustrated with a numerical example. In the end, the last section is devoted to the experimental evaluation of our data-perturbative mechanism in a real-world recommendation system. In particular, we examine how the application of the forgery and the suppression of ratings may preserve user privacy in Movielens. Among other results, we find that a large majority of users significantly reduce privacy risk for forgery and suppression rates of just 15%. In our data set, the probability distributions of the relative decrement factors indicate that, at low rates, forgery provides a higher reduction in privacy risk than suppression does. By contrast, we observe that the suppression relative decrement factor is greater than that of forgery in 43.45% of users. Lastly, we consider the case when users must opt for either forgery or suppression; and find that the latter is the best strategy to use in 95.3% of users who wish to vanish privacy risk while causing the minimum distortion.