Protecting Private Information for Two Classes of Aggregated Database Queries

: An important direction of informatics is devoted to the protection of privacy of conﬁdential information while providing answers to aggregated queries that can be used for analysis of data. Protecting privacy is especially important when aggregated queries are used to combine personal information stored in several databases that belong to different owners or come from different sources. Malicious attackers may be able to infer conﬁdential information even from aggregated numerical values returned as answers to queries over large collections of data. Formal proofs of security guarantees are important, because they can be used for implementing practical systems protecting privacy and providing answers to aggregated queries. The investigation of formal conditions which guarantee protection of private information against inference attacks originates from a fundamental result obtained by Chin and Ozsoyoglu in 1982 for linear queries. The present paper solves similar problems for two new classes of aggregated nonlinear queries. We obtain complete descriptions of conditions, which guarantee the protection of privacy of conﬁdential information against certain possible inference attacks, if a collection of queries of this type are answered. Rigorous formal security proofs are given which guarantee that the conditions obtained ensure the preservation of privacy of conﬁdential data. In addition, we give necessary and sufﬁcient conditions for the protection of conﬁdential information from special inference attacks aimed at achieving a group compromise. comprehensive experimental studies comparing their performance for various categories of practical datasets.

The investigation of formal conditions which guarantee the preservation of private information against inference attacks using aggregated database queries originates from a fundamental result obtained by Chin and Ozsoyoglu [14] in the case of linear queries and linear inference attacks. It belongs to an important research direction devoted to the protection of privacy of confidential information and provides answers to aggregated queries that can be used for analysis of data [9,15]. Protecting privacy is especially important when aggregated queries are used to combine personal information stored in several databases that belong to different owners or come from different sources [16]. Malicious attackers may be able to infer confidential information even from aggregated numerical values returned as answers to queries over large collections of data [17]. Formal proofs of security guarantees are important, because they can be used for implementing practical systems protecting privacy and providing answers to aggregated queries.
The present paper obtains novel rigorous formal conditions, which guarantee the protection of privacy of confidential information against certain possible inference attacks for two new classes of aggregated nonlinear queries motivated by the main result of [14]. Section 2 of our paper gives a review of related previous work. Section 3 contains technical details on the materials and methods used in this paper. Section 4 presents main results of our article. Section 4.1 defines MEAN and VARIANCE queries (MVQ) and introduces a new class of inference attacks, quadratic equation attacks (QEA). In order to protect confidential information from QEA attacks we design a quadratic audit system (QAS). Theorems 2 and 3 establish that QAS systems guarantee the protection of confidential data from QEA attacks. Rigorous formal security proofs are given to ensure the preservation of privacy of confidential data. Section 4.2 introduces interval inference attacks (IIA). To protect sensitive data from IIA attacks, we design an interval audit system (IAS). Theorems 4 and 5 prove that the IAS ensures protection against IIA attacks. Finally, Theorem 6 in Section 4.3 gives rigorous matrix conditions for the protection of confidential information from a group compromise. The results obtained are discussed in Section 5, where directions for future research are also proposed. A conclusion is given in Section 6.
The present paper contributes to the advancement of knowledge on the preservation of privacy of confidential information by developing formal theory, designing new formal systems for the protection against inference attacks and obtaining novel rigorous conditions that guarantee that the confidential information remains protected. In summary, a pointby-point list of the main contributions of this paper can be presented as follows: • Formal definitions of the MVQ queries and a new class of inference attacks, the QEA attacks. • The design of a QAS system for the protection of confidential information against the QEA attacks. • Rigorous formal proofs of Theorems 2 and 3, which establish that QAS systems guarantee the protection of confidential data from the QEA attacks. • Formal definition of a new class of inference attacks, the IIA attacks. • The design of an IAS system for the protection of sensitive data from the IIA attacks. • Rigorous formal proofs of Theorems 4 and 5, which demonstrate that IAS systems ensure protection against IIA attacks. • Rigorous formal proof of Theorem 6, which provides stringent matrix conditions for the protection of confidential information from a group compromise.

Previous Work
This section is devoted to the existing literature related to the results of [14] and a brief review of other relevant research. The paper [14] investigated linear queries and designed the concept of an audit expert, which maintains a dynamic matrix for processing such queries. The paper [18] suggested using a static audit expert for arbitrary linear queries, where the query basis matrix is prepared and fixed by the system beforehand. The paper [19] proposed to apply a hybrid audit expert, which combined the advantages of the dynamic and static expert systems. The effectiveness of hybrid audit experts was further investigated in [20].
The majority of previous papers devoted to linear queries concentrated on studying the more special case of so-called SUM queries (see Section 3 for a mathematical definition). The databases where the clients are allowed to submit SUM queries, were investigated in [21][22][23]. The readers are referred to our survey article [24] for more details.
Wu et al. [25] used the concept of differential privacy and designed a differentially private mechanism for answering linear queries, which achieves a near-optimal data utility subject to a fixed privacy protection constraint. Mckenna et al. [26] applied advanced optimisation methods to develop a mechanism for accurate answers to a user-provided set of linear queries under local differential privacy. Khalili et al. [27] proposed an incentive mechanism and a randomized response algorithm for generating differentially private answers to linear queries. Xiao et al. [28] devised a fine-grained strategy of adding Gaussian noise to query answers in the special case of answering linear queries under differential privacy subject to per-query constraints on accuracy.
Differential privacy has also been applied for privacy protection in various more advanced scenarios recently. For example, the paper by Qu et al. [29] proposed a customizable reliable differential privacy (CRDP) model and developed a modified Laplacian mechanism that enables CRDP to simultaneously minimize background knowledge attacks and eliminate collusion attacks in cyber-physical social networks. An application of the differential privacy for the development of personalised privacy protection in cyber-physical social systems was investigated in [30].
Another important relevant direction of research deals with federated learning, which occurs when a query needs to be answered by using a large database that is a union of several separate databases that belongs to different data owners not willing to share data with others due to privacy issues. For example, Wan et al. [31] proposed to integrate differential privacy and the Wasserstein Generative Adversarial Network (WGAN) for preserving the privacy of sensitive parameters in federated learning. Cui et al. [32] introduced a blockchain-empowered decentralized and asynchronous federated learning framework and designed an improved, differentially private federated learning based on generative adversarial nets. Qu et al. [33] proposed a blockchain-enabled adaptive asynchronous federated learning paradigm (FedTwin) and designed a tailor-made consensus algorithm that uses generative adversarial network-enhanced differential privacy and an improved Markov decision process. A trade-off optimization procedure and a hybrid model were developed by Qu et al. [34] for simultaneous protection of the identity and location privacy of smart mobile devices against dynamic adversaries. Blockchain-enabled federated learning and WGAN-enabled differential privacy were applied by Wan et al. [35] in order to protect confidential model parameters in the fifth-generation broadband cellular networks and beyond fifth-generation networks.
Thus, a lot of research has been conducted that investigates related directions. However, the protection of private information for the classes of nonlinear queries examined in the present paper has never been considered in the literature before.

Materials and Methods
If a data repository processes aggregated numerical queries for subsets of the records and provides the outcomes of these queries without giving access to individual records, then such a repository is often called a statistical database (cf. [36,37]). We use standard concepts and terminology, following [36,[38][39][40][41][42]. Our proofs also apply the main theorem of [43].
The set of all real numbers is denoted by R. The cardinality of a set S is denoted by |S|. For positive integers a ≤ b, the symbol [a : b] stands for the set A summary of the main notation used in this paper is given in Table 1. Let m be the number of attributes in every record of the database, and let r = (r 1 , r 2 , . . . , r m ) be an arbitrary record. The attributes in the database are denoted by A 1 , . . . , A m . For 1 ≤ i ≤ m, the attribute A i is a function such that A i ( r) = r i . Let n be the number of records stored in the database. Denote the records by r 1 , . . . , r n . We assume that the users can submit aggregated queries regarding the confidential attribute A 1 , and the attributes A 2 , . . . , A m are used to select subsets of records for these queries. Then A 1 is called a quantitative attribute and A 2 , . . . , A m are called characteristic attributes for such queries. Let x 1 , x 2 , . . . , x n be the (confidential) values of the quantitative attribute A 1 in the records. Table 1. Main terminology and notation used in the present paper.

Term Notation
Database with confidential data D Number of records in D n All records in D r 1 , r 2 , . . . , r n Number of attributes in each record m An arbitrary record in D r = (r 1 , r 2 , . . . , r m ) Quantitative attribute A 1 Characteristic attributes A 2 , . . . , A m Values of attribute A 1 in r 1 , r 2 , . . . , r n x 1 , x 2 , . . . , x n Boolean expression The set of records chosen for a query by specifying conditions for the characteristic attributes is called the query sample or query set. To select a sample set for a query, the users can use inequalities and Boolean expressions. Denote by B the set of all Boolean expressions of inequalities involving the characteristic variables. This set can be defined inductively by the following rules: (B1) For any r ∈ R, j ∈ [2 : m], the set B contains inequalities r j ≤ r, r j ≥ r, r j < r, r j > r and equality r j = r.
AND, OR and NOT operators, respectively.
Throughout, we consider a query using a Boolean expression B ∈ B to select the query sample. It specifies records r stored in the database such that the Boolean expression holds true for these records. The query sample, i.e., the set of all records in D satisfying condition B, is denoted by S = B(D).
Thorough investigation in the literature has been devoted to linear queries [14,18,19,44]. A linear query can be recorded as a linear combination where β is the outcome of the query, and α 1 , . . . , α n ∈ R. Linear queries are also called weighted sum queries. The COUNT query corresponding to the linear query (3) is defined as the number of nonzero coefficients α i , for i ∈ [1 : n].
A SUM query is defined as a linear Equation (3), where β is the outcome of the query, and where When there is a set of linear queries indexed by j = 1, . . . , k with equations α j,1 x 1 + α j,2 x 2 + · · · + α j,n x n = β j , then we can collect them into the matrix M = [α j,i ] and the column vector V = [β j ]. We can represent it as the matrix equation MX = V. Thus, every set of SUM queries (or linear queries) can be recorded as a system of linear equations of the form where M = [α j,i ], and where V = [β j ] is the column vector with the values returned by the queries corresponding to the rows of the matrix M. Each query corresponds to a row of the matrix M. To derive the confidential values x 1 , . . . , x n , the user can try to solve the system of linear equations.
For linear queries, it is enough to consider one-dimensional databases, or databases with only one quantitative attribute. An arbitrary set of linear queries in a multi-dimensional database can be represented as a disjoint union of linear queries corresponding to different quantitative attributes, and each of these subsets can be viewed as a set of linear queries of the corresponding 1-dimensional database.
Every linear combination of linear queries is also a linear query. If the outcomes of several linear queries are known, then the outcomes of all their linear combinations are also known. Therefore, row and column operations can be used to simplify (6). Applying row interchange, row scaling, row addition, and column interchange, the system (6) can be reduced to a normalized basis matrix form. Therefore, without loss of generality we may assume that (6) has been simplified and is a represented by a normalized query basis matrix and I k is the (k × k) identity matrix. Then the matrix M is said to be in normalized form.
The row vectors of M k form a basis of the space of all queries with outcomes which are known, because they can all be derived by using linear combinations of query vectors. Inference attacks can be used to derive private information from legitimately available data. It may be possible to deduce confidential information by comparing the results of several different queries. Let x 1 , x 2 , . . . , x n be the values of a protected or confidential attribute in the records. If the value x i of a confidential attribute in one record is revealed to the user, for some i ∈ [1 : n], then this event is called a compromise of the database. When it is essential to emphasize that the value in precisely one record has been revealed, then the terms 1-compromise or classical compromise can also be used. Linear inference attacks occur when malicious adversaries try to solve the system of linear equations (6) to determine confidential values.
To provide protection against linear inference attacks, Chin and Ozsoyoglu [14] proposed a system called Audit Expert. It uses a normalized basis matrix to store all queries answered previously. When a new query is added, the Audit Expert adds it to the matrix and then reduces it to a normalized basis form again.

Theorem 1 ([14]).
A statistical database with linear queries is compromised if and only if the normalized query basis matrix M k of the Audit Expert has a row with exactly one nonzero entry. The time complexity of the algorithm dynamically processing the query matrix of the Audit Expert and maintaining it in a normalized form for a set of k consecutive linear queries is O(k 2 ).

Results
This section presents new results obtained in this paper for the protection of confidential information against the quadratic equation attacks (Section 4.1), Interval Inference Attacks (Section 4.2), and Group Compromise (Section 4.3).

Quadratic Equation Attacks
In this subsection, we consider a new different class of nonlinear queries by using variance and mean. These notions play crucial roles in hypothesis testing, significance analysis, and other studies, see [39].
Let S = B(D) be a query sample, i.e., the set of records chosen by the Boolean expression B. Denote by V the set {r 1 | (r 1 , . . . , r m ) ∈ S} of values of the confidential quantitative attribute A 1 in the records of the sample S with the corresponding probability distribution. The mean of the values of the quantitative attribute is also called the expected value of the quantitative attribute. It is denoted by V = E(r 1 ) and is defined by the formula: The variance of V is the expected value E[(r 1 − E(r 1 )) 2 ] of the squared differences r 1 − E(r 1 ) of values of the quantitative attribute r 1 from the mean E(r 1 ) (see [40]). The variance of V is denoted by σ V and is defined by the following formula: where V is the mean given by (8) (see [40,41]). The variance measures the variability of values of the quantitative attribute from the mean. It is explained in [40] with a complete proof (see also [41]), that formula (9) can be rewritten in the following equivalent form: For more explanations and worked examples, the readers are referred to [40,41].
A MEAN and VARIANCE query, or an MVQ query, can be defined as a pair ( f , B), where B is a Boolean expression and f is a function f = ( f 1 , f 2 ), where f 1 is defined by (8) and f 2 is defined by (9). This means that an MVQ query submits a Boolean expression B and asks to return the values of the sample mean and variance for the sample S = B(S).
Equality (10) allows us to recover the outcome of each VARIANCE query from the mean value of the squares of the values of the confidential attribute. Therefore, in order to store an MVQ query in computer memory, it is enough to keep a record of the coefficients that occur in the MEAN query, the outcome of the MEAN query, and the mean value of the squares of the values of the confidential attribute. Therefore, to store a set of MVQ queries, we can use the following pair of matrix equalities, where M is the matrix storing the coefficients of the MEAN queries, X = [x 1 , . . . , x n ] T is the column of the confidential values x 1 , . . . , x n , Y = [x 2 1 , . . . , x 2 n ] T is the column of the squares of the confidential values, V is the column vector with the values returned by the MEAN queries, and where W is the column vector of the mean values of the squares of the confidential values. In concise matrix notation, the Equation (11) can be stored as the following matrix: (M|V|W).
The following example illustrates our matrix notation.

Example 1.
Suppose that in a dataset with two records r 1 , r 2 the values of the confidential attribute are x 1 = 0, x 2 = 2. Suppose that the MVQ queries have been answered for the following three samples: Then we get the following matrix equalities Here (13) keeps a record of the mean values of the samples, and (14) stores the corresponding mean values of the squares of the confidential attribute. We do not have to store long records of all coefficients of the VARIANCE queries, because equality (10) makes it easy to obtain the values of all VARIANCE queries from (14). The concise matrix notation we are going to use to keep a record of all MVQ queries is the matrix Applying the row and column operations, we can reduce M to a normalized form. Then the system (11), simplifies and reduces to the normalized form where the normalized query basis matrix M k has the form where I k is the (k × k) identity matrix. In concise matrix notation equations (16) can be stored as the matrix (M k |V |W ).
Next, we define the first type of a nonlinear inference attack, the QEA attack, which can be used by an adversary to compromise MVQ queries. Steps of the QEA attack are explained in Algorithm 1.
To protect sensitive data from QEA attacks, we design a quadratic audit system (QAS). It is described in Algorithm 2 by using the following matrix notation.
Let v be a vector with n components, and let T be a (k × n)-matrix. Denote by |v| the number of nonzero components in v. For 1 ≤ i ≤ k, the i-th row of T is denoted by T(i, :). For 1 ≤ j ≤ n, the j-th column of T is denoted by T(:, j). The deletion of the j-th column from T is denoted by T[:, j] ← [ ]. For 1 ≤ j < ≤ n, the interchanging the columns j and in T is denoted by T(: ). The (k + 1 × n)-matrix obtained by adding the v as the last row to T is denoted by [T; v]. Two vectors u and v are said to be parallel or collinear if and only if either at least one of them is a zero vector, or there exists a nonzero real number α such that u = αv. If two vectors u, v are collinear, then we write u||v.
A formal proof establishing that the QAS system guarantees protection of sensitive data from QEA attacks is given in Theorem 3. It relies on Theorem 2, which gives matrix conditions necessary and sufficient for QEA attack to reveal confidential data.
Theorem 2 uses the concept of c-compromise, where c is a positive integer. This concept includes as a special case the notion of a classical compromise or 1-compromise treated in Theorem 1. Namely, the disclosure of a statistic based on c or fewer records in the database is called a c-compromise. The notion of a c-compromise has already been studied in the literature (see the survey paper [24] for more references).
For any row r of the matrix M k in (17), denote by the vector of the first k components of r. Denote by r ( * ,n−k) = (r k+1 , . . . , r n ) (20) the vector of the last n − k components of r. Then the row has the form r = (r (k, * ) , r ( * ,n−k) ).
The vector r ( * ,n−k) will be called the projection of the row r on the matrix M k in (17).

Algorithm 1 Quadratic Equation Attack.
Input: A set of MVQ queries. Output: A compromise of the set of queries. 1: First, verify whether a compromise can be achieved by using only the set of MEAN queries as in Theorem 1. If not, then proceed to the next step. 2: Test all combinations of t ∈ [1 : n] and T ⊆ [1 : n] to find a pair (t, T) with two properties (A1), (A2): (A1) The set of linear equations corresponding to the MEAN queries can be used to derive equalities The attackers may be able to use the outcomes of the VARIANCE queries to derive a quadratic equation of the form depending only on x i , i ∈ T, where w ∈ R. 3: Substitute all expressions (22) into (23) so that it becomes a quadratic equation in one variable x t . 4: Solve the resulting quadratic equation in one variable x t to achieve a compromise. 5: Output t, x t .

Algorithm 2 Quadratic Audit System.
Input: Normalized matrix M k = I k |M k of the answered MVQ queries and the vector v of the new MVQ query. Output: New normalized matrix and answer to the query, or response that the query has been rejected. if T(i, :)(k + 1 : n)||T(k + 1, :)(k + 1 : n) then 13: Reject the query, keep the matrix unchanged. 14: end if 15: end for 16: end if 17: Answer the query and set M k+1 = [I k+1 ; T]. 18: end if Theorem 2. Let D be a database with the set of MVQ queries answered so far stored in matrix form (11) with the normalized form (16). Then the following conditions are equivalent.
(i) The QEA attack can be used to achieve a compromise of D.
(ii) The attackers can use the set consisting of only the MEAN queries answered so far to achieve a 2-compromise of D.
(iii) Either M k in (16) has a row with at most two nonzero entries, or M k has two rows with collinear projections on M k in (17).

Proof of Theorem 2.
As in the proof of the main theorem of [14] and in other previous publications, it has been customary to assume that the attackers can gain knowledge of the COUNT query corresponding to each their query. It is important to ensure rigorous protection of privacy under this assumption, in view of the following three easy ways enabling the attackers to gain access to the outcomes of the COUNT queries.
(a) The COUNT query is a legitimate query. It can be submitted to the database and may be answered as a separate query.
(b) The COUNT query can be included as an integral part of every SUM query or linear query.
(c) It may be easy for the attackers to gain access to the values of some COUNT queries by using additional information, legal knowledge, or insider knowledge.
Theorem 1 and its proof also assume that the audit system must provide protection against database compromise even if the attackers can gain access to the COUNT queries. Without this assumption, Theorem 1 is invalid. Indeed, even if the attackers can manage to obtain an outcome of the query corresponding to the value of a confidential attribute in just one record, they will be unable to notice that they have achieved this, since without the knowledge of a COUNT query they won't know whether the outcome corresponds to just one record or many records. This is why it is a common practice to assume that the attackers can also gain access to the outcomes of the corresponding COUNT queries, and that audit system must provide protection in these circumstances.
(i)⇒(ii): Suppose that condition (i) holds, i.e., the QEA could be used to achieve a compromise of D. Let us refer to the definition of the QEA attack in Algorithm 1.
First, we consider the case where the attackers managed to achieve a compromise in Step 1 of the Quadratic Equation Attack. In this case, Step 1 results in a compromise achieved by using only the set of MEAN queries. Every classical compromise is an example of a 2-compromise required for condition (ii). Therefore in this case condition (ii) follows immediately. Now, we assume that the attackers had to proceed to the remaining steps of the QEA. This means that they found an element t ∈ [1 : n] and a subset T ⊆ [1 : n] with properties (A1) and (A2). Let us take the equality x 1 = γ 1 x t + δ 1 , which is the first equality of the system (22). It implies that x 1 − γ 1 x t = δ 1 . Therefore, the attackers have managed to derive the value δ 1 of the statistic x 1 − γ 1 x t , which depends on at most two variables. This means that the attackers have achieved a 2-compromise by using only the set of MEAN queries, and so condition (ii) holds again.
(ii)⇒(iii): Suppose that condition (ii) holds, i.e., the attackers have managed to achieve a 2-compromize of D by using only MEAN queries. This means that they derived the value η of a statistic ν 1 x 1 + ν 2 x 2 , for some 1 ≤ 1 < 2 ≤ n, where ν 2 1 + ν 2 2 = 0. Denote the rows of the matrix M by m 1 , . . . , m k . For i ∈ [1 : k], let us denote by λ i the linear combination of the variables x 1 , . . . , x n corresponding to the i-th row of the matrix M. This means that where X = [x 1 , . . . , x n ] T . Then, as in (35) above, again it follows that there exist ξ 1 , . . . , ξ k such that First, we consider the case where ν 1 = 0. Then the value η = ν 2 x 2 provides a 1compromise. Hence, Theorem 1 implies that the normalized basis matrix M k of the audit system has a row with only one nonzero entry. Therefore condition (iii) is satisfied.
Second, if ν 2 = 0, then it follows in the same way that condition (iii) holds true, as well.
Third, it remains to treat the case where ν 1 , ν 2 = 0. Note that M k = [I k | M k ] as in (7). Let us keep in mind that because I k is an identity matrix, it follows that every nonzero linear combination of the rows of M has at least one nonzero component in the first k columns. Applying this to the linear combination (25), we see that 1 ≤ k. Furthermore, the following two subcases are possible and we consider them separately. Subcase 1. 2 > k. This means that x 2 belongs to the columns of the matrix M k , which is the right block of the matrix M k = [I k | M k ] in (7). Clearly, the sum ν 1 x 1 + ν 2 x 2 has only one nonzero component in the first k columns. More specifically, the only nonzero component of this sum in the first k columns is the 1 -th component. Because I k is an identity matrix, it follows from (25) that ξ 1 = 0 and Hence, η = ξ 1 λ 1 . It follows that the 1 -th row of M k has precisely two nonzero entries, and so condition (iii) holds. Subcase 2. 2 ≤ k. This means that x 1 , x 2 belong to the columns of the matrix I k in M k . Hence, we get ξ 1 , ξ 2 = 0 and all the other coefficients x i are equal to 0, i.e., Therefore, all entries in the last (n − k) columns of η are equal to zero. Denote by p 1 and p 2 the projections of the rows m 1 and m 1 on the matrix M k , respectively. It follows that ξ 1 p 1 + ξ 2 p 2 = 0. This implies that the projections p 1 and p 2 are collinear, and so condition (iii) is satisfied.

(iii)⇒(i):
Suppose that condition (iii) holds. The following two cases are possible. Case 1. The matrix M k in (16) has a row with at most two nonzero entries. Denote by the index of this row, where 1 ≤ ≤ k. By using the same notation m for this row and the same linear combination λ of the variable as in (24), we get λ = m X.
Let 1 , 2 be the indices of the two nonzero entries in m , where 1 ≤ 1 < 2 ≤ n. Denote these two nonzero entries of m by ν 1 and ν 2 . Then it follows from (29) that The -th linear equation of the system (16) shows that where v is the -th component of the column vector V in (16). Therefore the value of the statistic ν 1 x 1 + ν 2 x 2 is equal to v . This establishes a 2-compromise derived by using only the set of MEAN queries. Thus, condition (ii) holds. Case 2. The matrix M k in (16) has two rows with collinear projections on the matrix M k in (17). Denote by 1 , 2 the indices of these rows, where 1 ≤ 1 < 2 ≤ k. Denote by p 1 and p 2 the projections of the rows m 1 and m 2 on the matrix M k , respectively. Given that p 1 and p 2 are collinear, we can multiply one of these vectors by an appropriate coefficient and obtain the second vector. Without loss of generality, we may assume that there exists a coefficient ϕ such that p 1 = ϕp 2 . Because I k is an identity matrix and the projection of the vector m 1 − ϕm 2 on the matrix M k is equal to p 1 − ϕp 2 , it follows that This establishes a 2-compromise again, because equalities (32) show that the value of the statistic x 1 − x 2 is known and is equal to the constant v 1 − v 2 . This establishes that condition (ii) is satisfied in each of the cases, i.e., the attackers can achieve a 2-compromise by using only the set of MEAN queries. Let us introduce notation for the set of MVQ queries answered so far. Suppose that a set of k queries consisting of the corresponding pairs of mean and variance for the set of the corresponding k samples S 1 , . . . , S k have been submitted to the audit system. Applying (8), we can record the set of MEAN queries as a system of linear equations where i ∈ [1 : k], where β i is the outcome of the MEAN query, and where for j ∈ [1 : n]. Denote the left-hand-side of equality (33) by q i . Given that the attackers have achieved a 2-compromise by using only the queries of the system (33), they have derived the value η of a statistic ν 1 x 1 + ν 2 x 2 , for some 1 ≤ 1 < 2 ≤ n, where ν 2 1 + ν 2 2 = 0. It follows that there exist coefficients ξ 1 , . . . , ξ k such that and the value of the statistic ν 1 x 1 + ν 2 x 2 is equal to η = ξ 1 β 1 + · · · + ξ k β k . For each MEAN query of the system (33), the corresponding VARIANCE query of the form (9) can be rewritten in the form (10). It follows that all VARIANCE queries can be recorded as the following system of equations expressed in terms of the quadratic variables where i ∈ [1 : k], where δ i = σ 2 i + β 2 i , where σ 2 i is the outcome of the i-th VARIANCE query and β i is the outcome from (33), and where Denote the left-hand-side of equality (36) by i . Equalities (34) and (37) show that the coefficients α i1 , . . . , α in in the system (33) coincide with the corresponding coefficients γ i1 , . . . , γ in in the system (36). Therefore, it follows from (35) that Because at least one of the coefficients ν 1 , ν 2 is nonzero, without loss of generality we may assume that ν 1 = 0. Hence, (35) implies that Substituting (39) for x 1 in (38), we get This is a quadratic equation in one variable x 2 . It can be solved to determine the value of x 2 , which achieves a compromise of D. Thus, condition (i) is satisfied. This completes the proof of Theorem 2.
Theorem 3. Let M k = I k |M k be the normalized matrix of the answered MVQ queries, and let v be the vector of the coefficients of the mean in the next MVQ query. Then Algorithm 2 answers the next query only if it is safe to do so and the QEA attack cannot be used to disclose sensitive data. Algorithm 2 ensures that the next query is rejected if the QEA attack can reveal sensitive data after an answer to this query.
Proof. The proof establishing that QAS system guarantees protection of sensitive data from QEA attacks follows from Theorem 2. It follows immediately, because Algorithm 2 verifies condition (iii) of Theorem 2 and answers the next query only if Theorem 2 guarantees that sensitive data cannot be revealed by using the QEA attack after the query is answered.

Interval Inference Attacks
The class of IIA inference attacks is defined in Algorithm 3. It uses the following concepts. For a positive real number ε, we say that an ε-approximate compromise or an approximate compromise with precision ε has been achieved, if the attackers can determine x ∈ R such that they can deduce that the value of the confidential attribute in a record belongs to the interval [x, x + ε]. We say that an approximate compromise occurs if there exists ε such that an ε-approximate compromise has been achieved.
To protect sensitive data from IIA attacks, we design an interval audit system (IAS). It is described in Algorithm 4.
A formal proof that the IAS system protects sensitive data from IIA attacks is presented in Theorem 4. It relies on Theorem 5, which gives necessary and sufficient conditions for an approximate compromise to occur.

Algorithm 3 Interval Inference Attack.
Input: A set of MVQ queries with query sample S j , mean m j , variance σ 2 j , for j ∈ [1 : ]. Output: Index s of a record and the upper and lower bounds U, L for the sensitive attribute in the record. 1: S = ∪ j=1 S j . 2: for all r ∈ S do 3: L r ← −∞; U r ← +∞. 4: end for 5: for all j ∈ [1 : ] do 6: for all r ∈ S j do 7:   ]. Let S be the set of all records occurring in any of these already answered queries, and let L r , U r be the values defined for r ∈ S in Algorithm 3. Let T be the sample of records of the next submitted MVQ query. Then, Algorithm 4 answers the next query only if it is safe to do so and the IIA attack cannot result in a ε-approximate compromise of sensitive data. Algorithm 4 ensures that the next query is rejected if the IIA attack can result in an ε-approximate compromise after an answer to this query.
Proof. The proof establishing that IAS system guarantees protection of sensitive data from IIA attacks follows from Theorem 5. It follows immediately, because Algorithm 4 verifies condition (iii) of Theorem 5 and answers the next query only if Theorem 5 guarantees that ε-approximate compromise does not occur after the query is answered.

Algorithm 4 Interval Audit System.
Input: ε > 0 such that the system must protect from ε-approximate compromise. The set of already answered MVQ queries with m j , σ 2 j , j ∈ [1 : ], S, and L r , U r defined for r ∈ S in Algorithm 3. The new MVQ query with sample T. Output: Reject the query if it leads to ε-compromise. Otherwise, return m and σ 2 for the new query. 1: Compute the mean m and variance σ 2 for T. 2: for all r ∈ S ∩ T do 3: U r ← min{U r , m + σ |T| − 1}. 5: end for 6: for all r ∈ T \ S do 7: L r ← m − σ |T| − 1;

8:
U r ← m + σ |T| − 1. 9: end for 10: if min{|U r − L r | : r ∈ S ∪ T} ≤ ε then 11: Reject the query. 12: else 13: Output m, σ. 14: end if Theorem 5. Algorithm 3 returns the index s of a record r = (r 1 , . . . , r n ) ∈ D and an interval [L, U] = [L r , U r ] such that it is guaranteed that r 1 ∈ [L, U] and the length |U r − L r | of the achieves the minimum value. There exist two databases D L and D U such that the record r L with index s L found by Algorithm 4 in D L has confidential attribute r 1 equal to L, and the record r U with index s U in D U has confidential attribute equal to U.
Proof. Suppose that Algorithm 3 is applied to a set of samples of MVQ queries indexed by j ∈ [1 : ], with query sample S j consisting of records r = (r 1 , . . . , r n ) ∈ S j such that the mean and variance of the confidential components r 1 , for r ∈ S j , are equal to m j and σ 2 j , respectively.
For each j ∈ [1 : ] and each record r ∈ S = ∪ j=1 S j , it is easily seen that lines 2 to 9 of Algorithm 3 compute the following values For any sample S j , where j ∈ [1 : ], and any record r = (r 1 , . . . , r n ) ∈ S j , the following Samuelsen's inequalities were proven in [43]: Combining equalities (41) and (42) with all inequalities (43) for one fixed record r ∈ S and all samples S j , for j ∈ [1 : ], containing r ∈ S, we get It is clear that lines 11 to 16 of Algorithm 3 find the index s of the record r such that the length |U r − L r | of the interval [L r , U r ] achieves the minimum value. Let D L be a database with n records − → r [1], . . . , −→ r[n]. Suppose that there is just one sample S containing all records of D L and that the mean µ and variance σ 2 are given and fixed. Let where L and U are defined by (41) and (42), respectively. It is routine to verify that the mean of the confidential attributes of all records in D L is equal to µ and the variance is equal to σ 2 . Then Algorithm 4 computes Therefore, Algorithm 4 returns s L = 1, L, U. Because − → r[1] 1 = L, this example shows that in full generality, the value L cannot be improved. A shows that in general the value U cannot be improved either.

Group Compromise
Let c, k be positive integers such that c ≤ k, and let M k = (I k | M k ) be the normalized basis matrix of a set of linear queries as in (6) and (7). We use the following well-known definitions and facts of the matrix theory (see [38]). The rank of a matrix is equal to the dimension of the vector space spanned by the rows of the matrix. It is also equal to the maximum number of linearly independent rows of the matrix. The rank of a matrix with k rows is less than k if and only if the rows of the matrix are linearly dependent, i.e., there exists a nontrivial linear combination of the rows equal to zero. The rank of the matrix M k is equal to k. Theorem 6. Let c, k be positive integers such that c ≤ k, and let M k = (I k | M k ) be the normalized basis matrix (7) of a set of linear queries for the database D. Then the following conditions are equivalent.
(i) The database D is c-compromised by the set of linear queries with the normalized basis matrix M k .
(ii) There exist c columns in M k such that after deletion of these columns the rank of the remaining matrix becomes less than k. (iii) There exist s and t with s + t = c such that it is possible to remove s columns of M k and in this new matrix find t rows that span a space of dimension less than t.
be the vector with components e j , for ∈ [1 : n], defined by Let X = [x 1 , . . . , x n ] T be the column of the confidential variables.
(i)⇒(ii) Suppose that condition (i) holds. Then there exist coefficients ν 1 , . . . , ν k such that the linear combination ∑ k i=1 ν i m i has at most c nonzero components. Therefore it can be represented in the form for some positive integers 1 ≤ i 1 < · · · < i c ≤ n and some ξ 1 , . . . , ξ c ∈ R. Let M k be the matrix obtained from the matrix M k by deleting all columns with indices i 1 , . . . , i c . Denote by m 1 , . . . , m k the rows obtained from the rows m 1 , . . . , m k by deleting all columns i 1 , . . . , i c . It follows from (53) that ∑ k i=1 ν i m i = 0. Therefore the rows of the matrix M k are linearly dependent. It follows that the rank of M k is less than k. Thus, condition (ii) is satisfied.
(ii)⇒(i) Suppose that condition (ii) holds. Then there exist c columns in the matrix M k such that the rank of the matrix M k obtained by deleting these columns is less than k. Denote the indices of these columns by i 1 , . . . , i c , where 1 ≤ i 1 < · · · < i c ≤ n. Let m 1 , . . . , m k be the rows obtained from the rows m 1 , . . . , m k by deleting all columns i 1 , . . . , i c . It follows that the rows m 1 , . . . , m k are linearly dependent, i.e., there exist coefficients ν 1 , . . . , ν k such that ∑ k i=1 ν i m i = 0. Hence, equality (53) holds true, for some ξ 1 , . . . , ξ c . Therefore the statistic (53) produces a c-compromise of the database D. Thus, condition (i) is satisfied.
(i)⇒(iii) Suppose that there is a c-compromise. As above, then there exist coefficients ν 1 , . . . , ν k such that the sum ∑ k i=1 ν i m i can be represented in the form (53), for some 1 ≤ i 1 < · · · < i c ≤ n and ξ 1 , . . . , ξ c . Let s be the number of the indices 1 ≤ i 1 < · · · < i c ≤ n that are greater than k. Put t = c − s. Then Denote by N the matrix obtained from M k by deleting the columns with indices Let M be the matrix obtained from M k by deleting the columns with indices This means that M is obtained from M k by replacing M k with M. Then (7) implies that Denote the rows of the matrix M by m 1 , . . . , m k . Let ε 1 , . . . , ε k be the rows of the identity matrix I k , and let p 1 , . . . , p k be the rows of the matrix N. Then we have for i ∈ [1 : k]. Denote by e 1 , . . . , e n the vectors obtained from e 1 , . . . , e n by deleting the columns with indices (55). Clearly, Therefore, equality (53) implies that It follows that the sum ∑ k i=1 ν i m i has at most t nonzero components corresponding to the t vectors e i in the right-hand side of (60). Therefore (58), (60) and the definition of ε i show that Hence, (58), (60) and (61) imply that It follows that ∑ t =1 ν i p i = 0. This means that the vectors p i 1 , . . . , p i t are linearly dependent. Because these vectors are rows of the matrix N, we see that these t rows of the matrix N span a space of dimension less than t. Thus, condition (iii) is satisfied.
(iii)⇒(i) Suppose that condition (iii) holds. Then there exist s and t such that it is possible to remove s columns with indices i t+1 − k, . . . , i c − k from the matrix M k and in this new matrix N find t rows m i 1 , . . . , m i t that span a space of dimension less than t. (For consistency, here we introduce and use the same notation as in the proof of the preceding implication, so that the numbers i t+1 , . . . , i c refer to the indices of the corresponding columns in the matrix M.) Then these rows are linearly dependent, and so there exists a linear combination equal to zero, for some ν i 1 , . . . , ν i t . Consider the following linear combination Because I k is the identity matrix, it follows from (58) that, if we look at the last n − k components of the vector ϕ, then we see that all nonzero values among these components correspond to the s columns i t+1 , . . . , i c of M k of the matrix M k corresponding to the columns of the submatrix M k that were deleted in the discussion above. All the other values among the last n − k components of ϕ are equal to zero by (63). Therefore, there are at most s nonzero values among the last n − k components of the vector ϕ.
On the other hand, because I k is an identity matrix and ϕ is a sum of t rows of M k , it follows that there are at most t nonzero coordinates among the first k components of the vector ϕ. In total, we see that ϕ has at most s + t = c nonzero components. It follows that the linear combination (64) of the rows of the matrix M k produces a c-compromise of the database D. Thus, condition (i) is satisfied. This completes the proof of Theorem 6.
Note that the running times of the algorithms for the detection of a c-compromise using conditions (ii) and (iii) are O k 2 ( n c ) and O 2 c c 2 ( n c ) , respectively.

Discussion
The results obtained in this paper advance theoretical knowledge devoted to the protection of private and confidential information and prepare a foundation for the development of future comprehensive privacy protection systems.
At the same time, the results obtained have certain limitations, which motivate future work. Next, we formulate and discuss examples of directions for future research, which are motivated by our results and will need to be addressed in separate subsequent publications.
The first limitation of our results is explained by the general approach adopted in the previous papers [14,[18][19][20][21][22][23][24]. This approach gives only exact and correct answers to the queries submitted by the clients. However, if the system detects that a query can compromise confidential information, then it only replies that the query cannot be answered. The present paper also uses this approach.
The advantage of this approach is that in the case where it is determined that a new query submitted by a client does not lead to a disclosure of confidential information, then the client will be happy to receive an exact answer to the query. However, if it is discovered that a query leads to disclosure of confidential information, then no answer is given. Therefore, the client does not receive any helpful response in the latter case.
To tackle this issue, it may be a good idea to investigate how to supply the client with some additional information expressed, for example, in terms of evaluation of probabilities. We suggest the following direction for future research. Direction 1. Investigate and develop hybrid systems, which provide exact answer to a query if it does not lead to disclosure of confidential information, and which use differential privacy techniques to provide a randomised probabilistic response to a query if it leads to disclosure of confidential information.
The second limitation of our proposed systems is their focus on the particular novel classes of attacks that have not been considered previously. However, if a system provides protection against these attacks, then it can remain vulnerable to various other types of attacks. Therefore, for practical applications it is essential to consider systems providing simultaneous protection against various types of attacks without incurring a prohibitive computational overload.

Direction 2.
Design and investigate combined comprehensive systems providing answers to aggregated queries with simultaneous protection of confidential data against various different types of attacks without incurring a prohibitive computational overhead. Consider novel approaches to the optimisation of the performance of these systems.
The third limitation of [14,[18][19][20][21][22][23][24] and our systems is explained by the fact that this research still remains at the theoretical stage of development, when it is paramount to develop a comprehensive theory. Clearly, useful systems can be implemented as practical software only when there is sufficient rigorous theoretical foundation and only after significant advances on Direction 2 are achieved. After that, it will become important to design software implementations and conduct experimental studies comparing their performance for various categories of practical datasets. This motivates the following direction. The fourth limitation of our systems is in the assumption that the whole collection of data is known to the system answering queries. Therefore, the systems cannot operate in the federated learning scenario. Because federated learning is a rapidly growing area of research where aggregation techniques play significant roles (see, for example, the surveys [45,46]), we propose the following direction for future research. Directions 1 to 4 are recorded here in general form for arbitrary queries, even though the present article motivates the investigation of these directions with a focus on the MVQ queries as the very first option for consideration.

Conclusions
This paper investigated nonlinear queries, which had not been considered in the literature before. It contributed to the development of formal theory designing new systems for the protection against inference attacks and obtaining novel rigorous conditions that guarantee that the confidential information remains protected. The paper presented the following contributions to the advancement of knowledge on the preservation of privacy of confidential information: • Definitions of the MVQ queries (Section 4.1) and the QEA attacks (Algorithm 1). • The design of a QAS system for the protection of confidential information against the QEA attacks (Algorithm 2). • Theorems 2 and 3 prove that QAS systems guarantee protection against the QEA attacks. • Definition of the IIA attacks (Algorithm 3). • The design of an IAS system for the protection of sensitive data from the IIA attacks (Algorithm 4). • Theorems 4 and 5 prove that IAS systems ensures protection against IIA attacks. • Theorem 6 provides stringent matrix conditions for the protection of confidential information from a group compromise.
Four directions for future research were discussed and presented in Section 5.