Equality (
10) allows us to recover the outcome of each VARIANCE query from the mean value of the squares of the values of the confidential attribute. Therefore, in order to store an MVQ query in computer memory, it is enough to keep a record of the coefficients that occur in the MEAN query, the outcome of the MEAN query, and the mean value of the squares of the values of the confidential attribute. Therefore, to store a set of MVQ queries, we can use the following pair of matrix equalities,
where
M is the matrix storing the coefficients of the MEAN queries,
$X={[{x}_{1},\cdots ,{x}_{n}]}^{T}$ is the column of the confidential values
${x}_{1},\cdots $,
${x}_{n}$,
$Y={[{x}_{1}^{2},\cdots ,{x}_{n}^{2}]}^{T}$ is the column of the squares of the confidential values,
V is the column vector with the values returned by the MEAN queries, and where
W is the column vector of the mean values of the squares of the confidential values. In concise matrix notation, the Equation (
11) can be stored as the following matrix:
The following example illustrates our matrix notation.
Next, we define the first type of a nonlinear inference attack, the QEA attack, which can be used by an adversary to compromise MVQ queries. Steps of the QEA attack are explained in Algorithm 1.
To protect sensitive data from QEA attacks, we design a quadratic audit system (QAS). It is described in Algorithm 2 by using the following matrix notation.
A formal proof establishing that the QAS system guarantees protection of sensitive data from QEA attacks is given in Theorem 3. It relies on Theorem 2, which gives matrix conditions necessary and sufficient for QEA attack to reveal confidential data.
Proof of Theorem 2. As in the proof of the main theorem of [
14] and in other previous publications, it has been customary to assume that the attackers can gain knowledge of the COUNT query corresponding to each their query. It is important to ensure rigorous protection of privacy under this assumption, in view of the following three easy ways enabling the attackers to gain access to the outcomes of the COUNT queries.
(a) The COUNT query is a legitimate query. It can be submitted to the database and may be answered as a separate query.
(b) The COUNT query can be included as an integral part of every SUM query or linear query.
(c) It may be easy for the attackers to gain access to the values of some COUNT queries by using additional information, legal knowledge, or insider knowledge.
Theorem 1 and its proof also assume that the audit system must provide protection against database compromise even if the attackers can gain access to the COUNT queries. Without this assumption, Theorem 1 is invalid. Indeed, even if the attackers can manage to obtain an outcome of the query corresponding to the value of a confidential attribute in just one record, they will be unable to notice that they have achieved this, since without the knowledge of a COUNT query they won’t know whether the outcome corresponds to just one record or many records. This is why it is a common practice to assume that the attackers can also gain access to the outcomes of the corresponding COUNT queries, and that audit system must provide protection in these circumstances.
(i)⇒(ii): Suppose that condition (i) holds, i.e., the QEA could be used to achieve a compromise of D. Let us refer to the definition of the QEA attack in Algorithm 1.
First, we consider the case where the attackers managed to achieve a compromise in Step 1 of the Quadratic Equation Attack. In this case, Step 1 results in a compromise achieved by using only the set of MEAN queries. Every classical compromise is an example of a 2-compromise required for condition (ii). Therefore in this case condition (ii) follows immediately.
Now, we assume that the attackers had to proceed to the remaining steps of the QEA. This means that they found an element
$t\in [1:n]$ and a subset
$T\subseteq [1:n]$ with properties (A1) and (A2). Let us take the equality
${x}_{1}={\gamma}_{1}{x}_{t}+{\delta}_{1}$, which is the first equality of the system (
22). It implies that
${x}_{1}-{\gamma}_{1}{x}_{t}={\delta}_{1}$. Therefore, the attackers have managed to derive the value
${\delta}_{1}$ of the statistic
${x}_{1}-{\gamma}_{1}{x}_{t}$, which depends on at most two variables. This means that the attackers have achieved a 2-compromise by using only the set of MEAN queries, and so condition (ii) holds again.
(ii)⇒(iii): Suppose that condition (ii) holds, i.e., the attackers have managed to achieve a 2-compromize of
D by using only MEAN queries. This means that they derived the value
$\eta $ of a statistic
${\nu}_{1}{x}_{{\ell}_{1}}+{\nu}_{2}{x}_{{\ell}_{2}}$, for some
$1\le {\ell}_{1}<{\ell}_{2}\le n$, where
${\nu}_{1}^{2}+{\nu}_{2}^{2}\ne 0$. Denote the rows of the matrix
M by
${m}_{1},\cdots ,{m}_{k}$. For
$i\in [1:k]$, let us denote by
${\lambda}_{i}$ the linear combination of the variables
${x}_{1},\cdots ,{x}_{n}$ corresponding to the
i-th row of the matrix
M. This means that
where
$X={[{x}_{1},\cdots ,{x}_{n}]}^{T}$. Then, as in (
35) above, again it follows that there exist
${\xi}_{1},\cdots ,{\xi}_{k}$ such that
First, we consider the case where ${\nu}_{1}=0$. Then the value $\eta ={\nu}_{2}{x}_{{\ell}_{2}}$ provides a 1-compromise. Hence, Theorem 1 implies that the normalized basis matrix ${M}_{k}$ of the audit system has a row with only one nonzero entry. Therefore condition (iii) is satisfied.
Second, if ${\nu}_{2}=0$, then it follows in the same way that condition (iii) holds true, as well.
Third, it remains to treat the case where
${\nu}_{1},{\nu}_{2}\ne 0$. Note that
${M}_{k}=[{I}_{k}\mid {M}_{k}^{\prime}]$ as in (
7). Let us keep in mind that because
${I}_{k}$ is an identity matrix, it follows that every nonzero linear combination of the rows of
M has at least one nonzero component in the first
k columns. Applying this to the linear combination (
25), we see that
${\ell}_{1}\le k$. Furthermore, the following two subcases are possible and we consider them separately.
Subcase 1.${\ell}_{2}>k$. This means that
${x}_{{\ell}_{2}}$ belongs to the columns of the matrix
${M}_{k}^{\prime}$, which is the right block of the matrix
${M}_{k}=[{I}_{k}\mid {M}_{k}^{\prime}]$ in (
7). Clearly, the sum
${\nu}_{1}{x}_{{\ell}_{1}}+{\nu}_{2}{x}_{{\ell}_{2}}$ has only one nonzero component in the first
k columns. More specifically, the only nonzero component of this sum in the first
k columns is the
${\ell}_{1}$-th component. Because
${I}_{k}$ is an identity matrix, it follows from (
25) that
${\xi}_{{\ell}_{1}}\ne 0$ and
Subcase 2.${\ell}_{2}\le k$. This means that
${x}_{{\ell}_{1}}$,
${x}_{{\ell}_{2}}$ belong to the columns of the matrix
${I}_{k}$ in
${M}_{k}$. Hence, we get
${\xi}_{{\ell}_{1}},{\xi}_{{\ell}_{2}}\ne 0$ and all the other coefficients
${x}_{i}$ are equal to 0, i.e.,
Therefore, all entries in the last $(n-k)$ columns of $\eta $ are equal to zero. Denote by ${p}_{{\ell}_{1}}$ and ${p}_{{\ell}_{2}}$ the projections of the rows ${m}_{{\ell}_{1}}$ and ${m}_{{\ell}_{1}}$ on the matrix ${M}_{k}^{\prime}$, respectively. It follows that ${\xi}_{{\ell}_{1}}{p}_{{\ell}_{1}}+{\xi}_{{\ell}_{2}}{p}_{{\ell}_{2}}=0$. This implies that the projections ${p}_{{\ell}_{1}}$ and ${p}_{{\ell}_{2}}$ are collinear, and so condition (iii) is satisfied.
(iii)⇒(i): Suppose that condition (iii) holds. The following two cases are possible.
Case 1. The matrix
${M}_{k}$ in (
16) has a row with at most two nonzero entries. Denote by
ℓ the index of this row, where
$1\le \ell \le k$. By using the same notation
${m}_{\ell}$ for this row and the same linear combination
${\lambda}_{\ell}$ of the variable as in (
24), we get
Let
${\ell}_{1}$,
${\ell}_{2}$ be the indices of the two nonzero entries in
${m}_{\ell}$, where
$1\le {\ell}_{1}<{\ell}_{2}\le n$. Denote these two nonzero entries of
${m}_{\ell}$ by
${\nu}_{1}$ and
${\nu}_{2}$. Then it follows from (
29) that
The
ℓ-th linear equation of the system (
16) shows that
where
${v}_{\ell}$ is the
ℓ-th component of the column vector
${V}^{\prime}$ in (
16). Therefore the value of the statistic
${\nu}_{1}{x}_{{\ell}_{1}}+{\nu}_{2}{x}_{{\ell}_{2}}$ is equal to
${v}_{\ell}$. This establishes a 2-compromise derived by using only the set of MEAN queries. Thus, condition (ii) holds.
Case 2. The matrix
${M}_{k}$ in (
16) has two rows with collinear projections on the matrix
${M}_{k}^{\prime}$ in (
17). Denote by
${\ell}_{1},{\ell}_{2}$ the indices of these rows, where
$1\le {\ell}_{1}<{\ell}_{2}\le k$. Denote by
${p}_{{\ell}_{1}}$ and
${p}_{{\ell}_{2}}$ the projections of the rows
${m}_{{\ell}_{1}}$ and
${m}_{{\ell}_{2}}$ on the matrix
${M}_{k}^{\prime}$, respectively. Given that
${p}_{{\ell}_{1}}$ and
${p}_{{\ell}_{2}}$ are collinear, we can multiply one of these vectors by an appropriate coefficient and obtain the second vector. Without loss of generality, we may assume that there exists a coefficient
$\phi $ such that
${p}_{{\ell}_{1}}=\phi {p}_{{\ell}_{2}}$. Because
${I}_{k}$ is an identity matrix and the projection of the vector
${m}_{{\ell}_{1}}-\phi {m}_{{\ell}_{2}}$ on the matrix
${M}_{k}^{\prime}$ is equal to
${p}_{{\ell}_{1}}-\phi {p}_{{\ell}_{2}}$, it follows that
This establishes a 2-compromise again, because equalities (
32) show that the value of the statistic
${x}_{{\ell}_{1}}-\varrho {x}_{{\ell}_{2}}$ is known and is equal to the constant
${v}_{{\ell}_{1}}-\varrho {v}_{{\ell}_{2}}$. This establishes that condition (ii) is satisfied in each of the cases, i.e., the attackers can achieve a 2-compromise by using only the set of MEAN queries.
Let us introduce notation for the set of MVQ queries answered so far. Suppose that a set of
k queries consisting of the corresponding pairs of mean and variance for the set of the corresponding
k samples
${S}_{1},\cdots ,{S}_{k}$ have been submitted to the audit system. Applying (
8), we can record the set of MEAN queries as a system of linear equations
where
$i\in [1:k]$, where
${\beta}_{i}$ is the outcome of the MEAN query, and where
for
$j\in [1:n]$. Denote the left-hand-side of equality (
33) by
${q}_{i}$.
Given that the attackers have achieved a 2-compromise by using only the queries of the system (
33), they have derived the value
$\eta $ of a statistic
${\nu}_{1}{x}_{{\ell}_{1}}+{\nu}_{2}{x}_{{\ell}_{2}}$, for some
$1\le {\ell}_{1}<{\ell}_{2}\le n$, where
${\nu}_{1}^{2}+{\nu}_{2}^{2}\ne 0$. It follows that there exist coefficients
${\xi}_{1},\cdots ,{\xi}_{k}$ such that
and the value of the statistic
${\nu}_{1}{x}_{{\ell}_{1}}+{\nu}_{2}{x}_{{\ell}_{2}}$ is equal to
$\eta ={\xi}_{1}{\beta}_{1}+\cdots +{\xi}_{k}{\beta}_{k}$.
For each MEAN query of the system (
33), the corresponding VARIANCE query of the form (
9) can be rewritten in the form (
10). It follows that all VARIANCE queries can be recorded as the following system of equations expressed in terms of the quadratic variables
${x}_{1}^{2}$,
${x}_{2}^{2},\cdots $,
${x}_{n}^{2}$
where
$i\in [1:k]$, where
${\delta}_{i}={\sigma}_{i}^{2}+{\beta}_{i}^{2}$, where
${\sigma}_{i}^{2}$ is the outcome of the
i-th VARIANCE query and
${\beta}_{i}$ is the outcome from (
33), and where
Denote the left-hand-side of equality (
36) by
${\varrho}_{i}$. Equalities (
34) and (
37) show that the coefficients
${\alpha}_{i1},\cdots ,{\alpha}_{in}$ in the system (
33) coincide with the corresponding coefficients
${\gamma}_{i1},\cdots ,{\gamma}_{in}$ in the system (
36). Therefore, it follows from (
35) that
Because at least one of the coefficients
${\nu}_{1},{\nu}_{2}$ is nonzero, without loss of generality we may assume that
${\nu}_{1}\ne 0$. Hence, (
35) implies that
Substituting (
39) for
${x}_{{\ell}_{1}}$ in (
38), we get
This is a quadratic equation in one variable ${x}_{{\ell}_{2}}$. It can be solved to determine the value of ${x}_{{\ell}_{2}}$, which achieves a compromise of D. Thus, condition (i) is satisfied.
This completes the proof of Theorem 2. □