Data Obfuscation for Privacy-Preserving Machine Learning Using Quantum Symmetry Properties

Raubitzek, Sebastian; Schrittwieser, Sebastian; Schatten, Alexander; Mallinger, Kevin

doi:10.3390/bdcc9090223

Open AccessArticle

Data Obfuscation for Privacy-Preserving Machine Learning Using Quantum Symmetry Properties

by

Sebastian Raubitzek

¹

,

Sebastian Schrittwieser

²

,

Alexander Schatten

^3,* and

Kevin Mallinger

²

¹

SBA Research gGmbH, Floragasse 7/5.OG, 1040 Vienna, Austria

²

Christian Doppler Laboratory for Assurance and Transparency in Software Protection, Faculty of Computer Science, University of Vienna, Kolingasse 14-16, 1090 Vienna, Austria

³

Institute of Information Systems Engineering, TU Wien, Favoritenstrasse 9-11/194, 1040 Vienna, Austria

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput. 2025, 9(9), 223; https://doi.org/10.3390/bdcc9090223

Submission received: 1 July 2025 / Revised: 21 August 2025 / Accepted: 25 August 2025 / Published: 29 August 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

This study introduces a data obfuscation technique that leverages the exponential map of Lie-group generators. Originating from quantum machine learning frameworks, the method injects controlled noise into these generators, deliberately breaking symmetry and obscuring the source data while retaining predictive utility. Experiments on open medical datasets show that classifiers trained on obfuscated features match or slightly exceed the baseline accuracy obtained on raw data. This work demonstrates how Lie-group theory can advance privacy in sensitive domains by providing simultaneous data obfuscation and augmentation.

Keywords:

Lie groups; machine learning; CatBoost; quantum machine learning; feature maps

1. Introduction

Quantum technologies are increasingly being integrated into various disciplines, ranging from Quantum Key Distribution (QKD) [1], which enables secure key exchanges, to the advancement of quantum information processing technologies like quantum computers [2]. Furthermore, Quantum Machine Learning (QML) emerges as a promising field, leveraging quantum computational advantages to address complex problems [3]. Despite the fact that the physical realization of quantum computers and quantum circuits currently trails behind that of their classical counterparts, the theoretical and conceptual frameworks of quantum technologies have demonstrated promising potential across a broad spectrum of applications. Here, QML represents a frontier in computational science, blending quantum computing’s potential with classical machine learning’s algorithmic precision. The promise of QML lies in its capacity to process and analyze complex, high-dimensional datasets beyond the reach of current classical methodologies [3]. Central to these quantum information technologies are feature maps, which are instrumental in encoding classical data onto quantum circuits or qubits. These processes rely on the algebraic principles underlying the symmetries of the SU(2) Lie group [4], and these underlying algebraic structures of information encodings are the exact focus of this article.

Lie groups, which are continuous transformations, are instrumental in numerous physical and mathematical theories, providing a rich lexicon for describing symmetries and corresponding transformations [5,6].

This paper introduces a novel approach to data obfuscation using the exponential map of Lie-group generators, tested for publicly available medical data. Similar to how quantum computing aims to harness the multidimensional and symmetrical qualities of quantum states, our method applies these same properties—specifically, the symmetries found in Lie groups—to alter data within a high-dimensional space. This connection highlights our methodology’s interdisciplinary approach, effectively merging concepts from quantum mechanics, algebra, and machine learning in a novel way.

Central to our investigation is the following question: Can Lie-group theory effectively obfuscate sensitive data while retaining its essential properties for successful machine learning applications, thereby preserving the accuracy and informational value of the original dataset?

Large-scale healthcare analytics exemplify the pressing need for scalable privacy mechanisms in big-data pipelines. Federated learning studies show that sensitive imaging records can be queried without raw data exposure [7] and that verifiable aggregation further mitigates poisoning threats in distributed settings [8]. These findings motivate the Lie-group obfuscation presented here, positioning it as a complementary layer for secure, high-dimensional data processing within the journal’s core areas of big data and cognitive computing.

Our research builds on the foundational work of Schuld et al. [9,10,11] and IBM’s Qiskit [12] in the field of quantum machine learning (QML) and their exploration of feature maps for projecting data onto qubits.

Our main contributions are outlined as follows:

We develop a novel data obfuscation framework using the exponential map of Lie-group generators, tailored for privacy-preserving processing of medical data used in machine learning approaches.
We show where and how the invertibility of our obfuscation technique breaks down by injecting noise into the exponential map of Lie-group generators, thereby making it impossible to recover the original data.
We demonstrate the efficacy of this approach in maintaining and occasionally surpassing the predictive accuracy of machine learning models compared to non-obfuscated datasets.
We establish a conceptual link between the principles of quantum machine learning and our obfuscation methodology, highlighting the potential for cross-disciplinary innovation in leveraging symmetries for data privacy, thereby showing the applicability of quantum mechanical concepts in this context.

The remainder of this article is organized as follows: We provide a collection of related related work in Section 2. Section 3 provides our methodology, i.e., a background on quantum feature maps, Lie groups and how to use them for data obfuscation, and where invertibility of the exponential map breaks down by injecting noise. Section 4 describes our experimental setup and the employed datasets. Section 5 presents our results. Finally, Section 7 concludes our approach, presents our findings, discusses the implications, and gives an outlook on future applications.

2. Related Work

The protection of patient privacy is paramount in medical data processing, making data obfuscation a critical area of research. Data obfuscation techniques aim to mask sensitive information while maintaining the utility of the data for machine learning applications. This ongoing research is vital, as it addresses the dual challenge of protecting patient confidentiality and enabling the extraction of actionable insights from medical data [13].

One common approach is data anonymization, where identifiers such as names and social security numbers are removed or replaced with pseudonyms. For instance, the k-anonymity model [14] ensures that each record is indistinguishable from at least k-1 others regarding certain attributes. However, ref. [15] highlighted the vulnerability of k-anonymity to re-identification attacks, leading to the development of more sophisticated methods. Studies by Lu et al. [16] have applied homomorphic encryption to medical datasets, enabling secure analysis without compromising patient confidentiality.

Deep learning (DL)-based algorithms for image classification have demonstrated remarkable results in improving healthcare applications’ performance and efficiency. To address privacy concerns, especially in cloud-based solutions, data obfuscation techniques like variational autoencoders (VAEs) combined with random pixel intensity mapping can be used to enable DL model training on secured medical images while ensuring privacy [17].

Olatunji et al.’s comprehensive review [13] of healthcare data anonymization techniques underscores the delicate balance between privacy and utility in the context of modern big data and machine learning challenges, which also applies to data obfuscation.

Quantum information processing presents novel opportunities for advancing machine learning, particularly through quantum machine learning (QML). The integration of quantum computing with machine learning algorithms has the potential to revolutionize data processing, offering significant improvements in speed and efficiency.

A key concept in QML is the use of quantum feature maps, which embed classical data into high-dimensional quantum states. This process can enhance the representational capacity of machine learning models. Havlíček et al. [4] demonstrated that quantum feature maps could enable the classification of complex datasets that are challenging for classical models.

Quantum algorithms can potentially provide new methods for data privacy. For example, Lloyd et al. [18] proposed quantum algorithms for principal component analysis (PCA), which can be applied to obfuscate data while preserving essential features for machine learning tasks. Such approaches leverage the principles of quantum mechanics to enhance data security and utility simultaneously.

Synthetic data generation [19] involves creating artificial datasets that resemble real data but do not contain actual patient information. Techniques such as the use of generative adversarial networks (GANs) have been employed to generate realistic medical data for machine learning. While synthetic data can effectively preserve privacy, ensuring the fidelity and utility of such data is an area of active investigation. For a similar purpose but to create exemplary test classification datasets, Raubitzek et al. [20] showed that one can use Lie algebras to create synthetic and artificial data. This approach was tested using both quantum machine learning and classical machine learning algorithms.

3. Methodology

We start this section by discussing the fundamentals of quantum information processing and quantum machine learning necessary to understand our ideas, which we then expand to present our novel approaches. Overall, the presented approach is based on ideas from [21], where researchers employed arbitrary Lie groups and their respective generators to construct kernel matrices to be used in a quantum kernel estimator. However, the current approach differs such that we break these symmetries and exploit this to obfuscate data and increase the overall amount of data. However, in the following, we need to reiterate the initial steps of feature-map construction from [21] to then show the insertion of noise and how to alter the employed feature maps and symmetry groups.

Quantum machine learning consists of two main steps: the feature-encoding step and the actual quantum computation. In this work, we focus exclusively on the first step.

Standard quantum feature encoding can be described as follows:

| ψ (x) 〉 = U_{Φ} (x) | 0 〉,

(1)

where

$| ψ (x) 〉$ denotes the quantum state resulting from the application of a feature map ( $U_{Φ} (x)$ ) to the initial state ( $| 0 〉$ );
$| \cdot 〉$ represents a state in the complex Hilbert space ( $H$ );
$U_{Φ} (x)$ is a unitary operator encoding the classical data (x) into a quantum state;
$| 0 〉$ is the initial state of the system before encoding.

Thus, Equation (1) describes the process of encoding classical data (x) into a quantum state (

| ψ (x) 〉

) via the unitary feature map (

U_{Φ} (x)

).

These feature maps, particularly those in the Pauli class, exploit the symmetry properties of

SU (2)

. The behavior of Pauli-class feature maps is governed by the Pauli matrices, which consist of three

2 \times 2

complex matrices:

σ_{x} = (\begin{matrix} 0 & 1 \\ 1 & 0 \end{matrix}), σ_{y} = (\begin{matrix} 0 & - i \\ i & 0 \end{matrix}), σ_{z} = (\begin{matrix} 1 & 0 \\ 0 & - 1 \end{matrix})

These matrices represent the foundational elements of the Lie algebra associated with the

SU (2)

group.

SU (2)

refers to the set of all

2 \times 2

unitary matrices with a determinant of 1. The corresponding Lie algebra,

su (2)

, is defined as the collection of all

2 \times 2

traceless Hermitian matrices.

The Lie algebra (

su (2)

) is constructed using the Pauli matrices scaled by

\frac{1}{2} i

:

su (2) = \{\frac{1}{2} i σ_{x}, \frac{1}{2} i σ_{y}, \frac{1}{2} i σ_{z}\}

(2)

We expand this concept to include Lie groups

SU (n)

and

SL (n)

, introducing noise to the generators to subtly disrupt these symmetries and obfuscate the original data, thereby rendering it non-reproducible. To illustrate this, we first discuss quantum feature maps, with particular attention to two commonly used in IBM’s Qiskit [12], demonstrating where the mechanics of Lie groups come into play. We then extend this approach to incorporate

SU (n)

and

SL (n)

as described in [5,6].

Among the many quantum feature maps, the Z and ZZ feature maps are standard options implemented in IBM’s Qiskit [12]. These maps utilize the properties of Pauli matrices to generate rotations within a complex two-dimensional space, enabling the encoding of classical data into the quantum domain.

The core idea is akin to a standard rotation matrix parameterized by an angle

θ \in [0, 2 π]

; however, this approach extends to complex rotations parameterized by the Pauli matrices. This leads to the following feature maps, which are variations of Equation (1) and are illustrated in Figure 1:

The Z Feature Map:
The Z feature map utilizes the Pauli-Z operator to encode classical data into quantum states. For a given data point (x), it applies a phase rotation to each qubit in a quantum register, proportional to the corresponding feature value in x. This operation can be mathematically expressed as follows:

$U_{Z, j} (x) = exp (i x_{j} Z_{j}),$

(3)

where $Z_{j}$ represents the Pauli-Z matrix acting on the j-th qubit and $x_{j}$ denotes the j-th component of the data vector (x). This results in a rotation around the Z axis of the Bloch sphere, encoding the data within the phase of the quantum state, as shown in Figure 1.
The ZZ Feature Map:
Extending the principles of the Z feature map, the ZZ feature map incorporates entanglement between qubits to enhance the richness of the feature space. It applies two-qubit gates that are modulated by the product of pairs of classical data features, further enriching the quantum representation. This operation is depicted in Figure 1.

Efficient data encoding into a quantum circuit is crucial in quantum information processing. For each data feature, a corresponding manipulation is required. Our approach extends beyond standard quantum feature encoding, which utilizes

SU (2)

transformations for individual qubits. For example, encoding 8 features necessitates at least 8 generators. This requirement is satisfied by groups such as

SU (3)

, which provides 8 generators, specifically the Gell-Mann matrices.

We parameterize these matrices using normalized features within the exponential map, yielding a group element that is applied to a normalized vector. The result is a complex three-component vector encoding the information of the data sample. This method enables the use of arbitrary Lie groups for data encoding, provided the group’s generators are constructible. This concept, along with an illustrative example, is shown in Figure 2.

In our approach, we select a Lie group from the families of

SU (n)

or

SL (n)

that is sufficiently large, ensuring it has a number of generators greater than or equal to the number of features. Using the normalized feature vector (

\vec{x} = (x_{0}, x_{1}, \dots, x_{m})

), we parameterize the generators to obtain the corresponding group element (

U (\vec{x})

):

U (\vec{x}) = exp (i \sum_{j} x_{j} T_{j}),

(4)

where

x_{j}

represents the individual components of the feature vector,

T_{j}

represents the generators of the chosen symmetry group, and

U (\vec{x})

is a

k \times k

matrix representing the group element. If the number of generators exceeds the number of features, the parameters for the surplus generators are set to zero. This encoding maps our data samples or vectors into a new feature space and corresponding feature vector represented as follows:

\vec{ϕ} = \frac{1}{\sqrt{k}} (\begin{matrix} 1 \\ ⋮ \\ 1 \end{matrix}) \cdot U (\vec{x}),

(5)

For subsequent machine learning processes, we separate the real and complex components of the resulting feature vector, thereby obtaining

2 \times k

features.

Incorporating a noise term (

χ

) into each set of generators guarantees the data to be obfuscated. This is mathematically represented by a random uniform noise component added to each component of the summed-up set of parameterized generators; if the parameterized set of generators is complex, we add both a real and a complex noise component. This results in the following expression for our noisy group elements:

U_{χ} (\vec{x}) = exp (i \sum_{j} x_{j} T_{j} + χ),

(6)

This addition of noise effectively perturbs each group element (

U (\vec{x})

) generated by the exponential map, leading to a slightly altered encoded quantum state.

This expansion of feature maps to arbitrary Lie groups enhances our ability to represent and manipulate data. By using the diverse symmetries and structures of different Lie groups, we can design feature maps that are tailor-made for specific types of data or learning tasks.

Mathematically, adding a small noise vector (

χ

) ensures that the perturbed quantum state remains within a vicinity of the original state near the manifold, preserving the relative distances and geometric relationships crucial for machine learning algorithms. This proximity guarantees that, while the data is obfuscated enough to protect privacy, it retains sufficient structure for effective learning.

Finally, we can apply the feature map from Equation (6) to each sample several times—every time with a different noise component—and, thus, use our approach not only to obfuscate data but also to increase the amount of data, i.e., synthesize additional data, thereby multiplying the amount of data.

3.1. Retrieving the Original Data

Given the previously outlined discussion on the construction of our data obfuscation based on the exponential map of a Lie group, we want to ensure that our original data is not retrievable, which we do by using the following construction of our noise component (

χ

).

First of all, we need to make some assumptions about our discussion. We need to assume, first, that an attacker that wants to acquire the original data is familiar with our obfuscation approach and with Lie groups, corresponding algebras, etc. Then, we need to assume an attacker knows about our base vector, as discussed in Equation (5) and, finally, that the attacker is capable of reproducing the transformation matrix from our transformed feature vector, i.e.,

{\vec{ϕ}}_{χ} = \frac{1}{\sqrt{k}} (\begin{matrix} 1 \\ ⋮ \\ 1 \end{matrix}) \cdot U_{χ} (\vec{x}),

(7)

thereby reproducing

U_{χ} (\vec{x})

. This starts the discussion on how to choose the noise such that one cannot retrieve the original features (

\vec{x}

) from our transformation matrix (

U_{χ} (\vec{x})

).

First, we need to discuss if and when the exponential map of a Lie group is invertible, and thus the data retrievable:

3.1.1. Local Invertibility

The exponential map, denoted as

exp : g \to G

, where

g

is the Lie algebra of a Lie group (G), is locally invertible around the identity element of G. This follows from the Inverse Function Theorem, which applies because the differential of the exponential map at the identity (zero in the Lie algebra) is the identity map, making it a local diffeomorphism at this point.

3.1.2. Global Invertibility

Globally, the exponential map is generally not invertible. This is because the map can be neither injective (one-to-one) nor surjective (onto):

Injectivity: The exponential map is not injective if there exist elements ( $X, Y \in g$ ) such that $X \neq Y$ but $exp (X) = exp (Y)$ . This can occur, for example, when X and Y differ by a multiple of $2 π i$ in certain directions in $g$ , particularly for compact or periodic dimensions of G.
Surjectivity: The exponential map may not be surjective for some Lie groups, meaning not all elements of the group can be expressed as the exponential of some element in the algebra. A typical example is non-connected groups where the exponential map reaches only the connected component of the identity.

Given these arguments, we need to look at the most extreme case: Injectivity and surjectivity are given globally for a particular Lie group, and the attacker knows which Lie group we used to encode our data and, furthermore, knows the set of generators we used. Thus, we construct our noise in the following way to make our original data non-retrievable:

U_{χ} (\vec{x}) = exp (I \sum_{j} x_{j} T_{j} + χ)

(8)

This noise (

χ

) can be decomposed into two components:

χ = χ_{G} + ζ

(9)

where

χ_{G}

denotes noise that can be expressed as a linear combination of the generators of the Lie group (with a different parameterization vector (

\vec{c}

)) and

ζ

is a residual noise matrix that cannot be expressed as a linear combination of the generators. This results in the following cases: If the noise is

χ_{G} \neq 0

and

ζ = 0

, then the following occurs. As discussed before, given the most extreme case, one can reconstruct the generators. One can obtain a feature vector by assigning different parameterizations to these generators. However, one cannot retrieve the original feature vector exactly. The features will have a small deviation in each of its components. This means the noise injected into the exponential map slightly distorts the original features. Therefore, we construct

χ_{G}

such that

χ_{G} = \sum_{j} ϵ_{j} T_{j}, where \sum_{j} |ϵ_{j}| = ϵ,

(10)

where

ϵ

is a controllable parameter, i.e., the level of noise that we inject into our dataset. Furthermore, we distribute

ϵ

randomly among the coefficients (

ϵ_{j} \neq 0

). In conclusion, one cannot retrieve the original feature vector unless one knows precisely the random numbers/coefficients (

ϵ_{j}

).

The next case we need to discuss is the case in which our residual noise component is not zero (

ζ \neq 0

), assuming that

χ_{G} = 0

. In this case, the matrix (

U_{χ} (\vec{x})

) resulting from applying the exponential map, i.e.,

U_{χ} (\vec{x}) = exp (I \sum_{j} x_{j} T_{j} + χ)

, is not part of the regarded symmetry group; thus, our initial symmetry is broken, and we leave the Lie group’s manifold. However, this means the following:

Loss of Group Structure: The resulting matrix is no longer guaranteed to satisfy the properties that define the group (closure, associativity, identity, and invertibility). Hence, it cannot be inverted within the context of the group.
Breaking Symmetry: The exponential map no longer maps elements of the Lie algebra to the Lie group, breaking/violating the symmetry and making the inverse mapping undefined.
Non-recoverability of Original Features: Since the transformation is no longer within the group, one cannot apply the inverse of the exponential map to recover the original features. The noise ( $ζ$ ) introduces components that do not belong to the algebra; hence, the original structure and information are obfuscated beyond recoverability.

In conclusion, the introduction of residual noise

ζ

that cannot be expressed as a linear combination of the generators fundamentally disrupts the structure and invertibility of the exponential map, ensuring that the original feature vector cannot be reconstructed from the transformed vector. Furthermore, the noise injected into the parameterizations of the regarded generators ensures a slight distortion of the original features, which further obfuscates the original data. Thus, we conclude that the obfuscated data cannot be reconstructed.

4. Experiments

We performed experiments on four datasets to measure if the data obfuscated using our augmented noisy Lie-group approach can still be classified with a machine learning approach. This means we transformed all four datasets with varying amounts of noise and multipliers (i.e., synthetic data), performed machine learning classification with 80% training data and 20% test or validation data, and noted the accuracy of the machine learning prediction on the test data. We also compared this accuracy to the same machine learning approach but without obfuscating/transforming the data. Also, Appendix A provides results for other obfuscation techniques with the same setup. This experimental design is depicted in Figure 3. In the following, we discuss the details of our approach, such as the normalization, the employed machine learning algorithms, and the regarded datasets.

Normalization of Features: We normalized all features to the range of [0,

π

] to effectively utilize the exponential map with our chosen Lie groups. Furthermore, all categorical features were projected into a numerical space such that we gave each category a distinct value between 0 and

π

.

Datasets and Data Augmentation: We employed four distinct datasets, each subjected to five levels of noise and data augmentation, i.e., we changed the noise parameter where the noise was sampled from. Data synthetization was performed by multiplying the dataset size by factors ranging from 1 (no augmentation) up to 5, i.e., creating different noisy samples for each data point.

Bayesian Optimization and LGBM Classifier: For the classification tasks, we utilized a Light Gradient Boosting Machine (LGBM) classifier. LGBM is known for its efficiency and effectiveness in handling large datasets and high-dimensional feature spaces, making it an apt choice for our experiments [22]. Bayesian optimization with 100 iterations was employed to search through the hyperparameter space, ensuring the optimal configuration for each experimental condition.

Evaluation Strategy: The datasets were split into training and testing sets using an 80/20 ratio. The performance of the LGBM classifier, trained on the feature-mapped data, was compared against the same LGBM implementation with standard preprocessing, scaling the data to the interval of [0, 1] for both numerical and categorical features. We chose the standard accuracy score as our primary metric for evaluation.

Datasets

The following four datasets are medical datasets and are binary classifications such that the outcome is an identified disease or not. All four datasets are publicly available and can easily be fetched from online databases.

Breast Cancer Wisconsin Dataset (scikit-learn:load_breast_cancer()) https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html (accessed on 24 August 2025): 569 samples (212 malignant, 357 benign), 9 numeric features, and a binary target (malignant/benign).
Pima Indians Diabetes Database (OpenML: diabetes, ID: 37) https://www.openml.org/search?type=data&sort=runs&id=37&status=active (accessed on 24 August 2025): 768 samples, 9 numeric features (pregnancies, plasma glucose, blood pressure, skinfold thickness, serum insulin, BMI, pedigree, and age), and a binary target (diabetes: yes/no).
Indian Liver Patient Dataset (OpenML: ilpd, ID: 1480) https://www.openml.org/search?type=data&sort=runs&id=1480&status=active (accessed on 24 August 2025): 583 samples (416 liver patients, 167 controls), 11 clinical features (age, gender, bilirubin, enzymes, proteins, and albumin/globulin ratio), and binary target (liver disease: yes/no).
Breast Cancer Coimbra Dataset (OpenML: breast-cancer-coimbra, ID: 42900) https://www.openml.org/search?type=data&sort=runs&id=42900&status=active (accessed on 24 August 2025): 116 samples, 10 numeric features (age, BMI, glucose, insulin, HOMA, leptin, adiponectin, resistin, and MCP-1), and a binary target (cancer: yes/no).

5. Results

The experimental results highlight the efficacy of incorporating Lie group-based feature maps with noise for data obfuscation while maintaining the utility of machine learning models. Applying Bayesian optimization and LGBM classifiers across multiple datasets and conditions provides a robust evaluation framework for our methodology.

The performance of the LGBM classifier proved resilient to the levels of noise and the degree of data obfuscation applied, as reflected by the accuracy measurements. Injecting noise into the data, with the goal of making it more private, did not undermine the model’s ability to predict correctly, suggesting that our method is practical for privacy-preserving machine learning. The accuracy figures, alongside our baseline with original features, are listed in Table 1. In Figure 4 and Figure 5, we show how our method compares with the baseline: enhancements are highlighted in light blue, and cases where the baseline is better are indicated in purple. We also chart the differences; when there is no change, we consider it a win for our method, as the goal is to maintain the baseline accuracy, at least.

We also provide a benchmark comparison relative to other obfuscation methods in Appendix A.

Each dataset depicts at least one instance in which our method outdid the baseline in accuracy. In fact, for some datasets, our method held up well under most test scenarios. This indicates that, regardless of how much noise we added or how much we increased the dataset size—up to five times—the method was as good as or better than the baseline. We didn’t expect to beat the baseline in every case, as that is not the main goal of data obfuscation, but our findings confirm that transforming the data and shifting it into a different feature space preserves enough information for machine learning models to work effectively.

6. Discussion

In this article, we introduce a novel approach for data obfuscation based on the mathematical framework of Lie groups, where noise is injected in the exponential map to generate obfuscated feature vectors. Experiments were conducted using two Lie-group families—SU(n) and SL(n)—together with a Light Gradient Boosting Machine (LGBM) classifier. The results serve as a proof of concept, showing that this methodology can enhance data privacy while maintaining utility for machine learning tasks. As shown in Table 1, the classifier performed well on transformed data, often achieving results close to or better than the benchmark with unobfuscated data. Additional comparisons with other obfuscation techniques are presented in Appendix A, where accuracies of our approach are among the strongest across tested methods.

The utility of the method arises from several factors: the unspecified group remains unknown to potential attackers; the parametrization of the generators hides the original feature space; and the injected noise creates an

ϵ

ball around the true information both on and off the Lie manifold, preventing invertibility of the exponential map. This makes recovery of the original data infeasible.

This property is especially relevant for sensitive applications, such as medical data. The method renders raw data inaccessible while retaining the information content needed for reliable machine learning classification. For example, medical records can be obfuscated and still allow for accurate AI-based analysis without exposing patient-level data.

The presented experiments confirm that classification accuracy can remain high under Lie-group transformations, comparable to results obtained with unobfuscated data, particularly with appropriate parameterizations or data augmentation (Figure 4 and Figure 5). Furthermore, when evaluated against other obfuscation methods—such as PCA with noise, random projection, feature shuffling, and Gaussian mechanisms—our approach consistently matched or outperformed alternatives in terms of accuracy across multiple datasets.

When information leakage was evaluated via mutual information, Lie-group methods consistently achieved the lowest leakage across datasets, with the single exception of the ilpd dataset, where feature shuffling performed better. Overall, our method combines strong classification accuracy with consistently low leakage, showing favorable trade-offs compared to the baselines in Appendix A and Appendix B.

The appendix results further contextualize these findings. Random projections showed inconsistent performance across datasets. PCA with Gaussian perturbations yielded competitive results in some cases, but outcomes were highly sensitive to the chosen noise variance, raising concerns about robustness. Feature shuffling, when applied globally, severely reduced utility due to indiscriminate permutation. Gaussian mechanisms provided strong theoretical privacy guarantees but quickly degraded accuracy as noise increased. In contrast, Lie-group transformations offered a balanced middle ground: low leakage, higher stability than PCA + Gaussian, and better utility than feature shuffling and Gaussian mechanisms.

Beyond empirical results, the approach connects conceptually to quantum information processing. In quantum systems, noise often poses a challenge, yet in this context, injected noise enables privacy-preserving computation. By violating Lie-group symmetries slightly, we draw an analogy to noisy or non-Hermitian quantum evolutions, which break unitarity in a manner similar to how noise in our generators breaks strict invertibility. This suggests a natural alignment between Lie-group obfuscations and quantum-inspired machine learning.

An additional motivation for using noisy Lie-group exponential maps, rather than simpler methods such as random projection or differential privacy, lies in their hardware potential. In principle, one could build quantum or optical circuits implementing specific Lie-group symmetries, where unavoidable device-level noise (manufacturing imperfections, thermal fluctuations, quantum effects...) could be used for obfuscation. Since noise profiles are device-specific, reconstruction would require access to the hardware (or at least substantial information about it), making reverse engineering drastically more difficult. Our results show compatibility across multiple symmetry groups, suggesting that platforms already implementing SU(2)-like transformations, such as quantum computers or quantum key distribution systems, could provide proof-of-principle hardware realizations.

These observations also align with the No Free Lunch theorem [23], which states that no method is universally optimal across all tasks. Our experiments showed that performance depends on parameter choices such as multipliers and noise levels: in some cases, accuracy exceeded the baseline, while in others, it did not. This variability reflects the theorem’s principle according to which effectiveness depends on dataset and problem specifics.

In the broader landscape of privacy-preserving methods, Differential Privacy (DP) remains the most established standard [24,25]. DP guarantees bounded sensitivity by injecting calibrated noise into queries or model updates but often at the expense of model accuracy. For example, DP-SGD [26] prevents memorization of training data but typically reduces performance in complex tasks. Earlier methods such as k-anonymity [14] and l-diversity [27] rely on grouping records but are vulnerable to linkage attacks. In contrast, Lie-group obfuscations achieve privacy through algebraic structure, dimensional hiding, and non-invertible noise injection while preserving competitive utility.

In terms of computational cost, the proposed method is not part of the machine learning training loop itself but, instead, consists of straightforward matrix manipulations during the feature transformation stage. These operations decouple cleanly from the learning process, as they are applied once to the data prior to training. In practice, the exponential map–based transformations run within a few seconds per dataset, making them lightweight preprocessing steps. Thus, no severe computational overheads are expected for our technique compared to standard obfuscation or augmentation methods.

Taken together, our findings indicate that Lie-group obfuscations are a promising addition to privacy-preserving machine learning. They combine structured, symmetry-based richness with the robustness of noise injection, offering both theoretical grounding and practical utility. While not a replacement for existing standards such as DP, they represent a complementary approach that can extend the design space of privacy-preserving techniques, with potential applicability in both software and hardware contexts.

7. Conclusions

This work presented a proof-of-concept framework for privacy-preserving data obfuscation based on Lie groups and noisy exponential maps. We tested two group families, SU(n) and SL(n), across four biomedical datasets using a machine learning classifier. Input features were obfuscated through Lie-group transformations from two families (SU(n) and SL(n)), and the resulting representations were classified using a Light Gradient Boosting Machine (LGBM) model. The experiments showed that Lie-group obfuscations can maintain competitive predictive performance while reducing the recoverability of original data. In addition to classification accuracy, we evaluated information leakage through mutual information, finding that Lie-group methods consistently achieved low leakage compared to other baselines, with only isolated cases where alternatives performed better.

The approach differs from conventional techniques such as random projection or Gaussian mechanisms by combining symmetry-based transformations with injected noise, which makes the exponential map non-invertible and the reconstruction of original data infeasible. This makes it suitable for sensitive applications such as medical analytics, where reliable model predictions are needed without exposing raw patient records.

The study underlines the proof-of-concept nature of the work. Results varied depending on dataset and parameter settings, in line with the No Free Lunch theorem, meaning that effectiveness depends on the problem context and configuration. The present contribution is an evaluation to establish Lie-group transformations as a quantum-inspired foundation for privacy-preserving machine learning.

Most importantly, while the approach is quantum-inspired and draws conceptual links to noisy quantum feature maps, no quantum hardware implementation was used in this work. All experiments were purely classical, meaning that practical deployment on quantum or hybrid systems remains to be demonstrated.

Future work should address scaling of the obfuscation pipeline to high-throughput data streams; integration with federated learning architectures for secure, decentralized analytics; and application to heterogeneous clinical records where privacy-aware retrieval is required. Further benchmarking with additional group families, classifiers, and adversarial models will show the robustness and generality of the method.

Overall, this study demonstrates that Lie group-based obfuscation can serve as a mathematically grounded, quantum-inspired mechanism for balancing privacy and utility. By leveraging symmetry, dimensional obfuscation, and noise, the method opens a concrete research direction at the intersection of privacy, machine learning, and quantum-inspired computation.

We also acknowledge that the study relies on openly available datasets, which is a limitation, and that the approach should ideally be tested on a broader variety of datasets to further validate its general applicability, as is the case with most techniques presented at the proof-of-concept stage. The program code is available at https://github.com/Raubkatz/Quantum_Data_Obfuscation (accessed on 24 August 2025).

Author Contributions

Conceptualization, S.R. and S.S.; methodology, S.R., S.S. and K.M.; software, S.R. and S.S.; validation, S.R. and A.S.; formal analysis, S.R. and K.M.; investigation, S.R., K.M. and S.S.; resources, K.M., S.S. and A.S.; data curation, S.R. and S.S.; writing—original draft preparation, S.R., S.S., A.S. and K.M.; writing—review and editing, S.R., S.S., A.S. and K.M.; visualization, S.R.; supervision, S.R., S.S., A.S. and K.M.; project administration, S.S., A.S. and K.M.; funding acquisition, S.S., A.S. and K.M.; All authors have read and agreed to the published version of the manuscript.

Funding

Financial support from the Austrian Federal Ministry of Economy, Energy and Tourism; the National Foundation for Research, Technology and Development; and the Christian Doppler Research Association is gratefully acknowledged. SBA Research (SBA-K1 NGC) is a COMET Center within the COMET–Competence Centers for Excellent Technologies Programme and funded by BMIMI, BMWET, and the federal state of Vienna. The COMET Programme is managed by FFG.

Data Availability Statement

All data is public and openly available via the referenced and linked sources.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Comparative Experiments with Non-Lie-Group Obfuscation Techniques

To further validate the proposed Lie-group-based obfuscation method, we conducted additional experiments with established obfuscation techniques from the literature. The goal is to benchmark our approach against widely used feature-space obfuscation strategies that provide privacy protection while retaining predictive utility. We apply the same machine learning pipeline used in the main experiments to enable a direct comparison of classification performance, leakage behavior, and computational cost.

Appendix A.1. Overview of Considered Techniques

We selected four representative obfuscation techniques commonly studied in privacy-preserving machine learning. Each method transforms the original feature space to obscure sensitive structure, with different guarantees of privacy and utility. Below, we summarize the techniques and provide the corresponding references.

Random projection projects features into a lower-dimensional subspace using a random matrix (Gaussian or sparse). This preserves pairwise distances in expectation and prevents exact feature recovery. References: Bingham and Mannila (2001) [28]; Li, Hastie, and Church (2006) [29].
Principal Component Analysis with Gaussian Noise (PCA + Noise) applies PCA for dimensionality reduction, adds Gaussian noise in the latent space, and reconstructs features. PCA reduces redundancy; the added noise hinders inversion and contributes to privacy when calibrated as in differential privacy. References: Abdi and Williams (2010) [30]; Dwork and Roth (2014) [25]; Balle and Wang (2018) [31]; Shokri et al. (2017) for membership-inference motivation [32].
Feature shuffling randomly permutes feature columns (per feature) across samples. This simple anonymization baseline destroys cross-feature associations while preserving marginal distributions; it is widely discussed in statistical disclosure control. References: Muralidhar and Sarathy (2006) [33]; Domingo-Ferrer and Torra (2001) [34].
The Gaussian mechanism (differential-privacy style) adds calibrated Gaussian noise to features as a proxy for $(ε, δ)$ -differential privacy. The noise scale controls the privacy–utility trade-off. References: Dwork and Roth (2014) [25]; Balle and Wang (2018) [31]; Abadi et al. (2016) [26].

Appendix A.2. Experimental Design

All four techniques were applied to the same datasets and evaluated with the identical pipeline as in the main experiments (LightGBM with Bayesian hyperparameter search). Each method was tested at multiple strength levels (e.g., projection dimension, noise variance, or shuffled fraction) to characterize the privacy–utility trade-off.

Appendix A.3. Results

Table A1. Non-Lie-group obfuscation results (accuracy only) on the Breast Cancer Wisconsin dataset (baseline

= 0.974

).

Table A1. Non-Lie-group obfuscation results (accuracy only) on the Breast Cancer Wisconsin dataset (baseline

= 0.974

).

Random Projection
n_components	Accuracy
10	0.9386
full	0.9649
PCA + Gaussian
noise std (latent)	Accuracy
0.00	0.9561
0.01	0.9649
0.03	0.9561
0.05	0.9649
0.10	0.9561
Feature Shuffling
shuffled fraction	Accuracy
0.25	0.9825
0.50	0.9649
1.00	0.5789
Gaussian Mechanism
$ε$	Accuracy
0.5	0.5877
1.0	0.5965
2.0	0.5877
4.0	0.6404

Table A2. Non-Lie-group obfuscation results (accuracy only) on the Breast Cancer Coimbra dataset (baseline

= 0.833

).

Table A2. Non-Lie-group obfuscation results (accuracy only) on the Breast Cancer Coimbra dataset (baseline

= 0.833

).

Random Projection
n_components	Accuracy
9	0.7917
full	0.5417
PCA + Gaussian
noise std (latent)	Accuracy
0.00	0.7083
0.01	0.9167
0.03	0.7500
0.05	0.7917
0.10	0.7500
Feature Shuffling
shuffled fraction	Accuracy
0.25	0.6250
0.50	0.5417
1.00	0.4167
Gaussian Mechanism
$ε$	Accuracy
0.5	0.6667
1.0	0.6250
2.0	0.6667
4.0	0.5000

Table A3. Non-Lie-group obfuscation results (accuracy only) on the Pima Indians Diabetes dataset (baseline

= 0.744

).

Table A3. Non-Lie-group obfuscation results (accuracy only) on the Pima Indians Diabetes dataset (baseline

= 0.744

).

Random Projection
n_components	Accuracy
8	0.7727
full	0.7273
PCA + Gaussian
noise std (latent)	Accuracy
0.00	0.7662
0.01	0.7597
0.03	0.7792
0.05	0.8117
0.10	0.7792
Feature Shuffling
shuffled fraction	Accuracy
0.25	0.7338
0.50	0.7143
1.00	0.7143
Gaussian Mechanism
$ε$	Accuracy
0.5	0.7078
1.0	0.6623
2.0	0.7143
4.0	0.6883

Table A4. Non-Lie-group obfuscation results (accuracy only) on the Indian Liver Patient dataset (baseline

= 0.747

).

Table A4. Non-Lie-group obfuscation results (accuracy only) on the Indian Liver Patient dataset (baseline

= 0.747

).

Random Projection
n_components	Accuracy
10	0.7094
full	0.6752
PCA + Gaussian
noise std (latent)	Accuracy
0.00	0.6923
0.01	0.6752
0.03	0.6923
0.05	0.6923
0.10	0.6752
Feature Shuffling
shuffled fraction	Accuracy
0.25	0.6752
0.50	0.6752
1.00	0.6752
Gaussian Mechanism
$ε$	Accuracy
0.5	0.6838
1.0	0.6752
2.0	0.6752
4.0	0.6752

Figure A1. Non-Lie-group obfuscations: ΔAccuracy (model accuracy minus baseline) across obfuscation strengths. Each column shows one dataset; within each column, the four methods are stacked (Top to bottom): Random Projection, PCA + Gaussian, Feature Shuffling, and Gaussian Mechanism. Color map matches prior figures (two-color split at 0).

Figure A2. Non-Lie-group obfuscations: ΔAccuracy (model accuracy minus baseline) across obfuscation strengths for ILPD and Breast Cancer Coimbra. Same ordering and color map as Figure A1.

Appendix B. Leakage Analysis

This appendix describes the leakage analysis performed to evaluate the extent to which different obfuscation methods preserve or obscure information from the original datasets via mutual information.

Appendix B.1. Objective

The objective of the leakage analysis is to measure how much mutual information (MI) remains between original and obfuscated data under various transformation settings. Higher MI values indicate that the obfuscated data kept more information about the original features, which could lead to privacy risks. Conversely, lower MI values suggest stronger obfuscation at the potential cost of reduced utility. The analysis seeks to characterize this trade-off systematically, following established practices in information-theoretic privacy assessment [35,36,37].

Appendix B.2. Experimental Setup

The experimental procedure follows these steps:

For each dataset, the test split is fixed in advance to ensure comparability and to isolate stochasticity to the obfuscation procedures.
Each obfuscation method is applied repeatedly with independent random seeds. The number of repeats is set to $n_{runs} = 10, 000$ . This ensures statistically stable estimates of leakage across random realizations of projections, shuffling, or noise.
For each repeat and each obfuscation setting, the mean feature-wise mutual information is computed between the original and the obfuscated features as specified below. Discretization uses equal-width binning with $B = 30$ bins, and MI is computed with sklearn.metrics.mutual_info_score [38,39]. This discretized (binning) estimator is simple and model-free but is sensitive to the bin count and binning strategy [40].
For every configuration, results across repeats are aggregated by computing the sample mean and the standard error of the mean (SEM), following [41].

Appendix B.3. Mutual Information

For two random variables (U and V) with joint

p (u, v)

and marginals (

p (u), p (v)

), the mutual information (nats) is

I (U; V) = \sum_{u, v} p (u, v) log \frac{p (u, v)}{p (u) p (v)},

with the usual convention that terms with

p (u, v) = 0

contribute zero [35]. In our implementation, continuous features are discretized, empirical probabilities are formed from counts, and scikit-learn’s mutual_info_score computes the plug-in MI on the resulting labels [39]. We average per-feature MI over the shared dimensionality of original and obfuscated representations to obtain one scalar per configuration and repeat. Because transforms like random projection, PCA, or Lie maps can mix coordinates, the mean feature-wise MI is an axis-aligned summary; it can be complemented with cross-feature MI diagnostics when needed [40]. We then obtained plots showing mean MI versus obfuscation strength with error bars (SEM) to visualize variability across repeats.

Appendix B.4. Interpretation

The leakage analysis provides a quantitative assessment of the information-preserving properties of the obfuscation methods. In particular, the following conclusions can be drawn:

Overall lower values of mutual information indicate less information leakage.
A curve that decays toward zero with increasing obfuscation strength indicates effective privacy preservation.
A curve that remains flat or only slowly decreases suggests that the method preserves significant information, which may imply weaker privacy guarantees.
The SEM values indicate the robustness of results to the stochasticity of the obfuscation procedure. Narrow error bars imply stable leakage behavior, while wide bars indicate sensitivity to random seeds.

In summary, the leakage analysis complements the classification experiments by focusing on privacy rather than predictive utility. It enables systematic comparisons across obfuscation techniques and parameter regimes, thereby informing the trade-off between information leakage and downstream learning performance.

Appendix B.5. Leakage Analysis Discussion: Breast Cancer Wisconsin Dataset

We analyze the leakage behavior of several obfuscation families on the breast cancer dataset by reporting mean feature-wise mutual information (MI) between original and obfuscated features as summarized in Table A5, with the corresponding visual trends shown in Figure A3 and Figure A4. Since lower MI values correspond to stronger obfuscation, reductions across settings indicate less leakage. Random projection (Figure A3a) shows a mild decrease in MI as the number of output components increases, from

1.404278

at

n_{components} = 8

to

1.316021

at 30 components; SEM values remain below

5 \times 10^{- 4}

, confirming stability, but the overall leakage reduction is modest. PCA with Gaussian noise (Figure A3b) produces the clearest trend: MI decreases monotonically with higher noise levels across all keep ratios. Baseline leakage is highest for keep

= 1.0

(

2.832377

at zero noise) and decreases to

1.713168

at a noise level of

0.1

, while lower keep ratios start with a smaller baseline MI (e.g.,

2.156481

for keep

= 0.5

at zero noise) and consistently yield lower MI values across all noise settings, highlighting the combined effect of dimensionality reduction and additive perturbations. Feature shuffling (Figure A3c) yields an almost linear drop in MI with increasing shuffled fraction, from

2.832377

with no shuffling to

1.226736

at full shuffling, making it one of the strongest leakage-reducing mechanisms in this dataset. The Gaussian mechanism in its $ε$ parameterization (Figure A3d) does not substantially change MI, which remains around

1.353

across the tested range (

0.500000 \leq ε \leq 4.000000

). The complementary view via noise scale $σ$ (Figure A4c) confirms this flat trend, with MI effectively invariant to

σ

up to

30.44

. This suggests that, for this dataset, the DP-style Gaussian perturbation does not meaningfully alter feature dependence at the evaluated scales. Lie-group transformations (Figure A4a,b) show essentially flat MI across map noise levels: SU remains around

0.5386

and SL around

0.7201

, with only negligible variation (<0.002). No significant monotonic reduction is observed. Across all methods, SEM values remain very low, indicating robust results across

n_{runs} = 10, 000

. In summary, the obfuscation families exhibit heterogeneous effectiveness: PCA with noise and feature shuffling provide strong and consistent leakage reduction, and random projection yields only a modest effect, while the Gaussian mechanism and Lie maps remain nearly constant and ineffective in reducing MI on this dataset. Overall, the findings underline that not all perturbation families translate into meaningful leakage suppression for this setting.

When specifically comparing the Lie-group maps (SU and SL) to the other techniques, their behavior stands out in terms of their decreased overall MI values. Although their MI values remain almost constant across noise levels, they already operate at substantially lower absolute MI levels (∼0.54 for SU and ∼0.72 for SL) compared to the higher baselines of PCA, shuffling, and Gaussian noise. This indicates that Lie maps start in a natural “sweet spot” of low leakage without requiring heavy parameter tuning or large perturbations. As such, SU and SL emerge as the most effective obfuscation families in this dataset, combining stability with consistently strong privacy protection.

Table A5. Mutual information (mean ± SEM) between original and obfuscated features on the Breast Cancer Wisconsin dataset;

n_{runs} = 10, 000

. Values reported with a precision of

10^{- 6}

. Lower MI indicates less leakage.

Table A5. Mutual information (mean ± SEM) between original and obfuscated features on the Breast Cancer Wisconsin dataset;

n_{runs} = 10, 000

. Values reported with a precision of

10^{- 6}

. Lower MI indicates less leakage.

Random Projection
n_components	Mean MI	SEM
8.000000	1.404278	0.000493
15.000000	1.352576	0.000328
30.000000	1.316021	0.000231
PCA + Gaussian (keep = 1.0)
noise std	Mean MI	SEM
0.000000	2.832377	0.000000
0.010000	2.519922	0.000126
0.030000	2.192612	0.000106
0.050000	1.999148	0.000103
0.100000	1.713168	0.000110
PCA + Gaussian (keep = 0.75)
noise std	Mean MI	SEM
0.000000	2.533936	0.000000
0.010000	2.437840	0.000115
0.030000	2.215164	0.000111
0.050000	2.047446	0.000106
0.100000	1.769352	0.000113
PCA + Gaussian (keep = 0.5)
noise std	Mean MI	SEM
0.000000	2.156481	0.000000
0.010000	2.144932	0.000060
0.030000	2.094810	0.000086
0.050000	2.017133	0.000095
0.100000	1.826203	0.000116
Feature Shuffling
shuffled fraction	Mean MI	SEM
0.000000	2.832377	0.000000
0.250000	2.404110	0.000071
0.500000	2.029462	0.000085
0.750000	1.654822	0.000089
1.000000	1.226736	0.000076
Gaussian Mechanism ( $ε$ )
$ε$	Mean MI	SEM
0.500000	1.353520	0.000148
1.000000	1.353598	0.000147
2.000000	1.353845	0.000149
4.000000	1.354915	0.000149
Gaussian Mechanism ( $σ$ )
$σ$	Mean MI	SEM
3.805101	1.354915	0.000149
7.610202	1.353845	0.000149
15.220405	1.353598	0.000147
30.440809	1.353520	0.000148
LIE SU (map noise)
map noise	Mean MI	SEM
0.000000	0.538630	0.000000
0.001000	0.538971	0.000009
0.003000	0.538927	0.000009
0.010000	0.538822	0.000009
0.032000	0.538451	0.000011
0.100000	0.538196	0.000019
LIE SL (map noise)
map noise	Mean MI	SEM
0.000000	0.720121	0.000000
0.001000	0.720882	0.000015
0.003000	0.721000	0.000015
0.010000	0.721128	0.000014
0.032000	0.719613	0.000018
0.100000	0.715529	0.000033

Figure A3. Breast cancer dataset: leakage trends across random projection, PCA + Gaussian, feature shuffling, and Gaussian mechanism (

ε

-parameterized). (a) Random Projection: MI vs.

n_{components}

(±SEM). (b) PCA + Gaussian: MI vs. noise std in PCA space (±SEM). (c) Feature Shuffling: MI vs. shuffled fraction (±SEM). (d) Gaussian Mechanism: MI vs. privacy parameter

ε

(±SEM).

Figure A3. Breast cancer dataset: leakage trends across random projection, PCA + Gaussian, feature shuffling, and Gaussian mechanism (

ε

-parameterized). (a) Random Projection: MI vs.

n_{components}

(±SEM). (b) PCA + Gaussian: MI vs. noise std in PCA space (±SEM). (c) Feature Shuffling: MI vs. shuffled fraction (±SEM). (d) Gaussian Mechanism: MI vs. privacy parameter

ε

(±SEM).

Figure A4. Breast cancer dataset: leakage trends for Lie-group maps and the Gaussian mechanism parameterized by

σ

. (a) LIE SU: MI vs. map noise (±SEM). (b) LIE SL: MI vs. map noise (±SEM). (c) Gaussian Mechanism: MI vs. noise

σ

(±SEM).

Figure A4. Breast cancer dataset: leakage trends for Lie-group maps and the Gaussian mechanism parameterized by

σ

. (a) LIE SU: MI vs. map noise (±SEM). (b) LIE SL: MI vs. map noise (±SEM). (c) Gaussian Mechanism: MI vs. noise

σ

(±SEM).

Appendix B.6. Leakage Analysis Discussion: Breast Cancer Coimbra Dataset

We analyze the leakage behavior of several obfuscation families on the Breast Cancer Coimbra dataset by reporting mean feature-wise mutual information (MI) between original and obfuscated features, as summarized in Table A6, with corresponding visual trends shown in Figure A5 and Figure A6. Since lower MI values correspond to stronger obfuscation, reductions across settings indicate less leakage. Random projection (Figure A5a) yields a mild improvement, with MI decreasing from

2.007325

at

n_{components} = 4

to

1.889524

at 9 and SEM values below

7 \times 10^{- 4}

confirming stability. PCA with Gaussian noise (Figure A5b) shows strong monotonic reductions, especially for keep

= 1.0

, where MI drops from

2.357974

at zero noise to

1.902990

at

noise = 0.1

. Lower keep ratios start from reduced baselines: for instance, keep

= 0.5

begins at

1.895625

and stabilizes near

1.883164

, demonstrating that dimensionality reduction alone already provides leakage suppression. Feature shuffling (Figure A5c) decreases MI nearly linearly with the shuffled fraction, falling from

2.357974

to

1.643320

at full shuffling. The Gaussian mechanism in both

ε

(Figure A5d) and

σ

parameterizations (Figure A6c) remains invariant, with MI tightly clustered around

1.921

, indicating a negligible effect. Finally, Lie-group transformations (Figure A6a,b) operate in a distinctly low-leakage regime, with SU centered around

1.497

, SL near

1.587

, and only minimal variation (<0.01) across map noise. SEM values remain consistently low across all methods, confirming robustness over

n_{runs} = 10, 000

. In summary, the obfuscation families, again, show heterogeneous effectiveness: PCA and feature shuffling achieve clear monotonic reductions, random projection offers modest gains, and the Gaussian mechanism is ineffective at the tested scales, while Lie maps stand out for their inherently low leakage.

When focusing on SU and SL in particular, their strength becomes evident: despite showing little monotonic variation with map noise, they consistently maintain the lowest MI levels overall. Compared to PCA or shuffling, which require large perturbations or extreme fractions to reach similar suppression, SU and SL achieve strong privacy protection by default. This makes them not only more effective but also more robust to parameter choice. Overall, SU and SL emerge as the most reliable obfuscation families for this dataset, combining naturally low leakage with consistent stability.

Table A6. Mutual information (mean ± SEM) between original and obfuscated features on the Breast Cancer Coimbra dataset;

n_{runs} = 10, 000

. Values are reported with a precision of

10^{- 6}

. Lower MI indicates less leakage.

Table A6. Mutual information (mean ± SEM) between original and obfuscated features on the Breast Cancer Coimbra dataset;

n_{runs} = 10, 000

. Values are reported with a precision of

10^{- 6}

. Lower MI indicates less leakage.

Random Projection
n_components	Mean MI	SEM
4.000000	2.007325	0.000685
8.000000	1.895714	0.000474
9.000000	1.889524	0.000449
PCA + Gaussian (keep = 1.0)
noise std	Mean MI	SEM
0.000000	2.357974	0.000000
0.010000	2.214374	0.000266
0.030000	2.070490	0.000302
0.050000	1.992021	0.000319
0.100000	1.902990	0.000346
PCA + Gaussian (keep = 0.75)
noise std	Mean MI	SEM
0.000000	2.001390	0.000000
0.010000	2.009164	0.000188
0.030000	1.992153	0.000260
0.050000	1.958256	0.000299
0.100000	1.897717	0.000348
PCA + Gaussian (keep = 0.5)
noise std	Mean MI	SEM
0.000000	1.895625	0.000000
0.010000	1.891296	0.000157
0.030000	1.884895	0.000248
0.050000	1.884485	0.000284
0.100000	1.883164	0.000335
Feature Shuffling
shuffled fraction	Mean MI	SEM
0.000000	2.357974	0.000000
0.250000	2.199312	0.000263
0.500000	2.040280	0.000323
0.750000	1.802430	0.000305
1.000000	1.643320	0.000208
Gaussian Mechanism ( $ε$ )
$ε$	Mean MI	SEM
0.500000	1.920998	0.000399
1.000000	1.921232	0.000399
2.000000	1.921035	0.000402
4.000000	1.921397	0.000401
Gaussian Mechanism ( $σ$ )
$σ$	Mean MI	SEM
3.805101	1.921397	0.000401
7.610202	1.921035	0.000402
15.220405	1.921232	0.000399
30.440809	1.920998	0.000399
LIE SU (map noise)
map noise	Mean MI	SEM
0.000000	1.496904	0.000000
0.001000	1.498628	0.000045
0.003000	1.498612	0.000048
0.010000	1.499831	0.000058
0.032000	1.501576	0.000061
0.100000	1.506421	0.000124
LIE SL (map noise)
map noise	Mean MI	SEM
0.000000	1.586272	0.000000
0.001000	1.587166	0.000014
0.003000	1.587273	0.000015
0.010000	1.588067	0.000019
0.032000	1.591662	0.000043
0.100000	1.593364	0.000127

Figure A5. Breast Cancer Coimbra dataset: leakage trends across random projection, PCA + Gaussian, feature shuffling, and Gaussian mechanism (

ε

-parameterized). (a) Random Projection: MI vs.

n_{components}

(±SEM). (b) PCA + Gaussian: MI vs. noise std in PCA space (±SEM). (c) Feature Shuffling: MI vs. shuffled fraction (±SEM). (d) Gaussian Mechanism: MI vs. privacy parameter

ε

(±SEM).

Figure A5. Breast Cancer Coimbra dataset: leakage trends across random projection, PCA + Gaussian, feature shuffling, and Gaussian mechanism (

ε

-parameterized). (a) Random Projection: MI vs.

n_{components}

(±SEM). (b) PCA + Gaussian: MI vs. noise std in PCA space (±SEM). (c) Feature Shuffling: MI vs. shuffled fraction (±SEM). (d) Gaussian Mechanism: MI vs. privacy parameter

ε

(±SEM).

Figure A6. Breast Cancer Coimbra dataset: leakage trends for Lie-group maps and the Gaussian mechanism parameterized by

σ

. (a) LIE SU: MI vs. map noise (±SEM). (b) LIE SL: MI vs. map noise (±SEM). (c) Gaussian Mechanism: MI vs. noise

σ

(±SEM).

Figure A6. Breast Cancer Coimbra dataset: leakage trends for Lie-group maps and the Gaussian mechanism parameterized by

σ

. (a) LIE SU: MI vs. map noise (±SEM). (b) LIE SL: MI vs. map noise (±SEM). (c) Gaussian Mechanism: MI vs. noise

σ

(±SEM).

Appendix B.7. Leakage Analysis Discussion: Diabetes Dataset

We analyze the leakage behavior of several obfuscation families on the Pima Indians Diabetes dataset by reporting mean feature-wise mutual information (MI) between original and obfuscated features, as summarized in Table A7, with corresponding visual trends shown in Figure A7 and Figure A8. Since lower MI values correspond to stronger obfuscation, reductions across settings indicate less leakage. Random projection (Figure A7a) shows only a slight effect: MI increases marginally from

1.045728

at

n_{components} = 4

to

1.075660

at 8, suggesting little leakage suppression. PCA with Gaussian noise (Figure A7b) yields a clear monotonic reduction: for keep

= 1.0

, MI decreases from

2.629967

at zero noise to

1.675017

at

noise = 0.1

. Lower keep ratios already reduce leakage substantially at the baseline, with keep

= 0.5

starting at

1.392278

and further improving to

1.357252

at

noise = 0.1

, demonstrating the combined effect of dimensionality reduction and perturbation. Feature shuffling (Figure A7c) is particularly strong here, producing a nearly linear drop from

2.645523

at zero shuffle to

0.903854

at full shuffling, the lowest MI observed among the non-Lie methods. The Gaussian mechanism, in its

ε

(Figure A7d) and

σ

parameterizations (Figure A8c), shows invariance, with MI values remaining tightly around

1.059

across all tested settings, indicating no meaningful dependence reduction. Lie-group transformations (Figure A8a,b) start from already low baselines (SU around

0.835

and SL around

0.881

), with only slight variation across map noise levels (changes

< 0.01

). SEM values remain consistently low across methods, confirming robustness over

n_{runs} = 10, 000

. In summary, the obfuscation families, again, show heterogeneous effectiveness: PCA and feature shuffling exhibit strong monotonic improvements, random projection is ineffective, and Gaussian noise mechanisms remain flat, while Lie maps operate at inherently low leakage levels.

When focusing on SU and SL in particular, their strength becomes evident: despite showing little monotonic variation with map noise, they consistently maintain the lowest MI values overall. Compared to PCA or shuffling, which require large noise or extreme fractions to reach similar suppression, SU and SL achieve strong privacy protection inherently. This positions them as the most effective and reliable techniques for this dataset.

Table A7. Mutual information (mean ± SEM) between original and obfuscated features on the Pima Indians Diabetes dataset;

n_{runs} = 10, 000

. Values are reported with a precision of

10^{- 6}

. Lower MI indicates less leakage.

Table A7. Mutual information (mean ± SEM) between original and obfuscated features on the Pima Indians Diabetes dataset;

n_{runs} = 10, 000

. Values are reported with a precision of

10^{- 6}

. Lower MI indicates less leakage.

Random Projection
n_components	Mean MI	SEM
4.000000	1.045728	0.000531
8.000000	1.075660	0.000366
PCA + Gaussian (keep = 1.0)
noise std	Mean MI	SEM
0.000000	2.629967	0.000000
0.010000	2.406981	0.000196
0.030000	2.167315	0.000147
0.050000	1.992857	0.000160
0.100000	1.675017	0.000172
PCA + Gaussian (keep = 0.75)
noise std	Mean MI	SEM
0.000000	1.743887	0.000000
0.010000	1.743033	0.000077
0.030000	1.716163	0.000111
0.050000	1.669445	0.000134
0.100000	1.537582	0.000156
PCA + Gaussian (keep = 0.5)
noise std	Mean MI	SEM
0.000000	1.392278	0.000000
0.010000	1.395283	0.000068
0.030000	1.392868	0.000101
0.050000	1.386234	0.000116
0.100000	1.357252	0.000144
Feature Shuffling
shuffled fraction	Mean MI	SEM
0.000000	2.645523	0.000000
0.250000	2.210083	0.000261
0.500000	1.774980	0.000305
0.750000	1.339565	0.000276
1.000000	0.903854	0.000138
Gaussian Mechanism ( $ε$ )
$ε$	Mean MI	SEM
0.500000	1.058538	0.000243
1.000000	1.058796	0.000244
2.000000	1.059446	0.000244
4.000000	1.061787	0.000243
Gaussian Mechanism ( $σ$ )
$σ$	Mean MI	SEM
3.805101	1.061787	0.000243
7.610202	1.059446	0.000244
15.220404	1.058795	0.000244
30.440807	1.058537	0.000243
LIE SU (map noise)
map noise	Mean MI	SEM
0.000000	0.834774	0.000000
0.001000	0.834412	0.000016
0.003000	0.834530	0.000017
0.010000	0.835113	0.000025
0.032000	0.837033	0.000051
0.100000	0.838969	0.000072
LIE SL (map noise)
map noise	Mean MI	SEM
0.000000	0.881219	0.000000
0.001000	0.880126	0.000018
0.003000	0.879978	0.000033
0.010000	0.879444	0.000027
0.032000	0.877769	0.000045
0.100000	0.871098	0.000085

Figure A7. Diabetes dataset: leakage trends across random projection, PCA + Gaussian, feature shuffling, and Gaussian mechanism (

ε

-parameterized). (a) Random Projection: MI vs.

n_{components}

(±SEM). (b) PCA + Gaussian: MI vs. noise std (±SEM). (c) Feature Shuffling: MI vs. shuffled fraction (±SEM). (d) Gaussian Mechanism: MI vs.

ε

(±SEM).

Figure A7. Diabetes dataset: leakage trends across random projection, PCA + Gaussian, feature shuffling, and Gaussian mechanism (

ε

-parameterized). (a) Random Projection: MI vs.

n_{components}

(±SEM). (b) PCA + Gaussian: MI vs. noise std (±SEM). (c) Feature Shuffling: MI vs. shuffled fraction (±SEM). (d) Gaussian Mechanism: MI vs.

ε

(±SEM).

Figure A8. Diabetes dataset: leakage trends for Lie-group maps and the Gaussian mechanism parameterized by

σ

. (a) LIE SU: MI vs. map noise (±SEM). (b) LIE SL: MI vs. map noise (±SEM). (c) Gaussian Mechanism: MI vs. noise

σ

(±SEM).

Figure A8. Diabetes dataset: leakage trends for Lie-group maps and the Gaussian mechanism parameterized by

σ

. (a) LIE SU: MI vs. map noise (±SEM). (b) LIE SL: MI vs. map noise (±SEM). (c) Gaussian Mechanism: MI vs. noise

σ

(±SEM).

Appendix B.8. Leakage Analysis Discussion: ILPD Dataset

We analyze the leakage behavior of several obfuscation families on the Indian Liver Patient dataset by reporting mean feature-wise mutual information (MI) between original and obfuscated features, as summarized in Table A8, with corresponding visual trends shown in Figure A9 and Figure A10. Since lower MI values correspond to stronger obfuscation, reductions across settings indicate less leakage. Random projection (Figure A9a) shows only modest suppression, with MI ranging from

0.829246

at

n_{components} = 5

to

0.946211

at 10, and, thus, does not provide substantial leakage reduction. PCA with Gaussian noise (Figure A9b) displays a clear monotonic decline: for keep

= 1.0

, MI decreases from

1.992123

at zero noise to

1.246156

at

noise = 0.1

, while keep

= 0.5

starts from a lower baseline of

1.455830

and reduces further to

1.237322

, indicating the combined benefits of dimensionality reduction and noise injection. Feature shuffling (Figure A9c) strongly reduces leakage in a nearly linear fashion, from

1.755441

at fraction

0.25

down to

0.717366

at full shuffling, making it one of the most effective perturbations in this dataset. The Gaussian mechanism (Figure A9d and Figure A10c) yields consistently low MI values around

0.918

across both

ε

and

σ

ranges but with no significant monotonic trend, essentially behaving as a stable baseline rather than a tunable suppression mechanism. Lie-group transformations (Figure A10a,b) operate at already low leakage: SU remains near

0.834

and SL near

0.918

, with only negligible fluctuations (<0.01) across map noise levels. SEM values remain low across all families, confirming robustness over

n_{runs} = 10, 000

. In summary, the ILPD dataset exhibits heterogeneous responses: PCA with noise and feature shuffling provide strong monotonic suppression, random projection is modest, and Gaussian mechanisms remain flat, while Lie maps maintain inherently low MI regardless of tuning.

Importantly, SU and SL transformations stand out by starting from much lower baseline MI than most other methods, effectively placing them in a natural sweet spot of leakage protection. Unlike PCA or shuffling, which require stronger perturbations to reach similar levels, the Lie maps achieve strong and stable privacy with minimal parameter adjustments, marking them as the most effective obfuscation families for this dataset.

Table A8. Mutual information (mean ± SEM) between original and obfuscated features on the Indian Liver Patient dataset;

n_{runs} = 10, 000

. Values are reported with a precision of

10^{- 6}

. Lower MI indicates less leakage.

Table A8. Mutual information (mean ± SEM) between original and obfuscated features on the Indian Liver Patient dataset;

n_{runs} = 10, 000

. Values are reported with a precision of

10^{- 6}

. Lower MI indicates less leakage.

Random Projection
n_components	Mean MI	SEM
5.000000	0.829246	0.000538
8.000000	0.846292	0.000398
10.000000	0.946211	0.000384
PCA + Gaussian (keep = 1.0)
noise std	Mean MI	SEM
0.000000	1.992123	0.000000
0.010000	1.802029	0.000146
0.030000	1.580378	0.000152
0.050000	1.438586	0.000150
0.100000	1.246156	0.000153
PCA + Gaussian (keep = 0.75)
noise std	Mean MI	SEM
0.000000	1.825114	0.000000
0.010000	1.750205	0.000150
0.030000	1.585204	0.000146
0.050000	1.455445	0.000154
0.100000	1.260037	0.000160
PCA + Gaussian (keep = 0.5)
noise std	Mean MI	SEM
0.000000	1.455830	0.000000
0.010000	1.438615	0.000083
0.030000	1.389676	0.000116
0.050000	1.341065	0.000125
0.100000	1.237322	0.000147
Feature Shuffling
shuffled fraction	Mean MI	SEM
0.250000	1.755441	0.000409
0.500000	1.365865	0.000518
0.750000	0.977325	0.000420
1.000000	0.717366	0.000129
Gaussian Mechanism ( $ε$ )
$ε$	Mean MI	SEM
0.500000	0.917746	0.000203
1.000000	0.918131	0.000203
2.000000	0.919364	0.000203
4.000000	0.924021	0.000205
Gaussian Mechanism ( $σ$ )
$σ$	Mean MI	SEM
3.805101	0.924021	0.000205
7.610202	0.919364	0.000203
15.220405	0.918131	0.000203
30.440809	0.917746	0.000203
LIE SU (map noise)
map noise	Mean MI	SEM
0.000000	0.834060	0.000000
0.001000	0.833638	0.000012
0.003000	0.833685	0.000012
0.010000	0.834183	0.000017
0.032000	0.835897	0.000034
0.100000	0.836853	0.000056
LIE SL (map noise)
map noise	Mean MI	SEM
0.000000	0.917748	0.000000
0.001000	0.917342	0.000022
0.003000	0.917091	0.000023
0.010000	0.917502	0.000025
0.032000	0.917465	0.000033
0.100000	0.923466	0.000061

Figure A9. ILPD dataset: leakage trends across random projection, PCA + Gaussian, feature shuffling, and Gaussian mechanism (

ε

-parameterized). (a) Random Projection: MI vs.

n_{components}

(±SEM). (b) PCA + Gaussian: MI vs. noise std (±SEM). (c) Feature Shuffling: MI vs. shuffled fraction (±SEM). (d) Gaussian Mechanism: MI vs.

ε

(±SEM).

Figure A9. ILPD dataset: leakage trends across random projection, PCA + Gaussian, feature shuffling, and Gaussian mechanism (

ε

-parameterized). (a) Random Projection: MI vs.

n_{components}

(±SEM). (b) PCA + Gaussian: MI vs. noise std (±SEM). (c) Feature Shuffling: MI vs. shuffled fraction (±SEM). (d) Gaussian Mechanism: MI vs.

ε

(±SEM).

Figure A10. ILPD dataset: leakage trends for Lie-group maps and the Gaussian mechanism parameterized by

σ

. (a) LIE SU: MI vs. map noise (±SEM). (b) LIE SL: MI vs. map noise (±SEM). (c) Gaussian Mechanism: MI vs. noise

σ

(±SEM).

Figure A10. ILPD dataset: leakage trends for Lie-group maps and the Gaussian mechanism parameterized by

σ

. (a) LIE SU: MI vs. map noise (±SEM). (b) LIE SL: MI vs. map noise (±SEM). (c) Gaussian Mechanism: MI vs. noise

σ

(±SEM).

References

Kong, P.Y. A Review of Quantum Key Distribution Protocols in the Perspective of Smart Grid Communication Security. IEEE Syst. J. 2022, 16, 41–54. [Google Scholar] [CrossRef]
S, N.; Singh, H.; N, A.U. An extensive review on quantum computers. Adv. Eng. Softw. 2022, 174, 103337. [Google Scholar] [CrossRef]
Biamonte, J.; Wittek, P.; Pancotti, N.; Rebentrost, P.; Wiebe, N.; Lloyd, S. Quantum machine learning. Nature 2017, 549, 195–202. [Google Scholar] [CrossRef]
Havlíček, V.; Córcoles, A.D.; Temme, K.; Harrow, A.W.; Kandala, A.; Chow, J.M.; Gambetta, J.M. Supervised learning with quantum-enhanced feature spaces. Nature 2019, 567, 209–212. [Google Scholar] [CrossRef]
Hall, B.C. Lie Groups, Lie Algebras, and Representations: An Elementary Introduction, 2nd ed.; Graduate Texts in Mathematics; Springer: Cham, Switzerland, 2015; Volume 222. [Google Scholar] [CrossRef]
Georgi, H. Lie Algebras in Particle Physics: From Isospin to Unified Theories; CRC Press: Boca Raton, FL, USA, 2019. [Google Scholar] [CrossRef]
Singh, G.; Violi, V.; Fisichella, M. Federated Learning to Safeguard Patients Data: A Medical Image Retrieval Case. Big Data Cogn. Comput. 2023, 7, 18. [Google Scholar] [CrossRef]
Mbonu, W.E.; Maple, C.; Epiphaniou, G.; Panchev, C. A Verifiable, Privacy-Preserving, and Poisoning Attack-Resilient Federated Learning Framework. Big Data Cogn. Comput. 2025, 9, 85. [Google Scholar] [CrossRef]
Schuld, M.; Killoran, N. Quantum Machine Learning in Feature Hilbert Spaces. Phys. Rev. Lett. 2019, 122, 040504. [Google Scholar] [CrossRef] [PubMed]
Schuld, M.; Petruccione, F. Quantum ensembles of quantum classifiers. Sci. Rep. 2018, 8, 2772. [Google Scholar] [CrossRef]
Schuld, M.; Sinayskiy, I.; Petruccione, F. An introduction to quantum machine learning. Contemp. Phys. 2015, 56, 172–185. [Google Scholar] [CrossRef]
Qiskit Community. Qiskit: An Open-Source Framework for Quantum Computing. 2022. Available online: https://www.ibm.com/quantum/qiskit (accessed on 24 August 2025).
Olatunji, I.E.; Rauch, J.; Katzensteiner, M.; Khosla, M. A Review of Anonymization for Healthcare Data. Big Data 2022. ahead of print. [Google Scholar] [CrossRef] [PubMed]
Sweeney, L. k-ANONYMITY: A MODEL FOR PROTECTING PRIVACY. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 2002, 10, 557–570. [Google Scholar] [CrossRef]
Ganta, S.R.; Kasiviswanathan, S.P.; Smith, A. Composition Attacks and Auxiliary Information in Data Privacy. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, NV, USA, 24–27 August 2008; pp. 265–273. [Google Scholar] [CrossRef]
Lu, W.J.; Yamada, Y.; Sakuma, J. Privacy-preserving genome-wide association studies on cloud environment using fully homomorphic encryption. BMC Med. Inform. Decis. Mak. 2015, 15, S1. [Google Scholar] [CrossRef]
Popescu, A.B.; Taca, I.A.; Vizitiu, A.; Nita, C.I.; Suciu, C.; Itu, L.M.; Scafa-Udriste, A. Obfuscation Algorithm for Privacy-Preserving Deep Learning-Based Medical Image Analysis. Appl. Sci. 2022, 12, 3997. [Google Scholar] [CrossRef]
Lloyd, S.; Mohseni, M.; Rebentrost, P. Quantum principal component analysis. Nat. Phys. 2014, 10, 631–633. [Google Scholar] [CrossRef]
Choi, E.; Biswal, S.; Malin, B.; Duke, J.; Stewart, W.F.; Sun, J. Generating Multi-label Discrete Patient Records using Generative Adversarial Networks. In Proceedings of the 2nd Machine Learning for Healthcare Conference; Doshi-Velez, F., Fackler, J., Kale, D., Ranganath, R., Wallace, B., Wiens, J., Eds.; Proceedings of Machine Learning Research, Volume 68; PMLR: Cambridge, MA, USA, 2017; pp. 286–305. [Google Scholar]
Raubitzek, S.; Mallinger, K. On the Applicability of Quantum Machine Learning. Entropy 2023, 25, 992. [Google Scholar] [CrossRef] [PubMed]
Raubitzek, S.; Schrittwieser, S.; Schatten, A.; Mallinger, K. Quantum inspired kernel matrices: Exploring symmetry in machine learning. Phys. Lett. A 2024, 525, 129895. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30, p. 17. [Google Scholar]
Wolpert, D.; Macready, W. No free lunch theorems for optimization. IEEE Trans. Evol. Comput. 1997, 1, 67–82. [Google Scholar] [CrossRef]
Dwork, C.; McSherry, F.; Nissim, K.; Smith, A. Calibrating Noise to Sensitivity in Private Data Analysis. In Proceedings of the Third Theory of Cryptography Conference (TCC 2006), New York, NY, USA, 4–7 March 2006; Halevi, S., Rabin, T., Eds.; Lecture Notes in Computer Science. Springer: Berlin/Heidelberg, Germany, 2006; Volume 3876, pp. 265–284. [Google Scholar] [CrossRef]
Dwork, C.; Roth, A. The Algorithmic Foundations of Differential Privacy. Found. Trends Theor. Comput. Sci. 2014, 9, 211–407. [Google Scholar] [CrossRef]
Abadi, M.; Chu, A.; Goodfellow, I.; McMahan, H.B.; Mironov, I.; Talwar, K.; Zhang, L. Deep Learning with Differential Privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS 2016), Vienna, Austria, 24–28 October 2016; pp. 308–318. [Google Scholar] [CrossRef]
Machanavajjhala, A.; Kifer, D.; Gehrke, J.; Venkitasubramaniam, M. L-diversity: Privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data 2007, 1, 3-es. [Google Scholar] [CrossRef]
Bingham, E.; Mannila, H. Random projection in dimensionality reduction: Applications to image and text data. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’01, San Francisco, CA, USA, 26–29 August 2001; Association for Computing Machinery: New York, NY, USA, 2001; pp. 245–250. [Google Scholar] [CrossRef]
Li, P.; Hastie, T.; Church, K.W. Very Sparse Random Projections. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2006), Philadelphia, PA, USA, 20–23 August 2006; pp. 287–296. [Google Scholar] [CrossRef]
Abdi, H.; Williams, L.J. Principal Component Analysis. Wiley Interdiscip. Rev. Comput. Stat. 2010, 2, 433–459. [Google Scholar] [CrossRef]
Balle, B.; Wang, Y. Improving the Gaussian Mechanism for Differential Privacy: Analytical Calibration and Optimal Denoising. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), Stockholm, Sweden, 10–15 July 2018; Proceedings of Machine Learning Research. Volume 80, pp. 394–403. [Google Scholar]
Shokri, R.; Stronati, M.; Song, C.; Shmatikov, V. Membership Inference Attacks Against Machine Learning Models. In Proceedings of the 2017 IEEE Symposium on Security and Privacy (SP), San Jose, CA, USA, 22–26 May 2017; pp. 3–18. [Google Scholar] [CrossRef]
Muralidhar, K.; Sarathy, R. Data Shuffling—A New Masking Approach for Numerical Data. Manag. Sci. 2006, 52, 658–670. [Google Scholar] [CrossRef]
Domingo-Ferrer, J.; Torra, V. A Quantitative Comparison of Disclosure Control Methods for Microdata. In Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies; Doyle, P., Lane, J.I., Theeuwes, J., Zayatz, L.M., Eds.; North-Holland: Amsterdam, The Netherlands, 2001; pp. 111–133. [Google Scholar]
Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; Wiley Series in Telecommunications and Signal Processing; Wiley-Interscience: Hoboken, NJ, USA, 2006. [Google Scholar] [CrossRef]
Ünsal, A.; Önen, M. Information-Theoretic Approaches to Differential Privacy. ACM Comput. Surv. 2023, 56, 1–18. [Google Scholar] [CrossRef]
Liu, Y.; Zhu, X.; Wang, J.; Xiao, J. A Quantitative Metric for Privacy Leakage in Federated Learning. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 3065–3069. [Google Scholar] [CrossRef]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
scikit-learn developers. mutual_info_score—Mutual Information Between Two Clusterings. 2025. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mutual_info_score.html (accessed on 21 August 2025).
Paninski, L. Estimation of Entropy and Mutual Information. Neural Comput. 2003, 15, 1191–1253. [Google Scholar] [CrossRef]
Altman, D.G.; Bland, J.M. Standard deviations and standard errors. BMJ 2005, 331, 903. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Depiction of the Z and the ZZ feature maps. Both schemes were adopted according to the implementation in IBM’s Quiskit [12] (https://docs.quantum.ibm.com/api/qiskit/qiskit.circuit.library.PauliFeatureMap (accessed on 24 August 2025). The triangle on the left symbolizes the incoming qubit; afterward, for both feature maps, the qubit is entangled via a Hadamard gate to make the qubit more expressive. Next, Pauli-Z rotations are applied to the qubits.

Figure 2. Illustration of our data encoding strategy using the SU(3) group. A data sample with 8 features parameterizes the Gell-Mann matrices, which are then transformed into a group element via the exponential map. This group element is applied to a normalized “empty” input vector, yielding a complex three-component vector that embeds the information of the data sample. Note that imaginary unit i is part of Exp.

Figure 3. Pipeline of our conducted machine learning experiments. The pipeline depicts the incoming original data as the feature vector

\vec{x}

. Then, the noisy Lie-group transformations from Section 3 are applied to the original dataset to obtain the transformed feature vector (

\vec{ϕ}

). This transformed feature vector is then used as the input for the employed machine learning algorithm to classify the individual diseases for each dataset.

Figure 3. Pipeline of our conducted machine learning experiments. The pipeline depicts the incoming original data as the feature vector

\vec{x}

. Then, the noisy Lie-group transformations from Section 3 are applied to the original dataset to obtain the transformed feature vector (

\vec{ϕ}

). This transformed feature vector is then used as the input for the employed machine learning algorithm to classify the individual diseases for each dataset.

Figure 4. Accuracy scores for the Breast Cancer Wisconsin and diabetes datasets are presented relative to the benchmark results for symmetry groups SU and SL. We calculated the plots by subtracting the benchmark accuracy from the accuracy of the individual transformed approaches. Light-blue areas indicate instances where the accuracy was unchanged or improved by the obfuscation technique.

Figure 5. Accuracy scores for the Breast Cancer Coimbra and ilpd datasets are presented relative to the benchmark results for symmetry groups SU and SL. We calculated the plots by subtracting the benchmark accuracy from the accuracy of the individual transformed approaches. Light-blue areas indicate instances where the accuracy was unchanged or improved by the obfuscation technique.

Table 1. Results from our experiments with noise levels logarithmically ranging from 0 to 0.1 and the corresponding multipliers for each group extending from 1 (indicating no additional data) to 5 (denoting five times the original data volume), where M stands for the multiplier and SU and SL denote the specific utilized symmetry groups. The benchmark accuracy refers to the result obtained by the standard approach using LightGBM and non-obfuscated features.

Breast Cancer Wisconsin Benchmark Accuracy: 0.974
Noise Level $ϵ$	M. 1 SU	M. 2 SU	M. 3 SU	M. 4 SU	M. 5 SU	M. 1 SL	M. 2 SL	M. 3 SL	M. 4 SL	M. 5 SL
0.000	0.921	0.930	0.921	0.930	0.956	0.965	0.965	0.939	0.947	0.947
0.001	0.939	0.930	0.939	0.939	0.930	0.965	0.947	0.965	0.965	0.965
0.003	0.930	0.939	0.939	0.947	0.947	0.956	0.956	0.965	0.974	0.956
0.010	0.939	0.939	0.930	0.921	0.939	0.974	0.974	0.956	0.956	0.965
0.032	0.930	0.939	0.930	0.939	0.930	0.956	0.974	0.956	0.956	0.956
0.100	0.947	0.947	0.921	0.939	0.921	0.965	0.956	0.956	0.956	0.947
Pima Indians Diabetes, Benchmark Accuracy: 0.747
Noise Level $ϵ$	M. 1 SU	M. 2 SU	M. 3 SU	M. 4 SU	M. 5 SU	M. 1 SL	M. 2 SL	M. 3 SL	M. 4 SL	M. 5 SL
0.000	0.695	0.682	0.669	0.682	0.675	0.727	0.747	0.714	0.682	0.701
0.001	0.682	0.675	0.701	0.675	0.669	0.734	0.734	0.708	0.727	0.727
0.003	0.695	0.701	0.688	0.688	0.701	0.773	0.766	0.747	0.721	0.721
0.010	0.714	0.714	0.675	0.675	0.682	0.714	0.727	0.714	0.708	0.714
0.032	0.675	0.682	0.682	0.682	0.695	0.753	0.708	0.721	0.701	0.708
0.100	0.695	0.701	0.701	0.701	0.701	0.727	0.714	0.708	0.740	0.760
Indian Liver Patient, Benchmark Accuracy: 0.744
Noise Level $ϵ$	M. 1 SU	M. 2 SU	M. 3 SU	M. 4 SU	M. 5 SU	M. 1 SL	M. 2 SL	M. 3 SL	M. 4 SL	M. 5 SL
0.000	0.744	0.744	0.778	0.675	0.752	0.744	0.744	0.684	0.675	0.701
0.001	0.744	0.744	0.744	0.744	0.769	0.735	0.692	0.769	0.718	0.744
0.003	0.744	0.744	0.744	0.744	0.744	0.744	0.701	0.701	0.632	0.718
0.010	0.744	0.744	0.744	0.726	0.684	0.701	0.761	0.667	0.778	0.718
0.032	0.744	0.744	0.735	0.726	0.718	0.778	0.744	0.744	0.744	0.744
0.100	0.744	0.752	0.744	0.744	0.744	0.744	0.752	0.744	0.744	0.726
Breast Cancer Coimbra, Benchmark Accuracy: 0.833
Noise Level $ϵ$	M. 1 SU	M. 2 SU	M. 3 SU	M. 4 SU	M. 5 SU	M. 1 SL	M. 2 SL	M. 3 SL	M. 4 SL	M. 5 SL
0.000	0.500	0.750	0.750	0.792	0.833	0.792	0.708	0.708	0.792	0.833
0.001	0.625	0.792	0.708	0.792	0.708	0.833	0.792	0.708	0.792	0.750
0.003	0.500	0.750	0.667	0.792	0.875	0.708	0.750	0.792	0.792	0.792
0.010	0.500	0.750	0.833	0.708	0.708	0.750	0.792	0.708	0.875	0.792
0.032	0.500	0.833	0.875	0.708	0.875	0.708	0.750	0.875	0.833	0.792
0.100	0.500	0.708	0.708	0.750	0.667	0.708	0.833	0.792	0.792	0.833

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Raubitzek, S.; Schrittwieser, S.; Schatten, A.; Mallinger, K. Data Obfuscation for Privacy-Preserving Machine Learning Using Quantum Symmetry Properties. Big Data Cogn. Comput. 2025, 9, 223. https://doi.org/10.3390/bdcc9090223

AMA Style

Raubitzek S, Schrittwieser S, Schatten A, Mallinger K. Data Obfuscation for Privacy-Preserving Machine Learning Using Quantum Symmetry Properties. Big Data and Cognitive Computing. 2025; 9(9):223. https://doi.org/10.3390/bdcc9090223

Chicago/Turabian Style

Raubitzek, Sebastian, Sebastian Schrittwieser, Alexander Schatten, and Kevin Mallinger. 2025. "Data Obfuscation for Privacy-Preserving Machine Learning Using Quantum Symmetry Properties" Big Data and Cognitive Computing 9, no. 9: 223. https://doi.org/10.3390/bdcc9090223

APA Style

Raubitzek, S., Schrittwieser, S., Schatten, A., & Mallinger, K. (2025). Data Obfuscation for Privacy-Preserving Machine Learning Using Quantum Symmetry Properties. Big Data and Cognitive Computing, 9(9), 223. https://doi.org/10.3390/bdcc9090223

Article Menu

Data Obfuscation for Privacy-Preserving Machine Learning Using Quantum Symmetry Properties

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Retrieving the Original Data

3.1.1. Local Invertibility

3.1.2. Global Invertibility

4. Experiments

Datasets

5. Results

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Comparative Experiments with Non-Lie-Group Obfuscation Techniques

Appendix A.1. Overview of Considered Techniques

Appendix A.2. Experimental Design

Appendix A.3. Results

Appendix B. Leakage Analysis

Appendix B.1. Objective

Appendix B.2. Experimental Setup

Appendix B.3. Mutual Information

Appendix B.4. Interpretation

Appendix B.5. Leakage Analysis Discussion: Breast Cancer Wisconsin Dataset

Appendix B.6. Leakage Analysis Discussion: Breast Cancer Coimbra Dataset

Appendix B.7. Leakage Analysis Discussion: Diabetes Dataset

Appendix B.8. Leakage Analysis Discussion: ILPD Dataset

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI