Distinguishability-Driven Voice Generation for Speaker Anonymization via Random Projection and GMM

Wang, Chunxia; Zhang, Qiuyu; Hu, Yingjie; Wei, Huiyi

doi:10.3390/bdcc10020043

Open AccessArticle

Distinguishability-Driven Voice Generation for Speaker Anonymization via Random Projection and GMM

by

Chunxia Wang

,

Qiuyu Zhang

^*

,

Yingjie Hu

and

Huiyi Wei

School of Computer Science and Artificial Intelligence, Lanzhou University of Technology, Lanzhou 730050, China

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput. 2026, 10(2), 43; https://doi.org/10.3390/bdcc10020043

Submission received: 9 December 2025 / Revised: 16 January 2026 / Accepted: 26 January 2026 / Published: 29 January 2026

(This article belongs to the Topic Generative AI and Interdisciplinary Applications)

Download

Browse Figures

Versions Notes

Abstract

Speaker anonymization effectively conceals speaker identity in speech signals to protect privacy. To address issues in existing anonymization systems, including reduced voice distinguishability, limited anonymized voices, reliance on an external speaker pool, and vulnerability to privacy leakage against strong attackers, a novel distinguishability-driven voice generation for speaker anonymization via random projection and the Gaussian Mixture Model (GMM) is proposed. This method first applies the random projection to lower the dimensionality of the X-vectors from an external speaker pool, and then constructs a GMM in the reduced dimensional space to fit the generative model. By sampling from this generative model, anonymous speaker identity representations are generated, ultimately synthesizing anonymized speech that maintains both intelligibility and distinguishability. To ensure the anonymized speech remains sufficiently distinguishable from the original and prevents excessive similarity, a cosine similarity check is implemented between the original X-vector and pseudo-X-vector. Experimental results on the VoicePrivacy Challenge datasets demonstrate that the proposed method not only effectively protects speaker privacy across different attack scenarios but also preserves speech content integrity while significantly enhancing speaker distinguishability between original speakers and their corresponding pseudo-speakers, as well as among different pseudo-speakers.

Keywords:

speaker anonymization; privacy protection; random projection; GMM; distinguishability

1. Introduction

The breakthrough advancements in deep learning have accelerated the widespread adoption of Automatic Speaker Recognition (ASR) systems [1,2,3]. Voice interaction technologies, such as speech-to-text document generation [4], real-time multilingual speech translation [5], and intelligent voice search engines [6], have become deeply integrated into diverse aspects of society. Nevertheless, the large-scale development of such technologies entails the massive collection of voice data, which enhances service efficiency while simultaneously raising serious privacy protection challenges [7,8].

Voice data, as a typical type of biometric data, contains multidimensional sensitive information embedded in its acoustic features, including identity attributes, social status, physiological states, psychological traits, political stance, and religious beliefs [9]. These deeply embedded privacy features in speech signals are not only exposed to traditional data leakage risks but are also vulnerable to novel threats such as reconstruction attacks [10]. Moreover, with the advancement of voice cloning [11,12] and speech synthesis technologies [13,14], attackers can now synthesize highly realistic fake audio from just a few seconds of voice samples, posing severe threats to personal reputation and information security. Against this backdrop, the enhancement of regulations such as the General Data Protection Regulation (GDPR) [15] has made voice data de-identification and the development of privacy-preserving technologies a critical and urgent challenge in the field of AI ethics.

Speaker anonymization aims to desensitize identity information embedded in speech signals, making them unusable for voice cloning while effectively resisting identity tracing by automatic speaker verification (ASV) and automatic speaker identification (ASI) systems [16]. Critically, this process must preserve the original linguistic content and natural usability of speech to meet downstream application requirements. It is important to note that a seemingly straightforward approach—first transcribing speech to text, then modifying the text to remove speaker-specific patterns, and finally resynthesizing speech—is generally unsuitable for this task. This is due to three fundamental limitations: first, text transformation merely modifies linguistic patterns (e.g., regional dialect expressions) but fails to eliminate speaker-specific acoustic cues; second, it tends to distort linguistic content and paralinguistic information (e.g., emotional prosody), thereby compromising the essential requirement of preserving speech usability in speaker anonymization; third, it cannot mitigate voice cloning attacks, as such attacks rely on acoustic features rather than on the linguistic content of speech.

To advance speaker anonymization, the VoicePrivacy Initiative was launched in 2020 [17], which has since evolved into a biennial VoicePrivacy Challenge (VPC). VPC 2022 [18] focused on optimizing the robustness of anonymization methods, while VPC 2024 [19] introduced emotion preservation constraints, requiring anonymization systems to eliminate identity features while maintaining emotional intensity and the affective category of the original speech.

The latest VPC submissions revealed two dominant research directions in speaker anonymization: traditional signal processing-based methods [20,21,22] and Deep Neural Network (DNN)-based methods [23,24,25]. Signal processing-based methods can obscure a speaker’s identity by modifying instantaneous speech features (e.g., pitch and spectral envelope) and time scale, or by altering vocal characteristics from a speech generation perspective. These modifications effectively conceal the original speaker’s identity, speech rate, and other source and vocal tract patterns. Although computationally efficient and training free, these methods suffer from two critical drawbacks: first, the speech content is prone to noticeable distortion; second, the study [26] suggested that signal processing-based anonymization might be reversible after a limited number of attempts, potentially exposing the original voice. In contrast, DNN-based anonymization has pioneered a new paradigm, leveraging advances in neural voice conversion and speech synthesis, with a primary focus on disentangled representation learning. The key rationale for choosing disentanglement methods lies in their ability to separate linguistic and paralinguistic features using distinct encoders, enabling the independent modification of each attribute. For instance, Fang et al. [27] introduced an anonymization framework based on modified X-vectors, where pseudo-speaker identities were generated by averaging a set of candidate X-vectors from a speaker pool. Srivastava et al. designed three anonymization schemes based on probabilistic linear discriminant analysis (PLDA) distance metrics to obscure speakers’ identities [28]. However, Turner et al. [29] pointed out that pseudo-X-vectors in VPC were derived by averaging multiple X-vectors from a speaker pool, resulting in highly similar pseudo-X-vectors that reduced the diversity of anonymized voices. This high similarity not only made it harder to distinguish between anonymized voices, but also limited the number of distinguishable anonymized voices that could be generated.

In this paper, we propose a novel distinguishability-driven voice generation using random projection and GMM, effectively addressing the limitations of existing speaker anonymization approaches. These limitations include the overly concentrated distribution of anonymized X-vectors, limited diversity in generated anonymized voices, excessive reliance on an external speaker pool, and higher risks of privacy leakage risks when facing strong attackers.

The main contributions of this paper are summarized as follows:

(1): Random projection matrix for X-vector dimensionality reduction: We use a random projection matrix to construct a speaker-independent random generation mechanism. By injecting randomness, our approach not only enhances speaker privacy protection but also preserves the flexibility of high-dimensional data manifold structures, while significantly reducing computational overhead.
(2): GMM-based probabilistic sampler model: We employ GMM as the generative model, leveraging its strong distribution generalization capability to improve speech generation quality. The probabilistic mixture mechanism of GMM also helps defend against membership inference attacks. By sampling speaker embeddings directly from the GMM, we further reduce computational complexity. Additionally, the sampling model eliminates the need for a speaker pool during anonymization, mitigating privacy leakage risks while optimizing the anonymization process.
(3): Grid search strategy with GMM entropy for parameter optimization: We adopt a grid search strategy combined with GMM entropy metrics to evaluate probability distribution quality and optimize model parameters. This method not only enhances the model’s generalization ability but also achieves a strong balance between privacy protection and data distribution.
(4): Cosine similarity check for anonymity enhancement: We introduce a cosine similarity check mechanism to assess the discrimination between original X-vectors and pseudo-X-vectors. This effectively prevents insufficient distinguishability between original and anonymized voices due to excessive similarity, thereby improving speaker anonymization performance.
(5): Empirical evaluation is conducted on the VoicePrivacy Challenge dataset: The experimental results validated that our method effectively protected speaker privacy against ignorant and lazy informed attack scenarios (as formally defined in Section 5.3). After introducing the similarity check (SC), the average Equal Error Rate (EER) increased by 0.77% in the ignorant attack scenario (from 48.35% to 49.12%) and by 1.29% in the lazy informed attack scenario (from 45.54% to 46.83%). Furthermore, the results of voice distinctiveness ( $G_{VD}$ ) indicated that our proposed approach markedly improved speaker distinguishability, not only between original and anonymized voices, but also among different anonymized samples. Across most datasets, voice distinctiveness showed a significant improvement, reaching a peak value of 0.51.

The remainder of this paper is organized as follows. Section 2 reviews related work. Section 3 introduces random projection. Section 4 details our proposed speaker anonymization method. Section 5 demonstrates the experimental setup, including datasets, evaluation metrics, and an analysis of the experimental results. Finally, Section 6 concludes this paper.

2. Related Work

Speaker anonymization is a flexible and effective privacy protection technique designed to remove speaker identity from voice signals, while preserving the intelligibility, naturalness, and distinguishability of the original speech. X-vectors are widely adopted as effective features for encoding speaker identity, owing to their proven performance in speaker verification and recognition tasks. X-vector-based speaker anonymization methods employed advanced neural network technologies, utilizing a Time Delay Neural Network (TDNN)-based ASV system to extract X-vectors [30], and a Factorized Time Delay Neural Network (TDNN-F)-based ASR acoustic model (AM) [31] to derive linguistic and prosodic and linguistic features, including fundamental frequency (F0) and bottleneck (BN) features from the original speech waveform. To effectively conceal speaker identity, the designated speaker anonymizer proposed by Srivastava et al. [28] generated an anonymized X-vector by averaging multiple X-vectors randomly drawn from an external speaker pool. For each source X-vector, the procedure first filtered the 200 most-distant candidates via cosine distance, and then randomly selected 100 of these for mean computation. The anonymized X-vector, along with F0 and BN features, was fed into a speech synthesis acoustic model (SS AM) to produce Mel-fbank features. These features, together with the anonymized X-vector and F0, were then used by a neural source filter (NSF) waveform model to synthesize the anonymized speech. Given its effectiveness in protecting speaker privacy over conventional signal processing techniques, this disentanglement-based approach has become a prevalent framework for achieving efficient and robust anonymization in subsequent studies.

In recent years, numerous researchers enhanced X-vector-based anonymization from various perspectives to improve performance [23,32]. However, previous studies [26,33] indicated that existing anonymization systems remain constrained by their selection-based mechanisms. Specifically, the distribution of the external speaker pool significantly influenced anonymized speaker characteristics, and averaging selected X-vectors often diminished voice distinguishability. To address this, Mawalim et al. presented an X-vector-based anonymization method [34], which used singular value modification and statistics-based decomposition on the original X-vector with ensemble regression models to achieve transformation. They later introduced a clustering-based singular value correction method, further enhancing anonymization by adjusting F0 and speech duration [35]. Turner et al. [29] observed that the baseline methods produced overly compact pseudo-X-vector distributions, failing to utilize the full X-vector space and resulting in insufficient voice distinguishability. To mitigate this, they proposed a pseudo-speaker generator combined PCA with GMM, by applying PCA dimensionality reduction to the external speaker pool, and sampling anonymized X-vectors from a trained GMM. As another GAN-based generator, Meyer et al. [36] utilized a Generative Adversarial Network (GAN) with a Wasserstein distance cost function to generate speaker embeddings, enforcing the distribution of the generated embeddings to be similar to that of the speaker embeddings corresponding to real speakers. Perero et al. introduced an autoencoder-based X-vector reconstruction, incorporating domain adversarial training to suppress speaker attributes such as gender, accent, and identity [37]. Another innovative direction was the Orthogonal Householder Neural Network (OHNN)-based anonymization [38], which rotated original X-vectors into anonymized ones while constraining their distribution to the original space for more natural results. Recently, Yao et al. [25] developed a speaker identity-related matrix model, transforming it into an anonymized singular value transformation-assisted matrix to conceal the original speaker identity, not only improving anonymized speech quality, but also strengthening privacy protection. Chen et al. [39] represented the speaker attributes with distributions to increase the voice distinguishability of the anonymized speech, where the nonlinear variations among the cohort speakers leveraged to model the pseudo-speakers via neural networks and both utterances and frames are investigated as modeling units. Lee et al. [40] introduced the pinhole effect as a conceptual framework to explain identity linkability and found that the use of distinct pseudo-speakers increases speaker dispersion and reduced linkability compared to common pseudo-speaker mapping, thus enhancing privacy preservation.

3. Random Projection

Random projection (RP) is an efficient dimensionality reduction technique that maps high-dimensional data into a lower-dimensional space while preserving critical structural information [41]. Its core principle involves a randomly generated projection matrix to transform original high-dimensional data

X \in R^{n \times d}

into a reduced space

Y \in R^{n \times k}

, where n denotes the number of samples, d is the original feature dimension, and k is the reduced dimension (

k ≪ d

). In this work, X corresponds to X-vectors—fixed-dimensional (512-dimensional)—speaker embeddings extracted from speech signals using a pre-trained TDNN.The mapping is defined as follows:

Y = X \cdot R

(1)

where

R \in R^{d \times k}

represents the random projection matrix.

The theory foundation of RP relies on the Johnson–Lindenstrauss (JL) lemma [42], which provides a probabilistic guarantee for preserving pairwise Euclidean (

l_{2}

) distances in the reduced space. The JL Lemma states the following:

Johnson–Lindenstrauss lemma. For any

0 < ε < 1

, and any finite set of points

x_{i}, x_{j} \in R^{d}

, there exists a mapping

f : R^{m \times d} \to R^{m \times k}

with

k = Ω (\frac{\log n}{ε^{2}})

such that

(1 - ε) {∥ x_{i} - x_{j} ∥}^{2} \leq {∥ f (x_{i}) - f (x_{j}) ∥}^{2} \leq (1 + ε) {∥ x_{i} - x_{j} ∥}^{2} .

(2)

Here,

∥ x_{i} - x_{j} ∥

denotes the Euclidean distances between points

x_{i}, x_{j}

in the original d-dimensional space,

∥ f (x_{i}) - f (x_{j}) ∥

represents their distance in the reduced k-dimensional space, n is the number of points in the set,

ε

controls the distortion tolerance, k is independent of d and grows only logarithmically with n, and

Ω

represents the asymptotic lower-bound symbol. The JL lemma ensures that speaker relationships and structural properties critical for anonymization are maintained even after dimensionality reduction. A smaller

ε

preserves distance relationships more tightly but requires a larger k, increasing computational complexity. Conversely, a larger

ε

permits smaller k, trading accuracy for efficiency.

Random matrices satisfying the JL lemma can be constructed via various methods, including sparse random matrices and Gaussian random matrices. Among these, Gaussian random matrices are the most widely adopted. Each element of a Gaussian random matrix R is independently sampled from a normal distribution

N (0, \frac{1}{k})

. Owing to the favorable properties of Gaussian distributions, Gaussian RP demonstrates superior performance in preserving pairwise distances [43]. Therefore, this paper employs Gaussian RP to reduce the original 512-dimensional X-vectors to a lower-dimensional space. These reduced-dimensional X-vectors are subsequently used to generate pseudo-X-vectors—anonymized speaker embeddings—that are restored to the original 512-dimensional space via inverse random projection (RP).These pseudo-X-vectors replace the original X-vectors in the anonymization pipeline and are specifically designed to obscure the original speaker’s identity while ensuring compatibility with downstream speech synthesis models.

4. Proposed Method

4.1. The Anonymization Framework

Figure 1 shows the framework of the proposed speaker anonymization system. Following the architecture of B1.b from VPC 2022 [18]. The system workflow consists of three main stages: (1) Feature extraction: X-vector, F0, and BN features are extracted from the input speech waveform. (2) Modified X-vector anonymization: The anonymized X-vector is generated using RP and GMM based on an external speaker X-vector pool. (3) Speech synthesis. The anonymized X-vector, original F0, and BN features are combined to synthesize the anonymized speech waveform with Hifi-GAN [44].

As shown in Figure 1, the system employs three pre-trained models: ASR AM, X-vector extractor, and Hifi-GAN. Training details for these models can be found in prior work by [18,27].

4.2. The Anonymization Method

This paper focuses on improving the pseudo-X-vector generation method of the baseline system [17,18] through a three-step process: Firstly, we used random projection (RP) to the 512-dimensional X -vector pool, reducing its dimensionality while preserving essential features. Secondly, we fit a Gaussian Mixture Model (GMM) in the RP-reduced space to sample pseudo-X-vector. This approach preserves the diversity of the original embedding space while avoiding the entropy loss inherent in direct averaging operation [45]. Thirdly, we apply the inverse RP transformation to obtain a 512-dimensional pseudo-X-vector. The later stages of the anonymization pipeline’s follow the baseline system [18]: BN features, combined with F0 and pseudo-X-vectors for speech synthesis through the Hifi-GAN model. Both the training procedures and model reuse protocols maintain consistency with the baseline system.

4.3. GMM Entropy Estimation

To optimize the parameters for the proposed RP + GMM anonymization method, we use GMM entropy as an evaluation metric. Higher entropy indicates a more uniform and diverse distribution of the generated pseudo-speaker X-vectors, better approximating the true mixture distribution. For a GMM with components, the entropy

H (x)

is defined as follows:

H (x) = - \int p (x) \log p (x) d x

(3)

where

p (x) = \sum_{k} π_{k} N (x ∣ μ_{k}, Σ_{k})

, with x being the generated samples, and

π_{k}

,

μ_{k}

, and

\sum_{k}

representing the weight, means, and covariance matrix of the k-th component, respectively.

Since analytically calculating GMM entropy is intractable, we use Monte Carlo integration [46] to approximate it. Specifically, we estimate entropy by sampling log-likelihood values from the GMM-generated pseudo-X-vectors according to Equation (4),

H (x) \approx - \frac{1}{N} \sum_{i = 1}^{N} \log (\sum_{k = 1}^{C} π_{k} N (x_{i} ∣ μ_{k}, Σ_{k})),

(4)

where N is the number of samples and C is the number of Gaussian components;

x_{i}

denotes the i-th pseudo-speaker X-vector (sampled from the GMM);

N (x_{i} ∣ μ_{k}, Σ_{k})

represents the probability density function of the k-th Gaussian component, with mean

μ_{k}

and covariance matrix

Σ_{k}

; and

π_{k}

is the mixture weight of the k-th Gaussian component.

4.4. Similarity Check

The goal of speaker anonymization is to make the speech recordings unlinkable to their original speaker while ensuring they remain distinguishable from the corresponding original speaker. However, pseudo-X-vectors sampled from the GMM may occasionally remain close to the original X-vectors. To mitigate this, we introduce a similarity check (SC) between the pseudo X-vector

X_{p s e u}

and the original X-vector

X_{o r i g}

. Given a threshold

θ

,

cos (X_{p s e u}, X_{o r i g}) < θ

(5)

where

cos (\cdot)

denotes cosine similarity, and

θ

is its corresponding threshold, if the cosine similarity exceeds

θ

, the pseudo-X-vector is discarded, and the sampling process is repeated until this condition is met. The computational overhead of SC is negligible as it only required resampling from the GMM.

The selection of

θ

is centered on balancing privacy protection and speech utility in voice anonymization scenarios. Through multiple rounds of comparative experimental and iterative analysis based on real-world experience, we set

θ

to 0.7 in our study.

4.5. Computational Complexity and Scalability Analysis

In terms of computational cost, the proposed pseudo-X-vector generation process introduces minimal overhead. Applying RP to the 512-dimensional X-vectors reduces their dimensionality to a tractable subspace with a computational complexity of

O (n * d * k)

, where n represents the number of X-vectors, and d and k denote the original and reduced dimensionalities, respectively, with

k ≪ d

. This step significantly decreases both the memory and computational requirements in the subsequent modeling stage. Sampling from GMM in the reduced subspace entails linear complexity

O (n * k * C)

, where C represents the number of Gaussian components. Meanwhile, the lightweight similarity check and conditional resampling process introduce only negligible additional overhead. The inverse RP transformation also adds minimal latency. Compared with previous X-vector generation methods [18,27,28], our approach eliminates the need to retrieve N candidate speakers from an external pool and average their vectors, thereby further streamlining the X-vector generation process.

Regarding memory efficiency, the proposed method maintains a compact representation. Unlike the previous methods [18,27,28], which relied on pre-stored external speaker pools, our RP + GMM + SC approach eliminates this dependency during pseudo-X-vector generation, thereby incurring no additional memory overhead for external speaker resources. Moreover, the method exhibits favorable scalability with the increasing number of target speakers, since dimensionality reduction decouples sampling complexity from the size of any external speaker pool.

5. Experiment Results and Analysis

5.1. Datasets

The datasets used in this study are derived from the benchmark library of the VPC 2022 [18]. As illustrated in Figure 1, the anonymization framework comprises three pre-trained modules, with their corresponding training data configurations detailed in Table 1.

The ASR AM is a TDNN-F model that uses 40 MFCCs and a 100-dimensional i-vector as input. It outputs 256-dimensional BN features. This ASR AM is trained on Librispeech train-clean-100 and train-other-500.

The X-vector extractor is a TDNN model that takes 30-dimensional MFCCs as input and produces a 512-dimensional speaker embedding (X-vector) as output. This extractor is trained using Voxceleb 1 and 2. The system assumes that these components decouple the speech content (BN and F0) from the speaker identity X-vector. Then, using a generation technique, the X-vector is modified to generate a pseudo X-vector representing the new identity of the speaker.

The Hifi-GAN model processes the modified X-vector, F0, and BN features to generate anonymized speech. This Hifi-GAN model is also trained on LibriTTS train-clean-100 subset.

The final evaluation of the anonymization system is conducted on the official VPC [18] development sets and test sets of LibriSpeech [48] and VCTK [52] corpora, following the splits specified in the reference [18].

5.2. Determining the Optimal Parameters

To evaluate the performance of pseudo-X-vector generation, we systematically analyze their distribution by adjusting two key parameters: (1)

ε

, the distance distortion parameter in RP, and (2) C, the number of components in the GMM. The experiment explored various values of

ε = {0.5, 0.6, 0.7, 0.8, 0.9}

and

C = {1, 3, 5, 7, 9}

. A grid search strategy was used to find the optimal parameters, with GMM entropy as the evaluation criterion. A higher entropy value indicates greater diversity in the pseudo-X-vectors generated.

The evaluation setup was as follows. First, all X-vectors were extracted from the LibriTTS train-other-500 dataset (600 males/560 females). For each gender, a 50%-50% training test split was performed to train RP + GMM. Then, 1000 pseudo-X-vectors were sampled from the GMM to estimate entropy using Equation (4). The RP inverse transform was then utilized to obtain 512-dimensional pseudo-X-vectors. The GMM was designed to learn a diagonal covariance matrix in the RP-reduced feature space. The maximum iteration was set to 1000 in the Expectation–Maximization (EM) algorithm, with a convergence tolerance of

10^{- 15}

. If the EM algorithm failed to converge, the number of iterations was increased by 1.1 times, and the convergence tolerance grows by two times.

As shown in Figure 2, for a fixed

ε

, the entropy monotonically decreases as C increases. Conversely, Figure 3 shows that with a fixed C, the entropy decreases as

ε

increases. Since the entropy decreases with increasing

ε

and C, this paper selects the configuration with the maximum entropy,

ε = 0.5

,

C = 1

. It is noted that the enrollment utterances and the trial utterances are synthesized using the same random projection matrix and GMM.

5.3. Primary Objective Evaluation Metrics

Speaker anonymization seeks to protect speaker identity information and preserve speech intelligibility. Consequently, the evaluation of the anonymization system requires metrics that address both privacy and utility.

For the objective evaluation of anonymization performance, we employed two pre-trained systems: (1) ASVeval, an automatic speaker verification (ASV) system to assess speaker identity protection strength; (2) ASReval, an automatic speech recognition (ASR) system to evaluate content preservation quality. Both systems were developed using the Kaldi toolkit [53] and trained on a subset of LibriSpeech train-clean-360 under the VPC official settings [17,18].

Attackers targeting specific speakers may access one or more utterances (referred to as enrollment utterances), which could be original or anonymized. In detail, the utterances are anonymized at the speaker level. The ASVeval examines three scenarios [26]: (1) Unprotected scenario (o-o/Orig): No anonymization applied, and the attackers access both original trial and enrollment utterances. (2) Ignorant attack scenario (o-a): Users anonymize their trial utterances, but attackers remain unaware and attempt verification using original enrollment utterances against anonymized trial utterances. (3) Lazy informed attack scenario (a-a): Both enrollment and trial utterances are anonymized using the same system but generate different pseudo-speakers.

This paper presents an extensive and objective evaluation of the proposed anonymization system (RP + GMM) using the VPC framework with extended metrics [18]. To assess anonymity, we conduct a comparative analysis of outputs before and after applying the similarity check (RP + GMM + SC). Furthermore, we validate the system’s effectiveness by comparing it with several baseline systems (as shown in Table 2), including the official VPC benchmarks (B1, B2) and other state-of-the-art approaches.

5.3.1. Privacy Evaluation

For an objective privacy evaluation, the ASVeval is based on X-vector speaker and probabilistic linear discriminant analysis (PLDA). The Equal Error Rate (EER) and Log-likelihood ratio cost function

C_{l l r}

are utilized as principal metrics. The EER metric corresponds to the threshold

θ_{E E R}

at which the False Acceptance Rate (FA) is equal to False Rejection (FR), that is,

EER = P_{f a} (θ_{E E R}) = P_{m i s s} (θ_{E E R})

[54]. For each pair of enrollment and trial X-vectors, the PLDA classifier computes a log-likelihood ratio (LLR) score, which is then compared against a decision threshold to determine whether they come from the same speaker or different speakers [55]. Following [55],

C_{l l r}

is derived from the PLDA scores and can be decomposed into a discrimination loss (

C_{l l r}^{m i n}

) and a calibration loss (

C_{l l r}

-

C_{l l r}^{m i n}

). Here,

C_{l l r}^{m i n}

is obtained via optimal calibration using monotonic transformation of the scores to their empirical LLR values. Both EER and

C_{l l r}^{min}

serve as robust measures of classifier discrimination [56].

This paper examines two attack scenarios: ignorant (o-a) and lazy informed (a-a) attack models. Table 3 shows the evaluation results of EER and

C_{l l r}^{min}

under these two attack models. The higher EER and

C_{l l r}^{min}

values correspond to better privacy. The parameter values follow Section 5.2, specifically,

ε = 0.5

and

C = 1

. By combining EER and

C_{l l r}^{min}

metrics, the effects of similarity check and cross-attack models are analyzed as follows.

(1): Impact of similarity check. After introducing SC, the average EER for the o-a scenario increased by 0.77% (from 48.35% to 49.12%) and the average $C_{l l r}^{min}$ increased by 0.012 (from 0.975 to 0.983). This indicates that SC may slightly strengthen privacy protection for the original enrollment utterances. For the a-a scenario, the average EER rose by 1.29% (from 45.54% to 46.83%), and $C_{l l r}^{min}$ increased by 0.002 (from 0.953 to 0.955). This suggests that SC enhances the privacy protection of anonymized enrollment utterances. One reason for this could be that SC further obscures identity features through thresholding, making it harder for the ASV system to identify speakers.
(2): Comparison across attack models. From the perspective of attack scenarios, the EER before and after introducing SC was 48.35% and 49.12% in the o-a scenario, with corresponding $C_{l l r}^{min}$ of 0.975 and 0.983. In the a-a scenario, EER increased from 45.54% to 46.83% after SC addition, while $C_{l l r}^{min}$ slightly increased from 0.953 to 0.956. In other words, without SC, the EER and $C_{l l r}^{min}$ gap between two different attack scenarios was 2.81% and 0.021 (with o-a higher than a-a), respectively. After introducing SC, the difference increased to 2.29% for EER and 0.027 for $C_{l l r}^{min}$ , with o-a slightly outperforming. This demonstrates that RP + GMM + SC remains robust against different attack scenarios.
(3): A gender-focused privacy analysis. From a gender perspective, the introduction of SC has a significant gender-differentiated impact on speaker privacy protection performance. Specifically in the o-a scenario, the EER for female speakers before and after adding SC was 50.23% and 50.02%, respectively, with a corresponding $C_{l l r}^{min}$ of 0.983 and 0.981, indicating a stable privacy protection performance that outperforms that of male speakers. In contrast, in the a-a scenario, the EER for male speakers increased from 46.83% to 50.51% after introducing SC, while $C_{l l r}^{min}$ improved from 0.954 to 0.973, demonstrating a more pronounced privacy enhancement effect. Overall, the results suggest that female speakers exhibit a relative advantage in ignorant attacks, while male speakers demonstrate greater robustness in lazy informed attacks. The underlying reason can be traced to gender-based differences in acoustic characteristics. For female speakers, whose speech exhibits higher fundamental frequency and more dispersed formants, anonymization sharply reduces the similarity between the original and anonymized speech, yielding a high EER (around 50%) and consistently stable performance even after introducing SC. In contrast, male speech—characterized by lower fundamental frequency, more concentrated formants, and smaller inter-speaker variability—tends to converge toward a shared “average male” feature subspace during anonymization, resulting in higher similarity across anonymized utterances. SC effectively counteracts this convergence, promoting feature dispersion and thus enhancing both EER and $C_{l l r}^{min}$ for male speakers.

These results demonstrate that the proposed method effectively conceals speaker identity while maintaining system stability. The experimental data deviated only marginally from the optimal theoretical values (EER = 50%;

C_{l l r}^{min} = 1

) with the average EER and

C_{l l r}^{min}

reaching 47.46% and 0.97, respectively. This validates the robustness and privacy-preserving capability of the proposed methods in both the o-a and a-a attack scenarios.

For further validation, we compare the proposed method with those listed in Table 2. Figure 4 and Figure 5 present a comparison of the average EER and the average

C_{l l r}^{min}

across all eight systems. The average EER was computed by averaging the three EERs obtained on the development and test sets. The same method was used for the average

C_{l l r}^{min}

. As shown in Figure 4, RP + GMM performs well in the o-a scenario on development sets, with its EER values ranking just below the top systems, B1, A1, and O1C1, and more effective on the test sets comparable to other systems. In the a-a scenario, both proposed systems outperform the other systems, with RP + GMM + SC even better. The results of

C_{l l r}^{min}

in Figure 5 show a similar trend, further validating the generalization performance of the proposed systems.

5.3.2. Utility Evaluation

To assess the linguistic preservation capability and practical usability of anonymized speech, we employed the ASReval system for anonymization evaluation. Specifically, it decodes original and anonymized trial utterances into text transcripts and then computes the Word Error Rate (WER) as the primary metric for speech usability, defined as follows:

W E R = \frac{N_{s u b} + N_{d e l} + N_{i n s}}{N_{r e f}}

(6)

where

N_{s u b}, N_{d e l}, N_{i n s}

denote the number of substitution, deletion, and insertion errors, respectively, and

N_{r e f}

is the number of total words in the reference.

As shown in Table 4, the average WER of the two proposed systems increases by approximately 3% on average compared to those without anonymization, which aligns with expectations. Notably, adding SC (from RP + GMM to RP + GMM + SC) has a negligible effect on WER: changes are minimal (e.g., 5.72% to 5.89% for Libri_dev; 14.64% to 14.63% for VCTK_test). This demonstrates that SC enhances speaker privacy (as shown in previous ASVeval results) without compromising speech recognition performance significantly. Dataset-wise, Librispeech exhibits stronger recognition robustness: its original WER is much lower, and the increase after anonymization is mild (around a 2% rise). In contrast, VCTK speech has a higher WER, and anonymization leads to more noticeable (but still acceptable) recognition degradation. Additionally, WER trends are consistent between development and test sets, indicating stable experimental results. Overall, the proposed anonymization methods strike a balance privacy protection and speech content usability, with SC being a privacy-enhancing component that barely affects recognition performance.

5.3.3. Privacy–Utility Trade-Off

Figure 6 demonstrates the privacy–utility trade-off (EER vs. WER) across development sets and test sets under both attack scenarios. The first row displays the EER and WER results for the development set in the two scenarios, while the second row shows the corresponding results for the test set. As seen across all subfigures, the original data (Orig) demonstrates the optimal utility (lowest WER) but the weakest privacy protection (lowest EER). All anonymization methods significantly enhance privacy at the cost of moderate speech recognition degradation. Among the anonymization methods, B1 achieves the best privacy–utility balance in the o-a scenario on the development and test sets, though its privacy protection notably degrades in the a-a scenario. Our proposed systems achieves near-optimal EER values (approaching 50%) while maintaining WER within an acceptable range (typically below 11%), indicating the effective protection of speaker identity without severely degrading speech intelligibility. Notably, RP + GMM + SC shows a favorable position in the privacy–utility landscape, outperforming several systems by offering stronger anonymity at comparable or lower WER. These results confirm that disentangling speaker identity from speech content coupled with probabilistic sampling in a reduced dimensional space enables robust and adjustable anonymization suitable for varying privacy–utility requirements in practical applications.

5.4. Additional Objective Evaluation Metrics

This section evaluates additional metrics proposed during the VPC, including the ZEBRA framework, linkability, voice distinctiveness, and de-identification.

5.4.1. The ZEBRA Framework and Linkability

ZEBRA framework. The ZEBRA framework serves as a robust quantitative model for privacy assessment without requiring pre-registered biometric data [56]. This framework incorporates two principal metrics: expected privacy disclosure at a population level,

D_{ECE}

, and worst-case privacy disclosure for an individual,

\log_{10}^{l}

(referred to as

\log (l)

).

D_{ECE}

measured in bits estimates the amount of privacy information that attackers could access while remaining independent of their prior knowledge.

D_{ECE}

values range from 0 (indicating perfect privacy protection) to 0.721 (representing complete vulnerability).

\log (l)

, where l stands for the log-likelihood ratio, is measured in decibels (dB) with lower value corresponding to stronger privacy protection, where 0 dB signifies absolute privacy preservation.

The experimental results for

D_{ECE}

and

\log (l)

are shown in Table 5.

D_{ECE}

values remain consistently below 0.06, confirming strong privacy protection. The

\log (l)

results further indicate superior privacy performance, with Grade A outcomes outnumbering Grade B, particularly in the o-a scenario. Notably, RP + GMM + SC achieves more Grade A than RP + GMM, that is to say, RP + GMM + SC achieves better privacy performance. These results of

\log (l)

also reflect that privacy protection varies among individuals, mainly due to the randomness in the GMM sampling. Overall, the systems show strong privacy preservation capabilities.

Furthermore, we conducte comparative analyzes of the

D_{ECE}

and

\log (l)

results across different anonymization systems, with detailed comparisons presented in Figure 7 and Figure 8, respectively. The results demonstrate that B2 has significantly higher

D_{ECE}

and

\log (l)

values compared to all other anonymization systems, indicating its relatively weaker privacy protection capabilities. In the o-a scenario, the remaining systems show minimal differences in their

D_{ECE}

values, while in the a-a scenario, our two anonymization systems demonstrate superior

D_{ECE}

, with RP + GMM + SC showing particularly outstanding results. The results of

\log (l)

exhibit similar comparative patterns to those observed in Figure 7.

Linkability. Linkability serves as a metric to evaluate the distributional differences between matching and non-matching scores. The global

D \leftrightarrow sys

(hereafter referred to as

D \leftrightarrow sys

) represents a robust privacy measurement [57] that quantifies the degree of linkability, where lower values indicate superior privacy protection performance. This metric is calculated by averaging all matching scores, with a value ranging from zero to one, where the optimal value of zero signifies complete global identity unlinkability. Detailed information on linkability can be found in the study [57].

The evaluation results of

D \leftrightarrow sys

are also presented in Table 5. The

D \leftrightarrow sys

values in the o-a scenario are generally lower than those in the a-a scenario, with most being around 0.1, reflecting stronger unlinkability.

Figure 9 provides an extensive comparison of linkability across anonymization systems. The results reveal that in the o-a scenario, all systems except the underperforming B2 show comparable linkability. Our two proposed systems (particularly RP + GMM + SC) demonstrate significant advantages with optimal linkability effects. From the perspective of an attack scenario, all systems demonstrate superior unlinkability performance in the o-a scenario compared to the a-a scenario, indicating more effective prevention of speaker identity linkage between original and anonymized utterances.

5.4.2. Gain of Voice Distinctiveness and De-Identification

Gain of voice distinctiveness (

G_{VD}

) and de-identification (DeID) measure the levels of voice distinguishability and identity concealment before and after anonymization, respectively. Voice distinguishability requires that all utterances from the same speaker be converted to the same pseudo-speaker after anonymization, effectively preventing speaker confusion during multi-party conversations [57]. De-identification aims to make it impossible to trace back to original speakers after anonymization, also known as speaker concealment or voice disguise. Both

G_{VD}

and DeID are calculated based on the voice similarity matrix [58,59].

G_{VD} = 0

represents pseudo-speakers’ distinguishability as unchanged after anonymization,

G_{VD} > 0

indicates enhanced distinguishability (easier voice differentiation), and

G_{VD} < 0

shows weakened distinguishability (increased voice similarity); DeID ranges from 0% (no de-identification) to 100% (complete de-identification).

Table 6 presents the evaluation results for

G_{VD}

and DeID. In terms of voice distinctiveness, significant improvement is observed across most datasets, with

G_{VD}

reaching a peak value of 0.51. In practice, female speakers consistently achieve a higher

G_{VD}

than male speakers on VCTK datasets, while the opposite trend holds for LibriSpeech datasets. The observed gender-based differences in

G_{VD}

across datasets can be largely attributable to the acoustic characteristics of the datasets. The VCTK corpus contains clean, controlled studio recordings with consistent acoustic features, while LibriSpeech includes varied real-world audiobook recordings with more complex acoustic conditions. Regarding de-identification effectiveness, all anonymization systems achieve similarly high performance, with DeID most around 98%.

Figure 10 and Figure 11 demonstrate comparative evaluations of

G_{VD}

and DeID performance across anonymization systems, respectively. Figure 11 clearly shows that our proposed method excels in voice distinctiveness, with RP + GMM + SC achieving even better results than RP + GMM. The main reason is that the proposed systems do not introduce additional identity information from other speakers. In Figure 11, the results of DeID illustrate that, except for the slightly weaker performance of B2, all other systems perform similarily to B1. These results further validate the effectiveness of our proposed method in maintaining speech distinguishability while successfully concealing speaker identity.

5.5. Further Performance Comparison of the Latest Speaker Anonymization Methods

5.5.1. Objective Privacy and Utility Comparison

To further validate our proposed methods, we include the most recent speaker anonymization approaches and evaluate their privacy and utility under the a-a scenario on the Librispeech and VCTK datasets, where both ASVeval and ASReval settings align with those used in our work. The comparison shown in Table 7 focuses on three critical metrics: EER, WER, and

G_{VD}

.

As shown in Table 7, among the anonymization methods, MUSA achieves the highest EER (48.29%) and the lowest WER (9.895%), demonstrating exceptional performance in both privacy and speech quality, but revealing a strong degradation in the distinguishability of voice anonymization with

G_{VD}

(–2.78). Rano offers competitive privacy (EER = 47.81%) but at the cost of increased WER (11.91%) and moderately positive

G_{VD}

(0.39), indicating the effective discrimination of pseudo-speaker identities in pseudo-speaker space. SALT yields comparable EERs around 45%, maintaining slightly better WER (10.77%) and minimal

G_{VD}

change (–0.03), indicating limited distinguishability. Our proposed RP + GMM method improves upon SALT with a modestly increased EER (45.54%) and reduced WER (10.46%), accompanied by a small positive

G_{VD}

(0.10), showing controlled distinguiability modification. The enhanced version, RP + GMM + SC, further boosts EER to 46.83% and lowers WER to 10.26%, with slightly higher

G_{VD}

(0.12), confirming that SC effectively enhances both privacy and voice distinctiveness without sacrificing speech quality.

In summary, while MUSA excels in privacy protection and speech fidelity, its negative

G_{VD}

limits its applicability in interactive, multi-speaker environments. Rano improves distinguishability but sacrifices speech quality. In contrast, our proposed RP + GMM + SC method uniquely balances all three core objectives: it provides strong privacy (EER is close to that achieved via Rano and MUSA), preserves high speech intelligibility (WER is lower than that achieved Rano and SALT), and crucially, sustains positive

G_{VD}

. Notably, although MUSA attains a marginally superior performance in speech utility, our method effectively avoids the severe negative impact on

G_{VD}

observed in MUSA—a critical limitation that impedes its real-world deployability. By ensuring each original speaker is consistently and reliably transformed into a distinguishable pseudo-speaker, our approach directly addresses a key practical shortcoming of existing state-of-the-art methods. This ability to maintain

G_{VD}

while upholding privacy and speech clarity is essential for interactive, multi-speaker environments, where pseudo-speaker distinctiveness is indispensable for natural communication and auditory coherence. Consequently, our method offers a practical and competitive solution for speaker anonymization, particularly suited for real-world applications that demand the simultaneous fulfillment of privacy, utility, and distinguishability.

5.5.2. Subjective Naturalness and Intelligibility Evaluation

Objective evaluation metrics may not always accurately reflect the performance of speech models, given the complexity and sensitivity of the human auditory system. Subjective evaluation addresses this gap and remains important as an objective assessment. In speech-related tasks, the Mean Opinion Score (MOS) is a widely adopted subjective evaluation method. For speaker anonymization, naturalness and intelligibility are commonly assessed through human evaluation. Naturalness measures the degree to which processed speech sounds natural, with particular attention paid to any artifacts or degradation introduced by the anonymization tools. Intelligibility evaluates how clearly and easily the speech content can be understood.

In practice, listeners rate the naturalness of anonymized speech on a scale from one (completely unnatural) to five (completely natural). They also assess the intelligibility of individual samples—whether anonymized test utterances or original enrollment utterances—using the same five-point scale, where one denotes speech that is completely unintelligible and five denotes speech that is completely intelligible. Higher scores for naturalness and intelligibility correspond to better perceived speech quality.

In the experiment, ten original speech samples were selected. Each sample was anonymized using the RP + GMM method and other four baseline methods, resulting in a total of fifty anonymized speech samples. Each participant was required to listen to a single speech sample and rate its naturalness and intelligibility on a 1–5-point scale. The evaluation results are presented in Table 8.

As can be seen from Table 8, B1 performs the weakest in terms of both naturalness (3.15) and intelligibility (3.28). SALT and Rano show stable and similar performance across all metrics, while MUSA holds a slight advantage in naturalness (3.79). In contrast, the proposed RP + GMM + SC method achieves the best intelligibility score (3.85), with naturalness (3.76) also close to the highest level, demonstrating its overall strong performance and particularly notable advantages in terms of preserving speech clarity.

6. Conclusions

In this paper, a novel distinguishability-driven voice generation for speaker anonymization via random projection and GMM is proposed. Random projection reduces the dimensionality of X-vectors from an external speaker pool, effectively preserving their topological structure while significantly decreasing computational complexity. By independently generating random projection matrices, the proposed method eliminates reliance on an external speaker pool, thereby enhancing both flexibility and security of the anonymization framework. GMM is used as the generative model due to its superior distribution generalization capability, with model parameters optimized through grid search and GMM entropy evaluation. A random sampling strategy is implemented to prevent concentration in anonymized spatial distributions and overcome limitations in the quantity of generated anonymous voices. Additionally, we introduce a similarity check mechanism that effectively protects privacy while ensuring sufficient distinguishability between the original and anonymized speech. The experimental results demonstrate that the proposed method effectively balances privacy preservation and speech utility, while significantly enhancing speaker distinguishability between the original speakers and their corresponding pseudo-speakers, as well as among different pseudo-speakers.

This proposed anonymization system can be integrated into real-world speech systems, such as voice assistants, telephony, and smart devices, by replacing user voice identity features in real-time at the local or edge side. This approach not only meets the requirements of privacy regulations, but also preserves speech content intelligibility and speaker distinguishability. With low external dependency generative architecture and robustness against enhanced attacks, the system provides a viable and trustworthy technical pathway for natural speech interaction in highly privacy-sensitive scenarios, significantly enhancing the practical applicability of speaker anonymization technology.

Future research will focus on optimizing voice generation quality, including reducing WER and preserving prosodic information. The proposed method will also be extended to disentanglement-based speaker anonymization systems, supporting a broader range of downstream tasks. These improvements aim to enhance the practical utility of the anonymization framework while maintaining its strong privacy protection capability.

Author Contributions

C.W.: Writing—original draft, writing—review and editing, methodology, and investigation. Q.Z.: Writing—review and editing, resources, investigation, and supervision. Y.H.: Validation and Investigation. H.W.: Validation and investigation. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Grant No. 61862041).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data supporting the findings of this study are not publicly available due to privacy and ethical restrictions. The dataset contains sensitive information that cannot be shared openly to protect participant confidentiality.

Acknowledgments

The authors hereby thank the anonymous reviewers for their valuable comments and suggestions in this paper. The authors are very grateful to those who provided the useful benchmark datasets.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Li, J. Recent advances in end-to-end automatic speech recognition. APSIPA Trans. Signal Inf. Process. 2022, 11, 1–64. [Google Scholar] [CrossRef]
Nassif, A.B.; Shahin, I.; Attili, I.; Azzeh, M.; Shaalan, K. Speech recognition using deep neural networks: A systematic review. IEEE Access 2019, 7, 19143–19165. [Google Scholar] [CrossRef]
Mehrish, A.; Majumder, N.; Bharadwaj, R.; Mihalcea, R.; Poria, S. A review of deep learning techniques for speech processing. Inf. Fusion 2023, 99, 101869. [Google Scholar] [CrossRef]
Sinlapanuntakul, P.; Skilton, K.S.; Mathew, J.N.; Chaparro, B.S. The effects of background noise on user experience and performance of mixed reality voice dictation. In Proceedings of the Human Factors and Ergonomics Society Annual Meeting, Atlanta, GA, USA, 10–14 October 2022; pp. 1028–1032. [Google Scholar]
Wang, L.; Sun, S. Dictating translations with automatic speech recognition: Effects on translators’ performance. Front. Psychol. 2023, 14, 1108898. [Google Scholar] [CrossRef] [PubMed]
Melumad, S. Vocalizing search: How voice technologies alter consumer search processes and satisfaction. J. Consum. Res. 2023, 50, 533–553. [Google Scholar] [CrossRef]
Feng, T.; Hebbar, R.; Mehlman, N.; Shi, X.; Kommineni, A.; Narayanan, S. A review of speech-centric trustworthy machine learning: Privacy, safety, and fairness. APSIPA Trans. Signal Inf. Process. 2023, 12, 1–74. [Google Scholar] [CrossRef]
Acosta, L.H.; Reinhardt, D. “Alexa, how do you protect my privacy?” A quantitative study of user preferences and requirements about smart speaker privacy settings. Comput. Secur. 2025, 151, 104302. [Google Scholar] [CrossRef]
Cheng, P.; Roedig, U. Personal voice assistant security and privacy—A survey. Proc. IEEE 2022, 110, 476–507. [Google Scholar] [CrossRef]
Zhang, S.; Li, Z.; Das, A. Voicepm: A robust privacy measurement on voice anonymity. In Proceedings of the Security and Privacy in Wireless and Mobile Networks, Guildford, UK, 29 May–1 June 2023; pp. 215–226. [Google Scholar]
Kadam, S.; Jikamade, A.; Mattoo, P.; Hole, V. ReVoice: A neural network based voice cloning system. In Proceedings of the IEEE 9th International Conference for Convergence in Technology, Pune, India, 5–7 August 2024; pp. 1–6. [Google Scholar] [CrossRef]
Genelza, G.G. A systematic literature review on AI voice cloning generator: A game-changer or a threat? J. Emerge Technol. 2024, 4, 54–61. [Google Scholar]
Bilika, D.; Michopoulou, N.; Alepis, E.; Patsakis, C. Hello me, meet the real me: Voice synthesis attacks on voice assistants. Comput. Secur. 2024, 137, 103617. [Google Scholar] [CrossRef]
Tan, X.; Chen, J.; Liu, H.; Cong, J.; Zhang, C.; Liu, Y.; Wang, X.; Leng, Y.; Yi, Y.; He, L.; et al. Naturalspeech: End-to-end text-to-speech synthesis with human-level quality. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 4234–4245. [Google Scholar] [CrossRef] [PubMed]
Protection, F.D. General Data Protection Regulation (GDPR). Intersoft Consulting. 2018, Volume 24. Available online: https://gdpr-info.eu/ (accessed on 8 December 2025).
Srivastava, B.M.L.; Maouche, M.; Sahidullah, M.; Vincent, E.; Bellet, A.; Tommasi, M.; Tomashenko, N.; Wang, X.; Yamagishi, J. Privacy and Utility of X-Vector Based Speaker Anonymization. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 30, 2383–2395. [Google Scholar] [CrossRef]
Tomashenko, N.; Srivastava, B.M.L.; Wang, X.; Vincent, E.; Nautsch, A.; Yamagishi, J.; Evans, N.; Patino, J.; Bonastre, J.F.; Noe, P.G.; et al. Introducing the VoicePrivacy Initiative. In Proceedings of the Interspeech, Shanghai, China, 25–29 October 2020; pp. 1693–1697. [Google Scholar] [CrossRef]
Tomashenko, N.A.; Wang, X.; Miao, X.; Nourtel, H.; Champion, P.; Todisco, M.; Vincent, E.; Evans, N.W.D.; Yamagishi, J.; Bonastre, J. The VoicePrivacy 2022 Challenge Evaluation Plan. arXiv 2022, arXiv:2203.12468. [Google Scholar] [CrossRef]
Tomashenko, N.A.; Miao, X.; Champion, P.; Meyer, S.; Wang, X.; Vincent, E.; Panariello, M.; Evans, N.W.D.; Yamagishi, J.; Todisco, M. The VoicePrivacy 2024 Challenge Evaluation Plan. arXiv 2024, arXiv:2404.02677. [Google Scholar] [CrossRef]
Patino, J.; Tomashenko, N.; Todisco, M.; Nautsch, A.; Evans, N. Speaker anonymisation using the McAdams coefficient. In Proceedings of the Interspeech, Brno, Czech Republic, 30 August–3 September 2021; pp. 1099–1103. [Google Scholar] [CrossRef]
Gupta, P.; Prajapati, G.P.; Singh, S.; Kamble, M.R.; Patil, H.A. Design of Voice Privacy System using Linear Prediction. In Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, Auckland, New Zealand, 7–10 December 2020; pp. 543–549. [Google Scholar]
Pohlhausen, J.; Nespoli, F.; Bitzer, J. Enhancing Speech Privacy with LPC Modifications. In Proceedings of the Security and Privacy in Speech Communication, Kos, Greece, 6 September 2024; pp. 80–85. [Google Scholar]
Yao, J.; Wang, Q.; Zhang, L.; Guo, P.; Liang, Y.; Xie, L. NWPU-ASLP System for the VoicePrivacy 2022 Challenge. arXiv 2022, arXiv:2209.11969. [Google Scholar] [CrossRef]
Lv, Y.; Yao, J.; Chen, P.; Zhou, H.; Lu, H.; Xie, L. Salt: Distinguishable Speaker Anonymization Through Latent Space Transformation. In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, Taipei, Taiwan, 16–20 December 2023; pp. 1–8. [Google Scholar] [CrossRef]
Yao, J.; Wang, Q.; Guo, P.; Ning, Z.; Xie, L. Distinctive and Natural Speaker Anonymization via Singular Value Transformation-Assisted Matrix. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 2944–2956. [Google Scholar] [CrossRef]
Tomashenko, N.A.; Wang, X.; Vincent, E.; Patino, J.; Srivastava, B.M.L.; Noé, P.; Nautsch, A.; Evans, N.; Yamagishi, J.; O’bRien, B.; et al. The VoicePrivacy 2020 Challenge: Results and findings. Comput. Speech Lang. 2022, 74, 1–41. [Google Scholar] [CrossRef]
Fang, F.; Wang, X.; Yamagishi, J.; Echizen, I.; Todisco, M.; Evans, N.W.D.; Bonastre, J.-F. Speaker Anonymization Using X-vector and Neural Waveform Models. In Proceedings of the ISCA Workshop on Speech Synthesis, Vienna, Austria, 20–22 September 2019; pp. 155–160. [Google Scholar] [CrossRef]
Srivastava, B.M.L.; Tomashenko, N.A.; Wang, X.; Vincent, E.; Yamagishi, J.; Maouche, M.; Bellet, A.; Tommas, M. Design Choices for X-Vector Based Speaker Anonymization. In Proceedings of the Interspeech, Shanghai, China, 25–29 October 2020; pp. 1713–1717. [Google Scholar] [CrossRef]
Turner, H.; Lovisotto, G.; Martinovic, I. Generating identities with mixture models for speaker anonymization. Comput. Speech Lang. 2022, 72, 1–15. [Google Scholar] [CrossRef]
Snyder, D.; Garcia-Romero, D.; Povey, D.; Khudanpur, S. Deep Neural Network Embeddings for Text-Independent Speaker Verification. In Proceedings of the Interspeech, Stockholm, Sweden, 20–24 August 2017; pp. 999–1003. [Google Scholar] [CrossRef]
Povey, D.; Cheng, G.; Wang, Y.; Li, K.; Xu, H.; Yarmohammadi, M.; Khudanpur, S. Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks. In Proceedings of the Interspeech, Hyderabad, India, 2–6 September 2018; pp. 3743–3747. [Google Scholar] [CrossRef]
Yao, J.; Kuzmin, N.; Wang, Q.; Guo, P.; Ning, Z.; Guo, D.; Lee, K.A.; Chng, E.S.; Xie, L. NPU-NTU System for Voice Privacy 2024 Challenge. In Proceedings of the Security and Privacy in Speech Communication, Kos, Greece, 6 September 2024; pp. 67–71. [Google Scholar] [CrossRef]
Panariello, M.; Tomashenko, N.A.; Wang, X.; Miao, X.; Champion, P.; Nourtel, H.; Todisco, M.; Evans, N.; Vincent, E.; Yamagishi, J. The VoicePrivacy 2022 Challenge: Progress and Perspectives in Voice Anonymisation. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 3477–3491. [Google Scholar] [CrossRef]
Mawalim, C.O.; Galajit, K.; Karnjana, J.; Unoki, M. X-Vector Singular Value Modification and Statistical-Based Decomposition with Ensemble Regression Modeling for Speaker Anonymization System. In Proceedings of the Interspeech, Shanghai, China, 25–29 October 2020; pp. 1703–1707. [Google Scholar] [CrossRef]
Mawalim, C.O.; Galajit, K.; Karnjana, J.; Kidani, S.; Unoki, M. Speaker anonymization by modifying fundamental frequency and X-vector singular value. Comput. Speech Lang. 2022, 73, 1–17. [Google Scholar] [CrossRef]
Meyer, S.; Tilli, P.; Denisov, P.; Lux, F.; Koch, J.; Vu, N.T. Anonymizing Speech with Generative Adversarial Networks to Preserve Speaker Privacy. In Proceedings of the IEEE Spoken Language Technology Workshop, Doha, Qatar, 9–12 January 2022; pp. 912–919. [Google Scholar] [CrossRef]
Perero-Codosero, J.M.; Espinoza-Cuadros, F.M.; Hernandez-Gomez, L.A. X-vector anonymization using autoencoders and adversarial training for preserving speech privacy. Comput. Speech Lang. 2022, 74, 1–13. [Google Scholar] [CrossRef]
Miao, X.; Wang, X.; Cooper, E.; Yamagishi, J.; Tomashenko, N.A. Speaker Anonymization Using Orthogonal Householder Neural Network. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 3681–3695. [Google Scholar] [CrossRef]
Chen, L.; Gu, W.; Lee, K.A.; Guo, W.; Ling, Z.H. Pseudo-Speaker Distribution Learning in Voice Anonymization. IEEE Trans. Audio Speech Lang. Process. 2025, 33, 272–285. [Google Scholar] [CrossRef]
Lee, K.A.; Liu, Z.; Chen, L.; Ling, Z.H. Pinhole Effect on Linkability and Dispersion in Speaker Anonymization. IEEE Signal Process. Lett. 2025, 32, 4144–4148. [Google Scholar] [CrossRef]
Yahaya, Y.; Ajadi, J.O.; Sanusi, R.A.; Sawlan, Z.; Adegoke, N.A. Hybrid random projection technique for enhanced representation in high-dimensional data. Expert Syst. Appl. 2025, 262, 1–15. [Google Scholar] [CrossRef]
Ruidong, W. On linear extension of 1-Lipschitz mapping from Hilbert space into a normed space. Acta Math. Sci. 2010, 30, 161–165. [Google Scholar] [CrossRef]
Richardson, E.; Weiss, Y. On GANs and GMMs. In Proceedings of the Neural Information Processing Systems, Montréal, QC, Canada, 2–8 December 2018; pp. 5852–5863. [Google Scholar]
Kong, J.; Kim, J.; Bae, J. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. Adv. Neural Inf. Process. Syst. 2020, 33, 17022–17033. [Google Scholar]
Abdelnaby, M.; Moussa, M.R. A Benchmarking Study of Random Projections and Principal Components for Dimensionality Reduction Strategies in Single Cell Analysis. In Proceedings of the Computational Advances in Bio and Medical Sciences, Atlanta, GA, USA, 12–14 January 2025; pp. 1–15. [Google Scholar] [CrossRef]
Scrucca, L. Assessing uncertainty in Gaussian mixtures-based entropy estimation. Commun. Stat. Simul. Comput. 2025, 1–23. [Google Scholar] [CrossRef]
Juvela, L.; Wang, X.; Takaki, S.; Kim, S.; Airaksinen, M.; Yamagishi, J. The NII speech synthesis entry for Blizzard Challenge 2016. In Proceedings of the Blizzard Challenge, Cuppertino, CA, USA, 16 September 2016; pp. 17–22. [Google Scholar] [CrossRef]
Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An ASR corpus based on public domain audio books. In Proceedings of the Acoustics, Speech and Signal Processing, Brisbane, Australia, 19–24 April 2015; pp. 5206–5210. [Google Scholar] [CrossRef]
Nagrani, A.; Chung, J.S.; Zisserman, A. VoxCeleb: A Large-Scale Speaker Identification Dataset. In Proceedings of the Interspeech, Stockholm, Sweden, 20–24 August 2017; pp. 2616–2620. [Google Scholar] [CrossRef]
Chung, J.S.; Nagrani, A.; Zisserman, A. VoxCeleb2: Deep Speaker Recognition. In Proceedings of the Interspeech, Hyderabad, India, 2–6 September 2018; pp. 1086–1090. [Google Scholar] [CrossRef]
Zen, H.; Dang, V.; Clark, R.; Zhang, Y.; Weiss, R.J.; Jia, Y.; Chen, Z.; Wu, Y. LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech. In Proceedings of the Interspeech, Graz, Austria, 15–19 September 2019; pp. 1526–1530. [Google Scholar] [CrossRef]
Yamagishi, J.; Veaux, C.; MacDonald, K. CSTR VCTK Corpus: English Multi-Speaker Corpus for CSTR Voice Cloning Toolkit (Version 0.92); The Centre for Speech Technology Research: Edinburgh, UK, 2019. [Google Scholar] [CrossRef]
Povey, D.; Ghoshal, A.; Boulianne, G.; Burget, L.; Glembek, O.; Goel, N.K.; Hannemann, M.; Motlícek, P.; Qian, Y.; Schwarz, P.; et al. The Kaldi Speech Recognition Toolkit. In Proceedings of the Automatic Speech Recognition and Understanding, Waikoloa, HI, USA, 11–15 December 2011; Available online: https://api.semanticscholar.org/CorpusID:1774023 (accessed on 16 January 2026).
Provost, F.J.; Fawcett, T. Robust Classification for Imprecise Environments. Mach. Learn. 2001, 42, 203–231. [Google Scholar] [CrossRef]
Brümmer, N.; du Preez, J.A. Application-independent evaluation of speaker detection. Comput. Speech Lang. 2006, 20, 230–275. [Google Scholar] [CrossRef]
Nautsch, A.; Patino, J.; Tomashenko, N.A.; Yamagishi, J.; Noé, P.; Bonastre, J.; Todisco, M.; Evans, N. The Privacy ZEBRA: Zero Evidence Biometric Recognition Assessment. In Proceedings of the Interspeech, Shanghai, China, 25–29 October 2020; pp. 1698–1702. [Google Scholar] [CrossRef]
Maouche, M.; Srivastava, B.M.L.; Vauquier, N.; Bellet, A.; Tommasi, M.; Vincent, E. A Comparative Study of Speech Anonymization Metrics. In Proceedings of the Interspeech, Shanghai, China, 25–29 October 2020; pp. 1708–1712. [Google Scholar] [CrossRef]
Noé, P.; Nautsch, A.; Evans, N.W.D.; Patino, J.; Bonastre, J.; Tomashenko, N.A.; Matrouf, D. Towards a unified assessment framework of speech pseudonymisation. Comput. Speech Lang. 2022, 72, 1–18. [Google Scholar] [CrossRef]
Noé, P.; Bonastre, J.; Matrouf, D.; Tomashenko, N.A.; Nautsch, A.; Evans, N.W.D. Speech Pseudonymisation Assessment Using Voice Similarity Matrices. In Proceedings of the Interspeech, Shanghai, China, 25–29 October 2020; pp. 1718–1722. [Google Scholar] [CrossRef]
Wang, J.; Zhang, X.; Qu, X. Rano: Restorable Speaker Anonymization via Conditional Invertible Neural Network. In Proceedings of the International Joint Conference on Neural Networks, Rome, Italy, 30 June–5 July 2025; pp. 1–8. [Google Scholar] [CrossRef]
Yao, J.; Wang, Q.; Guo, P.; Ning, Z.; Yang, Y.; Pan, Y.; Xie, L. MUSA: Multi-Lingual Speaker Anonymization via Serial Disentanglement. IEEE Trans. Audio Speech Lang. Process. 2025, 33, 1664–1674. [Google Scholar] [CrossRef]

Figure 1. The framework of the proposed speaker anonymization system.

Figure 2. Information entropy varying with the number of GMM components.

Figure 3. Information entropy varying with

ε

.

Figure 3. Information entropy varying with

ε

.

Figure 4. EER comparison across different anonymization methods and attack scenarios on development and test sets. Higher EER corresponds to better privacy.

Figure 5.

C_{l l r}^{min}

comparison across different anonymization methods and attack scenarios on development and test sets. Higher

C_{l l r}^{min}

corresponds to better privacy.

Figure 5.

C_{l l r}^{min}

comparison across different anonymization methods and attack scenarios on development and test sets. Higher

C_{l l r}^{min}

corresponds to better privacy.

Figure 6. EER vs. WER results for the o-a and a-a scenarios on development and test sets. Higher EER and lower WER correspond to better anonymization performance.

Figure 7.

D_{ECE}

comparison across different anonymization methods and attack scenarios on development and test sets. Lower

D_{ECE}

corresponds to better privacy.

Figure 7.

D_{ECE}

comparison across different anonymization methods and attack scenarios on development and test sets. Lower

D_{ECE}

corresponds to better privacy.

Figure 8.

\log (l)

comparison across different anonymization methods and attack scenarios on development and test sets. Lower

\log (l)

corresponds to better privacy.

Figure 8.

\log (l)

comparison across different anonymization methods and attack scenarios on development and test sets. Lower

\log (l)

corresponds to better privacy.

Figure 9.

D \leftrightarrow sys

comparison across different anonymization methods and attack scenarios on development and test sets. Lower

D \leftrightarrow sys

corresponds to better privacy.

Figure 9.

D \leftrightarrow sys

comparison across different anonymization methods and attack scenarios on development and test sets. Lower

D \leftrightarrow sys

corresponds to better privacy.

Figure 10.

G_{VD}

comparison across different anonymization methods and attack scenarios on development and test sets. Higher

G_{VD}

corresponds to better distinguishability of anonymized voices.

Figure 10.

G_{VD}

comparison across different anonymization methods and attack scenarios on development and test sets. Higher

G_{VD}

corresponds to better distinguishability of anonymized voices.

Figure 11. DeID comparison across different anonymization methods and attack scenarios on development and test sets. Higher DeID corresponds to better privacy.

Table 1. Detailed description of each module of the evaluation model, along with the corresponding training corpora.

Module	Description	Input Features (dim)	Out Features (dim)	Training Dataset
F0	F0 estimation based on 5 F0 extractors ensemble [47]	–	–	–
ASR AM	TDNN-F [31]	MFCC (40) + i-vector (100)	BN (256)	LibriSpeech [48]: train-clean-100, train-clean-500
X-vector extractor	TDNN model topology [30]	MFCC (30)	X-vector (512)	Voxceleb-1 [49], Voxceleb-2 [50]
Waveform generator	Hifi-GAN model [44]	F0 + X-vector (512) + BN features (256)	Speech waveform	LibriTTS [51]: train-clean-100

Table 2. Description of methods adopted in the comparative experiment.

Abbreviation of the Anonymization System	Source of Anonymization Systems	Method Description
B1	Fang et al. [27]	Select the subset farthest from the original X-vector and calculate the average value
B2	Patino et al. [20]	Using the McAdams coefficients
A1	Mawalim et al. [34]	Use the variability-driven decomposition of the regression model to anonymize speech
A2	Mawalim et al. [35]	Using singular value decomposition
O1	Turner et al. [29]	PCA + GMM
O1c1	Turner et al. [29]	PCA + GMM under forced dissimilarity

Table 3. ASV results (EER,

C_{l l r}

,

C_{l l r}^{m i n}

) of our proposed anonymization system on development and test sets across three scenarios.

Table 3. ASV results (EER,

C_{l l r}

,

C_{l l r}^{m i n}

) of our proposed anonymization system on development and test sets across three scenarios.

Dataset	Gender	Scenario	RP + GMM			RP + GMM + SC
Dataset	Gender	Scenario	EER	$C_{llr}$	$C_{llr}^{\min}$	EER	$C_{llr}$	$C_{llr}^{\min}$
Libri (dev)	Female	o-o	8.81	42.908	0.305	8.81	42.908	0.305
		o-a	48.72	159.396	0.977	43.47	137.483	0.970
		a-a	49.29	58.498	0.995	49.01	48.223	0.958
	Male	o-o	1.24	14.268	0.035	1.24	14.268	0.035
		o-a	50.16	158.200	0.981	51.55	160.652	0.996
		a-a	39.44	57.145	0.939	52.19	69.941	0.987
VCTK diff (dev)	Female	o-o	2.92	1.146	0.102	2.92	1.146	0.102
		o-a	51.49	157.211	0.985	59.80	155.015	0.994
		a-a	49.24	49.844	0.925	39.70	44.201	0.932
	Male	o-o	1.39	1.155	0.052	1.39	1.155	0.052
		o-a	41.34	138.893	0.926	52.70	144.926	0.989
		a-a	53.90	72.231	0.983	47.15	62.277	0.988
VCTK common (dev)	Female	o-o	2.33	0.870	0.089	2.33	0.870	0.089
		o-a	47.09	158.910	0.968	50.00	145.902	0.974
		a-a	41.28	46.789	0.951	46.22	54.452	0.986
	Male	o-o	1.43	1.551	0.050	1.43	1.551	0.050
		o-a	42.74	150.900	0.937	47.58	152.512	0.983
		a-a	51.57	68.722	0.982	52.14	69.967	0.985
Libri (test)	Female	o-o	7.66	26.804	0.184	7.66	26.804	0.184
		o-a	47.81	151.146	0.993	43.80	140.360	0.969
		a-a	38.14	51.574	0.936	35.58	38.873	0.912
	Male	o-o	1.11	15.378	0.042	1.11	15.378	0.042
		o-a	44.54	140.752	0.955	45.21	142.983	0.959
		a-a	51.89	64.518	0.975	49.89	71.763	0.996
VCTK diff (test)	Female	o-o	4.94	1.497	0.170	4.94	1.497	0.170
		o-a	56.28	133.935	0.997	53.03	138.735	0.997
		a-a	42.44	42.835	0.955	44.34	45.323	0.898
	Male	o-o	2.07	1.831	0.072	2.07	1.831	0.072
		o-a	50.86	146.206	1.000	46.27	132.943	0.990
		a-a	41.22	50.635	0.908	50.63	74.723	0.952
VCTK common (test)	Female	o-o	2.89	0.866	0.092	2.89	0.866	0.092
		o-a	50.00	151.192	0.982	50.00	146.820	0.983
		a-a	45.09	43.370	0.954	46.24	42.348	0.945
	Male	o-o	1.13	1.042	0.036	1.13	1.042	0.036
		o-a	49.15	159.922	0.994	46.05	151.122	0.988
		a-a	42.94	50.736	0.935	48.87	60.755	0.930

Table 4. WER results for original and anonymized speech recognition on development and test sets.

Dataset	WER/%
Dataset	Orig	RP + GMM	RP + GMM + SC
Libri_dev	3.83	5.72	5.89
VCTK_dev	10.79	14.97	15.34
Aver_dev	7.31	10.35	10.66
Libri_test	4.15	5.80	5.86
VCTK_test	12.82	14.64	14.63
Aver_test	8.49	10.22	10.25

Table 5. Additional ASV results (

D_{ECE}

, log(l), and

D \leftrightarrow sys

) of our proposed anonymization system on development and test sets across three scenarios, with categorical tags for worst-case privacy disclosure based on posterior odds ratio (under a flat prior assumption): A (more disclosure than 50:50), B (one wrong in 10 to 100), C (one wrong in 100 to 10,000).

Table 5. Additional ASV results (

D_{ECE}

, log(l), and

D \leftrightarrow sys

) of our proposed anonymization system on development and test sets across three scenarios, with categorical tags for worst-case privacy disclosure based on posterior odds ratio (under a flat prior assumption): A (more disclosure than 50:50), B (one wrong in 10 to 100), C (one wrong in 100 to 10,000).

Dataset	Gender	Scenario	RP + GMM			RP + GMM + SC
Dataset	Gender	Scenario	$D_{ECE}$	log( $l$ )	D ↔ sys	$D_{ECE}$	log( $l$ )	D ↔ sys
Libri (dev)	Female	o-o	0.492	3.829 (C)	0.800	0.492	3.829 (C)	0.800
		o-a	0.016	1.094 (B)	0.099	0.020	0.538 (A)	0.129
		a-a	0.003	0.488 (A)	0.086	0.029	1.17 (B)	0.155
	Male	o-o	0.695	4.05 (D)	0.973	0.695	4.05 (D)	0.973
		o-a	0.014	1.388 (B)	0.061	0.003	0.689 (A)	0.074
		a-a	0.042	1.775 (B)	0.171	0.008	1.056 (B)	0.118
VCTK diff (dev)	Female	o-o	0.645	3.971 (C)	0.931	0.645	3.971 (C)	0.931
		o-a	0.011	1.422 (B)	0.065	0.004	0.981 (A)	0.142
		a-a	0.051	1.393 (B)	0.233	0.047	1.767 (B)	0.151
	Male	o-o	0.682	4.042 (D)	0.968	0.682	4.042 (D)	0.968
		o-a	0.050	1.207 (B)	0.200	0.008	1.381 (B)	0.083
		a-a	0.012	1.638 (B)	0.081	0.008	1.286 (B)	0.161
VCTK common (dev)	Female	o-o	0.652	3.596 (C)	0.936	0.652	3.596 (C)	0.936
		o-a	0.022	0.994 (A)	0.093	0.018	1.623 (B)	0.111
		a-a	0.033	0.668 (A)	0.127	0.009	0.233 (A)	0.093
	Male	o-o	0.683	3.616 (C)	0.966	0.683	3.616 (C)	0.966
		o-a	0.043	1.322 (B)	0.150	0.011	0.544 (A)	0.086
		a-a	0.012	0.447 (A)	0.118	0.010	1.059 (B)	0.122
Libri (test)	Female	o-o	0.584	3.978 (C)	0.898	0.584	3.978 (C)	0.898
		o-a	0.005	0.313 (A)	0.095	0.021	1.611 (B)	0.102
		a-a	0.044	1.611 (B)	0.185	0.059	0.947 (A)	0.191
	Male	o-o	0.690	3.924 (C)	0.957	0.690	3.924 (C)	0.957
		o-a	0.031	0.702 (A)	0.124	0.028	0.861 (A)	0.101
		a-a	0.018	1.212 (B)	0.072	0.003	0.429 (A)	0.085
VCTK diff (test)	Female	o-o	0.593	3.645 (C)	0.880	0.593	3.645 (C)	0.880
		o-a	0.002	0.796 (A)	0.091	0.002	0.562 (A)	0.079
		a-a	0.031	1.352 (B)	0.129	0.070	1.944 (B)	0.201
	Male	o-o	0.667	3.916 (C)	0.950	0.667	3.916 (C)	0.950
		o-a	0.000	0.043 (A)	0.046	0.007	0.978 (A)	0.064
		a-a	0.063	1.722 (B)	0.165	0.033	0.904 (A)	0.157
VCTK common (test)	Female	o-o	0.652	3.559 (C)	0.924	0.652	3.559 (C)	0.924
		o-a	0.012	0.884 (A)	0.105	0.011	0.324 (A)	0.110
		a-a	0.032	1.072 (B)	0.108	0.037	0.915 (A)	0.168
	Male	o-o	0.694	3.675 (C)	0.972	0.694	3.675 (C)	0.972
		o-a	0.005	0.386 (A)	0.054	0.008	0.845 (A)	0.096
		a-a	0.045	1.447 (B)	0.108	0.048	0.921 (A)	0.184

Table 6. Results of

G_{VD}

and DeID of our proposed anonymization system on development and test sets.

Table 6. Results of

G_{VD}

and DeID of our proposed anonymization system on development and test sets.

Datasets	Gender	$G_{VD} (dB)$		DeID (%)
Datasets	Gender	RP + GMM	RP + GMM + SC	RP + GMM	RP + GMM + SC
Libri (dev)	Female	−0.14	−0.07	96.01	96.04
Libri (dev)	Male	−0.12	−0.06	99.27	100.00
Vctk diff (dev)	Female	0.37	0.34	98.05	99.41
Vctk diff (dev)	Male	0.15	0.10	97.57	99.90
Vctk common (dev)	Female	0.17	0.11	96.63	99.41
Vctk common (dev)	Male	0.05	0.19	90.30	98.43
Libri (test)	Female	−0.11	−0.10	98.92	92.70
Libri (test)	Male	0.09	0.18	100.00	99.61
Vctk diff (test)	Female	0.51	0.46	100.00	99.27
Vctk diff (test)	Male	0.09	0.11	99.37	98.98
Vctk common (test)	Female	0.14	0.22	99.41	97.43
Vctk common (test)	Male	−0.02	−0.08	97.62	98.63

Table 7. Comparison of different speaker anonymization methods with EER, WER and

G_{VD}

results. For EER, closer to 50% is better. For WER, lower is better. For

G_{VD}

, higher is better.

Table 7. Comparison of different speaker anonymization methods with EER, WER and

G_{VD}

results. For EER, closer to 50% is better. For WER, lower is better. For

G_{VD}

, higher is better.

Anonymization Methods	EER (%) ↑	WER (%) ↓	$G_{VD}$
SALT [24]	45.32	10.77	−0.03
Rano [60]	47.81	11.91	0.39
MUSA [61]	48.29	9.895	−2.78
RP + GMM (ours)	45.54	10.46	0.10
RP + GMM + SC (ours)	46.83	10.26	0.12

Table 8. Comparison of different speaker anonymization methods with naturalness and intelligibility results.

Anonymization Methods	Naturalness	Intelligibility
B1 [27]	3.15	3.28
SALT [24]	3.73	3.85
Rano [60]	3.75	3.84
MUSA [61]	3.79	3.82
RP + GMM + SC (ours)	3.76	3.85

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, C.; Zhang, Q.; Hu, Y.; Wei, H. Distinguishability-Driven Voice Generation for Speaker Anonymization via Random Projection and GMM. Big Data Cogn. Comput. 2026, 10, 43. https://doi.org/10.3390/bdcc10020043

AMA Style

Wang C, Zhang Q, Hu Y, Wei H. Distinguishability-Driven Voice Generation for Speaker Anonymization via Random Projection and GMM. Big Data and Cognitive Computing. 2026; 10(2):43. https://doi.org/10.3390/bdcc10020043

Chicago/Turabian Style

Wang, Chunxia, Qiuyu Zhang, Yingjie Hu, and Huiyi Wei. 2026. "Distinguishability-Driven Voice Generation for Speaker Anonymization via Random Projection and GMM" Big Data and Cognitive Computing 10, no. 2: 43. https://doi.org/10.3390/bdcc10020043

APA Style

Wang, C., Zhang, Q., Hu, Y., & Wei, H. (2026). Distinguishability-Driven Voice Generation for Speaker Anonymization via Random Projection and GMM. Big Data and Cognitive Computing, 10(2), 43. https://doi.org/10.3390/bdcc10020043

Article Menu

Distinguishability-Driven Voice Generation for Speaker Anonymization via Random Projection and GMM

Abstract

1. Introduction

2. Related Work

3. Random Projection

4. Proposed Method

4.1. The Anonymization Framework

4.2. The Anonymization Method

4.3. GMM Entropy Estimation

4.4. Similarity Check

4.5. Computational Complexity and Scalability Analysis

5. Experiment Results and Analysis

5.1. Datasets

5.2. Determining the Optimal Parameters

5.3. Primary Objective Evaluation Metrics

5.3.1. Privacy Evaluation

5.3.2. Utility Evaluation

5.3.3. Privacy–Utility Trade-Off

5.4. Additional Objective Evaluation Metrics

5.4.1. The ZEBRA Framework and Linkability

5.4.2. Gain of Voice Distinctiveness and De-Identification

5.5. Further Performance Comparison of the Latest Speaker Anonymization Methods

5.5.1. Objective Privacy and Utility Comparison

5.5.2. Subjective Naturalness and Intelligibility Evaluation

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI