1. Introduction
The breakthrough advancements in deep learning have accelerated the widespread adoption of Automatic Speaker Recognition (ASR) systems [
1,
2,
3]. Voice interaction technologies, such as speech-to-text document generation [
4], real-time multilingual speech translation [
5], and intelligent voice search engines [
6], have become deeply integrated into diverse aspects of society. Nevertheless, the large-scale development of such technologies entails the massive collection of voice data, which enhances service efficiency while simultaneously raising serious privacy protection challenges [
7,
8].
Voice data, as a typical type of biometric data, contains multidimensional sensitive information embedded in its acoustic features, including identity attributes, social status, physiological states, psychological traits, political stance, and religious beliefs [
9]. These deeply embedded privacy features in speech signals are not only exposed to traditional data leakage risks but are also vulnerable to novel threats such as reconstruction attacks [
10]. Moreover, with the advancement of voice cloning [
11,
12] and speech synthesis technologies [
13,
14], attackers can now synthesize highly realistic fake audio from just a few seconds of voice samples, posing severe threats to personal reputation and information security. Against this backdrop, the enhancement of regulations such as the General Data Protection Regulation (GDPR) [
15] has made voice data de-identification and the development of privacy-preserving technologies a critical and urgent challenge in the field of AI ethics.
Speaker anonymization aims to desensitize identity information embedded in speech signals, making them unusable for voice cloning while effectively resisting identity tracing by automatic speaker verification (ASV) and automatic speaker identification (ASI) systems [
16]. Critically, this process must preserve the original linguistic content and natural usability of speech to meet downstream application requirements. It is important to note that a seemingly straightforward approach—first transcribing speech to text, then modifying the text to remove speaker-specific patterns, and finally resynthesizing speech—is generally unsuitable for this task. This is due to three fundamental limitations: first, text transformation merely modifies linguistic patterns (e.g., regional dialect expressions) but fails to eliminate speaker-specific acoustic cues; second, it tends to distort linguistic content and paralinguistic information (e.g., emotional prosody), thereby compromising the essential requirement of preserving speech usability in speaker anonymization; third, it cannot mitigate voice cloning attacks, as such attacks rely on acoustic features rather than on the linguistic content of speech.
To advance speaker anonymization, the VoicePrivacy Initiative was launched in 2020 [
17], which has since evolved into a biennial VoicePrivacy Challenge (VPC). VPC 2022 [
18] focused on optimizing the robustness of anonymization methods, while VPC 2024 [
19] introduced emotion preservation constraints, requiring anonymization systems to eliminate identity features while maintaining emotional intensity and the affective category of the original speech.
The latest VPC submissions revealed two dominant research directions in speaker anonymization: traditional signal processing-based methods [
20,
21,
22] and Deep Neural Network (DNN)-based methods [
23,
24,
25]. Signal processing-based methods can obscure a speaker’s identity by modifying instantaneous speech features (e.g., pitch and spectral envelope) and time scale, or by altering vocal characteristics from a speech generation perspective. These modifications effectively conceal the original speaker’s identity, speech rate, and other source and vocal tract patterns. Although computationally efficient and training free, these methods suffer from two critical drawbacks: first, the speech content is prone to noticeable distortion; second, the study [
26] suggested that signal processing-based anonymization might be reversible after a limited number of attempts, potentially exposing the original voice. In contrast, DNN-based anonymization has pioneered a new paradigm, leveraging advances in neural voice conversion and speech synthesis, with a primary focus on disentangled representation learning. The key rationale for choosing disentanglement methods lies in their ability to separate linguistic and paralinguistic features using distinct encoders, enabling the independent modification of each attribute. For instance, Fang et al. [
27] introduced an anonymization framework based on modified X-vectors, where pseudo-speaker identities were generated by averaging a set of candidate X-vectors from a speaker pool. Srivastava et al. designed three anonymization schemes based on probabilistic linear discriminant analysis (PLDA) distance metrics to obscure speakers’ identities [
28]. However, Turner et al. [
29] pointed out that pseudo-X-vectors in VPC were derived by averaging multiple X-vectors from a speaker pool, resulting in highly similar pseudo-X-vectors that reduced the diversity of anonymized voices. This high similarity not only made it harder to distinguish between anonymized voices, but also limited the number of distinguishable anonymized voices that could be generated.
In this paper, we propose a novel distinguishability-driven voice generation using random projection and GMM, effectively addressing the limitations of existing speaker anonymization approaches. These limitations include the overly concentrated distribution of anonymized X-vectors, limited diversity in generated anonymized voices, excessive reliance on an external speaker pool, and higher risks of privacy leakage risks when facing strong attackers.
The main contributions of this paper are summarized as follows:
- (1)
Random projection matrix for X-vector dimensionality reduction: We use a random projection matrix to construct a speaker-independent random generation mechanism. By injecting randomness, our approach not only enhances speaker privacy protection but also preserves the flexibility of high-dimensional data manifold structures, while significantly reducing computational overhead.
- (2)
GMM-based probabilistic sampler model: We employ GMM as the generative model, leveraging its strong distribution generalization capability to improve speech generation quality. The probabilistic mixture mechanism of GMM also helps defend against membership inference attacks. By sampling speaker embeddings directly from the GMM, we further reduce computational complexity. Additionally, the sampling model eliminates the need for a speaker pool during anonymization, mitigating privacy leakage risks while optimizing the anonymization process.
- (3)
Grid search strategy with GMM entropy for parameter optimization: We adopt a grid search strategy combined with GMM entropy metrics to evaluate probability distribution quality and optimize model parameters. This method not only enhances the model’s generalization ability but also achieves a strong balance between privacy protection and data distribution.
- (4)
Cosine similarity check for anonymity enhancement: We introduce a cosine similarity check mechanism to assess the discrimination between original X-vectors and pseudo-X-vectors. This effectively prevents insufficient distinguishability between original and anonymized voices due to excessive similarity, thereby improving speaker anonymization performance.
- (5)
Empirical evaluation is conducted on the VoicePrivacy Challenge dataset: The experimental results validated that our method effectively protected speaker privacy against ignorant and lazy informed attack scenarios (as formally defined in
Section 5.3). After introducing the similarity check (SC), the average Equal Error Rate (EER) increased by 0.77% in the ignorant attack scenario (from 48.35% to 49.12%) and by 1.29% in the lazy informed attack scenario (from 45.54% to 46.83%). Furthermore, the results of voice distinctiveness (
) indicated that our proposed approach markedly improved speaker distinguishability, not only between original and anonymized voices, but also among different anonymized samples. Across most datasets, voice distinctiveness showed a significant improvement, reaching a peak value of 0.51.
The remainder of this paper is organized as follows.
Section 2 reviews related work.
Section 3 introduces random projection.
Section 4 details our proposed speaker anonymization method.
Section 5 demonstrates the experimental setup, including datasets, evaluation metrics, and an analysis of the experimental results. Finally,
Section 6 concludes this paper.
2. Related Work
Speaker anonymization is a flexible and effective privacy protection technique designed to remove speaker identity from voice signals, while preserving the intelligibility, naturalness, and distinguishability of the original speech. X-vectors are widely adopted as effective features for encoding speaker identity, owing to their proven performance in speaker verification and recognition tasks. X-vector-based speaker anonymization methods employed advanced neural network technologies, utilizing a Time Delay Neural Network (TDNN)-based ASV system to extract X-vectors [
30], and a Factorized Time Delay Neural Network (TDNN-F)-based ASR acoustic model (AM) [
31] to derive linguistic and prosodic and linguistic features, including fundamental frequency (F0) and bottleneck (BN) features from the original speech waveform. To effectively conceal speaker identity, the designated speaker anonymizer proposed by Srivastava et al. [
28] generated an anonymized X-vector by averaging multiple X-vectors randomly drawn from an external speaker pool. For each source X-vector, the procedure first filtered the 200 most-distant candidates via cosine distance, and then randomly selected 100 of these for mean computation. The anonymized X-vector, along with F0 and BN features, was fed into a speech synthesis acoustic model (SS AM) to produce Mel-fbank features. These features, together with the anonymized X-vector and F0, were then used by a neural source filter (NSF) waveform model to synthesize the anonymized speech. Given its effectiveness in protecting speaker privacy over conventional signal processing techniques, this disentanglement-based approach has become a prevalent framework for achieving efficient and robust anonymization in subsequent studies.
In recent years, numerous researchers enhanced X-vector-based anonymization from various perspectives to improve performance [
23,
32]. However, previous studies [
26,
33] indicated that existing anonymization systems remain constrained by their selection-based mechanisms. Specifically, the distribution of the external speaker pool significantly influenced anonymized speaker characteristics, and averaging selected X-vectors often diminished voice distinguishability. To address this, Mawalim et al. presented an X-vector-based anonymization method [
34], which used singular value modification and statistics-based decomposition on the original X-vector with ensemble regression models to achieve transformation. They later introduced a clustering-based singular value correction method, further enhancing anonymization by adjusting F0 and speech duration [
35]. Turner et al. [
29] observed that the baseline methods produced overly compact pseudo-X-vector distributions, failing to utilize the full X-vector space and resulting in insufficient voice distinguishability. To mitigate this, they proposed a pseudo-speaker generator combined PCA with GMM, by applying PCA dimensionality reduction to the external speaker pool, and sampling anonymized X-vectors from a trained GMM. As another GAN-based generator, Meyer et al. [
36] utilized a Generative Adversarial Network (GAN) with a Wasserstein distance cost function to generate speaker embeddings, enforcing the distribution of the generated embeddings to be similar to that of the speaker embeddings corresponding to real speakers. Perero et al. introduced an autoencoder-based X-vector reconstruction, incorporating domain adversarial training to suppress speaker attributes such as gender, accent, and identity [
37]. Another innovative direction was the Orthogonal Householder Neural Network (OHNN)-based anonymization [
38], which rotated original X-vectors into anonymized ones while constraining their distribution to the original space for more natural results. Recently, Yao et al. [
25] developed a speaker identity-related matrix model, transforming it into an anonymized singular value transformation-assisted matrix to conceal the original speaker identity, not only improving anonymized speech quality, but also strengthening privacy protection. Chen et al. [
39] represented the speaker attributes with distributions to increase the voice distinguishability of the anonymized speech, where the nonlinear variations among the cohort speakers leveraged to model the pseudo-speakers via neural networks and both utterances and frames are investigated as modeling units. Lee et al. [
40] introduced the pinhole effect as a conceptual framework to explain identity linkability and found that the use of distinct pseudo-speakers increases speaker dispersion and reduced linkability compared to common pseudo-speaker mapping, thus enhancing privacy preservation.
3. Random Projection
Random projection (RP) is an efficient dimensionality reduction technique that maps high-dimensional data into a lower-dimensional space while preserving critical structural information [
41]. Its core principle involves a randomly generated projection matrix to transform original high-dimensional data
into a reduced space
, where
n denotes the number of samples,
d is the original feature dimension, and
k is the reduced dimension (
). In this work,
X corresponds to X-vectors—fixed-dimensional (512-dimensional)—speaker embeddings extracted from speech signals using a pre-trained TDNN.The mapping is defined as follows:
where
represents the random projection matrix.
The theory foundation of RP relies on the Johnson–Lindenstrauss (JL) lemma [
42], which provides a probabilistic guarantee for preserving pairwise Euclidean (
) distances in the reduced space. The JL Lemma states the following:
Johnson–Lindenstrauss lemma. For any
, and any finite set of points
, there exists a mapping
with
such that
Here, denotes the Euclidean distances between points in the original d-dimensional space, represents their distance in the reduced k-dimensional space, n is the number of points in the set, controls the distortion tolerance, k is independent of d and grows only logarithmically with n, and represents the asymptotic lower-bound symbol. The JL lemma ensures that speaker relationships and structural properties critical for anonymization are maintained even after dimensionality reduction. A smaller preserves distance relationships more tightly but requires a larger k, increasing computational complexity. Conversely, a larger permits smaller k, trading accuracy for efficiency.
Random matrices satisfying the JL lemma can be constructed via various methods, including sparse random matrices and Gaussian random matrices. Among these, Gaussian random matrices are the most widely adopted. Each element of a Gaussian random matrix
R is independently sampled from a normal distribution
. Owing to the favorable properties of Gaussian distributions, Gaussian RP demonstrates superior performance in preserving pairwise distances [
43]. Therefore, this paper employs Gaussian RP to reduce the original 512-dimensional X-vectors to a lower-dimensional space. These reduced-dimensional X-vectors are subsequently used to generate pseudo-X-vectors—anonymized speaker embeddings—that are restored to the original 512-dimensional space via inverse random projection (RP).These pseudo-X-vectors replace the original X-vectors in the anonymization pipeline and are specifically designed to obscure the original speaker’s identity while ensuring compatibility with downstream speech synthesis models.
5. Experiment Results and Analysis
5.1. Datasets
The datasets used in this study are derived from the benchmark library of the VPC 2022 [
18]. As illustrated in
Figure 1, the anonymization framework comprises three pre-trained modules, with their corresponding training data configurations detailed in
Table 1.
The ASR AM is a TDNN-F model that uses 40 MFCCs and a 100-dimensional i-vector as input. It outputs 256-dimensional BN features. This ASR AM is trained on Librispeech train-clean-100 and train-other-500.
The X-vector extractor is a TDNN model that takes 30-dimensional MFCCs as input and produces a 512-dimensional speaker embedding (X-vector) as output. This extractor is trained using Voxceleb 1 and 2. The system assumes that these components decouple the speech content (BN and F0) from the speaker identity X-vector. Then, using a generation technique, the X-vector is modified to generate a pseudo X-vector representing the new identity of the speaker.
The Hifi-GAN model processes the modified X-vector, F0, and BN features to generate anonymized speech. This Hifi-GAN model is also trained on LibriTTS train-clean-100 subset.
The final evaluation of the anonymization system is conducted on the official VPC [
18] development sets and test sets of LibriSpeech [
48] and VCTK [
52] corpora, following the splits specified in the reference [
18].
5.2. Determining the Optimal Parameters
To evaluate the performance of pseudo-X-vector generation, we systematically analyze their distribution by adjusting two key parameters: (1) , the distance distortion parameter in RP, and (2) C, the number of components in the GMM. The experiment explored various values of and . A grid search strategy was used to find the optimal parameters, with GMM entropy as the evaluation criterion. A higher entropy value indicates greater diversity in the pseudo-X-vectors generated.
The evaluation setup was as follows. First, all X-vectors were extracted from the LibriTTS train-other-500 dataset (600 males/560 females). For each gender, a 50%-50% training test split was performed to train RP + GMM. Then, 1000 pseudo-X-vectors were sampled from the GMM to estimate entropy using Equation (
4). The RP inverse transform was then utilized to obtain 512-dimensional pseudo-X-vectors. The GMM was designed to learn a diagonal covariance matrix in the RP-reduced feature space. The maximum iteration was set to 1000 in the Expectation–Maximization (EM) algorithm, with a convergence tolerance of
. If the EM algorithm failed to converge, the number of iterations was increased by 1.1 times, and the convergence tolerance grows by two times.
As shown in
Figure 2, for a fixed
, the entropy monotonically decreases as
C increases. Conversely,
Figure 3 shows that with a fixed
C, the entropy decreases as
increases. Since the entropy decreases with increasing
and
C, this paper selects the configuration with the maximum entropy,
,
. It is noted that the enrollment utterances and the trial utterances are synthesized using the same random projection matrix and GMM.
5.3. Primary Objective Evaluation Metrics
Speaker anonymization seeks to protect speaker identity information and preserve speech intelligibility. Consequently, the evaluation of the anonymization system requires metrics that address both privacy and utility.
For the objective evaluation of anonymization performance, we employed two pre-trained systems: (1) ASVeval, an automatic speaker verification (ASV) system to assess speaker identity protection strength; (2) ASReval, an automatic speech recognition (ASR) system to evaluate content preservation quality. Both systems were developed using the Kaldi toolkit [
53] and trained on a subset of LibriSpeech train-clean-360 under the VPC official settings [
17,
18].
Attackers targeting specific speakers may access one or more utterances (referred to as enrollment utterances), which could be original or anonymized. In detail, the utterances are anonymized at the speaker level. The ASVeval examines three scenarios [
26]: (1) Unprotected scenario (o-o/Orig): No anonymization applied, and the attackers access both original trial and enrollment utterances. (2) Ignorant attack scenario (o-a): Users anonymize their trial utterances, but attackers remain unaware and attempt verification using original enrollment utterances against anonymized trial utterances. (3) Lazy informed attack scenario (a-a): Both enrollment and trial utterances are anonymized using the same system but generate different pseudo-speakers.
This paper presents an extensive and objective evaluation of the proposed anonymization system (RP + GMM) using the VPC framework with extended metrics [
18]. To assess anonymity, we conduct a comparative analysis of outputs before and after applying the similarity check (RP + GMM + SC). Furthermore, we validate the system’s effectiveness by comparing it with several baseline systems (as shown in
Table 2), including the official VPC benchmarks (B1, B2) and other state-of-the-art approaches.
5.3.1. Privacy Evaluation
For an objective privacy evaluation, the ASVeval is based on X-vector speaker and probabilistic linear discriminant analysis (PLDA). The Equal Error Rate (EER) and Log-likelihood ratio cost function
are utilized as principal metrics. The EER metric corresponds to the threshold
at which the False Acceptance Rate (FA) is equal to False Rejection (FR), that is,
[
54]. For each pair of enrollment and trial X-vectors, the PLDA classifier computes a log-likelihood ratio (LLR) score, which is then compared against a decision threshold to determine whether they come from the same speaker or different speakers [
55]. Following [
55],
is derived from the PLDA scores and can be decomposed into a discrimination loss (
) and a calibration loss (
-
). Here,
is obtained via optimal calibration using monotonic transformation of the scores to their empirical LLR values. Both EER and
serve as robust measures of classifier discrimination [
56].
This paper examines two attack scenarios: ignorant (o-a) and lazy informed (a-a) attack models.
Table 3 shows the evaluation results of EER and
under these two attack models. The higher EER and
values correspond to better privacy. The parameter values follow
Section 5.2, specifically,
and
. By combining EER and
metrics, the effects of similarity check and cross-attack models are analyzed as follows.
- (1)
Impact of similarity check. After introducing SC, the average EER for the o-a scenario increased by 0.77% (from 48.35% to 49.12%) and the average increased by 0.012 (from 0.975 to 0.983). This indicates that SC may slightly strengthen privacy protection for the original enrollment utterances. For the a-a scenario, the average EER rose by 1.29% (from 45.54% to 46.83%), and increased by 0.002 (from 0.953 to 0.955). This suggests that SC enhances the privacy protection of anonymized enrollment utterances. One reason for this could be that SC further obscures identity features through thresholding, making it harder for the ASV system to identify speakers.
- (2)
Comparison across attack models. From the perspective of attack scenarios, the EER before and after introducing SC was 48.35% and 49.12% in the o-a scenario, with corresponding of 0.975 and 0.983. In the a-a scenario, EER increased from 45.54% to 46.83% after SC addition, while slightly increased from 0.953 to 0.956. In other words, without SC, the EER and gap between two different attack scenarios was 2.81% and 0.021 (with o-a higher than a-a), respectively. After introducing SC, the difference increased to 2.29% for EER and 0.027 for , with o-a slightly outperforming. This demonstrates that RP + GMM + SC remains robust against different attack scenarios.
- (3)
A gender-focused privacy analysis. From a gender perspective, the introduction of SC has a significant gender-differentiated impact on speaker privacy protection performance. Specifically in the o-a scenario, the EER for female speakers before and after adding SC was 50.23% and 50.02%, respectively, with a corresponding of 0.983 and 0.981, indicating a stable privacy protection performance that outperforms that of male speakers. In contrast, in the a-a scenario, the EER for male speakers increased from 46.83% to 50.51% after introducing SC, while improved from 0.954 to 0.973, demonstrating a more pronounced privacy enhancement effect. Overall, the results suggest that female speakers exhibit a relative advantage in ignorant attacks, while male speakers demonstrate greater robustness in lazy informed attacks. The underlying reason can be traced to gender-based differences in acoustic characteristics. For female speakers, whose speech exhibits higher fundamental frequency and more dispersed formants, anonymization sharply reduces the similarity between the original and anonymized speech, yielding a high EER (around 50%) and consistently stable performance even after introducing SC. In contrast, male speech—characterized by lower fundamental frequency, more concentrated formants, and smaller inter-speaker variability—tends to converge toward a shared “average male” feature subspace during anonymization, resulting in higher similarity across anonymized utterances. SC effectively counteracts this convergence, promoting feature dispersion and thus enhancing both EER and for male speakers.
These results demonstrate that the proposed method effectively conceals speaker identity while maintaining system stability. The experimental data deviated only marginally from the optimal theoretical values (EER = 50%; ) with the average EER and reaching 47.46% and 0.97, respectively. This validates the robustness and privacy-preserving capability of the proposed methods in both the o-a and a-a attack scenarios.
For further validation, we compare the proposed method with those listed in
Table 2.
Figure 4 and
Figure 5 present a comparison of the average EER and the average
across all eight systems. The average EER was computed by averaging the three EERs obtained on the development and test sets. The same method was used for the average
. As shown in
Figure 4, RP + GMM performs well in the o-a scenario on development sets, with its EER values ranking just below the top systems, B1, A1, and O1C1, and more effective on the test sets comparable to other systems. In the a-a scenario, both proposed systems outperform the other systems, with RP + GMM + SC even better. The results of
in
Figure 5 show a similar trend, further validating the generalization performance of the proposed systems.
5.3.2. Utility Evaluation
To assess the linguistic preservation capability and practical usability of anonymized speech, we employed the ASReval system for anonymization evaluation. Specifically, it decodes original and anonymized trial utterances into text transcripts and then computes the Word Error Rate (WER) as the primary metric for speech usability, defined as follows:
where
denote the number of substitution, deletion, and insertion errors, respectively, and
is the number of total words in the reference.
As shown in
Table 4, the average WER of the two proposed systems increases by approximately 3% on average compared to those without anonymization, which aligns with expectations. Notably, adding SC (from RP + GMM to RP + GMM + SC) has a negligible effect on WER: changes are minimal (e.g., 5.72% to 5.89% for Libri_dev; 14.64% to 14.63% for VCTK_test). This demonstrates that SC enhances speaker privacy (as shown in previous ASVeval results) without compromising speech recognition performance significantly. Dataset-wise, Librispeech exhibits stronger recognition robustness: its original WER is much lower, and the increase after anonymization is mild (around a 2% rise). In contrast, VCTK speech has a higher WER, and anonymization leads to more noticeable (but still acceptable) recognition degradation. Additionally, WER trends are consistent between development and test sets, indicating stable experimental results. Overall, the proposed anonymization methods strike a balance privacy protection and speech content usability, with SC being a privacy-enhancing component that barely affects recognition performance.
5.3.3. Privacy–Utility Trade-Off
Figure 6 demonstrates the privacy–utility trade-off (EER vs. WER) across development sets and test sets under both attack scenarios. The first row displays the EER and WER results for the development set in the two scenarios, while the second row shows the corresponding results for the test set. As seen across all subfigures, the original data (Orig) demonstrates the optimal utility (lowest WER) but the weakest privacy protection (lowest EER). All anonymization methods significantly enhance privacy at the cost of moderate speech recognition degradation. Among the anonymization methods, B1 achieves the best privacy–utility balance in the o-a scenario on the development and test sets, though its privacy protection notably degrades in the a-a scenario. Our proposed systems achieves near-optimal EER values (approaching 50%) while maintaining WER within an acceptable range (typically below 11%), indicating the effective protection of speaker identity without severely degrading speech intelligibility. Notably, RP + GMM + SC shows a favorable position in the privacy–utility landscape, outperforming several systems by offering stronger anonymity at comparable or lower WER. These results confirm that disentangling speaker identity from speech content coupled with probabilistic sampling in a reduced dimensional space enables robust and adjustable anonymization suitable for varying privacy–utility requirements in practical applications.
5.4. Additional Objective Evaluation Metrics
This section evaluates additional metrics proposed during the VPC, including the ZEBRA framework, linkability, voice distinctiveness, and de-identification.
5.4.1. The ZEBRA Framework and Linkability
ZEBRA framework. The ZEBRA framework serves as a robust quantitative model for privacy assessment without requiring pre-registered biometric data [
56]. This framework incorporates two principal metrics: expected privacy disclosure at a population level,
, and worst-case privacy disclosure for an individual,
(referred to as
).
measured in bits estimates the amount of privacy information that attackers could access while remaining independent of their prior knowledge.
values range from 0 (indicating perfect privacy protection) to 0.721 (representing complete vulnerability).
, where
l stands for the log-likelihood ratio, is measured in decibels (dB) with lower value corresponding to stronger privacy protection, where 0 dB signifies absolute privacy preservation.
The experimental results for
and
are shown in
Table 5.
values remain consistently below 0.06, confirming strong privacy protection. The
results further indicate superior privacy performance, with Grade A outcomes outnumbering Grade B, particularly in the o-a scenario. Notably, RP + GMM + SC achieves more Grade A than RP + GMM, that is to say, RP + GMM + SC achieves better privacy performance. These results of
also reflect that privacy protection varies among individuals, mainly due to the randomness in the GMM sampling. Overall, the systems show strong privacy preservation capabilities.
Furthermore, we conducte comparative analyzes of the
and
results across different anonymization systems, with detailed comparisons presented in
Figure 7 and
Figure 8, respectively. The results demonstrate that B2 has significantly higher
and
values compared to all other anonymization systems, indicating its relatively weaker privacy protection capabilities. In the o-a scenario, the remaining systems show minimal differences in their
values, while in the a-a scenario, our two anonymization systems demonstrate superior
, with RP + GMM + SC showing particularly outstanding results. The results of
exhibit similar comparative patterns to those observed in
Figure 7.
Linkability. Linkability serves as a metric to evaluate the distributional differences between matching and non-matching scores. The global
(hereafter referred to as
) represents a robust privacy measurement [
57] that quantifies the degree of linkability, where lower values indicate superior privacy protection performance. This metric is calculated by averaging all matching scores, with a value ranging from zero to one, where the optimal value of zero signifies complete global identity unlinkability. Detailed information on linkability can be found in the study [
57].
The evaluation results of
are also presented in
Table 5. The
values in the o-a scenario are generally lower than those in the a-a scenario, with most being around 0.1, reflecting stronger unlinkability.
Figure 9 provides an extensive comparison of linkability across anonymization systems. The results reveal that in the o-a scenario, all systems except the underperforming B2 show comparable linkability. Our two proposed systems (particularly RP + GMM + SC) demonstrate significant advantages with optimal linkability effects. From the perspective of an attack scenario, all systems demonstrate superior unlinkability performance in the o-a scenario compared to the a-a scenario, indicating more effective prevention of speaker identity linkage between original and anonymized utterances.
5.4.2. Gain of Voice Distinctiveness and De-Identification
Gain of voice distinctiveness (
) and de-identification (DeID) measure the levels of voice distinguishability and identity concealment before and after anonymization, respectively. Voice distinguishability requires that all utterances from the same speaker be converted to the same pseudo-speaker after anonymization, effectively preventing speaker confusion during multi-party conversations [
57]. De-identification aims to make it impossible to trace back to original speakers after anonymization, also known as speaker concealment or voice disguise. Both
and DeID are calculated based on the voice similarity matrix [
58,
59].
represents pseudo-speakers’ distinguishability as unchanged after anonymization,
indicates enhanced distinguishability (easier voice differentiation), and
shows weakened distinguishability (increased voice similarity); DeID ranges from 0% (no de-identification) to 100% (complete de-identification).
Table 6 presents the evaluation results for
and DeID. In terms of voice distinctiveness, significant improvement is observed across most datasets, with
reaching a peak value of 0.51. In practice, female speakers consistently achieve a higher
than male speakers on VCTK datasets, while the opposite trend holds for LibriSpeech datasets. The observed gender-based differences in
across datasets can be largely attributable to the acoustic characteristics of the datasets. The VCTK corpus contains clean, controlled studio recordings with consistent acoustic features, while LibriSpeech includes varied real-world audiobook recordings with more complex acoustic conditions. Regarding de-identification effectiveness, all anonymization systems achieve similarly high performance, with DeID most around 98%.
Figure 10 and
Figure 11 demonstrate comparative evaluations of
and DeID performance across anonymization systems, respectively.
Figure 11 clearly shows that our proposed method excels in voice distinctiveness, with RP + GMM + SC achieving even better results than RP + GMM. The main reason is that the proposed systems do not introduce additional identity information from other speakers. In
Figure 11, the results of DeID illustrate that, except for the slightly weaker performance of B2, all other systems perform similarily to B1. These results further validate the effectiveness of our proposed method in maintaining speech distinguishability while successfully concealing speaker identity.
5.5. Further Performance Comparison of the Latest Speaker Anonymization Methods
5.5.1. Objective Privacy and Utility Comparison
To further validate our proposed methods, we include the most recent speaker anonymization approaches and evaluate their privacy and utility under the a-a scenario on the Librispeech and VCTK datasets, where both ASVeval and ASReval settings align with those used in our work. The comparison shown in
Table 7 focuses on three critical metrics: EER, WER, and
.
As shown in
Table 7, among the anonymization methods, MUSA achieves the highest EER (48.29%) and the lowest WER (9.895%), demonstrating exceptional performance in both privacy and speech quality, but revealing a strong degradation in the distinguishability of voice anonymization with
(–2.78). Rano offers competitive privacy (EER = 47.81%) but at the cost of increased WER (11.91%) and moderately positive
(0.39), indicating the effective discrimination of pseudo-speaker identities in pseudo-speaker space. SALT yields comparable EERs around 45%, maintaining slightly better WER (10.77%) and minimal
change (–0.03), indicating limited distinguishability. Our proposed RP + GMM method improves upon SALT with a modestly increased EER (45.54%) and reduced WER (10.46%), accompanied by a small positive
(0.10), showing controlled distinguiability modification. The enhanced version, RP + GMM + SC, further boosts EER to 46.83% and lowers WER to 10.26%, with slightly higher
(0.12), confirming that SC effectively enhances both privacy and voice distinctiveness without sacrificing speech quality.
In summary, while MUSA excels in privacy protection and speech fidelity, its negative limits its applicability in interactive, multi-speaker environments. Rano improves distinguishability but sacrifices speech quality. In contrast, our proposed RP + GMM + SC method uniquely balances all three core objectives: it provides strong privacy (EER is close to that achieved via Rano and MUSA), preserves high speech intelligibility (WER is lower than that achieved Rano and SALT), and crucially, sustains positive . Notably, although MUSA attains a marginally superior performance in speech utility, our method effectively avoids the severe negative impact on observed in MUSA—a critical limitation that impedes its real-world deployability. By ensuring each original speaker is consistently and reliably transformed into a distinguishable pseudo-speaker, our approach directly addresses a key practical shortcoming of existing state-of-the-art methods. This ability to maintain while upholding privacy and speech clarity is essential for interactive, multi-speaker environments, where pseudo-speaker distinctiveness is indispensable for natural communication and auditory coherence. Consequently, our method offers a practical and competitive solution for speaker anonymization, particularly suited for real-world applications that demand the simultaneous fulfillment of privacy, utility, and distinguishability.
5.5.2. Subjective Naturalness and Intelligibility Evaluation
Objective evaluation metrics may not always accurately reflect the performance of speech models, given the complexity and sensitivity of the human auditory system. Subjective evaluation addresses this gap and remains important as an objective assessment. In speech-related tasks, the Mean Opinion Score (MOS) is a widely adopted subjective evaluation method. For speaker anonymization, naturalness and intelligibility are commonly assessed through human evaluation. Naturalness measures the degree to which processed speech sounds natural, with particular attention paid to any artifacts or degradation introduced by the anonymization tools. Intelligibility evaluates how clearly and easily the speech content can be understood.
In practice, listeners rate the naturalness of anonymized speech on a scale from one (completely unnatural) to five (completely natural). They also assess the intelligibility of individual samples—whether anonymized test utterances or original enrollment utterances—using the same five-point scale, where one denotes speech that is completely unintelligible and five denotes speech that is completely intelligible. Higher scores for naturalness and intelligibility correspond to better perceived speech quality.
In the experiment, ten original speech samples were selected. Each sample was anonymized using the RP + GMM method and other four baseline methods, resulting in a total of fifty anonymized speech samples. Each participant was required to listen to a single speech sample and rate its naturalness and intelligibility on a 1–5-point scale. The evaluation results are presented in
Table 8.
As can be seen from
Table 8, B1 performs the weakest in terms of both naturalness (3.15) and intelligibility (3.28). SALT and Rano show stable and similar performance across all metrics, while MUSA holds a slight advantage in naturalness (3.79). In contrast, the proposed RP + GMM + SC method achieves the best intelligibility score (3.85), with naturalness (3.76) also close to the highest level, demonstrating its overall strong performance and particularly notable advantages in terms of preserving speech clarity.
6. Conclusions
In this paper, a novel distinguishability-driven voice generation for speaker anonymization via random projection and GMM is proposed. Random projection reduces the dimensionality of X-vectors from an external speaker pool, effectively preserving their topological structure while significantly decreasing computational complexity. By independently generating random projection matrices, the proposed method eliminates reliance on an external speaker pool, thereby enhancing both flexibility and security of the anonymization framework. GMM is used as the generative model due to its superior distribution generalization capability, with model parameters optimized through grid search and GMM entropy evaluation. A random sampling strategy is implemented to prevent concentration in anonymized spatial distributions and overcome limitations in the quantity of generated anonymous voices. Additionally, we introduce a similarity check mechanism that effectively protects privacy while ensuring sufficient distinguishability between the original and anonymized speech. The experimental results demonstrate that the proposed method effectively balances privacy preservation and speech utility, while significantly enhancing speaker distinguishability between the original speakers and their corresponding pseudo-speakers, as well as among different pseudo-speakers.
This proposed anonymization system can be integrated into real-world speech systems, such as voice assistants, telephony, and smart devices, by replacing user voice identity features in real-time at the local or edge side. This approach not only meets the requirements of privacy regulations, but also preserves speech content intelligibility and speaker distinguishability. With low external dependency generative architecture and robustness against enhanced attacks, the system provides a viable and trustworthy technical pathway for natural speech interaction in highly privacy-sensitive scenarios, significantly enhancing the practical applicability of speaker anonymization technology.
Future research will focus on optimizing voice generation quality, including reducing WER and preserving prosodic information. The proposed method will also be extended to disentanglement-based speaker anonymization systems, supporting a broader range of downstream tasks. These improvements aim to enhance the practical utility of the anonymization framework while maintaining its strong privacy protection capability.