Next Article in Journal
Rapid Oil Pyrolysis in Ediacaran Carbonate Reservoirs in the Central Sichuan Basin Revealed by Analysis of the Unique Optical and Raman Spectral Features of Pyrobitumen
Previous Article in Journal
Load-Sensitive Tire–Road Friction Modeling and Dynamic Stability Analysis of Multi-Axle Trucks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Machine Unlearning for Speaker-Agnostic Detection of Gender-Based Violence Condition in Speech

by
Emma Reyner-Fuentes
1,*,
Esther Rituerto-González
2,3 and
Carmen Peláez-Moreno
1
1
Department of Signal Theory and Communications, University Carlos III de Madrid, 28911 Leganés, Madrid, Spain
2
Department of Psychiatry and Psychotherapy, University Hospital, Ludwig-Maximilian-University of Munich, 80336 Munich, Germany
3
Partner Site Munich-Augsburg, German Center for Mental Health (DZPG), 80336 Munich, Germany
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(22), 12270; https://doi.org/10.3390/app152212270
Submission received: 20 October 2025 / Revised: 12 November 2025 / Accepted: 16 November 2025 / Published: 19 November 2025
(This article belongs to the Section Applied Biosciences and Bioengineering)

Featured Application

This work introduces a speaker-agnostic method for detecting the condition of gender-based violence (GBV) survivors from speech. By employing domain-adversarial learning, the proposed model reduces the influence of speaker identity and enhances the extraction of paralinguistic biomarkers related to trauma. The application of this approach is twofold: (i) it provides a privacy-preserving tool for early and non-invasive mental health screening in clinical contexts, and (ii) it enables integration into digital platforms such as helplines, virtual assistants, or telehealth services, supporting practitioners in the timely identification of GBV survivors and reducing the burden of underreporting.

Abstract

Gender-based violence is a pervasive social and public health issue that severely impacts women’s mental health, often leading to conditions such as anxiety, depression, post-traumatic stress disorder, and substance abuse. Identifying the combination of these various mental health conditions could then point to someone who is a victim of gender-based violence. While speech-based artificial intelligence tools appear as a promising solution for mental health screening, their performance often deteriorates when encountering speech from previously unseen speakers, a sign that speaker traits may be confounding factors. This study introduces a speaker-agnostic approach to detecting the gender-based violence victim condition—defined as self-identified survivors who exhibit pre-clinical PTSD symptom levels—from speech, aiming to develop robust artificial intelligence models capable of generalizing across speakers. By employing domain-adversarial training, we reduce the influence of speaker identity on model predictions, and we achieve a 26.95% relative reduction in speaker identification accuracy while improving gender-based violence victim condition classification accuracy by 6.37% (relative). These results suggest that our models effectively capture paralinguistic biomarkers linked to the gender-based violence victim condition, rather than speaker-specific traits. Additionally, the model’s predictions show moderate correlation with pre-clinical post-traumatic stress disorder symptoms, supporting the relevance of speech as a non-invasive tool for mental health monitoring. This work lays the foundation for ethical, privacy-preserving artificial intelligence systems to support clinical screening of gender-based violence survivors.

1. Introduction

Gender-based violence (GBV)—encompassing physical, sexual, and psychological harm directed at individuals due to their gender—represents a widespread social and public health problem. Women and girls are disproportionately affected by GBV, with long-lasting consequences for their physical and mental well-being [1]. Research studies consistently link GBV with adverse psychological outcomes, including anxiety, depression, suicidal ideation, substance use, and, in particular, post-traumatic stress disorder (PTSD) [2,3,4,5,6]. PTSD is often the most prevalent mental health condition among GBV survivors [4,7,8], with symptoms that impair social functioning, daily activities, and overall quality of life.
According to the World Health Organization (WHO) and the United Nations (UN), a GBV victim is a woman subjected to violence that causes or risks causing physical, sexual, or psychological harm, including threats, coercion, or arbitrary restrictions of liberty, regardless of the context [9,10]. Some national legislations, such as Spain’s Organic Law 1/2004 [11], further recognize the impact on children who also suffer violence, inflicted by the perpetrators with the aim of indirectly damaging their mothers, as victims of “vicarious violence”.
Psychologists and social services professionals typically identify GBV victimization through clinical interviews or self-reported questionnaires, such as the EGS-R for PTSD assessment [12]. However, this process requires victims to disclose past abuse early, an obstacle given the stigma, fear, or denial that often surrounds these experiences. Under-reporting remains a serious challenge, leaving many cases unidentified [5,13,14].
In this context, speech-based artificial intelligence (AI) tools offer a promising, non-invasive approach to support mental health screening. These tools can be integrated into virtual assistants, therapy applications, or helplines to unobtrusively assess emotional and mental health states. Speech has already proven useful for diagnosing depression [15], suicide risk [16,17], and other physical and mental conditions [18,19]. Unlike traditional assessments, speech-based systems may reduce bias by not relying on explicit self-reports, thus improving accessibility and early detection.
Robust security is crucial when deploying such AI-enabled systems in sensitive environments, as devices often handle delicate personal data that must be protected [20,21]. AI plays a key role by allowing real-time detection and decision-making while maintaining privacy and integrity. The architecture in [22] poses an example of how an AI-based IoT solution embedded in cyber–physical systems can balance security with real-time detection.
Machine learning (ML) and, more specifically, deep learning (DL) models can extract meaningful paralinguistic features from speech signals, such as tone, pitch, rhythm, and vocal variability, beyond linguistic content [23]. However, these models often encode speaker-specific traits, which can degrade models’ generalization and raise privacy concerns in clinical contexts. Although personalization may improve performance in tasks such as speech emotion recognition (SER) [24], it may introduce undesirable confounds and ethical issues in mental health applications.
Building on our previous work [25,26], we hypothesize that speaker identity contributes undesirably to GBV Victim Condition (GBVVC) detection models. Therefore, in order to attain a speaker-agnostic detector that can distinguish between GBV victims and non-victims, we explore the use of domain-adversarial training techniques to disentangle relevant from irrelevant speaker-related information.
Domain-Adversarial Neural Networks (DANNs) [27] are used as a novel approach for domain adaptation by learning to ignore domain-sensitive characteristics that are inherently embedded in the data. These networks aim to actively learn feature representations that exclude information on the data domain. This methodology promotes the development of features that are discriminative for a main task while remaining domain-agnostic.
DANNs, as defined in [27], consist of two classifiers: one for the main classification task and another for domain classification. Figure 1 represents the original DANN architecture scheme. These classifiers share initial layers that shape the data representation. A gradient reversal layer is introduced between the domain classifier and the feature representation layers so that it forwards the data during forward propagation and reverses the gradient signs during backward propagation. In this manner, the network aims to minimize classification error for the main task while maximizing the error of the domain classifier, ensuring an effective and discriminative representation for the primary task, in which domain information is suppressed.
In our specific context, the main or primary task entails GBVVC classification, while the domain classification refers to speaker identity. We aim to ensure that the feature representation remains discriminative for the primary task while eliminating any traces of speaker-identity information.
Speaker-agnostic models offer several advantages: they prevent the identification of individual speakers, thus preserving privacy in mental health applications. Additionally, these models are designed to be generalizable and independent of the speaker’s identity, making them effective for an extensive range of users, even those who have never been seen by the training phase of the models. By intentionally not learning speaker-specific information, they focus on improving performance for the task at hand, ensuring that the model’s resources are dedicated only to learning relevant information.
We evaluate our method on an extended version of the WEMAC database [28], which includes spontaneous speech samples and emotional self-assessments from women in response to audiovisual stimuli. The extended dataset comprises additional speech data from 39 GBV survivors, whose participation was ethically approved but whose data remains unpublished due to privacy concerns. All participants were screened with the EGS-R questionnaire, and only survivors with pre-clinical PTSD symptoms (scores 20 ) were included, ensuring the sample represented recovered individuals. For clarity, throughout this manuscript, we use the term GBVVC to denote a condition of prior victimization characterized by residual/pre-clinical PTSD symptomatology as measured by the EGS-R (≤20). The present work focuses on the detection of these subtle, persistent vocal signatures rather than on acute, clinically diagnosed PTSD. Despite this, our system was still able to detect these residual symptoms in the speech signal associated with past victimization. More details on the dataset and feature extraction methodology are presented in Section 4.1.
Our findings suggest that domain-adversarial learning effectively suppresses speaker-specific cues and enhances the models’ ability to focus on paralinguistic indicators of GBVVC. By promoting generalizability and privacy, this work contributes to the development of trustworthy AI tools for sensitive mental health applications.

State-of-the-Art in Speech-Based Mental Health Detection Models

Our study addresses the automatic detection of pre-clinical post-traumatic stress disorder (PTSD) symptoms in self-identified survivors of gender-based violence (GBV) using speech-only data. To our knowledge, in addition to our prior work, no existing models have utilized audio data for the identification of GBVV, making the task, per se, a novel contribution to the field.
Most prior studies on PTSD detection using machine learning rely on textual data from clinical interviews. However, a growing body of evidence highlights the potential of voice-based biomarkers to identify trauma-related conditions. A systematic review [29] reported that speech-based machine learning approaches for trauma disorders typically achieve AUCs between 0.80 and 0.95, highlighting the diagnostic effectiveness of paralinguistic features.
More specifically, several studies have demonstrated this potential in the task of PTSD detection. One study analysing acoustic markers from PTSD patient interviews (Clinician-Administered PTSD Scale for DSM-5, CAPS-5 [30]) achieved 89.1% accuracy and an AUC of 0.954 using a random-forest classifier [31]. Another study developed a multimodal deep learning model combining acoustic, visual, and linguistic features, reaching an AUC of 0.90 and an F1 score of 0.83 in distinguishing PTSD from major depressive disorder [32]. And more recently, a study employing openSMILE features and random forest classification reported near-perfect performance (AUC = 0.97, accuracy = 0.92) in differentiating PTSD patients from controls [33].
Despite these advances, most speech-based datasets have focused on clinical or war veteran populations, leaving a critical gap in understanding pre-clinical and gender-specific manifestations of trauma. In contrast, our work represents the first effort to model GBV-linked psychological outcomes using audio-based methods, achieving an F1-score of 67%, and paving the way for non-invasive, privacy-preserving early digital mental health assessments.

2. Results

2.1. Prior Work

We previously explored the detection of gender-based violence victim condition (GBVVC) from speech signals in our work [25,26]. In the first paper [25], we demonstrated that using paralinguistic features extracted from speech, it is possible to differentiate between victims and non-victims with an accuracy of 71.53% under a speaker-independent (SI) setting. In our subsequent work [26], we incorporated additional data and investigated the relationship between model performance and self-reported psychological and physical symptoms. The system achieved a user-level SI accuracy of 73.08% via majority voting over individual utterances. However, in both studies, although we employed a Leave-One-Subject-Out (LOSO) strategy, our hypothesis is that speaker-related traits remained entangled with the target GBVVC labels. We hypothesized that, based on the consistently lower performance observed under the LOSO evaluation scheme, compared to other data-splitting strategies, which we believe indicated that the model was inadvertently leveraging speaker-specific characteristics to perform the classification task. This potential confound motivated the present work, where we aim to remove speaker-specific information while preserving task-relevant patterns. As we will further discuss in Section 3, several authors have already warned about the prevalence of this kind of problem in speech-related research for health diagnosis, where data is less abundant than for other speech technologies.

2.2. Baseline Models

Since this is, to the best knowledge of the authors, the first study addressing this specific task, we established a set of baseline models to set a reference performance and, therefore, contextualize the performance of the proposed Domain-Adversarial Model (DAM). Although our previous models [25,26] focused on GBVV classification, they were not designed with an adversarial structure to explicitly enforce speaker invariant feature learning. Hence, they are not directly comparable to the DAM approach. The baseline models introduced here serve this role by providing a clear benchmark under similar experimental conditions.
To this end, we first trained two baseline models using speech embeddings: (1) the Isolated Condition Model (ICM), which predicts GBVVC status, and (2) the Isolated Speaker Model (ISM), which identifies speaker identity. These modelsare trained on frame-level (FL) speech segments of 1 s duration. Majority voting (MV) is then applied at the user-level (UL) to obtain final predictions per subject.
The ICM model achieves a frame-level accuracy of 58.53 % , with a precision of 59.04 % , recall of 59.28 % , and F1-score of 59.16 % . At the user-level, the ICM reaches an accuracy of 60.00 % , with a precision of 61.90 % , recall of 59.09 % , and F1-score of 60.47 % . These results are presented in the corresponding confusion matrices, in Figure 2. The figure presents both user-level and frame-level confusion matrices for a visual assessment of model behavior across scales.
In contrast, the ISM model, designed to identify speaker identity from speech, achieves a frame-level accuracy of 91.34 % (Table 1),underscoring the strong presence of speaker-specific information in the embeddings. However, since this model performs a multi-class classification task involving 78 different speakers, a confusion matrix or performance metrics such as precision, recall, or F1-score at the user-level are not meaningful or directly comparable. Each prediction corresponds to a specific speaker identity rather than a binary classification problem, thus precluding the application of majority voting or binary confusion-based metrics, as we did with the ICM.

2.3. Speaker-Agnostic GBVVC Detection

In this section, we introduce our Domain-Adversarial Model (DAM), specifically designed to remove speaker-dependent features from the learnt speech embeddings, thereby promoting the emergence of representations that are informative for GBVVC detection but invariant to speaker identity. This is achieved through a gradient reversal layer that enables adversarial training between the main GBVVC classification task and an auxiliary speaker-identification task.
Figure 3 presents the confusion matrices at both user and frame levels for the main task of the DAM. At the user-level, DAM achieves an accuracy of 64.10 % , precision of 61.70 % , recall of 74.36 % , and F1-score of 67.44 % . At the frame-level, the model yields an accuracy of 58.66 % , precision of 59.57 % , recall of 65.55 % , and F1-score of 62.42 % . These results reveal a more balanced classification performance compared to the ICM. Specifically, DAM improves upon the ICM by a relative + 1.14 % in frame-level accuracy (from 58.11 % to 58.66 % ) and + 8.47 % in user-level accuracy (from 60.26 % to 64.10 % ). It also outperforms the baseline in F1-score at both levels: + 0.99 points at frame-level and + 4.06 points at user-level, indicating improved balance between precision and recall.
Nonetheless, the dataset is balanced perfectly at the subject level and quite evenly at the frame level ( 52.36 % GBVVC vs. 47.64 % non-GBVVC), which validates the use of accuracy as the primary evaluation metric. The results are reported in Table 1, with mean and standard deviation of the accuracy across 78 LOSO folds, reflecting the real-world variability due to differences in recording quality, vocal traits, and symptom expression. High inter-subject variance is expected in this kind of clinical speech dataset, and the fact that our DAM approach reduces the standard deviation (STD) in the GBVVC classification task while increasing the STD in the speaker classification task highlights its effectiveness in promoting speaker-invariant representations and enhancing the model’s generalization capability.
Table 1 compares four models across two tasks: GBVVC detection and speaker identification. The first column shows the performance of the ICM, which acts as our non-adversarial baseline. While we previously reported better results [26], those models are not directly compatible with the ones here, trained in an adversarial learning fashion. Hence, the ICM serves as a fairer baseline for this analysis. The second column shows the results of the DAM, which, despite the adversarial unlearning of speaker identity, maintains and slightly improves GBVVC classification performance. This suggests that speaker-related information is not essential—or potentially even detrimental—for detecting the condition.
The third and fourth columns assess the extent to which speaker identity is removed from the learnt representations. The Isolated Speaker Model (ISM), trained directly to identify speakers, achieves 91.34 % accuracy, confirming that speaker traits are easily learnable from feature embeddings. The fourth column shows the results for the Unlearnt Speaker Model (USM), which uses the same architecture but operates on the frozen embeddings from the DAM (see Figure 4). The USM, however, shows a steep drop in speaker identification accuracy to 66.72 % , a relative degradation of 26.95 % . This demonstrates the success of the adversarial strategy in attenuating speaker-specific information from the learnt feature embeddings.
We emphasize that speaker identification models cannot be evaluated under a true LOSO scheme because, by definition, they cannot classify previously unseen identities. Therefore, we use a subject-dependent evaluation strategy with fewer metrics for ISM and USM, ensuring that no segments from the same audio recording are shared between training and test partitions. This avoids data leakage while preserving speaker identity in the training data. Note that for the multi-class speaker identification task (ISM, USM), user-level metrics (Precision/Recall/F1) are not defined in the same way as for GBVVC (binary) and thus only accuracy is reported for speaker identification.
The attenuation of speaker information by over 26 % does not harm GBVVC detection. On the contrary, it enhances generalization, yielding a notable 6.37 % improvement in user-level MV accuracy. This strongly supports our hypothesis that speaker traits act as a confounding variable, and their removal allows the model to better attend to condition-relevant vocal biomarkers. We believe that refining our adversarial strategy to achieve even greater disentanglement of speaker identity could further amplify these gains in future iterations.
In addition, in the comparison of Precision versus Recall in Table 1, we observe a recall value moderately higher than precision (by approximately 5–10%), indicating that the models exhibit greater sensitivity than specificity. In such cases, the systems effectively identify most relevant instances but tend to generate a moderate proportion of false positives, reflecting a detection strategy oriented toward coverage rather than strict selectivity. Should the requirements of the application change (e.g., becoming time-sensitive or urgent), adjustments to the models should be made to achieve results better aligned with such needs.
To complement the quantitative comparison in Table 1, Figure 5 provides a visual summary of the performance differences between the non-adversarial (ICM) and domain-adversarial (DAM) models. At the frame-level (FL), both models show comparable results, with marginal gains under the adversarial configuration. However, at the user-level (UL), the domain-adversarial model yields consistent and more pronounced improvements across all metrics, particularly for Recall (+11.5%) and F1-Score (+7.6%). These results illustrate the advantage of domain-adversarial training in mitigating speaker-related bias and enhancing generalization to unseen speakers.
In summary, adversarial unlearning reduced speaker identification accuracy by 26.95% (ISM → USM) while producing a relative improvement of 6.37% in user-level GBVVC accuracy and an improvement of 7.64% in user-level GBVVC F1-score (ICM → DAM). This co-occurrence strongly supports the claim that suppressing speaker identity facilitates improved focus on condition-relevant vocal biomarkers.

2.4. The Correlation with EGS-R Score

To further understand what cues the model is leveraging to make GBVVC predictions, we explored the relationship between model performance and clinical symptomatology, as measured by the EGS-R score. This scale, ranging from 0 to 20 in our case, quantifies pre-clinical PTSD-related symptoms such as re-experiencing, emotional numbing, and hyperarousal.
In our previous work [26], we found that correctly classified GBVVC subjects tended to have higher EGS-R scores, suggesting that our models implicitly relied on speech markers associated with trauma-related symptomatology. Here, we replicate this analysis for both the ICM and DAM models by comparing the mean EGS-R scores of correctly and incorrectly classified GBVVC subjects.
Table 2 shows a statistically significant gap in symptom severity between correctly and incorrectly classified victims for both models (DAM: p = 0.0017 , ICM: p = 0.0351 , t-Student based). This shows that the mean EGS-R score for correctly classified subjects is significantly higher than that of the subjects who have been incorrectly classified. This reinforces the notion that our models are not making arbitrary predictions, but instead exploit subtle voice markers correlated with PTSD-like symptoms. Furthermore, the difference in EGS-R scores between correct and incorrect classifications increases under the DAM, although not significantly ( p = 0.0711 , t-Student based), indicating that the adversarial training may have the potential to enhance the model’s focus on trauma-related vocal features once speaker confounds are removed.
Taken together, these findings align with our broader hypothesis: the ability to detect GBVVC from speech is closely tied to the manifestation of early trauma symptoms, which can be encoded in paralinguistic cues such as pitch variability, jitter, or prosodic flattening. By eliminating speaker-dependent information, the DAM enables a clearer identification of these subtle cues, contributing to both improved accuracy and better interpretability grounded in clinical theory.

2.5. Feature Importance

To identify which acoustic features contributed most to the model’s predictions, we performed a SHAP (SHapley Additive exPlanations) analysis. The resulting summary plot is shown in Figure 6.
The SHAP analysis was conducted on a random subset of 200 samples from the full dataset. As observed, the mean Zero-Crossing Rate emerges as the most influential feature for the detection of both classes. Additionally, the plot is largely dominated by Mel-Frequency Cepstral Coefficients (MFCCs), highlighting their strong contribution to the classification process. Besides the standard deviation of the spectral roll-off, which also shows a significant impact for both classes, most of the top-ranking features correspond to average values rather than dispersion measures. Overall, this interpretability analysis provides insight into which acoustic descriptors are most discriminative between GBVVs and non-GBVVs, suggesting that the model primarily relies on spectral and cepstral dynamics to identify trauma-related vocal characteristics.

3. Discussion and Conclusions

The automatic detection of the gender-based violence victim condition (GBVVC) through speech analysis remains a nascent and underexplored research area. Most prior work in this field has focused on detecting GBV from textual data, particularly through the analysis of social media posts using natural language processing (NLP) techniques [34,35,36]. To the best of our knowledge, no previous studies have attempted to identify the condition of GBV victims directly from their speech, apart from our prior contributions [25,26].
In our earlier work, we demonstrated that machine learning (ML) models can distinguish between speech samples from GBV victims and non-victims. However, we identified a potential confounding factor: the models could rely on speaker identity cues, performing speaker identification alongside the detection of speech traits associated with the GBVVC. This risk is widely acknowledged in ML-based clinical diagnostics, especially in speech-related applications, where datasets are typically small and prone to overfitting to speaker-specific features [37,38]. As highlighted in [39], such biases present a significant barrier to the clinical applicability of ML technologies in mental health and diagnostic settings. To address this issue, we pursued a speaker-agnostic approach that mitigates the influence of speaker-specific information in the classification task. This direction aligns with emerging trends in ML such as “Machine Unlearning”, which aims to erase irrelevant or potentially biased information from model representations [40]. In our case, the objective was to remove speaker identity from the learnt embeddings, thereby ensuring that predictions of GBVVC are not driven by user-specific traits and enhancing model generalizability and privacy.
Inspired by domain-adversarial training methods previously used to disentangle speaker traits from emotion in speech emotion recognition (SER) tasks [41,42], we adapted the DANN (Domain-Adversarial Neural Network) framework to our task. By treating each speaker as a domain, we trained an encoder to learn features that are discriminative for GBVVC classification while being invariant to speaker identity. This was accomplished via adversarial training using a gradient reversal layer that discourages the encoder from encoding speaker-related information.
The implementation of this model achieved two main outcomes. First, it reduced the capacity of the model to identify speakers by 26.95%, validating that speaker-specific information had been effectively reduced. Second, this reduction coincided with a relative 6.37% improvement in GBVVC classification accuracy, as shown in Table 1. These results support our hypothesis that removing speaker-specific information allows the model to focus on more relevant, potentially clinical, acoustic patterns associated with the GBVVC. In doing so, we alleviate previous concerns about the model’s over-reliance on speaker identity and advance towards robust, speaker-independent speech biomarkers. Also, beyond improved technical performance, the adversarial unlearning approach we propose has a meaningful ethical benefit: by reducing speaker identity leakage from learned representations, the model reduces the risk of unauthorized re-identification and thereby supports privacy-preserving screening. This aligns with the objectives of Machine Unlearning, namely, removing or attenuating unwanted information from models, and with fairness goals by discouraging the misuse of identity cues. In sensitive deployments (helplines, clinical triage), the reduction of identity information can lower barriers for individuals seeking help and limit the risk of stigmatization or surveillance.
Additionally, the analysis of EGS-R scores (Table 2) offers further insights into the model’s behavior. The data suggest that the model performs better for victims exhibiting higher EGS-R scores, a scale associated with pre-clinical PTSD symptoms. Notably, the gap between correctly and incorrectly classified victims widens when speaker information is removed, suggesting that the model is increasingly relying on clinically relevant speech cues—rather than speaker identity—to make predictions. These findings strengthen the hypothesis that pre-clinical trauma manifestations may be encoded in acoustic patterns and can be detected by properly trained models. Furthermore, the SHAP feature importance analysis provides a deeper understanding of the acoustic features that drive the discrimination between GBVVs and non-GBVVs. The results reveal that the model relies primarily on spectral and cepstral descriptors—such as the mean and standard deviation, spectral rolloff, and several MFCC coefficients— but also on the zero-crossing rate, to differentiate between conditions. Overall, this interpretability analysis complements the EGS-R score analysis of the model’s behavior by clarifying which specific features contribute most to the decision process, offering a more transparent view of how the model performs GBVVC detection.
A key limitation of the present study is that the participant pool consisted exclusively of Spanish-speaking women residing in Spain, recruited in collaboration with local institutions. This linguistic and regional homogeneity prevents direct assessment of the model’s generalizability across languages, dialects, and cultural contexts. Moreover, in accordance with ethical requirements designed to prevent revictimization, only participants without a formal PTSD diagnosis (EGS-R ≤ 20) were included. Consequently, the full spectrum of clinical variability could not be captured. We hypothesize that the inclusion of individuals with higher symptom severity might enhance the detectability of GBVVC, as more pronounced trauma could yield stronger acoustic markers. Future studies should therefore aim to expand the dataset to multilingual and multicultural populations, encompassing a broader range of symptom severity, age groups, and socioeconomic backgrounds, to rigorously evaluate cross-linguistic robustness and fairness. Another important limitation is the relatively small number of participants in the victims’ group, which constrains both statistical power and the representativeness of the learned acoustic space. In addition, the current approach relies solely on paralinguistic acoustic features, omitting potentially informative physiological or behavioral modalities—such as heart-rate variability, galvanic skin response, linguistic cues, or facial expressions—that could be incorporated in a multimodal framework to enhance classification robustness and interpretability. From a modeling perspective, the Domain-Adversarial Module (DAM) architecture was intentionally kept simple to prioritize interpretability and training stability; however, more expressive architectures—such as transformer-based encoders, hierarchical attention mechanisms, or recurrent-convolutional hybrids—could better capture fine-grained temporal dynamics and yield further improvements in both speaker-independence and GBVVC detection accuracy. Finally, the use of fixed-length 1-s frames may discard longer-term dependencies relevant for condition identification; future work could explore variable-length or sequence-to-sequence modeling strategies.
The proposed domain-adversarial scheme also holds potential for broader applications beyond GBVVC. It may prove beneficial in other speech-based diagnostic tasks where speaker identity or other personal traits act as confounding variables, particularly in fairness-sensitive contexts such as disease detection across different demographic groups [43].
As a final note on security, we want to highlight the importance of privacy-preserving and generalizable AI for sensitive detection tasks. Integrating AI-based applications with IoT devices can enable scalable, real-time operation, but could also introduce important security and privacy concerns. By adopting speaker-agnostic approaches, models can provide accurate and fair assessments while safeguarding personal privacy. Ensuring the security of both the AI systems and the IoT infrastructures where they are set is, therefore, critical in the context of trustworthy and ethical mental health applications.
In summary, this work presents the first application of domain-adversarial learning to detect the GBVVC from speech, extending our prior efforts and contributing a novel approach to this under-investigated field. The proposed model achieves both technical and ethical objectives: it enhances classification performance while mitigating identity-related biases. These findings open promising avenues for the development of speech-based tools that are both effective and privacy-preserving, which is particularly valuable in sensitive domains such as GBV detection, mental health support, and non-invasive early diagnosis.

4. Materials and Methods

4.1. Database

The data used in this study stems from an extended version of the WEMAC database [28], a multimodal corpus for affective computing research. The original WEMAC dataset includes speech recordings and biosignals—Blood Volume Pulse (BVP), Galvanic Skin Response (GSR), and Body Temperature (BT)—from Spanish-speaking women with no known history of gender-based violence victimization (non-GBVV). These recordings were captured under controlled laboratory conditions while participants were exposed to 14 emotionally evocative audiovisual stimuli via a Virtual Reality (VR) headset. Participants also provided self-reported emotional annotations following each stimulus. The original participant pool comprised 100 non-GBVV women aged 20 to 77 years ( m e a n = 39.92 , S D = 14.26 ), evenly distributed across five predefined age groups: G1 (18–24), G2 (25–34), G3 (35–44), G4 (45–54), and G5 (≥55). Most participants were of Spanish nationality. Following each stimulus, participants responded to two open-ended questions regarding their emotional state, with their answers recorded for speech analysis.
In addition to this publicly available dataset, an ethically approved extension includes speech and questionnaire data from 39 women with a documented history of GBV victimization. For the purposes of this extension, participants were adults who were recruited from gender-based violence and mental health support centers. Potential participants were screened through an initial structured interview with a certified psychologist to verify eligibility and symptom severity. These participants were considered GBV victims based on (1) self-identification and (2) subsequent confirmation by a certified clinical psychologist specialized in trauma and GBV. Such confirmation resulted from a non-standardized psychological questionnaire designed ad hoc by the licensed psychologist. This instrument aimed to gather nuanced information on personal history, relationships, well-being, and trauma. Specifically, a subset of questions was derived from the standardized EGS-R questionnaire [12], a validated tool for assessing PTSD. To avoid revictimization, participants displaying symptoms of severe psychological distress, i.e., participants scoring above 20 on the EGS-R were excluded from the study. Additionally, all participants completed a biopsychosocial questionnaire that covered sociodemographic information, habits, and lifestyle aspects [44].
For classification purposes, we employed a balanced dataset comprising the 39 GBVV participants and an age-matched subset of 39 non-GBVV participants. To mitigate the imbalance in speech segment lengths between groups (GBVV responses tended to be longer), we subsampled the GBVV recordings by randomly removing one out of every four audio segments. All speech signals were resampled to 16 k Hz and normalized using z-score normalization per speaker. The final dataset consisted of 19,128 GBVV samples and 17,400 non-GBVV samples, each 1 s in duration.
Acoustic features were extracted using the Python library librosa v0.11.0–code for feature extraction available in Supplementary Materials Section S5. From each 1-s sample, a total of 19 standard low-level descriptors were computed using a 20 ms window and 10 ms overlap. These include 13 Mel-Frequency Cepstral Coefficients (MFCCs), Root Mean Square (RMS) energy, Zero Crossing Rate, Spectral Centroid, Spectral Roll-off, Spectral Flatness, and Pitch. For each descriptor, both mean and standard deviation were calculated, resulting in a 38-dimensional feature vector per second, following methodologies established in [25,26]. Other feature sets were not considered, as our previous ablation studies [26] demonstrated that librosa-based features yielded superior performance.

4.2. Proposed Model Architecture

The proposed model architecture was implemented in Python using the TensorFlow Keras library, with the code available in the Supplementary Materials, Section S5. It comprises three main components: an encoder, a speaker identification block (SPK), and a condition classification block (COND). The model was trained under a Leave-One-Subject-Out (LOSO) cross-validation scheme to ensure speaker independence.
  • Encoder. Projects the 38-dimensional acoustic features into a 128-dimensional embedding space. This component is optimized to extract information relevant for GBVV classification while removing speaker identity traits.
  • Speaker Identification Block (SPK). Trained to classify the speaker identity. It plays a key role during adversarial training by propagating gradients back through a gradient reversal layer, encouraging the encoder to eliminate speaker-specific information.
  • Condition Classification Block (COND). Classifies the speech sample as belonging to a GBVV or non-GBVV speaker.
A schematic representation of the full architecture, including the interconnection of blocks, is shown in the next section (Section 4.3.2, Figure 7, and layer-level details are available in Supplementary Materials (Section S6: Figure S11). Hyperparameters used in training are reported in Table 3, and were selected based on a series of ablation studies designed to assess the contribution of different architectural and training choices. These studies, which are discussed in detail in the Supplementary Materials (Section S4), guided the optimization of the model.

4.3. Domain-Adversarial Training Strategy

The primary goal of this work is to assess whether it is possible to distinguish victims of gender-based violence (GBVV) from non-victims using speech-based features, without relying on speaker-specific information. To achieve this, we adopt a domain-adversarial training strategy that explicitly aims to preserve information relevant to the GBVV condition while suppressing speaker identity cues. This section details the proposed training approach.

4.3.1. Initialization Phase: Isolated Models

Before introducing adversarial training, the encoder, the condition classifier, and the speaker classifier are independently pre-trained. These initial models are referred to as the Isolated Condition Model (ICM) and the Isolated Speaker Model (ISM). Their architectures are illustrated in Figure 8.
The purpose of this initialization is twofold. First, it ensures that each module starts from a meaningful set of parameters, thereby facilitating convergence during adversarial training. Second, it establishes baseline models for speaker and condition classification, which are essential to rigorously assess the impact of domain-adversarial training.

4.3.2. Adversarial Training Procedure

Once the ICM and ISM are initialized, we proceed with the domain-adversarial training of the full model (Figure 7). Each training epoch is composed of three sequential steps: one domain step followed by two main steps.
1. Domain Step: Speaker Specialization
The goal of this step is to make the speaker block specialize in speaker identification while ensuring that the encoder does not contribute to this task. To achieve this, the encoder is frozen (non-trainable), and only the speaker classifier is trained using embeddings generated by the fixed encoder. The condition classifier is not used at this stage.
By training the speaker block in isolation, it learns to exploit whatever speaker information remains in the embeddings. This implicitly pushes the adversarial process: in later steps, the encoder will be encouraged to remove this information. The speaker classification loss is computed using categorical cross-entropy over all training speakers (excluding the held-out test speaker), and backpropagation stops at the gradient reversal layer, preventing updates to the encoder.
2. Main Steps (x2): GBVV Detection and Speaker Information Removal
The subsequent two steps focus on the main task: detecting GBVV while suppressing speaker identity. Here, the encoder and condition classifier are all active while the speaker block is frozen. The model is trained to simultaneously predict the GBVV condition while unlearning speaker traits thanks to the gradient reversal layer.
At the core of this mechanism is a custom loss function defined as:
L T = λ L COND ( 1 λ ) L SPK
where L COND is the sparse categorical cross-entropy loss for GBVV condition classification, and L SPK is the categorical cross-entropy for speaker classification. The hyperparameter λ regulates the trade-off between the two objectives; after several ablation studies detailed in the Supplementary Materials (Section S4), the parameter was set to λ = 0.2 to prioritize removal of speaker information over condition classification.
The gradient reversal layer plays a key role in this setup: it inverts the gradient of the speaker loss before it reaches the encoder. Consequently, the encoder is trained to maximize speaker classification error (i.e., to confuse the speaker classifier), while minimizing the GBVV classification loss. This adversarial mechanism encourages the encoder to produce embeddings that are invariant to speaker identity and discriminative for the GBVV condition.
Iterative Optimization
This three-step process is repeated for each of the 100 training epochs. Each epoch comprises one domain step, where only the speaker classifier is trained and the encoder is frozen; followed by two consecutive main steps, where all components are updated adversarially based on the custom loss. This iterative approach promotes stable training and ensures a gradual removal of speaker information from the learnt embeddings, enabling speaker-independent detection of GBVV.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app152212270/s1.

Author Contributions

Conceptualization: E.R.-F., E.R.-G. and C.P.-M.; Data curation: E.R.-F. and E.R.-G.; Formal analysis: E.R.-F., E.R.-G. and C.P.-M.; Funding acquisition: C.P.-M.; Investigation: E.R.-F. and E.R.-G.; Methodology: E.R.-F., E.R.-G. and C.P.-M.; Project administration: C.P.-M.; Resources: C.P.-M.; Software: E.R.-F. and E.R.-G.; Supervision: C.P.-M.; Validation: E.R.-F., E.R.-G. and C.P.-M.; Visualization: E.R.-F. and E.R.-G.; Writing—original draft: E.R.-F. and E.R.-G.; Writing—review and editing: E.R.-F., E.R.-G. and C.P.-M. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been partially supported by the SAPIENTAE4Bindi Grant PDC2021-121071-I00 funded by MICIU/AEI/10.13039/501100011033 and by the European Union ”NextGenerationEU/PRTR”; PID2021-125780NB-I00 funded by MICIU/AEI/10.13039/501100011033 and by “ERDF A way of making Europe”; the Federal Ministry of Education and Research of Germany (Bundesministerium für Bildung und Forschung [BMBF]) and the Ministry of Bavaria within the initial phase of the German Center for Mental Health (DZPG), Grant: 01EE2303A.

Institutional Review Board Statement

The present study utilizes data collected under the same experimental settings as the publicly available WEMAC dataset [28]. The original data collection procedures for the WEMAC dataset, including the recruitment of human subjects, were rigorously reviewed and approved by the University Carlos III of Madrid Ethics and Data Protection Committee (Approval Code: CEI22_01_LOPEZ_Celia; Approval Date: 6 May 2022). The submission to the Ethical Committee covered essential topics such as the research goals and plans, the data management and pseudo-anonymization procedures, and the compliance with the European General Data Protection Regulation (GDPR), as fully detailed in the original data descriptor [28]. All investigations were conducted in accordance with the principles outlined in the Declaration of Helsinki (1975, revised in 2013).

Informed Consent Statement

Informed consent was obtained from all subjects, including those involved in the original WEMAC dataset collection. Participants were given detailed written and verbal information about the study objectives, procedures, potential risks and benefits, and their right to withdraw at any time. Detailed information about the procedure to request the withdrawal of all their data from every public data repository is also provided for up to five years after the signature of the informed consent, at which point all the data necessary to retrieve the identities of the participants will be destroyed. Additional care was taken to explain the potential emotional distress due to PTSD-related questions, and participants were provided with psychological support. This ensured participant privacy and autonomy for the secondary use of the anonymized data in this research.

Data Availability Statement

The data used in this study stems from an extended version of the WEMAC database [28], a multimodal corpus for affective computing research, https://edatos.consorciomadrono.es/dataverse/empatia, accessed on 11 November 2025. However, part of the dataset used here constitutes an independent, non-publicly released extension of the WEMAC database. The dataset used in this research includes sensitive information from participants with post-traumatic stress disorder (PTSD) pre-clinical symptoms. To protect participant privacy and comply with institutional ethics board requirements and, in particular, to guarantee the right of the participants to request the elimination of their data, individual-level data cannot be made publicly available. Requests for access to pseudo-anonymized and aggregated data may be directed to the corresponding author upon reasonable request and are subject to appropriate institutional ethical approval. The code supporting the findings of this study is openly available in the following GitHub repositories, regarding the Adversarial Training https://github.com/emmareyner/AdversarialTraining (accessed on 11 November 2025), and the Feature Extraction https://github.com/BINDI-UC3M/wemac_dataset_signal_processing (accessed on 11 November 2025).

Acknowledgments

The authors would like to thank the UC3M4Safety team and Eva Martínez Rubio, the psychologist in charge of the interviews with the GBV victims, for the insightful discussions on the GBV victim condition.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
AIArtificial Intelligence
BVPBlood Volume Pulse
CONDCondition Classification Block
DAMDomain-Adversarial Model
DANNDomain-Adversarial Neural Network
EGS-REscala de Gravedad de Síntomas Revisada
F1F1-Score
FLFrame-Level
GBVGender-Based Violence
GBVVGender-Based Violence Victim
GBVVCGender-Based Violence Victim Condition
GSRGalvanic Skin Response
ICMIsolated Condition Model
ISMIsolated Speaker Model
LOSOLeave-One-Subject-Out
MLMachine Learning
MVMajority Voting
NTNot Trainable
PTSDPost-Traumatic Stress Disorder
SERSpeech Emotion Recognition
SISpeaker-Independent
SPKSpeaker Identification Block
STDStandard Deviation
TTrainable
ULUser-Level
USMUnlearnt Speaker Model
VRVirtual Reality
WEMACWomen and Emotion Multi-modal Affective Computing dataset

References

  1. European Institute for Gender Equality (EIGE). What Is Gender-Based Violence? Available online: https://eige.europa.eu/gender-based-violence/what-is-gender-based-violence (accessed on 12 November 2025).
  2. Oram, S.; Khalifeh, H.; Howard, L.M. Violence against women and mental health. Lancet Psychiatry 2017, 4, 159–170. [Google Scholar] [CrossRef] [PubMed]
  3. Escribà-Agüir, V.; Ruiz-Pérez, I.; Montero, I.; Vives-Cases, C.; Plazaola-Castaño, J.; Martin-Baena, D. Partner Violence and Psychological Well-Being: Buffer or Indirect Effect of Social Support. Psychosom. Med. 2010, 72, 383–389. [Google Scholar] [CrossRef] [PubMed]
  4. Ferrari, G.; Agnew-Davies, R.; Bailey, J.; Howard, L.; Howarth, E.; Peters, T.J.; Sardinha, L.; Feder, G.S. Domestic violence and mental health: A cross-sectional survey of women seeking help from domestic violence support services. Glob. Health Action 2016, 9, 29890. [Google Scholar] [CrossRef] [PubMed]
  5. Spencer, C.N.; Khalil, M.; Herbert, M.; Aravkin, A.Y.; Arrieta, A.; Baeza, M.J.; Bustreo, F.; Cagney, J.; Calderon-Anyosa, R.J.C.; Carr, S.; et al. Health Effects Associated with Exposure to Intimate Partner Violence against Women and Childhood Sexual Abuse: A Burden of Proof Study. Nat. Med. 2023, 29, 3243–3258. [Google Scholar] [CrossRef]
  6. Haering, S.; Seligowski, A.V.; Linnstaedt, S.D.; Michopoulos, V.; House, S.L.; Beaudoin, F.L.; An, X.; Neylan, T.C.; Clifford, G.D.; Germine, L.T.; et al. Disentangling Sex Differences in PTSD Risk Factors. Nat. Mental Health 2024, 2, 605–615. [Google Scholar] [CrossRef]
  7. Chandan, J.; Thomas, T.; Bradbury-Jones, C.; Russell, R.; Bandyopadhyay, S.; Nirantharakumar, K.; Taylor, J. Female survivors of intimate partner violence and risk of depression, anxiety and serious mental illness. Br. J. Psychiatry J. Ment. Sci. 2019, 217, 562–567. [Google Scholar] [CrossRef]
  8. Shen, S.; Kusunoki, Y. Intimate partner violence and psychological distress among emerging adult women: A bidirectional relationship. J. Women’s Health 2019, 28, 1060–1067. [Google Scholar] [CrossRef]
  9. World Health Organization. Violence Against Women. 2024. Available online: https://www.who.int/news-room/fact-sheets/detail/violence-against-women (accessed on 12 November 2025).
  10. United Nations. Declaration on the Elimination of Violence Against Women. 1993. Available online: https://digitallibrary.un.org/record/179739 (accessed on 12 November 2025).
  11. Organic Law 1/2004 on Comprehensive Protection Measures Against Gender Violence. 2004. Available online: https://www.boe.es/eli/es/lo/2004/12/28/1/con (accessed on 12 November 2025).
  12. Echeburúa, E.; Amor, P.; Sarasua, B.; Zubizarreta, I.; Holgado-Tello, F.; Muñoz, J. Escala de Gravedad de Síntomas Revisada (EGS-R) del Trastorno de Estrés Postraumático según el DSM-5: Propiedades psicométricas. Ter. Psicol. 2016, 34, 111–128. [Google Scholar] [CrossRef]
  13. Shanmugam, D.; Hou, K.; Pierson, E. Quantifying Disparities in Intimate Partner Violence: A Machine Learning Method to Correct for Underreporting. npj Women’s Health 2024, 2, 15. [Google Scholar] [CrossRef]
  14. Chen, Z.; Ma, W.; Li, Y.; Guo, W.; Wang, S.; Zhang, W.; Chen, Y. Using Machine Learning to Estimate the Incidence Rate of Intimate Partner Violence. Sci. Rep. 2023, 13, 5533. [Google Scholar] [CrossRef] [PubMed]
  15. Koops, S.; Brederoo, S.G.; de Boer, J.N.; Nadema, F.G.; Voppel, A.E.; Sommer, I.E. Speech as a biomarker for depression. CNS Neurol. Disord. Drug Targets 2023, 22, 152–160. [Google Scholar] [CrossRef]
  16. Scherer, S.; Pestian, J.; Morency, L.P. Investigating the Speech Characteristics of Suicidal Adolescents. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 709–713. [Google Scholar] [CrossRef]
  17. Belouali, A.; Gupta, S.; Sourirajan, V.; Yu, J.; Allen, N.; Alaoui, A.; Dutton, M.A.; Reinhard, M.J. Acoustic and language analysis of speech for suicidal ideation among US veterans. BioData Min. 2021, 14, 11. [Google Scholar] [CrossRef] [PubMed]
  18. Latif, S.; Qadir, J.; Qayyum, A.; Usama, M.; Younis, S. Speech Technology for Healthcare: Opportunities, Challenges, and State of the Art. IEEE Rev. Biomed. Eng. 2020, 14, 342–356. [Google Scholar] [CrossRef]
  19. Wanderley Espinola, C.; Gomes, J.C.; Mônica Silva Pereira, J.; dos Santos, W.P. Detection of major depressive disorder, bipolar disorder, schizophrenia and generalized anxiety disorder using vocal acoustic analysis and machine learning: An exploratory study. Res. Biomed. Eng. 2022, 38, 813–829. [Google Scholar] [CrossRef]
  20. Abomhara, M.; Køien, G.M. Cyber Security and the Internet of Things: Vulnerabilities, Threats, Intruders and Attacks. J. Cyber Secur. Mobil. 2015, 4, 65–88. [Google Scholar] [CrossRef]
  21. Kumar, D.; Pawar, P.P.; Addula, S.R.; Meesala, M.K.; Oni, O.; Cheema, Q.N.; Haq, A.U.; Sajja, G.S. AI-Powered Security for IoT Ecosystems: A Hybrid Deep Learning Approach to Anomaly Detection. J. Cybersecur. Priv. 2025, 5, 90. [Google Scholar] [CrossRef]
  22. Miranda Calero, J.A.; Rituerto-González, E.; Luis-Mingueza, C.; Canabal, M.F.; Bárcenas, A.R.; Lanza-Gutiérrez, J.M.; Peláez-Moreno, C.; López-Ongil, C. Bindi: Affective Internet of Things to Combat Gender-Based Violence. IEEE Internet Things J. 2022, 9, 21174–21193. [Google Scholar] [CrossRef]
  23. Schuller, B.; Steidl, S.; Batliner, A.; Burkhardt, F.; Devillers, L.; MüLler, C.; Narayanan, S. Paralinguistics in speech and language—State-of-the-art and the challenge. Comput. Speech Lang. 2013, 27, 4–39. [Google Scholar] [CrossRef]
  24. La Mura, M.; Lamberti, P. Human-Machine Interaction Personalization: A Review on Gender and Emotion Recognition Through Speech Analysis. In Proceedings of the 2020 IEEE International Workshop on Metrology for Industry 4.0 & IoT, Rome, Italy, 3–5 June 2020; pp. 319–323. [Google Scholar] [CrossRef]
  25. Reyner Fuentes, E.; Rituerto González, E.; Mingueza, C.L.; Peláez Moreno, C.; López Ongil, C. Detecting Gender-Based Violence Aftereffects from Emotional Speech Paralinguistic Features. In Proceedings of the Proc. IberSPEECH 2022, Incheon, Republic of Korea, 18–22 September 2022; pp. 96–100. [Google Scholar] [CrossRef]
  26. Reyner-Fuentes, E.; Rituerto-González, E.; Trancoso, I.; Peláez-Moreno, C. Prediction of the Gender-based Violence Victim Condition using Speech: What do Machine Learning Models rely on? In Proceedings of the INTERSPEECH 2023, Dublin, Ireland, 20–24 August 2023; pp. 1768–1772. [Google Scholar] [CrossRef]
  27. Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; Lempitsky, V. Domain-Adversarial Training of Neural Networks. arXiv 2016, arXiv:1505.07818. Available online: http://arxiv.org/abs/1505.07818 (accessed on 15 November 2025). [CrossRef]
  28. Miranda Calero, J.A.; Gutiérrez-Martín, L.; Rituerto-González, E.; Romero-Perales, E.; Lanza-Gutiérrez, J.M.; Peláez-Moreno, C.; López-Ongil, C. WEMAC: Women and Emotion Multi-modal Affective Computing Dataset. Sci. Data 2024, 11, 1182. [Google Scholar] [CrossRef] [PubMed]
  29. Iyortsuun, N.K.; Kim, S.H.; Jhon, M.; Yang, H.J.; Pant, S. A Review of Machine Learning and Deep Learning Approaches on Mental Health Diagnosis. Healthcare 2023, 11, 285. [Google Scholar] [CrossRef]
  30. Weathers, F.W.; Blake, D.D.; Schnurr, P.P.; Kaloupek, D.G.; Marx, B.P.; Keane, T.M. The Clinician-Administered PTSD Scale for DSM-5 (CAPS-5). 2013. Available online: https://www.ptsd.va.gov/ (accessed on 15 November 2025).
  31. Marmar, C.; Brown, A.; Qian, M.; Laska, E.; Siegel, C.; Li, M.; Abu-Amara, D.; Tsiartas, A.; Richey, C.; Smith, J.; et al. Speech-based markers for posttraumatic stress disorder in US veterans. Depress. Anxiety 2019, 36, 607–616. [Google Scholar] [CrossRef] [PubMed]
  32. Schultebraucks, K.; Yadav, V.; Shalev, A.; Bonanno, G.; Galatzer-Levy, I. Deep learning-based classification of posttraumatic stress disorder and depression following trauma utilizing visual and auditory markers of arousal and mood. Psychol. Med. 2020, 52, 957–967. [Google Scholar] [CrossRef]
  33. Hu, J.; Zhao, C.; Shi, C.; Zhao, Z.; Ren, Z. Speech-based recognition and estimating severity of PTSD using machine learning. J. Affect. Disord. 2024, 362, 859–868. [Google Scholar] [CrossRef] [PubMed]
  34. Castorena, C.M.; Abundez, I.M.; Alejo, R.; Granda-Gutiérrez, E.E.; Rendón, E.; Villegas, O. Deep Neural Network for Gender-Based Violence Detection on Twitter Messages. Mathematics 2021, 9, 807. [Google Scholar] [CrossRef]
  35. Subramani, S.; Michalska, S.; Wang, H.; Du, J.; Zhang, Y.; Shakeel, H. Deep Learning for Multi-Class Identification from Domestic Violence Online Posts. IEEE Access 2019, 7, 46210–46224. [Google Scholar] [CrossRef]
  36. Yallico Arias, T.; Fabian, J. Automatic Detection of Levels of Intimate Partner Violence Against Women with Natural Language Processing Using Machine Learning and Deep Learning Techniques. In Proceedings of the Information Management and Big Data; Lossio-Ventura, J.A., Valverde-Rebaza, J., Díaz, E., Muñante, D., Gavidia-Calderon, C., Valejo, A.D.B., Alatrista-Salas, H., Eds.; Springer International Publishing: Cham, Switzerland, 2022; pp. 189–205. [Google Scholar]
  37. Berisha, V.; Krantsevich, C.; Stegmann, G.; Hahn, S.; Liss, J. Are Reported Accuracies in the Clinical Speech Machine Learning Literature Overoptimistic? In Proceedings of the Interspeech 2022, Incheon, Republic of Korea, 18–22 September 2022; pp. 2453–2457. [Google Scholar] [CrossRef]
  38. Rutowski, T.; Harati, A.; Shriberg, E.; Lu, Y.; Chlebek, P.; Oliveira, R. Toward Corpus Size Requirements for Training and Evaluating Depression Risk Models Using Spoken Language. In Proceedings of the Interspeech 2022, Incheon, Republic of Korea, 18–22 September 2022; pp. 3343–3347. [Google Scholar] [CrossRef]
  39. Milling, M.; Pokorny, F.B.; Bartl-Pokorny, K.D.; Schuller, B.W. Is Speech the New Blood? Recent Progress in AI-Based Disease Detection From Audio in a Nutshell. Front. Digit. Health 2022, 4, 886615. [Google Scholar] [CrossRef]
  40. Xu, H.; Zhu, T.; Zhang, L.; Zhou, W.; Yu, P.S. Machine Unlearning: A Survey. ACM Comput. Surv. 2023, 56, 9:1–9:36. [Google Scholar] [CrossRef]
  41. Li, H.; Tu, M.; Huang, J.; Narayanan, S.; Georgiou, P. Speaker-Invariant Affective Representation Learning via Adversarial Training. In Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 7144–7148. [Google Scholar] [CrossRef]
  42. Han, J.; Zhang, Z.; Cummins, N.; Schuller, B. Adversarial Training in Affective Computing and Sentiment Analysis: Recent Advances and Perspectives. arXiv 2018, arXiv:1809.08927. Available online: http://arxiv.org/abs/1809.08927 (accessed on 15 November 2025). [CrossRef]
  43. Li, B.; Shi, X.; Gao, H.; Jiang, X.; Zhang, K.; Harmanci, A.O.; Malin, B. Enhancing Fairness in Disease Prediction by Optimizing Multiple Domain Adversarial Networks. PLoS Digit. Health 2025, 4, e0000830. [Google Scholar] [CrossRef]
  44. Miranda Calero, J.A.; Gutiérrez Martín, L.; Martínez Rubio, E.; Blanco Ruiz, M.; Sainz de Baranda Andújar, C.; Romero Perales, E.; San Segundo Manuel, R.; López Ongil, C. UC3M4Safety Database—WEMAC: Biopsychosocial Questionnaire and Informed Consent; Consorcio Madroño: Madrid, Spain, 2022. [Google Scholar] [CrossRef]
Figure 1. Original Domain-Adversarial Neural Network architecture [27].
Figure 1. Original Domain-Adversarial Neural Network architecture [27].
Applsci 15 12270 g001
Figure 2. Confusion matrices for the Isolated Condition Model (ICM).
Figure 2. Confusion matrices for the Isolated Condition Model (ICM).
Applsci 15 12270 g002
Figure 3. Confusion matrices for the Domain-Adversarial Model (DAM).
Figure 3. Confusion matrices for the Domain-Adversarial Model (DAM).
Applsci 15 12270 g003
Figure 4. Unlearnt Speaker Model architecture. The black, thinner arrows correspond to the forward step. The thicker, colored arrows correspond to the backpropagation steps.
Figure 4. Unlearnt Speaker Model architecture. The black, thinner arrows correspond to the forward step. The thicker, colored arrows correspond to the backpropagation steps.
Applsci 15 12270 g004
Figure 5. Non-Adversarial baseline comparison: bar plot comparing all metrics between the ICM and DAM models, both at frame-level (FL) and user-level (UL). Each bar represents the mean performance for accuracy, precision, recall, and F1-Score. Percentage labels indicate the relative improvement achieved by DAM over ICM.
Figure 5. Non-Adversarial baseline comparison: bar plot comparing all metrics between the ICM and DAM models, both at frame-level (FL) and user-level (UL). Each bar represents the mean performance for accuracy, precision, recall, and F1-Score. Percentage labels indicate the relative improvement achieved by DAM over ICM.
Applsci 15 12270 g005
Figure 6. Summary plot of the SHAP feature importance analysis for the Domain-Adversarial Model (DAM). Red points indicate feature contributions towards the prediction of non-GBVVs (Class 0), while blue points represent contributions towards GBVVs (Class 1).
Figure 6. Summary plot of the SHAP feature importance analysis for the Domain-Adversarial Model (DAM). Red points indicate feature contributions towards the prediction of non-GBVVs (Class 0), while blue points represent contributions towards GBVVs (Class 1).
Applsci 15 12270 g006
Figure 7. Domain-Adversarial Model Architecture. Black arrows indicate the forward pass; colored arrows denote backpropagation. Solid blocks are trainable in the respective step, while dashed blocks are frozen. Top: domain step. Bottom: main step.
Figure 7. Domain-Adversarial Model Architecture. Black arrows indicate the forward pass; colored arrows denote backpropagation. Solid blocks are trainable in the respective step, while dashed blocks are frozen. Top: domain step. Bottom: main step.
Applsci 15 12270 g007
Figure 8. Isolated Model Architectures. Black, thinner arrows indicate the forward pass. Thicker, colored arrows denote backpropagation flows. Top: Isolated Condition Model (ICM). Bottom: Isolated Speaker Model (ISM).
Figure 8. Isolated Model Architectures. Black, thinner arrows indicate the forward pass. Thicker, colored arrows denote backpropagation flows. Top: Isolated Condition Model (ICM). Bottom: Isolated Speaker Model (ISM).
Applsci 15 12270 g008
Table 1. Frame-level (FL) and user-level (UL) metrics for condition classification (ICM, DAM) and classification accuracy for speaker identification tasks (ISM, USM). For ICM and DAM, we report accuracy, precision, recall, and F1, as well as the percentual increment per metric when using DAM. For ISM/USM (speaker ID), we report accuracy (FL) with SD; UL confusion metrics are not applicable for multi-class speaker ID and are therefore omitted.
Table 1. Frame-level (FL) and user-level (UL) metrics for condition classification (ICM, DAM) and classification accuracy for speaker identification tasks (ISM, USM). For ICM and DAM, we report accuracy, precision, recall, and F1, as well as the percentual increment per metric when using DAM. For ISM/USM (speaker ID), we report accuracy (FL) with SD; UL confusion metrics are not applicable for multi-class speaker ID and are therefore omitted.
GBVVC DetectionRelative
Increment (%)
Speaker Identification
ICM DAM ISM USM
FLMean Accuracy 58.11 ± 31.85 58.66 ± 31.52 0.95 91.34 ± 3.72 66.72 ± 6.71
Precision 58.18 59.57 2.39 --
Recall 64.87 65.55 1.05 --
F1-Score 61.43 62.42 1.61 --
ULAccuracy 60.26 64.10 6.37 --
Precision 59.09 61.70 4.42 --
Recall 66.67 74.36 11.53 --
F1-Score 62.65 67.44 7.65 --
Table 2. Mean EGS-R score for GBVVC subjects by classification outcome and model. EGS-R is only available for GBVVC participants.
Table 2. Mean EGS-R score for GBVVC subjects by classification outcome and model. EGS-R is only available for GBVVC participants.
Domain-Adversarial Model (DAM)Isolated Condition Model (ICM)
Correctly Classified Misclassified Correctly Classified Misclassified
Mean EGS-R 10.52 6.90 10.35 8.07
Scaled (%) 52.59 34.50 51.73 40.39
Table 3. Training hyperparameters of the proposed models.
Table 3. Training hyperparameters of the proposed models.
Number of Epochs100
OptimizerMomentum SGD
Batch size16
Lambda 0.2
Starter learning rate 10 9
Decay steps10,000
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Reyner-Fuentes, E.; Rituerto-González, E.; Peláez-Moreno, C. Machine Unlearning for Speaker-Agnostic Detection of Gender-Based Violence Condition in Speech. Appl. Sci. 2025, 15, 12270. https://doi.org/10.3390/app152212270

AMA Style

Reyner-Fuentes E, Rituerto-González E, Peláez-Moreno C. Machine Unlearning for Speaker-Agnostic Detection of Gender-Based Violence Condition in Speech. Applied Sciences. 2025; 15(22):12270. https://doi.org/10.3390/app152212270

Chicago/Turabian Style

Reyner-Fuentes, Emma, Esther Rituerto-González, and Carmen Peláez-Moreno. 2025. "Machine Unlearning for Speaker-Agnostic Detection of Gender-Based Violence Condition in Speech" Applied Sciences 15, no. 22: 12270. https://doi.org/10.3390/app152212270

APA Style

Reyner-Fuentes, E., Rituerto-González, E., & Peláez-Moreno, C. (2025). Machine Unlearning for Speaker-Agnostic Detection of Gender-Based Violence Condition in Speech. Applied Sciences, 15(22), 12270. https://doi.org/10.3390/app152212270

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop