Next Article in Journal
Compact Modeling of Two-Dimensional Field-Effect Biosensors
Next Article in Special Issue
CST: Complex Sparse Transformer for Low-SNR Speech Enhancement
Previous Article in Journal
Influence of the Tikhonov Regularization Parameter on the Accuracy of the Inverse Problem in Electrocardiography
Previous Article in Special Issue
Non-Contact Vibro-Acoustic Object Recognition Using Laser Doppler Vibrometry and Convolutional Neural Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Novel Speech Recognition Systems Applied to Forensics within Child Exploitation: Wav2vec2.0 vs. Whisper

by
Juan Camilo Vásquez-Correa
* and
Aitor Álvarez Muniain
*
Fundación Vicomtech, Basque Research and Technology Alliance (BRTA), Mikeletegi 57, 20009 Donostia-San Sebastián, Spain
*
Authors to whom correspondence should be addressed.
Sensors 2023, 23(4), 1843; https://doi.org/10.3390/s23041843
Submission received: 21 December 2022 / Revised: 1 February 2023 / Accepted: 2 February 2023 / Published: 7 February 2023
(This article belongs to the Special Issue Artificial Intelligence-Based Audio Signal Processing)

Abstract

:
The growth in online child exploitation material is a significant challenge for European Law Enforcement Agencies (LEAs). One of the most important sources of such online information corresponds to audio material that needs to be analyzed to find evidence in a timely and practical manner. That is why LEAs require a next-generation AI-powered platform to process audio data from online sources. We propose the use of speech recognition and keyword spotting to transcribe audiovisual data and to detect the presence of keywords related to child abuse. The considered models are based on two of the most accurate neural-based architectures to date: Wav2vec2.0 and Whisper. The systems were tested under an extensive set of scenarios in different languages. Additionally, keeping in mind that obtaining data from LEAs are very sensitive, we explore the use of federated learning to provide more robust systems for the addressed application, while maintaining the privacy of the data from LEAs. The considered models achieved a word error rate between 11% and 25%, depending on the language. In addition, the systems are able to recognize a set of spotted words with true-positive rates between 82% and 98%, depending on the language. Finally, federated learning strategies show that they can maintain and even improve the performance of the systems when compared to centralized trained models. The proposed systems set the basis for an AI-powered platform for automatic analysis of audio in the context of forensic applications of child abuse. The use of federated learning is also promising for the addressed scenario, where data privacy is an important issue to be managed.

1. Introduction

The growth in online child exploitation and abuse material is a significant challenge for European Law Enforcement Agencies (LEAs). Currently, the revision of online material about child abuse exceeds the capacity of LEAs to respond in a practical and timely manner. One of the most important sources of information that needs to be analyzed to find evidence about child abuse corresponds to audiovisual material from multimedia content. With the aims of safeguarding victims, prosecuting offenders and limiting the spread of online child abuse related material, LEAs need a next-generation AI-powered platform to process multimedia data from online sources. One of the main goals of the GRACE project (https://www.grace-fct.eu/ accessed on 1 February 2023) is to develop robust AI-based technology to equip LEAs with the aforementioned platform. Two of the core applications to be incorporated in order to accurately transcribe audiovisual online material and to detect the presence of specific keywords about child abuse in the transcriptions are automatic speech recognition (ASR) and keyword spotting (KWS).
Within this context, ASR technology has been applied in various forensic scenarios—for instance, to collect evidence via the examination of electronic devices [1] or to analyze multimedia content related to specific threats [2,3]. Nevertheless, the successful implementation of an ASR system in forensics introduces a series of issues to be solved, which are not present in other domains where ASR is applied. For instance, it is common to find audio coming from different sources, which are highly affected by background noise, overlapping speakers, and audio reverberation, among other factors. All these aspects affect the quality of the obtained transcription and the capability of the system to detect specific keywords.
Despite the aforementioned problems, recent advances in ASR have introduced novel end-to-end architectures [4] that have shown to be accurate enough in those adverse conditions. The core idea of end-to-end models is to directly map the input speech signal to character sequences and therefore greatly simplify training, fine-tuning, and inference making [5,6,7,8,9]. Two main approaches are distinguished in the literature to train end-to-end ASR systems: fully supervised or self-supervised models. Regarding the first group, NVIDIA proposed Quartznet [10] with the aim of building a competitive but lighter end-to-end ASR model. The architecture consists of multiple blocks of 1D convolutions stacked with residual connections. The model has been trained and tested on the Common Voice corpus, achieving word error rates (WERs) between 7.7 % and 12.5 % , depending on the language [11]. A Quartznet model also produced WERs of 19.2 % and 18.3 % in French and Spanish language multimedia data, respectively, from the MediaSpeech corpus [12]. Researchers from NVIDIA recently proposed Citrinet [13] as an evolution of Quartznet. The model consists of a residual network formed by 1D time-channel separable convolutions combined with a sub-word encoding and a squeeze-and-excitation mechanism [14]. The authors reported a WER of 5.6 % on the TEDLIUMv2 corpus. Another architecture that has proven to be accurate in many ASR benchmark scenarios is the recurrent neural network transducer (RNN-T) [15]. The RNN-T is formed by three main blocks: (1) an encoder network that receives input acoustic frames and produces high-level speech representations, (2) a predictor that acts as a decoder by processing the previous produced token, and (3) a joint network that combines the outputs from the two previous blocks and produces the distribution of the next predicted token or blank symbol. Recent models based on RNN-T achieved a WER 14.0% in the TEDLIUMv2 corpus [16].
Contrary to fully supervised models, recent studies are focused on the use of big acoustic models trained with self-supervised learning methods and a large amount of unlabeled data. Researchers from Meta AI demonstrated the capabilities of models of this type by introducing Wav2Vec2.0 [17]. This system outperformed many benchmark results, especially when considering ASR for low-resource languages in the Common Voice corpus [18]. Particularly, the authors in [19] considered a Wav2vec2.0 model combined with their proposed language modeling approach and achieved state-of-the-art results in the German Common Voice corpus, with a WER of 3.7 % . Wav2Vec2.0-based models have also been successfully tested in more adverse acoustic environments, such as in multimedia Portuguese data from the CORAA database [20]. Due to these reasons, Wav2Vec2.0 has become one of the most often considered neural-based models for ASR. Self-supervised approaches such as Wav2Vec2.0 are challenging because there is not a predefined lexicon for the input sound units during the pre-training phase. Moreover, sound units have variable length with no explicit segmentation [21]. With the aim of solving such issues, Meta AI released HuBERT as a new approach to learn self-supervised speech representations [22]. The combination of convolutional and transfomer networks from Wav2Vec2.0 and HuBERT has achieved state-of-the-art results in many ASR scenarios. With the aim of combining the best features from both type of networks in a single neural block, researchers from Google introduced the “convolutional augmented transfomer” or Conformer [23]. A Conformer network achieved a WER of 7.2 % in the TEDLIUMv2 corpus [24].
Self-supervised audio encoders like Wav2Vec2.0, HuBERT, and Conformers learn high quality audio representations. However, due to its unsupervised pre-training nature, they lack a proper decoding to transform such representations into usable outputs. This is why a fine-tuning stage is always necessary in order to accurately implement models for ASR or audio classification. With the aim of solving the aforementioned issue, researchers from OpenAI recently proposed “Whisper” [25]. Whisper is a sequence-to-sequence transformer trained in a fully supervised manner, using up to 680,000 h of labeled audio from the Internet. The model has achieved state-of-the-art WER results on many benchmark datasets for ASR, including librispeech, TEDLIUM, and Common Voice, among others.
There are two main issues that appear when designing ASR solutions for forensic scenarios: The first one is related to find the most appropriate neural architecture from the ones previously described in order to deal with different acoustic environments. The second one is related to data privacy and protection [26]. Generally, obtaining operative data from LEAs for the addressed scenario is not possible. In this context, federated learning (FL) has emerged as an alternative with which to train machine learning models on remote devices, such as mobile phones and remote data-centers in a non-centralized manner, preserving data privacy [27,28,29,30]. The procedure is as follows: LEAs’ operative data are stored in on-premise data servers. Then, FL strategies aim to transfer only local model updates to a central server, keeping LEAs’ data private. The central server aggregates information obtained from multiple clients, i.e., LEAs, and updates a central model that is transmitted back to the clients for their consumption. FL has been applied to train robust federated acoustic models for ASR [31,32,33] and KWS [34]. In [32], the authors proposed a client-adaptive federated training scheme to mitigate data heterogeneity when training ASR models. The proposed system achieved a similar WER with respect to the obtained one using fully centralized training. In [33], the authors proposed a strategy to compensate non-independent and identically distributed (non-IID) data in federated training of ASR systems. The proposed strategy involved random client data sampling, which resulted in a cost-quality trade-off. The optimization of such a trade-off led to obtaining ASRs with similar WERs to the obtained by training-centralized systems. The authors in [34] demonstrated the capabilities of federated training to obtain robust KWS systems locally trained on edge devices such as smartphones, reaching similar accuracies when compared with centralized trained models.
According to the reviewed literature, the two main paradigms and solutions for ASR to date include self-supervised models based on Wav2Vec2.0 and fully supervised models such as Whisper. This work considered and compared these two approaches to test their capabilities to perform robust ASR and KWS in a large set of test scenarios. We also evaluated the use of FL in the context where different LEAs can share a common ASR and KWS system, keeping the privacy of their data. In summary, the main contributions of this paper are four-fold:
  • We performed an extensive comparison between two of the most accurate neural-based ASR architectures to date: a fine-tuned version of Wav2Vec2.0 and Whisper. The evaluation was performed in many scenarios, but paying special attention to corpora coming from multimedia content. The models were tested on data from seven indo-European Languages, including English, Spanish, German, French, Italian, Portuguese, and Polish. This evaluation can be useful in other domains besides ASR forensics, making our contribution open and viable for other scenarios.
  • We created and released an in domain corpus that includes specific keywords of the child abuse domain, and a set of accompanying audio files where the keywords are present. The included audio was selected from open available corpora used in the literature. The created corpus can be used as a benchmark to test ASRs in uncontrolled acoustic conditions.
  • The two neural architectures are compared as well in the created corpora within the scope of child abuse forensics. To the best of our knowledge, this is the first study to comprise the use of open ASR solutions and their capabilities to recognize specific words within a forensic domain.
  • We validated the use of FL strategies to train ASR systems in the context of forensic applications. The core idea is that different LEAs can share a common model while keeping the privacy of their data.
The rest of the paper is distributed as follows. Section 2 details different technical aspects of Wav2Vec2.0 and Whisper architectures for ASR. Section 3 describes the considered corpora to test the ASR systems, and the process to deliver an in domain corpus for KWS in the context of forensics. Section 4 describes the pilot study on the use of FL for the addressed application. Section 5 displays the main results obtained regarding ASR, KWS, and FL. Section 6 discusses the main insights obtained from the results. Finally, Section 7 shows the main conclusion derived from this work.

2. Methods

We considered two of the most accurate neural-based ASR architectures to date: (1) Wav2vec2.0, which is trained following a self-supervised paradigm, and (2) Whisper, which is trained following a fully supervised strategy. Details about each model are found in the following sub-sections.

2.1. Wav2vec2.0

Wav2vec2.0 [17] is a self-supervised end-to-end architecture based on convolutional and transformer layers (see Figure 1). The model encodes raw audio waveforms χ into latent speech representations z 1 , , z T via a multi-layer convolutional feature encoder f : χ Z . These latent representations fed a transformer-masked network g : Z C . The transformer network initially quantizes the continuous representations, forming a discrete set of outputs q 1 , , q T that represent targets in the self-supervised learning objective [17,35]. Those quantized representations are then contextualized using the attention blocks from the transformer module, obtaining a set of discrete contextual representations c 1 , , c T . The feature encoder is formed by seven convolutional blocks with 512 channels, strides of { 5 , 2 , 2 , 2 , 2 , 2 , 2 } and kernel widths of { 10 , 3 , 3 , 3 , 3 , 2 , 2 } . The transformer network is formed by 24 blocks, 1024 dimensions, inner dimensions numbering 4096, and a total of 16 attention heads.
We considered a pre-trained Wav2vec2.0 acoustic model based on the Wav2Vec2-XLS-R-300M model, which is available via Hugginface (https://huggingface.co/facebook/wav2vec2-xls-r-300m accessed on 1 February 2023). The model was pre-trained in a self-supervised manner using 436k hours of unlabeled speech data in 128 languages from the VoxPopuli [36], Multilingual librispeech (MLS) [37], Common Voice [38], BABEL, and VoxLingua107 [39] corpora. The Wav2Vec2-XLS-R-300M is one of the different versions of the Meta AI’s XLS-R multilingual model [40] composed by 300 million parameters. The multilingual pre-trained model was fine-tuned with labeled speech data (see Section 3.1) in seven languages: English, German, French, Spanish, Italian, Portuguese, and Polish. Each model was trained for 50 epochs, with a batch size of 2, 16 gradient accumulation steps, and a learning rate of 5 × 10 5 , which was warmed up during the initial 10% of the training.
The trained acoustic representations were decoded using a connectionist temporal classification (CTC) layer with a beam-search decoding strategy (beam-width = 256). The CTC decoding included the use of separate 3-gram language models that are trained using large text corpora, and which were included in the decoding with weights of α = 0.5 and β = 1.5 .

2.2. Whisper

Whisper is a recently introduced ASR system by OpenAI [25]. Contrary to Wav2vec2.0, Whisper is trained in a fully supervised manner, using up to 680k hours of labeled speech data from multiple sources. The model is based on an encoder-decoder Transformer, which is fed by 80-channel log-Mel spectrograms. The encoder is formed by two convolution layers with a kernel size of 3, followed by a sinusoidal positional encoding, and a stacked set of Transformer blocks. The decoder uses the learned positional embeddings and the same number of Transformer blocks from the encoder. Figure 2 illustrates the general Whisper architecture. Different pre-trained models are available with variations in the number of layers and attention heads. We considered the "Whisper-large" model, which consists of 1550 million parameter distributed in 32 layers and 20 attention heads. The model is available via Huggingface (https://huggingface.co/openai/whisper-large accessed on 1 February 2023).
The model was not fine-tuned in this study; thus, the evaluation for all languages was conducted in a zero-shot setting. The decoding was performed using a beam search strategy with 5 beams, an array of temperature weights of 0.2 , 0.4 , 0.6 , 0.8 , 1 , and a no repeat n-gram size of 3 in order to take advantage of the language modeling head and to avoid loops, in a similar way to [25].

3. Materials

This section describes a set of open corpora used to benchmark the two considered ASR systems (Section 3.1), followed by the process performed to derive a set of keywords to be spotted by the considered systems (Section 3.2), and the description of a built in domain dataset considered as well to test the considered models (Section 3.3).

3.1. Data

The ASR and KWS models were trained an evaluated in a set of seven indo-European languages: English, Spanish, German, French, Italian, Portuguese, and Polish. These languages were selected because of two main reasons: (1) we covered German, Latin, and Slavik-based languages, which represent the majority of type of languages spoken in Europe, and (2) these languages were particularly selected by the Law Enforcement Agencies (LEAs) for the applications related to detecting child-abuse in online sources. Different public corpora were considered to train/test the ASR and KWS models in each language. Wav2vec2.0 models were fine-tuned using the Common Voice corpus [38] for each considered language. The amount of available labeled data highly varies depending on the language, and include: 1600 h for English, 777 h for German, 623 for French, 324 for Spanish, 158 for Italian, 63 for Portuguese, and 43 for Polish. These data are freely available via Huggingface (https://huggingface.co/datasets/common_voice accessed on 1 February 2023). The training data for the Spanish model also included 57 h from the RTVE2018 dataset [41] from the Albayzin 2018 evaluation challenge.
The corpora covered in our paper include both European and American accents for the aforementioned languages. In addition, the common voice corpus, which was used as our train set, was crowd-sourced from many countries and includes a large number of accents that helps to improve the generalization capabilities of our models.
The performance of the fine-tuned Wav2vec2.0 and the Whisper-based models were evaluated in a cross-corpora fashion, considering a large set of databases from the literature that are available in the different languages. The list of considered corpora is observed in Table 1. These corpora were selected in order to test the performances of the models in several recording conditions, which can be more similar to the realistic scenarios found by LEAs. Notice that due to the sensitive nature of the target application, it is not possible to get access to realistic operative data from LEAs. However, we created an in-domain synthetic dataset using these open source corpora, which is described in Section 3.3.

3.2. Spotted Keywords

In order to test the capabilities of the ASR models to spot specific keywords within the child abuse domain, we defined a list of keywords to be spotted. The keyword list was obtained from a set of open documents that include: (1) the “Best Practices on Victim support for LEA first responders” deliverable from the GRACE project (https://www.grace-fct.eu/deliverables/70 accessed on 1 February 2023), (2) the 2021 “Barriers to Compensation for Child Victims of Sexual Exploitation” report from ECPAT (https://ecpat.org/wp-content/uploads/2021/05/Barriers-to-Compensation-for-Child_ebook.pdf accessed on 1 February 2023) [47], (3) the study from [48], (4) EUROPOL technical reports [49,50,51], (5) EUROPOL press-releases from 2018 to 2022 using the keyword “child abuse” (https://www.europol.europa.eu/media-press/newsroom?q=child%20abuse accessed on 1 February 2023), (6) Wikipedia articles about “child abuse” and “online child abuse”, and (7) UNICEF press-releases about “child abuse” (https://www.unicef.org/search?force=0&query=child+abuse&created%5Bmin%5D=&created%5Bmax%5D= accessed on 1 February 2023). All documents were text crawled and pre-processed by performing lemmatization and removing stop words, numbers, and date entities. After this process, we obtained a corpus with 55,059 words, of which 6028 are unique. Figure 3 shows the most important keywords found in the crawled corpus.
Afterwards, we selected the 100 most repeated words from the corpus, which represent 33% of the information within the whole set of crawled documents. Finally, we excluded 12 terms because they were very broad concepts not related to child abuse, leading to a final set of 88 keywords to be spotted. The obtained keyword list (in English) was translated into the remaining six considered languages in order to have a common benchmark for all languages.

3.3. GRACE Dataset

We considered an additional corpus to test the implemented ASR systems by merging and filtering the data described in Section 3.1. We selected audio samples from all datasets that contain at least one of the 88 selected keywords. Table 2 shows the data distribution for each language after selection. The table includes the datasets considered for each language where the keywords were found, the number of utterances, and the total audio duration (in hours).
The selected audio files were processed in order to have more realistic acoustic conditions, like those expected in forensic applications within the considered domain. The process included: (1) adding background noise with signal to noise ratios (SNR) between 5 and 30 dB (randomly), (2) adding reverberation using room impulses from the VOiCES dataset [52], and (3) randomly applying the ogg-vorbis codec [53] due to it being commonly found in audio material from online sources. The final ASR and KWS evaluation was performed considering the two versions of the corpus: clean and noisy. This corpus is available online (https://datasets.vicomtech.org/di01-grace-automatic-speech-recognition-and-keyword-spotting/GRACE_ASR.zip accessed on 1 February 2023) to be used as a benchmark dataset for speech recognition in different languages under uncontrolled acoustic conditions.

4. Federated Learning

The considered FL pipeline was performed only with English data and included five nodes that were used for federated training, a dummy node considered to test the evolution of the learning process, and a central server in charge of aggregating the weights received from the five nodes. Figure 4 shows the implemented architecture. Three of the servers were located at Vicomtech premises (Spain), one server was located in Greece, another one in Portugal, and the remaining one in Cyprus. The aim of these connections was to create a real environment for the pilot, in similar conditions to the expected one when the model will perhaps be trained by different LEAs across Europe. In addition, secure communication between clients and the server was established through a VPN connection to ensure that sensitive data (parameters) were safely transmitted and to prevent unauthorized access. Each node contained data from a different dataset: TEDLIUMv2, debating technologies, Librispeech-other, Librispeech-clean, and SWC. This data configuration aimed to evaluate the impact of non-IID data distribution, which is more realistic for the addressed forensic application.
The FL pilot test was performed only with the Wav2Vec2.0 model and with the pre-trained Wav2Vec2-XLS-R-300M model. The training hyperparameters were the same for the five clients, and included a batch size of 2, a learning rate of 5 × 10 5 warmed up in the first 10 % of the training time, and a gradient accumulation of 16 steps. The local training was performed for 5 epochs. The central server was configured to run for 10 rounds of federated training, while using the federated averaging (FedAvg) aggregation mechanism to update the central model. The architecture configuration and the training process were implemented using Nvidia Flare (https://nvflare.readthedocs.io/en/main/index.html accessed on 1 February 2023).

5. Results

5.1. Speech Recognition

Wav2Vec2.0 and Whisper models were evaluated under the corpora described in Section 3.1. The results of the ASR systems in terms of WER are shown in Table 3. The results included those obtained in the evaluation of the seven languages, and using both the open benchmark corpora and the two versions (clean and noisy) of the synthetic GRACE corpus.
On average, the WER for each language using Whisper ranged from 11.3% (in Spanish) to 24.9% (in French). The results using Wav2Vec2.0 ranged from 13.1% (in Spanish) to 34.8% (in Portuguese). In general, Whisper produces less errors than Wav2Vec2.0 (see Figure 5 left). The difference between the models was statistically significant according to a Mann–Whitney test (U = 1203.5, p-value = 0.016). Whisper outperformed Wav2Vec2.0, especially under the most affected acoustic conditions, such as in the GRACE noisy, TEDLIUMv2, Debates, and CORAA corpora. However, there are some scenarios where Wav2Vec2.0 outperformed Whisper and which should be considered with special attention, such as the results for Spanish Common Voice.
The results obtained were compared to those found in the literature for the multilingual corpora: Common Voice, MLS, MTEDx, and MediaSpeech. The comparison is shown in Table 4. The Wav2Vec2.0-based model outperformed results in the Spanish versions of Common Voice and MediaSpeech corpora, with WERs of 4.3% and 14.5%, respectively, with respect to the results reported in [18] for Common Voice (WER = 6.2%) and in [12] for MediaSpeech (WER = 18.3%). We also reported state-of-the-art results for the Spanish, Portuguese, Italian, and German versions of the MTEDx corpus (WERs of 9.4%, 12%, 11.6%, and 21.7%, respectively) with respect to the WERs of 16.2%, 20.2%, 16.4%, and 42.3% reported in [43]. The Whisper model also achieved state-of-the-art results on the CORAA corpus (WER = 21.7%) with respect to the results reported in [54] (WER = 21.9%), and on the TEDLIUMv2 corpus (WER = 5.4%) compared to [13] (WER = 5.6%). Regarding MLS, the state-of-the-art results are still from [55]. However, notice that the results reported here correspond to cross-corpus tests, whereas the experiments performed in [55] were on Wav2Vec2.0 models trained and tested using MLS, thereby making the models adapted just for such a corpus.

5.2. Keyword Spotting

The text transcriptions from Wav2Vec2.0 and Whisper were post-processed in order to find the presence of the defined keywords to be spotted. The process involved transforming the transcription to lowercase and lemmatization. Lemmatization was performed to reduce the inflectional form of each word in order to detect all possible variations of the word within the transcription. The lemmatization process was performed using the set of large open dictionaries available in Spacy (https://spacy.io/usage/models accessed on 1 February 2023). The results obtained for KWS in each corpus are shown in Table 5. The results are presented in terms of the true positive rate (TPR). This is a common metric used in applications of this type where it is more important to avoid false-positive than false-negative errors [58,59].
On average, the TPRs were higher using Whisper, and the results per language using Whisper ranged from 81.5% (for Polish) to 98.4% (for Italian). Results using Wav2Vec2.0 ranged from 82.9% (for Portuguese) to 94.9% (for Spanish). Similarly to the ASR results, the difference between Whisper and Wav2Vec2.0 was larger when considering speech signals in uncontrolled acoustic conditions, such as the ones from the GRACE noisy corpus, where we can guarantee the presence of the spotted keywords in every utterance. High differences were also observed for the CORAA corpus, Common Voice, and the German SWC. The differences between the results obtained using Wav2Vec2.0 and Whisper were also statistically significant (see Figure 5 right) according to a Mann–Whitney test with U = 589.0 and a p-value = 0.003.

5.3. Federated Learning

The FL experiment involved training the Wav2Vec2.0 system using five separate real servers for training, and one additional node (dummy) used only to test the final model. Each node contained data from a different dataset (only in English) in order to evaluate the contribution from each corpus to the global aggregated model. The aim was also to cover non-IID conditions, which have shown to be one of the most important drawbacks when training models in an FL approach. The results are shown in Table 6. The results using the FL training are compared to those obtained when training the system in a completely centralized manner. Similar WERs were obtained by each node in the federated and centralized training. The main difference is that when considering FL models, there is only one aggregated model which covers the results of the five nodes, instead of having five different models for the case of the centralized approach. This fact greatly reduces the time considered to train the system, and most importantly, it is possible to take advantage of data from different data centers to train a more robust and general model without the need for sharing data among clients.

6. Discussion

The evaluation of Wav2Vec2.0 and Whisper-based ASR systems was performed in a large set of different scenarios, including one specifically designed for forensic applications within the child abuse domain. On average, Whisper is more accurate than the Wav2Vec2.0-based system. Whisper achieved WERs ranging from 11.5 % to 24.9 % , depending on the language, compared with Wav2Vec2.0’s WERs of between 13.3 % and 34.8 % . The difference between the two models was even larger when using languages trained with fewer resources, such as Portuguese or Italian. Despite these differences, Wav2vec2.0 is competitive with Whisper when the number of hours for fine-tuning is large, e.g, for English, Spanish, or French.
Results using the GRACE dataset showed relatively similar WERs between Wav2Vec2.0 and Whisper when considering the clean version of the corpus: the average WER was 22.1 % for Whisper and was 23.2 % for Wav2Vec2.0. However, the difference between the two models greatly increased with the noisy version of the corpus: the average WER was 26.3 % for Whisper and 45.1 % for Wav2Vec2.0. This is a great indicator of the ability of Whisper to perform accurate transcriptions under uncontrolled and noisy acoustic conditions, by keeping similar WERs in the two versions of the GRACE corpus. Despite the differences between the two types of models, there are some surprising results where Wav2Vec2.0 outperformed Whisper, and which should be considered with special attention—for instance, when evaluating the GRACE clean corpus in languages such as English, French, and Spanish. The models for these three languages were fine-tuned with more data, which likely explains the lower WER for Wav2Vec2.0 compared to that of Whisper.
Our systems achieved state-of-the art results on several of the considered benchmark corpora. We reported state-of-the-art results for some of the languages in the Common Voice corpus. State-of-the-art results were also achieved for almost all languages ib the MTEDx and MediaSpeech corpora. These results are good indicators if the ability of the considered systems to accurately recognize speech under more natural and spontaneous scenarios, closer to the expected in forensic domains.
The KWS evaluation indicated that both Wav2Vec2.0 and Whisper were accurate enough to recognize the considered child-abuse-related keywords in the seven languages. TPRs obtained for Wav2Vec2.0 ranged from 82.9% to 94.9%, depending on the language. Results using Whisper ranged from 80.3% to 98.2%. The particular evaluation of KWS in the GRACE dataset also showed that both models are equally accurate at recognizing the selected keywords under controlled acoustic conditions. On the contrary, when considering the noisy version of the corpus, the results for Wav2Vec2.0 were reduced by 20%, and the results for Whisper were only reduced by 3%. This fact again indicates the ability of Whisper to accurately process speech recordings in uncontrolled acoustic conditions.
The last experiment involved a pilot study on the use of FL to train ASR systems. The results indicated that an ASR trained in a federated way maintains and in some cases outperforms the performance of individual ASRs trained in a centralized manner by each LEA. In addition to the performance, the most important aspect of FL is that the ASR training does not involve any data sharing among LEAs, since only updates of the network parameters are transferred to a central server in charge of aggregating the model. These results are indicators of the potential use of FL to obtain a joint (and potentially richer) model combining sources of data that could not be otherwise combined. Despite the benefits of using FL, it is important to consider external factors that may degrade the performance and reliability of the system. For instance, there is evidence of FL attacks that are able to retrieve speaker information from the transferred weights [60] and data poisoning attacks inside LEA servers. Different strategies can be considered to mitigate attacks of these kinds, such as the use of differential privacy algorithms [61] or the use of trusted execution environments.

7. Conclusions

This paper proposed the use of speech recognition and keyword spotting technologies to be applied in forensic scenarios, particularly in child exploitation settings. The aim is to provide LEAs with technology to detect the presence of offensive online audiovisual material related to child abuse. State-of-the art ASR systems based on Wav2Vec2.0 and Whisper were considered for the addressed application. The performance of both models was tested on a large set of open benchmark corpora from the literature. Therefore, the results obtained can be extended to other ASR domains. We additionally created an in-domain corpus using different open source datasets from the research community. The aim was to test the models in more realistic and operative conditions.
The ASR and KWS models were evaluated in corpora from seven Indo-European languages, including English, German, French, Spanish, Italian, Portuguese, and Polish. We obtained overall WERs ranging from 11.3% to 24.9%, depending on the language. The performance of the KWS model for the different languages ranged from 81.5% to 98.4%. The most accurate results were obtained from models trained with more data, such as English or German. The comparison between Wav2Vec2.0 and Whisper models indicated that the second one was the most accurate system in the majority of cases, especially when considering utterances in uncontrolled acoustic conditions.
We also proposed a strategy for using FL to train robust ASR systems in the context of the addressed application. This is a suitable approach considering that collecting operational data from LEAs is not possible. FL approaches allow LEAs to build a common technological platform without the need to share their operational data. The results of the FL pilot indicated that similar WERs were achieved when comparing the model trained in a federated way to individual models trained in a centralized manner, even considering non-IID conditions, which has been shown to be one of the main drawbacks in FL.
For future work, the considered approaches can be extended to other forensic applications where there is a need to monitor audiovisual material from online sources. In addition, the considered technology can be combined with other speech processing methods, such as speaker and language identification, age and gender recognition, and speaker diarization. The ultimate goal is to provide LEAs with accurate tools to monitor audio from online sources, allowing them to respond in a practical and timely manner.

Author Contributions

Conceptualization, J.C.V.-C. and A.Á.M.; methodology, J.C.V.-C. and A.Á.M.; software, J.C.V.-C.; validation, J.C.V.-C.; formal analysis, J.C.V.-C. and A.Á.M.; investigation, J.C.V.-C. and A.Á.M.; resources, data curation, J.C.V.-C.; writing—original draft preparation, J.C.V.-C.; writing—review and editing, J.C.V.-C., A.Á.M.; visualization, J.C.V.-C. All authors have read and agreed to the published version of the manuscript.

Funding

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under project GRACE, grant agreement No. 883341.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki and approved by the Institutional Review Board of the GRACE consortium.

Informed Consent Statement

Not applicable.

Data Availability Statement

All data considered in this study are from open repositories under Creative common licenses.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

ASRAutomatic Speech Recognition
CTCConnectionist Temporal Classification
FLFederated Learning
KWSKeyword Spotting
LEALaw Enforcement Agency
MLSMultilingual Librispeech
SWCSpoken Wikipedia Corpus
TPRTrue Positive Rate

References

  1. Negrão, M.; Domingues, P. SpeechToText: An open-source software for automatic detection and transcription of voice recordings in digital forensics. Forensic Sci. Int. Digit. Investig. 2021, 38, 301223. [Google Scholar] [CrossRef]
  2. Alghowinem, S. A safer youtube kids: An extra layer of content filtering using automated multimodal analysis. In Proceedings of the SAI Intelligent Systems Conference, London, UK, 6–7 September 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 294–308. [Google Scholar]
  3. Mariconti, E.; Suarez-Tangil, G.; Blackburn, J.; De Cristofaro, E.; Kourtellis, N.; Leontiadis, I.; Serrano, J.L.; Stringhini, G. “You Know What to Do” Proactive Detection of YouTube Videos Targeted by Coordinated Hate Attacks. ACM Hum.-Comput. Interact. 2019, 3, 1–21. [Google Scholar] [CrossRef]
  4. Amodei, D.; Ananthanarayanan, S.; Anubhai, R.; Bai, J.; Battenberg, E.; Case, C.; Casper, J.; Catanzaro, B.; Cheng, Q.; Chen, G.; et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In Proceedings of the International Conference on Machine Learning, ICML, New York, NY, USA, 20–22 June 2016; pp. 173–182. [Google Scholar]
  5. Graves, A.; Jaitly, N. Towards end-to-end speech recognition with recurrent neural networks. In Proceedings of the International Conference on Machine Learning, ICML, Beijing, China, 21–26 June 2014; pp. 1764–1772. [Google Scholar]
  6. Chan, W.; Jaitly, N.; Le, Q.; Vinyals, O. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Proceedings of the ICASSP, Shangai, China, 7–13 May 2016; IEEE: Piscataway Township, NJ, USA, 2016; pp. 4960–4964. [Google Scholar]
  7. Chorowski, J.K.; Bahdanau, D.; Serdyuk, D.; Cho, K.; Bengio, Y. Attention-based models for speech recognition. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar]
  8. Lu, L.; Zhang, X.; Renais, S. On training the recurrent neural network encoder-decoder for large vocabulary end-to-end speech recognition. In Proceedings of the ICASSP, Shangai, China, 7–13 May 2016; IEEE: Piscataway Township, NJ, USA, 2016; pp. 5060–5064. [Google Scholar]
  9. Yao, Z.; Wu, D.; Wang, X.; Zhang, B.; Yu, F.; Yang, C.; Peng, Z.; Chen, X.; Xie, L.; Lei, X. Wenet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit. arXiv 2021, arXiv:2102.01547. [Google Scholar]
  10. Kriman, S.; Beliaev, S.; Ginsburg, B.; Huang, J.; Kuchaiev, O.; Lavrukhin, V.; Leary, R.; Li, J.; Zhang, Y. Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions. In Proceedings of the ICASSP, Online, 7–13 May 2020; IEEE: Piscataway Township, NJ, USA, 2020; pp. 6124–6128. [Google Scholar]
  11. Bermuth, D.; Poeppel, A.; Reif, W. Scribosermo: Fast Speech-to-Text models for German and other Languages. arXiv 2021, arXiv:2110.07982. [Google Scholar]
  12. Kolobov, R.; Okhapkina, O.; Omelchishina, O.; Platunov, A.; Bedyakin, R.; Moshkin, V.; Menshikov, D.; Mikhaylovskiy, N. Mediaspeech: Multilanguage asr benchmark and dataset. arXiv 2021, arXiv:2103.16193. [Google Scholar]
  13. Majumdar, S.; Balam, J.; Hrinchuk, O.; Lavrukhin, V.; Noroozi, V.; Ginsburg, B. Citrinet: Closing the gap between non-autoregressive and autoregressive end-to-end models for automatic speech recognition. arXiv 2021, arXiv:2104.01721. [Google Scholar]
  14. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 7–12 June 2018; pp. 7132–7141. [Google Scholar]
  15. Graves, A. Sequence transduction with recurrent neural networks. arXiv 2012, arXiv:1211.3711. [Google Scholar]
  16. Zhou, W.; Zheng, Z.; Schlüter, R.; Ney, H. On language model integration for rnn transducer based speech recognition. In Proceedings of the ICASSP, Singapoore, 7–13 May 2022; IEEE: Piscataway Township, NJ, USA, 2022; pp. 8407–8411. [Google Scholar]
  17. Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. 2020, 33, 12449–12460. [Google Scholar]
  18. Pham, N.Q.; Waibel, A.; Niehues, J. Adaptive multilingual speech recognition with pretrained models. In Proceedings of the INTERSPEECH, Incheon, Korea, 18–22 September 2022; pp. 3879–3883. [Google Scholar] [CrossRef]
  19. Krabbenhöft, H.N.; Barth, E. TEVR: Improving Speech Recognition by Token Entropy Variance Reduction. arXiv 2022, arXiv:2206.12693. [Google Scholar]
  20. Junior, A.C.; Casanova, E.; Soares, A.; de Oliveira, F.S.; Oliveira, L.; Junior, R.C.F.; da Silva, D.P.P.; Fayet, F.G.; Carlotto, B.B.; Gris, L.R.S.; et al. CORAA: A large corpus of spontaneous and prepared speech manually validated for speech recognition in Brazilian Portuguese. arXiv 2021, arXiv:2110.15731. [Google Scholar]
  21. Hsu, W.N.; Tsai, Y.H.H.; Bolte, B.; Salakhutdinov, R.; Mohamed, A. HuBERT: How much can a bad teacher benefit ASR pre-training? In Proceedings of the ICASSP, Online, 7–13 May 2021; IEEE: Piscataway Township, NJ, USA, 2021; pp. 6533–6537. [Google Scholar]
  22. Hsu, W.N.; Bolte, B.; Tsai, Y.H.H.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3451–3460. [Google Scholar] [CrossRef]
  23. Gulati, A.; Qin, J.; Chiu, C.C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y.; et al. Conformer: Convolution-augmented Transformer for Speech Recognition. In Proceedings of the INTERSPEECH, Online, 25–29 October 2020; pp. 5036–5040. [Google Scholar] [CrossRef]
  24. Guo, P.; Boyer, F.; Chang, X.; Hayashi, T.; Higuchi, Y.; Inaguma, H.; Kamo, N.; Li, C.; Garcia-Romero, D.; Shi, J.; et al. Recent developments on espnet toolkit boosted by conformer. In Proceedings of the ICASSP, Online, 7–13 May 2021; IEEE: Piscataway Township, NJ, USA, 2021; pp. 5874–5878. [Google Scholar]
  25. Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision; Technical Report; OpenAI: San Francisco, CA, USA, 2022. [Google Scholar]
  26. Voigt, P.; Von dem Bussche, A. The eu general data protection regulation (gdpr). In A Practical Guide, 1st ed.; Springer International Publishing: Cham, Switzerland, 2017; Volume 10, pp. 10–5555. [Google Scholar]
  27. Konečnỳ, J.; McMahan, H.B.; Yu, F.X.; Richtárik, P.; Suresh, A.T.; Bacon, D. Federated learning: Strategies for improving communication efficiency. arXiv 2016, arXiv:1610.05492. [Google Scholar]
  28. Yang, Q.; Liu, Y.; Cheng, Y.; Kang, Y.; Chen, T.; Yu, H. Federated learning. Synth. Lect. Artif. Intell. Mach. Learn. 2019, 13, 1–207. [Google Scholar]
  29. Li, L.; Fan, Y.; Tse, M.; Lin, K.Y. A review of applications in federated learning. Comput. Ind. Eng. 2020, 149, 106854. [Google Scholar] [CrossRef]
  30. Li, T.; Sahu, A.K.; Talwalkar, A.; Smith, V. Federated learning: Challenges, methods, and future directions. IEEE Signal Process. Mag. 2020, 37, 50–60. [Google Scholar] [CrossRef]
  31. Dimitriadis, D.; Kumatani, K.; Gmyr, R.; Gaur, Y.; Eskimez, S.E. A Federated Approach in Training Acoustic Models. In Proceedings of the INTERSPEECH, Online, 25–29 October 2020; pp. 981–985. [Google Scholar]
  32. Cui, X.; Lu, S.; Kingsbury, B. Federated acoustic modeling for automatic speech recognition. In Proceedings of the ICASSP, Online, 7–13 May 2021; IEEE: Piscataway Township, NJ, USA, 2021; pp. 6748–6752. [Google Scholar]
  33. Guliani, D.; Beaufays, F.; Motta, G. Training speech recognition models with federated learning: A quality/cost framework. In Proceedings of the ICASSP, Online, 7–13 May 2021; IEEE: Piscataway Township, NJ, USA, 2021; pp. 3080–3084. [Google Scholar]
  34. Hard, A.; Partridge, K.; Nguyen, C.; Subrahmanya, N.; Shah, A.; Zhu, P.; Moreno, I.L.; Mathews, R. Training Keyword Spotting Models on Non-IID Data with Federated Learning. In Proceedings of the INTERSPEECH, Online, 25–29 October 2020; pp. 4343–4347. [Google Scholar] [CrossRef]
  35. Conneau, A.; Baevski, A.; Collobert, R.; Mohamed, A.; Auli, M. Unsupervised cross-lingual representation learning for speech recognition. arXiv 2020, arXiv:2006.13979. [Google Scholar]
  36. Wang, C.; Riviere, M.; Lee, A.; Wu, A.; Talnikar, C.; Haziza, D.; Williamson, M.; Pino, J.; Dupoux, E. VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics and International Joint Conference on Natural Language Processing, Online, 1–6 August 2021; pp. 993–1003. [Google Scholar]
  37. Pratap, V.; Xu, Q.; Sriram, A.; Synnaeve, G.; Collobert, R. MLS: A Large-Scale Multilingual Dataset for Speech Research. In Proceedings of the INTERSPEECH, Online, 25–29 October 2020. [Google Scholar]
  38. Ardila, R.; Branson, M.; Davis, K.; Henretty, M.; Kohler, M.; Meyer, J.; Morais, R.; Saunders, L.; Tyers, F.M.; Weber, G. Common Voice: A Massively-Multilingual Speech Corpus. In Proceedings of the LREC, Marseille, France, 20–25 June 2020; pp. 4211–4215. [Google Scholar]
  39. Valk, J.; Alumäe, T. VoxLingua107: A dataset for spoken language recognition. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT), Online, 19–22 January 2021; IEEE: Piscataway Township, NJ, USA, 2021; pp. 652–658. [Google Scholar]
  40. Babu, A.; Wang, C.; Tjandra, A.; Lakhotia, K.; Xu, Q.; Goyal, N.; Singh, K.; von Platen, P.; Saraf, Y.; Pino, J.; et al. XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale. In Proceedings of the INTERSPEECH, Incheon, Republic of Korea, 18–22 September 2022; pp. 2278–2282. [Google Scholar] [CrossRef]
  41. Lleida, E.; Ortega, A.; Miguel, A.; Bazán-Gil, V.; Pérez, C.; Gómez, M.; De Prada, A. Albayzin 2018 evaluation: The iberspeech-rtve challenge on speech technologies for spanish broadcast media. Appl. Sci. 2019, 9, 5412. [Google Scholar] [CrossRef] [Green Version]
  42. Baumann, T.; Köhn, A.; Hennig, F. The Spoken Wikipedia Corpus collection: Harvesting, alignment and an application to hyperlistening. Lang. Resour. Eval. 2019, 53, 303–329. [Google Scholar] [CrossRef]
  43. Salesky, E.; Wiesner, M.; Bremerman, J.; Cattoni, R.; Negri, M.; Turchi, M.; Oard, D.W.; Post, M. The Multilingual TEDx Corpus for Speech Recognition and Translation. In Proceedings of the INTERSPEECH, Online, 30 August–3 September 2021; pp. 3655–3659. [Google Scholar] [CrossRef]
  44. Rousseau, A.; Deléglise, P.; Esteve, Y. Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks. In Proceedings of the LREC, Reykjavik, Iceland, 26–31 May 2014; pp. 3935–3939. [Google Scholar]
  45. Mirkin, S.; Jacovi, M.; Lavee, T.; Kuo, H.K.; Thomas, S.; Sager, L.; Kotlerman, L.; Venezian, E.; Slonim, N. A Recorded Debating Dataset. In Proceedings of the LREC, Miyazaki, Japan, 7–12 May 2018; pp. 250–254. [Google Scholar]
  46. Ogrodniczuk, M. Polish parliamentary corpus. In Proceedings of the LREC, Miyazaki, Japan, 7–12 May 2018; pp. 15–19. [Google Scholar]
  47. ECPAT. Barriers to Compensation for Child Victims of Sexual Exploitation A Discussion Paper Based on a Comparative Legal Study of Selected Countries; ECPAT International: Bangkok, Thailand, 2021. [Google Scholar]
  48. Richards, K. Misperceptions about child sex offenders. In Trends and Issues in Crime and Criminal Justice; EUROPOL: Hague, The Netherlands, 2011; pp. 1–8. [Google Scholar]
  49. EUROPOL. Online sexual coercion and extortion as a form of crime affecting children. In European Union Agency for Law Enforcement Cooperation; EUROPOL: Hague, The Netherlands, 2017. [Google Scholar]
  50. EUROPOL. Internet Organised Crime Threat Assessment. In European Union Agency for Law Enforcement Cooperation; EUROPOL: Hague, The Netherlands, 2019. [Google Scholar]
  51. EUROPOL. Exploting Isolation: Offenders and victims of online child sexual abuse during the COVID-19 pandemic. In European Union Agency for Law Enforcement Cooperation; EUROPOL: Hague, The Netherlands, 2020. [Google Scholar]
  52. Richey, C.; Barrios, M.A.; Armstrong, Z.; Bartels, C.; Franco, H.; Graciarena, M.; Lawson, A.; Nandwana, M.K.; Stauffer, A.; van Hout, J.; et al. Voices Obscured in Complex Environmental Settings (VOiCES) Corpus. In Proceedings of the INTERSPEECH, Hyderabat, India, 2–6 September 2018; pp. 1566–1570. [Google Scholar] [CrossRef]
  53. Moffitt, J. Ogg Vorbis—Open, free audio—Set your media free. Linux J. 2001, 2001, 9-es. [Google Scholar]
  54. Marcacini, R.M.; Candido Junior, A.; Casanova, E. Overview of the Automatic Speech Recognition for Spontaneous and Prepared Speech & Speech Emotion Recognition in Portuguese (SE&R) Shared-tasks at PROPOR 2022. In Proceedings of the PROPOR, Fortaleza, Brasil, 21–23 March 2022. [Google Scholar]
  55. Bai, J.; Li, B.; Zhang, Y.; Bapna, A.; Siddhartha, N.; Sim, K.C.; Sainath, T.N. Joint unsupervised and supervised training for multilingual asr. In Proceedings of the ICASSP, Singapore, 22–27 May 2022; IEEE: Piscataway Township, NJ, USA, 2022; pp. 6402–6406. [Google Scholar]
  56. Zheng, H.; Peng, W.; Ou, Z.; Zhang, J. Advancing CTC-CRF Based End-to-End Speech Recognition with Wordpieces and Conformers. arXiv 2021, arXiv:2107.03007. [Google Scholar]
  57. Stefanel Gris, L.R.; Casanova, E.; Oliveira, F.S.d.; Silva Soares, A.d.; Candido Junior, A. Brazilian Portuguese Speech Recognition Using Wav2vec 2.0. In Proceedings of the International Conference on Computational Portuguese Language, Fortaleza, Brasil, 21–23 March 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 333–343. [Google Scholar]
  58. Keshet, J.; Grangier, D.; Bengio, S. Discriminative keyword spotting. Speech Commun. 2009, 51, 317–329. [Google Scholar] [CrossRef]
  59. Lengerich, C.; Hannun, A. An end-to-end architecture for keyword spotting and voice activity detection. arXiv 2016, arXiv:1611.09405. [Google Scholar]
  60. Tomashenko, N.; Mdhaffar, S.; Tommasi, M.; Estève, Y.; Bonastre, J.F. Privacy attacks for automatic speech recognition acoustic models in a federated learning framework. In Proceedings of the ICASSP, Singapore, 22–27 May 2022; IEEE: Piscataway Township, NJ, USA, 2022; pp. 6972–6976. [Google Scholar]
  61. Geyer, R.C.; Klein, T.; Nabi, M. Differentially private federated learning: A client level perspective. arXiv 2017, arXiv:1712.07557. [Google Scholar]
Figure 1. Wav2vec2.0 architecture representation. The raw audio signal is mapped to speech representations that are fed into a transformer network to output context representations. Figure based on the one presented in [17].
Figure 1. Wav2vec2.0 architecture representation. The raw audio signal is mapped to speech representations that are fed into a transformer network to output context representations. Figure based on the one presented in [17].
Sensors 23 01843 g001
Figure 2. Whisper architecture representation. The log Mel-spectrograms are encoded by a transformer network. Encoded representations are transformed into character outputs and no-speech tokens via the transformer decoder. Figure based on the one presented in [25].
Figure 2. Whisper architecture representation. The log Mel-spectrograms are encoded by a transformer network. Encoded representations are transformed into character outputs and no-speech tokens via the transformer decoder. Figure based on the one presented in [25].
Sensors 23 01843 g002
Figure 3. Top 20 of the most important keywords related to child abuse, which were used to test the capability of the ASR system to detect specific terminology within the domain.
Figure 3. Top 20 of the most important keywords related to child abuse, which were used to test the capability of the ASR system to detect specific terminology within the domain.
Sensors 23 01843 g003
Figure 4. Configuration of the FL architecture. Central server with five client nodes (site- 1 , 2 , , 5 ) and a dummy node only used to test the performance of the aggregated model.
Figure 4. Configuration of the FL architecture. Central server with five client nodes (site- 1 , 2 , , 5 ) and a dummy node only used to test the performance of the aggregated model.
Sensors 23 01843 g004
Figure 5. Comparison between the results obtained using Wav2Vec2.0 and Whisper for ASR (left) and KWS (right).
Figure 5. Comparison between the results obtained using Wav2Vec2.0 and Whisper for ASR (left) and KWS (right).
Sensors 23 01843 g005
Table 1. List of public speech corpora considered to test the performances of ASR and KWS systems based on Wav2Vec2.0 and Whisper.
Table 1. List of public speech corpora considered to test the performances of ASR and KWS systems based on Wav2Vec2.0 and Whisper.
Corpus NameDescriptionLanguagesTest Duration (h)
Common Voice [38]Read sentences collected and validated via crowd-sourcingEnglish173
German72
French38
Spanish26
Italian23
Portuguese6
Polish7
Spoken Wikipedia Corpus (SWC) [42]Volunteer readers of Wikipedia articlesEnglish42
German36
Media Speech [12]Speech segments from YouTube videosFrench10
Spanish10
Multilingual TEDx [43]Audio recordings and transcripts from TED talksGerman2
French2
Spanish2
Italian2
Portuguese2
TEDLIUMv2 [44]Audio recordings from TED talksEnglish3
Multilingual librispeech (MLS) [37]Audio recordings from audiobooksGerman14
French10
Spanish10
Italian5
Polish2
Portuguese4
VoxforgeCrowdsourced read speechGerman3
French4
Spanish5
Italian2
Portuguese1
Debating technologies [45]Audio recordings from transcribed public debatesEnglish1
Polish Parliamentary corpus [46]Recordings from the Polish parliamentPolish1
CORAA [20]Combination of five corpora in PortuguesePortuguese13
Table 2. Data distribution for the GRACE dataset, which combines different corpora into a single one within the child abuse domain.
Table 2. Data distribution for the GRACE dataset, which combines different corpora into a single one within the child abuse domain.
LanguageBase Corpora# UtterancesDuration (h)
EnglishSWC, Debating technologies, TEDLIUMv229799.2
GermanMultilingual TEDx, SWC, Voxforge17125.9
FrenchMultilingual TEDx, MediaSpeech, Voxforge12504.1
SpanishMultilingual TEDx, MediaSpeech, Voxforge5572.0
ItalianMultilingual TEDx, Voxforge3541.0
PortugueseMultilingual TEDx, Voxforge, CORAA15032.3
Table 3. Results of the ASR models in different languages considering all benchmark datasets. Results in terms of WER.
Table 3. Results of the ASR models in different languages considering all benchmark datasets. Results in terms of WER.
ModelCommonMLSTED-MTEDxSWCMediaVoxforgeDebatesPolishCORAAGRACEGRACEAVG.
Voice LIUMv2 Speech Parl CleanNoisy
English
Wav2Vec 2.016.1-17.2-20.6--11.7--18.932.619.5
Whisper10.0-5.4-20.6--7.0--24.519.814.6
German
Wav2Vec 2.011.912.9-36.734.5-7.5---20.033.822.5
Whisper7.16.7-21.718.3 4.2---15.522.513.7
French
Wav2Vec 2.016.717.0-25.3-29.116.7---26.556.326.8
Whisper21.78.0-23.3-35.814.6---36.834.124.9
Spanish
Wav2Vec 2.04.77.2-12.9-14.56.3---12.633.313.1
Whisper6.25.3-9.4-15.84.2---19.618.811.3
Italian
Wav2Vec 2.012.821.1-22.2--14.3---18.046.322.5
Whisper7.913.6-11.6--10.5---14.120.213.0
Portuguese
Wav2Vec 2.012.920.1-33.8--17.8--48.542.768.134.8
Whisper5.48.8-13.1--11.2--21.722.142.317.8
Polish
Wav2Vec 2.011.512.7------32.1---18.8
Whisper8.96.0------32.5---15.8
Table 4. WER comparison between the results reported and those from the state-of-the-art for Common Voice, MLS, and MTEDx corpora. The best result for each corpus and language is highlighted in bold.
Table 4. WER comparison between the results reported and those from the state-of-the-art for Common Voice, MLS, and MTEDx corpora. The best result for each corpus and language is highlighted in bold.
CorpusReferenceLanguage
EnglishGermanFrenchSpanishItalianPortuguesePolish
Common Voice[11]-7.712.510.9---
[18]-7.211.26.26.56.17.6
[36]-7.89.610.0---
[25]10.17.714.76.48.17.19.0
[19]-3.6-----
[56]-9.8-----
[57]-----9.2-
Wav2vec2.016.111.916.74.712.812.911.5
Whisper-large10.07.121.76.27.95.48.9
MLS[37]-6.55.66.110.519.520.4
[40]-7.410.06.912.015.69.8
[55]-4.15.03.78.28.06.6
[25]-6.68.95.414.39.26.6
[57]-----12.3-
Wav2vec2.0-12.917.07.221.120.112.7
Whisper-large-6.78.05.315.88.89.9
MTEDx[43]-42.319.416.216.420.2-
[57]-----21.0-
Wav2vec2.0-36.725.312.922.233.8-
Whisper-large-21.723.39.411.613.1-
MediaSpeech[12]-19.218.3----
Wav2vec2.0-29.114.5----
Whisper-large-35.815.8----
Table 5. Results of KWS in different languages considering all benchmark datasets. Results in terms of TPR (%).
Table 5. Results of KWS in different languages considering all benchmark datasets. Results in terms of TPR (%).
ModelCommonMLSTED-MTEDxSWCMediaVoxforgeDebatesPolishCORAAGRACEGRACEAVG.
Voice LIUMv2 Speech Parl CleanNoisy
English
Wav2Vec 2.093.3-95.4-92.5--96.6--94.679.592.0
Whisper96.8-97.4-94.1--97.7--91.893.695.2
German
Wav2Vec 2.091.396.9-93.580.4-99.8---94.379.790.8
Whisper97.898.8-97.797.6 99.8---97.290.697.1
French
Wav2Vec 2.090.690.1-94.5-82.890.7---90.560.185.6
Whisper94.698.0-93.9-84.994.2---88.081.390.7
Spanish
Wav2Vec 2.096.798.1-98.1-96.3100.0---97.078.194.9
Whisper98.299.8-98.6-94.499.8---92.094.296.7
Italian
Wav2Vec 2.090.797.2-95.5--98.9---96.180.993.2
Whisper97.899.9-97.3--99.8---98.796.998.4
Portuguese
Wav2Vec 2.093.193.4-94.1--99.1--74.376.849.582.9
Whisper96.697.5-99.4--100.0--88.188.381.393.0
Polish
Wav2Vec 2.093.996.9------83.3---91.4
Whisper95.498.7------50.3---81.5
Table 6. Results of the FL pilot comparing WERs from Wav2Vec2.0 models trained in a federated or centralized way.
Table 6. Results of the FL pilot comparing WERs from Wav2Vec2.0 models trained in a federated or centralized way.
NodeDataWER FederatedWER Centralized
node-1TED-LIUMv213.513.3
node-2Debates12.412.3
node-3Librispeech-other7.97.8
node-4Librispeech-clean2.83.2
node-5SWC25.724.3
dummyLibrispeech-clean3.63.8
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Vásquez-Correa, J.C.; Álvarez Muniain, A. Novel Speech Recognition Systems Applied to Forensics within Child Exploitation: Wav2vec2.0 vs. Whisper. Sensors 2023, 23, 1843. https://doi.org/10.3390/s23041843

AMA Style

Vásquez-Correa JC, Álvarez Muniain A. Novel Speech Recognition Systems Applied to Forensics within Child Exploitation: Wav2vec2.0 vs. Whisper. Sensors. 2023; 23(4):1843. https://doi.org/10.3390/s23041843

Chicago/Turabian Style

Vásquez-Correa, Juan Camilo, and Aitor Álvarez Muniain. 2023. "Novel Speech Recognition Systems Applied to Forensics within Child Exploitation: Wav2vec2.0 vs. Whisper" Sensors 23, no. 4: 1843. https://doi.org/10.3390/s23041843

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop