Evaluating Novel Speech Transcription Architectures on the Spanish RTVE2020 Database

: This work presents three novel speech recognition architectures evaluated on the Spanish RTVE2020 dataset, employed as the main evaluation set in the Albayzín S2T Transcription Challenge 2020. The main objective was to improve the performance of the systems previously submitted by the authors to the challenge, in which the primary system scored the second position. The novel systems are based on both DNN-HMM and E2E acoustic models, for which fully- and self-supervised learning methods were included. As a result, the new speech recognition engines clearly outperformed the performance of the initial systems from the previous best WER of 19.27 to the new best of 17.60 achieved by the DNN-HMM based system. This work therefore describes an interesting benchmark of the latest acoustic models over a highly challenging dataset, and identiﬁes the most optimal ones depending on the expected quality, the available resources and the required latency.


Introduction
The Albayzín-RTVE 2020 Speech to Text Transcription Challenge (http://catedrartve. unizar.es/s2tc2020.html, accessed on 28 December 2021) called for Automatic Speech Recognition (ASR) systems that were robust against realistic TV shows. Nowadays, applying speech to text technologies to the broadcast domain is a growing trend that aims to approach ASR technology to automate different applications such as subtitling or metadata generation for archives. Although most of this work is still performed manually or through semiautomatic methods (e.g., re-speaking), the current state of the art (SoA) in speech recognition suggests that this technology can be exploitable autonomously without or with minor post-edition effort, mainly on contents with optimal audio quality and clean speech conditions. The use of Deep Learning algorithms along with the increasingly availability of speech data have made it possible to introduce this technology in such a complex scenario through the use of systems based on Deep Neural Networks (DNNs) or more recent architectures based on the End-To-End (E2E) principle.
Besides the broadcast domain, the significant increase in the ASR field has brought special interests to integrate this technology in many other applications and devices. For instance, considering speech as the most natural means of communication between humans, conversational assistants have acquired great relevance in our daily lives, both in the personal and professional environments [1]. In addition, other main sectors such as Industry, Healthcare or Automotive have already discovered the usability of speech technologies per system to process the whole test set. Finally, Section 6 draws the main conclusions and the lines of future work.

Recent Advances in Speech Recognition
During the last few years, ASR systems have positively evolved at acoustic modelling with the integration of DNNs in combination with Hidden Markov Models (HMMs) to outperform traditional approaches [8]. More recently, new attempts have been focused on building E2E ASR architectures [9], which directly map the input speech signal to character sequences and therefore greatly simplify training, fine-tuning and inference [10][11][12][13][14]. Once the potential of the E2E architectures for speech recognition was demonstrated [9], and considering the need to reduce the size and complexity of the ASR models due to their large hardware requirements, new architectures arose to make these models more optimal to be deployed on embedded hardware without loss of quality. More recently, driven by the increasing availability of data in major languages and the scarcity of annotated data in minority languages or specific domains, novel approaches have emerged focused on training big neural models through self-supervised learning methods and making use of hundreds of thousands of unlabelled acoustic data. Nowadays, most of the efforts in the field seem to be focused on this last direction, given the availability of pre-trained models and their high performance when being used as a feature extractor or when they are fine-tuned with in-domain data [15].
Nevertheless, the more conventional hybrid acoustic models also continue to evolve, improving benchmark results over well-known datasets like LibriSpeech [16]. Within the Kaldi community, a novel neural network architecture was recently presented, known as Multistream CNN [17]. The Multistream CNN acoustic model was inspired in the work presented in [18] but leaving out the multi-headed self-attention layers. This model processes input speech with diverse temporal resolutions by having stream-specific dilation rates to CNNs across multiple streams to achieve the robustness. A factorised time-delay neural network (TDNN-F) is stacked in each stream, while the dilation rate for the TDNN-F layers in each stream is chosen from multiples of 3 of the default sub-sampling rate for both training and decoding. The output embedding vectors from each stream are concatenated and followed by ReLU, batch normalisation and a dropout layer, which is finally projected to the output layer via a couple of fully connected layers. This hybrid model achieved very competitive results on the Librispeech test-clean/other sets in combination with the efficient self-attentive SRU [19] language model. Specifically, WER values of 1.75 and 4.46 were reported on the test-clean/other partitions of the Librispeech dataset, respectively [19].
With the aim of building lighter but competitive E2E models for speech recognition, NVIDIA proposed Quartznet [6], based on the main Jasper architecture [20]. This architecture consists of a new E2E neural acoustic model composed of multiple blocks with residual connections in between. Each block consists of one or more modules with 1D time-channel separable convolutional layers, batch normalisation, and ReLU layers. They reached near-SoA error rates on the LibriSpeech dataset, for which WER values of 2.69 and 7.25 were achieved on the test-clean/other sets [6], respectively. These results were reached with the model 15 × 5 Quartznet, which contained 18.8 million parameters, in contrast to other larger E2E architectures such as 10 × 5 Jasper (333 million) [20], PaddlePaddle Deep-Speech2 (49 million) [9] or Wav2Vec2.0 (95 to 317 million) [2].
Nowadays, many of the recent works in the ASR field seem to be focused on taking advantage of big acoustic models trained with self-supervised learning methods and a large amount of unlabelled data. Facebook AI demonstrated for the first time that learning powerful representations through self-supervised methods from big amounts of unlabelled speech and then fine-tuning on transcribed speech can enhance neural models trained with semi-supervised techniques [2]. Their experiments demonstrated the great potential of using large pre-training models estimated with unlabelled data and the impact of employing a different amount of labelled data for fine-tuning the model. For instance, using a large model pre-trained on 60,000 h and fine-tuned on 10 min of labelled training data, they reached Word Error Rates (WER) of 4.8 and 8.2 on test-clean/other of Librispeech, respectively, using a Transformer language model (LM). Increasing the labelled data to 100 h and using the same LM, they achieved a WER of 2.0 and 4.0 on the same test set.
Nevertheless, self-supervised approaches are challenging mainly because there is not a predefined lexicon for the input sound units during the pre-training phase. Moreover, sound units can be variable in length since an explicit segmentation is not provided [21]. In order to deal with these problems, Facebook AI released HuBERT (Hidden-Unit BERT), a new approach for learning self-supervised speech representations [3]. Unlike Wav2vec2.0, the HuBERT model learns not only acoustics, but also language models from continuous inputs. First, the model encodes unmasked audio inputs into meaningful continuous latent representations, which map to the classical acoustic modelling problem. In a next step, the model captures the long-range temporal relations between learned representations to reduce the prediction error. This work mainly focuses on the consistency of the kmeans mapping from audio inputs into discrete targets, which enables the model to focus on modelling the sequential structure of input data. HuBERT has shown that it can overcome other SoA approaches on speech representation learning for speech recognition. In the work presented in [21], WER values of 1.8 and 2.9 were reported on test-clean/other of the LibriSpeech dataset, respectively, using a Transformer LM.
On the other hand, at the language model level, deep Transformers or LSTM-RNN based language models have shown better performance than the traditional n-gram models especially during the re-scoring of the initial lattices [22]. More recently, the novel network architecture Self attentive simple Recurrent Unit (SRU) shows interesting improvements on the rescoring of the initial hypothesis [19].
The ASR engines presented in this work were built following the hybrid DNN-HMM, Quartznet, and Wav2vec2.0 architecture basis, in order to compare the performance of the systems trained with the same corpora as well as their feasibility to be deployed in different platforms, from high-performance servers to embedded systems.

Corpora Description
In this section, the training and evaluation corpora are described in detail. With respect to the training corpora, all the systems presented in this work shared the same acoustic and text data to train and/or fine-tune the initial models. On the other hand, the evaluation data are composed entirely of the RTVE2020 database [4].

Acoustic Corpus
The acoustic corpus was composed by annotated audio contents from seven different datasets, summing up a total of 743 h and 35 min. The following Table 1 presents the final number of hours containing only speech in each of the datasets. The RTVE2018 dataset [23] was released by RTVE and comprises a collection of TV shows drawn from diverse genres and broadcast by the public Spanish National Television (RTVE) from 2015 to 2018. This dataset was originally composed by 569 h and 22 m of audio with a high portion of imperfect transcriptions and, thus, they could not be used as such for training. Therefore, a forced-alignment was applied in order to recover only the segments transcribed with a high literality, obtaining a total of 112 h and 30 min of nearly correctly transcribed speech segments.
The SAVAS corpus [24] is composed of broadcast news contents in Spanish from 2011 to 2014 of the Basque Country's public broadcast corporation EiTB (Euskal Irrati Telebista), and includes annotated and transcribed audios in both clear (studio) and noisy (outside) conditions. The IDAZLE corpus is integrated by TV shows from the EiTB broadcaster as well, and it comprises a more varied and rich collection of programs of different genres and styles. TV shows are also the contents which compose the A la Carta (https://www.rtve. es/alacarta/, accessed on 28 December 2021) the acoustic corpus, including 265 contents broadcasted between 2018 and 2019 by RTVE.
The Common Voice dataset [25] is a crowdsourcing project started by Mozilla to create a free and massively-multilingual speech corpus to train speech recognition systems. Finally, the well-known and clean Albayzin [26] and Multext [27] datasets were also included, mainly to favour the initial training steps and alignments of the systems.

Text Corpus
Regarding text data, different sources were employed to obtain the enough language and domain coverage as close as possible to the contents of the RTVE2020 database. The following Table 2 presents the number of words provided by each of the text corpora. A total of almost 661 million words were thus compiled and used to estimate the language models for decoding and rescoring purposes. The Transcriptions text corpus corresponds to the text transcriptions of the all audio contents used to train the acoustic models. The RTVE2018 text corpus contains all the text transcriptions and re-spoken subtitles included within the RTVE2018 dataset, whilst the A la Carta corpus is integrated by subtitles taken from the "A la Carta" web portal, as a result of a collaboration between RTVE and Vicomtech. Finally, the Wikipedia corpus contains texts of the Wikipedia portal gathered in 2017 from Wikimedia (https://dumps.wikimedia.org/, accessed on 28 December 2021).

RTVE2020 Database
The RTVE2020 database served as the principal evaluation test of this work and the Albayzín Speech To Text Transcription Challenge 2020. It is composed of a series of TV shows of different genres which were broadcast by the public Spanish Television (RTVE) from 2018-2019. The database is composed of a total of 55 h and 40 min of audio, and it was fully transcribed by humans to obtain literal references. These references are presented in STM format, which contain time marked segments each including the waveform's filename and channel identifier, the speaker, the begin and end time, optional subset label and the literal transcription of the segment. The type of TV shows included in the database are presented in Table 3. Table 3. TV shows included in the RTVE2020 dataset. This information was partially gathered from [4].

TV Program Duration Description
Ese programa del que Ud. habla 01:58:36 A TV program that reviews daily political, cultural, socialand sports news from the perspective of comedy.
Neverfilms 00:11:41 A webseries that parody humorously trailers of series and movies well-known to the public.
Si fueras tú 00:51:14 Interactive series that tells the story of a young girl.
Bajo la red 00:59:01 A youth fiction series whose plot is about a chain of favours on the internet. As it can be observed in the description of the programs shown in Table 3, most of the TV shows include content with spontaneous speech, thus significantly increasing the difficulty of automatically transcribing this database. Despite the fact that the database is not correctly balanced with respect to the duration of each TV program, most of them share similar artefacts typical of informal speech.
Los desayunos de RTVE and Aquí la tierra are the two shows with the longest duration in the database. Although in the former some segments with orderly and formal speech can be found, it also includes interviews and political debates. Similarly, Aquí la tierra include formal and informal speech segments as well, combining weather reports usually performed by one speaker with interviews in which spontaneous speech is much more present. The soap opera Mercado central is the third in duration, and its main difficulty is related to the acted and emotional speech, which ASR engines are not usually trained with. Nevertheless, probably the most complex contents correspond to Cómo nos reímos, Vaya crack and Ese programa del que usted me habla, which together sum up a total of almost ten hours in the database. These contents are composed entirely of spontaneous speech, including sketches, overlapping speech, artistic performances, laughs, fillers, among other artefacts. Comando actualidad adds the difficulty of including spontaneous interviews and reports on the street, whilst Imprescindibles incorporates multi-channel and far-field low quality recordings. Finally, programs like Millenium and Versión española, which sum up to less than five hours, expose a priori a lower difficulty, since they include content with more formal speech, although the latter could integrate parts of movies.
In summary, the RTVE2020 database is a challenging evaluation set of Spanish TV shows in which the types of content that generate the greatest problems to the ASR engines today predominate.

Systems Description and Configuration
In this section, all the systems presented in this work are described in more detail. In the first subsection, the three novel ASR architectures are described. These architectures include systems built on the top of (1) the Kaldi toolkit, (2) the Quartznet Q15×5 architecture of NVIDIA, and (3) the Wav2vec2.0 model.
On the other hand, the two systems selected as baseline systems are presented in the second subsection. These baseline systems correspond to the DNN-HMM based primary system presented to the Albayzín S2T Challenge 2020, which scored the second position in the competition, and the third contrastive system based on the Quartznet Q5×5 architecture.

Multistream CNN Based System
The Multistream CNN based ASR engine was built on the top of the Kaldi toolkit through the nnet3 DNN setup and following the egs/librispeech/s5/local/chain/run_multistream_ cnn_1a.sh recipe from the ASAPP Research repository (https://github.com/asappresearch, accessed on 28 December 2021).
The Multistream CNN architecture is illustrated in Figure 1. The acoustic model is composed by an initial set of five 2D-CNN layers in charge of processing the given input speech frames augmented dynamically through the SpecAugment [28] technique. Each embedding vector outputted from the single-streamed set of CNN layers in each time step is then inserted as the input of each of the three stacks of TDNN-F layers, combined with a dilation rate configuration of 6-9-12. Each stack is composed of 17 TDNN-F layers, with an internal cell-dimension of 512, a bottleneck-dimension of 80 and a dropout schedule of '0,0@0.20,0.5@0.5,0'. The number of training epochs was set to six, with an initial and final learning rates of 10 −3 and 10 −5 , respectively, and a mini-batch size of 64. The input vector corresponded to a concatenation of 40-dimensional high-resolution MFCC coefficients, augmented through speed (using factors of 0.9, 1.0, and 1.1) [29] and volume (with a random factor between 0.125 and 2) [30] perturbation techniques, and the appended 100 dimensional i-Vectors. This system included a 3-gram language model for decoding and a 4-gram pruned RNNLM model for lattice-rescoring following the work presented in [31]. The 3-gram LM was trained with texts coming from the Transcriptions, RTVE2018 and A la Carta corpora presented in Table 2, and the 4-gram pruned RNNLM model was estimated adding the Wikipedia text corpus as well.

Quartznet Q15×5 Based System
The Quartznet family of models are E2E ASR architectures completely based on 1D Time-Channel Separable Convolutional layers with residual connections, as it is illustrated in Figure 2. This design is based on the Jasper architecture [20] but with many modifications focused on considerably reducing the number of parameters and, therefore, the computing resources needed. As the most novel architecture, we trained and evaluated the Q15×5 architecture, which is described as follows. The model initially integrates a 1D Convolutional layer (kernel(k) = 33, output channels (c) = 256) processing the spectrogram input from each speech frame. This layer is then followed by five groups of blocks. On the Q15×5 architecture, each group of block is composed of a block B i repeated three times with residual connections in between. Each B i is composed of a module repeated five times and composed of (i) a k-sized depth-wise convolutional layer, (ii) a point-wise convolution, (iii) a batch normalisation layer, and (iv) a ReLU. The configuration of the CNN for each B i was: B1 (k = 33, c = 256), B2 (k = 39, c = 256), B3 (k = 51, c = 512), B4 (k = 63, c = 512) and B5 (k = 75, c = 512). Finally, there are three additional convolutional layers C1 (k = 87, c = 512), C2 (k = 1, c = 1025) and a point-wise convolutional layer (k = 1, c = labels), followed by a Connectionist Temporal Classification (CTC) layer.
During training, a CTC loss function was employed to measure the prediction errors, in addition to the Novograd optimiser with beta values of 0.8 and 0.5 and a triangular cyclical learning rate policy during five cycles of 60 epochs for a total of 300 epochs, as it was described in [32]. The initial and minimum learning rates were set to 0.015 and 10 −5 , respectively, whilst the weight decay was set to 10 −3 . The training process was performed on four GPU cards, applying a batch size of 20 each and mixed precision training. Our resulting Q15×5 network configuration contained 18.9 million parameters.
Additionally, a 5-gram language model was trained with the Transcriptions, RTVE2018 and A la Carta corpora. This language model was employed during decoding with a CTC beam-search decoder based on the pyctcdecode library [33]. The decoding was performed with a beam-width of 256, a α = 0.5 and β = 1.

Wav2vec2.0 Based System
Wav2vec2.0 [2] is a self-supervised E2E architecture based on CNN and Transformer layers schematically represented in Figure 3. The Wav2vec2.0 model maps speech audio through a multi-layer convolutional feature encoder f : χ → Z to latent speech representations z 1 , . . . , z T , which are fed into a Transformer network g : Z → C to output context representations c 1 , . . . , c T . These context representations are then quantised to q 1 , . . . , q T to represent the targets in the self-supervised learning objective [2,34]. The feature encoder contains seven blocks and the temporal convolutions in each block include 512 channels with strides (5, 2, 2, 2, 2, 2, 2) and kernel widths (10, 3, 3, 3, 3, 2, 2). The Transformer used was composed by 24 blocks, a model dimension of 1024, an inner dimension of 4096 and a total of 16 attention heads. As the main baseline Wav2vec2.0 model, in this work, we selected the pretrained Wav2Vec2-XLS-R-300M model (https://huggingface.co/facebook/wav2vec2-xls-r-300m, accessed on 28 December 2021), which was self-supervised pre-trained with 436k h of unlabelled speech data in 128 languages from the VoxPopuli [35], MLS [36], Common-Voice [25], BABEL (corpus collected under the IARPA BABEL research program), and VoxLingua107 [37] corpora. The Wav2Vec2-XLS-R-300M is one of the different versions of the Facebook AI's XLS-R multilingual model [38] composed by 300 million of parameters.
With the aim of adapting this pre-trained model to the domain, we fine-tuned the Wav2Vec2-XLS-R-300M model for a total of 50 epochs in two steps, and, using the acoustic corpus defined above, augmented dynamically during training through the SpecAugment technique. First, we evolved the pre-trained model for 30 epochs with a maximum learning rate of 10 −4 . We used the cosine function as the learning rate annealing function with a warm up during the initial 10% of the training, whilst the batch-size was set to 256. This model was later fine-tuned for 20 additional epochs by modifying the maximum learning rate to 5 × 10 −5 .
Finally, the decoding was performed with the same 5-gram language model employed for the decoding of the previous Q15×5 based system, but with a beam-width of 256, α = 0.3 and β = 1.6.

DNN-HMM Based System
This first baseline system that we consider in this work corresponds to the primary system [7] presented in the Albayzín-RTVE 2020 Speech to Text Transcription Challenge. This ASR engine was built through the nnet3 DNN setup of the Kaldi recognition system, and using the so-called chain acoustic model based on Convolutional Neural Network (CNN) layers and a factorised time-delay neural network (TDNN-F) [39], which reduces the number of parameters of the network by factorising the weight matrix of each TDNN layer into the product of two low-rank matrices.
The acoustic model integrated a CNN-TDNN-F based network, with six CNN layers followed by 12 TDNN-F layers. The internal cell-dimension of the TDNN-F layers was of 1536, with a bottleneck-dimension of 160 and a dropout schedule of '0,0@0.2,0.5@0.5,0'. The number of training epochs was set to 4, with a learning rate of 1.5 × 10 −4 and a minibatch size of 64. The input vector corresponded to a concatenation of 40 dimensional high-resolution MFCC coefficients, augmented through speed (using factors of 0.9, 1.0, and 1.1) [29] and volume (with a random factor between 0.125 and 2) [30] perturbation techniques, and the appended 100-dimensional iVectors.
This system included a 3-gram language model for decoding and a 4-gram pruned RNNLM model for lattice-rescoring following the work presented in [31]. The 3-gram LM was trained with texts coming from the Transcriptions, RTVE2018 and A la Carta corpora presented in Table 2, and the 4-gram pruned RNNLM model was estimated adding the Wikipedia text corpus as well.

Quartznet Q5×5 Based System
This ASR engine was presented as the third contrastive system to the Albayzín-RTVE 2020 Speech to Text Transcription Challenge, and we selected it as the second baseline system in this work in order to compare its performance with the previously presented Quartznet Q15×5 based system. The Quartznet Q5×5 architecture is similar to the Q15×5 architecture explained in Section 4.1.2, but, in this case, each group of block is composed of a block B i repeated only one time, instead of three, thus decreasing the total number of parameters from 18.9 to 6.7 M.
In contrast to the Q15×5 based system, the training of the Q5×5 was performed for 100 epochs, with a cosine annealing learning rate using a batch-size of 40.
The decoding was realised using the same 5-gram language model described above. The parameters of the Beam Search CTC decoder corresponded to a beam-width of 1000, α = 1.2 and β = 0.

Results and Resources
In the following Table 4, the total WER values are presented for each system over all the TV programs in the RTVE2020 database. The first objective of this work was focused on improving our initially best CNN-TDNN-F based ASR engine, presented as the primary system to the Albayzín-RTVE 2020 S2T Transcription Challenge and which scored the second position among all the systems presented to competition. Although these improvements could be focused on improving both acoustic and language models, we decided to put our efforts into enhancing the acoustic model, which probably poses the most challenging task, while maintaining the same Kaldi based ASR architecture. Moreover, one of the main conclusions of the previous study of the authors [7] was that improving acoustic models helped our ASR engines more than language models, since most of the evaluation contents included spontaneous speech and our text corpus was mainly integrated by text contents with formal language. Given the difficulty of the RTVE2020 evaluation dataset at the acoustic and phonemic level, we decided to evaluate a more complex CNN and TDNN-F based neural network. This way, the Multistream CNN architecture clearly improved the performance of the CNN-TDNN-F acoustic model by tripling the stack of TDNN-F layers with diverse temporal resolutions. Therefore, replacing the CNN-TDNN-F acoustic model by the novel Multistream CNN one and maintaining the same lexicon and language models for decoding (3-gram model) and rescoring (4-gram based RNNLM model) the initial lattices, the results improved from 19.27 to a very competitive 17.60 of WER.
The second objective of this work was to see how we could improve the performance of the Nuance's Quartznet Q5×5 based system. However, analysing the state of the art, currently these E2E systems do not perform as well as the Kaldi based engines, the Quartznet architectures present an interesting proposal to integrate ASR functionalities in embedded systems considering the scarce HW resources and the low inference times required. With the aim of improving the results obtained by the Q5×5 based system, we included three principal improvements. First, we extended the architecture by adding two more groups of blocks and thus building a bigger Q15×5 architecture based ASR system. The second improvement corresponded to the inclusion of a triangular cyclical learning rate policy. This method [32] lets the learning rate vary cyclically between reasonable boundary values instead of monotonically decreasing it during training, thus improving classification accuracy. Finally, we included more training epochs by extending the 100 epochs employed for the Q5×5 network to the 300 training epochs used for the new Q15×5 acoustic model. Applying these evolutions, we managed to improve the error rate in 5.47 points, considering the 28.42 of WER achieved by the Q5×5 baseline and the new 22.95 of WER obtained by the Q15×5 based novel architecture. The 5-gram based external language model remained the same for both systems.
The last objective of this study was to evaluate the performance of the self-supervised Wav2vec2.0 model in these challenging scenarios. Although these types of models seem to be more focused on being applied in situations in which there is not enough annotated in-domain data to train acoustic models from scratch, its advantages such as (i) the clarity of the architecture, (ii) taking advantage of the extensive acoustic knowledge obtained by the pre-trained model, and (iii) its low latency in inference, make it a very interesting ASR system to explore. To this end, we selected the Wav2Vec2-XLS-R-300M pre-trained model, which was pre-trained with 436k hours of unlabelled speech data from diverse corpora and conditions, as it was described in Section 4.1.3. In this case, it is worth mentioning how a fine-tuned Wav2Vec2-XLS-R-300M model improved the performance of the Q15×5 model, even though the former was fine-tuned for only 50 epochs and the Quartznet model was trained for a total of 300 epochs. This way, the Wav2Vec2-XLS-R-300M model fine-tuned with the in-domain data reached a very promising 20.68 of WER. This result clearly demonstrates the power of the self-supervised models trained with huge amounts of unlabelled data, while maintaining similar inference latencies in comparison with the lighter Quartznet architectures (see Table 4). For the language model, the same 5-gram based external language model was employed.
In Table 5, the total WER results obtained by the systems for each TV program in the RTVE2020 dataset are presented. As it can be observed in Table 5, the systems perform consistently along all the contents in the RTVE2020 dataset. The Multistream CNN based system obtained the best results for all the contents except for Millenium, for which the baseline CNN-TDNN-F system performed the best, and for Imprescindibles where Wav2vec2.0 performed slightly better. In contrast, the Quartznet Q5×5 is the system that gives the worst performance in all the contents. The rest of the systems maintain consistency for most TV programs with respect to each particular total WER. The Wav2vec2.0 based system outperforms the Quartznet Q15×5 based ASR engine in all cases except one (Millenium), although the difference of 0.27 can be considered as negligible. In the rest of the contents, the differences between these two E2E systems are remarkable, although the Wav2vec2.0 system was fine-tuned with six times fewer epochs than the Quartznet Q15×5 based system. The greatest differences between both E2E systems are achieved in contents Vaya crack and Imprescindibles, which were classified as two of the most complex programs given their spontaneous style and far-field low quality, respectively.
In general, the behaviour of the systems regarding the content profiles is as expected. In those programs with cleaner speech, the WER decreases significantly compared to other programs which included adverse acoustic conditions, overlapping or spontaneous speech. More specifically, in TV shows such as Aquí la Tierra, Los desayunos de RTVE, Millenium and Versión Española, with controlled acoustic conditions (studio) and many segments with formal and well-structured speech, the error rates are below the 20% border and lower compared to the other TV shows. In contrast, in more complicated contents like Cómo nos Reíamos, Boca Norte or Wake-up, which include many segments with spontaneous and acted speech, acoustically adverse conditions and overlapping, the results degrade appreciably in all ASR engines.

Processing Time and Resources
The decoding processes of the transcription systems were performed on an Intel Xeon CPU E5-2683v4 2.10 GHz 7xGPU server with 256 GB DDR4 2400 MHz RAM memory. The GPUs used for decoding correspond to an NVIDIA Geforce GTX 3090Ti 24 GB graphics acceleration card.
The following Table 6 presents the processing time and computational resources needed by each ASR system for the decoding of the 55 h and 40 min of the RTVE2020 dataset. It should be noted that the first four systems were decoded using only one CPU core, whilst the Wav2vec2.0 based system took advantage of 40 CPU cores. In terms of RAM memory, the Kaldi based and the Quartznet based architectures occupied a similar memory space, mainly related to the size of the language model. In contrast, the Wav2vec2.0 based system took much more memory space since the pre-trained model was composed of 300 million parameters. Even though the Multistream CNN based system outperforms the CNN-TDNN-F in quality (see Table 5), it is worth noting that it took almost as long to decode the whole RTVE2020 database, even though the acoustic model was much bigger and the decodings were performed using only one CPU core. It could be sped up by segmenting the original waveform through a Speech Activity Detection (SAD) module and processing the segments in parallel using several CPU threads. However, the SAD module could have additionally introduced new errors when performing this segmentation that would have irreparably impacted the final result. Finally, it should be mentioned how competitive the latency of the Wav2vec2.0 based system was (RTF of 0.14) despite the big size of the model.

Conclusions and Future Work
In this work, three novel ASR architectures have been presented and evaluated on the RTVE2020 database. These systems correspond to an evolution of two similar ASR engines, previously constructed and evaluated on the same database and which were selected as the baseline systems.
The RTVE2020 database is the result of an interesting and necessary initiative to collect real broadcast Spanish speech data with the aim of building competitive ASR engines in this language. In conjunction to the RTVE2018 database [40], both constitute the larger, most complete and challenging speech corpus available for the community in the Spanish language.
Over this interesting and challenging dataset, we explored different alternatives to outperform the initial DNN-HMM and Quartznet Q5×5 based systems submitted to the Albayzín S2T Transcription Challenge 2020. In total, we presented three novel ASR engines, which clearly improved the performances of these baseline systems. The first system, based on the novel Multistream CNN acoustic model, reached the best results for almost all the contents, while the Quartznet Q15×5 outperformed the Q5×5 model at almost six points of WER, by extending the size of the model, including more training epochs and applying a triangular cyclical method for the optimal learning rate calculation. In addition, we also evaluated the performance of the Wav2vec2.0 self-supervised model, which achieved better results than the Quartznet based systems applying a fine-tuning process of only 50 epochs over the pre-trained model. In summary, this work constitutes an interesting and complete benchmark of several architectures in order to select the optimal ASR engine depending on the required quality, the available HW resources and the latency expected.
As future work, once the acoustic models have been considerably enhanced, we will focus on improving the language models, by incorporating new rescoring processes based on Transformers or Self attentive simple Recurrent Unit architectures. It should also be interesting to explore strategies to reduce the size of the Multistream CNN acoustic model, by reducing the amount of TDNN-F layers and/or the internal cell dimensions without loss of quality. These techniques would probably reduce the size of the model and would allow for increasing the inference times. Moreover, bigger Wav2vec2.0 pre-trained models will also be explored, which contained one and two billion parameters, with longer fine-tuning processes in order to allow the model to better adapt to the application domains. Finally, new self-supervised models like HuBERT [3] and SUPERB [41] will be also studied over the same dataset.