Special Issue "IberSPEECH 2020: Speech and Language Technologies for Iberian Languages"

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Acoustics and Vibrations".

Deadline for manuscript submissions: closed (31 December 2021) | Viewed by 14140

Special Issue Editors

Prof. Dr. Francesc Alías
E-Mail Website
Guest Editor
Grup de recerca en Tecnologies Mèdia (GTM), La Salle—Universitat Ramon Llull, Barcelona, Spain
Interests: speech processing; speech analysis and synthesis; voice production; expressive speech; Human-Computer Interaction; acoustic event detection; acoustic signal processing; machine listening; real-time noise monitoring; impact of noise events; real-life acoustic datasets; wireless acoustic sensor networks
Special Issues, Collections and Topics in MDPI journals
Prof. Dr. Valentin Cardeñoso-Payo
E-Mail Website
Guest Editor
ECA-SIMM research group—Universidad de Valladolid, Valladolid, Spain
Interests: human language technology; human computer interaction; graphics and visualization; computational prosody; computational physics
Dr. David Escudero-Mancebo
E-Mail Website
Guest Editor
ECA-SIMM research group—Universidad de Valladolid, Valladolid, Spain
Interests: artificial intelligence; spoken technologies; human computer interaction; graphics and visualization; computational prosody
Dr. César González-Ferreras
E-Mail Website
Guest Editor
ECA-SIMM Research Group, Universidad de Valladolid, Valladolid, Spain
Dr. António Joaquim da Silva Teixeira
E-Mail Website
Guest Editor
BIT – Biomedical Informatics and Technologies, Institute of Electronics and Informatics Engineering of Aveiro (IEETA), Department of Electronics, Telecommunications and Informatics, University of Aveiro, 3810-193 AVEIRO, Portugal
Interests: speech processing; speech synthesis; speech production studies; dialog Systems; human-computer interaction; multimodal interaction; silent speech interfaces; natural language processing; evaluation of auditive processing; pervasive assistance; smart environments; development for all; computer application in health

Special Issue Information

Dear Colleagues,

Following previous editions, IberSPEECH 2020 will be held in Valladolid, from 24 to 26 March 2021. The IberSPEECH event—the fifth of its kind under this name—brings together the XI Jornadas en Tecnologías del Habla and the VII Iberian SLTech Workshop events.  The conference provides a platform for scientific and industrial discussion and exchange around Iberian languages, with the following main topics of interest:

  1. Speech technology and applications;
  2. Human speech production, perception, and communication;
  3. Natural language processing and applications;
  4. Speech, language, and multimodality;
  5. Resources, standardization, and evaluation.

The main goal of this Special Issue is to present the latest advances in research and novel applications of speech and language processing, paying special attention to those focused on Iberian languages. We invite researchers interested in these fields to contribute to this issue, which covers all fields of IberSPEECH 2020 https://iberspeech2020.eca-simm.uva.es/authors/call-for-papers/. For those papers selected from the conference, the authors are asked to fulfill what is indicated in the instructions for “Preprints and Conference Papers” of this journal: https://www.mdpi.com/journal/applsci/instructions#preprints

Prof. Dr. Francesc Alías
Prof. Dr. Valentin Cardeñoso-Payo
Dr. David Escudero-Mancebo
Dr. César González-Ferreras
Prof. Dr. António Joaquim da Silva Teixeira
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2300 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • Speech technology and applications
  • Human speech production, perception, and communication
  • Natural language processing and applications
  • Speech, language, and multimodality
  • Resources, standardization, and evaluation

Published Papers (24 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Article
Data-Driven Analysis of European Portuguese Nasal Vowel Dynamics in Bilabial Contexts
Appl. Sci. 2022, 12(9), 4601; https://doi.org/10.3390/app12094601 - 03 May 2022
Viewed by 413
Abstract
European Portuguese (EP) is characterized by a large number of nasals encompassing five phonemic nasal vowels. One notable characteristic of these sounds is their dynamic nature, involving both oral and nasal gestures, which makes their study and characterization challenging. The study of nasal [...] Read more.
European Portuguese (EP) is characterized by a large number of nasals encompassing five phonemic nasal vowels. One notable characteristic of these sounds is their dynamic nature, involving both oral and nasal gestures, which makes their study and characterization challenging. The study of nasal vowels, in particular, has been addressed using a wide range of technologies: early descriptions were based on acoustics and nasalance, later expanded with articulatory data obtained from EMA and real-time magnetic resonance (RT-MRI). While providing important results, these studies were limited by the discrete nature of the EMA-pellets, providing only a small grasp of the vocal tract; by the small time resolution of the MRI data; and by the small number of speakers. To tackle these limitations, and to take advantage of recent advances in RT-MRI allowing 50 fps, novel articulatory data has been acquired for 11 EP speakers. The work presented here explores the capabilities of recently proposed data-driven approaches to model articulatory data extracted from RT-MRI to assess their suitability for investigating the dynamic characteristics of nasal vowels. To this end, we explore vocal tract configurations over time, along with the coordination of velum and lip aperture in oral and nasal bilabial contexts for nasal vowels and oral congeners. Overall, the results show that both generalized additive mixed models (GAMMs) and functional linear mixed models (FLMMs) provide an elegant approach to tackle the data from multiple speakers. More specifically, we found oro-pharyngeal differences in the tongue configurations for low and mid nasal vowels: vowel track aperture was larger in the pharyngeal and smaller in the palatal region for the three non-high nasal vowels, providing evidence of a raised and more advanced tongue position of the nasal vowels. Even though this work is aimed at exploring the applicability of the methods, the outcomes already highlight interesting data for the dynamic characterization of EP nasal vowels. Full article
Show Figures

Figure 1

Article
Automatic Classification of Synthetic Voices for Voice Banking Using Objective Measures
Appl. Sci. 2022, 12(5), 2473; https://doi.org/10.3390/app12052473 - 27 Feb 2022
Viewed by 422
Abstract
Speech is the most common way of communication among humans. People who cannot communicate through speech due to partial of total loss of the voice can benefit from Alternative and Augmentative Communication devices and Text to Speech technology. One problem of using these [...] Read more.
Speech is the most common way of communication among humans. People who cannot communicate through speech due to partial of total loss of the voice can benefit from Alternative and Augmentative Communication devices and Text to Speech technology. One problem of using these technologies is that the included synthetic voices might be impersonal and badly adapted to the user in terms of age, accent or even gender. In this context, the use of synthetic voices from voice banking systems is an attractive alternative. New voices can be obtained applying adaptation techniques using recordings from people with healthy voice (donors) or from the user himself/herself before losing his/her own voice. In this way, the goal is to offer a wide voice catalog to potential users. However, as there is no control over the recording or the adaptation processes, some method to control the final quality of the voice is needed. We present the work developed to automatically select the best synthetic voices using a set of objective measures and a subjective Mean Opinion Score evaluation. A prediction algorithm of the MOS has been build which correlates similarly to the most correlated individual measure. Full article
Show Figures

Figure 1

Article
Contribution of Vocal Tract and Glottal Source Spectral Cues in the Generation of Acted Happy and Aggressive Spanish Vowels
Appl. Sci. 2022, 12(4), 2055; https://doi.org/10.3390/app12042055 - 16 Feb 2022
Viewed by 331
Abstract
The source-filter model is one of the main techniques applied to speech analysis and synthesis. Recent advances in voice production by means of three-dimensional (3D) source-filter models have overcome several limitations of classic one-dimensional techniques. Despite the development of preliminary attempts to improve [...] Read more.
The source-filter model is one of the main techniques applied to speech analysis and synthesis. Recent advances in voice production by means of three-dimensional (3D) source-filter models have overcome several limitations of classic one-dimensional techniques. Despite the development of preliminary attempts to improve the expressiveness of 3D-generated voices, they are still far from achieving realistic results. Towards this goal, this work analyses the contribution of both the the vocal tract (VT) and the glottal source spectral (GSS) cues in the generation of happy and aggressive speech through a GlottDNN-based analysis-by-synthesis methodology. Paired neutral expressive utterances are parameterised to generate different combinations of expressive vowels, applying the target expressive GSS and/or VT cues on the neutral vowels after transplanting the expressive prosody on these utterances. The conducted objective tests focused on Spanish [a], [i] and [u] vowels show that both GSS and VT cues significantly reduce the spectral distance to the expressive target. The results from the perceptual test show that VT cues make a statistically significant contribution in the expression of happy and aggressive emotions for [a] vowels, while the GSS contribution is significant in [i] and [u] vowels. Full article
Show Figures

Figure 1

Article
TASE: Task-Aware Speech Enhancement for Wake-Up Word Detection in Voice Assistants
Appl. Sci. 2022, 12(4), 1974; https://doi.org/10.3390/app12041974 - 14 Feb 2022
Viewed by 657
Abstract
Wake-up word spotting in noisy environments is a critical task for an excellent user experience with voice assistants. Unwanted activation of the device is often due to the presence of noises coming from background conversations, TVs, or other domestic appliances. In this work, [...] Read more.
Wake-up word spotting in noisy environments is a critical task for an excellent user experience with voice assistants. Unwanted activation of the device is often due to the presence of noises coming from background conversations, TVs, or other domestic appliances. In this work, we propose the use of a speech enhancement convolutional autoencoder, coupled with on-device keyword spotting, aimed at improving the trigger word detection in noisy environments. The end-to-end system learns by optimizing a linear combination of losses: a reconstruction-based loss, both at the log-mel spectrogram and at the waveform level, as well as a specific task loss that accounts for the cross-entropy error reported along the keyword spotting detection. We experiment with several neural network classifiers and report that deeply coupling the speech enhancement together with a wake-up word detector, e.g., by jointly training them, significantly improves the performance in the noisiest conditions. Additionally, we introduce a new publicly available speech database recorded for the Telefónica’s voice assistant, Aura. The OK Aura Wake-up Word Dataset incorporates rich metadata, such as speaker demographics or room conditions, and comprises hard negative examples that were studiously selected to present different levels of phonetic similarity with respect to the trigger words “OK Aura”. Full article
Show Figures

Figure 1

Article
Automatic Identification of Emotional Information in Spanish TV Debates and Human–Machine Interactions
Appl. Sci. 2022, 12(4), 1902; https://doi.org/10.3390/app12041902 - 11 Feb 2022
Viewed by 367
Abstract
Automatic emotion detection is a very attractive field of research that can help build more natural human–machine interaction systems. However, several issues arise when real scenarios are considered, such as the tendency toward neutrality, which makes it difficult to obtain balanced datasets, or [...] Read more.
Automatic emotion detection is a very attractive field of research that can help build more natural human–machine interaction systems. However, several issues arise when real scenarios are considered, such as the tendency toward neutrality, which makes it difficult to obtain balanced datasets, or the lack of standards for the annotation of emotional categories. Moreover, the intrinsic subjectivity of emotional information increases the difficulty of obtaining valuable data to train machine learning-based algorithms. In this work, two different real scenarios were tackled: human–human interactions in TV debates and human–machine interactions with a virtual agent. For comparison purposes, an analysis of the emotional information was conducted in both. Thus, a profiling of the speakers associated with each task was carried out. Furthermore, different classification experiments show that deep learning approaches can be useful for detecting speakers’ emotional information, mainly for arousal, valence, and dominance levels, reaching a 0.7F1-score. Full article
Show Figures

Figure 1

Article
Evaluating Novel Speech Transcription Architectures on the Spanish RTVE2020 Database
Appl. Sci. 2022, 12(4), 1889; https://doi.org/10.3390/app12041889 - 11 Feb 2022
Viewed by 364
Abstract
This work presents three novel speech recognition architectures evaluated on the Spanish RTVE2020 dataset, employed as the main evaluation set in the Albayzín S2T Transcription Challenge 2020. The main objective was to improve the performance of the systems previously submitted by the authors [...] Read more.
This work presents three novel speech recognition architectures evaluated on the Spanish RTVE2020 dataset, employed as the main evaluation set in the Albayzín S2T Transcription Challenge 2020. The main objective was to improve the performance of the systems previously submitted by the authors to the challenge, in which the primary system scored the second position. The novel systems are based on both DNN-HMM and E2E acoustic models, for which fully- and self-supervised learning methods were included. As a result, the new speech recognition engines clearly outperformed the performance of the initial systems from the previous best WER of 19.27 to the new best of 17.60 achieved by the DNN-HMM based system. This work therefore describes an interesting benchmark of the latest acoustic models over a highly challenging dataset, and identifies the most optimal ones depending on the expected quality, the available resources and the required latency. Full article
Show Figures

Figure 1

Article
Unsupervised Adaptation of Deep Speech Activity Detection Models to Unseen Domains
Appl. Sci. 2022, 12(4), 1832; https://doi.org/10.3390/app12041832 - 10 Feb 2022
Viewed by 430
Abstract
Speech Activity Detection (SAD) aims to accurately classify audio fragments containing human speech. Current state-of-the-art systems for the SAD task are mainly based on deep learning solutions. These applications usually show a significant drop in performance when test data are different from training [...] Read more.
Speech Activity Detection (SAD) aims to accurately classify audio fragments containing human speech. Current state-of-the-art systems for the SAD task are mainly based on deep learning solutions. These applications usually show a significant drop in performance when test data are different from training data due to the domain shift observed. Furthermore, machine learning algorithms require large amounts of labelled data, which may be hard to obtain in real applications. Considering both ideas, in this paper we evaluate three unsupervised domain adaptation techniques applied to the SAD task. A baseline system is trained on a combination of data from different domains and then adapted to a new unseen domain, namely, data from Apollo space missions coming from the Fearless Steps Challenge. Experimental results demonstrate that domain adaptation techniques seeking to minimise the statistical distribution shift provide the most promising results. In particular, Deep CORAL method reports a 13% relative improvement in the original evaluation metric when compared to the unadapted baseline model. Further experiments show that the cascaded application of Deep CORAL and pseudo-labelling techniques can improve even more the results, yielding a significant 24% relative improvement in the evaluation metric when compared to the baseline system. Full article
Show Figures

Figure 1

Article
Active Correction for Incremental Speaker Diarization of a Collection with Human in the Loop
Appl. Sci. 2022, 12(4), 1782; https://doi.org/10.3390/app12041782 - 09 Feb 2022
Viewed by 268
Abstract
State of the art diarization systems now achieve decent performance but those performances are often not good enough to deploy them without any human supervision. Additionally, most approaches focus on single audio files while many use cases involving multiple recordings with recurrent speakers [...] Read more.
State of the art diarization systems now achieve decent performance but those performances are often not good enough to deploy them without any human supervision. Additionally, most approaches focus on single audio files while many use cases involving multiple recordings with recurrent speakers require the incremental processing of a collection. In this paper, we propose a framework that solicits a human in the loop to correct the clustering by answering simple questions. After defining the nature of the questions for both single file and collection of files, we propose two algorithms to list those questions and associated stopping criteria that are necessary to limit the work load on the human in the loop. Experiments performed on the ALLIES dataset show that a limited interaction with a human expert can lead to considerable improvement of up to 36.5% relative diarization error rate (DER) for single files and 33.29% for a collection. Full article
Show Figures

Figure 1

Article
Evaluation of Tacotron Based Synthesizers for Spanish and Basque
Appl. Sci. 2022, 12(3), 1686; https://doi.org/10.3390/app12031686 - 07 Feb 2022
Viewed by 367
Abstract
In this paper, we describe the implementation and evaluation of Text to Speech synthesizers based on neural networks for Spanish and Basque. Several voices were built, all of them using a limited number of data. The system applies Tacotron 2 to compute mel-spectrograms [...] Read more.
In this paper, we describe the implementation and evaluation of Text to Speech synthesizers based on neural networks for Spanish and Basque. Several voices were built, all of them using a limited number of data. The system applies Tacotron 2 to compute mel-spectrograms from the input sequence, followed by WaveGlow as neural vocoder to obtain the audio signals from the spectrograms. The limited number of data used for training the models leads to synthesis errors in some sentences. To automatically detect those errors, we developed a new method that is able to find the sentences that have lost the alignment during the inference process. To mitigate the problem, we implemented a guided attention providing the system with the explicit duration of the phonemes. The resulting system was evaluated to assess its robustness, quality and naturalness both with objective and subjective measures. The results reveal the capacity of the system to produce good quality and natural audios. Full article
Show Figures

Figure 1

Article
Fine-Tuning BERT Models for Intent Recognition Using a Frequency Cut-Off Strategy for Domain-Specific Vocabulary Extension
Appl. Sci. 2022, 12(3), 1610; https://doi.org/10.3390/app12031610 - 03 Feb 2022
Viewed by 510
Abstract
Intent recognition is a key component of any task-oriented conversational system. The intent recognizer can be used first to classify the user’s utterance into one of several predefined classes (intents) that help to understand the user’s current goal. Then, the most adequate response [...] Read more.
Intent recognition is a key component of any task-oriented conversational system. The intent recognizer can be used first to classify the user’s utterance into one of several predefined classes (intents) that help to understand the user’s current goal. Then, the most adequate response can be provided accordingly. Intent recognizers also often appear as a form of joint models for performing the natural language understanding and dialog management tasks together as a single process, thus simplifying the set of problems that a conversational system must solve. This happens to be especially true for frequently asked question (FAQ) conversational systems. In this work, we first present an exploratory analysis in which different deep learning (DL) models for intent detection and classification were evaluated. In particular, we experimentally compare and analyze conventional recurrent neural networks (RNN) and state-of-the-art transformer models. Our experiments confirmed that best performance is achieved by using transformers. Specifically, best performance was achieved by fine-tuning the so-called BETO model (a Spanish pretrained bidirectional encoder representations from transformers (BERT) model from the Universidad de Chile) in our intent detection task. Then, as the main contribution of the paper, we analyze the effect of inserting unseen domain words to extend the vocabulary of the model as part of the fine-tuning or domain-adaptation process. Particularly, a very simple word frequency cut-off strategy is experimentally shown to be a suitable method for driving the vocabulary learning decisions over unseen words. The results of our analysis show that the proposed method helps to effectively extend the original vocabulary of the pretrained models. We validated our approach with a selection of the corpus acquired with the Hispabot-Covid19 system obtaining satisfactory results. Full article
Show Figures

Figure 1

Article
A Study of Data Augmentation for ASR Robustness in Low Bit Rate Contact Center Recordings Including Packet Losses
Appl. Sci. 2022, 12(3), 1580; https://doi.org/10.3390/app12031580 - 01 Feb 2022
Viewed by 339
Abstract
Client conversations in contact centers are nowadays routinely recorded for a number of reasons—in many cases, just because it is required by current legislation. However, even if not required, conversations between customers and agents can be a valuable source of information about clients [...] Read more.
Client conversations in contact centers are nowadays routinely recorded for a number of reasons—in many cases, just because it is required by current legislation. However, even if not required, conversations between customers and agents can be a valuable source of information about clients or future clients, call center agents, markets trends, etc. Analyzing these recordings provides an excellent opportunity to be aware about the business and its possibilities. The current state of the art in Automatic Speech Recognition (ASR) allows this information to be effectively extracted and used. However, conversations are usually stored in highly compressed ways to save space and typically contain packet losses that produce short interruptions in the speech signal due to the common use of Voice-over-IP (VoIP) in these systems. These effects, and especially the last one, have a negative impact on ASR performance. This article presents an extensive study on the importance of these effects on modern ASR systems and the effectiveness of using several techniques of data augmentation to increase their robustness. In addition, ITU-T G.711, a well-known Packet Loss Concealment (PLC) method is applied in combination with data augmentation techniques to analyze ASR performance improvement on signals affected by packet losses. Full article
Show Figures

Figure 1

Article
GANBA: Generative Adversarial Network for Biometric Anti-Spoofing
Appl. Sci. 2022, 12(3), 1454; https://doi.org/10.3390/app12031454 - 29 Jan 2022
Viewed by 438
Abstract
Automatic speaker verification (ASV) is a voice biometric technology whose security might be compromised by spoofing attacks. To increase the robustness against spoofing attacks, presentation attack detection (PAD) or anti-spoofing systems for detecting replay, text-to-speech and voice conversion-based spoofing attacks are being developed. [...] Read more.
Automatic speaker verification (ASV) is a voice biometric technology whose security might be compromised by spoofing attacks. To increase the robustness against spoofing attacks, presentation attack detection (PAD) or anti-spoofing systems for detecting replay, text-to-speech and voice conversion-based spoofing attacks are being developed. However, it was recently shown that adversarial spoofing attacks may seriously fool anti-spoofing systems. Moreover, the robustness of the whole biometric system (ASV + PAD) against this new type of attack is completely unexplored. In this work, a new generative adversarial network for biometric anti-spoofing (GANBA) is proposed. GANBA has a twofold basis: (1) it jointly employs the anti-spoofing and ASV losses to yield very damaging adversarial spoofing attacks, and (2) it trains the PAD as a discriminator in order to make them more robust against these types of adversarial attacks. The proposed system is able to generate adversarial spoofing attacks which can fool the complete voice biometric system. Then, the resulting PAD discriminators of the proposed GANBA can be used as a defense technique for detecting both original and adversarial spoofing attacks. The physical access (PA) and logical access (LA) scenarios of the ASVspoof 2019 database were employed to carry out the experiments. The experimental results show that the GANBA attacks are quite effective, outperforming other adversarial techniques when applied in white-box and black-box attack setups. In addition, the resulting PAD discriminators are more robust against both original and adversarial spoofing attacks. Full article
Show Figures

Figure 1

Article
Exploring the Age Effects on European Portuguese Vowel Production: An Ultrasound Study
Appl. Sci. 2022, 12(3), 1396; https://doi.org/10.3390/app12031396 - 28 Jan 2022
Viewed by 320
Abstract
For aging speech, there is limited knowledge regarding the articulatory adjustments underlying the acoustic findings observed in previous studies. In order to investigate the age-related articulatory differences in European Portuguese (EP) vowels, the present study analyzes the tongue configuration of the nine EP [...] Read more.
For aging speech, there is limited knowledge regarding the articulatory adjustments underlying the acoustic findings observed in previous studies. In order to investigate the age-related articulatory differences in European Portuguese (EP) vowels, the present study analyzes the tongue configuration of the nine EP oral vowels (isolated context and pseudoword context) produced by 10 female speakers of two different age groups (young and old). From the tongue contours automatically segmented from the US images and manually revised, the parameters (tongue height and tongue advancement) were extracted. The results suggest that the tongue tends to be higher and more advanced for the older females compared to the younger ones for almost all vowels. Thus, the vowel articulatory space tends to be higher, advanced, and bigger with age. For older females, unlike younger females that presented a sharp reduction in the articulatory vowel space in disyllabic sequences, the vowel space tends to be more advanced for isolated vowels compared with vowels produced in disyllabic sequences. This study extends our pilot research by reporting articulatory data from more speakers based on an improved automatic method of tongue contours tracing, and it performs an inter-speaker comparison through the application of a novel normalization procedure. Full article
Show Figures

Figure 1

Article
Non-Parallel Articulatory-to-Acoustic Conversion Using Multiview-Based Time Warping
Appl. Sci. 2022, 12(3), 1167; https://doi.org/10.3390/app12031167 - 23 Jan 2022
Viewed by 421
Abstract
In this paper, we propose a novel algorithm called multiview temporal alignment by dependence maximisation in the latent space (TRANSIENCE) for the alignment of time series consisting of sequences of feature vectors with different length and dimensionality of the feature vectors. The proposed [...] Read more.
In this paper, we propose a novel algorithm called multiview temporal alignment by dependence maximisation in the latent space (TRANSIENCE) for the alignment of time series consisting of sequences of feature vectors with different length and dimensionality of the feature vectors. The proposed algorithm, which is based on the theory of multiview learning, can be seen as an extension of the well-known dynamic time warping (DTW) algorithm but, as mentioned, it allows the sequences to have different dimensionalities. Our algorithm attempts to find an optimal temporal alignment between pairs of nonaligned sequences by first projecting their feature vectors into a common latent space where both views are maximally similar. To do this, powerful, nonlinear deep neural network (DNN) models are employed. Then, the resulting sequences of embedding vectors are aligned using DTW. Finally, the alignment paths obtained in the previous step are applied to the original sequences to align them. In the paper, we explore several variants of the algorithm that mainly differ in the way the DNNs are trained. We evaluated the proposed algorithm on a articulatory-to-acoustic (A2A) synthesis task involving the generation of audible speech from motion data captured from the lips and tongue of healthy speakers using a technique known as permanent magnet articulography (PMA). In this task, our algorithm is applied during the training stage to align pairs of nonaligned speech and PMA recordings that are later used to train DNNs able to synthesis speech from PMA data. Our results show the quality of speech generated in the nonaligned scenario is comparable to that obtained in the parallel scenario. Full article
Show Figures

Figure 1

Article
Multimodal Diarization Systems by Training Enrollment Models as Identity Representations
Appl. Sci. 2022, 12(3), 1141; https://doi.org/10.3390/app12031141 - 21 Jan 2022
Viewed by 494
Abstract
This paper describes a post-evaluation analysis of the system developed by ViVoLAB research group for the IberSPEECH-RTVE 2020 Multimodal Diarization (MD) Challenge. This challenge focuses on the study of multimodal systems for the diarization of audiovisual files and the assignment of an identity [...] Read more.
This paper describes a post-evaluation analysis of the system developed by ViVoLAB research group for the IberSPEECH-RTVE 2020 Multimodal Diarization (MD) Challenge. This challenge focuses on the study of multimodal systems for the diarization of audiovisual files and the assignment of an identity to each segment where a person is detected. In this work, we implemented two different subsystems to address this task using the audio and the video from audiovisual files separately. To develop our subsystems, we used the state-of-the-art speaker and face verification embeddings extracted from publicly available deep neural networks (DNN). Different clustering techniques were also employed in combination with the tracking and identity assignment process. Furthermore, we included a novel back-end approach in the face verification subsystem to train an enrollment model for each identity, which we have previously shown to improve the results compared to the average of the enrollment data. Using this approach, we trained a learnable vector to represent each enrollment character. The loss function employed to train this vector was an approximated version of the detection cost function (aDCF) which is inspired by the DCF widely used metric to measure performance in verification tasks. In this paper, we also focused on exploring and analyzing the effect of training this vector with several configurations of this objective loss function. This analysis allows us to assess the impact of the configuration parameters of the loss in the amount and type of errors produced by the system. Full article
Show Figures

Figure 1

Article
Neural Models for Measuring Confidence on Interactive Machine Translation Systems
Appl. Sci. 2022, 12(3), 1100; https://doi.org/10.3390/app12031100 - 21 Jan 2022
Viewed by 387
Abstract
Reducing the human effort performed with the use of interactive-predictive neural machine translation (IPNMT) systems is one of the main goals in this sub-field of machine translation (MT). Prior works have focused on changing the human–machine interaction method and simplifying the feedback performed. [...] Read more.
Reducing the human effort performed with the use of interactive-predictive neural machine translation (IPNMT) systems is one of the main goals in this sub-field of machine translation (MT). Prior works have focused on changing the human–machine interaction method and simplifying the feedback performed. Applying confidence measures (CM) to an IPNMT system helps decrease the number of words that the user has to check through the translation session, reducing the human effort needed, although this supposes losing a few points in the quality of the translations. The effort reduction comes from decreasing the number of words that the translator has to review—it only has to check the ones with a score lower than the threshold set. In this paper, we studied the performance of four confidence measures based on the most used metrics on MT. We trained four recurrent neural network (RNN) models to approximate the scores from the metrics: Bleu, Meteor, Chr-f, and TER. In the experiments, we simulated the user interaction with the system to obtain and compare the quality of the translations generated with the effort reduction. We also compare the performance of the four models between them to see which of them obtains the best results. The results achieved showed a reduction of 48% with a Bleu score of 70 points—a significant effort reduction to translations almost perfect. Full article
Show Figures

Figure 1

Article
Cascade or Direct Speech Translation? A Case Study
Appl. Sci. 2022, 12(3), 1097; https://doi.org/10.3390/app12031097 - 21 Jan 2022
Viewed by 434
Abstract
Speech translation has been traditionally tackled under a cascade approach, chaining speech recognition and machine translation components to translate from an audio source in a given language into text or speech in a target language. Leveraging on deep learning approaches to natural language [...] Read more.
Speech translation has been traditionally tackled under a cascade approach, chaining speech recognition and machine translation components to translate from an audio source in a given language into text or speech in a target language. Leveraging on deep learning approaches to natural language processing, recent studies have explored the potential of direct end-to-end neural modelling to perform the speech translation task. Though several benefits may come from end-to-end modelling, such as a reduction in latency and error propagation, the comparative merits of each approach still deserve detailed evaluations and analyses. In this work, we compared state-of-the-art cascade and direct approaches on the under-resourced Basque–Spanish language pair, which features challenging phenomena such as marked differences in morphology and word order. This case study thus complements other studies in the field, which mostly revolve around the English language. We describe and analysed in detail the mintzai-ST corpus, prepared from the sessions of the Basque Parliament, and evaluated the strengths and limitations of cascade and direct speech translation models trained on this corpus, with variants exploiting additional data as well. Our results indicated that, despite significant progress with end-to-end models, which may outperform alternatives in some cases in terms of automated metrics, a cascade approach proved optimal overall in our experiments and manual evaluations. Full article
Show Figures

Figure 1

Article
A Comparison of Hybrid and End-to-End ASR Systems for the IberSpeech-RTVE 2020 Speech-to-Text Transcription Challenge
Appl. Sci. 2022, 12(2), 903; https://doi.org/10.3390/app12020903 - 17 Jan 2022
Viewed by 365
Abstract
This paper describes a comparison between hybrid and end-to-end Automatic Speech Recognition (ASR) systems, which were evaluated on the IberSpeech-RTVE 2020 Speech-to-Text Transcription Challenge. Deep Neural Networks (DNNs) are becoming the most promising technology for ASR at present. In the last few years, [...] Read more.
This paper describes a comparison between hybrid and end-to-end Automatic Speech Recognition (ASR) systems, which were evaluated on the IberSpeech-RTVE 2020 Speech-to-Text Transcription Challenge. Deep Neural Networks (DNNs) are becoming the most promising technology for ASR at present. In the last few years, traditional hybrid models have been evaluated and compared to other end-to-end ASR systems in terms of accuracy and efficiency. We contribute two different approaches: a hybrid ASR system based on a DNN-HMM and two state-of-the-art end-to-end ASR systems, based on Lattice-Free Maximum Mutual Information (LF-MMI). To address the high difficulty in the speech-to-text transcription of recordings with different speaking styles and acoustic conditions from TV studios to live recordings, data augmentation and Domain Adversarial Training (DAT) techniques were studied. Multi-condition data augmentation applied to our hybrid DNN-HMM demonstrated WER improvements in noisy scenarios (about 10% relatively). In contrast, the results obtained using an end-to-end PyChain-based ASR system were far from our expectations. Nevertheless, we found that when including DAT techniques, a relative WER improvement of 2.87% was obtained as compared to the PyChain-based system. Full article
Show Figures

Figure 1

Article
MLLP-VRAIN Spanish ASR Systems for the Albayzín-RTVE 2020 Speech-to-Text Challenge: Extension
Appl. Sci. 2022, 12(2), 804; https://doi.org/10.3390/app12020804 - 13 Jan 2022
Viewed by 304
Abstract
This paper describes the automatic speech recognition (ASR) systems built by the MLLP-VRAIN research group of Universitat Politècnica de València for the Albayzín-RTVE 2020 Speech-to-Text Challenge, and includes an extension of the work consisting of building and evaluating equivalent systems under the closed [...] Read more.
This paper describes the automatic speech recognition (ASR) systems built by the MLLP-VRAIN research group of Universitat Politècnica de València for the Albayzín-RTVE 2020 Speech-to-Text Challenge, and includes an extension of the work consisting of building and evaluating equivalent systems under the closed data conditions from the 2018 challenge. The primary system (p-streaming_1500ms_nlt) was a hybrid ASR system using streaming one-pass decoding with a context window of 1.5 seconds. This system achieved 16.0% WER on the test-2020 set. We also submitted three contrastive systems. From these, we highlight the system c2-streaming_600ms_t which, following a similar configuration as the primary system with a smaller context window of 0.6 s, scored 16.9% WER points on the same test set, with a measured empirical latency of 0.81 ± 0.09 s (mean ± stdev). That is, we obtained state-of-the-art latencies for high-quality automatic live captioning with a small WER degradation of 6% relative. As an extension, the equivalent closed-condition systems obtained 23.3% WER and 23.5% WER, respectively. When evaluated with an unconstrained language model, we obtained 19.9% WER and 20.4% WER; i.e., not far behind the top-performing systems with only 5% of the full acoustic data and with the extra ability of being streaming-capable. Indeed, all of these streaming systems could be put into production environments for automatic captioning of live media streams. Full article
Show Figures

Figure 1

Article
A Proposal for Multimodal Emotion Recognition Using Aural Transformers and Action Units on RAVDESS Dataset
Appl. Sci. 2022, 12(1), 327; https://doi.org/10.3390/app12010327 - 30 Dec 2021
Cited by 2 | Viewed by 1252
Abstract
Emotion recognition is attracting the attention of the research community due to its multiple applications in different fields, such as medicine or autonomous driving. In this paper, we proposed an automatic emotion recognizer system that consisted of a speech emotion recognizer (SER) and [...] Read more.
Emotion recognition is attracting the attention of the research community due to its multiple applications in different fields, such as medicine or autonomous driving. In this paper, we proposed an automatic emotion recognizer system that consisted of a speech emotion recognizer (SER) and a facial emotion recognizer (FER). For the SER, we evaluated a pre-trained xlsr-Wav2Vec2.0 transformer using two transfer-learning techniques: embedding extraction and fine-tuning. The best accuracy results were achieved when we fine-tuned the whole model by appending a multilayer perceptron on top of it, confirming that the training was more robust when it did not start from scratch and the previous knowledge of the network was similar to the task to adapt. Regarding the facial emotion recognizer, we extracted the Action Units of the videos and compared the performance between employing static models against sequential models. Results showed that sequential models beat static models by a narrow difference. Error analysis reported that the visual systems could improve with a detector of high-emotional load frames, which opened a new line of research to discover new ways to learn from videos. Finally, combining these two modalities with a late fusion strategy, we achieved 86.70% accuracy on the RAVDESS dataset on a subject-wise 5-CV evaluation, classifying eight emotions. Results demonstrated that these modalities carried relevant information to detect users’ emotional state and their combination allowed to improve the final system performance. Full article
Show Figures

Figure 1

Article
An Analysis of Sound Event Detection under Acoustic Degradation Using Multi-Resolution Systems
Appl. Sci. 2021, 11(23), 11561; https://doi.org/10.3390/app112311561 - 06 Dec 2021
Viewed by 466
Abstract
The Sound Event Detection task aims to determine the temporal locations of acoustic events in audio clips. In recent years, the relevance of this field is rising due to the introduction of datasets such as Google AudioSet or DESED (Domestic Environment Sound Event [...] Read more.
The Sound Event Detection task aims to determine the temporal locations of acoustic events in audio clips. In recent years, the relevance of this field is rising due to the introduction of datasets such as Google AudioSet or DESED (Domestic Environment Sound Event Detection) and competitive evaluations like the DCASE Challenge (Detection and Classification of Acoustic Scenes and Events). In this paper, we analyze the performance of Sound Event Detection systems under diverse artificial acoustic conditions such as high- or low-pass filtering and clipping or dynamic range compression, as well as under an scenario of high overlap between events. For this purpose, the audio was obtained from the Evaluation subset of the DESED dataset, whereas the systems were trained in the context of the DCASE Challenge 2020 Task 4. Our systems are based upon the challenge baseline, which consists of a Convolutional-Recurrent Neural Network trained using the Mean Teacher method, and they employ a multiresolution approach which is able to improve the Sound Event Detection performance through the use of several resolutions during the extraction of Mel-spectrogram features. We provide insights on the benefits of this multiresolution approach in different acoustic settings, and compare the performance of the single-resolution systems in the aforementioned scenarios when using different resolutions. Furthermore, we complement the analysis of the performance in the high-overlap scenario by assessing the degree of overlap of each event category in sound event detection datasets. Full article
Show Figures

Figure 1

Article
The Domain Mismatch Problem in the Broadcast Speaker Attribution Task
Appl. Sci. 2021, 11(18), 8521; https://doi.org/10.3390/app11188521 - 14 Sep 2021
Cited by 1 | Viewed by 526
Abstract
The demand of high-quality metadata for the available multimedia content requires the development of new techniques able to correctly identify more and more information, including the speaker information. The task known as speaker attribution aims at identifying all or part of the speakers [...] Read more.
The demand of high-quality metadata for the available multimedia content requires the development of new techniques able to correctly identify more and more information, including the speaker information. The task known as speaker attribution aims at identifying all or part of the speakers in the audio under analysis. In this work, we carry out a study of the speaker attribution problem in the broadcast domain. Through our experiments, we illustrate the positive impact of diarization on the final performance. Additionally, we show the influence of the variability present in broadcast data, depicting the broadcast domain as a collection of subdomains with particular characteristics. Taking these two factors into account, we also propose alternative approximations robust against domain mismatch. These approximations include a semisupervised alternative as well as a totally unsupervised new hybrid solution fusing diarization and speaker assignment. Thanks to these two approximations, our performance is boosted around a relative 50%. The analysis has been carried out using the corpus for the Albayzín 2020 challenge, a diarization and speaker attribution evaluation working with broadcast data. These data, provided by Radio Televisión Española (RTVE), the Spanish public Radio and TV Corporation, include multiple shows and genres to analyze the impact of new speech technologies in real-world scenarios. Full article
Show Figures

Figure 1

Article
The Multi-Domain International Search on Speech 2020 ALBAYZIN Evaluation: Overview, Systems, Results, Discussion and Post-Evaluation Analyses
Appl. Sci. 2021, 11(18), 8519; https://doi.org/10.3390/app11188519 - 14 Sep 2021
Viewed by 855
Abstract
The large amount of information stored in audio and video repositories makes search on speech (SoS) a challenging area that is continuously receiving much interest. Within SoS, spoken term detection (STD) aims to retrieve speech data given a text-based representation of a search [...] Read more.
The large amount of information stored in audio and video repositories makes search on speech (SoS) a challenging area that is continuously receiving much interest. Within SoS, spoken term detection (STD) aims to retrieve speech data given a text-based representation of a search query (which can include one or more words). On the other hand, query-by-example spoken term detection (QbE STD) aims to retrieve speech data given an acoustic representation of a search query. This is the first paper that presents an internationally open multi-domain evaluation for SoS in Spanish that includes both STD and QbE STD tasks. The evaluation was carefully designed so that several post-evaluation analyses of the main results could be carried out. The evaluation tasks aim to retrieve the speech files that contain the queries, providing their start and end times and a score that reflects how likely the detection within the given time intervals and speech file is. Three different speech databases in Spanish that comprise different domains were employed in the evaluation: the MAVIR database, which comprises a set of talks from workshops; the RTVE database, which includes broadcast news programs; and the SPARL20 database, which contains Spanish parliament sessions. We present the evaluation itself, the three databases, the evaluation metric, the systems submitted to the evaluation, the evaluation results and some detailed post-evaluation analyses based on specific query properties (in-vocabulary/out-of-vocabulary queries, single-word/multi-word queries and native/foreign queries). The most novel features of the submitted systems are a data augmentation technique for the STD task and an end-to-end system for the QbE STD task. The obtained results suggest that there is clearly room for improvement in the SoS task and that performance is highly sensitive to changes in the data domain. Full article
Show Figures

Figure 1

Article
Automatic Speech Recognition (ASR) Systems Applied to Pronunciation Assessment of L2 Spanish for Japanese Speakers
Appl. Sci. 2021, 11(15), 6695; https://doi.org/10.3390/app11156695 - 21 Jul 2021
Cited by 2 | Viewed by 1171
Abstract
General-purpose automatic speech recognition (ASR) systems have improved in quality and are being used for pronunciation assessment. However, the assessment of isolated short utterances, such as words in minimal pairs for segmental approaches, remains an important challenge, even more so for non-native speakers. [...] Read more.
General-purpose automatic speech recognition (ASR) systems have improved in quality and are being used for pronunciation assessment. However, the assessment of isolated short utterances, such as words in minimal pairs for segmental approaches, remains an important challenge, even more so for non-native speakers. In this work, we compare the performance of our own tailored ASR system (kASR) with the one of Google ASR (gASR) for the assessment of Spanish minimal pair words produced by 33 native Japanese speakers in a computer-assisted pronunciation training (CAPT) scenario. Participants in a pre/post-test training experiment spanning four weeks were split into three groups: experimental, in-classroom, and placebo. The experimental group used the CAPT tool described in the paper, which we specially designed for autonomous pronunciation training. A statistically significant improvement for the experimental and in-classroom groups was revealed, and moderate correlation values between gASR and kASR results were obtained, in addition to strong correlations between the post-test scores of both ASR systems and the CAPT application scores found at the final stages of application use. These results suggest that both ASR alternatives are valid for assessing minimal pairs in CAPT tools, in the current configuration. Discussion on possible ways to improve our system and possibilities for future research are included. Full article
Show Figures

Figure 1

Back to TopTop