IberSPEECH 2022: Speech and Language Technologies for Iberian Languages

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Acoustics and Vibrations".

Deadline for manuscript submissions: closed (23 June 2023) | Viewed by 13332

Special Issue Editors


E-Mail Website
Guest Editor
Grup de recerca en Tecnologies Mèdia (GTM), La Salle—Universitat Ramon Llull, 8022 Barcelona, Spain
Interests: speech processing; speech analysis and synthesis; voice production; expressive speech; Human-Computer Interaction; acoustic event detection; acoustic signal processing; machine listening; real-time noise monitoring; impact of noise events; real-life acoustic datasets; wireless acoustic sensor networks
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
SigMAT Research Group, Universidad de Granada, 18071 Granada, Spain
Interests: signal processing; speech processing; speech coding; robust multimedia signals transmission; silent speech processing/interfaces; join source-channel coding and applications; communication

E-Mail Website
Guest Editor
SISDIAL Research Group, Universidad de Granada, 18071 Granada, Spain
Interests: dialogue systems; conversational systems; dialogue management; speech and language technologies; affective computing; emotion recognition
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
Department of Electronics, Telecommunications and Informatics, Universidade de Aveiro, 3810-193 Aveiro, Portugal
Interests: multimodal interaction; natural user interaction; natural language processing; speech and language processing
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

IberSPEECH 2022 will be held in Granada (Spain), from 14 to 16 November 2022. The IberSPEECH event—the sixth of its kind under this name—brings together the XII Jornadas en Tecnologías del Habla and the VIII Iberian SLTech Workshop events. The conference provides a platform for scientific and industrial discussion and exchange around Iberian languages, with the following main topics of interest:

  • Speech technology and applications;
  • Human speech production, perception, and communication;
  • Natural language processing and applications;
  • Speech, language, and multimodality;
  • Resources, standardization, and evaluation.

The main goal of this Special Issue is to present the latest advances in research and novel applications of speech and language processing, paying special attention to those focused on Iberian languages. We invite researchers interested in these fields to contribute to this issue, which covers all fields of IberSPEECH 2022 http://iberspeech2022.ugr.es/call-for-papers. For those papers selected from the conference, the authors are asked to fulfill what is indicated in the instructions for “Preprints and Conference Papers” of this journal: https://www.mdpi.com/journal/applsci/instructions#preprints.

Prof. Dr. Francesc Alías
Dr. José Luis Pérez Córdoba
Dr. Zoraida Callejas Carrión
Dr. António Joaquim da Silva Teixeira
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • speech technology and applications
  • human speech production, perception, and communication
  • natural language processing and applications
  • speech, language, and multimodality
  • resources, standardization, and evaluation

Published Papers (11 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

13 pages, 1595 KiB  
Article
Prosodic Feature Analysis for Automatic Speech Assessment and Individual Report Generation in People with Down Syndrome
by Mario Corrales-Astorgano, César González-Ferreras, David Escudero-Mancebo and Valentín Cardeñoso-Payo
Appl. Sci. 2024, 14(1), 293; https://doi.org/10.3390/app14010293 - 28 Dec 2023
Viewed by 551
Abstract
Evaluating prosodic quality poses unique challenges due to the intricate nature of prosody, which encompasses multiple form–function profiles. These challenges are more pronounced when analyzing the voices of individuals with Down syndrome (DS) due to increased variability. This paper introduces a procedure for [...] Read more.
Evaluating prosodic quality poses unique challenges due to the intricate nature of prosody, which encompasses multiple form–function profiles. These challenges are more pronounced when analyzing the voices of individuals with Down syndrome (DS) due to increased variability. This paper introduces a procedure for selecting informative prosodic features based on both the disparity between human-rated DS productions and their divergence from the productions of typical users, utilizing a corpus constructed through a video game. Individual reports of five speakers with DS are created by comparing the selected features of each user with recordings of individuals without intellectual disabilities. The acquired features primarily relate to the temporal domain, reducing dependence on pitch detection algorithms, which encounter difficulties when dealing with pathological voices compared to typical ones. These individual reports can be instrumental in identifying specific issues for each speaker, assisting therapists in defining tailored training sessions based on the speaker’s profile. Full article
Show Figures

Figure 1

12 pages, 318 KiB  
Article
esCorpius-m: A Massive Multilingual Crawling Corpus with a Focus on Spanish
by Asier Gutiérrez-Fandiño, David Pérez-Fernández, Jordi Armengol-Estapé, David Griol, Ksenia Kharitonova and Zoraida Callejas
Appl. Sci. 2023, 13(22), 12155; https://doi.org/10.3390/app132212155 - 08 Nov 2023
Viewed by 982
Abstract
In recent years, transformer-based models have played a significant role in advancing language modeling for natural language processing. However, they require substantial amounts of data and there is a shortage of high-quality non-English corpora. Some recent initiatives have introduced multilingual datasets obtained through [...] Read more.
In recent years, transformer-based models have played a significant role in advancing language modeling for natural language processing. However, they require substantial amounts of data and there is a shortage of high-quality non-English corpora. Some recent initiatives have introduced multilingual datasets obtained through web crawling. However, there are notable limitations in the results for some languages, including Spanish. These datasets are either smaller compared to other languages or suffer from lower quality due to insufficient cleaning and deduplication. In this paper, we present esCorpius-m, a multilingual corpus extracted from around 1 petabyte of Common Crawl data. It is the most extensive corpus for some languages with such a level of high-quality content extraction, cleanliness, and deduplication. Our data curation process involves an efficient cleaning pipeline and various deduplication methods that maintain the integrity of document and paragraph boundaries. We also ensure compliance with EU regulations by retaining both the source web page URL and the WARC shared origin URL. Full article
Show Figures

Figure 1

15 pages, 1067 KiB  
Article
Cross-Corpus Training Strategy for Speech Emotion Recognition Using Self-Supervised Representations
by Miguel A. Pastor, Dayana Ribas, Alfonso Ortega, Antonio Miguel and Eduardo Lleida
Appl. Sci. 2023, 13(16), 9062; https://doi.org/10.3390/app13169062 - 08 Aug 2023
Cited by 2 | Viewed by 1365
Abstract
Speech Emotion Recognition (SER) plays a crucial role in applications involving human-machine interaction. However, the scarcity of suitable emotional speech datasets presents a major challenge for accurate SER systems. Deep Neural Network (DNN)-based solutions currently in use require substantial labelled data for successful [...] Read more.
Speech Emotion Recognition (SER) plays a crucial role in applications involving human-machine interaction. However, the scarcity of suitable emotional speech datasets presents a major challenge for accurate SER systems. Deep Neural Network (DNN)-based solutions currently in use require substantial labelled data for successful training. Previous studies have proposed strategies to expand the training set in this framework by leveraging available emotion speech corpora. This paper assesses the impact of a cross-corpus training extension for a SER system using self-supervised (SS) representations, namely HuBERT and WavLM. The feasibility of training systems with just a few minutes of in-domain audio is also analyzed. The experimental results demonstrate that augmenting the training set with EmoDB (German), RAVDESS, and CREMA-D (English) datasets leads to improved SER accuracy on the IEMOCAP dataset. By combining a cross-corpus training extension and SS representations, state-of-the-art performance is achieved. These findings suggest that the cross-corpus strategy effectively addresses the scarcity of labelled data and enhances the performance of SER systems. Full article
Show Figures

Figure 1

22 pages, 958 KiB  
Article
Automatic Detection of Inconsistencies and Hierarchical Topic Classification for Open-Domain Chatbots
by Mario Rodríguez-Cantelar, Marcos Estecha-Garitagoitia, Luis Fernando D’Haro, Fernando Matía and Ricardo Córdoba
Appl. Sci. 2023, 13(16), 9055; https://doi.org/10.3390/app13169055 - 08 Aug 2023
Cited by 1 | Viewed by 996
Abstract
Current State-of-the-Art (SotA) chatbots are able to produce high-quality sentences, handling different conversation topics and larger interaction times. Unfortunately, the generated responses depend greatly on the data on which they have been trained, the specific dialogue history and current turn used for guiding [...] Read more.
Current State-of-the-Art (SotA) chatbots are able to produce high-quality sentences, handling different conversation topics and larger interaction times. Unfortunately, the generated responses depend greatly on the data on which they have been trained, the specific dialogue history and current turn used for guiding the response, the internal decoding mechanisms, and ranking strategies, among others. Therefore, it may happen that for semantically similar questions asked by users, the chatbot may provide a different answer, which can be considered as a form of hallucination or producing confusion in long-term interactions. In this research paper, we propose a novel methodology consisting of two main phases: (a) hierarchical automatic detection of topics and subtopics in dialogue interactions using a zero-shot learning approach, and (b) detecting inconsistent answers using k-means and the Silhouette coefficient. To evaluate the efficacy of topic and subtopic detection, we use a subset of the DailyDialog dataset and real dialogue interactions gathered during the Alexa Socialbot Grand Challenge 5 (SGC5). The proposed approach enables the detection of up to 18 different topics and 102 subtopics. For the purpose of detecting inconsistencies, we manually generate multiple paraphrased questions and employ several pre-trained SotA chatbot models to generate responses. Our experimental results demonstrate a weighted F-1 value of 0.34 for topic detection, a weighted F-1 value of 0.78 for subtopic detection in DailyDialog, then 81% and 62% accuracy for topic and subtopic classification in SGC5, respectively. Finally, to predict the number of different responses, we obtained a mean squared error (MSE) of 3.4 when testing smaller generative models and 4.9 in recent large language models. Full article
Show Figures

Figure 1

22 pages, 771 KiB  
Article
Evaluation of Glottal Inverse Filtering Techniques on OPENGLOT Synthetic Male and Female Vowels
by Marc Freixes, Luis Joglar-Ongay, Joan Claudi Socoró and Francesc Alías-Pujol
Appl. Sci. 2023, 13(15), 8775; https://doi.org/10.3390/app13158775 - 29 Jul 2023
Viewed by 570
Abstract
Current articulatory-based three-dimensional source–filter models, which allow the production of vowels and diphtongs, still present very limited expressiveness. Glottal inverse filtering (GIF) techniques can become instrumental to identify specific characteristics of both the glottal source signal and the vocal tract transfer function to [...] Read more.
Current articulatory-based three-dimensional source–filter models, which allow the production of vowels and diphtongs, still present very limited expressiveness. Glottal inverse filtering (GIF) techniques can become instrumental to identify specific characteristics of both the glottal source signal and the vocal tract transfer function to resemble expressive speech. Several GIF methods have been proposed in the literature; however, their comparison becomes difficult due to the lack of common and exhaustive experimental settings. In this work, first, a two-phase analysis methodology for the comparison of GIF techniques based on a reference dataset is introduced. Next, state-of-the-art GIF techniques based on iterative adaptive inverse filtering (IAIF) and quasi closed phase (QCP) approaches are thoroughly evaluated on OPENGLOT, an open database specifically designed to evaluate GIF, computing well-established GIF error measures after extending male vowels with their female counterparts. The results show that GIF methods obtain better results on male vowels. The QCP-based techniques significantly outperform IAIF-based methods for almost all error metrics and scenarios and are, at the same time, more stable across sex, phonation type, F0, and vowels. The IAIF variants improve the original technique for most error metrics on male vowels, while QCP with spectral tilt compensation achieves a lower spectral tilt error for male vowels than the original QCP. Full article
Show Figures

Figure 1

30 pages, 405 KiB  
Article
An Overview of the IberSpeech-RTVE 2022 Challenges on Speech Technologies
by Eduardo Lleida, Luis Javier Rodriguez-Fuentes, Javier Tejedor, Alfonso Ortega, Antonio Miguel, Virginia Bazán, Carmen Pérez, Alberto de Prada, Mikel Penagarikano, Amparo Varona, Germán Bordel, Doroteo Torre-Toledano, Aitor Álvarez and Haritz Arzelus
Appl. Sci. 2023, 13(15), 8577; https://doi.org/10.3390/app13158577 - 25 Jul 2023
Viewed by 1035
Abstract
Evaluation campaigns provide a common framework with which the progress of speech technologies can be effectively measured. The aim of this paper is to present a detailed overview of the IberSpeech-RTVE 2022 Challenges, which were organized as part of the IberSpeech 2022 conference [...] Read more.
Evaluation campaigns provide a common framework with which the progress of speech technologies can be effectively measured. The aim of this paper is to present a detailed overview of the IberSpeech-RTVE 2022 Challenges, which were organized as part of the IberSpeech 2022 conference under the ongoing series of Albayzin evaluation campaigns. In the 2022 edition, four challenges were launched: (1) speech-to-text transcription; (2) speaker diarization and identity assignment; (3) text and speech alignment; and (4) search on speech. Different databases that cover different domains (e.g., broadcast news, conference talks, parliament sessions) were released for those challenges. The submitted systems also cover a wide range of speech processing methods, which include hidden Markov model-based approaches, end-to-end neural network-based methods, hybrid approaches, etc. This paper describes the databases, the tasks and the performance metrics used in the four challenges. It also provides the most relevant features of the submitted systems and briefly presents and discusses the obtained results. Despite employing state-of-the-art technology, the relatively poor performance attained in some of the challenges reveals that there is still room for improvement. This encourages us to carry on with the Albayzin evaluation campaigns in the coming years. Full article
Show Figures

Figure 1

17 pages, 726 KiB  
Article
Semisupervised Speech Data Extraction from Basque Parliament Sessions and Validation on Fully Bilingual Basque–Spanish ASR
by Mikel Penagarikano, Amparo Varona, Germán Bordel and Luis Javier Rodriguez-Fuentes
Appl. Sci. 2023, 13(14), 8492; https://doi.org/10.3390/app13148492 - 23 Jul 2023
Cited by 1 | Viewed by 624
Abstract
In this paper, a semisupervised speech data extraction method is presented and applied to create a new dataset designed for the development of fully bilingual Automatic Speech Recognition (ASR) systems for Basque and Spanish. The dataset is drawn from an extensive collection of [...] Read more.
In this paper, a semisupervised speech data extraction method is presented and applied to create a new dataset designed for the development of fully bilingual Automatic Speech Recognition (ASR) systems for Basque and Spanish. The dataset is drawn from an extensive collection of Basque Parliament plenary sessions containing frequent code switchings. Since session minutes are not exact, only the most reliable speech segments are kept for training. To that end, we use phonetic similarity scores between nominal and recognized phone sequences. The process starts with baseline acoustic models trained on generic out-of-domain data, then iteratively updates the models with the extracted data and applies the updated models to refine the training dataset until the observed improvement between two iterations becomes small enough. A development dataset, involving five plenary sessions not used for training, has been manually audited for tuning and evaluation purposes. Cross-validation experiments (with 20 random partitions) have been carried out on the development dataset, using the baseline and the iteratively updated models. On average, Word Error Rate (WER) reduces from 16.57% (baseline) to 4.41% (first iteration) and further to 4.02% (second iteration), which corresponds to relative WER reductions of 73.4% and 8.8%, respectively. When considering only Basque segments, WER reduces on average from 16.57% (baseline) to 5.51% (first iteration) and further to 5.13% (second iteration), which corresponds to relative WER reductions of 66.7% and 6.9%, respectively. As a result of this work, a new bilingual Basque–Spanish resource has been produced based on Basque Parliament sessions, including 998 h of training data (audio segments + transcriptions), a development set (17 h long) designed for tuning and evaluation under a cross-validation scheme and a fully bilingual trigram language model. Full article
Show Figures

Figure 1

19 pages, 813 KiB  
Article
Enhancing Voice Cloning Quality through Data Selection and Alignment-Based Metrics
by Ander González-Docasal and Aitor Álvarez
Appl. Sci. 2023, 13(14), 8049; https://doi.org/10.3390/app13148049 - 10 Jul 2023
Viewed by 2173
Abstract
Voice cloning, an emerging field in the speech-processing area, aims to generate synthetic utterances that closely resemble the voices of specific individuals. In this study, we investigated the impact of various techniques on improving the quality of voice cloning, specifically focusing on a [...] Read more.
Voice cloning, an emerging field in the speech-processing area, aims to generate synthetic utterances that closely resemble the voices of specific individuals. In this study, we investigated the impact of various techniques on improving the quality of voice cloning, specifically focusing on a low-quality dataset. To contrast our findings, we also used two high-quality corpora for comparative analysis. We conducted exhaustive evaluations of the quality of the gathered corpora in order to select the most-suitable data for the training of a voice-cloning system. Following these measurements, we conducted a series of ablations by removing audio files with a lower signal-to-noise ratio and higher variability in utterance speed from the corpora in order to decrease their heterogeneity. Furthermore, we introduced a novel algorithm that calculates the fraction of aligned input characters by exploiting the attention matrix of the Tacotron 2 text-to-speech system. This algorithm provides a valuable metric for evaluating the alignment quality during the voice-cloning process. We present the results of our experiments, demonstrating that the performed ablations significantly increased the quality of synthesised audio for the challenging low-quality corpus. Notably, our findings indicated that models trained on a 3 h corpus from a pre-trained model exhibit comparable audio quality to models trained from scratch using significantly larger amounts of data. Full article
Show Figures

Figure 1

15 pages, 23185 KiB  
Article
Frame-Based Phone Classification Using EMG Signals
by Inge Salomons, Eder del Blanco, Eva Navas, Inma Hernáez and Xabier de Zuazo
Appl. Sci. 2023, 13(13), 7746; https://doi.org/10.3390/app13137746 - 30 Jun 2023
Viewed by 906
Abstract
This paper evaluates the impact of inter-speaker and inter-session variability on the development of a silent speech interface (SSI) based on electromyographic (EMG) signals from the facial muscles. The final goal of the SSI is to provide a communication tool for Spanish-speaking laryngectomees [...] Read more.
This paper evaluates the impact of inter-speaker and inter-session variability on the development of a silent speech interface (SSI) based on electromyographic (EMG) signals from the facial muscles. The final goal of the SSI is to provide a communication tool for Spanish-speaking laryngectomees by generating audible speech from voiceless articulation. However, before moving on to such a complex task, a simpler phone classification task in different modalities regarding speaker and session dependency is performed for this study. These experiments consist of processing the recorded utterances into phone-labeled segments and predicting the phonetic labels using only features obtained from the EMG signals. We evaluate and compare the performance of each model considering the classification accuracy. Results show that the models are able to predict the phonetic label best when they are trained and tested using data from the same session. The accuracy drops drastically when the model is tested with data from a different session, although it improves when more data are added to the training data. Similarly, when the same model is tested on a session from a different speaker, the accuracy decreases. This suggests that using larger amounts of data could help to reduce the impact of inter-session variability, but more research is required to understand if this approach would suffice to account for inter-speaker variability as well. Full article
Show Figures

Figure 1

16 pages, 648 KiB  
Article
Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish
by David Gimeno-Gómez and Carlos-D. Martínez-Hinarejos
Appl. Sci. 2023, 13(11), 6521; https://doi.org/10.3390/app13116521 - 26 May 2023
Viewed by 780
Abstract
Visual speech recognition (VSR) is a challenging task that aims to interpret speech based solely on lip movements. However, although remarkable results have recently been reached in the field, this task remains an open research problem due to different challenges, such as visual [...] Read more.
Visual speech recognition (VSR) is a challenging task that aims to interpret speech based solely on lip movements. However, although remarkable results have recently been reached in the field, this task remains an open research problem due to different challenges, such as visual ambiguities, the intra-personal variability among speakers, and the complex modeling of silence. Nonetheless, these challenges can be alleviated when the task is approached from a speaker-dependent perspective. Our work focuses on the adaptation of end-to-end VSR systems to a specific speaker. Hence, we propose two different adaptation methods based on the conventional fine-tuning technique, the so-called Adapters. We conduct a comparative study in terms of performance while considering different deployment aspects such as training time and storage cost. Results on the Spanish LIP-RTVE database show that both methods are able to obtain recognition rates comparable to the state of the art, even when only a limited amount of training data is available. Although it incurs a deterioration in performance, the Adapters-based method presents a more scalable and efficient solution, significantly reducing the training time and storage cost by up to 80%. Full article
Show Figures

Figure 1

22 pages, 1045 KiB  
Article
Attentional Extractive Summarization
by José Ángel González, Encarna Segarra, Fernando García-Granada, Emilio Sanchis and Lluís-F. Hurtado
Appl. Sci. 2023, 13(3), 1458; https://doi.org/10.3390/app13031458 - 22 Jan 2023
Cited by 1 | Viewed by 1105
Abstract
In this work, a general theoretical framework for extractive summarization is proposed—the Attentional Extractive Summarization framework. Although abstractive approaches are generally used in text summarization today, extractive methods can be especially suitable for some applications, and they can help with other tasks such [...] Read more.
In this work, a general theoretical framework for extractive summarization is proposed—the Attentional Extractive Summarization framework. Although abstractive approaches are generally used in text summarization today, extractive methods can be especially suitable for some applications, and they can help with other tasks such as Text Classification, Question Answering, and Information Extraction. The proposed approach is based on the interpretation of the attention mechanisms of hierarchical neural networks, which compute document-level representations of documents and summaries from sentence-level representations, which, in turn, are computed from word-level representations. The models proposed under this framework are able to automatically learn relationships among document and summary sentences, without requiring Oracle systems to compute the reference labels for each sentence before the training phase. These relationships are obtained as a result of a binary classification process, the goal of which is to distinguish correct summaries for documents. Two different systems, formalized under the proposed framework, were evaluated on the CNN/DailyMail and the NewsRoom corpora, which are some of the reference corpora in the most relevant works on text summarization. The results obtained during the evaluation support the adequacy of our proposal and suggest that there is still room for the improvement of our attentional framework. Full article
Show Figures

Figure 1

Back to TopTop