Special Issue "IberSPEECH 2018: Speech and Language Technologies for Iberian Languages"

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Acoustics and Vibrations".

Deadline for manuscript submissions: closed (14 July 2019).

Special Issue Editors

Prof. Dr. Francesc Alías
Website
Guest Editor
GTM – Grup de Recerca en Tecnologies Mèdia. La Salle – Universitat Ramon Llull. Barcelona, Spain
Interests: acoustic event detection; acoustic signal processing; machine listening; real-time noise monitoring; impact of noise events; real-life acoustic datasets; wireless acoustic sensor networks
Special Issues and Collections in MDPI journals
Dr. Jordi Luque
Website
Guest Editor
Telefónica Research, Barcelona, Spain
Dr. Antonio Bonafonte
Website
Guest Editor
Universitat Politècnica de Catalunya, Barecelona, Spain
Dr. António Teixeira
Website
Guest Editor
Universidade de Aveiro, Portugal

Special Issue Information

Dear Colleagues,

Following previous editions, IberSPEECH’2018 will be held in Barcelona, from 21 to 23 November 2018. The IberSPEECH event—the fourth of its kind using this name—brings together the X Jornadas en Tecnologías del Habla and the V Iberian SLTech Workshop events.  The conference provides a platform for scientific and industrial discussion and exchange around Iberian languages, with the following main topics of interests:

  1. Speech technology and applications
  2. Human speech production, perception, and communication
  3. Natural language processing and applications
  4. Speech, language and multimodality
  5. Resources, standardization, and evaluation

The main goal of this Special Issue is to present the latest advances in research and novel applications of speech and language processing, paying special attention to those focused on Iberian languages. We invite researchers interested in these fields to contribute to this issue, which cover all fields of IberSPEECH2018 http://iberspeech2018.talp.cat/index.php/call-for-papers/. Papers submitted from the conference will enjoy a plenary 10% discount. (The Article Processing Charge (APC) for publication in Applied Sciences is 1260 CHF (Swiss Francs) for submissions before the end of December 2018 and 1350 CHF (Swiss Francs) after 1st January 2019)

For those papers selected from the conference, authors are asked to fulfill what is indicated in the instructions about “Preprints and Conference Papers” of the journal: https://www.mdpi.com/journal/applsci/instructions#preprints

Prof. Dr. Francesc Alías
Dr. Jordi Luque
Dr. Antonio Bonafonte
Dr. António Teixeira
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All papers will be peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1800 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • Speech technology and applications
  • Human speech production, perception, and communication
  • Natural language processing and applications
  • Speech, language and multimodality
  • Resources, standardization, and evaluation

Published Papers (14 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Editorial

Jump to: Research

Open AccessEditorial
Editorial for Special Issue “IberSPEECH2018: Speech and Language Technologies for Iberian Languages”
Appl. Sci. 2020, 10(1), 384; https://doi.org/10.3390/app10010384 - 04 Jan 2020
Abstract
The main goal of this Special Issue is to present the latest advances in research and novel applications of speech and language technologies based on the works presented at the IberSPEECH edition held in Barcelona in 2018, paying special attention to those focused [...] Read more.
The main goal of this Special Issue is to present the latest advances in research and novel applications of speech and language technologies based on the works presented at the IberSPEECH edition held in Barcelona in 2018, paying special attention to those focused on Iberian languages. IberSPEECH is the international conference of the Special Interest Group on Iberian Languages (SIG-IL) of the International Speech Communication Association (ISCA) and of the Spanish Thematic Network on Speech Technologies (Red Temática en Tecnologías del Habla, or RTTH for short). Several researchers were invited to extend their contributions presented at IberSPEECH2018 due to their interest and quality. As a result, this Special Issue is composed of 13 papers that cover different topics of investigation related to perception, speech analysis and enhancement, speaker verification and identification, speech production and synthesis, natural language processing, together with several applications and evaluation challenges. Full article

Research

Jump to: Editorial

Open AccessArticle
Albayzin 2018 Evaluation: The IberSpeech-RTVE Challenge on Speech Technologies for Spanish Broadcast Media
Appl. Sci. 2019, 9(24), 5412; https://doi.org/10.3390/app9245412 - 11 Dec 2019
Cited by 2
Abstract
The IberSpeech-RTVE Challenge presented at IberSpeech 2018 is a new Albayzin evaluation series supported by the Spanish Thematic Network on Speech Technologies (Red Temática en Tecnologías del Habla (RTTH)). That series was focused on speech-to-text transcription, speaker diarization, and multimodal diarization of television [...] Read more.
The IberSpeech-RTVE Challenge presented at IberSpeech 2018 is a new Albayzin evaluation series supported by the Spanish Thematic Network on Speech Technologies (Red Temática en Tecnologías del Habla (RTTH)). That series was focused on speech-to-text transcription, speaker diarization, and multimodal diarization of television programs. For this purpose, the Corporacion Radio Television Española (RTVE), the main public service broadcaster in Spain, and the RTVE Chair at the University of Zaragoza made more than 500 h of broadcast content and subtitles available for scientists. The dataset included about 20 programs of different kinds and topics produced and broadcast by RTVE between 2015 and 2018. The programs presented different challenges from the point of view of speech technologies such as: the diversity of Spanish accents, overlapping speech, spontaneous speech, acoustic variability, background noise, or specific vocabulary. This paper describes the database and the evaluation process and summarizes the results obtained. Full article
Show Figures

Figure 1

Open AccessArticle
Glottal Source Contribution to Higher Order Modes in the Finite Element Synthesis of Vowels
Appl. Sci. 2019, 9(21), 4535; https://doi.org/10.3390/app9214535 - 25 Oct 2019
Cited by 1
Abstract
Articulatory speech synthesis has long been based on one-dimensional (1D) approaches. They assume plane wave propagation within the vocal tract and disregard higher order modes that typically appear above 5 kHz. However, such modes may be relevant in obtaining a more natural voice, [...] Read more.
Articulatory speech synthesis has long been based on one-dimensional (1D) approaches. They assume plane wave propagation within the vocal tract and disregard higher order modes that typically appear above 5 kHz. However, such modes may be relevant in obtaining a more natural voice, especially for phonation types with significant high frequency energy (HFE) content. This work studies the contribution of the glottal source at high frequencies in the 3D numerical synthesis of vowels. The spoken vocal range is explored using an LF (Liljencrants–Fant) model enhanced with aspiration noise and controlled by the R d glottal shape parameter. The vowels [ɑ], [i], and [u] are generated with a finite element method (FEM) using realistic 3D vocal tract geometries obtained from magnetic resonance imaging (MRI), as well as simplified straight vocal tracts of a circular cross-sectional area. The symmetry of the latter prevents the onset of higher order modes. Thus, the comparison between realistic and simplified geometries enables us to analyse the influence of such modes. The simulations indicate that higher order modes may be perceptually relevant, particularly for tense phonations (lower R d values) and/or high fundamental frequency values, F 0 s. Conversely, vowels with a lax phonation and/or low F0s may result in inaudible HFE levels, especially if aspiration noise is not considered in the glottal source model. Full article
Show Figures

Figure 1

Open AccessArticle
Towards a Universal Semantic Dictionary
Appl. Sci. 2019, 9(19), 4060; https://doi.org/10.3390/app9194060 - 28 Sep 2019
Cited by 1
Abstract
A novel method for finding linear mappings among word embeddings for several languages, taking as pivot a shared, multilingual embedding space, is proposed in this paper. Previous approaches learned translation matrices between two specific languages, while this method learns translation matrices between a [...] Read more.
A novel method for finding linear mappings among word embeddings for several languages, taking as pivot a shared, multilingual embedding space, is proposed in this paper. Previous approaches learned translation matrices between two specific languages, while this method learns translation matrices between a given language and a shared, multilingual space. The system was first trained on bilingual, and later on multilingual corpora as well. In the first case, two different training data were applied: Dinu’s English–Italian benchmark data, and English–Italian translation pairs extracted from the PanLex database. In the second case, only the PanLex database was used. The system performs on English–Italian languages with the best setting significantly better than the baseline system given by Mikolov, and it provides a comparable performance with more sophisticated systems. Exploiting the richness of the PanLex database, the proposed method makes it possible to learn linear mappings among an arbitrary number of languages. Full article
Show Figures

Figure 1

Open AccessArticle
Summarization of Spanish Talk Shows with Siamese Hierarchical Attention Networks
Appl. Sci. 2019, 9(18), 3836; https://doi.org/10.3390/app9183836 - 12 Sep 2019
Cited by 1
Abstract
In this paper, we present an approach to Spanish talk shows summarization. Our approach is based on the use of Siamese Neural Networks on the transcription of the show audios. Specifically, we propose to use Hierarchical Attention Networks to select the most relevant [...] Read more.
In this paper, we present an approach to Spanish talk shows summarization. Our approach is based on the use of Siamese Neural Networks on the transcription of the show audios. Specifically, we propose to use Hierarchical Attention Networks to select the most relevant sentences for each speaker about a given topic in the show, in order to summarize his opinion about the topic. We train these networks in a siamese way to determine whether a summary is appropriate or not. Previous evaluation of this approach on summarization task of English newspapers achieved performances similar to other state-of-the-art systems. In the absence of enough transcribed or recognized speech data to train our system for talk show summarization in Spanish, we acquire a large corpus of document-summary pairs from Spanish newspapers and we use it to train our system. We choose this newspapers domain due to its high similarity with the topics addressed in talk shows. A preliminary evaluation of our summarization system on Spanish TV programs shows the adequacy of the proposal. Full article
Show Figures

Figure 1

Open AccessArticle
An Analysis of the Short Utterance Problem for Speaker Characterization
Appl. Sci. 2019, 9(18), 3697; https://doi.org/10.3390/app9183697 - 05 Sep 2019
Cited by 1
Abstract
Speaker characterization has always been conditioned by the length of the evaluated utterances. Despite performing well with large amounts of audio, significant degradations in performance are obtained when short utterances are considered. In this work we present an analysis of the short utterance [...] Read more.
Speaker characterization has always been conditioned by the length of the evaluated utterances. Despite performing well with large amounts of audio, significant degradations in performance are obtained when short utterances are considered. In this work we present an analysis of the short utterance problem providing an alternative point of view. From our perspective the performance in the evaluation of short utterances is highly influenced by the phonetic similarity between enrollment and test utterances. Both enrollment and test should contain similar phonemes to properly discriminate, being degraded otherwise. In this study we also interpret short utterances as incomplete long utterances where some acoustic units are either unbalanced or just missing. These missing units are responsible for the speaker representations to be unreliable. These unreliable representations are biased with respect to the reference counterparts, obtained from long utterances. These undesired shifts increase the intra-speaker variability, causing a significant loss of performance. According to our experiments, short utterances (3–60 s) can perform as accurate as if long utterances were involved by just reassuring the phonetic distributions. This analysis is determined by the current embedding extraction approach, based on the accumulation of local short-time information. Thus it is applicable to most of the state-of-the-art embeddings, including traditional i-vectors and Deep Neural Network (DNN) xvectors. Full article
Show Figures

Figure 1

Open AccessArticle
Exploring Efficient Neural Architectures for Linguistic–Acoustic Mapping in Text-To-Speech
Appl. Sci. 2019, 9(16), 3391; https://doi.org/10.3390/app9163391 - 17 Aug 2019
Cited by 1
Abstract
Conversion from text to speech relies on the accurate mapping from linguistic to acoustic symbol sequences, for which current practice employs recurrent statistical models such as recurrent neural networks. Despite the good performance of such models (in terms of low distortion in the [...] Read more.
Conversion from text to speech relies on the accurate mapping from linguistic to acoustic symbol sequences, for which current practice employs recurrent statistical models such as recurrent neural networks. Despite the good performance of such models (in terms of low distortion in the generated speech), their recursive structure with intermediate affine transformations tends to make them slow to train and to sample from. In this work, we explore two different mechanisms that enhance the operational efficiency of recurrent neural networks, and study their performance–speed trade-off. The first mechanism is based on the quasi-recurrent neural network, where expensive affine transformations are removed from temporal connections and placed only on feed-forward computational directions. The second mechanism includes a module based on the transformer decoder network, designed without recurrent connections but emulating them with attention and positioning codes. Our results show that the proposed decoder networks are competitive in terms of distortion when compared to a recurrent baseline, whilst being significantly faster in terms of CPU and GPU inference time. The best performing model is the one based on the quasi-recurrent mechanism, reaching the same level of naturalness as the recurrent neural network based model with a speedup of 11.2 on CPU and 3.3 on GPU. Full article
Show Figures

Figure 1

Open AccessArticle
Supervector Extraction for Encoding Speaker and Phrase Information with Neural Networks for Text-Dependent Speaker Verification
Appl. Sci. 2019, 9(16), 3295; https://doi.org/10.3390/app9163295 - 11 Aug 2019
Cited by 4
Abstract
In this paper, we propose a new differentiable neural network with an alignment mechanism for text-dependent speaker verification. Unlike previous works, we do not extract the embedding of an utterance from the global average pooling of the temporal dimension. Our system replaces this [...] Read more.
In this paper, we propose a new differentiable neural network with an alignment mechanism for text-dependent speaker verification. Unlike previous works, we do not extract the embedding of an utterance from the global average pooling of the temporal dimension. Our system replaces this reduction mechanism by a phonetic phrase alignment model to keep the temporal structure of each phrase since the phonetic information is relevant in the verification task. Moreover, we can apply a convolutional neural network as front-end, and, thanks to the alignment process being differentiable, we can train the network to produce a supervector for each utterance that will be discriminative to the speaker and the phrase simultaneously. This choice has the advantage that the supervector encodes the phrase and speaker information providing good performance in text-dependent speaker verification tasks. The verification process is performed using a basic similarity metric. The new model using alignment to produce supervectors was evaluated on the RSR2015-Part I database, providing competitive results compared to similar size networks that make use of the global average pooling to extract embeddings. Furthermore, we also evaluated this proposal on the RSR2015-Part II. To our knowledge, this system achieves the best published results obtained on this second part. Full article
Show Figures

Figure 1

Open AccessArticle
Intelligibility and Listening Effort of Spanish Oesophageal Speech
Appl. Sci. 2019, 9(16), 3233; https://doi.org/10.3390/app9163233 - 08 Aug 2019
Cited by 2
Abstract
Communication is a huge challenge for oesophageal speakers, be it for interactions with fellow humans or with digital voice assistants. We aim to quantify these communication challenges (both human–human and human–machine interactions) by measuring intelligibility and Listening Effort (LE) of Oesophageal Speech (OS) [...] Read more.
Communication is a huge challenge for oesophageal speakers, be it for interactions with fellow humans or with digital voice assistants. We aim to quantify these communication challenges (both human–human and human–machine interactions) by measuring intelligibility and Listening Effort (LE) of Oesophageal Speech (OS) in comparison to Healthy Laryngeal Speech (HS). We conducted two listening tests (one web-based, the other in laboratory settings) to collect these measurements. Participants performed a sentence recognition and LE rating task in each test. Intelligibility, calculated as Word Error Rate, showed significant correlation with self-reported LE ratings. Speaker type (healthy or oesophageal) had a major effect on intelligibility and effort. More LE was reported for OS compared to HS even when OS intelligibility was close to HS. Listeners familiar with OS reported less effort when listening to OS compared to nonfamiliar listeners. However, such advantage of familiarity was not observed for intelligibility. Automatic speech recognition scores were higher for OS compared to HS. Full article
Show Figures

Figure 1

Open AccessArticle
Application of Pitch Derived Parameters to Speech and Monophonic Singing Classification
Appl. Sci. 2019, 9(15), 3140; https://doi.org/10.3390/app9153140 - 02 Aug 2019
Cited by 2
Abstract
Speech and singing voice discrimination is an important task in the speech processing area given that each type of voice requires different information retrieval and signal processing techniques. This discrimination task is hard even for humans depending on the length of voice segments. [...] Read more.
Speech and singing voice discrimination is an important task in the speech processing area given that each type of voice requires different information retrieval and signal processing techniques. This discrimination task is hard even for humans depending on the length of voice segments. In this article, we present an automatic speech and singing voice classification method using pitch parameters derived from musical note information and f 0 stability analysis. We applied our method to a database containing speech and a capella singing and compared the results with other discrimination techniques based on information derived from pitch and spectral envelope. Our method obtains good results discriminating both voice types, is efficient, has good generalisation capabilities and is computationally fast. In the process, we have also created a note detection algorithm with parametric control of the characteristics of the notes it detects. We compared the agreement of this algorithm with a state-of-the-art note detection algorithm and performed an experiment that proves that speech and singing discrimination parameters can represent generic information about the music style of the singing voice. Full article
Show Figures

Figure 1

Open AccessArticle
Restricted Boltzmann Machine Vectors for Speaker Clustering and Tracking Tasks in TV Broadcast Shows
Appl. Sci. 2019, 9(13), 2761; https://doi.org/10.3390/app9132761 - 09 Jul 2019
Cited by 3
Abstract
Restricted Boltzmann Machines (RBMs) have shown success in both the front-end and backend of speaker verification systems. In this paper, we propose applying RBMs to the front-end for the tasks of speaker clustering and speaker tracking in TV broadcast shows. RBMs are trained [...] Read more.
Restricted Boltzmann Machines (RBMs) have shown success in both the front-end and backend of speaker verification systems. In this paper, we propose applying RBMs to the front-end for the tasks of speaker clustering and speaker tracking in TV broadcast shows. RBMs are trained to transform utterances into a vector based representation. Because of the lack of data for a test speaker, we propose RBM adaptation to a global model. First, the global model—which is referred to as universal RBM—is trained with all the available background data. Then an adapted RBM model is trained with the data of each test speaker. The visible to hidden weight matrices of the adapted models are concatenated along with the bias vectors and are whitened to generate the vector representation of speakers. These vectors, referred to as RBM vectors, were shown to preserve speaker-specific information and are used in the tasks of speaker clustering and speaker tracking. The evaluation was performed on the audio recordings of Catalan TV Broadcast shows. The experimental results show that our proposed speaker clustering system gained up to 12% relative improvement, in terms of Equal Impurity (EI), over the baseline system. On the other hand, in the task of speaker tracking, our system has a relative improvement of 11% and 7% compared to the baseline system using cosine and Probabilistic Linear Discriminant Analysis (PLDA) scoring, respectively. Full article
Show Figures

Figure 1

Open AccessArticle
Dual-Channel Speech Enhancement Based on Extended Kalman Filter Relative Transfer Function Estimation
Appl. Sci. 2019, 9(12), 2520; https://doi.org/10.3390/app9122520 - 20 Jun 2019
Cited by 1
Abstract
This paper deals with speech enhancement in dual-microphone smartphones using beamforming along with postfiltering techniques. The performance of these algorithms relies on a good estimation of the acoustic channel and speech and noise statistics. In this work we present a speech enhancement system [...] Read more.
This paper deals with speech enhancement in dual-microphone smartphones using beamforming along with postfiltering techniques. The performance of these algorithms relies on a good estimation of the acoustic channel and speech and noise statistics. In this work we present a speech enhancement system that combines the estimation of the relative transfer function (RTF) between microphones using an extended Kalman filter framework with a novel speech presence probability estimator intended to track the noise statistics’ variability. The available dual-channel information is exploited to obtain more reliable estimates of clean speech statistics. Noise reduction is further improved by means of postfiltering techniques that take advantage of the speech presence estimation. Our proposal is evaluated in different reverberant and noisy environments when the smartphone is used in both close-talk and far-talk positions. The experimental results show that our system achieves improvements in terms of noise reduction, low speech distortion and better speech intelligibility compared to other state-of-the-art approaches. Full article
Show Figures

Figure 1

Open AccessArticle
Data Augmentation for Speaker Identification under Stress Conditions to Combat Gender-Based Violence
Appl. Sci. 2019, 9(11), 2298; https://doi.org/10.3390/app9112298 - 04 Jun 2019
Cited by 4
Abstract
A Speaker Identification system for a personalized wearable device to combat gender-based violence is presented in this paper. Speaker recognition systems exhibit a decrease in performance when the user is under emotional or stress conditions, thus the objective of this paper is to [...] Read more.
A Speaker Identification system for a personalized wearable device to combat gender-based violence is presented in this paper. Speaker recognition systems exhibit a decrease in performance when the user is under emotional or stress conditions, thus the objective of this paper is to measure the effects of stress in speech to ultimately try to mitigate their consequences on a speaker identification task, by using data augmentation techniques specifically tailored for this purpose given the lack of data resources for this condition. An extensive experimentation has been carried out for assessing the effectiveness of the proposed techniques. First, we conclude that the best performance is always obtained when naturally stressed samples are included in the training set, and second, when these are not available, their substitution and augmentation with synthetically generated stress-like samples improves the performance of the system. Full article
Show Figures

Figure 1

Open AccessArticle
Automatic Assessment of Prosodic Quality in Down Syndrome: Analysis of the Impact of Speaker Heterogeneity
Appl. Sci. 2019, 9(7), 1440; https://doi.org/10.3390/app9071440 - 05 Apr 2019
Cited by 3
Abstract
Prosody is a fundamental speech element responsible for communicative functions such as intonation, accent and phrasing, and prosodic impairments of individuals with intellectual disabilities reduce their communication skills. Yet, technological resources have paid little attention to prosody. This study aims to develop an [...] Read more.
Prosody is a fundamental speech element responsible for communicative functions such as intonation, accent and phrasing, and prosodic impairments of individuals with intellectual disabilities reduce their communication skills. Yet, technological resources have paid little attention to prosody. This study aims to develop an automatic classifier to predict the prosodic quality of utterances produced by individuals with Down syndrome, and to analyse how inter-individual heterogeneity affects assessment results. A therapist and an expert in prosody judged the prosodic appropriateness of a corpus of Down syndrome’ utterances collected through a video game. The judgments of the expert were used to train an automatic classifier that predicts prosodic quality by using a set of fundamental frequency, duration and intensity features. The classifier accuracy was 79.3% and its true positive rate 89.9%. We analyzed how informative each of the features was for the assessment and studied relationships between participants’ developmental level and results: interspeaker variability conditioned the relative weight of prosodic features for automatic classification and participants’ developmental level was related to the prosodic quality of their productions. Therefore, since speaker variability is an intrinsic feature of individuals with Down syndrome, it should be considered to attain an effective automatic prosodic assessment system. Full article
Show Figures

Figure 1

Back to TopTop