MDPI - Publisher of Open Access Journals

11 pages, 2591 KiB

Open AccessArticle

Clarification of the Acoustic Characteristics of Velopharyngeal Insufficiency by Acoustic Simulation Using the Boundary Element Method: A Pilot Study

by Mami Shiraishi, Katsuaki Mishima, Masahiro Takekawa, Masaaki Mori and Hirotsugu Umeda

Acoustics 2025, 7(2), 26; https://doi.org/10.3390/acoustics7020026 - 13 May 2025

Viewed by 662

Abstract

A model of the vocal tract that mimicked velopharyngeal insufficiency was created, and acoustic analysis was performed using the boundary element method to clarify the acoustic characteristics of velopharyngeal insufficiency. The participants were six healthy adults. Computed tomography (CT) images were taken from [...] Read more.

A model of the vocal tract that mimicked velopharyngeal insufficiency was created, and acoustic analysis was performed using the boundary element method to clarify the acoustic characteristics of velopharyngeal insufficiency. The participants were six healthy adults. Computed tomography (CT) images were taken from the frontal sinus to the glottis during phonation of the Japanese vowels /i/ and /u/, and models of the vocal tracts were created from the CT data. To recreate velopharyngeal insufficiency, coupling of the nasopharynx was carried out in vocal tract models with no nasopharyngeal coupling, and the coupling site was enlarged in models with nasopharyngeal coupling. The vocal tract models were extended virtually for 12 cm in a cylindrical shape to represent the region from the lower part of the glottis to the tracheal bifurcation. The Kirchhoff–Helmholtz integral equation was used for the wave equation, and the boundary element method was used for discretization. Frequency response curves from 1 to 3000 Hz were calculated by applying the boundary element method. The curves showed the appearance of a pole–zero pair around 500 Hz, increased intensity around 250 Hz, decreased intensity around 500 Hz, decreased intensities of the first and second formants (F1 and F2), and a lower frequency of F2. Of these findings, increased intensity around 250 Hz, decreased intensity around 500 Hz, decreased intensities of F1 and F2, and lower frequency of F2 agree with the previously reported acoustic characteristics of hypernasality. Full article

(This article belongs to the Special Issue Developments in Acoustic Phonetic Research)

► Show Figures

Figure 1

16 pages, 551 KiB

Open AccessArticle

Dual-Channel Spoofed Speech Detection Based on Graph Attention Networks

by Yun Tan, Xiaoqian Weng and Jiangzhang Zhu

Symmetry 2025, 17(5), 641; https://doi.org/10.3390/sym17050641 - 24 Apr 2025

Viewed by 495

Abstract

In the field of voice cryptography, detecting forged speech is crucial for secure communication and identity authentication. While most existing spoof detection methods rely on monaural audio, the characteristics of dual-channel signals remain underexplored. To address this, we propose a symmetrical dual-branch detection [...] Read more.

In the field of voice cryptography, detecting forged speech is crucial for secure communication and identity authentication. While most existing spoof detection methods rely on monaural audio, the characteristics of dual-channel signals remain underexplored. To address this, we propose a symmetrical dual-branch detection framework that integrates Res2Net with coordinate attention (Res2NetCA) and a dual-channel heterogeneous graph fusion module (DHGFM). The proposed architecture encodes left and right vocal tract signals into spectrogram and time-domain graphs, and it models both intra- and inter-channel time–frequency dependencies through graph attention mechanisms and fusion strategies. Experimental results on the ASVspoof2019 and ASVspoof2021 LA datasets demonstrate the superior detection performance of our method. Specifically, it achieved an EER of 1.64% and a Min-tDCF of 0.051 on ASVspoof2019, and an EER of 6.76% with a Min-tDCF of 0.3638 on ASVspoof2021, validating the effectiveness and potential of dual-channel modeling in spoofed speech detection. Full article

(This article belongs to the Special Issue Applications Based on Symmetry in Applied Cryptography)

► Show Figures

Figure 1

16 pages, 2926 KiB

Open AccessArticle

Acoustic and Clinical Data Analysis of Vocal Recordings: Pandemic Insights and Lessons

by Pedro Carreiro-Martins, Paulo Paixão, Iolanda Caires, Pedro Matias, Hugo Gamboa, Filipe Soares, Pedro Gomez, Joana Sousa and Nuno Neuparth

Diagnostics 2024, 14(20), 2273; https://doi.org/10.3390/diagnostics14202273 - 12 Oct 2024

Viewed by 1256

Abstract

Background/Objectives: The interest in processing human speech and other human-generated audio signals as a diagnostic tool has increased due to the COVID-19 pandemic. The project OSCAR (vOice Screening of CoronA viRus) aimed to develop an algorithm to screen for COVID-19 using a dataset [...] Read more.

Background/Objectives: The interest in processing human speech and other human-generated audio signals as a diagnostic tool has increased due to the COVID-19 pandemic. The project OSCAR (vOice Screening of CoronA viRus) aimed to develop an algorithm to screen for COVID-19 using a dataset of Portuguese participants with voice recordings and clinical data. Methods: This cross-sectional study aimed to characterise the pattern of sounds produced by the vocal apparatus in patients with SARS-CoV-2 infection documented by a positive RT-PCR test, and to develop and validate a screening algorithm. In Phase II, the algorithm developed in Phase I was tested in a real-world setting. Results: In Phase I, after filtering, the training group consisted of 166 subjects who were effectively available to train the classification model (34.3% SARS-CoV-2 positive/65.7% SARS-CoV-2 negative). Phase II enrolled 58 participants (69.0% SARS-CoV-2 positive/31.0% SARS-CoV-2 negative). The final model achieved a sensitivity of 85%, a specificity of 88.9%, and an F1-score of 84.7%, suggesting voice screening algorithms as an attractive strategy for COVID-19 diagnosis. Conclusions: Our findings highlight the potential of a voice-based detection strategy as an alternative method for respiratory tract screening. Full article

(This article belongs to the Special Issue AI-Driven Diagnostics: Transforming Healthcare from Data to Clinical Decisions)

► Show Figures

Figure 1

28 pages, 660 KiB

Open AccessArticle

Improving End-to-End Models for Children’s Speech Recognition

by Tanvina Patel and Odette Scharenborg

Appl. Sci. 2024, 14(6), 2353; https://doi.org/10.3390/app14062353 - 11 Mar 2024

Cited by 2 | Viewed by 3775

Abstract

Children’s Speech Recognition (CSR) is a challenging task due to the high variability in children’s speech patterns and limited amount of available annotated children’s speech data. We aim to improve CSR in the often-occurring scenario that no children’s speech data is available for [...] Read more.

Children’s Speech Recognition (CSR) is a challenging task due to the high variability in children’s speech patterns and limited amount of available annotated children’s speech data. We aim to improve CSR in the often-occurring scenario that no children’s speech data is available for training the Automatic Speech Recognition (ASR) systems. Traditionally, Vocal Tract Length Normalization (VTLN) has been widely used in hybrid ASR systems to address acoustic mismatch and variability in children’s speech when training models on adults’ speech. Meanwhile, End-to-End (E2E) systems often use data augmentation methods to create child-like speech from adults’ speech. For adult speech-trained ASRs, we investigate the effectiveness of augmentation methods; speed perturbations and spectral augmentation, along with VTLN, in an E2E framework for the CSR task, comparing these across Dutch, German, and Mandarin. We applied VTLN at different stages (training/test) of the ASR and conducted age and gender analyses. Our experiments showed highly similar patterns across the languages: Speed Perturbations and Spectral Augmentation yield significant performance improvements, while VTLN provided further improvements while maintaining recognition performance on adults’ speech (depending on when it is applied). Additionally, VTLN showed performance improvement for both male and female speakers and was particularly effective for younger children. Full article

(This article belongs to the Special Issue Advances in Speech and Language Processing)

► Show Figures

Figure 1

18 pages, 4569 KiB

Open AccessArticle

Deep Learning for Neuromuscular Control of Vocal Source for Voice Production

by Anil Palaparthi, Rishi K. Alluri and Ingo R. Titze

Appl. Sci. 2024, 14(2), 769; https://doi.org/10.3390/app14020769 - 16 Jan 2024

Cited by 1 | Viewed by 2319

Abstract

A computational neuromuscular control system that generates lung pressure and three intrinsic laryngeal muscle activations (cricothyroid, thyroarytenoid, and lateral cricoarytenoid) to control the vocal source was developed. In the current study, LeTalker, a biophysical computational model of the vocal system was used [...] Read more.

A computational neuromuscular control system that generates lung pressure and three intrinsic laryngeal muscle activations (cricothyroid, thyroarytenoid, and lateral cricoarytenoid) to control the vocal source was developed. In the current study, LeTalker, a biophysical computational model of the vocal system was used as the physical plant. In the LeTalker, a three-mass vocal fold model was used to simulate self-sustained vocal fold oscillation. A constant /ə/ vowel was used for the vocal tract shape. The trachea was modeled after MRI measurements. The neuromuscular control system generates control parameters to achieve four acoustic targets (fundamental frequency, sound pressure level, normalized spectral centroid, and signal-to-noise ratio) and four somatosensory targets (vocal fold length, and longitudinal fiber stress in the three vocal fold layers). The deep-learning-based control system comprises one acoustic feedforward controller and two feedback (acoustic and somatosensory) controllers. Fifty thousand steady speech signals were generated using the LeTalker for training the control system. The results demonstrated that the control system was able to generate the lung pressure and the three muscle activations such that the four acoustic and four somatosensory targets were reached with high accuracy. After training, the motor command corrections from the feedback controllers were minimal compared to the feedforward controller except for thyroarytenoid muscle activation. Full article

(This article belongs to the Special Issue Computational Methods and Engineering Solutions to Voice III)

► Show Figures

Figure 1

21 pages, 9934 KiB

Open AccessArticle

On the Alignment of Acoustic and Coupled Mechanic-Acoustic Eigenmodes in Phonation by Supraglottal Duct Variations

by Florian Kraxberger, Christoph Näger, Marco Laudato, Elias Sundström, Stefan Becker, Mihai Mihaescu, Stefan Kniesburges and Stefan Schoder

Bioengineering 2023, 10(12), 1369; https://doi.org/10.3390/bioengineering10121369 - 28 Nov 2023

Cited by 6 | Viewed by 1497

Abstract

Sound generation in human phonation and the underlying fluid–structure–acoustic interaction that describes the sound production mechanism are not fully understood. A previous experimental study, with a silicone made vocal fold model connected to a straight vocal tract pipe of fixed length, showed that [...] Read more.

Sound generation in human phonation and the underlying fluid–structure–acoustic interaction that describes the sound production mechanism are not fully understood. A previous experimental study, with a silicone made vocal fold model connected to a straight vocal tract pipe of fixed length, showed that vibroacoustic coupling can cause a deviation in the vocal fold vibration frequency. This occurred when the fundamental frequency of the vocal fold motion was close to the lowest acoustic resonance frequency of the pipe. What is not fully understood is how the vibroacoustic coupling is influenced by a varying vocal tract length. Presuming that this effect is a pure coupling of the acoustical effects, a numerical simulation model is established based on the computation of the mechanical-acoustic eigenvalue. With varying pipe lengths, the lowest acoustic resonance frequency was adjusted in the experiments and so in the simulation setup. In doing so, the evolution of the vocal folds’ coupled eigenvalues and eigenmodes is investigated, which confirms the experimental findings. Finally, it was shown that for normal phonation conditions, the mechanical mode is the most efficient vibration pattern whenever the acoustic resonance of the pipe (lowest formant) is far away from the vocal folds’ vibration frequency. Whenever the lowest formant is slightly lower than the mechanical vocal fold eigenfrequency, the coupled vocal fold motion pattern at the formant frequency dominates. Full article

(This article belongs to the Special Issue Fundamentals and Applications of Fluid Mechanics and Acoustics in Biomedical Engineering)

► Show Figures

Figure 1

18 pages, 24683 KiB

Open AccessArticle

An Investigation of Acoustic Back-Coupling in Human Phonation on a Synthetic Larynx Model

by Christoph Näger, Stefan Kniesburges, Bogac Tur, Stefan Schoder and Stefan Becker

Bioengineering 2023, 10(12), 1343; https://doi.org/10.3390/bioengineering10121343 - 22 Nov 2023

Cited by 6 | Viewed by 1662

Abstract

In the human phonation process, acoustic standing waves in the vocal tract can influence the fluid flow through the glottis as well as vocal fold oscillation. To investigate the amount of acoustic back-coupling, the supraglottal flow field has been recorded via high-speed particle [...] Read more.

In the human phonation process, acoustic standing waves in the vocal tract can influence the fluid flow through the glottis as well as vocal fold oscillation. To investigate the amount of acoustic back-coupling, the supraglottal flow field has been recorded via high-speed particle image velocimetry (PIV) in a synthetic larynx model for several configurations with different vocal tract lengths. Based on the obtained velocity fields, acoustic source terms were computed. Additionally, the sound radiation into the far field was recorded via microphone measurements and the vocal fold oscillation via high-speed camera recordings. The PIV measurements revealed that near a vocal tract resonance frequency f_R, the vocal fold oscillation frequency f_o (and therefore also the flow field’s fundamental frequency) jumps onto f_R. This is accompanied by a substantial relative increase in aeroacoustic sound generation efficiency. Furthermore, the measurements show that f_o-f_R-coupling increases vocal efficiency, signal-to-noise ratio, harmonics-to-noise ratio and cepstral peak prominence. At the same time, the glottal volume flow needed for stable vocal fold oscillation decreases strongly. All of this results in an improved voice quality and phonation efficiency so that a person phonating with f_o-f_R-coupling can phonate longer and with better voice quality. Full article

(This article belongs to the Special Issue Fundamentals and Applications of Fluid Mechanics and Acoustics in Biomedical Engineering)

► Show Figures

Figure 1

22 pages, 771 KiB

Open AccessArticle

Evaluation of Glottal Inverse Filtering Techniques on OPENGLOT Synthetic Male and Female Vowels

by Marc Freixes, Luis Joglar-Ongay, Joan Claudi Socoró and Francesc Alías-Pujol

Appl. Sci. 2023, 13(15), 8775; https://doi.org/10.3390/app13158775 - 29 Jul 2023

Cited by 3 | Viewed by 1738

Abstract

Current articulatory-based three-dimensional source–filter models, which allow the production of vowels and diphtongs, still present very limited expressiveness. Glottal inverse filtering (GIF) techniques can become instrumental to identify specific characteristics of both the glottal source signal and the vocal tract transfer function to [...] Read more.

Current articulatory-based three-dimensional source–filter models, which allow the production of vowels and diphtongs, still present very limited expressiveness. Glottal inverse filtering (GIF) techniques can become instrumental to identify specific characteristics of both the glottal source signal and the vocal tract transfer function to resemble expressive speech. Several GIF methods have been proposed in the literature; however, their comparison becomes difficult due to the lack of common and exhaustive experimental settings. In this work, first, a two-phase analysis methodology for the comparison of GIF techniques based on a reference dataset is introduced. Next, state-of-the-art GIF techniques based on iterative adaptive inverse filtering (IAIF) and quasi closed phase (QCP) approaches are thoroughly evaluated on OPENGLOT, an open database specifically designed to evaluate GIF, computing well-established GIF error measures after extending male vowels with their female counterparts. The results show that GIF methods obtain better results on male vowels. The QCP-based techniques significantly outperform IAIF-based methods for almost all error metrics and scenarios and are, at the same time, more stable across sex, phonation type, F0, and vowels. The IAIF variants improve the original technique for most error metrics on male vowels, while QCP with spectral tilt compensation achieves a lower spectral tilt error for male vowels than the original QCP. Full article

(This article belongs to the Special Issue IberSPEECH 2022: Speech and Language Technologies for Iberian Languages)

► Show Figures

Figure 1

39 pages, 2509 KiB

Open AccessArticle

Deriving Vocal Fold Oscillation Information from Recorded Voice Signals Using Models of Phonation

by Wayne Zhao and Rita Singh

Entropy 2023, 25(7), 1039; https://doi.org/10.3390/e25071039 - 10 Jul 2023

Cited by 2 | Viewed by 2838

Abstract

During phonation, the vocal folds exhibit a self-sustained oscillatory motion, which is influenced by the physical properties of the speaker’s vocal folds and driven by the balance of bio-mechanical and aerodynamic forces across the glottis. Subtle changes in the speaker’s physical state can [...] Read more.

During phonation, the vocal folds exhibit a self-sustained oscillatory motion, which is influenced by the physical properties of the speaker’s vocal folds and driven by the balance of bio-mechanical and aerodynamic forces across the glottis. Subtle changes in the speaker’s physical state can affect voice production and alter these oscillatory patterns. Measuring these can be valuable in developing computational tools that analyze voice to infer the speaker’s state. Traditionally, vocal fold oscillations (VFOs) are measured directly using physical devices in clinical settings. In this paper, we propose a novel analysis-by-synthesis approach that allows us to infer the VFOs directly from recorded speech signals on an individualized, speaker-by-speaker basis. The approach, called the ADLES-VFT algorithm, is proposed in the context of a joint model that combines a phonation model (with a glottal flow waveform as the output) and a vocal tract acoustic wave propagation model such that the output of the joint model is an estimated waveform. The ADLES-VFT algorithm is a forward-backward algorithm which minimizes the error between the recorded waveform and the output of this joint model to estimate its parameters. Once estimated, these parameter values are used in conjunction with a phonation model to obtain its solutions. Since the parameters correlate with the physical properties of the vocal folds of the speaker, model solutions obtained using them represent the individualized VFOs for each speaker. The approach is flexible and can be applied to various phonation models. In addition to presenting the methodology, we show how the VFOs can be quantified from a dynamical systems perspective for classification purposes. Mathematical derivations are provided in an appendix for better readability. Full article

(This article belongs to the Special Issue Information-Theoretic Approaches in Speech Processing and Recognition)

► Show Figures

Figure 1

10 pages, 3866 KiB

Open AccessArticle

An Acoustic Simulation Method of the Japanese Vowels /i/ and /u/ by Using the Boundary Element Method

by Mami Shiraishi, Katsuaki Mishima, Masahiro Takekawa, Masaaki Mori and Hirotsugu Umeda

Acoustics 2023, 5(2), 553-562; https://doi.org/10.3390/acoustics5020033 - 6 Jun 2023

Cited by 1 | Viewed by 2455

Abstract

This study aimed to establish and verify the validity of an acoustic simulation method during sustained phonation of the Japanese vowels /i/ and /u/. The study participants were six healthy adults. First, vocal tract models were constructed based on computed tomography (CT) data, [...] Read more.

This study aimed to establish and verify the validity of an acoustic simulation method during sustained phonation of the Japanese vowels /i/ and /u/. The study participants were six healthy adults. First, vocal tract models were constructed based on computed tomography (CT) data, such as the range from the frontal sinus to the glottis, during sustained phonation of /i/ and /u/. To imitate the trachea, after being virtually extended by 12 cm, cylindrical shapes were then added to the vocal tract models between the tracheal bifurcation and the lower part of the glottis. Next, the boundary element method and the Kirchhoff–Helmholtz integral equation were used for discretization and to represent the wave equation for sound propagation, respectively. As a result, the relative discrimination thresholds of the vowel formant frequencies for /i/ and /u/ against actual voice were 1.1–10.2% and 0.4–9.3% for the first formant and 3.9–7.5% and 5.0–12.5% for the second formant, respectively. In the vocal tract model with nasal coupling, a pole–zero pair was observed at around 500 Hz, and for both /i/ and /u/, a pole–zero pair was observed at around 1000 Hz regardless of the presence or absence of nasal coupling. Therefore, the boundary element method, which produces solutions by analysis of boundary problems rather than three-dimensional aspects, was thought to be effective for simulating the Japanese vowels /i/ and /u/ with high validity for the vocal tract models encompassing a wide range, from the frontal sinuses to the trachea, constructed from CT data obtained during sustained phonation. Full article

► Show Figures

Figure 1

20 pages, 4358 KiB

Open AccessArticle

Automatic Multiple Articulator Segmentation in Dynamic Speech MRI Using a Protocol Adaptive Stacked Transfer Learning U-NET Model

by Subin Erattakulangara, Karthika Kelat, David Meyer, Sarv Priya and Sajan Goud Lingala

Bioengineering 2023, 10(5), 623; https://doi.org/10.3390/bioengineering10050623 - 22 May 2023

Cited by 4 | Viewed by 2898

Abstract

Dynamic magnetic resonance imaging has emerged as a powerful modality for investigating upper-airway function during speech production. Analyzing the changes in the vocal tract airspace, including the position of soft-tissue articulators (e.g., the tongue and velum), enhances our understanding of speech production. The [...] Read more.

Dynamic magnetic resonance imaging has emerged as a powerful modality for investigating upper-airway function during speech production. Analyzing the changes in the vocal tract airspace, including the position of soft-tissue articulators (e.g., the tongue and velum), enhances our understanding of speech production. The advent of various fast speech MRI protocols based on sparse sampling and constrained reconstruction has led to the creation of dynamic speech MRI datasets on the order of 80–100 image frames/second. In this paper, we propose a stacked transfer learning U-NET model to segment the deforming vocal tract in 2D mid-sagittal slices of dynamic speech MRI. Our approach leverages (a) low- and mid-level features and (b) high-level features. The low- and mid-level features are derived from models pre-trained on labeled open-source brain tumor MR and lung CT datasets, and an in-house airway labeled dataset. The high-level features are derived from labeled protocol-specific MR images. The applicability of our approach to segmenting dynamic datasets is demonstrated in data acquired from three fast speech MRI protocols: Protocol 1: 3 T-based radial acquisition scheme coupled with a non-linear temporal regularizer, where speakers were producing French speech tokens; Protocol 2: 1.5 T-based uniform density spiral acquisition scheme coupled with a temporal finite difference (FD) sparsity regularization, where speakers were producing fluent speech tokens in English, and Protocol 3: 3 T-based variable density spiral acquisition scheme coupled with manifold regularization, where speakers were producing various speech tokens from the International Phonetic Alphabetic (IPA). Segments from our approach were compared to those from an expert human user (a vocologist), and the conventional U-NET model without transfer learning. Segmentations from a second expert human user (a radiologist) were used as ground truth. Evaluations were performed using the quantitative DICE similarity metric, the Hausdorff distance metric, and segmentation count metric. This approach was successfully adapted to different speech MRI protocols with only a handful of protocol-specific images (e.g., of the order of 20 images), and provided accurate segmentations similar to those of an expert human. Full article

(This article belongs to the Special Issue AI in MRI: Frontiers and Applications)

► Show Figures

Figure 1

18 pages, 1787 KiB

Open AccessArticle

Deep Reinforcement Learning for Articulatory Synthesis in a Vowel-to-Vowel Imitation Task

by Denis Shitov, Elena Pirogova, Tadeusz A. Wysocki and Margaret Lech

Sensors 2023, 23(7), 3437; https://doi.org/10.3390/s23073437 - 24 Mar 2023

Cited by 1 | Viewed by 1858

Abstract

Articulatory synthesis is one of the approaches used for modeling human speech production. In this study, we propose a model-based algorithm for learning the policy to control the vocal tract of the articulatory synthesizer in a vowel-to-vowel imitation task. Our method does not [...] Read more.

Articulatory synthesis is one of the approaches used for modeling human speech production. In this study, we propose a model-based algorithm for learning the policy to control the vocal tract of the articulatory synthesizer in a vowel-to-vowel imitation task. Our method does not require external training data, since the policy is learned through interactions with the vocal tract model. To improve the sample efficiency of the learning, we trained the model of speech production dynamics simultaneously with the policy. The policy was trained in a supervised way using predictions of the model of speech production dynamics. To stabilize the training, early stopping was incorporated into the algorithm. Additionally, we extracted acoustic features using an acoustic word embedding (AWE) model. This model was trained to discriminate between different words and to enable compact encoding of acoustics while preserving contextual information of the input. Our preliminary experiments showed that introducing this AWE model was crucial to guide the policy toward a near-optimal solution. The acoustic embeddings, obtained using the proposed approach, were revealed to be useful when applied as inputs to the policy and the model of speech production dynamics. Full article

(This article belongs to the Special Issue Signal Processing in Biomedical Sensor Systems)

► Show Figures

Figure 1

21 pages, 6664 KiB

Open AccessArticle

3D Dynamic Spatiotemporal Atlas of the Vocal Tract during Consonant–Vowel Production from 2D Real Time MRI

by Ioannis K. Douros, Yu Xie, Chrysanthi Dourou, Karyna Isaieva, Pierre-André Vuissoz, Jacques Felblinger and Yves Laprie

J. Imaging 2022, 8(9), 227; https://doi.org/10.3390/jimaging8090227 - 25 Aug 2022

Cited by 1 | Viewed by 2607

Abstract

In this work, we address the problem of creating a 3D dynamic atlas of the vocal tract that captures the dynamics of the articulators in all three dimensions in order to create a global speaker model independent of speaker-specific characteristics. The core steps [...] Read more.

In this work, we address the problem of creating a 3D dynamic atlas of the vocal tract that captures the dynamics of the articulators in all three dimensions in order to create a global speaker model independent of speaker-specific characteristics. The core steps of the proposed method are the temporal alignment of the real-time MR images acquired in several sagittal planes and their combination with adaptive kernel regression. As a preprocessing step, a reference space was created to be used in order to remove anatomical information of the speakers and keep only the variability in speech production for the construction of the atlas. The adaptive kernel regression makes the choice of atlas time points independently of the time points of the frames that are used as an input for the construction. The evaluation of this atlas construction method was made by mapping two new speakers to the atlas and by checking how similar the resulting mapped images are. The use of the atlas helps in reducing subject variability. The results show that the use of the proposed atlas can capture the dynamic behavior of the articulators and is able to generalize the speech production process by creating a universal-speaker reference space. Full article

(This article belongs to the Special Issue Spatio-Temporal Biomedical Image Analysis)

► Show Figures

Figure 1

21 pages, 1855 KiB

Open AccessArticle

A Robust and Low Computational Cost Pitch Estimation Method

by Desheng Wang, Yangjie Wei, Yi Wang and Jing Wang

Sensors 2022, 22(16), 6026; https://doi.org/10.3390/s22166026 - 12 Aug 2022

Cited by 4 | Viewed by 3406

Abstract

Pitch estimation is widely used in speech and audio signal processing. However, the current methods of modeling harmonic structure used for pitch estimation cannot always match the harmonic distribution of actual signals. Due to the structure of vocal tract, the acoustic nature of [...] Read more.

Pitch estimation is widely used in speech and audio signal processing. However, the current methods of modeling harmonic structure used for pitch estimation cannot always match the harmonic distribution of actual signals. Due to the structure of vocal tract, the acoustic nature of musical equipment, and the spectrum leakage issue, speech and audio signals’ harmonic frequencies often slightly deviate from the integer multiple of the pitch. This paper starts with the summation of residual harmonics (SRH) method and makes two main modifications. First, the spectral peak position constraint of strict integer multiple is modified to allow slight deviation, which benefits capturing harmonics. Second, a main pitch segment extension scheme with low computational cost feature is proposed to utilize the smooth prior of pitch more efficiently. Besides, the pitch segment extension scheme is also integrated into the SRH method’s voiced/unvoiced decision to reduce short-term errors. Accuracy comparison experiments with ten pitch estimation methods show that the proposed method has better overall accuracy and robustness. Time cost experiments show that the time cost of the proposed method reduces to around 1/8 of the state-of-the-art fast NLS method on the experimental computer. Full article

(This article belongs to the Topic Human–Machine Interaction)

► Show Figures

Figure 1

15 pages, 14826 KiB

Open AccessArticle

Data-Driven Analysis of European Portuguese Nasal Vowel Dynamics in Bilabial Contexts

by Nuno Almeida, Samuel Silva, Conceição Cunha and António Teixeira

Appl. Sci. 2022, 12(9), 4601; https://doi.org/10.3390/app12094601 - 3 May 2022

Viewed by 2424

Abstract

European Portuguese (EP) is characterized by a large number of nasals encompassing five phonemic nasal vowels. One notable characteristic of these sounds is their dynamic nature, involving both oral and nasal gestures, which makes their study and characterization challenging. The study of nasal [...] Read more.

European Portuguese (EP) is characterized by a large number of nasals encompassing five phonemic nasal vowels. One notable characteristic of these sounds is their dynamic nature, involving both oral and nasal gestures, which makes their study and characterization challenging. The study of nasal vowels, in particular, has been addressed using a wide range of technologies: early descriptions were based on acoustics and nasalance, later expanded with articulatory data obtained from EMA and real-time magnetic resonance (RT-MRI). While providing important results, these studies were limited by the discrete nature of the EMA-pellets, providing only a small grasp of the vocal tract; by the small time resolution of the MRI data; and by the small number of speakers. To tackle these limitations, and to take advantage of recent advances in RT-MRI allowing 50 fps, novel articulatory data has been acquired for 11 EP speakers. The work presented here explores the capabilities of recently proposed data-driven approaches to model articulatory data extracted from RT-MRI to assess their suitability for investigating the dynamic characteristics of nasal vowels. To this end, we explore vocal tract configurations over time, along with the coordination of velum and lip aperture in oral and nasal bilabial contexts for nasal vowels and oral congeners. Overall, the results show that both generalized additive mixed models (GAMMs) and functional linear mixed models (FLMMs) provide an elegant approach to tackle the data from multiple speakers. More specifically, we found oro-pharyngeal differences in the tongue configurations for low and mid nasal vowels: vowel track aperture was larger in the pharyngeal and smaller in the palatal region for the three non-high nasal vowels, providing evidence of a raised and more advanced tongue position of the nasal vowels. Even though this work is aimed at exploring the applicability of the methods, the outcomes already highlight interesting data for the dynamic characterization of EP nasal vowels. Full article

(This article belongs to the Special Issue IberSPEECH 2020: Speech and Language Technologies for Iberian Languages)

► Show Figures

Figure 1

Search Results (28)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (28)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI