Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (43)

Search Parameters:
Keywords = linear prediction of speech

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
16 pages, 4224 KiB  
Article
Optimizing Museum Acoustics: How Absorption Magnitude and Surface Location of Finishing Materials Influence Acoustic Performance
by Milena Jonas Bem and Jonas Braasch
Acoustics 2025, 7(3), 43; https://doi.org/10.3390/acoustics7030043 - 11 Jul 2025
Viewed by 165
Abstract
The architecture of contemporary museums often emphasizes visual aesthetics, such as large volumes, open-plan layouts, and highly reflective finishes, resulting in acoustic challenges, such as excessive reverberation, poor speech intelligibility, elevated background noise, and reduced privacy. This study quantified the impact of surface—specific [...] Read more.
The architecture of contemporary museums often emphasizes visual aesthetics, such as large volumes, open-plan layouts, and highly reflective finishes, resulting in acoustic challenges, such as excessive reverberation, poor speech intelligibility, elevated background noise, and reduced privacy. This study quantified the impact of surface—specific absorption treatments on acoustic metrics across eight gallery spaces. Room impulse responses calibrated virtual models, which simulated nine absorption scenarios (low, medium, and high on ceilings, floors, and walls) and evaluated reverberation time (T20), speech transmission index (STI), clarity (C50), distraction distance (rD), Spatial Decay Rate of Speech (D2,S), and Speech Level at 4 m (Lp,A,S,4m). The results indicate that going from concrete to a wooden floor yields the most rapid T20 reductions (up to −1.75 s), ceiling treatments deliver the greatest STI and C50 gains (e.g., STI increases of +0.16), and high-absorption walls maximize privacy metrics (D2,S and Lp,A,S,4m). A linear regression model further predicted the STI from T20, total absorption (Sabins), and room volume, with an 84.9% conditional R2, enabling ±0.03 accuracy without specialized testing. These findings provide empirically derived, surface-specific “first-move” guidelines for architects and acousticians, underscoring the necessity of integrating acoustics early in museum design to balance auditory and visual objectives and enhance the visitor experience. Full article
Show Figures

Figure 1

29 pages, 9545 KiB  
Article
A Class of Perfectly Secret Autonomous Low-Bit-Rate Voice Communication Systems
by Jelica Radomirović, Milan Milosavljević, Sara Čubrilović, Zvezdana Kuzmanović, Miroslav Perić, Zoran Banjac and Dragana Perić
Symmetry 2025, 17(3), 365; https://doi.org/10.3390/sym17030365 - 27 Feb 2025
Viewed by 538
Abstract
This paper presents an autonomous perfectly secure low-bit-rate voice communication system (APS-VCS) based on the mixed-excitation linear prediction voice coder (MELPe), Vernam cipher, and sequential key distillation (SKD) protocol by public discussion. An authenticated public channel can be selected in a wide range, [...] Read more.
This paper presents an autonomous perfectly secure low-bit-rate voice communication system (APS-VCS) based on the mixed-excitation linear prediction voice coder (MELPe), Vernam cipher, and sequential key distillation (SKD) protocol by public discussion. An authenticated public channel can be selected in a wide range, from internet connections to specially leased radio channels. We found the source of common randomness between the locally synthesized speech signal at the transmitter and the reconstructed speech signal at the receiver side. To avoid information leakage about open input speech, the SKD protocol is not executed on the actual transmitted speech signal but on artificially synthesized speech obtained by random selection of the linear spectral pairs (LSP) parameters of the speech production model. Experimental verification of the proposed system was performed on the Vlatacom Personal Crypto Platform for Voice encryption (vPCP-V). Empirical measurements show that with an adequate selection of system parameters for voice transmission of 1.2 kb/s, a secret key rate (KR) of up to 8.8 kb/s can be achieved, with a negligible leakage rate (LR) and bit error rate (BER) of order 103 for various communications channels, including GSM 3G and GSM VoLTE networks. At the same time, by ensuring perfect secrecy within symmetric encryption systems, it further highlights the importance of the symmetry principle in the field of information-theoretic security. To our knowledge, this is the first autonomous, perfectly secret system for low-bit-rate voice communication that does not require explicit prior generation and distribution of secret keys. Full article
(This article belongs to the Special Issue Symmetry and Asymmetry in Cryptography, Second Edition)
Show Figures

Figure 1

19 pages, 903 KiB  
Article
Deep-Learning Framework for Efficient Real-Time Speech Enhancement and Dereverberation
by Tomer Rosenbaum, Emil Winebrand, Omer Cohen and Israel Cohen
Sensors 2025, 25(3), 630; https://doi.org/10.3390/s25030630 - 22 Jan 2025
Viewed by 3290
Abstract
Deep learning has revolutionized speech enhancement, enabling impressive high-quality noise reduction and dereverberation. However, state-of-the-art methods often demand substantial computational resources, hindering their deployment on edge devices and in real-time applications. Computationally efficient approaches like deep filtering and Deep Filter Net offer an [...] Read more.
Deep learning has revolutionized speech enhancement, enabling impressive high-quality noise reduction and dereverberation. However, state-of-the-art methods often demand substantial computational resources, hindering their deployment on edge devices and in real-time applications. Computationally efficient approaches like deep filtering and Deep Filter Net offer an attractive alternative by predicting linear filters instead of directly estimating the clean speech. While Deep Filter Net excels in noise reduction, its dereverberation performance remains limited. In this paper, we present a generalized framework for computationally efficient speech enhancement and, based on this framework, identify an inherent constraint within Deep Filter Net that hinders its dereverberation capabilities. We propose an extension to the Deep Filter Net framework designed to overcome this limitation, demonstrating significant improvements in dereverberation performance while maintaining competitive noise-reduction quality. Our experimental results highlight the potential of this enhanced framework for real-time speech enhancement on resource-constrained devices. Full article
(This article belongs to the Section Physical Sensors)
Show Figures

Figure 1

13 pages, 14573 KiB  
Article
A Feature Integration Network for Multi-Channel Speech Enhancement
by Xiao Zeng, Xue Zhang and Mingjiang Wang
Sensors 2024, 24(22), 7344; https://doi.org/10.3390/s24227344 - 18 Nov 2024
Viewed by 1298
Abstract
Multi-channel speech enhancement has become an active area of research, demonstrating excellent performance in recovering desired speech signals from noisy environments. Recent approaches have increasingly focused on leveraging spectral information from multi-channel inputs, yielding promising results. In this study, we propose a novel [...] Read more.
Multi-channel speech enhancement has become an active area of research, demonstrating excellent performance in recovering desired speech signals from noisy environments. Recent approaches have increasingly focused on leveraging spectral information from multi-channel inputs, yielding promising results. In this study, we propose a novel feature integration network that not only captures spectral information but also refines it through shifted-window-based self-attention, enhancing the quality and precision of the feature extraction. Our network consists of blocks containing a full- and sub-band LSTM module for capturing spectral information, and a global–local attention fusion module for refining this information. The full- and sub-band LSTM module integrates both full-band and sub-band information through two LSTM layers, while the global–local attention fusion module learns global and local attention in a dual-branch architecture. To further enhance the feature integration, we fuse the outputs of these branches using a spatial attention module. The model is trained to predict the complex ratio mask (CRM), thereby improving the quality of the enhanced signal. We conducted an ablation study to assess the contribution of each module, with each showing a significant impact on performance. Additionally, our model was trained on the SPA-DNS dataset using a circular microphone array and the Libri-wham dataset with a linear microphone array, achieving competitive results compared to state-of-the-art models. Full article
(This article belongs to the Section Sensor Networks)
Show Figures

Figure 1

24 pages, 1052 KiB  
Article
German Noun Plurals in Simultaneous Bilingual vs. Successive Bilingual vs. Monolingual Kindergarten Children: The Role of Linguistic and Extralinguistic Variables
by Katharina Korecky-Kröll, Marina Camber, Kumru Uzunkaya-Sharma and Wolfgang U. Dressler
Languages 2024, 9(9), 306; https://doi.org/10.3390/languages9090306 - 23 Sep 2024
Viewed by 1178
Abstract
(1) Background: The complex phenomenon of German noun plural inflection is investigated in three groups of German-speaking kindergarten children: (a) monolinguals (1L1), (b) simultaneous bilinguals (2L1) also acquiring Croatian, and (c) successive bilinguals (L2) acquiring Turkish as L1. Predictions of the usage-based schema [...] Read more.
(1) Background: The complex phenomenon of German noun plural inflection is investigated in three groups of German-speaking kindergarten children: (a) monolinguals (1L1), (b) simultaneous bilinguals (2L1) also acquiring Croatian, and (c) successive bilinguals (L2) acquiring Turkish as L1. Predictions of the usage-based schema model and of Natural Morphology concerning different linguistic variables are used to explore their impact on plural acquisition in the three groups of children. (2) Methods: A longitudinal study (from mean age 3;1 to 4;8) is conducted using two procedures (a formal plural test and spontaneous recordings in kindergarten), and the data are analyzed using generalized linear (mixed-effects) regression models in R. (3) Results: All children produce more errors in the metalinguistically challenging test compared to spontaneous speech, with L2 children being particularly disadvantaged. Socioeconomic status (henceforth SES) and teachers’ plural type frequency are most relevant for 1L1 children, and kindergarten exposure is more relevant for L2 children, while the linguistic variables are more important for 2L1 children. (4) Conclusions: The main predictions of the schema model and of Natural Morphology are largely confirmed. All of the linguistic variables investigated show significant effects in some analyses, but morphotactic transparency turns out to be the most relevant variable for all three groups of children. Full article
Show Figures

Figure 1

18 pages, 1124 KiB  
Data Descriptor
SparrKULee: A Speech-Evoked Auditory Response Repository from KU Leuven, Containing the EEG of 85 Participants
by Bernd Accou, Lies Bollens, Marlies Gillis, Wendy Verheijen, Hugo Van hamme and Tom Francart
Data 2024, 9(8), 94; https://doi.org/10.3390/data9080094 - 26 Jul 2024
Cited by 6 | Viewed by 2182
Abstract
Researchers investigating the neural mechanisms underlying speech perception often employ electroencephalography (EEG) to record brain activity while participants listen to spoken language. The high temporal resolution of EEG enables the study of neural responses to fast and dynamic speech signals. Previous studies have [...] Read more.
Researchers investigating the neural mechanisms underlying speech perception often employ electroencephalography (EEG) to record brain activity while participants listen to spoken language. The high temporal resolution of EEG enables the study of neural responses to fast and dynamic speech signals. Previous studies have successfully extracted speech characteristics from EEG data and, conversely, predicted EEG activity from speech features. Machine learning techniques are generally employed to construct encoding and decoding models, which necessitate a substantial quantity of data. We present SparrKULee, a Speech-evoked Auditory Repository of EEG data, measured at KU Leuven, comprising 64-channel EEG recordings from 85 young individuals with normal hearing, each of whom listened to 90–150 min of natural speech. This dataset is more extensive than any currently available dataset in terms of both the number of participants and the quantity of data per participant. It is suitable for training larger machine learning models. We evaluate the dataset using linear and state-of-the-art non-linear models in a speech encoding/decoding and match/mismatch paradigm, providing benchmark scores for future research. Full article
Show Figures

Figure 1

18 pages, 2000 KiB  
Article
Accuracy and Consistency of Confidence Limits for Monosyllable Identification Scores Derived Using Simulation, the Harrell–Davis Estimator, and Nonlinear Quantile Regression
by Vijaya Kumar Narne, Dhanya Mohan, Sruthi Das Avileri, Saransh Jain, Sunil Kumar Ravi, Krishna Yerraguntla, Abdulaziz Almudhi and Brian C. J. Moore
Diagnostics 2024, 14(13), 1397; https://doi.org/10.3390/diagnostics14131397 - 30 Jun 2024
Viewed by 974
Abstract
Background: Audiological diagnosis and rehabilitation often involve the assessment of whether the maximum speech identification score (PBmax) is poorer than expected from the pure-tone average (PTA) threshold. This requires the estimation of the lower boundary of the PBmax values expected [...] Read more.
Background: Audiological diagnosis and rehabilitation often involve the assessment of whether the maximum speech identification score (PBmax) is poorer than expected from the pure-tone average (PTA) threshold. This requires the estimation of the lower boundary of the PBmax values expected for a given PTA (one-tailed 95% confidence limit, CL). This study compares the accuracy and consistency of three methods for estimating the 95% CL. Method: The 95% CL values were estimated using a simulation method, the Harrell–Davis (HD) estimator, and non-linear quantile regression (nQR); the latter two are both distribution-free methods. The first two methods require the formation of sub-groups with different PTAs. Accuracy and consistency in the estimation of the 95% CL were assessed by applying each method to many random samples of 50% of the available data and using the fitted parameters to predict the data for the remaining 50%. Study sample: A total of 642 participants aged 17 to 84 years with sensorineural hearing loss were recruited from audiology clinics. Pure-tone audiograms were obtained and PBmax scores were measured using monosyllables at 40 dB above the speech recognition threshold or at the most comfortable level. Results: For the simulation method, 6.7 to 8.2% of the PBmax values fell below the 95% CL for both ears, exceeding the target value of 5%. For the HD and nQR methods, the PBmax values fell below the estimated 95% CL for approximately 5% of the ears, indicating good accuracy. Consistency, estimated from the standard deviation of the deviations from the target value of 5%, was similar for all the methods. Conclusions: The nQR method is recommended because it has good accuracy and consistency, and it does not require the formation of arbitrary PTA sub-groups. Full article
(This article belongs to the Special Issue Hearing Loss: From Diagnosis to Pathology)
Show Figures

Figure 1

20 pages, 1252 KiB  
Article
Imperceptible and Reversible Acoustic Watermarking Based on Modified Integer Discrete Cosine Transform Coefficient Expansion
by Xuping Huang and Akinori Ito
Appl. Sci. 2024, 14(7), 2757; https://doi.org/10.3390/app14072757 - 25 Mar 2024
Cited by 2 | Viewed by 1221
Abstract
This paper aims to explore an alternative reversible digital watermarking solution to guarantee the integrity of and detect tampering with data of probative importance. Since the payload for verification is embedded in the contents, algorithms for reversible embedding and extraction, imperceptibility, payload capacity, [...] Read more.
This paper aims to explore an alternative reversible digital watermarking solution to guarantee the integrity of and detect tampering with data of probative importance. Since the payload for verification is embedded in the contents, algorithms for reversible embedding and extraction, imperceptibility, payload capacity, and computational time are issues to evaluate. Thus, we propose a reversible and imperceptible audio information-hiding algorithm based on modified integer discrete cosine transform (intDCT) coefficient expansion. In this work, the original signal is segmented into fixed-length frames, and then intDCT is applied to each frame to transform signals from the time domain into integer DCT coefficients. Expansion is applied to DCT coefficients at a higher frequency to reserve hiding capacity. Objective evaluation of speech quality is conducted using listening quality objective mean opinion (MOS-LQO) and the segmental signal-to-noise ratio (segSNR). The audio quality of different frame lengths and capacities is evaluated. Averages of 4.41 for MOS-LQO and 23.314 [dB] for segSNR for 112 ITU-T test signals were obtained with a capacity of 8000 bps, which assured imperceptibility with the sufficient capacity of the proposed method. This shows comparable audio quality to conventional work based on Linear Predictive Coding (LPC) regarding MOS-LQO. However, all segSNR scores of the proposed method have comparable or better performance in the time domain. Additionally, comparing histograms of the normalized maximum absolute value of stego data shows a lower possibility of overflow than the LPC method. A computational cost, including hiding and transforming, is an average of 4.884 s to process a 10 s audio clip. Blind tampering detection without the original data is achieved by the proposed embedding and extraction method. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

15 pages, 2722 KiB  
Article
Optimizing Speech Emotion Recognition with Deep Learning and Grey Wolf Optimization: A Multi-Dataset Approach
by Suryakant Tyagi and Sándor Szénási
Algorithms 2024, 17(3), 90; https://doi.org/10.3390/a17030090 - 20 Feb 2024
Cited by 5 | Viewed by 2799
Abstract
Machine learning and speech emotion recognition are rapidly evolving fields, significantly impacting human-centered computing. Machine learning enables computers to learn from data and make predictions, while speech emotion recognition allows computers to identify and understand human emotions from speech. These technologies contribute to [...] Read more.
Machine learning and speech emotion recognition are rapidly evolving fields, significantly impacting human-centered computing. Machine learning enables computers to learn from data and make predictions, while speech emotion recognition allows computers to identify and understand human emotions from speech. These technologies contribute to the creation of innovative human–computer interaction (HCI) applications. Deep learning algorithms, capable of learning high-level features directly from raw data, have given rise to new emotion recognition approaches employing models trained on advanced speech representations like spectrograms and time–frequency representations. This study introduces CNN and LSTM models with GWO optimization, aiming to determine optimal parameters for achieving enhanced accuracy within a specified parameter set. The proposed CNN and LSTM models with GWO optimization underwent performance testing on four diverse datasets—RAVDESS, SAVEE, TESS, and EMODB. The results indicated superior performance of the models compared to linear and kernelized SVM, with or without GWO optimizers. Full article
(This article belongs to the Special Issue Bio-Inspired Algorithms)
Show Figures

Figure 1

10 pages, 1450 KiB  
Article
Outcome Prediction of Speech Perception in Quiet and in Noise for Cochlear Implant Candidates Based on Pre-Operative Measures
by Tobias Weissgerber, Marcel Löschner, Timo Stöver and Uwe Baumann
J. Clin. Med. 2024, 13(4), 994; https://doi.org/10.3390/jcm13040994 - 8 Feb 2024
Cited by 5 | Viewed by 1489
Abstract
(1) Background: The fitting of cochlear implants (CI) is an established treatment, even in cases with considerable residual hearing but insufficient speech perception. The aim of this study was to evaluate a prediction model for speech in quiet and to provide reference data [...] Read more.
(1) Background: The fitting of cochlear implants (CI) is an established treatment, even in cases with considerable residual hearing but insufficient speech perception. The aim of this study was to evaluate a prediction model for speech in quiet and to provide reference data and a predictive model for postoperative speech perception in noise (SPiN) after CI provision. (2) Methods: CI candidates with substantial residual hearing (either in hearing threshold or in word recognition scores) were included in a retrospective analysis (n = 87). Speech perception scores in quiet 12 months post-surgery were compared with the predicted scores. A generalized linear model was fitted to speech reception thresholds (SRTs) after CI fitting to identify predictive variables for SPiN. (3) Results: About two-thirds of the recipients achieved the expected outcome in quiet or were better than expected. The mean absolute error of the prediction was 13.5 percentage points. Age at implantation was the only predictive factor for SPiN showing a significant correlation (r = 0.354; p = 0.007). (4) Conclusions: Outcome prediction accuracy for speech in quiet was comparable to previous studies. For CI recipients in the included study population, the SPiN outcome could be predicted only based on the factor age. Full article
Show Figures

Figure 1

15 pages, 473 KiB  
Article
Crossband Filtering for Weighted Prediction Error-Based Speech Dereverberation
by Tomer Rosenbaum, Israel Cohen and Emil Winebrand
Appl. Sci. 2023, 13(17), 9537; https://doi.org/10.3390/app13179537 - 23 Aug 2023
Cited by 2 | Viewed by 1515
Abstract
Weighted prediction error (WPE) is a linear prediction-based method extensively used to predict and attenuate the late reverberation component of an observed speech signal. This paper introduces an extended version of the WPE method to enhance the modeling accuracy in the time–frequency domain [...] Read more.
Weighted prediction error (WPE) is a linear prediction-based method extensively used to predict and attenuate the late reverberation component of an observed speech signal. This paper introduces an extended version of the WPE method to enhance the modeling accuracy in the time–frequency domain by incorporating crossband filters. Two approaches to extending the WPE while considering crossband filters are proposed and investigated. The first approach improves the model’s accuracy. However, it increases the computational complexity, while the second approach maintains the same computational complexity as the conventional WPE while still achieving improved accuracy and comparable performance to the first approach. To validate the effectiveness of the proposed methods, extensive simulations are conducted. The experimental results demonstrate that both methods outperform the conventional WPE regarding dereverberation performance. These findings highlight the potential of incorporating crossband filters in improving the accuracy and efficacy of the WPE method for dereverberation tasks. Full article
(This article belongs to the Special Issue Automatic Speech Signal Processing)
Show Figures

Figure 1

17 pages, 1125 KiB  
Article
Investigations on the Optimal Estimation of Speech Envelopes for the Two-Stage Speech Enhancement
by Yanjue Song and Nilesh Madhu
Sensors 2023, 23(14), 6438; https://doi.org/10.3390/s23146438 - 16 Jul 2023
Cited by 2 | Viewed by 1704
Abstract
Using the source-filter model of speech production, clean speech signals can be decomposed into an excitation component and an envelope component that is related to the phoneme being uttered. Therefore, restoring the envelope of degraded speech during speech enhancement can improve the intelligibility [...] Read more.
Using the source-filter model of speech production, clean speech signals can be decomposed into an excitation component and an envelope component that is related to the phoneme being uttered. Therefore, restoring the envelope of degraded speech during speech enhancement can improve the intelligibility and quality of output. As the number of phonemes in spoken speech is limited, they can be adequately represented by a correspondingly limited number of envelopes. This can be exploited to improve the estimation of speech envelopes from a degraded signal in a data-driven manner. The improved envelopes are then used in a second stage to refine the final speech estimate. Envelopes are typically derived from the linear prediction coefficients (LPCs) or from the cepstral coefficients (CCs). The improved envelope is obtained either by mapping the degraded envelope onto pre-trained codebooks (classification approach) or by directly estimating it from the degraded envelope (regression approach). In this work, we first investigate the optimal features for envelope representation and codebook generation by a series of oracle tests. We demonstrate that CCs provide better envelope representation compared to using the LPCs. Further, we demonstrate that a unified speech codebook is advantageous compared to the typical codebook that manually splits speech and silence as separate entries. Next, we investigate low-complexity neural network architectures to map degraded envelopes to the optimal codebook entry in practical systems. We confirm that simple recurrent neural networks yield good performance with a low complexity and number of parameters. We also demonstrate that with a careful choice of the feature and architecture, a regression approach can further improve the performance at a lower computational cost. However, as also seen from the oracle tests, the benefit of the two-stage framework is now chiefly limited by the statistical noise floor estimate, leading to only a limited improvement in extremely adverse conditions. This highlights the need for further research on joint estimation of speech and noise for optimum enhancement. Full article
(This article belongs to the Special Issue Machine Learning and Signal Processing Based Acoustic Sensors)
Show Figures

Figure 1

14 pages, 2488 KiB  
Review
Review of Advances in Speech Processing with Focus on Artificial Neural Networks
by Douglas O’Shaughnessy
Electronics 2023, 12(13), 2887; https://doi.org/10.3390/electronics12132887 - 30 Jun 2023
Viewed by 2347
Abstract
Speech is the primary way via which most humans communicate. Computers facilitate this transfer of information, especially when people interact with databases. While some methods to manipulate and interpret speech date back many decades (e.g., Fourier analysis), other processing techniques were developed late [...] Read more.
Speech is the primary way via which most humans communicate. Computers facilitate this transfer of information, especially when people interact with databases. While some methods to manipulate and interpret speech date back many decades (e.g., Fourier analysis), other processing techniques were developed late last century (e.g., linear predictive coding and hidden Markov models). Nonetheless, the last 25 years have seen major advances leading to the wide acceptance of computer-based speech processing, e.g., cellular telephones and real-time online conversations. This paper reviews older techniques and recent methods that focus largely on artificial neural networks. The major highlights in speech research are examined, without delving into mathematical detail, while giving insight into the research choices that have been made. The focus of this work is to understand how and why the discussed methods function well. Full article
(This article belongs to the Special Issue Recent Advances in Audio, Speech and Music Processing and Analysis)
Show Figures

Figure 1

19 pages, 988 KiB  
Article
Speech Rate and Turn-Transition Pause Duration in Dutch and English Spontaneous Question-Answer Sequences
by Damar Hoogland, Laurence White and Sarah Knight
Languages 2023, 8(2), 115; https://doi.org/10.3390/languages8020115 - 22 Apr 2023
Cited by 1 | Viewed by 2937
Abstract
The duration of inter-speaker pauses is a pragmatically salient aspect of conversation that is affected by linguistic and non-linguistic context. Theories of conversational turn-taking imply that, due to listener entrainment to the flow of syllables, a higher speech rate will be associated with [...] Read more.
The duration of inter-speaker pauses is a pragmatically salient aspect of conversation that is affected by linguistic and non-linguistic context. Theories of conversational turn-taking imply that, due to listener entrainment to the flow of syllables, a higher speech rate will be associated with shorter turn-transition times (TTT). Previous studies have found conflicting evidence, however, some of which may be due to methodological differences. In order to test the relationship between speech rate and TTT, and how this may be modulated by other dialogue factors, we used question-answer sequences from spontaneous conversational corpora in Dutch and English. As utterance-final lengthening is a local cue to turn endings, we also examined the impact of utterance-final syllable rhyme duration on TTT. Using mixed-effect linear regression models, we observed evidence for a positive relationship between speech rate and TTT: thus, a higher speech rate is associated with longer TTT, contrary to most theoretical predictions. Moreover, for answers following a pause (“gaps”) there was a marginal interaction between speech rate and final rhyme duration, such that relatively long final rhymes are associated with shorter TTT when foregoing speech rate is high. We also found evidence that polar (yes/no) questions are responded to with shorter TTT than open questions, and that direct answers have shorter TTT than responses that do not directly answer the questions. Moreover, the effect of speech rate on TTT was modulated by question type. We found no predictors of the (negative) TTT for answers that overlap with the foregoing questions. Overall, these observations suggest that TTT is governed by multiple dialogue factors, potentially including the salience of utterance-final timing cues. Contrary to some theoretical accounts, there is no strong evidence that higher speech rates are consistently associated with shorter TTT. Full article
(This article belongs to the Special Issue Pauses in Speech)
Show Figures

Figure 1

31 pages, 2550 KiB  
Review
A Narrative Review of Speech and EEG Features for Schizophrenia Detection: Progress and Challenges
by Felipe Lage Teixeira, Miguel Rocha e Costa, José Pio Abreu, Manuel Cabral, Salviano Pinto Soares and João Paulo Teixeira
Bioengineering 2023, 10(4), 493; https://doi.org/10.3390/bioengineering10040493 - 20 Apr 2023
Cited by 15 | Viewed by 4062
Abstract
Schizophrenia is a mental illness that affects an estimated 21 million people worldwide. The literature establishes that electroencephalography (EEG) is a well-implemented means of studying and diagnosing mental disorders. However, it is known that speech and language provide unique and essential information about [...] Read more.
Schizophrenia is a mental illness that affects an estimated 21 million people worldwide. The literature establishes that electroencephalography (EEG) is a well-implemented means of studying and diagnosing mental disorders. However, it is known that speech and language provide unique and essential information about human thought. Semantic and emotional content, semantic coherence, syntactic structure, and complexity can thus be combined in a machine learning process to detect schizophrenia. Several studies show that early identification is crucial to prevent the onset of illness or mitigate possible complications. Therefore, it is necessary to identify disease-specific biomarkers for an early diagnosis support system. This work contributes to improving our knowledge about schizophrenia and the features that can identify this mental illness via speech and EEG. The emotional state is a specific characteristic of schizophrenia that can be identified with speech emotion analysis. The most used features of speech found in the literature review are fundamental frequency (F0), intensity/loudness (I), frequency formants (F1, F2, and F3), Mel-frequency cepstral coefficients (MFCC’s), the duration of pauses and sentences (SD), and the duration of silence between words. Combining at least two feature categories achieved high accuracy in the schizophrenia classification. Prosodic and spectral or temporal features achieved the highest accuracy. The work with higher accuracy used the prosodic and spectral features QEVA, SDVV, and SSDL, which were derived from the F0 and spectrogram. The emotional state can be identified with most of the features previously mentioned (F0, I, F1, F2, F3, MFCCs, and SD), linear prediction cepstral coefficients (LPCC), linear spectral features (LSF), and the pause rate. Using the event-related potentials (ERP), the most promissory features found in the literature are mismatch negativity (MMN), P2, P3, P50, N1, and N2. The EEG features with higher accuracy in schizophrenia classification subjects are the nonlinear features, such as Cx, HFD, and Lya. Full article
Show Figures

Figure 1

Back to TopTop