Spectral Bandwidth Effects on Emotion Classification and Representation in Spoken and Sung Signals

Garlitz, Rylen; Shamsi, Allen; Wayland, Ratree

doi:10.3390/signals7030050

Open AccessArticle

Spectral Bandwidth Effects on Emotion Classification and Representation in Spoken and Sung Signals

by

Rylen Garlitz

,

Allen Shamsi

and

Ratree Wayland

^*

Department of Linguistics, University of Florida, Gainesville, FL 32611-5454, USA

^*

Author to whom correspondence should be addressed.

Signals 2026, 7(3), 50; https://doi.org/10.3390/signals7030050

Submission received: 31 March 2026 / Revised: 23 May 2026 / Accepted: 28 May 2026 / Published: 1 June 2026

Download

Browse Figures

Versions Notes

Abstract

Speech emotion recognition systems are typically trained on audio sampled at conventional bandwidths that exclude frequencies above approximately 8 kHz, yet the contribution of extended high-frequency information to vocal emotion recognition remains unclear. This study examines how spectral bandwidth influences automatic emotion classification using the RAVDESS corpus of acted speech and song. Recordings were low-pass filtered to simulate multiple bandwidth conditions (8, 12, and 16 kHz, along with the original full-bandwidth signal), and classification was performed using a Random Forest model trained on mel-spectral features. In addition to classification accuracy, we analyzed permutation-based spectral feature importance and the geometry of the classifier’s posterior-probability space. Bandwidth restriction had relatively modest effects on classification accuracy overall, with mean accuracy ranging from approximately 55% to 77% across conditions, although its impact was greater for speech than for song. Feature-importance analyses indicated that the model depends primarily on low- and mid-frequency spectral information, whereas higher-frequency and EHF regions show increased importance when available. Geometry analyses showed no reliable evidence that bandwidth altered the global structure of the stimulus-level emotion space, although spectral truncation reduced separability for certain emotion contrasts, particularly in speech at normal emotional intensity. These results indicate that most acoustic information supporting categorical emotion recognition resides in lower spectral regions, while EHF information provides supplementary acoustic information that may refine some emotional distinctions under specific conditions.

Keywords:

speech emotion recognition; spectral bandwidth; extended high-frequency speech; acoustic features; machine learning; emotion-space geometry

1. Introduction

Automatic speech emotion recognition (SER) involves classifying emotion-related categories from acoustic signals and is widely used in affective computing and speech technology. Most current SER systems rely on time–frequency representations such as mel-spectrograms, which capture broad patterns of spectral energy associated with prosodic and voice-quality cues. In practice, however, these systems are typically trained on audio sampled at 16 kHz, limiting the usable spectral range to approximately 0–8 kHz. This effectively assumes that frequency content above 8 kHz contributes little to emotion recognition.

This assumption is rarely tested directly. Emotional vocalizations—particularly high-arousal or strong-intensity expressions—often involve increased vocal effort, turbulent airflow, and changes in voice quality that can redistribute energy toward higher frequencies [1,2]. In addition, emotional expression in song differs from speech in ways that may affect spectral structure, including more sustained phonation and more stable harmonic organization [3,4,5]. These differences raise the possibility that extended high-frequency (EHF) information may contribute to emotion classification, particularly in conditions such as strong intensity expressions or sung vocalizations, where increased energy and more stable harmonic structure may enhance the availability of high-frequency cues.

Despite this, relatively little work has systematically examined the role of spectral bandwidth in SER while holding other aspects of the signal constant [6]. In particular, it remains unclear whether access to higher-frequency information improves classification performance, whether its contribution differs between speech and song, and whether it depends on emotional intensity.

The present study addresses these questions using the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS [7]), which provides high-resolution recordings of emotional speech and song produced under controlled conditions. Because the recordings are stored at 48 kHz, the available spectral range can be manipulated directly while preserving all other properties of the signal.

In this study, we systematically evaluate the effect of spectral bandwidth by comparing full-band signals with low-pass filtered versions at 8, 12, and 16 kHz. In addition to classification accuracy, we examine how bandwidth influences model reliance on spectral cues and the structure of emotion representations derived from classifier outputs. This design allows us to test whether extended high-frequency information contributes meaningfully to emotion classification and how any such contribution depends on modality and emotional intensity.

1.1. Conceptual Framing: What Is Being Classified

A recurring issue in speech emotion research concerns what is meant by the term emotion and, more specifically, what is being classified in emotion recognition tasks. In affective science, the term is used to refer to several related but distinct constructs, including biologically grounded response systems, consciously experienced feelings, and the conceptual categories used to describe and interpret affective states [8,9]. These distinctions are not merely terminological—they reflect different levels of analysis that do not necessarily correspond directly to one another, nor map cleanly onto the discrete category labels typically used in emotion recognition tasks.

In practice, however, most speech emotion datasets do not attempt to measure internal emotional states directly. Instead, they consist of vocal expressions designed to be recognizable as instances of named emotion categories. The RAVDESS, for example, includes acted (as opposed to naturally occurring) emotional expressions produced by trained speakers and validated through listener judgments [7]. The labels associated with these recordings, therefore, reflect intended and perceptually recognized emotional expressions rather than independently verified psychological or physiological states.

This distinction is important for interpreting the present study. The classification models evaluated here operate on labeled acoustic signals and are trained to distinguish among categories defined at the level of vocal expression. Accordingly, the results should be understood as reflecting the discriminability of perceptually defined emotion categories under different acoustic conditions, rather than as evidence about the structure of emotion as a biological or psychological system.

At the same time, emotional expression in speech varies continuously along dimensions such as intensity and arousal, and these differences are reflected in systematic variation in acoustic cues. The inclusion of normal- and strong-intensity emotional expressions in the RAVDESS corpus provides a controlled way to examine how variation in emotional intensity interacts with spectral bandwidth. Thus, although the present analyses are conducted using discrete emotion categories, they remain sensitive to graded differences in vocal expression—captured here through systematic variation in intensity—that may influence classification performance.

1.2. Related Work: Bandwidth Assumptions in SER

Most speech emotion recognition (SER) systems rely on acoustic features such as mel-frequency cepstral coefficients (MFCCs), spectrogram-based representations, and related spectral descriptors [6,10]. In practice, these features are typically computed from audio sampled at 16 kHz, limiting the available spectral range to approximately 0–8 kHz. As a result, higher-frequency components of the signal are usually not represented in standard SER pipelines. However, this emphasis reflects methodological convention more than direct empirical evaluation, and controlled manipulations of spectral bandwidth—where lexical content, speaker identity, and emotion category are preserved—remain relatively rare. Specifically, few studies have systematically examined how access to higher-frequency information affects emotion classification while holding other aspects of the signal constant. Existing SER systems continue to rely primarily on limited spectral range [6,10], leaving the contribution of extended high-frequency (EHF) information largely untested.

The present study addresses this gap by evaluating emotion classification across systematically manipulated bandwidth conditions using high-resolution recordings. By comparing full-band signals with low-pass filtered versions of the same stimuli, we directly test whether access to higher-frequency information influences classification performance, model reliance on spectral cues, and the structure of emotion representations derived from classifier outputs.

1.3. Speech vs. Song and the Role of Intensity

Speech and song differ systematically in phonatory control, pitch organization, harmonic structure, and spectral shaping. Singing typically involves sustained phonation, expanded pitch range, slower temporal patterning, and more stable harmonic energy distribution than speech [3,4]. These differences influence how affective information is acoustically encoded and perceptually interpreted. Emotional portrayals in song may therefore rely on acoustic patterns that differ from spoken portrayals even when lexical content is held constant.

One important distinction concerns the stability of suprasegmental cues. In speech, rapid articulatory transitions and consonantal constrictions frequently interrupt voicing and introduce aperiodicity, which can obscure acoustic cues relevant to emotional expression, including pitch contour, spectral tilt, and intensity modulation [2]. In contrast, singing prolongs vowel nuclei and reduces consonantal masking, producing more continuous harmonic structure and allowing clearer expression of arousal-related acoustic cues [5]. Research in vocal emotion communication and music psychology shows that emotional expression in music often relies on exaggerated acoustic correlates that parallel—while extending beyond—those used in speech prosody [2,5]. These differences have been associated with higher recognition accuracy for sung compared with spoken emotional expressions in behavioral perception studies [11,12].

From a signal-processing perspective, sustained phonation in singing produces greater harmonic continuity and reduced spectral variability relative to speech [3]. Spectral representations such as mel spectrograms and MFCCs may therefore capture more stable emotion-relevant patterns in song than in speech, where coarticulation and consonantal noise introduce greater within-category variability [13,14]. Machine-learning studies using acted emotional corpora, including the RAVDESS, frequently report high classification accuracy and clear acoustic separability across emotional portrayals [7].

Emotional intensity introduces an additional dimension of variation. Intensity is closely linked to vocal effort, subglottal pressure, and physiological arousal, and it influences several acoustic parameters, including amplitude, spectral slope, and overall spectral characteristics [2,15]. High-intensity emotional portrayals typically involve increased vocal fold collision force, greater airflow, and stronger articulatory engagement, all of which are associated with increased turbulent and higher-frequency spectral energy [3,15]. Because extended high-frequency energy is often associated with turbulence, frication, and breath noise in speech production, stronger emotional expressions may amplify these cues relative to normal-intensity productions.

The RAVDESS corpus provides a particularly suitable dataset for examining these relationships. The corpus includes systematically controlled portrayals of emotional intensity (normal vs. strong) across both speech and song modalities while holding lexical content and actor identity constant [7]. This design allows direct comparison of acoustic and classification effects across modality and intensity, providing a controlled framework for testing whether EHF contributions to emotion classification vary as a function of vocal intensity and expressive modality.

Despite growing interest in speech emotion recognition, relatively little work has systematically examined how spectral bandwidth influences not only classification accuracy but also the spectral cue-weighting strategies and representational organization of emotion categories. Prior SER studies have typically relied on conventional sampling rates that exclude extended high-frequency (EHF) information and have rarely manipulated spectral bandwidth while holding lexical content, speaker identity, and emotional category constant. Moreover, previous work has focused primarily on overall classification performance, with comparatively little attention to how bandwidth affects the internal structure of emotion representations derived from classifier outputs.

The present study addresses these gaps using controlled bandwidth manipulations applied to high-resolution recordings from the RAVDESS corpus. In addition to evaluating classification accuracy across bandwidth conditions, we examine how spectral bandwidth influences model reliance on different spectral regions using permutation-based feature importance and investigate how bandwidth restriction affects the geometry of emotion representations in posterior-probability space. To our knowledge, this is among the first studies to combine controlled spectral-bandwidth manipulation with representational emotion-space analysis in speech emotion recognition. By integrating classification, feature-importance, and geometry analyses, the study provides a more comprehensive account of how bandwidth influences both emotion classification performance and the organization of emotion representations.

2. Materials and Methods

The analytic pipeline proceeded in four stages. First, full-resolution audio recordings from the RAVDESS corpus were subjected to controlled bandwidth manipulation, generating low-pass filtered versions at 8, 12, and 16 kHz alongside the unfiltered full-band signals. Second, acoustic representations were extracted from each stimulus using mel-spectrogram-based features summarized over time. Third, emotion classification was conducted using Random Forest models trained and evaluated separately across bandwidth, modality (speech vs. song), and emotional intensity conditions, with cross-validated prediction outputs retained for downstream analyses. Finally, we examined the representational organization of emotion categories by deriving emotion-space geometries from the classifier probability outputs, allowing us to assess how bandwidth restriction influenced both classification accuracy and the similarity structure among emotions.

2.1. Dataset (RAVDESS)

Stimuli were drawn from the audio-only recordings of the RAVDESS. The corpus comprises 7356 audiovisual files derived from 2496 distinct vocalizations produced by 24 professional actors (12 female, 12 male) speaking North American English. Each actor produced emotional expressions in both speech and song modalities, and recordings were exported into three formats (audio–video, video-only, and audio-only). The full database contains 4320 speech recordings and 3036 song recordings; however, song recordings from one female actor were lost due to technical issues during corpus construction, resulting in 132 missing song-related files [7].

Actors produced two semantically neutral statements (“Kids are talking by the door”; “Dogs are sitting by the door”). Speech stimuli span eight emotion categories (neutral, calm, happy, sad, angry, fearful, surprise, disgust), whereas song stimuli include six (neutral, calm, happy, sad, angry, fearful). All emotion categories except neutral were recorded at two intensity levels (normal, strong), and each vocalization was repeated twice.

Because lexical content is held constant across emotional expressions, emotional distinctions in the corpus are conveyed primarily through vocal-acoustic modulation. Acoustic analyses of the speech stimuli in RAVDESS have demonstrated systematic differentiation of emotion categories along multiple prosodic dimensions, including mean fundamental frequency (F0), F0 variability, duration, and vocal intensity [16]. High-arousal emotions such as anger, happiness, and surprise exhibit elevated pitch, greater pitch variability, and increased loudness relative to low-arousal emotions such as sadness, with strong-intensity expressions amplifying these patterns. Prosodic realization is further influenced by talker sex, sentence content, and repetition, confirming that the speech recordings constitute acoustically structured emotional signals suitable for perceptual and computational modeling.

To enable direct comparisons between speech and song, cross-modality analyses in the present study were restricted to the set of emotion categories available in both modalities (neutral, calm, happy, sad, angry, fearful) and to corresponding intensity conditions. Because the classification model outputs were generated repeatedly across cross-validation runs, the counts summarized below reflect cross-validated prediction instances rather than unique stimuli. The number of prediction instances contributing to the classification analyses for each emotion, modality, and intensity condition is summarized in Table 1.

Finally, to avoid conflating dataset category labels with underlying biological emotion states, we interpret RAVDESS emotion labels as conventional feeling-category terms indexing expressed and perceptually recognized affect. This interpretation is consistent with distinctions between emotion states, emotion concepts, and consciously experienced emotional feelings [8].

2.2. Bandwidth Manipulation and Experimental Design

To examine the contribution of extended high-frequency information to emotion classification, controlled spectral bandwidth manipulations were applied to the original full-resolution RAVDESS recordings. The audio files are distributed at a sampling rate of 48 kHz, allowing analysis of spectral information up to approximately 24 kHz.

For each stimulus, low-pass filtered versions were generated with cutoff frequencies at 8, 12, and 16 kHz, in addition to the unfiltered full-band condition. Filtering was implemented using window-based finite impulse response (FIR) filters (firwin) designed with a 2 kHz transition band and applied using zero-phase forward-backward filtering (filtfilt) in Python (v3.13.12) via the SciPy signal processing library [17]. This approach avoids phase distortion while preserving temporal structure and lexical content and systematically restricting the available spectral information.

The resulting dataset comprised four bandwidth conditions (8 kHz, 12 kHz, 16 kHz, Full). The experimental design crossed Bandwidth with Modality (speech vs. song) and Emotional Intensity (normal vs. strong), yielding a set of controlled acoustic conditions for evaluating how spectral truncation influences emotion classification performance and the structure of emotion representations derived from model outputs.

2.3. Feature Extraction

For each audio file in each bandwidth condition, fixed-dimensional acoustic representations were extracted using a 64-band log-mel spectrogram framework. Spectrograms were computed using the librosa signal-processing library [18] with an FFT size of 2048, a 25 ms analysis window, and a 10 ms hop length, yielding time–frequency representations sensitive to both harmonic structure and broadband spectral energy.

To obtain utterance-level feature vectors suitable for Random Forest classification, summary statistics were computed over time for each mel frequency band. These included the mean, standard deviation, 10th, 50th, and 90th percentiles, skewness, and kurtosis of spectral energy within each band, yielding a single fixed-length feature vector per stimulus per bandwidth condition.

For interpretability analyses, mel bands were grouped into four frequency regions based on their center frequencies: Low (0–4 kHz), Mid (4–8 kHz), High (8–12 kHz), and Extended High Frequency (EHF; 12–24 kHz). These groupings were used to aggregate feature-importance measures derived from the classification models.

Because the present analysis relied on temporally summarized spectral representations, fine-grained temporal sequencing information was not modeled directly. This approach emphasizes broad spectral-distributional characteristics while reducing sensitivity to dynamic temporal transitions that may contribute to emotional prosody. Accordingly, the present findings should be interpreted primarily in terms of how bandwidth influences spectral cue availability and representational organization rather than detailed temporal dynamics in emotional expression.

2.4. Classification and Feature Importance

Emotion classification was conducted using Random Forest (RF) models trained separately for each Modality × Intensity × Bandwidth condition. Model training and evaluation were implemented using the scikit-learn machine-learning framework [19]. The classification objective was to predict emotion category membership among the six shared RAVDESS emotions included in the analysis (neutral, calm, happy, sad, angry, fearful).

Random Forest models were selected because they provide stable permutation-based feature-importance estimates and probabilistic outputs that support the representational and geometry analyses central to the present study. The goal of the analysis was not to maximize predictive performance across classifier architectures, but rather to examine how bandwidth manipulation influences spectral cue weighting and emotion-space structure within a controlled and interpretable modeling framework.

Hyperparameter optimization was performed using a two-stage grid search. An initial coarse search explored multiple model configurations, including the number of trees (200, 500, 800), maximum tree depth (None, 10, 20, 40), minimum samples per leaf (1, 2, 4), and feature subsampling strategy (sqrt, log2). A subsequent fine-grained search was conducted within high-performing parameter regions to refine model configuration. Optimal parameters were selected within each condition based on cross-validation performance.

Model performance was assessed using repeated stratified cross-validation (10 folds × 10 repeats) to ensure balanced class representation across training and test partitions. Cross-validated model outputs were retained for downstream statistical analyses. For the accuracy analysis, predictions were aggregated at the level of Run × Stimulus × Condition × Modality × Intensity × Sex × Emotion, so that classification performance could be modeled as binomial success/failure counts while preserving the correspondence among bandwidth-manipulated versions of the same underlying stimulus and avoiding pseudo-replication. Statistical inference on classification accuracy was then conducted using binomial generalized linear mixed-effects models.

To quantify model reliance on spectral information, permutation-based feature importance was computed for each mel-derived feature. For inferential analysis, these values were aggregated to one observation per Run × Bandwidth × Modality × Intensity × Frequency-band cell. The four band-specific importance measures (Low, Mid, High, EHF) were reshaped to long format and analyzed using linear mixed-effects models with frequency band, bandwidth condition, modality, and emotional intensity as fixed effects and cross-validation run as a random intercept. Fixed effects were evaluated using Type III Wald χ² tests, and planned follow-up comparisons were conducted using estimated marginal means with Holm-adjusted pairwise contrasts.

An additional advantage of the present approach is its relatively modest computational complexity compared with large-scale deep-learning architectures commonly used in contemporary speech emotion recognition. Because the analysis relied on fixed-dimensional spectral summary representations and Random Forest classifiers, model training and inference remained computationally efficient and scalable across repeated cross-validation analyses while still supporting interpretable representational analyses.

2.5. Emotion-Space Geometry Analysis

To examine how bandwidth manipulation affected the representational structure of emotion categories, we analyzed the geometry of the classifier’s posterior-probability space. Because some condition cells lacked particular emotion categories (see Section 2.1), geometry analyses were restricted to the shared probability dimensions available across conditions (calm, happy, sad, angry, fearful), ensuring comparability of probability-space distances.

First, we tested whether the multivariate distribution of stimulus-level probability vectors differed across bandwidth conditions using permutational multivariate analysis of variance (PERMANOVA). For each stimulus and bandwidth condition, posterior class-probability vectors generated by the Random Forest classifier were averaged across cross-validation runs so that each stimulus contributed a single probability vector per bandwidth level, thereby avoiding pseudo-replication from repeated cross-validation predictions. Euclidean distance matrices were computed from these stimulus-averaged vectors, and PERMANOVA models were fit within each Modality × Intensity condition to test whether bandwidth altered the distribution of stimulus representations. Permutations were restricted within stimulus identity to preserve the repeated-measures structure across bandwidth conditions. Homogeneity of multivariate dispersion was assessed separately using permutation tests.

Second, we examined the geometry of emotion-category prototypes derived from these probability representations. Emotion prototypes were computed separately within each cross-validation run for each Bandwidth × Modality × Intensity condition by averaging stimulus-level probability vectors within each true emotion category. Pairwise distances among emotion centroids were then calculated. Cosine distance was used as the primary metric because it captures similarity in the relative pattern of class-probability distributions independent of absolute magnitude differences, whereas Euclidean distance was computed as a complementary metric sensitive to absolute probability separation.

Overall emotion-space separation was quantified as the mean pairwise distance among emotion centroids within each condition. To isolate the effect of spectral truncation, distance changes were expressed relative to the Full-bandwidth condition within each run and Modality × Intensity cell (Δ-distance). Pairwise contrasts between specific emotion categories were also examined, and uncertainty was estimated using bootstrap confidence intervals across cross-validation runs.

All data wrangling and downstream statistical analyses were conducted in R 4.3.1 [20]. Data processing, reshaping, string handling, functional iteration, and visualization used tidyverse tools [21], with figures produced using ggplot2 [22]. Binomial generalized linear mixed-effects models and linear mixed-effects models were fit using lme4 [23], with additional inferential support for linear mixed models provided by lmerTest [24]. Type III tests were obtained using car [25], estimated marginal means and Holm-adjusted [26] post hoc comparisons were conducted using emmeans [27], and permutational multivariate analyses of variance and dispersion tests were conducted using vegan [28].

3. Results

We first report classification performance across bandwidth conditions, modality, and emotional intensity. We then analyze permutation-based feature importance to identify the spectral regions contributing most strongly to emotion classification. Finally, we examine how bandwidth manipulation affects the geometry of the emotion space derived from model probability outputs.

3.1. Emotion Classification Performance

Classification performance was examined across bandwidth conditions, modality (speech vs. song), and emotional intensity (normal vs. strong). Overall accuracy was computed to summarize classifier performance across conditions. Statistical inference on classification accuracy was conducted using binomial generalized linear mixed-effects models to determine whether performance differed as a function of bandwidth and its interaction with modality and intensity.

To ensure that comparisons across bandwidth conditions reflected the same underlying stimuli, predictions were aggregated at the level of Run × Stimulus × Condition × Modality × Intensity × Sex × Emotion. This aggregation allows classification accuracy to be modeled as binomial counts while preserving the correspondence between conditions and avoiding pseudo-replication. The resulting dataset, therefore, reflects repeated evaluations of the same stimuli across bandwidth conditions within each cross-validation run.

Table 2 summarizes mean classification accuracy for each Modality × Intensity × Bandwidth condition. Accuracy values were calculated after aggregating repeated cross-validated predictions at the stimulus level, so that repeated predictions for the same underlying recording were not treated as independent observations. Disgust and surprise were excluded because these categories were not available in both speech and song. Neutral items were retained where available but do not have strong-intensity counterparts in RAVDESS.

Figure 1 visualizes the same descriptive pattern, with mean classification accuracy shown across bandwidth conditions as a function of modality (speech vs. song), emotional intensity (normal vs. strong), and speaker sex (female vs. male). Classification accuracy is generally higher for song than for speech and for strong compared to normal emotional portrayals, while differences across bandwidth conditions appear comparatively modest.

In an earlier model comparison, adding bandwidth condition provided only weak evidence of improved model fit (χ²(3) = 6.66, p = 0.083), indicating that spectral truncation may have limited overall impact on classification performance. However, in the final mixed-effects model, Type III Wald tests indicated a statistically reliable but modest omnibus effect of bandwidth condition, χ²(3) = 8.10, p = 0.044. In the final mixed-effects model, classification accuracy varied significantly as a function of modality, emotional intensity, speaker sex, and emotion category, while bandwidth effects remained modest and context-dependent. Modality exerted a large effect (χ²(1) = 83.88, p < 0.001), with higher accuracy for song than for speech, and emotional intensity also had a reliable effect (χ²(1) = 4.64, p < 0.031), with strong-intensity stimuli classified more accurately than normal-intensity stimuli. Classification accuracy additionally differed by speaker sex (χ²(1) = 8.57, p = 0.003), with higher accuracy for female speakers, and varied significantly across emotion categories (χ²(5) = 175.07, p < 0.001).

Importantly, the effect of bandwidth was not uniform across conditions. A significant bandwidth × modality interaction (χ²(3) = 8.01, p = 0.046) indicates that spectral reduction affects speech and song differently. Follow-up comparisons showed that bandwidth effects were small and non-monotonic. Within speech, no pairwise bandwidth contrast was reliable after correction, whereas within song, the only reliable contrast was between 8 kHz and Full (odds ratio = 1.235, SE = 0.085, z = 3.07, Holm-adjusted p = 0.0128), with higher classification odds at 8 kHz than at Full. Bandwidth effects also differed across speaker sex (χ²(3) = 39.20, p < 0.001), indicating that the impact of spectral reduction varies between male and female speakers. In contrast, there was no evidence that bandwidth effects differ systematically across emotion categories (χ²(15) = 19.90, p = 0.176). Descriptively, emotions such as anger and neutral tended to show higher classification accuracy, whereas sadness showed comparatively lower accuracy across conditions (see Figure 1). Full pairwise comparisons among emotion categories and interactions are reported in Supplementary Tables S2–S4.

Overall, these results show that classification performance is driven primarily by modality, emotional intensity, and emotion category, whereas the contribution of spectral bandwidth is modest and context-dependent, and not well described by a simple monotonic degradation account.

Because overall classification accuracy showed relatively modest sensitivity to bandwidth restriction, the next analysis examined whether bandwidth nevertheless altered the spectral cue-weighting strategies used by the classifier.

3.2. Spectral-Band Importance Across Conditions

To examine how the classifier relied on different spectral regions across conditions, we analyzed permutation-based feature-importance values using linear mixed-effects models. Because permutation-based feature-importance values were computed once per cross-validation training run and condition cell, they were constant across individual trial predictions within that cell. Accordingly, feature-importance values were aggregated to a single observation per Run × Bandwidth × Modality × Intensity × Frequency-band cell. For each cross-validation run and each Bandwidth × Modality × Intensity condition, we retained the four band-specific importance measures (Low, Mid, High, and Extended High Frequency [EHF]), reshaped the data to long format, and treated frequency band as a within-cell factor.

Linear mixed-effects models were then fit with feature importance as the dependent variable and fixed effects for frequency band, bandwidth condition, modality, and emotional intensity, along with their interactions. A random intercept for cross-validation run was included to account for run-to-run variability in overall importance magnitude. Fixed effects were evaluated using Type III Wald χ² tests, and planned follow-up comparisons were conducted using estimated marginal means with Holm-adjusted pairwise contrasts.

Figure 2 shows the estimated permutation-based feature importance for each spectral band across bandwidth conditions, separated by modality (speech vs. song) and emotional intensity (normal vs. strong). Y-axis represents estimated feature importance scaled by 1000 for readability and reflects the extent to which classification performance depends on each spectral region.

Across conditions, the classifier showed greatest reliance on low- and mid-frequency spectral information, whereas high-frequency and EHF bands contributed less under the most restricted bandwidth conditions. As bandwidth increased, however, the relative importance of higher-frequency bands also increased, particularly for the EHF region, indicating that the importance of EHF information increased when it was available. These patterns were broadly similar across modalities and intensity levels, although the relative distribution of importance across spectral regions varied as a function of stimulus modality, emotional intensity, and bandwidth condition. Thus, broader bandwidth changed the model’s cue-weighting profile even when its effect on classification accuracy was modest.

Type III Wald χ² tests from the linear mixed-effects model results revealed strong main effects of frequency band (χ²(3) = 264,220, p < 0.001), bandwidth condition (χ²(3) = 2427.9, p < 0.001), modality (χ²(1) = 1892.1, p < 0.001), and emotional intensity (χ²(1) = 1484.5, p < 0.001), indicating that overall permutation-based feature importance varied substantially across spectral regions, bandwidth levels, stimulus modality, and portrayal strength. Consistent with the descriptive patterns noted above, the model showed greatest dependence on low-frequency bands, followed by mid-frequency bands, with comparatively smaller contributions from high-frequency and EHF regions.

Several two-way interactions further clarified these effects. A significant condition × modality interaction (χ²(3) = 7.99, p = 0.046) indicated that bandwidth-related changes in feature importance differed somewhat between speech and song stimuli. A significant condition × intensity interaction (χ²(3) = 20.79, p < 0.001) indicated that bandwidth effects also varied as a function of emotional strength. The modality × intensity interaction was likewise significant (χ²(1) = 705.84, p < 0.001), demonstrating that the overall magnitude of feature importance differed across modality–intensity combinations.

Frequency band also interacted strongly with the experimental factors. The band × condition interaction was highly significant (χ²(9) = 77,224, p < 0.001), indicating that bandwidth expansion altered the relative importance of specific spectral regions. In addition, both band × modality (χ²(3) = 12,112, p < 0.001) and band × intensity (χ²(3) = 7157.4, p < 0.001) interactions were significant, demonstrating that the distribution of importance across spectral bands differed systematically between speech and song stimuli and between normal and strong emotional portrayals.

Higher-order interactions were also statistically reliable. The condition × modality × intensity interaction (χ²(3) = 13.57, p = 0.0035) indicated that bandwidth effects varied across specific combinations of modality and emotional intensity. Moreover, all higher-order interactions involving frequency band were significant: band × condition × modality (χ²(9) = 2229.6, p < 0.001), band × condition × intensity (χ²(9) = 402.42, p < 0.001), band × modality × intensity (χ²(3) = 3864.5, p < 0.001), and the four-way band × condition × modality × intensity interaction (χ²(9) = 1033.3, p < 0.001). These effects indicate that bandwidth-related changes in spectral reliance depend jointly on modality and emotional intensity and that the way bandwidth reshapes the importance of specific spectral regions differs across these conditions.

Follow-up comparisons using estimated marginal means were conducted to further characterize these effects. Pairwise contrasts comparing bandwidth conditions within each spectral band and Modality × Intensity cell are reported in Supplementary Tables S6–S9. These comparisons show that bandwidth expansion is associated with systematic changes in feature importance across spectral regions. In general, low-frequency bands exhibit the highest importance in the most spectrally restricted conditions, whereas the relative contribution of high-frequency and extended high-frequency bands increases as additional bandwidth becomes available.

These results indicate that the classifier’s reliance on spectral information is strongly structured by frequency band and varies systematically with bandwidth. Importantly, bandwidth did not simply increase or decrease performance uniformly; instead, it redistributed the spectral regions on which the classifier relied, and these changes also depended on modality, emotional intensity, and their interaction. The model depended most strongly on low-frequency information overall, whereas higher-frequency and EHF regions gained importance when they were available. These findings motivate the subsequent analyses examining whether such bandwidth-related changes in spectral reliance influence the structure of the emotion space itself.

These feature-importance patterns suggest that bandwidth manipulation alters the relative contribution of different spectral regions even when overall classification performance changes only modestly. The next analysis, therefore, examined whether these shifts in spectral cue weighting were accompanied by changes in the representational geometry of the classifier’s emotion space.

3.3. Emotion-Space Structure

While the preceding analyses identify which spectral regions show the greatest importance for emotion classification, they do not reveal how these acoustic cues shape the organization of emotion categories in the model’s output space. Changes in spectral reliance across bandwidth conditions may influence not only overall classification accuracy but also the structure of the classifier’s internal representation of emotions. In particular, if certain spectral regions are important for distinguishing specific emotional contrasts, restricting bandwidth may compress or distort relationships among emotion categories in the model’s posterior-probability space. To evaluate this possibility, the next analysis examines the geometry of emotion representations derived from the classifier’s probability outputs.

To examine whether spectral bandwidth alters the structure of the emotion representations produced by the classifier, we analyzed the geometry of the model’s posterior-probability space. For each stimulus, the Random Forest classifier outputs a probability vector over emotion categories, reflecting the model’s graded confidence that the stimulus belongs to each emotion class. These probability vectors provide a representation of stimuli within a multivariate “emotion space,” where distances between vectors reflect the similarity of emotional representations learned by the classifier. Changes in this representational structure across bandwidth conditions would indicate that spectral truncation affects not only classification accuracy but also the organization of emotion categories in the model’s output space.

We evaluated this question at two complementary levels. First, we tested whether the multivariate distribution of stimulus-level probability vectors differed across bandwidth conditions using permutational multivariate analysis of variance (PERMANOVA). Second, we examined the geometry of emotion-category prototypes derived from these probabilities, quantifying both overall emotion-space separation and pairwise distances among emotion categories across bandwidth conditions.

3.3.1. Statistical Test of Stimulus-Level Emotion Space

To formally test whether spectral bandwidth alters the structure of the classifier’s stimulus-level probability space, we conducted permutational multivariate analysis of variance (PERMANOVA). This analysis evaluates whether the multivariate distribution of stimulus-level probability vectors differs across bandwidth conditions within each Modality × Intensity cell. For each stimulus and bandwidth condition, posterior class-probability vectors were averaged across cross-validation runs so that each stimulus contributed a single probability vector per bandwidth level, preventing pseudo-replication arising from repeated predictions across cross-validation folds.

Euclidean distance matrices were computed from these stimulus-averaged probability vectors. Because some condition cells lacked particular emotion categories, the analysis was restricted to the emotion dimensions shared across all conditions (calm, happy, sad, angry, fearful), ensuring that distances were computed in comparable probability spaces across conditions. Separate PERMANOVA models were then run within each Modality × Intensity cell using the model distance ~ bandwidth. This approach isolates the effect of spectral bandwidth while holding modality and emotional intensity constant, avoiding higher-order interactions that would be difficult to interpret in a single global model. Permutations were restricted within stimulus identity (strata = stimulus) to preserve the repeated-measures structure across bandwidth conditions.

PERMANOVA analyses provided no reliable evidence that bandwidth altered the global structure of the stimulus-level probability space. Within each Modality × Intensity cell, bandwidth effects were not statistically significant after correction for multiple comparisons (all Holm-corrected p ≥ 0.080; see Supplementary Table S10). The only nominal effect occurred for song stimuli at normal intensity (pseudoF = 0.012, R² = 1.91 × 10⁻⁵, p = 0.020), but this effect did not survive Holm correction (p(Holm) = 0.080) and explained a negligible proportion of variance (R² = 0.002%). The remaining conditions (speech–normal, speech–strong, song–strong) showed no evidence that bandwidth altered the multivariate distribution of stimulus representations (all p ≥ 0.368). Tests of homogeneity of multivariate dispersion (betadisper with permutation testing) were not significant across any condition (p = 0.979–0.997), indicating that the results are not driven by differences in within-condition dispersion.

3.3.2. Overall Emotion-Space Separation

In addition to the inferential test described above, we examined how bandwidth manipulation affected the global separability of emotion categories in the classifier’s output space. Emotion prototypes were computed separately within each cross-validation run for each Bandwidth × Modality × Intensity condition by averaging stimulus-level probability vectors within each true emotion category. These prototypes represent centroid representations of each emotion in the model’s posterior-probability space.

Pairwise distances among emotion centroids were then computed. Cosine distance was used as the primary metric because it captures similarity in the pattern of class-probability distributions independent of magnitude differences. Euclidean distance was also computed as a complementary metric sensitive to absolute probability separation, and both measures produced qualitatively similar results. For interpretability, we summarize cosine distances in the main analyses.

Overall emotion-space separation was quantified as the mean pairwise distance among emotion centroids within each condition. To isolate the effect of spectral truncation, distance changes were expressed relative to the Full-bandwidth condition within each run and Modality × Intensity cell (Δ-distance). Uncertainty was estimated using bootstrap confidence intervals across cross-validation runs.

Figure 3 shows the change in overall emotion-space separation relative to the Full-bandwidth condition. Negative values indicate compression of the emotion space (reduced separability among emotions), whereas positive values indicate expansion.

The largest compression occurred for speech stimuli at normal intensity in the 8 kHz condition, indicating that severe bandwidth restriction reduces overall separability among emotions in this condition. Specifically, mean pairwise cosine distance was reduced relative to Full bandwidth at 8 kHz, Δ = −0.008, 95% bootstrap CI [−0.010, −0.006]. More moderate compression is observed at 12 kHz, Δ = −0.003, 95% CI [−0.005, −0.002], while distances at 16 kHz remain close to the Full-bandwidth baseline, Δ = 0.001, 95% CI [−0.000, 0.002]. In contrast, song stimuli and strong emotional portrayals show little change in global emotion-space separation, with Δ-distance values remaining near zero across bandwidth conditions.

These results indicate that bandwidth reduction has only modest effects on the global separability of emotion categories. Instead, substantial changes in emotion-space structure appear limited to the most severe bandwidth restriction and primarily affect speech stimuli at normal intensity.

3.3.3. Pairwise Emotion Contrasts

To determine which emotion relationships accounted for these global patterns, we next examined changes in pairwise distances between individual emotion categories. Pairwise cosine distances were computed for all emotion pairs within each condition and compared with the corresponding distances in the Full-bandwidth condition.

Figure 4 shows changes in separability for selected emotion pairs relative to the Full-bandwidth baseline. The most pronounced compression was observed for the fearful–happy contrast in speech at 8 kHz. Specifically, in normal-intensity speech, the fearful–happy contrast showed the largest reduction in pairwise cosine distance at 8 kHz, Δ = −0.048, 95% bootstrap CI [−0.053, −0.043], indicating that these two emotion categories became less separable under severe bandwidth restriction. Additional localized changes in that same cell were observed for contrasts including angry–happy, Δ = −0.027, 95% CI [−0.031, −0.022], happy–sad, Δ = −0.014, 95% CI [−0.019, −0.009], angry–sad, Δ = 0.009, 95% CI [0.006, 0.012], and angry–fearful, Δ = 0.007, 95% CI [0.003, 0.012], although these were smaller than the fearful–happy effect, and most contrasts returned toward baseline at 12–16 kHz. Thus, bandwidth restriction did not uniformly compress all emotion relationships; instead, it selectively reduced separability for specific contrasts, most notably fearful–happy, while slightly increasing separability for others.

To further characterize the strongest local distortions, we identified emotion-pair contrasts whose bootstrap confidence intervals excluded zero. These effects were concentrated primarily in the 8 kHz condition, particularly for speech at normal intensity. However, the number of affected contrasts was limited, and most emotion relationships remained stable across bandwidth conditions.

Overall, these results indicate that spectral bandwidth reduction produces localized distortions in specific emotion contrasts rather than large-scale restructuring of the overall emotion space. The clearest effects were concentrated under the most severe bandwidth restriction and were most evident for speech at normal intensity.

4. Discussion

The present study investigated how spectral bandwidth influences automatic recognition of vocal emotion using bandwidth-manipulated recordings from the RAVDESS corpus. In addition to classification accuracy, we examined which spectral regions the model depends on and how emotion representations are organized in the classifier’s posterior-probability space. Three main findings emerged. First, restricting bandwidth had relatively modest effects on classification accuracy overall, and these effects were interaction-dependent (the effect was often larger for speech than for song) rather than uniformly monotonic across conditions. Second, feature-importance analyses showed that the model depends primarily on low- and mid-frequency spectral information, whereas higher-frequency regions increase in importance when available. Third, analyses of the classifier’s probability-space geometry showed no evidence of global restructuring of emotion representations, but instead revealed localized reductions in separability among certain emotion contrasts under severe spectral truncation. These findings indicate that EHF information is not the primary basis for vocal emotion recognition in this dataset but can provide complementary acoustic information under specific conditions.

4.1. Bandwidth and Acoustic Cues in Vocal Emotion Recognition

The relatively small overall effect of bandwidth restriction suggests that most acoustic information supporting categorical emotion recognition in these recordings lies below approximately 8 kHz. This finding is consistent with extensive research demonstrating that vocal emotion perception depends heavily on prosodic and spectral cues concentrated in lower frequency regions, including pitch variation, amplitude modulation, and spectral characteristics related to overall spectral balance [1,2,5]. These cues remain largely preserved even when upper spectral regions are removed, which likely explains why classification performance remained relatively stable across bandwidth conditions. More importantly, bandwidth affected the classifier’s spectral cue-weighting strategy more clearly than it affected overall classification accuracy.

At the same time, feature-importance analyses indicate that higher-frequency spectral regions are not redundant. Their importance increased systematically with available bandwidth, suggesting that extended high-frequency (EHF) information provides supplementary acoustic information rather than serving as the primary basis for emotion classification. This interpretation aligns with research on extended-bandwidth speech perception showing that frequencies above the traditional telephone band can enhance perceived clarity and contribute to fine-grained aspects of voice quality [29], and support speech perception, particularly in challenging listening conditions [30]. These frequencies have also been linked to perceptual aspects of voice quality, although such qualities are not themselves direct markers of emotion categories. In the present study, higher spectral regions appear to refine distinctions among emotion categories rather than determine classification performance.

Importantly, the relationship between bandwidth and classification performance was not monotonic. In some conditions, particularly for song stimuli, the 8 kHz condition slightly outperformed the Full-band condition. This pattern suggests that access to broader spectral information does not necessarily yield uniformly improved emotion classification. One possible explanation is that most emotion-relevant acoustic information in the present dataset is concentrated in lower-frequency spectral regions, such that higher-frequency information contributes comparatively little additional category-level information. Another possibility is that broader spectral representations introduce additional acoustic variability that does not consistently improve separability among emotion categories and may occasionally reduce classifier stability. In addition, because the present study used Random Forest models operating on temporally summarized spectral representations, higher-frequency detail may not have been exploited as effectively as it might be in architectures designed to capture fine-grained spectrotemporal structure. Importantly, however, the feature-importance and emotion-space analyses still indicate that broader bandwidth systematically alters spectral cue weighting and selectively affects certain emotion contrasts, even when overall classification accuracy changes only modestly.

These findings are also consistent with recent reviews of speech emotion recognition (SER) systems. Many approaches rely on spectral representations derived from mel-frequency features or learned embeddings, with prosodic and spectral-envelope information identified as primary contributors to emotional classification [6]. The present results extend this literature by demonstrating that even when higher spectral regions are removed, most information supporting categorical emotion recognition remains available in lower-frequency bands.

4.2. Effects of Modality and Emotional Intensity

The results also revealed systematic differences between speech and song stimuli. Both classification performance and emotion-space geometry were more stable for song than for speech across bandwidth conditions. This pattern likely reflects acoustic differences between the two modalities. Compared with speech, singing typically involves more sustained phonation, slower temporal modulation, and a more stable harmonic structure [3,4], which may help preserve emotion-relevant acoustic cues even when portions of the spectral signal are removed. This suggests that modality shapes the robustness of emotion-related acoustic information to bandwidth reduction.

Emotion perception research has long noted parallels between emotional expression in speech and music, particularly in the use of pitch, loudness, and timing cues [5]. Because singing often exaggerates these cues and maintains stable harmonic structure [3,4], emotion-relevant information may remain recoverable even under spectral degradation. Extended high-frequency components may further contribute to perceived timbral qualities such as brightness or spectral richness, potentially supporting the preservation of subtle acoustic distinctions without being essential for emotion classification. The present results show that this robustness extends beyond classification accuracy to the preservation of emotion-space structure.

Emotional intensity also moderated the effects of bandwidth restriction. Strong emotional expressions were classified more accurately and exhibited more stable emotion-space geometry than normal-intensity portrayals. This pattern likely reflects increased acoustic redundancy in strongly expressed emotions. High-intensity emotional expressions typically involve larger pitch excursions, greater amplitude variation, and more pronounced voice-quality differences [2]. When such cues are amplified simultaneously, the signal becomes more robust to degradation in any single spectral region. The present findings suggest that this redundancy extends to machine-learning representations: under strong emotional expression, sufficient acoustic information remains available to preserve category separation even under substantial bandwidth restriction.

4.3. Emotion-Space Geometry and Representational Structure

Beyond overall classification accuracy and spectral cue-weighting analyses, a central contribution of this study is the examination of emotion-space geometry derived from the classifier’s posterior-probability vectors. Rather than focusing exclusively on classification accuracy, this approach allows the internal structure of model representations to be examined directly. PERMANOVA analyses provided no reliable evidence that bandwidth manipulation was associated with a global restructuring of the stimulus-level probability space. However, prototype-based analyses revealed modest compression of the emotion space under bandwidth restriction, particularly for speech at normal intensity.

Importantly, this compression was not uniform across emotions. Pairwise analyses indicated that bandwidth reduction selectively affected particular emotion contrasts, most notably the fearful–happy contrast in the 8 kHz condition. This pattern suggests that bandwidth limitation selectively weakens distinctions among emotion categories rather than disrupting the overall organization of the emotion space represented by the classifier. In other words, spectral truncation appears to affect the precision of specific emotional contrasts more than the global structure of emotion categories.

From a machine-learning perspective, these findings illustrate the value of analyzing the geometry of classifier outputs rather than relying solely on accuracy metrics. Recent work using deep neural representations has shown that learned speech representations encode information relevant for emotion classification, and that different layers can contribute differently to performance [31]. The present results demonstrate that bandwidth manipulations can produce localized changes in representational structure even when overall classification performance remains relatively stable.

The results also connect to research on human perception of emotional speech under degraded listening conditions. Behavioral studies have shown that listeners can recognize vocal emotions with relatively high accuracy even when acoustic information is reduced or degraded, although specific emotional contrasts may become less distinct [2]. The present findings suggest a similar pattern in machine-learning representations: bandwidth restriction does not eliminate the core emotional structure of the signal but can selectively reduce the separability of acoustically similar emotions.

The present findings also have broader implications for applied speech and audio technologies operating under bandwidth-limited or acoustically degraded conditions. The observation that emotion classification remained relatively stable under substantial spectral truncation suggests that much of the information supporting categorical vocal emotion recognition is preserved within lower-frequency spectral regions typically retained in conventional telecommunication systems. At the same time, the finding that higher-frequency and EHF regions can selectively refine certain emotion contrasts suggests that broader bandwidth may still provide advantages in applications where subtle affective distinctions are important. These issues may be particularly relevant for hearing-assistive technologies, voice-assistant systems, and speech interfaces operating in noisy real-world environments, where spectral degradation may interact with background noise and listener-related limitations. The results may also have implications for music emotion recognition and clinical speech analysis, where fine-grained aspects of voice quality and emotional expression can carry diagnostically or perceptually relevant information. More generally, the present findings suggest that bandwidth limitations may affect not only overall recognition performance but also the relative perceptual salience of specific emotional contrasts and the organization of emotion representations under degraded listening conditions.

4.4. Limitations and Future Directions

Several limitations should be considered when interpreting the present results. First, the study used a single acted emotional corpus recorded under controlled studio conditions. Although RAVDESS provides well-validated emotional portrayals, acted emotional speech may differ acoustically from spontaneous emotional expression in natural communication. In particular, acted portrayals may exaggerate prosodic, articulatory, and spectral distinctions among emotions relative to everyday conversational speech, potentially increasing acoustic separability and classification performance under controlled conditions. Emotional expression in natural interaction is often more variable, context-dependent, and dynamically modulated than in laboratory-produced portrayals. Accordingly, the present findings should be interpreted primarily as evidence of how spectral bandwidth influences emotion-related acoustic structure under controlled conditions rather than as a direct model of emotional communication in fully naturalistic settings. Second, the classifier used mel-spectral summary features and a Random Forest architecture rather than representation-learning approaches based on deep neural networks. While this approach enabled interpretable analysis of feature importance, models such as self-supervised speech encoders may exhibit different sensitivity to spectral bandwidth effects.

Future work should therefore extend these analyses to more naturalistic emotional speech datasets and to contemporary representation-learning models, including convolutional neural networks, transformer-based systems, and self-supervised learning architectures. Such work would help determine whether the bandwidth-related changes in spectral cue weighting and emotion-space structure observed here generalize across modeling approaches. Additional work examining specific acoustic cues within high-frequency spectral regions may also help clarify which aspects of vocal production contribute to the localized distortions observed in specific emotion contrasts. Such research would help determine under what conditions extended bandwidth provides measurable advantages for emotion recognition systems operating in real-world conditions.

5. Conclusions

This study examined the role of spectral bandwidth in automatic vocal emotion recognition using bandwidth-manipulated speech and song stimuli from the RAVDESS corpus. Across analyses, the results indicate that most information supporting categorical emotion recognition is concentrated in low- and mid-frequency spectral regions, while extended high-frequency information contributes modest but measurable supplementary detail.

Bandwidth restriction produced only limited changes in classification accuracy and did not substantially alter the global structure of the classifier’s stimulus-level probability space. However, severe truncation reduced the separability of certain emotion contrasts, particularly in speech at normal emotional intensity. These findings suggest that extended bandwidth does not fundamentally determine emotion recognition performance in this dataset but refines emotional distinctions when acoustic cues are weaker or more ambiguous.

The results also highlight the importance of examining not only classification accuracy but also the internal representational structure of machine-learning models. Even when overall performance appears stable, bandwidth manipulations can produce localized changes in the geometry of emotional representations. Overall, these findings demonstrate that bandwidth primarily influences how acoustic information is structured and utilized within model representations, rather than how much information is available for classification. Understanding these representational effects will be important for future research on speech emotion recognition, particularly as models increasingly rely on richer acoustic representations and operate in more variable real-world environments.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/signals7030050/s1, Table S1: Omnibus tests for the final classification-accuracy model; Table S2: Pairwise comparisons among emotion categories based on estimated marginal means; Table S3: Pairwise comparisons for the bandwidth × modality interaction: bandwidth contrasts within speech and within song; Table S4: Pairwise comparisons for the modality × intensity interaction: modality contrasts within each intensity level; Table S5: Omnibus tests for the feature-importance model; Table S6: Pairwise contrasts comparing bandwidth conditions within spectral band (speech stimuli, normal intensity). Pairwise contrasts are based on estimated marginal means; p values are Holm-adjusted; Table S7: Pairwise contrasts comparing bandwidth conditions within spectral band (speech stimuli, strong intensity). Pairwise contrasts are based on estimated marginal means; p values are Holm-adjusted; Table S8: Pairwise contrasts comparing bandwidth conditions within spectral band (song stimuli, normal intensity). Pairwise contrasts are based on estimated marginal means; p values are Holm-adjusted; Table S9: Pairwise contrasts comparing bandwidth conditions within spectral band (song stimuli, strong intensity). Pairwise contrasts are based on estimated marginal means; p values are Holm-adjusted; Table S10: PERMANOVA tests of bandwidth effects on stimulus-level probability space.

Author Contributions

Conceptualization, R.W.; methodology, R.G. and A.S.; software, R.G.; validation, R.G., A.S. and R.W.; formal analysis, R.G. and A.S.; investigation, R.G.; A.S. and R.W.; resources, R.G., A.S. and R.W.; data curation, R.W.; writing—original draft preparation, R.W.; writing—review and editing, R.G., A.S. and R.W.; visualization, A.S.; supervision, R.W.; project administration, R.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data supporting the findings of this study are openly available on the Open Science Framework (OSF) at https://osf.io/m974h/overview?view_only=673ddd33730e4030a8fb31f3d0b0d996 (accessed on 27 May 2026). The repository includes model prediction outputs, aggregated results files, and analysis scripts for reproducing the classification, feature importance, and emotion-space geometry analyses.

Acknowledgments

During the preparation of this manuscript, the authors used ChatGPT (OpenAI) (GPT-5.5, OpenAI, San Francisco, CA, USA) and Grammarly (Grammarly Inc., San Francisco, CA, USA) (accessed on 1 April 2026) for language editing and stylistic refinement. These tools were used to improve clarity and presentation only. The authors have reviewed and edited all outputs and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Banse, R.; Scherer, K.R. Acoustic profiles in vocal emotion expression. J. Pers. Soc. Psychol. 1996, 70, 614–636. [Google Scholar] [CrossRef] [PubMed]
Scherer, K.R. Vocal communication of emotion: A review of research paradigms. Speech Commun. 2003, 40, 227–256. [Google Scholar] [CrossRef]
Sundberg, J. The Science of the Singing Voice; Northern Illinois University Press: DeKalb, IL, USA, 1987. [Google Scholar]
Patel, A.D. Music, Language, and the Brain; Oxford University Press: Oxford, UK, 2008. [Google Scholar]
Juslin, P.N.; Laukka, P. Communication of emotions in vocal expression and music performance: Different channels, same code? Psychol. Bull. 2003, 129, 770–814. [Google Scholar] [CrossRef] [PubMed]
Akçay, M.B.; Oğuz, K. Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Commun. 2020, 116, 56–76. [Google Scholar] [CrossRef]
Livingstone, S.R.; Russo, F.A. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 2018, 13, e0196391. [Google Scholar] [CrossRef] [PubMed]
Adolphs, R. How should neuroscience study emotions? By distinguishing emotion states, concepts, and experiences. Soc. Cogn. Affect. Neurosci. 2017, 12, 24–31. [Google Scholar] [CrossRef] [PubMed]
LeDoux, J.E.; Brown, R. A higher-order theory of emotional consciousness. Proc. Natl. Acad. Sci. USA 2017, 114, E2016–E2025. [Google Scholar] [CrossRef] [PubMed]
Madanian, S.; Chen, T.; Adeleye, O.; Templeton, J.M.; Poellabauer, C.; Parry, D.; Schneider, S.L. Speech emotion recognition using machine learning: A systematic review. Intell. Syst. Appl. 2023, 20, 200266. [Google Scholar] [CrossRef]
Gabrielsson, A.; Juslin, P.N. Emotional expression in music performance: Between the performer’s intention and the listener’s experience. Psychol. Music 1996, 24, 68–91. [Google Scholar] [CrossRef]
Thompson, W.F.; Russo, F.A.; Livingstone, S.R. Facial expressions of singers influence perceived pitch relations. Psychon. Bull. Rev. 2010, 17, 317–322. [Google Scholar] [CrossRef] [PubMed]
Schuller, B.; Vlasenko, B.; Eyben, F.; Rigoll, G.; Wendemuth, A. Acoustic emotion recognition: A benchmark comparison of performances. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Merano, Italy, 13–17 December 2009; pp. 552–557. [Google Scholar] [CrossRef]
Trigeorgis, G.; Ringeval, F.; Brueckner, R.; Marchi, E.; Schuller, B.; Nicolaou, M.A.; Zafeiriou, S. Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 5200–5204. [Google Scholar] [CrossRef]
Titze, I.R. Principles of Voice Production; Prentice Hall: Englewood Cliffs, NJ, USA, 1994. [Google Scholar]
Major, D.P.; Chatterjee, M. Acoustic analyses of the RAVDESS corpus of emotional stimuli. JASA Express Lett. 2026, 6, 024801. [Google Scholar] [CrossRef] [PubMed]
Virtanen, P.; Gommers, R.; Oliphant, T.E.; Haberland, M.; Reddy, T.; Cournapeau, D.; Burovski, E.; Peterson, P.; Weckesser, W.; Bright, J.; et al. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nat. Methods 2020, 17, 261–272. [Google Scholar] [CrossRef] [PubMed]
McFee, B.; McVicar, M.; Faronbi, D.; Roman, I.; Gover, M.; Balke, S.; Seyfarth, S.; Malek, A.; Raffel, C.; Lostanlen, V.; et al. librosa/librosa: 0.11.0. Zenodo. 2025. Available online. (accessed on 1 February 2026). [CrossRef]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-Learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2023; Available online: https://www.R-project.org/ (accessed on 28 March 2026).
Wickham, H.; Averick, M.; Bryan, J.; Chang, W.; McGowan, L.-D.; François, R.; Grolemund, G.; Hayes, A.; Henry, L.; Hester, J.; et al. Welcome to the Tidyverse. J. Open Source Softw. 2019, 4, 1686. [Google Scholar] [CrossRef]
Wickham, H. ggplot2: Elegant Graphics for Data Analysis; Springer: New York, NY, USA, 2016. [Google Scholar]
Bates, D.; Mächler, M.; Bolker, B.; Walker, S. Fitting Linear Mixed-Effects Models Using lme4. J. Stat. Softw. 2015, 67, 1–48. [Google Scholar] [CrossRef]
Kuznetsova, A.; Brockhoff, P.B.; Christensen, R.H.B. lmerTest Package: Tests in Linear Mixed Effects Models. J. Stat. Softw. 2017, 82, 1–26. [Google Scholar] [CrossRef]
Fox, J.; Weisberg, S. An R Companion to Applied Regression, 3rd ed.; Sage: Thousand Oaks, CA, USA, 2019. [Google Scholar]
Holm, S. A Simple Sequentially Rejective Multiple Test Procedure. Scand. J. Stat. 1979, 6, 65–70. [Google Scholar]
Lenth, R.; Piaskowski, J. emmeans: Estimated Marginal Means, aka Least-Squares Means, R Package Version 2.0.2; The Comprehensive R Archive Network (CRAN): Vienna, Austria, 2026. Available online: https://CRAN.R-project.org/package=emmeans (accessed on 28 March 2026).
Oksanen, J.; Simpson, G.L.; Blanchet, F.G.; Kindt, R.; Legendre, P.; Minchin, P.R.; O’Hara, R.B.; Solymos, P.; Stevens, M.H.H.; Szoecs, E.; et al. vegan: Community Ecology Package, R Package Version 2.7-3. 2026. Available online: https://CRAN.R-project.org/package=vegan (accessed on 28 March 2026).
Monson, B.B.; Hunter, E.J.; Lotto, A.J.; Story, B.H. The perceptual significance of high-frequency energy in the human voice. Front. Psychol. 2014, 5, 587. [Google Scholar] [CrossRef] [PubMed]
Moore, B.C.J. A review of the perceptual effects of hearing loss for frequencies above 3 kHz. Int. J. Audiol. 2016, 55, 707–714. [Google Scholar] [CrossRef] [PubMed]
Pepino, L.; Riera, P.; Ferrer, L. Emotion Recognition from Speech Using wav2vec 2.0 Embeddings. In Proceedings of the Interspeech 2021, Brno, Czech Republic, 30 August–3 September 2021; pp. 3400–3404. [Google Scholar] [CrossRef]

Figure 1. Mean classification accuracy across bandwidth conditions as a function of modality (speech vs. song), emotional intensity (normal vs. strong), and speaker gender (female vs. male). Points represent mean cross-validated accuracy, and error bars indicate 95% bootstrap confidence intervals.

Figure 2. Estimated spectral feature importance across bandwidth conditions. Estimated marginal means of permutation-based feature importance for four spectral bands (Low, Mid, High, and Extended High Frequency [EHF]) across bandwidth conditions. Panels show results separately for modality (speech vs. song) and emotional intensity (normal vs. strong). Higher values indicate greater reliance of the classifier on the corresponding frequency region for emotion classification.

Figure 3. Overall emotion-space separation relative to the Full-bandwidth condition. Negative values indicate compression of the emotion space (reduced separability among emotions), whereas positive values indicate expansion. Points represent mean Δ-distance across cross-validation runs, and error bars indicate 95% bootstrap confidence intervals.

Figure 4. Changes in pairwise emotion separability relative to the Full-bandwidth condition. Values represent Δ cosine distance between emotion-category prototypes, computed as the distance in each bandwidth condition minus the corresponding Full-bandwidth distance within the same Modality × Intensity cell. Negative values indicate reduced separability between the two emotion categories, whereas positive values indicate increased separability. Points represent mean Δ-distance across cross-validation runs, and error bars indicate 95% bootstrap confidence intervals.

Table 1. Number of prediction instances used in the classification analyses across cross-validation runs.

Emotion	Speech (Normal)	Speech (Strong)	Song (Normal)	Song (Strong)	Total
Neutral	3840	-	3680	-	7520
Calm	3840	3840	3680	3680	15,040
Happy	3840	3840	3680	3680	15,040
Sad	3840	3840	3680	3680	15,040
Angry	3840	3840	3680	3680	15,040
Fearful	3840	3840	3680	3680	15,040

Table 2. Mean classification accuracy by modality, emotional intensity, and bandwidth condition. Values show mean percent correct [95% CI].

Modality	Intensity	Bandwidth	Mean Accuracy, % [95% CI]	N Stimulus-Condition Cells
speech	normal	8 kHz	54.6 [50.7, 58.4]	576
		12 kHz	55.3 [51.4, 59.1]	576
		16 kHz	54.8 [51.2, 58.5]	576
		Full	54.9 [51.2, 58.6]	576
	strong	8 kHz	67.9 [63.9, 71.9]	480
		12 kHz	68.4 [64.5, 72.2]	480
		16 kHz	68.1 [64.3, 71.8]	480
		Full	68.1 [64.3, 72.2]	480
song	normal	8 kHz	73.1 [69.9, 76.3]	552
		12 kHz	72.6 [69.3, 75.9]	552
		16 kHz	72.6 [69.5, 75.7]	552
		Full	72.0 [68.5, 75.3]	552
	strong	8 kHz	77.0 [73.5, 80.5]	460
		12 kHz	77.2 [73.7, 80.7]	460
		16 kHz	76.3 [72.8, 79.7]	460
		Full	76.6 [73.0, 80.1]	460

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Garlitz, R.; Shamsi, A.; Wayland, R. Spectral Bandwidth Effects on Emotion Classification and Representation in Spoken and Sung Signals. Signals 2026, 7, 50. https://doi.org/10.3390/signals7030050

AMA Style

Garlitz R, Shamsi A, Wayland R. Spectral Bandwidth Effects on Emotion Classification and Representation in Spoken and Sung Signals. Signals. 2026; 7(3):50. https://doi.org/10.3390/signals7030050

Chicago/Turabian Style

Garlitz, Rylen, Allen Shamsi, and Ratree Wayland. 2026. "Spectral Bandwidth Effects on Emotion Classification and Representation in Spoken and Sung Signals" Signals 7, no. 3: 50. https://doi.org/10.3390/signals7030050

APA Style

Garlitz, R., Shamsi, A., & Wayland, R. (2026). Spectral Bandwidth Effects on Emotion Classification and Representation in Spoken and Sung Signals. Signals, 7(3), 50. https://doi.org/10.3390/signals7030050

Article Menu

Spectral Bandwidth Effects on Emotion Classification and Representation in Spoken and Sung Signals

Abstract

1. Introduction

1.1. Conceptual Framing: What Is Being Classified

1.2. Related Work: Bandwidth Assumptions in SER

1.3. Speech vs. Song and the Role of Intensity

2. Materials and Methods

2.1. Dataset (RAVDESS)

2.2. Bandwidth Manipulation and Experimental Design

2.3. Feature Extraction

2.4. Classification and Feature Importance

2.5. Emotion-Space Geometry Analysis

3. Results

3.1. Emotion Classification Performance

3.2. Spectral-Band Importance Across Conditions

3.3. Emotion-Space Structure

3.3.1. Statistical Test of Stimulus-Level Emotion Space

3.3.2. Overall Emotion-Space Separation

3.3.3. Pairwise Emotion Contrasts

4. Discussion

4.1. Bandwidth and Acoustic Cues in Vocal Emotion Recognition

4.2. Effects of Modality and Emotional Intensity

4.3. Emotion-Space Geometry and Representational Structure

4.4. Limitations and Future Directions

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI