We first report classification performance across bandwidth conditions, modality, and emotional intensity. We then analyze permutation-based feature importance to identify the spectral regions contributing most strongly to emotion classification. Finally, we examine how bandwidth manipulation affects the geometry of the emotion space derived from model probability outputs.
3.1. Emotion Classification Performance
Classification performance was examined across bandwidth conditions, modality (speech vs. song), and emotional intensity (normal vs. strong). Overall accuracy was computed to summarize classifier performance across conditions. Statistical inference on classification accuracy was conducted using binomial generalized linear mixed-effects models to determine whether performance differed as a function of bandwidth and its interaction with modality and intensity.
To ensure that comparisons across bandwidth conditions reflected the same underlying stimuli, predictions were aggregated at the level of Run × Stimulus × Condition × Modality × Intensity × Sex × Emotion. This aggregation allows classification accuracy to be modeled as binomial counts while preserving the correspondence between conditions and avoiding pseudo-replication. The resulting dataset, therefore, reflects repeated evaluations of the same stimuli across bandwidth conditions within each cross-validation run.
Table 2 summarizes mean classification accuracy for each Modality × Intensity × Bandwidth condition. Accuracy values were calculated after aggregating repeated cross-validated predictions at the stimulus level, so that repeated predictions for the same underlying recording were not treated as independent observations. Disgust and surprise were excluded because these categories were not available in both speech and song. Neutral items were retained where available but do not have strong-intensity counterparts in RAVDESS.
Figure 1 visualizes the same descriptive pattern, with mean classification accuracy shown across bandwidth conditions as a function of modality (speech vs. song), emotional intensity (normal vs. strong), and speaker sex (female vs. male). Classification accuracy is generally higher for song than for speech and for strong compared to normal emotional portrayals, while differences across bandwidth conditions appear comparatively modest.
In an earlier model comparison, adding bandwidth condition provided only weak evidence of improved model fit (χ2(3) = 6.66, p = 0.083), indicating that spectral truncation may have limited overall impact on classification performance. However, in the final mixed-effects model, Type III Wald tests indicated a statistically reliable but modest omnibus effect of bandwidth condition, χ2(3) = 8.10, p = 0.044. In the final mixed-effects model, classification accuracy varied significantly as a function of modality, emotional intensity, speaker sex, and emotion category, while bandwidth effects remained modest and context-dependent. Modality exerted a large effect (χ2(1) = 83.88, p < 0.001), with higher accuracy for song than for speech, and emotional intensity also had a reliable effect (χ2(1) = 4.64, p < 0.031), with strong-intensity stimuli classified more accurately than normal-intensity stimuli. Classification accuracy additionally differed by speaker sex (χ2(1) = 8.57, p = 0.003), with higher accuracy for female speakers, and varied significantly across emotion categories (χ2(5) = 175.07, p < 0.001).
Importantly, the effect of bandwidth was not uniform across conditions. A significant bandwidth × modality interaction (χ
2(3) = 8.01,
p = 0.046) indicates that spectral reduction affects speech and song differently. Follow-up comparisons showed that bandwidth effects were small and non-monotonic. Within speech, no pairwise bandwidth contrast was reliable after correction, whereas within song, the only reliable contrast was between 8 kHz and Full (odds ratio = 1.235, SE = 0.085, z = 3.07, Holm-adjusted
p = 0.0128), with higher classification odds at 8 kHz than at Full. Bandwidth effects also differed across speaker sex (χ
2(3) = 39.20,
p < 0.001), indicating that the impact of spectral reduction varies between male and female speakers. In contrast, there was no evidence that bandwidth effects differ systematically across emotion categories (χ
2(15) = 19.90,
p = 0.176). Descriptively, emotions such as anger and neutral tended to show higher classification accuracy, whereas sadness showed comparatively lower accuracy across conditions (see
Figure 1). Full pairwise comparisons among emotion categories and interactions are reported in
Supplementary Tables S2–S4.
Overall, these results show that classification performance is driven primarily by modality, emotional intensity, and emotion category, whereas the contribution of spectral bandwidth is modest and context-dependent, and not well described by a simple monotonic degradation account.
Because overall classification accuracy showed relatively modest sensitivity to bandwidth restriction, the next analysis examined whether bandwidth nevertheless altered the spectral cue-weighting strategies used by the classifier.
3.2. Spectral-Band Importance Across Conditions
To examine how the classifier relied on different spectral regions across conditions, we analyzed permutation-based feature-importance values using linear mixed-effects models. Because permutation-based feature-importance values were computed once per cross-validation training run and condition cell, they were constant across individual trial predictions within that cell. Accordingly, feature-importance values were aggregated to a single observation per Run × Bandwidth × Modality × Intensity × Frequency-band cell. For each cross-validation run and each Bandwidth × Modality × Intensity condition, we retained the four band-specific importance measures (Low, Mid, High, and Extended High Frequency [EHF]), reshaped the data to long format, and treated frequency band as a within-cell factor.
Linear mixed-effects models were then fit with feature importance as the dependent variable and fixed effects for frequency band, bandwidth condition, modality, and emotional intensity, along with their interactions. A random intercept for cross-validation run was included to account for run-to-run variability in overall importance magnitude. Fixed effects were evaluated using Type III Wald χ2 tests, and planned follow-up comparisons were conducted using estimated marginal means with Holm-adjusted pairwise contrasts.
Figure 2 shows the estimated permutation-based feature importance for each spectral band across bandwidth conditions, separated by modality (speech vs. song) and emotional intensity (normal vs. strong).
Y-axis represents estimated feature importance scaled by 1000 for readability and reflects the extent to which classification performance depends on each spectral region.
Across conditions, the classifier showed greatest reliance on low- and mid-frequency spectral information, whereas high-frequency and EHF bands contributed less under the most restricted bandwidth conditions. As bandwidth increased, however, the relative importance of higher-frequency bands also increased, particularly for the EHF region, indicating that the importance of EHF information increased when it was available. These patterns were broadly similar across modalities and intensity levels, although the relative distribution of importance across spectral regions varied as a function of stimulus modality, emotional intensity, and bandwidth condition. Thus, broader bandwidth changed the model’s cue-weighting profile even when its effect on classification accuracy was modest.
Type III Wald χ2 tests from the linear mixed-effects model results revealed strong main effects of frequency band (χ2(3) = 264,220, p < 0.001), bandwidth condition (χ2(3) = 2427.9, p < 0.001), modality (χ2(1) = 1892.1, p < 0.001), and emotional intensity (χ2(1) = 1484.5, p < 0.001), indicating that overall permutation-based feature importance varied substantially across spectral regions, bandwidth levels, stimulus modality, and portrayal strength. Consistent with the descriptive patterns noted above, the model showed greatest dependence on low-frequency bands, followed by mid-frequency bands, with comparatively smaller contributions from high-frequency and EHF regions.
Several two-way interactions further clarified these effects. A significant condition × modality interaction (χ2(3) = 7.99, p = 0.046) indicated that bandwidth-related changes in feature importance differed somewhat between speech and song stimuli. A significant condition × intensity interaction (χ2(3) = 20.79, p < 0.001) indicated that bandwidth effects also varied as a function of emotional strength. The modality × intensity interaction was likewise significant (χ2(1) = 705.84, p < 0.001), demonstrating that the overall magnitude of feature importance differed across modality–intensity combinations.
Frequency band also interacted strongly with the experimental factors. The band × condition interaction was highly significant (χ2(9) = 77,224, p < 0.001), indicating that bandwidth expansion altered the relative importance of specific spectral regions. In addition, both band × modality (χ2(3) = 12,112, p < 0.001) and band × intensity (χ2(3) = 7157.4, p < 0.001) interactions were significant, demonstrating that the distribution of importance across spectral bands differed systematically between speech and song stimuli and between normal and strong emotional portrayals.
Higher-order interactions were also statistically reliable. The condition × modality × intensity interaction (χ2(3) = 13.57, p = 0.0035) indicated that bandwidth effects varied across specific combinations of modality and emotional intensity. Moreover, all higher-order interactions involving frequency band were significant: band × condition × modality (χ2(9) = 2229.6, p < 0.001), band × condition × intensity (χ2(9) = 402.42, p < 0.001), band × modality × intensity (χ2(3) = 3864.5, p < 0.001), and the four-way band × condition × modality × intensity interaction (χ2(9) = 1033.3, p < 0.001). These effects indicate that bandwidth-related changes in spectral reliance depend jointly on modality and emotional intensity and that the way bandwidth reshapes the importance of specific spectral regions differs across these conditions.
Follow-up comparisons using estimated marginal means were conducted to further characterize these effects. Pairwise contrasts comparing bandwidth conditions within each spectral band and Modality × Intensity cell are reported in
Supplementary Tables S6–S9. These comparisons show that bandwidth expansion is associated with systematic changes in feature importance across spectral regions. In general, low-frequency bands exhibit the highest importance in the most spectrally restricted conditions, whereas the relative contribution of high-frequency and extended high-frequency bands increases as additional bandwidth becomes available.
These results indicate that the classifier’s reliance on spectral information is strongly structured by frequency band and varies systematically with bandwidth. Importantly, bandwidth did not simply increase or decrease performance uniformly; instead, it redistributed the spectral regions on which the classifier relied, and these changes also depended on modality, emotional intensity, and their interaction. The model depended most strongly on low-frequency information overall, whereas higher-frequency and EHF regions gained importance when they were available. These findings motivate the subsequent analyses examining whether such bandwidth-related changes in spectral reliance influence the structure of the emotion space itself.
These feature-importance patterns suggest that bandwidth manipulation alters the relative contribution of different spectral regions even when overall classification performance changes only modestly. The next analysis, therefore, examined whether these shifts in spectral cue weighting were accompanied by changes in the representational geometry of the classifier’s emotion space.
3.3. Emotion-Space Structure
While the preceding analyses identify which spectral regions show the greatest importance for emotion classification, they do not reveal how these acoustic cues shape the organization of emotion categories in the model’s output space. Changes in spectral reliance across bandwidth conditions may influence not only overall classification accuracy but also the structure of the classifier’s internal representation of emotions. In particular, if certain spectral regions are important for distinguishing specific emotional contrasts, restricting bandwidth may compress or distort relationships among emotion categories in the model’s posterior-probability space. To evaluate this possibility, the next analysis examines the geometry of emotion representations derived from the classifier’s probability outputs.
To examine whether spectral bandwidth alters the structure of the emotion representations produced by the classifier, we analyzed the geometry of the model’s posterior-probability space. For each stimulus, the Random Forest classifier outputs a probability vector over emotion categories, reflecting the model’s graded confidence that the stimulus belongs to each emotion class. These probability vectors provide a representation of stimuli within a multivariate “emotion space,” where distances between vectors reflect the similarity of emotional representations learned by the classifier. Changes in this representational structure across bandwidth conditions would indicate that spectral truncation affects not only classification accuracy but also the organization of emotion categories in the model’s output space.
We evaluated this question at two complementary levels. First, we tested whether the multivariate distribution of stimulus-level probability vectors differed across bandwidth conditions using permutational multivariate analysis of variance (PERMANOVA). Second, we examined the geometry of emotion-category prototypes derived from these probabilities, quantifying both overall emotion-space separation and pairwise distances among emotion categories across bandwidth conditions.
3.3.1. Statistical Test of Stimulus-Level Emotion Space
To formally test whether spectral bandwidth alters the structure of the classifier’s stimulus-level probability space, we conducted permutational multivariate analysis of variance (PERMANOVA). This analysis evaluates whether the multivariate distribution of stimulus-level probability vectors differs across bandwidth conditions within each Modality × Intensity cell. For each stimulus and bandwidth condition, posterior class-probability vectors were averaged across cross-validation runs so that each stimulus contributed a single probability vector per bandwidth level, preventing pseudo-replication arising from repeated predictions across cross-validation folds.
Euclidean distance matrices were computed from these stimulus-averaged probability vectors. Because some condition cells lacked particular emotion categories, the analysis was restricted to the emotion dimensions shared across all conditions (calm, happy, sad, angry, fearful), ensuring that distances were computed in comparable probability spaces across conditions. Separate PERMANOVA models were then run within each Modality × Intensity cell using the model distance ~ bandwidth. This approach isolates the effect of spectral bandwidth while holding modality and emotional intensity constant, avoiding higher-order interactions that would be difficult to interpret in a single global model. Permutations were restricted within stimulus identity (strata = stimulus) to preserve the repeated-measures structure across bandwidth conditions.
PERMANOVA analyses provided no reliable evidence that bandwidth altered the global structure of the stimulus-level probability space. Within each Modality × Intensity cell, bandwidth effects were not statistically significant after correction for multiple comparisons (all Holm-corrected
p ≥ 0.080; see
Supplementary Table S10). The only nominal effect occurred for song stimuli at normal intensity (pseudoF = 0.012, R
2 = 1.91 × 10
−5,
p = 0.020), but this effect did not survive Holm correction (p(Holm) = 0.080) and explained a negligible proportion of variance (R
2 = 0.002%). The remaining conditions (speech–normal, speech–strong, song–strong) showed no evidence that bandwidth altered the multivariate distribution of stimulus representations (all
p ≥ 0.368). Tests of homogeneity of multivariate dispersion (betadisper with permutation testing) were not significant across any condition (
p = 0.979–0.997), indicating that the results are not driven by differences in within-condition dispersion.
3.3.2. Overall Emotion-Space Separation
In addition to the inferential test described above, we examined how bandwidth manipulation affected the global separability of emotion categories in the classifier’s output space. Emotion prototypes were computed separately within each cross-validation run for each Bandwidth × Modality × Intensity condition by averaging stimulus-level probability vectors within each true emotion category. These prototypes represent centroid representations of each emotion in the model’s posterior-probability space.
Pairwise distances among emotion centroids were then computed. Cosine distance was used as the primary metric because it captures similarity in the pattern of class-probability distributions independent of magnitude differences. Euclidean distance was also computed as a complementary metric sensitive to absolute probability separation, and both measures produced qualitatively similar results. For interpretability, we summarize cosine distances in the main analyses.
Overall emotion-space separation was quantified as the mean pairwise distance among emotion centroids within each condition. To isolate the effect of spectral truncation, distance changes were expressed relative to the Full-bandwidth condition within each run and Modality × Intensity cell (Δ-distance). Uncertainty was estimated using bootstrap confidence intervals across cross-validation runs.
Figure 3 shows the change in overall emotion-space separation relative to the Full-bandwidth condition. Negative values indicate compression of the emotion space (reduced separability among emotions), whereas positive values indicate expansion.
The largest compression occurred for speech stimuli at normal intensity in the 8 kHz condition, indicating that severe bandwidth restriction reduces overall separability among emotions in this condition. Specifically, mean pairwise cosine distance was reduced relative to Full bandwidth at 8 kHz, Δ = −0.008, 95% bootstrap CI [−0.010, −0.006]. More moderate compression is observed at 12 kHz, Δ = −0.003, 95% CI [−0.005, −0.002], while distances at 16 kHz remain close to the Full-bandwidth baseline, Δ = 0.001, 95% CI [−0.000, 0.002]. In contrast, song stimuli and strong emotional portrayals show little change in global emotion-space separation, with Δ-distance values remaining near zero across bandwidth conditions.
These results indicate that bandwidth reduction has only modest effects on the global separability of emotion categories. Instead, substantial changes in emotion-space structure appear limited to the most severe bandwidth restriction and primarily affect speech stimuli at normal intensity.
3.3.3. Pairwise Emotion Contrasts
To determine which emotion relationships accounted for these global patterns, we next examined changes in pairwise distances between individual emotion categories. Pairwise cosine distances were computed for all emotion pairs within each condition and compared with the corresponding distances in the Full-bandwidth condition.
Figure 4 shows changes in separability for selected emotion pairs relative to the Full-bandwidth baseline. The most pronounced compression was observed for the fearful–happy contrast in speech at 8 kHz. Specifically, in normal-intensity speech, the fearful–happy contrast showed the largest reduction in pairwise cosine distance at 8 kHz, Δ = −0.048, 95% bootstrap CI [−0.053, −0.043], indicating that these two emotion categories became less separable under severe bandwidth restriction. Additional localized changes in that same cell were observed for contrasts including angry–happy, Δ = −0.027, 95% CI [−0.031, −0.022], happy–sad, Δ = −0.014, 95% CI [−0.019, −0.009], angry–sad, Δ = 0.009, 95% CI [0.006, 0.012], and angry–fearful, Δ = 0.007, 95% CI [0.003, 0.012], although these were smaller than the fearful–happy effect, and most contrasts returned toward baseline at 12–16 kHz. Thus, bandwidth restriction did not uniformly compress all emotion relationships; instead, it selectively reduced separability for specific contrasts, most notably fearful–happy, while slightly increasing separability for others.
To further characterize the strongest local distortions, we identified emotion-pair contrasts whose bootstrap confidence intervals excluded zero. These effects were concentrated primarily in the 8 kHz condition, particularly for speech at normal intensity. However, the number of affected contrasts was limited, and most emotion relationships remained stable across bandwidth conditions.
Overall, these results indicate that spectral bandwidth reduction produces localized distortions in specific emotion contrasts rather than large-scale restructuring of the overall emotion space. The clearest effects were concentrated under the most severe bandwidth restriction and were most evident for speech at normal intensity.