1. Introduction
Understanding speech in the presence of competing talkers, often referred to as the cocktail party problem [
1], relies heavily on spatial hearing mechanisms. Listeners with normal hearing (NH) routinely benefit from spatial release from masking (SRM), defined as improved speech recognition when a target and masker differ in spatial location. Classical psychoacoustic models attribute SRM to two primary components: better ear listening, arising from acoustic head shadow effects that improve the signal-to-noise ratio (SNR) at one ear, and binaural unmasking, which leverages interaural time and level differences (ITDs and ILDs) to enhance segregation of competing sources [
2,
3]. Experimental work consistently shows that NH listeners can obtain substantial SRM—often 6–10 dB or more—depending on masker type, acoustic environment, and spatial configuration [
4,
5,
6,
7]. SRM also interacts with contextual factors such as reverberation, which disproportionately reduces energetic SRM while leaving informational masking benefits partially intact [
5,
8].
For cochlear implant (CI) users, speech in noise perception remains one of the most persistent functional limitations. CI processors transmit temporal envelope cues relatively well but degrade fine structure and low-frequency timing cues essential for accurate ITD-based binaural processing [
9]. As a result, SRM for CI listeners is often significantly smaller than that observed for NH individuals, particularly when maskers are symmetrically arranged or at smaller spatial separations requiring precise binaural analysis [
3,
9,
10]. In contrast, electric–acoustic stimulation (EAS, also called a hybrid cochlear implant (HCI)), which combines a CI for high frequencies with preserved acoustic hearing for low-frequencies, has demonstrated the ability to restore access to low-frequency temporal fine structure, ITDs, and other cues that strongly support spatial hearing and speech in noise performance. Large scale clinical studies report improved speech perception, better subjective spatial hearing, and greater sound quality in EAS users than in CI listeners, provided that low-frequency hearing is adequately preserved [
10,
11].
To better understand the mechanisms underlying these effects, many studies employ vocoder simulations, which enable systematic manipulation of spectral resolution, channel interactions, spatial geometry, and masker type while approximating key characteristics of CI or EAS processing [
12,
13]. Critically, binaural advantages can persist even when spectral resolution is reduced by vocoding, provided that interaural cues are preserved. It has been shown that individuals demonstrated nearly normal binaural benefits with spectrally degraded speech when target and interferers were spatially separated, reinforcing the central role of binaural cues in SRM under CI-like processing [
14]. Recent simulation work has further shown that spectral smearing, a correlate of channel interaction, elevates speech reception thresholds and modulates the magnitude of SRM, highlighting the dependence of spatial advantages on spectrotemporal fidelity [
15]. EAS simulations, which combine low pass acoustic speech with vocoded high frequency bands, consistently demonstrate better masking release and spatial cue access than electric-only simulations, reflecting the critical role of residual low-frequency information [
16].
A key knowledge gap concerns small spatial separations. Historically, studies of SRM in CI and EAS listeners have used large separations (e.g., ±90°) that maximize head shadow benefits but do not reflect everyday communication, where talkers are often only a few degrees apart. Recent work using simulated CI speech with small spatial separations (e.g., ±2–±30°) showed that NH listeners exhibit significantly reduced SRM for vocoded speech relative to natural speech, reflecting limitations imposed by poor spectral resolution and degraded interaural cues [
7]. It has been shown that children with bilateral CIs and NH peers emphasize cue-tradeoffs between head shadow and interaural differences and provide a functionally relevant metric that could be extended to adult simulations [
17]. These findings underscore that when angular separation is small, head shadow advantages alone are insufficient; instead, listeners depend heavily on fine grained binaural cues, precisely the cues that EAS seeks to restore [
2,
16]
Environmental factors further constrain SRM. Reverberation reduces interaural coherence and smears amplitude envelopes, producing stronger decrements in spatial benefits for CI and EAS users than for NH listeners [
8,
18]. Likewise, device-level factors such as behind the ear microphone placement can distort spatial cues before they ever reach binaural pathways [
9]. Another critical layer involves electric–acoustic interactions. Psychoacoustic studies show that electric stimulation can elevate thresholds for acoustic probe tones within overlapping frequency regions, reducing the benefit of low-frequency hearing unless fittings are carefully optimized [
19]. Computational modeling work reveals that electric–acoustic interactions influence neural firing synchrony, phase locking, and dynamic range, providing mechanistic explanations for why EAS sometimes yields robust benefits and other times does not [
20,
21].
Although the present study employs NH listeners, the primary goal is not to directly generalize outcomes to CI or EAS users. Rather, NH listeners are used as a mechanistic model to isolate the effects of spectral degradation and partial restoration of low-frequency acoustic cues on SRM under tightly controlled conditions. This approach minimizes confounding influences common in clinical populations, such as auditory deprivation, neural plasticity, etiology, electrode placement variability, and device-specific fitting strategies. Vocoder-based simulations in NH listeners have therefore been widely used to estimate process-level constraints and theoretical upper bounds on spatial hearing performance with CI and EAS processing, while acknowledging that real-world clinical outcomes may differ.
The present study builds on this foundation by quantifying SRM at small, conversationally relevant separations using natural speech, simulated CI speech, and simulated EAS in NH listeners. By holding the acoustic scene constant while manipulating listening mode, the incremental contribution of low-frequency acoustic cues to SRM in at smaller target to masker separations where head shadow benefits alone are minimal were investigated. It is hypothesized that simulated EAS will yield lower speech identification thresholds and greater SRM than simulated CI speech but will not fully match natural speech performance. It is also expected that the differences among listening modes will reflect not only energetic SNR advantages but also sensitivity to perceptual segregation cues.
2. Methods
2.1. Listeners
Twenty-two young adults with normal hearing (mean age = 21.3 years; age range: 19–23 years) participated in the study. Air-conduction audiometric thresholds were obtained for all participants, and normal hearing was confirmed as thresholds ≤15 dB HL at octave frequencies from 250 to 8000 Hz. None of the participants demonstrated audiometric asymmetry, defined as interaural threshold differences greater than 10 dB HL. All study procedures were reviewed and approved by Towson University’s Institutional Review Board, and participants received financial compensation for their time.
2.2. Stimuli
Three male talkers from the Coordinate Response Measure (CRM, [
22]) corpus were used in the experiment. CRM sentences follow the fixed carrier phrase: “Ready [CALL SIGN] go to [COLOR] [NUMBER] now.” The corpus includes eight possible call signs (Arrow, Baron, Charlie, Eagle, Hopper, Laker, Ringo, and Tiger), four colors (Blue, Red, White, and Green), and eight numbers (1–8), resulting in 256 unique combinations. On each trial, listeners heard one target sentence, identified by the call sign “Charlie”, presented concurrently with two masker sentences that used different call signs. Each talker produced a distinct color–number combination, and listeners responded by selecting the color and number combination for the call sign “Charlie”.
The CRM corpus was selected because it provides tightly controlled speech materials that minimize semantic predictability while producing robust informational masking. Although the response set is constrained, chance performance is low (1/32), and the adaptive threshold-tracking procedure targets performance well above chance, making systematic guessing unlikely to influence threshold estimation. The CRM corpus has been extensively validated in multi-talker and spatial–hearing research, including studies of spatial release from masking in normal-hearing listeners and cochlear implant users [
6,
7,
8,
9,
23].
For the natural, simulated CI, and simulated EAS speech conditions, the target and masker signals were first convolved with location-specific head-related impulse responses (HRIRs) to generate stimuli containing appropriate binaural spatial cues. In the natural-speech conditions, the resulting direction-dependent signals were summed and presented diotically over headphones. In the simulated CI and simulated EAS conditions, the direction-dependent signals were summed and subsequently processed through the appropriate simulators before being delivered over headphones. Across all listening conditions, the target level remained fixed while the masker level was adaptively varied on each trial.
2.3. Cochlear Implant Simulation
Spectral degradation was introduced using a noise-band vocoder, a widely used method for simulating cochlear implant (CI) signal processing [
24]. An eight-channel vocoder configuration was selected because it yields speech-recognition outcomes comparable to those achieved by high-performing CI users [
25,
26]. The input bandwidth was restricted to 150–8000 Hz, after which the stimuli were divided into eight frequency bands using fourth-order Butterworth filters (24 dB/octave). Band cutoff frequencies were assigned according to the Greenwood frequency-position mapping [
27], ensuring a distribution that reflects cochlear tonotopy. Within each band, the temporal envelope was extracted through half-wave rectification followed by low-pass filtering at 160 Hz (24 dB/octave). These envelopes were then used to modulate corresponding band-limited noise carriers. Finally, all eight modulated signals were summed to produce the noise-vocoded stimuli, which were presented bilaterally to simulate CI listening conditions.
2.4. Electro Acoustic Simulation
To simulate EAS, the input signal was first bandwidth-limited to 150–7000 Hz and divided into eight analysis bands using fourth-order Butterworth band-pass filters. The two lowest-frequency vocoder channels were then replaced with a low-pass-filtered version of the original speech (cutoff = 500 Hz; fourth-order Butterworth), approximating the range of residual acoustic hearing typically preserved in EAS users. This approach follows established methods in the literature [
28,
29,
30] and reflects patterns of low-frequency hearing retention commonly reported in cochlear implant recipients [
31,
32]. The remaining six channels underwent standard noise-band vocoder processing, in which the temporal envelope of each band was extracted and used to modulate a band-limited noise carrier. Finally, outputs from all eight channels, the low-frequency acoustic component and the six vocoded high-frequency channels, were combined to generate the simulated EAS stimuli.
2.5. Conditions
A virtual auditory spatial array was used to present all speech stimuli. Head-related impulse responses (HRIRs) were generated following the procedures described by [
33], which use an image-model-based simulation to compute the directions, delays, and attenuations of early reflections [
34]. These reflections, together with the direct path, were then spatially rendered using non-individualized head-related transfer functions (HRTFs). This simulation method has been shown to yield HRIRs that closely approximate the physical and perceptual characteristics of those measured in real acoustic environments. Five spatial configurations were tested: a colocated condition in which the target and both maskers were presented from 0° azimuth, and four spatially separated conditions with the target fixed at 0° and the maskers symmetrically positioned at ±5°, ±10°, ±15°, or ±30°. The HRIR set used in this study was the same as that employed in previous work examining the effects of small spatial separations between the target and the maskers on SRM using simulated CI speech [
7].
2.6. Procedure
All participants were seated in a double-walled, sound-treated audiology booth in the Spatial Hearing and Auditory PErception (SHAPE) laboratory at Towson University. Auditory stimuli were delivered through circumaural headphones (Sennheiser HD 650; Sennheiser, Hanover, Germany). Stimulus generation was performed in MATLAB (MathWorks Inc., Natick, MA, USA), and signals were presented via a Lynx Hilo audio interface (Lynx Studio Technology, Costa Mesa, CA, USA).
A one-up/one-down adaptive procedure [
35] based on the accuracy of reporting the color–number combination of the target sentence was used to estimate the target-to-masker ratio (TMR) required to identify 50% of the target call sign. In all trial blocks, the target speech was presented in the presence of competing masking speech. The target speech was presented at 20 dB above the pure-tone average (PTA) at 0.5, 1, 2, and 4 kHz and the masker levels were adjusted to achieve the required TMR. After each correct response, the masker level was increased by 5 dB; following each incorrect response, it decreased by 5 dB. After the first three reversals, the step size was reduced to 1 dB. Each adaptive track included nine reversals, and the TMR threshold was calculated as the mean of the final six reversals. Stimulus type varied randomly across trial blocks and the threshold estimates were averaged across three adaptive tracks for each of the spatial separation tested. Study participants responded to the speech stimuli using a computer monitor positioned directly in front of them. After each trial, feedback (“Correct” or “Incorrect”) was provided. Testing was self-paced, and listeners were encouraged to take breaks as needed to minimize fatigue and attentional effects. All testing was completed within a single experimental session lasting approximately two hours and no testing was spread across multiple days. These procedures were implemented to ensure stable performance.
2.7. Data Analysis
All statistical analyses were conducted using SPSS version 28.0 (IBM Corp., Armonk, NY, USA). Repeated-measures analyses of variance (RM-ANOVAs) were employed to examine differences in speech identification thresholds across spatial-separation conditions for natural, Simulated CI, and EAS speech. In addition, Pearson correlation analyses were performed to assess the relationship between identification thresholds at each spatial separation tested for the three different types of speech stimuli.
3. Results
Figure 1 shows the mean target-to-masker ratios (±1 standard error of mean) required to identify the target signal 50% of the time at the five different spatial separations tested in this experiment. With in each panel, the darker line indicates the mean thresholds while the lighter lines indicate individual thresholds of the listeners. An RM-ANOVA was conducted with stimuli type (natural, EAS, and CI simulated speech) and spatial separations (0°, ±5°, ±10°, ±15°, and ±30°) as within-subject factors and TMR as the dependent variable. Mauchly’s test indicated that the assumption of sphericity had been violated for spatial separation and therefore degrees of freedom were corrected using Greenhouse–Geisser estimates of sphericity (
χ2(9) = 24.18,
p = 0.004,
ε = 0.57). Results indicated a significant main effect of stimuli type (
F(2, 40) = 462.13,
p < 0.001,
partial η2 = 0.96, indicating a very large effect) and a significant main effect of spatial separation (
F(2.29, 45.70) = 178.90,
p < 0.001,
partial η2 = 0.90 indicating a very large effect) on TMR thresholds. Also, there was a significant interaction between stimuli type and spatial separation on TMR thresholds (
F (8, 160) = 21.02,
p < 0.001,
partial η2 = 0.51 indicating a large effect).
To better understand the significant interactions, separate RM-ANOVAs were conducted for each of the stimuli type. There was a significant effect of spatial separation on TMR thresholds for all the three types of speech stimuli used (Natural Speech: F(4, 80) = 167.20, p < 0.001, partial η2 = 0.89 indicating a large effect; EAS Speech: F(4, 80) = 44.17, p < 0.001, partial η2 = 0.69 indicating a large effect; CI simulated Speech: F(4, 80) = 47.34, p < 0.001, partial η2 = 0.70 indicating a large effect). A post hoc analysis using paired sample t-tests and Bonferroni correction showed that the TMR thresholds at all other spatial separations were significantly better than colocated conditions (all p < 0.05) for natural, EAS, and simulated CI speech conditions.
Spatial release from masking (SRM) was calculated as the difference between the TMR threshold at spatially separated condition and the colocated condition.
Figure 2 shows the mean SRM (±1 standard error of the mean) at different spatial separations for the three types of speech stimuli used. To investigate the effect of speech stimulus type on SRM, an RM-ANOVA was conducted with stimuli type (natural, EAS, and CI simulated speech) and spatial separations (0°, ±5°, ±10°, ±15°, and ±30°) as within-subject factors. There was a significant interaction between stimuli type and spatial separation (
F(6, 120) = 14.27,
p < 0.001,
partial η2 = 0.42, indicating a large effect). Simple effect analyses indicated that the natural speech condition resulted in significantly larger SRM compared to EAS and simulated CI speech conditions at all spatial separations. There was no significant difference in the amount of SRM between EAS and simulated CI speech conditions at 5° and 10° of spatial separation between the target and the maskers. However, at larger spatial separations (15° and 30°) between the target and the maskers, there was a significant difference in the amount of SRM for EAS and simulated CI speech conditions with higher SRM for EAS speech stimuli compared to the CI simulated speech (all
p < 0.001). Also, the difference in the amounts of SRM became increasingly larger as the spatial separation between the target and maskers increased.
Correlational analyses were performed to investigate the relationships between the individual TMR thresholds obtained at various spatial separations for the three kinds of speech stimuli. The scatterplots showing these relationships at the different spatial separation tested are shown in
Figure 3. All the correlations were positive and statistically significant for all the spatial separations with the correlation value (Pearson’s
r,
df = 20 for all conditions) ranging between 0.39 and 0.65. These results indicate that listeners with better thresholds in the natural speech condition tend to have better thresholds in the EAS and simulated CI speech conditions as well.
4. Discussion
The present study examined speech identification thresholds and spatial release from masking (SRM) for natural, EAS, and CI simulated speech across a range of small-to-moderate spatial separations. Consistent with the “cocktail party” literature, spatial separation produced robust reductions in target-to-masker ratio (TMR) thresholds overall, but the size of the benefit depended strongly on stimulus type, revealing a large interaction between stimulus fidelity and spatial configuration. These findings reinforce and refine the established accounts of spatial hearing that integrate energetic unmasking via head shadow with binaural unmasking and object-based attentional selection.
4.1. Summary of Principal Findings
Across natural, EAS, and simulated CI speech, spatial separation between a target and competing talkers yielded robust spatial release from masking (SRM), with natural > EAS > simulated CI speech performance and a stimulus type × separation interaction. Post hoc contrasts confirmed significantly better TMR thresholds at all spatially separated conditions for each stimulus class, and SRM grew with increased spatial separation. Correlations of individual thresholds across stimulus types indicate shared listener factors that generalize from natural to degraded speech. These patterns dovetail with the classic SRM literature showing that separating speech sources in azimuth improves intelligibility via better-ear SNR, binaural unmasking (ITD/ILD), and facilitated auditory object selection, particularly when maskers are other talkers [
4,
5,
6,
7,
8,
23].
4.2. Main Effects of Stimulus Type and Spatial Separation
The strong main effect of stimulus type with best thresholds for natural speech, intermediate for EAS, and poorest for simulated CI speech aligns with evidence that spectral fidelity and access to pitch/fundamental-frequency cues facilitate segregation of competing talkers and reduce informational masking. Degrading speech via vocoding reduces spectral resolution and temporal fine-structure (TFS) cues that support stream formation and pitch-based grouping, elevating speech-on-speech thresholds [
12,
36]. The finding of smallest SRM for simulated CI speech echoes reports that limited encoding of ITDs and coarse envelope cues constrain binaural advantages under vocoding or electric hearing. The pronounced main effect of spatial separation on TMR thresholds is a hallmark of SRM: separating target and maskers yields benefits from head shadow (better-ear SNR), interaural time differences (ITDs), and interaural level differences (ILDs) that enhance both energetic and informational unmasking. Classic and contemporary work shows that SRM can be substantial for speech maskers and that binaural cues contribute beyond better-ear SNR, especially when interferers are other talkers [
4,
5,
6,
7,
8].
4.3. Interaction of Stimulus Type with Spatial Separation
The significant stimulus type × separation interaction indicates that spatial benefits scale with acoustic fidelity and the reliance of SRM on binaural cues [
37]. However, the magnitude of SRM varied substantially by stimulus type. With natural speech, SRM increased steadily with separation, consistent with listeners leveraging high-resolution spectral and TFS cues together with binaural differences to segregate competing talkers. The literature similarly reports larger binaural advantages and SRM when interferers are speech, with effects growing as spatial separations widen (within limits) and as scene complexity increases [
6,
23,
38]. Also, prior studies show that when spectral cues are intact, binaural unmasking and spatial attention work synergistically to improve speech perception [
39,
40]. In contrast, CI-simulated speech produced the smallest SRM across separations, a pattern consistent with studies showing that vocoding and CI processors transmit envelope-ITDs and ILDs only coarsely, often limiting binaural fusion and unmasking [
41,
42]. Notably, faithful ITD cues (in the fine structure or in the envelope) are critical; when ITDs are scrambled or poorly encoded, SRM collapses even if ILDs are available [
42,
43]. Because CI simulations attenuate or eliminate temporal fine structure cues critical for localization and segregation, listeners could not benefit from spatial separation to the same degree as with natural speech. The hybrid condition yielded intermediate SRM overall, overlapping with CI-simulation at small separations (±5°, ±10°) but diverging at larger separations (±15°, ±30°), where hybrid outperformed CI-simulation and the gap grew with angle. This pattern is coherent with models and data suggesting that SRM increases with angular separation but at a diminishing rate and that access to residual low-frequency acoustic information preferentially boosts use of ITD cues at wider angles. The hybrid results therefore imply partial preservation of cue sets (e.g., low-frequency timing) that become increasingly useful as spatial separation grows [
38,
42].
4.4. Correlational Structure Across Stimulus Types
The positive correlations across stimulus types suggest stable individual listener traits affecting speech-in-noise ability, regardless of spectral degradation. Listeners who performed well with natural speech generally performed well with EAS and simulated CI speech, consistent with individual differences in factors such as cognitive processing efficiency, working memory, or attentional control [
44,
45]. The moderate correlation magnitudes (
r = 0.39–0.65) further imply that while there is shared variance, each stimulus type also engages partially distinct perceptual or cognitive processes. When bottom-up matches are poor (e.g., under vocoding), listeners depend more on explicit working memory-based mechanisms, preserving rank-ordering across conditions [
45,
46].
4.5. Relevance to CI Hearing and EAS Strategies
The reduced SRM for CI-simulated speech echoes extensive evidence that bilateral CI users often show modest SRM and smaller binaural benefits than normal-hearing peers, largely due to limited access to precise ITDs and to interaural place mismatches. Hybrid approaches that preserve low-frequency acoustic hearing can restore some ITD sensitivity and improve spatial perception relative to purely electric hearing, helping to explain the hybrid advantage at larger separations in our data [
42,
47]. Moreover, clinical and review papers emphasize that preserved low-frequency hearing in the implanted ear (EAS) can enhance speech-in-noise and localization outcomes beyond bimodal (CI + HA contralaterally) fittings, particularly in multi-talker scenarios, although benefits vary with how much low-frequency hearing is preserved and how devices are programmed (e.g., spectral overlap) [
48,
49].
Overall, these results sharpen the picture of SRM in speech-on-speech masking by demonstrating that the magnitude and growth of SRM with spatial separation depend on the integrity of spectral and temporal cues. With natural speech, listeners exploit complementary mechanisms—better-ear SNR, ITDs/ILDs, pitch and timbre cues, and object-based attention—yielding the largest SRM across separations. With CI-simulated speech, the selective loss of fine spectral detail and TFS reduces auditory object formation and limits access to precise interaural timing, compressing SRM. The hybrid condition sits between, indicating that even partial acoustic preservation restores some critical cues, particularly at larger separations where timing-based information exerts greater leverage. This pattern is consistent with SRM models that partition contributions of angular separation vs. asymmetry, and with object-based accounts of attention in complex scenes [
38,
50].
4.6. Limitations and Future Directions
Two limitations merit emphasis. First, the study tested relatively small separations; larger angles, more talkers, and reverberant environments often reveal different balances of better-ear, binaural, and attentional contributions. Extending to ±60–±90°, adding room acoustics, or manipulating talker sex/F0 and head-orientation would provide a richer stress-test of mechanisms and may enlarge stimulus-type differences. Second, while CI simulations are invaluable, clinical bilateral CI cohorts are essential to validate predicted constraints (e.g., ITD/ILD sensitivity, spatial attention) and to probe how device synchronization and mapping shape SRM. Future work combining computational SRM models with listener-specific device profiles could specify realistic upper bounds on SRM for various CI strategies.
4.7. Implications
These findings underscore the importance of spectral fidelity for maximizing SRM. For CI users, the substantially reduced SRM observed in the simulated CI condition suggests that real-world communication in multi-talker environments remains challenging, even with spatial separation. Improving access to fine-structure cues or enhancing binaural processing in CI systems may therefore yield significant functional benefits. Meanwhile, the hybrid condition’s intermediate performance suggests potential advantages of EAS strategies that preserve low-frequency acoustic hearing, which has been shown to support improved spatial perception and speech segregation [
48].
4.8. Scope and Generalizability
The present findings should be interpreted within the constraints of a simulation-based paradigm using normal-hearing listeners. Performance obtained with simulated CI and EAS speech does not imply that congenital or post-lingually deafened CI users would achieve comparable SRM magnitudes in real-world listening conditions. Instead, the results provide insight into how specific signal-processing constraints—such as reduced spectral resolution and partial restoration of low-frequency timing cues—shape spatial unmasking when other sources of biological and device-related variability are held constant. In this sense, the NH simulation framework offers a controlled means of examining the relative contributions of acoustic and binaural cues to SRM and identifying conditions under which EAS processing can offer advantages over electric-only stimulation. Clinical studies are necessary to determine how these mechanistic effects interact with long-term auditory experience, device use, and neural adaptation in CI and EAS users.