fNIRS Assessment of Speech Comprehension in Children with Normal Hearing and Children with Hearing Aids in Virtual Acoustic Environments: Pilot Data and Practical Recommendations

The integration of virtual acoustic environments (VAEs) with functional near-infrared spectroscopy (fNIRS) offers novel avenues to investigate behavioral and neural processes of speech-in-noise (SIN) comprehension in complex auditory scenes. Particularly in children with hearing aids (HAs), the combined application might offer new insights into the neural mechanism of SIN perception in simulated real-life acoustic scenarios. Here, we present first pilot data from six children with normal hearing (NH) and three children with bilateral HAs to explore the potential applicability of this novel approach. Children with NH received a speech recognition benefit from low room reverberation and target-distractors’ spatial separation, particularly when the pitch of the target and the distractors was similar. On the neural level, the left inferior frontal gyrus appeared to support SIN comprehension during effortful listening. Children with HAs showed decreased SIN perception across conditions. The VAE-fNIRS approach is critically compared to traditional SIN assessments. Although the current study shows that feasibility still needs to be improved, the combined application potentially offers a promising tool to investigate novel research questions in simulated real-life listening. Future modified VAE-fNIRS applications are warranted to replicate the current findings and to validate its application in research and clinical settings.


The Influence of Hearing Loss and Auditory Noise on Development
Hearing plays a crucial role in children's development when learning through verbal communication. Yet, critical learning is often challenged by noise [1][2][3][4]. Hearing loss (HL) presents additional challenges,

Behavioral Speech-In-Noise Comprehension Assessments
In the past, a variety of behavioral tests have been designed to assess SIN recognition in adults and some for application in children. Examples of such tests are the Hearing In Noise Test (HINT; Nilsson et al. [25]), the Words-in-Noise test (WIN; Wilson [26]), and the Listening in Spatialized Noise-Sentences test (LISN-S; Cameron and Dillon [27]). See Table 1, category A for further examples of speech recognition tests. Headphone-based; recordings of 250 sentences by a male speaker that are intended to be utilized in adaptive SRT measurements in quiet or spectrally matched noise Adults with NH/HL Oldenburger Satztest (Oldenburger sentence test)-OlSa [28,29] Headphone-based; recordings of sentences that consist of a random combination of 50 words that are used to measure the SRT in quiet and in noise Adults with NH/HL Words-In-Noise test-WiN [26], for clinical use Earphone-based; recordings of 70 words embedded in unique segments of multi-talker distractor noise that are intended to be utilized in adaptive SRT measurements Adults with NH/HL Döring test [30], for clinical use Loudspeaker-based; recordings are single syllables of the "Freiburger Sprachverständnistest" (Freiburger speech comprehension test) which are repeated three times in background noise (words of the "Freiburger Sprachverständnistest"); spatial location of noise and target are varied (spatially separated vs. co-located) Children with NH/HL Listening in Spatialized Noise-Sentences test-LiSN-S [27] Headphone-based; recordings of 120 sentences by a female speaker that are intended to be utilized in adaptive SRT measurements in background speech by two masking talkers (two female speakers that record two distractor stories) in four different conditions: maskers are either spatially co-located with target or at ±90 • azimuth and either share the same pitch or different pitch than the target Children with NH/HL "Oldenburger Kinder-Satztest" (Oldenburger sentence test for children)-OlKiSa [31] Headphone-based; simplified version of the Oldenburger sentence test (OlSA); recordings of sentences that consist of a random combination of 21 words that are used to measure the SRT in quiet and in noise Children with NH/HL Children's Coordinate Response Measure-CCRM [32] Headphone-based; recordings of sentences that are to be utilized in adaptive SRT measurements in either 20-talker babble or speech-shaped noise Bronkhorst [33], Wenzel et al. [34], Denk et al. [35], Pausch and Fels [36] Investigations of auditory sound localization, distance perception, and attention switching using ear/headphones, research HAs, or loudspeaker-based reproduction of auditory stimuli with or without manipulation of acoustic variables including but not limited to reverberation, interaural level differences, and sound intensity Adults with HL Best et al. [37], van den Bogaert et al. [38] Children with HL Johnstone et al. [39] Auditory distance perception Blind and sighted adults with NH Kolarik et al. [40], Kolarik et al. [41]; Shinn-Cunningham [42], Zahorik [43] Adults with NH/HL Courtois et al. [44] Auditory attention switching Adults with NH Oberem et al. [45], Oberem et al. [46] Auditory simulations of SIN tasks in simulated indoor environments Adults with NH Behavioral MacCutcheon et al. [47], Peng and Wang [48,49], Helms Tillery et al. [50] Investigations of speech or word (in noise) recognition, listening effort, and the influence of variables such as language skills, working memory, or stimulus presentation, i.e., auditory-only or in combination with visual stimuli, and room acoustics such as reverberation times simulated VAEs Adults with medically intractable epilepsy ECoG Zion Golumbic et al. [67] Adults with NH EEG-fMRI Puschmann et al. [68] Adults with age-related HL fMRI Wong et al. [69] Abbreviations: SIN-speech-in-noise; SRT-speech reception threshold; VAE(s)-virtual acoustic environment(s); SNR-signal-to-noise ratio; NH-normal hearing; HL-hearing loss; CI-cochlear implant; HA-hearing aid; fNIRS-functional near-infrared spectroscopy; EEG-electroencephalography; ERP-event-related potential; ECoG-intracranial electrocorticograph; fMRI-functional magnetic resonance imaging.

of 25
For listeners, many factors are known to affect SIN comprehension and can be varied within these assessments. These might include the speech stimuli and the distractor, such as competing talkers. For example, in the LISN-S for children, the location of the maskers (0 • vs. 90 • azimuth) as well as the pitch similarity to the target speaker (same as or different from the target speaker; [27]) are varied. While several of the SIN tests have been applied to typically developing children and translated into several languages [70,71], a recent study showed that language skills were a significant predictor for performance in the Hearing In Noise Test (HINT) for children with CI, HAs, and a developmental language disorder, but not children with NH [71]. To investigate whether early executively demanding (linguistic) training might help children with HL to compensate and to better understand the mechanisms supporting SIN comprehension, it is important to create appropriate testing environments that mimic complex real-world auditory situations and are appropriate for the assessment of children with HL. Such testing environments might guide the future design of audiological assessment tools that are applicable to daily listening in quiet and in noisy environments.

Speech Comprehension and Virtual Acoustic Reality
Recent advances in acoustic virtual reality enable reliable application of increasingly plausible virtual auditory environments (VAEs) for laboratory-based hearing research [42,[72][73][74][75]. By manipulating auditory cues in VAEs, various factors influencing speech comprehension can be examined in isolation. Past studies have used VAEs to explore spatial hearing in free-field environments, such as sound localization [33,34], auditory distance perception [40][41][42][43], and auditory attention switching [45,46]; see Table 1, category B. In addition to free-field listening, recent VAE studies also began to examine more realistic indoor auditory environments, such as speech understanding in noisy classrooms ( [47][48][49]; Table 1, category B). Thereby, SIN tests can be assessed in simulated real-life settings. While most current VAE work mainly focused on individuals with NH, there is an increasing interest in VAE application to evaluate outcomes of assistive hearing devices [24,50,53,76].

Speech Comprehension and Functional Near-Infrared Spectroscopy
Next to assessments that mimic real-life listening scenarios, it is of interest to gain insights into the underlying neural processes that contribute to good SIN perception and individual differences in speech comprehension. Table 1, category C provides examples of past neuroimaging studies on word and speech (in noise) understanding utilizing different neuroimaging techniques. Functional near-infrared spectroscopy (fNIRS) has recently gained much traction as a versatile optical neuroimaging tool to assess auditory paradigms and language development in both NH listeners, and those fitted with Cis [57,[77][78][79][80][81][82][83]. A current study also looked at auditory mechanisms in children fitted with HAs [84]. fNIRS is particularly suitable for investigations of auditory paradigms due to its silent operation, higher spatial resolution than electroencephalography (EEG), fewer motion restrictions, and compatibility with hearing device use in contrast to functional magnetic resonance imaging [85]. Neural activity is inferred by the continuously recorded changes in oxygenated, deoxygenated, and total hemoglobin concentration (∆HbO, ∆HbR, ∆HbT). Essentially, fNIRS allows the capturing of the relation between speech recognition and cortical activation. In previous studies, superior temporal gyrus (STG) activity was considered predictive of speech comprehension [54,86,87]. Next to the temporal cortices, the left inferior frontal gyrus (IFG) has been shown to facilitate the differentiation of an auditory stream of interest from auditory noise during effortful listening, which requires a higher cognitive load [54][55][56][57][58]88]. With a variety of auditory tasks applied in the previous literature, fNIRS has thus demonstrated its potential for combination with VAEs to elucidate the underlying neural mechanisms of speech comprehension during real-world listening. Table 1 provides examples of past tests and studies that investigated auditory processing and SIN comprehension by means of behavioral-only, VAE-based, or neuroimaging assessments. In the Children 2020, 7, 219 6 of 25 current pilot study, we introduce a novel experimental approach to investigate how children with NH and children with HAs utilize auditory cues to understand SIN in complex simulated auditory environments. While children are exposed to a virtual acoustic simulation of a realistic classroom environment from VAE, our rigorous approach combines a simultaneous behavioral assessment of SIN performance and a neural measure through fNIRS. After a detailed description of the methods and testing equipment, its first application is illustrated by pilot data. Clear recommendations are provided to address current challenges of the novel approach.  Figure 1 depicts the individual unaided pure tone audiograms that were obtained within three weeks of the study participation for the HA group. For all NH children, NH was based on the mandatory early hearing screen (U9, including headphone-based audiometry) and parental report on the day of testing. Before participation, parents provided written informed consent and children's assent. The study was approved by the local ethical committee (Medical Faculty, University Hospital Aachen; EK 188/15) and conducted in accordance with the Code of Ethics of the World Medical Association (Declaration of Helsinki). See Supplementary Table S2 for information on the demographic and hearing assessment.  Table 1 provides examples of past tests and studies that investigated auditory processing and SIN comprehension by means of behavioral-only, VAE-based, or neuroimaging assessments. In the current pilot study, we introduce a novel experimental approach to investigate how children with NH and children with HAs utilize auditory cues to understand SIN in complex simulated auditory environments. While children are exposed to a virtual acoustic simulation of a realistic classroom environment from VAE, our rigorous approach combines a simultaneous behavioral assessment of SIN performance and a neural measure through fNIRS. After a detailed description of the methods and testing equipment, its first application is illustrated by pilot data. Clear recommendations are provided to address current challenges of the novel approach.  Figure 1 depicts the individual unaided pure tone audiograms that were obtained within three weeks of the study participation for the HA group. For all NH children, NH was based on the mandatory early hearing screen (U9, including headphone-based audiometry) and parental report on the day of testing. Before participation, parents provided written informed consent and children's assent. The study was approved by the local ethical committee (Medical Faculty, University Hospital Aachen; EK 188/15) and conducted in accordance with the Code of Ethics of the World Medical Association (Declaration of Helsinki). See Supplementary Table S2 for information on the demographic and hearing assessment.

Equipment and Virtual Acoustic Environment
See Supplementary Table S1 for a detailed overview of all testing equipment. Testing was performed in a custom-built sound attenuated booth (L × W × H = 2.12 × 2.12 × 2 m; 9 m 3 ; Figure 2A,C). A four-channel loudspeaker array (Neumann KH-120A; Georg Neumann GmbH, Berlin, Germany) was positioned at ear height, one at each corner, for audio playback using crosstalk cancellation [89]. The child was seated in the center of the booth with a distance of 110 cm from each loudspeaker.

Equipment and Virtual Acoustic Environment
See Supplementary Table S1 for a detailed overview of all testing equipment. Testing was performed in a custom-built sound attenuated booth (L × W × H = 2.12 × 2.12 × 2 m; 9 m 3 ; Figure  2A,C). A four-channel loudspeaker array (Neumann KH-120A; Georg Neumann GmbH, Berlin, Germany) was positioned at ear height, one at each corner, for audio playback using crosstalk cancellation [89]. The child was seated in the center of the booth with a distance of 110 cm from each loudspeaker. Figure 2. Illustration of the experimental setting. Panorama view of the inside of the sound insulated booth (A), left behind-the-ear receiver-in-canal device used as research hearing aids (HAs) (B), and schematic illustration of the setup (C). The participant was seated centrally within the sound insulated booth. For a subset of participants in the HA group, research HAs were led into the sound insulated booth. The fNIRS fibers and optodes were directed into the booth to be placed on the participant's head within a cap. On top of the cap, the rigid body base, holding reflective markers, was mounted. The cameras, tracking the motion, were positioned in each corner above the loudspeakers. The computer and ETG4000 (Hitachi Medical Corporation, Tokyo, Japan) were placed outside the booth to minimize equipment noise. , and schematic illustration of the setup (C). The participant was seated centrally within the sound insulated booth. For a subset of participants in the HA group, research HAs were led into the sound insulated booth. The fNIRS fibers and optodes were directed into the booth to be placed on the participant's head within a cap. On top of the cap, the rigid body base, holding reflective markers, was mounted. The cameras, tracking the motion, were positioned in each corner above the loudspeakers. The computer and ETG4000 (Hitachi Medical Corporation, Tokyo, Japan) were placed outside the booth to minimize equipment noise.
The behavioral paradigm was implemented in a simulated virtual classroom (L × W × H = 11.8 × 7.6 × 3 m, V = 244 m 3 ). All room acoustic simulations were performed in the real-time auralization framework Room Acoustics for Virtual Acoustic Environments (RAVEN; Pelzer, Aspöck, Schröder and Vorländer [75], Pausch et al. [90], Schröder [91]). To achieve a realistic spatial percept, VAEs can be rendered using head-related transfer functions (HRTFs). Accounting for differences in head size between children and adults and enhanced physical correctness of the spatial cues delivered in VAEs [92,93], an individualization procedure was applied for each child. The individualization procedure was achieved by scaling the HRTFs [94,95] from an adult artificial head [96], which were measured at a spatial resolution of 1 • × 3 • in azimuth and elevation angles. Merged with the room acoustic simulations, the scaled HRTFs were used to create the binaural stimuli. The software Virtual Acoustics was utilized for the binaural real-time reproduction. The acoustic simulation was updated for the current position and orientation of the child's interaural axis center based on the input of an optical motion tracking system (Flex 13, OptiTrack, Corvallis, OR, USA).
In addition to acoustic transmission through loudspeakers, the system also included a pair of research HAs (custom-made behind-the-ear receiver-in-canal devices with open fitting by GN ReSound, Ballerup, Denmark; Figure 2B) to play the auditory stimuli for children using HAs. This combined reproduction strategy aims at approaching the real-life equivalent, where individuals are likely to use their residual hearing. The simulated HA microphone signals were based on scaled hearing aid-related transfer functions (HARTFs; Pausch, Aspock, Vorlander, and Fels [90]). Together with the results of the room acoustic simulation, they contained all spatial signal characteristics as they would be captured by the front HA microphones in the virtual classroom. To address the real-life delay in HA signal processing, a variable delay line added a 5 ms delay relative to the binaural loudspeaker reproduction at the ear drum level [97]. Using the simulated signals as input, a MATLAB-based [98] real-time software platform for the emulation of HA algorithms with individual fitting capability was integrated [99].
For this pilot study, the software platform was utilized for one child in the HA group, using gain prescription based on the individual's unaided audiogram [100]. No directional HA algorithms or other signal enhancement algorithms were included. The other two children were unable to use the research HAs due to their higher degree of HL that would have required amplification that exceeded the safety limits in the software platform. Instead, they were listening with their own HAs to the VAEs reproduced via the loudspeakers. Assuming negligible residual hearing capabilities for these individuals, the binaural playback over loudspeakers with CTC filters was based solely on individually scaled HARTFs instead of a mixture of playback HRTFs (loudspeaker playback) and HARTFs (HA playback) as in combined reproduction. All children with NH received binaural stimulus playback only via loudspeakers with CTC filters based on individually scaled HRTFs.
To minimize equipment noise, the computer and fNIRS system (ETG4000, Hitachi Medical Corporation, Tokyo, Japan) were placed outside the booth. To ensure firm hold, the 2 × 3 × 5 fNIRS probe holders with 2 × 22 measurement channels (CHs) were placed in an EEG cap (Easycap GmbH, Herrsching, Germany). The probe sets were positioned symmetrically on the left and right side of the head ( Figure 3A). The last receiving optode of the lowest row was placed above the ear (proximal to T3/T4 of the 10/20 system [101]). The anterior, lower corner of each probe set was directed towards the end of the eyebrows. A virtual registration approach was applied [102], with optode positions resembling 2 × 3 × 5 CH configurations by the Jichi Medical University [103]. The 2 regions of interest (ROIs), STG and IFG, are based on anatomical labels of the highest probability ( Figure 3A). Changes in HbO, HbR, and HbT were obtained via 2 wavelengths (695 and 830 nm) and a sampling frequency of 10 Hz. HbT has been suggested to be less susceptible to pial vein contamination [104] and 2 large pial veins (the superior anastomotic and superficial middle cerebral vein) underlie our fNIRS configuration. Therefore, the current exploratory analyses of the pilot data focused on ∆HbT.
Children 2020, 7, x FOR PEER REVIEW 13 of 31

Experimental Design and Procedure
In the VAE, a modified SIN task (adjusted from the LISN-S task by Cameron and Dillon [27]) for the assessment of German-speaking children was implemented [105]. The target speech was a selection of 5-keyword sentences from the Hochmair-Schulz-Moser (HSM) test [106] and recorded with a native German-speaking female voice (mean f 0 = 213 Hz, measured from a 2-min utterance). The two-talker distractor speech was passages from fairy tales by the Grimm brothers that children were less familiar with. Two pitch conditions were created by either using the voice from the target for the distractors (P same ), or two separate female voices (P diff; mean f 0 = 191 and 198 Hz). In the VAE, the target speaker was always positioned in front of the listener. To introduce spatial separation, the distractor speakers were either located symmetrically on both sides of the participant with a 90 • angular separation (S diff ) or in the same virtual position as the target speaker (S same ). The 2 spatial × 2 pitch conditions were tested in a low reverberant virtual classroom (0.4 s with reverberation time (RT), averaged across octave bands between 500 and 2000 Hz; RT low ) and a high reverberant virtual classroom (1.1 s; RT high ), which were created through variations in absorption and scattering properties of the surface materials of the virtual classroom. Thus, eight conditions, one test block each, were created through variations of pitch and spatial cue, and RT ( Figure 4A,B). The order of the test blocks was pseudorandomized following a nested Latin Square design.
For each child, a short practice run was provided for task familiarization. During the main task, a manually initiated 15 s rest period, with a subsequent audio playback introducing the next condition, was presented prior to each block ( Figure 4B,C). Note that due to the manual start of the rest block, which allowed each child to determine individually when they were ready to continue to account for fatigue, the total duration of rest was variable (M = 44.52; SD = 17.81; minimum of 34.30 s, allowing the fNIRS signal to return to baseline). At the beginning of each condition, the distractor stories started and continued throughout the entire test block. A leading 1 kHz sine tone of 200 ms was played, followed by 500 ms silence, before each target sentence was presented. The child verbally repeated what was heard. The verbal response was manually scored based on the accurately identified keywords by an experimenter outside the booth. An excerpt of the procedure is shown in Figure 4C.
The speech reception threshold (SRT) is a measure of speech comprehension in noise, with lower values indicating better behavioral performance. For each condition/block, the SRT at 50% accuracy was tracked using a one-down one-up adaptive staircase procedure [107], adjusting the target presentation level at an initial step size of 4 dB. The distractors were always presented at 55 A-weighted decibels (dBA) sound pressure level (SPL). The target speech was set at an initial 70 dBA SPL. A trial was scored correct with three or more keywords correctly verbally repeated. This led to the subsequent trial with lowered target speech, i.e., lower SNR, until the first reversal, and then, step size subsequently changed to 2 dB. A reversal was reached when the direction of changing SNR reversed, such as from decreasing to increasing SNR. A test block terminated at the 6th reversal. To ensure safety, the playback levels never exceeded 80 dBA SPL for children with NH and 105 dBA SPL for the child using the research HAs.

Behavioral Data
To derive an SRT for each acoustic condition, a logistic regression was fitted to all SNRs tested with the SNR at 50% accuracy being interpolated [108]. For children, this approach is considered more robust [109] and consistent in estimating the psychophysical threshold from disperse behavioral data [110], as compared to SRTs calculated by averaging the last reversals. 500 ms silence that preceded each target sentence (see the Supporting Material for exemplary audio files and task instructions) as well as the verbal response. In between conditions, a playback introduced the next condition and a manually initiated (asterisk) break with a total silence duration of at least 30.4 s was presented before each condition. Abbreviations: SRT-speech reception threshold; T-target voice; D-distractor voices; Ssame-target and distractor at the same spatial position in front of participant; Sdiff-target at front and distractors at ± 90°; Psame-same pitch of target and distractor voices (both "D" and "T" in black color); Pdiff-different pitch of target and distractor voices ("T" in black and "D" in red); RTlow-low reverberation time; RThigh-high reverberation time. Footnotes: * The door is freshly painted. † The door is painted. ‡ The plane flies very quietly. § The plane flies. ¶ Last night was a thunderstorm.
For each child, a short practice run was provided for task familiarization. During the main task, a manually initiated 15 s rest period, with a subsequent audio playback introducing the next condition, was presented prior to each block ( Figure 4B,C). Note that due to the manual start of the rest block, which allowed each child to determine individually when they were ready to continue to account for fatigue, the total duration of rest was variable (M = 44.52; SD = 17.81; minimum of 34.30 s, allowing the fNIRS signal to return to baseline). At the beginning of each condition, the distractor stories started and continued throughout the entire test block. A leading 1 kHz sine tone of 200 ms was played, followed by 500 ms silence, before each target sentence was presented. The child verbally repeated what was heard. The verbal response was manually scored based on the accurately identified keywords by an experimenter outside the booth. An excerpt of the procedure is shown in Figure 4C. and subsequent 500 ms silence that preceded each target sentence (see the Supporting Material for exemplary audio files and task instructions) as well as the verbal response. In between conditions, a playback introduced the next condition and a manually initiated (asterisk) break with a total silence duration of at least 30.4 s was presented before each condition. Abbreviations: SRT-speech reception threshold; T-target voice; D-distractor voices; S same -target and distractor at the same spatial position in front of participant; S diff -target at front and distractors at ± 90 • ; P same -same pitch of target and distractor voices (both "D" and "T" in black color); P diff -different pitch of target and distractor voices ("T" in black and "D" in red); RT low -low reverberation time; RT high -high reverberation time. Footnotes: * The door is freshly painted. † The door is painted. ‡ The plane flies very quietly. § The plane flies. ¶ Last night was a thunderstorm.

Neural Data
Appertaining to previous findings [54,56,88], the bilateral STG and IFG formed our 4 a priori hypothesized ROIs. CHs assessing bilateral IFG and STG activity are depicted in Figure 3A. Inclusion of a participant required at least 50% of CHs in each ROI to have a good signal quality. As the usage of an automated, criterion-based detection of bad signal quality might sometimes be obstructed due to large baseline shifts and trends in the data, particularly in paradigms of long duration, poor signal quality was identified by visual inspection before and after preprocessing (i.e., CHs showing a large signal variation and spikes or measurement errors/flat lines). On average, 1-2 CHs across all ROIs and probe sets were excluded in the remaining sample.
The fNIRS data were preprocessed in MATLAB [98] via self-written scripts and scripts from the HomER2 (Huppert et al. [111]; version: homer2_src_v2_8_11022018) and SPM-fNIRS toolbox [112]. All steps of the preprocessing pipeline are depicted in Figure 3B. Specifically, a combined spline interpolation and wavelet filtering approach was used to reduce motion artifacts, because the combination of the two techniques yielded the best results for data obtained in challenging samples and tasks as well as for paradigms that involve motion [113,114]. Further, the last 50 s of each block were considered for the neuronal analyses ( Figure 4C), because the last 50 s best captured the neuronal activation of the SRT of each individual, i.e., the last 4 reversals comprising the last 5-6 sentences per condition. During this time, the SNR was reached, at which the child heard 50% of the sentences correctly.

Analyses
Statistical analyses for the pilot data were performed in IBM SPSS [115] and R [116]. For the NH group, a repeated measures analysis of variance (rmANOVA) was fitted to the behavioral measure of SRTs and neural activity (∆HbT) in each ROI, separately, using 3 within-subject factors: RT (RT low vs. RT high ), spatial (S same vs. S diff ), and pitch cue (P same vs. P diff ). An a priori α = 0.05 was used to identify statistical significance. For post hoc analysis using pairwise comparisons, uncorrected t-test results are reported because of the small sample size in this pilot study and the main purpose of applicability assessment.
To examine behavioral SRTs during HAs use, exploratory analyses were conducted in an available program (SINGLIMS.EXE; Crawford et al. [117], Crawford and Garthwaite [118]) using modified t-tests [119]. That is, individual data of children with HAs were compared against the group distribution estimated from the NH group. For further inspection, the effect size with 95% confidence intervals and a point estimate of the probability of a HA user's score falling above the value of the NH group are presented. For the neural data, effects on ∆HbT that were identified in the NH group analysis were similarly plotted separately for each child of the HA group for exploratory purposes. Importantly, the results are primarily intended to demonstrate applicability in children fitted with HAs rather than generalizable evidence across the population that is hard of hearing and fitted with HAs. No other comparisons were found to be significant. For children with HAs, the individual SRT ( Figure 5B) was compared to the NH group in each test condition using modified t-tests [119]. Large individual variability was observed ( Table 2). In general, most children of the HA compared to the NH group had elevated SRT in all eight test conditions. Using the SRT distributions from the NH group, the probability of the children with HAs' score falling above the value of the NH group ranged between 81.75 and 99.89% for child HA 1, between 78.26 and 99.84% for child HA 2, and between 57.21 and 97.90% for child HA 3 across all eight conditions. Among all children with HAs, child HA 3, with the best unaided thresholds in the pure tone audiogram, had SRTs closest to the NH group.   Modified t-test statistics performed based on methods described by Crawford and Howell [119] are listed for each child in the HA group, by comparing the individual's speech reception threshold to the NH group in each test condition. Effect sizes with 95% confidence interval are shown between the individual case and control group. Abbreviations: HAs-hearing aids; NH-normal hearing; RT low -low reverberation time; RT high -high reverberation time; S same -same spatial position of target and distractor speakers; S diff -different position of target and distractor speakers; P same -same pitch of target and distractor voices; P diff different pitch of target and distractor voices; CI-confidence interval; M-mean; SD-standard deviation; * p < 0.05; ** p < 0.01.

Neural Data
The repeated-measures ANOVA of the ∆HbT for the NH group revealed a significant interaction between the pitch cue and RT in the left IFG, F (1,5) = 7.38, p = 0.04, η p 2 = 0.60. Irrespective of spatial cue availability, left IFG activation tended to be lower during the condition with low RT and available pitch cue (RT low, P diff ) than in conditions with high RT and available pitch cue (RT high, P diff ; Figure 5C; t(5) = 2.42, p = 0.06, d = 1.18). Similar activations in the left IFG were observed between the two RTs when the target and masker shared the same pitch. Results suggested that the pitch cue reduced the LIFG activation only in the low RT but not in high RT. No other effects on ∆HbT, or other analyses for the remaining ROIs, were found to be significant. To explore possible alterations, the activation of each child in the HA group was plotted for the interaction effect observed in the NH group ( Figure 5D). Notably, child HA 3, with the best unaided pure tone audiogram and best behavioral performance, showed different activation patterns than the NH group. On the contrary, child HA 2 showed neural activation patterns similar to the NH group, although the behavioral performance of child HA 2 was poorer.

Discussion
The main objective of the current pilot study was to provide a tentative assessment of applicability and to offer extensive recommendations for future applications of a novel paradigm and experimental setup that combines fNIRS and VAEs to investigate simulated complex real-world listening in children on behavioral and neural levels. Furthermore, the multimethod approach was tested in three children with bilateral HAs.
The findings of the pilot study suggest that excessive reverberation of 1.1 s impairs speech comprehension in children with NH on the behavioral level. This corroborates with previous reports of the negatively affected hearing and wellbeing of children in reverberating classrooms [120,121]. According to the American Speech-Language-Hearing Association guidelines for classroom recommendations, good classroom acoustics should be controlled to be under 0.6 s and between 0.45 and 0.6 s according to European regulatory guidelines [122]. Next to the room acoustics, the spatial separation between the target and distractors might provide important auditory cues for children with NH to understand SIN. In the behavioral measure of SRT, the effect of spatial cues was moderated by pitch similarity between the target and distractors in line with Cameron and Dillon [27]. The NH group received a larger speech comprehension benefit from the spatial separation when the target and distractors shared the same pitch. This might suggest that pitch similarity promoted the use of spatial cues for understanding SIN. Behavioral performance of the HA group was overall poorer across conditions. Nevertheless, most children fitted with HAs also benefited from the spatial separation of the talkers. There was large individual variability in the effect of RT and pitch cue on performance. The degree of HL likely accounted for the variance in performance. Indeed, better aided speech audibility as well as stronger vocabulary and working memory abilities have been shown to facilitate SIN recognition in reverberant environments [24]. Additional studies applying the multimethod approach are required to clarify the observed variability and its underlying mechanisms.
Crucially, while the current study pointed to the potential of the novel multimethod approach to investigate complex, realistic listening scenarios in NH children and children with HAs, the current fNIRS results in particular should be treated with caution due to the small sample size and the exploratory nature of the pilot study. On the neural level, in the low RT condition, a reduced left IFG activation was observed by introducing the pitch cue. This might suggest that the left IFG assists SIN recognition during more difficult, effortful conditions only (i.e., when target and distractor talkers share the same pitch) to reach a comparable behavioral performance to easier listening conditions (i.e., when a pitch cue is introduced). This finding corroborated with previous studies that considered attention-dependent left IFG activation a plausible neural marker for effortful listening [55,88,123]. While the behavioral measure of SRTs improved by the introduction of the pitch cue only if target and distractors shared the same spatial position, the neural finding might indicate that children received a general release from effortful listening by the pitch cue. Interestingly, the release from effort was only accessible by children in low RT but not in high RT. This might suggest that the high RT conditions require more effort irrespective of pitch cue availability. In contrast, no differences in STG activation were observed between test conditions. Analogously to the behavioral performance, a large variety of the effect of RT and pitch cues on neural activation in the left IFG was observed for the children with HAs. Of note, HA user 3, who had the best behavioral performance and the best unaided pure tone audiometry, did not show similar activation patterns to the NH group. HL has been shown to lead to neural alterations within and beyond the auditory cortices [124,125]. Different neural resources might support speech comprehension in complex auditory conditions in children with bilateral HL and HAs compared to children with NH. Yet, future studies with a modified VAE-fNIRS approach and a larger sample size have yet to identify the exact neural mechanisms within and beyond bilateral STG and IFG that support speech comprehension after HL and HA use.
While fNIRS hyperscanning alone already allows to capture neural activation of two people during natural conversations [66,126], the fNIRS-VAE approach could potentially enable the investigation of even more complex auditory situations in the long-run. By simulation of multiple speakers as well as varying acoustic room conditions, each factor that contributes to real-world hearing could be investigated in isolation and related to underlying neural mechanisms. It should be noted that real-life auditory perception generally takes place in a multisensory environment. While the role of tactile cues is still unclear, visual cues were shown to affect auditory perception in (virtual) environments [127,128]. Importantly, visual cues improve SIN perception differently in children with HL compared to children with NH, with a larger audio-visual enhancement for children with HL [129]. Thus, while it is possible that additional visual cues may have been particularly helpful for the children with HAs, the current design focused on firstly understanding the auditory aspect through highly controllable VAEs. Nevertheless, future research studies incorporating other cues in the virtual environment, such as visual cues, could be of interest in the long run. Further, while most research utilizing traditional SIN paradigms is interested in performance differences (speech comprehension) and associated varying effort levels, for example between the NH population and individuals with HL, the staircase procedure offers the exploitation of different research questions. Constant accuracy levels between participants and conditions might enable the investigation of behavioral and neural mechanisms that facilitate such a comparable level of speech comprehension in children with NH as well as children with HAs. In addition to classical audiometric testing in simple acoustic environments, advanced fNIRS-VAE approaches might, after extensive validation, potentially also offer the possibility to optimize HA fitting in complex auditory scenes and to identify possible factors that could be improved in assistive listening devices.
Despite its numerous advantages, future research using a modified VAE-fNIRS application is warranted to validate the current findings and to further elucidate the behavioral and neural mechanisms that underlie individual differences in SIN comprehension in children with NH and children fitted with HAs. In order to benefit from the current VAE-fNIRS pilot study in future applications of the multimethod approach, an extensive list of several limitations of the presented approach and recommendations of how to address each of the challenges for future multimethod applications are offered in Table 3. Loudness deviations when investigating SIN comprehension typically do not exceed 10 dB SPL. Activation differences thus hardly reflect overall sound intensity differences. Nevertheless, individual loudness perception (rather than physical intensity) appears to be related to brain activation [130][131][132] and subjective auditory loudness perception should be assessed and taken into consideration during interpretation.
Noise removal: Head movements and high pass filtering Head movements are warranted during VAE simulations; however, an excessive amount might distort the NIRS signal. The long duration of the task limits the strict application of high pass filters.
For datasets that are acquired from challenging samples, few trials, lengthy paradigms, and when head movements are an important aspect of the task, combined motion artifact detection and correction techniques are highly recommended (e.g., see Jahani, Setarehdan, Boas, and Yucel [113] or Di Lorenzo, Pirazzoli, Blasi, Bulgarelli, Hakuno, Minagawa, and Brigadoi [114]). Implementation of short-separation CHs, which are sensitive to changes in superficial blood flow, is considered crucial to remove noise (i.e., extra-cerebral signal) [133,134], which is also highly relevant due to the long task duration that limits the application of strict high pass filters. When investigating various age groups, an age-corrected differential path length factor is advised [135].

Localization/ROIs and lateralization
Variability in head size and shape might affect the formation of ROIs and a differential lateralization of speech-related activity might add additional variation.
The use of probe positioning units ensures correct and consistent fNIRS probe placement. Individual formation of ROIs by allocation of relative weights to the CHs depending on the probability to fall into a respective ROI (e.g., see Huppert et al. [136]) might be considered for the analyses.
In addition, variability in speech lateralization due to inter alia speech content [137][138][139] should be controlled for.

Participants
Varying degrees of HL, HA devices, and frequency of HA use Due to time constraints and elaborative purpose of the study design, an audiometry was performed only for the HA group that served as input for the research HAs Future studies, assessing larger populations, should aim at controlling for varying degrees of hearing (loss) and administer detailed questionnaires about HA use, device, and fitting.

Other factors affecting speech comprehension
Due to the small sample size, the current pilot investigation could not control for variability in hearing abilities. Auditory, linguistic as well as other cognitive mechanisms were suggested to affect speech understanding (e.g., see the ease of language understanding model Rönnberg, Lunner, Zekveld, Sörqvist, Danielsson, Lyxell, Dahlström, Signoret, Stenfelt, and Pichora-Fuller [19], Rönnberg, Holmer, and Rudner [20], Holmer, Heimann, and Rudner [21] or McCreery, Walker, Spratford, Lewis, and Brennan [24]). Speech represents a highly complex auditory signal that involves multiple brain networks. Animal models of cortical reorganization following HL highlight the widespread effects on HL beyond the auditory cortex and the interplay of multiple neural networks that, in turn, make the effects of HL on speech understanding highly individual [98,99].
Next to audiometry, additional measures on cognition and speech performance (e.g., assessment of (verbal) IQ and speech production) were beyond the scope of the current pilot study; however, they are highly recommended for future applications.
In conclusion, while several challenges are still to be overcome and future studies have to further evaluate adapted versions of this multimethod approach, the application of advanced VAE-fNIRS approaches could provide unique tools to understand children's listening abilities in complex real-world auditory situations and potentially offer crucial information to improve assistive fitting of HAs in complex (simulated) real-world listening in the long-run.