An Auditory-Perceptual and Pupillometric Study of Vocal Strain and Listening E ﬀ ort in Adductor Spasmodic Dysphonia

: This study evaluated ratings of vocal strain and perceived listening e ﬀ ort by normal hearing participantswhile listeningto speechsamples produced bytalkers withadductor spasmodicdysphonia (AdSD). In addition, objective listening e ﬀ ort was measured through concurrent pupillometry to determine whether listening to disordered voices changed arousal as a result of emotional state or cognitive load. Recordings of the second sentence of the “Rainbow Passage” produced by talkers with varying degrees of AdSD served as speech stimuli. Twenty naïve young adult listeners perceptually evaluated these stimuli on the dimensions of vocal strain and listening e ﬀ ort using two separate visual analogue scales. While making the auditory-perceptual judgments, listeners’ pupil characteristics were objectively measured in synchrony with the presentation of each voice stimulus. Data analyses revealed moderate-to-high inter- and intra-rater reliability. A signiﬁcant positive correlation was found between the ratings of vocal strain and listening e ﬀ ort. In addition, listeners displayed greater peak pupil dilation (PPD) when listening to more strained and e ﬀ ortful voice samples. Findings from this study suggest that when combined with an auditory-perceptual task, non-volitional physiologic changes in pupil response may serve as an indicator of listening and cognitive e ﬀ ort or arousal. and e ﬀ ort elicited smaller PPDs. Our results revealed a strong, positive correlation between strain and PPD (0.73), and between e ﬀ ort and PPD (0.66) when averaged across all listeners. Given this positive correlation and the dependence of vocal strain on the presence of AdSD spasms and / or momentary aphonic breaks, the averaged pupillary responses can be


Introduction
Dysphonia describes an impairment of the speaking voice [1] which may occur due to a variety of reasons including those secondary to neurological disorders of the central or peripheral nervous system. Spasmodic dysphonia is a neurogenic voice disorder characterized by sudden, involuntary spasms of laryngeal musculature, either adductory, abductory, or in combination. Adductor spasmodic dysphonia (AdSD) is the most common diagnostic subtype which involves abnormal adduction of the vocal folds during voicing that may result in intermittent phonatory breaks that negatively impact the 1.
Do normal hearing adult listeners expend effort while listening to intelligible speech samples from talkers with different degrees of AdSD severity? 2.
Is there a relationship between the auditory-perceptual ratings of vocal strain and listening effort for these AdSD talker samples? 3.
What is the relationship between the pupillometric measures of listening effort and perceived vocal strain and listening effort ratings, when listeners are presented with AdSD speech samples?

Participants
Twenty neurologically and vocally typical adults (11 males, 9 females; age range = 18-29 years; mean: 22.75 years) participated in the current study. The number of recruited participants was based on a power analysis calculated using G*Power (Version 3.1, Heinrich-Heine-Universitat, Düsseldorf. Germany, 2007) with an effect size of 0.4. Each listener participated in a single listening session which required approximately 45 min (10-15 min for task instruction, instrumentation adjustment, and calibration, 7-10 min for the experimental protocol, 10-min break, and 7-10 min for the retest procedure). All participants were native English speakers with self-reported normal hearing. In addition, participants did not have professional background in speech-language pathology, were not formally exposed to or had education related to voice disorders and had not previously judged disordered speech or voice samples. We also excluded potential participants if they indicated use of medications which are pharmaceutically reported to influence pupil reactions (e.g., Levodopa). This was done by providing a list of medications to participants who could then exclude themselves accordingly if use occurred. This list was provided to potential participants along with the letter Appl. Sci. 2020, 10, 5907 4 of 14 of information. Additionally, potential participants were also excluded if they reported an upper respiratory infection during the week prior to the date of the experiment.

Auditory Stimuli
Speech samples from 23 talkers (6 males, 17 females) with AdSD from an archive of the Voice Production and Perception Laboratory at the University of Western Ontario were used as stimuli for the current study. All talkers had been diagnosed with AdSD by a board-certified laryngologist. Speech samples were recorded using a professional quality cardioid condenser microphone (SHURE PG81) while they read the Rainbow Passage [19] in their typical voice. Once the passage was collected, the second sentence ("The rainbow is a division of white light into many beautiful colors.") was extracted for use in the current study.
The experimental structure for each trial was as follows. Each trial began with the spoken cue "Please listen to the following stimulus", and this preparatory stimulus was spoken by a normal speaking male adult. This cue lasted three seconds and indicated the impending onset of the stimulus to be judged. Upon cue presentation, one of the 23 sentences from the set of AdSD talkers was presented. One second after the sentence offset, the spoken sentence "Please indicate your ratings after the beep" instructed participants to begin rating strain and listening effort.

Assessment of Strain and Listening Effort
After the presentation of each sentence, listeners used two separate 100 mm long electronic sliders representing visual analog scales (Figure 1a) to rate first, how much strain they thought the talker exhibited and, second, how much effort they had to invest to comprehend the sentence. The end points of the slider for the feature of 'strain' was marked "mild" (value of 1) toward the left side of the scale and "profound" (value of 100) toward the right side. The end points of the slider for 'listening effort' indicated "none" (1) on the left and "extreme" (100) on the right. Listeners could move the slider handle and mark the scale at any point along the continuum where they thought it best indicated the degree of both strain and listening effort that represented the stimulus.

Pupillometry Data Recording
Pupil dilation for each participant was recorded continuously using an EyeLink 1000 (SR Research, Ottawa, ON, Canada) eye tracker (Figure 1b,c) in Western's Brain and Mind Institute. Participants were seated comfortably on a stationary chair at the instrumental tower mount. The participant's chin was positioned on a chin rest and their forehead placed against a forehead rest while they faced the monitor in front of them. The device collected the pupil responses of the right eye at a sampling rate of 1000 Hz.

Procedure
On the day of the experiment, participants sat in a softly lighted room. The light was consistent throughout the room to prevent reflexive dilation in reaction to changing luminance on the retina [20]. Each listener was individually familiarized with the tasks they would perform. Listeners were briefly trained about the voice dimensions of "strain" and "listening effort" and all were provided written definitions. Strain was explicitly defined to indicate the listener's the perception of excessive vocal effort; listeners were asked if they understood the concept of strain relative to the laryngeal force that was exhibited in each talker's sample. Listening effort was defined as the amount of cognitive work that was required while listening to the talker samples. The height and general positioning for each listener were adjusted to provide the best and most direct view of the pupils. Listeners were instructed not to move their head or body or to look down or away from the monitor at any point during the experiment. During the task, they were asked to maintain focus at the center of the monitor and were requested to avoid blinks as much as possible or at least try not to blink excessively when listening to the stimulus. Listeners were asked to wear headphones (Sennheiser HD 205, Wedemark, Germany) and self-adjust the volume to a comfortable listening level before beginning the experiment. Unless listeners are hearing-impaired or a given experimental task that seeks to address varied signal-to-noise ratios, the process of allowing normal-hearing listeners to adjust their own loudness level during auditory-perceptual experiments is common e.g., [16]. Thus, control of listening level was unnecessary in this study.
Once the optimum position was reached and the listeners were ready to proceed, calibration of the visual gaze and its validation was performed. During this task, listeners were asked to maintain visual focus on a fixation circle on the screen and to follow it when requested in order to calibrate the eye tracker. Upon obtaining satisfactory calibration and subsequent validation, the auditory-perceptual rating procedure was initiated by the experimenter. The talker stimuli were presented to listeners in randomized order. After listening to each stimulus until the beep, the listeners used the first computer slider to indicate their ratings of talker's vocal strain and the second slider to indicate their own listening effort. Once the listener completed both ratings for a given stimulus, they clicked the "next" button to hear the next stimulus. Once all stimuli were rated, a message appeared on the screen indicating the end of the test. After the first rating procedure, each listener was given a 10-min break to rest and then the re-test phase of the experiment was undertaken in order to provide test and re-test measures for intra-rater reliability. To synchronize the pupil recordings with the presentation stimulus, markers were embedded into the pupil data stream at the start and end of the stimulus presentation (which included the preparatory and rating auditory prompts at the beginning and end, respectively).
Appl. Sci. 2020, 10, x FOR PEER REVIEW 5 of 14 adjust their own loudness level during auditory-perceptual experiments is common e.g., [16]. Thus, control of listening level was unnecessary in this study. Once the optimum position was reached and the listeners were ready to proceed, calibration of the visual gaze and its validation was performed. During this task, listeners were asked to maintain visual focus on a fixation circle on the screen and to follow it when requested in order to calibrate the eye tracker. Upon obtaining satisfactory calibration and subsequent validation, the auditoryperceptual rating procedure was initiated by the experimenter. The talker stimuli were presented to listeners in randomized order. After listening to each stimulus until the beep, the listeners used the first computer slider to indicate their ratings of talker's vocal strain and the second slider to indicate their own listening effort. Once the listener completed both ratings for a given stimulus, they clicked the "next" button to hear the next stimulus. Once all stimuli were rated, a message appeared on the screen indicating the end of the test. After the first rating procedure, each listener was given a 10-min break to rest and then the re-test phase of the experiment was undertaken in order to provide test and re-test measures for intra-rater reliability. To synchronize the pupil recordings with the presentation stimulus, markers were embedded into the pupil data stream at the start and end of the stimulus presentation (which included the preparatory and rating auditory prompts at the beginning and end, respectively). (a) Screenshot of the user interface for assessing the "strain" and "listening effort" using visual analog sliders. The label of the "Start!" button was changed to "Next" once the first stimulus was played back. (b) Participant positioned on the EyeLink 1000 tower in front of the monitor displaying the ratings screen set-up. (c) A secondary display showing the EyeLink 1000 tracker parameters and pupil image, which was monitored by the experimenter to ensure proper data collection during the experiment.

Auditory-Perceptual Data
Once all listeners had completed the experimental task, their ratings of strain and listening effort were first analyzed for reliability. Two sets (i.e., test and retest) of strain and listening effort ratings Figure 1. (a) Screenshot of the user interface for assessing the "strain" and "listening effort" using visual analog sliders. The label of the "Start!" button was changed to "Next" once the first stimulus was played back. (b) Participant positioned on the EyeLink 1000 tower in front of the monitor displaying the ratings screen set-up. (c) A secondary display showing the EyeLink 1000 tracker parameters and pupil image, which was monitored by the experimenter to ensure proper data collection during the experiment.

Auditory-Perceptual Data
Once all listeners had completed the experimental task, their ratings of strain and listening effort were first analyzed for reliability. Two sets (i.e., test and retest) of strain and listening effort ratings Appl. Sci. 2020, 10, 5907 6 of 14 that could range between 1 and 100 were generated for each talker sample and listener. Intra-rater reliability for both strain and listening effort was obtained for each listener by computing the Pearson correlation coefficient between the test and retest session ratings across all talkers. These correlation values ranged from 0.56 to 0.96 for strain and from 0.58 and 0.90 for listening effort, indicating moderate-to-high intra-rater reliability. Interrater reliability was calculated separately with Cronbach's α in SPSS (Version 24, IBM, Armonk, NY, USA, 2020) for each of the two rated features. The Cronbach's α was 0.98 for strain and 0.97 for listening effort, indicating very high reliability among listeners for the rating tasks.
The strain and listening effort ratings for each AdSD talker were subsequently averaged across all listeners and the test-retest sessions. These averaged ratings along with their standard errors are displayed in Figure 2a. It can be seen from Figure 2a that Talkers 8 and 10 were rated to have the least and Talkers 1 and 18 were rated to exhibit the highest degrees of both perceived strain and effort. In addition, the strain and listening effort ratings appeared to vary in a similar pattern across talkers. This association is confirmed through the scatter plot shown in Figure 2b, where a linear regression fit to the strain-listening effort data accounted for 80% of the variance. In addition, linear mixed-effects models (LMMs) were developed to further probe the relationship between strain and listening effort. The LMMs were implemented using the R statistical software (v4.0.2, R Foundation for Statistical Computing, Vienna, Austria, 2020) using the nlme package. The basic LMM model included listening effort as the dependent variable, strain as the fixed effect, and the talker and listener variables as random effects. Results showed that strain was a significant predictor of listening effort (F (1, 298.71) = 184.04, p < 0.001), with a correlation of 0.53 (t (298.7087) = 13.566, p < 0.001). A more complex LMM model, which allowed different slope coefficients for each talker, revealed statistically similar results (x 2 (1) = 0.244, p = 0.622). Thus, these results indicate a consistent relationship between the vocal strain and listening effort ratings.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 6 of 14 that could range between 1 and 100 were generated for each talker sample and listener. Intra-rater reliability for both strain and listening effort was obtained for each listener by computing the Pearson correlation coefficient between the test and retest session ratings across all talkers. These correlation values ranged from 0.56 to 0.96 for strain and from 0.58 and 0.90 for listening effort, indicating moderate-to-high intra-rater reliability. Interrater reliability was calculated separately with Cronbach's α in SPSS (Version 24, IBM, Armonk, NY, USA, 2020) for each of the two rated features. The Cronbach's α was 0.98 for strain and 0.97 for listening effort, indicating very high reliability among listeners for the rating tasks. The strain and listening effort ratings for each AdSD talker were subsequently averaged across all listeners and the test-retest sessions. These averaged ratings along with their standard errors are displayed in Figure 2a. It can be seen from Figure 2a that Talkers 8 and 10 were rated to have the least and Talkers 1 and 18 were rated to exhibit the highest degrees of both perceived strain and effort. In addition, the strain and listening effort ratings appeared to vary in a similar pattern across talkers. This association is confirmed through the scatter plot shown in Figure 2b, where a linear regression fit to the strain-listening effort data accounted for 80% of the variance. In addition, linear mixedeffects models (LMMs) were developed to further probe the relationship between strain and listening effort. The LMMs were implemented using the R statistical software (v4.0.2, R Foundation for Statistical Computing, Vienna, Austria, 2020) using the nlme package. The basic LMM model included listening effort as the dependent variable, strain as the fixed effect, and the talker and listener variables as random effects. Results showed that strain was a significant predictor of listening effort (F (1, 298.71) = 184.04, p < 0.001), with a correlation of 0.53 (t (298.7087) = 13.566, p < 0.001). A more complex LMM model, which allowed different slope coefficients for each talker, revealed statistically similar results (х 2 (1) = 0.244, p = 0.622). Thus, these results indicate a consistent relationship between the vocal strain and listening effort ratings. A repeated measures ANOVA was conducted to statistically assess the effects of the auditoryperceptual features (i.e., vocal strain and listening effort), talkers, and any potential interaction between the features and talkers. The a priori significance level was set to 0.05 for all statistical tests, and the Greenhouse-Geisser correction was applied when sphericity condition was violated. Significant effects were found for the auditory-perceptual features (F (1, 19) = 37.13, p < 0.001, = 0.662), and talkers (F (6.41, 121.73) = 72.08, p < 0.001, = 0.791). In addition, a significant interaction between auditory-perceptual features and talkers was found (F (4.04, 76.66) = 12.88, p < 0.001, = 0.404). Post-hoc comparisons using the Bonferroni correction revealed that Talkers 5 and 20 were rated differently on the auditory-perceptual features than the others. Unlike the rest of talkers,  4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22  A repeated measures ANOVA was conducted to statistically assess the effects of the auditory-perceptual features (i.e., vocal strain and listening effort), talkers, and any potential interaction between the features and talkers. The a priori significance level was set to 0.05 for all statistical tests, and the Greenhouse-Geisser correction was applied when sphericity condition was violated. Significant effects were found for the auditory-perceptual features (F (1, 19) = 37.13, p < 0.001, η 2 p = 0.662), and talkers (F (6.41, 121.73) = 72.08, p < 0.001, η 2 p = 0.791). In addition, a significant interaction between auditory-perceptual features and talkers was found (F (4.04, 76.66) = 12.88, p < 0.001, η 2 p = 0.404). Post-hoc comparisons using the Bonferroni correction revealed that Talkers 5 and 20 were rated differently on the auditory-perceptual features than the others. Unlike the rest of talkers, Talkers 5 (red Appl. Sci. 2020, 10, 5907 7 of 14 data point, Figure 2b) and 20 (green data point, Figure 2b) had higher listening effort ratings relative to their corresponding strain ratings.

Pupillometry Data
In this study, pupil size was parameterized by the pupil diameter estimates returned by the eye tracker. Raw pupil diameter data were recorded throughout the experiment and had to be processed in several steps before final visualization and analysis. The recorded time stamps for all stimuli were normalized first so that the starting point of each sentence was at 0 s. Given the nature of the experiment, eye blinks, or changes due to factors other than the listening task, were potential confounds that needed to be identified. Pupil tracks with shorter duration than the playback stimulus were discarded, as they signify loss of synchronous pupil data. Quick blinks (<125 milliseconds) were identified, removed, and interpolated (linear interpolation began roughly 50 ms before the blink and end at least 150 ms after the blink) without changing the overall pattern of the tracking sequence. Finally, the tracks were smoothed by a 11-point moving average filter. This pre-processing of pupil tracks resulted in the exclusion of approximately 13% of the tracks due to dropouts, too many variations, or long blinks. This process was required to eliminate the risk of data distortion.
The validated and pre-processed pupil diameter tracks associated with each talker stimulus were averaged across all listeners and the test-retest sessions. These averaged pupil responses are plotted in Figures 3 and 4 to provide a visual representation of the time course of pupil dilation during the presentation of the talker stimuli. Figure 3a depicts the speech waveform of the Talker 1 stimulus presented to all listeners, while 3b displays the averaged pupil track elicited while listening to this stimulus. The shaded region in Figure 3b represents the 95% confidence interval in the pupil track. As described earlier, the first three seconds of the waveform included the auditory prompt, and the last second of this prompt was designated as the baseline period. Prior to averaging, all listeners' individual pupil tracks were normalized by subtracting the track mean during the baseline period from pupil values at each time point.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 7 of 14 Talkers 5 (red data point, Figure 2b) and 20 (green data point, Figure 2b) had higher listening effort ratings relative to their corresponding strain ratings.

Pupillometry Data
In this study, pupil size was parameterized by the pupil diameter estimates returned by the eye tracker. Raw pupil diameter data were recorded throughout the experiment and had to be processed in several steps before final visualization and analysis. The recorded time stamps for all stimuli were normalized first so that the starting point of each sentence was at 0 s. Given the nature of the experiment, eye blinks, or changes due to factors other than the listening task, were potential confounds that needed to be identified. Pupil tracks with shorter duration than the playback stimulus were discarded, as they signify loss of synchronous pupil data. Quick blinks (<125 milliseconds) were identified, removed, and interpolated (linear interpolation began roughly 50 ms before the blink and end at least 150 ms after the blink) without changing the overall pattern of the tracking sequence. Finally, the tracks were smoothed by a 11-point moving average filter. This pre-processing of pupil tracks resulted in the exclusion of approximately 13% of the tracks due to dropouts, too many variations, or long blinks. This process was required to eliminate the risk of data distortion.
The validated and pre-processed pupil diameter tracks associated with each talker stimulus were averaged across all listeners and the test-retest sessions. These averaged pupil responses are plotted in Figures 3 and 4 to provide a visual representation of the time course of pupil dilation during the presentation of the talker stimuli. Figure 3a depicts the speech waveform of the Talker 1 stimulus presented to all listeners, while 3b displays the averaged pupil track elicited while listening to this stimulus. The shaded region in Figure 3b represents the 95% confidence interval in the pupil track. As described earlier, the first three seconds of the waveform included the auditory prompt, and the last second of this prompt was designated as the baseline period. Prior to averaging, all listeners' individual pupil tracks were normalized by subtracting the track mean during the baseline period from pupil values at each time point.
In the current study, we focused on the peak pupil dilation (PPD) as a dependent measure. From each baseline-normalized average pupil track, the PPD was determined as the maximum pupil diamter during the presentation time of the talker speech sample following the baseline period. It can be observed from Figure 3b that the PPD for Talker 1 speech sample is located at a latency of approximately 3000 ms from the end of the baseline period (i.e., the playback of the Talker 1 stimulus). (a) Waveform associated with the Talker #1 stimulus. The first three seconds comprise the auditory prompt "please listen to the following stimulus", while the following segment is the sentence "the rainbow is a division of white light into many beautiful colors" spoken by Talker #1. (b) The time course of the pupil diameter in response to the above stimulus, averaged across listeners and test sessions after baseline normalization. The shaded region represents the 95% confidence interval around the averaged pupil track. Note that the pupil diameter is in arbitrary units set by the EyeLink 1000 system.
In the current study, we focused on the peak pupil dilation (PPD) as a dependent measure. From each baseline-normalized average pupil track, the PPD was determined as the maximum pupil diamter during the presentation time of the talker speech sample following the baseline period. It can be observed from Figure 3b that the PPD for Talker 1 speech sample is located at a latency of approximately 3000 ms from the end of the baseline period (i.e., the playback of the Talker 1 stimulus). Figure 4a displays the averaged pupil tracks for all talkers' post-baseline normalization. Salient features from Figure 4a include the differences in the temporal pattern of the pupil tracks and the PPD value for different talker stimuli, and the location of PPD between 2000 and 3500 ms after the initiation of the talker stimulus. To further illustrate how talkers who induced high and low PPD results appear relative to each other, the pupil tracks of 4 talkers were isolated and plotted separately in Figure 4b. Two of the tracks in Figure 4b were elicited while listening to talker stimuli that were judged to exhibit the highest vocal strain and required the most listening effort (Talkers 1 and 18). The other two tracks belonged to talkers who were rated as least strained and required least listening effort (Talkers 8 and 10). It is evident that talker speech samples that resulted in highest strain/effort ratings also resulted in the highest PPD values.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 8 of 14 Figure 3. (a) Waveform associated with the Talker #1 stimulus. The first three seconds comprise the auditory prompt "please listen to the following stimulus", while the following segment is the sentence "the rainbow is a division of white light into many beautiful colors" spoken by Talker #1. (b) The time course of the pupil diameter in response to the above stimulus, averaged across listeners and test sessions after baseline normalization. The shaded region represents the 95% confidence interval around the averaged pupil track. Note that the pupil diameter is in arbitrary units set by the EyeLink 1000 system. Figure 4a displays the averaged pupil tracks for all talkers' post-baseline normalization. Salient features from Figure 4a include the differences in the temporal pattern of the pupil tracks and the PPD value for different talker stimuli, and the location of PPD between 2000 and 3500 ms after the initiation of the talker stimulus. To further illustrate how talkers who induced high and low PPD results appear relative to each other, the pupil tracks of 4 talkers were isolated and plotted separately in Figure 4b. Two of the tracks in Figure 4b were elicited while listening to talker stimuli that were judged to exhibit the highest vocal strain and required the most listening effort (Talkers 1 and 18). The other two tracks belonged to talkers who were rated as least strained and required least listening effort (Talkers 8 and 10). It is evident that talker speech samples that resulted in highest strain/effort ratings also resulted in the highest PPD values.    Figure 4a, which once again highlights the talker-dependent distribution of PPD values. To understand the relationship between the auditory-perceptual rating data and the PPD values, the scatter plots between the PPD and the strain and listening effort ratings are depicted in Figure 5b,c, respectively. A trend for greater pupil dilation when listening to talker samples with higher perceived levels of strain and listening effort is evident in these scatter plots. Statistically significant positive Pearson correlation coefficients of 0.73 and 0.66 (both p < 0.001) were found between the strain ratings and PPD values, and between the listening effort ratings and PPD values, respectively. Linear regression fits explained 54% and 43% of the variability in the averaged strain ratings vs. averaged PPD values, and the averaged listening effort ratings vs. averaged PPD data, respectively.
However, regression analyses between the auditory-perceptual ratings and PPD data at the individual listener level did not reveal similar results. The slopes of the regression lines fit to the individaul strain vs. PPD and listening effort vs. PPD data were not statistically different from zero for all listeners. Plausible reasons for this lack of significance include: (a) greater variability in the PPD data than the auditory-perceptual data for each individual listener and (b) missing PPD data associated with some talkers (due to discarding of invalid pupil tracks) for some listeners, further contributing to the PPD data variability. Therefore, the relationship between auditory-perceptual   Figure 4a, which once again highlights the talker-dependent distribution of PPD values. To understand the relationship between the auditory-perceptual rating data and the PPD values, the scatter plots between the PPD and the strain and listening effort ratings are depicted in Figure 5b,c, respectively. A trend for greater pupil dilation when listening to talker samples with higher perceived levels of strain and listening effort is evident in these scatter plots. Statistically significant positive Pearson correlation coefficients of 0.73 and 0.66 (both p < 0.001) were found between the strain ratings and PPD values, and between the listening effort ratings and PPD values, respectively. Linear regression fits explained 54% and 43% of the variability in the averaged strain ratings vs. averaged PPD values, and the averaged listening effort ratings vs. averaged PPD data, respectively.
However, regression analyses between the auditory-perceptual ratings and PPD data at the individual listener level did not reveal similar results. The slopes of the regression lines fit to the individaul strain vs. PPD and listening effort vs. PPD data were not statistically different from zero for all listeners. Plausible reasons for this lack of significance include: (a) greater variability in the PPD data than the auditory-perceptual data for each individual listener and (b) missing PPD data associated with some talkers (due to discarding of invalid pupil tracks) for some listeners, further contributing to the PPD data variability. Therefore, the relationship between auditory-perceptual ratings and the pupil dilation was only evident at a group level, and not at the individual listener level.

Discussion
This study investigated auditory-perceptual and pupillometric evaluation of speech samples produced by talkers with AdSD. This involved ratings of the perceived degree of vocal strain exhibited by AdSD talkers and the perceived listening effort by naïve, normal hearing listeners. In addition, listeners' pupillary responses while listening to the AdSD speech samples were collected and analyzed. The AdSD speech samples utilized in this study varied widely in severity in order to capture potentially differential responses to the stimuli by listeners. Salient results from this study are discussed below.

Listener Ratings of Strain and Effort
Twenty normal hearing listeners rated speech samples from 23 AdSD talkers on a scale of 1-100 for two auditory-percetual dimensions: vocal strain and listening effort. Reliability analyses of the rating data revealed: (a) moderate to strong intra-rater reliability, with test-retest ratings correlations ranging from 0.56 to 0.96 for strain and 0.58 to 0.97 for listening effort and (b) excellent interrater

Discussion
This study investigated auditory-perceptual and pupillometric evaluation of speech samples produced by talkers with AdSD. This involved ratings of the perceived degree of vocal strain exhibited by AdSD talkers and the perceived listening effort by naïve, normal hearing listeners. In addition, listeners' pupillary responses while listening to the AdSD speech samples were collected and analyzed. The AdSD speech samples utilized in this study varied widely in severity in order to capture potentially differential responses to the stimuli by listeners. Salient results from this study are discussed below.

Listener Ratings of Strain and Effort
Twenty normal hearing listeners rated speech samples from 23 AdSD talkers on a scale of 1-100 for two auditory-percetual dimensions: vocal strain and listening effort. Reliability analyses of the rating data revealed: (a) moderate to strong intra-rater reliability, with test-retest ratings correlations ranging from 0.56 to 0.96 for strain and 0.58 to 0.97 for listening effort and (b) excellent interrater reliability, with Cronbach's α of 0.98 and 0.97 for strain and listening effort, respectively. These reliability results are consistent with previous studies by Nagle and Eadie [14,16] investigating the relationship between voice quality attributes and listening effort, albeit with a different voice disorder population (i.e., tracheoesophageal and electrolarynx voices, respectively).
Data from auditory-perceptual evaluation of samples revealed that the talkers exhibited various degrees of vocal strain. For example, some of the speech stimuli were rated as less strained (e.g., Talkers 4, 8, 10, and 15) compared to others who were consistently judged as exhibiting increased levels of strain (e.g., Talkers 1,2,9,18,and 21). More importantly, the auditory-perceptual data demonstrated that the higher the ratings for strain, the more listening effort was expended. For instance, Talkers 8 and 10 were rated the lowest in terms of strain and were also judged to require the lowest degree of listening effort; in contrast, Talkers 1 and 18 were judged as the most strained and were evaluated as requiring the most listening effort. Across the 23 speech stimuli, the averaged vocal strain and listening effort ratings exhibited a significantly high positive correlation (r = 0.90). To the best of our knowledge, no study to date has evaluated perceived listening effort in the context of talkers with AdSD and our results confirm increased listening effort is required as AdSD severity increases. Furthermore, given that the speech stimuli used in this study were highly intelligible, these results are consistent with previous findings suggesting that the challenges faced by listeners are beyond those related to audibility [13] or intelligibility [21]. Such perceptual challenges increase when more cognitive effort is expended to channel attention and concentration in order to achieve a listening goal. This is particularly important when the quality of an auditory signal is distanced from optimal [13], as is the case with speech samples from talkers with greater AdSD severity.
As shown in Figure 2a, out of the 23 AdSD talkers evaluated, 21 were judged to have a higher strain rating relative to the listening effort, a finding that was not unexpected. Interestingly, results revealed that listeners rated stimuli from Talkers 5 and 20 to have higher ratings for listening effort relative to the strain ratings (see Figure 2b). Investigations into the speech samples from these talkers divulged that their voices are more characterized by increased breathiness, rather than strain. Thus, the auditory-perceptual ratings for these two Talkers (5 and 20) confirm that listeners were in fact attending to the rating task, and rated the listening effort dimension holistically. These stimuli were not perceived to be highly strained but they still deviated from normal, which subsequently required increased listening effort.

Pupil Dialation in Response to Vocal Samples
The other aim of this study was to examine the relationship between the pupil dilation in response to listening to AdSD speech samples and the perceived listening effort. To our knowledge, this is the first study to empirically evaluate pupil responses and the amount of effort expended while listening to disordered speech samples in general, and AdSD speech samples in particular. The goal herein was to explore the variability in processing effort as indicated by the peak pupil dilation (PPD). Pupil size is reported to be impacted by cognitive load and more specifically, language processing tasks such as hearing and reading words [12,17,22] or sentences [17,23]. The present aim was to determine whether a sample with increased strain would be associated with an increased PPD with respect to baseline, which would be consistent with increases in the amount of cognitive resources utilized by a listener in a speech reception task [24]. Processing demand is reported to be imposed by either stimulus factors such as linguistic complexity or noise, or as addressed in our study, the quality of the voice sample being assessed. Additionally, it is possible that listener factors such as the capacity of working memory or hearing impairment will influence both perceptual ratings and PPD. Thus, consideration of both speaker and listener factors is essential as they are reported to influence processing demands [25,26].
The averaged pupil track profiles shown in Figures 3 and 4 are consistent with previous studies investigating the relationship between pupillometry and speech perception in noisy environments [27]. A closer assessment of the pupillary data revealed that stimuli from two talkers (1 and 18), that received the highest perceptual ratings for strain and listening effort, also elicited the highest averaged PPDs. Stimuli from talkers rated lower on strain and effort elicited smaller PPDs. Our results revealed a strong, positive correlation between strain and PPD (0.73), and between effort and PPD (0.66) when averaged across all listeners. Given this positive correlation and the dependence of vocal strain on the presence of AdSD spasms and/or momentary aphonic breaks, the averaged pupillary responses can be deemed to have been evoked in response to the unique quality of the AdSD speech stimuli. The present findings are consistent with those reported by Kramer et al. [12,16,28] and Zekveld et al. [12,17,29] who examined listening effort through pupillometry and reported larger mean PPD for their normal hearing listeners in low intelligibility than high intelligibility conditions, ascribing larger mental effort to such challenging listening conditions.
All participants were tested with all voices in random order, and then tested with all voices again, in a different order. Both subjective and pupillometric responses compared across test and retest. The test-retest correlation coefficients were generally high for the auditory-perceptual ratings. For the test-retest pupil dilation comparision, data from Talker 1 (who was one of the talkers with the highest level of strain and for whom listeners exhibited a high PPD value) was examined more closely in various test-retest presentation orders for a few listeners. These presentation orders included position order seventh (in the test) and position order 22nd (in the re-test) (Figure 6a) and presentation order 21st (test) and 11th (re-test) (Figure 6b). In all these instances, the first presentation elicited greater PPD than did the second presentation of the same stimulus.
The decrease in PPD values may be due to the fact that listeners have already habituated to the stimulus or it may be the consequence of fatigue/boredom. It is known that pupil dilation is influenced by the emotional valence, which represents the attractiveness (positive affect) or aversiveness (negative affect) to an auditory stimulus [30]. Evidence exists for increased pupil dilation when listening to auditory stimuli with negative affective connotations [30,31]. As such, emotional valence may be a contributing factor to our pupil data, especially for our naïve listeners who are exposed to abnormal voice samples for the first time and perceived them to be aversive. The fact that the PPD, albeit pronounced, is reduced in magnitude on the second presentation of the Talker #1 stimulus, perhaps suggests that repeated exposure may reduce the negative emotional valence. We acknowledge that this explanation is speculative, but it is in line with previous studies that report habituation secondary to repetition and exposure [32,33]. Furthermore, this explanation is consistent with findings from Raman et al. [34] who reported that listening effort ratings from listeners who are familiar and exposed to abnormal voice samples (in their case, that of esophageal voices) are significantly lower when compared to similar ratings from naïve listeners. Given the speculative nature, further research is warranted to understand the relative contribution of cognitive load and emotional valence to pupillary responses when listening to disordered voice and speech samples. results revealed a strong, positive correlation between strain and PPD (0.73), and between effort and PPD (0.66) when averaged across all listeners. Given this positive correlation and the dependence of vocal strain on the presence of AdSD spasms and/or momentary aphonic breaks, the averaged pupillary responses can be deemed to have been evoked in response to the unique quality of the AdSD speech stimuli. The present findings are consistent with those reported by Kramer et al. [12,16,28] and Zekveld et al. [12,17,29] who examined listening effort through pupillometry and reported larger mean PPD for their normal hearing listeners in low intelligibility than high intelligibility conditions, ascribing larger mental effort to such challenging listening conditions. All participants were tested with all voices in random order, and then tested with all voices again, in a different order. Both subjective and pupillometric responses compared across test and retest. The test-retest correlation coefficients were generally high for the auditory-perceptual ratings. For the test-retest pupil dilation comparision, data from Talker 1 (who was one of the talkers with the highest level of strain and for whom listeners exhibited a high PPD value) was examined more closely in various test-retest presentation orders for a few listeners. These presentation orders included position order seventh (in the test) and position order 22nd (in the re-test) (Figure 6a) and presentation order 21st (test) and 11th (re-test) (Figure 6b). In all these instances, the first presentation elicited greater PPD than did the second presentation of the same stimulus.
The decrease in PPD values may be due to the fact that listeners have already habituated to the stimulus or it may be the consequence of fatigue/boredom. It is known that pupil dilation is influenced by the emotional valence, which represents the attractiveness (positive affect) or aversiveness (negative affect) to an auditory stimulus [30]. Evidence exists for increased pupil dilation when listening to auditory stimuli with negative affective connotations [30,31]. As such, emotional valence may be a contributing factor to our pupil data, especially for our naïve listeners who are exposed to abnormal voice samples for the first time and perceived them to be aversive. The fact that the PPD, albeit pronounced, is reduced in magnitude on the second presentation of the Talker #1 stimulus, perhaps suggests that repeated exposure may reduce the negative emotional valence. We acknowledge that this explanation is speculative, but it is in line with previous studies that report habituation secondary to repetition and exposure [32,33]. Furthermore, this explanation is consistent with findings from Raman et al. [34] who reported that listening effort ratings from listeners who are familiar and exposed to abnormal voice samples (in their case, that of esophageal voices) are significantly lower when compared to similar ratings from naïve listeners. Given the speculative nature, further research is warranted to understand the relative contribution of cognitive load and emotional valence to pupillary responses when listening to disordered voice and speech samples.  To summarize, our data on AdSD samples support the notion that when confronted with stimuli characterized by an abnormal vocal quality, listeners, on average, demonstrate a physiologic response that corresponds to their auditory-perceptual assessments. These findings provide valuable insights into the demands of effective verbal communication in general, and the challenges that may occur in the presence of disordered speech or an abnormal vocal quality specifically.
While the present data offer valuable insights on various aspects of auditory-perceptual evaluation of voice quality, there are some limitations which deserve mention. It is pertinent to note that none of the talker samples used in the present study were characterized by reduced intelligibility, rather, the speech samples were different in the consistency and flow of speech production. Therefore, future research might seek to investigate the relationship between auditory-perceptual features and pupil dilation when listeners are asked to make auditory-perceptual judgments of unique sentence stimuli that simultaneously requires comprehension (i.e., intelligibility) of such sentences which are characterized by different degrees of AdSD severity. Furthermore, our study only assessed ratings of strain and listening effort in relation to pupillary responses from naïve normal hearing listeners. Having experienced listeners and gathering their physiological responses along with subjective auditory-perceptual ratings can be complementary. In fact, it would be interesting to observe what the PPDs of experienced listeners who have ample exposure to disordered voices through their profession. Our listeners also rated talker stimuli based on their individual internal standards. While excellent reliability was documented in our study, it would be valuable to determine if adding perceptual anchors might influence the ratings and concurrent PPD values. In addition, no acoustic measures were performed on our AdSD audio samples. Future studies which are designed to evaluate potential correlations between acoustic measures of dysphonic speech, auditory-perceptual ratings and pupillometry would be a valuable area for future study. Finally, the temporal gap between test-retest was relatively short (10-15 min). Future studies might seek to assess longer gaps between test-retest to identify whether the exposure to the stimuli would fade away and PPD would be altered within the context of an increased break.

Conclusions
This study addressed auditory-perceptual evaluation of features of voice quality in relation to pupil dilation. The present data offer important observations and provide valuable insights into how naïve listeners rate voice quality (more specifically vocal strain) along with their simultaneous evaluation of listening effort. First, listeners consistently assigned greater listening effort to voice samples that were judged to exhibit more strain. Second, because listening effort may include multiple perceptual factors, i.e., a disordered voice might be rated relatively lower on strain but higher on listening effort due to the overall, composite quality of the voice. Given the nature of voice quality deviation in those diagnosed with AdSD, this finding was not unwarranted. Third, like previous studies, intelligible voices were rated as demanding variously increased degrees of listening effort which confirms the fact that listening effort goes beyond simply understanding what is being said. Fourth, the stimuli which were subjectively rated by listeners as being more strained, were also generally observed to provoke an increase in PPD. This finding suggests a potential relationship of the listening task to aspects of cognitive load and listening effort. It is, however, important to acknowledge that this load was observed at a group level and was also found to decrease with exposure and habituation over the course of the experiment.