Acoustic Identification of the Voicing Boundary during Intervocalic Offsets and Onsets based on Vocal Fold Vibratory Measures

Methods for automating relative fundamental frequency (RFF)—an acoustic estimate of laryngeal tension—rely on manual identification of voiced/unvoiced boundaries from acoustic signals. This study determined the effect of incorporating features derived from vocal fold vibratory transitions for acoustic boundary detection. Simultaneous microphone and flexible nasendoscope recordings were collected from adults with typical voices (N=69) and with voices characterized by excessive laryngeal tension (N=53) producing voiced–unvoiced–voiced utterances. Acoustic features that coincided with vocal fold vibratory transitions were identified and incorporated into an automated RFF algorithm (“aRFF-APH”). Voiced/unvoiced boundary detection accuracy was compared between the aRFF-APH algorithm, a recently published version of the automated RFF algorithm (“aRFF-AP”), and gold-standard, manual RFF estimation. Chi-square tests were performed to characterize differences in boundary cycle identification accuracy among the three RFF estimation methods. Voiced/unvoiced boundary detection accuracy significantly differed by RFF estimation method for voicing offsets and onsets. Of 7721 productions, 76.0% of boundaries were accurately identified via the aRFF-APH algorithm, compared to 70.3% with the aRFF-AP algorithm and 20.4% with manual estimation. Incorporating acoustic features that corresponded with voiced/unvoiced boundaries led to improvements in boundary detection accuracy that surpassed the gold-standard method for calculating RFF.


Introduction
Excessive and/or imbalanced laryngeal muscle forces have been implicated in over 65% of individuals with voice disorders [1]. The specific pathophysiology of laryngeal hypertonicity is a known characteristic of many functional, structural, and neurological disorders, including: adductor-type laryngeal dystonia [2,3], hyperfunctional voice disorders [e.g., muscle tension dysphonia, nodules; 4], and Parkinson's disease [5]. Despite this prevalence, current clinical voice assessments fall short in objectively quantifying the degree of laryngeal muscle tension. For instance, auditory-perceptual judgments are a gold-standard technique used to assess voice quality, but the reliability and validity of these judgments remains questionable [6,7]. Likewise, manual laryngeal palpation techniques can be useful for evaluating tension of the extrinsic laryngeal and other superficial neck musculature; however, these methods do not assess the intrinsic laryngeal muscles and, moreover, are subject to the skill and experience of the practitioner [8]. Much of the research surrounding laryngeal hypertonicity has therefore turned to acoustic analyses, for which data can be non-invasively collected via a microphone. Acoustic signals can provide insight into characteristics of the glottal source (e.g., timing, frequency, and amplitude of vocal fold vibration). To date, however, a single acoustic indicator specific to laryngeal muscle tension has not been identified.
In recent years, relative fundamental frequency (RFF) has been suggested as an acoustic indicator of laryngeal muscle tension. Estimated from short-term changes in instantaneous f o during intervocalic offsets and onsets, RFF is a non-invasive, objective measure that shows promise in estimating the degree of baseline laryngeal muscle tension. RFF can be calculated from a vowel-voiceless consonant-vowel (VCV) production as in Fig. 1. The instantaneous f o of the ten voiced cycles preceding ("voicing offset") and following  [9,12,13], Parkinson's disease [14,15], and adductor-type laryngeal dystonia [11]. Specifically, those with voice disorders characterized by excessive laryngeal muscle tension tend to exhibit lower average RFF values, perhaps due to increased baseline laryngeal muscle tension that impedes their ability to leverage tension as a strategy for devoicing (voicing offset) and reinitiating voicing (voicing onset). RFF also normalizes (increases) in individuals with VH following voice therapy [i.e., a functional change; 9, 16], but not in individuals with vocal nodules or polyps following therapy [i.e., structural intervention; 16]. This suggests that RFF reflects functional voice changes rather than structural voice changes. It has also been demonstrated that RFF captures within-speaker changes in vocal effort [17], or the perceived exertion of a vocalist to a perceived communication scenario [i.e., vocal demand; 18], as well as indirect kinematic estimates of laryngeal tension [19].
Despite the promise of RFF as an acoustic estimate of laryngeal muscle tension, this measure requires refinement before it will be appropriate for routine clinical use.
RFF can currently be calculated in two ways: manually or semi-automatically. The current gold-standard method of computing RFF is through manual estimation techniques using Praat software [20]. Due to the time-and training-intensive procedures that are necessary to reliably perform manual RFF estimation, a semi-automated RFF algorithm, called the aRFF algorithm was developed [21,22]. Both manual and aRFF estimation techniques use autocorrelation to estimate f o , which occurs by comparing a segment of the voice signal with itself when offset by a certain period. Despite being a relatively fast f o estimation method, autocorrelation assumes signal periodicity and f o stationarity, requiring 2-3 complete pitch periods to examine the physiological f o ranges encountered in speech [23]. These characteristics are not ideal for estimating f o during the voicing offset and onset transitions examined in RFF, which specifically capture rapid changes in f o . Indeed, Vojtech, et al. [24] compared the effects of different f o estimation techniques on resulting RFF estimates, determining that f o estimation via the Auditory-SWIPE′ algorithm (an algorithm that estimates the f o of an input sound by first filtering the sound using a auditory-processing front-end then comparing the spectrum of the filtered sound to a sawtooth waveform) [25] led to smaller errors between manual and semi-automated RFF estimates compared to autocorrelation. The results of this work led to a refined version of the aRFF algorithm that employs Auditory-SWIPE′ for f o estimation, as well as using the acoustic metric, pitch strength [26], to account for differences in voice sample characteristics (e.g., recording location, overall severity of dysphonia). Incorporating both Auditory-SWIPE′ and pitch strength-based sample categories, this algorithm is called aRFF-AP.
In both manual and semi-automated RFF estimation methods, the most tedious step of the RFF computational process is identifying the boundary between voiced and unvoiced speech. As RFF depends on the termination and initiation of voicing within a VCV production, these points in time must be identified from the acoustic signal prior to collecting vocal cycles for RFF estimation. Manual RFF estimation relies on trial-and-error techniques of trained technicians to locate this boundary [requiring 20-40 minutes of analysis time per RFF estimate; 11], whereas the semi-automated RFF algorithms (aRFF, aRFF-AP) take advantage of a faster, more objective approach. Specifically, three acoustic features (normalized peak-to-peak amplitude, number of zero crossings, and waveform shape similarity) are examined in time to identify where a state transition in feature values occurs to locate voiced/unvoiced boundary. The aRFF and aRFF-AP algorithms assume that each acoustic feature will exhibit a substantial change in feature values over time and that this change will occur at the boundary between voiced and unvoiced segments.
The methodology used to identify the voiced/unvoiced boundary in semi-automated RFF estimation requires further inquiry. First, it is unclear as to whether the three features used in aRFF and aRFF-AP algorithms are the best choice of acoustic features to mark the initiation and termination of vibration since voiced/unvoiced boundary detection accuracy has not been formally assessed amongst the three features or compared to that of other acoustic features (e.g., cepstral peak prominence). Second, there is uncertainty in how the boundary identified using these acoustic features corresponds to the physiological initiation or termination of vocal fold vibration. This is because both manual and semi-automated RFF methods rely only on the acoustic signal, which only provides indirect information about the vibration of the vocal folds and may be masked by supraglottic resonances, coarticulation (e.g., due to concurrent aspiration and frication), and radiation [27]. Thus-in addition to a lack of f o stationarity during vocal fold offset and onset transitions-signal masking adds to the complexity of identifying the initiation or termination of vocal fold vibration. The uncertainties in acoustic boundary cycle identification warrant further investigation to (i) inform the implementation of acoustic features used in the semi-automated RFF algorithm and (ii) validate manual RFF estimation as a gold-standard that accurately represents changes in instantaneous f o during voicing offsets and onsets.
High-speed videoendoscopy (HSV) may be a useful technique to examine the relationship between the acoustic signal and vocal fold vibration. By sampling at frame rates much greater than typical modal (i.e., the vocal register most typically used during conversational speech) f o values, HSV can capture cycle-to-cycle changes in vocal fold vibratory behavior during voicing offsets and onsets [28][29][30]. Indeed, prior work has employed HSV to investigate voicing offsets and onsets relative to the acoustic signal: Patel, et al. [31] acquired simultaneous recordings via a microphone and rigid laryngoscope as vocally healthy speakers repeated /hi hi hi/ at their typical pitch and loudness. The results of this work indicated a tight coupling between the acoustic signal and the physiological vibrations of the vocal folds; however, this relationship may not be generalizable to the acoustic outputs typically examined with RFF. Transitioning between a vowel and the voiceless glottal fricative, /h/, may require different mechanisms than when transitioning between a vowel and a voiceless obstruent produced via oral constrictions (e.g., /f/, /s/, /ʃ/, /p/, /t/, /k/). For instance, Löfqvist, et al. [32] observed that glottal vibrations continued uninterruptedly through the /h/ during the production of /aha/ sequences by some speakers; these vibrations were not present for voiceless consonant productions of /asa/ or /apa/. Such differences could ultimately affect the relationship between oscillatory events obtained from the laryngoscopic images and from the acoustic signal. Additionally, the participants in Patel, et al. [31] were limited to adults with typical voices, whereas the target population for employing RFF in clinical voice assessments includes speakers with voice disorders characterized by excessive laryngeal muscle tension. As such, additional investigations are needed to examine voicing offsets and onsets in the context of speakers with and without voice disorders.
To carry out the present investigation, speakers with typical voices and speakers with voices characterized by excessive laryngeal muscle tension were enrolled across a wide age range to investigate the relationship between acoustic features and vocal fold vibratory characteristics during intervocalic voicing offset and onsets. Acoustic features were identified that corresponded with the physiological initiation and/or termination of vocal fold vibration. The aRFF-AP algorithm was then further refined by modifying algorithmic parameters corresponding to the HSV-tuned acoustic feature set ("aRFF-APH"). Voiced/ unvoiced boundary detection accuracy was computed for each of the three RFF methods (manual estimation, aRFF-AP, aRFF-APH) relative to the actual vocal fold vibratory features identified via HSV. It was hypothesized that incorporating features related to the onset and offset of vocal fold vibration would improve the accuracy of acoustic voiced/ unvoiced boundary detection (aRFF-APH) over methods that did not leverage these tuned features (manual, aRFF-AP).

Participants
Sixty-nine individuals with typical voices ( A speech-language pathologist specializing in voice disorders assessed the overall severity of dysphonia (OS; 0-100) of each participant using the Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V). The average OS for participants with typical voices was 8.3 (SD = 6.7, range = 0. 6-34.2), and that of participants with either an HVD or PD was 15.6 (SD = 12.4, range = 0.9-51.3). The speech-language pathologist rerated 15% of participants in a separate sitting to ensure adequate intrarater reliability. The Pearson's product-moment correlation coefficient was calculated on the ratings using the statistical package R (Version 3.2.4), yielding an intrarater reliability of r = .96. The overall demographic information for participants with typical voices (split into young adults < 40 years of age and older adults ≥ 40 years of age), participants with HVDs, and participants with PD are included in Table 1.

Procedure
The current study comprised three main components: participant training, experimental setup, and data collection. In the training segment, each participant was instructed to produce the speech tokens that would be simultaneously captured via microphone and flexible nasendoscope during data collection. Following the training segment, participants were instrumented with recording equipment. Data collection then commenced, during which participants were cued to produce speech tokens during a nasendoscopic examination that totaled approximately 5-10 minutes. The overall experimental time (including consent, training, setup, and recording) required approximately 1-2 hours.

Participant
Training-Participants were trained to produce eight iterations of the VCV utterance, /ifi/. This token was selected since the phoneme /i/ provides an open pharynx to better view the vocal folds under endoscopy [19], whereas the phoneme /f/ has been shown to minimize within-speaker variations in RFF [33]. Each participant was instructed to produce four /ifi/ utterances, take a breath, and then produce the remaining four /ifi/ utterances.
Participants were then trained to produce the eight /ifi/ utterances at different vocal rates and levels of vocal effort. These modifications were chosen to alter the stiffness of the laryngeal musculature [34] to, in turn, produce voice with varying degrees of laryngeal muscle tension. Using methodology employed by McKenna, et al. [19], a metronome was used to train three vocal speeds: slow rate (50 beats per minute), regular rate (65 beats per minute), and fast rate (80 beats per minute). Similarly, participants were cued using methodology described by McKenna, et al. [19] to produce voice with varying levels of vocal effort (mild effort, moderate effort, maximum effort) while maintaining comfortable speaking rate and volume. While instructing participants to "increase your effort during your speech as if you are trying to push your air out," mild effort was cued as "mildly more effort than your regular speaking voice," moderate effort as "more effort than mild," and maximum effort as "as much effort as you can while still having a voice."

Experimental
Setup-After the training segment, participants were seated in a sound-attenuated booth and instrumented with recording equipment. This included a directional headset microphone (Shure SM35 XLR) placed 45° from the midline and 7 cm from the lips, as well as a neck-surface accelerometer (BU series 21771 from Knowles Electronic, Itasca, IL) that was placed on the anterior neck, superior to the thyroid notch and inferior to the cricoid cartilage using double-sided adhesive. Microphone and accelerometer signals were pre-amplified (Xenyx Behringer 802 Preamplifier) and digitized at 30 kHz (National Instruments 6312 USB).
A flexible routine endoscope (Pentax, Model FNL-10RP3, 3.5-mm) was then passed transnasally through the inferior nasal turbinate, superior to the soft palate, and into the hypopharynx for laryngeal visualization. In cases in which participant nasal anatomy or reported discomfort interfered with image acquisition using the routine endoscope, a flexible slim endoscope (Pentax, Model FNL-7RP3, 2.4-mm) was used. A numbing agent was not administered so as to not affect laryngeal function [35], but a nasal decongestant was offered prior to insertion to minimize participant discomfort as the endoscope was passed through the nasal cavity. To record images of the larynx, the endoscope was attached to a camera (FASTCAM Mini AX100l; Model 540K-C-16GB; 256 × 256 pixels) with a 40-mm optical lens adapter. A steady xenon light was used for imaging (300 W KayPENTAX Model 7162B).

Experimental
Recording-During the endoscopy procedure, participants were instructed to produce the eight ifi/ repetitions for each condition, which were cued in the following order: slow rate, regular rate, fast rate, mild effort, moderate effort, and maximum effort. Participants completed a minimum of two recordings per condition, and recordings were repeated when the experimenter determined that the vocal folds were not adequately captured (e.g., obstruction by the epiglottis). Video images were acquired at a frame rate of 1 kHz using Photron Fastcam Viewer software (v.3.6.6) to track the fundamental frequency of vibration of the vocal folds, which is estimated to be 85-255 Hz during modal phonation in adults [36]. Recording was triggered by the camera software and a custom MATLAB (version 9.3; The MathWorks, Natick, MA) script that automatically time-aligned the video images with the microphone and accelerometer signals. Due to the recording limitations of the high-speed imaging system, the synchronized microphone, accelerometer, and HSV recordings were restricted in duration to 7.940 seconds when the 3.5-mm endoscope was used and 8.734 seconds when the 2.4-mm endoscope was used.

Technician Training:
A semi-automated algorithm was used to identify the physiological termination and initiation of vocal fold vibration from each /ifi/ production. To carry out this processing, a series of technicians used the algorithm to compute the glottic angle waveform, from which vocal fold abductory and adductory patterns were isolated. The training and experimental data processing schemes used to extract vocal fold vibratory features are described in detail below.
Prior to processing experimental data, nine technicians underwent a training scheme described by McKenna, et al. [37]. In brief, technicians were first trained to measure glottic angles (extending from the anterior commissure along the medial vocal fold edge to the vocal process) from images obtained during a flexible nasendoscopic procedure using a halogen light source and acquired at a conventional framerate of 30 frames per second. Technicians were required to meet two-way mixed-effects intraclass correlation coefficients (ICC) for consistency of agreement ≥ .80 when compared to glottic angle markings made previously by a gold-standard technician [38]. The average reliability for the nine technicians was ICC(3,1) = .89 (SD = .01, range = .88-.91).
The nine technicians then completed training to use a semi-automated glottic angle tracking algorithm, as described in detail in Diaz-Cadiz, et al. [38]. Using this algorithm, the technicians were trained to use time-aligned microphone, accelerometer, and video frames captured during an /ifi/ utterance to semi-automatically estimate the glottic angle over time.

Experimental Data Processing:
To process experimental data, technicians first determined whether each /ifi/ production was analyzable based on manual inspection of the laryngoscopic recordings. An /ifi/ production was rejected from further analysis if the glottis was obstructed (e.g., by the epiglottis), if video quality was too poor to resolve the glottis, or if an /ifi/ production at the end of the recording was incompletely captured due to the pre-defined recording length. Of the potential 9172 /ifi/ productions recorded, 12.8% were considered unusable (1173 of 9172), leaving 7999 for further processing.
Technicians then used the semi-automated angle algorithm to calculate the glottic angle waveform for the usable /ifi/ productions (N = 7999). Within this analysis, each of the nine technicians was assigned to analyze a subset of the 122 speakers, wherein the assigned technician processed all /ifi/ productions of each speaker within the subset. For each speaker, the assigned technician determined whether the /ifi/ production was usable and, if so, obtained a quantitative estimate of the glottic angle for the production. Manual intervention was implemented if algorithmic estimates of the glottic angle waveform was deemed inappropriate by the technician; if errors persisted following manual intervention, the technicians were instructed to mark the instance as unusable. The technicians accepted the fully automated results in 75.0% of cases (6000 of 7999), whereas the technicians accepted the automated results only after performing manual glottic angle intervention in 21.5% of cases (1721 of 7999). The remaining 3.5% of cases were discarded due to producing inappropriate glottic angle estimates even after manual-assisted angle estimation (278 of 7999). This analysis resulted in 7721 usable /ifi/ productions for further processing. This initial data processing was then rechecked by a second technician.
A series of kinematic time points were then extracted from each usable /ifi/ production to mark the physiological termination or initiation of vocal fold vibration. Technicians were presented with a MATLAB GUI showing time-aligned high-speed video frames, the microphone signal, the previously extracted glottic angle waveform, and a high-pass filtered version of the quick vibratory profile (QVP; see Fig. 2). The QVP was included here as an alternative to the glottic angle waveform due to its sensitivity to HSV imagery and superior ability to track the vibrating glottis during the transition between voiced and unvoiced segments [29]. The QVP was calculated by (i) centering the HSV frame over the glottis using the semi-automated glottic angle extraction algorithm from Diaz-Cadiz, et al. [38], (ii) calculating vertical and horizontal profiles of the HSV frames using the methodology from Ikuma, et al. [29], and (iii) high-pass filtering the resulting QVP using a 7 th order Butterworth filter to attenuate low frequency energy below a cut-off frequency of 50 Hz.
With the MATLAB GUI, a total of three technicians used the time-aligned microphone signal, glottic angle waveform, and QVP to identify the time of voicing offset (t off ) and the time of voicing onset (t on ). For each participant, a single technician was assigned to identify t off and t on for all utterances. In this analysis, t off was described as the termination of the last vibratory cycle before the voiceless consonant, whereas t on was characterized as the initiation of the first vibratory cycle after the voiceless consonant. In the event that the vocal folds exhibited an abrupt closure at the start of voicing onset (i.e., prior to vocal fold vibration), t on was extracted as the time point immediately before the point of abrupt vocal fold closure. Technicians were instructed to use the glottic angle waveform and QVP to identify these two time points, then corroborate the selected indices via manual visualization of the raw HSV images. This process was carried out to minimize errors that may occur if the glottic angle waveform failed to capture small glottal gaps during vibratory cycle phases or if the QVP was confounded by lighting artifacts (e.g., intensity saturation due to the epiglottis coming into view). The microphone signal was included within the GUI to orient technicians in the event that the glottic angle waveform and QVP both failed to properly track the vibrations of the vocal folds; in such cases, the technicians were instructed to indicate that the production needed to be rejected or reprocessed.
The technicians each reanalyzed 10% of participants in a separate sitting to ensure adequate intrarater reliability. The three technicians also analyzed the HSV images of the same participant to assess interrater reliability. Intrarater reliability was assessed via two-way mixed-effects ICCs for absolute agreement, whereas interrater reliability was computed using two-way mixed-effects ICCs for consistency of agreement (single measures). Intrarater reliability ranged from .86 to .99, with an overall mean reliability of .96 (SD = .05). Average interrater reliability was ICC(3,1) = .91 (95% CI = .86-.96) for t off and ICC(3,1) = .97 (95% CI = .96-.99) for t on .

Manual RFF Estimation-Five
technicians were trained 1 to manually estimate RFF using methodology described in Vojtech and Heller Murray [39]. Table 2 shows the number of participants that each of the five technicians rated throughout the course of data collection. Two trained technicians carried out manual RFF estimation on each participant (7721 total /ifi/ productions). Mean RFF values were computed across two technicians to use as the gold-standard for RFF estimates.
Intrarater reliability was assessed via Pearson correlation coefficients within each technician when instructed to reanalyze 20% of participants in a separate sitting, whereas interrater reliability was computed via two-way mixed-effects ICCs for consistency of agreement. The average intrarater reliability was calculated as r = .90 (SD = .05, range = .84-.97), and the average interrater reliability was computed as ICC(3,1) = .93 (SD = .04, range = .87-.98).

Semi-automated RFF Estimation-Semi-automated RFF estimation was
performed on all 7721 /ifi/ productions using the MATLAB-based aRFF-AP algorithm, which is described in detail in Vojtech, et al. [24]. The aRFF-AP algorithm estimates RFF using a 9-step process that includes: (1) identifying the voiceless consonant and vowels in each production via high-to-low energy ratios of the acoustic signal, (2)  The relationship between acoustic and physiologic voicing transitions was assessed by examining acoustic features relative to the termination (t off ) or initiation (t on ) of voicing. This step is distinct from the development of the aRFF-AP algorithm, as Vojtech, et al. [24] examined acoustic features relative to the intervocalic voicing transitions indicated by manual RFF estimation. To perform this analysis, a literature review was conducted to select a set of acoustic features that showed promise in distinguishing voiced and unvoiced segments, as is the goal of the acoustic features in the semi-automated RFF algorithm (i.e., step 7 of the aRFF-AP algorithm). The acoustic features that best corresponded with the termination or initiation of voicing were then implemented in the aRFF-AP algorithm.

Acoustic Feature Selection:
In the aRFF-AP algorithm, acoustic feature trends are examined to identify a state transition in feature values that mark the boundary cycle, or the vocal cycle that distinguishes the vowel from the voiceless consonant. The boundary cycle is offset cycle 10 for voicing offset and onset cycle 1 for voicing onset (see Fig. 1). The aRFF-AP algorithm uses three acoustic features to characterize the transition between voiced and unvoiced segments: normalized peak-to-peak amplitude, number of zero crossings, and waveform shape similarity.
In addition to the three features included within the aRFF-AP algorithm, a set of 15 new acoustic features were examined with respect to classifying voiced and unvoiced speech segments: (1) autocorrelation, (2) mean cepstral peak prominence, aRFF-AP algorithms plus 15 new acoustic features), 13 features were calculated directly from the microphone signal. This included autocorrelation, mean and standard deviation of cepstral peak prominence, cross-correlation, low-to-high ratio of spectral energy, normalized cross-correlation, normalized peak-to-peak amplitude, number of zero crossings, short-time energy, short-time log energy, short-time magnitude, signal-to-noise ratio, and waveform shape similarity. The remaining five features were calculated using a processed version of the microphone signal. Specifically, the third step of the aRFF-AP algorithm leverages the Auditory-SWIPE′ algorithm to calculate the f o contour and pitch strength contour of each /ifi/ production. Three features were calculated from the f o contour (average, median, and standard deviation of voice f o ), and two features were computed using the pitch strength contour (average and median pitch strength).
In addition to examining the 13 acoustic features extracted from the raw microphone signal, filtered versions of these features were also considered. The aRFF and aRFF-AP algorithms employ a version of the microphone signal when band-pass filtered ±3 ST around the average f o of the speaker to identify peaks and troughs in signal amplitude. The aRFF-AP algorithm also used this filtered version of the signal to compute normalized peak-to-peak amplitude (whereas the number of zero crossings and waveform shape similarity were calculated using the raw microphone signal). By including features computed from the filtered microphone signal, the result of this literature review resulted in a total of 31 acoustic features for subsequent analysis. Table 3 provides an overview of the acoustic features, including the signal used to calculate each and the proposed hypotheses in acoustic feature values when used for voiced/unvoiced detection.

Feature Set Reduction:
The 31-feature set was first examined to remove features that did not sufficiently capture the transition between voiced and unvoiced segments. The sliding window process in step 6 of the aRFF-AP algorithm was simulated to estimate each feature over time, ranging from the midpoint of the voiceless consonant and into the vowel. Acoustic feature trends were then examined relative to HSV-derived voicing transitions as a function of the number of pitch periods 2 away from the "true" boundary cycle; specifically, the true boundary cycle was set to reference the time of voicing offset (t off ) and the time of voicing onset (t on ) to investigate the relationship between these acoustic features and the physiologically derived termination and initiation of vocal fold vibration, respectively. To comprehensively examine trends in feature values, the acoustic features were analyzed as a function of ±10 pitch periods from the true boundary cycle, resulting in 21 feature values (i.e., one feature value for each pitch period) for each of the 31 acoustic features per /ifi/.
The feature values were then visually inspected to determine which acoustic features failed to exhibit a substantial change in feature magnitude during the transition between the voiceless consonant and vowel; such features were removed from subsequent analysis.
2 "Pitch period" refers to the duration of one glottal cycle, and was computed per /ifi/ production using the average f o determined using Auditory-SWIPE′.
Vojtech et al. Page 11 The remaining acoustic features were then used as predictors in a stepwise binary logistic regression to determine the probability of feature values corresponding to a voiced (1) or unvoiced (0) segment. The 21 values per acoustic feature for each of the 7721 /ifi/ productions were continuous predictors. Feature values were assumed independent in the regression model to identify which features were significantly related to voicing status rather than to create a regression equation for predicting voicing status. Variable significance was set to p < .05. Highly correlated features (variable inflation factor > 10) were removed from the model to reduce multicollinearity. Acoustic features that exhibited significant predictive effects and were sufficiently independent were retained for further algorithmic refinement.

Algorithmic Modifications:
The acoustic features that exhibited significant predictive effects on voicing status were introduced into the aRFF-AP algorithm to produce a more physiologically relevant version of the RFF algorithms called "aRFF-APH" (aRFF-AP with HSV-derived acoustic features). The pitch strength rejection criterion of the aRFF-AP algorithm-which removes VCV productions with average pitch strength values below .05 from subsequent analysis due to little-to-no presence of a pitch sensation-was retained in the current study to streamline data processing. A sliding window based on the speaker's estimated f o then navigated from the voiceless consonant and into the vowel of interest. Within each window of time, the selected acoustic features from the current study were calculated rather than those within the aRFF-AP algorithm (i.e., normalized peak-topeak amplitude, number of zero crossings, waveform shape similarity). Rule-based signal processing techniques were then adapted from the aRFF [21,22] and aRFF-AP algorithms to identify the boundary cycle separating voiced and unvoiced segments. To locate this cycle, the algorithm identified a feature value that maximized the effect size between left and right components of each acoustic feature vector; the cycle index that corresponded to this identified feature value was selected as the boundary cycle candidate for that feature. From here, the median of these candidates was then calculated as the final boundary cycle.

Performance of Manual and Semi-automated RFF Estimation Methods
-To examine the impact of introducing physiologically relevant acoustic features into the semi-automated RFF algorithms, the ability of manual and algorithmic RFF estimation methods to locate the true boundary cycle (derived via HSV; referenced to t off for voicing offset and t on for voicing onset) was assessed. First, the 7721 /ifi/ productions from 122 participants were processed using the aRFF-AP and aRFF-APH algorithms as well as manual estimation techniques. The accuracy of the three RFF estimation methods was then quantified as the distance (in average pitch periods) between true and selected boundary cycles for each voicing offset and onset instance of each /ifi/ production. The distance between true and selected boundary cycles was compared across RFF estimation methods to determine which method best corresponded with vocal fold vibratory characteristics during intervocalic offsets and onsets.

Statistical Analysis
Chi-square tests were performed to determine whether there was a relationship between RFF estimation method (manual, aRFF-AP, aRFF-APH) and boundary cycle classification accuracy. Two chi-square tests were conducted: one for voicing offset and one for voicing onset. In each analysis, a correctly classified boundary cycle referred to an instance in which the distance between true and selected boundary cycles was zero (whereas a misclassified boundary cycle corresponded to some non-zero distance between true and selected boundary cycles). Significance was set a priori to p < .05. Cramer's V was used to assess effect sizes of significant associations. Resulting effect sizes were interpreted using criteria from Cohen [49]. Post hoc chi-square tests of independence were then performed for pairwise comparisons of the three RFF estimation methods using a Bonferroni-adjusted p value of .017 (.05 / 3 comparisons). Fig. 3 shows the relationship between acoustic features and the true boundary cycle (relative to t off ) for 7721 voicing offset instances. Fig. 4 shows this relationship (relative to t on ) for 7721 voicing onset instances. Manual inspection of these 31 features resulted in the removal of the filtered number of zero crossings, raw and filtered autocorrelation, filtered cepstral peak prominence, filtered low-to-high ratio of spectral energy, raw and filtered standard deviation of cepstral peak prominence, and standard deviation of voice f o due to a lack of discrimination between voiced and unvoiced segments (indicated by the dashed lines in Figs. 3 and 4). All further analyses were completed using the remaining 23 features (indicated by the solid lines in Figs. 3 and 4).

Acoustic Feature Set Reduction
The results of the logistic regression (shown in Table 4) indicated that filtered waveform shape similarity, median of voice f o , cepstral peak prominence, number of zero crossings, short-time energy, average pitch strength, normalized cross-correlation, and cross-correlation were all significant predictors of voicing status for voicing offset (p < .05). When using these eight features, the model for voicing offset accounted for 61.7% of the variance in voicing status (adjusted R 2 = 61.7%), with an area under the receiver operating characteristic (ROC) curve of .96. Inspection of the coefficients indicated that the log odds of voicing decreased per one unit increase in short-time energy, normalized cross-correlation, number of zero crossings, or filtered waveform shape similarity (i.e., negative coefficient). On the other hand, the log odds of voicing increased per one unit increase in median of voice f o , cepstral peak prominence, average pitch strength, or cross-correlation (i.e., positive coefficient).
For voicing onset, the stepwise binary logistic regression revealed that filtered waveform shape similarity, median of voice f o , cepstral peak prominence, number of zero crossings, average pitch strength, signal-to-noise ratio, filtered short-time energy, and filtered shorttime log energy were all significant predictors of voicing status (p < .05; see Table 4 pitch strength, signal-to-noise ratio, or filtered short-time log energy. The resulting acoustic features were then incorporated into the aRFF-APH algorithms to identify the boundary cycle of voicing.

Performance of Manual and Semi-automated RFF Estimation Methods
The comparison of aRFF-APH, aRFF-AP, and manual RFF estimation techniques in identifying the true boundary cycle is shown in Fig. 5. Out of 7721 offset instances (see Fig.  5a) The results of the chi-square tests are shown in Table 5. Boundary cycle classification accuracy was significantly different for RFF estimation method, producing large effect sizes for both voicing offset (p < .001, V = .52) and onset (p < .001, V = .58). Boundary cycle classification accuracy was significantly different between manual and aRFF-AP methods. Post hoc analyses revealed a large effect for voicing offset (p < .001, V = .53) and onset (p < .001, V = .52), wherein aRFF-AP was more likely to correctly identify the boundary cycle than manual estimation. Boundary cycle classification accuracy was also significantly different between manual and aRFF-APH methods. Post hoc analyses showed a large effect for both voicing offset (p < .001, V = .59) and onset (p < .001, V = .62), such that aRFF-AP was more likely to correctly identify the boundary cycle. Finally, the boundary cycle classification accuracy was significantly different between semi-automated RFF algorithms (aRFF-AP, aRFF-APH) for both voicing offset and onset (p < .001); yet, the size of this effect was negligible for offset (V = .08) and small for onset (V = .15).

Discussion
The aim of the current study was to investigate the relationship between acoustic features and vocal fold vibratory characteristics during intervocalic voicing offsets and onsets. A large set of speakers with typical voices and speakers with voices characterized by excessive laryngeal muscle tension were instructed to produce the VCV utterance, /ifi/, while altering vocal rate and vocal effort. Simultaneous recordings were acquired using a microphone and flexible nasendoscope. The initiation (voicing onset) and termination (voicing offset) of vocal fold vibration were identified via inspection of the laryngoscopic images. A set of acoustic features were examined in reference to these time points, and a stepwise binary logistic regression was performed to identify which features best coincided with voicing offset and onset. The features that exhibited significant predictive effects were then implemented into the semi-automated RFF algorithm ("aRFF-APH"). The accuracy of the aRFF-APH algorithm in locating the transition between voiced and unvoiced segments was then assessed against (1) the current version of the semi-automated RFF algorithm ("aRFF-AP"), and (2) manual RFF estimation, the current gold-standard technique for calculating RFF.
The results of this investigation indicate that using the aRFF-APH algorithm led to the greatest percentage of correctly identified boundary cycles (76.0%), followed by the aRFF-AP algorithm (70.3%) then manual estimation (20.4%). This suggests that using physiologically tuned acoustic features to identify the transition between voiced and unvoiced segments-even in the absence of methods to account for differences in voice sample characteristics (i.e., as in the aRFF-AP algorithm)-improves the correspondence between algorithmic and physiologic boundary cycles. These findings are in support of our hypothesis that incorporating features related to the onset and offset of vocal fold vibration improves the accuracy of acoustic voiced/unvoiced boundary detection.
Although the aRFF-APH algorithm demonstrated greater accuracy in detecting voiced/ unvoiced boundaries, the aRFF-AP algorithm remains the gold-standard method for semiautomatically estimating RFF. This is because the aRFF-AP algorithm was developed and validated using independent training and test sets to improve the clinical applicability of RFF. The aRFF-APH algorithm, on the other hand, was developed with the goal of improving the acoustic voiced/unvoiced detection rather than clinical applicability and was specifically tuned to the limited database examined here. As part of this investigation, all speakers were recorded in a sound-attenuated booth in the presence of constant noise from the endoscopic light source. In addition to this single recording location, the voice sample characteristics that were captured in the speaker dataset were more limited that those used in the development of the aRFF-AP algorithm: Vojtech, et al. [24] included over 20 different primary voice complaints with an overall severity of dysphonia ranging from 0 to 100, whereas the current study included a smaller range of diagnoses (57% typical, 16% MTD, 3% nodules, 2% polyp, 1% scarring, 1% lesion, 20% Parkinson's disease) and resulting dysphonia severity (0-51.3). Because of the limited spectrum of vocal function captured here, pitch strength-tuned parameters and independent training/test sets were not implemented in the development of the aRFF-APH algorithm in the current study.
Despite the aforementioned differences between the aRFF-AP and aRFF-APH algorithms, using either of these methods resulted in a greater boundary cycle identification accuracy than when using manual estimation. These findings are unexpected since manual estimation has long been considered the gold-standard RFF estimation method, wherein trained technicians may exercise trial and error to identify the boundary cycle in difficult scenarios (e.g., poor recording environment and/or equipment, severe dysphonia) when cycle masking is present. It is possible that the characteristics of the speaker database confounded this outcome, as noise from the endoscopic light source may have masked the voice signals and/or speaker productions may have deviated from the norm due to the flexible nasendoscope. Even though manual estimation makes use of trial and error to subjectively locate the boundary cycle when masking is present, it is possible that manual estimation techniques were not sensitive enough to isolate the physiological boundary cycle in these conditions. On the other hand, the aRFF-AP algorithm was designed to account for such variations and the aRFF-APH algorithm was refined based on the physiologically determined vocal fold characteristics. Both algorithms also identify potential vocal cycles using a filtered version of the microphone signal that was designed to reduce the amplitude of vocal tract resonances, coarticulation due to concurrent frication and aspiration, and radiation of the lips. By only using the raw microphone signal to identify vocal cycles, the RFF values resulting from manual estimation may not reflect the true offset or onset of voicing as expected.
Although manual estimation resulted in the lowest boundary cycle identification accuracy, it is important to note that most misclassifications occurred within two pitch periods of the true boundary cycle for both voicing offset and onset (see Fig. 5). These findings are similar to those of Lien, et al. [50], in which manual RFF estimation was compared when performed on a microphone signal versus a neck-surface accelerometer signal. Since a neck-surface accelerometer is able to capture the vibrations of the glottal source in the absence of vocal cycle masking due to frication and aspiration [as may occur during the production of an intervocalic fricative; 27], the accelerometer signal was considered to be a ground-truth over the microphone signal. The authors observed that misclassifications occurred closer to the vowel for both voicing offset and onset when performing manual RFF estimation using a microphone signal rather than an accelerometer signal. Whereas offset RFF values were extracted approximately two cycles closer to the vowel when using a microphone signal, onset RFF values were computed less than one cycle away from the voiceless consonant when using a microphone signal. The results of the current study support these findings and, moreover, lend support to the supposition that the aRFF-AP and aRFF-APH algorithms benefit from using a band-pass filtered version of the microphone signal to identify potential vocal cycles.
As semi-automated RFF algorithm accuracy is typically quantified in reference to manual RFF estimation [e.g., see 22,24], it is important to consider that manual estimation may not be a true gold-standard technique. Further investigation is necessary to examine the hypothesis that differences in boundary cycle identification may be attributed to the algorithms leveraging a band-pass filtered version of the microphone signal to reduce the impacts of vocal tract resonances, coarticulation due to concurrent frication and aspiration, and radiation of the lips. Such an investigation should include an analysis of both laryngeal imaging and acoustics to comprehensively assess the relevance and validity of manual estimation as the gold-standard technique for calculating RFF. Laryngeal imaging is a crucial component for this investigation, as this modality can provide physiological confirmation of vocal fold vibrations that are indirectly captured via RFF. In addition to comparing manual and semi-automated boundary cycle selections, this investigation should aim to compare the boundary cycles obtained via manual RFF estimation when using each version of the acoustic signal (i.e., microphone, accelerometer). In the event that manual estimation is no longer considered as gold-standard RFF method, efforts should be made to develop new metrics of algorithmic performance to replace those that are calculated in reference to RFF values obtained via manual estimation (e.g., root-mean-square error, mean bias error).
Even though manual RFF estimation demonstrated the lowest voiced/unvoiced boundary detection accuracy, it should also be noted that this method is currently the only means by which RFF can be calculated on running speech. Whereas manual RFF technicians are trained to process both isolated VCV productions and VCV productions extracted from running speech, the semi-automated RFF algorithm has been designed, trained, and tested on isolated VCV productions since its origination [see 22]. Thus, although aRFF-AP and aRFF-APH demonstrated greater accuracy in capturing the voicing transitions necessary to compute RFF, both versions of the algorithm can only be used in scenarios when the voice samples are compatible. It is therefore recommended that the aRFF-AP algorithmwhich was validated across a broad spectrum of vocal function and recording locations [24]-be used in future investigations when compatible voice samples (i.e., isolated VCV productions) are available. In scenarios that require RFF to be computed from running speech, it is recommended that manual RFF estimation be used.
The results of the current study demonstrate the promise of using physiologically relevant acoustic features to locate the boundary cycle between voiced and unvoiced speech segments, specifically for estimates of RFF. However, additional steps must be undertaken to improve the clinical applicability of the aRFF-APH algorithm. This should include the use of independent training and test sets that span a broad range of vocal function. In doing so, the aRFF-APH algorithm could be modified to include pitch strength-tuned algorithm parameters to account for variations in voice sample characteristics. The aRFF-APH algorithms should also be expanded to neck-surface accelerometer signals, as there has been a growing interest in using the neck-surface vibrations generated during speech for ecological momentary assessment and ambulatory voice monitoring [e.g., 27, 51-60]. By capturing daily vocal behavior through a neck-surface accelerometer, vocal behaviors associated with excessive or imbalanced laryngeal muscle forces could be identified and monitored via RFF. Although an accelerometer-based RFF algorithm has been developed [61] future work should aim to improve this algorithm by identify physiologically tuned features that can be used to identify the true termination and initiation of vocal fold vibration. Doing so would further improve the clinical relevance of using RFF to assess and track laryngeal muscle tension.

Conclusions
The current study examined the relationship between acoustic outputs from the semiautomated RFF algorithm and physiological vocal fold vibratory characteristics during intervocalic offsets and onsets. By incorporating features that reflected the onset and offset of vocal fold vibration, algorithmic accuracy of voiced/unvoiced detection increased. Voiced/unvoiced boundary detection accuracy when using the RFF algorithm exceeded that of the gold-standard, manual method for calculating RFF. These findings highlight the benefits of incorporating features related to vibratory offsets and onsets for acoustic voiced/ unvoiced boundary detection. It is recommended that the recently validated version of the semi-automated algorithm be used to calculate RFF when voice samples containing isolated vowel-voiceless consonant-vowel productions are available, and manual RFF estimation in scenarios that require RFF to be computed from running speech.  Normalized feature values calculated from the raw microphone signal or Auditory-SWIPE′ output (teal) with respect to distance (pitch periods) from the true boundary cycle (thin black dotted line) for voicing onset. Normalized feature values calculated from band-pass filtered microphone signal are overlaid in orange (when applicable). Top row: normalized peak-to-peak amplitude (PTP), short-time magnitude (STM), short-time energy (STE), cross-correlation (XCO), normalized cross-correlation (NXCO), autocorrelation (ACO). Middle row: mean and standard deviation of cepstral peak prominence (CPP, SD CPP), signal-to-noise ratio (SNR), number of zero crossings (NZC), waveform shape similarity (WSS), low-to-high ratio of spectral energy (LHR  CCP reflects the distribution of energy at harmonically related frequencies [43] and is calculated as the magnitude of the peak with the highest amplitude in the cepstrum (i.e., the Fourier transform of the power spectrum).