Advances in Clinical Voice Quality Analysis with VOXplot

Background: The assessment of voice quality can be evaluated perceptually with standard clinical practice, also including acoustic evaluation of digital voice recordings to validate and further interpret perceptual judgments. The goal of the present study was to determine the strongest acoustic voice quality parameters for perceived hoarseness and breathiness when analyzing the sustained vowel [a:] using a new clinical acoustic tool, the VOXplot software. Methods: A total of 218 voice samples of individuals with and without voice disorders were applied to perceptual and acoustic analyses. Overall, 13 single acoustic parameters were included to determine validity aspects in relation to perceptions of hoarseness and breathiness. Results: Four single acoustic measures could be clearly associated with perceptions of hoarseness or breathiness. For hoarseness, the harmonics-to-noise ratio (HNR) and pitch perturbation quotient with a smoothing factor of five periods (PPQ5), and, for breathiness, the smoothed cepstral peak prominence (CPPS) and the glottal-to-noise excitation ratio (GNE) were shown to be highly valid, with a significant difference being demonstrated for each of the other perceptual voice quality aspects. Conclusions: Two acoustic measures, the HNR and the PPQ5, were both strongly associated with perceptions of hoarseness and were able to discriminate hoarseness from breathiness with good confidence. Two other acoustic measures, the CPPS and the GNE, were both strongly associated with perceptions of breathiness and were able to discriminate breathiness from hoarseness with good confidence.


Introduction
Standard clinical practice for the evaluation of voice disorders includes a battery of multidimensional assessments (e.g., visual analysis, auditory-perceptual judgment, aerodynamic analysis, acoustic analysis, and self-assessment [1]) aimed to describe and diagnose the voice complaint. Voice disorders affect quality, volume, pitch, resonance, flexibility, and/or stamina. These vocal changes are the manifestation of disordered respiratory, laryngeal, and vocal tract functions, which might result, in many cases, from heterogeneous local etiologies [2]. Many voice disorders are associated with abnormal oscillation patterns of the vocal folds. The resulting voiced energy can vary as a function of vibrational changes at different vocal fold areas, but especially at the free vocal fold margin. Furthermore, the more a critical region of one vocal fold or both vocal folds are affected by laryngeal pathology, the more variation in vocal sound energy and subsequent perceptions of voice quality severity can be expected [3].
Although voice quality is not a clearly defined term, there are two general approaches to evaluation [4]. First, the subjective approach of listening to the patient's voice and assigning a score to different perceptual domains is considered a gold standard approach for perceptual voice analysis. Second, the use of an objective instrumental approach can be used, in which a specific computer algorithm is applied to recorded voice signals. Examples of instrumental assessment of voice quality include analysis of the acoustic voice sound signal and the inverse-filtered oral airflow signal or its derivative. Although many different terms have been used to describe voice quality, a wide acceptance has been acknowledged for terms such as hoarseness or overall voice quality, and major subtypes of the general anomalies in voice quality such as breathiness, roughness, and strain [4,5].
An objective acoustic analysis of voice signals is the most commonly used instrumental tool in clinical practice and research for objectively characterizing voice disorders [6]. Voice signals can be analyzed acoustically in the domains of time, frequency, amplitude, and quefrency. A large number of acoustic measures have been introduced and described to objectively predict dysphonia types and severities. This is illustrated in a taxonomy by Buder [6] with 15 signal-processing-based categories. The reliable and valid use of objective acoustic analysis in research or clinical practice depends on specific requirements (e.g., hardware, software, and examination circumstances) to enable voice analysis with high accuracy and reliability [4,7].
The quantification of voice quality with acoustic methods has traditionally been analyzed on sustained vowels. Although the assessment of voice quality based on sustained vowels (SV) does not necessarily correspond to that of continuous speech (CS) [8,9], acoustic measures from sustained vowels are ubiquitous in research and clinical practice. Acoustic parameters that correlate strongly with auditory-perceptual judgments are included in two examples of multiparametric acoustic indices: the acoustic voice quality index (AVQI) for the evaluation of hoarseness, and the acoustic breathiness index (ABI), which assesses the hoarseness subtype, breathiness [10]. Both AVQI and ABI have been used with wide international acceptance for research and clinical practice for a number of reasons: (a) their multivariate constructs based on linear regression analysis that combines relevant acoustic markers; (b) the inclusion of both continuous speech and sustained vowels in the acoustic analysis; (c), signal processing that uses algorithms of the freeware Praat; and (d) a single score ranging from 0 to 10 for the entire recording being analyzed (i.e., the higher AVQI or ABI score, the more severe the related anomaly of voice quality, and vice versa) [10].
The acoustic measures of AVQI and ABI include smoothed cepstral peak prominence (CPPS); harmonics-to-noise ratio (HNR); shimmer percentage; shimmer dB; general slope of the spectrum (Slope); and tilt of the regression line through the spectrum (Tilt); jitter local; glottal-to-noise excitation ratio with a maximum frequency of 4500 Hz (GNE); relative level of high-frequency noise between energy from 0 to 6 kHz and energy from 6 to 10 kHz (HF Noise); HNR by Dejonckere (HNR-D), which analyses the harmonic shape of the spectral display by using the frequency bandwidth between 500 and 1500 Hz and a cepstrum to determine F0, and thus locate the harmonic structure in the long-term average of the spectrum; differences between the amplitude of the first and second harmonics in the spectrum (H1H2); and period standard deviation (PSD).
Next to AVQI and ABI, a third multivariate index with a long tradition in the evaluation of overall voice quality on sustained vowels is the dysphonia severity index (DSI) [11,12]. The DSI includes four voice parameters (jitter local; highest frequency and lowest intensity of a voice range profile; and maximum phonation time), in which jitter local is the only acoustic single parameter directly associated with voice quality. To use the DSI with Praat algorithms for signal processing the pitch perturbation quotient was considered in place of jitter local [13].
VOXplot (Lingphon, Straubenhardt, Germany; https://voxplot.lingphon.com, accessed on 11 June 2023) is a new freeware application for acoustic voice quality analysis based on the Praat algorithms for signal processing. Whereas Praat is a versatile and correspondingly complex software for acoustic analysis of arbitrary signals, VOXplot is specifically tailored to the analysis of voice quality. With Praat, only the algorithms are used, while the user interface of VOXplot is designed to meet the demands of standardized and intuitive ease of use for clinicians and researchers. VOXplot covers the entire workflow of acoustic voice quality assessment: recording and recording quality assess-ment, acoustic voice quality analysis, and generation of a concise PDF (or JPEG/PNG) sheet with the analysis results. The core analysis of VOXplot is the voice quality analyses of continuous speech and sustained vowels with AVQI and ABI. VOXplot is currently available in 12 analysis languages for AVQI and ABI, which are based on more than one decade of research knowledge [14,15]. The validation results of both indices relate only to an objective evaluation of the hoarseness and breathiness levels for heterogeneous voice disorders in comparison with vocally healthy volunteers with no further specification of a specific disorder or vocal symptom. The usability of VOXplot is currently available in three interface languages. Further details of sustained vowels can be analyzed qualitatively with the narrowband spectrogram and quantitatively with single acoustic parameters.
As mentioned before, AVQI, ABI, and DSI are used in combination with highly sensitive acoustic markers for the evaluation of hoarseness and breathiness. However, a direct comparison of these objective metrics using the VOXplot application with perceptual ratings of hoarseness or breathiness is missing. Therefore, the aim of this study was to compare the concurrent validity and diagnostic validity outcomes of 13 single acoustic voice quality measures between hoarseness and breathiness aspects on sustained vowels.

Participants
In the present study, the voice recordings and auditory-perceptual judgment of hoarseness and breathiness acquired in a previous study [16] were applied to new analyses. The group of dysphonic participants consisted of 175 patients with various organic and nonorganic voice disorders and various degrees of dysphonia severity. The control group of 43 vocally healthy volunteers reported no voice complaints, history of voice, speech, or hearing problems, and no impact of voice problems as measured with the voice handicap index [17]. Table 1 summarizes the demographic data and the types of dysphonia for the two groups. For further details regarding the data and recording acquisition, and inclusion and exclusion criteria, we refer to Barsties v. Latoszek et al. (2020) [16]. Abbreviation. SD = standard deviation.
All the participants gave their informed consent for inclusion before they participated in the study. The study was conducted in accordance with the Declaration of Helsinki, and the protocol was approved by the Ethics Committee of Greifswald University (BB072/16).

Auditory-Perceptual Judgment
For the auditory-perceptual judgment ratings, a panel of three male experts specialized in voice disorders with experience ranging from 8 to 31 years was used. The GRBAS scale was used for data collection. Each listener rated ordinally on a four-point scale the hoarseness level, which is represented in the G-parameter (Grade), and the breathiness severity, which is represented in the B-parameter (which represents the degree of the extent of air leakage through the glottis).
For further details regarding the rating scale, rating procedure, anchor voices, reliability results of the raters, and deviation of the rating level results from the expert panel for hoarseness and breathiness, we refer to Barsties v. Latoszek et al. (2020) [16].

Acoustic Measurements
The acoustic analyses were conducted only on recordings of the sustained vowel [a:] across 3 s of the mid-vowel segment from a single trial. The [a:] vowel was used as a typical open front vowel for the clinical and scientific acoustic tasks, which is easily recognized regardless of the native language, linguistic competence, or individual health problems (e.g., hearing disorders) from the test person in comparison to other vowels [18,19]. These sound files were applied to a new analysis using VOXplot. In total, 13 single voice quality parameters were acquired from each recording, which are listed in Table 2. Table 2. List of 13 acoustic measures for the voice quality evaluation.

Category Acoustic Measures Abbreviation
Fourier and linear prediction coefficient spectra Smoothed cepstral peak prominence is the distance between the first harmonic peak and the point with equal quefrency on the regression line through the smoothed cepstrum.

CPPS (dB)
Differences between the amplitudes of the first and second harmonics in the spectrum. To localize the first harmonic peak, a cepstrum was performed for F0 determination.

H1H2 (dB)
Relative level of high-frequency noise between energy from 0 to 6 kHz and energy from 6 to 10 kHz.

HF-Noise (dB)
Harmonics-to-noise ratio is the base 10 logarithm of the ratio between the periodic energy and the noise energy, multiplied by 10 HNR.

HNR (dB)
Harmonics-to-noise ratio from Dejonckere and Lebacq, which analyzes the harmonic emergence of the spectral display comprised within the frequency bandwidth between 500 Hz and 1500 Hz. A cepstrum was performed to determine F0 and thus to localize the harmonic structure in the long-term average spectrum.

HNR-D (dB)
General slope of the spectrum is defined as the difference between the energy within 0-1000 Hz and the energy within 1000-10,000 Hz of the long-term average spectrum.

Slope (dB)
Tilt of the regression line through the spectrum is the difference between the energy within 0-1000 Hz and the energy within 1000-10,000 Hz of the trendline through the long-term average spectrum.

Category Acoustic Measures Abbreviation
Frequency of short-term perturbation measures Period standard deviation is the variation in the standard deviation of periods in which the length of the sample is important for a valid computation of the standard deviation.

PSD (ms)
Frequency of short-term perturbation measures Two jitter variations: Jitter local is the average difference between successive periods, divided by the average period.
Jitter local (%) Jitter of the five-point period perturbation quotient is the average absolute difference between a period and the average of it and its four closest neighbors, divided by the average period.

PPQ5 (%)
Amplitude of short-term perturbations measures Two shimmer variations: Shimmer local is the absolute mean difference between the amplitudes of successive periods, divided by the average amplitude.

Shimmer (%)
Shimmer local dB is the base 10 logarithm of the difference between the amplitudes of successive periods, multiplied by 20.

Shimmer (dB)
Combines spectral and perturbation features The glottal-to-noise-excitation (GNE) ratio with a maximum frequency of 4500 Hz. GNE

Statistics
The association of the 13 acoustic parameters with the two auditory-perceptual evaluations of hoarseness and breathiness from 218 recorded voice samples was investigated by calculating Spearman's rank correlation coefficients. An absolute correlation score of ≥0.70 is marked as a high relationship for the concurrent validity aspect between the acoustic parameter and the perceived voice quality evaluation [20].
The Fisher r-to-z transformation was used to assess the statistical significance of the two correlation coefficients from the outcomes of the acoustic parameter and perceived hoarseness vs. perceived breathiness levels.
A receiver operating characteristic (ROC) curve was then generated in order to analyze the diagnostic accuracy of the 13 acoustic metrics according to sensitivity (results of the participants with hoarseness or breathiness) and specificity (results of participants without hoarseness or breathiness). The power of the acoustic markers to discriminate between the absence and presence of hoarseness or breathiness was estimated using the area under the ROC curve (A ROC ). An A ROC of >0.90 is considered to be exceptionally good; an A ROC of <0.70 is considered to be low, and an A ROC of ≤ 0.50 corresponds to a chance level of diagnostic accuracy [21]. In order to find the optimal threshold value that best differentiates between without and with hoarseness or breathiness, the Youden index (a measure that uses a receiver operating characteristic to determine which threshold value is best suited to distinguish two groups in a measurement) was calculated as sensitivity + specificity − 1.
The significant differences between the two ROC curves (calculated for hoarseness and breathiness) of the acoustic measures were determined by the difference between the areas under the curves [22].
The statistical analyses were performed using SPSS, version 23, for Windows (IBM Corp., Armonk, NY, USA). The tests of significance between the two correlation coefficients and between the areas under two independent ROC curves were analyzed on VassarStats (R. Lowry, Vassar College, NY, USA, 1998-2023; http://vassarstats.net/, accessed on 11 June 2023). Results were considered statistically significant at p ≤ 0.05. Table 3 presents the validation outcomes for the 13 single acoustic voice quality parameters in direct comparison to the auditory-perceptual ratings of hoarseness and breathiness. The thresholds with sensitivity and specificity, based on the ROC statistics and the Youden Index, are also listed in Table 3.  For hoarseness, a strong correlation was present for CPPS, HNR, and PPQ5. No acoustic parameter reached an exceptionally good level of A ROC , and 4 of the 13 acoustic parameters revealed a low level of A ROC , in which one of them was characterized by a chance level in diagnostic accuracy (H1H2).

Results
For breathiness, a strong correlation was present for CPPS and GNE. However, GNE reached an exceptionally good A ROC result, and 9 of the remaining 12 acoustic parameters had a strong level of diagnostic accuracy.
To assign a single acoustic voice quality parameter with high validity to a type of voice abnormality, (a) the absolute correlation value and the A ROC had to be >0.70, and (b) significant differences in validity performances between hoarseness and breathiness must be obtained in the correlation results or the A ROC outcomes. According to the results listed in Table 3 for hoarseness, two acoustic parameters could be identified as highly valid (HNR and PPQ5) in comparison to breathiness. For breathiness, two acoustic metrics (CPPS and GNE) were also revealed to have outstanding validity results in comparison to hoarseness.

Discussion
The aim of the present study was to investigate the validity of single acoustic parameters representing voice quality characteristics of hoarseness or breathiness in a direct comparison of the auditory-perceptual voice quality ratings of those domains from sustained vowel [a:] phonation. Although multiparametric models are preferred in highly valid evaluations of hoarseness or breathiness [4,9,23,24], single acoustic parameters are mostly used in clinical practice and recommended protocols for instrumental assessment of voice [7]. The present study attempted to reveal the most relevant acoustic markers for hoarseness and breathiness from a pool of metrics, which are already part of relevant multiparametric models in the evaluation of voice quality, such as DSI, AVQI, and ABI.
In general, the results from the initial AVQI and ABI studies were confirmed by the present study, with comparable results to the correlation coefficients for hoarseness and breathiness [9,24]. Although continuous speech was also considered in the voice quality evaluation for AVQI and ABI, CPPS and HNR showed high agreement for hoarseness, and CPPS and GNE presented the strongest results for breathiness. Because perceptions of breathiness are associated with high irregularity in the acoustic spectrum (e.g., a lot of spectral aperiodicity or noise), while perceptions of hoarseness can be associated with multidimensional acoustic factors other than spectral aperiodicity, it was logical that the discriminative ability of CPPS (which measures the periodicity in the acoustic spectrum) for breathiness was significantly higher than for hoarseness in this study. Originally, CPPS was developed for the vocal quality abnormality of breathiness [25], in which breathiness is a main subtype of hoarseness [24]. Just like GNE, which was also developed for the evaluation of breathiness [26], the present study confirmed its strength in the evaluation of this voice quality aspect with significantly higher concurrent validity and diagnostic accuracy.
A clearer unique identifier for hoarseness versus breathiness was shown in this study by the two parameters HNR and PPQ5. In the case of HNR, it is the second most important acoustic parameter in the AVQI formula after CPPS, which is supported by the results of this study [9]. The findings of this study suggest that HNR is a general parameter that does not necessarily correspond to other strong breathiness measures such as CPPS or GNE. Only PPQ5 achieved a sufficiently high agreement with hoarseness and was significantly differentiated from breathiness in the current study. This result was contrary to the results of the original study on AVQI by Maryn et al. (2010) [9]. Furthermore, in a meta-analysis on the evaluation of hoarseness, jitter parameters generally ranked significantly lower than spectral or cepstral parameters and some shimmer markers [27], but, according to the present results, PPQ5 seems to be robust enough to assess hoarseness in the evaluation of sustained vowels, which may explain why this parameter is included in the DSI formula.
The new developments based on the present study were updated in VOXplot and are available from version 2.0 (see Figure 1).

Conclusions
For the voice quality evaluation on the sustained vowel HNR and PPQ5 (for hoarseness), and CPPS and GNE (for breathiness) yielded the highest significant validity results compared to each of the other voice quality aspect." These four acoustic parameters should have priority in the evaluation of hoarseness and breathiness and are prominently included in VOXplot (e.g., in the voice quality circle plot).

Conclusions
For the voice quality evaluation on the sustained vowel HNR and PPQ5 (for hoarseness), and CPPS and GNE (for breathiness) yielded the highest significant validity results compared to each of the other voice quality aspect." These four acoustic parameters should have priority in the evaluation of hoarseness and breathiness and are prominently included in VOXplot (e.g., in the voice quality circle plot).  Informed Consent Statement: Informed consent was obtained from all the subjects involved in the study.

Data Availability Statement:
The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.