Detrending the Waveforms of Steady-State Vowels

Steady-state vowels are vowels that are uttered with a momentarily fixed vocal tract configuration and with steady vibration of the vocal folds. In this steady-state, the vowel waveform appears as a quasi-periodic string of elementary units called pitch periods. Humans perceive this quasi-periodic regularity as a definite pitch. Likewise, so-called pitch-synchronous methods exploit this regularity by using the duration of the pitch periods as a natural time scale for their analysis. In this work, we present a simple pitch-synchronous method using a Bayesian approach for estimating formants that slightly generalizes the basic approach of modeling the pitch periods as a superposition of decaying sinusoids, one for each vowel formant, by explicitly taking into account the additional low-frequency content in the waveform which arises not from formants but rather from the glottal pulse. We model this low-frequency content in the time domain as a polynomial trend function that is added to the decaying sinusoids. The problem then reduces to a rather familiar one in macroeconomics: estimate the cycles (our decaying sinusoids) independently from the trend (our polynomial trend function); in other words, detrend the waveform of steady-state waveforms. We show how to do this efficiently.


Introduction
Formants are characteristic frequency components in human speech that are caused by resonances in the vocal tract (VT) during speech production and occur both in vowels and consonants (In the literature, the distinction between the physical resonance of the VT and the associated characteristic frequency in the resulting speech is often not made [2] (p. 179); as such, the term "formant" can mean both, and we will follow that custom here). In source-filter theory [3], the "standard model" of acoustic phonetics, the speech production process is modeled as a linear time-invariant system [4]. In a nutshell, the input to the system is the glottal source (i.e., the vibration of the vocal folds), the system's transfer

Background
Historically, formants and steady-state vowels (as opposed to vowels in general) are intrinsically connected because the concept of a formant was originally defined in terms of steady-state vowels-see Figure 1a. What we mean by steady-state vowels in this work is the steady-state portion (which may never be attained in some cases) of a vowel utterance. This is the time interval in which (a) the VT configuration can be taken to be approximately fixed, leading to essentially constant formants, and (b) the vocal folds are sustained in a reasonably steady vibration called the glottal cycle, during which the vocal folds open and close periodically. Because of (a) and (b) the vowel waveform appears as a quasi-periodic string of elementary units called pitch periods (We use "quasi-periodic" in the (colloquial) sense of Reference [22] (p. 75), i.e., designating a recurrent function of time for which the waveforms for successive periods are approximately the same. Examples are given in Figure 1a,c,d).
In practice, steady-state vowels are identified simply by looking for such quasi-periodic strings, which typically consist of about 3 to 5 pitch periods [5]. Results from clinical trials [23] indicate that normal (non-pathological) voices are usually Type I [24], i.e., phonation results in nearly periodic sounds, which supports the notion that uttered vowels in normal speech typically reach a steady-state before "moving on" [25].  [26] used Fourier transforms of single pitch periods of steady-state vowels to calculate their spectra, and coined the term "formant" to designate the peak frequencies which were characteristic to the vowel [27] (p. ix). Shown here are examples of steady-state vowels from Hermann's work. Small vertical arrows indicate the start (glottal closing instant or GCI) of the second pitch period. Adapted from [27] (p. ix); originally from [28] (p. 40). (b) The impulse-like response produced when one of the authors excited his vocal tract by flicking his thumb against his larynx whilst mouthing "o". This is an old trick to emulate the impulse response of the vocal tract normally brought about by sharp GCIs (also known as glottal closures). The impulse responses triggered by sharp GCIs can be observed occasionally in the waveforms of vocal fry sections or as glottal stops /P/ [27] (p. 49). (c) Three pitch periods taken from a synthesized steady-state instance of the vowel /7/ at a fundamental frequency of 120 Hz and sampled at 8000 Hz. The trend inferred by our model is a fifth order polynomial. This example is discussed in Section 4.1. (d) Three pitch periods taken from a steady-state instance of the vowel /ae/ at a fundamental frequency of 138 Hz. The trend inferred by our model is a weighted combination of a 4th and 5th order polynomial. This example is discussed in Section 4.2. Source: [29], bdl/arctic_a0017.wav, 0.51-0.54 s, resampled to 8000 Hz.
During the steady-state, the pitch periods happen in sync with the glottal cycle [30]. The start of the pitch period coincides with the glottal closing instant (GCI), which causes a sudden agitation in the waveform and can be automatically detected quite reliably [31]. The duration between the GCIs, i.e., the length of the pitch periods, defines the fundamental frequency of the vowel, which for women is on the order of (5 ms) −1 = 200 Hz and for men on the order of (8 ms) −1 = 125 Hz [32,33]. The GCI pulse at the start of each pitch period is often so sharp that, recalling source-filter theory, the resulting response of the VT approximates the VT impulse response-see Figure 1b. In contrast, the glottal opening instant (GOI) excites the VT only weakly, which injects additional low-frequency content into the waveform roughly halfway through the pitch period.

The Pinson Model
The observation that the GCI pulse is often sufficiently sharp underlies Pinson's basic model [19] of the portion of the pitch period between GCI and GOI as being essentially the VT impulse response. The model was originally proposed to estimate formants during voiced speech directly in the time domain. The reason why the model does not address the whole pitch period is an additional complication due to the GOI: when the glottis is open, the VT is coupled to the subglottal cavities (such as the lungs), which slightly increases the bandwidths of the formants, and thus slightly shifts the poles of the VT transfer function. If the "closed half" of the pitch period is characterized by Q formants with bandwidths α = {α 1 · · · α Q } and frequencies ω = {ω 1 · · · ω Q }, then the VT transfer function H P has Q conjugate pole pairs and up to (2Q − 1) zeros: where N(s) is a polynomial of constrained degree in order that H P be proper. (Note that, to emphasize the physical connection to resonances, we denote the bandwidth and frequency of the kth formant by (α k , ω k ) and not by (B k , F k ) as is more customary in acoustic phonetics.) Pinson's model for the "closed half" of one pitch period is then just a bias term plus the associated impulse response of H P (its inverse Laplace transform) constrained to live between GCI (t = 0) and GOI (t = T O ): Here the (2Q + 1) amplitudes b = {b 1 · · · b 1+2Q } are determined by N(s), α and ω through the partial fraction decomposition of Equation (1). The model is then put to use by locating a steady-state vowel, choosing the central pitch period and sampling the "closed half" between GCI and GOI (which at the time had to be estimated by hand). Denoting the N samples by d = {d 1 · · · d N } sampled at times t = {t 1 · · · t N }, estimates of the Q formants were found by weighted least-squares: Here w i is an error-weighting function which deemphasizes samples close to GCI and GOI. The amplitude estimatesb can be used to reconstruct the fit to the data and determine N(s) but are otherwise not used for the formant estimation. Two features of Pinson's model are of interest here. The first one is that the bandwidth estimates obtained by this method seem to be (much) more reliable than those obtained by today's standard linear predictive coding (LPC) methods [13,14], when compared to bandwidths measured by independent methods [4,34,35]. The second one is the direct parametrization of the model function in Equation (2) by the formant bandwidths and frequencies (α, ω), which, as we explain in Section 3, transparently enables uncertainty quantification for their estimates (α,ω) in a straightforward and transparent way-this is much harder in LPC-like methods.

The Proposed Model for a Single Pitch Period
The model for a single pitch period we propose in this paper is a simple generalization of Pinson's model in Equation (2) based on the empirical observation that the waveforms in pitch periods often seem to oscillate around a baseline or trend, which becomes more pronounced towards the end of the pitch period-see the examples in Figure 1c-d. In order to model this trend, we generalize the bias term in Equation (2) to an arbitrary polynomial of order P − 1 and widen the scope of the model to the full pitch period of length T: Entropy 2020, 22, 331

of 21
As before, the b = {b 1 · · · b P+2Q } are the model amplitudes, but now f is an implicit function of (P, Q), the model orders. P and Q will be subject to variation during model fitting, as opposed to the Pinson model where P ≡ 1 and Q was decided upon beforehand. The Bayesian approach in Section 3 avoids such a particular choice by using model averaging over all allowed values of (P, Q) to estimate the 2Q formant bandwidths and frequencies (α, ω). As we will discuss in Section 2.4, the trend is caused by the weak excitation of the VT by the GOI. The trend is essentially a low-frequency byproduct of the glottal "open-close" cycle driving steady-state vowels (and is therefore absent at isolated glottal closure events, as in Figure 1b). The main innovation of the model is the assumption that the low-frequency content can be modeled adequately by superimposing a polynomial to the impulse response of the VT in the time domain. This reduces the problem of estimating formants to one frequently encountered in macroeconomics. This problem is the detrending of nonstationary time series such as business cycles for which one needs to estimate the cycles (in our case, parametrized by α and ω) independently from the trend (in our case ∑ P k=1 b k t k−1 ) [36]. Since our proposed model for a single pitch period in Equation (3) is a generalization of Equation (2), it inherits the more reliable bandwidth estimation and the ability for straightforward uncertainty quantification from the Pinson model. In addition, the cumbersome labor of handpicking the "closed half" of the pitch period is eliminated by extending the scope of the model to full pitch periods, for which automatically estimated GCIs can be used. However, we pay for this convenience with a less precise model as we do not take into account the change in the formant bandwidths during the open part of the glottal cycle.

Outline
In Section 2 we state the model for a steady-state vowel, which is a chain of single pitch period models all sharing the same (α, ω) parameters, and discuss the origin of the trend.
In Section 3 we discuss how the formants and the uncertainty on their estimates are inferred using Bayesian model averaging and the nested sampling algorithm [37].
In Section 4 we apply the model to synthetic data and to real data. Section 5 concludes.

A Pitch-Synchronous Linear Model for Steady-State Vowels
The model we propose is a simple variation of the standard linear model [38]: Here d = {d 1 · · · d N } is a vector holding the dataset of N points sampled at times t = {t 1 · · · t N }, f is the model function, G is an N × m matrix holding the m basis functions which are function of t and the r "nonlinear" parameters θ, and b = {b 1 · · · b m } is a vector of m "linear" amplitudes. (The variables just listed describe the standard linear model in general terms; we connect these variables to our specific problem below in Section 2.1) The probabilistic aspect enters with our pdf for the vector of N errors e = {e 1 · · · e N } which is the classical separable multivariate Gaussian characterized by a single parameter, the noise amplitude σ (For more on the rationale behind assigning this pdf, see References [39,40] or, more concisely, Reference [41]). The noise power σ 2 may be also be expressed as the signal-to-noise ratio SNR = 10 log 10 f T f /Nσ 2 . It is well-known that for certain priors the simple form in Equation (4) allows for the marginalization over the amplitudes b and noise amplitude σ in the posterior distribution p(b, θ, σ|d, I) (where I ≡ our prior information), such that the posterior for the r nonlinear parameters p(θ|d, I) can be written in closed form. The variation on the standard linear model just mentioned consists of promoting the dataset to a set of n dataset vectors, one for each pitch period in the steady-state vowel, and we fit the model to each d i simultaneously while keeping θ and σ fixed but allowing each pitch period its own set of m amplitudes b i and errors e i . Thus Equation (4) becomes the set of equations: The form of Equation (5) and our choice of priors below ensure that the marginalization over the {b i } and σ is still possible. What remains is to specify the model function f and the priors for the {b i }, θ and σ [42].

The Model Function
Given a steady-state vowel {d 1 · · · d n } segmented into n pitch periods, typically with the help of an automatic GCI detector. We assume that the data have been normalized and sampled at regular intervals and choose dimensionless units, such that the sampling times are If we assume that the steady-state vowel is characterized by Q formants, the r = 2Q nonlinear parameters of the model are the Q formant bandwidths α = {α 1 · · · α Q } and Q formant frequencies Further assuming that the trend in each pitch period can be modeled by a polynomial of degree P − 1 (which may differ in shape in each pitch period but always has that same degree), the model function for the ith pitch period has m = P + 2Q basis functions and the same amount of amplitudes b i : where G i is an N i × m matrix holding the P polynomials and 2Q damped sinusoids: It is easy to verify that Equations (6) and (7) together are equivalent to the model for a single pitch period in Equation (3) we have motivated in Section 1.3. Thus, to summarize, our pitch-synchronous linear model for a steady-state vowel is essentially a chain of n linear single pitch period models, all constrained by (or rather, frustrated into) sharing the same parameters θ, which embodies our steady-state assumption that the formants do not change appreciably. The trend functions are only constrained by their degree, such that their shape may vary from period to period. In addition, the amplitudes of the damped sines may vary as well, which means the intensity and phase of the damped sines can vary from period to period.

The Priors
The prior pdfs for the model parameters are The pdfs are zero outside of the ranges indicated between the parentheses on the right. In our implementation of the model we set δ = 1 because (a) we normalize the data before fitting the model and (b) we use normalized Legendre functions instead of the computationally inconvenient polynomial functions in Equation (7) [43]. Thus this assignment for p({b i }|I) essentially expresses that we expect all amplitudes to be of order one and acts like a regularizer such that the amplitudes take on "reasonable" values [44].
The Jeffreys priors for the formant bandwidths and frequencies θ stem from representation invariance arguments reflecting our a priori ignorance about their true values-see in App. A in [39]. The ranges [θ lo j , θ hi j ] in which the true values are supposed to lie must be decided by the user. Typically, the user can estimate the θ directly with LPC. Additionally, based on the phoneme at hand, one can look up plausible ranges for the formant frequencies and bandwidths in the literature, e.g., References [4,45]. In our experiments in Section 4 we used both approaches to set the ranges for θ (see Table 1 below) [46].
Likewise, the Jeffreys prior for the noise amplitude σ is bounded by σ lo and σ hi . For the lower bound we may take σ lo 1 (e.g., on the order of the quantization noise amplitude) and for the upper bound we would conceivably have σ hi ≈ 1 because of the normalization imposed on the data. As we will discuss below, the precise values of these bounds have a negligible influence on our final conclusions (i.e., model averaging over values of (P, Q) and obtaining weighted samples from the posterior of θ) on the condition that they may be taken sufficiently wide such that most of the mass of the integral below in Equation (20) is contained within them.

The Likelihood Function
The likelihood function is where N = ∑ n i=1 N i and the scalar "least-squares" quadratic form

The Origin of the Trend
In this Section we present a heuristic derivation of the trend from source-filter theory.
According to the source-filter model of speech, the speech wave y(t) is in general the output of a linear time-invariant system which models the VT with impulse response (transfer function) h(t) (H(s)) and the radiation characteristic with impulse response (transfer function) r(t) (R(s)), and is driven by the glottal flow u(t): where * denotes convolution. We now proceed in the canonical way. As R(s) ∝ s for up to 4 kHz [4] (p. 128), and we work modulo rescaling, we can take [47] y(t) u (t) * h(t) (9) in this frequency range; accordingly, the glottal source may be taken to be the derivative of the glottal flow u (t). In this paper we will always work with signals resampled to a bandwidth of 4 kHz, which essentially limits us to the first three or four formants [48] (p. 20).
If we decompose the glottal cycle into a sharp delta-like excitation at GCI and a weak excitation at GOI, the glottal source during a pitch period may be written as where a 1 is a constant, and l(t) represents a slowly changing and broad function in which the spectral power |L(ω)| decreases quickly with increasing frequency ω. Substituting Equation (10) in Equation (9) gives where, as before, y(t) represents the speech signal during one pitch period. As l(t) is a smooth and broad function, the magnitude of its Fourier transform |L(ω)| will be quite narrowly concentrated around ω ≈ 0. In contrast, |H(iω)| will have its first peak only at F 1 0, the first formant, and therefore generally only a slowly rising slope near frequencies ω ≈ 0 if H(s) can be represented as an all-pole transfer function (which is the assumption behind LPC [49]). Since where L denotes the Fourier transform, if |L(ω)| falls off sufficiently fast, the slope of |H(iω)| near ω ≈ 0 will be near constant, and we may write Thus we may write, very roughly, that the speech signal during one pitch period As we have assumed that l(t) is a smooth and broad function, it is reasonable to assume that it can be modeled as a polynomial. Thus l(t) is our trend function, which modulo scaling essentially passes unscathed through convolution with h(t) because of the absolute mismatch in their frequency content.

Inferring the Formant Bandwidths and Frequencies: Theory
The posterior distribution for the (nm + r + 1) model parameters is We use the nested sampling algorithm [37] to infer the parameters of interest, the formant bandwidths and frequencies θ in the following way: Once the data {d i } are gathered, a "grid" of plausible model order values and prior ranges for the θ are proposed. Each point (P, Q) on that grid parametrizes a particular model for the {d i }. The evidence for a (P, Q) model where the integrand is a function of the model orders implicitly, can be estimated with the nested sampling algorithm. A (highly desirable) byproduct of this estimation is the acquisition of a set of weighted samples from the posterior in Equation (11), from which the estimates and error bars of the formant bandwidths and frequencies can be calculated, as well as any other function of them (such as the VT transfer function). However, we show in Section 3.1 that by performing the integrals over {b i } and σ Equation (12) can be written as where L I is the integrated likelihood. Thus we can sample our parameters of interest, the θ, directly from Equation (13) instead of Equation (12). The uncertainty quantification for the formants is then accomplished through Bayesian model averaging. Using Equation (13), the evidence Z(P, Q) is calculated and samples θ Finally, we note that sampling θ from Equation (13) instead of Equation (12) reduces the dimensionality of the parameter space from nm + r + 1 = n(P + 2Q) + 2Q + 1 → r = 2Q, which compares favorably to the increased cost of evaluating L I compared to L. In a typical application, n = 3, P = 5 and Q = 3, such that the dimensionality is reduced from 40 to a mere 6 dimensions. The dimensionality of the problem does not depend on the number of pitch periods n (but its complexity does).

The Integrated Likelihood
We begin by writing Equation (12) as where the quadratic form for the ith pitch period The expression for the integrated likelihood L I , defined implicitly by Equation (13), can be found from Equation (16) by marginalizing over the amplitudes {b i } and the standard deviation σ. We begin with the {b i }. Defining (see, e.g., App. A in [38]) it can be shown after some effort that (using, e.g., [50]) This is a good approximation if g i /σ 2 + I/δ 2 ≈ g i /σ 2 , i.e., if for all j = 1 · · · m it holds that In our implementation, where δ = 1, we found this to be an acceptable approximation for all states ({b i }, θ, σ) with an appreciable likelihood L.
It is interesting to note that when the least-squares measure is evaluated at the optimal amplitudesb i , it reduces to Thus, Equation (17) can be written from left to right as the sum of (a) a density term, (b) a term quantifying the goodness-of-fit and (c) a regularization term: Having completed the square in Q (i) F , the integral over the amplitudes b i is elementary, and we arrive at When it comes to the polynomial amplitudes, marginalizing over them can be seen as detrending (also called background removal [41]). Likewise, marginalization over the damped sinusoid amplitudes corresponds to removing their amplitudes and phases, i.e., we are only interested in the poles. The next step is to marginalize over the standard deviation by performing the integral in the large square brackets in Equation (19), i.e., We assume a reasonable amount of model functions m in relation to the number of data points N such that N > nm. For states (θ, σ) with appreciable likelihood practically all of the mass of Equation (20) is concentrated near the peak of its integrand at which we may assume to be within the bounds [σ lo , σ hi ] if these are sufficiently wide and a reasonable fit to the data is possible (see App. A in [39] for more details). Assuming this is the case, Equation (20) can be safely converted into an elementary gamma integral by letting σ lo → 0 and σ hi → ∞ and the marginalization can be performed analytically. Doing so we finally obtain the expression for the integrated likelihood L I : where C(P, Q) = 1 2 is a pure model order regularization term, being function only of P and Q. The factor [log σ hi /σ lo ] −1 due to the normalization of the Jeffreys prior for σ is a constant independent of (P, Q, θ) and subsequently cancels out in model averaging and the weighting of posterior samples of θ.

Optimization Approaches
Though the nested sampling approach proposed here is different from the optimization approach we used in the conference paper, it is still possible to formulate a straightforward iterative optimization scheme for this problem. Indeed, a least-squares search in θ can still be used-the marginalization over the amplitudes would then be called "variable projection" [51], and the amplitude regularization in Equation (18) can still be incorporated by treating the {b i } as model predictions for nm additional datapoints measured to be zero with errorbar δ. To fix the scale for the actual N datapoints, these could be assigned fictional errorbars as well with magnitudeσ as defined in Equation (21). Fast Fourier transformations, initially on the data and after on fit residuals, could be used for the initial guesses for the formant frequencies [39].
However, we have refrained from developing this approach mainly because of two reasons. First, due to the low dimensionality of the problem the nested sampling runs we ran to calculate Z(P, Q) for the next section were not unbearably slow (even in a Python/NumPy context) as most runs finished under two minutes. Second, nested sampling allowed us to calculate the evidence for a model order Z(P, Q) with confidence, while an optimization approach based on a Laplace approximation can easily give poor results. We did not, however, consider variational approaches.

Application to Data
In our experiments we used Praat [52] interfaced with the parselmouth Python library [53] with the default recommended settings to segment steady-state vowels into n pitch periods and get initial estimates for the formant bandwidths and frequencies to determine plausible ranges for their true values (Table 1). For the nested sampling we used the static sampler of the excellent dynesty Python library [54], again with default settings. Table 1. The prior ranges [θ lo j , θ hi j ] for the Jeffreys priors for θ used in the two applications of the model to data. The ranges for the formant bandwidths (α j ) and frequencies (ω j ) are given in Hz; for example, the prior range for the first formant ω 1 is 200-700 Hz. The data consists of a synthetic steady-state /7/ (Section 4.1) and a real steady-state /ae/ (Section 4.2). The range of the model order P was generally set to P = (1, 2, . . . , 10) while the allowed values of Q were more specific to the application (given that the signal bandwidth was limited to 4 kHz). In our experiments we found that high-degree polynomials tend to become too wiggly, thereby competing against the damped sinusoids for spectral power in awkwardly high frequency regions. It appeared that a good rule of thumb to prevent this behavior was to limit P ≤ 10 (and thus set the maximum polynomial degree to 9). We also note that the case P = 1 together with n = 1 would correspond to the Pinson model of Section 1.2, if we would disregard the fact that we model the entire pitch period (from GCI to GCI) as opposed to only the portion between GCI and GOI.

Synthesized Steady-State /7/
We apply the model first to a synthetic steady-state vowel /7/ to verify the model's prediction accuracy and to see whether the inferred polynomial correlates with u (t), which we would expect based on the arguments of Section 2.4. The vowel was generated with different parameter settings [32] (p. 121) which emulate female and male speakers at different fundamental frequencies F 0 spanning the entire range of normal (non-pathological) speakers [4].
The vowel /7/ was synthesized by first generating an artificial glottal source signal-the glottal flow derivative u (t)-at a sampling rate of 16 kHz, which was then filtered by an all-pole VT transfer function consisting of Q true = 3 poles with realistic formant values (which we will refer to as "the true values" from now on) based on Reference [55] (p. 163), and then downsampled to 8 kHz. Now yielding to the usual notation of acoustic phonetics, the true bandwidths B true (our αs) and frequencies F true (our ωs) used for the VT transfer function are B true = (54,22,19) Hz and F true = (430, 1088, 2142) Hz.
The glottal flow derivative u (t) was generated using the LF model [56]. For the male speakers, the 11 values of F 0 = (80, 90, . . . , 180) Hz, increasing in steps of 10 Hz. Likewise, for female speakers, the 11 values of F 0 = (160, 170, . . . , 260) Hz. We applied tiny but realistic values of jitter (0.5%) and shimmer (2%) to the generated pitch periods, which greatly improved the perceived naturalness of the steady-state vowel's sound. Finally, we selected the n = 3 central pitch periods for analysis.
As mentioned before, the range of P was set to P = (1, 2, . . . , 10). The allowed range of Q was set to Q = (1, 2, 3). The ranges for the formant bandwidths and frequencies [θ lo j , θ hi j ] are given in Table 1. For each synthesized steady-state /7/, the posterior probability of P and Q was calculated-see Figure 2. It is clear that the majority of the most probable values P MP and Q MP are close to unity, which indicates that typically the model has an outspoken preference for a particular value of P and Q. In this case we see that Q = 2 = Q true = 3 is heavily preferred, which means that in this particular experiment the model did not pick up the third formant [57]. Contrary to the number of formants Q, the preferred polynomial order P − 1 is more dependent on variations in F 0 , with P about 5 to 6 for male speakers and smaller for female speakers.   F 1 ) and second (B 2 , F 2 ) formants. Each point in the graphs represents an estimate either by our model or by Praat for a synthesized steady-state /7/ according to a speaker sex and fundamental frequency F 0 . The sex is indicated by black (male) or lightgreen (female). The model's estimates are averaged over all allowed model orders (i.e., values of (P, Q)) according to Equation (14), though in practice only one or two values of (P, Q) dominate (as suggested by  In Figure 3, the results of a test of the model's prediction accuracy according to the model-averaged estimates in Equation (14) are shown for the frequency and bandwidth of the first (B 1 , F 1 ) and second (B 2 , F 2 ) formants. In accordance with Figure 2, we did not show the estimates for the third (B 3 , F 3 ) formant as the most probable value of the number of formants in the data is Q MP = 2-indeed, the errorbars for the estimates for (B 3 , F 3 ) were huge, rendering those estimates practically useless.  . The model's prediction accuracy for the bandwidth and frequency of the first (B 1 , F 1 ) and second (B 2 , F 2 ) formants. Each point in the graphs represents an estimate either by our model or by Praat for a synthesized steady-state /7/ according to a speaker sex and fundamental frequency F 0 . The sex is indicated by black (male) or lightgreen (female). The model's estimates are averaged over all allowed model orders (i.e., values of (P, Q)) according to Equation (14), though in practice only one or two values of (P, Q) dominate (as suggested by Figure 2). The model's estimates are the dots with the errorbars at three standard deviations. The linear predictive coding (LPC) estimates acquired with Praat are plotted as crosses. The true values Btrue and Ftrue are drawn as dotted horizontal lines. Figure 3. The model's prediction accuracy for the bandwidth and frequency of the first (B 1 , F 1 ) and second (B 2 , F 2 ) formants. Each point in the graphs represents an estimate either by our model or by Praat for a synthesized steady-state /7/ according to a speaker sex and fundamental frequency F 0 . The sex is indicated by black (male) or lightgreen (female). The model's estimates are averaged over all allowed model orders (i.e., values of (P, Q)) according to Equation (14), though in practice only one or two values of (P, Q) dominate (as suggested by Figure 2). The model's estimates are the dots with the errorbars at three standard deviations. The linear predictive coding (LPC) estimates acquired with Praat are plotted as crosses. The true values B true and F true are drawn as dotted horizontal lines. Figure 3 shows a striking dichotomy between the results for male and female speakers. For the male speakers, the model's estimates seem to perform equally well or better than the LPC estimates. In contrast, the performance dramatically decreases for female speakers with the true values often outside of the already exceedingly large error bars. For B 2 and F 2 , the model does communicate its uncertainty about the true value of the bandwidths and frequencies by returning estimates with huge error bars, but unfortunately for B 1 and F 1 its estimates can be quite misleading.
The reason for this significant change in performance is mainly due to the change in the fundamental frequency F 0 . As F 0 rises, the near-impulse response waveforms triggered by the GCIs tend to "spill over" into the next pitch period, i.e., the damped sinusoids caused by a given GCI are still ringing out appreciably when the next GCI happens, thus contaminating the pack of newly triggered decaying sinusoids. These nearest-neighbor effects increasingly wreck the assumption that a pitch period is only made up of a trend plus the VT impulse response. This is also evident from Figure 2a, where the most probable degree of the trend polynomial (P MP − 1) drops to zero for very high fundamental frequencies, suggesting that the low-frequency content is picked up by F 1 , which Figure 3 confirms. From this experiment it appears that the threshold of F 0 is around 150 Hz. This essentially means that the model is limited to male speakers only.
The difficulties we encounter here reflect a known phenomenon in the literature: formant analysis for female voices is in general harder compared to male voices, regardless of the method used (e.g., [15], see also [16] (pp. 124-126) for an excellent discussion). Indeed, the negative correlation between F 0 and the estimates' accuracy in Figure 3 is exhibited for both our model and Praat's LPC algorithm. Next to the higher F 0 we already mentioned, another cause of this phenomenon is the fact that the coupling between the glottal source and VT is generally stronger in females [58], which violates the assumption of source and filter separability which underlies source-filter theory ("the female VT is not merely a small-scale version of the male VT" [59]).
Next, in Figure 4, we look at a typical case for male speakers, F 0 = 120 Hz [32], for which the model performed quite well. The third formant F 3 can be seen clearly in the spectrum of the residuals, though the model concluded it was noise as can be seen by the rather large error bars on the fit residuals. The bottom panels also show that the inferred polynomial trend correlates well with the true u (t).   Finally, using the same synthesized steady-state vowel as in Figure 4, we gauge how inaccuracies in the segmentation of the steady-state vowel into n pitch periods affect the model's estimates. In Figure 5, we simulate errors in this preprocessing step by parametrizing the relative error in estimating the pitch periods {T i = τ i+1 − τ i } as (0 ≤ ≤ 1) and perturb the n + 1 known GCIs at {τ i } according to (1 ≤ i ≤ n + 1).  Testing the robustness of the bandwidth and frequency estimates of the first (B 1 , F 1 ) and second (B 2 , F 2 ) formants against increasing relative error in pitch period {T i } estimation for a synthetic /7/ in the case F 0 = 120 Hz for a male speaker. Errors in {T i } estimation induce errors in the pitch period segmentation according to the GCIs {τ i } and thus transfer to the formant estimates, which are acquired through model averaging as defined in Equation (14). The method is explained in detail in the main text. In each panel the green star indicates the estimates for the unperturbed {τ i } (i.e. = 0%), for which no averaging has been done. (a) The fit quality as gauged by the SNR (defined in Section 2) as a function of . Each point and its errorbar are the empirical mean and standard deviation at three σ, respectively, over 6 draws. For reference, the prediction gain of adaptive LPC for stationary voiced speech sounds is typically about 20 dB [60] (p. 70). (b) Comparison of the formant estimates as a function of to their true values B true and F true (dotted horizontal lines). Each point is the empirical mean of the point estimates over 6 draws, and each errorbar is the empirical mean of the point estimates' errorbars at three standard deviations over the same 6 draws. The steady-state vowel is then segmented into n pitch periods {d ( ) 1 · · · d ( ) n } using the set of perturbed GCIs {τ ( ) i } and estimates of the noise amplitudeσ ( ) and theθ ( ) are obtained. We repeated this procedure 6 times by drawing 6 sets of the {T ( ) i } for each value of ∈ {1%, 5%, 10%, 15%, . . . , 30%}. The results averaged over the draws are shown in Figure 5a,b. The conclusion from this particular experiment is that while the fit quality deteriorates strongly as the relative error in estimating the pitch periods grows (a), the formant estimates degrade relatively gracefully (b). One contributing factor for this is a feature of our model: it performs a kind of generalized averaging [39] (Sec. 7.5) over the pitch periods to arrive at "robust" estimates of the θ.

Real Steady-State /ae/
For real data, we do not know the underlying glottal source u (t) as this is very hard to measure reliably. An alternative to measuring the glottal flow directly is the electroglottograph (EGG). The EGG signal can be used as a probe for the glottal source, as we will explain below.
The CMU ARCTIC database [29] consists of utterances which are recorded simultaneously with an EGG signal. The source of the steady-state /ae/ used in this section is a male speaker called BDL, file bdl/arctic_a0017.wav, 0.51-0.54 s, resampled to 8000 Hz. The fundamental frequency F 0 is about 138 Hz.
Once again the range of P was set to P = (1, 2, . . . , 10). The allowed range of Q was set to Q = (1, 2, 3, 4). The ranges for the formant bandwidths and frequencies [θ lo j , θ hi j ] are given in Table 1. In Figure 6, the posterior probability for the individual model orders is shown, from which a clear preference for P = 5 and Q = 4 arises.
In Figure 7, we show the model-averaged posterior distributions in Equation (14) for the formant bandwidths and frequencies. The estimates of the frequencies are reasonably sharp and agree quite well with the LPC estimates obtained with Praat. The error bars on the bandwidths increase gradually until the uncertainty on B 4 has become so large that it is essentially unresolved. This increase in the uncertainty on the bandwidths mirrors the fact that measuring bandwidths becomes increasingly difficult for higher formants [13].
Finally, we correlate the inferred trend together with the EGG signal in Figure 8. The EGG signal is the electrical conductance between two electrodes placed on the neck. When the glottis is closed and the glottal flow u(t) is zero, the measured conductance is high, and vice versa. The EGG signal rises sharply at the GCI, i.e., when the glottal flow drops abruptly to zero. From the discussion in Section 2.4 and in particular Equation (10), the glottal flow u(t) modulo a bias constant can be estimated very roughly from the inferred polynomial by integrating it over time. It is seen in the plot that the expected anticorrelation is borne out: when the u(t) estimate hits a through (GCI), the EGG signal rises sharply. Conversely, when the u(t) estimate hits a peak (GOI), the EGG signal hits a through, as the electrical conductance across the two electrodes drops due to the opening of the glotts.  Figure 6. The posterior probabilities of the joint (a) and separate (b) model orders for the steady-state /ae/ according to Equation (15). In this case, model averaging is for all practical purposes equivalent to model selection as the model (P = 5, Q = 4) occupies 98% of the posterior mass. Figure 6. The posterior probabilities of the joint (a) and separate (b) model orders for the steady-state /ae/ according to Equation (15). In this case, model averaging is for all practical purposes equivalent to model selection as the model (P = 5, Q = 4) occupies 98% of the posterior mass.    P,Q × Z(P, Q)/ ∑ P,Q Z(P, Q). The dotted vertical lines indicate a distance of three standard deviations from the mean, which is also stated in the panel titles together with the point estimate. The solid vertical lines indicate the LPC estimates obtained with Praat.  Comparison of the estimate of u(t) modulo a bias constant and the measured electroglottograph (EGG) signal. The speech signal, and therefore the u(t) estimate, lags behind the EGG signal by approximately 1 ms due to the distance between the glottis and the microphone.

Conclusions
The proposed model is a modest step towards formant estimation with reliable uncertainty quantification in the case of steady-state vowels. In our approach, the uncertainty quantification is implemented through Bayesian model averaging. The validity of our approach depends on the assumption that pitch periods can be modeled accurately as being composed of a slowly changing trend superimposed on a set of decaying sinusoids that represent the impulse response of the VT. It appears that this assumption likely holds only for fundamental frequencies F 0 below about 150 Hz, which poses a grave restriction on its use as this excludes most female speakers.