A Weakly Informative Prior for Resonance Frequencies †

: We derive a weakly informative prior for a set of ordered resonance frequencies from Jaynes’ principle of maximum entropy. The prior facilitates model selection problems in which both the number and the values of the resonance frequencies are unknown. It encodes a weakly inductive bias, provides a reasonable density everywhere, is easily parametrizable, and is easy to sample. We hope that this prior can enable the use of robust evidence-based methods for a new class of problems, even in the presence of multiplets of arbitrary order.


Introduction
An important problem in the natural sciences is the accurate measurement of resonance frequencies. The problem can be formalized by the following probabilistic model: p(D, x|I) = p(D|x)p(x|I) ≡ L(x)π(x), where D is the data, x = {x k } K k=1 are the K resonance frequencies of interest, and I is the prior information about x. As an example instance of (1), we refer to the vocal tract resonance (VTR) problem discussed in Section 5 for which D is audio recorded from the mouth of a speaker; x are a set of K VTR frequencies, and the underlying model is a sinusoidal regression model. Any realistic problem will include additional model parameters θ, but these have been silently ignored by formally integrating them out of (1), i.e., p(D, x|I) = dθ p(D, x, θ|I).
In this paper, we assume that the likelihood L(x) ≡ p(D|x) is given, and our task is to choose an uninformative prior π(x) ≡ p(x|I) from limited prior information I. A conflict arises, however: The uninformative priors π most commonly chosen to express limited prior information I are, in practice, often precluded by that same I.
The goal of this paper is to describe this conflict (2) and to show how it can be resolved by adopting a specific choice for π. This allows robust inference of the number of resonances K in the important case of such limited prior information I, which in turn enables accurate measurement of the resonance frequencies x with standard methods such as nested sampling [1] or reversible jump MCMC [2].

Notation
The symbol π is intended to convey a vague notion of a generally uninformative or weakly informative prior. Definite choices for π are indicated with the subscript i: where β i is a placeholder for the hyperparameter specific to π i . Note that in the plots below and for the experiments in Section 5, the values of the β i are always set according to Table 1.
Each π i uniquely determines a number of important high-level quantities since the likelihood L(x) and data D are assumed to be given. These quantities are the evidence for the model with K resonances: the posterior: and the information: which measures the amount of information obtained by updating from prior π i to posterior P i , i.e., H i (K) ≡ D KL (P i |π i ), where D KL is the Kullback-Leibler divergence.

Conflict
The uninformative priors π referenced in (2) are of the independent and identically distributed type: where g(x|β) is any wide distribution with hyperparameters β. A typical choice for g is the uniform distribution over the full frequency bandwidth; other examples include diffuse Gaussians or Jeffreys priors [3][4][5][6][7][8][9]. Second, the limited prior information I in (2) about K implies that the problem will involve model selection, since each value of K implicitly corresponds to a different model for the data. It is, thus, necessary to evaluate and compare evidence The conflict between these two elements is due to the label switching problem, which is a well-known issue in mixture modeling, e.g., [10]. The likelihood functions L(x) used in models parametrized by resonance frequencies are typically invariant to switching the label k; i.e., the index k of the frequency x k has no distinguishable meaning in the model underlying the data. The posterior P(x) ∝ L(x)π(x) will inherit this exchange symmetry if the prior is of type (7). Thus, if the model parameters x are well determined by the data D, the posterior landscape will consist of one primary mode, which is defined as a mode living in the ordered region: and (K! − 1) induced modes, which are identical to the primary mode up to a permutation of the labels k and, thus, live outside of the region R K (x 0 ). The trouble is that correctly taking into account these induced modes during the evaluation of Z(K) requires a surpris-ing amount of extra work in addition to tuning the MCMC method of choice, and that is the label switching problem in our setting. In fact, there is currently no widely accepted solution for the label switching problem in the context of mixture models either [11,12]. This is, then, how in (2) uninformative priors π are "precluded" by the limited information I: the latter implies model selection, which in turn implies evaluating Z(K), which is hampered by the label switching problem due to the exchange symmetry of the former. Therefore, it seems better to try to avoid it by encoding our preference for primary modes directly into the prior. This results in abandoning the uninformative prior π in favor of the weakly informative prior π 3 , which is proposed in Section 4 as a solution to the conflict. We use the VTR problem to briefly illustrate the label switching problem in Figure 1. The likelihood L(x) is described implicitly in Section 5 and is invariant to switching the labels k because the underlying model function (23) of the regression model is essentially a sum of sinusoids, one for each x k . As frequencies can be profitably thought of as scale variables ( [13], Appendix A), the uninformative prior (7) is represented by where β 1 ≡ (x 0 , x max ) are a common lower and upper bound, and is the Jeffreys prior, the conventional uninformative prior for a scale variable [although any prior of the form (7) that is sufficiently uninformative would yield essentially the same results.] We have visualized the posterior landscape P 1 (x) in Figure 1 by using the pairwise marginal posteriors P 1 (x k , x ) plotted in blue. Note the exchange symmetry of P 1 , which manifests as an (imperfect) reflection symmetry around the dotted diagonal x k = x bordering the ordered region R 3 (x 0 ). The primary mode can be identified by the black dot; all other modes are induced modes. Integrating all K! modes to obtain Z(K) quickly becomes intractable for Z 4.
1000 5500 x 1 [Hz] 1000 5500 Figure 1. The exchange symmetry of the posterior P 1 (x) for a well-determined instance of the VTR problem from Section 5 with K := 3. The pairwise marginal posteriors P 1 (x k , x ) are shown using the isocontours of kernel density approximations calculated from posterior samples of x. For each panel, the diagonal x k = x is plotted as a dotted line, and the ordered region R 3 (x 0 ) is shaded in grey. The black dot marks the mean of the primary mode for this problem.

A Simple Way Out?
A simple method out of the conflict is to break the exchange symmetry by assuming specialized bounds for each x k : where being hyperparameters specifying the individual bounds. However, in order to enable the model to detect doublets (a resolved pair of two close frequencies such as the primary mode in the leftmost panel in Figure 1), it is necessary to assign overlapping bounds in (a, b), presumably by using some heuristic. The necessary degree of overlap increases as the detection of higher order multiplets such as triplets (which can and do occur) is desired, but the more overlap in (a, b), the more the label switching problem returns. Despite this issue, there will be cases where we have sufficient prior information I to set the (a, b) hyperparameters without too much trouble; the VTR problem is such a case for which the overlapping values of (a, b) up to K = 5 are given in Table 1.

Solution
Our solution to the conflict (2) is a chain of K coupled Pareto distributions: where and the hyperparameter β 3 ≡ x 0 is defined as From Figure 2, it can be seen that π 3 encodes weakly informative knowledge about K ordered frequencies: (12) and (13) together imply that π 3 (x) is defined only for x ∈ R K (x 0 ), while nonzero only for x ∈ R K (x 0 ). In other words, its support is precisely the ordered region R K (x 0 ), which solves the label switching problem underlying the conflict automatically, as the exchange symmetry of π is broken. This is illustrated in Figure 2, where P 3 contracts to a single primary mode, which is just what we would like.
The K + 1 hyperparameters x 0 in (14) are a common lower bound x 0 plus K expected values of the resonance frequencies x. While the former is generally easily determined, the latter may seem difficult to set given the premise of this paper that we dispose only of limited prior information I. Why do we claim that π 3 is only weakly informative if it is parametrized by the expected values of the very things it is supposed to be only weakly informative about? The answer is that for any reasonable amount of data, inference based on π 3 is completely insensitive to the exact values of x. Therefore, any reasonable guess for x 0 will suffice in practice. For example, for the VTR problem, we simply applied a heuristic where we take x k = k × 500 Hz (see Table 1). This insensitivity is due to the maximum entropy status of π 3 and indicates the weak inductive bias it entails. On a more prosaic level, the heavy tails of the Pareto distributions in (12) ensure that the prior will be eventually overwhelmed by the data no matter how a priori improbable the true value of x is. More prosaic still, in Section 5.1 below we show quantitatively that for the VTR problem π 3 is about as (un)informative as π 2 .
1000 5500 x 1 [Hz] 1000 5500 x 3 > x 2 Figure 2. Contraction of prior (π 3 ) to posterior (P 3 ) for the application of π 3 to the VTR problem used in Figure 1. The pairwise marginal prior π 3 (x k , x ) is obtained by integrating out the third frequency; for example, π 3 (x 1 , x 2 ) = dx 3 π 3 (x). Unlike P 1 in Figure 1, P 3 exhibits only a single mode that coincides with the primary mode as marked by the black dot.

Derivation of π 3
Our ansatz consists of interpreting the x as a set of K ordered scale variables that are bounded from below by x 0 . Starting from (9) and not bothering with the bounds (a, b), we obtain the improper pdf We can simplify (15) using the one-to-one transformation x ↔ u defined as which yields (with abuse of notation for brevity) Since model selection requires proper priors, we need to normalize m(u) by adding extra information (i.e., constraints) to it; we propose to simply fix the K first moments u = { u k } K k=1 . This will yield the Pareto chain prior π 3 (u) directly, expressed in u space rather than x space. The expression for π 3 (u) is found by minimizing the Kullback-Leibler divergence [14] where u = {u k } K k=1 are the supplied first moments. This variational problem is equivalent to finding π 3 (u) by means of Jaynes' principle of maximum entropy with m(u) serving as the invariant measure [15]. Since the exponential distribution Exp(x|λ) is the maximum entropy distribution for a random variable x ≥ 0 with a fixed first moment x = 1/λ, the solution to (18) is where the rate hyperparameters λ k = 1/u k and Transforming (19) to x space using (16) finally yields (12), but we still need to express λ k in terms of x-we might find it hard to pick reasonable values of u k = log x k /x k−1 from limited prior information I. For this, we will need the identity Constraining x k = x k and solving for λ k , we obtain λ k = x k /(x k − x k−1 ), in agreement with (14). Note that the existence of the first marginal moments x k requires that λ k > 1.

Sampling from π 3
Sampling from π 3 is trivial because of the independence of the u k in u space (19). To produce a sample x ∼ π 3 (x) given the hyperparameter x 0 , compute the corresponding rate parameters {λ k } K k=1 from (14), and use them in (19) to obtain a sample u ∼ π 3 (u). The desired x is then obtained from u using the transformation (16).

Application: The VTR Problem
We now present a relatively simple but realistic instance of the problem of measuring resonance frequencies, which will allow us to illustrate the above ideas. The VTR problem consists of measuring human vocal tract resonance (VTR) frequencies x for each of five representative vowel sounds taken from the CMU ARCTIC database [16]. The VTR frequencies x describe the vocal tract transfer function T(x) and are fundamental quantities in acoustic phonetics [17]. The five vowel sounds are recorded utterances of the first vowel in the words W = {shore, that, you, little, until}. In order to achieve high-quality VTR frequency estimatesx, only the quasi-periodic steady-state part of the vowel sound is considered for the measurement. The data D, thus, consists of a string of highly correlated pitch periods. See Figure 3 for an illustration of these concepts.  The measurement itself is formalized as inference using the probabilistic model (1). The model assumed to underlie the data is the sinusoidal regression model introduced in [18]; due to limited space, we only describe it implicitly. The sinusoidal regression model assumes that each pitch period d ∈ D can be modeled as where d = {d t } T t=1 is a time series consisting of T samples. The model function consists of a sinusoidal part (first ∑) and a polynomial trend correction (second ∑). Note the additional model parameters θ = {A, α, σ, L}. Formally, given the prior p(θ) ( [18], Section 2.2), the marginal likelihood L(x) is then obtained as L(x) = dθ L(x, θ)p(θ), where the complete likelihood L(x, θ) is implicitly given by (22) and (23). Practically, we just marginalize out θ from samples obtained from the complete problem p (D, x, θ|I). For inference, the computational method of choice is nested sampling [1] using the dynesty library [19][20][21][22][23], which scales roughly as O(K 2 ) [24]. Since the VTR problem is quite simple (H i (K) ∼ 30 nats), we only perform single nested sampling runs and take the obtained log Z i (K) and H i (K) as point estimates. Full details on the experiments and data are available at https://github.com/mvsoom/frequency-prior.

Experiment I: Comparing π 2 and π 3
In Experiment I, we perform a high-level comparison between π 2 and π 3 in terms of evidence (4) and information (6). The values of the hyperparameters used in the experiment are listed in Table 1. We did not include π 1 in this comparison as the label switching problem prevented convergence of nested sampling runs for K ≥ 4. The (a, b) bounds for π 2 were based on loosely interpreting the VTRs as formants and consulting formant tables from standard works [25][26][27][28][29][30]. These allowed us to compile bounds up until the fifth formant such that K max = 5. For π 3 , we simply applied a heuristic where we take x k = k × 500 Hz. We selected x 0 empirically (although a theoretical approach is also possible [31]), and x max was set to the Nyquist frequency. The role of x max is to truncate π 3 in order to avoid aliasing effects, since the support of π 3 (x i ) is unbounded from above. We implemented this by using the following likelihood function in the nested sampling program: First, we compare the influence of π 2 and π 3 on model selection. Given D ∈ W, the posterior probability of the number of resonances K is given by the following.
The results in the top row of Figure 4a are striking: while p 2 (K) shows individual preferences based on D, p 3 (K) prefers K = K max unequivocally. Second, in Figure 4b, we compare π 2 and π 3 directly in terms of differences in evidence [log Z i (K)] and uninformativeness [H i (K)] for each combination (D, K).
Arrows pointing eastward indicate Z 3 (K) > Z 2 (K). The π 3 prior dominates the π 2 prior in terms of evidence, for almost all values of K, indicating that π 3 places its mass in regions of higher likelihood or, equivalently, that the data were much more probable under π 3 than π 2 . This implies that the hint of π 3 at more structure beyond K > K max should be taken serious-we investigate this in Section 5.2.
Arrows pointing northward indicate H 3 (K) > H 2 (K), i.e., π 3 is less informative than π 2 , since more information is gained by updating from π 3 to P 3 than from π 2 to P 2 . It is observed that π 2 and π 3 are roughly comparable in terms of (un)informativeness.

Experiment II: 'Free' Analysis
We now freely look for more structure in the data by letting K vary up until K max = 10. This goes beyond the capacities of π 1 (because of the label switching problem) and π 2 (because no data are available to set the (a, b) bounds). Thus, the great advantage of π 3 is that we can use a simple heuristic to set x 0 and let the model perform the discovering without worrying about convergence issues or the obtained evidence values. The bottom row in Figure 4a shows that model selection for the VTR problem is well-defined, with the most probable values of K ≤ 10, except for D = until. That case is investigated in Figure 3, where the need for more VTRs (higher K) is apparent from the unmodeled broad peak centered at around 3000 Hz in the FFT power spectrum (right panel). Incidentally, this spectrum also shows that spectral peaks are often resolved into more than one VTR, which underlines the importance of using a prior that enables trouble-free handling of multiplets of arbitrary order. A final observation from the spectrum is the fact that the inferredx k differs substantially from the supplied values in x (Table 1), which hints at the weak inductive bias underlying π 3 .

Discussion
It is only when the information in the prior is comparable to the information in the data that the prior probability can make any real difference in parameter estimation problems or in model selection problems ( [32], p. 9).
Although the weakly informative prior for resonance frequencies π 3 is meant to be overwhelmed, its practical advantage (i.e., solving the label switching problem) will