Prior Elicitation , Assessment and Inference with a Dirichlet Prior

Methods are developed for eliciting a Dirichlet prior based upon stating bounds on the individual probabilities that hold with high prior probability. This approach to selecting a prior is applied to a contingency table problem where it is demonstrated how to assess the prior with respect to the bias it induces as well as how to check for prior-data conflict. It is shown that the assessment of a hypothesis via relative belief can easily take into account what it means for the falsity of the hypothesis to correspond to a difference of practical importance and provide evidence in favor of a hypothesis.

Bayesian inference requires a prior and the Dirichlet(α 1 , . . ., α k ), for some choice of hyperparameters α 1 , . . ., α k ≥ 0, is a convenient choice due to its conjugacy.The prior density is of the form π(p 1 , . . ., p k ) = d(α 1 , . . ., α k )p α 1 −1 To employ such a prior it is necessary to have an elicitation algorithm to determine the hyperparameters.The purpose of this paper is to develop a particular algorithm based on bounds on the probabilities; to show how the chosen prior can be assessed with respect to the bias that it induces; to demonstrate how to check whether or not the prior conflicts with the data; to show how to modify the prior when such a conflict is encountered; and to implement inferences using the prior based on a measure of statistical evidence.
A key component of a Bayesian statistical analysis is the choice of the prior.This paper is concerned with the choice of a proper prior for a statistical analysis.It is generally acknowledged that the correct way to do this is through a process of elicitation where knowledgeable experts translate what is known about an application into the choice of a probability distribution reflecting beliefs about the unknown values of certain quantities.This is in contrast to the use of rules for the choice of default priors which are supposedly objective, such as the use of the principle of insufficient reason or the use of a Jeffreys prior.In fact, such default priors are also subjectively chosen as there appears to be no universal rule for this purpose and the specific rule itself needs to be chosen.In addition, these rules sometimes produce priors with characteristics that imply very specific beliefs, such as the Jeffreys prior for the multinomial which is a Dirichlet with all hyperparameters equal to 1/2.In essence, elicitation is honest about the subjectivity inherent in the choice of the prior and provides an argument for why the choice was made.In the context of the Dirichlet this knowledge will take the form of how likely a success is expected on each of the k categories being counted.Discussions about the process of elicitation for general problems can be found in [1,2].
In Section 2, current methods for eliciting a Dirichlet prior are reviewed and a new method is developed that possesses some advantages for situations where a weakly informative prior is required.Perhaps the main difference between the elicitation algorithm developed here and those already available in the literature, is that the user is required to state a lower or upper bound on each probability that they are virtually certain holds.Thus, a user knows that a cell probability must be smaller than some upper bound or knows that a probability must be larger than some lower bound.Rather than stating that such bounds hold categorically, the bound is believed to hold with a large prior probability, hence the terminology "virtual certainty".This follows good practice as the support of the prior is still the whole simplex and so does not rule out any values as being impossible.Note that the lower bound of 0 and the upper bound of 1 on a probability always hold with absolute certainty, so there is no concern that such bounds cannot be provided, but in many cases much tighter bounds will be applicable.One of the primary contributions of this paper is show how these bounds can be chosen consistently in the sense that they determine a Dirichlet prior and to develop an algorithm for obtaining this prior.In addition, it is shown in an example that this approach lends itself very naturally to determining a prior for the testing of independence.It is to be noted, however, that no elicitation methodology can be viewed as the correct approach and the existence of many approaches can only help to encourage the broad and effective use of priors.Thus, for a particular problem another elicitation algorithm, such as one among those reviewed in Section 2, may be felt to be more suitable.
A prior chosen via elicitation is proper.This allows for criticism of the prior in light of the observed data, namely, an assessment for prior-data conflict.If a prior is found to be in conflict with the data then, unless there is so much data that the effect of the prior is negligible, it is necessary to modify the prior to avoid this.These issues are discussed in Section 3.2.
In addition one has to be concerned about whether or not the choice of the prior results in bias.In fact, the issue of bias could be considered one of the main reasons for doubts being expressed about the appropriateness of Bayesian methodology.To precisely define bias it seems necessary to formulate a measure of evidence and here we use the relative belief ratio which is the ratio of the posterior to the prior as this measures change in belief from a priori to a posteriori.The assessment of bias in the prior, using this measure of evidence, is addressed in Section 3.1.
All inferences are derived from the relative belief ratio.Such inferences are invariant under 1-1 increasing functions of the relative belief ratio (as well as being invariant under smooth reparameterizations) and so the measure of evidence can equivalently be defined as the log of the relative belief ratio.It is then immediate that the expected evidence under the posterior is the relative entropy between the posterior and prior.In essence the relative entropy is a kind of average evidence and the log of the relative belief ratio at a specific parameter value is playing the role of the bit in the definition of entropy.It is to be noted, however, that for inference the concern is with measuring evidence, either in favor of or against a specific value, and not with the measurement of the more abstract concept of information.As such, there is an intimate connection between the concepts of entropy, evidence and relative belief inferences.Our purpose here, however, is not to consider this connection but discuss a methodology for choosing a prior for one of the most basic statistical problems, demonstrate how the chosen prior is to be assessed for conflict with data and for bias and then used for the derivation of inferences.Relative belief inferences for the multinomial are discussed in Section 4.
This presents a full treatment of a statistical analysis for the multinomial, although it is assumed that the multinomial model is correct.Strictly speaking, provided the data are available, it should also be checked that the initial sample is i.i.d.from a multinomial(1, p 1 , . . ., p k ) distribution, perhaps using a multivariate version of a runs test, but this is not addressed here.
Throughout the paper, Π denotes the prior probability measure on the full model parameter θ, which in the case of the multinomial is the vector of cell probabilities, and π denotes its density.
Dependence of Π on hyperparameters is indicated by subscripts, such as Π (α 1 ,...,α k ) denoting the Dirichlet(α 1 , . . ., α k ) distribution.When a particular prior Π is referenced and interest is in the marginal prior of some function ψ = Ψ(θ), then Π Ψ is used for the marginal prior measure of ψ with corresponding density π Ψ .In addition, M denotes the prior (predictive) probability measure of the data induced by Π and the sampling model and m denotes the corresponding density.
The following example, taken from [3], is considered as a practical application of the methodology.
Individuals were classified according to their blood type Y (O, A, B, and AB, although the AB individuals were eliminated, as they were small in number) and also classified according to X, their disease status (peptic ulcer = P, gastric cancer = G, or control = C).Thus, there are three populations; namely, those suffering from a peptic ulcer, those suffering from gastric cancer, and those suffering from neither, and it is assumed that the individuals involved in the study can be considered as random samples from the respective populations.The data are in Table 1 and the goal is to determine whether or not X and Y are independent.Thus, the counts are assumed to be multinomial(8766, p 11 , p 12 , p 13 , p 21 , p 22 , p 23 , p 31 , p 32 , p 33 ) where the first index refers to X and the second to Y and with a relabelling of the categories, e.g., X = G is relabeled as X = 2. Using the chi-squared test, the null hypothesis of no relationship is rejected with a value of the chi-squared statistic of 40.54 and a p-value of 0.0000.Table 2 gives the estimated cell probabilities based on the full multinomial as well as the estimated cell probabilities based on independence between X and Y.The difference between the two tables is very small and of questionable practical significance.For example, the largest difference between corresponding cells is 0.012 and, as a natural measure of difference between two distributions, the estimated Kullback-Leibler divergence, based on the raw data, is estimated as 0.002.This suggests that in reality the deviation from independence is not meaningful.The cure for this is that, in assessing any hypothesis, it is necessary to say what size of deviation δ from the null is of practical significance and take this into account when performing the test.This arises as a natural aspect of the relative belief approach to this problem and will be discussed in Sections 3 and 4 it is shown that a very different conclusion is reached in this example.

Elicitation
The problem of eliciting a Dirichlet prior is simplest when k = 2 and this corresponds to a beta distribution.Since this simple case contains the essence of the approach to elicitation for the Dirichlet presented here, this is considered first.

Eliciting a Beta Prior
Consider first the situation where k = 2 and the prior Π α 1 ,α 2 on p 1 is beta(α 1 , α 2 ).Suppose it is known with "virtual certainty" that l 1 ≤ p 1 ≤ u 1 where l 1 , u 1 ∈ [0, 1] are known.This immediately implies that 1 − u 1 ≤ p 2 = 1 − p 1 ≤ 1 − l 1 with virtual certainty.Here "virtual certainty" is interpreted to mean that the true value of p 1 is in the interval [l 1 , u 1 ] with high prior probability γ, say γ = 0.99.Thus, this restricts the prior to those values of (α 1 , α 2 ) satisfying Π α 1 ,α 2 ([l 1 , u 1 ]) = γ.Note that in general there may be several values of (α 1 , α 2 ) that satisfy this equality.For example, if ) for all α.To completely determine (α 1 , α 2 ) another condition is added, namely, it is required that the mode of the prior be at the point ξ ∈ [l 1 , u 1 ] as this allows the placement of the primary amount of the prior mass at an appropriate place within [l 1 , u 1 ].For example, a natural choice of the mode in this context is ξ = (l 1 + u 1 )/2, namely, the midpoint of the interval.When α 1 , α 2 ≥ 1 the mode of the beta(α 1 , α 2 ) occurs at ξ = (α 1 − 1)/τ where τ = α 1 + α 2 − 2. There is thus a 1-1 correspondence between the values (α 1 , α 2 ) and (ξ, τ) given by α 1 = 1 + τξ, α 2 = 1 + τ(1 − ξ).Hereafter, we restrict to the case α i ≥ 1 to avoid singularities on the boundary as these seem difficult to justify a priori.Therefore, after specifying the mode, only the scaling of the beta prior is required through the choice of τ.Now if X ∼ beta(1 distribution has its mode at ξ and whenever u 1 − l 1 ≤ γ, there is a value τ ∈ [0, ∞) such that there is exactly γ of the probability in [l 1 , u 1 ].
While the theorem establishes the existence of a value τ satisfying the requisite equation, it does not establish that this value is unique.Although uniqueness is not necessary for the methodology, based on examples and intuition, it seems very likely that Π 1+τξ,1+τ(1−ξ) ([l 1 , u 1 ]) is a monotone increasing function of τ which would imply that the τ in Theorem 1 is in fact unique.In any case, τ can be computed by choosing τ 0 = 0, finding a value τ * such that Π 1+τ * ξ,1+τ * (1−ξ) ([l 1 , u 1 ]) > γ and then obtaining τ ∈ [τ 0 , τ * ] satisfying the equality via the bisection root finding algorithm.This procedure is guaranteed to converge by the intermediate value theorem.
Example 2. Determining a beta prior.
The concept of "virtual certainty" is interpreted as something being true "with high probability" and choosing γ close to 1 reflects this.For example, in rolling an apparently symmetrical die the analyst may be quite certain that the probability p i of observing i pips is a least 1/8 and wants the prior to reflect this.In effect, the goal is to ensure that the prior concentrates its mass in the region satisfying these inequalities and choosing γ large accomplishes this.Actually, it is not necessary that exact equality is obtained to ensure virtual certainty.As long as γ is close to 1, then small changes in γ will not lead to big changes in the prior as in Example 2 where it is seen that choosing γ = 0.993 rather than γ = 0.990 makes very little difference in the prior.Specifying probabilities beyond 2 or 3 decimal places seems impractical in most applications so taking γ in the range [0.990, 0.999] seems quite satisfactory for characterizing virtual certainty while allowing some flexibility for the analyst.Far more important than the choice of γ is the selection of what it is felt is known, for example, the bounds l 1 and u 1 on the probabilities for the beta prior, as mistakes can be made.Protection against a misleading analysis caused by a poor choice of a prior is approached through checking for prior-data conflict and modifying the prior appropriately when this is the case, as discussed in Section 3. It is also to be noted that the methodology does not require γ to be large as the analyst may only be willing to say that the bounds on the probabilities for the die hold with prior probability γ = 0.50.However, choosing the bounds so that these are fairly weak constraints on the probabilities, and so almost certainly hold as is reflected by choosing γ close to 1, seems like an easy way to be weakly informative.

Eliciting a Dirichlet Prior
The approach to eliciting a beta prior allows for a great deal of flexibility as to where the prior allocates the bulk of its mass in [0, 1].The question, however, is how to generalize this to the Dirichlet(α 1 , . . ., α k ) prior.As will be seen, it is necessary to be careful about how (α 1 , . . ., α k ) is elicited.Again, we make the restriction that each α i ≥ 1 to avoid singularities for the prior on the boundary.
It seems quite natural to think about putting probabilistic bounds on the p i , such as requiring l i ≤ p i ≤ u i with high probability, for fixed constants l i , u i , to reflect what is known with virtual certainty about p i .For example, it may be known that p i is very small and so we put l i = 0, choose u i small and require that p i ≤ u i with prior probability at least γ.While placing bounds like this on the p i seems reasonable, such an approach can result in a complicated shape for the region that is to contain the true value of (p 1 , . . ., p k ) with virtual certainty.This complexity can make the computations associated with inference very difficult.In fact, it can be hard to determine exactly what the full region is.As such, it seems better to use an elicitation method that fits well with the geometry of the Dirichlet family.If it is felt that more is known a priori than a Dirichlet prior can express, then it is appropriate to contemplate using some other family of priors, see, for example, Elfadaly and Garthwaite [4,5].Given the conjugacy property of Dirichlet priors and their common usage, the focus here is on devising elicitation algorithms that work well with this family.First, however, we consider elicitation approaches for this problem that have been presented in the literature.
Chaloner and Duncan [6] discuss an iterative elicitation algorithm based on specifying characteristics of the prior predictive distribution of the data which is Dirichlet-multinomial.Regazzini and Sazonov [7] discuss an elicitation algorithm which entails partitioning the simplex, prescribing prior probabilities for each element of the partition and then selecting a mixture of Dirichlet distributions such that this prior has Prohorov distance less than some > 0 from the true prior associated with de Finetti's representation theorem.Both of these approaches are complicated to implement.Closest to the method presented here is that discussed in [8] where (α 1 , . . ., α k ) is specified by choosing i ∈ {1, . . ., k}, stating two prior quantiles (p γ i1 , p γ i2 ) where 0 < γ i1 < γ i2 < 1 for p i and specifying prior quantile p γ j for p j for each j = i, k.Thus, there are k constraints that the Dirichlet(α 1 , . . ., α k ) has to satisfy and an algorithm is provided for computing (α 1 , . . ., α k ).Drawbacks include the fact that the p i are not treated symmetrically as there is a need to place two constraints on one of the probabilities and p k is treated quite differently than the other probabilities.In addition, precise quantiles need to be specified and values α i < 1 can be obtained which induce singularities in the prior.Furthermore, it is not at all clear what these constraints say about the joint prior on (p 1 , . . ., p k ) as this elicitation does not take into account the dependencies that occur necessarily among the p i .Zapata-Vázquez et al. [9] develop an elicitation algorithm based on eliciting beta distributions for the individual probabilities and then constructing a Dirichlet prior that represents a compromise among these marginals.Elfadaly and Garthwaite [4] determine a Dirichlet by eliciting the first quartile, median and third quartile for the conditional distribution of p i | p 1 , . . ., p i−1 and finding the beta distribution, rescaled by the factor 1 − ∑ i−1 j=1 p j , that best fits these quantiles.This requires the prescription of precise quantiles, an order in which to elicit the conditionals and an iterative approach to reconcile the elicited conditional quantiles when these quantiles are not consistent with a Dirichlet.A notable aspect of their approach is that it also works for the Connor-Mosimann distribution, a generalization of the Dirichlet, and in that case no reconciliation is required.Similarly, Elfadaly and Garthwaite [5] base the elicitation on the three quartiles of the marginal beta distributions of the p i which, while independent of order, still requires reconciliation to ensure that the elicited marginals correspond to a Dirichlet.In addition, the elicitation procedure based on the conditionals is extended to develop an elicitation procedure for a more flexible prior based on a Gaussian copula.The approach in this paper is based on the idea of placing bounds on the probabilities that hold with virtual certainty and that are mutually consistent for any prior on S k .The user need only check that the bounds stated satisfy the conditions stated in the theorems to ensure consistency and these can be very simple to check and modify appropriately.Rather than being required to state precise quantiles or moments for the prior, all that is required are weak bounds on the probabilities.For example, we might be willing to say that we are virtually certain that p i is greater than a value l i .We consider l i a weak bound because there may be some belief that the true value is much greater than l i but being precise about how to express such beliefs is more difficult and requires more refined judgements.Certainly elicitation methodology that requires more assessment than what is being required here is even more open to concerns about robustness and other issues with the prior.As discussed in Sections 3-5, such concerns are better addressed through considerations about prior-data conflict, bias and using inference methods that are as robust to the prior as possible.
There are several versions depending on whether lower or upper bounds are placed on the p i .We start with the situation where a lower bound is given for each p i as this provides the basic idea for the others.Generally the elicitation process allows for a single lower or upper bound to be specified for each p i .These bounds specify a subsimplex of the simplex S k with all edges of the same length.As will be seen, this implicitly takes into account the dependencies among the p i .With such a region determined, it is straightforward to find (α 1 , . . ., α k ) such that the subsimplex contains γ of the prior probability for (p 1 , . . ., p k ).It is worth noting that the bounds determined in Theorems 2-4 can be applied to any family of priors on S k and it is only in Section 2.2.4 where specific reference is made to the Dirichlet.
Note that a (k − 1)-simplex can be specified by k distinct points in R k , say a 1 , . . ., a k , and then taking all convex combinations of these points.This simplex will be denoted as S(a 1 , . . ., , where e i is the i-th standard basis vector of R k , and it is clear that S(a 1 , . . ., a k ) ⊂ S k whenever a 1 , . . ., a k ∈ S k .The centroid of S(a 1 , . . . ,a k ) is equal to CS(a 1 , . . ., a k ) = ∑ k i=1 a i /k.

Lower Bounds on the Probabilities
For this we ask for a set of lower bounds l 1 , . . ., l k ∈ [0, 1] such that l i ≤ p i for i = 1, . . ., k.To make sense, there is only one additional constraint that the l i must satisfy, namely, Thus, the p i are completely determined when L 1:k = 1.Attention is thus restricted to the case where L 1:k < 1.The following result then holds.Theorem 2. Specifying the lower bounds l 1 , . . ., l k ∈ [0, 1] such that l i ≤ p i for i = 1, . . ., k and prescribes S(a 1 , . . ., a k ) ⊂ S k where a i = (l 1 , . . ., l i−1 , u i , l i+1 , . . ., l k ) and The edges of S(a 1 , . . ., Proof.Note that (1) implies that p i = 1 − ∑ j =i p j ≤ 1 − ∑ j =i l j = u i , and so stating the lower bounds implies a set of upper bounds, and also l i < u i ≤ 1.Consider now the set S = {(p 1 , . . ., p k ) : i=1 c i a i ∈ S since, for example, the first coordinate satisfies This proves that S ⊂ S(a 1 , . . ., a k ) and so we have S(a 1 , . . ., 2 and so S(a 1 , . . ., a k ) has edges all of the same length.This completes the proof.
It is relatively straightforward to ensure that the elicited bounds are consistent with a prior on S k .For, if it is determined that L 1:k ≥ 1, then it is simply a matter of lowering some of the bounds to ensure (1) is satisfied.For example, multiplying all the bounds by a common factor can do this and lowered l i means greater conservatism as it is a weaker bound.Furthermore, it is perfectly acceptable to set some l i = 0 as this does not affect the result.

Upper Bounds on the Probabilities
Of course, it may be that prior beliefs are instead expressed via upper bounds on the probabilities or a mixture of upper and lower bounds.The case of all upper bounds is considered first.Our goal is to specify the upper bounds in such a way that these lead unambiguously to lower bounds l 1 , . . . ,l k ∈ [0, 1] satisfying (1) and so to the simplex S(a 1 , . . ., a k ).
Suppose then that we have the upper bounds u 1 , . . ., u k ∈ [0, 1] such that p i ≤ u i .It is clear then that l 1 , . . ., l k must satisfy the system of linear equations given by (2) as well as 0 ≤ l i ≤ u i for i = 1, . . ., k and (1).Thus, the l i must satisfy where 1 k is the k-dimensional vector of 1's and I k is the k × k identity.Noting that Note that this requires that k ≥ 2 as is always the case. Putting From ( 4) and, for i = 1, . . ., k, this implies that l i ≥ 0 iff In addition, when ( 5) is satisfied, then l i < u i for i = 1, . . ., k.This completes the proof of the following result.Theorem 3. Specifying upper bounds u 1 , . . ., u k ∈ [0, 1], such that p i ≤ u i for i = 1, . . ., k, satisfying inequalities ( 5) and (7), determines the lower bounds l 1 , . . ., l k , given by ( 6), which determine the simplex S(a 1 , . . ., a k ) defined in Theorem 2.
For this elicitation to be consistent with a prior on S k it is necessary to make sure that the upper bounds satisfy ( 5) and (7).If we take 5) is satisfied and (k − 1)u ≥ ku − 1 implies that ( 7) is satisfied as well.If U 1:k ≤ 1, then the u i need to be increased which is conservative and note that U 1:k ≤ k is true provided all the u i ≤ 1 which is always the case.If ( 5) is satisfied but (7) is not for some i, then u i must be increased, which is again conservative, and ( 5) is still satisfied.Thus, again, making sure the elicited bounds are consistent is straight-forward.In addition, the bound u i = 1 is an acceptable choice.

Upper and Lower Bounds on the Probabilities
Now, perhaps after relabelling the probabilities, suppose that lower bounds 0 ≤ l i ≤ p i for i = 1, . . ., m as well as upper bounds p i ≤ u i ≤ 1 for i = m + 1, . . ., k, where 1 ≤ m < k, have been provided.Again, it is required that L 1:m = l 1 + • • • + l m < 1 and we search for conditions on the u i that complete the prescription of a full set of lower bounds l 1 , . . ., l k so that Theorem 2 applies.Again the l and u vectors must satisfy (3).Let x r:s denote the subvector of x given by its consecutive r-th through s-th coordinates and X r:s the sum of these coordinates provided r ≤ s and be null otherwise.The following equations hold Rearranging these equations so the knowns are on the left and the unknowns are on the right gives It follows from (9) that and substituting this into (8) gives the solution for u 1:m as well.Thus, it is only necessary to determine what additional conditions have to be imposed on the l 1 , . . ., l m , u m , . . ., u k so that Theorem 2 applies.Note that it follows from (8) that u 1:m takes the correct form, as given by ( 2), so it is really only necessary to check that l is appropriate.
First it is noted that it is necessary that k − m > 1.The case k − m = 1 only occurs when m = k − 1 and then which is the required value for u k for Theorem 2 to apply.Thus, when k − m = 1, there is no choice but to put u k = 1 − l 1 − • • • − l k−1 and choose a lower bound for p k , which of course could be 0, which means that Theorem 2 applies.It is assumed hereafter that k − m > 1.
The above argument establishes the following result.
Ensuring that the elicited bounds are consistent with a prior on S k can proceed as follows.First ensuring L 1:m < 1 can be accomplished conservatively by lowering some of the l i if necessary.In addition, the inequality 1 − L 1:m < U m+1:k can be accomplished conservatively by raising some of the u i if necessary.If U m+1:k > (k − m)(1 − L 1:m ), then some of the u i need to be decreased or some of the l i need to be increased or a combination of both.Indeed setting a u i = 1 to be conservative, so (13) is satisfied, may require lowering some of the lower bounds but again this is conservative.Note that, if we assign the u i such that U m+1:k = (k − m)(1 − L 1:m ), then (13) reduces to u i ≥ 1 − L 1:m and the assignment u i = 1 − L 1:m ensures consistency although an alternative assignment can be made such that The purpose of Theorems 2-4 is to ensure that the bounds selected for the individual probabilities are consistent.It may be that an expert has a bound which they believe holds with virtual certainty but the consistency requirements are violated.The solution to this problem is to decrease a lower bound or increase an upper bound so that the requirements are satisfied.While this is not an entirely satisfactory solution to this problem, it does not violate the prescription that the bounds hold with virtual certainty.Furthermore, the lower bound of 0, or the upper bound of 1, is always available if a user feels they have absolutely no idea how to choose such a bound.
Consider an example.
Suppose that k = 4 and the lower bounds l 1 = 0.2, l 2 = 0.2, l 3 = 0.3, l 4 = 0.2 are placed on the probabilities.This results in the bounds 0.2 ≤ p 1 ≤ 0.3, 0.2 ≤ p 2 ≤ 0.3, 0.3 ≤ p 3 ≤ 0.4, and 0.2 ≤ p 4 ≤ 0.3 which are reasonably tight.The mode was placed at the centroid ξ = (0.22, 0.22, 0.32, 0.22).For γ = 0.99, an error tolerance of = 0.005 and a Monte Carlo sample of size of N = 10 3 at each step, the values τ = 2560 and (α 1 , α 2 , α 3 , α 4 ) = (577.0,577.0, 833.0, 577.0) were obtained after 13 iterations.The prior content of S(a 1 , a 2 , a 3 , , a 4 ) was estimated to be 0.989.If greater accuracy is required then N can be increased and/or decreased.This choice of lower bounds results in a fairly concentrated prior as is reflected in the plots of the marginals in Figure 1.This concentration is not a defect of the elicitation as (2) indicates that it must occur when the sum of the bounds is close to 1. Thus, the concentration is forced by the dependencies among the probabilities.Consider now another example.

Example 1 (continued). Choosing the prior.
Given that we wish to assess independence, it is necessary that any elicited prior include independence as a possibility so this is not ruled out a priori.A natural elicitation is to specify valid bounds (namely, bounds that satisfy our theorems) on the p i• and the p •j and then use these to obtain bounds on the p ij which in turn leads to the prior.Thus, suppose valid bounds have been specified that lead to the lower bounds a i ≤ p i• , b j ≤ p •j .Then it is necessary that l ij = a i b j is the lower bound on p ij .Note that it is immediate that the l ij satisfy the conditions of Theorem 2 and from (2), p ij ≤ 1 − ∑ r,s l rs + l ij = 1 − ∑ r a r ∑ s b s + a i b j which is greater than l ij = a i b j since 0 ≤ ∑ r a r < 1 and 0 ≤ ∑ s b s < 1.As such the region for the p ij contains elements of H 0 .
For this example, the lower bounds a 1 = 0.1, a 2 = 0.0, a 3 = 0.5, b 1 = 0.2, b 2 = 0.2, b 3 = 0.0 were chosen which leads to the lower bounds L =    0.02 0.02 0.00 0.00 0.00 0.00 0.10 0.10 0.00    on the p ij .Note that these are precisely the bounds used in Example 4 so the prior is as determined in that example where the indexing is row-wise.
The software used in this paper to determine the prior from the bounds is available at http://utstat.utoronto.ca/mikevans/software/Dirichlet/RDirichlet.html.

Assessing the Prior
Here, we specialize the developments discussed in [10,11] to the multinomial problem with a Dirichlet prior.It is to be noted that the methods presented in this section for the assessment of a prior are applicable to any prior and not just in the special circumstances discussed here.
Suppose a quantity ψ = Ψ(p 1 , . . ., p k ) is of interest and there is a need to assess the hypothesis H 0 : Ψ(p 1 , . . . ,p k ) = ψ 0 .Let π Ψ denote the prior density and π Ψ (• | f 1 , . . ., f k ) denote the posterior density of Ψ, where ( f 1 , . . ., f k ) gives the observed cell counts.When Ψ(p 1 , . . ., p k ) = (p 1 , . . ., p k ), as the limiting ratio of the posterior probability of a set containing ψ 0 to the prior probability of this set where the limit is taken as the set converges (nicely) to the point ψ 0 .Whenever π Ψ (ψ 0 ) > 0 and π Ψ is continuous at ) is measuring how beliefs about ψ 0 have changed from a priori to a posteriori and is a measure of evidence concerning then there is evidence that H 0 is true, as belief in the truth of H 0 has increased, if RB Ψ (ψ 0 | f 1 , . . ., f k ) < 1, then there is evidence that H 0 is false, as belief in the truth of H 0 has decreased and if RB Ψ (ψ 0 | f 1 , . . ., f k ) = 1, then there is no evidence either way.
Any 1-1 increasing transformation of a relative belief ratio can also be used to measure evidence.For example, log RB Ψ (ψ 0 | f 1 , . . ., f k ) works just as well but now log RB Ψ (ψ 0 | f 1 , . . ., f k ) > (<) 0 provides evidence for (against) H 0 .As mentioned in the Introduction, this establishes a connection between relative belief and relative entropy.The Bayes factor is the ratio of the posterior odds to prior odds and so is also a measure of change in belief and, as such, is a measure of evidence.When the prior on ψ is discrete, the Bayes factor for the event {ψ 0 where {ψ 0 } c is the complement of {ψ 0 }.Thus, the Bayes factor can be expressed in terms of the relative belief ratio but not conversely.Furthermore, it can be proved that when 1 which simply expresses the natural property that evidence for {ψ 0 } is evidence against {ψ 0 } c and conversely.Thus, it is seen that the relative belief ratio is a more fundamental measure of evidence and moreover the Bayes factor is not really comparing the evidence for {ψ 0 } with the evidence for its negation.When the prior on ψ is continuous, the issue is more complicated because of a common recommendation that such a prior be replaced by a mixture with a point mass at ψ 0 so that a Bayes factor can be defined.Alternatively, one could define the Bayes factor at ψ 0 in the continuous case as the limit of Bayes factors of shrinking sets as we have done for the relative belief ratio.When this definition is used, the Bayes factor is identical to the relative belief ratio.For these reasons, and a number of optimality properties proven for relative belief ratios, we adopt the relative belief ratio as the basic measure of evidence.These issues and results are more fully discussed in [11].

Assessing Bias in the Prior
Given that there is a measure of evidence for H 0 , it is possible to assess the bias in the prior with respect to H 0 .For this let M(• | ψ 0 ) denote the prior predictive distribution of ( f 1 , . . ., f k ) given that Ψ(p 1 , . . . ,p k ) = ψ 0 .The bias against H 0 is assessed by the prior probability that evidence in favor of H 0 will not be obtained when H 0 is true.If ( 14) is large, then there is bias in the prior against H 0 and, as such, if evidence against H 0 is obtained after seeing the data, then this should have little impact.In essence the ingredients of the study are such that it is not meaningful to find evidence against H 0 .To measure bias in favor of H 0 , let ψ * be a value of Ψ that is just meaningfully different than ψ 0 .In other words, values ψ that differ from ψ 0 less than ψ * does, are not considered as practically different than ψ 0 .Then the bias in favor of H 0 is measured by If ( 15) is large, then there is bias in favor of H 0 and if evidence in favor of H 0 .isobtained after seeing the data, then this should have little impact.It is shown in [11] that both ( 14) and ( 15) converge to 0 as n → ∞.Thus, bias can be controlled by sample size.The computation of ( 14) and ( 15) can be difficult in certain contexts with the primary issue being the need to generate from the conditional prior predictives of the data.As in the following example, however, great accuracy is typically not required for these computations and so effective methods are available.

Example 1 (continued). Measuring bias and choosing δ.
To assess independence between X and Y, the marginal parameter ψ = Ψ(p 11 , p 12 , . . ., p kl ) = ∑ i,j is used.Note that ( 16) is the minimum Kullback-Leibler distance between the p ij values and an element of H 0 .Furthermore, ψ = 0 iff independence holds.
As discussed previously, it is necessary to specify a δ > 0 such that a practically meaningful lack of independence occurs iff the true value ψ ≥ δ.One approach is to specify a δ such that, if −δ ≤ (p ij − p i• p •j )/p ij < δ for all i and j, then any such deviation is practically insignificant, as the relative errors are all bounded by δ.Using ln(1 + x) ≈ x for small x, this condition implies that −δ ≤ ψ < δ.The range of ψ is then discretized using this δ and the hypothesis to be assessed is now, because ψ ≥ 0 always, H 0 : 0 ≤ ψ < δ.This assessment is carried out using the relative belief ratios based on the discretized prior and posterior of Ψ as discussed in Section 4. For the data in this problem, we take δ = 0.01 which corresponds to a 1% relative error.Thus, this says that we do not consider independence as failing when the true probabilities differ from probabilities based on independence with a relative error of less than 1%.
With this choice of δ the issue of bias is now addressed.The prior distribution of the discretized Ψ is determined by simulation.For this, generate the p ij from the elicited prior and compute ψ and the prior probability contents of the intervals for ψ given by [0, δ), [δ, 2δ), . . ., [(k − 1)δ, kδ) where k is determined so as to cover the full range of observed generated values of ψ.The plot of the prior density histogram for ψ is provided in Figure 3.
For inference, the posterior contents of these intervals are also determined via simulating from the posterior based on the observed data.For measuring bias, however, we proceed as follows.Each time a generated ψ satisfies [0, δ) the corresponding p ij are used to generate a new data set F ij and RB Ψ ([0, δ) | F 11 , . . ., F kl ) is determined and note that this requires generating from the posterior based on the F ij .The probability M(RB Ψ ([0, δ) | F 11 , . . ., F kl ) ≤ 1 | [0, δ)) is then estimated by the proportion of these relative belief ratios that are less than or equal to 1.This gives an estimate of the bias against H 0 .Estimating the bias in favor of H 0 proceeds similarly, but now the F ij are generated whenever ψ ∈ [δ, 2δ) is satisfied, as these represent values that correspond to just differing from independence meaningfully.Clearly this procedure could be computationally quite demanding if highly accurate estimates of the biases are required.In general, however, high accuracy is not necessary.Even accuracy to one decimal place will provide a clear indication of whether or not there is serious bias.In this problem the biases for the elicited prior are estimated to be 0.12 for bias for and 0.02 for bias against.Thus, there is only a probably of 0.02 of obtaining evidence against H 0 when it is true, which implies virtually no bias against H 0 .There is, however, a prior probability of 0.12 of obtaining evidence in favor of H 0 when it is just meaningfully false and so some bias in favor of H 0 .It is to be noted that bias decreases as ψ * moves away from ψ 0 .These values depend on the chosen value of δ but in fact are reasonably robust to this choice.The prior probability content of the interval [0, 0.01) is 0.14 while [0.01, 0.02) contains 0.25 of the prior probability.Thus, there is a reasonable amount of prior probability allocated to effective independence and also to the smallest nonindependence of interest.

Checking for Prior-Data Conflict
Anytime a prior is used it is reasonable to question whether or not the prior is contradicted by the data.Essentially such a contradiction occurs when the data indicate that the true value of the model parameter lies in the tails of the prior.While opinions vary on this, the point-of-view taken here is that properly collected data are primary in determining inferences, and so models and priors that are contradicted by the data need to be modified when this occurs.The issue is somewhat less relevant for priors, as with enough data the effect of the prior is minimal, but on the other hand it often turns out to be relatively easy to modify the prior so that the conflict is avoided, see [12].
The elicitation discussed here could be in error, namely, if the true probabilities lie well outside the intervals obtained.If the data demonstrate this in a reasonably conclusive way, then it would seem incorrect to proceed with an analysis based on this prior unless there was an absolute conviction that the amount of data was sufficient to overwhelm the influence of the prior.To check for prior-data conflict we follow Evans and Moshonov [13] and compute the tail probability where ( f 1 , . . ., f k ) is the observed value of the minimal sufficient statistic and M is the prior predictive distribution of this statistic with density m.In [14] it is proved that quite generally (17) converges to Π(π(p 1 , . . . ,p k ) ≤ π(p 1,true , . . ., p k,true )) as n → ∞, where Π is the prior on (p 1 , . . ., p k ).Thus, a small value of ( 17) is indicating that the true value of (p 1 , . . ., p k ) lies in a region where the prior is relatively low and so the data are contradicting the prior.Certainly a value like 0.01 for (17) suggests that the true value is well into the "tails" of the prior.It is to be noted that prior-data conflict can have a number of ill-effects.For example, results in [15] show that robustness to the prior cannot be achieved in the presence of prior-data conflict.
When the prior is given by the uniform, then a simple computation shows that (17) is equal to 1 and so there is no prior-data conflict.Intuitively, the closer τ is to 0, then the less information the prior is putting into the analysis.This idea can be made precise in terms of the weak informativity of one prior with respect to another as developed in [12].As such, if prior-data conflict is obtained with the prior specified by a value of (ξ 1 , . . ., ξ k , τ), then this prior can be replaced by a prior that is weakly informative with respect to it so that the conflict can be avoided and this entails choosing a value τ < τ.
For the elicited Dirichlet prior, the value of ( 17) is approximately equal to 1 (to the accuracy of the computations) and so there is definitely no prior-data conflict.

Inference
For data ( f 1 , . . ., f k ) and Dirichlet(α 1 , . . ., α k ) prior the posterior, of (p 1 , . . ., p k ) is Dirichlet(α 1 + f 1 , . . . ,α k + f k ).As such it is easy to generate from the posterior of ψ, estimate the posterior contents of the intervals [(i − 1)δ, iδ) and then estimate the relative belief ratios RB From this a relative belief estimate of the discretized ψ can be obtained and various hypotheses assessed for this quantity.
The strength of the evidence provided by RB Ψ (ψ 0 | f 1 , . . ., f k ) is measured by, see [11], namely, the posterior probability that the true value of ψ has a relative belief ratio no greater than the hypothesized value.When RB Ψ (ψ 0 | f 1 , . . ., f k ) < 1, so there is evidence against ψ 0 , a small value for (18) implies there is strong evidence against ψ 0 since there is a large posterior probability that the true value has a larger relative belief ratio than ψ 0 .When RB Ψ (ψ 0 | f 1 , . . ., f k ) > 1, so there is evidence in favor of ψ 0 , a large value for (18) indicates there is strong evidence in favor of ψ 0 since there is a small posterior probability that the true value has a larger relative belief ratio than ψ 0 .Note that when RB Ψ (ψ 0 | f 1 , . . ., f k ) > 1, then the best estimate of ψ in the set {ψ : RB as it has the most evidence in its favor.While the measure of strength looks like a p-value, it has a very different interpretation and it is not measuring evidence.Note that, if our goal was instead to estimate ψ, then the measure of evidence adopted dictates that this be given by the relative belief estimate ψ(x) = arg sup ψ RB Ψ (ψ | f 1 , . . ., f k ) as this is the value with the most evidence in its favor (sup ψ RB Ψ (ψ | f 1 , . . ., f k ) is always greater than 1).In addition, an assessment of the accuracy of the estimate is given by the size of a λ-relative belief region C λ ( f 1 , . . ., f k ) = {ψ : RB Ψ (ψ | f 1 , . . ., f k ) ≥ c λ ( f 1 , . . ., f k )} where c λ ( f 1 , . . ., f k ) is the smallest constant so that the posterior content of C λ ( f 1 , . . ., f k ) is a least λ for a choice of λ ∈ (0, 1).Note that ψ(x) is always in C λ ( f 1 , . . ., f k ).
Relative belief inferences possess a number of optimal properties in the class of Bayesian inferences, see [11], and with particular relevance for the choice of prior, optimal robustness to the prior properties as developed in [15].Whether or not the elicitation methodology itself assists in inducing such robustness is a matter for further investigation.Especially when the bounds are chosen to be quite diffuse, this seems plausible.Given that there is no prior-data conflict with the elicited prior and little or no bias in this prior relative to the hypothesis H 0 of independence, we can proceed to inference in Example 1.
The posterior of the p ij is the Dirichlet(998.2, 694.2, 146.48, 395.48, 428.48, 96.48, 2918.1, 2651.1, 582.48) distribution.For the hypothesis H 0 of independence between the variables, and using the discretized Kullback-Leibler divergence with δ = 0.01, the value RB Ψ ([0, δ) | f 1 , . . ., f k ) = 7.13 was obtained so there is evidence in favor of H 0 .For the strength of this evidence the value of (18) equals 1.Thus, the evidence in favor of H 0 is of the maximum possible strength.Of course, this is due to the large sample size and the fact that the posterior distribution concentrates entirely in [0, δ).Note that this is a very different conclusion than that obtained by the p-value based on the chi-squared test.

Conclusions
A very natural and easy to use method has been developed for eliciting Dirichlet priors based upon placing bounds on the individual probabilities that takes into account the dependencies among the probabilities.Of course, there may be more information available, such as upper and lower bounds on many of the probabilities.The price paid for this, however, is a much more complicated region where the bulk of the prior mass is located and even difficulties in determining what that region is, so this represents a problem for further work.It is also relevant to consider the individual bounds holding with possibly different prior probabilities but, as with considering both lower and upper bounds simultaneously for each probability, mathematical issues arise that make this a problem for further work.
While we view the approach to elicitation presented here as being fairly simple, it is certainly reasonable that other approaches are practically useful and preferable in certain situations.There is no doubt that the Dirichlet imposes what may be unnatural constraints for some situations and so more general families of priors are also needed for the multinomial.As such, extending our approach to more general families of priors is another problem of interest.In particular, Theorems 2-4 are relevant to any family of priors placed on S k and so, provided sampling from such a prior is straightforward and there is a nice way to parameterize the prior as with the Dirichlet, then the approach of this paper can be implemented.
The application of the Dirichlet prior to an inference problem has also been illustrated using a measure of statistical evidence, the relative belief ratio, as a basis for the inferences.Given that a measure of evidence has been identified, it is possible to assess the bias in the prior before proceeding to inference.In addition, the prior has been checked to see if it is contradicted by the data.While the adequacy of the prior in light of the data can be assessed via the methods discussed in Section 3, there is also a need to measure how closely an elicited prior reflects an expert's judgements and suitable methodology needs to be developed for that problem.
Finally, it is seen that the assessment of a hypothesis can be different than that obtained by a standard p-value and, in particular, provide evidence in favor of a hypothesis.Of course, this is based on a well-known defect in p-values, namely, with a large enough sample, a failure of the hypothesis of no practical importance can be detected.The solution to this problem is to say what difference matters and use an approach that incorporates this.Relative belief inferences are seen to do this in a very natural way.The choice of δ is not arbitrary but is rather a fundamental characteristic of the application.When such a δ cannot be determined, it is not a failure of the inference methodology, but rather reflects a failure of the analyst to understand an aspect of the application that is necessary for a more refined analysis to take place.

Figure 2 .
Figure 2. Plot of the nine marginal priors in Example 4.

Figure 3 .
Figure 3. Plot of the prior density histogram for ψ in Example 1.

Table 1 .
The data in Example 1.
Y = O Y = A Y = B Total

Table 2 .
The estimated cell probabilities in Example 1 based on the full and independence models.