Asymptotic Results for Multinomial Models

In this work, we derived new asymptotic results for multinomial models. To obtain these results, we started by studying limit distributions in models with a compact parameter space. This restriction holds since the key parameter whose components are the probabilities of the possible outcomes have non-negative components that add up to 1. Based on these results, we obtained conﬁdence ellipsoids and simultaneous conﬁdence intervals for models with normal limit distributions. We then studied the covariance matrices of the limit normal distributions for the multinomial models. This was a transition between the previous general results and on the inference for multinomial models in which we considered the chi-square tests, conﬁdence regions and non-linear statistics—namely log-linear models with two numerical applications to those models. Namely, our approach overcame the hierarchical restrictions assumed to analyse the multidimensional contingency table.


Introduction
In several fields of study such as health, business, social sciences and education, the outcomes of variables are mainly discrete, i.e., the variables only take finite or countable numbers. A discrete variable whose outcome only takes finite numbers is called a categorical variable [1]. A categorical variable consists of a set of categories that are nonoverlapping [2] and the outcome could be binary (dichotomous), i.e., with just two possible levels, such as "present" or "absent" as a desired condition, or polytomous, i.e., with more than two levels, as is the case of the "Likert" scale [3]. There are two common types of polytomous variables, as can be seen in [4], which are the ordinal and nominal scale of measurement. Categorical variables, such as one's eye colour, ethnicity and affiliations, the categories of which cannot be ordered in any way, are nominal, while categories such as the level of resistance to a drug of a patient, the level of education and economic status exhibit a natural order and are thus ordinal.
In a study in which all the observed variables are categorical, the most common way of representing the data is in a contingency table, which is a cross-tabulation of the variables [5,6]. When there are m-variables, the contingency table is an m-dimensional table-also known as a multidimensional table when the attributes are more than two. The information of a contingency table is mainly summarized through appropriate measures such as measures of association or models. Association measures, although easy in their computation and interpretation, lead to a great loss of information, as can be seen in [7]. Models are preferred in the case where a more sensitive analysis is required. A model is a "theory" or a conceptual framework about observations, and the parameters in the model represent the "effects" that particular variables or combinations of variables have in determining the values taken by the observations. The easiest and most common model for a contingency table is the log-linear model [8]. It is constructed by taking the natural logarithms of the cell probabilities by the analogy of the analysis of variance (ANOVA) models, as can be seen in [9][10][11]. Classical log-linear models are sometimes regarded in the framework of the generalized linear model (GLM). They are also important in connection with contingency matrices, as can be seen in [12]. Contemporary problems in categorical data analysis with extremely high-dimensional data with demanding computational procedures require the development of complex models. Much work has been done on the modelling of categorical data, as can be seen in [7,13]. For example, in [14], the author used regression models for modelling categorical data. In our work, we derived new asymptotic results that will enable us to obtain confidence ellipsoids and simultaneous confidence intervals, respectively, for the vector of probabilities and its components, which will enable us to overcome some inference limitations of the existing procedures.
Inferential statistical analysis requires assumptions about the probability distribution of the response variable. For categorical data, the main distribution is the multinomial distribution. Most of the time, categorical data result from n-independent and identical trials with each trial having two or more possible outcomes. When the n is identical and independent trials have the same category probabilities, then the distribution of counts in the various categories is the multinomial distribution. The binomial distribution is a special case of the multinomial distribution with just two possible outcomes for each trial. Usually, the parameters of the multinomial distribution are not known and these parameters are often estimated from the sample data by several estimation methods such as the maximum likelihood estimation (MLE), as can be seen, for instance, in [15], the minimum discrimination information (MDI) [16], weighted least squares (WLS) [17] and Bayesian estimation (BA) [18]. In a previous study [19], we wanted to minimize the average cost so we used statistical decision theory (SDT) since there were only a finite number of possible choices. We point out that we achieved consistency since the probability of selecting the choice with the least average cost tends towards 1 and when the sample size tends towards infinity.
If we have n realizations of an experiment with m possible results with probabilities p 1 , · · · , p m , we have the probability mass function, as can be seen in [20,21]: for the vector N = (N 1 , · · · , N m ) of the times we obtain the different results. This probability mass function corresponds to the singular multinomial distribution M(·|n, p). We name as multinomial the models for describing these sets of independent realizations of experiments with a finite number of results: For the vector with p = (p 1 , · · · , p m ) of probabilities, we have the vector of estimators: with:p Moreover, as can be seen in [22], as n → ∞: where ∼ indicates the limit distribution, in this case N (0, U(p)), the normal distribution with the null mean vector and covariance matrix: where D(p) is the diagonal matrix with principal elements p 1 , · · · , p m . This result will play an important role in the asymptotic treatment of the multinomial models which is this paper's goal.
To carry out that asymptotic treatment, we start by obtaining a convenient version of the continuous mapping theorem [23] in the next section on limit distributions. Then, we obtained confidence regions in Section 3, namely the confidence ellipsoids and simultaneous confidence intervals. Then, in Section 4, we studied the algebraic structure of the limit covariance matrix, U(p).
In Section 5, we obtained chi-square tests for hypotheses on outcome probabilities and confidence ellipsoids and simultaneous confidence intervals for them. We also considered log-linear models for which we presented a numerical application. We pointed out that our approach to these models overcame the hierarchical restriction used to analyse multidimensional contingency tables.
Our use of both the classical and the new version of the parametrized continuous mapping theorem (PCMT) enabled us to carry out statistical inference for multinomial models. This inference was similar to ANOVA and related techniques but F-tests were replaced by chi-square tests which is highly convenient since now we have an infinity of degrees of freedom for the error.
Finally, we stress the close relationship between our ANOVA-like inference using the chi-square test and the usual treatment of fixed effect models. We point out that the F-test in that treatment had interesting invariance properties that expressed the symmetry of those models, especially since those models are associated with orthogonal partitions or sub-spaces which are invariant for rotation.

Limit Distributions
Let C be the class of continuous functions. If l(·) ∈ C, and the distribution F Y n , of Y n converges to F Y ∈ C (that is F Y n (y · ) F Y (y · ), whenever y · is a continuity point of F Y (·)), we have, as can be seen in [23][24][25][26]: as follows from the continuous mapping theorem.
Remark 1 (Corollary to PCMT ). a With θ a parameter both for F Y and F l(Y) ; b when F Y is N (0, V (θ)) and θ n p − → θ, we have: with G a matrix and V (θ) the covariance matrix of the parameter, θ. This remark will be renamed as a corollary of PCMT (CPCMT), as can be seen in [6].
We now consider the case of a sequence of random vectors with the same limit distribution. The sequence {X n } of random vectors is the mean stable when all its vectors have the same mean vector µ.
between Z n and µ. Since Z n s − → µ, we also have θ n,j s − → µ. Then, with Θ ε (µ) the radius ε sphere with centre µ, Pr Z n , θ n,j εΘ ε (µ) − → 1, j = 1, · · · , m. Now g g g j (z) is a continuous function of z so it will have a maximum u j,ε in Θ ε (µ) that will exceed the supremum of the spectral radius of g g g j (µ) in Θ ε (µ), so: thus: and so: and the thesis follows from Proposition 1.
Proof. The thesis follows from Propositions 1 and 2 since the continuous mapping theorem, as can be seen in [23,24], implies that the limit distribution of

Confidence Ellipsoids
We start by establishing the following.

Proposition 4.
If Y-not necessarily normal-has a covariance matrix C, with a range space Ω = R(C), and the mean vector µ, then: Proof. Let α 1 , · · · , α m constitute an orthonormal basis for the orthogonal complement, Ω ⊥ , of Ω. Then, α t j (Y − µ) will have a null mean value and variance. Thus, according to the Bienaymé-Tchebycheff inequality: Therefore, we obtain, with A, the matrix with row vectors α 1 , · · · , α m : as follows from the Boole generalized inequalities, and so the thesis is established.
We then have: Given B B B is a positive semi-definite [definite] matrix with positive eigenvalues ε 1 , · · · , andε h corresponding to eigenvectors α 1 , · · · , α m , we have B B B = A t DA DA DA and B B B We can now establish the following.
, degrees of freedom and σ 2 is the variance of Y.
Proof. As stated in Lemma 2, we have: . We now only have to point out thatŸ ∼ N 0, σ 2 I h to establish the thesis, where I h is the identity matrix.
We now consider confidence ellipsoids and simultaneous confidence intervals. Ellipsoids and their support planes are presented in [29]; the affine point of x belongs to the ellipsoid: if and only if: where v indicates that all possible vectors v are considered. We now establish: with x h,1−q , the (1-q)-th quantile of χ 2 h (the central chi-square with h degrees of freedom), when rank(B B B) = h.
Proof. The proof for the case Y ∼ N (µ, B B B) directly follows from the previous considerations. Thus, we only have to point out that x ∈ ξ(µ, B B B, r) is equivalent to, as can be seen in [29]: Since, as we saw m−1 , because, as we shall see in the next section, rank(U(p)) = m − 1.
In the next section, we will obtain results on (U(p)) that will be used to obtain chi-square confidence regions for p and through duality, test hypothesis on p.

Covariance Matrices
As we saw, forp, the limit covariance matrix of √ n(p m − p) is: where p = (p 1 , · · · , p m ), p j > 0, j = 1, · · · , m and m ∑ j=1 p j = 1. For the rank of the covariance matrix, we have: since rank(D(p)) = m and rank pp t = 1, as follows from, as can be seen in [30], page 46, that |rank(A) − rank(B)| ≤ rank(A + B). In addition to this, U(p)1 = 0 so rank(U(p)) ≤ m − 1. Thus, rank(U(p)) = m − 1: Matrix U(p) is a covariance matrix which, as can be seen in, [30], is positive semidefinite. There is therefore an orthogonal matrix P(p) and a diagonal matrix D(v) whose principal elements are the eigenvalues v 1 , · · · , v m of U(p) such that: Since rank(U(p)) = m − 1, we may order its eigenvalues to have v j > 0, j = 1, · · · , m − 1, and v m = 0. With D(v) 1 2 , the diagonal matrix with principal elements, v 1/2 1 , · · · , v 1/2 m and: U(p) we will have: We now establish: Proof. With g j = rank(M j ) and M j = m j,1 , · · · , m j,m , j = 1, · · · , w, there will be g j linearly independent column vectors ṁ j,l ; l ∈ D j of M j , j = 1, · · · , w. The vectors in w j=1 ṁ j,l ; l ∈ D j will be linearly independent, since, when j = j , they are orthogonal.
rank(M j ). Moreover, if we join another column vector of [M 1 · · · M w ], say m j ,l , to the set w j=1 m j,l ; l ∈ D j , it will linearly depend on the m j ,l ; l ∈ C j . Thus, the vectors in the extended set will not be linearly independent.
, with indicating the Kronecker matrix product, as can be seen in [31], with rank(1 w I m ) = rank(1 m )rank(I m ) = m. Thus, as can be seen in [30]: and: so: as we wished to establish. Let Q 1 , · · · , Q w now be pairwise orthogonal orthogonal projection matrices (POOPM) Thus, its nullity space, N(p) will have a dimension 1 and since α 1 = 1 √ m 1 m ∈ N(p), α 1 α t 1 will be the orthogonal projection matrix on N(p), since U(p) is symmetrical so its range space R(p) will be the orthogonal complement N(p) ⊥ of N(p). The orthogonal projection matrix T(p) on R(p) will then be: Thus, if w ∑ j=1 Q j = I m , we will have T(p) = w ∑ j=2 Q j as well as and, according to Lemma 3: Now: rank Q j U(p) ≤ rank Q j , j = 2, · · · , w and m − 1 = w ∑ j=1 rank Q j . Thus, we must have: We now highlight that U(p) and U(p) 1 2 have the same eigenvectors associated with positive eigenvalues. These eigenvectors constitute an orthonormal basis for R(U(p)) = R(U(p) and, reasoning as above, we obtain: Matrices Q 1 , · · · , Q w naturally appear, as can be seen in [32], when there are factors that cross or groups of nested factors that cross. The sums of squares of effects and interactions of these factors are the A j Y 2 , j = 2, ..., w, and A 1 Y 2 can be associated with the general mean.

Chi-Square Tests
According to the PCMT, when: holds, the limit distribution of: will be that of χ 2 r with r = rank AU(p n )A t , since when H 0 (A) holds: and we also havep n p − → p. We thus have for H 0 (A) a q limit level test with statistic L n (A) and a critical value x r,1−q , the (1 − q) − th is a quantile of χ 2 r . Moreover, under any alternative to H 0 (A): we have, whatever K > 0: so the chi-square tests will be strongly consistent. Let us assume that the probabilities p 1 , · · · , p m correspond to the treatments of a fixed effects model in which d factors, with h 1 , · · · , h d levels, cross (for instance, probabilities of cures for different treatments). We then have: which, as can be seen in [32], tests the effects and interactions of the factors. These effects and interactions correspond to subsets ofd = {1, · · · , d}. Thus, the null set, φ, will correspond to the general mean value, if the set has one element, it will be associated with the effects of the factor whose index belongs to the set. Otherwise, if the set has more than one element, it will be associated with the interaction between the levels of the factors with those indices. The sets can be ordered by the indexes: Putting ϕ j to indicate the j − th set, j = 1, · · · , 2 d , we have, as can be seen in [30], the matrices: where: we then have, with A j = A ϕ j , j = 1, · · · , 2 d : Thus, for testing the: we have the statistic L n (A j ), with g j degrees of freedom, j = 1, · · · , 2 d . Another interesting case is that of cross-nesting factors. The factors in the h-th group combinations of levels of the first v factors in group h, and we also put b h,0 = 1, h = 1, · · · , d.

Each of the combinations contains
combinations of levels of the remaining factors, and we have the matrices: where T r is obtained by deleting the first row equal to 1 √ r 1 t r of a r × r orthogonal matrix and indicates the Kronecker matrix product. These matrices have ranks: where b h,0 = 1, h = 1, · · · , d. The effects and interactions in this cross-nesting are associated with the vectors j = (j 1 , · · · , j d ) with j h = 0, · · · , f h , h = 1, · · · , d. So j is associated with the matrices: with ranks: To test the hypothesis: we have the statistic L n (A j ) with g j degrees of freedom, j ∈ Γ.

Confidence Regions
According to the continuity of the Moore-Penrose inverses, as can be seen in [30], pp. 221-224, we have: so the PCMT gives: as well as: for models with factors crossing and cross-nesting. Thus: We can now apply Proposition 6 to obtain:

Non-Linear Statistics
We assume that the component functions g l (·), l = 1, · · · , w of G(·) have continuous partial derivatives of the second order to apply Proposition 2 and its corollary in showing that: √ n(G(p n ) − G(p)) ∼ N 0, G(p)U(p)G(p) t .
An interesting application of this result is that to log-linear models, as can be seen in [12], in which we use: G(p n ) = log(p n ) = (logp n,1 , · · · , logp n,m ) G(p) = log(p) = (logp 1 , · · · , logp m ) We now have: so: since D(p) −1 p = 1. Moreover, since D(p) −1 is invertible, taking: we have: rank U 0 (p) = rank(U(p)) = m − 1. Thus: so 1 p p belonging to the nullity space N 0 (p) of U 0 (p) constitutes an orthonormal basis for that space. Since U 0 (p) is symmetrical, its range space R 0 (p) will be N 0 (p) ⊥ , and the orthogonal projection matrix on R 0 (p) will be: Let A 0 have row vectors that constitute an orthonormal basis for a sub-space ∇ 0 . Putting: l 0 (p) = log(p) l 0 (p n ) = log(p n ) We have, according to the PCMT: Thus, to test: H 0 (A) : Al 0 (p) = 0 (72) we have the limit q level chi-square test with the statistic nl 0 (p n ) t AU 0 (p)A t + l 0 (p n ) and the critical value x r 0 (A),1−q with r 0 (A) = rank AU 0 (p)A t . These tests will be strongly consistent as follows from l 0 (p n ) p − −− → n→∞ l 0 (p). Moreover, we have the limit level 1 − q confidence ellipsoids given by We can now apply Proposition 6 to obtain:

Numerical Example
In this section, we apply our results, for non-linear statistic, to a dataset on coronary heart disease analysed by [12,33], with the log-linear model, a non-linear statistic. In all, a total of 1330 sick patients were categorized with respect to three variables, namely blood pressure, serum cholesterol and whether they had a coronary heart disease or not. The Blood pressure, which was the first variable had four categorical levels. The second variable, which was, Serum cholesterol, also had four categorical levels while the third variable, which indicated the presence of coronary heart disease, had two levels. So in all, the data had 32 classes with a total of 1330 sick patients. Kindly refer to Section 5.6 of [12] for details on the variables and their levels as well as a cross-classification of the frequencies for each category.
To proceed with the analysis by the application of our method, as can be seen in Equation (63), we firstly estimate the probabilities of each of the 32 classes. In order to apply the non-linear statistics, we calculated the logarithms of those estimates. As in Equation (63), the composite functions of the estimates were normally distributed with a null mean vector and a certain covariance matrix. We proceeded by evaluating our covariance matrix-firstly by determining the Jacobian matrix of the gradient of our composite function, as in Equation (65) and by Equation (66), and then determining the covariance matrix for our composite function.
To test the hypothesis of the absence of effects and interactions, we started by obtaining matrices A using orthogonal matrices.
The first orthogonal matrix we considered is: Since we have 32 classes, we build the P matrices up to P 32 with Kronecker matrix products: where is the Kronecker product. After getting the orthogonal matrix P 32 , we determined our A matrices. Since we have three factors, we have eight A matrices defined as follows.
Let the set containing the factors indexes be ϕ = {1, 2, 3}, then the subsets of ϕ and the corresponding factor effects and interactions are: Now, the A j , j = 1, · · · , 8 are obtained from P 32 as follows: A 1 is the first row; A 2 is the second to fourth rows; A 3 is the fifth, ninth and thirteenth rows; A 4 is the sixth, seventh, eighth, tenth, eleventh, twelfth, fourteenth, fifteenth and sixteenth rows. Similarly, A 5 is the seventeenth row; A 6 is the eighteenth, nineteenth and twentieth rows; A 7 is the twenty-first, twenty-fifth and twenty-ninth rows; and A 8 is the twenty-second, twentythird, twenty-forth, twenty-sixth, twenty-seventh, twenty-eight, thirtieth, thirty-first and thirty-second rows. Now, according to Equation (44), we have the covariance matrices: where W (p) is defined in Equation (66), for the: where G(p) is given by Equation (64). The sum of squares (SS) for the statistic Z j (p), j = 1, · · · , 8 are given by where n is the total number of our observations, and + indicates the Moore-Penrose inverses of the V j (p), j = 1, · · · , 8. The SS j , j = 1, · · · , 8 are chi-squares with degrees of freedom the rank(A j ), j = 1, · · · , 8. Table 1 is an ANOVA-like table that presents the results of our analysis of the coronary heart disease data. This gives the sources of variation of the general mean, the main effects and the interaction effects. It also presents the degrees of freedom and the sum of squares for these effects. The significance level of these effects are indicated in *. From Table 1, we see that the general mean is highly significantly different from 0. Moreover, the factors blood pressure, serum cholesterol and coronary heart disease as well as both interactions between the first and second factors with the third factor were highly significant. The interactions in which the first and second factors did partake were not significant. We pointed out that we were able to consider the interaction between the three factors by using our approach. In the classical analysis of the data, as can be seen in [12] (Section 5.6), this would not be possible. Again, we applied our method to analyse the General Social Survey (GSS 2008). The General Social Survey (GSS) is a survey which conducts basic scientific research on the structure and development of American society with a data-collection program designed to both monitor societal change within the United States and to compare the United States with other nations [34]. We considered three categorical variables, "Education", "Political party affiliation" and "Gender". The "Education" variable had 5 categorical levels, while "Political party affiliation" and "Gender" has 7 and 2 categorical levels respectively, as analyzed by [7] in Section 3.2.4.
Just as in Tables 1 and 2, an ANOVA-like table, presents the results of our analysis. It gives the sources of variation of the general mean, the main effects and the interaction effects for our GSS (2008) data. It also presents the degrees of freedom and the sum of square for these effects. The significance level of these effects is indicated in *. From the table, both factors 1 and 2 have significant effects as well as a significant interaction. The interactions between factors 2 and 3 is also significant. Factor 3 had neither significant effects nor interactions with the exception of its interaction with factor 2.
We can thus conclude that the core of significance was in factors 1 and 2 (education and political party affiliation) and that both genders behave similarly.

Comparing Procedures
Our procedure is based on the asymptotic distribution given by √ n(p n − p) ∼ N (0, U(p)) (79) and by √ n(l(p n ) − l(p)) ∼ N 0, J(p)U(p)J(p) t where J(·) is the Jacobian matrix of the gradients of l(·) (the row vectors of J(·) are the gradients of the components of l(·)): This enabled us to, given an orthogonal partition: where m is the number of probabilities, test the hypotheses: This way, we overcame the requirement of using hierarchical models.

Conclusions
Multinomial models having their limit distribution given by √ n(p n − p) ∼ N (0, U(p)) with rank(U(p)) = m − 1 and p belonging to a compact set led us to establish the parametrized continuous mapping theorem (PCMT) to carry out inference for them. Namely, we obtained the confidence ellipsoids and confidence intervals. Moreover, we obtained chi-square tests for the hypothesis: H 0 (A) : Ap = 0 which led to ANOVA-like inference for both multinomial and log-linear models. We pointed out that now hierarchical assumptions made on log-linear models are no longer necessary and all effects and interactions can be tested without any restrictions. Actually, this was used in the two numerical examples we presented. In addition to this, the replacement of F-tests by chi-square ones increases the power of our inference by replacing a finite number of degrees of freedom for the error by an infinity of them. Funding: This work was partially supported by Fundação para a Ciência e a Tecnologia (Portuguese Foundation for Science and Technology) through project UIDB/00297/2020.

Data Availability Statement:
The data on coronary heart disease used in this article were obtained from [12,33].