Minimum Penalized ϕ-Divergence Estimation under Model Misspecification

This paper focuses on the consequences of assuming a wrong model for multinomial data when using minimum penalized ϕ-divergence, also known as minimum penalized disparity estimators, to estimate the model parameters. These estimators are shown to converge to a well-defined limit. An application of the results obtained shows that a parametric bootstrap consistently estimates the null distribution of a certain class of test statistics for model misspecification detection. An illustrative application to the accuracy assessment of the thematic quality in a global land cover map is included.


Introduction
In many practical settings, individuals are classified into a finite number of unique nonoverlapping categories, and the experimenter collects the number of observations falling in each of such categories. In statistics, that sort data is called multinomial data. Examples arise in many scientific disciplines: in economics, when dealing with the number of different types of industries observed in a geographical area; in biology, when counting the number of individuals belonging to one of k species (see, for example, Pardo [1], pp. 94-95); in sports, when considering the number of injured players in soccer matches (see, for example, Pardo [1], p. 146); and many others.
When dealing with multinomial data, one often finds zero cell frequencies, even for large samples. Although many examples can be given, we will center on the following one, since two related data sets will be analyzed in Section 4. Zero cell frequencies are usually observed when the quality of the geographic information data is assessed, and specifically, when we pay attention to the thematic component of this quality. Roughly speaking, the thematic quality refers to the correctness of the qualitative aspect of an element (pixel, feature, etc.). To give an assessment of the thematic accuracy, a comparison is needed between the label considered as true of a feature and the label assigned to the same feature after a classification (among a number of labels previously stated). This way, each element/feature, which really belongs to a particular category, can be classified as belonging to the same category (correct assignment), or as belonging to another one (incorrect assignment). Given a sample of n elements belonging to a particular category, after collecting the number of elements correctly classified, X 1 , and the number of incorrect classifications in a set of k − 1 possible categories, X i , i = 2, . . . , k, we obtain a multinomial vector (X 1 , X 2 , . . . , X k ) t , for which small or zero cell frequencies are often observed associated with the incorrect classifications, X i , i = 2, . . . , k.
φ-divergence assigns to the empty cells. The resulting estimator is called the minimum penalized φ-divergence estimator (MPφE). Moreover, Mandal et al. [4] have shown that such estimators have the same asymptotic properties as the MφEs. Specifically, they are strongly consistent and, conveniently normalized, asymptotically normal. To derive these asymptotic properties, it is assumed that the probability model is correctly specified, that is to say, that we are sure about π ∈ P. If the parametric model is not correctly specified, Jiménez-Gamero et al. [5] have shown that, under certain assumptions, the MφEs still have a well defined limit, and, conveniently normalized, they are asymptotically normal. For the MLE, these results were known from those in [6]. Because, as argued before, the use of penalized φ-divergences may lead to better performance of the resulting estimators, the aim of this piece of research is to investigate the asymptotic properties of the MPφEs under model misspecification. If the model considered is true, we obtain as a particular case the results in [4].
The usefulness of the results obtained is illustrated by applying them to the problem of testing goodness-of-fit to the parametric family P, H 0 : π ∈ P, against the alternative H 1 : π / ∈ P, using as a test statistic a penalized φ 1 -divergence between a nonparametric estimator of π, the relative frequencies, and a parametric estimator of π, obtained by assuming that the null hypothesis is true, P(θ),θ being an MPφ 2 E. Here, φ 1 and φ 2 may differ. The convenience of using this type of test statistics is justified in Mandal et al. [7]. Although these authors show that, under H 0 , such test statistics are asymptotically distribution free, the asymptotic approximation to the null distribution of the test statistics in this class is rather poor. Some numerical examples illustrate this unsatisfactory behavior of the asymptotic approximation. By using the fact that the MPφE always converges to a well-defined limit, whether the model in H 0 is true or not, we prove that the bootstrap consistently estimates the null distribution of these test statistics. We then retake the previously cited numerical examples to exemplify the usefulness of the bootstrap approximation which, despite the demand for more computing time, is more accurate than that yielded by the asymptotic null distribution for small and moderate sample sizes. The rest of the paper is organized as follows. Section 2 studies certain asymptotic properties of MPφ 2 Es; specifically, conditions are given for the strong consistency and asymptotic normality. Section 3 uses such results to prove that a parametric bootstrap provides a consistent estimator to the null distribution of test statistics based on penalized φ-divergences for testing H 0 . Section 4 displays an application of the results obtained in the context of a classification work in a cover land map.
Before ending this section we introduce some notation: all limits in this paper are taken when n → ∞; L → denotes convergence in distribution; P → denotes convergence in probability; a.s. → denotes the almost sure convergence; let {A n } be a sequence of random variables and let ∈ R, then A n = O P (n − ) means that n A n is bounded in probability, A n = o P (n − ) means that n A n P → 0, and A n = o(n − ) means that n A n a.s. → 0; N k (µ, Σ) denotes the k-variate normal law with mean µ and variance matrix Σ; all vectors are column vectors; the superscript t denotes transpose; if x ∈ R k , with x t = (x 1 , . . . , x k ), then Diag(x) is the k × k diagonal matrix whose (i, i) entry is x i , 1 ≤ i ≤ k, and I k denotes the k × k identity matrix; to simplify notation, all 0s appearing in the paper represent vectors of the appropriate dimension.
Assumption 2 is assumed when dealing with estimators based on minimum divergence, since it lets us take Taylor series expansions of D φ (π, P(θ)), which is useful to derive asymptotic properties of the MφEs. For example, Section 3 of Lindsay [9] assumes that the function φ (he calls G what we call φ) is a thrice differentiable function (which is stronger than Assumption 2); Theorem 3 in Morales et al. [3] requires, among other conditions, φ to meet Assumption 2 to derive the consistency and asymptotic normality of MφEs.
Assumption 2 is also assumed in Mandal et al. [4] (they call G what we call φ) to study the consistency and asymptotic normality of MPφEs. Specifically, these authors show that, if π ∈ P and θ 0 is the true parameter value, then, under suitable regularity conditions including Assumption 2, the MPφE is consistent for θ 0 , and √ n(θ φ,h − θ 0 ) is asymptotically normal with a mean of 0 and a variance matrix equal to the inverse of the information matrix.
Next we will only assume that π ∈ ∆ 0k , that is, the assumption that π ∈ P is dropped. In this context, we prove that the MPφE is consistent for θ 0 , where now θ 0 is the parameter vector that minimizes D φ,h (π, P(θ)), that is to say, θ 0 = arg min θ D φ,h (π, P(θ)). Note that θ 0 also depends on φ and h, so to be rigorous we should denote it by θ 0,φ,h , but to simplify notation we will simply denote it as θ 0 . We also show that √ n(θ φ,h − θ 0 ) is asymptotically normal with a mean of 0. With this aim, we will also assume the following.
Assumption 3 is assumed in papers on estimators based on minimum divergence estimation. For example, it is Assumption A3(b) in [6], which states, that it is the fundamental identification condition for quasi-maximum likelihood estimators to have a well-defined limit; and it is contained in Assumptions 7 and 9 in [10], required for minimum chi-square estimators to have a well-defined limit; it also coincides with Assumption 30 in [9], imposed for the same reason.
Let θ 0 be as defined in Assumption 3. Then P(θ 0 ) is the (φ, h)-projection of π on P. Section 3 in [11] shows that Assumption 3 holds for two-way tables when P is the uniform association model, so the (φ, h)-projection always exists for such model. Nevertheless, this projection may not exist, or may not be defined uniquely. See Example 2 in [12] for an instance where there is no unique minimum (because although Θ is that example is convex, the family {P(θ), θ ∈ Θ} is not convex, so the uniqueness of the projection is not guaranteed). Let ∆ k (φ, P, h) = {π ∈ ∆ 0k such that Assumption 3 holds}.

Remark 1.
Observe that, if m = k, then the penalization has no effect asymptotically; by contrast, if m < k, then the presence of the tuning parameter h influences the covariance matrix of the asymptotic law of √ n(θ φ,h − θ 0 ) and √ n(P(θ φ,h ) − P(θ 0 )).

Remark 2.
If π ∈ P, we obtain as a particular case the results in Mandal et al. [4]. Our conditions are weaker than those in [4]. The reason is that they allow an infinite number of categories, while we are assuming that such a number is finite, k. Therefore, when the number of categories is finite, the assumptions in [4] for the consistency and asymptotic normality of the MPφE can be weakened.
As a consequence of Theorem 1, the following corollary gives the asymptotic behavior of D φ 1 ,h 1 (π, P(θ φ 2 ,h 2 )), for arbitrary φ 1 , φ 2 , and h 1 , h 2 , that may or may not coincide. Part (a) of Corollary 1, which assumes that the model P is correctly specified, has been previously proven in [7]. It is included here for the sake of completeness. Part (b), which describes the limit in law under alternatives is, to the best of our knowledge, new. Corollary 1. Let P be a parametric family satisfying Assumption 1. Let φ 1 and φ 2 be two real functions satisfying Assumption 2. Let X ∼ M k (n; π) with π ∈ ∆ k (φ, P, h).

Remark 3.
If π ∈ P, the asymptotic behavior of the statistic T does not depend either on φ 1 , φ 2 , or on h 1 , h 2 . In fact, the asymptotic law of T is the same as if non-penalized divergences were used.

Remark 4.
When π ∈ ∆ k (φ 2 , P, h 2 ) − P, if m = k, then the asymptotic distribution of W does not depend on h 1 , h 2 ; by contrast, if m < k, then the asymptotic distribution of W does depend on h 1 and h 2 .

Application to Bootstrapping Goodness-Of-Fit Tests
As observed in Remark 5, the test that rejects H 0 when T ≥ χ 2 k−s−1,1−α is asymptotically correct and consistent against fixed alternatives. Nevertheless, the χ 2 approximation to the null distribution of the test statistic is rather poor. Next we illustrate this fact with three examples. The last one is motivated by a real data set application in Section 4. All computations have been performed using programs written in the R language [13]. Example 1. Let X ∼ M 3 (n; π), with π ∈ P so that The problem of testing goodness-of-fit to this family is dealt with by considering as test statistic a penalized φ 1 -divergence and an MPφ 2 E, with φ 1 and φ 2 , two members of the power-divergence family, defined as follows: We thank an anonymous referee for pointing out that the power divergence family is also known as the α-divergence family (see, for example, Section 4 of Amari [14]). In order to evaluate the performance of the χ 2 approximation to the null distribution of T, we carried out an extensive simulation experiment. As a previous part of the simulation experiment, we evaluated the possible effect of the tuning parameter h 2 on the accuracy of the MPφ 2 E. For this goal, we generated 10,000 samples of size 200 from the parametric family with θ = 0.3333, and calculated the MPφ 2 E with h 2 = 0.5, 1, 2, 5, 10 and φ 2 = PD −2 , which correspond to the modified chi-square test statistic (see, for example, [1], p. 114). We calculated the root mean square deviation (RMSD) of the resulting estimations, 10, 000 , obtaining 0.00156, 0.00128, 0.00128, 0.00128, and 0.00128, respectively. According to these results, there are rather small differences in the performance of the MPφ 2 E for the values of h 2 considered. Because of this, we fixed φ 2 = PD −2 and h 2 = 0.5, 1, 2. Next, to study the goodness of the asymptotic approximation, we generated 10,000 samples of size n = 100 from the parametric family with θ = 0.3333, and calculated the test statistic T with h 1 = h 2 = 0.5 and φ 1 (x) = φ 2 (x) = PD −2 (x), as well as the associated p-values corresponding to the asymptotic null distribution. We then computed the fraction of these p-values, which are less than or equal to the nominal values α = 0.05, 0.10 (top and below in tables). This experiment was repeated for n = 150, 200, h 1 = h 2 = 1, 2, φ 1 = PD 1 (which corresponds to the chi-square test statistic) and φ 1 = PD 2 . Table 1 shows the results obtained. We also considered the case h 1 = h 2 , obtaining quite close outcomes. Table 2 displays the results obtained for n = 200 and φ 1 = φ 2 = PD −2 . Looking at these tables, we conclude that the asymptotic null distribution does not provide an accurate estimation of the null distribution of T since the type I error probabilities are much greater than the nominal values, 0.05 and 0.10. Therefore, other approximations of the null distribution should be studied. Table 1. Type I error probabilities obtained using asymptotic approximation for Example 1 with θ = 0.3333, φ 1 = PD λ , λ ∈ {−2, 1, 2}, φ 2 = PD −2 , and h 1 = h 2 ∈ {0.5, 1, 2}.  Example 2. Let X ∼ M 3 (n; π), with π ∈ P so that p 1 (θ) = 0.5 − 2θ, p 2 (θ) = 0.5 + θ, p 3 (θ) = θ, 0 < θ < 1/4.
We repeated the simulation schedule described in Example 1 for this law with θ = 0.24. Tables 3 and 4 report the obtained results. In contrast to the results for Example 1, where the asymptotic approximation gives a rather liberal test, in this case the resulting test is very conservative. Therefore, we again conclude that the asymptotic null distribution does not provide an accurate estimation of the null distribution of T. Table 3. Type I error probabilities obtained using asymptotic approximation for Example 2 with  Table 4. Type I error probabilities obtained using asymptotic approximation for Example 2 with Example 3. Let X ∼ M 4 (n; π), with π ∈ P so that We repeated the simulation schedule described in Example 1 for this law with θ = 0.8. Tables 5 and 6 report the results obtained. Looking at these tables, we see that the test based on asymptotic approximation is liberal, and conclude, as in the previous examples, that other approximations of the null distribution should be considered. Table 5. Type I error probabilities obtained using asymptotic approximation for Example 3 with θ = 0.8,   The reason for the unsatisfactory results in the three examples is that the asymptotic approximation requires unaffordably large sample sizes when some cells have extremely small probabilities, which provoke the presence of zero cell frequencies. To appreciate this fact, notice that Example 1 requires n > 30, 000 to obtain expected cell frequencies greater than 10.
Motivated by these examples, the aim of this section is to study another way of approximating the null distribution of T, the bootstrap. The null bootstrap distribution of T is the conditional distribution of given (X 1 , . . . , X k ), whereπ * is defined asπ with (X 1 , . . . , X k ) replaced by (X * 1 , . . . , X * k ) ∼ M k (n; P(θ φ 2 ,h 2 )), andθ * φ 2 ,h 2 = arg min θ D φ 2 ,h 2 (π * , P(θ)). Let P * denote the bootstrap conditional probability law, given (X 1 , . . . , X k ). The next theorem gives the weak limit of T * . Theorem 2. Let P be a parametric family satisfying Assumption 1. Let φ 1 and φ 2 be two real functions satisfying Assumption 2. Let X ∼ M k (n; π) with π ∈ ∆ k (φ, P, h). Then Recall that, from Corollary 1(a), when H 0 is true, the test statistic T converges in law to a χ 2 k−s−1 law. Thus, the result in Theorem 2 implies the consistency of the null bootstrap distribution of T as an estimator of the null distribution of T. It is important to remark that the result in Theorem 2 holds whether H 0 is true or not, that is, the bootstrap properly estimates the null distribution, even if the available data does not obey the law in the null hypothesis. This is due to the fact that, under the assumed conditions, the MPφE always converges to a well-defined limit.

Remark 6.
Properties of the Bootstrap Test. Similarly to Remark 5, as a consequence of Corollary 1(a) and Theorem 2, we have that, for testing H 0 vs. H 1 , the test that rejects the null hypothesis when T ≥ T * 1−α is asymptotically correct, in the sense that P 0 (T ≥ T * 1−α ) → α, where T * 1−α stands for the 1 − α percentile of the bootstrap distribution of T. From Corollary 1(b) and Theorem 2, it follows that such a test is consistent against fixed alternatives π ∈ ∆ k (φ 2 , P, h 2 ) − P, in the sense that P(T ≥ T * 1−α ) → 1.
3. Approximate the p-value by means of the expression For the numerical experiments previously described, whose results are displayed in Tables 1-6, we also calculated the bootstrap p-values. This was done by generating B = 1000 bootstrap samples to approximate each p-value, and calculating the fraction of these p-values, which are less than or equal to 0.05 and 0.10 (top and bottom in the tables). Tables 7-12 display the estimated type I error probabilities obtained by using the bootstrap approximation as well as those obtained with the asymptotic approximation (bootstrap, B, and asymptotic, A, in the tables) taken from Tables 1-6 in order to facilitate the comparison between them. Looking at Tables 7-12, we conclude that the bootstrap approximation is superior to the asymptotic one for small and moderate sample sizes, since in all cases the bootstrap type I error probabilities were closer to the nominal values than those obtained using the asymptotic null distribution. This superior performance of the bootstrap null distribution estimator has been noticed in other inferential problems, where φ-divergences are used as test statistics (see, for example, [5,12,15,16]).      Table 10. Asymptotic and bootstrap type I error probabilities for Example 2 with n = 200, θ = 0.24,

Application to the Evaluation of the Thematic Classification in Global Land Cover Maps
This section displays the results of an application of our proposal to two real data sets related to the thematic quality assessment of a global land cover (GLC) map. The data comprise the results of two thematic classifications of the land cover category "Evergreen Broadleaf Trees" (EBL) and summarize the number of sample units correctly classified in this class, and the number of confusions with other land cover classes: "Deciduous Broadleaf Trees" (DBL), "Evergreeen Needleleaf Trees" (ENL), and "Urban/Built Up" (U). The results of these two classifications were collected from two different global land cover maps: the Globcover map and the LC-CCI map (see Tsendbazar et al. [17] for additional details) and they are displayed in Table 13. Parametric specifications of the multinomial vector of probabilities are quite attractive since they describe in a concise way the classification pattern. Because of this, given the similarity between the two observed classifications in Table 13, we are interested in the search of a parametric model suitable to depict the thematic accuracy of this class in both GLC maps. For this purpose, we consider the parametric family in Equation (3) of Example 3. The presence of a zero cell frequency in each data set leads us to consider a penalized φ-divergence as a test statistic for testing goodness-of-fit to such a parametric family. Table 14 displays the observed values of the test statistic T and the associated bootstrap p-values for the goodness-of-fit test with respect to the parametric family in Equation (3) for the two observed classifications of the EBL class in Table 13. Looking at this table, it can be concluded that the null hypothesis cannot be rejected in both cases. Therefore, the parametric model in Equation (3) provides an adequate description of the thematic classification of the EBL class.