Log-Normal or Over-Dispersed Poisson?

: Although both over-dispersed Poisson and log-normal chain-ladder models are popular in claim reserving, it is not obvious when to choose which model. Yet, the two models are obviously different. While the over-dispersed Poisson model imposes the variance to mean ratio to be common across the array, the log-normal model assumes the same for the standard deviation to mean ratio. Leveraging this insight, we propose a test that has the power to distinguish between the two models. The theory is asymptotic, but it does not build on a large size of the array and, instead, makes use of information accumulating within the cells. The test has a non-standard asymptotic distribution; however, saddle point approximations are available. We show in a simulation study that these approximations are accurate and that the test performs well in ﬁnite samples and has high power.


Introduction
Which is the better chain-ladder model for claim reserving: over-dispersed Poisson or log-normal? While the expert may have a go-to model, the answer should be informed by the data. Choosing the wrong model could substantially influence the quality of the reserve forecast. Yet, so far, no statistical theory is available that supports the actuary in his/her decision and that allows him/her to make a solid argument in favour of either model.
We develop a test that can distinguish between over-dispersed Poisson and log-normal data generating processes, both of which have a long history in claim reserving. The test exploits that the former model fixes the variance to mean ratio across the array, while the latter assumes a common standard deviation to mean ratio. Consequently, the test statistic is based on estimators for the variation in the respective models. The idea is drawn from the econometric literature on encompassing. Intuitively, the test asks whether the null-model can accurately predict the behaviour of the rival model's variation estimator when the null-model is true.
The over-dispersed Poisson model is appealing since it naturally pairs with Poisson quasi-likelihood estimation, replicating the popular chain-ladder technique in run-off triangles (Kremer 1985, pp. 130). Furthermore, this model makes for an appealing story due to its relation to compound Poisson distributions. Such distributions give the aggregate incremental claims an interpretation as the sum over a Poisson number of claims with random individual claim amounts (Beard et al. 1984, Section 3.2). A popular method to generate distribution forecasts for the over-dispersed Poisson model is bootstrapping (England and Verrall 1999;England 2002). While in widespread use, there is so far no theory proving the validity of the bootstrap in this setting. Furthermore, in some settings, the method seems to produce unsatisfactory results.
Recently, Harnau and Nielsen (2017) developed a theory that gives the over-dispersed Poisson model a rigorous statistical footing. They propose an asymptotic framework based on infinitely-divisible distributions that keeps the dimension of the data array fixed and instead builds on large cell means. This resolves the incidental parameter problem (Lancaster 2000;Neyman and Scott 1948) that renders a standard asymptotic theory based on a large array invalid and arises since the number of parameters grows with the size of the array. The class of infinitely-divisible distributions includes compound Poisson With the theoretical results for encompassing tests between over-dispersed Poisson and generalized log-normal models in place, we show that they perform well in a simulation study. First, we demonstrate that saddle point approximations to the limiting distributions of the statistics work very well. Second, we tackle an issue that disappears in the limit: we have the choice between a number of asymptotically-identical estimators that generally differ in finite samples. Simulations reveal substantial heterogeneity in finite sample performance, but also show that some choices generally do well. Third, we show that the tests have high power for parameterizations we may realistically encounter in practice. We also find that power grows quickly with the variation in the means. The simulation study is in Section 5.
Having convinced ourselves that the tests do well in simulations, we demonstrate their application in a range of empirical applications in Section 6. First, we revisit the empirical illustration of the problem from the beginning of the paper. We show that the test has no problem rejecting one of the two rival models. Second, we consider an example that perhaps somewhat cautions against starting with a model that may be misspecified to begin with. In this application, dropping a clearly needed calendar effect turns the results of the encompassing tests upside down. Third, taking these insights into account, we implement a testing procedure that makes use of a whole range of recent results: deciding between the over-dispersed Poisson and generalized log-normal model, evaluating misspecification and testing for the need for a calendar effect.
We conclude the paper with a discussion of potential avenues for future research in Section 7. These include further misspecification tests, a theory for the bootstrap and empirical studies assessing the usefulness of the recent theoretical developments in applications.

Empirical Illustration of the Problem
We illustrate in an empirical example that the choice between the over-dispersed Poisson and (generalized) log-normal model is not always obvious. Table 1 shows a run-off triangle taken from Verrall et al. (2010, Table 1) with accident years i in the rows and development years j in the columns. Calendar years k = i + j − 1 are on the diagonals. Table 1. Run-off triangle taken from Verrall et al. (2010) with an indication for splitting into sub-samples corresponding to the first and last five accident years. Accident years i in the rows; development years j in the columns.  Kuang et al. (2015) and Harnau (2018a) model the data in Table 1 as log-normal, it is not obvious whether a log-normal or an over-dispersed Poisson model is more appropriate. In a log-normal model, the aggregate incremental claims Y ij are independent: where α and β are accident and development effects, respectively. On the original scale, this implies that: Thus, the standard deviation to mean ratio, as well as the log data variance are common across cells. If we instead chose an over-dispersed Poisson model, we would maintain the independence assumption and specify the first two moments of the claims Y ij as: Thus, the variance to mean ratio σ 2 is identical for all cells.
To choose between the two models, we could take the misspecification tests by Harnau (2018a) as a starting point. To implement the tests, the data are first split into sub-samples. For the Verrall et al. (2010) data, Harnau (2018a) considers a split into two sub-samples consisting of cells relating to the first and last five accident years, as illustrated in Table 1. The idea is then to test for common parameters across the sub-samples. In the log-normal model, we first perform a Bartlett test for common log data variances ω 2 across sub-samples and, if this is not rejected, an F-test for common accident and development effects. Similarly, in the over-dispersed Poisson model, we first test for common over-dispersion σ 2 and then again for common accident and development effects.
If one of the models is flagged as misspecified, but not the other, the choice becomes obvious. However, in this application, we cannot reject either model based on these tests. For the log-normal model, the Bartlett test for common log data variances yields a p-value of 0.09 and the F-test for common effects a p-value of 0.91; the p-values for the equivalent tests in the over-dispersed Poisson model are 0.78 and 0.64, respectively. Therefore, the question remains: Which model should we choose?

Overview of the Rival Models
We first discuss two common elements of the rival models, namely the data structure and the chain-ladder predictor and its identification. Then, we in turn state assumptions, estimation and known theoretical results for the over-dispersed Poisson and the generalized log-normal chain-ladder model.

Data
We assume that we have data for a run-off triangle of aggregate incremental claims. We denote the claims for accident year i and development year j by Y ij . Further, we count calendar years with an offset, so the calendar year k = i + j − 1. Then, we can define the index set for a run-off triangle with I accident, development and calendar years by: We define the number of observations in I as n. We could also allow for data in a generalized trapezoid as defined by  without changing the results of the paper. Loosely, generalized trapezoids allow for an unbalanced number of accident and development years, as well as missing calendar years both in the past and the future.

Identification
We briefly discuss the identification problem of the chain-ladder predictor α i + β j + δ that is common to both over-dispersed and generalized log-normal models. Kremer (1985) showed that based on this predictor, Poisson quasi-likelihood estimation replicates the classical chain-ladder point forecasts in a run-off triangle.
The identification problem is that for any a and b, where α i and β j are accident and development effects, respectively. Thus, no individual effect is identified. Several ad hoc identification methods are available; for example, we could set  suggest a parametrization that is canonical in a Poisson model and allows for easy counting of degrees of freedom. The idea is to re-write the linear predictor in terms of a level and deviations from said level as: Thus, we can write: where the design x ij and identified parameter ξ are given by: For the asymptotic theory in the over-dispersed Poisson model, it turns out to be useful to explicitly decouple the level and its deviations by decomposing as: We can then define the aggregate predictor τ and the frequencies π ij as: .
(2) Importantly, the frequencies π ij are invariant to the level µ 11 , the first component of ξ. Therefore, we can vary the aggregate predictor τ by varying µ 11 without affecting the frequencies π ij . The frequencies π ij are, in turn, functions of ξ (2) alone. Further, we note that given ξ (2) , there is a one-to-one mapping between µ 11 and τ through τ = exp(µ 11 ) ∑ ij∈I exp(x (2) ij ξ (2) ). While this choice of identification scheme is useful for derivation of the theory in this paper, any scheme may be used in applications of the results. This is because, as  point out, the linear predictor µ ij is identified, unlike the individual effects. Since the main results of the paper rely on estimates of the linear predictors µ ij alone, they are unaffected by the choice of a particular identification scheme.
Furthermore, the results in this paper are not limited to the chain-ladder predictor; we could, for example, include a calendar effect. Nielsen (2015) derives the form of the design vector for extended chain-ladder predictors in generalized trapezoids. The identification method is implemented in the R (R Core Team 2017) package apc (Nielsen 2015), as well as in the Python package of the same name (Harnau 2017).
We note that the identification method can introduce arbitrariness into the forecast for models that require parameter extrapolation, such as the extended chain-ladder model with calendar effects. In the standard chain-ladder model, we can forecast claim reserves without parameter extrapolation; in a continuous setting, Lee et al. (2015) refer to this as in-sample forecasting. In contrast, in the extended chain-ladder model, we cannot estimate parameters for future calendar years from the run-off triangle. For this case,  and Nielsen and Nielsen (2014) explain how forecasts can be influenced by ad hoc constraints and lay out conditions for the identification method that make forecasts invariant to these arbitrary and untestable constraints.

Over-Dispersed Poisson Model
We give the assumptions of the over-dispersed Poisson model and discuss its estimation by Poisson quasi-likelihood. We state the sampling scheme proposed by Harnau and Nielsen (2017) and the asymptotic distribution of the estimators.

Assumptions
The first assumption imposes the over-dispersed Poisson structure on the moments. We can write it as: The second assumption is distributional and allows for the asymptotic theory later on. We assume that the independent aggregate claims Y ij have a non-degenerate, non-negative and infinitely-divisible distribution with at least three moments. As noted by Harnau and Nielsen (2017), an appealing example for claim reserving of such a distribution is compound Poisson. The interpretation is that the aggregate incremental claims Y ij can be written as Y ij = ∑ N ij =1 X for a Poisson number of claims N ij D = Poisson{exp(µ ij )} independent of the independent and identically distributed random claim amounts X .

Estimation
We estimate the over-dispersed Poisson model by Poisson quasi-likelihood. The appeal is that, as noted in Section 3.2, Poisson quasi-likelihood estimation replicates the chain-ladder technique. We explicitly distinguish between the model, subscripted with ODP, and its standard estimators, subor super-scripted with ql, to avoid confusion later on when we evaluate the estimators under the rival model.
The fitted values for the linear predictors are given by: The fitted value for the aggregate predictor τ is then given by: a result implied by the fact that the re-parametrization of the Poisson likelihood in terms of the mixed parameter (τ, ξ (2) ) is linearly separable, so the parameters are variation independent; see, for example, Harnau and Nielsen (2017); Martínez Miranda et al. (2015) or, for a more formal treatment, Barndorff-Nielsen (1978, Theorem 8.4). This implies that the estimator for the aggregate predictor is unbiased for the aggregate mean.
As an estimator for the over-dispersion σ 2 , Harnau and Nielsen (2017) use the Poisson deviance D scaled by the degrees of freedom. The deviance is the log likelihood ratio statistic against a model with as many parameters as observations, giving a perfect fit. The estimator is given by:

Sampling Scheme
For the asymptotic theory, we adopt the sampling scheme proposed by Harnau and Nielsen (2017). The idea is to grow the overall mean τ = ∑ ij∈I E(Y ij ) while holding the frequencies π ij and thus ξ (2) fixed. We note that this also implies that µ 11 is O{log(τ)}. In this sampling scheme, information accumulates in the estimated frequencies. In this sense, it is reminiscent of multinomial sampling as used, for example, by Martínez Miranda et al. (2015) in a Poisson model conditional on the data sum. Furthermore, we assume that τ increases in such a way that the skewness vanishes. Harnau and Nielsen (2017) remark that this is implicit for distributions such as Poisson, negative binomial and many compound Poisson distributions. Importantly, the sampling scheme holds the number of cells in the run-off triangle fixed. If we instead grew the dimension of the array, the number of parameters would also increase, thus making an asymptotic theory difficult.

Asymptotic Theory
Based on the assumptions in Section 3.3.1 and the sampling scheme Section 3.3.3, Harnau and Nielsen (2017) derived the asymptotic distribution of the estimators.
The theory hinges on Harnau and Nielsen (2017, Theorems 1, 2), which for our purposes can be formulated as: An implication of the sampling scheme is that we cannot consistently estimate µ 11 since the overall mean τ and thus the level µ 11 grow. However, the remaining parameters ξ (2) are fixed and can be estimated in a consistent way. To ease notation, we define the design matrix X and the diagonal matrix of frequencies Π so: Harnau and Nielsen (2017, Lemma 1) derive the distribution of the estimator for the mean parameters in terms of the mixed parametrization (τ, ξ (2) ) . The advantage is that the two components of the mixed parameter are variation independent, so the covariance matrix featured in the asymptotic distribution is block-diagonal. This property turns out to be useful for example in the derivation of distribution forecasts. However, we opt to state the results in terms of the original parameterization by ξ to ease the analogy with the generalized log-normal model below. For our purposes, this does not complicate the theory. As a corollary to Harnau and Nielsen (2017, Lemma 1), we can then state the distribution of the quasi-likelihood estimatorξ ql as follows. All proofs are in Appendix A.
Thus, even though the level µ 11 → ∞, the difference between estimator and level (μ 11 − µ 11 ) vanishes in probability. We note that τX ΠX corresponds to the Fisher information about ξ in a Poisson model.
Further, Harnau and Nielsen (2017, Lemma 1) find that the asymptotic distribution of the deviance is proportional to a χ 2 : Thus, the estimatorσ 2 has an asymptotic distribution, which is unbiased for σ 2 .

Generalized Log-Normal Model
Following the same structure as for the over-dispersed Poisson model above, we set up the generalized log-normal model as introduced by Kuang and Nielsen (2018) and discuss its estimation and theoretical results. This model nests the log-normal model. While the log-normal model allows for an exact distribution theory for the estimators, Kuang and Nielsen (2018) provide an asymptotic theory that covers the generalized model. We are going to employ this asymptotic theory for the encompassing tests below.

Assumptions
The assumptions for the generalized log-normal model mirror those for the over-dispersed Poisson model closely. The assumption of independent Y ij with a non-negative, non-degenerate infinitely-divisible distribution and at least three moments is maintained. The difference lies in the moment assumptions, which are replaced with: where o(1) vanishes as ω 2 goes to zero. Thus, in the generalized log-normal model, the standard deviation to mean ratio, also known as the coefficient of variation, is common across the data for small ω 2 . This is in contrast to the variance to mean ratio in the over-dispersed Poisson model. Kuang and Nielsen (2018, Theorem 3.2) point out that the log-normal model log(Y ij ) D = N(µ ij , ω 2 ) satisfies these assumptions. There, the standard deviation to mean ratio is exp(ω 2 ) − 1 as in (1).
Based on the infinite divisibility assumption, we can construct a story similar to the compound Poisson story for the over-dispersed Poisson model. By definition, Y is infinitely divisible if for any m > 0, there exist independent and identically distributed random variables X 1 , . . . , X m , so ∑ m =1 X has the same distribution as Y. Thus, as pointed out by Kuang and Nielsen (2018), we can again think of m as the unknown number of claims and of X as the individual claim amounts.

Estimation
We estimate the generalized log-normal model on the log scale by least squares. We define: Then, least squares fitted values for the linear predictors µ ij are, with the design X as defined in (5), given by:μ ls ij = x ijξls whereξ ls = (X X) −1 X Z.
We estimate the variation parameter ω 2 based on the residual sum of squares written as: The estimator for the aggregate predictor τ as defined in (2) is then: Unlike in the over-dispersed Poisson model, this estimator is generally not unbiased. Instead, the sum of linear predictors is unbiased for the sum of logs since ∑ ij∈Iμ ls ij = ∑ ij∈I Z ij .

Sampling Scheme
We adopt the sampling scheme Kuang and Nielsen (2018) put forward for the generalized log-normal model. In this scheme, ω 2 vanishes in such a way that the skewness of Y ij goes to zero while ξ remains fixed. In a log-normal model, this corresponds to letting the log data variance ω 2 , thus the standard deviation to mean ratio exp(ω 2 ) − 1, go to zero. Again, the dimension of the array I remains fixed.

Asymptotic Theory
The asymptotic theory Kuang and Nielsen (2018) introduced for the generalized log-normal model allows one to find parameter uncertainty, testing for nested model reduction and closed-form distribution forecasts. Kuang and Nielsen (2018, Theorem 3.4) find that for small ω 2 , Thus, generalized log-normal random variables are asymptotically normal, but heteroskedastic on the original scale. Furthermore, Kuang and Nielsen (2018, Theorem 3.3) prove that: Therefore, conversion to the log scale yields asymptotic normality, as well. The difference is that the variance is now homoskedastic. We recall that µ ij is fixed under the sampling scheme in the generalized log-normal model. Therefore, these results imply that Y ij This also means that the data sum ∑ ij∈I Y ij P → τ. The small ω 2 distribution of the estimators in the generalized log-normal model is given by Kuang and Nielsen (2018, Theorem 3.5) as: In an exact log-normal model, the results in (8) hold for any ω 2 . In contrast to the over-dispersed Poisson model, the full parameter vector ξ, including the level µ 11 , can now be consistently estimated since it is fixed under the sampling scheme. This comes at the cost that ω 2 and, thus, the standard deviation to mean ratio move towards zero.

Encompassing Tests
With the two rival models in place, we aim to test the over-dispersed Poisson against the generalized log-normal model and vice versa. Since the models are generally not nested, we cannot simply test for a reduction from one to the other. Instead, we investigate whether the null model can correctly predict the behaviour of the statistics of the rival model if the null model is true. We first consider identifiable differences between the two models. Then, we in turn look at scenarios where the null is the over-dispersed Poisson model versus where it is the generalized log-normal model.

Identifiable Differences
It is interesting to consider what key features let us differentiate between the generalized log-normal and the over-dispersed Poisson model. Looking first at the means, we find that differences between the two models are not identifiable. This is because for any ξ = (µ 11 , ξ (2) ) and ω 2 in the generalized log-normal model, we can define ξ † = (µ 11 + ω 2 /2, ξ (2) ) for the over-dispersed Poisson model, so: Thus, we could not even tell the models apart based on the means if we knew their true values. In contrast, differences in the second moments are identifiable. In the generalized log-normal model, the standard deviation to mean ratio is constant for small ω 2 , while the variance to mean ratio is constant in the over-dispersed Poisson model. Since: constancy in one ratio generally implies variation in the other, except when all means are identical. Thus, the standard deviation to mean ratio in an over-dispersed Poisson model varies by cell, and so does the variance to mean ratio in a generalized log-normal model. Thus, if nature presented us with the true ratios, we could tell the models apart. As noted, an exception arises when all cells have the same mean, a scenario that seems unlikely in claim reserving. If this were the case, the assumptions of the two models are identical: the over-dispersed Poisson model becomes a generalized log-normal model and vice versa. Thus, non-identifiable differences between the ratios imply that both models are congruent with the data generating process in this dimension. Loosely, the two models become more different as the variation in the means increases. We may thus conjecture that there is a relationship between the power of tests based on standard deviations and variance to mean ratios and the variation in the means.

Null Model: Over-Dispersed Poisson
We find the asymptotic distribution of the least squares estimators, motivated in the generalized log-normal model, when the data generating process is over-dispersed Poisson. We propose a test statistic based on these estimators and find its limiting distribution under an over-dispersed Poisson data generating process.
The estimators from the log-normal model are computed on the log scale. Thus, we first find the limiting distribution of over-dispersed Poisson Y ij on the log scale.
Lemma 1. In the over-dispersed Poisson model Sections 3.3.1 and 3.3.3, We stress again that µ ij is not fixed under the sampling scheme so that the result does not imply that Z ij converges to µ ij , rather it implies that their difference (Z ij − µ ij ) vanishes. We can relate this lemma to Harnau and Nielsen (2017, Theorem 2) Given the limiting distribution on the log scale, we can find the distribution of the estimators in the same way as we would in a Gaussian model. Since the asymptotic distribution of √ τ(Z ij − µ ij ) is now heteroskedastic, unlike in the generalized log-normal model as shown in (7), we can anticipate that the results will not match those found in the generalized log-normal model. This is confirmed by the following lemma, using the notation for the design matrix X and the diagonal matrix of frequencies Π introduced in (5).

Lemma 2.
Define Ω = (X X) −1 X Π −1 X(X X) −1 , and let U = N(0, I). Then, in the over-dispersed Poisson model Sections 3.3.1 and 3.3.3, As could be expected given Lemma 1, the results in Lemma 2 match finite sample results in a heteroskedastic independent Gaussian model. Notably, the residual sum of squares RSS is not asymptotically χ 2 . However, the over-dispersion σ 2 enters their distribution only multiplicatively. The frequency matrix Π enters as a nuisance parameter that we can, however, consistently estimate since it is a function of ξ (2) alone. For example, we could use plug-in estimators Π ql = Π(ξ ls ). If we knew σ 2 , we could feasibly approximate the limiting distribution of RSS. Besides Monte Carlo simulation, numerical methods are available; see, for example, Johnson et al. (1995, Section 18.8). These methods exploit that the distribution of the quadratic form can be written as a weighted sum of χ 2 1 . Generally, for a real symmetric matrix A and independent χ 2 1 variables V ij , where λ ij are the eigenvalues of A; this follows directly by the eigendecomposition of A.
Unfortunately, the over-dispersion σ 2 is generally unknown, so that we cannot simply base an encompassing test on the residual sum of squares RSS. Therefore, we require an estimator for σ 2 . An obvious choice in the over-dispersed Poisson model is the estimatorσ 2 = D/(n − p). However, computed on the same data, D and RSS are not independent. We could tackle this issue in two ways. First, similar to Harnau (2018a), we could split the data I into disjoint and thus independent sub-samples. Then, we could compute RSS on one sub-sample and D on the other, making the two statistics independent. However, in doing so, we would incorporate less information into each estimate and likely lose power. Beyond that, it seems little would be gained by this approach since no closed-form for the distribution of RSS is available in the first place. The second way to tackle the issue is to find the asymptotic distribution of the ratio RSS/D with each component computed over the full sample. This is the way we are going to go.
Before we proceed, we derive an alternative estimator for the over-dispersion σ 2 that gives us more choice later on for the encompassing test. Lemma 1 is suggestive of a weighted least squares approach on the log scale since the form of the heteroskedasticity is known, taking Π as given. For: the weighted least squares estimators on the log scale are given by: Of course, Π is unknown, so these estimators are infeasible. However, we can consistently estimate Π. Thus, we can compute feasible weighted least squares estimators. For a first stage estimation of the weights by least squares, we write: so the (least squares) feasible weighted least squares estimators are: Similarly, using instead the quasi-likelihood-based plug-in estimator Π ql = Π(ξ (2) ql ) for the weights, we write: so the (quasi-likelihood) feasible weighted least squares estimators are: While we would generally expect them to differ in finite samples, it turns out that the Poisson quasi-likelihood and the (feasible) weighted least squares estimators are asymptotically equivalent. We formulate this in a lemma.  (3), τRSS * − D P → 0. These results still hold ifξ * is replaced byξ * ls orξ * ql , RSS * is replaced by RSS * ls or RSS * ql , or τ is replaced byτ ql orτ ls .
We are now armed with four candidate statistics for an encompassing test: To find their asymptotic distribution, we exploit that the distribution of each one is asymptotically equivalent to a quadratic form of the same random vector Y. This is reflected in the limiting distribution. which we formulate in a theorem.
Theorem 1. In the over-dispersed Poisson model Sections 3.3.1 and 3.3.3, R ls , R ql , R * ls and R * ql are asymptotically equivalent, so that the difference of any two vanishes in probability. For U D = N(0, I), Π as in (5), M as in (6) and M * as in (9), each statistic is asymptotically distributed as: Crucially, the asymptotic distribution R ODP is invariant to σ 2 . While it is again a function of the unknown, but consistently estimable frequencies π ij , for large τ, the plug-in version R ODP = R ODP ( Π) has the same distribution as R ODP (Π).
Theorem 1 allows us to test whether the over-dispersed Poisson model encompasses the generalized log-normal model. For a given critical value, if we reject that the R-statistic was drawn from R ODP , then we reject that the over-dispersed Poisson model M ODP encompasses the generalized log-normal model. While this indicates that the over-dispersed Poisson model is likely wrong, it could mean that the generalized log-normal model is correct or that some other model is appropriate. Conversely, non-rejection means that we cannot reject that the over-dispersed Poisson model encompasses the generalized log-normal model.
The distribution R ODP does not have a closed-form, but precise saddle point approximations are available, as we show below. Furthermore, it is of interest to investigate the impact of the choice among the different test statistics and plug-in estimators for Π appearing in R ODP in finite samples. Above that, we may question the power properties of the test. We discuss these points below in Section 5.

Null Model: Generalized Log-Normal
We first derive the small-ω 2 asymptotic distribution of Poisson quasi-likelihood and weighted least squares estimators when the data generating process is generalized log-normal. Then, we find the asymptotic distribution of the R-statistic proposed for an encompassing test above.
First, given asymptotic standard-normality on the log scale as in (7), we can easily show asymptotic normality of the weighted least squares estimator. As it turns out, Poisson quasi-likelihood estimators are also asymptotically equivalent to the weighted least squares estimators when the data generating process is generalized log-normal. We formalize this result in a lemma.
These results still hold ifξ * is replaced byξ * ls orξ * ql , RSS * is replaced by RSS * ls or RSS * ql , or τ is replaced byτ ql orτ ls .
With these results in place, we can find the distribution of the R-statistics in the generalized log-normal model.

Theorem 2.
In the generalized log-normal model Sections 3.4.1 and 3.4.3, R ls , R ql , R * ls and R * ql as in (10) are asymptotically equivalent so that the difference of any two vanishes in probability. For U D = N(0, I), Π as in (5), M as in (6) and M * as in (9), each statistic is asymptotically distributed: Thus, the test statistics are asymptotically distributed as the ratio of quadratic forms in both data generating processes. The difference arises in the sandwich-matrices. While the orthogonal projections M and M * feature in both distributions, the frequency matrix Π acts in different ways on R ODP and R GLN . Intuitively, R ODP is the ratio of "bad" least squares to "good" weighted least squares residuals computed in a heteroskedastic Gaussian model. In contrast, R GLN has the interpretation as the ratio of "good" least squares to "bad" weighted least squares residuals now computed in a homoskedastic model. Thus, we may expect draws from R GLN to likely be smaller than those from R ODP .

Distribution of Ratios of Quadratic Forms
We discuss the support of and numerical saddle point approximations to the limiting distributions of the encompassing tests under either data generating process.
The limiting distribution under the null hypothesis in both models is a ratio of dependent quadratic forms in normal random variables. This class of distributions is rather common. Besides standard F distributions, which are a special case, they appear for example in the Durbin-Watson test for serial correlation Watson 1950, 1951). While the distributions generally do not permit closed-form computations of the cdf, fast and precise numerical methods are available.
Butler and Paolella (2008) study a setting that includes ours, but is more general. They consider R = A / B where A and B are symmetric n × n matrices, B is positive semidefinite and D = N(ν, I). In our scenario, both A and B are positive semidefinite, and ν = 0.
Butler and Paolella (2008, Lemma 2) state that R is degenerate if and only if A = cB for some constant c. In our setting, this occurs if Π = n −1 I, so all cells have the same mean. This matches our observation from Section 4.1 that generalized log-normal and over-dispersed Poisson model are indistinguishable if all cells Y ij have the same mean. In that case, both the standard deviation to mean and the variance to mean ratio are constant across cells. This manifests in the collapse of both R ODP and R GLN to a point mass at n.
Further, Butler and Paolella (2008, Lemma 3) derive the support of R for a variety of cases depending on the properties of A and B. Building on their work, we can prove the following result.
Lemma 5. The distributions R GLN and R ODP have the same support. In non-degenerate cases, the support is (l, r) for 0 < l < r < ∞.
The cumulative distribution functions and densities of ratios of quadratic forms admit saddle point approximations. We adapt the discussion in Butler and Paolella (2008) to our scenario in which ν = 0; a setting that matches Lieberman (1994). We aim to approximate: First, we compute the eigenvalues of A − rB denoted λ 1 , . . . , λ n . We can write the cumulant generating function K(s), the log of the moment generating function ϕ(s) = E[exp{s(A − rB)}], of X r and its -th derivative as: Except for the special case when all eigenvalues λ t are zero, so K (1) (s) = 0,ŝ is unique since K (1) (s) is strictly increasing. The former case occurs if and only if E(X r ) = 0, which is the case for r = trace(A)/trace(B). This case is dealt with separately. For the other cases, we compute: Then, denoting by Φ(.) and φ(.) the standard normal cdf and density, respectively, the first order approximation to the cdf of R is: This saddle point approximation is a special case of the more general form in Lugannani and Rice (1980). This is what Lieberman (1994) built on. Lugannani and Rice (1980) analysed the error behaviour for a sum of independent and identically distributed random variables and showed uniformity of the errors for a large sample. Butler and Paolella (2008) instead considered a fixed sample size and show uniformity of errors in the tail of the distribution. This seems appealing for our scenario, since we would expect the rejection region of the test to correspond to the tail of the distribution.

Power
We show that the conjecture of a link between the power of the tests and variation in the means raised above in Section 4.1 is correct. To prove this, we consider a sequential asymptotic argument in which first, depending on the data generating process, τ becomes large or ω 2 becomes small and then the means become "more dispersed" in a sense made precise below. Based on this argument, we can justify a one-sided test where the rejection region corresponds to the upper tail when the null model is generalized log-normal and to the lower tail when it is over-dispersed Poisson.
The sequential asymptotics allows us to exclusively consider the impact of more dispersed means on R GLN and R ODP without worrying about the effect on the distribution of R ls , R ql , R * ls or R * ql . However, larger mean dispersion would be linked to changes in ξ (2) , a parameter that we keep fixed when deriving the asymptotic distribution of the test statistics in the first stage of the asymptotics. Therefore, we would expect the approximation quality achieved in the first stage to be affected by the second stage. The interpretation of the results is thus for a given first stage approximation quality, however large τ or small ω 2 may be needed to achieve this.
We model "more dispersed" means by increasing the variation in the frequencies π ij and specifically by letting some frequencies go to zero. In this way, we do not make a statement about the means in absolute terms, but merely say that some cell means become large relative to others.
For our analysis, we exclude cells for which estimation would yield a perfect fit; equivalently, we can impose that the frequencies do not exclusively vanish for perfectly-fitted cells. For example, in a chain-ladder model for the run-off triangle in Table 1, the corner cells (1, 10) and (10, 1) would be fit perfectly as they have their own parameters ∆β 10 and ∆α 10 .
To increase the variation in the frequencies π ij , we decide on n − q cells of the run-off triangle for which we want the frequencies to vanish. We require that the remaining q cells with non-vanishing frequencies make up an array on which we can estimate a model with the same structure for the linear predictor µ ij as for the full data without obtaining a perfect fit. For example, for a chain-ladder model in which µ ij = α i + β j + δ, this would be the case for rectangular arrays with at least two columns and rows or for triangular arrays with at least three rows and columns. For the ease of notation, we sort rows and columns of the frequency matrix Π defined in (5) such that the cells with vanishing frequencies are in the bottom right block of the matrix. Then, for a q × q matrix Π 1 and an (n − q) × (n − q) matrix Π 2 , we define a new frequency matrix: so s(t) takes care of the normalization such that the elements of Π (t) are still frequencies.
The idea is to model the vanishing frequencies by letting t → 0. Clearly, Π (1) corresponds to Π, whereas Π (0) has all frequencies in the the bottom right block equal to zero. We assume that Π 1 = q −1 I, so that the limiting case does not correspond to a scenario without variation in the frequencies.
Similarly, we sort rows and columns of the design matrix X to obtain a convenient partition. We sort the rows such that the q cells relating to non-vanishing frequencies are in the first q rows. Further, we sort the columns so the p 1 , say, parameters relevant for these q cells are in the first p 1 columns. Then, we can partition: where X 11 is q × p 1 and X 22 is (n − q) × (p − p 1 ). Column sorting ensures that X 12 = 0. Imposing that there is no perfect fit for the q cells without vanishing frequencies implies that p 1 < q, so there are fewer parameters than cells.
We are now interested in the properties of the large τ or small ω 2 limiting distributions of the R statistics' when some frequencies are small. For fixed t, Theorems 1 and 2 apply. Thus, for fixed frequencies Π (t) , the large τ and small ω 2 distributions of the R statistics in an over-dispersed Poisson and generalized log-normal model, respectively, are: where M * (t) is the weighted least squares orthogonal projection matrix I − X * (t) (X * (t) X * (t) ) −1 X * (t) for X * (t) = Π 1/2 (t) X. Thus, Π (t) enters not only directly, but also indirectly through M * (t) . We study the tests' power by looking at the limit of R (t) ODP and R (t) GLN as t → 0. We reiterate that the sequential asymptotics neglect interactions between first stage asymptotics for large τ or small ω 2 and the second stage small t asymptotics. A first intuition that neglects the potential influence of M * (t) may tell us that R (t) GLN should be well behaved while R (t) ODP blows up for small t. This turns out to be correct.
Theorem 3 justifies one-sided tests and shows that the power of the tests under either data generating process goes to unity in the sequential asymptotic argument. Since the distribution of R ODP and R GLN coincides for equal means, the power of the tests to distinguish between the data generating processes comes entirely from the variation in means. As the mean variation becomes large, R ODP first order stochastic dominates R GLN . Thus, we can consider the lower tail of R ODP and the upper tail of R GLN as rejection regions. While still controlling the size of the test under the null, we gain power compared to two-sided tests as the mean variation increases.
The denominator of R (0) GLN can be interpreted as "bad" weighted least squares residuals in a homoskedastic Gaussian model computed on just the subset of q cells with non-vanishing frequencies.
For a brief intuition as to why only cells and parameters relating to X 11 matter in the limit of the denominator, we consider weighted least squares estimation for Z = Xξ + Π −1/2 (t) , taking Π (t) as given.
We solve this by minimizing |Π 1/2 (t) (Z − Xξ)| 2 . For t > 0, the minimum is given by Z Π 1/2 (t) M * (t) Π 1/2 (t) Z. When t = 0, the last n − q elements of Z and rows of X corresponding to the vanishing frequencies do not contribute to the norm. The same holds for the last p − p 1 parameters in ξ that are then not identified. Thus, for t = 0, letting Z 1 contain the first p 1 elements of Z, the minimum of the norm equals Z 1Π 1/2 1M * 11Π 1/2 1 Z 1 .

Simulations
With the theoretical results for encompassing tests between over-dispersed Poisson and generalized log-normal models in place, we show that they perform well in a simulation study. First, we show that saddle point approximations to the limiting distributions R ODP and R GLN are very accurate. Second, we tackle an issue that disappears in the limit; namely, the choice between asymptotically identical estimators that generally differ in finite samples. We show that finite sample performance is indeed affected by this choice. However, we find that for some choices, finite sample and asymptotic distributions are very close. Third, we show that the tests have high power in finite samples and, considering the behaviour of the limiting distributions alone, that power increases quickly with the variation in means. For the simulations and empirical applications below, we use the Python packages quad_form_ratio (Harnau 2018b) and apc (Harnau 2017). The package was inspired by the R (R Core Team 2017) package apc (Nielsen 2015) with similar functionality.

Quality of Saddle Point Approximations
We show that saddle point approximations work well compared to large Monte Carlo simulations.
We consider three parameterizations. First, we let the design X correspond to that of a chain-ladder model for a ten-by-ten run-off triangle and set the frequency matrix Π to the least squares estimates Π ls = Π(ξ (2) ls ) of the Verrall et al. (2010) data in Table 1 (V N J). Second, for the same design, we now set the frequency matrix to the least squares plug-in estimates based on a popular dataset by Taylor and Ashe (1983) (TA). We provide these data in the Appendix A in Table A2. Third, we consider a design X for an extended chain-ladder model in an eleven-by-eleven run-off triangle and set Π to the least squares plug-in estimates of the Barnett and Zehnwirth (2000) data (BZ), also shown in the Appendix in Table A1. We remark that in the computations, we drop the corner cells of the triangles that would be fit perfectly in any case; this helps to avoid numerical issues without affecting the results.
Given a data generating process R chosen from R ODP and R GLN , a design matrix X and a frequency matrix Π, we use a large Monte Carlo simulation as a benchmark for the saddle point approximation. First, we draw B = 10 7 realizations r b from R. For the Monte Carlo cdf P MC (R ≤ q) = B −1 ∑ B b=1 1(r b ≤ q), we then find the quantiles q α , so P MC (R ≤ q α ) = α for α = 0, 0.005, 0.01, . . . , 1. To compute the saddle point approximation P SP (R ≤ q), we use the implementation of the procedure described in Section 4.4 in the package quad_form_ratio. Then, for each Monte Carlo quantile q α , we compute the difference P SP (R ≤ q α ) − α. Taking the Monte Carlo cdf as the truth, we refer to this as the saddle point approximation error. Figure 1a shows the generalized log-normal saddle point approximation error P SP (R GLN ≤ q α ) − α plotted against α. One and two (pointwise) Monte Carlo standard errors α(1 − α)/B are shaded in blue and green, respectively. While the approximation errors for TA are generally not significantly different from zero, the same cannot be said for the other two sets of parameters. For the parameterizations V N J and BZ, the errors start and end in zero and are negative in between. Despite statistically-significant differences, the approximation is very good with a maximum absolute approximation error of just over −0.006. The errors in the tails are much smaller, as we might have expected given the results by Butler and Paolella (2008) Figure 1b shows the plot for the approximation error to R ODP produced in the same way as Figure 1a. The approximation error is positive and generally significantly different from zero across parameterizations. Yet, the largest error is about 0.005 with smaller errors in the tails.
We would argue that the saddle point approximation errors, while statistically significant, are negligible in applications. That is, using a saddle point approximation rather than a large Monte Carlo simulation is unlikely to affect the practitioner's modelling decision.

Finite Sample Approximations under the Null
The asymptotic theory above left us without guidance on how to choose between test statistics R and estimators for the nuisance parameter Π that appears in the limiting distributions R. While the choice is irrelevant for large τ or small ω 2 , we show that it matters in finite samples and that some combinations perform much better than others when it comes to approximation under the null hypothesis.
In applications, we approximate the distribution of R by R = R( Π). That is, defining the α quantile of R as q R α , we hope that P(R ≤ q R α ) ≈ α under the null hypothesis. To assess whether this is justified, we simulate the approximation quality across 16 asymptotically identical combinations of R-statistics and ratios of quadratic forms R. We describe the simulation process in three stages. First, we explain how we set up the data generating processes for the generalized log-normal and over-dispersed Poisson model. Second, we lay out explicitly the combinations we consider. Third, we explain how we compute the approximation errors. As in Section 5.1, we point out that we drop the corner cells of the triangles in simulations. This aids numerical stability without affecting the results.
For the generalized log-normal model, we simulate independent log-normal variables Y ij , so log(Y ij ) D = N(x ij ξ, ω 2 ). We consider three settings for the true parameters corresponding largely to the estimates from the same three datasets we used in Section 5.1, namely the Verrall et al. (2010) data (V N J), Taylor and Ashe (1983) data (TA) and Barnett and Zehnwirth (2000) data (BZ). Specifically, we consider pairs (ξ, ω 2 ) set to the estimated counterparts (ξ ls ,ω 2 /s) for s = 1, 2. The estimatesω 2 are 0.39 for V N J, 0.12 for TA and 0.001 for BZ. Theory tells us that the approximation errors should decrease with ω 2 , thus as s increases.
For the over-dispersed Poisson model, we use a compound Poisson-gamma data generating process, largely following Harnau and Nielsen (2017) and Harnau (2018a). We simulate independent } and X s are independent Gamma distributed with scale σ 2 − 1 and shape (σ 2 − 1) −1 . This satisfies the assumptions for the over-dispersed Poisson model in Sections 3.3.1 and 3.3.3. For the true parameters (τ, ξ (2) , σ 2 ), we consider three sets of estimates (sτ ql ,ξ (2) ls ,σ 2 ) from the same data as for the log-normal data generating process. We use least squares estimatesξ (2) ls so that the frequency matrix Π is identical within parameterization between the two data generating processes. The estimates forσ 2 are 10,393 for V N J, 52,862 for TA and 124 for BZ. Those forτ ql are 14, 633, 814 for V N J, 34, 358, 090 for TA and 10, 221, 194 for BZ. Again, we consider s = 1, 2, but this time scaling the aggregate predictor. If this increases, so should the approximation quality. We recall that ξ (2) and τ pin down µ 11 through the one-to-one mapping τ = exp(µ 11 ) ∑ ij∈I exp(x (2) ij ξ (2) ). Thus, multiplying τ by s corresponds to adding log(s) to µ 11 . For a given data generating process, we independently draw B = 10 5 run-off triangles b = {Y ij,b : (i, j) ∈ I} and compute a battery of statistics for each draw. First, we compute the four test statistics R ls , R ql , R * ls and R * ql as defined in (10). Second, we compute the estimates for the frequency matrices Π based on least squares estimates, quasi-likelihood estimates and feasible weighted least squares estimates with least squares and with the quasi-likelihood first stage. This leads to four different approximations to the limiting distribution, which, dropping the subscript for the data generating process, we denote by: Given a data generating process and a choice of test statistic and limiting distribution approximation (R, R), we approximate P(R ≤ q R α ) by Monte Carlo simulation. For each combination (R, R), we have B paired realizations; for example, R b and the distribution R b are based on the triangle b . Denote the saddle point approximation to the cdf of R b as G b (q) = P SP ( R b ≤ q). Neglecting the saddle point approximation error, we then compute P( We do this for α ∈ A = {0.005, 0.01, . . . , 0.995}. To evaluate the performance, we consider three metrics: area under the curve of absolute errors (also roughly the mean absolute error), maximum absolute error and error at (one-sided) 5% critical values. We compute the area under the curve as AUC = ∑ 199 =1 | P MC (R ≤ q R α ) − α |∆α where α = 0.005 · , so ∆α = 0.005; we can also roughly interpret this as the mean absolute error MAE = 200/199 · AUC since α = 200 −1 . The maximum absolute error is max α∈A | P MC (R ≤ q R α ) − α|. Finally, the error at 5% critical values is P MC (R > q R GLN,0.95 ) − 0.05 for the generalized log-normal and P MC (R ≤ q R ODP,0.05 ) − 0.05 for the over-dispersed Poisson data generating process. Figure 2 shows bar charts for the area under the curve for all 16 combinations of R and R stacked across the three parameterizations for s = 1. The chart is ordered by the sum of errors across parameterizations and data generating processes within combination, increasing from top to bottom. The maximum absolute error summed over parameterizations is indicated by "+". Since a bar chart for the maximum absolute errors is qualitatively very similar to the plot for the area under the curve, we do not discuss it separately and instead provide it as Figure A1 in the Appendix A.  Table 1, the Taylor and Ashe (1983) data in Table A2 and the Barnett and Zehnwirth (2000) data in Table A1, respectively. Based on 10 5 repetitions for each parametrization. s = 1.

Over-dispersed Poisson
Looking first at the sum over parameterizations and data generating processes within combinations, we see large differences in approximation quality both for the area under the curve of absolute errors and the maximum absolute error. The former varies from about 5 pp (percentage-points) for (R * ls , R * ls ) to close to 30 pp for (R ql , R * ls ), the latter from 8 pp to 45 pp. It is notable that the four combinations involving R ql are congregated at the bottom of the pack. In contrast, the three best performing combinations all involve R * ls . These three top-performers have a substantial head start compared to their competition. While their AUC varies from 4.8 pp to 6.0 pp, there is a jump to 13.9 pp for fourth place. Similarly, the maximum absolute errors of the top three contenders lie between 7.5 pp and 9.3 pp, while those for fourth place add up to 20.5 pp.
Considering next the contributions of the individual parameterizations to the area under the curve across data generating processes, the influence is by no means balanced. Instead, the average contribution over combinations of the V N J, TA and BZ parameterizations is about 35%, 57% and 8%, respectively. This ordering is well aligned in magnitude and ordering with that of ω 2 and σ 2 /τ, loosely interpretable as a measure for the expected approximation quality. Still, considering the contributions of the parameterizations within combinations, we see substantial heterogeneity. For example, the TA parameterization contributes much less to (R * ql , R * ql ) than V N J, while the reverse is true for (R ls , R ls ). Finally, we see substantial variation between the two data generating processes. While the range of areas under the curve of absolute errors aggregated over parameterizations for the generalized log-normal is 0.7 pp to 10 pp, that for the over-dispersed Poisson is 3.2 pp to 19.4 pp. The best performer for the generalized log-normal is, perhaps unsurprisingly, (R * ls , R ls ). Intuitively, since the data generating process is log-normal, the asymptotic results would be exact for this combination if we plugged the true parameters into the frequency matrices. Just shy of these, we plug in the least squares parameter estimates, which are maximum likelihood estimated. It is perhaps more surprising that using R ql is not generally a good idea for the over-dispersed Poisson data generating process even though the fact that these combinations take the bottom four slots is largely driven by the TA parametrization. Reassuringly, the top three performers across data generating processes also take the top three spots within data generating processes, albeit with a slightly changed ordering. Figure 3a shows box plots for the size error at 5% nominal size computed over the three parameterizations and two data generating processes within combinations (R, R) for s = 1. Positive errors indicate an over-sized and negative errors an under-sized test. In the plots, medians are indicated by blue lines inside the boxes. The boxes show the interquartile range. Whiskers represent the full range. The ordering is increasing in the sum of the absolute errors at 5% critical values from top to bottom. 0.00 0.02 0.04 Looking at the medians, we can see that these are close to zero, ranging from −0.15 pp to 0.37 pp. However, there is substantial variation in the interquartile range, 0.1 pp for (R ls , R * ql ) to 0.7 pp for (R * ql , R ls ) and range, 0.4 pp for (R * ls , R * ls ) to 6.7 pp for (R ql , R ls )). The best and worst performers from the analysis for the area under the curve and maximum absolute errors are still found in the top and bottom positions. Particularly the performance of (R * ls , R * ls ) seems close to perfection with a range from −0.2 pp to 0.2 pp. Figure 3b is constructed in the same way as Figure 3a, but for s = 2, halving the variance for the generalized log-normal and doubling the aggregate predictor for the over-dispersed Poisson data generating process. Theory tells us that the approximation quality should improve, and this is indeed what we see. The medians move towards zero, now taking values between −0.05 pp and 0.14 pp; the largest interquartile range is now 1.1 pp and the largest range 2.9 pp.
Overall, the combination (R * ls , R * ls ) performs very well across the considered parameterizations and data generating processes. This is not to say that we could not marginally increase performance in certain cases, for example by picking (R * ls , R ls ) when the true data generating process is log-normal. However, even in this case in which we get the data generating process exactly right, not much seems to be gained in approximation quality where it matters most, namely in the tails relevant for testing. Thus, it seems reasonable to simply use (R * ls , R * ls ) regardless of the hypothesized model, at least for size control.

Power
Having convinced ourselves that we can control size across a number of parameterizations, we show that the tests have good power. First, we consider how the power in finite sample approximations compares to power in the limiting distributions. Second, we investigate how power changes as the means become more dispersed based on the impact on the limiting distributions R GLN and R ODP alone, as discussed in Section 4.5.

Finite Sample Approximations Under the Alternative
We show that combinations of R-statistics and approximate limiting distributions R that do well for size control under the null hypothesis also do well when it comes to power at 5% critical values. The data generating processes are identical to those in Section 5.2 and so are the three considered parameterizations V N J, BZ and TA. To avoid numerical issues, we again drop the perfectly-fitted corner cells of the triangles without affecting the results.
To avoid confusion, we stress that we do not consider the impact of more dispersed means in this section. Thus, if we mention asymptotic results, we refer to large τ when the true data generating process is over-dispersed Poisson and for small ω 2 when it is generalized log-normal, holding the frequency matrix Π fixed.
For a given parametrization, we first find the asymptotic power. When the generalized log-normal model is the null hypothesis, we find the 5% critical values c GLN : P(R GLN > c GLN ) = 0.05, using the true parameter values for Π. Then, we compute the power P(R ODP > c GLN ). Conversely, when the over-dispersed Poisson is the null model, we find c ODP : P(R ODP ≤ c ODP ) = 0.05 and compute the power P(R GLN ≤ c ODP ). Lacking closed-form solutions, we again use saddle point approximations, iteratively solving the equations for the critical values to a precision of 10 −4 .
Next, we approximate the finite sample power of the top four combinations for size control in Section 5.2, (R * ls , R * ls ), (R * ls , R ls ), (R ls , R ql ) and (R * ls , R * ql ), by the rejection frequencies under the alternative for s = 1. For example, say the generalized log-normal model is the null hypothesis, and we want to compute the power for the combination (R * ls , R * ls ). Then, we first draw B = 10 5 triangles b from the over-dispersed Poisson data generating process. For each draw b, we find 5% critical values c R GLN,ls,b : P( R * GLN,ls,b > c R GLN,ls,b ) = 0.05. We compute these based on saddle point approximations, solving iteratively up to a precision of 10 −4 . Then, we approximate the power as B −1 ∑ B b=1 1{R * GLN,ls,b > c R GLN,ls,b }. For the over-dispersed Poisson null hypothesis, we proceed equivalently, using the left tail instead. In this way, we approximate power for all three parameterizations and all four combinations.
Before we proceed, we point out that we should be cautious to interpret power without taking into account the size error in finite samples. A test with larger than nominal size would generally have a power advantage purely due to the size error. One way to control for this is to consider size-adjusted power, which levels the playing field by using critical values not at the nominal, but at the true size. In our case, this would correspond to critical values from the true distribution of the test statistic R, rather than the approximated distribution R. Therefore, the choice of R would not play a role any more. To sidestep this issue, we take a different approach and compare how close the power of the finite sample approximations matches the asymptotic power. Table 2 shows the asymptotic power and the gap between power in finite sample approximations and asymptotic power.
Looking at the asymptotic power first, we can see little variation between data generating processes within parameterizations. The power is highest for the V N J parameterization with 99%, followed by BZ with 95% and TA with 65%. This ordering aligns with that of the standard deviations of the frequencies π ij under these parameterizations, which are given by 0.016, 0.012 and 0.009 for V N J, BZ and TA, respectively. Table 2. Power in % at 5% critical values for large τ (over-dispersed Poisson DGP) and small ω 2 (generalized log-normal DGP) along with the power gap in pp for the top four performers from Table 3. DGP is short for data generating process. Based on 10 5 repetitions. s = 1.  When considering the finite sample approximations, we see that their power is relatively close to the asymptotic power. For V N J, absolute deviations range from 0.14 pp to 0.34 pp and for BZ from 0.94 pp and 1.15 pp. Compared to that, discrepancies for the TA parameterization are larger. The smallest discrepancy of 0.14 pp arises for (R * ls , R ls ) when the data generating process is generalized log-normal. As before, this is intuitive since it corresponds to plugging maximum likelihood estimated parametersξ (2) into Π. With 5.28 pp, the largest discrepancy arises for (R ls , R ql ) for an over-dispersed Poisson data generating process. Mean absolute errors across parameterizations and data generating processes are rather close, ranging from 1.01 pp for (R * ls , R ls ) to 2.1 pp for (R ls , R ql ). Our proposed favourite from above (R * ls , R * ls ) comes in second with 1.27 pp. We would argue that we can still justify the use of (R * ls , R * ls ) regardless of the data generating process.

Increasing Mean Dispersion in Limiting Distributions
We consider the impact of more dispersed means on power based on the the test statistics' limiting distributions R GLN and R ODP . We show that the power grows quickly as we move from identical means across cells to a scenario where a single frequency hits zero.
For a given diagonal frequency matrix Π with values π ij , we define the linear combination: Thus, for t = 1, we recover Π, while for t = 0, we are in a setting where all cells have the same frequencies, so all means are identical. In the latter scenario, R GLN and R ODP collapse to a point-mass at n, as discussed in Section 4.4. We consider t ranging from just over zero to just under t max : t max min ij∈I (π ij ) + (1 − t max )n −1 = 0. The significance of t max is that Π (t max ) corresponds to the matrix where the smallest frequency is exactly zero.
For each t, we approximate one-sided 5% critical values of R (t) We iteratively solve the equations up to a precision of 10 −4 . Theorem 3 tells us that the critical values should grow for both models, but that c (t) GLN converges as t approaches t max , while c (t) ODP goes to infinity.
Then, for given t and critical values, we find the power when the null model is generalized GLN ) and when the null model is over-dispersed Poisson P(R ODP ). Again, we use saddle point approximation. Based on Theorem 3, we should see the power go to unity as t approaches t max .
We consider the same parameterizations V N J, TA and BZ of frequency matrices Π and design matrices X as above. The values for t max are 1.083 for V N J, 1.396 for TA and 1.103 for BZ. To avoid numerical issues, we again drop the perfectly-fitted corner cells from the triangles. In this case, while the power is not affected, the critical values are scaled down by the ratio ofτ ls computed over the smaller array without corner cells to that computed over the full triangle. Since this is merely proportional, the results are not affected qualitatively. Figure 4a shows the power when the generalized log-normal model is the null hypothesis. For all considered parameterizations, this is close to 5% for t close to zero, increasing monotonically with t and approaching unity as t approaches t max , as expected.  For t = 1, where Π (t) corresponds to the least squares estimated frequencies from the data, the power matches what we found in Table 2. Figure 4b shows the difference in power between the two models plotted over t. For the three settings we consider, these curves have a similar shape and start and end at zero. Generally, the power is very comparable, with differences between −2 pp and 1 pp, again matching our findings from Table 2 for t = 1. Figure 5a shows the one-sided 5% critical values c (t) GLN plotted over t. As expected, these are increasing for all settings. Figure 5b shows the ratio of the critical values c GLN . This starts at unity, initially decreases, then increases and, finally, explodes towards infinity as we approach t max .
Taking the plots together, we get the following interpretation. We recall that the two distributions are identical for t = 0. Further, the rejection regions for the generalized log-normal null is the upper tail, while the lower tail is relevant for the over-dispersed Poisson model. However, for small t, the mass of both R (t) GLN and R (t) ODP is highly concentrated around n, and the distributions are quite similar. This explains why the power is initially close to 5% for either. Further, due to the concentration, c

Empirical Applications
We consider a range of empirical examples. First, we revisit the empirical illustration of the problem from the beginning of the paper in Section 2. We show that the proposed test favours the over-dispersed Poisson model over the generalized log-normal model. Second, we consider an example that perhaps somewhat cautions against starting off with a model that may be misspecified to begin with: dropping a clearly needed calendar effect turns the results of the encompassing tests upside down. Third, taking these insights into account, we implement a testing procedure that makes use of a number of recent results: deciding between the over-dispersed Poisson and generalized log-normal model, evaluating misspecification and testing for the need for a calendar effect.

Empirical Illustration Revisited
We revisit the data in Table 1 discussed in Section 2 and show that we can reject that the (generalized) log-normal model encompasses the over-dispersed Poisson model, but cannot reject the alternative direction. Thus, the encompassing tests proposed in this paper have higher power to distinguish between the two models than the misspecification tests Harnau (2018a) applied to these data. We remark that the encompassing tests were designed explicitly to distinguish between the two models, in contrast to the more general misspecification tests. Table 3 shows p-values for all 16 combinations of R-statistics and R under both null hypotheses. This is consistent with the applications in Kuang et al. (2015) and Harnau (2018a), who consider these data in a log-normal model. Looking at our preferred combination (R * ls , R * ls ), we find a p-value of 0.001, rejecting the model. Reassuringly, we reject the generalized log-normal model for any combination of R and R. The most favourable impression to this null hypothesis is given by (R ls , R ls ) with a p-value of 0.004.
If we instead take the over-dispersed Poisson model as the null, so: H 0 : over-dispersed Poisson vs. H A : generalized log-normal, the model cannot be rejected with a p-value of 0.17 for (R * ls , R * ls ). Again, this decision is quite robust to the choice of estimators with a least favourable p-value of 0.09 obtained based on (R ls , R * ql ). If we accept the null, we can evaluate the power against the generalized log-normal model. For instance, the 5% critical value under the over-dispersed Poisson model is 95.7. The probability of drawing a value smaller than that from the generalized log-normal model is 0.99. Thus, the power at the 5% critical value is close to unity. We can also find the power at the value taken by R * ls , interpretable as the 17% critical value if we like. This is simply one minus the p-value of the generalized log-normal model, thus equal to 1 − 0.001 = 0.999.

Sensitivity to Invalid Model Reductions
The Barnett and Zehnwirth (2000) data are known to require a calendar effect for modelling. We show those data in Table A1 in the Appendix. Barnett and Zehnwirth (2000), Kuang et al. (2015) and Harnau (2018a) approached these datasets using log-normal models. Here, we find that an encompassing test instead heavily favours an over-dispersed Poisson model. Further, we show that dropping the needed calendar effect substantially affects the test results.
We again first consider a generalized log-normal model; however, we initially allow for a calendar effect. Adding the prefix "extended" to models with calendar effect, we test: Our preferred test statistic R * ls = 114.40. Paired with R * ls , this yields a p-value of 0.02. Thus, the generalized log-normal model is clearly rejected. For illustrative purposes, we continue anyway and test whether we can drop the calendar effect from the generalized log-normal model. Thus, the hypothesis is: H 0 : generalized log-normal vs. H A : extended generalized log-normal. Kuang and Nielsen (2018) show that for small ω 2 , we can use a standard F-test for this purpose. If we assumed that the data generating process is not generalized log-normal, but log-normal, the F-test would be exact. This test rejects the reduction with a p-value of 0.00. If again we decide to continue anyway, we can now test the generalized log-normal against an over-dispersed Poisson model, both without the calendar effect. Thus, the hypothesis is: Interestingly, the log-normal model does not look so bad any more now. For this model, R * ls = 87.54, which yields a p-value of 0.10. Of course, this should not encourage us to assume that the generalized log-normal model without the calendar effect is actually a good choice. Rather, it draws attention to the fact that tests computed on inappropriately-reduced models may yield misleading conclusions. The tests proposed in this paper assume that the null model is well specified, and the results are generally only valid if this is correct. In applications, we may relax this statement to "the tests only give useful indications if the null model describes the data well". In this case, we did not only ignore the initial rejection of the generalized log-normal model, but also that calendar effects are clearly needed to model the data well.
We now start over, switching the role of the two models, thus starting with an extended over-dispersed Poisson model. The first hypothesis is the mirror image from above: The test statistic is still R * ls = 114.40, but now, we cannot reject the null hypothesis with a p-value of 0.14. We may thus feel comfortable to model the data using an over-dispersed Poisson model with a calendar effect. Next, we investigate whether the calendar effect can be dropped, testing: H 0 : over-dispersed Poisson vs. H A : extended over-dispersed Poisson. Harnau and Nielsen (2017) showed that for large τ, this can be done with an F-test based on Poisson deviances. This reduction is clearly rejected, again with a p-value of 0.00. We move on anyway, drop the calendar effect and test: H 0 : over-dispersed Poisson vs. H A : generalized log-normal.
In this case, the p-value is 0.01, and we reject the null, so we get the opposite result. Comparing the outcomes of the tests, it seems clear that an over-dispersed Poisson model with the calendar effect is the most reasonable choice. However, if we had not started at this point, but rather never considered a calendar effect in the first place, we might have come to a very different conclusion. This indicates that the starting point can matter a great deal for the model choice and that it may be a good idea to start with a more general model and test for reductions, even if we were fairly certain that the reduced model is a good choice.

A General to Specific Testing Procedure
The Taylor and Ashe (1983) data have frequently been modelled as over-dispersed Poisson, for example by England and Verrall (1999), England (2002) and Harnau (2018a). We provide those data in Table A2 in the Appendix A. Based on the insight from the application to the Barnett and Zehnwirth (2000) data above, we start with a general model with the calendar effect and use a whole battery of tests to see if a generalized log-normal or over-dispersed Poisson chain-ladder model can be justified. We find that an over-dispersed Poisson chain-ladder model is reasonable for these data.
We first consider a generalized log-normal model with the calendar effect. We test: H 0 : extended generalized log-normal vs. H A : extended over-dispersed Poisson.
The null hypothesis is clearly rejected with a test statistic of R * ls = 81.5 and a p-value of 0.001. Thus, we do not proceed further with this model. Instead, we now start with an over-dispersed Poisson model with the calendar effect. The hypothesis: H 0 : extended over-dispersed Poisson vs. H A : extended generalized log-normal cannot be rejected with a p-value of 0.92. We point out that this indicates that the draw is in the right tail of R ODP . While we would argue this is not the case here, we may worry about values that are too far out in the right tail of R ODP , which would perhaps indicate that we should reject both models.
Next, we apply the misspecification tests by Harnau (2018a). We first split the run-off triangle into four sub-samples, as indicated in Figure 6. We can now test whether the over-dispersion is common across sub-samples: Harnau (2018a) showed that we can use a Bartlett test based on the Poisson deviance for this purpose. In the model with the calendar effect, this test yields a p-value of just above 0.05, a rather close call. In light of the fact that the ultimate goal of the exercise is forecasting reserves and that forecasting often benefits from simpler models, we decide to accept the hypothesis. Next, we consider the hypothesis that there are no breaks in accident, development and calendar effects between sub-samples: As demonstrated by Harnau (2018a), this can be tested with a deviance based F-test that is independent of the Bartlett tests for large τ. This test yields a p-value of 0.07. Based on the same argument as above, we accept the hypothesis. Now that we are reasonably happy with the over-dispersed Poisson extended chain-ladder model, we test whether the calendar effects can be dropped. Based on an F-test, this hypothesis cannot be rejected with a p-value of 0.30. Thus, we move on, retesting whether the over-dispersed Poisson model still encompasses the log-normal model. Based on a test statistic R * ls = 73.5, this cannot be rejected with a p-value of 0.73. We can now go back and apply the misspecification tests by Harnau (2018a) once again, except this time for models without the calendar effect. Using the same sub-sample structure, a Bartlett test cannot reject the hypothesis of common over-dispersion H 0 : σ 2 = σ 2 with a p-value of 0.08. Further, an F-test for the hypothesis of the absence of breaks in the mean parameters H 0 : α i, + β j, + δ = α i + β j + δ cannot be rejected with a p-value of 0.93.
In conclusion, an over-dispersed Poisson chain-ladder model for the Taylor and Ashe (1983) data survived a whole battery of specification tests, and we may at least be more comfortable with this model choice, having found no strong evidence telling us otherwise. In contrast, the generalized log-normal model was clearly rejected.

Discussion
While there has been a range of recent advances for both over-dispersed Poisson and (generalized) log-normal models, there are still several areas left for further research. This spans from further misspecification tests and refinements thereof over a potential theory for the bootstrap to empirical studies evaluating the impact of the theoretical procedures in practice.
As pointed out by Harnau (2018a), the misspecification tests require a specific choice for the number of sub-samples and their shape. A generalization that is agnostic about these choices would be desirable. Harnau (2018a) also remarked that a misspecification test for independence would be useful. The assumption of independence across cells is common to both over-dispersed Poisson and generalized log-normal models. It seems likely that a test that is valid in one model would translate easily to the other.
The closed-form distribution forecasts proposed by Harnau and Nielsen (2017) for the over-dispersed Poisson model and by Kuang and Nielsen (2018) for the generalized log-normal model are both based on t-distributions and thus symmetric. These forecasts seem to perform rather well and, in some settings, appear more robust than the bootstrap by England and Verrall (1999) and England (2002). However, with an appealing asymptotic theory in place for both types of models, it may be worth considering whether a theory for the bootstrap could be developed to allow for potential asymmetry of the forecast distribution that we might expect in finite samples.
Finally, given the range of recent theoretical developments, an empirical study that evaluates the impact of the contributions in applications seems appropriate. Since the main concern in claim reserving is forecasting, such a study would likely require data not just for run-off triangles, but also for the realized values in the forecast array, that is the lower triangle. Such data are available, for example, from Casualty Actuarial Society (2011). For instance, it would be interesting to see how the forecast performance between rival models differs if one were rejected by the theory, but not the other.

Conflicts of Interest:
The author declares no conflict of interest. D → N(0, σ 2 Π −1 ). To obtain the distribution of the least-squares estimator, we pre-multiply by (X X) −1 X .

This yields
with Ω as defined in the lemma. We find the distribution of the residual sum of squares using the continuous mapping theorem. With that, Finally, we show thatτ ls /τ P → 1 whereτ ls = ∑ ij∈I exp(x ijξ ls ). Define f (ξ) = ∑ ij∈I exp(x ij ξ).
The proof for τRSS * − D P → 0 follows by the same argument. The main insight is that the asymptotic distribution of the Poisson deviance is asymptotically equivalent to that of the quadratic form τ −1 {Y − exp(µ)} (Π −1 − X (X ΠX) −1 X ){Y − exp(µ)}, as Harnau and Nielsen (2017, Proof of Lemma 1) show building on Johansen (1979, Theorems 7.7, 7.8). This is again asymptotically identical to the sequential mapping from Y to Z followed by the map from Z to the scaled residual sum of weighted least squares τRSS * = { √ τΠ 1/2 (Z − µ)} M * { √ τΠ 1/2 (Z − µ)}. To show that we can replace the weight matrix Π in the weighted least squares estimator by Π ls or Π ql we note that both matrices converge in probability to Π and then apply Slutsky's theorem (Casella and Berger 2002, Theorem 5.5.17). Combining this argument with the proof of the equivalence of D and RSS * in the last paragraph, it also follows that we can replace the weights in RSS * without affecting the result. Finally, bothτ ls /τ P → 1 andτ ql /τ P → 1 so we can replace τ as well by Slutsky's theorem.
Appendix A.5. Proof of Theorem 1 Taking into account the results from Lemma 3, it follows that (τRSS * )/D P → 1 and that the result still holds if we replace RSS * by RSS * ls or RSS * ql and τ byτ ls orτ ql . Thus, for example, R ls P → R ql so their difference vanishes and similarly for any other of the six total combinations.
Appendix A.6. Proof of Lemma 4 The asymptotic distribution of the weighted least squares estimators follows by the same argument as in Lemma 2, except now taking (ω 2 ) −1/2 (Z − µ) D → N(0, I n ), as shown by Kuang and Nielsen (2018, Theorem 3.3), as a starting point.
The asymptotic equivalence of weighted least squares and Poisson quasi likelihood estimation follows from the same argument as in Lemma 3, except now (ω 2 ) −1/2 {Y − exp(µ)} D → N[0, diag{exp(µ)}]. The argument for replacing true parameters in the frequency matrix Π and the aggregate means τ by estimates is identical to that in Lemma 3 as well.
Further, we assume without loss of generality that rows and columns of X and Π have been sorted as described in Section 4.5.
First, we establish a useful equivalence of the denominators of both R M 22 = 0. Thus, since U 2 = 0 with probability one, the second term is positive with probability one so overall the numerator U Π −1/2 (t) MΠ −1/2 (t) U

Over-dispersed Poisson
V N J T A BZ Figure A1. Bar chart of maximum absolute errors for the considered combinations of R and R. Ordered by the sum of errors within combination across data generating processes and parameterizations increasing from top to bottom. Sum of maximum absolute errors across parameterizations indicated by "+". V N J, TA, and BZ is short for parameters set to their estimates from the Verrall et al. (2010), Taylor and Ashe (1983) and Barnett and Zehnwirth (2000) data, respectively. Based on 10 5 repetitions for each parametrization. s = 1. Table A1. Insurance run-off triangle taken from Barnett and Zehnwirth (2000 , Table 3.5) as used in the empirical application in Section 6.2 and the simulations in Section 5.  --------10 420,778 590,400  ---------11 496,200  ----------Table A2. Insurance run-off triangle taken from Taylor and Ashe (1983) as used in the empirical application in Section 6.3 and the simulations in Section 5.