Constrained Inference When the Sampled and Target Populations Differ

Abstract: In the analysis of contingency tables, often one faces two difficult criteria: sampled and target populations are not identical and prior information translates to the presence of general linear inequality restrictions. Under these situations, we present new models of estimating cell probabilities related to four well-known methods of estimation. We prove that each model yields maximum likelihood estimators under those restrictions. The performance ranking of these methods under equality restrictions is known. We compare these methods under inequality restrictions in a simulation study. It reveals that these methods may rank differently under inequality restriction than with equality. These four methods are also compared while US census data are analyzed.


Introduction
When working with a sample contingency table, a researcher might need to adjust it based on information available from other sources.This information might come from prior surveys, censuses, established theories or other sources.Often it comes as marginal information such as row and/or column totals.For example, consider a data set where each subject is cross-classified by income (low/high) and urbanity (urban/rural), and, marginal information about income and urbanity is available from a census.One would like to adjust the sample data to conform to the desired margins from census.
For two-way contingency tables of size (I × J), four well-known [1,2] margin-adjusting methods for estimating cell probabilities are raking (RAKE), least squares (LSQ), minimum chi-squared (MCSQ) and maximum likelihood under random sampling (MLRS).Assume that a random sample {n ij , 1 ≤ i ≤ I, 1 ≤ j ≤ J} is available from a multinomial (n, π) probability distribution, where n = ∑ i,j n ij , π = (π ij , ∀i, j).Let p ij = n ij n denote the sample cell proportions.Then RAKE finds the estimates { πRK ij } that minimize the discrimination information, under the marginal constraints where πij denotes the estimators of target cell probabilities π ij , ∀i, j, π i+ , π +j are known, ∀i, j.
Under the same constraints (1), the methods LSQ, MCSQ, MLRS find the estimates respectively.Instead of given marginal totals, one might like to use restrictions of a more general nature.Consider the survey data [3] from the second National Health and Nutrition Examination Survey (NHANES II).
Table 1a shows the sample proportions and corresponding census proportions of 2 × 2 contingency tables of income by urbanity, and Table 1b shows the sample proportions and corresponding census proportions of 2 × 2 contingency tables of education by urbanity.We observe differences in the census and sample values, possibly due to differences in target and sampled populations.For example, in Table 1a census data, the magnitude of row totals (0.3191 < 0.6809) is different from that of the sample data (0.5260 > 0.4740).Similarly, in Table 1b census data, the off-diagonal entries satisfy an order relation (0.2625 > 0.2107), but, in samples, the relation goes in the opposite direction (0.2360 < 0.2682).If such constraints are known a priori (e.g., from census or other sources), then it is wiser to incorporate them into the analysis while adjusting the sample data.
Much prior work (e.g., [2]) assumed that random samples were directly taken from the target population with known row and column margins (π i+ , π +j respectively).However, in practice, there are situations in which a random sample from the target population is inaccessible.For example, often sample units are too expensive to locate or unwilling to participate in the survey.In this case, to estimate the target cell probabilities, we have to take a random sample from a sampled population that is systematically different from the target population.Clearly, the resulting estimators are typically biased.Researchers in [3] have studied such discrepancies under marginal row and column constraints.A similar problem in a regression context can be found in [4].
It is well-known that all four margin-adjusting methods are asymptotically equivalent under simple random sampling.However, their small sample results can be different.Using simulation methods, [5] found that MCSQ is best, followed by MLRS, RAKE and LSQ, in order of performance in average root mean squared error.However, for margin adjusting, [3] found that both RAKE and MLRS dominate MCSQ; and LSQ is inferior to all three methods when the sampled population is systematically different from the target population.In this paper, we consider general linear constraints (not necessarily marginal) under inequality restrictions and study the performance of those four methods.For simulation (Section 4), we have restricted our attention to (2 × 2) tables to facilitate comparison with Little and Wu [3].

Solutions from Each Method
First, we vectorize the I × J contingency table of probabilities lexicographically, say, π = (π ij ) denote the I J × 1 target population probability vector.Thus, the pair (i, j) = t, for some t, 1 ≤ t ≤ I J.We assume that the available knowledge of the population can be expressed as r constraints as where A = (a ij ) denotes an (I J × r), r ≤ I + J − 1, matrix of constants with rank(A) = r, c = (c i ) denotes the (r × and this is known as the primal problem.The Kuhn-Tucker method [6] identifies an equivalent dual problem that could be substantially easier to solve than the primal problem (for larger I, J).
Consider maximization in λ and minimization in π of L(λ, π).Suppose there exists (λ * , π * ) for which L(λ, π * ) ≤ L(λ * , π * ) ≤ L(λ * , π), ∀λ, π, if and only if then λ * is a Kuhn-Tucker vector and π * is an optimal solution of the primal problem, and L(λ * , π * ) is the optimal value of the L(λ, π).More generally, λ * is a Kuhn-Tucker vector if and only if The dual problem is given by sup λ g(λ), where the function g is defined by g(λ) = inf π L(λ, π) [6].Often, the dual problem has a nice form, and λ * can be found by numerical methods.Then, one can use the relation ( 4) to find the solution π * to the primal problem.

Models Relating the Sampled and Target Populations
Suppose a random sample of size n is taken from the sampled population.For the (i, j)th cell, let π ij , τ ij be the target and sampled probabilities, respectively.Consider the RAKE model in Equation ( 5), below which it specifies how the sampled and target populations are connected.Theorem 1 shows that the solution to the model in (5) are the maximum likelihood (ML) estimators under the RAKE model.Theorem 1. Suppose the probabilities of target and sampled populations are related by Then π * t , given by ( 5), are the maximum likelihood estimates of the cell probabilities π t in the target population.
Proof.Consider the (primal) raking problem of minimizing ∑ I J t=1 π t ln( π t τ t ) subject to A T π − c ≤ 0. Using an example from (p. 309, [7]), the Lagrangian for this problem is given by If If . The dual problem is solved numerically (e.g., using the Newton-Raphson method), and we obtain the dual solutions λ = λ * as functions of τ t .Differentiating L in (6) with respect to each π t and setting equal to zero, the primal solutions of ( πt ) are given by πt = τt exp Assuming the counts n t follow a multinomial(n, τ t , ∀t) distribution, the MLE of τ t is given by τt = p t = n t /n.Since λ * i are functions of p t , hence πt are MLEs.Thus, raking yields ML estimates for the RAKE model (5).
In general, consider the model By using similar arguments as above, LSQ is ML for the LSQ model obtained by setting α = 1, MLRS is ML for the MLRS model obtained by setting α = −1, and MCSQ is ML for the MCSQ model obtained by setting α = −2 in (8).Of course, α → 0 in (8) corresponds to RAKE.Theorem 3.1 shows that for any model α, (8) yields MLEs of π t for that model.If π t is generated from a method different from α in (8), the solution is still available, but it is not MLE under (8).Hence, it is of interest how these four different methods stack up against each other (as MLE versus not MLE) in a given situation.To address this issue, a simulation study is conducted in the next section.

A Simulation Study
We performed a simulation study to compare the methods in a systematic way.We restrict our attention to (2 × 2) tables so that comparison with equality [3] is facilitated.In contrast to margin-adjusting methods (e.g., [3]) where only one parameter, e.g.π 11 , is enough to consider, for inequality constraints one needs to consider all π ij , ∀i, j.In this simulation, we have saught solution of the primal problem itself because the table dimensions (2 × 2) are the smallest, and duality approach does not help much to reduce the necessary computation load.
We have considered two types of inequality restrictions in the simulation: isotonic and nonisotonic (see [7] for definitions).For each of the 16 designs described below, sample sizes n = 30, 100, 1000 are considered.Thus, in each of 16 × 3 = 48 cases, for a given π as the target population vector, we vary λ and find τ using (8).Then, we take multinomial random samples from this τ and calculate p.This process is repeated 200 times for each of 48 cases.
For isotonic constraints, we use a tree order as: For isotonic constraints, closed-form solutions (π * ) are available for all four methods as follows.The LSQ under tree order is calculated using the algorithm on page 19 of [7], and, MLRS = LSQ.The RAKE and MCSQ values are given by least square projections of ln p and p 2 on to the constraints of interest, and then applying the inverse of those transformations (see pages 240 and 278 of [7], respectively).
With given λ and the target probabilities π ij , first we determine the sample probabilities τ ij using NEQNF of IMSL libraries of Fortran (version 7, Rogue Wave Software, Inc., Louisville, CO, USA).Then, a multinomial random sample of size n is taken from the sampled population by using the multinomial random number generator GGMTN in the IMSL subroutine library, and we calculate p.
Next, π * is found for each of four methods.When there is no violation, no adjustment is needed.When there is a violation, the solution is found by using the subroutine LCONG of IMSL.
After we find the estimates π * = {π * i } for either constraints, we calculate the root mean squared error of the estimates as RMSE = ∑ 4 i=1 (π i − π * i ) 2 , where π i is the true value of the target probability.To provide a more systematic comparison between these four methods, we compute a relative RMSE (RRMSE) defined as where RMSE * is the root mean squared error of the method that is ML under the model that generated the data, so RMSE = RMSE * or RRMSE = 0 for each model under its corresponding method.Figures 1-3 give visual comparisons of the methods under each model, for sample sizes n = 30, 100, 1000, respectively.For each figure the horizontal reference line with 0 RRMSE corresponds to the ML estimates under the model that was used to generate the data.

Overall RMSE of estimators. A crude comparison of the estimators is presented in Table
Figures 1-3 present RRMSEs for data generated under each of the four models for all 16 problems with n = 30, 100, 1000, respectively.To interpret them, first note that smaller values of c 1 , c 2 mean stronger constraints.In addition, a negative value of RRMSE reflects that bias from model misspecification is represented by lower variance than the method that is ML for the model that generated the data.
Certain reasonable patterns emerge from these figures; estimates based on the correct model dominate other methods when the sample size is large, or when the constraints are isotonic; here, the bias from model misspecification dominates RMSE.Results from nonisotonic constraints are more homogeneous.For them, LSQ turned out to be generally larger than MLRS.
Panel a of the figures summarizes results for the data generated under the RAKE model.For nonisotonic constraints, RAKE and MLRS performed similarly.For n = 30, 100, LSQ is slightly inferior to the other methods for the nonisotonic constraints with (c 1 , c 2 ) = (0.4,0.6) but is competitive when (c 1 , c 2 ) = (0.6, 0.7).RAKE seems to dominate (or close) and MCSQ performs worst (except when n = 30) of all nonisotonic constraints cases 1-8.RAKE performs slightly worse for isotonic constraints cases when n = 30, but is best again when n = 100, 1000.
Panel b of the figures summarizes results for data generated under the LSQ model.For all constraints with n 1000, LSQ and MLRS performed similarly.For n = 30, 100, LSQ is much inferior to MLRS for the nonisotonic constraints with (c 1 , c 2 ) = (0.4,0.6) but performs similarly when (c 1 , c 2 ) = (0.6, 0.7).MCSQ performs worst throughout, except for isotonic constraints with n = 30, when all three methods did better than RAKE, but this turned around when n = 100, 1000.
Panel c of the figures summarizes results for the data generated under the MCSQ model.Although for isotonic constraints LSQ = MLRS, for nonisotonic constraints, LSQ performed much worse than MLRS.The MCSQ values were close to the LSQ values for all constraints, except for isotonic constraints designs 9, 12 with n = 1000 when MCSQ is way off.Rake performed competitively with MLRS for nonisotonic cases.However, for isotonic constraints, RAKE was outperformed by other three methods for all n.
Panel d of the figures summarizes results for data generated under the MLRS model.Although for isotonic constraints MCSQ performed best for all n, with nonisotonic constraints, MCSQ is beaten by all other methods for n = 100, and by RAKE and MLRS when n = 30.LSQ performed much worse than MLRS for all nonisotonic cases.MLRS performed best for nonisotonic constraints and was close to best (MCSQ) for isotonic constraints, for all n.

Applying Four Methods to Real World Data
In this section, we illustrate the four methods studied in this paper using the data from [3] on the second National Health and Nutrition Examination Survey (NHANES ).The data are presented in Table 1.
It is not hard to observe that in Table 1a the sample cell and marginal proportions differ considerably from the census (see Section 1).From the census values for the income case from Table 1a, we see π 11 + π 12 = 0.3191.Hence, it is reasonable to consider π 11 + π 12 ≤ 0.32; observing similar other discrepancies, we consider the following three inequality restrictions: and for the education case, we consider For each problem, the estimates and RMSEs are computed and the results are displayed in Table 3.We consider the census proportions as the target probabilities {π ij }.
Let NHANES data be our sample proportions {p ij }.Let { πij } be the estimates under the constraints in (10), also under (11) (considered separately).Below, we define the adjusted RMSE, the unadjusted RMSE, and the proportional unexplained root mean square error (PURMSE), where "adjusted" means estimates under constraints, and "unadjusted" means unrestricted sample proportions.The larger value of PURMSE means that the adjusted MSEs are smaller than the unadjusted, which subsequently means that the used constraints give estimates that are quite close to the target values.Table 3 shows that all four methods perform well.Among all methods, LSQ performs comparatively the worst.

Conclusions
The paper [3] compared four margin-adjusting methods using equality constraints of known marginal totals for (I × J) contingency tables.Here, under general linear inequality constraints, theoretical models are proposed for the differences between the sampled and target populations.To compare the performance of these four methods, a simulation is performed for the case of I = J = 2. Based on this simulation, we find that the performance of the methods depends on the specific type of constraints.For nonisotonic constraints, we find RAKE to perform the best, with MLRS being a close second.The MCSQ and LSQ perform worse, of which MCSQ is slightly better than LSQ.These findings are parallel to those of [3].
However, we also find that the performance ranking of the methods changes for isotonic constraints (tree order).Here, MCSQ and LSQ (= MLRS) perform better than RAKE.In addition, MCSQ is slightly better than LSQ (= MLRS).These results are very different from those of [3].
The theoretical models for the differences between the sampled and target populations, and the corresponding methods and techniques described can be extended to higher dimensions in a similar manner as in Theorem 1.
As opposed to the case under equality restrictions, the distribution of estimators under inequality constraints is not known; hence, their mean and standard errors are also not tractable.It is difficult to explain the different behavior of the estimators under isotonic and non-isotonic constraints as seen in simulation.It is well-known that estimators under isotonic constraints have special properties of partial order relations.
We have fixed the values of c 1 , c 2 earlier in the simulation keeping in par with applications.This makes the choices of π ij restricted under inequality constraints.Note that with equality constraints, once we fix π 11 , all values of π ij are fixed.This is not the case under inequality constraints.The choices of π ij , λ i have to be such that the optimization problem has a solution.

Figure 1 .
Figure 1.RRMSEs for data generated under four models.The horizontal reference line at 0 RRMSE corresponds to ML estimates under the model.

Figure 3 .
Figure 3. RRMSEs for data generated under four models.The horizontal reference line at 0 RRMSE corresponds to ML estimates under the model.

Table 1 .
(a) Probability of income × urbanity and (b) probability of education × urbanity from NHANES II and the census.

Table 2 .
Average root mean square error (RMSE) over all 16 designs by method and model that generated the data.

Table 3 .
Proportional unexplained root mean square error (PURMSE) values for different methods under inequality restrictions for Income and education.