Entropy-Based Solutions for Ecological Inference Problems: A Composite Estimator

Information-based estimation techniques are becoming more popular in the field of Ecological Inference. Within this branch of estimation techniques, two alternative approaches can be pointed out. The first one is the Generalized Maximum Entropy (GME) approach based on a matrix adjustment problem where the only observable information is given by the margins of the target matrix. An alternative approach is based on a distributionally weighted regression (DWR) equation. These two approaches have been studied so far as completely different streams, even when there are clear connections between them. In this paper we present these connections explicitly. More specifically, we show that under certain conditions the generalized cross-entropy (GCE) solution for a matrix adjustment problem and the GME estimator of a DWR equation differ only in terms of the a priori information considered. Then, we move a step forward and propose a composite estimator that combines the two priors considered in both approaches. Finally, we present a numerical experiment and an empirical application based on Spanish data for the 2010 year.


Introduction
Ecological inference (EI) is the process of drawing conclusions about individual-level behavior from aggregate (historically called "ecological") data, when no individual data are available.Situations where the only available data are aggregated at a level other than the level of interest are quite common in many application fields.This is the typical setting for Ecological Inference [1][2][3], Cross-level Inference [4,5], Small Area Estimation [6], or disaggregation methods [7].The basic idea is that, in order to study the behavior of the individuals (or sub-groups of individuals), a microeconomic analysis ought to be carried out using fairly localized individual data, and data which are aggregated by areal units may be used in order to investigate the behavior of the individuals comprising those units.In this paper, we specifically refer to the process of drawing conclusions about individual-level behavior from aggregate data, when no individual data are available or when individual data are incomplete.In this inferential context, one problem is that many different possible relationships at the individual (or subgroup) level can generate the same observations at the aggregate (or group) level [8].In the absence of individual (or subgroup) level measurements (in the form of survey data), such information needs to be inferred.Estimates of the disaggregated values for the variable of interest can be inferred from aggregate data by using appropriate statistical techniques.However, in many situations, given that the micro-data of interest are not available, the accuracy of any predicted value cannot be verified.This research focuses on the estimation on disaggregated indicators by subclasses.Assume that we have an indicator, y i• , that is observable across the different areas i = 1, . . ., T. Our objective is to disaggregate it into an indicator y ij for the j = 1, . . ., K different sub-categories (or sub-areas) that conform each class (or area) i.The information available for this inference exercise, together with the indicator y i• , is another disaggregated indicator x ij that is related to the target indicator y ij .This paper approaches this estimation problem in an attempt to unify two estimation strategies and it is organized as follows.Section 2 explains the main features of the matrix-adjustment following the ideas of the Generalized Cross Entropy (GCE) estimation introduced in [9], whereas in Section 3 the basis of the Distributionally Weighted Regression (DWR) estimation are explained.Section 4 studies these two strategies under a common approach and propose a composite prior estimator in line with the Data Weighted Prior (DWP) proposed in [10,11].The comparative performance of the three techniques is evaluated by means of a numerical experiment in Section 5. Finally, Section 6 presents the main conclusions of the paper.

Matrix-Adjustment and Distributionally Weighted Regression Problems
Within the family of IT estimators, [10] proposed a general solution for the estimation problem described in the introduction basing on the minimization of the divergence between the target variable and some prior information.Following this approach, each indicator y ij is assumed as a discrete random variable that can take M different values.Defining a supporting vector (for the sake of simplicity assumed as common for all the y ij ) z = [z 1 , z 2 , . . ., z M ] that contains the M possible realizations of the targets with unknown probabilities p ij = p ij1 , p ij2 , . . ., p ijM , y ij can be written as: Alternatively, this idea can be generalized in order to include an error term and define each y ij as: In such a case, we assume that the y ij elements are given from two sources: a signal that keeps the resemblance with the priors x ij , plus a noise term (ε ij ).The noise components can be included in order to account for potential spatial heterogeneity and our uncertainty about the target variable.Basically, we represent uncertainty about the realizations of the errors treating each element ε ij as a discrete random variable with L ≥ 2 possible outcomes contained in a convex set v = {v 1 , . . ., v L }, which for the sake of simplicity will be assumed as common for all the ε ij .We also assume that these possible realizations are symmetric around zero (−v 1 = v L ).The traditional way of fixing the upper and lower limits of this set is to apply the three-sigma rule [12].Under these conditions, each ε ij can be defined as: where w ijl is the unknown probability of the outcome v l for the cell ij.Now, the y ij elements can be written as: The solution to the estimation problem is given by the minimization of the Kullback-Leibler divergence between the posteriors distributions p s and the a priori probabilities q ij = q ij1 , q ij2 , . . ., q ijM .The q s reflect the information we have on the indicators x ij , which are somehow related to our target y ij , being defined by the expression: The solution to the estimation problems is given by minimizing the KL divergence between the p s and the q s.If we do not have an informative prior, the a priori distributions are specified as uniform q ij = 1 M ; ∀m = 1, . . ., M , which leads to the GME solution.The uniform distribution is usually set as the natural prior W 0 for the error terms.Specifically, the constrained minimization problem can be written as: subject to: Restrictions ( 8) are just normalization constrains, whereas Equation ( 7) reflects the observable information that we have on the relationship between the aggregates y i• and the indicators y ij through the observable K-dimensional vector C •j .Denoting as ŷ0 ij to the solution in absence of this information, this is given by the indicator x ij ; i.e., ŷ0 Following Golan et al., (1994), the aggregate vectors y i• and C •j are, respectively, row and column margins in a matrix of inter-industry flows.However, the availability of sample (observable) and out-of-sample (unobservable) information could be different in our estimation problem, because in the inter-industry problem it is natural to have known K + T data, but in other estimation problems we only have aggregate information across the dimension of T through y i• .For example, if we want to disaggregate the income per capita in each area i (y i• ) into the income per capita of its sub-populations (men and women, population classified by education levels, etc.) being observable the weight of each sub-population on the total population, but not the overall income per capita of each sub-group.
Sometimes the aggregate C •j is not observable and it is replaced by the observation of the weights given to the sub-category j in each area i (θ ij ) that defines the indicator y i• as the weighted sum: Additionally, the relation between the target indicators y ij and the prior information x ij will be made explicit by means of a functional relationship like: and, consequently: Equations ( 10) and (11) contain the starting point of the traditional approach to spatial disaggregation based on some Distributionally Weighted Regression (DWR) of the type proposed in [13,14].In Equation (10), the unobservable y ij are defined as a linear function of x ij , allowing for slope heterogeneity (note that the β ij can be different for each area and sub-class) and an specific area indicator α i plus an error term ε ij .For the estimation of model Equation (10), the same IT-based strategy is followed, by defining for the M possible realizations of each parameter, the support vector b = [b 1 , b 2 , . . ., b M ] (again common for parameters α i and β ij ) with unknown probabilities p α , p β to be recovered.The noise components ε ij are treated in the same ways as in Equation (5).
Once the respective supporting vectors and the a priori probability distributions are set, the DWR estimation can be made in the terms of the following GCE program: subject to: Both for the parameters and the errors, the supporting vectors usually contain values symmetrically centered on zero.If all the a priori distributions (q α , q β , W 0 ) are specified as uniform, then the GCE solution reduces to the GME one.

Unifying the Two Approaches: A Composite Prior Estimator
In this section, we will unify the two previous approaches under a common framework showing that the matrix adjustment problem introduced in [9] is simply a case of a DWR equation (if the available observable information is the same) with not necessarily uniform distributions for q α and q β .We let out of the discussion the a priori distribution of the errors W 0 because the uniform solution is the most intuitive.We will base our explanation on the most common case of supporting vectors with M ≥ 2 values distributed symmetrically around zero.
Note that the GME solution to the DWR problem departs from the specification of a priori distributions that assume that the parameters can take any value as long as they remain in the bounds set in the supports.In contrast, in the solution offered in [9] for the inter-industry flows estimation, no area-specific (row-specific in terms of the problem discussed there) effect was considered and the prior expectation on y ij is given by the corresponding cell x ij .These assumptions can be formulated in terms of the a priori distributions used in the DWR approach, which means that both approaches can be treated as particular cases of a general estimation problem.
The a priori distribution q α can be defined in order to consider the assumption of avoiding any area-specific parameter α i from Equation (10).As opposed to the GME's solution to the DWR estimation where they are specified as uniform (q αu ), now we specify an alternative non-uniform distribution (q αn ) with a point mass at b α m = 0. Similarly, the a priori distribution q β should reflect that the uninformative estimation of y ij is the regressor x ij .This non-uniform distribution (q βn ), consequently, should be specified as fulfilling the condition ŷ0 ij = x ij , or alternatively: Appendix A illustrates how specifying such an a priori distribution for the simplest case with M = 2 values in the supporting vectors.Having made explicit that, under the same information availability, the two approaches only differ on the a priori distributions specified, it is possible to apply a composite prior estimator that considers both possibilities in the same fashion as in in [10,11].This estimator is very flexible in the assumptions made on the a priori distributions, given that it allows for including both uniform and non-uniform priors.The estimator it is called Data Weighted Prior (DWP) because it is the information observed which weighs the two alternative priors considered.Furthermore, the authors of [10] prove that its estimates present relatively lower variance than those estimated from a GCE program.
Specifically, the DWP program can be written for our problem as: subject to: The γ parameters are estimated simultaneously with the rest of coefficients of the model.Each γ measures the weight given to the uniform prior q u for each parameter and it is defined as γ = H and the same is applied for the errors (w 0 ijl = 1 J ).To understand the logic of this estimator, an explanation on the objective function of the previous minimization program is required.Note that Equation ( 16) is divided in four terms.The last term measures the Kullback divergence between the posterior and the prior probabilities for the noise component of the model.The first term quantifies this divergence between the recovered probabilities and the uniform priors for each coefficient, being this divergence weighted by the corresponding (1 − γ).Next, the second element of (16) measures the divergence with the non-uniform priors and it is weighted by γ.The third element in (16) relates to the Kullback divergence of the weighting parameters γ.Equation ( 16) is minimized subject to the set of constraints present in Equations ( 16)-(18).Again, the restrictions in (18) ensure that the posterior probability distributions of the estimates and the errors are compatible with the observations, and Equation (18) are just normalization constraints.

A Numerical Experiment
The numerical simulation compares the performance of the estimation strategies explained previously to estimate a set of latent indicators (T × K).The target will be the unknown elements y ij (output per worker, income per capita, etc.) that measure the amount of certain variable z ij per unit of other auxiliary variable l ij .The values of the later are drawn from a normal distribution as l ij ∼ N(20, 2), which define the weights as θ ij = l ij /l i• We also simulate an observable disaggregated indicator x ij drawn as x ij ∼ N(10, 1) related to our unobservable target y ij .
In the context of simulation, we assume that the indicator y ij is generated as a convex combination from two possible schemes: This equation contains two sets of slope parameters, namely β ij and η ij , which relate the regressor x ij with the target y ij .Furthermore, a fixed area effect α i is also included.These parameters have been arbitrarily set as: and they are kept constant along the simulations.The error term ε ij is drawn as ε ij ∼ N(0, 0.1) and it is generated in each new trial of the experiment.
The first part of the equation (α i + β ij x ij + ε ij ) shows that y ij can be generated from a process like the one depicted in (16): a linear function of x ij with slope heterogeneity plus a specific area effect (see 11).The second term (η ij x ij + ε ij ) does not include any specific area indicator and assumes that y ij is exclusively affected by x ij (see 2).Equation ( 19) includes the scalar δ bounded between 0 and 1 that weighs the two possible sources that generate the variable.If we make δ → 1 , the first possible mechanism takes over and the contrary happens when we make δ → 0 .Note that if we set δ = 1 we are imposing a data-generating process in line with the assumptions made in the GME program depicted in Equations ( 12)-( 14) for the DWR estimation.On the contrary, if we set δ = 0, this is a scenario compatible with the assumptions of non-uniform priors for the parameters that reflected the belief of absence of area-specific effects and a slope parameter close to 1 (labeled as GCE when the simulation results are shown).Any other value of δ between these two extreme cases shows a data-generating process that is not fully incorporated in the priors of either alternative.It is in this type of intermediate situation with the composite prior estimator (labeled as DWP in the simulation results) described in Equations ( 16)-( 18) can be useful, because both priors are considered and we let the data speak for themselves and favor the most realistic one.
The unobservable indicators generated in (20) will be estimated by the three estimation strategies described in the paper (DWR, GCE and DWP estimators) with equal amounts of observable information (the aggregates y i• = K j=1 y ij θ ij ).We have specified a common supporting vector for all the parameters with M = 3 points at b = (−10, 0, 10).Similarly, a three-point (H = 3) support vector with values 0, 0.5 and 1 has been set for the weighting parameters γ.For the error terms, the support with L = 3 values has been chosen, applying the three-sigma rule with uniform a priori weights.
In the experiment, we compare the performance of the three approaches under different scenarios.Three different dimensions (T × K) of the matrix with the target indicators y ij have been considered and for each case we set arbitrarily six different values of scalar δ: 0.0; 0.2; 0.4; 0.6; 0.8 and 1.0.In each one of these 18 scenarios, we have carried out 200 trials and computed the mean of the absolute deviation in percentage between our estimates and the real y ij .Table 1 shows the results:  [1.466]Values on each cell report the mean absolute deviation (in %) between the real generated target values and the estimated ones.Values in parentheses show the average bias, on absolute terms (ABIAS), and the figures in brackets show the root of the mean squared errors of the estimates (RMSE).
Independently of the estimation approach, the numbers on Table 1 show some common patterns to the three of them.The deviations increase with the value of the scalar δ given that high values of this scalar give more weight to the part of the data-generating process that includes an area-specific effect, which makes the y ij indicators more difficult to predict.The errors seem more stable regarding the different sizes of the target matrices.
If we pay attention to the comparative performance among the three approaches evaluated in the experiment, the results indicate (not surprisingly) that, for low values of the scalar δ, it seems preferable considering that the GCE approach does not introduce any area-specific effect and considers the regressor x ij as the best prediction in absence of observable information.The longer the value of this scalar, the better the relative performance of the GME-DWR approach (based on a priori uniform distributions).
The rule of thumb would be, consequently, to use the former when we suspect that no area-specific effect is present (if the second term in Equation ( 19) dominates) and to favor the latter otherwise (if the first term is more important).In empirical estimation problems, is virtually impossible to know beforehand which one of the two terms is more important.It is in these situations when the use of the composite prior estimator can be helpful.The DWP approach generally outperforms the competing estimators for intermediate values of δ (ranging from 0.4 to 0.8).These medium values indicate some degree of uncertainty about the type of process that generates the data to be estimated.Moreover, the DWP approach can be seen as a conservative solution: even when one of the two parts of the process is clearly dominant (δ = 0 or δ = 1), the composite prior does not perform much worse than the best of the three options.The losses in terms of prediction, however, can be larger if we choose one single-prior estimator when the other is the best option (see the first and last rows of Table 1).

An Empirical Application: Obtaining Disaggregated Information on Wages
In order to illustrate the performance of the proposed estimator, it will be applied to solve an empirical problem of disaggregating data of average wages for Spain.The most detailed information about non-agricultural wages in Spain is published in the Wage Structure Survey (Encuesta de Estructura Salarial).The complete version of this survey is conducted by the Spanish Statistical Office (INE) every four years, being the corresponding to 2010 one of the most recent ones.In intermediate years, however, only partial data are collected and the microdata are not released.If, for example, we want to explore the differences across industries on average wages by gender and type of working day in a year where the complete statistical operation is not conducted, the only information we have are at aggregate level.This situation happens, for example, in 2011, where the only available data on are the aggregates reported in Table 2, which do not allow disaggregated differences between male and female workers to be analyzed depending on the industry they belong to: In such a context, if the researcher wants to study wage gender gaps across industries it would be necessary to apply an estimation procedure that produces disaggregated values for this specific year, since the official aggregated data do not allow for this type of analysis.The values in Table 2 provide the aggregates required for applying our DWP estimator.Vector y, with dimension (18 × 1) and elements y i• , contains the mean wage for each industry and our estimation target will be the unknown y ij elements, where sub-index j refers to the type of worker (classified into four categories: full-time males, full-time female, part-time male and part-time females).The information in Table 2 is also useful for setting a regressor (x ij ) for our analysis.In particular, the aggregate mean wages for each type of worker (x i• , in the four bottom rows of Table 2) will be used for this purpose, assuming that x i• = x ij , j = 1, . . ., 4. The additional information required to define the weights (θ ij ) has been taken from the Spanish Labor Force Survey (EPA) corresponding to that year, where we can find information about the number of workers classified by industry, type of working day and gender.With all this information, the DWP estimator has been applied, specifying identical support vectors as those described in the previous section with the numerical simulation, and the estimates obtained are shown in Table 3: The aggregate information classified by industry in Table 2 displayed a high variability, ranging from slightly more than EUR 14,000 for the average worker in the Accommodation industry to almost three times higher in Financial and Insurance services.Additionally, the aggregates also showed that the male workers earned more on average than the female workers.Specifically, full-time male workers earned on average around 16% than their female counterparts, whereas this gap was around 11% in the case of part-time workers.This information, however, does not allow for checking if this gender differences on wage keep stable independently on the industry.The estimates obtained by the DWP estimator and reported on Table 3 help to shed some light on this matter.
According to the outcomes of the estimation, the gender gap for full-time workers is much larger in the case of economic branches related to mining, manufacturing or construction than in service activities.Furthermore, for the specific case of Education and Health and social services activities, we estimate significant positive difference for full-time female workers.Something similar, but to a lesser extent, happens with the case of part-time workers: the mean gender gap in favor of male workers, according to the estimates, is mainly produced by the higher wages received in mining, manufacturing and construction, but in general the activities related to services tend to alleviate this gap.Detecting these differential patterns across industries is possible due to the disaggregated information contained in the estimates, which was partially hidden in the aggregated averages.Additionally, we have explored how robust are the estimates and the patterns found by modifying the supporting vectors, which in turn impact on the priors, as depicted in equation ( 15).The estimates reported in Table 3 correspond to a case where the support vectors have been defined as b = [−100,0,100] with M = 3 and common for parameters α i and β ij .Appendix B reports the same estimates as in Table 3, where the support vectors are defined as b = [−10,0,10] (Table A1) and b = [−1,000,0,1,000] (Table A2) in order to check if having wider or narrower vectors impacts on the results.Despite some of the minor differences produced by the numerical simulation, the general patterns seem to be robust to this specification.

Conclusions
In this paper, we have tackled the problem of providing reliable estimates of a target variable in a set of small geographical areas, by showing that under certain conditions the generalized cross-entropy (GCE) solution for a matrix adjustment problem and the GME estimator of a DWR equation differ only in terms of the a priori information considered.Then, a composite estimator that combines the priors considered in both approaches is proposed and the performance among the three approaches is evaluated throughout Montecarlo experiments.
The proposed method may represent a new basis to recover estimate at a disaggregate level in presence of: (i) sampling and response errors; (ii) small samples.Within this framework, minimal distributional assumptions are necessary, and a dual loss function is used to take into account both the estimation precision and the prediction objectives.The choice of the prior is data based and endogenously determined and the method provides a simple way of introducing and evaluating prior information in the estimation process.The DWP estimation procedure seem to be a promising alternative model-based estimation technique because the implementation of the method involves minimum outlay of computing, it does not depend on any hypotheses regarding the form of the error distribution in the model, and it produces good results for small-sized samples, especially in the presence of spatial heterogeneity.Finally, theoretical and other non-sample information may be directly imposed on the DWP estimates much more easily than the classic Maximum likelihood and Bayesian estimation techniques.
The results indicate that for low values of the parameter δ (that measures the weight given to the uniform prior for each parameter), it seems preferable considering the GCE approach that does not introduce any area-specific effect and considers the indicator observed at area level as the best prediction in absence of observable information.The longer the value of this scalar, the better the relative performance of the GME-DWR approach (based on a priori uniform distributions).
The working of the proposed estimation procedure has been also illustrated by applying the procedure on the estimation of average wages for the Spanish industries in 2011, classified by gender and type of working day.Our results have shown that the DWP estimation has the potential to obtain disaggregated estimates based on minimal assumptions about the data-generating process.

Table 2 .
Available information on annual wages by industry, type of working day and gender.Wage Structure Survey, 2011.

Table 3 .
DWP estimates on disaggregated mean annual wages (EUR) by industry, type of working day and gender, 2011.