Abstract
In an infinite-/super-population (SP) setup, regression analysis of longitudinal data, which involves repeated responses and covariates collected from a sample of independent individuals or correlated individuals belonging to a cluster such as a household/family, has been intensively studied in the statistics literature over the last three decades. In general, a longitudinal, such as an auto-correlation structure for repeated responses for an individual or a two-way cluster–longitudinal correlation structure for repeated responses from the individuals belonging to a cluster/household, are exploited to obtain consistent and efficient regression estimates. However, as opposed to the SP setup, a similar regression analysis for a finite population (FP)-based longitudinal or clustered longitudinal data using a survey sample (SS) taken from the FP-based on a suitable sampling design becomes complex, which requires first defining the FP regression and correlation (both longitudinal and/or clustered) parameters and then estimating them using appropriate sampling weighted-design unbiased (SWDU) estimating equations. The finite sampling inferences, such as predictions of longitudinal changes in FP totals, would become much more complex, meaning that it would be necessary to predict the non-sampled totals after accommodating the longitudinal and/or clustered longitudinal correlation structures. Our objective in this paper is to deal with this complex FP prediction inference by developing a design cum model (DCM)-based estimation approach. Two competitive FP total predictors, namely design-assisted model-based (DAMB) and design cum model-based (DCMB) predictors are compared using an intensive simulation study. The regression and correlation parameters involved in these prediction functions are optimally estimated using the proposed DCM-based approach.
Keywords:
clusters-based longitudinal survey sample; design assisted model-based prediction; design cum model-based estimators; design cum model-based total prediction; finite population in a longitudinal setup; finite population in a cluster-based longitudinal setup; individual-based longitudinal survey sample 1. Introduction
Clusters/household-based longitudinal survey data analysis for finite population (FP) inferences is an important research topic. For example, to help develop public policy, to understand the determinants of health, and to understand the relationship between health status and health care use, Statistics Canada conducted the National Population Health Survey (NPHS) to gather information on the health of Canadians. The survey began in 1994, collecting biennial information from selected households/clusters under a state/province/strata until 2012. The responses can be linear, binary, or multinomial. In this paper, however, we concentrate on the analysis of linear cluster longitudinal data, such as the repeated body mass index (bmi) measures (ranges in general from 18 to 40 ), collected under the NPHS study from all members of the selected households over a period of time. Notice that the health status of an individual measured based on at a given time is likely to be dependent on (1) the health status of previous times, (2) household/cluster random effect, and (3) on certain time-dependent covariates such as gender, age group, education level, and lifestyle factors like smoking, diet, and physical activity. One may refer to this type of data as (a) a single-stage cluster-based longitudinal survey (SSCLS) sample. This is because this data set consisting of repeated responses, for example, those collected from all individuals belonging to a sample of households/clusters, where the sample was chosen in a single stage from the specified FP containing a large number of households/clusters. In a specialized case, when the household is considered to be the sampling unit, i.e., repeated responses are collected from the household leader only, for example, one does not need to consider any household/cluster correlation. In such cases, one may refer to the data set as (b) a single-stage individual-based longitudinal survey (SSILS) sample. In this paper, we study finite sampling inferences using both SSCLS and SSILS samples.
We remark that, except for some general discussion and exploratory analysis [1,2,3,4], there do not appear to be adequate discussions on inferences using the aforementioned SSILS and SSCLS samples. More specifically, to use these SSILS and SSCLS samples for inferences, in a model-based approach, one needs to consider an appropriate longitudinal correlation model for the -based hypothetical data. In this token, for linear longitudinal survey data analysis, some studies, such as [5] (Section 2, eqn. 2), assumed that the repeated responses from an individual in the follow a random effects-based linear mixed model, where an individual’s common random effect causes an equi-correlation structure among the repeated responses from the individual. This cluster correlation-oriented model, however, fails to accommodate the time-lag-dependent decaying correlations [6] (chps. 2–3) that appear to be more appropriate than an equip-correlation structure for longitudinal responses. Similarly, some studies, such as [7] (Section 7.4 see also the references therein) by following [8], have summarized a ‘working’ correlation model and the so-called GEE (generalized estimating equation) estimation approach to fit the longitudinal survey sampled data, which appear to be appropriate for non-longitudinal clustered data and/or classical multivariate data. More specifically, it was suggested by [7] (Section 7.4) to compute the correlations of the longitudinal survey data by using the standard Pearson’s correlation formula, which appears to be a naive approach, as these correlations fail to exhibit any auto-correlations i.e., decaying correlations as time lag increases, appropriate for the longitudinal data. A similar working correlations approach using ‘working’ odds ratio parameters for longitudinal binary survey data was used by [9] (ch. 20), which provides inconsistent regression estimates [10] (ch. 4, Section 4.2).
As a remedy, following [6] (Section 2.2) (see also [11] (Section 3)), we use a general stationary auto-correlation structure-based correlation model to fit the -based hypothetical data involving repeated data from independent individuals, and similarly use a familial longitudinal correlation structure-based [6] (Section 3.1) correlation model to fit the -based hypothetical data involving repeated data from all members in a cluster/family exhibiting two-way clustered longitudinal correlations—the clusters being independent to each other. These individual-based longitudinal (IL) and cluster-based longitudinal (CL) correlation models for the , along with the estimation of the model parameters using the SSILS and SSCLS samples, are provided in Section 3 and Section 4 respectively. As far as the SSILS and SSCLS samples are concerned, their construction from respective are given in the previous Section 2.
Prediction functions for totals up to a given time are then constructed by replacing the non-sampled response totals with their model as well as design cum model (DCM)-based expectations. These expectations involve regression parameters which are not easy to estimate consistently using the survey sample. More specifically, as the responses in a survey sample are subject to randomness due to both sample selection (as covariates from sample to sample change) and model errors, unlike some of the existing studies [12,13,14,15], it is not possible to obtain any valid MUOLS/MUGLS (model unbiased ordinary/generalized least square) estimators for the regression parameters [16]. In Section 3 and Section 4, we thus develop suitable DCMU estimators for the regression parameters involved in the -based expectations using the SSILS and SSCLS samples, respectively. Next, we use these DCMU estimators in the same Section 3 and Section 4 to form the DCMU predictions for the total at a given time, using both the SSILS and SSCLS samples, respectively. We also include an alternative theoretical result for predictions using the MU function, but replacing the parameters in the expectations with DCMU estimators. We refer to this prediction approach as the design-assisted model-based (DAMB) approach. As expected, a simulation study in Section 5 shows that DCMU predictors perform better than the DAMB predictors, as expected. The paper concludes with a discussion and some concluding remarks in Section 6.
2. Materials: Individual or Cluster-Based Longitudinal Survey Data
In practice, there are many situations where longitudinal data are recorded at a level, and it may be of interest to know the longitudinal pattern of a response variable over a small period of time by using a survey sample consisting of longitudinal responses chosen from the targeted However, the nature of the sample would depend on the form of the underlying For example, (a) suppose that an electric power company is interested in analyzing the household power consumption pattern over the last years for a city with N (e.g., 10,000) households, where the sampling units are individual households. For this purpose, a single-stage individual household-based longitudinal sample (SSILS), say of size may be taken from the , and their responses, along with covariates, may be used to understand the longitudinal pattern. However, there are different studies, for example, in health-related cases, (b) a health organization, such as Health Canada, as pointed out in the last section, may be interested to know the longitudinal pattern of the health condition of all members of the households in a state/province. Suppose that there are K households in the state/province and the c-th cluster/household has family members. Notice that these members are correlated and that the underlying correlation structure, unlike in case (a), has to be accommodated for any pattern-change analysis. In this example, one would take an SSCLS (single-stage cluster/household-based longitudinal sample) of size k households and use the sample to understand the longitudinal pattern of the
For a better understanding of the differences between the two aforementioned samples, SSILS and SSCLS, we present them in notational detail, including their respective as follows. We remark that these samples will be exploited in, respectively, for a model-based prediction of their totals at a given marginal time. As far as the sample selection from the is concerned, we use the well-known equal-probability-based SRSWOR (simple random sampling without replacement) in both cases.
2.1. SSILS Sample from Individual-Based
Individual-based longitudinal finite population . Let
be a longitudinal with as a vector of T repeated responses for the ith individual, and denote the covariates matrix with as the p-dimensional covariate vector recorded at time t for the i-th individual in the FP. In reality, this is unknown, and hence its data are hypothetical, unless a sample is taken to observe a part of the FP.
Survey sample from using SRSWOR design. For the prediction of the total at a given time namely
a sample may be chosen from the in (1), as follows:
according to a suitable design, say as
with being a sample inclusion indicator variable.
2.2. SSCLS Sample from Cluster-Based
Cluster-based longitudinal finite population . Let
be the targeted finite population, where denotes the number of independent clusters/households with their sizes being the size of the c-th cluster, which is small and fixed; and denotes a T dimensional hypothetical linear response vector containing T potential repeated responses from the i-th individual of the c-th cluster/household under the finite population. In this setup, at a given time point the pair-wise hypothetical responses within the c-th cluster, namely and for are likely to be correlated, as they share an invisible random cluster/household effect leading to the cross-sectional within-cluster correlations; and the repeated responses from the i-th individual under the c-th cluster, namely and for are also likely to be correlated, maintaining a dynamic dependence relationship, leading to the longitudinal correlations.
Survey sample from using the SRSWOR design. Unlike the construction of by (3), clusters are now considered to be the primary sampling units. Thus, may be chosen from the in (5) as follows:
according to a suitable design, say as
with being a cluster inclusion indicator variable.
Our purpose is to develop a suitable prediction function and its estimator for the total, namely for
3. Proposed DCMU Prediction Method Using SSILS Sample
Following (2), the total up to time t has the formula
Thus, once the SSILS is chosen, this longitudinal cumulative total (LCT) may be expressed in terms of survey sampled and non-sampled LCTs as
where the second term in the right hand side of (6) is the response totals up to time t. For prediction inferences, this and similar response totals are predicted in general with their model-based expectations. Notice that because the repeated responses for the i-th individual under the in (1) are supposed to be longitudinally correlated, we use a super-population correlation model as in Section 3.1 below. We may then use to denote the model-based expectations; hence, the model unbiased (MU) prediction functions for the LCT in (9) have the following forms:
However, as the estimation of the expected function in the second term in (10) has to be performed using the sample from (3), the decades-long existing studies (e.g., see [12] for independent data with [14] (Section 2.6.2), [17] (Section 2.2) for clustered correlated data), conditional on have estimated the expected function by using the sampling sequence obtaining the prediction estimator as
Clearly, this estimation, based on the sequence (i.e., treating the sample as though it is taken directly from the super-population ), ignores the (1) as the source of the sample during the estimation process. Thus, this existing MU prediction approach is flawed, yielding invalid prediction.
As a remedy to the aforementioned anomaly, we propose a design cum model (DCMU) prediction function and estimate the expectation involved in the prediction function based on the true sampling sequence as follows:
yielding the DCMU prediction estimator as
We further remark that, as the estimation of the expected function as in (11) is flawed, one may modify the estimation of the expected function by computing the true sampling sequence -based estimation. We refer to this modified estimator as the design-assisted model-based (DAMB) predictor estimator, with its formula given by
Note that, because of its validity concern, we no longer follow the MU prediction estimator from (11). As far as the computation of the proposed DCMU prediction estimator in (13) and the DAMB prediction estimator given by (14) are concerned, we demonstrate them by considering a correlation model for the (1) data as in the next section.
3.1. Super-Population Longitudinal Auto-Correlation Model
As pointed out in Section 1, the so called ‘working’ correlation structures used by some studies (e.g., [7] (Section 7.4), and [9] (ch. 20)) fail to accommodate the decaying correlation properties as time lag increases. As a remedy, in this section, by following [6] (Section 2.2) (see also [11] (Section 3)), we propose a lag-dependent correlation structure, i.e., a super-population correlation model for the longitudinal data in the More specifically, we suggest using a general auto-correlation model, as follows, that accommodates frequently encountered so-called AR(1) (auto-regressive order 1), MA(1) (moving average order 1), and EQC (equi-correlation) correlation structures, among others. Thus, the hypothetical repeated responses for the ith individual, namely is assumed to follow the correlation model given by
where for all and and R is the lag-dependent auto-correlation matrix defined as
where, for is known to be the ℓth lag auto-correlation. Notice that in (15) is the covariates matrix, as defined in (1), for the and is referred to as the regression parameters vector. Further notice that, as far as the general nature of the lag correlation matrix R in (17) is concerned, as mentioned above, these lag correlations maintain suitable special patterns under the AR(1), MA(1), and EQC models, respectively, as follows:
Thus, computing the correlation matrix R in (17) is sufficient for all of these three (and other similar) processes. The parameters for are lag correlation parameters.
We remark that, even though the model is fitted to the data, as in (15)–(17), the super-population regression parameters , along with lag correlations , cannot, however, be estimated using as a sample, because it is only a hypothetical sample. In reality, these parameters, therefore, have to be estimated as optimally as possible by using the survey sample (SS) constructed in (3). We provide this estimation in Section 3.3. Notice that these parameter estimates will then be used in (13) and (14) to obtain the DCMU and DAMB predictors in order to predict the targeted marginal FP totals. Now, because the super-population model is written as in (15) and (16), we can use the moment properties of the data and re-write the DCMU and DAMB prediction functions following (13) and (14), respectively, as
where must be estimated by accommodating the lag correlation matrix given by (17).
We further remark that, for infinite-population-based inferences for longitudinal data, many studies (e.g., [11], [6] (ch. 2, 7)) have used the auto-correlation model (17). However, for finite sampling inferences for longitudinal data, this model is not adequately discussed in the literature. On the contrary, to model the correlations of the repeated data from the same individual in a finite population setup, some authors such as [5] (ens. 2, 3) have used an individual specific random-effects-based linear mixed model given by
where denotes the random effect of the ith individual which is shared by all responses over time This model, however, produces equal correlations among the repeated responses, and hence, as pointed out in [18] (see also [19]), they fail to accommodate the time effects on correlations. Some other authors, such as [7] (Section 7.4), following the so-called ‘working’ correlations approach of [8], have suggested using an unstructured correlation matrix, say where there are correlations to compute by using a method of moments. There are, however, many inference issues with this ‘working’ correlation approach. For example, (a) this unstructured correlation matrix ignores the time effects on the repeated responses and hence fails to accommodate the lag effect on the association between two repeated responses. Consequently, unlike the lagged correlation structure shown in (17), this approach amounts to computing more number-paired correlations. (b) Also, as demonstrated in [6] (Section 6.4), this ‘working’ correlations approach may produce inefficient estimates compared to the simpler ‘working’ independence-based approach, which makes it an unacceptable inference approach for longitudinal data.
We now proceed to the next section for the estimation of the regression parameters , which is involved in the prediction functions in (21) and (22). This estimation also requires the estimation of the lag correlations , as well for More specifically, in Section 3.2, we demonstrate how one could optimally estimate using the (1)-based data, provided that these data were available. However, as these data are not available in practice, this hypothetical estimates turn out to be the so-called regression parameters [20,21]. In Section 3.3, we then use the SS from (3) to develop a sampling weighted-design unbiased (SWDU) estimate for the finite-population regression parameters. The SWDU estimator, therefore, becomes the DCMU (design cum model unbiased) estimate for the super-population regression parameter
3.2. Hypothetical Estimation of the Model Parameters Using Data
- Estimation of :
Notice that the estimation of the regression parameter involved in the prediction functions (21) and (22) depends on the available repeated data up to time for all Hence, it is convenient to use multiple parameters, namely for , in these prediction functions, for all Note that, for the auto-correlation matrix in (17) does not play any role in estimating For only one lag correlation, namely would influence estimation. Similarly, for and would play a role in estimation, and so on. Thus, in view of the role of this time length in estimation, we first rewrite the DAMB and DCMU prediction functions from (22) and (21) using for as follows.
We then obtain an optimal hypothetical estimator of using the (1)-based hypothetical data, as follows.
More specifically, as the is assumed to follow the regression model, as in (15) and (16), by writing
for one could obtain an optimal HGLS (hypothetical generalized least square) estimate of say provided that was available, by solving the underlying HGLS estimating equation as
Notice that this estimate is not computable, as it is a hypothetical estimate only because -based data are not available. Thus, it is referred to as the N-dependent regression parameter [21], which is yet to be unbiasedly estimated using the survey sample (3). This we do in Section 3.3.2.
- Estimation of :
Notice that is the lag ℓ auto-correlation for as in (26) (see also (17)). More specifically, its model (15)–(17)-based formula is given by
for all with Thus, if the -based data were available, one could use the well-known method of moments and consistently estimate using the formula
where is the hypothetical estimate of given by (27). Notice that in (27) and in (29) are computed iteratively. Thus, one may use to obtain the initial value of by (27), which is then used in (29) to obtain the first-step estimate for The iteration continues until convergence.
3.3. Real Life Estimation of the Model Parameters Using the Survey Sample
We remark that, for a given in the last section, we obtained the formulas for the hypothetical estimates for the super-population regression parameter and lagged correlations These hypothetical estimates are and respectively, which are referred to as the finite-population parameters. Thus, in reality, these parameters must be estimated based on the sampled data. The purpose of this section is to demonstrate how to exploit the survey sample in (3) taken from the in (1) by using the SRSWOR sampling design given by (4) in order to obtain design-optimal estimates for the parameters. Note that the sampling design in (4) is widely used in practice. However, other designs also can be applied when appropriate.
3.3.1. Estimating Function Approach for Design Unbiased (DU) Estimation of
Because is the solution of the -based estimating equation given by (27), for its design optimal estimation, one needs to develop a -based estimating equation, say such that
(e.g., [21]). To achieve this goal, based on the sampling design from (4), we consider the sampling weight for the selection of ith individual in the sample from Now, because N individuals belonging to the are independent, one may follow the structure of in (27) and develop the sampling weighted GLS (SWGLS) estimating function as
further producing the SWGLS estimating equation that yields the SWGLS estimate of as
which is DU (design unbiased) for the parameter This is because
same as the -based estimating function in (27).
3.3.2. Estimating Function Approach for Design Unbiased (DU) Estimation of
To obtain the sample -based estimate for the finite population lag correlation parameters given in (29), we use the same estimating function approach as used for the estimation of regression parameters in (27) with given by (31). More specifically one may use the sampling weighted auto-covariance function as a DU estimator for the auto-covariance in the numerator in (29) because it can be shown that
Similarly it can be shown that
with respect to DU estimation for the variance term in the denominator in (29).
By combining (33) and (34) one then obtain a first order DU estimator of given in (29) as
Finally, by using from (35), we obtain the final DU regression estimator for by (31). Further, because from (27) is MU (model unbiased) for it then follows that from (31) is DCMU (design cum model unbiased) estimator for involved in the prediction functions in (24) and (25). Hence, by using from (31) for in (24) and (25), we obtain the desired prediction function estimators as
4. Proposed DCMU Prediction Method Using SSCLS Sample: A Generalization
As described in Section 2, more specifically in Section 2.2, dealing with a cluster-based inferences would require an additional cluster correlation parameter estimation on top of regression and longitudinal correlation estimation, which we have performed in Section 3. In an infinite population setup, this type of familial/cluster longitudinal data have been discussed extensively in the literature (e.g., [6] chps 3, 8, 9).
Turning back to the setup, in Section 4.1 below, we write a cluster-based finite population (FP: ) and provide a cluster longitudinal super-population (SP: ) model to fit the -based hypothetical data. Similar to the previous section, this SP model fitting to the FP data would be utilized to obtain hypothetical estimates for the SP parameters, where these estimates are referred to as the FP parameters [20,21]. This hypothetical estimation is given in Section 4.2. Next, we exploit the SSCLS sample from (6) for DCMU estimation for all parameters, including the cluster regression parameters. We remark here that, in the context of MU prediction, this type of cluster-based was used by many authors in the past in a two-stage cluster sampling setup [22,23]. However, as demonstrated in [16], the MU estimation approach is flawed and would result in an invalid MU prediction. On the contrary, similarly to Section 3, in this section, we provide valid DCMU estimation-based prediction functions.
4.1. Cluster-Based Longitudinal FP and Its SP Model
As opposed to the individual-based FP (1) studied in the last section, here we consider clusters, such as household-based FP involving a large number of independent households, with each member of a household having repeated potential responses in a longitudinal setup. Typically, cluster/household sizes are small. Next, because the is unknown or hypothetical, any finite sampling inferences require a sample, say , to be chosen from the FP and exploit it to design optimal inferences for the targeted FP parameters. When the whole small-sized cluster is chosen as a sampling unit, the resulting longitudinal survey-based sample is referred to as the SSCLS (single-stage cluster-based longitudinal survey) sample. In Section 2.2, it was shown how one can construct the sample (6) from defined by (5).
Now, by treating the in (5) as a large sample from a longitudinal super-population its cluster-based hypothetical longitudinal data may be modeled as
equivalently in vector-matrix notations
where, for we write
where is the T-dimensional unit vector and is a general lag correlation matrix, as in (17) under Section 3.1, being referred to as the lag ℓ longitudinal correlation, and is the unit matrix. Further notice that, at every point of time, two individuals, i and j, belonging to the same cluster c are structurally correlated, and hence
have their covariance matrix as
where is the unit matrix. Thus, by combining (42) and (43), we obtain the mean and covariance structure for the longitudinal response vector for the individuals in the cth cluster, namely for as
where unit matrix, and is the longitudinal correlation index parameter representing all lagged correlations, namely
We remark that the proposed cluster-based longitudinal model (40)–(43) is a generalization of the individual-based longitudinal model (15)–(17) under Section 3.1 to the cluster setup. This model (40)–(43) also may be treated as a generalization of the cluster regression model (e.g., [22] (eqns. 3.1, 3.2), [24] (eqn 2.1), [25] (eqns. 2.1, 2.2), [23] (eqn. 1), and [14] (eqn. 9.11)) to the longitudinal setup. There is also a difference at the cluster level, as we are considering small-sized clusters, such as households, and whether these studies dealt with larger clusters, prompting the need for two-stage cluster-sampling-based inferences, as opposed to our single-stage cluster-sampling-based inference.
4.2. FP Data-Based Hypothetical Estimation Equations
Notice that if the FP data (5) were available, the SP regression parameters could be optimally estimated by exploiting the SP model-based moment properties (44) and (45), more specifically by solving the hypothetical GLS (HGLS) estimating equation as
where is used to denote the hypothetical GLS estimator of involving the -based hypothetical responses up to time T in a cluster setup, which is also denoted as an N-dependent FP regression parameter [16,20,21]. Note that the computation for in (46) may be simplified by recalling the formula for from (45) as
and hence writing
(e.g., [6] (Section 3.1)) where
As the optimal estimation of by (46) depends on the estimates of cluster variance and longitudinal correlation parameters, we estimate the rest of these parameters using an MU estimating equation approach. More specifically, by generalizing the individual-based hypothetical MM (HMM) Formula (29) under Section 3.2 to the cluster/household setup (see also [6] (Section 3.1.2)), we obtain the HMM estimating formula for as
with and where is the hypothetical estimate of obtained by the formula in (46) but using cluster-based individuals’ responses up to time for Furthermore, in (48) is an initial estimate of a specialized value of using The formulas for and are given below in (49).
For the purpose of formulating estimate, notice from (47) that the parameter is involved in variances and covariances of the within-cluster responses. Thus, by pooling the -based sum of squares and sum of products and equating to its -based expectation, after some algebra, we obtain
with the initial estimate used in (48), as
The remaining parameter as it is a variance parameter for all individuals under all households, has its HMM formula already used in the denominators of (48) and (49). For the sake of completeness, we give its HMM formula as follows:
where
4.3. Survey Sample -Based DU Estimating Equations
Notice that all SP parameter estimates obtained in the last section, namely and computed by (46), (48), (49), and (50), respectively, are all FP parameters. Their real life estimation has to be performed using the sampled data from (6), as well as the rest of the covariates information available from the sampling frame. Now, to obtain DU (design unbiased) estimates for these FP parameters, we simply use the sampling weighted (SW) total for each of the FP-based total functions involved in the formulas from (46) to (50). Thus, by using the sampling weight, say following (7) (which is the inverse of the inclusion probability for the cth cluster in the sample), we obtain the DU estimates for the aforementioned FP parameters as
where in (54)
and in (53)
4.4. Formulation of the DCMU Prediction Functions Using SSCLS Sample
As the is unknown, for the prediction of the total at a given time we first write its LCT (longitudinal cumulative total) and split this total in terms of sampled (ss) and non-sampled (ns) response totals as
which is similar to, but different from, the LCT split in (9) based on the SSILS sample. By following the same technique used in the SSILS setup, more specifically by following (14) and (13) from Section 3, we write the final DAMB and DCMU predictor for the cumulative and marginal totals as follows:
Notice that in (57) and (60) is a DCMU estimate of , as in (51), which is computed based on the SSCLS sample from (6). As this estimate depends on the estimates of longitudinal correlation and cluster correlation these later estimates were computed step by step as in (52) and (53), respectively.
5. Prediction Comparison Using Simulation Results
Our objective in this section is to examine the finite sampling performance of the proposed DCMU and DAMB prediction estimators for the FP totals using both SSILS and SSCLS survey data. The precise formulas for these predictors are developed in Section 3.3.2 and Section 4.4 based on the SSILS and SSCLS samples, respectively. As a criterion to understand the performance of the parameter estimators, we checked the amount of bias of an estimator from its true value. However, as the large bias and small standard error of an estimator indicates the worst performance of an estimator, we have used the percentage relative bias (see (63) below under Section 5.1) as a criterion to compare the performance of the DCMU and DAMB total predictors.
We now proceed for simulation studies, first for SSILS sample-based prediction and then for SSCLS sample-based prediction. Details, including how the data are generated and the estimators are obtained to compute the prediction functions, are given in Section 5.1 using the SSILS sample and in Section 5.2 using the SSCLS sample.
5.1. Simulation Study 1: Prediction Performance Using Individual-Based Longitudinal Survey Sample
Recall from Section 3 that, even though the longitudinal responses from the individuals in (1) are hypothetical, in a model-based approach, it is assumed that the repeated responses of an individual are likely to follow an auto-correlation structure given in (15)–(17). In this section, we conduct a simulation study to examine first the performance of a longitudinal correlation-based SWGLS (sampling weighted GLS) approach (31) in estimating the regression parameters over time by using the survey sample given by (3) and (4). We then examine the performance of two competitive, namely the DAMB (design assisted model-based) and DCMU (design cum model unbiased) predictors given by (36)–(39) for FP totals over time. We remark that these predictors are formulated using DCMU estimates for SP regression parameters involved in the prediction functions.
5.1.1. Simulation Design in Steps (S1–S7)
- S1.
- We specify the SP model (15) with a set of regression parameters as and and its longitudinal correlation structure (17) based on time periods involving lag correlations: and
- S2.
- We consider three widely used correlation structures, namely AR1 MA1 and EQC As our estimation method is not model-dependent, we thus estimate the lag correlations and irrespective of the model. For example, when data are generated with true model, say AR(1) estimates of lag correlations are obtained for the true SP lag correlations and and so on.
- S3.
- We consider household leaders with responses on household annual electricity consumption for years, in the FP (1), and their four-time independent household-related covariate values with covariates as the size of the household, as two categorical covariates representing three household income levels, more specifically with
- S4.
- Using and the parameter values, correlation structures, and covariates from steps 1 to 3 above, we generate, for a given the longitudinal responses
- S5.
- We then choose a sample (3) of size households from the of size using the SRSWOR sampling design, as in (4), along with their responses and covariates The covariate values for the non-sampled individuals are assumed to be known from an underlying sampling frame.
- S6.
- Finally, the sample from step 5 is used to compute the SWGLS estimate of (FP parameter), namely by (31), and lag correlation estimate for (FP parameter) by (35). These estimates, more specifically are then used to compute the marginal (at a given time t) total prediction estimates by (37) and by (39). The percentage relative biases (PRB), namelyare also computed.
- S7.
- We then repeat steps 5 and 6 for 25 times and compute the simulation average of the regression and correlation estimates under three correlation structures, AR(1), MA(1), and EQC, which are reported in Table 1, Table 2 and Table 3, respectively. The simulation average of the prediction estimates, along with their percentage relative biases (PRBs), are reported in Table 4.
Table 1. AR(1) correlation structure-based regression parameters and their SWGLS (sampling weighted GLS) estimates along with standard errors (given in parenthesis) using 25 samples each of size chosen from the of size
Table 2. MA(1) correlation structure-based regression parameters and their SWGLS (sampling weighted GLS) estimates along with standard errors (given in parenthesis) using 25 samples each of size chosen from the of size
Table 3. EQ correlation structure-based regression parameters and their SWGLS (sampling weighted GLS) estimates along with standard errors (given in parenthesis) using 25 samples each of size chosen from the of size
Table 4. Design-assisted model-based and design cum model unbiased predictions along with their standard errors (given in parenthesis) and 100% relative absolute biases [given in square bracket] for totals over time using 25 samples each of size chosen from the of size
5.1.2. Simulated Prediction Performance for Marginal FP Totals
Note that as the FP total predictors given by (36) and (38) depend on the sample -based estimates (31) for the SP regression parameters for it is therefore important to examine the performance of this estimator in estimating the FP parameter given by (27) corresponding to Furthermore, as in (27) depends on the auto-correlation structure in (26) up to time for we first exploit the FP generated by (15)–(17) to compute the FP regression parameters by (27), as well as the FP lag correlation parameters by (29). These FP parameters corresponding to SP parameter and the FP correlation parameters corresponding to SP correlation parameters for under all three auto-correlations, namely, the AR(1), MA(1), and EQC models, are displayed in the upper half of the Table 1, Table 2, and Table 3, respectively. For example, Table 1, based on the AR(1) model, shows four-dimensional (using four covariates) values at as
corresponding to the SP regression parameters
Similarly, FP and SP lag correlations may be interpreted. Next, because the sample -based (see step 5 in the last sub-section) estimation of amounts to the estimation of the final estimates computed by (31) (which is a simulation average based on 25 repetitions) under the AR(1), MA(1), and EQC processes are displayed in the lower half of the Table 1, Table 2, and Table 3, respectively. For example, the aforementioned under the AR(1) process are estimated as
which and other similar estimates from Table 2 and Table 3 appear to perform well reflecting their design unbiasedness for the FP regression parameters
Next, as indicated in Step 6 in the last sub-section, the aforementioned sample-based regression estimates are used to compute the design-assisted model-based (DAMB), as well as design cum model-based (DCMB), marginal (at a given time t) total predictors (37) and (39), for predicting/estimating the marginal total The results from Table 4 show that these two predictors appear to perform almost the same, and the estimates are very close to the true FP totals. For example, under the AR(1) process (displayed in the extreme left block), the FP total at time i.e., is predicted by MB predictor as with PRB as and it is predicted by DCMB predictor as with slightly different PRBs as Thus they perform almost the same for the targeted prediction. Their equivalent performances under the MA(1) and EQC can be interpreted similarly.
5.2. Simulation Study 2: Prediction Performance Using Cluster-Based Longitudinal Survey Sample
In the last section, we examined the prediction performances using the individual-based longitudinal survey sample. As a generalization, we now examine the prediction behaviors for cluster-based FP total predictors over a longitudinal period of the study developed, as in (56)–(61). Notice that, as opposed to the individual-based FP (1), a cluster-based FP is given by (5) with its SP model given by (40)–(42). The individuals under the cth cluster are now correlated with the cluster correlation coefficient defined in (45). As an illustration in the present simulation, we have generated a cluster-correlated population and, hence, the sample with for example. As far as longitudinal correlations are concerned, we use the widely used AR(1) structure with leading to 3 lag-dependent correlation coefficients under total time period
We now construct our hypothetical cluster-based as follows.
- Let consists of independent clusters/families. We label these clusters in sequence from 1 to Following the notations from (40), we consider four different family structure (FS1 to FS4) with their family/cluster sizes as follows:
- FS1: each with 2 parents (say, father (F) and mother (M)) and 2 children (C1 and C2);
- FS2: each with 2 parents (F and M) and 1 child (C1);
- FS3: each with 1 parent (F) and 2 children (C1 and C2);
- FS4: each with 1 parent (M) and 2 children (C1 and C2).
As far as the covariates are concerned, we consider three covariates, namely age smoking status of the individual member at initial time point and gender We explain below how we generated the covariates under FS1. The covariates under remaining FS2–FS4 are generated similarly.
- 1.
- Generation of for FS1: (a) For father’s age, we have generated 175 ages from a uniform distribution with range 50–60. (b) For mother’s age, in sequence (following father’s label), we generate one age difference indicator value, say , randomly from seven different age difference indicators and computed the selected mother’s age as . (c) To consider the ages of C1, we used where now was chosen as a randomly selected value from a set of age difference values . (d) For the age of (corresponding to C1), we have used the formula with as a randomly selected value from a set of age difference values
- 2.
- Generation of for FS1: Smoking habits for the members were determined using the binary distribution with a probability smoking rate of say. More specifically, we used
- 3.
- Generation of for FS1: We considered and to determine gender for both C1 and C2, we used
Based on the aforementioned covariate values and using their effects as and further using cluster correlation and AR(1) longitudinal correlation process with the responses namely body mass index over a longitudinal period (equivalent to say 2 years) were generated using the SP correlation model (40). Before one can examine the prediction performance of the DAMB (56)–(58) and DCMU (59)–(61) marginal predictors at a given time, it is necessary to first compute the FP data-based regression estimates (FPRE) and then the survey sample ((51)–(53))-based regression estimates (SSRE) after accommodating the cluster correlations (indexed by ) and AR(1) longitudinal correlations indexed by
As far as the SS (6) is concerned, we use SRSWOR and chose 32 families from 175 families under FS1; 12 families from 50 families under FS2; 6 families from 25 families under FS3; and 6 other families from 25 families under FS4. Thus, altogether, we chose clusters/families with sample size from clusters under the FP of size The SP parameter estimates (i.e., the FP correlation and regression parameters) by using the estimating Equations (49) (for cluster correlation ), (48) (for longitudinal correlations), and (46) (for regression parameter estimates) from Section 4.2; and their corresponding sample -based estimates computed by solving the SS-based estimating Equations (51)–(53) from Section 4.3 are provided in Table 5. Samples were repeated for 25 times to compute the sample-based parameter estimates. All sample-based estimates appear to be close to FP-based estimates. In general, cluster-correlation estimates appear to work well when T is small, as more clusters cause more variation over the longitudinal period. However, the main regression parameter estimates, shown in the bottom rectangular box in Table 5, are not negatively effected by this slight difference in correlation estimates.
Table 5.
AR(1) correlation structure-based regression parameters and their SWGLS (sampling weighted GLS) estimates along with standard errors (given in parenthesis) using 25 samples each of size chosen from the cluster-based of size
Finally, the regression parameters estimates, both FP- and sample-based from Table 5, are used to compute the DAMB and DCMU prediction estimates by using (57) and (60), respectively. These estimates, along with actual FP totals at all time points, are displayed in Table 6. The DCMU predictions shown in column 3 appear to have a smaller PRB (percentage relative bias) as compared to the DAMB predictors exhibited in column 1 for and 3, showing the relative superiority of the DCMU predictions as compared to the DAMB prediction in the single-stage cluster setup.
Table 6.
Design-assisted model-based and design cum model unbiased predictions, along with their standard errors (given in parenthesis) and 100% relative absolute biases [given in square bracket] for totals over time using 25 samples each of size chosen from the cluster-based of size for AR(0.5) model.
6. Discussion and Concluding Remarks
In a finite population (FP) setup, the prediction of FP total is a difficult problem, as one requires us to predict the non-sampled response total well, which is customarily performed by replacing such a non-sampled total with its model-based expectation estimate computed from a survey sample. Following the estimating function approach [21] for independent data, the super-population (SP) model-based (for FP data) regression parameters involved in the model-based expectation (equivalently in the prediction function) may be estimated based on the survey sample using a sampling weighted OLS (ordinary least square) (SWOLS) estimator. This SWOLS is DCMU (design cum model unbiased) for the SP regression parameter, which is DU (design unbiased) for the FP parameter [20] corresponding to the SP parameter. However, in this paper, we have considered an FP with independent individuals or households, for example, but each individual or a cluster/household member providing a set of longitudinally correlated responses, the members in a household being structurally cluster correlated. Clearly, as opposed to an SP regression model for independent data, in the proposed setup, one requires an SP correlation model, more specifically, longitudinal correlation and combined-cluster longitudinal correlation models. We use a so-called ‘working’ correlation model (e.g., [7] (Section 7.4), [8]), for example, have used an unstructured or standard Pearson correlation model, whereas [5] has used a random effects-based mixed model to accommodate the longitudinal correlations. However, as explained in Section 3.2, these models fail to accommodate the time effects on the correlations. More specifically, their models fail to produce a correlation structure with decaying correlations as the time lag for two repeated responses increases. As a remedy, we have considered lag-based correlation models following [11] (Section 3), for example, to incorporate the time effects on the correlations. Also, in a cluster-based longitudinal setup, we have generalized this lag-based correlation model to a dynamic mixed-model setup where, conditionally, on the cluster random effects, the repeated responses follow a lag-based correlation structure.
The aforementioned correlation structures and their sample-based estimates are discussed in detail, and it is demonstrated how to obtain DCMU (design cum model unbiased) estimators, namely, the SWGLS (sampling weighted GLS) estimators for the regression parameters after accommodating the longitudinal or cluster-longitudinal correlations. Subsequently, such DCMU regression estimators are used to develop design-assisted model-based (DAMB) and DCMB (design cum model-based) prediction functions. Also, the relative performance of these DAMB as well as DCMU predictors for FP total estimation is examined both theoretically and numerically.
In conclusion, this paper advances longitudinal survey data analysis for both individuals and cluster-based FP. More specifically, it is demonstrated that the DCMU estimation of the parameters involved in a prediction function provides DCMU-valid prediction for the FP total. The step-by-step development for the estimation and prediction methods should be useful to practitioners from statistical agencies such as Statistics and Health, Canada, and the Bureau of Statistics, USA, or similar organizations in other countries.
Author Contributions
Conceptualization, B.C.S.; Methodology, B.C.S.; Software, A.M.V.; Formal analysis, A.M.V.; Investigation, A.M.V. and B.C.S.; Writing—Original draft, A.M.V. and B.C.S. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Data Availability Statement
The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.
Acknowledgments
Authors would like to thank three reviewers for their comments and suggestions that led to the improvement of this paper.
Conflicts of Interest
Authors declare no conflicts of interest.
References
- Binder, D. Longitudinal surveys: Why are these surveys different from all other surveys? Surv. Methodol. 1998, 24, 101–108. [Google Scholar]
- Lynn, P. Methods for longitudinal surveys. In Methodology of Longitudinal Surveys; Lynn, P., Ed.; John Wiley and Sons: Hoboken, NJ, USA, 2009; pp. 1–18. [Google Scholar]
- Smith, P.W.F.; Berrington, A.; Sturgis, P. A comparison of graphical models and structural equation models for the analysis of longitudinal survey data. In Methodology of Longitudinal Surveys; Lynn, P., Ed.; John Wiley and Sons: Hoboken, NJ, USA, 2009; pp. 381–391. [Google Scholar]
- Thompson, M.E. Using longitudinal complex survey data. Annu. Stat. Appl. 2015, 2, 305–320. [Google Scholar] [CrossRef]
- Skinner, C.J.; de Toledo Vieira, M. Variance estimation in the analysis of clustered longitudinal survey data. Surv. Methodol. 2007, 33, 3–12. [Google Scholar]
- Sutradhar, B.C. Dynamic Mixed Models for Familial Longitudinal Data; Springer: New York, NY, USA, 2011. [Google Scholar]
- Wu, C.; Thompson, M.E. Sampling Theory and Practice; Springer Nature: Cham, Switzerland, 2020. [Google Scholar]
- Liang, K.Y.; Zeger, S.L. Longitudinal data analysis using generalized linear models. Biometrika 1986, 78, 13–22. [Google Scholar] [CrossRef]
- Roberts, G.; Ren, Q.; Rao, J.N.K. Using marginal mean models for data from longitudinal surveys with a complex design: Some advances in methods. In Methodology of Longitudinal Surveys; Lynn, P., Ed.; John Wiley and Sons: Hoboken, NJ, USA, 2009; pp. 351–366. [Google Scholar]
- Sutradhar, B.C. Longitudinal Categorical Data Analysis; Springer: New York, NY, USA, 2014. [Google Scholar]
- Sutradhar, B.C.; Das, K. On the efficiency of regression estimators in generalized linear models for longitudinal data. Biometrika 1999, 86, 459–465. [Google Scholar] [CrossRef]
- Bellhouse, D.R. Model-based estimation in finite population sampling. Am. Stat. 1987, 41, 260–262. [Google Scholar] [CrossRef]
- Prasad, N.G.N.; Rao, J.N.K. The estimation of the mean squared error of small-area estimators. J. Am. Stat. Assoc. 1990, 85, 163–171. [Google Scholar] [CrossRef]
- Valliant, R.; Dorfman, A.H.; Royal, R.M. Finite Population Sampling and Inference: A Prediction Approach; John Wiley and Sons, Inc.: New York, NY, USA, 2000. [Google Scholar]
- Melville, G.J.; Welsh, A.H. Model-based prediction in ecological surveys including those with incomplete detection. Aust. N. Z. J. Stat. 2014, 56, 257–281. [Google Scholar] [CrossRef]
- Sutradhar, B.C. Doubly weighted estimation approach for linear regression analysis with two-stage cluster samples. Sankhya B Indian J. Stat. 2024, 86, 55–90. [Google Scholar] [CrossRef]
- Kennel, T.L.; Valliant, R. Robust variance estimators for generalized regression estimators in cluster samples. Surv. Methodol. 2019, 45, 427–450. [Google Scholar]
- Jowaheer, V.; Sutradhar, B.C. Analyzing longitudinal count data with overdispersion. Biometrika 2002, 89, 389–399. [Google Scholar] [CrossRef]
- Thall, P.F.; Vail, S.C. Some covariance model for longitudinal count data with overdispersion. Biometrics 1990, 46, 657–671. [Google Scholar] [CrossRef] [PubMed]
- Binder, D. On the variances of asymptotically normal estimators from complex surveys. Int. Stat. Rev. 1983, 51, 279–292. [Google Scholar] [CrossRef]
- Godambe, V.P.; Thompson, M.E. Parameters of super-population and survey population: Their relationships and estimation. Int. Stat. Rev. 1986, 54, 127–138. [Google Scholar] [CrossRef]
- Royal, R.M. The linear least-squares prediction approach to two-stage sampling. J. Am. Stat. Assoc. 1976, 71, 657–664. [Google Scholar] [CrossRef]
- Valliant, R. Generalized variance functions in stratified two-stage sampling. J. Am. Stat. Assoc. 1987, 82, 499–508. [Google Scholar] [CrossRef]
- Isaki, C.T.; Fuller, W.A. Survey design under the regression super-population model. J. Am. Stat. Assoc. 1982, 77, 89–96. [Google Scholar] [CrossRef]
- Scott, A.J.; Holt, D. The effect of two-stage sampling on ordinary least squares methods. J. Am. Stat. Assoc. 1982, 77, 848–854. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).