1. Introduction
Clusters/household-based longitudinal survey data analysis for finite population (FP) inferences is an important research topic. For example, to help develop public policy, to understand the determinants of health, and to understand the relationship between health status and health care use, Statistics Canada conducted the National Population Health Survey (NPHS) to gather information on the health of Canadians. The survey began in 1994, collecting biennial information from selected households/clusters under a state/province/strata until 2012. The responses can be linear, binary, or multinomial. In this paper, however, we concentrate on the analysis of linear cluster longitudinal data, such as the repeated body mass index (bmi) measures (ranges in general from 18 to 40 ), collected under the NPHS study from all members of the selected households over a period of time. Notice that the health status of an individual measured based on at a given time is likely to be dependent on (1) the health status of previous times, (2) household/cluster random effect, and (3) on certain time-dependent covariates such as gender, age group, education level, and lifestyle factors like smoking, diet, and physical activity. One may refer to this type of data as (a) a single-stage cluster-based longitudinal survey (SSCLS) sample. This is because this data set consisting of repeated responses, for example, those collected from all individuals belonging to a sample of households/clusters, where the sample was chosen in a single stage from the specified FP containing a large number of households/clusters. In a specialized case, when the household is considered to be the sampling unit, i.e., repeated responses are collected from the household leader only, for example, one does not need to consider any household/cluster correlation. In such cases, one may refer to the data set as (b) a single-stage individual-based longitudinal survey (SSILS) sample. In this paper, we study finite sampling inferences using both SSCLS and SSILS samples.
We remark that, except for some general discussion and exploratory analysis [
1,
2,
3,
4], there do not appear to be adequate discussions on
inferences using the aforementioned SSILS and SSCLS samples. More specifically, to use these SSILS and SSCLS samples for
inferences, in a model-based approach, one needs to consider an appropriate
longitudinal correlation model for the
-based hypothetical data. In this token, for linear longitudinal survey data analysis, some studies, such as [
5] (Section 2, eqn. 2), assumed that the repeated responses from an individual in the
follow a random effects-based
linear mixed model, where an individual’s common random effect causes an equi-correlation structure among the repeated responses from the individual. This cluster correlation-oriented model, however, fails to accommodate the time-lag-dependent decaying correlations [
6] (chps. 2–3) that appear to be more appropriate than an equip-correlation structure for longitudinal responses. Similarly, some studies, such as [
7] (Section 7.4 see also the references therein) by following [
8], have summarized a ‘working’ correlation model and the so-called GEE (generalized estimating equation) estimation approach to fit the longitudinal survey sampled data, which appear to be appropriate for non-longitudinal clustered data and/or classical multivariate data. More specifically, it was suggested by [
7] (Section 7.4) to compute the correlations of the longitudinal survey data by using the standard Pearson’s correlation formula, which appears to be a naive approach, as these correlations fail to exhibit any auto-correlations i.e., decaying correlations as time lag increases, appropriate for the longitudinal data. A similar working correlations approach using ‘working’ odds ratio parameters for longitudinal binary survey data was used by [
9] (ch. 20), which provides inconsistent regression estimates [
10] (ch. 4, Section 4.2).
As a remedy, following [
6] (Section 2.2) (see also [
11] (Section 3)), we use a general stationary auto-correlation structure-based
correlation model to fit the
-based hypothetical data involving repeated data from independent individuals, and similarly use a familial longitudinal correlation structure-based [
6] (Section 3.1)
correlation model to fit the
-based hypothetical data involving repeated data from all members in a cluster/family exhibiting two-way clustered longitudinal correlations—the clusters being independent to each other. These individual-based longitudinal (IL) and cluster-based longitudinal (CL) correlation models for the
, along with the estimation of the model parameters using the SSILS and SSCLS samples, are provided in
Section 3 and
Section 4 respectively. As far as the SSILS and SSCLS samples are concerned, their construction from respective
are given in the previous
Section 2.
Prediction functions for
totals up to a given time are then constructed by replacing the non-sampled response totals with their model as well as design cum model (DCM)-based expectations. These expectations involve
regression parameters which are not easy to estimate consistently using the survey sample. More specifically, as the responses in a survey sample are subject to randomness due to both sample selection (as covariates from sample to sample change) and
model errors, unlike some of the existing studies [
12,
13,
14,
15], it is not possible to obtain any valid MUOLS/MUGLS (model unbiased ordinary/generalized least square) estimators for the regression parameters [
16]. In
Section 3 and
Section 4, we thus develop suitable DCMU estimators for the regression parameters involved in the
-based expectations using the SSILS and SSCLS samples, respectively. Next, we use these DCMU estimators in the same
Section 3 and
Section 4 to form the DCMU predictions for the
total at a given time, using both the SSILS and SSCLS samples, respectively. We also include an alternative theoretical result for predictions using the MU function, but replacing the parameters in the expectations with DCMU estimators. We refer to this prediction approach as the design-assisted model-based (DAMB) approach. As expected, a simulation study in
Section 5 shows that DCMU predictors perform better than the DAMB predictors, as expected. The paper concludes with a discussion and some concluding remarks in
Section 6.
2. Materials: Individual or Cluster-Based Longitudinal Survey Data
In practice, there are many situations where longitudinal data are recorded at a level, and it may be of interest to know the longitudinal pattern of a response variable over a small period of time by using a survey sample consisting of longitudinal responses chosen from the targeted However, the nature of the sample would depend on the form of the underlying For example, (a) suppose that an electric power company is interested in analyzing the household power consumption pattern over the last years for a city with N (e.g., 10,000) households, where the sampling units are individual households. For this purpose, a single-stage individual household-based longitudinal sample (SSILS), say of size may be taken from the , and their responses, along with covariates, may be used to understand the longitudinal pattern. However, there are different studies, for example, in health-related cases, (b) a health organization, such as Health Canada, as pointed out in the last section, may be interested to know the longitudinal pattern of the health condition of all members of the households in a state/province. Suppose that there are K households in the state/province and the c-th cluster/household has family members. Notice that these members are correlated and that the underlying correlation structure, unlike in case (a), has to be accommodated for any pattern-change analysis. In this example, one would take an SSCLS (single-stage cluster/household-based longitudinal sample) of size k households and use the sample to understand the longitudinal pattern of the
For a better understanding of the differences between the two aforementioned samples, SSILS and SSCLS, we present them in notational detail, including their respective as follows. We remark that these samples will be exploited in, respectively, for a model-based prediction of their totals at a given marginal time. As far as the sample selection from the is concerned, we use the well-known equal-probability-based SRSWOR (simple random sampling without replacement) in both cases.
2.1. SSILS Sample from Individual-Based
Individual-based longitudinal finite population . Let
be a longitudinal
with
as a vector of
T repeated responses for the
ith individual, and
denote the
covariates matrix with
as the
p-dimensional covariate vector recorded at time
t for the
i-th individual in the FP. In reality, this
is unknown, and hence its data are hypothetical, unless a sample is taken to observe a part of the FP.
Survey sample from using SRSWOR design. For the prediction of the
total at a given time
namely
a sample
may be chosen from the
in (1), as follows:
according to a suitable design, say
as
with
being a sample inclusion indicator variable.
2.2. SSCLS Sample from Cluster-Based
Cluster-based longitudinal finite population . Let
be the targeted finite population, where
denotes the number of independent clusters/households with their sizes
being the size of the
c-th cluster, which is small and fixed; and
denotes a
T dimensional hypothetical linear response vector containing
T potential repeated responses from the
i-th
individual of the
c-th cluster/household under the finite population. In this setup, at a given time point
the pair-wise hypothetical responses within the
c-th cluster, namely
and
for
are likely to be correlated, as they share an invisible random cluster/household effect leading to the cross-sectional within-cluster correlations; and the repeated responses from the
i-th individual under the
c-th cluster, namely
and
for
are also likely to be correlated, maintaining a dynamic dependence relationship, leading to the longitudinal correlations.
Survey sample from using the SRSWOR design. Unlike the construction of
by (3), clusters are now considered to be the primary sampling units. Thus,
may be chosen from the
in (5) as follows:
according to a suitable design, say
as
with
being a cluster inclusion indicator variable.
Our purpose is to develop a suitable prediction function and its estimator for the
total, namely for
3. Proposed DCMU Prediction Method Using SSILS Sample
Following (2), the
total up to time
t has the formula
Thus, once the SSILS
is chosen, this longitudinal cumulative total (LCT) may be expressed in terms of survey sampled
and non-sampled
LCTs as
where the second term in the right hand side of (6) is the
response totals up to time
t. For prediction inferences, this and similar
response totals are predicted in general with their model-based expectations. Notice that because the repeated responses
for the
i-th individual
under the
in (1) are supposed to be longitudinally correlated, we use a super-population correlation model
as in
Section 3.1 below. We may then use
to denote the model-based expectations; hence, the model unbiased (MU) prediction functions for the LCT in (9) have the following forms:
However, as the estimation of the expected function in the second term in (10) has to be performed using the sample
from (3), the decades-long existing studies (e.g., see [
12] for independent data with
[
14] (Section 2.6.2), [
17] (Section 2.2) for clustered correlated data), conditional on
have estimated the expected function by using the sampling sequence
obtaining the prediction estimator as
Clearly, this estimation, based on the sequence
(i.e., treating the sample
as though it is taken directly from the super-population
), ignores the
(1) as the source of the sample during the estimation process. Thus, this existing MU prediction approach is flawed, yielding invalid prediction.
As a remedy to the aforementioned anomaly, we propose a design cum model (DCMU) prediction function and estimate the expectation involved in the prediction function based on the true sampling sequence
as follows:
yielding the DCMU prediction estimator as
We further remark that, as the estimation of the expected function as in (11) is flawed, one may modify the estimation of the expected function by computing the true sampling sequence
-based estimation. We refer to this modified estimator as the design-assisted model-based (DAMB) predictor estimator, with its formula given by
Note that, because of its validity concern, we no longer follow the MU prediction estimator from (11). As far as the computation of the proposed DCMU prediction estimator in (13) and the DAMB prediction estimator given by (14) are concerned, we demonstrate them by considering a correlation model for the (1) data as in the next section.
3.1. Super-Population Longitudinal
Auto-Correlation Model
As pointed out in
Section 1, the so called ‘working’ correlation structures used by some studies (e.g., [
7] (Section 7.4), and [
9] (ch. 20)) fail to accommodate the decaying correlation properties as time lag increases. As a remedy, in this section, by following [
6] (Section 2.2) (see also [
11] (Section 3)), we propose a lag-dependent correlation structure, i.e., a super-population
correlation model for the longitudinal data in the
More specifically, we suggest using a general auto-correlation model, as follows, that accommodates frequently encountered so-called AR(1) (auto-regressive order 1), MA(1) (moving average order 1), and EQC (equi-correlation) correlation structures, among others. Thus, the hypothetical repeated responses for the
ith
individual, namely
is assumed to follow the correlation model given by
where
for all
and
and
R is the
lag-dependent auto-correlation matrix defined as
where, for
is known to be the
ℓth lag auto-correlation. Notice that
in (15) is the
covariates matrix, as defined in (1), for the
and
is referred to as the
regression parameters vector. Further notice that, as far as the general nature of the lag correlation matrix
R in (17) is concerned, as mentioned above, these lag correlations maintain suitable special patterns under the AR(1), MA(1), and EQC models, respectively, as follows:
Thus, computing the correlation matrix
R in (17) is sufficient for all of these three (and other similar) processes. The parameters
for
are
lag correlation parameters.
We remark that, even though the
model is fitted to the
data, as in (15)–(17), the super-population regression parameters
, along with lag correlations
, cannot, however, be estimated using
as a sample, because it is only a hypothetical sample. In reality, these parameters, therefore, have to be estimated as optimally as possible by using the survey sample (SS)
constructed in (3). We provide this estimation in
Section 3.3. Notice that these parameter estimates will then be used in (13) and (14) to obtain the DCMU and DAMB predictors in order to predict the targeted marginal FP totals. Now, because the super-population model is written as in (15) and (16), we can use the moment properties of the
data and re-write the DCMU and DAMB prediction functions following (13) and (14), respectively, as
where
must be estimated by accommodating the lag correlation matrix
given by (17).
We further remark that, for infinite-population-based inferences for longitudinal data, many studies (e.g., [
11], [
6] (ch. 2, 7)) have used the auto-correlation model (17). However, for finite sampling inferences for longitudinal data, this model is not adequately discussed in the literature. On the contrary, to model the correlations of the repeated data from the same individual in a finite population setup, some authors such as [
5] (ens. 2, 3) have used an individual specific random-effects-based linear mixed model given by
where
denotes the random effect of the
ith individual which is shared by all responses over time
This model, however, produces equal correlations among the repeated responses, and hence, as pointed out in [
18] (see also [
19]), they fail to accommodate the time effects on correlations. Some other authors, such as [
7] (Section 7.4), following the so-called ‘working’ correlations approach of [
8], have suggested using an unstructured correlation matrix, say
where there are
correlations to compute by using a method of moments. There are, however, many inference issues with this ‘working’ correlation approach. For example, (a) this unstructured correlation matrix ignores the time effects on the repeated responses and hence fails to accommodate the lag effect on the association between two repeated responses. Consequently, unlike the lagged correlation structure
shown in (17), this approach amounts to computing more number-paired correlations. (b) Also, as demonstrated in [
6] (Section 6.4), this ‘working’ correlations approach may produce inefficient estimates compared to the simpler ‘working’ independence-based approach, which makes it an unacceptable inference approach for longitudinal data.
We now proceed to the next section for the estimation of the regression parameters
, which is involved in the prediction functions in (21) and (22). This estimation also requires the estimation of the lag correlations
, as well for
More specifically, in
Section 3.2, we demonstrate how one could optimally estimate
using the
(1)-based data, provided that these data were available. However, as these data are not available in practice, this hypothetical estimates turn out to be the so-called
regression parameters [
20,
21]. In
Section 3.3, we then use the SS
from (3) to develop a sampling weighted-design unbiased (SWDU) estimate for the finite-population regression parameters. The SWDU estimator, therefore, becomes the DCMU (design cum model unbiased) estimate for the super-population regression parameter
3.2. Hypothetical Estimation of the
Model Parameters Using Data
Notice that the estimation of the regression parameter
involved in the prediction functions (21) and (22) depends on the available repeated data up to time
for all
Hence, it is convenient to use multiple parameters, namely
for
, in these prediction functions, for all
Note that, for
the auto-correlation matrix
in (17) does not play any role in estimating
For
only one lag correlation, namely
would influence
estimation. Similarly, for
and
would play a role in
estimation, and so on. Thus, in view of the role of this time length in
estimation, we first rewrite the DAMB and DCMU prediction functions from (22) and (21) using
for
as follows.
We then obtain an optimal hypothetical estimator of
using the
(1)-based hypothetical data, as follows.
More specifically, as the
is assumed to follow the
regression model, as in (15) and (16), by writing
for
one could obtain an optimal HGLS (hypothetical generalized least square) estimate of
say
provided that
was available, by solving the underlying HGLS estimating equation as
Notice that this
estimate is not computable, as it is a hypothetical estimate only because
-based data are not available. Thus, it is referred to as the
N-dependent
regression parameter [
21], which is yet to be unbiasedly estimated using the survey sample
(3). This we do in
Section 3.3.2.
Notice that
is the lag
ℓ auto-correlation for
as in (26) (see also (17)). More specifically, its
model (15)–(17)-based formula is given by
for all
with
Thus, if the
-based data were available, one could use the well-known method of moments and consistently estimate
using the formula
where
is the hypothetical estimate of
given by (27). Notice that
in (27) and
in (29) are computed iteratively. Thus, one may use
to obtain the initial value of
by (27), which is then used in (29) to obtain the first-step estimate
for
The iteration continues until convergence.
3.3. Real Life Estimation of the Model Parameters
Using the Survey Sample
We remark that, for a given in the last section, we obtained the formulas for the hypothetical estimates for the super-population regression parameter and lagged correlations These hypothetical estimates are and respectively, which are referred to as the finite-population parameters. Thus, in reality, these parameters must be estimated based on the sampled data. The purpose of this section is to demonstrate how to exploit the survey sample in (3) taken from the in (1) by using the SRSWOR sampling design given by (4) in order to obtain design-optimal estimates for the parameters. Note that the sampling design in (4) is widely used in practice. However, other designs also can be applied when appropriate.
3.3.1. Estimating Function Approach for Design Unbiased
(DU) Estimation of
Because
is the solution of the
-based estimating equation
given by (27), for its design optimal estimation, one needs to develop a
-based estimating equation, say
such that
(e.g., [
21]). To achieve this goal, based on the sampling design
from (4), we consider the sampling weight
for the selection of
ith individual in the sample
from
Now, because
N individuals belonging to the
are independent, one may follow the structure of
in (27) and develop the sampling weighted GLS (SWGLS) estimating function
as
further producing the SWGLS estimating equation that yields the SWGLS estimate of
as
which is DU (design unbiased) for the
parameter
This is because
same as the
-based estimating function in (27).
3.3.2. Estimating Function Approach for Design Unbiased
(DU) Estimation of
To obtain the sample
-based estimate for the finite population
lag correlation parameters
given in (29), we use the same estimating function approach as used for the estimation of regression parameters
in (27) with
given by (31). More specifically one may use the sampling weighted auto-covariance function as a DU estimator for the auto-covariance in the numerator in (29) because it can be shown that
Similarly it can be shown that
with respect to DU estimation for the variance term in the denominator in (29).
By combining (33) and (34) one then obtain a first order DU estimator of
given in (29) as
Finally, by using
from (35), we obtain the final DU regression estimator
for
by (31). Further, because
from (27) is MU (model unbiased) for
it then follows that
from (31) is DCMU (design cum model unbiased) estimator for
involved in the prediction functions in (24) and (25). Hence, by using
from (31) for
in (24) and (25), we obtain the desired prediction function estimators as
4. Proposed DCMU Prediction Method Using
SSCLS Sample: A Generalization
As described in
Section 2, more specifically in
Section 2.2, dealing with a cluster-based
inferences would require an additional cluster correlation parameter estimation on top of regression and longitudinal correlation estimation, which we have performed in
Section 3. In an infinite population setup, this type of familial/cluster longitudinal data have been discussed extensively in the literature (e.g., [
6] chps 3, 8, 9).
Turning back to the
setup, in
Section 4.1 below, we write a cluster-based finite population (FP:
) and provide a cluster longitudinal super-population (SP:
) model to fit the
-based hypothetical data. Similar to the previous section, this SP model fitting to the FP data would be utilized to obtain hypothetical estimates for the SP parameters, where these estimates are referred to as the FP parameters [
20,
21]. This hypothetical estimation is given in
Section 4.2. Next, we exploit the SSCLS sample
from (6) for DCMU estimation for all parameters, including the cluster regression parameters. We remark here that, in the context of MU prediction, this type of cluster-based
was used by many authors in the past in a two-stage cluster sampling setup [
22,
23]. However, as demonstrated in [
16], the MU estimation approach is flawed and would result in an invalid MU prediction. On the contrary, similarly to
Section 3, in this section, we provide valid DCMU estimation-based prediction functions.
4.1. Cluster-Based Longitudinal FP and Its SP Model
As opposed to the individual-based FP (1) studied in the last section, here we consider clusters, such as household-based FP
involving a large number of independent households, with each member of a household having repeated potential responses in a longitudinal setup. Typically, cluster/household sizes are small. Next, because the
is unknown or hypothetical, any finite sampling inferences require a sample, say
, to be chosen from the FP and exploit it to design optimal inferences for the targeted FP parameters. When the whole small-sized cluster is chosen as a sampling unit, the resulting longitudinal survey-based sample is referred to as the SSCLS (single-stage cluster-based longitudinal survey) sample. In
Section 2.2, it was shown how one can construct the sample
(6) from
defined by (5).
Now, by treating the
in (5) as a large sample from a longitudinal super-population
its cluster-based hypothetical longitudinal data may be modeled as
equivalently in vector-matrix notations
where, for
we write
where
is the
T-dimensional unit vector and
is a
general lag correlation matrix, as in (17) under
Section 3.1,
being referred to as the lag
ℓ longitudinal correlation, and
is the
unit matrix. Further notice that, at every point of time, two individuals,
i and
j, belonging to the same cluster
c are structurally correlated, and hence
have their covariance matrix as
where
is the unit matrix. Thus, by combining (42) and (43), we obtain the mean and covariance structure for the longitudinal response vector for the individuals in the
cth cluster, namely for
as
where
unit matrix,
and
is the longitudinal correlation index parameter representing all lagged correlations, namely
We remark that the proposed cluster-based longitudinal model (40)–(43) is a generalization of the individual-based longitudinal model (15)–(17) under
Section 3.1 to the cluster setup. This model (40)–(43) also may be treated as a generalization of the cluster regression model (e.g., [
22] (eqns. 3.1, 3.2), [
24] (eqn 2.1), [
25] (eqns. 2.1, 2.2), [
23] (eqn. 1), and [
14] (eqn. 9.11)) to the longitudinal setup. There is also a difference at the cluster level, as we are considering small-sized clusters, such as households, and whether these studies dealt with larger clusters, prompting the need for two-stage cluster-sampling-based inferences, as opposed to our single-stage cluster-sampling-based inference.
4.2. FP Data-Based Hypothetical Estimation Equations
Notice that if the FP
data (5) were available, the SP regression parameters
could be optimally estimated by exploiting the SP model-based moment properties (44) and (45), more specifically by solving the hypothetical GLS (HGLS) estimating equation as
where
is used to denote the hypothetical GLS estimator of
involving the
-based hypothetical responses up to time
T in a cluster setup, which is also denoted as
an
N-dependent FP regression parameter [
16,
20,
21]. Note that the computation for
in (46) may be simplified by recalling the formula for
from (45) as
and hence writing
(e.g., [
6] (Section 3.1)) where
As the optimal estimation of
by (46) depends on the estimates of cluster variance and longitudinal correlation parameters, we estimate the rest of these parameters using an MU estimating equation approach. More specifically, by generalizing the individual-based hypothetical MM (HMM) Formula (29) under
Section 3.2 to the cluster/household setup (see also [
6] (Section 3.1.2)), we obtain the HMM estimating formula for
as
with
and where
is the hypothetical estimate of
obtained by the formula in (46) but using cluster-based individuals’ responses up to time
for
Furthermore,
in (48) is an initial estimate of
a specialized value of
using
The formulas for
and
are given below in (49).
For the purpose of formulating
estimate, notice from (47) that the
parameter is involved in variances and covariances of the within-cluster responses. Thus, by pooling the
-based sum of squares and sum of products and equating to its
-based expectation, after some algebra, we obtain
with the initial
estimate used in (48), as
The remaining parameter
as it is a variance parameter for all individuals under all households, has its HMM formula already used in the denominators of (48) and (49). For the sake of completeness, we give its HMM formula as follows:
where
4.3. Survey Sample -Based DU Estimating Equations
Notice that all SP parameter estimates obtained in the last section, namely
and
computed by (46), (48), (49), and (50), respectively, are all FP parameters. Their real life estimation has to be performed using the sampled data
from (6), as well as the rest of the covariates information available from the sampling frame. Now, to obtain DU (design unbiased) estimates for these FP parameters, we simply use the sampling weighted (SW) total for each of the FP-based total functions involved in the formulas from (46) to (50). Thus, by using the sampling weight, say
following (7) (which is the inverse of the inclusion probability for the
cth cluster in the sample), we obtain the DU estimates for the aforementioned FP parameters as
where in (54)
and in (53)
4.4. Formulation of the DCMU Prediction Functions Using
SSCLS Sample
As the
is unknown, for the prediction of the
total at a given time
we first write its LCT (longitudinal cumulative total) and split this total in terms of sampled (ss) and non-sampled (ns) response totals as
which is similar to, but different from, the LCT split in (9) based on the SSILS sample. By following the same technique used in the SSILS setup, more specifically by following (14) and (13) from
Section 3, we write the final DAMB and DCMU predictor for the cumulative and marginal totals as follows:
Notice that in (57) and (60) is a DCMU estimate of , as in (51), which is computed based on the SSCLS sample from (6). As this estimate depends on the estimates of longitudinal correlation and cluster correlation these later estimates were computed step by step as in (52) and (53), respectively.
5. Prediction Comparison Using Simulation Results
Our objective in this section is to examine the finite sampling performance of the proposed DCMU and DAMB prediction estimators for the FP totals using both SSILS and SSCLS survey data. The precise formulas for these predictors are developed in
Section 3.3.2 and
Section 4.4 based on the SSILS and SSCLS samples, respectively. As a criterion to understand the performance of the parameter estimators, we checked the amount of bias of an estimator from its true value. However, as the large bias and small standard error of an estimator indicates the worst performance of an estimator, we have used the percentage relative bias (see (63) below under
Section 5.1) as a criterion to compare the performance of the DCMU and DAMB total predictors.
We now proceed for simulation studies, first for SSILS sample-based prediction and then for SSCLS sample-based prediction. Details, including how the data are generated and the estimators are obtained to compute the prediction functions, are given in
Section 5.1 using the SSILS sample and in
Section 5.2 using the SSCLS sample.
5.1. Simulation Study 1: Prediction Performance Using
Individual-Based Longitudinal Survey Sample
Recall from
Section 3 that, even though the longitudinal responses from the individuals in
(1) are hypothetical, in a model-based approach, it is assumed that the repeated responses of an individual are likely to follow an auto-correlation structure given in (15)–(17). In this section, we conduct a simulation study to examine first the performance of a longitudinal correlation-based SWGLS (sampling weighted GLS) approach (31) in estimating the
regression parameters over time by using the survey sample
given by (3) and (4). We then examine the performance of two competitive, namely the DAMB (design assisted model-based) and DCMU (design cum model unbiased) predictors given by (36)–(39) for FP
totals over time. We remark that these predictors are formulated using DCMU estimates for SP
regression parameters involved in the prediction functions.
5.1.1. Simulation Design in Steps (S1–S7)
- S1.
We specify the SP model (15) with a set of regression parameters as and and its longitudinal correlation structure (17) based on time periods involving lag correlations: and
- S2.
We consider three widely used correlation structures, namely AR1 MA1 and EQC As our estimation method is not model-dependent, we thus estimate the lag correlations and irrespective of the model. For example, when data are generated with true model, say AR(1) estimates of lag correlations are obtained for the true SP lag correlations and and so on.
- S3.
We consider
household leaders with responses on household annual electricity consumption for
years, in the FP
(1), and their four-time independent household-related covariate values with covariates
as the size of the household,
as two categorical covariates representing three household income levels, more specifically with
- S4.
Using and the parameter values, correlation structures, and covariates from steps 1 to 3 above, we generate, for a given the longitudinal responses
- S5.
We then choose a sample (3) of size households from the of size using the SRSWOR sampling design, as in (4), along with their responses and covariates The covariate values for the non-sampled individuals are assumed to be known from an underlying sampling frame.
- S6.
Finally, the sample
from step 5 is used to compute the SWGLS estimate of
(FP parameter), namely
by (31), and lag correlation estimate
for
(FP parameter) by (35). These estimates, more specifically
are then used to compute the marginal (at a given time
t) total prediction estimates
by (37) and
by (39). The percentage relative biases (PRB), namely
are also computed.
- S7.
We then repeat steps
5 and
6 for 25 times and compute the simulation average of the regression and correlation estimates under three correlation structures, AR(1), MA(1), and EQC, which are reported in
Table 1,
Table 2 and
Table 3, respectively. The simulation average of the prediction estimates, along with their percentage relative biases (PRBs), are reported in
Table 4.
5.1.2. Simulated Prediction Performance for Marginal FP Totals
Note that as the FP total predictors given by (36) and (38) depend on the sample
-based estimates
(31) for the SP regression parameters
for
it is therefore important to examine the performance of this estimator
in estimating the FP parameter
given by (27) corresponding to
Furthermore, as
in (27) depends on the auto-correlation structure
in (26) up to time
for
we first exploit the FP generated by (15)–(17) to compute the FP regression parameters
by (27), as well as the FP lag correlation parameters
by (29). These FP parameters
corresponding to SP parameter
and the FP correlation parameters
corresponding to SP correlation parameters
for
under all three auto-correlations, namely, the AR(1), MA(1), and EQC models, are displayed in the upper half of the
Table 1,
Table 2, and
Table 3, respectively. For example,
Table 1, based on the AR(1) model, shows four-dimensional (using four covariates)
values at
as
corresponding to the SP regression parameters
Similarly, FP and SP lag correlations may be interpreted. Next, because the sample
-based (see step 5 in the last sub-section) estimation of
amounts to the estimation of
the final estimates
computed by (31) (which is a simulation average based on 25 repetitions) under the AR(1), MA(1), and EQC processes are displayed in the lower half of the
Table 1,
Table 2, and
Table 3, respectively. For example, the aforementioned
under the AR(1) process are estimated as
which and other similar estimates from
Table 2 and
Table 3 appear to perform well reflecting their design unbiasedness for the FP regression parameters
Next, as indicated in Step 6 in the last sub-section, the aforementioned sample-based regression estimates
are used to compute the design-assisted model-based (DAMB), as well as design cum model-based (DCMB), marginal (at a given time
t) total predictors
(37) and
(39), for predicting/estimating the marginal total
The results from
Table 4 show that these two predictors appear to perform almost the same, and the estimates are very close to the true FP totals. For example, under the AR(1) process (displayed in the extreme left block), the FP total
at time
i.e.,
is predicted by MB predictor as
with PRB as
and it is predicted by DCMB predictor as
with slightly different PRBs as
Thus they perform almost the same for the targeted prediction. Their equivalent performances under the MA(1) and EQC can be interpreted similarly.
5.2. Simulation Study 2: Prediction Performance Using
Cluster-Based Longitudinal Survey Sample
In the last section, we examined the prediction performances using the individual-based longitudinal survey sample. As a generalization, we now examine the prediction behaviors for cluster-based FP total predictors over a longitudinal period of the study developed, as in (56)–(61). Notice that, as opposed to the individual-based FP (1), a cluster-based FP is given by (5) with its SP model given by (40)–(42). The individuals under the cth cluster are now correlated with the cluster correlation coefficient defined in (45). As an illustration in the present simulation, we have generated a cluster-correlated population and, hence, the sample with for example. As far as longitudinal correlations are concerned, we use the widely used AR(1) structure with leading to 3 lag-dependent correlation coefficients under total time period
We now construct our hypothetical cluster-based as follows.
Let consists of independent clusters/families. We label these clusters in sequence from 1 to Following the notations from (40), we consider four different family structure (FS1 to FS4) with their family/cluster sizes as follows:
FS1: each with 2 parents (say, father (F) and mother (M)) and 2 children (C1 and C2);
FS2: each with 2 parents (F and M) and 1 child (C1);
FS3: each with 1 parent (F) and 2 children (C1 and C2);
FS4: each with 1 parent (M) and 2 children (C1 and C2).
As far as the covariates are concerned, we consider three covariates, namely age smoking status of the individual member at initial time point and gender We explain below how we generated the covariates under FS1. The covariates under remaining FS2–FS4 are generated similarly.
- 1.
Generation of for FS1: (a) For father’s age, we have generated 175 ages from a uniform distribution with range 50–60. (b) For mother’s age, in sequence (following father’s label), we generate one age difference indicator value, say , randomly from seven different age difference indicators and computed the selected mother’s age as . (c) To consider the ages of C1, we used where now was chosen as a randomly selected value from a set of age difference values . (d) For the age of (corresponding to C1), we have used the formula with as a randomly selected value from a set of age difference values
- 2.
Generation of
for FS1: Smoking habits for the members were determined using the binary distribution with a probability smoking rate of
say. More specifically, we used
- 3.
Generation of for FS1: We considered and to determine gender for both C1 and C2, we used
Based on the aforementioned covariate values and using their effects as and further using cluster correlation and AR(1) longitudinal correlation process with the responses namely body mass index over a longitudinal period (equivalent to say 2 years) were generated using the SP correlation model (40). Before one can examine the prediction performance of the DAMB (56)–(58) and DCMU (59)–(61) marginal predictors at a given time, it is necessary to first compute the FP data-based regression estimates (FPRE) and then the survey sample ((51)–(53))-based regression estimates (SSRE) after accommodating the cluster correlations (indexed by ) and AR(1) longitudinal correlations indexed by
As far as the SS
(6) is concerned, we use SRSWOR and chose 32 families from 175 families under FS1; 12 families from 50 families under FS2; 6 families from 25 families under FS3; and 6 other families from 25 families under FS4. Thus, altogether, we chose
clusters/families with sample size
from
clusters under the FP
of size
The SP parameter estimates (i.e., the FP correlation and regression parameters) by using the estimating Equations (49) (for cluster correlation
), (48) (for longitudinal correlations), and (46) (for regression parameter estimates) from
Section 4.2; and their corresponding sample
-based estimates computed by solving the SS-based estimating Equations (51)–(53) from
Section 4.3 are provided in
Table 5. Samples were repeated for 25 times to compute the sample-based parameter estimates. All sample-based estimates appear to be close to FP-based estimates. In general, cluster-correlation estimates appear to work well when
T is small, as more clusters cause more variation over the longitudinal period. However, the main regression parameter estimates, shown in the bottom rectangular box in
Table 5, are not negatively effected by this slight difference in correlation estimates.
Finally, the regression parameters estimates, both FP- and sample-based from
Table 5, are used to compute the DAMB and DCMU prediction estimates by using (57) and (60), respectively. These estimates, along with actual FP totals at all time points, are displayed in
Table 6. The DCMU predictions shown in column 3 appear to have a smaller PRB (percentage relative bias) as compared to the DAMB predictors exhibited in column 1 for
and 3, showing the relative superiority of the DCMU predictions as compared to the DAMB prediction in the single-stage cluster setup.
6. Discussion and Concluding Remarks
In a finite population (FP) setup, the prediction of FP total is a difficult problem, as one requires us to predict the non-sampled response total well, which is customarily performed by replacing such a non-sampled total with its model-based expectation estimate computed from a survey sample. Following the estimating function approach [
21] for independent data, the super-population (SP) model-based (for FP data) regression parameters involved in the model-based expectation (equivalently in the prediction function) may be estimated based on the survey sample using a sampling weighted OLS (ordinary least square) (SWOLS) estimator. This SWOLS is DCMU (design cum model unbiased) for the SP regression parameter, which is DU (design unbiased) for the FP parameter [
20] corresponding to the SP parameter. However, in this paper, we have considered an FP with independent individuals or households, for example, but each individual or a cluster/household member providing a set of longitudinally correlated responses, the members in a household being structurally cluster correlated. Clearly, as opposed to an SP regression model for independent data, in the proposed setup, one requires an SP correlation model, more specifically, longitudinal correlation and combined-cluster longitudinal correlation models. We use a so-called ‘working’ correlation model (e.g., [
7] (Section 7.4), [
8]), for example, have used an unstructured or standard Pearson correlation model, whereas [
5] has used a random effects-based mixed model to accommodate the longitudinal correlations. However, as explained in
Section 3.2, these models fail to accommodate the time effects on the correlations. More specifically, their models fail to produce a correlation structure with decaying correlations as the time lag for two repeated responses increases. As a remedy, we have considered lag-based correlation models following [
11] (Section 3), for example, to incorporate the time effects on the correlations. Also, in a cluster-based longitudinal setup, we have generalized this lag-based correlation model to a dynamic mixed-model setup where, conditionally, on the cluster random effects, the repeated responses follow a lag-based correlation structure.
The aforementioned correlation structures and their sample-based estimates are discussed in detail, and it is demonstrated how to obtain DCMU (design cum model unbiased) estimators, namely, the SWGLS (sampling weighted GLS) estimators for the regression parameters after accommodating the longitudinal or cluster-longitudinal correlations. Subsequently, such DCMU regression estimators are used to develop design-assisted model-based (DAMB) and DCMB (design cum model-based) prediction functions. Also, the relative performance of these DAMB as well as DCMU predictors for FP total estimation is examined both theoretically and numerically.
In conclusion, this paper advances longitudinal survey data analysis for both individuals and cluster-based FP. More specifically, it is demonstrated that the DCMU estimation of the parameters involved in a prediction function provides DCMU-valid prediction for the FP total. The step-by-step development for the estimation and prediction methods should be useful to practitioners from statistical agencies such as Statistics and Health, Canada, and the Bureau of Statistics, USA, or similar organizations in other countries.