Prediction Inferences for Finite Population Totals Using Longitudinal Survey Data

Asokan M. Variyath; Brajendra C. Sutradhar

doi:10.3390/stats8040110

and

Department of Mathematics and Statistics, Memorial University, St. John’s, NL A1C 5S7, Canada

^*

Authors to whom correspondence should be addressed.

Stats2025, 8(4), 110;https://doi.org/10.3390/stats8040110

Version Notes

Order Reprints

Abstract

In an infinite-/super-population (SP) setup, regression analysis of longitudinal data, which involves repeated responses and covariates collected from a sample of independent individuals or correlated individuals belonging to a cluster such as a household/family, has been intensively studied in the statistics literature over the last three decades. In general, a longitudinal, such as an auto-correlation structure for repeated responses for an individual or a two-way cluster–longitudinal correlation structure for repeated responses from the individuals belonging to a cluster/household, are exploited to obtain consistent and efficient regression estimates. However, as opposed to the SP setup, a similar regression analysis for a finite population (FP)-based longitudinal or clustered longitudinal data using a survey sample (SS) taken from the FP-based on a suitable sampling design becomes complex, which requires first defining the FP regression and correlation (both longitudinal and/or clustered) parameters and then estimating them using appropriate sampling weighted-design unbiased (SWDU) estimating equations. The finite sampling inferences, such as predictions of longitudinal changes in FP totals, would become much more complex, meaning that it would be necessary to predict the non-sampled totals after accommodating the longitudinal and/or clustered longitudinal correlation structures. Our objective in this paper is to deal with this complex FP prediction inference by developing a design cum model (DCM)-based estimation approach. Two competitive FP total predictors, namely design-assisted model-based (DAMB) and design cum model-based (DCMB) predictors are compared using an intensive simulation study. The regression and correlation parameters involved in these prediction functions are optimally estimated using the proposed DCM-based approach.

Keywords:

clusters-based longitudinal survey sample; design assisted model-based prediction; design cum model-based estimators; design cum model-based total prediction; finite population in a longitudinal setup; finite population in a cluster-based longitudinal setup; individual-based longitudinal survey sample

1. Introduction

Clusters/household-based longitudinal survey data analysis for finite population (FP) inferences is an important research topic. For example, to help develop public policy, to understand the determinants of health, and to understand the relationship between health status and health care use, Statistics Canada conducted the National Population Health Survey (NPHS) to gather information on the health of Canadians. The survey began in 1994, collecting biennial information from selected households/clusters under a state/province/strata until 2012. The responses can be linear, binary, or multinomial. In this paper, however, we concentrate on the analysis of linear cluster longitudinal data, such as the repeated body mass index (bmi) measures (ranges in general from 18 to 40

kg / m^{2}

), collected under the NPHS study from all members of the selected households over a period of time. Notice that the health status of an individual measured based on

b m i

at a given time is likely to be dependent on (1) the health status of previous times, (2) household/cluster random effect, and (3) on certain time-dependent covariates such as gender, age group, education level, and lifestyle factors like smoking, diet, and physical activity. One may refer to this type of data as (a) a single-stage cluster-based longitudinal survey (SSCLS) sample. This is because this data set consisting of repeated

b m i

responses, for example, those collected from all individuals belonging to a sample of households/clusters, where the sample was chosen in a single stage from the specified FP

(F)

containing a large number of households/clusters. In a specialized case, when the household is considered to be the sampling unit, i.e., repeated responses are collected from the household leader only, for example, one does not need to consider any household/cluster correlation. In such cases, one may refer to the data set as (b) a single-stage individual-based longitudinal survey (SSILS) sample. In this paper, we study finite sampling inferences using both SSCLS and SSILS samples.

We remark that, except for some general discussion and exploratory analysis [1,2,3,4], there do not appear to be adequate discussions on

F

inferences using the aforementioned SSILS and SSCLS samples. More specifically, to use these SSILS and SSCLS samples for

F

inferences, in a model-based approach, one needs to consider an appropriate

S

longitudinal correlation model for the

F

-based hypothetical data. In this token, for linear longitudinal survey data analysis, some studies, such as [5] (Section 2, eqn. 2), assumed that the repeated responses from an individual in the

F

follow a random effects-based

S

linear mixed model, where an individual’s common random effect causes an equi-correlation structure among the repeated responses from the individual. This cluster correlation-oriented model, however, fails to accommodate the time-lag-dependent decaying correlations [6] (chps. 2–3) that appear to be more appropriate than an equip-correlation structure for longitudinal responses. Similarly, some studies, such as [7] (Section 7.4 see also the references therein) by following [8], have summarized a ‘working’ correlation model and the so-called GEE (generalized estimating equation) estimation approach to fit the longitudinal survey sampled data, which appear to be appropriate for non-longitudinal clustered data and/or classical multivariate data. More specifically, it was suggested by [7] (Section 7.4) to compute the correlations of the longitudinal survey data by using the standard Pearson’s correlation formula, which appears to be a naive approach, as these correlations fail to exhibit any auto-correlations i.e., decaying correlations as time lag increases, appropriate for the longitudinal data. A similar working correlations approach using ‘working’ odds ratio parameters for longitudinal binary survey data was used by [9] (ch. 20), which provides inconsistent regression estimates [10] (ch. 4, Section 4.2).

As a remedy, following [6] (Section 2.2) (see also [11] (Section 3)), we use a general stationary auto-correlation structure-based

S

correlation model to fit the

F

-based hypothetical data involving repeated data from independent individuals, and similarly use a familial longitudinal correlation structure-based [6] (Section 3.1)

S

correlation model to fit the

F

-based hypothetical data involving repeated data from all members in a cluster/family exhibiting two-way clustered longitudinal correlations—the clusters being independent to each other. These individual-based longitudinal (IL) and cluster-based longitudinal (CL) correlation models for the

F

, along with the estimation of the model parameters using the SSILS and SSCLS samples, are provided in Section 3 and Section 4 respectively. As far as the SSILS and SSCLS samples are concerned, their construction from respective

F

are given in the previous Section 2.

Prediction functions for

F

totals up to a given time are then constructed by replacing the non-sampled response totals with their model as well as design cum model (DCM)-based expectations. These expectations involve

S

regression parameters which are not easy to estimate consistently using the survey sample. More specifically, as the responses in a survey sample are subject to randomness due to both sample selection (as covariates from sample to sample change) and

S

model errors, unlike some of the existing studies [12,13,14,15], it is not possible to obtain any valid MUOLS/MUGLS (model unbiased ordinary/generalized least square) estimators for the regression parameters [16]. In Section 3 and Section 4, we thus develop suitable DCMU estimators for the regression parameters involved in the

S

-based expectations using the SSILS and SSCLS samples, respectively. Next, we use these DCMU estimators in the same Section 3 and Section 4 to form the DCMU predictions for the

F

total at a given time, using both the SSILS and SSCLS samples, respectively. We also include an alternative theoretical result for predictions using the MU function, but replacing the parameters in the expectations with DCMU estimators. We refer to this prediction approach as the design-assisted model-based (DAMB) approach. As expected, a simulation study in Section 5 shows that DCMU predictors perform better than the DAMB predictors, as expected. The paper concludes with a discussion and some concluding remarks in Section 6.

2. Materials: Individual or Cluster-Based Longitudinal Survey Data

In practice, there are many situations where longitudinal data are recorded at a

F

level, and it may be of interest to know the longitudinal pattern of a response variable over a small period of time by using a survey sample

(s^{*})

consisting of longitudinal responses chosen from the targeted

F .

However, the nature of the sample would depend on the form of the underlying

F .

For example, (a) suppose that an electric power company is interested in analyzing the household power consumption pattern over the last

T = 4

years for a city with N (e.g., 10,000) households, where the sampling units are individual households. For this purpose, a single-stage individual household-based longitudinal sample (SSILS), say of size

n = 500,

may be taken from the

F

, and their responses, along with covariates, may be used to understand the longitudinal pattern. However, there are different studies, for example, in health-related cases, (b) a health organization, such as Health Canada, as pointed out in the last section, may be interested to know the longitudinal pattern of the health condition of all members of the households in a state/province. Suppose that there are K households in the state/province and the c-th cluster/household has

N_{c}

family members. Notice that these members are correlated and that the underlying correlation structure, unlike in case (a), has to be accommodated for any pattern-change analysis. In this example, one would take an SSCLS (single-stage cluster/household-based longitudinal sample) of size k households and use the sample to understand the longitudinal pattern of the

F .

For a better understanding of the differences between the two aforementioned samples, SSILS and SSCLS, we present them in notational detail, including their respective

F,

as follows. We remark that these samples will be exploited in, respectively, for a model-based prediction of their

F

totals at a given marginal time. As far as the sample selection from the

F

is concerned, we use the well-known equal-probability-based SRSWOR (simple random sampling without replacement) in both cases.

2.1. SSILS Sample $(s_{1}^{*})$ from Individual-Based $F_{1}$

Individual-based longitudinal finite population $F_{1}$ . Let

\begin{matrix} F_{1} & : & {(y_{i} : T \times 1, x_{i} : T \times p), i = 1, \dots, N} \end{matrix}

(1)

be a longitudinal

F

with

y_{i} = {(y_{i 1}, \dots, y_{i t}, \dots, y_{i T})}^{'}

as a vector of T repeated responses for the ith individual, and

x_{i}^{'} = (x_{i 1}, \dots, x_{i t}, \dots, x_{i T})

denote the

p \times T

covariates matrix with

x_{i t}

as the p-dimensional covariate vector recorded at time t for the i-th individual in the FP. In reality, this

F

is unknown, and hence its data are hypothetical, unless a sample is taken to observe a part of the FP.

Survey sample $s_{1}^{*}$ from $F_{1}$ using SRSWOR design. For the prediction of the

F_{1}

total at a given time

t,

namely

\begin{matrix} τ_{y, t} & = & \sum_{i = 1}^{N} y_{i t} for t = 1, \dots, T, \end{matrix}

(2)

a sample

s_{1}^{*}

may be chosen from the

F_{1}

in (1), as follows:

\begin{matrix} s_{1}^{*} \equiv {(y_{i} : T \times 1, x_{i} : T \times p); i = 1, \dots, n} \subset F_{1} \end{matrix}

(3)

according to a suitable design, say

(D_{s_{1}^{*}}),

as

\begin{matrix} D_{s_{1}^{*}} : P r (s_{1}^{*} \in F_{1}) = 1 / (\begin{matrix} N \\ n \end{matrix}); π_{i} = P r (i \in s_{1}^{*}) = P r [δ_{i \in s_{1}^{*}} = 1] = \frac{n}{N}, \end{matrix}

(4)

with

δ_{i \in s_{1}^{*}}

being a sample inclusion indicator variable.

2.2. SSCLS Sample $(s_{2}^{*})$ from Cluster-Based $F_{2}$

Cluster-based longitudinal finite population $F_{2}$ . Let

\begin{matrix} F_{2} \equiv {(y_{c i} : T \times 1, x_{c i} : T \times p); i = 1, \dots, N_{c}; c = 1, \dots, K} \end{matrix}

(5)

be the targeted finite population, where

K (\to \infty)

denotes the number of independent clusters/households with their sizes

N_{1}, \dots, N_{c}, \dots, N_{K},

N_{c}

being the size of the c-th cluster, which is small and fixed; and

y_{c i} = {(y_{c i 1}, \dots, y_{c i t}, \dots, y_{c i T})}^{'}

denotes a T dimensional hypothetical linear response vector containing T potential repeated responses from the i-th

(i = 1, \dots, N_{c})

individual of the c-th cluster/household under the finite population. In this setup, at a given time point

t,

the pair-wise hypothetical responses within the c-th cluster, namely

y_{c i t}

and

y_{c j t}

for

i \neq j; i, j = 1, \dots, N_{c},

are likely to be correlated, as they share an invisible random cluster/household effect leading to the cross-sectional within-cluster correlations; and the repeated responses from the i-th individual under the c-th cluster, namely

y_{c i t}

and

y_{c i, t + ℓ}

for

t = 1, \dots, T - ℓ; ℓ = 1, \dots, T - 1,

are also likely to be correlated, maintaining a dynamic dependence relationship, leading to the longitudinal correlations.

Survey sample $s_{2}^{*}$ from $F_{2}$ using the SRSWOR design. Unlike the construction of

s_{1}^{*}

by (3), clusters are now considered to be the primary sampling units. Thus,

s_{2}^{*}

may be chosen from the

F_{2}

in (5) as follows:

\begin{matrix} s_{2}^{*} \equiv {(y_{c i} : T \times 1, x_{c i} : T \times p); c = 1, \dots, k; i = 1, \dots, N_{c}} \subset F_{2} \end{matrix}

(6)

according to a suitable design, say

(D_{s_{2}^{*}}),

as

\begin{matrix} D_{s_{2}^{*}} : P r (s_{2}^{*} \in F_{2}) = 1 / (\begin{matrix} K \\ k \end{matrix}); π_{c} = P r (c \in s_{2}^{*}) = P r [δ_{c \in s_{2}^{*}} = 1] = \frac{k}{K}, \end{matrix}

(7)

with

δ_{c \in s_{2}^{*}}

being a cluster inclusion indicator variable.

Our purpose is to develop a suitable prediction function and its estimator for the

F_{2}

total, namely for

\begin{matrix} τ_{y, t} & = & \sum_{c = 1}^{K} \sum_{i = 1}^{N_{c}} y_{c i t} for t = 1, \dots, T . \end{matrix}

(8)

3. Proposed DCMU Prediction Method Using SSILS Sample

Following (2), the

F_{1}

total up to time t has the formula

τ_{y (t)} = \sum_{i = 1}^{N} \sum_{u = 1}^{t} y_{i u} .

Thus, once the SSILS

s_{1}^{*}

is chosen, this longitudinal cumulative total (LCT) may be expressed in terms of survey sampled

(s s)

and non-sampled

(n s)

LCTs as

\begin{matrix} τ_{y (t)} = \sum_{i = 1}^{N} \sum_{u = 1}^{t} y_{i u} & = & \sum_{i \in s_{1}^{*}}^{n} \sum_{u = 1}^{t} y_{i u} (s s) + \sum_{i \notin s_{1}^{*}}^{N - n} \sum_{u = 1}^{t} y_{i u} (n s), for t = 1, \dots, T, \end{matrix}

(9)

where the second term in the right hand side of (6) is the

n s

response totals up to time t. For prediction inferences, this and similar

n s

response totals are predicted in general with their model-based expectations. Notice that because the repeated responses

{y_{i 1}, \dots, y_{i u}, \dots, y_{i T}}

for the i-th individual

(i = 1, \dots, N)

under the

F_{1}

in (1) are supposed to be longitudinally correlated, we use a super-population correlation model

S_{1}

as in Section 3.1 below. We may then use

E_{S_{1}} [\cdot] \equiv E_{F_{1} \subset S_{1}}

to denote the model-based expectations; hence, the model unbiased (MU) prediction functions for the LCT in (9) have the following forms:

\begin{matrix} MU Prediction Function : \\ {\hat{τ}}_{y (t)} = \sum_{i \in s_{1}^{*}}^{n} \sum_{u = 1}^{t} y_{i u} + E_{F_{1} \subset S_{1}} [\sum_{i \notin s_{1}^{*}}^{N - n} \sum_{u = 1}^{t} y_{i u}] . \end{matrix}

(10)

However, as the estimation of the expected function in the second term in (10) has to be performed using the sample

s_{1}^{*}

from (3), the decades-long existing studies (e.g., see [12] for independent data with

T = 1;

[14] (Section 2.6.2), [17] (Section 2.2) for clustered correlated data), conditional on

s_{1}^{*},

have estimated the expected function by using the sampling sequence

s_{1}^{*} \subset S_{1},

obtaining the prediction estimator as

\begin{matrix} MU Prediction Function Estimator : \\ {\hat{τ}}_{y (t)}^{†} = \sum_{i \in s_{1}^{*}}^{n} \sum_{u = 1}^{t} y_{i u} + {\hat{E}}_{s_{1}^{*} \subset S_{1}} [\sum_{i \notin s_{1}^{*}}^{N - n} \sum_{u = 1}^{t} y_{i u}] . \end{matrix}

(11)

Clearly, this estimation, based on the sequence

s_{1}^{*} \subset S_{1}

(i.e., treating the sample

s_{1}^{*}

as though it is taken directly from the super-population

S_{1}

), ignores the

F_{1}

(1) as the source of the sample during the estimation process. Thus, this existing MU prediction approach is flawed, yielding invalid prediction.

As a remedy to the aforementioned anomaly, we propose a design cum model (DCMU) prediction function and estimate the expectation involved in the prediction function based on the true sampling sequence

s_{1}^{*} \subset F_{1} \subset S_{1},

as follows:

\begin{matrix} DCMU Prediction Function : \\ {\tilde{τ}}_{y (t)} = \sum_{i \in s_{1}^{*}}^{n} \sum_{u = 1}^{t} y_{i u} + E_{s_{1}^{*} \subset F_{1} \subset S_{1}} [\sum_{i \notin s_{1}^{*}}^{N - n} \sum_{u = 1}^{t} y_{i u}] \\ = & \sum_{i \in s_{1}^{*}}^{n} \sum_{u = 1}^{t} y_{i u} + \frac{N - n}{N} [\sum_{i = 1}^{N} \sum_{u = 1}^{t} E_{F_{1} \subset S_{1}} [y_{i u}]], \end{matrix}

(12)

yielding the DCMU prediction estimator as

\begin{matrix} DCMU Prediction Function Estimator : \\ {\hat{\tilde{τ}}}_{y (t)} = \sum_{i \in s_{1}^{*}}^{n} \sum_{u = 1}^{t} y_{i u} + \frac{N - n}{N} [\sum_{i = 1}^{N} \sum_{u = 1}^{t} {\hat{E}}_{s_{1}^{*} \subset F_{1} \subset S_{1}} [y_{i u}]] . \end{matrix}

(13)

We further remark that, as the estimation of the expected function as in (11) is flawed, one may modify the estimation of the expected function by computing the true sampling sequence

s_{1}^{*} \subset F_{1} \subset S_{1}

-based estimation. We refer to this modified estimator as the design-assisted model-based (DAMB) predictor estimator, with its formula given by

\begin{matrix} DAMB Prediction Estimator : \\ {\hat{\hat{τ}}}_{y (t)} = \sum_{i \in s_{1}^{*}}^{n} \sum_{u = 1}^{t} y_{i u} + [\sum_{i \notin s_{1}^{*}}^{N - n} \sum_{u = 1}^{t} {\hat{E}}_{s_{1}^{*} \subset F_{1} \subset S_{1}} (y_{i u})] . \end{matrix}

(14)

Note that, because of its validity concern, we no longer follow the MU prediction estimator from (11). As far as the computation of the proposed DCMU prediction estimator in (13) and the DAMB prediction estimator given by (14) are concerned, we demonstrate them by considering a

S_{1}

correlation model for the

F_{1}

(1) data as in the next section.

3.1. Super-Population $(S_{1})$ Longitudinal Auto-Correlation Model

As pointed out in Section 1, the so called ‘working’ correlation structures used by some studies (e.g., [7] (Section 7.4), and [9] (ch. 20)) fail to accommodate the decaying correlation properties as time lag increases. As a remedy, in this section, by following [6] (Section 2.2) (see also [11] (Section 3)), we propose a lag-dependent correlation structure, i.e., a super-population

(S_{1})

correlation model for the longitudinal data in the

F_{1} .

More specifically, we suggest using a general auto-correlation model, as follows, that accommodates frequently encountered so-called AR(1) (auto-regressive order 1), MA(1) (moving average order 1), and EQC (equi-correlation) correlation structures, among others. Thus, the hypothetical repeated responses for the ith

(i = 1, \dots, N)

individual, namely

y_{i} = {(y_{i 1}, \dots, y_{i t}, \dots, y_{i T})}^{'}

is assumed to follow the correlation model given by

\begin{matrix} F_{1} \subset S_{1} : y_{i} & = & x_{i} β + ϵ_{i}, \end{matrix}

(15)

\begin{matrix} with ϵ_{i} & = & {(ϵ_{i 1}, \dots, ϵ_{i t}, \dots, ϵ_{i T})}^{'} \sim (0, σ_{ϵ}^{2} R) \end{matrix}

(16)

where

v a r [ϵ_{i t}] = σ_{ϵ}^{2}

for all

i = 1, \dots, N,

and

t = 1, \dots, T,

and R is the

T \times T

lag-dependent auto-correlation matrix defined as

R (ρ) = [\begin{matrix} 1 & ρ_{1} & ρ_{2} & \dots & ρ_{T - 1} \\ ρ_{1} & 1 & ρ_{1} & \dots & ρ_{T - 2} \\ ⋮ & ⋮ & ⋮ & ⋮ \\ ρ_{T - 1} & ρ_{T - 2} & ρ_{T - 3} & \dots & 1 \end{matrix}],

(17)

where, for

ℓ = 1, \dots, T - 1,

ρ_{ℓ}

is known to be the ℓth lag auto-correlation. Notice that

x_{i}

in (15) is the

T \times p

covariates matrix, as defined in (1), for the

F_{1},

and

β

is referred to as the

S_{1}

regression parameters vector. Further notice that, as far as the general nature of the lag correlation matrix R in (17) is concerned, as mentioned above, these lag correlations maintain suitable special patterns under the AR(1), MA(1), and EQC models, respectively, as follows:

\begin{matrix} For AR (1) model : ϵ_{i t} = ρ ϵ_{i, t - 1} + a_{i t}, with a_{i t} \overset{i i d}{\sim} (0, σ_{ϵ}^{2}) \end{matrix}

\begin{matrix} \Rightarrow & ρ_{ℓ} = ρ_{1}^{ℓ}; for ℓ = 1, \dots, T - 1 . \\ For MA (1) model : ϵ_{i t} = a_{i t} - ρ a_{i, t - 1}, with a_{i t} \overset{i i d}{\sim} (0, σ_{ϵ}^{2}) \end{matrix}

(18)

\begin{matrix} \Rightarrow & ρ_{1} = \frac{ρ}{1 + ρ}, ρ_{ℓ} = 0; for ℓ = 2, \dots, T - 1 . \\ For EQC model : ϵ_{i t} = ρ a_{i 0} + a_{i t}, with a_{i t} \overset{i i d}{\sim} (0, σ_{ϵ}^{2}), a_{i 0} \overset{i i d}{\sim} (0, σ_{ϵ}^{2}); \\ a_{i t} and a_{i 0} are independent \end{matrix}

(19)

\begin{matrix} \Rightarrow & ρ_{ℓ} = \frac{ρ^{2}}{1 + ρ^{2}}, for ℓ = 2, \dots, T - 1 . \end{matrix}

(20)

Thus, computing the correlation matrix R in (17) is sufficient for all of these three (and other similar) processes. The parameters

ρ_{ℓ}

for

ℓ = 1, \dots, T - 1

are

S_{1}

lag correlation parameters.

We remark that, even though the

S_{1}

model is fitted to the

F_{1}

data, as in (15)–(17), the super-population regression parameters

β

, along with lag correlations

ρ_{ℓ}

, cannot, however, be estimated using

F_{1}

as a sample, because it is only a hypothetical sample. In reality, these parameters, therefore, have to be estimated as optimally as possible by using the survey sample (SS)

s_{1}^{*}

constructed in (3). We provide this estimation in Section 3.3. Notice that these parameter estimates will then be used in (13) and (14) to obtain the DCMU and DAMB predictors in order to predict the targeted marginal FP totals. Now, because the super-population model is written as in (15) and (16), we can use the moment properties of the

F_{1}

data and re-write the DCMU and DAMB prediction functions following (13) and (14), respectively, as

\begin{matrix} DCMU Prediction Function : \\ {\tilde{τ}}_{y (t)} = \sum_{i \in s_{1}^{*}}^{n} \sum_{u = 1}^{t} y_{i u} + \frac{N - n}{N} [\sum_{i = 1}^{N} \sum_{u = 1}^{t} x_{i u}^{'} β] \end{matrix}

(21)

\begin{matrix} DAMB Prediction Function : \\ {\hat{τ}}_{y (t)} = \sum_{i \in s_{1}^{*}}^{n} \sum_{u = 1}^{t} y_{i u} + \sum_{i \notin s_{1}^{*}}^{N - n} \sum_{u = 1}^{t} x_{i u}^{'} β . \end{matrix}

(22)

where

β

must be estimated by accommodating the lag correlation matrix

R (ρ)

given by (17).

We further remark that, for infinite-population-based inferences for longitudinal data, many studies (e.g., [11], [6] (ch. 2, 7)) have used the auto-correlation model (17). However, for finite sampling inferences for longitudinal data, this model is not adequately discussed in the literature. On the contrary, to model the correlations of the repeated data from the same individual in a finite population setup, some authors such as [5] (ens. 2, 3) have used an individual specific random-effects-based linear mixed model given by

\begin{matrix} y_{i t} & = & x_{i t}^{'} β + γ_{i} + ϵ_{i t}, \end{matrix}

(23)

where

γ_{i}

denotes the random effect of the ith individual which is shared by all responses over time

t = 1, \dots, T .

This model, however, produces equal correlations among the repeated responses, and hence, as pointed out in [18] (see also [19]), they fail to accommodate the time effects on correlations. Some other authors, such as [7] (Section 7.4), following the so-called ‘working’ correlations approach of [8], have suggested using an unstructured correlation matrix, say

R^{*} (ρ) = (ρ_{u t}, u \neq t; u, t = 1, \dots, T) : R \times R,

where there are

T (T - 1) / 2

correlations to compute by using a method of moments. There are, however, many inference issues with this ‘working’ correlation approach. For example, (a) this unstructured correlation matrix ignores the time effects on the repeated responses and hence fails to accommodate the lag effect on the association between two repeated responses. Consequently, unlike the lagged correlation structure

R (ρ)

shown in (17), this approach amounts to computing more number-paired correlations. (b) Also, as demonstrated in [6] (Section 6.4), this ‘working’ correlations approach may produce inefficient estimates compared to the simpler ‘working’ independence-based approach, which makes it an unacceptable inference approach for longitudinal data.

We now proceed to the next section for the estimation of the regression parameters

β

, which is involved in the prediction functions in (21) and (22). This estimation also requires the estimation of the lag correlations

ρ_{ℓ}

, as well for

ℓ = 1, \dots, T - 1 .

More specifically, in Section 3.2, we demonstrate how one could optimally estimate

β

using the

F_{1}

(1)-based data, provided that these data were available. However, as these data are not available in practice, this hypothetical estimates turn out to be the so-called

F_{1}

regression parameters [20,21]. In Section 3.3, we then use the SS

s_{1}^{*}

from (3) to develop a sampling weighted-design unbiased (SWDU) estimate for the finite-population regression parameters. The SWDU estimator, therefore, becomes the DCMU (design cum model unbiased) estimate for the super-population regression parameter

β .

3.2. Hypothetical Estimation of the $S_{1}$ Model Parameters Using $F_{1}$ Data

Estimation of $β$ :

Notice that the estimation of the regression parameter

β

involved in the prediction functions (21) and (22) depends on the available repeated data up to time

t,

for all

t = 1, \dots, T .

Hence, it is convenient to use multiple parameters, namely

β_{(t)}

for

β

, in these prediction functions, for all

t = 1, \dots, T .

Note that, for

t = 1,

the auto-correlation matrix

R (ρ)

in (17) does not play any role in estimating

β_{(1)} .

For

t = 2,

only one lag correlation, namely

ρ_{1},

would influence

β_{(2)}

estimation. Similarly, for

t = 3,

ρ_{1}

and

ρ_{2}

would play a role in

β_{(3)}

estimation, and so on. Thus, in view of the role of this time length in

β

estimation, we first rewrite the DAMB and DCMU prediction functions from (22) and (21) using

β_{(t)}

for

β,

as follows.

\begin{matrix} DAMB Prediction Function : \\ {\hat{τ}}_{y (t)} = \sum_{i \in s_{1}^{*}}^{n} \sum_{u = 1}^{t} y_{i u} + \sum_{i \notin s_{1}^{*}}^{N - n} \sum_{u = 1}^{t} x_{i u}^{'} β_{(t)} . \end{matrix}

(24)

\begin{matrix} DCMU Prediction Function : \\ {\tilde{τ}}_{y (t)} = \sum_{i \in s_{1}^{*}}^{n} \sum_{u = 1}^{t} y_{i u} + \frac{N - n}{N} [\sum_{i = 1}^{N} \sum_{u = 1}^{t} x_{i u}^{'} β_{(t)}] . \end{matrix}

(25)

We then obtain an optimal hypothetical estimator of

β_{(t)}

using the

F_{1}

(1)-based hypothetical data, as follows.

More specifically, as the

F_{1}

is assumed to follow the

S_{1}

regression model, as in (15) and (16), by writing

R_{(t)} (ρ) = [\begin{matrix} 1 & ρ_{1} & ρ_{2} & \dots & ρ_{t - 1} \\ ρ_{1} & 1 & ρ_{1} & \dots & ρ_{t - 2} \\ ⋮ & ⋮ & ⋮ & ⋮ \\ ρ_{t - 1} & ρ_{t - 2} & ρ_{t - 3} & \dots & 1 \end{matrix}],

(26)

for

ℓ = 1, \dots, t - 1,

one could obtain an optimal HGLS (hypothetical generalized least square) estimate of

β_{(t)},

say

{\hat{β}}_{(t), H G Q L} \equiv β_{(t), N}

provided that

F_{1}

was available, by solving the underlying HGLS estimating equation as

\begin{matrix} G_{[N, (t)]} (β_{(t)} | (ρ_{0}, ρ_{1}, \dots, ρ_{t - 1})) = \sum_{i = 1}^{N} x_{i}^{'} R_{(t)}^{- 1} (ρ) (y_{i} - x_{i} β_{(t)}) = 0 \\ \Rightarrow β_{(t)} = {\hat{β}}_{(t), H G Q L} \equiv β_{(t), N}, for t = 1, \dots, T . \end{matrix}

(27)

Notice that this

β_{(t)}

estimate is not computable, as it is a hypothetical estimate only because

F_{1}

-based data are not available. Thus, it is referred to as the N-dependent

F_{1}

regression parameter [21], which is yet to be unbiasedly estimated using the survey sample

s_{1}^{*}

(3). This we do in Section 3.3.2.

Estimation of $ρ_{ℓ}$ :

Notice that

ρ_{ℓ}

is the lag ℓ auto-correlation for

ℓ = 1, \dots, t - 1,

as in (26) (see also (17)). More specifically, its

S_{1}

model (15)–(17)-based formula is given by

\begin{matrix} ρ_{ℓ} & = & \frac{E_{F_{1} \subset S_{1}} [(y_{i u} - x_{i u}^{'} β_{(t)}) (y_{i, u + ℓ} - x_{i, u + ℓ}^{'} β_{(t)})]}{E_{F_{1} \subset S_{1}} {(y_{i u} - x_{i u}^{'} β_{(t)})}^{2}} \end{matrix}

(28)

for all

u = 1, \dots, t,

with

t = 2, \dots, T .

Thus, if the

F_{1}

-based data were available, one could use the well-known method of moments and consistently estimate

ρ_{ℓ}

using the formula

\begin{matrix} {\hat{ρ}}_{ℓ, H M M} & = & \frac{\sum_{i = 1}^{N} \sum_{u = 1}^{t - ℓ} [(y_{i u} - x_{i u}^{'} β_{(t), N}) (y_{i, u + ℓ} - x_{i, u + ℓ}^{'} β_{(t), N})] / N (t - ℓ)}{\sum_{i = 1}^{N} \sum_{u = 1}^{t} {[y_{i u} - x_{i u}^{'} β_{(t), N}]}^{2} / N t} \\ = & ρ_{ℓ, N}, (s a y), for ℓ = 1, \dots, t - 1 \end{matrix}

(29)

where

β_{(t), N}

is the hypothetical estimate of

β_{(t)}

given by (27). Notice that

β_{(t), N}

in (27) and

{\hat{ρ}}_{ℓ, H M M} \equiv ρ_{ℓ, N}

in (29) are computed iteratively. Thus, one may use

ρ_{ℓ} = 0

to obtain the initial value of

β_{(t), N}

by (27), which is then used in (29) to obtain the first-step estimate

ρ_{ℓ, N}

for

ρ_{ℓ} .

The iteration continues until convergence.

3.3. Real Life Estimation of the $S_{1}$ Model Parameters Using the Survey Sample $s_{1}^{*}$

We remark that, for a given

t (t = 1, \dots, T),

in the last section, we obtained the formulas for the hypothetical estimates for the super-population

(S_{1})

regression parameter

β_{(t)}

and lagged correlations

ρ_{1}, \dots, ρ_{ℓ}, \dots, ρ_{t - 1} .

These hypothetical estimates are

β_{(t), N}

and

ρ_{1, N}, \dots, ρ_{ℓ, N}, \dots, ρ_{(t - 1), N},

respectively, which are referred to as the finite-population

(F_{1})

parameters. Thus, in reality, these parameters must be estimated based on the sampled data. The purpose of this section is to demonstrate how to exploit the survey sample

s_{1}^{*}

in (3) taken from the

F_{1}

in (1) by using the SRSWOR sampling design

D_{s_{1}^{*}}

given by (4) in order to obtain design-optimal estimates for the

F_{1}

parameters. Note that the sampling design

D_{s_{1}^{*}}

in (4) is widely used in practice. However, other designs also can be applied when appropriate.

3.3.1. Estimating Function Approach for Design Unbiased (DU) Estimation of $β_{(t), N}$

Because

β_{(t), N}

is the solution of the

F_{1}

-based estimating equation

G_{[N, (t)]} (\cdot) = 0

given by (27), for its design optimal estimation, one needs to develop a

s_{1}^{*}

-based estimating equation, say

g_{[n, (t)]} (\cdot) = 0

such that

E_{D_{s_{1}^{*}}} [g_{[n, (t)]} (\cdot)] \equiv E_{s_{1}^{*} \subset F_{1}} [g_{[n, (t)]} (\cdot)] = G_{[N, (t)]} (\cdot)

(e.g., [21]). To achieve this goal, based on the sampling design

D_{s_{1}^{*}}

from (4), we consider the sampling weight

w_{i \in s_{1}^{*}} = N / n

for the selection of ith individual in the sample

s_{1}^{*}

from

F_{1} .

Now, because N individuals belonging to the

F_{1}

are independent, one may follow the structure of

G_{[N, (t)]} (\cdot)

in (27) and develop the sampling weighted GLS (SWGLS) estimating function

g_{[n, (t)]} (\cdot)

as

\begin{matrix} g_{[n, (t)]} (β_{(t), N} | (ρ_{0}, ρ_{1}, \dots, ρ_{t - 1})) = \sum_{i = 1}^{n} w_{i \in s_{1}^{*}} x_{i}^{'} R_{(t)}^{- 1} (ρ) (y_{i} - x_{i} β_{(t), N}) \end{matrix}

(30)

further producing the SWGLS estimating equation that yields the SWGLS estimate of

β_{(t), N}

as

\begin{matrix} g_{[n, (t)]} (β_{(t), N} | (ρ_{0}, ρ_{1}, \dots, ρ_{t - 1})) = 0 \\ \Rightarrow {\hat{β}}_{(t), N} = {[\sum_{i = 1}^{n} w_{i \in s_{1}^{*}} x_{i}^{'} R_{(t)}^{- 1} (ρ) x_{i}]}^{- 1} \sum_{i = 1}^{n} w_{i \in s_{1}^{*}} x_{i}^{'} R_{(t)}^{- 1} (ρ) y_{i}, \end{matrix}

(31)

which is DU (design unbiased) for the

F_{1}

parameter

β_{(t), N} .

This is because

\begin{matrix} E_{D_{s_{1}^{*}}} [g_{[n, (t)]} (\cdot)] & = & E_{D_{s_{1}^{*}}} \sum_{i = 1}^{n} w_{i \in s_{1}^{*}} [x_{i}^{'} R_{(t)}^{- 1} (ρ) (y_{i} - x_{i} β_{(t), N})] \\ = & \sum_{i = 1}^{N} w_{i \in s_{1}^{*}} E_{D_{s_{1}^{*}}} [δ_{i \in s_{1}^{*}}] [x_{i}^{'} R_{(t)}^{- 1} (ρ) (y_{i} - x_{i} β_{(t), N})] \\ with δ_{i \in s_{1}^{*}} as the inclusion indicator from (4) \\ = & G_{[N, (t)]} (β_{(t), N} | (ρ_{0}, ρ_{1}, \dots, ρ_{t - 1})), \end{matrix}

(32)

same as the

F_{1}

-based estimating function in (27).

3.3.2. Estimating Function Approach for Design Unbiased (DU) Estimation of $ρ_{ℓ, N}$

To obtain the sample

s_{1}^{*}

-based estimate for the finite population

(F_{1})

lag correlation parameters

ρ_{ℓ, N}

given in (29), we use the same estimating function approach as used for the estimation of regression parameters

(β_{(t), N})

in (27) with

{\hat{β}}_{(t), N}

given by (31). More specifically one may use the sampling weighted auto-covariance function as a DU estimator for the auto-covariance in the numerator in (29) because it can be shown that

\begin{matrix} \sum_{i = 1}^{n} [w_{i \in s_{1}^{*}} \sum_{u = 1}^{t - ℓ} [(y_{i u} - x_{i u}^{'} β_{(t), N}) (y_{i, u + ℓ} - x_{i, u + ℓ}^{'} β_{(t), N})]] \\ = & \sum_{i = 1}^{N} \sum_{u = 1}^{t - ℓ} [(y_{i u} - x_{i u}^{'} β_{(t), N}) (y_{i, u + ℓ} - x_{i, u + ℓ}^{'} β_{(t), N})] . \end{matrix}

(33)

Similarly it can be shown that

\begin{matrix} \sum_{i = 1}^{n} w_{i \in s_{1}^{*}} \sum_{u = 1}^{t} {[y_{i u} - x_{i u}^{'} β_{(t), N}]}^{2} = \sum_{i = 1}^{N} \sum_{u = 1}^{t} {[y_{i u} - x_{i u}^{'} β_{(t), N}]}^{2} \end{matrix}

(34)

with respect to DU estimation for the variance term in the denominator in (29).

By combining (33) and (34) one then obtain a first order DU estimator of

ρ_{ℓ, N}

given in (29) as

\begin{matrix} {\hat{ρ}}_{ℓ, N} & = & \frac{\sum_{i = 1}^{n} [w_{i \in s_{1}^{*}} \sum_{u = 1}^{t - ℓ} [(y_{i u} - x_{i u}^{'} β_{(t), N}) (y_{i, u + ℓ} - x_{i, u + ℓ}^{'} β_{(t), N})]] / N (t - ℓ)}{\sum_{i = 1}^{n} w_{i \in s_{1}^{*}} \sum_{u = 1}^{t} {[y_{i u} - x_{i u}^{'} β_{(t), N}]}^{2} / N t} . \end{matrix}

(35)

Finally, by using

{\hat{ρ}}_{ℓ, N}

from (35), we obtain the final DU regression estimator

{\hat{β}}_{(t), N}

for

β_{(t), N}

by (31). Further, because

β_{(t), N}

from (27) is MU (model unbiased) for

β_{(t),}

it then follows that

{\hat{β}}_{(t), N}

from (31) is DCMU (design cum model unbiased) estimator for

β (t)

involved in the prediction functions in (24) and (25). Hence, by using

{\hat{β}}_{(t), N}

from (31) for

β_{(t)}

in (24) and (25), we obtain the desired prediction function estimators as

\begin{matrix} DAMB Prediction Function Estimator : \\ {\hat{\hat{τ}}}_{y (t)} = \sum_{i \in s_{1}^{*}}^{n} \sum_{u = 1}^{t} y_{i u} + \sum_{i \notin s_{1}^{*}}^{N - n} \sum_{u = 1}^{t} x_{i u}^{'} {\hat{β}}_{(t), N} \end{matrix}

(36)

\begin{matrix} \Rightarrow Marginal Prediction at Time t : \\ {\hat{τ}}_{y (1)}^{*} = {\hat{\hat{τ}}}_{y (1)}; {\hat{τ}}_{y (t)}^{*} = [{\hat{\hat{τ}}}_{y (t)} - {\hat{\hat{τ}}}_{y (t - 1)}] for t = 2, \dots, T . \end{matrix}

(37)

\begin{matrix} DCMU Prediction Function Estimator : \\ {\hat{\tilde{τ}}}_{y (t)} = \sum_{i \in s_{1}^{*}}^{n} \sum_{u = 1}^{t} y_{i u} + \frac{N - n}{N} [\sum_{i = 1}^{N} \sum_{u = 1}^{t} x_{i u}^{'} {\hat{β}}_{(t), N}] \end{matrix}

(38)

\begin{matrix} \Rightarrow Marginal Prediction at Time t : \\ {\tilde{τ}}_{y (1)}^{*} = {\hat{\tilde{τ}}}_{y (1)}; {\tilde{τ}}_{y (t)}^{*} = [{\hat{\tilde{τ}}}_{y (t)} - {\hat{\tilde{τ}}}_{y (t - 1)}] for t = 2, \dots, T . \end{matrix}

(39)

4. Proposed DCMU Prediction Method Using SSCLS Sample: A Generalization

As described in Section 2, more specifically in Section 2.2, dealing with a cluster-based

F

inferences would require an additional cluster correlation parameter estimation on top of regression and longitudinal correlation estimation, which we have performed in Section 3. In an infinite population setup, this type of familial/cluster longitudinal data have been discussed extensively in the literature (e.g., [6] chps 3, 8, 9).

Turning back to the

F

setup, in Section 4.1 below, we write a cluster-based finite population (FP:

F_{2}

) and provide a cluster longitudinal super-population (SP:

S_{2}

) model to fit the

F_{2}

-based hypothetical data. Similar to the previous section, this SP model fitting to the FP data would be utilized to obtain hypothetical estimates for the SP parameters, where these estimates are referred to as the FP parameters [20,21]. This hypothetical estimation is given in Section 4.2. Next, we exploit the SSCLS sample

s_{2}^{*}

from (6) for DCMU estimation for all parameters, including the cluster regression parameters. We remark here that, in the context of MU prediction, this type of cluster-based

F

was used by many authors in the past in a two-stage cluster sampling setup [22,23]. However, as demonstrated in [16], the MU estimation approach is flawed and would result in an invalid MU prediction. On the contrary, similarly to Section 3, in this section, we provide valid DCMU estimation-based prediction functions.

4.1. Cluster-Based Longitudinal FP and Its SP Model

As opposed to the individual-based FP (1) studied in the last section, here we consider clusters, such as household-based FP

(F_{2}),

involving a large number of independent households, with each member of a household having repeated potential responses in a longitudinal setup. Typically, cluster/household sizes are small. Next, because the

F_{2}

is unknown or hypothetical, any finite sampling inferences require a sample, say

s_{2}^{*}

, to be chosen from the FP and exploit it to design optimal inferences for the targeted FP parameters. When the whole small-sized cluster is chosen as a sampling unit, the resulting longitudinal survey-based sample is referred to as the SSCLS (single-stage cluster-based longitudinal survey) sample. In Section 2.2, it was shown how one can construct the sample

s_{2}^{*}

(6) from

F_{2}

defined by (5).

Now, by treating the

F_{2}

in (5) as a large sample from a longitudinal super-population

(S_{2}),

its cluster-based hypothetical longitudinal data may be modeled as

\begin{matrix} F_{2} \in S_{2} : y_{c i t} & = & x_{c i t}^{'} β + γ_{c} + ϵ_{c i t}, t = 1, \dots, T; i = 1, \dots, N_{c}; c = 1, \dots, K \\ γ_{c} \overset{i i d}{\sim} (0, σ_{γ}^{2}) ϵ_{c i t} \overset{i i d}{\sim} (0, σ_{ϵ}^{2}) γ_{c} and ϵ_{c i t} are independent; \\ corr (ϵ_{c i t} . ϵ_{c i, t + ℓ}) \equiv ρ_{ℓ}, ℓ = 1, \dots, T - 1, \end{matrix}

(40)

equivalently in vector-matrix notations

\begin{matrix} F_{2} \in S_{2} : y_{c i} & = & x_{c i} β + 1_{T} γ_{c} + ϵ_{c i}, \end{matrix}

(41)

where, for

y_{c i} = {(y_{c i 1}, \dots, y_{c i t}, \dots, y_{c i T})}^{'}, ϵ_{c i} = {(ϵ_{c i 1}, \dots, ϵ_{c i t}, \dots, ϵ_{c i T})}^{'}

we write

\begin{matrix} ϵ_{c i} & \sim & (01_{T}, σ_{ϵ}^{2} R (ρ_{1}, \dots, ρ_{ℓ}, \dots, ρ_{T - 1})) \\ \Rightarrow & E_{S_{2}} (Y_{c i}) = x_{c i} β, {c o v}_{S_{2}} (Y_{c i}) = σ_{γ}^{2} 1_{T} 1_{T}^{'} + σ_{ϵ}^{2} R (ρ) = σ_{γ}^{2} U_{T T} + σ_{ϵ}^{2} R (ρ), \end{matrix}

(42)

where

1_{T}

is the T-dimensional unit vector and

R (ρ)

is a

T \times T

general lag correlation matrix, as in (17) under Section 3.1,

ρ_{ℓ}

being referred to as the lag ℓ longitudinal correlation, and

U_{T T}

is the

T \times T

unit matrix. Further notice that, at every point of time, two individuals, i and j, belonging to the same cluster c are structurally correlated, and hence

\begin{matrix} y_{c i} & = & x_{c i} β + 1_{T} γ_{c} + ϵ_{c i}, and \\ y_{c j} & = & x_{c j} β + 1_{T} γ_{c} + ϵ_{c j} \end{matrix}

have their covariance matrix as

\begin{matrix} {c o v}_{S} (Y_{c i}, Y_{c j}^{'}) = σ_{γ}^{2} 1_{T} 1_{T}^{'} = σ_{γ}^{2} U_{T T} \end{matrix}

(43)

where

U_{T T} : T_{\times} T

is the unit matrix. Thus, by combining (42) and (43), we obtain the mean and covariance structure for the longitudinal response vector for the individuals in the cth cluster, namely for

y_{c} = {(y_{c 1}^{'}, \dots, y_{c i}^{'}, \dots, y_{c N_{c}}^{'})}^{'} : N_{c} T \times 1,

as

\begin{matrix} E_{S_{2}} [Y_{c}] & = & X_{c} β = (\begin{matrix} x_{c 1} \\ ⋮ \\ x_{c i} \\ ⋮ \\ x_{c N_{c}} \end{matrix}) β : N_{c} T \times 1 \end{matrix}

(44)

\begin{matrix} {c o v}_{S_{2}} [Y_{c}] & = & σ_{ϵ}^{2} ⨁ [R (ρ), \dots, R (ρ), \dots, R (ρ)] + σ_{γ}^{2} U_{N_{c} T, N_{c} T} \\ \equiv & σ_{ϵ}^{2} [I_{N_{c}} \otimes R (ρ)] + σ_{γ}^{2} U_{N_{c} T, N_{c} T} \\ = & Σ_{c, N_{c} T} (σ^{2}, ϕ, ρ) : N_{c} T \times N_{c} T \end{matrix}

(45)

where

U_{N_{c} T, N_{c} T} : N_{c} T \times N_{c} T

unit matrix,

σ^{2} = [σ_{ϵ}^{2} + σ_{γ}^{2}], ϕ = σ_{γ}^{2} / σ^{2},

and

ρ

is the longitudinal correlation index parameter representing all lagged correlations, namely

ρ \equiv (ρ_{1}, \dots, ρ_{ℓ}, \dots, ρ_{T - 1}) .

We remark that the proposed cluster-based longitudinal model (40)–(43) is a generalization of the individual-based longitudinal model (15)–(17) under Section 3.1 to the cluster setup. This model (40)–(43) also may be treated as a generalization of the cluster regression model (e.g., [22] (eqns. 3.1, 3.2), [24] (eqn 2.1), [25] (eqns. 2.1, 2.2), [23] (eqn. 1), and [14] (eqn. 9.11)) to the longitudinal setup. There is also a difference at the cluster level, as we are considering small-sized clusters, such as households, and whether these studies dealt with larger clusters, prompting the need for two-stage cluster-sampling-based inferences, as opposed to our single-stage cluster-sampling-based inference.

4.2. FP Data-Based Hypothetical Estimation Equations

Notice that if the FP

(F_{2})

data (5) were available, the SP regression parameters

β

could be optimally estimated by exploiting the SP model-based moment properties (44) and (45), more specifically by solving the hypothetical GLS (HGLS) estimating equation as

\begin{matrix} Hypothetical Estimating Equation for β : \\ G_{N} (β | σ^{2}, ϕ, ρ) = \sum_{c = 1}^{K} X_{c}^{'} Σ_{c, N_{c} T}^{- 1} (σ^{2}, ϕ, ρ) (y_{c} - X_{c} β) = 0 \\ \Rightarrow & {\hat{β}}_{(T), H G L S} = {[\sum_{c = 1}^{K} X_{c}^{'} Σ_{c, N_{c} T}^{- 1} (σ^{2}, ϕ, ρ) X_{c}]}^{- 1} \sum_{c = 1}^{K} X_{c}^{'} Σ_{c, N_{c} T}^{- 1} (σ^{2}, ϕ, ρ) y_{c} \\ = & β_{(T), N}, (say) \end{matrix}

(46)

where

{\hat{β}}_{(T) H G L S}

is used to denote the hypothetical GLS estimator of

β

involving the

F_{2}

-based hypothetical responses up to time T in a cluster setup, which is also denoted as

β_{(T), N},

an N-dependent FP regression parameter [16,20,21]. Note that the computation for

Σ_{c, N_{c} T}^{- 1} (σ^{2}, ϕ, ρ)

in (46) may be simplified by recalling the formula for

Σ_{c, N_{c} T} (\cdot)

from (45) as

Σ_{c, N_{c} T} (σ^{2}, ϕ, ρ) = σ_{ϵ}^{2} [I_{N_{c}} \otimes R (ρ)] + σ_{γ}^{2} U

and hence writing

\begin{matrix} Σ_{c, N_{c} T}^{- 1} (σ^{2}, ϕ, ρ) \\ = & \frac{1}{σ_{ϵ}^{2}} [I_{N_{c}} \otimes R^{- 1}] - \frac{σ_{γ}^{2}}{σ_{ϵ}^{2}} [\frac{[I_{N_{c}} \otimes R^{- 1}] U [I_{N_{c}} \otimes R^{- 1}]}{1 + \frac{σ_{γ}^{2}}{σ_{ϵ}^{2}} 1_{c}^{'} [I_{N_{c}} \otimes R^{- 1}] 1_{c}}] \end{matrix}

(47)

(e.g., [6] (Section 3.1)) where

σ^{2} = [σ_{γ}^{2} + σ_{ϵ}^{2}], ϕ = σ_{γ}^{2} / σ^{2}, ρ \equiv (ρ_{1}, \dots, ρ_{ℓ}, \dots, ρ_{T - 1}) .

As the optimal estimation of

β

by (46) depends on the estimates of cluster variance and longitudinal correlation parameters, we estimate the rest of these parameters using an MU estimating equation approach. More specifically, by generalizing the individual-based hypothetical MM (HMM) Formula (29) under Section 3.2 to the cluster/household setup (see also [6] (Section 3.1.2)), we obtain the HMM estimating formula for

ρ_{ℓ}

as

\begin{matrix} Hypothetical Method of Moments Estimating Formula for ρ_{ℓ} : ℓ = 1, \dots, t - 1 \\ {\hat{ρ}}_{ℓ, H M M} = \frac{1}{[1 - {\hat{ϕ}}_{0, H M M}]} \\ \times & [\frac{\sum_{c = 1}^{K} \sum_{i = 1}^{N_{c}} \sum_{u = 1}^{t - ℓ} [(y_{c i u} - x_{c i u}^{'} β_{(t), N}) (y_{c i, u + ℓ} - x_{c i, u + ℓ}^{'} β_{(t), N})] / N (t - ℓ)}{\sum_{c = 1}^{K} \sum_{i = 1}^{N_{c}} \sum_{u = 1}^{t} {[y_{c i u} - x_{c i u}^{'} β_{(t), N}]}^{2} / N t} - {\hat{ϕ}}_{0, H M M}] \\ = & ρ_{ℓ, N}, (s a y), for ℓ = 1, \dots, t - 1, \end{matrix}

(48)

with

N = \sum_{c = 1}^{K} N_{c},

and where

β_{(t), N}

is the hypothetical estimate of

β_{(t)}

obtained by the formula in (46) but using cluster-based individuals’ responses up to time

t,

for

t = 1, \dots, T .

Furthermore,

{\hat{ϕ}}_{0, H M M}

in (48) is an initial estimate of

ϕ,

a specialized value of

{\hat{ϕ}}_{H M M}

using

{\hat{ρ}}_{ℓ} = 0 .

The formulas for

{\hat{ϕ}}_{H M M}

and

{\hat{ϕ}}_{0, H M M}

are given below in (49).

For the purpose of formulating

ϕ

estimate, notice from (47) that the

ϕ

parameter is involved in variances and covariances of the within-cluster responses. Thus, by pooling the

F_{2}

-based sum of squares and sum of products and equating to its

S_{2}

-based expectation, after some algebra, we obtain

\begin{matrix} Hypothetical Method of Moments Estimating Formula for ϕ : \\ \tilde{S} = [\sum_{c = 1}^{K} \sum_{i = 1}^{N_{c}} \sum_{u, u^{'} = 1}^{t} [(y_{c i u} - x_{c i u}^{'} β_{(t), N}) (y_{c i u^{'}} - x_{c i u^{'}}^{'} β_{(t), N})] \\ + & 2 \sum_{c = 1}^{K} \sum_{i < j}^{N_{c}} \sum_{u, u^{'} = 1}^{t} [(y_{c i u} - x_{c i u}^{'} β_{(t), N}) (y_{c j u^{'}} - x_{c j u^{'}}^{'} β_{(t), N})]] \\ / & [\sum_{c = 1}^{K} \sum_{i = 1}^{N_{c}} \sum_{u = 1}^{t} {[y_{c i u} - x_{c i u}^{'} β_{(t), N}]}^{2}] \\ \Rightarrow & {\hat{ϕ}}_{H M M} = \frac{\tilde{S} - \frac{1}{t} [t + 2 {(t - 1) ρ_{1} + \dots + 2 ρ_{t - 2} + ρ_{t - 1}}]}{t (\sum_{c = 1}^{K} N_{c}^{2}) / N - \frac{1}{t} [t + 2 {(t - 1) ρ_{1} + \dots + 2 ρ_{t - 2} + ρ_{t - 1}}]} \\ = & ϕ_{N}, (say), \end{matrix}

(49)

with the initial

ϕ

estimate used in (48), as

\begin{matrix} {\hat{ϕ}}_{0, H M M} & = & \frac{\tilde{S} - 1}{t (\sum_{c = 1}^{K} N_{c}^{2}) / N - 1} \end{matrix}

The remaining parameter

σ^{2} = [σ_{γ}^{2} + σ_{ϵ}^{2}],

as it is a variance parameter for all individuals under all households, has its HMM formula already used in the denominators of (48) and (49). For the sake of completeness, we give its HMM formula as follows:

\begin{matrix} {\hat{σ}}_{H M M}^{2} & = & \sum_{c = 1}^{K} \sum_{i = 1}^{N_{c}} \sum_{u = 1}^{t} {[y_{c i u} - x_{c i u}^{'} β_{(t), N}]}^{2} / N t = σ_{N}^{2}, (s a y) . \end{matrix}

(50)

where

N = \sum_{c = 1}^{K} N_{c} .

4.3. Survey Sample $(s_{2}^{*})$ -Based DU Estimating Equations

Notice that all SP parameter estimates obtained in the last section, namely

β_{(T), N}, ρ_{ℓ, N},

ϕ_{N},

and

σ_{N}^{2},

computed by (46), (48), (49), and (50), respectively, are all FP parameters. Their real life estimation has to be performed using the sampled data

s_{2}^{*}

from (6), as well as the rest of the covariates information available from the sampling frame. Now, to obtain DU (design unbiased) estimates for these FP parameters, we simply use the sampling weighted (SW) total for each of the FP-based total functions involved in the formulas from (46) to (50). Thus, by using the sampling weight, say

w_{c \in s_{2}^{*}} = K / k

following (7) (which is the inverse of the inclusion probability for the cth cluster in the sample), we obtain the DU estimates for the aforementioned FP parameters as

\begin{matrix} {\hat{\hat{β}}}_{(T), S W G L S} = {[\sum_{c = 1}^{k} w_{c \in s_{2}^{*}} X_{c}^{'} Σ_{c, N_{c} T}^{- 1} (\cdot) X_{c}]}^{- 1} \sum_{c = 1}^{k} w_{c \in s_{2}^{*}} X_{c}^{'} Σ_{c, N_{c} T}^{- 1} (\cdot) y_{c} = {\hat{β}}_{(T), N}; \end{matrix}

(51)

\begin{matrix} {\hat{\hat{ρ}}}_{ℓ, S W M M} \equiv {\hat{ρ}}_{ℓ, N} = \frac{1}{1 - {\hat{ϕ}}_{0, S W M M}} \\ \times & [(\sum_{c = 1}^{k} \sum_{i = 1}^{N_{c}} \sum_{u = 1}^{t - ℓ} w_{c \in s_{2}^{*}} [(y_{c i u} - x_{c i u}^{'} β_{(t), N}) (y_{c i, u + ℓ} - x_{c i, u + ℓ}^{'} β_{(t), N})] / N (t - ℓ)) \\ \times & {(\sum_{c = 1}^{k} \sum_{i = 1}^{N_{c}} \sum_{u = 1}^{t} w_{c \in s_{2}^{*}} {[y_{c i u} - x_{c i u}^{'} β_{(t), N}]}^{2} / N t)}^{- 1} - {\hat{ϕ}}_{0, S W M M}] \end{matrix}

(52)

\begin{matrix} {\hat{\hat{ϕ}}}_{S W M M} \equiv {\hat{ϕ}}_{N} \\ = & \frac{{\hat{\tilde{S}}}_{S W M M} - \frac{1}{t} [t + 2 {(t - 1) ρ_{1} + \dots + 2 ρ_{t - 2} + ρ_{t - 1}}]}{t (\sum_{c = 1}^{K} N_{c}^{2}) / N - \frac{1}{t} [t + 2 {(t - 1) ρ_{1} + \dots + 2 ρ_{t - 2} + ρ_{t - 1}}]} \end{matrix}

(53)

\begin{matrix} {\hat{\hat{σ}}}_{S W M M}^{2} = \sum_{c = 1}^{k} \sum_{i = 1}^{N_{c}} \sum_{u = 1}^{t} w_{c \in s_{2}^{*}} {[y_{c i u} - x_{c i u}^{'} β_{(t), N}]}^{2} / N t = {\hat{σ}}_{N}^{2}, \end{matrix}

(54)

where in (54)

\begin{matrix} {\hat{\tilde{S}}}_{S W M M} = [\sum_{c = 1}^{k} \sum_{i = 1}^{N_{c}} \sum_{u, u^{'} = 1}^{t} w_{c \in s_{2}^{*}} [(y_{c i u} - x_{c i u}^{'} β_{(t), N}) (y_{c i u^{'}} - x_{c i u^{'}}^{'} β_{(t), N})] \\ + & 2 \sum_{c = 1}^{k} \sum_{i < j}^{N_{c}} \sum_{u, u^{'} = 1}^{t} w_{c \in s_{2}^{*}} [(y_{c i u} - x_{c i u}^{'} β_{(t), N}) (y_{c j u^{'}} - x_{c j u^{'}}^{'} β_{(t), N})]] \\ / & [\sum_{c = 1}^{k} \sum_{i = 1}^{N_{c}} \sum_{u = 1}^{t} w_{c \in s_{2}^{*}} {[y_{c i u} - x_{c i u}^{'} β_{(t), N}]}^{2}], \end{matrix}

and in (53)

\begin{matrix} {\hat{ϕ}}_{0, S W M M} = \frac{{\hat{\tilde{S}}}_{S W M M} - 1}{t (\sum_{c = 1}^{K} N_{c}^{2}) / N - 1} . \end{matrix}

4.4. Formulation of the DCMU Prediction Functions Using SSCLS Sample

As the

F_{2}

is unknown, for the prediction of the

F_{2}

total at a given time

t,

we first write its LCT (longitudinal cumulative total) and split this total in terms of sampled (ss) and non-sampled (ns) response totals as

\begin{matrix} τ_{y (t)} & = & \sum_{c = 1}^{K} \sum_{i = 1}^{N_{c}} \sum_{u = 1}^{t} y_{c i u} \\ = & \sum_{c \in s_{2}^{*}}^{k} \sum_{i = 1}^{N_{c}} \sum_{u = 1}^{t} y_{c i u} (s s) + \sum_{c \notin s_{2}^{*}}^{K - k} [\sum_{i = 1}^{N_{c}} \sum_{u = 1}^{t} y_{c i u} (n s)], for t = 1, \dots, T, \end{matrix}

(55)

which is similar to, but different from, the LCT split in (9) based on the SSILS sample. By following the same technique used in the SSILS setup, more specifically by following (14) and (13) from Section 3, we write the final DAMB and DCMU predictor for the cumulative and marginal totals as follows:

\begin{matrix} DAMB Prediction Function Estimator : \\ {\hat{\hat{τ}}}_{y (t)} & = & \sum_{c \in s_{2}^{*}}^{k} \sum_{i = 1}^{N_{c}} \sum_{u = 1}^{t} y_{c i u} + \sum_{c \notin s_{2}^{*}}^{K - k} [\sum_{i = 1}^{N_{c}} \sum_{u = 1}^{t} {\hat{E}}_{s_{2}^{*} \subset F_{2} \subset S_{2}} (y_{c i u})] \end{matrix}

(56)

\begin{matrix} = & \sum_{c \in s_{2}^{*}}^{k} \sum_{i = 1}^{N_{c}} \sum_{u = 1}^{t} y_{c i u} + \sum_{c \notin s_{2}^{*}}^{K - k} [\sum_{i = 1}^{N_{c}} \sum_{u = 1}^{t} x_{c i u}^{'} {\hat{β}}_{(t), N} ({\hat{ρ}}_{N}, {\hat{ϕ}}_{N})] \end{matrix}

(57)

\begin{matrix} \Rightarrow DAMB Marginal Prediction at Time t : \\ {\hat{τ}}_{y (1)}^{*} = {\hat{\hat{τ}}}_{y (1)}; {\hat{τ}}_{y (t)}^{*} = [{\hat{\hat{τ}}}_{y (t)} - {\hat{\hat{τ}}}_{y (t - 1)}] for t = 2, \dots, T . \end{matrix}

(58)

\begin{matrix} DCMU Prediction Function Estimator : \\ {\hat{\tilde{τ}}}_{y (t)} & = & \sum_{c \in s_{2}^{*}}^{k} \sum_{i = 1}^{N_{c}} \sum_{u = 1}^{t} y_{c i u} + \frac{K - k}{K} \sum_{c = 1}^{K} [\sum_{i = 1}^{N_{c}} \sum_{u = 1}^{t} {\hat{E}}_{s_{2}^{*} \subset F_{2} \subset S_{2}} (y_{c i u})] \end{matrix}

(59)

\begin{matrix} = & \sum_{c \in s_{2}^{*}}^{k} \sum_{i = 1}^{N_{c}} \sum_{u = 1}^{t} y_{c i u} + \frac{K - k}{K} \sum_{c = 1}^{K} [\sum_{i = 1}^{N_{c}} \sum_{u = 1}^{t} x_{c i u}^{'} {\hat{β}}_{(t), N} ({\hat{ρ}}_{N}, {\hat{ϕ}}_{N})] \end{matrix}

(60)

\begin{matrix} \Rightarrow DCMU Marginal Prediction at Time t : \\ {\tilde{τ}}_{y (1)}^{*} = {\hat{\tilde{τ}}}_{y (1)}; {\tilde{τ}}_{y (t)}^{*} = [{\hat{\tilde{τ}}}_{y (t)} - {\hat{\tilde{τ}}}_{y (t - 1)}] for t = 2, \dots, T . \end{matrix}

(61)

Notice that

{\hat{β}}_{(t), N} ({\hat{ρ}}_{N}, {\hat{ϕ}}_{N})

in (57) and (60) is a DCMU estimate of

β

, as in (51), which is computed based on the SSCLS sample

s_{2}^{*}

from (6). As this estimate depends on the estimates of longitudinal correlation

ρ

and cluster correlation

ϕ,

these later estimates were computed step by step as in (52) and (53), respectively.

5. Prediction Comparison Using Simulation Results

Our objective in this section is to examine the finite sampling performance of the proposed DCMU and DAMB prediction estimators for the FP totals using both SSILS and SSCLS survey data. The precise formulas for these predictors are developed in Section 3.3.2 and Section 4.4 based on the SSILS and SSCLS samples, respectively. As a criterion to understand the performance of the parameter estimators, we checked the amount of bias of an estimator from its true value. However, as the large bias and small standard error of an estimator indicates the worst performance of an estimator, we have used the percentage relative bias (see (63) below under Section 5.1) as a criterion to compare the performance of the DCMU and DAMB total predictors.

We now proceed for simulation studies, first for SSILS sample-based prediction and then for SSCLS sample-based prediction. Details, including how the data are generated and the estimators are obtained to compute the prediction functions, are given in Section 5.1 using the SSILS sample and in Section 5.2 using the SSCLS sample.

5.1. Simulation Study 1: Prediction Performance Using Individual-Based Longitudinal Survey Sample

Recall from Section 3 that, even though the longitudinal responses from the individuals in

F_{1}

(1) are hypothetical, in a model-based approach, it is assumed that the repeated responses of an individual are likely to follow an auto-correlation structure given in (15)–(17). In this section, we conduct a simulation study to examine first the performance of a longitudinal correlation-based SWGLS (sampling weighted GLS) approach (31) in estimating the

F_{1}

regression parameters over time by using the survey sample

s_{1}^{*}

given by (3) and (4). We then examine the performance of two competitive, namely the DAMB (design assisted model-based) and DCMU (design cum model unbiased) predictors given by (36)–(39) for FP

(F_{1})

totals over time. We remark that these predictors are formulated using DCMU estimates for SP

(S_{1})

regression parameters involved in the prediction functions.

5.1.1. Simulation Design in Steps (S1–S7)

S1.: We specify the SP $(S_{1})$ model (15) with a set of regression parameters as $β_{0} = 1.0, β_{1} = 0.5, β_{2} = 0.2,$ and $β_{3} = 0.5,$ and its longitudinal correlation structure (17) based on $T = 4$ time periods involving lag correlations: $ρ_{1}, ρ_{2},$ and $ρ_{3} .$
S2.: We consider three widely used correlation structures, namely AR1 $(ρ = 0.7),$ MA1 $(ρ = 0.4),$ and EQC $(ρ = 0.4) .$ As our estimation method is not model-dependent, we thus estimate the lag correlations $ρ_{1}, ρ_{2},$ and $ρ_{3},$ irrespective of the model. For example, when data are generated with true model, say AR(1) $(ρ = 0.7),$ estimates of lag correlations are obtained for the true SP lag correlations $ρ_{1} = ρ = 0.7, ρ_{2} = 0.49,$ and $ρ_{3} = 0.34,$ and so on.
S3.: We consider $N = 1000$ household leaders with responses on household annual electricity consumption for $T = 4$ years, in the FP $(F_{1})$ (1), and their four-time independent household-related covariate values with covariates $x_{2}$ as the size of the household, $(x_{3}, x_{4}) \equiv [(0, 0) or (1, 0) or (0, 1)]$ as two categorical covariates representing three household income levels, more specifically with

$\begin{matrix} x_{i 1} & = & 1 an intercept covariate for i = 1, \dots, 1000 \\ x_{i 2} & = & \{\begin{matrix} 1 & for i = 1, \dots, 100 \\ 2 & for i = 101, \dots, 500 \\ 3 & for i = 501, \dots, 700 \\ 4 & for i = 701, \dots, 1000 \end{matrix} \\ (x_{i 3}, x_{i 4}) & = & \{\begin{matrix} (0, 0) & for i = 1, \dots, 50; 101, \dots, 150; \\ 501, \dots, 550; 701, \dots, 750 \\ (1, 0) & for i = 51, \dots, 75; 151, \dots, 450; \\ 551, \dots, 650; 751, \dots, 950 \\ (0, 1) & for i = 76, \dots, 100; 451, \dots, 500; \\ 651, \dots, 700; 951, \dots, 1000 \end{matrix} \end{matrix}$

(62)
S4.: Using $σ_{ϵ}^{2} = 1,$ and the parameter values, correlation structures, and covariates from steps 1 to 3 above, we generate, for a given $t (t = 1, \dots, T),$ the longitudinal responses ${y_{i u}, u = 1, \dots, t; i = 1, \dots, N} .$
S5.: We then choose a sample $s_{1}^{*}$ (3) of size $n = 100$ households from the $F_{1}$ of size $N = 1000,$ using the SRSWOR sampling design, as in (4), along with their responses and covariates ${(y_{i u}, x_{i u}), u = 1, \dots, t; i = 1, \dots, n} .$ The covariate values for the non-sampled individuals are assumed to be known from an underlying sampling frame.
S6.: Finally, the sample $s_{1}^{*}$ from step 5 is used to compute the SWGLS estimate of $β_{t, N}$ (FP parameter), namely ${\hat{β}}_{(t), N}$ by (31), and lag correlation estimate ${\hat{ρ}}_{ℓ, N}$ for $ρ_{ℓ, N}$ (FP parameter) by (35). These estimates, more specifically ${\hat{β}}_{(t), N}$ are then used to compute the marginal (at a given time t) total prediction estimates ${\hat{τ}}_{(t)}^{*}$ by (37) and ${\tilde{τ}}_{(t)}^{*}$ by (39). The percentage relative biases (PRB), namely

$\begin{matrix} P R B ({\hat{τ}}_{(t)}^{*}) = \frac{| {\hat{τ}}_{(t)}^{*} - τ_{(t)}}{s . e . ({\hat{τ}}_{(t)}^{*})} \times 100; P R B ({\tilde{τ}}_{(t)}^{*}) = \frac{| {\tilde{τ}}_{(t)}^{*} - τ_{(t)} |}{s . e . ({\tilde{τ}}_{(t)}^{*})} \times 100, \end{matrix}$

(63)

are also computed.
S7.: We then repeat steps 5 and 6 for 25 times and compute the simulation average of the regression and correlation estimates under three correlation structures, AR(1), MA(1), and EQC, which are reported in Table 1, Table 2 and Table 3, respectively. The simulation average of the prediction estimates, along with their percentage relative biases (PRBs), are reported in Table 4.

Table 1. AR(1) $(ρ = 0.7)$ correlation structure-based $F_{1}$ regression parameters and their SWGLS (sampling weighted GLS) estimates along with standard errors (given in parenthesis) using 25 samples each $(s_{1}^{*})$ of size $n = 100$ chosen from the $F_{1}$ of size $N = 1000 .$

Table 2. MA(1) $(ρ = 0.4)$ correlation structure-based $F_{1}$ regression parameters and their SWGLS (sampling weighted GLS) estimates along with standard errors (given in parenthesis) using 25 samples each $(s_{1}^{*})$ of size $n = 100$ chosen from the $F_{1}$ of size $N = 1000 .$

Table 3. EQ $(ρ = 0.4)$ correlation structure-based $F_{1}$ regression parameters and their SWGLS (sampling weighted GLS) estimates along with standard errors (given in parenthesis) using 25 samples each $(s_{1}^{*})$ of size $n = 100$ chosen from the $F_{1}$ of size $N = 1000 .$

Table 4. Design-assisted model-based $({\hat{τ}}_{(t)}^{*})$ and design cum model unbiased $({\tilde{τ}}_{(t)}^{*})$ predictions along with their standard errors (given in parenthesis) and 100% relative absolute biases [given in square bracket] for $F_{1}$ totals $(τ_{(t)})$ over time $t = 1, \dots, 4,$ using 25 samples each $(s_{1}^{*})$ of size $n = 100$ chosen from the $F_{1}$ of size $N = 1000 .$

5.1.2. Simulated Prediction Performance for Marginal FP Totals

Note that as the FP total predictors given by (36) and (38) depend on the sample

(s_{1}^{*})

-based estimates

{\hat{β}}_{(t), N}

(31) for the SP regression parameters

β_{(t)} \equiv β

for

t = 1, \dots, T,

it is therefore important to examine the performance of this estimator

{\hat{β}}_{(t), N}

in estimating the FP parameter

β_{(t), N}

given by (27) corresponding to

β_{(t)} .

Furthermore, as

β_{(t), N}

in (27) depends on the auto-correlation structure

R (ρ)

in (26) up to time

t,

for

t = 1, \dots, T,

we first exploit the FP generated by (15)–(17) to compute the FP regression parameters

β_{(t), N}

by (27), as well as the FP lag correlation parameters

ρ_{ℓ, N}

by (29). These FP parameters

β_{(t), N}

corresponding to SP parameter

β,

and the FP correlation parameters

ρ_{ℓ, N}

corresponding to SP correlation parameters

ρ_{ℓ}

for

t = 1, \dots, 4,

under all three auto-correlations, namely, the AR(1), MA(1), and EQC models, are displayed in the upper half of the Table 1, Table 2, and Table 3, respectively. For example, Table 1, based on the AR(1) model, shows four-dimensional (using four covariates)

β_{(t), N}

values at

t = 4

as

β_{(4), N} \equiv [1.0491, 0.5007, 0.1581, 0.4468]

corresponding to the SP regression parameters

β \equiv [1.0, 0.5, 0.2, 0.5] .

Similarly, FP and SP lag correlations may be interpreted. Next, because the sample

s_{1}^{*}

-based (see step 5 in the last sub-section) estimation of

β

amounts to the estimation of

β_{(t), N},

the final estimates

{\hat{β}}_{(t), N}

computed by (31) (which is a simulation average based on 25 repetitions) under the AR(1), MA(1), and EQC processes are displayed in the lower half of the Table 1, Table 2, and Table 3, respectively. For example, the aforementioned

β_{(4), N}

under the AR(1) process are estimated as

{\hat{β}}_{(4), N} \equiv [1.026, 0.514, 0.134, 0.394]

which and other similar estimates from Table 2 and Table 3 appear to perform well reflecting their design unbiasedness for the FP regression parameters

β_{(t), N} .

Next, as indicated in Step 6 in the last sub-section, the aforementioned sample-based regression estimates

{\hat{β}}_{(t), N}

are used to compute the design-assisted model-based (DAMB), as well as design cum model-based (DCMB), marginal (at a given time t) total predictors

{\hat{τ}}_{(t)}^{*}

(37) and

{\tilde{τ}}_{(t)}^{*}

(39), for predicting/estimating the marginal total

τ_{(t)} = \sum_{i = 1}^{N} y_{i t} .

The results from Table 4 show that these two predictors appear to perform almost the same, and the estimates are very close to the true FP totals. For example, under the AR(1) process (displayed in the extreme left block), the FP total

τ_{(t)}

at time

t = 2,

i.e.,

τ_{(2)} = 2574.4

is predicted by MB predictor as

{\hat{τ}}_{(2)}^{*} = 2559.1

with PRB as

27.3,

and it is predicted by DCMB predictor as

{\tilde{τ}}_{(2)}^{*} = 2559.1

with slightly different PRBs as

27.6 .

Thus they perform almost the same for the targeted prediction. Their equivalent performances under the MA(1) and EQC can be interpreted similarly.

5.2. Simulation Study 2: Prediction Performance Using Cluster-Based Longitudinal Survey Sample

In the last section, we examined the prediction performances using the individual-based longitudinal survey sample. As a generalization, we now examine the prediction behaviors for cluster-based FP total predictors over a longitudinal period of the study developed, as in (56)–(61). Notice that, as opposed to the individual-based FP

(F_{1})

(1), a cluster-based FP

(F_{2})

is given by (5) with its SP model

S_{2}

given by (40)–(42). The

N_{c}

individuals under the cth cluster are now correlated with the cluster correlation coefficient

ϕ = σ_{γ}^{2} / σ^{2}

defined in (45). As an illustration in the present simulation, we have generated a cluster-correlated population and, hence, the sample with

ϕ = 0.33,

for example. As far as longitudinal correlations are concerned, we use the widely used AR(1) structure with

ρ = 0.5

leading to 3 lag-dependent correlation coefficients

ρ_{1} = 0.50, ρ_{2} = 0.25, ρ_{3} = 0.125,

under total time period

T = 4 .

We now construct our hypothetical cluster-based

F_{2}

as follows.

Let $F_{2}$ consists of $K = 275$ independent clusters/families. We label these clusters in sequence from 1 to $275 .$ Following the notations from (40), we consider four different family structure (FS1 to FS4) with their family/cluster sizes $(N_{c})$ as follows:
FS1: $N_{c} = 4, c = 1, \dots, 175,$ each with 2 parents (say, father (F) and mother (M)) and 2 children (C1 and C2);
FS2: $N_{c} = 3, c = 176, \dots, 225,$ each with 2 parents (F and M) and 1 child (C1);
FS3: $N_{c} = 3, c = 226, \dots, 250,$ each with 1 parent (F) and 2 children (C1 and C2);
FS4: $N_{c} = 3, c = 251, \dots, 275,$ each with 1 parent (M) and 2 children (C1 and C2).

As far as the covariates are concerned, we consider three covariates, namely age

(x_{c i t, 1} \equiv x_{1}),

smoking status of the individual member at initial time point

(x_{c i t, 2} \equiv x_{2}),

and gender

(x_{c i t, 3} \equiv x_{3}) .

We explain below how we generated the covariates under FS1. The covariates under remaining FS2–FS4 are generated similarly.

1.: Generation of $x_{1}$ for FS1: (a) For father’s age, we have generated 175 ages from a uniform distribution with range 50–60. (b) For mother’s age, in sequence (following father’s label), we generate one age difference indicator value, say $d_{a}$ , randomly from seven different age difference indicators $(d_{a} \equiv [- 2, - 1, 0, 1, 2, 3, 4),$ and computed the selected mother’s age as $x_{1} (M) = x_{1} (F) - d_{a}$ . (c) To consider the ages of C1, we used $x_{1} (C 1) = x_{1} (of younger between F and M) - d_{a},$ where now $d_{a}$ was chosen as a randomly selected value from a set of age difference values $[20, 21, 22, 23, 24, 25]$ . (d) For the age of $C 2$ (corresponding to C1), we have used the formula $x_{1} (C 2) = x_{1} (C 1) - d_{a}$ with $d_{a}$ as a randomly selected value from a set of age difference values $[1, 2, 3, 4] .$
2.: Generation of $x_{2}$ for FS1: Smoking habits for the members were determined using the binary distribution with a probability smoking rate of $π,$ say. More specifically, we used

$x_{2} (F) \sim bin (π = 0.5); x_{2} (M) \sim bin (π = 0.5); x_{2} (C 1) \sim bin (π = 0.1); x_{2} (C 2) \sim bin (π = 0.05) .$
3.: Generation of $x_{3}$ for FS1: We considered $x_{3} (F) = 1.0, x_{3} (M) = 0,$ and to determine gender for both C1 and C2, we used $x_{3} (C 1) \equiv x_{3} (C 2) \sim bin (π = 0.5) .$

Based on the aforementioned covariate values and using their effects as

β_{0} (intercept) = 1.0, β_{1} (age effect) = 0.5, β_{2} (smoking effect) = 0.2, β_{3} (gender effect) = 0.5;

and further using cluster correlation

ϕ = 0.33,

and AR(1) longitudinal correlation process with

ρ = 0.5,

the

F_{2}

responses

{y_{c i t}},

namely body mass index

(b m i)

over a longitudinal period

T = 4

(equivalent to say 2 years) were generated using the SP

(S_{2})

correlation model (40). Before one can examine the prediction performance of the DAMB (56)–(58) and DCMU (59)–(61) marginal predictors at a given time, it is necessary to first compute the FP data-based regression estimates (FPRE) and then the survey sample

(s_{2}^{*})

((51)–(53))-based regression estimates (SSRE) after accommodating the cluster correlations (indexed by

ϕ

) and AR(1) longitudinal correlations indexed by

ρ (\equiv ρ_{1}, ρ_{2}, ρ_{3}) .

As far as the SS

s_{2}^{*}

(6) is concerned, we use SRSWOR and chose 32 families from 175 families under FS1; 12 families from 50 families under FS2; 6 families from 25 families under FS3; and 6 other families from 25 families under FS4. Thus, altogether, we chose

k = 56

clusters/families with sample size

n = 200

from

K = 275

clusters under the FP

F_{2}

of size

N = 1000 .

The SP parameter estimates (i.e., the FP correlation and regression parameters) by using the estimating Equations (49) (for cluster correlation

ϕ

), (48) (for longitudinal correlations), and (46) (for regression parameter estimates) from Section 4.2; and their corresponding sample

(s_{2}^{*})

-based estimates computed by solving the SS-based estimating Equations (51)–(53) from Section 4.3 are provided in Table 5. Samples were repeated for 25 times to compute the sample-based parameter estimates. All sample-based estimates appear to be close to FP-based estimates. In general, cluster-correlation estimates appear to work well when T is small, as more clusters cause more variation over the longitudinal period. However, the main regression parameter estimates, shown in the bottom rectangular box in Table 5, are not negatively effected by this slight difference in correlation estimates.

Table 5. AR(1)

(ρ = 0.5)

correlation structure-based

F_{2}

regression parameters and their SWGLS (sampling weighted GLS) estimates along with standard errors (given in parenthesis) using 25 samples each

(s_{2}^{*})

of size

n = 200

chosen from the cluster-based

F_{2}

of size

N = 1000 .

Finally, the regression parameters estimates, both FP- and sample-based from Table 5, are used to compute the DAMB and DCMU prediction estimates by using (57) and (60), respectively. These estimates, along with actual FP totals at all time points, are displayed in Table 6. The DCMU predictions shown in column 3 appear to have a smaller PRB (percentage relative bias) as compared to the DAMB predictors exhibited in column 1 for

T = 1, 2,

and 3, showing the relative superiority of the DCMU predictions as compared to the DAMB prediction in the single-stage cluster setup.

Table 6. Design-assisted model-based

({\hat{τ}}_{(t)}^{*})

and design cum model unbiased

({\tilde{τ}}_{(t)}^{*})

predictions, along with their standard errors (given in parenthesis) and 100% relative absolute biases [given in square bracket] for

F_{2}

totals

(τ_{(t)})

over time

t = 1, \dots, 4,

using 25 samples each

(s_{2}^{*})

of size

n = 200

chosen from the cluster-based

F_{2}

of size

N = 1000

for AR(0.5) model.

6. Discussion and Concluding Remarks

In a finite population (FP) setup, the prediction of FP total is a difficult problem, as one requires us to predict the non-sampled response total well, which is customarily performed by replacing such a non-sampled total with its model-based expectation estimate computed from a survey sample. Following the estimating function approach [21] for independent data, the super-population (SP) model-based (for FP data) regression parameters involved in the model-based expectation (equivalently in the prediction function) may be estimated based on the survey sample using a sampling weighted OLS (ordinary least square) (SWOLS) estimator. This SWOLS is DCMU (design cum model unbiased) for the SP regression parameter, which is DU (design unbiased) for the FP parameter [20] corresponding to the SP parameter. However, in this paper, we have considered an FP with independent individuals or households, for example, but each individual or a cluster/household member providing a set of longitudinally correlated responses, the members in a household being structurally cluster correlated. Clearly, as opposed to an SP regression model for independent data, in the proposed setup, one requires an SP correlation model, more specifically, longitudinal correlation and combined-cluster longitudinal correlation models. We use a so-called ‘working’ correlation model (e.g., [7] (Section 7.4), [8]), for example, have used an unstructured or standard Pearson correlation model, whereas [5] has used a random effects-based mixed model to accommodate the longitudinal correlations. However, as explained in Section 3.2, these models fail to accommodate the time effects on the correlations. More specifically, their models fail to produce a correlation structure with decaying correlations as the time lag for two repeated responses increases. As a remedy, we have considered lag-based correlation models following [11] (Section 3), for example, to incorporate the time effects on the correlations. Also, in a cluster-based longitudinal setup, we have generalized this lag-based correlation model to a dynamic mixed-model setup where, conditionally, on the cluster random effects, the repeated responses follow a lag-based correlation structure.

The aforementioned correlation structures and their sample-based estimates are discussed in detail, and it is demonstrated how to obtain DCMU (design cum model unbiased) estimators, namely, the SWGLS (sampling weighted GLS) estimators for the regression parameters after accommodating the longitudinal or cluster-longitudinal correlations. Subsequently, such DCMU regression estimators are used to develop design-assisted model-based (DAMB) and DCMB (design cum model-based) prediction functions. Also, the relative performance of these DAMB as well as DCMU predictors for FP total estimation is examined both theoretically and numerically.

In conclusion, this paper advances longitudinal survey data analysis for both individuals and cluster-based FP. More specifically, it is demonstrated that the DCMU estimation of the parameters involved in a prediction function provides DCMU-valid prediction for the FP total. The step-by-step development for the estimation and prediction methods should be useful to practitioners from statistical agencies such as Statistics and Health, Canada, and the Bureau of Statistics, USA, or similar organizations in other countries.

Author Contributions

Conceptualization, B.C.S.; Methodology, B.C.S.; Software, A.M.V.; Formal analysis, A.M.V.; Investigation, A.M.V. and B.C.S.; Writing—Original draft, A.M.V. and B.C.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Acknowledgments

Authors would like to thank three reviewers for their comments and suggestions that led to the improvement of this paper.

Conflicts of Interest

Authors declare no conflicts of interest.

References

Binder, D. Longitudinal surveys: Why are these surveys different from all other surveys? Surv. Methodol. 1998, 24, 101–108. [Google Scholar]
Lynn, P. Methods for longitudinal surveys. In Methodology of Longitudinal Surveys; Lynn, P., Ed.; John Wiley and Sons: Hoboken, NJ, USA, 2009; pp. 1–18. [Google Scholar]
Smith, P.W.F.; Berrington, A.; Sturgis, P. A comparison of graphical models and structural equation models for the analysis of longitudinal survey data. In Methodology of Longitudinal Surveys; Lynn, P., Ed.; John Wiley and Sons: Hoboken, NJ, USA, 2009; pp. 381–391. [Google Scholar]
Thompson, M.E. Using longitudinal complex survey data. Annu. Stat. Appl. 2015, 2, 305–320. [Google Scholar] [CrossRef]
Skinner, C.J.; de Toledo Vieira, M. Variance estimation in the analysis of clustered longitudinal survey data. Surv. Methodol. 2007, 33, 3–12. [Google Scholar]
Sutradhar, B.C. Dynamic Mixed Models for Familial Longitudinal Data; Springer: New York, NY, USA, 2011. [Google Scholar]
Wu, C.; Thompson, M.E. Sampling Theory and Practice; Springer Nature: Cham, Switzerland, 2020. [Google Scholar]
Liang, K.Y.; Zeger, S.L. Longitudinal data analysis using generalized linear models. Biometrika 1986, 78, 13–22. [Google Scholar] [CrossRef]
Roberts, G.; Ren, Q.; Rao, J.N.K. Using marginal mean models for data from longitudinal surveys with a complex design: Some advances in methods. In Methodology of Longitudinal Surveys; Lynn, P., Ed.; John Wiley and Sons: Hoboken, NJ, USA, 2009; pp. 351–366. [Google Scholar]
Sutradhar, B.C. Longitudinal Categorical Data Analysis; Springer: New York, NY, USA, 2014. [Google Scholar]
Sutradhar, B.C.; Das, K. On the efficiency of regression estimators in generalized linear models for longitudinal data. Biometrika 1999, 86, 459–465. [Google Scholar] [CrossRef]
Bellhouse, D.R. Model-based estimation in finite population sampling. Am. Stat. 1987, 41, 260–262. [Google Scholar] [CrossRef]
Prasad, N.G.N.; Rao, J.N.K. The estimation of the mean squared error of small-area estimators. J. Am. Stat. Assoc. 1990, 85, 163–171. [Google Scholar] [CrossRef]
Valliant, R.; Dorfman, A.H.; Royal, R.M. Finite Population Sampling and Inference: A Prediction Approach; John Wiley and Sons, Inc.: New York, NY, USA, 2000. [Google Scholar]
Melville, G.J.; Welsh, A.H. Model-based prediction in ecological surveys including those with incomplete detection. Aust. N. Z. J. Stat. 2014, 56, 257–281. [Google Scholar] [CrossRef]
Sutradhar, B.C. Doubly weighted estimation approach for linear regression analysis with two-stage cluster samples. Sankhya B Indian J. Stat. 2024, 86, 55–90. [Google Scholar] [CrossRef]
Kennel, T.L.; Valliant, R. Robust variance estimators for generalized regression estimators in cluster samples. Surv. Methodol. 2019, 45, 427–450. [Google Scholar]
Jowaheer, V.; Sutradhar, B.C. Analyzing longitudinal count data with overdispersion. Biometrika 2002, 89, 389–399. [Google Scholar] [CrossRef]
Thall, P.F.; Vail, S.C. Some covariance model for longitudinal count data with overdispersion. Biometrics 1990, 46, 657–671. [Google Scholar] [CrossRef] [PubMed]
Binder, D. On the variances of asymptotically normal estimators from complex surveys. Int. Stat. Rev. 1983, 51, 279–292. [Google Scholar] [CrossRef]
Godambe, V.P.; Thompson, M.E. Parameters of super-population and survey population: Their relationships and estimation. Int. Stat. Rev. 1986, 54, 127–138. [Google Scholar] [CrossRef]
Royal, R.M. The linear least-squares prediction approach to two-stage sampling. J. Am. Stat. Assoc. 1976, 71, 657–664. [Google Scholar] [CrossRef]
Valliant, R. Generalized variance functions in stratified two-stage sampling. J. Am. Stat. Assoc. 1987, 82, 499–508. [Google Scholar] [CrossRef]
Isaki, C.T.; Fuller, W.A. Survey design under the regression super-population model. J. Am. Stat. Assoc. 1982, 77, 89–96. [Google Scholar] [CrossRef]
Scott, A.J.; Holt, D. The effect of two-stage sampling on ordinary least squares methods. J. Am. Stat. Assoc. 1982, 77, 848–854. [Google Scholar] [CrossRef]

Table 1. AR(1)

(ρ = 0.7)

correlation structure-based

F_{1}

regression parameters and their SWGLS (sampling weighted GLS) estimates along with standard errors (given in parenthesis) using 25 samples each

(s_{1}^{*})

of size

n = 100

chosen from the

F_{1}

of size

N = 1000 .

Table 1. AR(1)

(ρ = 0.7)

correlation structure-based

F_{1}

regression parameters and their SWGLS (sampling weighted GLS) estimates along with standard errors (given in parenthesis) using 25 samples each

(s_{1}^{*})

of size

n = 100

chosen from the

F_{1}

of size

N = 1000 .

	Super-Population Regression Parameters				Super-Population Lag Correlation
Time	$β_{0}$	$β_{1}$	$β_{2}$	$β_{3}$	$ρ_{1}$	$ρ_{2}$	$ρ_{3}$
-	1	0.5	0.2	0.5	0.70	0.49	0.34
	Finite Population Regression Parameters				Finite Population Lag Correlation
Time	$β_{0, N}$	$β_{1, N}$	$β_{2, N}$	$β_{3, N}$	$ρ_{1, N}$	$ρ_{2, N}$	$ρ_{3, N}$
4	1.0491	0.5007	0.1581	0.4468	0.7082	0.4924	0.3509
3	1.0402	0.5059	0.1573	0.4246	0.6972	0.4908	-
2	1.0252	0.5038	0.1906	0.4418	0.6840	-	-
1	0.9902	0.5098	0.2246	0.4712	-	-	-
	Sample Estimate of Regression Parameters				Sample Estimate of Lag Correlation
Time	${\hat{β}}_{0, N}$	${\hat{β}}_{1, N}$	${\hat{β}}_{2, N}$	${\hat{β}}_{3, N}$	${\hat{ρ}}_{1, N}$	${\hat{ρ}}_{2, N}$	${\hat{ρ}}_{3, N}$
4	1.0262	0.5142	0.1336	0.3941	0.6827	0.4745	0.3265
	(0.1980)	(0.0679)	(0.1396)	(0.2120)	(0.0391)	(0.0534)	(0.0519)
3	1.0254	0.5171	0.1328	0.3568	0.6785	0.4760	-
	(0.1965)	(0.0700)	(0.1500)	(0.2087)	(0.0475)	(0.0635)	-
2	1.0073	0.5139	0.1737	0.3785	0.6697	-	-
	(0.1889)	(0.0715)	(0.1568)	(0.2082)	(0.0472)	-	-
1	0.9610	0.5239	0.2062	0.4323	-	-	-
	(0.1789)	(0.0673)	(0.1708)	(0.2244)	-	-	-

Table 2. MA(1)

(ρ = 0.4)

correlation structure-based

F_{1}

regression parameters and their SWGLS (sampling weighted GLS) estimates along with standard errors (given in parenthesis) using 25 samples each

(s_{1}^{*})

of size

n = 100

chosen from the

F_{1}

of size

N = 1000 .

Table 2. MA(1)

(ρ = 0.4)

correlation structure-based

F_{1}

regression parameters and their SWGLS (sampling weighted GLS) estimates along with standard errors (given in parenthesis) using 25 samples each

(s_{1}^{*})

of size

n = 100

chosen from the

F_{1}

of size

N = 1000 .

	Super-Population Regression Parameters				Super-Population Lag Correlation
Time	$β_{0}$	$β_{1}$	$β_{2}$	$β_{3}$	$ρ_{1}$	$ρ_{2}$	$ρ_{3}$
-	1	0.5	0.2	0.5	0.4	0	0
	Finite Population Regression Parameters				Finite Population Lag Correlation
Time	$β_{0, N}$	$β_{1, N}$	$β_{2, N}$	$β_{3, N}$	$ρ_{1, N}$	$ρ_{2, N}$	$ρ_{3, N}$
4	0.9595	0.4987	0.2387	0.5495	0.4103	0.0061	0.0293
3	0.9793	0.4955	0.2136	0.5438	0.3815	−0.0291	-
2	0.9904	0.4923	0.2057	0.5693	0.3643	-	-
1	1.0011	0.4742	0.2573	0.6486	-	-	-
	Sample Estimate of Regression Parameters				Sample Estimate of Lag Correlation
Time	${\hat{β}}_{0, N}$	${\hat{β}}_{1, N}$	${\hat{β}}_{2, N}$	${\hat{β}}_{3, N}$	${\hat{ρ}}_{1, N}$	${\hat{ρ}}_{2, N}$	${\hat{ρ}}_{3, N}$
4	0.9752	0.4891	0.2566	0.5945	0.3824	−0.0147	−0.0182
	(0.1570)	(0.0535)	(0.1092)	(0.1617)	(0.0459)	(0.0734)	(0.1011)
3	0.9991	0.4846	0.2308	0.5887	0.3560	−0.0557	-
	(0.1482)	(0.0555)	(0.1242)	(0.1702)	(0.0651)	(0.1165)	-
2	0.9927	0.4886	0.2162	0.6356	0.3229	-	-
	(0.1728)	(0.0624)	(0.1482)	(0.1788)	(0.0868)	-	-
1	0.9669	0.4795	0.2742	0.7516	-	-	-
	(0.2668)	(0.0762)	(0.2120)	(0.2339)	-	-	-

Table 3. EQ

(ρ = 0.4)

correlation structure-based

F_{1}

regression parameters and their SWGLS (sampling weighted GLS) estimates along with standard errors (given in parenthesis) using 25 samples each

(s_{1}^{*})

of size

n = 100

chosen from the

F_{1}

of size

N = 1000 .

Table 3. EQ

(ρ = 0.4)

correlation structure-based

F_{1}

regression parameters and their SWGLS (sampling weighted GLS) estimates along with standard errors (given in parenthesis) using 25 samples each

(s_{1}^{*})

of size

n = 100

chosen from the

F_{1}

of size

N = 1000 .

	Super-Population Regression Parameters				Super-Population Lag Correlation
Time	$β_{0}$	$β_{1}$	$β_{2}$	$β_{3}$	$ρ_{1}$	$ρ_{2}$	$ρ_{3}$
-	1	0.5	0.2	0.5	0.4	0.4	0.4
	Finite Population Regression Parameters				Finite Population Lag Correlation
Time	$β_{0, N}$	$β_{1, N}$	$β_{2, N}$	$β_{3, N}$	$ρ_{1, N}$	$ρ_{2, N}$	$ρ_{3, N}$
4	1.0443	0.5007	0.1616	0.4517	0.4253	0.3960	0.3668
3	1.0560	0.4949	0.1606	0.4730	0.4065	0.4154	-
2	1.0354	0.4928	0.2023	0.5156	0.4063	-	-
1	1.0532	0.4751	0.2373	0.5441	-	-	-
	Sample Estimate of Regression Parameters				Sample Estimate of Lag Correlation
Time	${\hat{β}}_{0, N}$	${\hat{β}}_{1, N}$	${\hat{β}}_{2, N}$	${\hat{β}}_{3, N}$	${\hat{ρ}}_{1, N}$	${\hat{ρ}}_{2, N}$	${\hat{ρ}}_{3, N}$
4	1.0241	0.5128	0.1393	0.4038	0.3971	0.3825	0.3252
	(0.1780)	(0.0609)	(0.1253)	(0.1911)	(0.0501)	(0.0690)	(0.0950)
3	1.0265	0.5097	0.1382	0.4395	0.3977	0.3999	-
	(0.2047)	(0.0648)	(0.1316)	(0.2128)	(0.0615)	(0.0955)	-
2	0.9889	0.5121	0.1827	0.5067	0.3872	-	-
	(0.1929)	(0.0609)	(0.1524)	(0.2238)	(0.0752)	-	-
1	1.0037	0.4899	0.2363	0.5340	-	-	-
	(0.2775)	(0.0907)	(0.1673)	(0.2545)	-	-	-

Table 4. Design-assisted model-based

({\hat{τ}}_{(t)}^{*})

and design cum model unbiased

({\tilde{τ}}_{(t)}^{*})

predictions along with their standard errors (given in parenthesis) and 100% relative absolute biases [given in square bracket] for

F_{1}

totals

(τ_{(t)})

over time

t = 1, \dots, 4,

using 25 samples each

(s_{1}^{*})

of size

n = 100

chosen from the

F_{1}

of size

N = 1000 .

Table 4. Design-assisted model-based

({\hat{τ}}_{(t)}^{*})

and design cum model unbiased

({\tilde{τ}}_{(t)}^{*})

predictions along with their standard errors (given in parenthesis) and 100% relative absolute biases [given in square bracket] for

F_{1}

totals

(τ_{(t)})

over time

t = 1, \dots, 4,

using 25 samples each

(s_{1}^{*})

of size

n = 100

chosen from the

F_{1}

of size

N = 1000 .

	AR (0.7)			MA (0.4)			EQ (0.4)
	${\hat{τ}}_{(t)}^{*}$	${\tilde{τ}}_{(t)}^{*}$	$τ_{(t)}$	${\hat{τ}}_{(t)}^{*}$	${\tilde{τ}}_{(t)}^{*}$	$τ$	${\hat{τ}}_{(t)}^{*}$	${\tilde{τ}}_{(t)}^{*}$	$τ$
1	2580.2	2579.9	2589.4	2564.5	2563.9	2555.9	2567.5	2567.1	2579.6
	(54.4)	(56.5)	-	(72.9)	(72.0)	-	(55.8)	(55.1)	-
	[16.9]	[16.8]	-	[11.8]	[11.1]	-	[21.7]	[22.7]	-
2	2559.1	2559.1	2574.4	2551.9	2551.8	2539.8	2581.3	2581.0	2585.6
	(56.0)	(55.4)	-	(48.6)	(47.0)	-	(54.4)	(56.6)	-
	[27.3]	27.6]	-	[24.9]	[25.5]	-	[7.9]	[8.1]	-
3	2562.1	2562.1	2573.0	2548.3	2548.3	2543.4	2549.3	2549.4	2561.1
	(59.2)	(59.2)	-	(51.5)	(51.0)	-	(63.1)	(61.7)	-
	[18.4]	[18.4]	-	[9.5]	[9.6]	-	[18.7]	[18.9]	-
4	2567.0	2566.7	2576.2	2576.0	2575.5	2572.5	2568.2	2568.2	2578.9
	(60.2)	(59.9)	-	(67.8)	(69.2)	-	(64.0)	(64.8)	-
	[15.3]	[15.9]	-	[5.6]	[4.3]	-	[16.7]	[16.5]	-

Table 5. AR(1)

(ρ = 0.5)

correlation structure-based

F_{2}

regression parameters and their SWGLS (sampling weighted GLS) estimates along with standard errors (given in parenthesis) using 25 samples each

(s_{2}^{*})

of size

n = 200

chosen from the cluster-based

F_{2}

of size

N = 1000 .

Table 5. AR(1)

(ρ = 0.5)

correlation structure-based

F_{2}

regression parameters and their SWGLS (sampling weighted GLS) estimates along with standard errors (given in parenthesis) using 25 samples each

(s_{2}^{*})

of size

n = 200

chosen from the cluster-based

F_{2}

of size

N = 1000 .

	Super-Population Reg. Parameters				Super-Population Corr
Time	$β_{0}$	$β_{1}$	$β_{2}$	$β_{3}$	$ρ_{1}$	$ρ_{2}$	$ρ_{3}$	$ϕ$
-	1	0.5	0.2	0.5	0.50	0.25	0.125	0.33
	Finite Population Reg. Parameters				Finite Population Corr
Time	$β_{0, N}$	$β_{1, N}$	$β_{2, N}$	$β_{3, N}$	$ρ_{1, N}$	$ρ_{2, N}$	$ρ_{3, N}$	$ϕ_{N}$
4	0.9647	0.5011	0.2452	0.4929	0.5168	0.2765	0.1747	0.3023
3	0.9659	0.5012	0.2215	0.4772	0.5134	0.2657	-	0.3081
2	1.0315	0.4998	0.2583	0.4442	0.4894	-	-	0.3350
1	0.9316	0.5025	0.2532	0.3469	-	-	-	0.3496
4	1.0255	0.5006	0.2155	0.4471	0.5176	0.2881	0.1789	0.2560
	(0.1878)	(0.0041)	(0.0806)	(0.1414)	(0.0421)	(0.0584)	(0.0798)	(0.0484)
3	1.0409	0.5007	0.1765	0.4360	0.4966	0.2635	-	0.2769
	(0.1941)	(0.0043)	(0.0958)	(0.1569)	(0.0318)	(0.0633)	. -	(0.0407)
2	1.1103	0.4993	0.2265	0.3947	0.4745	-	-	0.2986
	(0.1753)	(0.0042)	(0.1064)	(0.1723)	(0.0465)	-	-	(0.0492)
1	0.9797	0.5028	0.2251	0.2934	-	-	-	0.3189
	(0.2279)	(0.0041)	(0.1270)	(0.1767)	-	-	-	(0.0583)

Table 6. Design-assisted model-based

({\hat{τ}}_{(t)}^{*})

and design cum model unbiased

({\tilde{τ}}_{(t)}^{*})

predictions, along with their standard errors (given in parenthesis) and 100% relative absolute biases [given in square bracket] for

F_{2}

totals

(τ_{(t)})

over time

t = 1, \dots, 4,

using 25 samples each

(s_{2}^{*})

of size

n = 200

chosen from the cluster-based

F_{2}

of size

N = 1000

for AR(0.5) model.

Table 6. Design-assisted model-based

({\hat{τ}}_{(t)}^{*})

and design cum model unbiased

({\tilde{τ}}_{(t)}^{*})

predictions, along with their standard errors (given in parenthesis) and 100% relative absolute biases [given in square bracket] for

F_{2}

totals

(τ_{(t)})

over time

t = 1, \dots, 4,

using 25 samples each

(s_{2}^{*})

of size

n = 200

chosen from the cluster-based

F_{2}

of size

N = 1000

for AR(0.5) model.

	Using SSRE		Using FPRE
Time	${\hat{τ}}_{(t)}^{*}$	${\tilde{τ}}_{(t)}^{*}$	${\hat{τ}}_{(t)}^{*}$	${\tilde{τ}}_{(t)}^{*}$	$τ$
1	22,403	22,405	22,377	22,380	22,371
	(109)	(126)	(22)	(54)	-
	[29.4]	[27.0]	[27.3]	[16.7]	-
2	22,432	22,434	22,410	22,413	22,408
	(91)	(110)	(18.6)	(52)	-
	[26.4]	[13.6]	[10.8]	[9.6]	-
3	22,345	22,349	22,346	22,348	22,368
	(105)	(129)	(19)	(56)	-
	[21.9]	[14.7]	[115]	[35.7]
4	22,403	22,406	22,413	22,416	22,399
	(104)	(131)	(21)	(56)	-
	[3.8]	[5.3]	[66.7]	[30.4]	-

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Prediction Inferences for Finite Population Totals Using Longitudinal Survey Data

Abstract

1. Introduction

2. Materials: Individual or Cluster-Based Longitudinal Survey Data

2.1. SSILS Sample $(s_{1}^{*})$ from Individual-Based $F_{1}$

2.2. SSCLS Sample $(s_{2}^{*})$ from Cluster-Based $F_{2}$

3. Proposed DCMU Prediction Method Using SSILS Sample

3.1. Super-Population $(S_{1})$ Longitudinal Auto-Correlation Model

3.2. Hypothetical Estimation of the $S_{1}$ Model Parameters Using $F_{1}$ Data

3.3. Real Life Estimation of the $S_{1}$ Model Parameters Using the Survey Sample $s_{1}^{*}$

3.3.1. Estimating Function Approach for Design Unbiased (DU) Estimation of $β_{(t), N}$

3.3.2. Estimating Function Approach for Design Unbiased (DU) Estimation of $ρ_{ℓ, N}$

4. Proposed DCMU Prediction Method Using SSCLS Sample: A Generalization

4.1. Cluster-Based Longitudinal FP and Its SP Model

4.2. FP Data-Based Hypothetical Estimation Equations

4.3. Survey Sample $(s_{2}^{*})$ -Based DU Estimating Equations

4.4. Formulation of the DCMU Prediction Functions Using SSCLS Sample

5. Prediction Comparison Using Simulation Results

5.1. Simulation Study 1: Prediction Performance Using Individual-Based Longitudinal Survey Sample

5.1.1. Simulation Design in Steps (S1–S7)

5.1.2. Simulated Prediction Performance for Marginal FP Totals

5.2. Simulation Study 2: Prediction Performance Using Cluster-Based Longitudinal Survey Sample

6. Discussion and Concluding Remarks

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

Prediction Inferences for Finite Population Totals Using Longitudinal Survey Data

Abstract

1. Introduction

2. Materials: Individual or Cluster-Based Longitudinal Survey Data

2.1. SSILS Sample ( s 1 * ) from Individual-Based F 1

2.2. SSCLS Sample ( s 2 * ) from Cluster-Based F 2

3. Proposed DCMU Prediction Method Using SSILS Sample

3.1. Super-Population ( S 1 ) Longitudinal Auto-Correlation Model

3.2. Hypothetical Estimation of the S 1 Model Parameters Using F 1 Data

3.3. Real Life Estimation of the S 1 Model Parameters Using the Survey Sample s 1 *

3.3.1. Estimating Function Approach for Design Unbiased (DU) Estimation of β ( t ) , N

3.3.2. Estimating Function Approach for Design Unbiased (DU) Estimation of ρ ℓ , N

4. Proposed DCMU Prediction Method Using SSCLS Sample: A Generalization

4.1. Cluster-Based Longitudinal FP and Its SP Model

4.2. FP Data-Based Hypothetical Estimation Equations

4.3. Survey Sample ( s 2 * ) -Based DU Estimating Equations

4.4. Formulation of the DCMU Prediction Functions Using SSCLS Sample

5. Prediction Comparison Using Simulation Results

5.1. Simulation Study 1: Prediction Performance Using Individual-Based Longitudinal Survey Sample

5.1.1. Simulation Design in Steps (S1–S7)

5.1.2. Simulated Prediction Performance for Marginal FP Totals

5.2. Simulation Study 2: Prediction Performance Using Cluster-Based Longitudinal Survey Sample

6. Discussion and Concluding Remarks

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

2.1. SSILS Sample $(s_{1}^{*})$ from Individual-Based $F_{1}$

2.2. SSCLS Sample $(s_{2}^{*})$ from Cluster-Based $F_{2}$

3.1. Super-Population $(S_{1})$ Longitudinal Auto-Correlation Model

3.2. Hypothetical Estimation of the $S_{1}$ Model Parameters Using $F_{1}$ Data

3.3. Real Life Estimation of the $S_{1}$ Model Parameters Using the Survey Sample $s_{1}^{*}$

3.3.1. Estimating Function Approach for Design Unbiased (DU) Estimation of $β_{(t), N}$

3.3.2. Estimating Function Approach for Design Unbiased (DU) Estimation of $ρ_{ℓ, N}$

4.3. Survey Sample $(s_{2}^{*})$ -Based DU Estimating Equations