Adaptive Regression Analysis of Heterogeneous Data Streams via Models with Dynamic Effects

Wei, Jianfeng; Yang, Jian; Cheng, Xuewen; Ding, Jie; Li, Shengquan

doi:10.3390/math11244899

Open AccessArticle

Adaptive Regression Analysis of Heterogeneous Data Streams via Models with Dynamic Effects

by

Jianfeng Wei

^1,†,

Jian Yang

¹,

Xuewen Cheng

¹,

Jie Ding

^2,*,†

and

Shengquan Li

^1,*

¹

Peng Cheng Laboratory, Shenzhen 518066, China

²

School of Mathematical Sciences, Dalian University of Technology, Dalian 116024, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2023, 11(24), 4899; https://doi.org/10.3390/math11244899

Submission received: 30 October 2023 / Revised: 22 November 2023 / Accepted: 5 December 2023 / Published: 7 December 2023

(This article belongs to the Special Issue Advances in Statistical Modeling)

Download

Browse Figures

Versions Notes

Abstract

:

Streaming data sequences arise from various areas in the era of big data, and it is challenging to explore efficient online models that adapt to them. To address the potential heterogeneity, we introduce a new online estimation procedure to analyze the constantly incoming streaming datasets. The underlying model structures are assumed to be the generalized linear models with dynamic regression coefficients. Our key idea lies in introducing a vector of unknown parameters to measure the differences between batch-specific regression coefficients from adjacent data blocks. This is followed by the usage of the adaptive lasso penalization methodology to accurately select nonzero components, which indicates the existence of dynamic coefficients. We provide detailed derivations to demonstrate how our proposed method not only fits within the online updating framework in which the old estimator is recursively replaced with a new one based solely on the current individual-level samples and historical summary statistics but also adaptively avoids undesirable estimation biases coming from the potential changes in model parameters of interest. Computational issues are also discussed in detail to facilitate implementation. Its practical performance is demonstrated through both extensive simulations and a real case study. In summary, we contribute to a novel online method that efficiently adapts to streaming data environment, addresses potential heterogeneity, and mitigates estimation biases from changes in coefficients.

Keywords:

adaptive lasso; data streams; dynamic coefficients; online estimation; regression analysis

MSC:

62-08

1. Introduction

As the digital age progresses, big data has become the new currency, generating a tremendous amount of information, more than ever before. In the area of regression analysis, new challenges arise when a sequence of data batches arrives constantly, or in other words, comes in an online manner, and this is a type of data that has become a core component in big data fields [1]. Specifically, if the streaming characteristic is ignored, researchers typically need to load entire historical individual-level observations at each stage of statistical analysis. However, in the environment of streaming data blocks, practitioners face the challenge of avoiding the need for storing the previous data and updating the model estimation with only the current individual-level data and historical summary statistics. The online methodologies designed for this purpose are formally referred to as approaches that can meet this demand. A detailed comparison of key mechanisms for batch processing and streaming processing can be found in Figure 1. We clarify that in the context of this paper, the term “summary statistics” is used to refer to the estimated parameters up to the previous batches. This usage is consistent with the existing literature on online stream data analysis, such as Luo and Song [2], which will be mentioned later. However, we acknowledge that this term traditionally refers to measures such as mean, variance, skewness, etc. In our usage, summary statistics actually aim at emphasizing that online methods do not require historical data but rather rely on the aggregated data derived from them.

The generalized linear model (GLM) [3] is one of the most widely used models to examine how covariates affect the values of responses of interest for subjects, and it includes several specific models as special cases, such as the linear model, the logistic model, and the Poisson model, among many others. In recent times, many statistical techniques have been proposed to analyze streaming data sequences using the GLM framework. These techniques can be categorized into several main groups. The first category is the commonly used stochastic gradient descent algorithm, which has been investigated and applied to the online scenarios we are interested in [4,5,6,7]. Another type of online approach that has been well proposed specifically involves the cumulative estimating equation estimator for linear models and the cumulatively updating estimating equation estimator for non-linear models [8]. Additionally, a renewable estimation and inference procedure for GLM has also been developed [2], and its applications in analyzing streaming data with clustering [9] or other more general phenomena have also been discussed in the literature [10,11,12,13,14,15,16,17].

Nevertheless, most of the existing methods for analyzing streaming datasets are designed by assuming that all sequentially available data batches are sampled from the same pre-specified model structure with common unknown parameters. Since this widely adopted assumption for streaming data might be violated in many real-world applications, it is important to consider heterogeneity or data drift, which means that underlying model specifications are likely to be changed over time [18]. Directly applying the online methods that ignore this issue may invalidate the model accuracy for the current data batch due to the potentially incorrect estimates of the parameters of interest. Therefore, it is crucial to extend the existing online estimation procedures, originally designed for common model structures, to address the challenge of heterogeneity. We are grateful to an anonymous reviewer for pointing out that our setup shares similarities with random coefficient models, as discussed in Klein [19] and Hsiao [20], and with the regime-switching models in the field of time series, as described in Hamilton [21]. We appreciate the diverse methods of estimating random coefficient models, including Bayesian updating and estimation with all data, and our proposed method offers a valuable addition to this field. Our method, from the perspective of the frequentist, is suited for analyzing streaming datasets, a modern context that motivates our research.

As a motivating example, we focus on the airline on-time statistics database that is accessible publicly. Since October 1987, it has been constantly updating its records in terms of all commercial flights’ flight arrival and departure details within the USA.Further details will be elaborated in Section 4.3. A preliminary analysis is conducted there by fitting a series of logistic models to all data batches collected by year separately, in which estimates at each stage should be considered valid since they will be affected by the potentially different previous observations. We find that patterns for regression coefficient estimates changed dramatically, which implies significant population heterogeneity among individuals from various data batches.

Our main contributions in this paper are presented in this paragraph. We develop a novel approach that not only analyzes sequentially arrived data batches in a streaming manner but also adaptively avoids undesirable estimation biases arising from potential changes in the model parameters of interest. Specifically, we focus on the generalized linear model and extend it to fall within the online updating framework, in which the old estimator is recursively replaced with a new one based only on the current individual-level samples and historical summary statistics. To tackle the potential heterogeneity, our key idea lies in introducing a vector of unknown parameters to measure the differences between batch-specific regression coefficients from adjacent data blocks. Then, the adaptive lasso penalization methodology [22] is used to accurately select nonzero components. We show detailed derivations and discuss issues about computation to further facilitate the implementation of our approach, indicating that it can simultaneously enhance computing efficiency and reduce storage space. We also conduct extensive numerical studies to validate the practical performance of our newly proposed approach.

We structure the remaining parts of this article as follows. We provide a comprehensive summary our assumed streaming framework of model setup in Section 2 and then propose an adaptive online estimation procedure for the corresponding dynamic regression coefficients in Section 3. Section 4 presents numerical comparisons based on both simulated data with various settings and real data concerning flight arrival and departure records. In Section 5, we summarize this paper with a discussion.

2. The Model Setup

Instead of loading the entire dataset all at once, as in the offline setting, we consider an online scenario in which observations become available sequentially in blocks. For the streaming environment we are interested in, we assume that a series of data batches arrives sequentially, and at the current time point b with

b \geq 1

, there are

n_{b}

independent and identically distributed copies, denoted by

O_{b} = {(Y_{b i}, X_{b i}) : i = 1, \dots, n_{b}},

from

(Y_{b}, X_{b})

, in which

Y_{b}

is a random variable that represents our response of interest and

X_{b}

is a vector of measured covariates whose dimension is p. Denote

N_{b} = \sum_{j = 1}^{b} n_{j}

, the cumulative sample size, and let

O_{b}^{*} = O_{1} \cup \dots \cup O_{b}

be the cumulative collection of datasets.

For

b \geq 1

, we assume that the conditional on

X_{b}

, the underlying response

Y_{b}

follows the generalized linear model [3], which assigns the conditional density function of

Y_{b}

given

X_{b}

to be formulated as

f_{b} (y; X_{b}, θ_{b}, ϕ_{b}) = exp \{\frac{y X_{b}^{T} θ_{b} - b (X_{b}^{T} θ_{b})}{a (ϕ_{b})} + c (y, ϕ_{b})\},

(1)

where

θ_{b} = {(θ_{b 1}, \dots, θ_{b p})}^{T}

is the corresponding bth vector of unknown batch-specific regression coefficients with dimension p,

ϕ_{b}

is a nuisance parameter, and

a (\cdot)

and

b (\cdot)

are known functions. In the Gaussian linear model, it is known that

ϕ_{b}

is the variance parameter, while in both the logistic and Poisson models,

ϕ_{b} = 1

instead. Note that we can also equivalently express (1) via

E (Y_{b} | X_{b}) = g (X_{b}^{T} θ_{b})

, where

g (\cdot)

is a known link function.

Remark 1.

Although we suppose that samples in each data batch are independent and identically distributed, observations across different

{O_{b} : b \geq 1}

are not necessarily following the same population distributed. More precisely, in view of our model setup presented in (1), potential heterogeneity is allowed to come from changes in regression coefficients. However, the underlying model structures for different data batches are the same, that is, a common generalized linear model. A similar strategy has been used in Luo and Song [18], although they only focus on linear models and impose stricter restrictions on the patterns of how coefficients change.

We aim at making valid estimates of

θ_{b}

for each currently arriving data batch

O_{b}

with

b \geq 1

. Generally speaking, for statistical analysis in the environment of streaming data with dynamic coefficients we are interested in, there are at least two challenges in practical application. The first one is that loading the whole dataset

O_{b}^{*}

is infeasible since the previous data

O_{b - 1}^{*}

is not available any more in an online setting; that is, we only have access to the current data batch

O_{b}

and a set of historical summary statistics, denoted by

H_{b - 1}

. Another one is that at each accumulation point b, it is possible that we have

θ_{b} \neq θ_{b - 1}

, meaning that

θ_{b} - θ_{b - 1}

has at least one nonzero component [18]. Due to the potential dynamic coefficients of the current data batch

O_{b}

, using the undesirable historical information

H_{b - 1}

directly might lead to incorrectly estimated regression coefficients for those we are interested in at the latest moment. Therefore, any newly updated results or seeming efficiency gains should be treated with caution. In a word, new adaptive online methods that can find estimators tailored for heterogeneous streaming data driven by the generalized linear model with dynamic regression coefficients are desired.

3. Proposed Methodology

Our proposed adaptive online method that can accommodate the potential violation of the homogeneity is preliminarily presented in Section 3.1, along with a discussion of its advantages. A more detailed explanation of the underlying motivation and derivation are deferred to Section 3.2. Additionally, computational issues will be discussed in Section 3.3.

3.1. Online Estimation of Dynamic Coefficients

Before presenting our proposed approach, we first discuss two other estimators that will help us understand the advantages of the estimator we will propose shortly after. At the accumulation point b for

b \geq 1

, we define the following two estimators:

{\tilde{θ}}_{b} = arg min_{θ \in R^{p}} {\tilde{L}}_{b} (θ) a n d {\overset{ˇ}{θ}}_{b}^{*} = arg max_{θ \in R^{d}} {\overset{ˇ}{L}}_{b}^{*} (θ),

(2)

where for

b \geq 1

, we denote the logarithm of the likelihood function for observations

O_{b}

as follows:

{\tilde{L}}_{b} (θ) = \sum_{i = 1}^{n_{b}} log f_{b} (Y_{b i}; X_{b i}, θ, ϕ_{b}) = \sum_{i = 1}^{n_{b}} [\frac{Y_{b i} X_{b i}^{T} θ - b (X_{b i}^{T} θ)}{a (ϕ_{b})} + c (Y_{b i}, ϕ_{b})],

(3)

and the definition of the objective function

{\overset{ˇ}{L}}_{b}^{*} (θ)

used here is indeed the same as that of

{\tilde{L}}_{b} (θ)

, except that the summation is taken over all data samples in

O_{b}^{*}

. We can see that

{\tilde{θ}}_{b}

estimates the current batch-specific regression coefficient

θ_{b}

based on

O_{b}

only, meaning that it adapts to the online updating framework since its calculation does not need any historical individual-level observations in

O_{b - 1}^{*}

. By contrary, the offline estimator

{\overset{ˇ}{θ}}_{b}^{*}

needs all sample in

O_{b} *

and is consequently infeasible in our setting of streaming data.

Remark 2.

It is evident that the estimator

{\tilde{θ}}_{b}

naturally avoids the underlying heterogeneity problem and maintains its consistency. Nevertheless,

{\tilde{θ}}_{b}

is potentially inefficient due to the absence of historical data

O_{b - 1}^{*}

. To see that, consider the most ideal scenario that all data batches come sequentially from the same mechanism of underlying population; that is,

θ_{b}

does not vary across data batches. In such cases, the efficiency of

{\tilde{θ}}_{b}

will certainly be lower than that of

{\overset{ˇ}{θ}}_{b}^{*}

. Due to these reasons, we term

{\tilde{θ}}_{b}

as the online conservative estimator, which can serve as a benchmark in comparison with the proposed method we will develop.

At each accumulation point b, we introduce a p-dimensional parameter vector to accommodate the increment between two adjacent batch-specific regression coefficients

δ_{b} = {(δ_{b 1}, \dots, δ_{b p})}^{T} = θ_{b} - θ_{b - 1},

(4)

for

b \geq 2

, which is also termed as the heterogeneous degree between

O_{b}

and

O_{b - 1}

in this article. Most of the existing methods in the context of online learning [2,8,9] ignore the potential heterogeneity; that is, they implicitly demanded that

δ_{b} = 0 for b \geq 2,

(5)

or equivalently,

θ_{1} = \dots = θ_{b}

, where

0

denotes a matrix consisting of zeros with suitable dimension. It is possible that the vector

δ_{b}

at the accumulation point b has nonzero components. Thus, adaptively detecting the sparsity structure in

δ_{b}

could be an effective way to validate the heterogeneity assumption.

We next provide the definition of our proposed adaptive online estimation methodology without delving into excessive explanation. Details of our motivation and derivation will be presented later in Section 3.2. At the accumulation point b with

b \geq 1

, our proposed online estimator

{\hat{θ}}_{b}

for the underlying dynamic covariate effect is recursively defined as the maximizer

{({\hat{θ}}_{b}^{T}, {\hat{δ}}_{b}^{T})}^{T} = arg max_{θ, δ \in R^{p}} {\hat{L}}_{b} (θ, δ),

where we denote the function with respect to

θ

and

δ

{\hat{L}}_{b} (θ, δ) = {\tilde{L}}_{b} (θ) - \frac{1}{2} {(θ - {\hat{θ}}_{b - 1} - δ)}^{T} {\hat{I}}_{b - 1} (θ - {\hat{θ}}_{b - 1} - δ) - N_{b} λ_{b} \sum_{k = 1}^{p} | {\tilde{δ}}_{b k} |^{- 1} | δ_{k} |,

(6)

at each bth accumulation point with

b \geq 1

, where

λ_{b}

is the tuning parameter and we set

{\tilde{δ}}_{b} = {\tilde{θ}}_{b} - {\tilde{θ}}_{b - 1} = {({\tilde{δ}}_{b 1}, \dots, {\tilde{δ}}_{b p})}^{T}

,

{\hat{θ}}_{0} = 0

,

{\hat{I}}_{0} = 0

and

{\hat{I}}_{b} = {\hat{I}}_{b - 1} + {\tilde{I}}_{b} ({\hat{θ}}_{b})

, in which

{\tilde{I}}_{b} (θ)

denotes the negative second derivative of the logarithm of the likelihood function

{\tilde{L}}_{b} (θ)

defined in (3). The specific penalty function we choose is the adaptive lasso penalty [22]. We point out that there are also other alternatives, such as the smoothly clipped absolute deviation penalty [23] and the minimax concave penalty [24], and all of them can produce the consistent estimates and sparse solutions that we are interested in. In comparison with other penalty functions, the adaptive lasso penalty we used is easier to optimize and the algorithm converges very fast. In Section 3.3, more details about the computational algorithm for this newly devised optimization issue in (6) will be discussed.

Remark 3.

From the specific definition, we can see that our proposed online estimator at the accumulation point b (

b \geq 1

) depends on the data only through the current data batch

O_{b}

and the historical summary statistics

H_{b - 1} = {{\hat{θ}}_{b - 1}, {\tilde{θ}}_{b - 1}, {\hat{I}}_{b - 1}, N_{b - 1}} .

(7)

In other words, a previous estimator

{\hat{θ}}_{b - 1}

is recursively updated to

{\hat{θ}}_{b}

when the new data block of the current observations of interest

O_{b}

arrives for

b \geq 2

. After we obtain the new estimator, the data batch

O_{b}

is not needed to be accessible any more except its summary statistics

H_{b} = {{\hat{θ}}_{b}, {\tilde{θ}}_{b}, {\hat{I}}_{b}, N_{b}}

that will be useful in the next stage.

Remark 4.

Note that if we are confident that the assumption stated in (5) holds true, we can directly exclude the penalty term away and in the maximization of (6). In this case, it can be verified that the resulting online estimator simplifies to the one that is adapted to the homogeneous streaming data and has been previously discussed in Luo and Song [2]. The corresponding estimator, denoted as

{\overset{ˇ}{θ}}_{b}

, can be equivalently expressed using the first derivative of objective function; that is, we recursively solve

{\tilde{U}}_{b} ({\overset{ˇ}{θ}}_{b}) + {\overset{ˇ}{I}}_{b - 1} ({\overset{ˇ}{θ}}_{b - 1} - {\overset{ˇ}{θ}}_{b}) = 0,

for all

b \geq 1

, where

{\tilde{U}}_{b} (θ) = \nabla_{θ} {\tilde{L}}_{b} (θ)

and

{\overset{ˇ}{I}}_{b - 1} = \sum_{j = 1}^{b - 1} {\tilde{I}}_{j} ({\overset{ˇ}{θ}}_{j})

for

b \geq 1

with

{\overset{ˇ}{I}}_{0} = 0

. Hence, the existing homogeneous estimator

{\overset{ˇ}{θ}}_{b}

is a special case of the estimator we have just proposed. However, these estimators do not take the underlying dynamical characteristics of regression coefficients into consideration and might not estimate consistent parameters when the covariate effects change.

3.2. Motivation and Derivation of the Proposed Estimator

For a better understanding of our motivation for the construction in (6), let us come back to (6) again. From the notation of nuisance vectors

{δ_{b} : b \geq 1}

defined in (4), we can deduce that the batch-specific vector of regression coefficients can be equivalently rewritten as

θ_{j} = θ_{b} - \sum_{l = j + 1}^{b} δ_{l},

(8)

for

j = 1, \dots, b

and

b \geq 2

. Starting from the first data batch, we assume that we have possessed a series of online updated estimators

{\hat{θ}}_{*, j}

and

{\hat{δ}}_{*, j}

until the

(b - 1)

th data block (

j \leq b - 1

). Here,

{\hat{θ}}_{*, b}

and

{\hat{δ}}_{*, b}

are just pre-defined notations and specific mathematical definitions will be presented later. They will be shown to coincide with

{\hat{θ}}_{b}

and

{\hat{δ}}_{b}

, which has been proposed in the previous subsection.

Intuitively, we can estimate our current regression coefficient of interest

θ_{b}

and heterogeneous degree

δ_{b}

by maximizing the objective function with respect to

θ

and

δ

\sum_{j = 1}^{b - 1} {\tilde{L}}_{j} (θ - δ - \sum_{l = j + 1}^{b - 1} {\hat{δ}}_{*, l}) + {\tilde{L}}_{b} (θ) - N_{b} λ_{b} \sum_{k = 1}^{p} | {\tilde{δ}}_{b k} |^{- 1} | δ_{k} |,

(9)

at the accumulation point b. However, optimizing the objective function (9) directly is infeasible since its first term relies on the whole historical raw dataset, but these data will no longer be stored in the streaming setting. To overcome this issue, at the bth accumulation point, we replace each component for summation, that is,

{\tilde{L}}_{j} (θ - δ - \sum_{l = j + 1}^{b - 1} {\hat{δ}}_{*, l})

, with

\begin{matrix} {\tilde{P}}_{j} (θ, δ) & = & {\tilde{L}}_{j} ({\hat{θ}}_{*, j}) + {\tilde{U}}_{j} {({\hat{θ}}_{*, j})}^{T} (θ - δ - \sum_{l = j + 1}^{b - 1} {\hat{δ}}_{*, l} - {\hat{θ}}_{*, j}) \\ - \frac{1}{2} {(θ - δ - \sum_{l = j + 1}^{b - 1} {\hat{δ}}_{*, l} - {\hat{θ}}_{*, j})}^{T} {\tilde{I}}_{j} ({\hat{θ}}_{j}) (θ - δ - \sum_{l = j + 1}^{b - 1} {\hat{δ}}_{*, l} - {\hat{θ}}_{*, j}), \end{matrix}

for

j \leq b - 1

, which can be viewed as a second-order Taylor approximation of

{\tilde{L}}_{j} (θ)

at a pre-specified point

{\hat{θ}}_{j}

.

Consequently, we obtain the following approximated function:

{\hat{L}}_{*, b} (θ, δ) = {\tilde{L}}_{b} (θ) + \sum_{j = 1}^{b - 1} {\tilde{P}}_{j} (θ, δ) - N_{b} λ_{b} \sum_{k = 1}^{p} | {\tilde{δ}}_{b k} |^{- 1} | δ_{k} | .

Note that the calculation of

{\hat{L}}_{*, b} (θ, δ)

does not need any historical individual-level data in

O_{b - 1}^{*}

any more but needs a series of summary statistics. Hence, we can conduct adaptive online estimation based on the refined function. Specifically, at the accumulation point b with

b \geq 1

, we define an online estimator by recursively calculating

{({\hat{θ}}_{*, b}^{T}, {\hat{δ}}_{*, b}^{T})}^{T} = arg max_{θ, δ \in R^{p}} {\hat{L}}_{*, b} (θ, δ) .

We can show next in Theorem 1 that our proposed online estimator

{\hat{θ}}_{b}

and

{\hat{δ}}_{b}

presented in (6) at the accumulation point b is indeed equivalent to the maximizer of

{\hat{L}}_{*, b} (θ, δ)

.

Theorem 1.

For all

b \geq 1

, the maximizer

({\hat{θ}}_{*, b}, {\hat{δ}}_{*, b})

of

{\hat{L}}_{*, b} (θ, δ)

equals that of

{\hat{L}}_{b} (θ, δ)

, that is,

{\hat{θ}}_{*, b} = {\hat{θ}}_{b} and {\hat{δ}}_{*, b} = {\hat{δ}}_{b} .

In other words, recursively optimizing the objective function

{\hat{L}}_{b} (θ, δ)

as defined in (6) is equivalent to recursively optimizing

{\hat{L}}_{*, b} (θ, δ)

.

Proof.

Write

{\hat{δ}}_{*, b} = {({\hat{δ}}_{*, b 1}, \dots, {\hat{δ}}_{*, b p})}^{T}

and let

{\hat{κ}}_{*, b} = {({\hat{κ}}_{*, b 1}, \dots, {\hat{κ}}_{*, b p})}^{T}

in which

{\hat{κ}}_{*, b k} \leq 1 / | {\tilde{δ}}_{b k} |

if

{\hat{θ}}_{*, b k} = 0

and

{\hat{κ}}_{*, b k} = sign ({\hat{δ}}_{*, b k}) / | {\tilde{δ}}_{b k} |

if

{\hat{δ}}_{*, b k} \neq 0

for

k = 1, \dots, p

. According to the Karush–Kuhn–Tucker (KKT) conditions, we know that at the accumulation point b (

b \geq 2

), the maximizer

({\hat{θ}}_{*, b}, {\hat{δ}}_{*, b})

that maximizes the function

{\hat{L}}_{*, b} (θ, δ)

should satisfy the following set of equations:

\{\begin{matrix} h_{b} ({\hat{θ}}_{*, b}, {\hat{δ}}_{*, b}) + {\tilde{U}}_{b} ({\hat{θ}}_{*, b}) = 0, \\ - h_{b} ({\hat{θ}}_{*, b}, {\hat{δ}}_{*, b}) - N_{b} λ_{b} {\hat{κ}}_{*, b} = 0, \end{matrix}

(10)

where we define

h_{b} ({\hat{θ}}_{*, b}, {\hat{δ}}_{*, b}) = \sum_{j = 1}^{b - 1} [{\tilde{U}}_{j} ({\hat{θ}}_{*, j}) - {\tilde{I}}_{j} ({\hat{θ}}_{*, j}) ({\hat{θ}}_{*, b} - {\hat{θ}}_{*, j} - \sum_{l = j + 1}^{b} {\hat{δ}}_{*, l})] .

Then, we further re-write

h_{b} ({\hat{θ}}_{*, b}, {\hat{δ}}_{*, b})

into a recursive form as follows:

\begin{matrix} h_{b} ({\hat{θ}}_{*, b}, {\hat{δ}}_{*, b}) & = & h_{b - 1} ({\hat{θ}}_{*, b - 1}, {\hat{δ}}_{*, b - 1}) - h_{b - 1} ({\hat{θ}}_{*, b - 1}, {\hat{δ}}_{*, b - 1}) + h_{b} ({\hat{θ}}_{*, b}, {\hat{δ}}_{*, b}) \\ = & h_{b - 1} ({\hat{θ}}_{*, b - 1}, {\hat{δ}}_{*, b - 1}) + {\tilde{U}}_{b - 1} ({\hat{θ}}_{*, b - 1}) - {\hat{I}}_{b - 1} ({\hat{θ}}_{*, b} - {\hat{θ}}_{*, b - 1} - {\hat{δ}}_{*, b}) . \end{matrix}

Based on this formula, it can be deduced that solving the online estimators from the set of KKT conditions for all

b \geq 1

presented in (10) is equivalent to solving them from the following set of equations:

\{\begin{matrix} - {\hat{I}}_{b - 1} ({\hat{θ}}_{*, b} - {\hat{θ}}_{*, b - 1} - {\hat{δ}}_{*, b}) + {\tilde{U}}_{b} ({\hat{θ}}_{*, b}) = 0, \\ {\hat{I}}_{b - 1} ({\hat{θ}}_{*, b} - {\hat{θ}}_{*, b - 1} - {\hat{δ}}_{*, b}) - N_{b} λ_{b} {\hat{κ}}_{*, b} = 0, \end{matrix}

(11)

for all

b \geq 1

. The equivalence lies in the fact that Equation (11) is indeed the KKT conditions for maximizing

{\hat{L}}_{b} (θ, δ)

. Hence, we complete the proof. □

From Theorem 1, we know that defining our proposed adaptive online estimator based on

{\hat{L}}_{b} (θ, δ)

or

{\hat{L}}_{*, b} (θ, δ)

is equivalent. However, optimizing

{\hat{L}}_{*, b} (θ, δ)

directly is not satisfactory since the calculation of

{\hat{L}}_{*, b} (θ, δ)

relies on

O_{n}

and

H_{*, b - 1} = {{\hat{θ}}_{*, 1}, \dots, {\hat{θ}}_{*, b - 1}, {\tilde{θ}}_{*, b - 1}, {\tilde{U}}_{1}, \dots, {\tilde{U}}_{b - 1}, {\tilde{I}}_{1}, \dots, {\tilde{I}}_{b - 1}, N_{b - 1}},

at the time point b, and it is obvious that

H_{*, b - 1}

contains many more historical summary statistics in comparison to

H_{b - 1}

. Therefore, to minimize the use of summary statistics and conserve storage memory, it is necessary to further transform it into

{\hat{L}}_{b} (θ, δ)

, which is a more concise objective function.

3.3. Tuning Parameter Selection and Implentation Algorithm

We first discuss the selection of tuning parameters

{λ_{b} : b \geq 1}

. From the literature, we know that the accuracy of selection and consistency of estimation for the penalty-based approaches depend on an careful choice of the tuning parameter [25,26]. In the offline setting, various methods for parameter tuning have been used in the literature, such as cross-validation (CV), generalized cross-validation (GCV), the Akaike information criterion (AIC), and the Bayesian information criterion (BIC), among many others [27,28,29]. However, in the streaming data setting, where the entire cumulative dataset

O_{b}^{*}

at accumulation point b with

b \geq 2

is not assumed to be fully accessible, traditional offline criteria cannot be directly calculated and are no longer feasible.

To align with the streaming environment, we devise the following online BIC information criterion at the bth accumulation point for our proposed method:

\begin{matrix} BIC (λ_{b}) & = & 2 {\tilde{L}}_{b} ({\hat{θ}}_{b, λ_{b}}) - {({\hat{θ}}_{b, λ_{b}} - {\hat{θ}}_{b - 1} - {\hat{δ}}_{b, λ_{b}})}^{T} {\hat{I}}_{b - 1} ({\hat{θ}}_{b, λ_{b}} \\ - {\hat{θ}}_{b - 1} - {\hat{δ}}_{b, λ_{b}}) - log (N_{b}) | A_{b} |, \end{matrix}

where

A_{b} = {k : {\overset{ˇ}{δ}}_{b k} \neq 0, k = 1, \dots, p}

and

| A |

denotes the cardinality of an arbitrary set

A

. The online BIC criterion can appropriately balance the model fits and the model complexity. Thus, the chosen

λ_{b}

is defined as

{\hat{λ}}_{b} = arg max_{λ_{b} \in Ω_{b}} BIC (λ_{b}),

(12)

where

Ω_{b}

denotes the candidate set consisting of potentially desirable values of

λ_{b}

. We can find that this tuning parameter selection approach adapts to the online environment we are interested in since the newly defined online BIC criterion only depends on

O_{b}

and the historical summary statistics

H_{b - 1}

.

We next present an algorithm in implementing the proposed online estimation method in an online updating scheme. To optimize the objective function in (6), we suggest an algorithm for the purpose of finding the maximizer

{\hat{θ}}_{b}

by following similar steps to those in Zhang and Lu [30]. As mentioned in that paper, we approximate the objective function in (6) with the help of an iterative least-squares procedure coming from the updating step of the Newton–Raphson algorithm. Then, by imposing the adaptive lasso penalty, we solve the resulting problem with a lasso-penalized least-squares form at each iteration. Various algorithms can be used to optimize this standard problem, such as the proximal gradient descent, the coordinate descent, and the shooting algorithm, among others [29,30,31]. It has been demonstrated to be computationally efficient through numerical experiments. We briefly list the steps for the implementing of our proposed adaptive online estimator for dynamic regression coefficients coming from the generalized linear model as follows:

Step 1.

Sequentially input arriving datasets

{O_{1}, \dots, O_{b}, \dots}

from model (1).

Step 2.

Compute

{\hat{θ}}_{1} = {\tilde{θ}}_{1}

and

{\hat{I}}_{1} = {\tilde{I}}_{1} ({\hat{θ}}_{1})

using initial dataset

O_{1}

.

Step 3.

For each

b \geq 2

:

Read in dataset $O_{b}$ .
Calculate ${\tilde{θ}}_{b}$ using $O_{b}$ only and set ${\tilde{δ}}_{b} = {\tilde{θ}}_{b} - {\tilde{θ}}_{b - 1}$ .
For $λ_{b}$ in a sequence tuning parameter $Υ_{b}$ , obtain ${\hat{θ}}_{b, λ_{b}}$ and ${\hat{δ}}_{b, λ_{b}}$ by optimizing the objective function (6).
Choose optimal ${\hat{λ}}_{b}$ via the online BIC criterion shown in (12).
Set ${\hat{θ}}_{b} \leftarrow {\hat{θ}}_{b, {\hat{λ}}_{b}}$ and ${\hat{δ}}_{b} \leftarrow {\hat{δ}}_{b, {\hat{λ}}_{b}}$ , and update ${\hat{I}}_{b} = {\hat{I}}_{b - 1} + {\tilde{I}}_{b} ({\hat{θ}}_{b})$ .
Save the newest set of summary statistics $H_{b}$ as defined in (7).
Release dataset $O_{b}$ from the memory

Step 4.

Output the parameters of interest

{\hat{θ}}_{b}

and

{\hat{δ}}_{b}

for each

b \geq 1

.

We emphasize here, as pointed out in the previous literature, that analyzing data in a streaming manner not only saves storage space but also significantly reduces computation time. Offline methods that load and fit all historical data are time-consuming. However, neither offline nor previous online methods have considered the potential heterogeneity of the data. Our newly proposed method can flexibly address this phenomenon. Frankly speaking, compared to online methods that assume data homogeneity, our method increases the computational complexity due to the usage of penalty terms. But thanks to the online setting, the increasing of computational complexity is linear with respect to the number of data batches and should not be significant when the cumulative sample size is large, since only a small number of samples need to be loaded each time and we do not need to load all of the historical data directly.

4. Numerical Studies

We have conducted a series of numerical studies to evaluate the finite sample performance of our proposed adaptive online estimation procedure in the generalized linear model with dynamic regression coefficients. In this paper, we primarily focus on the logistic model, but it is worth noting that other types of generalized linear models can be implemented similarly. Both simulated experiments and a real case data are considered.

4.1. Mathematical Formulation for a Special Case: The Logistic Model

As we have generally derived the estimation procedures based on the framework of the generalized linear model, we will now provide the detailed mathematical expressions of necessary quantities via this special case of interest. Denote

π (x) = expit (x) = e^{x} / (1 + e^{x})

. The density function for a single sample used in the logistic model is

f_{b} (y; X_{b}, θ_{b}) = exp \{y X_{b}^{T} θ_{b} - log [1 + exp (X_{b}^{T} θ_{b})]\} = {[π (X_{b}^{T} θ_{b})]}^{y} {[1 - π (X_{b}^{T} θ_{b})]}^{1 - y} .

In comparison with the general form as presented in (1), we have

b (x) = log (1 + x)

,

a (ϕ_{b}) = 1

and

c (y, ϕ_{b}) = 0

here in the logistic model.

Then, at the accumulation point b, the log-likelihood function based on the data

O_{b}

can be expressed as

{\tilde{L}}_{b} (θ) = \sum_{i = 1}^{n_{b}} log f_{b} (Y_{b i}; X_{b i}, θ) = \sum_{i = 1}^{n_{b}} \{Y_{b i} X_{b i}^{T} θ - log [1 + exp (X_{b i}^{T} θ)]\} .

With this formulation in hand, the first derivative, or equivalently, the score function, and the negative second derivative, or equivalently, the information matrix, are

\begin{matrix} {\tilde{U}}_{b} (θ) & = & \frac{\partial {\tilde{L}}_{b} (θ)}{\partial θ} = \sum_{i = 1}^{n_{b}} X_{b i} [Y_{b i} - \frac{exp (X_{b i}^{T} θ)}{1 + exp (X_{b i}^{T} θ)}], \\ {\tilde{I}}_{b} (θ) & = & \frac{\partial^{2} {\tilde{L}}_{b} (θ)}{\partial θ \partial θ^{T}} = \sum_{i = 1}^{n_{b}} \frac{exp (X_{b i}^{T} θ) X_{b i} X_{b i}^{T}}{{[1 + exp (X_{b i}^{T} θ)]}^{2}}, \end{matrix}

respectively. In the next section of numerical studies, we will illustrate our proposed method via the logistic model.

4.2. Simulation Experiments

We adopt the logistic model structure to represent the potential response of interest. To facilitate our presentation, we suppose that there is a terminal point B, and for each

b = 1, \dots, B

, we sequantially generate the streaming data batches with binary outcomes

Y_{b}

and covariates

X_{b}

, resulting in a full dataset with a total sample size

N_{B}

. For ease of presentation, we directly set

X = X_{b}

for all b and also suppose that the sample sizes of all data batches are

n_{b} = N_{B} / B

, for all

b = 1, \dots, B

. Given

X

,

Y_{b}

follows a Bernoulli distribution with probability of success

π_{b} = P (Y_{b} = 1 | X)

and dispersion parameter

ϕ_{b} = 1

. A logistic model takes the form

g (π_{b}) = log (\frac{π_{b}}{1 - π_{b}}) = θ_{b 0} + θ_{b 1} X_{1} + θ_{b 2} X_{2},

for

b \geq 1

, where we set

X = {(1, X_{1}, X_{2})}^{T} = {(1, X_{1}, I_{{{\tilde{X}}_{2} \geq 1}})}^{T}

, with

(X_{1}, {\tilde{X}}_{2})

generated from a bivariate normal distribution (marginal standard normal with covariance 0.4). Note that

X_{1}

and

X_{2}

are correlated and both discrete and continuous random variables have been involved to mimic a more realistic scene.

We consider the following four simulation cases with different numbers of data batches and different magnitudes of heterogeneous degrees:

Case 1.: $B = 15$ , $θ_{1} = {(1, - 1, 1)}^{T}$ , $δ_{6} = {(- 2, 0, 0)}^{T}$ and $δ_{11} = {(0, 2, 0)}^{T}$ .
Case 2.: $B = 15$ , $θ_{1} = {(1, - 1, 1)}^{T}$ , $δ_{6} = {(- 1, 0, 0)}^{T}$ and $δ_{11} = {(0, 1, 0)}^{T}$ .
Case 3.: $B = 25$ , $θ_{1} = {(1, - 1, 1)}^{T}$ , $δ_{6} = δ_{16} = {(- 1, 0, 0)}^{T}$ , $δ_{11} = δ_{21} = {(0, 1, 0)}^{T}$ .
Case 4.: $B = 50$ , $θ_{1} = {(1, - 1, 1)}^{T}$ , $δ_{11} = δ_{31} = {(- 1, 0, 0)}^{T}$ , $δ_{21} = δ_{41} = {(0, 1, 0)}^{T}$ .

All remaining heterogeneous degrees that have not been specified are zero vectors. Two different sample sizes of

n_{b} = 400

and 800 are conducted for these cases. In comparison with Case 1, Case 2 features smaller heterogeneous degrees. Case 3 and Case 4 are different in the values of B and they contain more data streams than Case 1 and Case 2, making them more complex. It is worth noting that B chosen in Case 1 and Case 2 is relatively small, and the reason is that we can view the gradual changes more clearly in the performances of online estimators. A total of 500 repetitions are conducted for each simulation setting.

First, we view from the results to investigate whether the newly proposed method can estimate heterogeneous degrees

δ_{b}

and detect heterogeneous ones appropriately. For Case 1, we show simulation results about sparsity estimation for all streaming stages in Table 1, while for Case 2, the corresponding results are shown in Table 2. We report the proportion of times for the proposed methods of underselecting (U), overselecting (O), and exactly selecting (E) the nonzero ones in each streaming stage. These measures can tell us how well the proposed method performs in identifying potential changes of effects. To evaluate the estimation accuracy of heterogeneous degrees, we further report the estimated mean squared error (MSE). In Table 1 and Table 2, we have also reported the frequencies of all heterogeneous degrees that have been estimated as nonzero ones. It can be observed that the E value exhibits relatively high values. In addition, as the sample size increased from 400 to 800, the occurrence of mistake selections was reduced significantly. The extent of these mistake selections was relatively small and can be considered negligible. The results of frequencies indicate that at batch 6 and 11, the variable can be fully selected 500 times. From all these results, we can find that our newly proposed online method performs well in terms of identifying the potential existence of dynamic regression coefficients.

Summarized simulation results for the sparsity estimation of heterogeneous degrees under both Case 3 and Case 4 are shown in Table 3. Only simulation results at and next to those blocks of observations that have nonzero components of heterogeneous degree vectors are shown for ease of presentation. We find that our proposed estimator shows similar results as before from the perspective of identifying and quantifying dynamic effects even when heterogeneous data batches occur many more times. When the homogeneity holds in comparison with the former data batch, the estimation accuracy evaluated by MSE is reduced greatly due to the consideration of historical useful information.

To provide a more intuitive presentation of the estimated regression coefficients, we employ boxplots to illustrate the estimated regression coefficients across all cases. In Figure 2, the results for Case 1 with the sample size 400 are shown. Similar results for Case 2 with the sample size 400, and both cases with the sample size 800 are respectively shown in Figure A1–Figure A3 in Appendix A. Note that in all reported figures, we have labeled our proposed estimator as “Online-Hetero”. In addition, we have considered various other methods for comparison, including the online estimator proposed by Luo and Song [2], referred to as “Online”, which has also been introduced in our Remark 4, and the online conservative estimator, termed as “Online-Conser”, which has been defined in Equation (2) and discussed in Remark 2. The method “Online” has been shown in Luo and Song [2] to outperform other existing methods under homogeneity. In addressing the heterogeneity problem, the “Online-Conser” method serves as a benchmark compared to the proposed method since it naturally mitigates the underlying heterogeneity issue, ensuring its consistency by estimating the current batch-specific regression coefficient solely based on the current data batch. Hence, these chosen methods for comparison are representative.

The boxplots for parameters 0 and 1 demonstrate that, in the presence of heterogeneity, the Online-Conser can promptly detect it and provide consistent estimates. However, due to neglecting the underlying existence of dynamic regression coefficients, the biases of the Online estimator are quite large. The Online method also exhibits a tendency to progressively converge towards the altered degree of heterogeneity as the number of data batches increases, and even in the absence of heterogeneity for parameter 2, it displays slight deviation, as shown in these boxplots. In contrast, our proposed method demonstrates the ability to adaptively detect potential changes in regression coefficients, resulting in unbiased estimates. Furthermore, compared to the online conservative estimator, our proposed estimator significantly improves efficiency as homogeneous data accumulates, even though we need to deal with occasional coefficient changes. We can also find that although there are occasional changes of coefficients, our proposed estimator can lead to a large efficiency gain, in comparison with the online conservative estimator, when the homogeneous data accumulate. In other words, the conservative estimator results in lower efficiency since it ignores historical information directly and only uses current data.

We then turn our attention to Case 3 and Case 4. The boxplot for estimated regression coefficients at the terminal point for Case 3 is shown in Figure 3, and a similar boxplot for Case 4 is presented in Figure A4 in Appendix A. We present only the terminal point as we have introduced the offline method (denoted by “Offline”), which requires data from the entire sample size. Additionally, given the larger number of data blocks in these cases, presenting the results in the same manner as before would result in an overly lengthy display. This makes it challenging to clearly show our findings. In these scenarios, where heterogeneity arises occasionally, the biases of both the Online estimator and the Offline estimator are notably unsatisfactory at the terminal point. In contrast, our proposed method, Online-Hetero, delivers satisfactory results due to its adaptive capability in detecting heterogeneous degrees.

4.3. Real Data Analysis

4.3.1. Presentation of the Streaming Airline Data

In this subsection, our online methodology is applied to the airline on-time statistics mentioned in the introduction part, available at the corresponding website (https://zenodo.org/record/1246060 (accessed on 25 October 2023)).

To provide a general demonstration of the effectiveness of our newly proposed method, we use this dataset to investigate the effects of factors such as distance and departure time on late arrival for those flights that are of medium and large distances (greater that 2400 miles) on weekends in the year 2008. The records are extracted from the database in accordance with these characteristics.

Here are more details about the real data we will analyze. A flight is termed as a late arrival if it is late by more than 15 min. The variable named “distance” is continuous in miles, and the variable named “night” is binary with 1 if the corresponding flight takes off between 8 p.m. and 5 a.m. and 0 otherwise. These are commonly chosen and analyzed variables in the literature [8]. Finally, a total of 46264 eligible flights are included in our study. The proportion of late arrivals is

25.3 %

. In addition, night flights account for

20.9 %

of all records. The median distance is 2553 miles. To analyze the data in a streaming manner, we arrange all records in the data based on their actual occurrence in time and then partition them into 12 distinct data blocks according to the month. The sample sizes vary for each block in our study, and these sizes span a range from a minimum of 3109 samples in the smallest block to a maximum of 5134 samples in the largest block. While we could consider a larger dataset with more batches, doing so might complicate the graphical representation of our analysis results. By limiting our analysis to a year’s worth of data, we can present the advantages of our method in a way that is both aesthetically pleasing and easy to understand. However, it is important to note that our method is not limited to this number of data blocks, meaning that we can handle many more in practical applications, as demonstrated in our previous simulation experiments.

4.3.2. Fitting in an Online Manner Using Various Approaches

We first apply the conservative estimation approach on these streaming data batches, or, in other words, we fit a series of logistic models to monthly individual data batches separately. In Figure 4, we show the trace plots of the regression coefficients for distance and night via dotted lines. It is evident that the trend for the coefficients of distance has an obvious change in about month 6, while the trend for coefficients of night exhibits a significant change in about month 10. These observations indicate the potential existence of heterogeneity coming from dynamic effects. The estimation results of the conservative estimator are considered as a standard reference since each estimate is based solely on the current data batch and should be consistent. However, as demonstrated by the dotted line in Figure 4, this method exhibits significant fluctuations, which is consistent with the phenomena observed in simulation experiments in Section 4.2.

Subsequently, we apply the online estimator of Luo and Song [2], which has also been used in the simulation subsection and does not account for potential heterogeneity. As seen in Figure 4 (dashed lines), the fluctuations in the estimates obtained from the Online method are more stable compared to those from the Online-Conser method, primarily due to the consideration of previous information. Nevertheless, the estimated regression coefficient curves of the Online method differ significantly from those of the Online-Conser method. This discrepancy suggests that the estimates provided by the Online method could potentially be biased, and the Online method cannot accurately detect potential changes in regression coefficients.

Alternatively, we apply our proposed online estimation method to adaptively update estimates for regression coefficients. Trace plots of regression coefficients for different covariates have also been presented in Figure 4 using solid lines. Our proposed method successfully identifies the abrupt changes in distance and night that were not observed in the results from the Online-Conser method. In this graph, we have marked the locations where heterogeneity was identified with red plus signs. These marks correspond to non-zero elements in the estimates of the nuisance parameters provided by the penalty method. For example, the break of night indicates that after month 10, there is a larger negative effect on flights that take off at night. The 95% pointwise confidence intervals based on bootstrapping for our proposed Online-Hetero estimator are plotted in Figure 5. These results highlight the value of identifying dynamic regression coefficients to produce more convincing estimation results in the streaming data environment, which is becoming increasingly prevalent in practice.

5. Discussion

We present a novel online estimation procedure to analyze the streaming datasets, and the underlying model structures are assumed to be the generalized linear models with dynamic regression coefficients. Based on detailed derivations, we have demonstrated that our proposed method can not only adapt to the online updating framework but also avoid undesirable estimation biases coming from the potential changes in the model parameters of interest. From the numerical experiments of our simulations and a real-world case in the previous Section 4, we observe the crucial importance of identifying potential heterogeneity. Ignoring the changes in parameters can lead to significant biases, thereby losing important information contained in the data. The proposed online estimation procedure holds significant potential for real-world applications, particularly in the era of big data. It is especially relevant for scenarios where data is constantly incoming in a streaming manner, such as financial markets, social media analytics, and real-time health monitoring systems. By addressing potential heterogeneity and dynamically adjusting to changes in parameters, our method can provide more accurate insights. This could lead to improved decision-making and statistical accuracy in these fields.

Although there are many advantages, we admit that there still exist important issues that need to be solved in the future. First, we only focus on the simplest scenario of observations and one future direction would be to extend the current methods to analyze other types of data in a heterogeneous streaming data environment, including but not limited to clustered data [9] and survival data [32]. Second, we have assumed that the heterogeneity only comes from the changes in batch-specific regression coefficients for simplicity, while the model structure, that is, the generalized linear model, remains the same across all data batches. In an existing paper [18], an online estimation method that combines a linear state–space model and Kalman recursive technique has been proposed. However, a similar issue is encountered in their approach. More general heterogeneity needs to be considered, such as changes in the model structures. In addition, we only suggest using the bootstrap technique in the empirical analysis instead of deriving the asymptotic distributions with explicit formula for standard errors. More theoretical derivations need to be conducted to facilitate such an important step to further speed the computation. We will investigate these non-trivial issues in future projects.

Author Contributions

Conceptualization, J.W., J.D. and S.L.; Methodology, J.D. and J.W.; Formal analysis, J.D., J.W. and J.Y.; Investigation, J.D., J.W., S.L. and X.C.; Writing—original draft preparation, J.W., J.D. and S.L.; Supervision, S.L., J.Y. and X.C; Funding acquisition, S.L. and J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The authors acknowledge the efforts in the creation of the database of the airline on-time statistics, which is available from the website at https://zenodo.org/record/1246060 (accessed on 25 October 2023).

Acknowledgments

The authors are grateful to the Joint Editor, Associate Editor, and reviewers for their constructive suggestions and insightful comments.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

GLM	Generalized linear model

Appendix A. Supplementary Numerical Results

This appendix contains some additional numerical results referenced in Section 4 of the main text. In addition, boxplots of the estimated regression coefficients for Case 1 with the sample size 4 and both Case 1 and Case 2 with the sample size 800 are shown in Figure A1, Figure A2, and Figure A3, respectively. In Figure A4, we show the boxplots of estimated regression coefficients at the terminal point for Case 4.

Figure A1. Boxplots for regression coefficients for Case 2 with batch size 400.

Figure A2. Boxplots for regression coefficients for Case 1 with batch size 800.

Figure A3. Boxplots for regression coefficients for Case 2 with batch size 800.

Figure A4. Boxplots for regression coefficients at the terminal point B for Case 4.

References

Wang, C.; Chen, M.H.; Schifano, E.; Wu, J.; Yan, J. Statistical methods and computing for big data. Stat. Its Interface 2016, 9, 399–414. [Google Scholar] [CrossRef]
Luo, L.; Song, P.X.K. Renewable estimation and incremental inference in generalized linear models with streaming data sets. J. R. Stat. Soc. Ser. (Stat. Methodol.) 2020, 82, 69–97. [Google Scholar] [CrossRef]
McCullagh, P.; Nelder, J.A. Generalized Linear Models; Routledge: New York, NY, USA, 2019. [Google Scholar]
Robbins, H.; Monro, S. A Stochastic Approximation Method. Ann. Math. Stat. 1951, 22, 400–407. [Google Scholar] [CrossRef]
Toulis, P.; Airoldi, E.M. Scalable estimation strategies based on stochastic approximations: Classical results and new insights. Stat. Comput. 2015, 25, 781–795. [Google Scholar] [CrossRef] [PubMed]
Toulis, P.; Airoldi, E.M. Asymptotic and finite-sample properties of estimators based on stochastic gradients. Ann. Stat. 2017, 45, 1694–1727. [Google Scholar] [CrossRef]
Fang, Y. Scalable statistical inference for averaged implicit stochastic gradient descent. Scand. J. Stat. 2019, 46, 987–1002. [Google Scholar] [CrossRef]
Schifano, E.D.; Wu, J.; Wang, C.; Yan, J.; Chen, M.H. Online updating of statistical inference in the big data setting. Technometrics 2016, 58, 393–403. [Google Scholar] [CrossRef]
Luo, L.; Zhou, L.; Song, P.X.K. Real-time regression analysis of streaming clustered data with possible abnormal data batches. J. Am. Stat. Assoc. 2022, 543, 2029–2044. [Google Scholar] [CrossRef]
Wang, K.; Wang, H.; Li, S. Renewable quantile regression for streaming datasets. Knowl. Based Syst. 2022, 235, 107675. [Google Scholar] [CrossRef]
Jiang, R.; Yu, K. Renewable quantile regression for streaming data sets. Neurocomputing 2022, 508, 208–224. [Google Scholar] [CrossRef]
Sun, X.; Wang, H.; Cai, C.; Yao, M.; Wang, K. Online renewable smooth quantile regression. Comput. Stat. Data Anal. 2023, 185, 107781. [Google Scholar] [CrossRef]
Wang, T.; Zhang, H.; Sun, L. Renewable learning for multiplicative regression with streaming datasets. Comput. Stat. 2023, 1–28. [Google Scholar] [CrossRef]
Ma, X.; Lin, L.; Gai, Y. A general framework of online updating variable selection for generalized linear models with streaming datasets. J. Stat. Comput. Simul. 2023, 93, 325–340. [Google Scholar] [CrossRef]
Hector, E.C.; Luo, L.; Song, P.X.K. Parallel-and-stream accelerator for computationally fast supervised learning. Comput. Stat. Data Anal. 2023, 177, 107587. [Google Scholar] [CrossRef]
Han, R.; Luo, L.; Lin, Y.; Huang, J. Online inference with debiased stochastic gradient descent. Biometrika 2023, asad046. [Google Scholar] [CrossRef]
Luo, L.; Wang, J.; Hector, E.C. Statistical inference for streamed longitudinal data. arXiv 2022, arXiv:2208.02890. [Google Scholar] [CrossRef]
Luo, L.; Song, P.X.K. Multivariate online regression analysis with heterogeneous streaming data. Can. J. Stat. 2023, 51, 111–133. [Google Scholar] [CrossRef]
Klein, L. A Textbook of Econometrics; Prentice-Hall: Upper Saddle River, NJ, USA, 1953. [Google Scholar]
Hsiao, C. Analysis of Panel Data; Cambridge University Press: New York, NY, USA, 1986. [Google Scholar]
Hamilton, J.D. Time Series Analysis; Princeton University Press: Princeton, NJ, USA, 1994. [Google Scholar]
Zou, H. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 2006, 101, 1418–1429. [Google Scholar] [CrossRef]
Fan, J.; Li, R. Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
Zhang, C.H. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 2010, 38, 894–942. [Google Scholar] [CrossRef]
Wang, H.; Leng, C. Unified LASSO estimation by least squares approximation. J. Am. Stat. Assoc. 2007, 102, 1039–1048. [Google Scholar] [CrossRef]
Wang, H.; Li, B.; Leng, C. Shrinkage tuning parameter selection with a diverging number of parameters. J. R. Stat. Soc. Ser. (Stat. Methodol.) 2009, 71, 671–683. [Google Scholar] [CrossRef]
Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control 1974, 19, 716–723. [Google Scholar] [CrossRef]
Schwarz, G. Estimating the dimension of a model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. (Stat. Methodol.) 1996, 58, 267–288. [Google Scholar] [CrossRef]
Zhang, H.H.; Lu, W. Adaptive Lasso for Cox’s proportional hazards model. Biometrika 2007, 94, 691–703. [Google Scholar] [CrossRef]
Friedman, J.; Hastie, T.; Tibshirani, R. Regularization Paths for Generalized Linear Models via Coordinate Descent. J. Stat. Softw. 2010, 33, 1–22. [Google Scholar] [CrossRef] [PubMed]
Cox, D.R. Regression models and life tables (with discussion). J. R. Stat. Soc. Ser. (Stat. Methodol.) 1972, 34, 187–202. [Google Scholar]

Figure 1. Comparison of mechanisms for batch processing and streaming processing.

Figure 2. Boxplots for regression coefficients for Case 1 with batch size 400.

Figure 3. Boxplots for regression coefficients at the terminal point B for Case 3.

Figure 4. Trace plots of regression coefficients for different methods.

Figure 5. The 95% bootstrap pointwise confidence intervals for proposed method.

Table 1. Results for sparsity evaluation of heterogeneous degrees for Case 1.

									Frequency
	$n_{b} = 400$				$n_{b} = 800$				$n_{b} = 400$			$n_{b} = 800$
Batch	MSE	U%	O%	E%	MSE	U%	O%	E%	$δ_{b 1}$	$δ_{b 2}$	$δ_{b 3}$	$δ_{b 1}$	$δ_{b 2}$	$δ_{b 3}$
2	0.008	0.0	1.6	98.4	0.005	0.0	2.0	98.0	2	4	3	1	3	6
3	0.005	0.0	1.8	98.2	0.001	0.0	0.8	99.2	4	2	3	0	2	2
4	0.004	0.0	1.8	98.2	0.002	0.0	1.4	98.6	2	6	1	1	2	4
5	0.004	0.0	1.8	98.2	0.003	0.0	1.2	98.8	2	5	2	1	0	6
6	0.022	0.0	1.2	98.8	0.014	0.0	1.4	98.6	500	4	2	500	1	6
7	0.006	0.0	3.2	96.8	0.006	0.0	3.6	96.4	10	3	3	10	2	7
8	0.001	0.0	0.6	99.4	0.003	0.0	3.0	97.0	2	1	1	10	2	4
9	0.005	0.0	2.0	98.0	0.001	0.0	1.8	98.2	7	3	1	7	2	1
10	0.007	0.0	1.8	98.2	0.002	0.0	1.6	98.4	5 2	5	3	3	3
11	0.032	0.0	2.0	98.0	0.013	0.0	1.4	98.6	5	500	6	6	500	1
12	0.009	0.0	4.4	95.6	0.003	0.0	3.8	96.2	4	15	3	1	16	2
13	0.005	0.0	3.0	97.0	0.002	0.0	1.6	98.4	6	8	1	3	3	2
14	0.006	0.0	3.0	97.0	0.003	0.0	2.8	97.2	5	6	5	3	9	2
15	0.006	0.0	2.6	97.4	0.000	0.0	0.2	99.8	4	4	6	0	1	0

Notes: MSE, the average of mean squared errors; U, O, and E: the proportion of times (in percent) of the method underselecting, overselecting, and exactly selecting the parameters with nonzero values; column

δ_{b 1}

to column

δ_{b 3}

summarize the frequency of nonzero estimates for each coefficient.

Table 2. Results for sparsity evaluation of heterogeneous degrees for Case 2.

									Frequency
	$n_{b} = 400$				$n_{b} = 800$				$n_{b} = 400$			$n_{b} = 800$
Batch	MSE	U%	O%	E%	MSE	U%	O%	E%	$δ_{b 1}$	$δ_{b 2}$	$δ_{b 3}$	$δ_{b 1}$	$δ_{b 2}$	$δ_{b 3}$
2	0.010	0.0	3.2	96.8	0.001	0.0	0.8	99.2	8	3	5	3	0	1
3	0.005	0.0	1.6	98.4	0.001	0.0	1.2	98.8	3	2	3	4	1	1
4	0.006	0.0	2.2	97.8	0.001	0.0	0.8	99.2	2	4	5	4	0	1
5	0.002	0.0	1.0	99.0	0.002	0.0	1.4	98.6	2	1	2	2	3	3
6	0.040	0.0	4.8	95.2	0.012	0.0	1.8	98.2	500	6	18	500	4	5
7	0.011	0.0	6.2	93.8	0.002	0.0	2.2	97.8	19	6	9	7	2	2
8	0.005	0.0	2.8	97.2	0.002	0.0	1.6	98.4	10	4	1	3	2	3
9	0.005	0.0	2.8	97.2	0.003	0.0	2.4	97.6	4	5	6	7	0	5
10	0.005	0.0	2.6	97.4	0.002	0.0	2.2	97.8	8	4	3	2	4	5
11	0.026	0.0	3.8	96.2	0.009	0.0	1.2	98.8	10	500	9	3	500	3
12	0.007	0.0	5.4	94.6	0.003	0.0	2.8	97.2	2	22	4	3	8	4
13	0.004	0.0	2.6	97.4	0.002	0.0	2.6	97.4	4	10	2	3	7	4
14	0.003	0.0	2.0	98.0	0.001	0.0	1.2	98.8	1	4	5	1	5	0
15	0.003	0.0	2.0	98.0	0.001	0.0	1.4	98.6	3	4	3	0	5	2

Notes: The interpretations for MSE, U, O, E, and

δ_{b 1}

to

δ_{b 3}

are the same as for Table 1.

Table 3. Results for sparsity evaluation of heterogeneous degrees for Cases 3 and 4.

					Frequency
Batch	MSE	U%	O%	E%	$δ_{b 1}$	$δ_{b 2}$	$δ_{b 3}$
Results for Case 3
6	0.013	0.0	1.2	98.8	500	2	4
7	0.004	0.0	4.0	96.0	10	5	5
11	0.010	0.0	1.8	98.2	6	500	3
12	0.002	0.0	3.0	97.0	4	9	2
16	0.010	0.0	2.4	97.6	500	9	3
17	0.003	0.0	3.6	96.4	14	2	2
21	0.015	0.0	1.4	98.6	5	500	2
22	0.005	0.0	5.0	95.0	5	19	2
Results for Case 4
11	0.012	0.0	1.4	98.6	500	1	6
12	0.005	0.0	4.4	95.6	9	4	9
21	0.009	0.0	1.4	98.6	3	500	5
22	0.002	0.0	2.6	97.4	4	7	2
31	0.009	0.0	1.0	99.0	500	4	1
32	0.003	0.0	3.4	96.6	8	4	5
41	0.016	0.0	2.0	98.0	6	500	5
42	0.005	0.0	3.8	96.2	3	11	5

Notes: The interpretations for MSE, U, O, E, and

δ_{b 1}

to

δ_{b 3}

are the same as for Table 1.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wei, J.; Yang, J.; Cheng, X.; Ding, J.; Li, S. Adaptive Regression Analysis of Heterogeneous Data Streams via Models with Dynamic Effects. Mathematics 2023, 11, 4899. https://doi.org/10.3390/math11244899

AMA Style

Wei J, Yang J, Cheng X, Ding J, Li S. Adaptive Regression Analysis of Heterogeneous Data Streams via Models with Dynamic Effects. Mathematics. 2023; 11(24):4899. https://doi.org/10.3390/math11244899

Chicago/Turabian Style

Wei, Jianfeng, Jian Yang, Xuewen Cheng, Jie Ding, and Shengquan Li. 2023. "Adaptive Regression Analysis of Heterogeneous Data Streams via Models with Dynamic Effects" Mathematics 11, no. 24: 4899. https://doi.org/10.3390/math11244899

APA Style

Wei, J., Yang, J., Cheng, X., Ding, J., & Li, S. (2023). Adaptive Regression Analysis of Heterogeneous Data Streams via Models with Dynamic Effects. Mathematics, 11(24), 4899. https://doi.org/10.3390/math11244899

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Adaptive Regression Analysis of Heterogeneous Data Streams via Models with Dynamic Effects

Abstract

1. Introduction

2. The Model Setup

3. Proposed Methodology

3.1. Online Estimation of Dynamic Coefficients

3.2. Motivation and Derivation of the Proposed Estimator

3.3. Tuning Parameter Selection and Implentation Algorithm

4. Numerical Studies

4.1. Mathematical Formulation for a Special Case: The Logistic Model

4.2. Simulation Experiments

4.3. Real Data Analysis

4.3.1. Presentation of the Streaming Airline Data

4.3.2. Fitting in an Online Manner Using Various Approaches

5. Discussion

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Supplementary Numerical Results

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI