Abstract
In this paper, a size-biased Lindley (SBL) first-order autoregressive (AR(1)) process is proposed, the so-called SBL-AR(1). Some probabilistic and statistical properties of the proposed process are determined, including the distribution of its innovation process, the Laplace transformation function, multi-step-ahead conditional measures, autocorrelation, and spectral density function. In addition, the unknown parameters of the model are estimated via the conditional least squares and Gaussian estimation methods. The performance and behavior of the estimators are checked through some numerical results by a Monte Carlo simulation study. Additionally, two real-world datasets are utilized to examine the model’s applicability, and goodness-of-fit statistics are used to compare it to several pertinent non-Gaussian AR(1) models. The findings reveal that the proposed SBL-AR(1) model exhibits key theoretical properties, including a closed-form innovation distribution, multi-step conditional measures, and an exponentially decaying autocorrelation structure. Parameter estimation via conditional least squares and Gaussian methods demonstrates consistency and efficiency in simulations. Real-world applications to inflation expectations and water quality data reveal a superior fit over competing non-Gaussian AR(1) models, evidenced by lower values of the AIC and BIC statistics. Forecasting comparisons show that the classical conditional expectation method achieves accuracy comparable to some modern machine learning techniques, underscoring its practical utility for skewed and fat-tailed time series.
Keywords:
time series; non-Gaussian AR(1); size-biased Lindley distribution; estimation; simulation; forecasting MSC:
62M10; 62M15
1. Introduction
Continuous-valued time series, in which realizations are continuously recorded over time, are useful in many domains, including engineering, economics, finance, and the natural sciences. In particular, they are employed in stock market analysis, scientific research, medical studies, economic forecasting, and weather forecasting. However, traditional time series analysis often assumes Gaussian-distributed marginals, which fail to capture the main features of real-world data, such as skewness, fat-tails, positivity, and size-biased sampling (e.g., environmental, economic, and biomedical data). For instance, Gaussian models cannot well accommodate strictly positive measurements like water turbidity or inflation rates, nor can they represent the high kurtosis observed in phenomena, such as financial volatility. These gaps limit their applicability and forecasting accuracy in non-Gaussian contexts. To address these limitations, several non-Gaussian AR(1) models have been proposed, highlighting some key features, e.g., skewness, fat tails, and positivity; among them are the gamma distribution (Gaver and Lewis [1]), Weibull and gamma (Sim [2]), exponential (Mališić [3]), inverse Gaussian (Abraham and Balakrishna [4]), normal-Laplace (Jose et al. [5]), approximated beta (Popović [6]), Lindley (Bakouch and Popovíc [7]), double Lindley (Nitha and Krishnarani [8]), gamma-Lindley (Mello et al. [9]), logistic (Jilesh and Jayakumar [10]), and exponential-Gaussian (Nitha and Krishnarani [11]).
From the distribution theory perspective, size-biased distributions, a special class of weighted distributions, emerge naturally in scenarios where observations are recorded with probabilities proportional to their inherent size or magnitude, a common feature in ecological surveys (e.g., oversampling large organisms), econometric data (e.g., prioritizing high-value transactions), or biomedical studies (e.g., detecting severe disease cases). These distributions address unequal detection probabilities inherent to real-world data collection, where larger or more prominent units (subjects) are systematically overrepresented. Their theoretical formulation, presented in weighting the original probability density function () by the weight function x, called size-biased:
where the operator is the expected value of a random variable X, as a normalizing factor, enables accurate modeling of such biased sampling mechanism. Pioneered by Patil and Rao [12], size-biased distributions have been extensively applied across environmental science, forestry, and social science, as demonstrated by Scheaffer [13] in wildlife population studies, Singh and Maddala [14] in econometric inequality analysis, and Drummer and McDonald [15] in ecological sampling. Beyond applied fields, size-biasing plays a pivotal role in statistical estimation, renewal theory, and distributional infinite divisibility. Despite their broad utility, their integration into non-Gaussian time series models remains limited, creating a critical gap in analyzing temporally dependent data subject to size-based sampling biases. This omission compromises parameter estimation and forecasting in fields like epidemiology (e.g., disease case reporting) or hydrology (e.g., extreme event monitoring), where detection probabilities inherently correlate with observation magnitude.
To contribute to this gap, we propose the size-biased Lindley AR(1) (SBL-AR(1)) process by using the size-biased Lindley (SBL) distribution introduced by Ayesha [16]. The SBL distribution enhances the classical Lindley, as non-Gaussian, distribution by incorporating size-biased sampling. Compared to other non-Gaussian distributions commonly used in time series models, such as double Lindley, gamma-Lindley and logistic, the size-biased Lindley distribution offers additional flexibility in capturing skewed and fat-tailed behaviors while inherently addressing size-biased sampling effects. This makes it particularly advantageous in applications. The probability density function (PDF), denoted as , and the cumulative distribution function (CDF), denoted as , of the SBL distribution are, respectively, given by
where is a scale parameter of the SBL distribution.
The first two raw moments and the variance of the SBL distribution are, respectively, expressed as follows:
For more details about the SBL distribution, see [16].
The limitations of traditional Gaussian time series models in capturing non-Gaussian features, previously stated, motivate the proposed SBL-AR(1) process, which integrates the SBL marginals to flexibly model positive-valued data with temporal dependence. The SBL-AR(1) addresses critical gaps in non-Gaussian autoregressive models through two key motivations: (1) SBL marginals, which enhance the classical Lindley distribution by weighting observations proportionally to their size, improving fit for data where larger values are oversampled (e.g., economic datasets), and (2) a closed-form innovation mixture combining Dirac delta and generalized gamma-exponential distributions, enabling precise modeling of excess zeros and continuous positive values. Parameter estimation via conditional least squares (CLS) and Gaussian estimation (GE) is validated through simulations demonstrating estimator consistency, while real-world applications, including inflation expectations and water turbidity monitoring, showcase a superior performance over competing models (e.g., Lindley, gamma, and inverse Gaussian AR(1)) via AIC/BIC criteria. By balancing theoretical findings (Laplace transformation function, multi-step-ahead conditional measures, autocorrelation decay, spectral density) with practical utility (accurate forecasting of skewed/fat-tailed data), the SBL-AR(1) establishes a flexible model for analyzing non-Gaussian time series prevalent in environmental and economic domains.
The remainder of this paper is constructed as follows: In Section 2, a first-order autoregressive process with SBL marginals is constructed, and the distribution of the innovation process is derived. Section 3 investigates some structural properties for the proposed SBL-AR(1), including multi-step conditional Laplace transform, conditional variance, conditional mean, autocorrelation function, and spectral density function. In Section 4, we utilize the conditional least squares and the Gaussian estimation techniques to estimate the parameters of the proposed process, and the performance of estimators is assessed via a simulation study. Section 5 discusses the application of the model using two real-life datasets. In addition, Section 6 gives the forecasting of the data for AR model based on the classical statistical method and some machine learning methods for its predictive ability. In Section 7, the paper’s conclusion provides suggestions for future research directions aligned with the proposed framework.
2. SBL-AR(1): Model Construction and Innovation Distribution
In this section, a first-order stationary autoregressive process with SBL marginals, denoted as SBLAR(1), is presented, and then we obtain the distribution of the innovation term.
Suppose that is a stochastic process defined as follows:
where is a stationary process with SBL() marginals, , and is a sequence of independent and identically distributed (i.i.d.) random variables independent of for all . The definition of the SBL-AR(1) model indicates that it is a first-order Markovian process.
It is worth, before investigating more features for the SBL-AR(1) model, defining the generalized mixture distributions in the next definition.
Definition 1.
Let be a distribution function. is said to be a generalized mixture of the distribution functions , if
for all t, where , are real numbers satisfying , , and for some indices i, .
The following proposition gives the definition of the PDF, , which will be used later to find the innovation distribution. It is important to note that is a properly PDF for all admissible values of the parameters and .
Proposition 1.
If , , and , then the generalized mixture
is a PDF.
Proof.
Equation (3) combines exponential and gamma densities, which are non-negative, and represents a mixture of exponential , gamma , gamma , and exponential , respectively. Therefore, it can be concluded that
We still need to verify that for . Equation (3) can be reformulated as follows:
where
As , we conclude that
Additionally, we have that
and for ,
From Equations (6)–(8), is a sum of non-negative terms due to and derivative , ensuring that , hence for , which completes the proof.
That is, represents a generalized mixture of the exponential (), gamma (2,), gamma (3,), and exponential () distributions such that the sum of weights in is equal to 1. □
The distribution of the innovation sequence plays a crucial role in the practical applications and further studies of this process. One frequently used technique to determine the innovation sequence involves the Laplace transform function. The subsequent theorem presents the distribution of the innovation random variable (rv) . Let and represent the Laplace transforms (LTs) of the random variables and , respectively. The LT of SBL rv X can be expressed as
Theorem 1.
Assume that represents a marginal distribution of the stochastic process given by Equation (2). Consequently, the distribution of the innovation sequence, , is a mixture of singular and absolutely continuous distributions, expressed as follows:
where is the Dirac Delta function defined as
and is given by Equation (3).
Proof.
As the process is a stationary, the LT of Equation (2) can be expressed as
consequently, the LT of the innovation rv is given by
By utilizing Equation (9), Equation (10) is represented as
By applying partial fraction decomposition, the preceding equation can be expressed as
where
and
It can be seen that
Based on Equations (12)–(15) and the properties of inverse LTs, we conclude that the distribution of the innovation sequence, , is composed of a discrete component of 0, with probability , and a generalized mixture of exponential (), gamma (3, ), gamma (2, ), and exponential () distributions, with probability . □
Figure 1 depicts the distribution of the innovation process through its density curves. From Figure 1a, it is clear that, as increases in the interval (0, 0.5), the values of the innovation probability increase, while in the case where , the values of the innovation probability decrease. Furthermore, plots in Figure 1b,c indicate that smaller values lead to distributions with heavier tails. In summary, the innovation density is both uni-modal and right-skewed.
Figure 1.
Density curves for the innovation process : (a) = 1.5, (b) = 0.97, (c) = 0.2.
As a result of the previously mentioned theorem about the distribution of the innovation term, the stationary process in Equation (2) can be reformulated as follows.
Definition 2.
The SBL-AR(1) process in Equation (2) is restated as
Or, in other terms,
where is an indicator variable such that .
Figure 2 depicts the sample paths from the process in Equation (16) for different values of the parameters and . We generated 200 observations from the SBL-AR(1) process by setting = 1.6, 2, and 2.5; and = 0.2, 0.5, and 0.7. Throughout these plots, the SBL-AR(1) behavior is investigated and Figure 2 points out that the simulated series is stationary and has positive values.
Figure 2.
The sampling paths of the SBL-AR(1) process for (A) , , (B) , , (C) , , (D) , .
3. Structural Properties Associated with the SBL-AR(1) Model
This section focuses on the development of the SBL-AR(1) model’s conditional mean and variance, along with the derivation of its multi-step-ahead conditional Laplace transform. The mean and variance are required to establish the SBL-AR(1) prediction equations, while the Laplace transform provides insight into the joint distribution of vectors produced by the process. In this section, we outline some statistical properties of the SBL-AR(1) model and discuss each one in detail. The conditional statistical measures of the SBL-AR(1) process are obtained using the same methodology as outlined by Bakouch and Popović [7].
3.1. Some Statistical Conditional Measures
The one- and multi-step-ahead conditional mean and variance of the SBL-AR process are based on the mean and variance of the random variable and the innovation term . Thus, we first compute the mean and variance for and , respectively, as follows. According to the definition of the SBL-AR(1) process outlined in Equation (16), the random variable has the following mean and variance, respectively:
Consequently, the mean and variance of the innovation process are, respectively, stated as
By utilizing these properties, the one-step ahead conditional mean for the process in Equation (16) can be formulated as
Consequently, the formula for the ()-step ahead conditional mean is expressed as
When the previously mentioned expression will tend to the unconditional mean of the main process as follows:
The expressions for one-step and ()-step-ahead conditional variance of the proposed model, respectively, are obtained as
Observe that when in Equation (25), the unconditional variance of the main process is obtained as
The multi-step-ahead conditional LT of the SBL-AR(1) process is obtained as
3.2. Joint Distribution, Autocorrelation, and Spectral Density Function
The joint LT of () is expressed as follows:
The SBL-AR(1) process is not time reversible because the joint LT is not symmetric in and .
Following a few basic calculations, the autocovariance and autocorrelation functions at lag k of the proposed process are, respectively, given by
The spectral density function of a stationary process is simply defined as the Fourier transform of the absolutely summable autocovariance function. Hence, the spectral density function of the SBL-AR(1) is obtained as follows:
Given that , by substituting in the last equation we obtain
The parametric spectral density estimator is obtained by replacing the parameters and with their corresponding Gaussian estimators and , which will be discussed later, in the right hand side of Equation (27).
Figure 3a displays the spectral density function estimator of the SBL-AR(1) process that is given by Equation (27) for , and , where those estimated values will be obtained later in Section 5. Figure 3b depicts the theoretical SBL-AR(1) spectral density function at and different values for . As the value of increases, the spectral density function becomes more leptokurtic. Also, from Figure 3c, it is clear that at = 0.5 and various values, the spectral density function becomes more platykurtic as increases. Further, curves in Figure 3a are very similar to ones in Figure 3b.
Figure 3.
Spectral density curves of SBL-AR(1) process: (a) , (b) , and (c) .
4. Parameter Estimation and Simulation Studies
This section is devoted to estimating the parameters involved in the process, specifically and . Let represent a realization from the SBL-AR(1) process. The next subsections discuss the conditional least squares and the Gaussian estimation techniques. Additionally, a simulation study will be performed.
4.1. Estimation via Conditional Least Squares Procedure
The conditional least squares (CLS) estimators for and , denoted as and , are derived by minimizing the conditional sum of squares function
By utilizing Equation (22) for , then takes the form
Consequently, by setting , then the previous equation can rewritten as
Estimating and is achieved by solving the normal equations obtained from Equation (28), which are as follows:
As , then the estimator for is obtained as
where is given by Equation (30).
4.2. Gaussian Estimation Approach
Whittle [17] proposed this approach by utilizing the Gaussian likelihood function as the baseline distribution for estimation. Subsequently, Crowder [18] applied this estimation technique to analyze correlated binomial data. Both Al-Nachawati et al. [19] and Alwasel et al. [20] employed the same estimation technique within the context of a first-order autoregressive process. Despite its approximate nature, this method provides a reliable estimation for the proposed model. The Gaussian estimation (GE) approach is based on the one-step conditional expectation and variance of the model. The conditional maximum likelihood function is expressed as follows:
In this context, and represent the conditional and marginal probability functions of and , respectively. We assume that both and follow a Gaussian PDF, with the conditional mean and conditional variance serving as their parameters. Thus, the likelihood function can be formulated as follows:
Consequently, the log-likelihood function is given by
where and are the one-step conditional mean and variance defined by Equation (22) and Equation (24), respectively. Hence, the Gaussian log-likelihood function related to the SBL-AR(1) process takes the form
Therefore, the Gaussian estimators, termed and , can be derived by solving the system of equations and .
Crowder [18] indicated that when employing the Gaussian method for estimating the parameter , the expression is asymptotically normally distributed with a mean of zero and an asymptotic variance of , where denotes the conditional expected information matrix. An approximation can be achieved using the observed conditional information matrix, as discussed by Bakouch and Popović [7].
Now, we conduct a simulation study in the next subsection to check the performance of the CLS and GE estimation methods.
4.3. Monte Carlo Simulation and Experimental Analysis
In this subsection, we perform a simulation study to check the validity of the estimation methods, which are used for the model parameters’ estimation. The consistency and behavior of the CLS and GE techniques of parameter estimation in the SBL-AR(1) process are investigated throughout a Monte Carlo simulation. Over 1000 replications and sample sizes of 50, 100, 500, and 1000 are simulated from the SBL-AR(1) with actual parameter values:
- (a) (b)
- (c) (d)
- (e) (f)
The mean square error (MSE) is used to assess the performance of the estimates and comparison purposes.
A step-by-step simulation algorithm for the SBL-AR(1) process is provided as follows (Algorithm 1):
| Algorithm 1: Simulation algorithm for the SBL-AR(1) process |
|
The values of and used in the simulation study were subject to the constraints and . Notably, some of these values correspond to the parameter estimates obtained from the real-world datasets analyzed in this study.
Table 1 and Table 2 display the values for the mean estimates, bias, and MSE. The bias and MSE are provided in the parentheses as (bias, MSE). Generally, both approaches performed well and effectively. In the two methodologies, the bias and MSE for all estimates tended to zero as the sample size increased. Additionally, when the sample size was increased, the values of the estimate became closer to the actual values. In terms of MSE and the values of mean estimates, the GE outperformed the CLS for () = (0.1, 1.5), (0.97, 1.24), (0.97, 1.6), (0.5, 1.4), and (0.7, 1.4). However, the CLS showed better performance for () = (0.5, 1.5).
Table 1.
Average, bias, and MSE (in parentheses) of the estimates for some different values of the parameters and .
Table 2.
Average, bias, and MSE (in parentheses) of the estimates for some different values of the parameters and .
5. Real-Life Data Analysis and Model Selection
In this section, we assess the applicability of the proposed model by utilizing two real-life datasets.
To illustrate and evaluate the performance and competitiveness of the proposed model, we investigated two real-life datasets, outlined as follows:
- The first dataset consists of 451 observations, representing the monthly University of Michigan Inflation Expectation (MICH) from 5 January 1984 to 1 November 2021. These data can be found at https://fred.stlouisfed.org/series/MICH (accessed on 12 June 2024).
- The second dataset comprises 221 observations that represent the turbidity of water quality in Brisbane, measured every 10 min during the period from 23 June 2024, at 07:10, to 24 June 2024, at 19:30. These data can be obtained from https://www.kaggle.com/datasets/downshift/water-quality-monitoring-dataset (accessed on 3 October 2024).
The time series, autocorrelation (ACF), and partial autocorrelation (PACF) functions for the two datasets are displayed in Figure 4 and Figure 5, respectively. Based on these plots, we can conclude that the PACF cuts off after lag one, indicating that the datasets are appropriate for an AR(1) model. Additionally, the ACF dies down rapidly. These two figures suggest that the two datasets are stationary. We further validate this conclusion through a stationarity test as follows.
Figure 4.
The time series, ACF, and PACF plots of monthly University of Michigan Inflation Expectation (MICH).
Figure 5.
The time series, ACF, and PACF plots of the turbidity of water quality in Brisbane.
The augmented Dickey–Fuller (ADF) test is a statistical tool used to determine whether a time series is stationary or not. If the p-value from the test is less than the designated significance level (0.05), we reject the null hypothesis of non-stationarity, indicating that the time series is stationary. The ADF test was applied using the adf.test function from the tseries package in R. According to the p-values shown for each dataset in Table 3, we can conclude that the datasets are stationary.
Table 3.
Descriptive statistics of University of Michigan Inflation Expectation and Brisbane water quality datasets.
To compare the proposed process, SBL-AR(1), we will utilize the earlier two datasets alongside the following relevant non-Gaussian AR(1) models associated with their Gaussian log-likelihood functions.
- E-AR(1) with exponential marginals (Gaver and Lewis [1]):
- G-AR(1) with gamma marginals (Gaver and Lewis [1]):
- INGAR(1)-I with inverse Gaussian marginals (Abraham and Balakrishna [4]):
- INGAR(1)-II with inverse Gaussian marginals (Abraham and Balakrishna [4]):
- L-AR(1) with Lindley marginals (Bakouch and Popovíc [7]):
- GaL-AR(1) with gamma Lindley marginals (Mello et al. [9]):
- AR-L(1) with Lindley innovations (Nitha and Krishnarani [21]):
The GE method was used to estimate the unknown parameters for all the considered models. Numerical methods were applied to determine these parameter values, utilizing the optim function in R along with the conjugate gradients (CG) method for this purpose. For each dataset and model, the Gaussian likelihood estimates, the Akaike information criterion (AIC), Bayesian information criterion (BIC), and Hannan–Quinn information criterion (HQIC) were computed. We evaluated model performance using information criteria statistics. The model that performs best is the one with the smallest values for these statistics. The results of the goodness-of-fit statistics and the GE estimates, including their standard errors (SE), are summarized in Table 4 and Table 5. From these tables, it is evident that the SBL-AR(1) model achieved the smallest values for AIC, BIC, and HQIC. Consequently, we can conclude that the proposed model performed well for both datasets; hence, the SBL-AR(1) model provides the best fit among the AR(1) models considered.
Table 4.
Estimated parameters, AIC, BIC, and HQIC for monthly University of Michigan Inflation Expectation dataset.
Table 5.
Estimated parameters, AIC, BIC, and HQIC for Brisbane water quality dataset.
For each dataset, the residuals, , were computed from the fitted model, given by Equation (2), using the estimated parameter in Table 4 and Table 5. To assess the presence of autocorrelation in these residuals, both the Box–Pierce and Ljung–Box tests were performed. The results, summarized in Table 6, indicate that for all cases, the p-values exceeded 0.05, indicating no significant autocorrelation in the residuals of the fitted model.
Table 6.
Ljung–Box and Box–Pierce test results for residual autocorrelation.
6. Forecasting
Forecasting time series data is essential for predicting future trends in non-Gaussian contexts, where traditional Gaussian models may fail to capture skewed or fat-tailed data distribution. The proposed SBL-AR(1) model utilizes size-biased Lindley marginals and employs both classical conditional expectation method and machine learning techniques, demonstrating a superior accuracy in predicting the considered real-world datasets.
6.1. Classical Conditional Expectation Method
This classical method is one of the most widely used forecasting methods relies on conditional expectations. The forecast is essentially the expected value of given all the available information up to time t. The one- and k-step-ahead forecasts for and , respectively, are concluded by Equation (23) as follows:
and
The parameters and are replaced with their GE estimates. Then
and
6.2. Machine Learning Forecasting Methods
Unlike classical statistical methods, which rely on predefined theoretical assumptions, machine learning (ML) approaches are non-parametric and data-driven, learning patterns directly from observed data to adapt flexibly to complex structures. These methods typically involve training algorithms on historical data to learn underlying relationships, which can then be used to make predictions about future values. To model the autoregressive behavior of the time series using ML methods, we adopted an AR(1) structure where the current observation is predicted using its immediate lag, . The data were first transformed into a supervised learning format by creating a lag-1 feature. The dataset was then divided into training and testing sets (e.g., 80% and 20%, respectively). In the following subsections, we examine the use of some machine learning (ML) methods, which are support vector regression, k-nearest neighbors, and extreme gradient boosting, to forecast data generated by AR(1) processes.
6.2.1. Support Vector Regression
Support vector regression (SVR) is used to model both linear and nonlinear relationships in time series data, adept in complex scenarios while remaining applicable to simpler linear cases like the autoregressive (AR(1)) process. It finds a function predicting targets accurately within a small error margin () (Smola and Schölkopf [22]), effective for time series forecasting.
Given a training set , where represents the input features (e.g., lagged time series values) and denotes the corresponding target values, the goal is to find a linear function that minimizes the cost function:
where w is the weight vector, b is the bias term, and are slack variables allowing deviations beyond , and C controls the trade-off between model complexity and training error.
We employed the svm function from the e1071 package in R (version 4.4.3), specifying a linear kernel and epsilon-insensitive loss (type = “eps-regression”), which aligns with the assumptions of the AR(1) model.
6.2.2. Extreme Gradient Boosting
Extreme gradient boosting (XGBoost) is based on gradient-boosted decision trees, optimized for speed and performance (Chen and Guestrin [23]). XGBoost constructs its model by building decision trees sequentially, where each new tree is trained to correct the errors made by the previous ones. This process is guided by an objective function that balances the accuracy and complexity of the model. The objective function minimized during training combines a loss term measuring prediction errors (e.g., squared error) and a regularization term penalizing complexity to prevent overfitting, expressed as
where represents each regression tree, and is a regularization function controlling the complexity of each tree.
The XGBoost algorithm was implemented in R using the xgboost package, with data formatted as a matrix or xgb.DMatrix for optimized performance and trained via xgb.train using the “reg:squarederror” objective function.
6.2.3. K-Nearest Neighbors
The K-nearest neighbors (KNN) algorithm predicts outcomes by measuring distances (e.g., ) between a new data point and training samples, selecting the K closest neighbors. The final prediction is the average of these neighbors’ target values:
where denotes the neighbor’s value (Kramer, O. [24]).
We used the FNN package in R to apply the KNN method. The knn.reg function was used to make predictions based on the average of the closest points in the data.
6.3. Forecasting Evaluation
To assess the performance of the forecasting model, several commonly used error metrics were employed to quantify the accuracy of the predictions. Below are the most commonly used measures:
- Root Mean Squared Error (RMSE)It is the square root of MSE; in the same units as the target variable:
- Mean Absolute Error (MAE)It measures the average absolute difference between predicted and actual values:
- Mean Absolute Percentage Error (MAPE)It measures error as a percentage of actual values:
- For the considered machine learning methods, hyperparameter tuning was performed via grid search to determine the optimal values for the key parameters that achieve the lowest RMSE of each method.
Table 7 and Table 8 evaluate the one-step-ahead forecasting performance of classical (SBL-AR(1) conditional expectation) and machine learning methods. While XGBoost achieved the lowest errors for the MICH dataset in MAE and MAPE and SVR excelled for the Turbidity dataset, the classical method demonstrated notable competitiveness. For the MICH dataset, the classical method outperformed SVR in all measures and outperformed KNN in MAE (0.1809712 vs. 0.1831944) and MAPE (6.098343 vs. 6.270608); it also closely matched XGBoost. For the Turbidity dataset, although the machine learning methods are superior, the values of the classical method are still close to the KNN method. These results highlight the competitiveness of the classical forecasting method under the SBL-AR(1) model against modern machine learning techniques due to its parametric structure, which effectively captures skewed and fat-tailed data, avoiding overfitting (that is, the predicted values match the true observed values). These results also underscore that while ML methods adapt flexibly to data patterns, the classical approach based on the structure of SBL-AR(1) retains predictive power, particularly in scenarios where parametric assumptions align well with the data’s inherent dynamics.
Table 7.
Model performance measures for forecasting MICH dataset values.
Table 8.
Model performance measures for forecasting Turbidity dataset values.
Actual and predicted values for each dataset using all of the forecasting methods are shown in Figure 6 and Figure 7. The classical SBL-AR(1) method’s predictions in these figures closely follow the actual data trends, showing smoother alignment compared to the machine learning methods. This strong fit arises because the SBL-AR(1) model is specifically designed for the data’s skewed and positive-valued nature, utilizing its parametric structure. While some ML methods achieved slightly lower errors in Table 7 and Table 8, the figures confirm the classical method’s capability in capturing the data’s inherent patterns without overfitting.
Figure 6.
Actual and predicted values of University of Michigan Inflation Expectation.
Figure 7.
Actual and predicted values of turbidity of water quality in Brisbane.
7. Conclusions
In this study, we introduced a first-order autoregressive process based on the size-biased Lindley distribution (SBL-AR(1)) model. We explored several theoretical properties of the process, including its innovation distribution, Laplace transform, conditional mean and variance, autocorrelation structure, and spectral density. To estimate the model parameters, we employed both conditional least squares and Gaussian estimation methods.
A comprehensive Monte Carlo simulation was conducted to evaluate the behavior of the estimators, demonstrating their efficiency. Furthermore, the applicability of the SBL-AR(1) model was illustrated using two real-world datasets. In both cases, the model provided a superior fit compared to several alternative non-Gaussian AR(1) processes, as evidenced by goodness-of-fit statistics. In addition, both classical statistical method and machine learning techniques were used for forecasting. The classical method has demonstrated strong competitiveness when compared to machine learning methods.
Future work may consider extending the model to higher-order autoregressive structures.
Author Contributions
Conceptualization, H.S.B., M.M.G. and H.M.E.-T.; methodology, H.S.B., M.M.G. and H.M.E.-T.; software, H.M.E.-T.; validation, H.S.B., M.M.G. and H.M.E.-T.; formal analysis, H.S.B., M.M.G. and H.M.E.-T.; investigation, H.S.B., M.M.G. and H.M.E.-T.; resources, H.S.B., M.M.G., H.M.E.-T. and S.M.A.A.; data curation, H.S.B., M.M.G., H.M.E.-T. and S.M.A.A.; writing—original draft preparation, H.S.B., M.M.G. and H.M.E.-T.; writing—review and editing, H.S.B., M.M.G., H.M.E.-T. and S.M.A.A.; visualization, H.S.B., M.M.G., H.M.E.-T. and S.M.A.A.; supervision, H.S.B., M.M.G., H.M.E.-T. and S.M.A.A.; project administration, H.S.B., M.M.G., H.M.E.-T. and S.M.A.A.; funding acquisition, S.M.A.A. All authors have read and agreed to the published version of the manuscript.
Funding
This research work was funded by Umm Al-Qura University, Saudi Arabia under grant number: 25UQU4310037GSSR08.
Data Availability Statement
The original data presented in the study are available on the website https://fred.stlouisfed.org/series/MICH (accessed on 12 June 2024) and the website https://www.kaggle.com/datasets/downshift/water-quality-monitoring-dataset (accessed on 3 October 2024).
Acknowledgments
The authors extend their appreciation to Umm Al-Qura University, Saudi Arabia for funding this research work through grant number: 25UQU4310037GSSR08.
Conflicts of Interest
The authors declare no conflicts of interest.
Abbreviations
| SBL | Size-biased Lindley |
| SBL-AR(1) | Size-biased Lindley autoregressive of order 1 |
| ACF | Autocorrelation function |
| PACF | Partial autocorrelation function |
| ADF | Augmented Dickey–Fuller |
| CLS | Conditional least squares |
| GE | Gaussian estimation |
| MSE | Mean squared error |
| E-AR | Exponential autoregressive |
| G-AR | Gamma autoregressive |
| INGAR | Inverse Gaussian autoregressive |
| L-AR | Lindley autoregressive |
| GaL-AR | Gamma Lindley autoregressive |
| AR-L | Autoregressive with Lindley innovations |
| AIC | Akaike information criterion |
| BIC | Bayesian information criterion |
| HQIC | Hannan–Quinn information criterion |
| SE | Standard error |
| ML | Machine learning |
| SVR | Support vector regression |
| XGBoost | Extreme gradient boosting |
| KNN | K-nearest neighbors |
| RMSE | Root mean squared error |
| MAE | Mean absolute error |
| MAPE | Mean absolute percentage error |
References
- Gaver, D.P.; Lewis, P.A. First-order autoregressive gamma sequences and point processes. Adv. Appl. Probab. 1980, 12, 727–745. [Google Scholar] [CrossRef]
- Sim, C.H. Simulation of Weibull and gamma autoregressive stationary process. Commun. Stat. Simul. Comput. 1986, 15, 1141–1146. [Google Scholar] [CrossRef]
- Mališić, J.D. Mathematical Statistics and Probability Theory: Volume B; Springer: Berlin/Heidelberg, Germany, 1987. [Google Scholar]
- Abraham, B.; Balakrishna, N. Inverse Gaussian autoregressive models. J. Time Ser. Anal. 1999, 20, 605–618. [Google Scholar] [CrossRef]
- Jose, K.K.; Tomy, L.; Sreekumar, J. Autoregressive processes with normal-Laplace marginals. Stat. Probab. Lett. 2008, 78, 2456–2462. [Google Scholar] [CrossRef]
- Popović, B.V. AR (1) time series with approximated beta marginal. Publ. Inst. Math. 2010, 88, 87–98. [Google Scholar] [CrossRef]
- Bakouch, H.S.; Popović, B.V. Lindley first-order autoregressive model with applications. Commun. Stat. Theory Methods 2016, 45, 4988–5006. [Google Scholar] [CrossRef]
- Nitha, K.U.; Krishnarani, S.D. On a class of time series model with double Lindley distribution as marginals. Statistica 2021, 81, 365–382. [Google Scholar]
- Mello, A.B.; Lima, M.C.; Nascimento, A.D. The title of the cited article. Environmetrics 2022, 33, e2724. [Google Scholar] [CrossRef]
- Jilesh, V.; Jayakumar, K. On first order autoregressive asymmetric logistic process. J. Indian Soc. Probab. Stat. 2023, 24, 93–110. [Google Scholar] [CrossRef]
- Nitha, K.U.; Krishnarani, S.D. Exponential-Gaussian distribution and associated time series models. Revstat Stat. J. 2023, 21, 557–572. [Google Scholar]
- Patil, G.P.; Rao, C.R. Weighted distributions and size-biased sampling with applications to wildlife populations and human families. Biometrics 1978, 34, 179–189. [Google Scholar] [CrossRef]
- Scheaffer, R. Size-biased sampling. Technometrics 1972, 14, 635–644. [Google Scholar] [CrossRef]
- Singh, S.K.; Maddala, G.S. Modeling Income Distributions and Lorenz Curves; Springer: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
- Drummer, T.D.; McDonald, L.L. Size bias in line transect sampling. Biometrics 1987, 43, 13–21. [Google Scholar] [CrossRef]
- Ayesha, A. Size biased Lindley distribution and its properties a special case of weighted distribution. J. Appl. Math. 2017, 8, 808–819. [Google Scholar] [CrossRef]
- Whittle, P. Gaussian estimation in stationary time series. Bull. Int. Stat. Inst. 1961, 39, 105–129. [Google Scholar]
- Crowder, M. Gaussian estimation for correlated binomial data. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 1985, 47, 229–237. [Google Scholar] [CrossRef]
- Al-Nachawati, H.; Alwasel, I.; Alzaid, A.A. Estimating the parameters of the generalized Poisson AR (1) process. J. Stat. Comput. Simul. 1997, 56, 337–352. [Google Scholar] [CrossRef]
- Alwasel, I.; Alzaid, A.; Al-Nachawati, H. Estimating the parameters of the binomial autoregressive process of order one. Appl. Math. Comput. 1998, 95, 193–204. [Google Scholar] [CrossRef]
- Nitha, K.U.; Krishnarani, S.D. On autoregressive processes with Lindley-distributed innovations: Modeling and simulation. Stat. Transit. New Ser. 2024, 25, 31–47. [Google Scholar] [CrossRef]
- Smola, A.J.; Schölkopf, B. A tutorial on support vector regression. Stat. Comput. 2004, 14, 199–222. [Google Scholar] [CrossRef]
- Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
- Kramer, O. Dimensionality reduction by unsupervised k-nearest neighbor regression. In Proceedings of the 2011 10th International Conference on Machine Learning and Applications and Workshops, Honolulu, HI, USA, 18–21 December 2011; pp. 275–278. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).






