Next Article in Journal
Commutators of Pre-Lie n-Algebras and PL-Algebras
Next Article in Special Issue
Statistical Computation of Hjorth Competing Risks Using Binomial Removals in Adaptive Progressive Type II Censoring
Previous Article in Journal
GOMFuNet: A Geometric Orthogonal Multimodal Fusion Network for Enhanced Prediction Reliability
Previous Article in Special Issue
Inference with Pólya-Gamma Augmentation for US Election Law
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A First-Order Autoregressive Process with Size-Biased Lindley Marginals: Applications and Forecasting

by
Hassan S. Bakouch
1,
M. M. Gabr
2,
Sadiah M. A. Aljeddani
3,* and
Hadeer M. El-Taweel
4,*
1
Department of Mathematics, College of Science, Qassim University, Buraydah 51452, Saudi Arabia
2
Department of Mathematics, Faculty of Science, Alexandria University, Alexandria 21515, Egypt
3
Mathematics Department, Al-Lith University College, Umm Al-Qura University, Al-Lith 21961, Saudi Arabia
4
Department of Mathematics, Faculty of Science, Damietta University, Damietta 34517, Egypt
*
Authors to whom correspondence should be addressed.
Mathematics 2025, 13(11), 1787; https://doi.org/10.3390/math13111787
Submission received: 24 April 2025 / Revised: 21 May 2025 / Accepted: 24 May 2025 / Published: 27 May 2025
(This article belongs to the Special Issue Statistical Simulation and Computation: 3rd Edition)

Abstract

:
In this paper, a size-biased Lindley (SBL) first-order autoregressive (AR(1)) process is proposed, the so-called SBL-AR(1). Some probabilistic and statistical properties of the proposed process are determined, including the distribution of its innovation process, the Laplace transformation function, multi-step-ahead conditional measures, autocorrelation, and spectral density function. In addition, the unknown parameters of the model are estimated via the conditional least squares and Gaussian estimation methods. The performance and behavior of the estimators are checked through some numerical results by a Monte Carlo simulation study. Additionally, two real-world datasets are utilized to examine the model’s applicability, and goodness-of-fit statistics are used to compare it to several pertinent non-Gaussian AR(1) models. The findings reveal that the proposed SBL-AR(1) model exhibits key theoretical properties, including a closed-form innovation distribution, multi-step conditional measures, and an exponentially decaying autocorrelation structure. Parameter estimation via conditional least squares and Gaussian methods demonstrates consistency and efficiency in simulations. Real-world applications to inflation expectations and water quality data reveal a superior fit over competing non-Gaussian AR(1) models, evidenced by lower values of the AIC and BIC statistics. Forecasting comparisons show that the classical conditional expectation method achieves accuracy comparable to some modern machine learning techniques, underscoring its practical utility for skewed and fat-tailed time series.

1. Introduction

Continuous-valued time series, in which realizations are continuously recorded over time, are useful in many domains, including engineering, economics, finance, and the natural sciences. In particular, they are employed in stock market analysis, scientific research, medical studies, economic forecasting, and weather forecasting. However, traditional time series analysis often assumes Gaussian-distributed marginals, which fail to capture the main features of real-world data, such as skewness, fat-tails, positivity, and size-biased sampling (e.g., environmental, economic, and biomedical data). For instance, Gaussian models cannot well accommodate strictly positive measurements like water turbidity or inflation rates, nor can they represent the high kurtosis observed in phenomena, such as financial volatility. These gaps limit their applicability and forecasting accuracy in non-Gaussian contexts. To address these limitations, several non-Gaussian AR(1) models have been proposed, highlighting some key features, e.g., skewness, fat tails, and positivity; among them are the gamma distribution (Gaver and Lewis [1]), Weibull and gamma (Sim [2]), exponential (Mališić [3]), inverse Gaussian (Abraham and Balakrishna [4]), normal-Laplace (Jose et al. [5]), approximated beta (Popović [6]), Lindley (Bakouch and Popovíc [7]), double Lindley (Nitha and Krishnarani [8]), gamma-Lindley (Mello et al. [9]), logistic (Jilesh and Jayakumar [10]), and exponential-Gaussian (Nitha and Krishnarani [11]).
From the distribution theory perspective, size-biased distributions, a special class of weighted distributions, emerge naturally in scenarios where observations are recorded with probabilities proportional to their inherent size or magnitude, a common feature in ecological surveys (e.g., oversampling large organisms), econometric data (e.g., prioritizing high-value transactions), or biomedical studies (e.g., detecting severe disease cases). These distributions address unequal detection probabilities inherent to real-world data collection, where larger or more prominent units (subjects) are systematically overrepresented. Their theoretical formulation, presented in weighting the original probability density function ( f ( x ) ) by the weight function x, called size-biased:
f S B ( x ) = x f ( x ) E ( X ) , x > 0 ,
where the operator E ( . ) is the expected value of a random variable X, as a normalizing factor, enables accurate modeling of such biased sampling mechanism. Pioneered by Patil and Rao [12], size-biased distributions have been extensively applied across environmental science, forestry, and social science, as demonstrated by Scheaffer [13] in wildlife population studies, Singh and Maddala [14] in econometric inequality analysis, and Drummer and McDonald [15] in ecological sampling. Beyond applied fields, size-biasing plays a pivotal role in statistical estimation, renewal theory, and distributional infinite divisibility. Despite their broad utility, their integration into non-Gaussian time series models remains limited, creating a critical gap in analyzing temporally dependent data subject to size-based sampling biases. This omission compromises parameter estimation and forecasting in fields like epidemiology (e.g., disease case reporting) or hydrology (e.g., extreme event monitoring), where detection probabilities inherently correlate with observation magnitude.
To contribute to this gap, we propose the size-biased Lindley AR(1) (SBL-AR(1)) process by using the size-biased Lindley (SBL) distribution introduced by Ayesha [16]. The SBL distribution enhances the classical Lindley, as non-Gaussian, distribution by incorporating size-biased sampling. Compared to other non-Gaussian distributions commonly used in time series models, such as double Lindley, gamma-Lindley and logistic, the size-biased Lindley distribution offers additional flexibility in capturing skewed and fat-tailed behaviors while inherently addressing size-biased sampling effects. This makes it particularly advantageous in applications. The probability density function (PDF), denoted as f S B L ( x ) , and the cumulative distribution function (CDF), denoted as F S B L ( x ) , of the SBL distribution are, respectively, given by
f S B L ( x ; θ ) = θ 3 x ( 1 + x ) e θ x 2 + θ , x > 0 , θ > 0 ,
F S B L ( x ; θ ) = 1 ( 2 + θ + θ x ( 2 + θ + θ x ) ) e θ x 2 + θ , x > 0 , θ > 0 ,
where θ is a scale parameter of the SBL distribution.
The first two raw moments and the variance of the SBL distribution are, respectively, expressed as follows:
E ( X ) = 2 ( θ + 3 ) θ ( θ + 2 ) ,
E ( X 2 ) = 6 ( θ + 4 ) θ 2 ( θ + 2 ) ,
v a r ( X ) = 2 ( θ 2 + 6 θ + 6 ) θ 2 ( θ + 2 ) 2 .
For more details about the SBL distribution, see [16].
The limitations of traditional Gaussian time series models in capturing non-Gaussian features, previously stated, motivate the proposed SBL-AR(1) process, which integrates the SBL marginals to flexibly model positive-valued data with temporal dependence. The SBL-AR(1) addresses critical gaps in non-Gaussian autoregressive models through two key motivations: (1) SBL marginals, which enhance the classical Lindley distribution by weighting observations proportionally to their size, improving fit for data where larger values are oversampled (e.g., economic datasets), and (2) a closed-form innovation mixture combining Dirac delta and generalized gamma-exponential distributions, enabling precise modeling of excess zeros and continuous positive values. Parameter estimation via conditional least squares (CLS) and Gaussian estimation (GE) is validated through simulations demonstrating estimator consistency, while real-world applications, including inflation expectations and water turbidity monitoring, showcase a superior performance over competing models (e.g., Lindley, gamma, and inverse Gaussian AR(1)) via AIC/BIC criteria. By balancing theoretical findings (Laplace transformation function, multi-step-ahead conditional measures, autocorrelation decay, spectral density) with practical utility (accurate forecasting of skewed/fat-tailed data), the SBL-AR(1) establishes a flexible model for analyzing non-Gaussian time series prevalent in environmental and economic domains.
The remainder of this paper is constructed as follows: In Section 2, a first-order autoregressive process with SBL marginals is constructed, and the distribution of the innovation process is derived. Section 3 investigates some structural properties for the proposed SBL-AR(1), including multi-step conditional Laplace transform, conditional variance, conditional mean, autocorrelation function, and spectral density function. In Section 4, we utilize the conditional least squares and the Gaussian estimation techniques to estimate the parameters of the proposed process, and the performance of estimators is assessed via a simulation study. Section 5 discusses the application of the model using two real-life datasets. In addition, Section 6 gives the forecasting of the data for AR model based on the classical statistical method and some machine learning methods for its predictive ability. In Section 7, the paper’s conclusion provides suggestions for future research directions aligned with the proposed framework.

2. SBL-AR(1): Model Construction and Innovation Distribution

In this section, a first-order stationary autoregressive process with SBL marginals, denoted as SBLAR(1), is presented, and then we obtain the distribution of the innovation term.
Suppose that { X t ; t = 1 , 2 , , n } is a stochastic process defined as follows:
X t = ρ X t 1 + a t , ρ [ 0 , 1 )
where { X t } is a stationary process with SBL( θ ) marginals, θ > 0 , and { a t } is a sequence of independent and identically distributed (i.i.d.) random variables independent of X t s for all s 1 . The definition of the SBL-AR(1) model indicates that it is a first-order Markovian process.
It is worth, before investigating more features for the SBL-AR(1) model, defining the generalized mixture distributions in the next definition.
Definition 1. 
Let G ( t ) be a distribution function. G ( t ) is said to be a generalized mixture of the distribution functions G ( t ; 1 ) , G ( t ; 2 ) , if
G ( t ) = i 1 ω i G ( t ; i ) ,
for all t, where ω 1 , ω 2 , are real numbers satisfying i 1 ω i = 1 , i 1 | ω i | < , and for some indices i, ω i < 0 .
The following proposition gives the definition of the PDF, h ( x ) , which will be used later to find the innovation distribution. It is important to note that h ( x ) is a properly PDF for all admissible values of the parameters ρ and θ .
Proposition 1. 
If θ 1 , x > 0 , and 0 < ρ < 1 , then the generalized mixture
h ( x ) = 8 ( 1 ρ ) ρ 2 1 ρ 2 ( θ ( 1 ρ ) + 2 ) 3 ( θ + 2 ) ρ e ( θ + 2 ) x ρ + 2 ( 1 ρ ) 3 1 ρ 2 ( θ ( 1 ρ ) + 2 ) 1 2 θ 3 x 2 e θ x + ( 1 ρ ) 2 ( θ ( 1 ρ ) ( θ ( 1 ρ ) + 4 ρ + 2 ) + 12 ρ ) 1 ρ 2 ( θ ( 1 ρ ) + 2 ) 2 θ 2 x e θ x + 2 ρ ( 1 ρ ) θ ( 1 ρ ) θ ( 1 θ ) ρ 2 2 ( θ + 2 ) ρ + θ + 5 + 6 ( ρ + 1 ) + 12 ρ 1 ρ 2 ( θ ( 1 ρ ) + 2 ) 3 × θ e θ x
is a PDF.
Proof. 
Equation (3) combines exponential and gamma densities, which are non-negative, and represents a mixture of exponential ( θ + 2 ρ ) , gamma ( 3 , θ ) , gamma ( 2 , θ ) , and exponential ( θ ) , respectively. Therefore, it can be concluded that
0 h ( x ) d x = 1 .
We still need to verify that h ( x ) 0 for x > 0 . Equation (3) can be reformulated as follows:
h ( x ) = ( 1 ρ ) ( 1 ρ 2 ) ( θ ( 1 ρ ) + 2 ) e θ x h ( x ) ,
where
h ( x ) = 8 ρ 2 ( θ ( 1 ρ ) + 2 ) 2 ( θ + 2 ) ρ e ( θ ( 1 ρ ) + 2 ) x ρ + 2 ( 1 ρ ) 2 1 2 θ 3 x 2 + ( 1 ρ ) ( θ ( 1 ρ ) ( θ ( 1 ρ ) + 4 ρ + 2 ) + 12 ρ ) ( θ ( 1 ρ ) + 2 ) θ 2 x + 2 ρ θ ( 1 ρ ) θ ( 1 θ ) ρ 2 2 ( θ + 2 ) ρ + θ + 5 + 6 ( ρ + 1 ) + 12 ρ ( θ ( 1 ρ ) + 2 ) 2 θ .
As θ 1 , we conclude that
h ( 0 ) = 2 ( θ 1 ) ( 2 + θ ( 1 ρ ) ) ρ 0 .
Additionally, we have that
lim x h ( x ) = ,
and for x > 0 ,
h ( x ) = 8 ρ 2 ( θ ( 1 ρ ) + 2 ) 2 ( θ + 2 ) ρ ( θ ( 1 ρ ) + 2 ) ρ e ( θ ( 1 ρ ) + 2 ) x ρ + 2 ( 1 ρ ) 2 θ 3 x + ( 1 ρ ) ( θ ( 1 ρ ) ( θ ( 1 ρ ) + 4 ρ + 2 ) + 12 ρ ) ( θ ( 1 ρ ) + 2 ) θ 2 > 0 .
From Equations (6)–(8), h ( x ) is a sum of non-negative terms due to θ 1 and derivative h ( x ) > 0 , ensuring that h ( x ) 0 , hence h ( x ) 0 for x > 0 , which completes the proof.
That is, h ( x ) represents a generalized mixture of the exponential ( θ + 2 ρ ), gamma (2, θ ), gamma (3, θ ), and exponential ( θ ) distributions such that the sum of weights in h ( x ) is equal to 1. □
The distribution of the innovation sequence { a t } plays a crucial role in the practical applications and further studies of this process. One frequently used technique to determine the innovation sequence involves the Laplace transform function. The subsequent theorem presents the distribution of the innovation random variable (rv) a t . Let Φ X t and Φ a t represent the Laplace transforms (LTs) of the random variables X t and a t , respectively. The LT of SBL rv X can be expressed as
Φ X ( s ) = E e s X = θ 3 ( 2 + s + θ ) ( 2 + θ ) ( s + θ ) 3 .
Theorem 1. 
Assume that S B L ( θ ) represents a marginal distribution of the stochastic process given by Equation (2). Consequently, the distribution of the innovation sequence, { a t } , is a mixture of singular and absolutely continuous distributions, expressed as follows:
f a ( x ) = ρ 2 δ ( x ) + ( 1 ρ 2 ) h ( x ) ,
where δ ( x ) is the Dirac Delta function defined as
δ ( x ) = x = 0 0 x 0 ,
and h ( x ) is given by Equation (3).
Proof. 
As the process { X t } is a stationary, the LT of Equation (2) can be expressed as
Φ X ( s ) = Φ X ( ρ s ) Φ a ( s ) ;
consequently, the LT of the innovation rv a t is given by
Φ a ( s ) = Φ X ( s ) Φ X ( ρ s ) .
By utilizing Equation (9), Equation (10) is represented as
Φ a = ( 2 + s + θ ) ( s ρ + θ ) 3 ( s + θ ) 3 ( 2 + s ρ + θ ) .
By applying partial fraction decomposition, the preceding equation can be expressed as
Φ a ( s ) = ρ 2 + ( 1 ρ 2 ) A 2 + θ 2 + θ + ρ s + B θ 3 ( θ + s ) 3 + C θ 2 ( θ + s ) 2 + D θ θ + s ,
where
A = 8 ( 1 ρ ) ρ 2 1 ρ 2 ( θ ( 1 ρ ) + 2 ) 3 = 8 ρ 2 1 + ρ ( θ ( 1 ρ ) + 2 ) 3 ,
B = 2 ( 1 ρ ) 3 1 ρ 2 ( θ ( 1 ρ ) + 2 ) = 2 ( 1 ρ ) 2 1 + ρ ( θ ( 1 ρ ) + 2 ) , C = ( 1 ρ ) 2 ( θ ( 1 ρ ) ( θ ( 1 ρ ) + 4 ρ + 2 ) + 12 ρ ) 1 ρ 2 ( θ ( ρ ) + θ + 2 ) 2
= ( 1 ρ ) ( θ ( 1 ρ ) ( θ ( 1 ρ ) + 4 ρ + 2 ) + 12 ρ ) 1 + ρ ( θ ( ρ ) + θ + 2 ) 2 ,
and
D = 2 ρ ( 1 ρ ) θ ( 1 ρ ) θ ( 1 θ ) ρ 2 2 ( θ + 2 ) ρ + θ + 5 + 6 ( ρ + 1 ) + 12 ρ 1 ρ 2 ( θ ( 1 ρ ) + 2 ) 3 = 2 ρ θ ( 1 ρ ) θ ( 1 θ ) ρ 2 2 ( θ + 2 ) ρ + θ + 5 + 6 ( ρ + 1 ) + 12 ρ 1 + ρ ( θ ( 1 ρ ) + 2 ) 3 .
It can be seen that Φ a ( 0 ) = ( 2 + θ ) θ 3 θ 3 ( 2 + θ ) = 1 .
Based on Equations (12)–(15) and the properties of inverse LTs, we conclude that the distribution of the innovation sequence, { a t } , is composed of a discrete component of 0, with probability ρ 2 , and a generalized mixture of exponential ( 2 + θ ρ ), gamma (3, θ ), gamma (2, θ ), and exponential ( θ ) distributions, with probability 1 ρ 2 . □
Figure 1 depicts the distribution of the innovation process through its density curves. From Figure 1a, it is clear that, as ρ increases in the interval (0, 0.5), the values of the innovation probability increase, while in the case where 0.5 ρ < 1 , the values of the innovation probability decrease. Furthermore, plots in Figure 1b,c indicate that smaller θ values lead to distributions with heavier tails. In summary, the innovation density is both uni-modal and right-skewed.
As a result of the previously mentioned theorem about the distribution of the innovation term, the stationary process { X t } in Equation (2) can be reformulated as follows.
Definition 2. 
The SBL-AR(1) process { X t } in Equation (2) is restated as
X t = ρ X t 1 , w i t h p r o b a b i l i t y ρ 2 ρ X t 1 + ε t , w i t h p r o b a b i l i t y 1 ρ 2 .
Or, in other terms,
X t = ρ X t 1 + I t ε t ,
where I t is an indicator variable such that P ( I t = 0 ) = 1 P ( I t = 1 ) = ρ 2 .
Figure 2 depicts the sample paths from the process in Equation (16) for different values of the parameters θ and ρ . We generated 200 observations from the SBL-AR(1) process by setting θ = 1.6, 2, and 2.5; and ρ = 0.2, 0.5, and 0.7. Throughout these plots, the SBL-AR(1) behavior is investigated and Figure 2 points out that the simulated series is stationary and has positive values.

3. Structural Properties Associated with the SBL-AR(1) Model

This section focuses on the development of the SBL-AR(1) model’s conditional mean and variance, along with the derivation of its multi-step-ahead conditional Laplace transform. The mean and variance are required to establish the SBL-AR(1) prediction equations, while the Laplace transform provides insight into the joint distribution of vectors produced by the process. In this section, we outline some statistical properties of the SBL-AR(1) model and discuss each one in detail. The conditional statistical measures of the SBL-AR(1) process are obtained using the same methodology as outlined by Bakouch and Popović [7].

3.1. Some Statistical Conditional Measures

The one- and multi-step-ahead conditional mean and variance of the SBL-AR process are based on the mean and variance of the random variable ε t and the innovation term a t . Thus, we first compute the mean and variance for ε t and a t , respectively, as follows. According to the definition of the SBL-AR(1) process outlined in Equation (16), the random variable ε t has the following mean and variance, respectively:
E ( ε t ) = 1 1 + ρ E ( X t ) = 1 1 + ρ 2 ( θ + 3 ) θ ( θ + 2 ) ,
V a r ( ε t ) = 2 6 + 12 ρ 12 ρ 2 + θ 2 ( 1 + 2 ρ ρ 2 ) 6 θ ( 1 2 ρ + ρ 2 ) θ 2 ( 2 + θ ) 2 ( 1 + ρ ) 2 .
Consequently, the mean and variance of the innovation process a t = I t ε t are, respectively, stated as
E ( a t ) = ( 1 ρ ) 2 ( θ + 3 ) θ ( θ + 2 ) ,
V a r ( a t ) = ( 1 ρ 2 ) 2 ( θ 2 + 6 θ + 6 ) θ 2 ( θ + 2 ) 2 .
By utilizing these properties, the one-step ahead conditional mean for the process in Equation (16) can be formulated as
E ( X t | X t 1 = x t 1 ) = ρ x t 1 + E ( I t ε t ) = ρ x t 1 + ( 1 ρ ) 2 ( θ + 3 ) θ ( θ + 2 ) .
Consequently, the formula for the ( k + 1 )-step ahead conditional mean is expressed as
E ( X t + k | X t 1 = x t 1 ) = ρ k + 1 x t 1 + ( 1 ρ k + 1 ) 2 ( θ + 3 ) θ ( θ + 2 ) .
When k , the previously mentioned expression will tend to the unconditional mean of the main process as follows:
E ( X t + k | X t 1 = x t 1 ) 2 ( θ + 3 ) θ ( θ + 2 ) .
The expressions for one-step and ( k + 1 )-step-ahead conditional variance of the proposed model, respectively, are obtained as
V a r ( X t | X t 1 = x t 1 ) = ( 1 ρ 2 ) 2 ( θ 2 + 6 θ + 6 ) θ 2 ( θ + 2 ) 2 ,
V a r ( X t + k | X t 1 = x t 1 ) = ( 1 ρ 2 ( k + 1 ) ) 2 ( θ 2 + 6 θ + 6 ) θ 2 ( θ + 2 ) 2 .
Observe that when k in Equation (25), the unconditional variance of the main process is obtained as
V a r ( X t + k | X t 1 = x t 1 ) 2 ( θ 2 + 6 θ + 6 ) θ 2 ( θ + 2 ) 2 .
The multi-step-ahead conditional LT of the SBL-AR(1) process is obtained as
Φ X t + k | X t 1 = x t 1 ( s ) = E e s X t + k | X t 1 = x t 1 = E e s ρ k + 1 X t 1 + i = 0 k ρ k i a k + i | X t 1 = x t 1 = e s ρ k + 1 x t 1 i = 0 k Φ a ρ k i s = e s ρ k + 1 x t 1 i = 0 k ( 2 + θ + s ρ k i ) ( s ρ k + 1 i + θ ) 3 ( s ρ k i + θ ) 3 ( 2 + θ + s ρ k + 1 i ) = e s ρ k + 1 x t 1 ( 2 + θ + s ) ( s ρ k + 1 + θ ) 3 ( s + θ ) 3 ( 2 + θ + s ρ k + 1 ) .
As k , we obtain
Φ X t + k | X t 1 = x t 1 ( s ) θ 3 ( 2 + θ + s ) ( 2 + θ ) ( s + θ ) 3 .
Noting that the final Equation (26) is equivalent to Equation (9), the unconditional LT of the process { X t } .

3.2. Joint Distribution, Autocorrelation, and Spectral Density Function

The joint LT of ( X t 1 , X t ) is expressed as follows:
Φ X t 1 , X t ( s 1 , s 2 ) = E e s 1 X t 1 s 2 X t = E e s 1 X t 1 s 2 ( ρ X t 1 + a t ) = Φ X ( s 1 + s 2 ρ ) Φ a ( s 2 ) = θ 3 ( 2 + θ + ( s 1 + s 2 ρ ) ) ( 2 + θ ) ( ( s 1 + s 2 ρ ) + θ ) 3 . ( 2 + s 2 + θ ) ( s 2 ρ + θ ) 3 ( s 2 + θ ) 3 ( 2 + s 2 ρ + θ ) .
The SBL-AR(1) process is not time reversible because the joint LT Φ X t 1 , X t ( s 1 , s 2 ) is not symmetric in s 1 and s 2 .
Following a few basic calculations, the autocovariance and autocorrelation functions at lag k of the proposed process are, respectively, given by
γ ( k ) = ρ k v a r ( X t ) = ρ k 2 ( θ 2 + 6 θ + 6 ) θ 2 ( θ + 2 ) 2 ,
η ( k ) = ρ k .
The spectral density function of a stationary process is simply defined as the Fourier transform of the absolutely summable autocovariance function. Hence, the spectral density function of the SBL-AR(1) is obtained as follows:
f X ( w ) = 1 2 π z = γ ( z ) e i w z , w [ π , π ] .
Given that z = | γ ( z ) | < , by substituting γ ( z ) in the last equation we obtain
f X ( w ) = ( θ 2 + 6 θ + 6 ) ( 1 ρ 2 ) π θ 2 ( θ + 2 ) 2 ( 1 + ρ 2 2 ρ cos w ) .
The parametric spectral density estimator is obtained by replacing the parameters θ and ρ with their corresponding Gaussian estimators θ ^ G E and ρ ^ G E , which will be discussed later, in the right hand side of Equation (27).
Figure 3a displays the spectral density function estimator of the SBL-AR(1) process that is given by Equation (27) for θ ^ G E = 1.24 , 1.6 , and ρ ^ G E = 0.97 , where those estimated values will be obtained later in Section 5. Figure 3b depicts the theoretical SBL-AR(1) spectral density function at θ = 1.5 and different values for ρ . As the value of ρ increases, the spectral density function becomes more leptokurtic. Also, from Figure 3c, it is clear that at ρ = 0.5 and various θ values, the spectral density function becomes more platykurtic as θ increases. Further, curves in Figure 3a are very similar to ones in Figure 3b.

4. Parameter Estimation and Simulation Studies

This section is devoted to estimating the parameters involved in the process, specifically ρ and θ . Let ( X 1 , X 2 , , X n ) represent a realization from the SBL-AR(1) process. The next subsections discuss the conditional least squares and the Gaussian estimation techniques. Additionally, a simulation study will be performed.

4.1. Estimation via Conditional Least Squares Procedure

The conditional least squares (CLS) estimators for ρ and θ , denoted as ρ ^ C L S and θ ^ C L S , are derived by minimizing the conditional sum of squares function
Q n ( ρ , θ ) = t = 2 n X t E X t | X t 1 2 .
By utilizing Equation (22) for E X t | X t 1 , then Q n ( ρ , θ ) takes the form
Q n ( ρ , θ ) = t = 2 n X t ρ X t 1 ( 1 ρ ) 2 ( θ + 3 ) θ ( θ + 2 ) 2 .
Consequently, by setting 2 ( θ + 3 ) θ ( θ + 2 ) = μ X , then the previous equation can rewritten as
Q n ( ρ , θ ) = t = 2 n X t ρ X t 1 ( 1 ρ ) μ X 2 .
Estimating ρ and μ X is achieved by solving the normal equations obtained from Equation (28), which are as follows:
ρ ^ C L S = ( n 1 ) t = 2 n X t X t 1 t = 2 n X t t = 2 n X t 1 ( n 1 ) t = 2 n X t 1 2 t = 2 n X t 1 2 ,
μ ^ X C L S = t = 2 n X t ρ ^ t = 2 n X t 1 ( n 1 ) ( 1 ρ ^ ) .
As μ X = 2 ( θ + 3 ) θ ( θ + 2 ) , then the estimator for θ is obtained as
θ ^ C L S = ( μ ^ X C L S 1 ) + ( μ ^ X C L S 1 ) 2 + 6 μ ^ X C L S μ ^ X C L S ,
where μ ^ X C L S is given by Equation (30).

4.2. Gaussian Estimation Approach

Whittle [17] proposed this approach by utilizing the Gaussian likelihood function as the baseline distribution for estimation. Subsequently, Crowder [18] applied this estimation technique to analyze correlated binomial data. Both Al-Nachawati et al. [19] and Alwasel et al. [20] employed the same estimation technique within the context of a first-order autoregressive process. Despite its approximate nature, this method provides a reliable estimation for the proposed model. The Gaussian estimation (GE) approach is based on the one-step conditional expectation and variance of the model. The conditional maximum likelihood function is expressed as follows:
L ( ρ , θ ) = f ( x 1 ) t = 2 n f ( x t | x t 1 ) .
In this context, f ( x t | x t 1 ) and f ( x 1 ) represent the conditional and marginal probability functions of X t | X t 1 and X t , respectively. We assume that both f ( x t | x t 1 ) and f ( x 1 ) follow a Gaussian PDF, with the conditional mean and conditional variance serving as their parameters. Thus, the likelihood function can be formulated as follows:
L ( ρ , θ ) = ( 2 π ) n 2 ( σ x t 1 2 ) n 2 exp t = 2 n x t μ x t 1 2 2 σ x t 1 2 .
Consequently, the log-likelihood function is given by
l ( ρ , θ ) = log ( L ( ρ , θ ) ) = n 2 log ( 2 π ) 1 2 t = 2 n x t μ x t 1 2 σ x t 1 2 + log σ x t 1 2 ,
where μ x t 1 = E ( X t | X t 1 = x t 1 ) and σ x t 1 2 = V a r ( X t | X t 1 = x t 1 ) are the one-step conditional mean and variance defined by Equation (22) and Equation (24), respectively. Hence, the Gaussian log-likelihood function related to the SBL-AR(1) process takes the form   
l ( ρ , θ ) = n 2 log ( 2 π ) 1 2 t = 2 n log ( 1 ρ 2 ) 2 ( θ 2 + 6 θ + 6 ) θ 2 ( θ + 2 ) 2 + θ 2 ( θ + 2 ) 2 ( 1 ρ 2 ) 2 ( θ 2 + 6 θ + 6 ) x t ρ x t 1 ( 1 ρ ) 2 ( θ + 3 ) θ ( θ + 2 ) 2 .
Therefore, the Gaussian estimators, termed ρ ^ G E and θ ^ G E , can be derived by solving the system of equations l ( ρ , θ ) ρ = 0 and l ( ρ , θ ) θ = 0 .
Crowder [18] indicated that when employing the Gaussian method for estimating the parameter Ω , the expression n ( Ω ^ Ω ) is asymptotically normally distributed with a mean of zero and an asymptotic variance of [ J ( Ω ) ] 1 , where J ( Ω ) denotes the conditional expected information matrix. An approximation can be achieved using the observed conditional information matrix, as discussed by Bakouch and Popović [7].
Now, we conduct a simulation study in the next subsection to check the performance of the CLS and GE estimation methods.

4.3. Monte Carlo Simulation and Experimental Analysis

In this subsection, we perform a simulation study to check the validity of the estimation methods, which are used for the model parameters’ estimation. The consistency and behavior of the CLS and GE techniques of parameter estimation in the SBL-AR(1) process are investigated throughout a Monte Carlo simulation. Over 1000 replications and sample sizes of 50, 100, 500, and 1000 are simulated from the SBL-AR(1) with actual parameter values:
  • (a) ρ = 0.97 , θ = 1.24 ,    (b) ρ = 0.97 , θ = 1.6 ,
  • (c) ρ = 0.5 , θ = 1.4 ,    (d) ρ = 0.7 , θ = 1.4 ,
  • (e) ρ = 0.1 , θ = 1.5 ,    (f) ρ = 0.5 , θ = 1.5 .
The mean square error (MSE) is used to assess the performance of the estimates and comparison purposes.
A step-by-step simulation algorithm for the SBL-AR(1) process is provided as follows (Algorithm 1):
Algorithm 1: Simulation algorithm for the SBL-AR(1) process
1.
Specify the values for the parameters ρ , θ , sample size (n) and the number of iterations ( N = 1000 ).
2.
Compute the innovation random variable a t ( = I t ε t ) , where I t follows a Bernoulli distribution with parameter 1 ρ 2 , and I t ε t is a mixture of exponential and gamma distributions, generated using rexp and rgamma functions.
3.
Simulate the SBL-AR(1) process using the relation in Equation (17).
4.
For each generated sample size, the parameters are estimated using (GE) via the optim function in R, which maximizes the likelihood in Equation (32), along with the CLS approach outlined in Equations (29) and (31).
5.
Repeat steps for n = 50 , 100 , 500 and 1000.
6.
Compute the average of the estimates, biases, and the MSE.
The values of ρ and θ used in the simulation study were subject to the constraints θ 1 and 0 < ρ < 1 . Notably, some of these values correspond to the parameter estimates obtained from the real-world datasets analyzed in this study.
Table 1 and Table 2 display the values for the mean estimates, bias, and MSE. The bias and MSE are provided in the parentheses as (bias, MSE). Generally, both approaches performed well and effectively. In the two methodologies, the bias and MSE for all estimates tended to zero as the sample size increased. Additionally, when the sample size was increased, the values of the estimate became closer to the actual values. In terms of MSE and the values of mean estimates, the GE outperformed the CLS for ( ρ , θ ) = (0.1, 1.5), (0.97, 1.24), (0.97, 1.6), (0.5, 1.4), and (0.7, 1.4). However, the CLS showed better performance for ( ρ , θ ) = (0.5, 1.5).

5. Real-Life Data Analysis and Model Selection

In this section, we assess the applicability of the proposed model by utilizing two real-life datasets.
To illustrate and evaluate the performance and competitiveness of the proposed model, we investigated two real-life datasets, outlined as follows:
  • The first dataset consists of 451 observations, representing the monthly University of Michigan Inflation Expectation (MICH) from 5 January 1984 to 1 November 2021. These data can be found at https://fred.stlouisfed.org/series/MICH (accessed on 12 June 2024).
  • The second dataset comprises 221 observations that represent the turbidity of water quality in Brisbane, measured every 10 min during the period from 23 June 2024, at 07:10, to 24 June 2024, at 19:30. These data can be obtained from https://www.kaggle.com/datasets/downshift/water-quality-monitoring-dataset (accessed on 3 October 2024).
The time series, autocorrelation (ACF), and partial autocorrelation (PACF) functions for the two datasets are displayed in Figure 4 and Figure 5, respectively. Based on these plots, we can conclude that the PACF cuts off after lag one, indicating that the datasets are appropriate for an AR(1) model. Additionally, the ACF dies down rapidly. These two figures suggest that the two datasets are stationary. We further validate this conclusion through a stationarity test as follows.
The augmented Dickey–Fuller (ADF) test is a statistical tool used to determine whether a time series is stationary or not. If the p-value from the test is less than the designated significance level (0.05), we reject the null hypothesis of non-stationarity, indicating that the time series is stationary. The ADF test was applied using the adf.test function from the tseries package in R. According to the p-values shown for each dataset in Table 3, we can conclude that the datasets are stationary.
To compare the proposed process, SBL-AR(1), we will utilize the earlier two datasets alongside the following relevant non-Gaussian AR(1) models associated with their Gaussian log-likelihood functions.
  • E-AR(1) with exponential marginals (Gaver and Lewis [1]):
    l ( ρ , λ , β ) = n 2 log ( 2 π ) 1 2 t = 2 n log 1 ρ 2 λ 2 + λ 2 1 ρ 2 x t ρ x t 1 1 ρ λ 2 .
  • G-AR(1) with gamma marginals (Gaver and Lewis [1]):
    l ( ρ , λ , k ) = n 2 log ( 2 π ) 1 2 t = 2 n log k 1 ρ 2 λ 2 + λ 2 k 1 ρ 2 x t ρ x t 1 k ( 1 ρ ) λ 2 .
  • INGAR(1)-I with inverse Gaussian marginals (Abraham and Balakrishna [4]):
    l ( ρ , λ , μ ) = n 2 log ( 2 π ) 1 2 t = 2 n log 1 ρ 2 μ + 1 ( 1 ρ 2 ) μ ( x t ρ x t 1 ( 1 ρ ) μ ) 2 .
  • INGAR(1)-II with inverse Gaussian marginals (Abraham and Balakrishna [4]):
    l ( ρ , λ , μ ) = n 2 log ( 2 π ) 1 2 t = 2 n log 1 ρ 2 μ 3 λ + λ ( 1 ρ 2 ) μ 3 ( x t ρ x t 1 ( 1 ρ ) μ ) 2 .
  • L-AR(1) with Lindley marginals (Bakouch and Popovíc [7]):
    l ( ρ , λ ) = n 2 log ( 2 π ) 1 2 t = 2 n log ( 1 ρ 2 ) λ 2 + 4 λ + 2 λ 2 ( 1 + λ ) 2 + λ 2 ( 1 + λ ) 2 ( 1 ρ 2 ) λ 2 + 4 λ + 2 x t ρ x t 1 ( 1 ρ ) ( λ + 2 ) λ ( 1 + λ ) 2 .
  • GaL-AR(1) with gamma Lindley marginals (Mello et al. [9]):
    l ( ρ , λ , β ) = n 2 log ( 2 π ) 1 2 t = 2 n log ( ( 2 β λ + λ ) 2 + 2 β 2 ( 1 + 3 λ ) 2 β ( β λ 3 β λ 2 + 2 λ 2 ) ) ( 1 ρ 2 ) λ 2 β 2 ( 1 + λ ) 2 + λ 2 β 2 ( 1 + λ ) 2 ( ( 2 β λ + λ ) 2 + 2 β 2 ( 1 + 3 λ ) 2 β ( β λ 3 β λ 2 + 2 λ 2 ) ) ( 1 ρ 2 ) × x t ρ x t 1 ( 1 ρ ) 2 β ( 1 + λ ) λ λ β ( 1 + λ ) 2 .
  • AR-L(1) with Lindley innovations (Nitha and Krishnarani [21]):
    l ( ρ , λ ) = n 2 log ( 2 π ) 1 2 t = 2 n log λ 2 + 4 λ + 2 λ 2 ( 1 + λ ) 2 ( 1 ρ 2 ) + λ 2 ( 1 + λ ) 2 ( 1 ρ 2 ) λ 2 + 4 λ + 2 x t ρ x t 1 ( λ + 2 ) λ ( 1 + λ ) 2 .
The GE method was used to estimate the unknown parameters for all the considered models. Numerical methods were applied to determine these parameter values, utilizing the optim function in R along with the conjugate gradients (CG) method for this purpose. For each dataset and model, the Gaussian likelihood estimates, the Akaike information criterion (AIC), Bayesian information criterion (BIC), and Hannan–Quinn information criterion (HQIC) were computed. We evaluated model performance using information criteria statistics. The model that performs best is the one with the smallest values for these statistics. The results of the goodness-of-fit statistics and the GE estimates, including their standard errors (SE), are summarized in Table 4 and Table 5. From these tables, it is evident that the SBL-AR(1) model achieved the smallest values for AIC, BIC, and HQIC. Consequently, we can conclude that the proposed model performed well for both datasets; hence, the SBL-AR(1) model provides the best fit among the AR(1) models considered.
For each dataset, the residuals, a t , were computed from the fitted model, given by Equation (2), using the estimated parameter ρ ^ in Table 4 and Table 5. To assess the presence of autocorrelation in these residuals, both the Box–Pierce and Ljung–Box tests were performed. The results, summarized in Table 6, indicate that for all cases, the p-values exceeded 0.05, indicating no significant autocorrelation in the residuals of the fitted model.

6. Forecasting

Forecasting time series data is essential for predicting future trends in non-Gaussian contexts, where traditional Gaussian models may fail to capture skewed or fat-tailed data distribution. The proposed SBL-AR(1) model utilizes size-biased Lindley marginals and employs both classical conditional expectation method and machine learning techniques, demonstrating a superior accuracy in predicting the considered real-world datasets.

6.1. Classical Conditional Expectation Method

This classical method is one of the most widely used forecasting methods relies on conditional expectations. The forecast X ^ t + k is essentially the expected value of X t + k given all the available information up to time t. The one- and k-step-ahead forecasts for X t + 1 and X t + k , respectively, are concluded by Equation (23) as follows:
X ^ t + 1 = E ( X t + 1 | X t = x t ) = ρ x t + ( 1 ρ ) 2 ( θ + 3 ) θ ( θ + 2 ) ,
and
X ^ t + k = E ( X t + k | X t = x t ) = ρ k x t + ( 1 ρ k ) 2 ( θ + 3 ) θ ( θ + 2 ) .
The parameters ρ and θ are replaced with their GE estimates. Then
X ^ t + 1 = E ( X t + 1 | X t = x t ) = ρ ^ x t + ( 1 ρ ^ ) 2 ( θ ^ + 3 ) θ ^ ( θ ^ + 2 ) ,
and
X ^ t + k = E ( X t + k | X t = x t ) = ρ ^ k x t + ( 1 ρ ^ k ) 2 ( θ ^ + 3 ) θ ^ ( θ ^ + 2 ) .

6.2. Machine Learning Forecasting Methods

Unlike classical statistical methods, which rely on predefined theoretical assumptions, machine learning (ML) approaches are non-parametric and data-driven, learning patterns directly from observed data to adapt flexibly to complex structures. These methods typically involve training algorithms on historical data to learn underlying relationships, which can then be used to make predictions about future values. To model the autoregressive behavior of the time series using ML methods, we adopted an AR(1) structure where the current observation x t is predicted using its immediate lag, x t 1 . The data were first transformed into a supervised learning format by creating a lag-1 feature. The dataset was then divided into training and testing sets (e.g., 80% and 20%, respectively). In the following subsections, we examine the use of some machine learning (ML) methods, which are support vector regression, k-nearest neighbors, and extreme gradient boosting, to forecast data generated by AR(1) processes.

6.2.1. Support Vector Regression

Support vector regression (SVR) is used to model both linear and nonlinear relationships in time series data, adept in complex scenarios while remaining applicable to simpler linear cases like the autoregressive (AR(1)) process. It finds a function predicting targets accurately within a small error margin ( ϵ ) (Smola and Schölkopf [22]), effective for time series forecasting.
Given a training set { ( x 1 , y 1 ) , . . . , ( x N , y N ) } , where x i represents the input features (e.g., lagged time series values) and y i denotes the corresponding target values, the goal is to find a linear function f ( x ) = w T x + b that minimizes the cost function:
min w , b , ξ , ξ 1 2 w 2 + C i = 1 n ξ i + ξ i subject to y i ( w T x i + b ) ϵ + ξ i , ( w T x i + b ) y i ϵ + ξ i , ξ i , ξ i 0
where w is the weight vector, b is the bias term, ξ i and ξ i are slack variables allowing deviations beyond ϵ , and C controls the trade-off between model complexity and training error.
We employed the svm function from the e1071 package in R (version 4.4.3), specifying a linear kernel and epsilon-insensitive loss (type = “eps-regression”), which aligns with the assumptions of the AR(1) model.

6.2.2. Extreme Gradient Boosting

Extreme gradient boosting (XGBoost) is based on gradient-boosted decision trees, optimized for speed and performance (Chen and Guestrin [23]). XGBoost constructs its model by building decision trees sequentially, where each new tree is trained to correct the errors made by the previous ones. This process is guided by an objective function that balances the accuracy and complexity of the model. The objective function minimized during training combines a loss term measuring prediction errors (e.g., squared error) and a regularization term penalizing complexity to prevent overfitting, expressed as
Obj ( θ ) = i = 1 n ( y i y ^ i ) 2 + k = 1 K Ω ( f k ) ,
where f k represents each regression tree, and Ω ( f k ) is a regularization function controlling the complexity of each tree.
The XGBoost algorithm was implemented in R using the xgboost package, with data formatted as a matrix or xgb.DMatrix for optimized performance and trained via xgb.train using the “reg:squarederror” objective function.

6.2.3. K-Nearest Neighbors

The K-nearest neighbors (KNN) algorithm predicts outcomes by measuring distances (e.g., d ( x , x i ) = | x x i | ) between a new data point and training samples, selecting the K closest neighbors. The final prediction is the average of these neighbors’ target values:
y ^ = 1 K j = 1 K y ( j ) ,
where y ( j ) denotes the j t h neighbor’s value (Kramer, O. [24]).
We used the FNN package in R to apply the KNN method. The knn.reg function was used to make predictions based on the average of the closest points in the data.

6.3. Forecasting Evaluation

To assess the performance of the forecasting model, several commonly used error metrics were employed to quantify the accuracy of the predictions. Below are the most commonly used measures:
  • Root Mean Squared Error (RMSE)
    It is the square root of MSE; in the same units as the target variable:
    RMSE = 1 n t = 1 n ( y t y ^ t ) 2 .
  • Mean Absolute Error (MAE)
    It measures the average absolute difference between predicted and actual values:
    MAE = 1 n t = 1 n | y t y ^ t | .
  • Mean Absolute Percentage Error (MAPE)
    It measures error as a percentage of actual values:
    MAPE = 1 n t = 1 n y t y ^ t y t × 100 % .
  • For the considered machine learning methods, hyperparameter tuning was performed via grid search to determine the optimal values for the key parameters that achieve the lowest RMSE of each method.
Table 7 and Table 8 evaluate the one-step-ahead forecasting performance of classical (SBL-AR(1) conditional expectation) and machine learning methods. While XGBoost achieved the lowest errors for the MICH dataset in MAE and MAPE and SVR excelled for the Turbidity dataset, the classical method demonstrated notable competitiveness. For the MICH dataset, the classical method outperformed SVR in all measures and outperformed KNN in MAE (0.1809712 vs. 0.1831944) and MAPE (6.098343 vs. 6.270608); it also closely matched XGBoost. For the Turbidity dataset, although the machine learning methods are superior, the values of the classical method are still close to the KNN method. These results highlight the competitiveness of the classical forecasting method under the SBL-AR(1) model against modern machine learning techniques due to its parametric structure, which effectively captures skewed and fat-tailed data, avoiding overfitting (that is, the predicted values match the true observed values). These results also underscore that while ML methods adapt flexibly to data patterns, the classical approach based on the structure of SBL-AR(1) retains predictive power, particularly in scenarios where parametric assumptions align well with the data’s inherent dynamics.
Actual and predicted values for each dataset using all of the forecasting methods are shown in Figure 6 and Figure 7. The classical SBL-AR(1) method’s predictions in these figures closely follow the actual data trends, showing smoother alignment compared to the machine learning methods. This strong fit arises because the SBL-AR(1) model is specifically designed for the data’s skewed and positive-valued nature, utilizing its parametric structure. While some ML methods achieved slightly lower errors in Table 7 and Table 8, the figures confirm the classical method’s capability in capturing the data’s inherent patterns without overfitting.

7. Conclusions

In this study, we introduced a first-order autoregressive process based on the size-biased Lindley distribution (SBL-AR(1)) model. We explored several theoretical properties of the process, including its innovation distribution, Laplace transform, conditional mean and variance, autocorrelation structure, and spectral density. To estimate the model parameters, we employed both conditional least squares and Gaussian estimation methods.
A comprehensive Monte Carlo simulation was conducted to evaluate the behavior of the estimators, demonstrating their efficiency. Furthermore, the applicability of the SBL-AR(1) model was illustrated using two real-world datasets. In both cases, the model provided a superior fit compared to several alternative non-Gaussian AR(1) processes, as evidenced by goodness-of-fit statistics. In addition, both classical statistical method and machine learning techniques were used for forecasting. The classical method has demonstrated strong competitiveness when compared to machine learning methods.
Future work may consider extending the model to higher-order autoregressive structures.

Author Contributions

Conceptualization, H.S.B., M.M.G. and H.M.E.-T.; methodology, H.S.B., M.M.G. and H.M.E.-T.; software, H.M.E.-T.; validation, H.S.B., M.M.G. and H.M.E.-T.; formal analysis, H.S.B., M.M.G. and H.M.E.-T.; investigation, H.S.B., M.M.G. and H.M.E.-T.; resources, H.S.B., M.M.G., H.M.E.-T. and S.M.A.A.; data curation, H.S.B., M.M.G., H.M.E.-T. and S.M.A.A.; writing—original draft preparation, H.S.B., M.M.G. and H.M.E.-T.; writing—review and editing, H.S.B., M.M.G., H.M.E.-T. and S.M.A.A.; visualization, H.S.B., M.M.G., H.M.E.-T. and S.M.A.A.; supervision, H.S.B., M.M.G., H.M.E.-T. and S.M.A.A.; project administration, H.S.B., M.M.G., H.M.E.-T. and S.M.A.A.; funding acquisition, S.M.A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research work was funded by Umm Al-Qura University, Saudi Arabia under grant number: 25UQU4310037GSSR08.

Data Availability Statement

The original data presented in the study are available on the website https://fred.stlouisfed.org/series/MICH (accessed on 12 June 2024) and the website https://www.kaggle.com/datasets/downshift/water-quality-monitoring-dataset (accessed on 3 October 2024).

Acknowledgments

The authors extend their appreciation to Umm Al-Qura University, Saudi Arabia for funding this research work through grant number: 25UQU4310037GSSR08.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

SBLSize-biased Lindley
SBL-AR(1)Size-biased Lindley autoregressive of order 1
ACFAutocorrelation function
PACFPartial autocorrelation function
ADFAugmented Dickey–Fuller
CLSConditional least squares
GEGaussian estimation
MSEMean squared error
E-ARExponential autoregressive
G-ARGamma autoregressive
INGARInverse Gaussian autoregressive
L-ARLindley autoregressive
GaL-ARGamma Lindley autoregressive
AR-LAutoregressive with Lindley innovations
AICAkaike information criterion
BICBayesian information criterion
HQICHannan–Quinn information criterion
SEStandard error
MLMachine learning
SVRSupport vector regression
XGBoostExtreme gradient boosting
KNNK-nearest neighbors
RMSERoot mean squared error
MAEMean absolute error
MAPEMean absolute percentage error

References

  1. Gaver, D.P.; Lewis, P.A. First-order autoregressive gamma sequences and point processes. Adv. Appl. Probab. 1980, 12, 727–745. [Google Scholar] [CrossRef]
  2. Sim, C.H. Simulation of Weibull and gamma autoregressive stationary process. Commun. Stat. Simul. Comput. 1986, 15, 1141–1146. [Google Scholar] [CrossRef]
  3. Mališić, J.D. Mathematical Statistics and Probability Theory: Volume B; Springer: Berlin/Heidelberg, Germany, 1987. [Google Scholar]
  4. Abraham, B.; Balakrishna, N. Inverse Gaussian autoregressive models. J. Time Ser. Anal. 1999, 20, 605–618. [Google Scholar] [CrossRef]
  5. Jose, K.K.; Tomy, L.; Sreekumar, J. Autoregressive processes with normal-Laplace marginals. Stat. Probab. Lett. 2008, 78, 2456–2462. [Google Scholar] [CrossRef]
  6. Popović, B.V. AR (1) time series with approximated beta marginal. Publ. Inst. Math. 2010, 88, 87–98. [Google Scholar] [CrossRef]
  7. Bakouch, H.S.; Popović, B.V. Lindley first-order autoregressive model with applications. Commun. Stat. Theory Methods 2016, 45, 4988–5006. [Google Scholar] [CrossRef]
  8. Nitha, K.U.; Krishnarani, S.D. On a class of time series model with double Lindley distribution as marginals. Statistica 2021, 81, 365–382. [Google Scholar]
  9. Mello, A.B.; Lima, M.C.; Nascimento, A.D. The title of the cited article. Environmetrics 2022, 33, e2724. [Google Scholar] [CrossRef]
  10. Jilesh, V.; Jayakumar, K. On first order autoregressive asymmetric logistic process. J. Indian Soc. Probab. Stat. 2023, 24, 93–110. [Google Scholar] [CrossRef]
  11. Nitha, K.U.; Krishnarani, S.D. Exponential-Gaussian distribution and associated time series models. Revstat Stat. J. 2023, 21, 557–572. [Google Scholar]
  12. Patil, G.P.; Rao, C.R. Weighted distributions and size-biased sampling with applications to wildlife populations and human families. Biometrics 1978, 34, 179–189. [Google Scholar] [CrossRef]
  13. Scheaffer, R. Size-biased sampling. Technometrics 1972, 14, 635–644. [Google Scholar] [CrossRef]
  14. Singh, S.K.; Maddala, G.S. Modeling Income Distributions and Lorenz Curves; Springer: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
  15. Drummer, T.D.; McDonald, L.L. Size bias in line transect sampling. Biometrics 1987, 43, 13–21. [Google Scholar] [CrossRef]
  16. Ayesha, A. Size biased Lindley distribution and its properties a special case of weighted distribution. J. Appl. Math. 2017, 8, 808–819. [Google Scholar] [CrossRef]
  17. Whittle, P. Gaussian estimation in stationary time series. Bull. Int. Stat. Inst. 1961, 39, 105–129. [Google Scholar]
  18. Crowder, M. Gaussian estimation for correlated binomial data. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 1985, 47, 229–237. [Google Scholar] [CrossRef]
  19. Al-Nachawati, H.; Alwasel, I.; Alzaid, A.A. Estimating the parameters of the generalized Poisson AR (1) process. J. Stat. Comput. Simul. 1997, 56, 337–352. [Google Scholar] [CrossRef]
  20. Alwasel, I.; Alzaid, A.; Al-Nachawati, H. Estimating the parameters of the binomial autoregressive process of order one. Appl. Math. Comput. 1998, 95, 193–204. [Google Scholar] [CrossRef]
  21. Nitha, K.U.; Krishnarani, S.D. On autoregressive processes with Lindley-distributed innovations: Modeling and simulation. Stat. Transit. New Ser. 2024, 25, 31–47. [Google Scholar] [CrossRef]
  22. Smola, A.J.; Schölkopf, B. A tutorial on support vector regression. Stat. Comput. 2004, 14, 199–222. [Google Scholar] [CrossRef]
  23. Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
  24. Kramer, O. Dimensionality reduction by unsupervised k-nearest neighbor regression. In Proceedings of the 2011 10th International Conference on Machine Learning and Applications and Workshops, Honolulu, HI, USA, 18–21 December 2011; pp. 275–278. [Google Scholar]
Figure 1. Density curves for the innovation process a t = I t ε t : (a) θ = 1.5, (b) ρ = 0.97, (c) ρ = 0.2.
Figure 1. Density curves for the innovation process a t = I t ε t : (a) θ = 1.5, (b) ρ = 0.97, (c) ρ = 0.2.
Mathematics 13 01787 g001
Figure 2. The sampling paths of the SBL-AR(1) process for (A) θ = 1.6 , ρ = 0.2 , (B) θ = 2 , ρ = 0.2 , (C) θ = 2.5 , ρ = 0.5 , (D) θ = 2.5 , ρ = 0.7 .
Figure 2. The sampling paths of the SBL-AR(1) process for (A) θ = 1.6 , ρ = 0.2 , (B) θ = 2 , ρ = 0.2 , (C) θ = 2.5 , ρ = 0.5 , (D) θ = 2.5 , ρ = 0.7 .
Mathematics 13 01787 g002
Figure 3. Spectral density curves of SBL-AR(1) process: (a) ρ = 0.97 , (b) θ = 1.5 , and (c) ρ = 0.5 .
Figure 3. Spectral density curves of SBL-AR(1) process: (a) ρ = 0.97 , (b) θ = 1.5 , and (c) ρ = 0.5 .
Mathematics 13 01787 g003
Figure 4. The time series, ACF, and PACF plots of monthly University of Michigan Inflation Expectation (MICH).
Figure 4. The time series, ACF, and PACF plots of monthly University of Michigan Inflation Expectation (MICH).
Mathematics 13 01787 g004
Figure 5. The time series, ACF, and PACF plots of the turbidity of water quality in Brisbane.
Figure 5. The time series, ACF, and PACF plots of the turbidity of water quality in Brisbane.
Mathematics 13 01787 g005
Figure 6. Actual and predicted values of University of Michigan Inflation Expectation.
Figure 6. Actual and predicted values of University of Michigan Inflation Expectation.
Mathematics 13 01787 g006
Figure 7. Actual and predicted values of turbidity of water quality in Brisbane.
Figure 7. Actual and predicted values of turbidity of water quality in Brisbane.
Mathematics 13 01787 g007
Table 1. Average, bias, and MSE (in parentheses) of the estimates for some different values of the parameters ρ and θ .
Table 1. Average, bias, and MSE (in parentheses) of the estimates for some different values of the parameters ρ and θ .
Scenariosn ρ ^ CLS θ ^ CLS ρ ^ GE θ ^ GE
(a) ( ρ , θ ) = (0.97, 1.24)
500.9059661.4673420.9763741.251286
(−0.064034, 0.008183)(0.227342, 0.344839)(0.007344, 0.000575)(0.012526, 0.021865)
1000.938881.4468660.9758051.262123
(−0.03112, 0.002239)(0.206866, 0.248877)(0.006775, 0.000366)(0.023363, 0.019913)
5000.9633091.4019440.9773121.2481
(−0.006691, 0.000184)(0.161944, 0.062201)(0.008282, 0.000164)(0.00934, 0.004442)
10000.9663161.2879180.9772071.249197
(−0.003684, 0.000082)(0.047918, 0.010303)(0.008177, 0.00012)(0.010437, 0.002362)
(b) ( ρ , θ ) = (0.97, 1.6)
500.9048681.7439930.9736991.589438
(−0.065132, 0.009001)(0.143993, 0.465764)(0.004669, 0.000633)(−0.008962, 0.030876)
1000.9391371.7311050.9755311.601991
(−0.030863, 0.002211)(0.131105, 0.31464)(0.006501, 0.000409)(0.003591, 0.027661)
5000.9629161.6955610.9742271.596723
(−0.007084, 0.000178)(0.095561, 0.062953)(0.005197, 0.000148)(−0.001677, 0.00524)
10000.9665881.6262710.9742141.597994
(−0.003412, 0.00008)(0.026271, 0.01708)(0.005184, 0.000096)(−0.000406, 0.001813)
(c) ( ρ , θ ) = (0.5, 1.4)
500.4680091.6075880.4655631.565282
(−0.031991, 0.015481)(0.207588, 0.128041)(−0.033937, 0.013012)(0.166682, 0.110757)
1000.4804461.5948220.4689691.513646
(−0.019554, 0.007864)(0.194822, 0.073287)(−0.030531, 0.006731)(0.115046, 0.052598)
5000.4932061.5698980.4792311.477341
(−0.006794, 0.001392)(0.169898, 0.035519)(−0.020269, 0.001619)(0.078741, 0.013248)
10000.4980651.5690960.4780641.474208
(−0.001935, 0.000822)(0.169096, 0.031902)(−0.021436, 0.001079)(0.075608, 0.009266)
Table 2. Average, bias, and MSE (in parentheses) of the estimates for some different values of the parameters ρ and θ .
Table 2. Average, bias, and MSE (in parentheses) of the estimates for some different values of the parameters ρ and θ .
Scenariosn ρ ^ CLS θ ^ CLS ρ ^ GE θ ^ GE
(d) ( ρ , θ ) = (0.7, 1.4)
500.6545731.6617710.6678231.606182
(−0.045427, 0.011906)(0.261771, 0.225731)(−0.031477, 0.008301)(0.207582, 0.184589)
1000.675981.5961470.6736441.549443
(−0.02402, 0.005558)(0.196147, 0.107849)(−0.025656, 0.004115)(0.150843, 0.088059)
5000.694411.5691920.680071.483746
(−0.00559, 0.001021)(0.169192, 0.041562)(−0.01923, 0.001081)(0.085146, 0.019932)
10000.6977511.569350.6826251.474496
(−0.002249, 0.000516)(0.16935, 0.034756)(−0.016675, 0.000684)(0.075896, 0.012019)
(e) ( ρ , θ ) = (0.1, 1.5)
500.0948441.6615260.1144861.541434
(−0.005156, 0.019144)(0.161526, 0.061588)(0.014486, 0.011496)(0.041434, 0.046460)
1000.0950261.6503530.1043831.520137
(−0.004974, 0.009954)(0.150353, 0.040085)(0.004383, 0.006490)(0.020137, 0.022047)
5000.1000311.6421510.0997091.502587
(0.000031, 0.002009)(0.142151, 0.023514)(−0.000291, 0.001577)(0.002587, 0.004227)
10000.0995561.6408090.0999461.499580
(−0.000444, 0.000976)(0.140809, 0.021493)(−0.000054, 0.000787)(−0.00042, 0.002101)
(f) ( ρ , θ ) = (0.5, 1.5)
500.4709841.5711490.4656731.509039
(−0.029016, 0.012418)(0.071149, 0.065098)(−0.034327, 0.012710)(0.009039, 0.075593)
1000.4831631.5509750.4728341.471724
(−0.016837, 0.007294)(0.050975, 0.036887)(−0.027166, 0.006639)(−0.028276, 0.037763)
5000.4963011.5328600.4769451.437758
(−0.003699, 0.001462)(0.032860, 0.007463)(−0.023055, 0.001755)(−0.062242, 0.010739)
10000.4984101.5321060.4773501.434703
(−0.001590, 0.000741)(0.032106, 0.004187)(−0.022650, 0.001148)(−0.065297, 0.007641)
Table 3. Descriptive statistics of University of Michigan Inflation Expectation and Brisbane water quality datasets.
Table 3. Descriptive statistics of University of Michigan Inflation Expectation and Brisbane water quality datasets.
DatasetnMin. Q 1 Q 2 Mean Q 3 Max.Varp-Value (ADF)
MCIH4510.4002.7003.0003.0443.2005.2000.33850.03571
Turbidity2212.3442.6672.8212.8432.9814.6250.06650.01031
Table 4. Estimated parameters, AIC, BIC, and HQIC for monthly University of Michigan Inflation Expectation dataset.
Table 4. Estimated parameters, AIC, BIC, and HQIC for monthly University of Michigan Inflation Expectation dataset.
ModelGE Estimates (SE)AICBICHQIC
SBL-AR ρ ^ = 0.9722689 (0.006470897)
θ ^ = 1.2350076 (0.137368908)
253.3235261.5464256.5642
L-AR ρ ^ = 0.9615126 (0.00501119)
λ ^ = 1.1524668 (0.06931619)
268.7283276.9512271.9689
GaL-AR ρ ^ = 0.9695075 (0.005533153)
λ ^ = 1.0218001 (0.108046443)
β ^ = 0.9895284 (0.293700792)
262.2251274.5595267.0861
E-AR ρ ^ = 0.9597630 (0.00505366)
λ ^ = 0.8991799 (0.05621214)
274.4090282.6319277.6497
G-AR ρ ^ = 0.9736072 (0.007168259)
λ ^ = 0.5314403 (0.203203788)
k ^ = 0.5480252 (0.297268206)
265.7606278.0950270.6216
INGAR-I ρ ^ = 0.9582729 (0.003857915)
μ ^ = 1.1860471 (0.096928851)
274.3043282.5272277.5449
INGAR-II ρ ^ = 0.9932729 (0.00163749)
μ ^ = 1.9887342 (0.44900524)
λ ^ = 1.0075656 (0.58975847)
260.5591272.8935265.4201
AR-L ρ ^ = 0.9353976 (0.006316815)
θ ^ = 8.6482145 (0.540094410)
270.5227278.7457273.7634
Table 5. Estimated parameters, AIC, BIC, and HQIC for Brisbane water quality dataset.
Table 5. Estimated parameters, AIC, BIC, and HQIC for Brisbane water quality dataset.
ModelGE Estimates (SE)AICBICHQIC
SBL-AR ρ ^ = 0.9677021 (0.00751236)
θ ^ = 1.6010980 (0.17560826)
46.8954553.6917849.63968
L-AR ρ ^ = 0.9717271 (0.006628198)
λ ^ = 1.1756744 (0.127043976)
49.1824855.9788151.92672
GaL-AR ρ ^ = 0.9718353 (0.006936761)
λ ^ = 1.1164845 (0.175881692)
β ^ = 0.7821798 (0.246680549)
51.6041761.7986655.72052
E-AR ρ ^ = 0.9633879 (0.006277054)
λ ^ = 1.0429649 (0.090100243)
56.9575663.7538859.70179
G-AR ρ ^ = 0.9524641 (0.008869184)
λ ^ = 1.5172295 (0.395848932)
k ^ = 1.4776463 (0.495027829)
71.7030481.8975375.81939
INGAR-I ρ ^ = 0.9622654 (0.004742255)
μ ^ = 0.8843855 (0.098468366)
59.3563066.1526262.10053
INGAR-II ρ ^ = 0.9970946 (0.0002890516)
μ ^ = 1.5959593 (0.2560154481)
λ ^ = 0.3071916 (0.1316577834)
50.6348660.8293554.75122
AR-L ρ ^ = 0.9049345 (0.00977161)
θ ^ = 7.0817431 (0.56402558)
96.52965103.3259899.27389
Table 6. Ljung–Box and Box–Pierce test results for residual autocorrelation.
Table 6. Ljung–Box and Box–Pierce test results for residual autocorrelation.
Datasetp-Value of Ljung-Box Testp-Value of Box-Pierce Test
MCIH0.63710.6382
Turbidity0.37150.3747
Table 7. Model performance measures for forecasting MICH dataset values.
Table 7. Model performance measures for forecasting MICH dataset values.
MeasureRMSEMAEMAPE
SVR0.26950470.18909216.447078%
XGBoost0.26212520.17924376.073617%
KNN0.25088390.18319446.270608%
Classical0.26799580.18097126.098343%
Table 8. Model performance measures for forecasting Turbidity dataset values.
Table 8. Model performance measures for forecasting Turbidity dataset values.
MeasureRMSEMAEMAPE
SVR0.19936640.14487034.984548%
XGBoost0.20939790.14624925.040056%
KNN0.22492250.15384555.342222%
Classical0.23008790.16602025.728669%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bakouch, H.S.; Gabr, M.M.; Aljeddani, S.M.A.; El-Taweel, H.M. A First-Order Autoregressive Process with Size-Biased Lindley Marginals: Applications and Forecasting. Mathematics 2025, 13, 1787. https://doi.org/10.3390/math13111787

AMA Style

Bakouch HS, Gabr MM, Aljeddani SMA, El-Taweel HM. A First-Order Autoregressive Process with Size-Biased Lindley Marginals: Applications and Forecasting. Mathematics. 2025; 13(11):1787. https://doi.org/10.3390/math13111787

Chicago/Turabian Style

Bakouch, Hassan S., M. M. Gabr, Sadiah M. A. Aljeddani, and Hadeer M. El-Taweel. 2025. "A First-Order Autoregressive Process with Size-Biased Lindley Marginals: Applications and Forecasting" Mathematics 13, no. 11: 1787. https://doi.org/10.3390/math13111787

APA Style

Bakouch, H. S., Gabr, M. M., Aljeddani, S. M. A., & El-Taweel, H. M. (2025). A First-Order Autoregressive Process with Size-Biased Lindley Marginals: Applications and Forecasting. Mathematics, 13(11), 1787. https://doi.org/10.3390/math13111787

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop