Bootstrapping State-Space Models: Distribution-Free Estimation in View of Prediction and Forecasting

Lima, José Francisco; Pereira, Fernanda Catarina; Gonçalves, Arminda Manuela; Costa, Marco

doi:10.3390/forecast6010003

Open AccessArticle

Bootstrapping State-Space Models: Distribution-Free Estimation in View of Prediction and Forecasting

by

José Francisco Lima

^1,†,

Fernanda Catarina Pereira

^2,†

,

Arminda Manuela Gonçalves

^1,2,†

and

Marco Costa

^3,*,†

¹

Department of Mathematics, University of Minho, 4710-057 Braga, Portugal

²

Centre of Mathematics, University of Minho, 4710-057 Braga, Portugal

³

Centre for Research and Development in Mathematics and Applications, Águeda School of Technology and Management, University of Aveiro, 3810-193 Aveiro, Portugal

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Forecasting 2024, 6(1), 36-54; https://doi.org/10.3390/forecast6010003

Submission received: 13 November 2023 / Revised: 19 December 2023 / Accepted: 20 December 2023 / Published: 27 December 2023

(This article belongs to the Special Issue Feature Papers of Forecasting 2023)

Download

Browse Figures

Versions Notes

Abstract

:

Linear models, seasonal autoregressive integrated moving average (SARIMA) models, and state-space models have been widely adopted to model and forecast economic data. While modeling using linear models and SARIMA models is well established in the literature, modeling using state-space models has been extended with the proposal of alternative estimation methods to the maximum likelihood. However, maximum likelihood estimation assumes, as a rule, that the errors are normal. This paper suggests implementing the bootstrap methodology, utilizing the model’s innovation representation, to derive distribution-free estimates—both point and interval—of the parameters in the time-varying state-space model. Additionally, it aims to estimate the standard errors of these parameters through the bootstrap methodology. The simulation study demonstrated that the distribution-free estimation, coupled with the bootstrap methodology, yields point forecasts with a lower mean-squared error, particularly for small time series or when dealing with smaller values of the autoregressive parameter in the state equation of state-space models. In this context, distribution-free estimation with the bootstrap methodology serves as an alternative to maximum likelihood estimation, eliminating the need for distributional assumptions. The application of this methodology to real data showed that it performed well when compared to the usual maximum likelihood estimation and even produced prediction intervals with a similar amplitude for the same level of confidence without any distributional assumptions about the errors.

Keywords:

bootstrap; distribution-free estimation; economic data; forecasting; state-space modeling; time series analysis

1. Introduction

The analysis and forecasting of time series play a pivotal role in the economic context, providing valuable tools for understanding and anticipating trends and variations in economic indicators over time [1,2]. Time series refer to sets of observations ordered chronologically, such as financial data, industrial production, exchange rates, and various other economic factors. This statistical approach allows economists and financial analysts to unravel patterns, seasonality, and underlying behaviors within the data, aiding in informed decision-making, strategic planning, and the identification of opportunities and risks.

Time series modeling and forecasting using linear models and seasonal autoregressive integrated moving average (SARIMA) models offer distinct advantages in handling complex temporal data [3]. Linear models provide simplicity and transparency in understanding the linear relationships between variables over time. They are well-suited for capturing trends and straightforward relationships, making them a valuable tool for initial exploration. Linear models offer interpretability, making it easier to identify how changes in one variable affect others. However, they may struggle with capturing nonlinear patterns and seasonality, which are often present in economic and financial data. On the other hand, SARIMA models are specifically designed to handle time series data with seasonality and autocorrelation. SARIMA models combine autoregressive, differencing, and moving average components, making them adaptable to a wide range of data patterns. This flexibility allows them to provide more-accurate and -reliable forecasts, making them a preferred choice in many economic forecasting scenarios. While linear models offer simplicity and transparency, SARIMA models excel in capturing complex temporal patterns, especially in economic and financial data. Selecting between these modeling techniques depends on the specific characteristics of the data and the level of accuracy required for forecasting and decision-making in the economic context [4].

Preferring state-space models over linear models and SARIMA models offers several advantages, particularly when dealing with complex and dynamic time series data. State-space models provide a more-flexible framework that can capture both linear and nonlinear relationships in the data. They provide a unified way to represent and analyze time series data, making them suitable for a wide range of applications. State-space models can handle hidden or unobservable states and capture irregular patterns, making them versatile in modeling economic and financial data [5,6]. Furthermore, state-space models are well suited to handling missing or irregularly sampled data, a common issue in real-world economic and financial datasets. They offer the capability to incorporate exogenous variables, which can be crucial for improving the accuracy of forecasts and enhancing the understanding of causal relationships in complex economic systems. State-space models also facilitate Bayesian inference, allowing for a probabilistic approach to modeling and forecasting. This probabilistic nature provides not only point estimates, but also uncertainty quantification, which is valuable for risk assessment and decision-making. State-space models offer greater flexibility, robustness, and adaptability when dealing with complex time series data, especially in scenarios where linear models and SARIMA models may not adequately capture the underlying dynamics.

State-space models have a versatile structure that allows them to model a time series with the aim of forecasting it. In this context, exponential smoothing methods can be considered in the state-space formulation [7,8] or even from a data assimilation perspective [9].

Estimating the parameters of state-space models through the maximum likelihood encounters several significant numerical challenges. This arises from the intrinsic complexity of these models, which often include nonlinear components, noisy observations, and unobservable latent states. Here are some common numerical challenges associated with this process. The nonlinearity of the log-likelihood function makes the likelihood optimization a computationally intensive task. Finding the optimal solution may require advanced numerical optimization algorithms, such as the Newton–Raphson method or Monte Carlo algorithms. In some cases, optimization algorithms may not converge on a viable solution or may get stuck in local minima, resulting in inaccurate or unfeasible estimates. The convergence of optimization algorithms can be highly sensitive to the initial parameter values. Finding an appropriate set of initial values is often a crucial step in successfully estimating the parameters. Real data often contain noise and measurement errors. This can affect the accuracy of parameter estimation, making it necessary to consider robust techniques for dealing with imperfect data. In some models, there may be identifiability problems, in which multiple sets of parameters produce similar results. This makes parameter estimation challenging, since there may not be a single well-defined solution.

In this context, the estimation of distribution-free parameters in state-space models is a valuable approach that does not rely on specific assumptions about the underlying data distribution. This method, often referred to as nonparametric or distribution-free estimation, is particularly useful when the true data distribution is unknown or complex. Reference [10] proposed to combine the Stochastic Expectation–Maximization (SEM) algorithm and Sequential Monte Carlo (SMC) approaches for non-parametric estimation in state-space models. In distribution-free parameter estimation for state-space models, the emphasis is on estimating the system’s hidden states and parameters in a way that does not assume a specific probability distribution for the observations. This flexibility is advantageous when dealing with real-world data that may exhibit non-standard or heavy-tailed behavior. Common techniques for distribution-free parameter estimation in state-space models include nonparametric methods such as kernel density estimation, local polynomial regression, or bootstrapping. These methods focus on data-driven approaches to estimate parameters and states, making them less sensitive to distributional assumptions.

The distribution-free approach is especially valuable when dealing with financial and economic time series data, where data characteristics can be challenging to model with traditional parametric assumptions. By allowing for more flexibility and adaptability, distribution-free estimation methods offer a robust way to capture complex dynamics and dependencies in time series data, making them a valuable tool in econometrics and quantitative finance. Distribution-free parameter estimation has been considered in time series modeling in various contexts (see, for example, [11,12]). Reference [13] proposed that estimators widen the scope of the application of the generalized method of moments to some heteroscedastic state-space models, as in the case of state-space models with varying coefficients. These estimators were extended to multivariate models in [14]. However, no asymptotic distributions have been determined that allow for standard errors or confidence intervals to be obtained for the estimates of these estimators.

This study proposes using the bootstrap methodology to obtain both point and interval estimates and the standard errors of these estimates. Bootstrapping is a technique used in this type of inference when the distributional assumptions are not guaranteed or the exact or asymptotic distribution of the estimators is not known [15]. The bootstrap technique has already been applied in the particular case of estimating the parameters of state-space models, either considering the normality of the errors [16] or as an approach for state-space models where the bootstrapping is used as a diagnostic tool [17]. However, this paper proposes the adoption of the bootstrap methodology to obtain inferential properties of the distribution-free estimators proposed in [13,14].

The modeling and forecasting will be illustrated using the Manufacturing PMI time series, which is a monthly economic indicator for the United States released by the Institute for Supply Management (ISM), a non-governmental and non-profit organization established in 1915. This index is constructed through surveys of purchasing managers at more than 300 industrial companies. It is a key indicator for assessing and monitoring the development of the American economy [18].

This study proposes employing distribution-free estimators to estimate the unknown parameters in the state-space model enhanced by the bootstrap methodology. This approach enables the derivation of bootstrap point estimates and confidence intervals. The proposed method outperformed the SARIMA time series modeling and demonstrated favorable results compared to the maximum likelihood estimation within the state-space framework. An additional advantage is that distribution-free estimators do not rely on distributional assumptions for the associated errors.

This paper is organized as follows. Section 2 introduces the materials and methods: time series modeling via SARIMA and state-space models and the parameter estimation considering both the maximum likelihood and distribution-free estimation. For estimation with distribution-free estimators, the bootstrap-based approach to obtaining both point and interval estimates of the parameters is presented. The final part of this section presents the design and results of the simulation study. Section 3 describes the database used in the application to real data, the modeling of real data, and the discussion of the results. Section 4 presents the main conclusions of this work.

2. Materials and Methods

2.1. SARIMA Modeling

Let

Y_{t}

be a time series. In seasonal time series, it is expected that the seasonal component is related in some way to the non-seasonal components. In other words, if neighboring observations in a time series,

Y_{t}, Y_{t - 1}, Y_{t - 2} \dots

, are related, there is a probability that observations spaced by s time units,

Y_{t}, Y_{t - s}, Y_{t - 2 s}, \dots

, are also related. Seasonal differencing is a technique applied to capture this relationship. Seasonal differencing of order 1 is given by

\nabla_{s} Y_{t} = Y_{t} - Y_{t - s} = (1 - B^{s}) Y_{t},

where B is the lag operator. This seasonal differencing subtracts the current observation from the observation that occurred s time units ago, highlighting seasonal variations in the series. This technique is particularly useful when the series exhibits repetitive periodic behavior. Similarly, seasonal differencing can be applied multiple times, leading to the definition of a seasonal differencing operator of order D:

\nabla_{s}^{D} Y_{t} = {(1 - B^{s})}^{D} Y_{t},

where D is an integer greater than or equal to 1. This is useful for dealing with more-complex seasonal patterns and highlighting higher-order seasonal variations.

A time series process

Y_{t}

is considered a seasonal autoregressive integrated moving average (SARIMA) process denoted as SARIMA (p, d, q) (P, D, Q)

_{s}

when it satisfies the SARIMA equation:

Φ_{p} (B) N_{P} (B^{s}) \nabla^{d} \nabla_{s}^{D} Y_{t} = Θ_{q} (B) H_{Q} (B^{s}) ϵ_{t},

where

Φ_{p} (B)

,

N_{P} (B^{s})

,

Θ_{q} (B)

, and

H_{Q} (B^{s})

are polynomials associated with the autoregressive and moving average terms and d and D are the orders of differencing for the regular and seasonal components, respectively.

SARIMA models are best suited to time series with well-defined seasonal behavior [8,19].

2.2. State-Space Modeling

Linear state-space models can be viewed as an extension of multiple linear regression models, providing a powerful framework for modeling time series data with additional dynamics and unobservable components.

In a multiple linear regression, we seek to explain the variation in a dependent variable (the response) as a linear combination of several independent variables (the predictors). The relationship is expressed as

Y = β X + ε

, where Y is the response, X represents the predictors,

β

is the vector of coefficients, and

ε

represents the error term.

Linear state-space models take this idea a step further by considering that the observed data not only depend on observable predictors, but also on unobservable state variables. These state variables capture hidden dynamics in the data that evolve over time. The state-space model can be expressed as

Y_{t} = W_{t} β_{t} + e_{t}

β_{t} = μ + Φ (β_{t - 1} - μ) + ε_{t}

Here,

Y_{t}

is the observed data,

W_{t}

represents the observed predictors, and

β_{t}

is the vector of unobservable state variables. The matrix

Φ

captures the transition of the state variables over time. This formulation extends the linear regression framework to account for temporal dependencies and latent states.

In essence, linear state-space models encompass the concept of multiple linear regression by considering it as a special case where there are no unobservable state variables

β_{t}

and the relationship between

Y_{t}

and X is purely linear. However, they go beyond simple regression by allowing for time-evolving relationships, dynamics, and noise, making them suitable for modeling complex time series data, including financial markets, economics, and more. By incorporating latent state variables, these models can capture hidden patterns and dynamics, enhancing our ability to understand and forecast time-dependent phenomena.

The state-space model has the following assumptions:

$W_{t}$ is a $p \times m$ matrix;
$e_{t}$ is a $p \times 1$ vector of independent and identically distributed errors following, in general, a multivariate normal distribution with a zero mean and a variance-covariance matrix H, $e_{t} \sim N_{p} (0, H)$ ;
$Φ$ is an $m \times m$ matrix known as the autoregressive matrix;
$ε_{t}$ is an $m \times 1$ vector of independent and identically distributed errors following, in general, a multivariate normal distribution with a zero mean and a covariance matrix Q, $ε_{t} \sim N_{p} (0, Q)$ .

The mixed-effects models defined by the observation and state equations allow for the establishment of various models to handle missing or omitted data. They also enable the definition of models with fixed or random effects (which can be time-invariant or time-varying).

2.2.1. Predicting and Forecasting with the Kalman Filter

The Kalman filter (KF) is a recursive algorithm used in MEE to obtain optimal estimates and predictions for the unobserved state vector

β_{t}

. The KF equations form a system that obtains linear projections at each time instant t. In this way, linear estimators with the lowest mean-squared error are calculated. The KF also provides 1-step predictions for the vectors of the observed variables.

Let

β_{t | t - 1}

be the estimator of

β_{t}

with the smallest mean-squared error based on the information available up to time

t - 1

, representing the vector of observations

{\tilde{Y}}_{t - 1} = (Y_{1}, Y_{2}, \dots, Y_{t - 1})

, i.e.,

β_{t | t - 1} = E [β_{t} | {\tilde{Y}}_{t - 1}]

with

P_{t | t - 1}

being an

m \times m

matrix (covariance matrix) representing the mean-squared error:

P_{t | t - 1} = E [(β_{t} - β_{t | t - 1}) {(β_{t} - β_{t | t - 1})}^{'}] .

Consider that

β_{t | t} = E [β_{t} | {\tilde{Y}}_{t}]

and

P_{t | t} = E [(β_{t} - β_{t | t}) {(β_{t} - β_{t | t})}^{'} | {\tilde{Y}}_{t}] .

The prediction of

Y_{t}

, updated up to time

t - 1

, known as the prediction equation, is given by

Y_{t | t - 1} = E [Y_{t} | {\tilde{Y}}_{t - 1}] = W_{t} β_{t | t - 1} .

At the moment when the observation

Y_{t}

is obtained at time t, the mean-squared error and the respective update of the prediction of the state vector

β_{t}

can be expressed as

β_{t | t} = β_{t | t - 1} + K_{t} (Y_{t} - Y_{t | t - 1})

with the mean-squared error given by

P_{t | t} = [I - K_{t} W_{t}] P_{t | t - 1}

. Let

K_{t}

be the Kalman gain; it is an

m \times p

matrix, and I is an identity matrix of order m, defined as

K_{t} = P_{t | t - 1} W_{t}^{'} {[W_{t} P_{t | t - 1} W_{t}^{'} + H]}^{- 1}; t = 1, \dots, n .

It is possible to obtain a prediction for the state vector at time

t + 1

based on all the available information up to time

t + 1

. The prediction can be expressed as

β_{t + 1 | t} = μ + Φ (β_{t | t} - μ),

with mean-squared error

P_{t + 1 | t} = Φ P_{t | t} Φ^{'} + Q

.

However, it is necessary to define

β_{1 | 0}

and

P_{1 | 0}

to initiate the recursive process. Knowing that the vector

β_{t}

is a stationary process with mean

μ

and a covariance matrix

Σ = E ([β_{t} - μ) {(β_{t} - μ)}^{'}]

, the process begins with the prediction of

β_{1}

. In the absence of any information, the mean value is

β_{1 | 0} = μ

, and the matrix

P_{1 | 0}

is equal to the covariance matrix of the state vector

β_{t}

,

P_{1 | 0} = Σ

.

2.2.2. Confidence Intervals for 1-Step Forecasts

In some cases, point estimation is not sufficient to quantify uncertainty regarding a prediction. Confidence intervals provide a solution to this uncertainty, as the future is indeed unknown, and the predictions are also uncertain. Therefore, interval estimation allows quantifying the uncertainty associated with point predictions. Based on the covariance matrix of the prediction error,

Σ_{t | t - 1} = W_{t} P_{t | t - 1} W_{t}^{'} + H

, it allows estimating the one-step-ahead prediction error. To justify the interval estimation of univariate prediction, it is necessary for the fitted model to describe a time series, and its assumptions must be valid.

The statistic used to compute a univariate prediction confidence interval is as follows:

\frac{Y_{t} - Y_{t | t - 1}}{\sqrt{Σ_{t | t - 1}}} \sim N (0, 1) .

In a confidence interval, z corresponds to a quantile of the standard normal distribution, and

1 - α

corresponds to the confidence level of the interval. The interval is obtained as

P (- z_{1 - \frac{α}{2}} < \frac{Y_{t} - Y_{t | t - 1}}{\sqrt{Σ_{t | t - 1}}} < z_{1 - \frac{α}{2}}) = 1 - α

so that

I_{(1 - α) 100 %} = [Y_{t | t - 1} - z_{1 - \frac{α}{2}} \sqrt{Σ_{t | t - 1}}, Y_{t | t - 1} + z_{1 - \frac{α}{2}} \sqrt{Σ_{t | t - 1}}] .

2.3. Parameters Estimation

The parameters in state-space models, including transition matrices, covariances, and observation matrices, govern the behavior and structure of the underlying system, and the KF may not produce the optimum predictions [20]. Without precise parameter estimates, the model may not reflect the true dynamics of the system, leading to unreliable forecasts and inferences. Therefore, parameter estimation plays a fundamental role in ensuring that state-space models provide valuable insights and predictions for real-world applications.

2.3.1. Gaussian Likelihood Estimation

For the Gaussian maximum likelihood (MLE), the goal is to maximize the log-likelihood based on observations

Y_{1}, Y_{2}, \dots, Y_{n}

considering that the initial state

β_{1}

follows a normal distribution. In state-space models, the MLE is performed based on the conditional probabilities by innovations

η_{t | t - 1} = Y_{t} - W_{t} β_{t | t - 1}

,

t = 1, \dots, n

, considering

Θ

the vector of unknown parameters, that is

L (Θ; {\tilde{Y}}_{n}) = \prod_{t = 1}^{n} p (Y_{t} | {\tilde{Y}}_{t - 1}) .

Here,

{\tilde{Y}}_{n} = (Y_{1}, Y_{2}, \dots, Y_{n})

, and

p (Y_{t} | {\tilde{Y}}_{t - 1})

corresponds to the density of

Y_{t}

given

{\tilde{Y}}_{t - 1}

; the log-likelihood function is given by

log L (Θ; {\tilde{Y}}_{n}) = - \frac{n}{2} ln (2 π) - \frac{1}{2} \sum_{t = 1}^{n} log | Σ_{t | t - 1} (Θ) | - \frac{1}{2} \sum_{t = 1}^{n} η_{t | t - 1} {(Θ)}^{'} Σ_{t | t - 1}^{- 1} (Θ) η_{t | t - 1} (Θ) .

It is important to emphasize that, in the function above, the dependence of the innovations

η_{t | t - 1} (Θ)

and their respective covariance matrix

Σ_{t | t - 1} (Θ)

on the parameter vector

Θ

to be estimated should be considered. In some applications where the process

β_{t}, t = 1, \dots, n

is not stationary, the initial values of the KF,

β_{1 | 0}

and

P_{1 | 0}

, may be attributed to

Θ

and are estimated based on the sample [21].

2.3.2. Distribution-Free Estimators

An alternative to maximum likelihood estimation is distribution-free or non-parametric estimators. These comprise a class of statistical estimators that do not make specific assumptions about the underlying data distribution. In other words, these estimators do not require the data to follow a particular distribution, such as the normal distribution. They are often used in situations where there is not enough information about the data distribution or when the data have distributions that are too complex to be adequately modeled by a parametric distribution. Distribution-free estimators provide a flexible and robust approach to estimating parameters and making statistical inferences when the data’s distributional assumptions are uncertain or not well defined. Reference [13] proposed distribution-free estimators based on the generalized method of moments, whose consistency conditions were established even for heteroscedastic models. However, although point estimates can be obtained, neither the sampling distributions nor the asymptotic distributions are known.

Considering the linear univariate SSM, Reference [13] proposed to estimate the state mean

μ

by

\hat{μ} = n^{- 1} \sum_{t = 1}^{n} Y_{t} W_{t}^{- 1} .

The autoregressive parameter

ϕ

is estimated based on the covariance structure of process

{Y_{t} W_{t}^{- 1}}

based on its sample autocovariance function,

\hat{γ} (k)

, defined as

\hat{γ} (k) = \frac{1}{n} \sum_{t = 1}^{n - k} (\frac{Y_{t + k}}{W_{t + k}} - \hat{μ}) (\frac{Y_{t}}{W_{t}} - \hat{μ})

through the estimator

\hat{ϕ} = \sum_{k = 1}^{ℓ} \hat{γ} (k + 1) \hat{γ} (k) {(\sum_{k = 1}^{ℓ} {\hat{γ}}^{2} (k))}^{- 1} .

The choice of ℓ was discussed in the original work [13], and it depends on the time series size. To estimate

σ_{ϵ}^{2}

and

σ_{ϵ}^{2}

, the distribution-free estimators are considered:

{\hat{σ}}_{ϵ}^{2} = \frac{1 - {\hat{ϕ}}^{2}}{\hat{ϕ}} \hat{γ} (1) and {\hat{σ}}_{e}^{2} = {(\sum_{t = 1}^{n} W_{t}^{- 2})}^{- 1} [\sum_{t = 1}^{n} {(\frac{Y_{t}}{W_{t}} - \hat{μ})}^{2} - \frac{n {\hat{σ}}_{ϵ}^{2}}{1 - {\hat{ϕ}}^{2}}] .

These estimators are consistent under simple regularity conditions based on the

W_{t}

sequence, which must be limited.

2.3.3. Point and Interval Distribution-Free Estimation of Parameters via Bootstrapping

The distribution-free estimators produce point estimates of the parameters based on the time series, but neither their distribution nor their asymptotic distribution is known. It is, therefore, proposed to boost these estimates to obtain the bootstrap point estimates, their standard errors, and their confidence intervals via bootstrapping.

Bootstrapping state-space models is a resampling technique used to assess the uncertainty of parameter estimates in time series modeling when the underlying data distribution might not be well understood or when we have limited data. This method involves simulating new datasets by resampling from the standardized innovations, enabling the generation of multiple parameter estimates and intervals. By repeatedly applying the bootstrap procedure, we can construct a distribution of parameter estimates, which provides insights into the robustness and variability of the model. This technique has already been used in state-space models; in particular, Reference [16] considered it for assessing the precision of Gaussian maximum likelihood estimates of the parameters of linear state-space models. Reference [22] proposed a bootstrap procedure for constructing prediction intervals directly for the observations, which does not need the backward representation of the model. Reference [23] proposed parametric and nonparametric bootstrap methods for estimating the prediction mean-squared error of state vector predictors that use estimated model parameters.

In this work, it is proposed to consider the innovation form of the representation [24]:

\begin{matrix} β_{t + 1 | t} & = & μ + ϕ (β_{t | t - 1} - μ) + ϕ K_{t} η_{t | t - 1} \\ Y_{t} & = & W_{t} β_{t | t - 1} + η_{t | t - 1} . \end{matrix}

The basic steps of the nonparametric bootstrap are as follows:

Construct the standardized innovations’ function: calculate the standardized innovations for each observation: $η_{t} (\hat{Θ}) = Σ_{t}^{- 1 / 2} (\hat{Θ}) η_{t} (\hat{Θ})$ ;
Generate a bootstrap sample: create a new dataset by sampling, with replacement, from the set of standardized innovations $η_{1} (\hat{Θ}), \dots, η_{n} (\hat{Θ})$ to obtain $η_{1}^{*} (\hat{Θ}), \dots, η_{n}^{*} (\hat{Θ})$ ;
Construct a bootstrap time series: create a time series $Y_{1}^{*}, \dots, Y_{n}^{*}$ based on the resample standardized innovations by solving the following equation:

$ξ_{t} = A_{t} ξ_{t - 1} + C_{t} η_{t},$

where $ξ_{t} = {[β_{t + 1 | t} | Y_{t}]}^{'}$ and

$A_{t} = [\begin{matrix} ϕ & 0 \\ W_{t} & 0 \end{matrix}], C_{t} = [\begin{matrix} ϕ K_{t} Σ_{t}^{- 1 / 2} \\ Σ_{t}^{- 1 / 2} \end{matrix}],$

considering $η_{1}^{*} (\hat{Θ})$ , $t = 1, \dots, n$ , in place of $η_{1} (\hat{Θ})$ , $t = 1, \dots, n$ , with the initial conditions of the Kalman filter remaining fixed at their given values while the parameter $Θ$ is held fixed at $\hat{Θ}$ ;
Calculate the bootstrap distribution-free estimates—using the bootstrap time series ${Y_{t}^{*}}_{t = 1, . . ., n}$ —to compute the distribution-free estimators ${\hat{Θ}}^{*}$ ;
Repeat the procedure: repeat steps 2 to 5 B times to obtain a set of bootstrap parameter estimates $[{\hat{Θ}}_{b}^{*}, b = 1, \dots, B]$ .

In this study, we considered B = 1000 replicates. Thus, at the end of this procedure, we have 1000 estimates,

Θ^{*}

, of the vector of unknown parameters

Θ

, i.e., of each of the parameters. These bootstrap estimates make it possible to obtain a bootstrap distribution to construct a bootstrap confidence interval at the

1 - α

level by the empirical quantiles of this distribution utilized, of order

α / 2

and

1 - α / 2

. In this context, the bootstrap estimate of the

θ_{i}

parameter is considered to be the average of the 1000 bootstrap estimates obtained in the previous procedure.

The main advantage of this approach is that it does not require the assumption of normality or the implementation of optimization methods, as in the case of the maximum likelihood, which in some cases, may not converge or may converge to a local maximum. On the other hand, even if normality is verified, the distribution-free approach can provide initial estimates for the iterative log-likelihood optimization procedure.

2.4. Simulation Study

To analyze the performance of the proposed methodology in comparison with the maximum likelihood estimation, a simulation study was designed with various scenarios. These scenarios were based on a time-invariant state-space model defined by

\begin{matrix} Y_{t} & = & β_{t} + e_{t} \\ β_{t} & = & ϕ β_{t - 1} + ε_{t} \end{matrix}

where the state process

{β_{t}}

is a stationary AR(1) process with zero mean. This study was designed under the optimum conditions for maximum likelihood estimation, since errors with a normal distribution were considered, so we compared the proposed methodology in the best scenario in favor of Gaussian maximum likelihood estimation. Thus, the time series were obtained by simulating errors with the distributions

e_{t} \sim N (0, σ_{e}^{2})

and

ε_{t} \sim N (0, σ_{ε}^{2})

, considering time series of dimension

n = 50, 500

, autoregressive parameter values of 0.5 and 0.9, and two pairs of variances

(σ_{e}^{2}, σ_{ε}^{2})

.

In the scenario of large time series (

n = 500

) (Table 1 and Table 2), which is the most-favorable scenario for maximum likelihood estimation because it enhances the convergence of the optimization method, it can be seen that distribution-free estimation associated with the bootstrap methodology had a very similar parameter estimation performance to the maximum likelihood estimation. On the one hand, the GML method performed better, particularly in terms of the confidence interval coverage rates for the highest value of the autoregressive parameter,

ϕ = 0.9

, but on the other hand, it had lower coverage rates at the 95% confidence level. On the other hand, DFb maintained higher coverage rates in both cases close to 100%, which means that the DFb method is conservative. In short, the proposed DFb method is more advantageous in the case of large time series when the correlation structure, translated by the

ϕ

parameter, is weaker (

ϕ

closer to zero). This is also evident from the analysis of the average amplitudes of the confidence intervals, which are smaller in this scenario, without a reduction in their coverage rate.

Table 3 and Table 4 show the results of the simulation study for small time series, in this case with

n = 50

, for the various combinations of the parameters

ϕ

and variances. For series of this size, both estimation methods lowered their rate of valid estimates, i.e., within the parameter space. However, it can be seen that the distribution-free method associated with the bootstrapping performed best in this respect, while the Gaussian maximum likelihood estimation had the lowest success rates, especially for

ϕ = 0.5

. From the point of view of the accuracy of the estimates, from the RMSE perspective, the distribution-free method with the bootstrap had the best performance overall. This better performance also occurred in the analysis of the coverage rates of the confidence intervals, as well as in the analysis of their average amplitude, with this method producing confidence intervals with smaller amplitudes without compromising the coverage rate (which is still higher, as a rule, than the confidence level considered).

3. Application to Economic Data

In this section, we implement the proposed methodology on real economic data and juxtapose it with the Gaussian maximum likelihood estimation, considering both the parameter estimation and forecast quality.

3.1. Dataset

The ISM index chosen to illustrate an application to real data is the Manufacturing PMI, which is a monthly economic indicator of the United States of America. It is constructed through surveys conducted with purchasing managers in over 300 industrial companies. This index is a fundamental indicator for assessing and monitoring the development of the American economy. It was created by the “Institute for Supply Management”, from which the designation ISM derives. This non-governmental, non-profit organization organization was established in 1915 and provides reports on development, education, and research to both individuals and companies or financial institutions with the purpose of creating value and enabling them to gain competitive advantages, as this information supports many decision-making processes in management.

The Manufacturing PMI index allows the analysis of changes in production levels between months. The reports are released on the first business day of a given month, making it one of the first economic indicators available to managers and investors. It is composed of five other subindicators with equal weight, as described by [25]. These subindicators are:

New Orders, reflecting the number of customer orders placed with companies;
Production, evaluating whether a company’s production has changed compared to a previous period (days, weeks, and months);
Employment, measuring changes in employment, whether it has increased or decreased;
Deliveries, revealing whether the delivery times between suppliers and the company have increased or decreased compared to a previous period;
Inventories, indicating how much a company’s inventories have increased or decreased.

The companies were categorized into 18 different sectors, including food and beverages, chemicals, machinery, and transportation equipment, among others. In summary, the data from the ISM Manufacturing Index, especially the PMI, allows for a comprehensive assessment of the performance of the U.S. manufacturing industry. The database, named “ISM”, considered in this work comprises 569 observations, including monthly ISM values and their respective dates. The time series analyzed included values from January 1975 up to May 2022. The data are reported on a monthly basis. For the purposes of modeling and estimating the models, a training time series up to December 2020 was considered (Figure 1), leaving the last observations for the model testing and evaluation series.

Figure 1 shows that the series may not be stationary in terms of variance. There were several oscillations throughout the series, most notably in the 1980s, between 2007 and 2010, as well as between 2019 and 2022. The minimum value of 29.40 in April 1980 reflects a period in which the economy was already in recession; in that decade, the unemployment rate was around 7.5%. Both the 1980 recession and the 1981–82 recessions were triggered by a restrictive monetary policy in an attempt to combat rising inflation. During the 1960s and 1970s, economists and policymakers believed that they could reduce unemployment through higher inflation, in a trade-off known as the Philips curve. This strategy severely affected U.S. industrial companies [26]. The maximum observed value of 69.90 was obtained after the recovery from the aforementioned recession.

3.2. Modeling with Regression Linear Models

To identify possible components in the ISM index time series, it was broken down into the usual level, trend, seasonality, and noise components (Figure 2). The breakdown of the time series indicated a possible trend and a seasonal component, with a 12-month period, if any, and low amplitude (of around −2 to 3 points). Based on this exploratory analysis of the time series, a linear model and a state-space model will be adjusted and analyzed, whose performance will be evaluated.

By examining Figure 2, it can be observed that the seasonal component was extremely small. To assess the significance of the seasonal component and trend, a model was fit with a set of explanatory variables, including indicators for 11 months, the intercept term, the independent variable time (since we already confirmed the presence of a trend), and the response variable, the ISM series. The indicator variables were also considered dummy variables. The multiple linear regression model included the intercept term

α_{0}

, the coefficient

α_{1}

associated with the time variable

t = 1, 2, \dots, n

, dummy variables

d_{i, t}

, where

i = 1, \dots, 11

, representing indicator variables

d_{i, t}

, which assumes 1 when month t is January (

i = 1

), …, November (

i = 11

), and the random error

ϵ_{t}

, that is

Y_{t} = α_{0} + t α_{1} + \sum_{i = 1}^{11} β_{i} d_{i, t} + ϵ_{t} .

Table 5 represents the estimates and their corresponding p-values for the estimated parameters of the multiple linear regression model. At a significance level of

5 %

, only the intercept,

α_{0}

, and the coefficient associated with the time variable,

α_{1}

, had p-values below

0.05

, which means that the seasonal coefficients were not statistically significant when considering the annual seasonality. Thus, the linear trend component will be the only one considered in the linear modeling.

In Table 6 is presented the summary of the simple linear regression model with the time variable, since the coefficients associated with the seasonal variables were not statistically significant.

The histogram in Figure 3 shows that the residuals did not behave like a white noise process. The Shapiro–Wilk and the Kolmogorov–Smirnov normality tests rejected the null hypothesis that the residuals were normal. The analysis of the autocorrelation function (ACF) and the partial autocorrelation function (PACF) graphs showed that the residuals also had a temporal correlation structure.

3.3. SARIMA Modeling

In order to have a model that can incorporate a temporal correlation structure, several SARIMA models were fit to the ISM index time series. From this analysis, the best-performing model was

SARIMA (2, 1, 0) {(2, 0, 0)}_{[12]}

, with the following formulation:

(1 - 0.0539 B - 0.1108 B^{2}) (1 + 0.1345 B^{12} + 0.1478 B^{24}) (1 - B) Y_{t} = ε_{t}

and whose summary is shown in Table 7.

The analysis of the SARIMA model’s residuals showed that the normality of their distribution was rejected (Figure 4). However, the Ljung–Box test did not reject the hypothesis that there was no correlation in the series of residuals, considering lags up to 24.

3.4. State-Space Modeling

In order to be able to integrate the temporal correlation structure already identified, either by adjusting the simple linear regression model or by the SARIMA model, we also considered a state-space model in which the known values

W_{t}

were the predicted values resulting from adjusting the simple linear regression model. In this way, we considered a state-space model with the following observation equation:

Y_{t} = W_{t} β_{t} + e_{t}

where

W_{t} = 50.8679 + 0.0045 t

, with

t = 1, \dots, n

and the state process

{β_{t}}

is an autoregressive process:

β_{t} = μ + ϕ (β_{t - 1} - μ) + ε_{t} .

This state-space model can be understood as a calibration model in which the linear trend with the base structure is considered, which is calibrated, at each moment t, by a stochastic factor

β_{t}

. This model makes it possible to incorporate a temporal correction structure, in this case through the state process, and a dynamic adjustment over time. We would expect the average

μ

of the calibration factor process to be close to 1, with each calibration factor corresponding to a correction factor that either increases the value expected by the trend,

β_{t} > 1

, or decreases it,

β_{t} < 1

.

This model was adjusted and its unknown parameters,

Θ = (μ, ϕ, σ_{e}^{2}, σ_{ε}^{2})

, estimated by both the maximum likelihood method with the assumption of normality of the errors

e_{t}

and

ε_{t}

and the distribution-free estimators proposed in [13]. The standard errors and confidence intervals of the latter were obtained via bootstrapping. The results of the parameter estimation and the respective confidence intervals at

95 %

are shown in Table 8. In both estimation methods, the estimation of the observation error variance was zero. This implies that, in practice, the response variable—in this case, the ISM index—is explained by the calibration of the linear trend,

W_{t}

, through the autoregressive order -one state process without additional noise. However, it should be noted that the resulting model is heteroscedastic, as the variance of the response variable,

Y_{t}

, is given by

v a r (Y_{t}) = W_{t}^{2} σ_{ε}^{2} {(1 - ϕ^{2})}^{- 1}

, given that the state process,

{β_{t}}

, is stationary (

| \hat{ϕ} |

< 1). From the analysis of the results, it can be concluded that both estimation methods provide very similar point estimates of the parameters. However, the most-significant difference lies in

{\hat{σ}}_{ε}^{2}

, with the maximum likelihood estimation estimating this variance at about 15-times the value of the non-parametric estimate.

With regard to the maximum likelihood estimation, the standardized innovations must be analyzed to see if they behave like white noise. From the analysis of the standardized innovations, Table 9, and the tests for normality and correlation, we rejected the normality and the hypothesis of no correlation, indicating that the assumptions of the model and normality do not hold.

The standard errors and bootstrap confidence intervals were obtained from the empirical bootstrap distributions obtained from the 1000 replicates obtained for each parameter; see Figure 5. The descriptive statistics of the bootstrap distributions are shown in Table 10. The maximum likelihood estimation, although dependent on the assumptions about the distribution of errors, still provided interesting results from a modeling perspective, and the differences compared to non-parametric estimation were not significant. This suggests that, despite the challenges observed in the assumptions of the maximum likelihood estimation, this method remains a robust tool for the analysis of state-space models, at least in the specific application of this study. On the other hand, the non-parametric approach allowed overcoming the limitations associated with the normality assumption.

3.5. Forecasting

In this section, we present the forecasting procedure in a statistical modeling context, focusing on the three fitted models. The period for which one-step-ahead forecasts are desired corresponds to the test series, that is it extends from January 2021 to May 2022, comprising a total of 17 observations. Forecasts will be obtained through three different models: the simple linear regression model (SLRM), the SARIMA model (SARIMA), and the state-space model, whose parameters were estimated using both the Gaussian maximum likelihood (SSM-GML) and the distribution-free estimators associated with the use of bootstrapping to estimate the standard errors and confidence intervals (SSM-DFb).

Table 11, complemented by the analysis of the graphs in Figure 6, allows us to conclude that the state-space models performed best in terms of the accuracy of the one-step predictions in the test series. These values were used to fit the two state-space models. The lowest MSE was obtained when considering the state-space model with the Gaussian maximum likelihood parameter estimation; however, the SSM-DFb model had a very similar RMSE, with no significant statistical difference (the p-value of the Diebold–Mariano test was 0.3981 [27]). The SARIMA model forecasts, on the other hand, had a higher root-mean-squared error.

When we analyzed the confidence intervals of the 17 one-step-ahead forecasts based on the maximum likelihood estimation and distribution-free bootstrap estimators, it can be seen that the latter produced bootstrap confidence intervals with an average semi-range of

4.56

, while the Gaussian maximum likelihood estimation produced a value of

4.26

. In addition, in both methods, only one of the observations did not match the respective prediction intervals, corresponding to 5.9%, a value close to the significance level considered. In this particular case, the SARIMA model produced confidence intervals with average semi-amplitudes of 4.17, i.e., slightly lower than with the other models, and as already mentioned, from the point of view of point forecasting, it performed worse in terms of the root-mean-squared error of the point forecasts for these 17 forecast instants.

4. Conclusions

The results showed that modeling time series using state-space models is a good approach for obtaining both point and interval forecasts. In addition to the widely used linear regression models and SARIMA models, which have their advantages—in the case of the former, their easy implementation and interpretation, and in the case of the latter, their wide application, particularly in economic data series—state-space models are an alternative that provides good results in terms of their predictive quality. In addition, state-space models are also popular due to their flexibility and easy interpretation. The distribution-free estimators linked to the bootstrap methodology present an alternative to maximum likelihood estimation. One notable advantage is that they do not require an optimization process and demonstrate superior performance, particularly in achieving a lower mean-squared error for estimates and narrower amplitudes in the corresponding confidence intervals. This advantage is particularly pronounced in the case of smaller time series. These parameters can also be considered as initial values in iterative processes to obtain maximum likelihood estimates. A contribution of this work was the incorporation of the bootstrap methodology in obtaining estimates, standard errors, and confidence intervals for the parameters. This is relevant, since the distribution of distribution-free estimators is not known, nor is their an asymptotic distribution.

Author Contributions

Conceptualization and methodology, M.C. and A.M.G.; software, J.F.L. and F.C.P.; validation, M.C. and A.M.G.; investigation, J.F.L. and F.C.P.; resources, J.F.L.; data curation, J.F.L.; writing, review and editing, all authors. All authors have read and agreed to the published version of the manuscript.

Funding

F. Catarina Pereira was funded by national funds through Fundação para a Ciência e a Tecnologia (FCT) through the individual Ph.D. research grant UI/BD/150967/2021 of CMAT-UM. A. Manuela Gonçalves was partially financed by Portuguese Funds through FCT within the Projects UIDB/00013/2020 and UIDP/00013/2020 of CMAT-UM. Marco Costa was partially supported by The Center for Research and Development in Mathematics and Applications (CIDMA-UA) through the Portuguese Foundation for Science and Technology—FCT, references UIDB/04106/2020 and UIDP/04106/2020.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Barbaglia, L.; Consoli, S.; Manzan, S. Forecasting with Economic News. J. Bus. Econ. Stat. 2023, 41, 708–719. [Google Scholar]
Lima, S.; Gonçalves, A.M.; Costa, M. Time series forecasting using Holt-Winters exponential smoothing: An application to economic data. AIP Conf. Proc. 2019, 2186, 090003. [Google Scholar]
Perone, G. Using the SARIMA Model to Forecast the Fourth Global Wave of Cumulative Deaths from COVID-19: Evidence from 12 Hard-Hit Big Countries. Econometrics 2022, 10, 18. [Google Scholar] [CrossRef]
Alqatawna, A.; Abu-Salih, B.; Obeid, N.; Almiani, M. Incorporating Time-Series Forecasting Techniques to Predict Logistics Companies’ Staffing Needs and Order Volume. Computation 2023, 11, 141. [Google Scholar] [CrossRef]
Aoki, M. Studies of Economic Interdependence by State Space Modeling of Time Series: US-Japan Example. Ann. d’Économie Stat. 1987, 1987, 225–252. [Google Scholar] [CrossRef]
Borrero, J.D.; Mariscal, J. Predicting Time Series Using an Automatic New Algorithm of the Kalman Filter. Mathematics 2022, 10, 2915. [Google Scholar] [CrossRef]
Hyndman, R.J.; Koehler, A.B.; Snyder, R.D.; Grose, S. A state-space framework for automatic forecasting using exponential smoothing methods. Int. J. Forecast. 2002, 18, 439–454. [Google Scholar] [CrossRef]
Hyndman, R.J.; Athanasopoulos, G. Forecasting: Principles and Practice; OTexts: Melbourne, Australia, 2018. [Google Scholar]
Albarakati, A.; Budišić, M.; Crocker, R.; Glass-Klaiber, J.; Iams, S.; Maclean, J.; Marshall, N.; Roberts, C.; Van Vleck, E.S. Model and data reduction for data assimilation: Particle filters employing projected forecasts and data with application to a shallow water model. Comput. Math. Appl. 2022, 116, 194–211. [Google Scholar]
Chau, T.; Ailliot, P.; Monbet, V. An algorithm for non-parametric estima- tion in state–space models. Comput. Stat. Data Anal. 2021, 153, 107062. [Google Scholar] [CrossRef]
Hoayek, A.S.; Ducharme, G.R.; Khraibani, Z. Distribution-free inference in record series. Extremes 2017, 20, 585–603. [Google Scholar] [CrossRef]
Wooldridge, J.M. Applications of Generalized Method of Moments Estimation. J. Econ. Perspect. 2001, 15, 87–100. [Google Scholar] [CrossRef]
Costa, M.; Alpuim, T. Parameter estimation of state-space models for univariate observations. J. Stat. Plan. Inference 2010, 140, 1889–1902. [Google Scholar] [CrossRef]
Gonçalves, A.M.; Costa, M. Predicting seasonal and hydro-meteorological impact in environmental variables modelling via Kalman filtering. Stoch. Environ. Res. Risk Assess. 2013, 27, 1021–1038. [Google Scholar] [CrossRef]
Berkowitz, J.; Kilian, L. Recent developments in bootstrapping time series. Econom. Rev. 2000, 19, 1–48. [Google Scholar] [CrossRef]
Stoffer, D.S.; Wall, K.D. Bootstrapping State-Space Models: Gaussian Maximum Likelihood Estimation and the Kalman Filter. J. Am. Stat. Assoc. 1991, 86, 1024–1033. [Google Scholar] [CrossRef]
Angelini, G.; Cavaliere, G.; Fanelli, L. Bootstrap inference and diagnostics in state-space models: With applications to dynamic macro models. J. Appl. Econom. 2022, 37, 3–22. [Google Scholar] [CrossRef]
Tsuchiya, Y. Purchasing and supply managers provide early clues on the direction of the US economy: An application of a new market-timing test. Int. Rev. Econ. Financ. 2014, 29, 599–618. [Google Scholar]
Moon, H.; Lee, H.; Song, B. Mixed pooling of seasonality for time series forecasting: An application to pallet transport data. Expert Syst. Appl. 2022, 201, 117195. [Google Scholar] [CrossRef]
Costa, M.; Monteiro, M. Bias-correction of Kalman filter estimators associated to a linear state-space model with estimated parameters. J. Stat. Plan. Inference 2016, 176, 22–32. [Google Scholar] [CrossRef]
Snyder, R.D.; Forbes, C.S. Reconstructing the Kalman Filter for Stationary and Non Stationary Time Series. Stud. Nonlinear Dyn. Econom. 2002, 7, 1. [Google Scholar]
Rodriguez, A.; Ruiz, E. Bootstrap prediction intervals in state–space models. J. Time Ser. Anal. 2009, 30, 167–178. [Google Scholar] [CrossRef]
Pfeffermann, D.; Tiller, R. Bootstrap approximation to prediction MSE for state-space models with estimated parameters. J. Time Ser. Anal. 2005, 26, 893–916. [Google Scholar] [CrossRef]
Anderson, B.D.O.; Moore, J.B. Optimal Filtering; Prentice-Hall: Englewood Cliffs, NJ, USA, 1979; p. 44. [Google Scholar]
Bognanni, M.; Young, T. An assessment of the ism manufacturing price index for inflation forecasting. Econ. Comment. 2018, 2018, 1–6. [Google Scholar]
Goodfriend, M.; King, R.G. The incredible volcker disinflation. J. Monet. Econ. 2005, 52, 981–1015. [Google Scholar] [CrossRef]
Diebold, F.X.; Mariano, R.S. Comparing predictive accuracy. J. Bus. Econ. Stat. 1995, 13, 253–263. [Google Scholar]

Figure 1. Representation of the ISM index time series from January 1974 to May 2022. The red line represents the average value

52.90

over the period.

Figure 1. Representation of the ISM index time series from January 1974 to May 2022. The red line represents the average value

52.90

over the period.

Figure 2. Decomposition of the ISM time series.

Figure 3. Analysis of the residuals from the adjustment of the simple linear regression model; the blue dashed line represents the limits of the 95% confidence intervals.

Figure 4. Histogram, QQ−plot, ACF, and PACF of the residuals from the adjustment of the SARIMA model; the blue dashed line represents the limits of the 95% confidence intervals.

Figure 5. Empirical distributions of the 1000 bootstrap distribution-free estimates of

μ

,

ϕ

, and

σ_{ε}^{2}

(top left

μ

; top right

ϕ

; below

σ_{ε}^{2}

).

Figure 5. Empirical distributions of the 1000 bootstrap distribution-free estimates of

μ

,

ϕ

, and

σ_{ε}^{2}

(top left

μ

; top right

ϕ

; below

σ_{ε}^{2}

).

Figure 6. One-step-ahead forecasts by the state-space model with both models: top—SARIMA model; Gaussian likelihood estimation (SSM-ML); distribution-free estimators associated with the bootstrapping (SSM-DFb).

Table 1. Simulation study results for time series of dimension

n = 500

: convergence/success rate (CR) (the rate of the estimates within the parameter space), Gaussian maximum likelihood (GML), distribution-free estimation with bootstrapping (DFb).

Table 1. Simulation study results for time series of dimension

n = 500

: convergence/success rate (CR) (the rate of the estimates within the parameter space), Gaussian maximum likelihood (GML), distribution-free estimation with bootstrapping (DFb).

Parameters			Method of Estimation	RMSE			CR
$ϕ$	$σ_{ε}^{2}$	$σ_{e}^{2}$	Method of Estimation	$ϕ$	$σ_{ε}$	$σ_{e}$	CR
0.50	0.05	0.01	GML	0.0720	0.0242	0.0517	$95 %$
	0.05	0.01	DFb	0.1007	0.0372	0.0439	$98 %$
	1.00	0.50	GML	0.0921	0.1564	0.2633	$99 %$
	1.00	0.50	DFb	0.0940	0.1520	0.1465	$99 %$
0.90	0.05	0.01	GML	0.0226	0.0147	0.0207	$100 %$
	0.05	0.01	DFb	0.0241	0.0251	0.0279	$99 %$
	1.00	0.50	GML	0.0251	0.0783	0.0758	$100 %$
	1.00	0.50	DFb	0.0260	0.1098	0.0901	$100 %$

Table 2. Simulation study results for time series of dimension

n = 500

: coverage rate of confidence intervals at 95% (CVR), average amplitude (AvgA), Gaussian maximum likelihood (GML), distribution-free estimation with bootstrapping (DFb).

Table 2. Simulation study results for time series of dimension

n = 500

: coverage rate of confidence intervals at 95% (CVR), average amplitude (AvgA), Gaussian maximum likelihood (GML), distribution-free estimation with bootstrapping (DFb).

Parameters			Method of Estimation	CVR			AvgA
$ϕ$	$σ_{ε}^{2}$	$σ_{e}^{2}$	Method of Estimation	$ϕ$	$σ_{ε}$	$σ_{e}$	$ϕ$	$σ_{ε}$	$σ_{e}$
0.50	0.05	0.01	GML	$91 %$	$83 %$	$92 %$	0.3160	0.1065	0.3347
	0.05	0.01	DFb	$100 %$	$99 %$	$100 %$	0.2866	0.0920	0.1193
	1.00	0.50	GML	$89 %$	$88 %$	$92 %$	0.3834	0.6322	1.2232
	1.00	0.50	DFb	$100 %$	$100 %$	$100 %$	0.3325	0.5021	0.5568
0.90	0.05	0.01	GML	$97 %$	$94 %$	$96 %$	0.0898	0.0587	0.0810
	0.05	0.01	DFb	$100 %$	$100 %$	$100 %$	0.1001	0.0851	0.1023
	1.00	0.50	GML	$94 %$	$94 %$	$96 %$	0.0943	0.3009	0.2913
	1.00	0.50	DFb	$100 %$	$100 %$	$100 %$	0.1038	0.4102	0.3441

Table 3. Simulation study results for time series of dimension

n = 50

: convergence/success rate (CR) (the rate of the estimates within the parameter space), Gaussian maximum likelihood (GML), distribution-free estimation with bootstrapping (DFb).

Table 3. Simulation study results for time series of dimension

n = 50

: convergence/success rate (CR) (the rate of the estimates within the parameter space), Gaussian maximum likelihood (GML), distribution-free estimation with bootstrapping (DFb).

Parameters			Method of Estimation	RMSE			CR
$ϕ$	$σ_{ε}^{2}$	$σ_{e}^{2}$	Method of Estimation	$ϕ$	$σ_{ε}$	$σ_{e}$	CR
0.50	0.05	0.01	GML	0.1946	0.0637	0.0802	$78 %$
	0.05	0.01	DFb	0.1649	0.0655	0.0560	$88 %$
	1.00	0.50	GML	0.2297	0.3066	0.4391	$88 %$
	1.00	0.50	DFb	0.1843	0.2790	0.2128	$88 %$
0.90	0.05	0.01	GML	0.1066	0.0420	0.0552	$86 %$
	0.05	0.01	DFb	0.1047	0.0505	0.0448	$89 %$
	1.00	0.50	GML	0.1225	0.2285	0.2793	$92 %$
	1.00	0.50	DFb	0.1329	0.2076	0.1936	$98 %$

Table 4. Simulation study results for time series of dimension

n = 50

: coverage rate of confidence intervals at 95% (CVR), average amplitude (AvgA), Gaussian maximum likelihood (GML), distribution-free estimation with bootstrapping (DFb).

Table 4. Simulation study results for time series of dimension

n = 50

: coverage rate of confidence intervals at 95% (CVR), average amplitude (AvgA), Gaussian maximum likelihood (GML), distribution-free estimation with bootstrapping (DFb).

Parameters			Method of Estimation	CVR			AvgA
$ϕ$	$σ_{ε}^{2}$	$σ_{e}^{2}$	Method of Estimation	$ϕ$	$σ_{ε}$	$σ_{e}$	$ϕ$	$σ_{ε}$	$σ_{e}$
0.50	0.05	0.01	GML	$85 %$	$87 %$	$86 %$	0.7715	0.2340	0.8432
	0.05	0.01	DFb	$100 %$	$80 %$	$100 %$	0.5982	0.1401	0.1524
	1.00	0.50	GML	$82 %$	$81 %$	$93 %$	0.9043	1.7546	4.6511
	1.00	0.50	DFb	$100 %$	$95 %$	$100 %$	0.6788	0.7327	0.7945
0.90	0.05	0.01	GML	$94 %$	$93 %$	$97 %$	0.3499	0.1782	0.3716
	0.05	0.01	DFb	$100 %$	$94 %$	$100 %$	0.2975	0.1434	0.1426
	1.00	0.50	GML	$93 %$	$92 %$	$97 %$	0.3790	0.9682	1.4738
	1.00	0.50	DFb	$100 %$	$99 %$	$100 %$	0.3576	0.7044	0.7054

Table 5. Estimates, standard errors, t-values, and p-values of the multiple linear regression model with seasonal coefficients and the time variable.

	Estimates	Standard Error	t-Value	p-Value
$α_{0}$	$50.6512$	$0.9782$	$51.78$	$2.00 \times 10^{- 16}$
$α_{1}$	$0.0064$	$0.0015$	$4.18$	$3.45 \times 10^{- 5}$
$β_{1}$	$- 0.5033$	$1.2290$	$- 0.41$	$0.6823$
$β_{2}$	$- 0.0347$	$1.2290$	$- 0.03$	$0.9775$
$β_{3}$	$- 0.3598$	$1.2290$	$- 0.29$	$0.7698$
$β_{4}$	$- 0.5662$	$1.2290$	$- 0.46$	$0.6452$
$β_{5}$	$- 0.5434$	$1.2290$	$- 0.44$	$0.6585$
$β_{6}$	$- 0.0894$	$1.2354$	$- 0.07$	$0.9424$
$β_{7}$	$- 0.0511$	$1.2354$	$- 0.04$	$0.9670$
$β_{8}$	$0.2808$	$1.2354$	$0.23$	$0.8203$
$β_{9}$	$0.1213$	$1.2354$	$0.10$	$0.9218$
$β_{10}$	$0.1362$	$1.2354$	$0.11$	$0.9123$
$β_{11}$	$0.0234$	$1.2354$	$0.02$	$0.9849$

Table 6. Summary of the simple linear regression model with the time variable.

	Estimates	Standard Error	p-Value
$α_{0}$	$50.8679$	$0.5053$	$2.00 \times 10^{- 16}$
$α_{1}$	$0.0045$	$0.0016$	$0.0046$
$σ^{2}$	5.98

Table 7. Summary of the SARIMA model.

Parameter	Estimates	Standard Error
$ϕ_{1}$	$0.0539$	$0.0437$
$ϕ_{2}$	$0.1108$	$0.0429$
$Φ_{1}$	$- 0.1345$	$0.0456$
$Φ_{2}$	$- 0.1478$	$0.0450$
$σ^{2}$	Log L	AIC
$4.431$	$- 1168.54$	2347

Table 8. Estimates, confidence intervals at

95 %

, and standard errors for both the Gaussian maximum likelihood and distribution-free estimation with bootstrapping.

Table 8. Estimates, confidence intervals at

95 %

, and standard errors for both the Gaussian maximum likelihood and distribution-free estimation with bootstrapping.

	Gaussian Maximum Likelihood				Distribution-Free with Bootstrapping
Parameters	Estimates	S.E.	Lower l.	Upper l.	Estimates	$Q_{2.5 %}$	$Q_{97.5 %}$	S.E.
$μ$	$1.0017$	$0.0251$	$0.9525$	$1.0509$	$1.0000$	$0.9505$	$1.0572$	$0.0278$
$ϕ$	$0.9300$	$0.0159$	$0.8989$	$0.9612$	$0.8943$	$0.8105$	$0.9698$	$0.0418$
$σ_{ε}^{2}$	$0.0407$	$0.0012$	$0.0383$	$0.0431$	$0.0019$	$0.0012$	$0.0027$	$0.0004$
$σ_{e}^{2}$	0

Table 9. Test values for normality and correlation tests on the series of standardized innovations in the Gaussian maximum likelihood estimation.

Test	p-Values
Shapiro–Wilk	$3.052 \times 10^{- 9}$
Kolmogorov–Smirnov	$0.02283$
Ljung–Box	$5.303 \times 10^{- 6}$

Table 10. Descriptive statistics of the distributions of the bootstrap estimates of the parameters of the 1000 replicates.

	Min.	Q $_{25 %}$	Median	Mean	Q $_{75 %}$	Max	Variance
$\hat{μ}$	$0.8850$	$0.9887$	$1.0008$	$1.0013$	$1.0137$	$1.1204$	$0.0004$
$\hat{ϕ}$	$0.7255$	$0.8790$	$0.9038$	$0.8998$	$0.9293$	$0.9789$	$0.0016$
${\hat{σ}}_{ε}^{2}$	$0.0009$	$0.0016$	$0.0019$	$0.0019$	$0.0021$	$0.0036$	$1.414 \times 10^{- 7}$

Table 11. Root-mean-squared error (RMSE) of the forecasts from the 4 models.

	SLRM	SARIMA	SSM-ML	SSM-DFb
RMSE	$6.56$	$2.12$	$1.76$	$1.77$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lima, J.F.; Pereira, F.C.; Gonçalves, A.M.; Costa, M. Bootstrapping State-Space Models: Distribution-Free Estimation in View of Prediction and Forecasting. Forecasting 2024, 6, 36-54. https://doi.org/10.3390/forecast6010003

AMA Style

Lima JF, Pereira FC, Gonçalves AM, Costa M. Bootstrapping State-Space Models: Distribution-Free Estimation in View of Prediction and Forecasting. Forecasting. 2024; 6(1):36-54. https://doi.org/10.3390/forecast6010003

Chicago/Turabian Style

Lima, José Francisco, Fernanda Catarina Pereira, Arminda Manuela Gonçalves, and Marco Costa. 2024. "Bootstrapping State-Space Models: Distribution-Free Estimation in View of Prediction and Forecasting" Forecasting 6, no. 1: 36-54. https://doi.org/10.3390/forecast6010003

APA Style

Lima, J. F., Pereira, F. C., Gonçalves, A. M., & Costa, M. (2024). Bootstrapping State-Space Models: Distribution-Free Estimation in View of Prediction and Forecasting. Forecasting, 6(1), 36-54. https://doi.org/10.3390/forecast6010003

Article Menu

Bootstrapping State-Space Models: Distribution-Free Estimation in View of Prediction and Forecasting

Abstract

1. Introduction

2. Materials and Methods

2.1. SARIMA Modeling

2.2. State-Space Modeling

2.2.1. Predicting and Forecasting with the Kalman Filter

2.2.2. Confidence Intervals for 1-Step Forecasts

2.3. Parameters Estimation

2.3.1. Gaussian Likelihood Estimation

2.3.2. Distribution-Free Estimators

2.3.3. Point and Interval Distribution-Free Estimation of Parameters via Bootstrapping

2.4. Simulation Study

3. Application to Economic Data

3.1. Dataset

3.2. Modeling with Regression Linear Models

3.3. SARIMA Modeling

3.4. State-Space Modeling

3.5. Forecasting

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI