1. Introduction
Time series modeling often relies on capturing temporal dependencies and residual structures to achieve robust predictions [
1,
2]. Traditional models like Autoregressive Integrated Moving Average (ARIMA), and Seasonal AutoRegressive Integrated Moving Average with eXogenous regressors (SARIMAX), assume normally distributed errors [
1,
3,
4]; however, real-world data frequently exhibit skewness, zero inflation, or other non-Gaussian characteristics [
5,
6,
7]. SARIMAX models are widely used for time series analysis, especially when dealing with seasonal patterns and exogenous inputs. Ref. [
8] demonstrated the use of grid-based optimization for SARIMAX hyperparameters in seasonal forecasting tasks. Meanwhile, Ref. [
9] explored hybrid SARIMAX architectures integrating neural networks to improve predictive performance in systems with complex external drivers. Recent developments in Bayesian time series analysis, including models with skew-
t innovations [
10], provide alternative frameworks for handling asymmetry and heavy tails. Additionally, the SARIMAX model is increasingly utilized in big data contexts for its robust forecasting capabilities across various domains, see for example [
8,
11]. Other flexible alternatives, such as COM-Poisson autoregressive models, offer powerful tools for addressing overdispersion and zero inflation simultaneously in time series data [
12]. While these approaches are not the focus of the present study, they represent valuable directions for complementary or comparative analysis.
The skew-normal distribution, introduced by [
13], and its zero-inflated variant [
14], provide flexible frameworks to model such error terms. A skew-normal distribution can be integrated into the error term to better capture asymmetry in time series data, improving estimation and forecasting in autoregressive models [
15]; although, some studies have proposed combining SARIMA structures with semiparametric components to enhance flexibility in capturing nonlinear effects [
16]. In our approach, it remains fully parametric. It extends SARIMAX by embedding skew-normal and zero-inflated skew-normal distributions into the error term, enabling better handling of non-Gaussian residual structures without altering the linear temporal framework.
One of the motivations for employing SARIMAX with skew-normal and zero-inflated skew-normal errors lies in their ability to handle asymmetries and over-dispersion more effectively than traditional methods like generalized autoregressive conditional heteroskedasticity (GARCH) or beta regression [
17,
18]. While GARCH models excel at capturing volatility clustering, they assume symmetric error distributions, which may lead to biased parameter estimates when the underlying data exhibit skewness [
17]. Similarly, beta regression, despite its utility in modeling proportions, lacks the ability to handle time-dependent structures and external predictors efficiently [
18].
In this paper, we explore the incorporation of the skew-normal and zero-inflated skew-normal distributions into SARIMAX models to enhances both realism and accuracy in applications in which error asymmetry or intermittency is evident, such as in microbiological growth in which natural variations in data are due to biological conditions. This integration allows for improved performance by better capturing the underlying data distribution, particularly in scenarios where significant skewness and intermittence are present, which normal error assumptions fail to address effectively. This study addresses a critical gap in the existing literature by combining zero-inflated models with SARIMAX and skew-normal error structures to account for both zero inflation and asymmetry in time series data. Prior works have largely focused on either zero inflation or skewed distributions in isolation. By integrating these approaches, the proposed methodology provides a comprehensive framework that can enhance the modeling of complex real-world phenomena. Although SARIMAX models traditionally assume Gaussian innovations, empirical time series frequently present non-Gaussian features such as skewness or zero inflation in the residuals. These deviations can lead to biased parameter estimates, undercoverage of confidence intervals, and suboptimal forecasts. Incorporating skew-normal and zero-inflated skew-normal distributions into the SARIMAX framework allows the model to directly account for asymmetry and excess zeros, enhancing estimation and predictive performance in complex applied contexts.
In what follows, a detailed mathematical background of the proposed approach to the the SARIMAX model with skew-normal errors, and the SARIMAX model with zero-inflated skew-normal errors will be presented in
Section 2, followed by a methodology
Section 3 that outlines the steps for data preparation, model specification, parameter estimation, assessment of goodness-of-fit, validation and forecasting. Additionally, this section will introduce a computational implementation framework, and conclude with the design of simulation experiments. The subsequent
Section 4 presents the results of these experiments, and the paper concludes with a discussion of the findings in
Section 5, highlighting their implications and potential applications.
2. Mathematical Background
A SARIMAX model with skew-normal errors is given by:
where:
and
are polynomials of order
p and
q, representing the autoregressive (AR) and moving average (MA) components, respectively:
B is the backshift operator, defined as .
d is the order of differencing required to achieve stationarity in .
represents exogenous variables influencing .
accounts for seasonal components.
is a coefficient matrix that captures the influence of seasonal components in .
is the error term, assumed to follow a skew-normal distribution , where:
- –
is the location parameter, controlling the central tendency.
- –
is the scale parameter, determining the dispersion.
- –
is the skewness parameter, with corresponding to a symmetric normal distribution.
The skew-normal distribution,
, is a generalization of the normal distribution allowing for skewness. Its probability density function (PDF) is given by:
where:
Key moments of the skew-normal distribution include:
Mean: where .
Variance: .
Skewness and higher moments depend on , controlling the asymmetry.
For the SARIMAX model to be stationary and invertible, the following conditions must hold:
Stationarity: The roots of the polynomial must lie outside the unit circle, ensuring the AR process does not exhibit explosive behavior.
Invertibility: The roots of the polynomial must also lie outside the unit circle, ensuring the MA process is well-defined.
To ensure stationarity, we applied the Augmented Dickey–Fuller (ADF) test to each series prior to estimation. If the null hypothesis of a unit root was not rejected, we applied first-order differencing. This transformation preserves the essential structure of the temporal dynamics while satisfying the stationarity condition required for SARIMAX model estimation.
In scenarios where the error term exhibits both skewness and intermittence, a zero-inflated skew-normal distribution can be employed. The error term
is defined as a mixture:
where
is the zero-inflation probability.
The likelihood function for
in this case combines the two components:
where
is the indicator function.
For SARIMAX models with zero-inflated skew-normal errors, the probability of observing zeros in the series is modeled using a logistic regression approach. The binary outcome variable is defined as:
The probability of zeros is estimated using the logistic function:
In this context,
represents the estimated probability that the observed value is zero at time
t, conditional on the covariates and time index. This probability is computed using the logistic regression model defined above. It plays a central role in the predictive Equation (
7), where it governs the contribution of the zero-inflation mechanism to the overall forecast.
The logistic function is applied to a linear combination of the predictors, where:
is the intercept term.
captures the linear effect of time t on the log-odds of observing a zero in .
are the coefficients for each component of the exogenous variables , reflecting how each variable influences the likelihood of being zero. Note that this formulation corresponds to a logistic regression model applied independently from the SARIMAX structure. Each exogenous variable contributes linearly to the log-odds of observing a structural zero, forming a standard multiple logistic regression to estimate . This model is conceptually separate from the SARIMAX predictor, which governs the continuous part of the time series dynamics.
This logistic regression framework allows for a nuanced understanding of how both temporal dynamics and external factors (captured by ) contribute to the occurrences of zeros in the time series data. It enables the model to adjust dynamically to changes over time and varying conditions represented by the exogenous variables.
This mathematical framework provides the basis for incorporating skew-normal and zero-inflated skew-normal distributions into SARIMAX models.
For time series with significant skewness, a transformation stabilizes variance and mitigates extreme values:
where
c is a small constant to handle zeros. After model estimation, predictions are back-transformed:
Equations (8) and (9) are proposed by the authors. They introduce a sign-dependent log transformation that stabilizes the variance in skewed series while preserving the scale and direction of the original data. This formulation is conceptually inspired by classical transformation techniques, including the Box–Cox transformation [
19] and the alternative family of transformations proposed by [
20], but it represents a novel methodological contribution tailored to asymmetric time series modeling.
The SARIMAX residuals follow a skew-normal distribution:
where parameters
,
, and
are estimated via maximum likelihood.
For zero-inflated error terms, the residuals follow a zero-inflated skew-normal distribution:
where
p denotes the zero-inflation probability.
Final predictions integrate zero inflation and adjusted residuals:
For forecasting, the method extends recursively:
where
.
In the recursive forecasting Equation (
13),
represents the forecasted value from the previous step, used as input for predicting
. This structure is consistent with SARIMAX forecasting procedures where future predictions are conditioned on prior predictions and estimated residuals. To clarify notation, we use
for in-sample fitted values, and
for h-step-ahead forecasts. Likewise,
denotes estimated residuals from the skew-normal error structure.
For highly skewed data, a log-transformation improves stability:
with
c ensuring numerical stability. Back-transformation applies after forecasting:
3. Methodology
This section describes four subsections: a step-by-step procedure for modeling time series using SARIMAX with skewed-normal or zero-inflated skewed-normal errors, the computational framework, a simulation study carried out, and finally an illustrative case study.
3.1. Step-by-Step Procedure for Modeling Time Series Using SARIMAX with Skewed-Normal or Zero-Inflated Skewed-Normal Errors
Algorithms 1–6 show step-by-step time series modeling using SARIMAX with skew-normal or zero-inflated skew-normal errors.
Algorithm 1 Step 1: Data Preparation |
- 1:
Input: Raw time series data with exogenous variables (if available). - 2:
Output: Preprocessed time series ready for modeling. - 3:
Examine the time series for trends, seasonality, and stationarity:
Use visual inspection and statistical tests (e.g., ADF test) to check stationarity [ 21].
- 4:
Apply transformations if necessary:
- 5:
Assess intermittency:
|
Algorithm 2 Step 2: Model Specification |
- 1:
Input: Preprocessed time series and exogenous variables (if available). - 2:
Output: SARIMAX model structure and error distribution. - 3:
Identify ARIMA orders (p, d, q) using:
- 4:
Include seasonal components if periodic patterns are evident. - 5:
Specify the error distribution:
For skew-normal errors: Define parameters (location), (scale), and (skewness). For zero-inflated skew-normal errors: Define the zero-inflation probability p and the skew-normal parameters.
|
Algorithm 3 Step 3: Maximum Likelihood Estimation for SARIMAX |
- 1:
Input: Model structure and preprocessed data. - 2:
Output: Estimated parameters for the SARIMAX model. - 3:
Fit the SARIMAX model by maximizing the log-likelihood function:
where: is the parameter vector including ARIMA coefficients and exogenous variable effects. is the covariance matrix of residuals. Y is the observed time series, is its mean, and n is the number of observations.
- 4:
Optimize using numerical methods such as Newton–Raphson or quasi-Newton methods [ 23].
|
Algorithm 4 Step 4: Estimation for Skew-Normal and Zero-Inflated Skew-Normal Errors |
- 1:
Input: SARIMAX model with skew-normal or zero-inflated skew-normal error assumption. Note: This estimation procedure is not based on a Gaussian assumption. Instead, it uses the skew-normal and zero-inflated skew-normal densities to define the likelihood function, enabling the model to capture asymmetry and excess zeros in the residuals. Let be the observed response series, and the model residuals at each time t. The parameter n denotes the number of observations used in the likelihood computation. - 2:
Output: Estimated parameters , , , and (if applicable) p. - 3:
Case 1: Skew-Normal Errors - 4:
Case 2: Zero-Inflated Skew-Normal Errors - 5:
Return the estimated parameters , , , and (if applicable) p.
|
Note on implementation: As traditional SARIMAX software does not support skew-normal or zero-inflated skew-normal errors natively, both estimation procedures were implemented by the authors. For Case 1 (skew-normal errors), the log-likelihood was maximized using the L-BFGS-B optimization algorithm via
scipy.optimize.minimize, ensuring parameter constraints. For Case 2 (zero-inflated skew-normal), we implemented a custom expectation–maximization (EM) algorithm to estimate both the skew-normal parameters and the zero-inflation probability
p. In the E-step, the posterior probability that a residual originates from the zero-inflated component is computed. In the M-step,
p and the continuous distribution parameters are updated by maximizing the expected complete-data log-likelihood. This iterative scheme follows the structure of Moon [
24] and yields stable and interpretable estimates under complex error structures. Model identifiability is ensured by setting appropriate constraints on parameter spaces (e.g.,
,
,
) and initializing optimization routines based on empirical moments. Additionally, model complexity is evaluated via penalized likelihood criteria such as AIC and BIC to prevent overparameterization.
Algorithm 5 Step 5: Goodness-of-Fit Evaluation |
- 1:
Input: Fitted model parameters and residuals. - 2:
Output: Model diagnostics and comparison. - 3:
Compute model selection criteria, by Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) [ 25, 26]:
where L is the likelihood, k is the number of parameters, and n is the number of observations. Note: The likelihood value L used in the formulas below corresponds to the log-likelihood derived from the skew-normal or zero-inflated skew-normal models defined in Algorithm 6, depending on the case. - 4:
Perform residual diagnostics: Test for autocorrelation using the Ljung–Box test [ 27]. Assess residual skewness and normality with QQ plots [ 28] and the Jarque–Bera test [ 29]. Check for heteroskedasticity using the Breusch–Pagan test [ 30].
|
Both AIC and BIC were used to compare alternative SARIMAX model orders. In cases of discrepancy between criteria, BIC was prioritized due to its stronger penalization of model complexity. The final selected model reflects the order minimizing BIC.
Algorithm 6 Step 6: Validation and Forecasting |
- 1:
Input: Fitted SARIMAX model. - 2:
Output: Validated model and forecasts. - 3:
Validate the model:
White noise characteristics expected from the residuals include: (i) no significant autocorrelation at multiple lags (as indicated by the Ljung–Box test), (ii) homoskedasticity (constant variance), and (iii) a mean approximately equal to zero. These properties confirm that the fitted model has adequately captured the temporal dynamics of the series. - 4:
Generate forecasts: Use the recursive SARIMAX structure for forecasting:
For zero-inflated models, combine probabilities and predictions:
- 5:
Evaluate forecasting performance: Use root mean square error (RMSE) and mean absolute error (MAE) to quantify accuracy [ 31].
|
3.2. Implementation of Algorithms in Python
We describe how algorithms of
Section 3.1 can be implemented in Python using relevant libraries.
Step 1: Data Preparation To prepare the data, use Python libraries such as pandas for handling time series and statsmodels for statistical tests like the Augmented Dickey–Fuller (ADF) test.
import pandas as pd from statsmodels.tsa.stattools import adfuller |
# Load the time series data data = pd.read_csv("time_series.csv") y = data["target_variable"] |
# Check stationarity with ADF test adf_result = adfuller(y) print(f"ADF Statistic: {adf_result[0]}") print(f"p-value: {adf_result[1]}") |
# Apply differencing if needed y_diff = y.diff().dropna() |
Step 2: Model Specification Use the statsmodels library to analyze ACF and PACF plots and specify the SARIMAX model structure.
import statsmodels.api as sm import matplotlib.pyplot as plt |
# Plot ACF and PACF from statsmodels.graphics.tsaplots import plot_acf, plot_pacf plot_acf(y_diff, lags=20) plot_pacf(y_diff, lags=20) plt.show() |
# Specify SARIMAX orders sarimax_order = (p, d, q) # Replace with appropriate values seasonal_order = (P, D, Q, s) # Replace with appropriate values |
Step 3: Model Estimation Fit the SARIMAX model using maximum likelihood estimation. For skew-normal errors, you can use custom likelihood functions or extend libraries like statsmodels.
from statsmodels.tsa.statespace.sarimax import SARIMAX |
# Fit SARIMAX model model = SARIMAX(y, order=sarimax_order, seasonal_order=seasonal_order, enforce_stationarity=False, enforce_invertibility=False) results = model.fit() print(results.summary()) |
Step 4: Incorporating Skew-Normal or Zero-Inflated Skew-Normal Errors For skew-normal and zero-inflated models, define custom likelihood functions using libraries such as scipy and integrate them into the SARIMAX framework.
from scipy.stats import skewnorm |
# Define Skew-Normal likelihood def skew_normal_likelihood(epsilon, xi, omega, alpha): return 2 / omega * skewnorm.pdf((epsilon - xi) / omega, alpha) |
# Implement zero-inflated logic if needed |
Step 5: Goodness-of-Fit Evaluation Evaluate the model using criteria like AIC, BIC, and diagnostic tests such as the Ljung–Box test.
# AIC and BIC print(f"AIC: {results.aic}") print(f"BIC: {results.bic}") |
# Residual diagnostics residuals = results.resid plot_acf(residuals, lags=20) |
Step 6: Validation and Forecasting Validate the model and generate forecasts. Use metrics like RMSE and MAE to evaluate forecast accuracy.
from sklearn.metrics import mean_squared_error, mean_absolute_error import numpy as np |
# Forecasting forecast = results.get_forecast(steps=10) forecast_mean = forecast.predicted_mean |
# Calculate RMSE and MAE rmse = np.sqrt(mean_squared_error(y_true, forecast_mean)) mae = mean_absolute_error(y_true, forecast_mean) print(f"RMSE: {rmse}") print(f"MAE: {mae}") |
The computational implementation was carried out in Python 3.9, leveraging libraries such as
pandas,
statsmodels, and
scipy. The base SARIMAX structure was initialized using the
statsmodels.tsa.statespace.sarimax module. To accommodate skew-normal and zero-inflated skew-normal error structures, custom extensions were developed by the authors. These included manual implementation of the log-likelihood functions and optimization using the L-BFGS-B algorithm via
scipy.optimize.minimize. For zero-inflated skew-normal models, an expectation–maximization (EM) routine was implemented following the methodology of Moon [
24]. All simulation and visualization tasks were conducted using
NumPy,
SciPy, and
Matplotlib.
3.3. Simulation Studies
The simulation studies were designed to evaluate the performance of SARIMAX models under various error structures, including zero-inflated, positively skewed, and negatively skewed time series.
First, we evaluate the impact of variations in two critical parameters of the SARIMAX model with skew-normal and zero-inflated errors:
The analysis examines the effect of incremental adjustments to these parameters on model fit (using Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC)) and forecasting accuracy (using mean absolute error (MAE) and root mean square error (RMSE)).
Experimental Setup The experiment involves the following steps:
Simulation: Generate synthetic time series data with specified values of skewness () and zero-inflation probability (p).
Model Fitting: Fit a SARIMAX model to the simulated data using maximum likelihood estimation.
Forecast Evaluation: Generate forecasts and compute metrics (MAE, RMSE) to assess forecasting accuracy.
Model Fit Assessment: Evaluate model goodness-of-fit using AIC and BIC.
Second, scenarios were constructed to mimic realistic data characteristics and test the model’s adaptability and precision. The values of the skewness parameter (
) and the zero-inflation probability (
p) were selected to reflect conditions commonly encountered in applied contexts. Moderate to high skewness values (
) allow the model to capture asymmetric distributions, which are frequently observed in biological and industrial data [
13]. Similarly, zero inflation levels (
) were chosen based on studies in which excess zeros are present in time series, particularly in count data models and intermittent processes [
7,
32]. These values ensure that the simulation scenarios are both theoretically meaningful and empirically relevant. While not exhaustive, these combinations provide a structured and interpretable grid for evaluating model behavior under different distributional challenges. Future work could explore finer or adaptive grids to expand the scope of this analysis.
Table 1 summarizes the parameters used for each simulated series.
The first scenario, zero-inflated skew-normal, incorporated a 30% probability of zeros in the error term, alongside a positively skewed distribution. The positive skew-normal scenario was constructed to exhibit significant positive asymmetry, while the negative skew-normal scenario reflected a strong left-skewed distribution. All scenarios used consistent SARIMAX configurations, including seasonal components, to ensure comparability.
Each simulated series was constructed by combining the specified AR, MA, and seasonal components, along with error terms reflecting the chosen distribution (e.g., skew-normal or zero-inflated skew-normal). The parameters for skewness (), location (), scale (), and zero-inflation probability (p) were carefully tuned to replicate practical scenarios encountered in time series analysis.
The fitted models were evaluated on their ability to accurately capture the underlying patterns and produce reliable forecasts.
Forecasts were generated for each simulation scenario, including 95% confidence intervals to quantify prediction uncertainty. Forecasting performance was evaluated by computing RMSE and AIC between the model predictions and the true values of the simulated series, which were generated under known skew-normal or zero-inflated skew-normal processes. This allows for a direct and meaningful assessment of predictive accuracy in a controlled setting. We distinguish between two types of intervals constructed in this study. Confidence intervals for estimated parameters (e.g., , , , , and p) are derived from the observed Fisher information matrix or the Hessian of the log-likelihood function. These intervals quantify the uncertainty around the fitted model components. Forecast intervals, on the other hand, reflect the uncertainty in future values , and are constructed using simulation-based methods that incorporate both parameter estimates and error distribution assumptions (skew-normal or zero-inflated skew-normal).
4. Results
This section presents the results of the simulation study described in the methodology.
First,
Figure 1 and
Figure 2 illustrate the relationship between the skewness parameter (
) and the metrics MAE and RMSE for different zero-inflation probabilities (
p).
To delve deeper into these results, a two-way ANOVA was conducted to examine the effects of the skewness parameter (
) and the zero-inflation probability (
p) on forecast accuracy, measured by RMSE and MAE. The variance explained corresponds to the variability in predictive performance attributable to changes in the distributional parameters. Reported
p-values indicate whether the differences in performance across levels of
and
p are statistically significant.
Table 2 includes the mean differences, F-statistic values, and significance levels (
p), for AIC, BIC, MAE and RMSE metrics by skewness (
), zero inflation (
p) and
interaction.
The ANOVA results reveal the following:
Skewness (): The skewness parameter has a dominant impact on all metrics, particularly on MAE and RMSE, with highly significant p-values (<0.001). This indicates that the asymmetry in the error distribution significantly affects both the fit of the model and the accuracy of the forecast.
Zero Inflation (p): The probability of zero inflation has a moderate but statistically significant effect on all metrics, with values of p below 0.05 for most cases. The impact is more pronounced when combined with the skewness ().
Interaction (): The interaction between skewness and zero inflation is significant across all metrics, suggesting that the combined effects of these parameters are not independent and should be considered jointly for optimal model tuning.
To better illustrate the joint influence of the skewness parameter (
) and the zero-inflation probability (
p), we include heatmaps showing the simulated RMSE and AIC values across different combinations (
Figure 3 and
Figure 4). These visualizations provide an intuitive overview of how model performance is affected by both asymmetry and excess zeros.
To further examine when standard ARIMA models may fail to capture non-Gaussian dynamics, we conducted a residual normality analysis across combinations of skewness (
) and zero-inflation probability (
p) in simulated series.
Figure 5 presents a heatmap of
p-values from the Shapiro–Wilk test applied to the residuals of ARIMA(1,1,1) fits. The results reveal that as both
and
p increase, the likelihood of rejecting the normality assumption rises, confirming the value of adopting skew-normal or zero-inflated error models in such settings.
These results emphasize the importance of accounting for both skewness and zero inflation when designing and calibrating SARIMAX models. Skewness appears to be the primary driver of variation, particularly for metrics related to forecasting accuracy, while zero inflation plays a secondary but important role, especially in interaction with skewness.
This analysis demonstrates that both skewness and zero inflation have significant impacts on the performance of SARIMAX models. Skewness has a more dominant influence, particularly on forecasting metrics such as MAE and RMSE. Zero inflation has a moderate impact but interacts significantly with skewness, indicating a complex relationship between these parameters. Understanding these effects is critical for fine-tuning SARIMAX models to improve both fit and forecast accuracy under varying data conditions.
Second, the results include observed series, predicted values and performance metrics for each scenario. The shaded region represents the 95% confidence interval.
Figure 6 depicts the observed series and predicted values for the zero-inflated skew-normal scenario.
Figure 7 depicts the observed series and predicted values for the zero-inflated skew-normal scenario.
Figure 8 depicts observed series and predicted values for the zero-inflated skew-normal scenario.
Table 3 depicts performance metrics for the zero-inflated skew-normal, positive and negative skew-normal scenarios.
Forecasting performance was evaluated by computing RMSE and AIC between the model predictions and the true values of the simulated series, which were generated under known skew-normal or zero-inflated skew-normal processes. This allows for a direct and meaningful assessment of predictive accuracy in a controlled setting. The simulation study highlights the ability of SARIMAX models to handle varying levels of skewness and intermittency. Forecast performance, as measured by RMSE and MAE, demonstrates robust adaptability across different error distributions.
4.1. Illustrative Case Study
Modeling Stationary-Phase Growth of E. coli with Skew-Normal Errors.
This case study illustrates the application of SARIMAX models with skew-normal error terms to real experimental data from a microbial growth study. The dataset corresponds to a controlled experiment examining the growth dynamics of Escherichia coli (E. coli) under varying pH conditions, with constant temperature regulation at 37.4 °C. The dependent variable is optical density measured at 600 nm (OD600), which serves as a proxy for microbial biomass.
4.1.1. Data Description
The data used corresponds to the stationary-phase and early decline phase of
E. coli growth, which is typically characterized by a slowing or reduction in biomass due to nutrient limitation or accumulation of waste products. This subset was extracted from a larger time series dataset recorded in six-minute intervals over several hours. The observed series consists of 200 points, all strictly positive, and exhibits a mild decreasing trend with small fluctuations, see
Table 4.
Table 4 shows a representative subset of the absorbance measurements used in the model, highlighting the gradual decline in optical density typical of the stationary phase.
4.1.2. Modeling Procedure
In our empirical evaluation, we used a forecast horizon of 40 time steps, corresponding to the last 20% of the series. These observations were held out for validation, while the initial 160 points (80%) were used for model training. Forecast accuracy was evaluated over this horizon using RMSE and MAE. No rolling or expanding windows were used; the goal was to assess accuracy over a realistic forecasting horizon using a consistent training set. Two models were fitted to the training portion of the dataset (80%):
Following Algorithm 3 and Algorithm 4 of our proposed methodology, the SARIMAX model was estimated. The residuals exhibited significant positive skewness, validating the incorporation of a skew-normal distribution to improve the error structure modeling.
4.1.3. Results
Table 5 summarizes the forecasting performance and error structure diagnostics of the two models. Notably, the skewness parameter
of the skew-normal distribution was estimated at
, indicating a moderate positive asymmetry in the residuals. This validates the hypothesis that normality may not hold in real-world microbial data, especially in non-exponential phases.
4.1.4. Estimated Parameters and Confidence Intervals
The ARIMA and SARIMAX models share the same autoregressive and moving average parameters, estimated as follows:
AR(1): with 95% CI [0.902, 0.993]
MA(1): with 95% CI [0.0017, 0.326]
Error variance : with 95% CI
The skew-normal distribution fitted to the SARIMAX residuals yielded:
Location parameter with 95% CI [−0.00164, −0.00134]
Scale parameter with 95% CI [0.00134, 0.00193]
Skewness parameter with 95% CI [12.07, 15.99]
4.1.5. Visualization
Figure 9 displays the histogram of residuals from the SARIMAX model, overlayed with the probability density function (PDF) of the skew-normal distribution fitted. The visual alignment confirms the appropriateness of modeling residuals with skew-normal errors.
Figure 10 shows the ACF and PACF of the residuals from the SARIMAX(1,1,1) model with skew-normal errors. The lack of significant autocorrelation supports that the model adequately accounts for temporal dependencies, leaving residuals that behave approximately as white noise. This complements the distributional diagnostics and confirms the suitability of the proposed error structure. The Ljung–Box test on the residuals yields a
p-value of 0.278, indicating no significant autocorrelation up to lag 20. This supports the adequacy of the SARIMAX model in capturing the temporal structure of the data. As expected, the Shapiro–Wilk and Jarque–Bera tests reject normality (with
p-values of
and
, respectively), which is consistent with the skew-normal distributional assumption.
5. Discussion
The results of the modeling process using SARIMAX with skew-normal or zero-inflated skew-normal errors reveal distinct performance characteristics for each simulated dataset. In the following, we analyze the key findings and implications:
Zero-Inflated Skew-Normal
The zero-inflated SARIMAX model effectively handled the combination of zero inflation and asymmetric skewness. This was evident in the well-aligned fitted values and predictions across the observed data. The forecast confidence intervals (95%) demonstrated a robust model fit, encapsulating the majority of future values. Performance metrics such as RMSE (0.56) and MAE (0.45) indicate moderate error levels, which are acceptable given the complex nature of the series. The logistic regression component successfully captured the zero-inflation probability, allowing the SARIMAX model to focus on modeling the continuous non-zero values.
Positive Skew-Normal
For the positively skewed series, the log-transformed SARIMAX model provided significant improvements in stability and predictive accuracy. The back-transformed fitted values aligned closely with the observed data, although slight over- or under-estimations were observed near extreme skewed values. Metrics such as RMSE (0.78) and MAE (0.65) suggest that while the model performed adequately, further refinements may be required to enhance accuracy. For instance, incorporating additional seasonal components or higher-order terms could improve the fit for highly skewed series.
Negative Skew-Normal
The SARIMAX model performed exceptionally well for the negatively skewed dataset, yielding the lowest error metrics (RMSE: 0.49, MAE: 0.40). This suggests that the model’s assumptions and components were well-suited for handling negative asymmetry. The forecasts closely followed the trend and variability of the observed series, with confidence intervals reflecting low uncertainty. The success in this scenario highlights the model’s inherent flexibility in capturing asymmetric structures when the skewness direction aligns with its assumptions.
Comparative Insights
The comparative analysis between these cases shows that zero-inflated SARIMAX is robust for datasets with a mix of zeros and continuous skewed values. Log-transformation enhances stability for highly skewed datasets but may introduce slight inaccuracies during back-transformation. Standard SARIMAX excels in scenarios with moderate asymmetry and no inflation, particularly for negative skewness.
The results align with prior findings on the utility of skew-normal distributions for capturing asymmetry in residuals [
13]. However, this study extends the application by incorporating zero-inflation dynamics, which have been less explored in SARIMAX models. Compared to conventional ARIMA models, which assume Gaussian residuals, the proposed approach demonstrates superior flexibility and predictive accuracy for datasets exhibiting non-Gaussian features. This contrasts with the limited adaptability of traditional methods highlighted in recent reviews of time series modeling.
The sensitivity analysis conducted in this study highlights the critical role of skewness () and zero-inflation probability (p) in determining the performance of SARIMAX models. The results underscore that skewness exerts a dominant influence on metrics like MAE and RMSE, particularly in datasets with pronounced asymmetry. Meanwhile, zero inflation impacts forecasting accuracy to a lesser extent but significantly interacts with skewness, suggesting a non-linear relationship between these parameters. For example, scenarios with high skewness and moderate zero inflation showed amplified model sensitivity, indicating the necessity of careful parameter tuning. These findings emphasize the importance of including both skewness and zero inflation considerations when designing SARIMAX models for real-world applications.
Illustrative case study: Advantages of the SARIMAX Model with Skew-Normal Errors
Despite showing slightly higher RMSE and MAE values compared to the classical ARIMA model, the SARIMAX model with skew-normal errors provides a more realistic representation of the underlying error structure by accounting for asymmetry. This model does not assume that residuals are normally distributed and therefore offers improved interpretability and robustness under real-world conditions. The apparent lack of forecasting precision is not a drawback, but rather a reflection of the model’s sensitivity to skewed errors, which classical models tend to ignore. The application of SARIMAX models with skew-normal and zero-inflated skew-normal errors holds significant promise for advancing applied statistics within the management sciences, particularly in the biotech industry. These models offer a robust framework for analyzing complex, non-Gaussian data typical in biotechnological processes, where understanding and predicting biological behaviors are crucial for operational and strategic decisions. The ability of these models to accurately capture the inherent asymmetries and variabilities in biological data can lead to more precise forecasting and optimization of biotechnological processes, ultimately enhancing production efficiencies and innovation in product development [
33].
Although the proposed SARIMAX model with skew-normal errors exhibits slightly higher RMSE compared to the classical ARIMA model, this trade-off is justified by a more realistic representation of the residual structure. The incorporation of skewness enhances the model’s interpretability, particularly in datasets with asymmetry, which classical models fail to capture. Furthermore, the improvement in AIC indicates a better balance between model fit and complexity, underscoring the practical relevance of using skewed error structures in applied forecasting scenarios.
Implications for Decision Makers
For practitioners and decision-makers, the findings offer several actionable insights:
Enhanced Forecasting Accuracy: The proposed models provide reliable forecasts even under challenging conditions such as zero inflation or extreme skewness, making them applicable to fields like finance, retail, and environmental monitoring.
Tailored Model Selection: Depending on the nature of the data (e.g., zero-inflated, positively skewed), decision-makers can select and customize the appropriate model variant to optimize predictive performance.
Risk Management: By accurately capturing uncertainty through confidence intervals, the models enable better risk assessment and decision-making in volatile or uncertain environments.
Limitations
While the models performed well across the different scenarios, some limitations remain:
The zero-inflated SARIMAX model relies heavily on the correct specification of the logistic regression component, which may not generalize well to more complex zero-inflated patterns.
The log-transformed SARIMAX struggles with extreme skewness, suggesting that alternative transformations or error distributions (e.g., skew-t) could be explored.
Model evaluation was limited to simulated data; application to real-world datasets may reveal additional challenges.
While the proposed models improve the representation of residual structures and model fitting, several limitations must be acknowledged. First, identifiability is ensured through constrained parameter spaces and appropriate initialization strategies, but complex models may still risk convergence to local optima. Second, the incorporation of skew-normal and zero-inflated components increases computational complexity, particularly during maximum likelihood or EM-based estimation. Finally, parameter sensitivity (especially for the skewness () and zero-inflation probability (p)), may arise in small samples or when exogenous variables exhibit multicollinearity. These considerations are important when applying the model to large-scale or noisy datasets.
Future work: Beyond the skew-normal and zero-inflated skew-normal formulations explored in this study, alternative error structures, such as the skew-t distribution, offer greater flexibility by accounting for heavy tails in addition to asymmetry. Likewise, COM-Poisson time series models provide a useful framework for count data exhibiting both overdispersion and zero inflation. Although a comparative analysis with these models is not included here, future work could investigate the relative performance of these approaches in similar settings. Could include extending the model to incorporate dynamic skewness parameters: testing alternative distributions (e.g., skew-t, beta) for residuals, applying the methodology to real-world time series datasets, such as financial or environmental data, to assess robustness and scalability.
An interesting direction for future work involves extending the proposed framework to multivariate or multiple-series time series models. For example, incorporating skew-normal or zero-inflated error structures into vector autoregressive (VAR) models or dynamic factor models could allow the joint modeling of multiple correlated time series that exhibit asymmetry or excess zeros. Such approaches could be particularly relevant in applications involving panel data, environmental monitoring, or interconnected biological processes.
6. Conclusions
This study demonstrates the flexibility of SARIMAX models enhanced with skew-normal and zero-inflated error distributions to handle both asymmetry and intermittency. The simulation results highlight the capacity of these models to produce reliable forecasts under diverse error structures. Future work could explore applications in fields such as inventory management, where intermittency and skewness are common.