Short-Term Hourly Ozone Concentration Forecasting Using Functional Data Approach

: Air pollution, especially ground-level ozone, poses severe threats to human health and ecosystems. Accurate forecasting of ozone concentrations is essential for reducing its adverse effects. This study aims to use the functional time series approach to model ozone concentrations, a method less explored in the literature, and compare it with traditional time series and machine learning models. To this end, the ozone concentration hourly time series is first filtered for yearly seasonality using smoothing splines that lead us to the stochastic (residual) component. The stochastic component is modeled and forecast using a functional autoregressive model (FAR), where each daily ozone concentration profile is considered a single functional datum. For comparison purposes, different traditional and machine learning techniques, such as autoregressive integrated moving average (ARIMA), vector autoregressive (VAR), neural network autoregressive (NNAR), random forest (RF), and support vector machine (SVM), are also used to model and forecast the stochastic component. Once the forecast from the yearly seasonality component and stochastic component are obtained, both are added to obtain the final forecast. For empirical investigation, data consisting of hourly ozone measurements from Los Angeles from 2013 to 2017 are used, and one-day-ahead out-of-sample forecasts are obtained for a complete year. Based on the evaluation metrics, such as R 2 , root mean squared error (RMSE), and mean absolute error (MAE), the forecasting results indicate that the FAR outperforms the competitors in most scenarios, with the SVM model performing the least favorably across all cases.


Introduction
Air pollution refers to the existence of detrimental materials such as gases, solids, or liquid particles in the atmosphere, leading to harmful consequences for human health, the environment, and ecosystems.These materials are termed pollutants and can have diverse origins, arising from sources like industrial operations, transportation, natural occurrences, and human activities.Typical pollutants include ozone (O 3 ), nitric oxide (NO), nitrogen dioxide (NO 2 ), sulfur dioxide (SO 2 ), carbon monoxide (CO), particulate matter 2.5 (PM 2.5 ), and particulate matter 10 (PM 10 ), among others.Air pollution, especially ground-level O 3 , is a significant global concern affecting public health and the environment (Seinfeld and Pandis 2016).
In the United States of America (USA), the environmental protection agency (EPA) estimates that mobile sources such as cars, buses, planes, trucks, and trains account for about half of the cancer risk and more than 70% of the non-cancer health effects associated with air toxics (Munshed et al. 2023).Stationary sources such as power plants, oil refineries, industrial facilities, and factories emit large amounts of pollutants from particular sites called point sources.Area sources are made up of several minor pollution contributors that combined have a significant impact on air quality.Area sources include agricultural areas, urban regions, and wood-burning fireplaces.Natural sources like wind-blown dust, wildfires, and volcanic activity can also impact air quality.Figure 1 displays the different sources described above that contribute significantly to air pollution.The collective impact of these sources on the environment poses a serious challenge for researchers and policymakers (Lotrecchiano et al. 2020;Mavroidis and Ilia 2012).Exposure to ground-level O 3 can have a range of adverse effects on human health.For example, O 3 exposure is associated with an elevated risk of chronic respiratory and cardiovascular ailments, as indicated by numerous studies (World Health Organization 2020).Even short-term exposure to O 3 can perturb lung mucociliary function, consequently weakening resistance to bacterial infections (Sujith et al. 2017).Alarming statistics reveal that approximately 16,400 premature deaths within the European Union (EU) are attributed to O 3 pollution (Nuvolone et al. 2018).Beyond human health, O 3 poses a substantial threat to vegetation and biodiversity.Its capacity to impair crops, forests, and other forms of vegetation manifests in the form of diminished photosynthesis, reduced carbon sequestration, and losses in biodiversity.Moreover, it induces visible leaf injuries (Paoletti and Manning 2007).This concern extends to various regions, including parts of southern Europe, where a multitude of grassland areas face the risk of high O 3 levels, impacting plant composition, seasonal flowering, and seed production for diverse natural species (Mills et al. 2016).
Ozone concentrations are important for air quality management and public health, and accurate prediction of these concentrations can aid in decision-making processes.Due to the importance of this topic, researchers in the past have proposed different methods and techniques to model and forecast O 3 concentrations.For example, Hong et al. (2023) presented a novel approach for predicting hourly O 3 levels using deep learning techniques called the multi-order difference embedded long short-term memory (MDELSTM) method, using a Los Angeles air quality dataset.The performance of the proposed model is compared with partial least squares (PLS), gated recurrent unit (GRU), long short-term memory (LSTM), multilayer perceptron (MLP), and stationary difference embedded LSTM (SDELSTM).Based on the results, the MDELSTM model demonstrates superior prediction performance among the models considered.Machine learning techniques are popular for forecasting O 3 concentrations (Eslami et al. 2020;Hashim et al. 2022;Yafouz et al. 2022).For instance, Yafouz et al. (2022) conducted a study to predict the tropospheric O 3 concentration using different machine learning models.These models included linear regression (LR), support vector regression (SVR), tree regression (TR), Gaussian process regression (GPR), ensemble regression (ER), and artificial neural networks (ANNs).Using data from Peninsular, Malaysia, the results indicated that the LR, SVR, GPR, and ANN performed better in terms of having a high R 2 .
Many researchers have investigated the performance of classical time series models for forecasting O 3 concentrations (Salazar et al. 2019).For example, Kumar et al. (2004) studied the autoregressive integrated moving average (ARIMA) model to forecast the daily surface O 3 concentration.Using a real dataset, the ARIMA (1,0,1) model demonstrated the best performance in predicting the maximum daily O 3 concentration, with a MAPE of 13.14%.Aneiros-Pérez et al. (2004) forecast daily maximum O 3 concentrations in Toulouse, France, using nonlinear models based on kernel estimators.They also included exogenous variables in their analysis.The study found that the functional additive model (FAM), with its back-fitting kernel approach, produced the lowest quadratic errors and explained the highest percentage of variability in the data compared to other models.Due to the challenges posed by O 3 to human health, researchers are actively proposing different models to model and forecast O 3 concentration time series data (Arsić et al. 2020;Gao et al. 2018;Ghoneim et al. 2017;Rahman and Nasher 2024;Su et al. 2020).
This research contributes to the literature on O 3 forecasting by applying functional data analysis (FDA) methods, which can capture the dynamic and complex features of the O 3 concentration as a function of time.FDA methods have been used in various fields such as bio-statistics, econometrics, and environmental science, but are less explored in the context of O 3 forecasting (Jan et al. 2022;Shah et al. 2022).This study proposes a novel time series model based on FDA, which treats each day as a single functional observation with 24 discrete points.The performance of the FDA model is compared with classical time series and machine learning models using standard accuracy metrics.The results of this study provide insights into the advantages and limitations of different forecasting methods and suggest ways to improve the accuracy and reliability of O 3 forecasts.This study also has practical implications for policymakers, environmental managers, and public health officials, who can use the O 3 forecasts to implement timely and effective measures to mitigate the adverse effects of O 3 pollution.
The rest of the article is organized in the following manner.Section 2 provides details about the general modeling framework and the details of different models used in this research work.Analysis and results based on empirical investigation are provided in Section 3. Finally, Section 4 concludes the study.

Methodology
This section explores the predictive models used in this study to model and forecast O 3 concentration data.To this end, our proposed approach FAR is outlined in more detail.Several parametric and non-parametric competing models like ARIMA, VAR, SVM, NNAR, and RF are also briefly described in this section.Before going into detail, we revisit some of the preliminaries of functional data analysis.

Functional Data Analysis
James O. Ramsay coined the term "functional data analysis" to describe this field of study (Ramsay 1982).FDA is at the forefront of modern statistical computing, ushering transformative changes across diverse fields.The rapid advancement of technology allows for quicker and more accurate measurement equipment, facilitating the collection of continuous data spanning time, space, or other continua.This paradigm shift challenges classical statistical assumptions, particularly the conventional belief that the number of data points should surpass the number of variables in a dataset.In response to this evolution, FDA emerged as a field treating data as functional, where each curve is regarded as a single observation, rather than a collection of discrete data points.The FDA method leverages information within and across sample units to incorporate derivatives, smoothness, and other inherent features of the functional data structure.The process of converting data to functional objects involves collecting information at discretized points, typically equally spaced, and implementing suitable basis functions to eliminate common noise while accurately representing each data curve.
Basis functions are pivotal in FDA, serving as the fundamental components for modeling intricate functional observations.These functions are expressed as a linear combination of coefficients C and known basis functions ϕ.The representation of a functional observation y(j) is given by where C k denotes parameters, ϕ k (j) represents the known basis functions, and K is the number of basis functions used.A more simplified representation using matrix notation can be written as Several common basis functions, including the Fourier basis, B-spline basis, polynomial principal components, and exponential basis are used to cater to different data characteristics.The selection depends on the nature of the data, e.g., the Fourier basis is generally suitable for periodic data and the B-spline basis is preferred for non-periodic data.In our research, we chose the B-spline basis system for its applicability to non-periodic data.
The B-spline basis functions are polynomial segments that are joined at certain knots or breakpoints.The compact support property, where each basis function is positive over a limited number of adjacent intervals, increases the computational efficiency.The order of a B-spline determines the degree of polynomial segments while the knot sequence governs their placement.The strategic placement of knots, whether equally spaced or tailored to data characteristics, plays a vital role in achieving accurate and meaningful representations.Coincident knots are employed strategically, offering flexibility to induce specific characteristics like derivative discontinuities.

Functional Autoregressive Model
The FAR model used in this study is an extension of the classical AR model for functional data.The FAR model of order 1, FAR(1), operates within the framework of a separable Hilbert space denoted by H.It specifically considers the Hilbert space L 2 [0, 1], although the idea is applicable to other L 2 -spaces.An autoregressive Hilbertian process of order 1, ARH(1), which is also called FAR(1), is identified in this context as a sequence Y t of H-random variables.The process is strictly stationary and satisfies the AR equation: where µ(j) is the mean function, ψ is a bounded linear operator, and ϵ(j) is the shock or innovation term.The AR operator ψ is assumed to be a compact Hilbert-Schmidt, symmetric, and positive bounded linear operator from L 2 [0, 1] to itself.The compactness property permits a decomposition using orthonormal bases and real numbers, and the operator is said to be nuclear if certain eigenvalue conditions are met.Within the functional framework, we will drop the compact support (j) for simplicity.
2.2.1.Operators in the Hilbert Space (L 2 [0, 1]) Operators in the Hilbert space L 2 [0, 1] are defined by a norm and are bounded linear operators and obtained from the inner product of the space H.This norm is the supremum of the operator's norm over all unit vectors y in H.It is represented by the symbol ∥ψ∥ L and written as ∥ψ∥ L = sup ∥y∥≤1 ∥ψ(y)∥ An operator ψ is termed compact with respect to orthonormal bases ν j and f j in H and a sequence λ j approaching zero if it can be expressed as If the sum of the squares of an operator's sequence λ j is finite, the operator is Hilbert- The space of Hilbert-Schmidt operators S is separable and is equipped with an inner product and associated norm: Operators are classified as symmetric if ⟨ψ(x), y⟩ = ⟨x, ψ(y)⟩, and positive if ⟨ψ(y), y⟩ ≥ 0. A symmetric positive Hilbert-Schmidt operator admits the decomposition A compact operator is nuclear if the sum of the absolute values of its sequence λ j is finite: The model is non-parametric, as ψ represents an infinite-dimensional parameter.

Estimation of the Operator ψ
Estimating the AR operator ψ in the FAR model involves addressing specific assumptions for obtaining a stationary solution.Two assumptions are considered to ensure the existence of a stationary solution.The first assumption requires the existence of an integer j 0 ≥ 1 such that ∥ψ j 0 ∥ L < 1, while the second assumption necessitates the existence of a > 0 and 0 < b < 1 such that ∥ψ j ∥ L ≤ ab j for all j ≥ 0. These assumptions, under certain conditions, guarantee a unique strictly stationary solution, as proven in Bosq (2000).
It is important to note that estimating ψ cannot rely on likelihood due to the nonexistence of the Lebesgue measure in non-locally compact spaces and the concept of density is not available for the functional data.Instead, the classical method of moments is employed.The estimation of ψ is represented as ) are the covariance and cross-covariance operators of the process, and ⊗ is the Kronecker product.The sample versions of these operators are denoted as Γt and Ĉt , respectively.
To simplify notation, it is assumed that the mean of the process E(Y t ) = 0 is known.The sample versions of the covariance and cross-covariance operators, denoted as Γt and Ĉt , are given by Γt = The covariance operator Γ is a symmetric, positive definite, and compact operator.It can be decomposed into eigenfunctions and eigenvalues, denoted as λ j and ν j , respectively.However, Γ −1 is not a bounded operator.To address this, a practical solution is proposed involving the consideration of the first p most important empirical functional principal components as substitutes for unknown population principal components, as given by From the ARH(1) equation, when multiplying by Y n we obtain the relation By the definitions of the covariance and cross-covariance operators of ARH(1) and using E(ϵ) = 0, we have C = ψΓ and ψ = CΓ −1 .The estimate of ψ is then given by ψn The last term is derived by performing an additional smoothing step on Y k+1 and νj .The empirical eigenfunctions are known to converge asymptotically to the population eigenfunctions.
Once the estimator ψ of the population parameter ψ is obtained, it is crucial to assess its optimality in estimating the true parameter.For the FAR parameter ψ, Didericksen et al. (2012) demonstrated that the estimator is optimal in terms of MSE and MAE, as its prediction error is comparable to the infeasible predictor ψ(y) for an appropriately chosen p.

Autoregressive Integrated Moving Average (ARIMA) Models
Time series data analysis is a vital tool for comprehending and forecasting temporal patterns.The autoregressive moving average (ARMA) model combines elements of autoregressive (AR) and moving average (MA) models, providing a framework for modeling univariate time series data.For a univariate time series Y t , an ARMA(p, q) model can be written as (3) Equation ( 3) contains an intercept term C, AR parameters ϕ r (r = 1, 2, . . ., p), MA parameters Φ l (l = 1, 2, . . ., q), and a white noise term Z t ∼ N(0, σ 2 z ).The ARMA models are well-suited for stationary time series data.However, differencing is required to achieve stationarity for non-stationary data, and this is where the ARIMA model comes into play.The ARIMA model is an extension of ARMA specifically tailored for non-stationary time series (Shumway et al. 2000).It involves differencing the data to attain stationarity, which can be expressed as where Y d t represents the dth difference of the series, ϕ r (r = 1, 2, . . ., p) and Φ l (l = 1, 2, . . ., q) are the parameters of the AR and MA components, respectively, and Identifying the appropriate ARIMA model is a critical step and it involves determining the order of differencing 'd' and the numbers of AR (p) and MA (q) terms.The autocorrelation function (ACF) and partial autocorrelation function (PACF) plots are valuable tools in this process.The ACF indicates the series correlation with itself at different lags, whereas the PACF reveals autocorrelation at a lag k with intervening data deleted.For parameter estimation, generally, the maximum likelihood estimation (MLE) is used (Shumway et al. 2000).This study investigated different models and found that the ARIMA(5,0,0), which is a pure autoregressive model with five lagged observations, fits the data well and provides white noise errors, thus being suitable for forecasting.

Vector Autoregressive
A vector autoregressive (VAR) model is a powerful and commonly used time series analysis method that enables us to capture the dynamic interactions between variables over time.It extends the concept of univariate AR models to a multivariate situation.Because of their flexibility and forecasting accuracy, VAR models are extremely useful for understanding and forecasting complicated real-world behavior.They can effectively express variable interdependence.
A VAR model is built on a set of equations that describe the evolution of various time series variables.Each variable in the model is treated as a linear function of its own lagged values as well as the lagged values of all other variables in the model.Suppose Y t is a vector of univariate time series, then a VAR model can be written as where Y t is an n × 1 vector representing the current values of n distinct response time series variables at time t.The n × 1 vector C consists of constant offsets that serve as intercepts, accounting for the baseline level of the variables.The Φ j matrices, ranging from j = 1 to p, are n × n matrices of AR coefficients.These matrices capture the relationships between the variables at different time lags; specifically, the impact of past values up to lag p on the current values.The parameter "p" defines the order of the VAR model, determining the maximum number of past periods considered in each equation and influencing the model's ability to capture system dynamics.Finally, ϵ t represents an n × 1 vector of "white noise" terms, introducing randomness into the model and accounting for unexplained variability not captured by the lagged variables, thereby enhancing the model's accuracy in describing the underlying time series data.The choice of the order is a critical step and is typically determined through crossvalidation techniques and information criteria such as AIC, BIC, and HQ.As we estimate models for 365 days, the value of the selected p can vary.However, in most cases, a VAR of order five, i.e., VAR(5), is suitable as it generally provides whitened residuals and produces lower AIC values compared to other orders of the model.Estimating and inferring parameters in VAR models is typically achieved by using the MLE or ordinary least squares (OLS) techniques.These methods rely on certain assumptions, such as the error terms having a conditional mean of zero, stationary variables, and the absence of perfect multicollinearity.

Artificial Neural Networks
Artificial neural networks (ANNs) are a foundational component of machine learning, particularly in the realm of deep learning.They mirror the human brain's structure and functionality, comprising interconnected nodes arranged in layers: input, hidden, and output.These nodes, akin to artificial neurons, process information through weighted connections, activating based on a threshold to transmit data across layers.The ANNs are renowned for their proficiency in rapid data classification, clustering, and various applications such as speech and image recognition, with Google's search algorithm being a prominent example of their implementation.
ANNs constitute a multi-layered architecture comprising essential components featuring an input layer, hidden layers, and an output layer.The input layer receives diverse data formats while the hidden layer, functioning as a "distillation layer", extracts pertinent patterns to enhance network efficiency by recognizing crucial information and discarding redundancy.The output layer processes transformed information from the hidden layers to produce the final output.The most commonly employed ANNs for predictive tasks are MLPs with hidden layers, utilizing a three-layered network interconnected by acyclic connections.Nodes are considered as manufacturing components and the architecture allows for more than one hidden layer.

Neural Network Autoregressive
The neural network autoregressive (NNAR) model is a potent deep learning structure widely employed in various fields like natural language processing, time series forecasting, and speech recognition.It utilizes past data to predict future values, employing a recursive approach where each time step's input is the prior prediction.This model handles nonlinear relationships within intricate datasets and can be constructed using feedforward or recurrent neural networks.
The NNAR(p, k) architecture involves "p" lagged inputs for forecasting and includes "k" nodes in the hidden layer.It shares similarities with ARIMA models but operates without constraints for stationarity.The NNAR model equation comprises weighted node connections, nonlinear activation functions, AR dependencies, exogenous impacts, and error terms, and can be written as where Y it represents the univariate response at node i and time t, w ij indicates the connection strength between nodes i and j for i, j = 1, . . ., N, λZ T i(t−1) captures the exogenous impact, and f () is an unidentified smoothed link function.Estimation methods for the NNAR include profile least squares estimation and local linear approximation to model the unknown link function and optimize parameter estimates.This model iterates through historical inputs to generate multi-step forecasts in time series analysis, showing adaptability and robustness in handling intricate data patterns.In our case, a simple autoregressive of order one with one node in the hidden layer, NNAR(1,1), is used.

Support Vector Machine
Support vector machines (SVMs) are powerful supervised learning models widely employed for data analysis in classification, regression, and outlier detection tasks.They excel in handling both linear and nonlinear data separations, making them versatile in various domains such as text classification, image recognition, gene expression analysis, and even time series prediction.Notably, these machines were meticulously designed not only for effective ranking but also for efficiently simplifying training set outcomes.This distinct quality has led to the widespread adoption of SVM techniques, especially in forecasting time series.
An SVM commences with the input data, composed of labeled feature vectors, where each vector corresponds to one of two classes.Extending beyond binary classification, SVMs can also tackle multi-class classification problems.These feature vectors are then transformed into a higher-dimensional space using various kernel functions, such as the linear, polynomial, or radial basis functions.This transformation is pivotal as it calculates the similarity between different feature vectors, allowing the SVM to address nonlinear separable data.

•
Training data and class labels:The training data contains assigned class labels (y i ) denoted within {−1, 1} for binary classification, serving as an anchor for the model's learning process.
under the constraints y i (w In this work, the SVM model is an epsilon-regression type with a radial basis function kernel.It uses two lagged values of ozone concentration as inputs and is parameterized with a cost of 1.0, gamma of 0.5, and epsilon of 0.1.The model is supported by 1191 support vectors to ensure accurate predictions.

Random Forest
The random forest (RF) algorithm, introduced by Leo Breiman (Breiman 2001), is an important machine learning method suitable for both classification and regression tasks.It employs an ensemble of decision trees, each trained on a distinct subset of data created through bagging.Bagging, or bootstrap aggregating, involves forming diverse training sets for each tree by random sampling with replacement.A distinctive feature of the RF is the additional randomness introduced during tree construction.Instead of opting for the optimal feature for node splitting, it considers a random subset of features, thereby enhancing model diversity.This unique characteristic contributes to RF's robustness against overfitting and its effectiveness in handling complex datasets.
Random forest constructs a "forest" of decision trees through bagging.Each decision tree is descriptive, with a root node containing the variable of interest and leaf nodes representing predicted outcomes.The trees grow without pruning and predictions for new observations are made by aggregating results using f , where f (x) is the RF regression predictor, K is the number of trees, and f i (x) represents the individual regression trees.
Working Principle of Random Forests

•
Tree construction: 1. Random subsampling (bootstrap aggregating): The algorithm selects a subset of training data with a replacement (bootstrap sample) and trains a decision tree on each subset.Samples are obtained by randomly selecting observations with replacements from the original dataset.
2. Random feature selection: At each decision tree node, a random subset of features is chosen for splitting, reducing the correlation between trees and enhancing ensemble diversity.The number of features considered at each node is typically the square root of the total features for classification and one-third for regression.

•
Prediction and aggregation: 3. OOB error and feature importance: Post-training, RF provides two indices: the out-of-bag (OOB) error and the importance value of each feature.The OOB error measures prediction accuracy on data not included in the bootstrap sample, while feature importance is determined using the OOB dataset, aiding in variable selection.
4. Voting: Once all decision trees are trained, predictions are made by aggregating individual tree results.Majority voting is used for classification, and averaging for regression.The algorithm's performance is assessed through OOB error, providing an estimate of the RF's generalization error by ŷ = 1 n tree ∑ n tree j=1 y j , where ŷ is the final predicted value, n tree is the number of trees in RF, and y j is the predictive value of the jth tree.

•
Introducing randomness: The RF introduces randomness by selecting a subset of features for node splitting, enhancing model performance and reducing overfitting.The equation for the RF regression predictor with randomness in feature selection is f where θ i are independent random vectors with the same distribution, representing the randomly selected features.

•
Hyperparameters and feature importance: The RF has hyperparameters such as the number of trees (n estimators ), maximum features for splitting a node max_features, and minimum samples required to split an internal node (min_samples_leaf).These can be tuned for optimal performance.The RF assesses feature importance by measuring the accuracy decrease using OOB error estimation during feature selection.
Finally, our RF model uses four lagged values of ozone concentration as predictors in this work.It was fine-tuned through 5-fold cross-validation, achieving optimal performance with the number of variables tried at each split (mtry) set to 2.

General Modeling Framework
The main goal of this study is to forecast O 3 concentrations and compare the proposed model, FAR(1), with other traditional time series and machine learning models.To explain in detail, for the O 3 concentration series denoted by X t,h for the tth day (t = 1, 2, . . ., n) and the hth hour in a day (h = 1, 2, . . ., 24), the dynamics of the O 3 concentration can be modeled as This means that the O 3 series Y t,h is decomposed into two components.The first one is D t,h , which is the deterministic component that contains long-term dynamics, and the second one is Y t,h , which is the stochastic component that captures short-term variations.The deterministic component consists of yearly seasonality, which is modeled and forecast using the generalized additive modeling (GAM) technique using smoothing splines.More precisely, the yearly seasonality, which is represented by the series (1, 2,. . ., 365, 1, 2,. . ., 365, 1, 2,. . ., 365, 1, 2,. . ., 366, 1, 2,. . ., 365), is modeled using a smoothing spline, the one-dayahead forecast for y t+1,h = y t,h , as this component represents long-term dynamics for our forecast horizon.Hence, the one-day-ahead forecast for the deterministic component is obtained as follows: Dt+1,h = Dt,h .
To compute the stochastic component, we subtract the deterministic component D t,h from X t,h .Mathematically, this is represented as The modeling and forecasting of the stochastic component is performed using FAR(1) and five alternate competing models.To this end, the stochastic component Y t,h is converted to a matrix of dimensions "days × hours", where each row represents a day and each column represents an hour.In the case of FAR(1), this matrix is then transformed into a functional object using the B-spline basis function given in Equation ( 1), that results in Y t (j), and the FAR(1) model is applied thereafter.In the case of the VAR model, the matrix of hourly series is considered and modeled as given in Section 2.4.Finally, for all other models, the hourly series is modeled and forecast separately considering each hourly series as a univariate time series.
Once both components, deterministic and stochastic, are modeled and forecast separately, the final one-day-ahead out-of-sample forecasts are obtained by combining the forecasts of both components.In summary, the following procedure is used to obtain one-day-ahead out-of-sample forecasts from different models.

•
For an hourly time series, decompose the O 3 concentration time series into deterministic and stochastic components, as given in Equation ( 7).

•
Using Equation ( 8), obtain a one-step-ahead forecast for each hour that will result in a 24 h one-day-ahead forecast.

•
Obtain the stochastic component using Equation ( 9).Use the models described in Section 2 to obtain a one-day-ahead forecast.

•
Compute the final one-day-ahead out-of-sample forecasts by combining the forecasts from deterministic and stochastic components.

•
Repeat the above steps for 365 days using a rolling window to obtain the one-dayahead out-of-sample forecasts for the entire year.

Data Overview and Preprocessing
The data for this study originate from Los Angeles, an area highly prone to groundlevel O 3 pollution due to its unique geography, surrounded by mountains and bordered by the sea, resulting in a Mediterranean climate with high temperatures and minimal rainfall.The dataset utilized for our research was sourced exclusively from O 3 concentration measurements taken at the North Main Street monitoring site (−118.22688 E, 34.06659 N) as part of the Environmental Protection Agency's (EPA's) air quality system in Los Angeles 1 .This dataset spans the years from 2013 to 2017, capturing a comprehensive record of O 3 levels in the region.While the EPA's air quality system typically collects a wide range of meteorological and air pollution data, we have focused specifically on the O 3 concentration data for our study, thereby refining our dataset to provide a precise and specialized foundation for our research.The dataset is plotted in Figure 3, where the red line separates the model estimation and out-of-sample forecasting periods.This figure shows considerable variation in O 3 concentration throughout the year.values are common in air quality datasets and we have a very small number of missing values in our dataset.To address this, linear interpolation is applied to estimate missing values by assuming a linear relationship between neighboring known data points.This method calculates the value of a missing data point based on the values of the nearest data points before and after it in the dataset.This method is commonly used in various fields, such as time series analysis, geographic information systems, and data analysis, to interpolate and fill in missing data points.The dataset was organized with 80% of the data, spanning from 2013 to 2016, allocated for model estimation, while the remaining 20%, which corresponds to the data for 2017, was reserved for one-day-ahead out-of-sample forecasts.
The descriptive statistics in Table 1 reveal that the smallest observed value is 0.0010 parts per million (ppm), while the largest value stands at 0.1160 ppm.The mean O 3 concentration is calculated at 0.0239 ppm.Notably, the distribution appears to be slightly positively skewed, evident from the mean being marginally higher than the median (0.0220 ppm).Moreover, the standard deviation of 0.0180 ppm signifies a reasonable level of variability around the mean O 3 concentration value.Additionally, the first quartile, at 0.0070 ppm, and the third quartile, at 0.0360 ppm, help us to understand where the majority of values cluster within the dataset.

Forecasting Accuracy Metrics
The accuracy of our forecasting models is measured using three standard error measures, i.e., mean absolute error (MAE), root mean squared error (RMSE), and R-squared (R 2 ).These error measures are common descriptive statistics that show how close the predictions are to the actual values.Mathematically, they are described as N is the number of observations in the out-of-sample forecast period.• X t,h and Xt,h represents the actual and forecast O 3 values of the tth (t = 1, 2, . . .365) day and hth (h = 1, 2, . . .24) hour.

•
Xh denotes the mean O 3 value of the hth hour.

Results
A comparison of the forecasting accuracy and model performance using R 2 , MAE, and RMSE as the evaluation criteria is shown in Table 2 and depicted in Figure 4.The results indicate that the FAR(1) model achieved the best performance, as it had the highest R 2 value and the lowest error values.The second-best model was the VAR model, followed by the ARIMA, NNAR, and RF models.The SVM model had the worst performance among the six models.The O 3 concentration forecasting performance for each model for different days of the week is summarized in Table 3 and illustrated in Figure 5.These errors represent the average errors for the whole year when calculated for different days of the week.From these results, one can notice that the proposed model outperforms the other models by producing the lowest RMSE and MAE values for every day of the week.The forecast errors are different for each day of the week, with the lowest values on Wednesday, for which the O 3 concentration is more stable than the rest of the days, and the highest values on Saturday and Sunday.The SVM model is the worst among the six models, as it has the highest RMSE and MAE values for every day.
Table 4 compares the hourly O 3 forecasting errors of six different models using the RMSE and MAE metrics.The results show that the FAR(1) is the most accurate for most hours, except for the first five hours, when the VAR model performs better.The O 3 levels and the errors change throughout the day due to various factors such as weather, traffic, and emissions.The errors are usually higher during the peak hours (14-18) and the late hours (21-24), and lower during the off-peak hours (1-4).The VAR model is slightly better than the ARIMA, NNAR, and SVM models, but still worse than the FAR(1) model.Again, the SVM model has the highest errors for most hours and is the least accurate model.These results are also depicted in Figure 6.Finally, Table 5 lists the month-specific RMSE and MAE values for the models used in the study to forecast the one-day-ahead O 3 concentration, which are further depicted in Figure 7.These errors are calculated by averaging the errors month-wise for the year 2017.The results show that the forecasting errors are not uniform across the months.The errors tend to be higher in February, March, April, and October, which implies that O 3 is more challenging to predict in these months.Again, the FAR(1) model performs the best in most months, except for April, where the VAR model is slightly better, and February, where both models have equal errors.Again, the SVM model has the worst performance in all months, which suggests that it is not good for O 3 forecasting.The ARIMA, NNAR, and RF models have comparable performance but are inferior to the FAR(1) models.The FAR(1) model has the lowest RMSE and MAE values in December, when the O 3 level is the lowest.

•
Margin and decision boundary: The SVM aims to establish a hyperplane maximizing the margin between classes, mathematically expressed as w T x + b = 0, where w denotes the weight vector and b is the bias.• Support vectors and decision boundary: Essential in shaping the decision boundary, support vectors are the data points lying within or on the margin.They play a defining role in establishing the decision boundary.• Optimization and decision function: The optimization process fine-tunes key parameters like the weight vector, bias, and regularization factor to craft the decision function.• Optimization (mathematical model): The hyperplane determination involves an opti- where C denotes the regularization parameter and ξ i are slack variables permitting misclassifications within margin boundaries.•Decisionfunction and prediction: The decision function anticipates the class label for a new data point x, utilizing the transformed feature vector generated via the kernel function.This function is represented as f (x) = sign(w T ϕ(x) + b), where ϕ(x) denotes the transformed input feature vector.

Figure 2 Figure 2 .
Figure2shows the flowchart of the proposed modeling framework.

Figure 3 .
Figure 3. O 3 concentration time series for Los Angeles.The red line divides the estimation and out-of-sample forecasting periods.

Figure 5 .
Figure 5. Day-specific RMSE and MAE values for O 3 concentration forecasting.

Figure 7 .
Figure 7. Out-of-sample month-specific RMSE and MAE for O 3 concentration.

Table 1 .
Descriptive statistics of O 3 concentration time series.

Table 2 .
One-day-ahead out-of-sample forecast results for O 3 concentration.

Table 3 .
Day-specific forecasting errors of all models.

Table 4 .
Hour-specific forecasting errors for O 3 concentration.

Table 5 .
Month-specific forecasting errors of all models.