1. Introduction
Air pollution refers to the existence of detrimental materials such as gases, solids, or liquid particles in the atmosphere, leading to harmful consequences for human health, the environment, and ecosystems. These materials are termed pollutants and can have diverse origins, arising from sources like industrial operations, transportation, natural occurrences, and human activities. Typical pollutants include ozone
, nitric oxide
, nitrogen dioxide
, sulfur dioxide
, carbon monoxide
, particulate matter 2.5
, and particulate matter 10
, among others. Air pollution, especially ground-level
, is a significant global concern affecting public health and the environment (
Seinfeld and Pandis 2016).
In the United States of America (USA), the environmental protection agency (EPA) estimates that mobile sources such as cars, buses, planes, trucks, and trains account for about half of the cancer risk and more than 70% of the non-cancer health effects associated with air toxics (
Munshed et al. 2023). Stationary sources such as power plants, oil refineries, industrial facilities, and factories emit large amounts of pollutants from particular sites called point sources. Area sources are made up of several minor pollution contributors that combined have a significant impact on air quality. Area sources include agricultural areas, urban regions, and wood-burning fireplaces. Natural sources like wind-blown dust, wildfires, and volcanic activity can also impact air quality.
Figure 1 displays the different sources described above that contribute significantly to air pollution. The collective impact of these sources on the environment poses a serious challenge for researchers and policymakers (
Lotrecchiano et al. 2020;
Mavroidis and Ilia 2012).
Exposure to ground-level
can have a range of adverse effects on human health. For example,
exposure is associated with an elevated risk of chronic respiratory and cardiovascular ailments, as indicated by numerous studies (
World Health Organization 2020). Even short-term exposure to
can perturb lung mucociliary function, consequently weakening resistance to bacterial infections (
Sujith et al. 2017). Alarming statistics reveal that approximately 16,400 premature deaths within the European Union (EU) are attributed to
pollution (
Nuvolone et al. 2018). Beyond human health,
poses a substantial threat to vegetation and biodiversity. Its capacity to impair crops, forests, and other forms of vegetation manifests in the form of diminished photosynthesis, reduced carbon sequestration, and losses in biodiversity. Moreover, it induces visible leaf injuries (
Paoletti and Manning 2007). This concern extends to various regions, including parts of southern Europe, where a multitude of grassland areas face the risk of high
levels, impacting plant composition, seasonal flowering, and seed production for diverse natural species (
Mills et al. 2016).
Ozone concentrations are important for air quality management and public health, and accurate prediction of these concentrations can aid in decision-making processes. Due to the importance of this topic, researchers in the past have proposed different methods and techniques to model and forecast
concentrations. For example,
Hong et al. (
2023) presented a novel approach for predicting hourly
levels using deep learning techniques called the multi-order difference embedded long short-term memory (MDELSTM) method, using a Los Angeles air quality dataset. The performance of the proposed model is compared with partial least squares (PLS), gated recurrent unit (GRU), long short-term memory (LSTM), multilayer perceptron (MLP), and stationary difference embedded LSTM (SDELSTM). Based on the results, the MDELSTM model demonstrates superior prediction performance among the models considered. Machine learning techniques are popular for forecasting
concentrations (
Eslami et al. 2020;
Hashim et al. 2022;
Yafouz et al. 2022). For instance,
Yafouz et al. (
2022) conducted a study to predict the tropospheric
concentration using different machine learning models. These models included linear regression (LR), support vector regression (SVR), tree regression (TR), Gaussian process regression (GPR), ensemble regression (ER), and artificial neural networks (ANNs). Using data from Peninsular, Malaysia, the results indicated that the LR, SVR, GPR, and ANN performed better in terms of having a high
.
Many researchers have investigated the performance of classical time series models for forecasting
concentrations (
Salazar et al. 2019). For example,
Kumar et al. (
2004) studied the autoregressive integrated moving average (ARIMA) model to forecast the daily surface
concentration. Using a real dataset, the ARIMA (1,0,1) model demonstrated the best performance in predicting the maximum daily
concentration, with a MAPE of 13.14%.
Aneiros-Pérez et al. (
2004) forecast daily maximum
concentrations in Toulouse, France, using nonlinear models based on kernel estimators. They also included exogenous variables in their analysis. The study found that the functional additive model (FAM), with its back-fitting kernel approach, produced the lowest quadratic errors and explained the highest percentage of variability in the data compared to other models. Due to the challenges posed by
to human health, researchers are actively proposing different models to model and forecast
concentration time series data (
Arsić et al. 2020;
Gao et al. 2018;
Ghoneim et al. 2017;
Rahman and Nasher 2024;
Su et al. 2020).
This research contributes to the literature on
forecasting by applying functional data analysis (FDA) methods, which can capture the dynamic and complex features of the
concentration as a function of time. FDA methods have been used in various fields such as bio-statistics, econometrics, and environmental science, but are less explored in the context of
forecasting (
Jan et al. 2022;
Shah et al. 2022). This study proposes a novel time series model based on FDA, which treats each day as a single functional observation with 24 discrete points. The performance of the FDA model is compared with classical time series and machine learning models using standard accuracy metrics. The results of this study provide insights into the advantages and limitations of different forecasting methods and suggest ways to improve the accuracy and reliability of
forecasts. This study also has practical implications for policymakers, environmental managers, and public health officials, who can use the
forecasts to implement timely and effective measures to mitigate the adverse effects of
pollution.
The rest of the article is organized in the following manner.
Section 2 provides details about the general modeling framework and the details of different models used in this research work. Analysis and results based on empirical investigation are provided in
Section 3. Finally,
Section 4 concludes the study.
2. Methodology
This section explores the predictive models used in this study to model and forecast concentration data. To this end, our proposed approach FAR is outlined in more detail. Several parametric and non-parametric competing models like ARIMA, VAR, SVM, NNAR, and RF are also briefly described in this section. Before going into detail, we revisit some of the preliminaries of functional data analysis.
2.1. Functional Data Analysis
James O. Ramsay coined the term “functional data analysis” to describe this field of study (
Ramsay 1982). FDA is at the forefront of modern statistical computing, ushering transformative changes across diverse fields. The rapid advancement of technology allows for quicker and more accurate measurement equipment, facilitating the collection of continuous data spanning time, space, or other continua. This paradigm shift challenges classical statistical assumptions, particularly the conventional belief that the number of data points should surpass the number of variables in a dataset. In response to this evolution, FDA emerged as a field treating data as functional, where each curve is regarded as a single observation, rather than a collection of discrete data points. The FDA method leverages information within and across sample units to incorporate derivatives, smoothness, and other inherent features of the functional data structure. The process of converting data to functional objects involves collecting information at discretized points, typically equally spaced, and implementing suitable basis functions to eliminate common noise while accurately representing each data curve.
Basis functions are pivotal in FDA, serving as the fundamental components for modeling intricate functional observations. These functions are expressed as a linear combination of coefficients
C and known basis functions
. The representation of a functional observation
is given by
where
denotes parameters,
represents the known basis functions, and
K is the number of basis functions used. A more simplified representation using matrix notation can be written as
Several common basis functions, including the Fourier basis, B-spline basis, polynomial principal components, and exponential basis are used to cater to different data characteristics. The selection depends on the nature of the data, e.g., the Fourier basis is generally suitable for periodic data and the B-spline basis is preferred for non-periodic data. In our research, we chose the B-spline basis system for its applicability to non-periodic data.
The B-spline basis functions are polynomial segments that are joined at certain knots or breakpoints. The compact support property, where each basis function is positive over a limited number of adjacent intervals, increases the computational efficiency. The order of a B-spline determines the degree of polynomial segments while the knot sequence governs their placement. The strategic placement of knots, whether equally spaced or tailored to data characteristics, plays a vital role in achieving accurate and meaningful representations. Coincident knots are employed strategically, offering flexibility to induce specific characteristics like derivative discontinuities.
2.2. Functional Autoregressive Model
The FAR model used in this study is an extension of the classical AR model for functional data. The FAR model of order 1, FAR(1), operates within the framework of a separable Hilbert space denoted by
H. It specifically considers the Hilbert space
, although the idea is applicable to other
-spaces. An autoregressive Hilbertian process of order 1, ARH(1), which is also called FAR(1), is identified in this context as a sequence
of
H-random variables. The process is strictly stationary and satisfies the AR equation:
where
is the mean function,
is a bounded linear operator, and
is the shock or innovation term. The AR operator
is assumed to be a compact Hilbert–Schmidt, symmetric, and positive bounded linear operator from
to itself. The compactness property permits a decomposition using orthonormal bases and real numbers, and the operator is said to be nuclear if certain eigenvalue conditions are met. Within the functional framework, we will drop the compact support
for simplicity.
2.2.1. Operators in the Hilbert Space ()
Operators in the Hilbert space
are defined by a norm and are bounded linear operators and obtained from the inner product of the space
H. This norm is the supremum of the operator’s norm over all unit vectors
y in
H. It is represented by the symbol
and written as
An operator
is termed compact with respect to orthonormal bases
and
in
H and a sequence
approaching zero if it can be expressed as
If the sum of the squares of an operator’s sequence
is finite, the operator is Hilbert–Schmidt, i.e.,
The space of Hilbert–Schmidt operators
S is separable and is equipped with an inner product and associated norm:
Operators are classified as symmetric if
, and positive if
. A symmetric positive Hilbert–Schmidt operator admits the decomposition
A compact operator is nuclear if the sum of the absolute values of its sequence
is finite:
The norms of these operators follow the relationship . If is an integral operator in defined by , where is a real kernel, it is a Hilbert–Schmidt operator if and only if . The model is non-parametric, as represents an infinite-dimensional parameter.
2.2.2. Estimation of the Operator
Estimating the AR operator
in the FAR model involves addressing specific assumptions for obtaining a stationary solution. Two assumptions are considered to ensure the existence of a stationary solution. The first assumption requires the existence of an integer
such that
, while the second assumption necessitates the existence of
and
such that
for all
. These assumptions, under certain conditions, guarantee a unique strictly stationary solution, as proven in
Bosq (
2000).
It is important to note that estimating cannot rely on likelihood due to the non-existence of the Lebesgue measure in non-locally compact spaces and the concept of density is not available for the functional data. Instead, the classical method of moments is employed. The estimation of is represented as , where and are the covariance and cross-covariance operators of the process, and ⊗ is the Kronecker product. The sample versions of these operators are denoted as and , respectively.
To simplify notation, it is assumed that the mean of the process is known. The sample versions of the covariance and cross-covariance operators, denoted as and , are given by and .
The covariance operator
is a symmetric, positive definite, and compact operator. It can be decomposed into eigenfunctions and eigenvalues, denoted as
and
, respectively. However,
is not a bounded operator. To address this, a practical solution is proposed involving the consideration of the first
p most important empirical functional principal components as substitutes for unknown population principal components, as given by
From the ARH(1) equation, when multiplying by
we obtain the relation
By the definitions of the covariance and cross-covariance operators of ARH(1) and using E
, we have
and
. The estimate of
is then given by
The last term is derived by performing an additional smoothing step on and . The empirical eigenfunctions are known to converge asymptotically to the population eigenfunctions.
Once the estimator
of the population parameter
is obtained, it is crucial to assess its optimality in estimating the true parameter. For the FAR parameter
,
Didericksen et al. (
2012) demonstrated that the estimator is optimal in terms of MSE and MAE, as its prediction error is comparable to the infeasible predictor
for an appropriately chosen
p.
2.3. Autoregressive Integrated Moving Average (ARIMA) Models
Time series data analysis is a vital tool for comprehending and forecasting temporal patterns. The autoregressive moving average (ARMA) model combines elements of autoregressive (AR) and moving average (MA) models, providing a framework for modeling univariate time series data. For a univariate time series
, an ARMA(
p,
q) model can be written as
Equation (
3) contains an intercept term
C, AR parameters
(
), MA parameters
(
), and a white noise term
. The ARMA models are well-suited for stationary time series data. However, differencing is required to achieve stationarity for non-stationary data, and this is where the ARIMA model comes into play. The ARIMA model is an extension of ARMA specifically tailored for non-stationary time series (
Shumway et al. 2000). It involves differencing the data to attain stationarity, which can be expressed as
where
represents the
dth difference of the series,
(
) and
(
) are the parameters of the AR and MA components, respectively, and
.
Identifying the appropriate ARIMA model is a critical step and it involves determining the order of differencing ‘
d’ and the numbers of AR (
p) and MA (
q) terms. The autocorrelation function (ACF) and partial autocorrelation function (PACF) plots are valuable tools in this process. The ACF indicates the series correlation with itself at different lags, whereas the PACF reveals autocorrelation at a lag
k with intervening data deleted. For parameter estimation, generally, the maximum likelihood estimation (MLE) is used (
Shumway et al. 2000). This study investigated different models and found that the ARIMA(5,0,0), which is a pure autoregressive model with five lagged observations, fits the data well and provides white noise errors, thus being suitable for forecasting.
2.4. Vector Autoregressive
A vector autoregressive (VAR) model is a powerful and commonly used time series analysis method that enables us to capture the dynamic interactions between variables over time. It extends the concept of univariate AR models to a multivariate situation. Because of their flexibility and forecasting accuracy, VAR models are extremely useful for understanding and forecasting complicated real-world behavior. They can effectively express variable interdependence.
A VAR model is built on a set of equations that describe the evolution of various time series variables. Each variable in the model is treated as a linear function of its own lagged values as well as the lagged values of all other variables in the model. Suppose
is a vector of univariate time series, then a VAR model can be written as
where
is an
n × 1 vector representing the current values of
n distinct response time series variables at time
t. The n × 1 vector
C consists of constant offsets that serve as intercepts, accounting for the baseline level of the variables. The
matrices, ranging from
j = 1 to
p, are
n ×
n matrices of AR coefficients. These matrices capture the relationships between the variables at different time lags; specifically, the impact of past values up to lag
p on the current values. The parameter “
p” defines the order of the VAR model, determining the maximum number of past periods considered in each equation and influencing the model’s ability to capture system dynamics. Finally,
represents an
n × 1 vector of “white noise” terms, introducing randomness into the model and accounting for unexplained variability not captured by the lagged variables, thereby enhancing the model’s accuracy in describing the underlying time series data.
The choice of the order is a critical step and is typically determined through cross-validation techniques and information criteria such as AIC, BIC, and HQ. As we estimate models for 365 days, the value of the selected p can vary. However, in most cases, a VAR of order five, i.e., VAR(5), is suitable as it generally provides whitened residuals and produces lower AIC values compared to other orders of the model. Estimating and inferring parameters in VAR models is typically achieved by using the MLE or ordinary least squares (OLS) techniques. These methods rely on certain assumptions, such as the error terms having a conditional mean of zero, stationary variables, and the absence of perfect multicollinearity.
2.5. Artificial Neural Networks
Artificial neural networks (ANNs) are a foundational component of machine learning, particularly in the realm of deep learning. They mirror the human brain’s structure and functionality, comprising interconnected nodes arranged in layers: input, hidden, and output. These nodes, akin to artificial neurons, process information through weighted connections, activating based on a threshold to transmit data across layers. The ANNs are renowned for their proficiency in rapid data classification, clustering, and various applications such as speech and image recognition, with Google’s search algorithm being a prominent example of their implementation.
ANNs constitute a multi-layered architecture comprising essential components featuring an input layer, hidden layers, and an output layer. The input layer receives diverse data formats while the hidden layer, functioning as a “distillation layer”, extracts pertinent patterns to enhance network efficiency by recognizing crucial information and discarding redundancy. The output layer processes transformed information from the hidden layers to produce the final output. The most commonly employed ANNs for predictive tasks are MLPs with hidden layers, utilizing a three-layered network interconnected by acyclic connections. Nodes are considered as manufacturing components and the architecture allows for more than one hidden layer.
Neural Network Autoregressive
The neural network autoregressive (NNAR) model is a potent deep learning structure widely employed in various fields like natural language processing, time series forecasting, and speech recognition. It utilizes past data to predict future values, employing a recursive approach where each time step’s input is the prior prediction. This model handles nonlinear relationships within intricate datasets and can be constructed using feedforward or recurrent neural networks.
The NNAR(
p,
k) architecture involves “
p” lagged inputs for forecasting and includes “
k” nodes in the hidden layer. It shares similarities with ARIMA models but operates without constraints for stationarity. The NNAR model equation comprises weighted node connections, nonlinear activation functions, AR dependencies, exogenous impacts, and error terms, and can be written as
where
represents the univariate response at node
i and time
t,
indicates the connection strength between nodes
i and
j for
i,
j = 1, …,
N,
captures the exogenous impact, and
is an unidentified smoothed link function. Estimation methods for the NNAR include profile least squares estimation and local linear approximation to model the unknown link function and optimize parameter estimates. This model iterates through historical inputs to generate multi-step forecasts in time series analysis, showing adaptability and robustness in handling intricate data patterns. In our case, a simple autoregressive of order one with one node in the hidden layer, NNAR(1,1), is used.
2.6. Support Vector Machine
Support vector machines (SVMs) are powerful supervised learning models widely employed for data analysis in classification, regression, and outlier detection tasks. They excel in handling both linear and nonlinear data separations, making them versatile in various domains such as text classification, image recognition, gene expression analysis, and even time series prediction. Notably, these machines were meticulously designed not only for effective ranking but also for efficiently simplifying training set outcomes. This distinct quality has led to the widespread adoption of SVM techniques, especially in forecasting time series.
An SVM commences with the input data, composed of labeled feature vectors, where each vector corresponds to one of two classes. Extending beyond binary classification, SVMs can also tackle multi-class classification problems. These feature vectors are then transformed into a higher-dimensional space using various kernel functions, such as the linear, polynomial, or radial basis functions. This transformation is pivotal as it calculates the similarity between different feature vectors, allowing the SVM to address nonlinear separable data.
Key Elements and Optimization in the SVM
Training data and class labels:The training data contains assigned class labels denoted within {, 1} for binary classification, serving as an anchor for the model’s learning process.
Margin and decision boundary: The SVM aims to establish a hyperplane maximizing the margin between classes, mathematically expressed as
where
w denotes the weight vector and
b is the bias.
Support vectors and decision boundary: Essential in shaping the decision boundary, support vectors are the data points lying within or on the margin. They play a defining role in establishing the decision boundary.
Optimization and decision function: The optimization process fine-tunes key parameters like the weight vector, bias, and regularization factor to craft the decision function.
Optimization (mathematical model): The hyperplane determination involves an optimization problem:
under the constraints
with
, where
C denotes the regularization parameter and
are slack variables permitting misclassifications within margin boundaries.
Decision function and prediction: The decision function anticipates the class label for a new data point x, utilizing the transformed feature vector generated via the kernel function. This function is represented as , where denotes the transformed input feature vector.
In this work, the SVM model is an epsilon-regression type with a radial basis function kernel. It uses two lagged values of ozone concentration as inputs and is parameterized with a cost of 1.0, gamma of 0.5, and epsilon of 0.1. The model is supported by 1191 support vectors to ensure accurate predictions.
2.7. Random Forest
The random forest (RF) algorithm, introduced by Leo Breiman (
Breiman 2001), is an important machine learning method suitable for both classification and regression tasks. It employs an ensemble of decision trees, each trained on a distinct subset of data created through bagging. Bagging, or bootstrap aggregating, involves forming diverse training sets for each tree by random sampling with replacement. A distinctive feature of the RF is the additional randomness introduced during tree construction. Instead of opting for the optimal feature for node splitting, it considers a random subset of features, thereby enhancing model diversity. This unique characteristic contributes to RF’s robustness against overfitting and its effectiveness in handling complex datasets.
Random forest constructs a “forest” of decision trees through bagging. Each decision tree is descriptive, with a root node containing the variable of interest and leaf nodes representing predicted outcomes. The trees grow without pruning and predictions for new observations are made by aggregating results using , where is the RF regression predictor, K is the number of trees, and represents the individual regression trees.
Working Principle of Random Forests
1. Random subsampling (bootstrap aggregating): The algorithm selects a subset of training data with a replacement (bootstrap sample) and trains a decision tree on each subset. Samples are obtained by randomly selecting observations with replacements from the original dataset.
2. Random feature selection: At each decision tree node, a random subset of features is chosen for splitting, reducing the correlation between trees and enhancing ensemble diversity. The number of features considered at each node is typically the square root of the total features for classification and one-third for regression.
3. OOB error and feature importance: Post-training, RF provides two indices: the out-of-bag (OOB) error and the importance value of each feature. The OOB error measures prediction accuracy on data not included in the bootstrap sample, while feature importance is determined using the OOB dataset, aiding in variable selection.
4. Voting: Once all decision trees are trained, predictions are made by aggregating individual tree results. Majority voting is used for classification, and averaging for regression. The algorithm’s performance is assessed through OOB error, providing an estimate of the RF’s generalization error by , where is the final predicted value, is the number of trees in RF, and is the predictive value of the jth tree.
The RF introduces randomness by selecting a subset of features for node splitting, enhancing model performance and reducing overfitting. The equation for the RF regression predictor with randomness in feature selection is where are independent random vectors with the same distribution, representing the randomly selected features.
The RF has hyperparameters such as the number of trees , maximum features for splitting a node max_features, and minimum samples required to split an internal node (min_samples_leaf). These can be tuned for optimal performance. The RF assesses feature importance by measuring the accuracy decrease using OOB error estimation during feature selection.
Finally, our RF model uses four lagged values of ozone concentration as predictors in this work. It was fine-tuned through 5-fold cross-validation, achieving optimal performance with the number of variables tried at each split (mtry) set to 2.
2.8. General Modeling Framework
The main goal of this study is to forecast
concentrations and compare the proposed model, FAR(1), with other traditional time series and machine learning models. To explain in detail, for the
concentration series denoted by
for the
tth day (
) and the
hth hour in a day (
), the dynamics of the
concentration can be modeled as
This means that the
series
is decomposed into two components. The first one is
, which is the deterministic component that contains long-term dynamics, and the second one is
, which is the stochastic component that captures short-term variations. The deterministic component consists of yearly seasonality, which is modeled and forecast using the generalized additive modeling (GAM) technique using smoothing splines. More precisely, the yearly seasonality, which is represented by the series (1, 2,…, 365, 1, 2,…, 365, 1, 2,…, 365, 1, 2,…, 366, 1, 2,…, 365), is modeled using a smoothing spline, the one-day-ahead forecast for
, as this component represents long-term dynamics for our forecast horizon. Hence, the one-day-ahead forecast for the deterministic component is obtained as follows:
To compute the stochastic component, we subtract the deterministic component
from
. Mathematically, this is represented as
The modeling and forecasting of the stochastic component is performed using FAR(1) and five alternate competing models. To this end, the stochastic component
is converted to a matrix of dimensions “days × hours”, where each row represents a day and each column represents an hour. In the case of FAR(1), this matrix is then transformed into a functional object using the B-spline basis function given in Equation (
1), that results in
, and the FAR(1) model is applied thereafter. In the case of the VAR model, the matrix of hourly series is considered and modeled as given in
Section 2.4. Finally, for all other models, the hourly series is modeled and forecast separately considering each hourly series as a univariate time series.
Once both components, deterministic and stochastic, are modeled and forecast separately, the final one-day-ahead out-of-sample forecasts are obtained by combining the forecasts of both components. In summary, the following procedure is used to obtain one-day-ahead out-of-sample forecasts from different models.
For an hourly time series, decompose the
concentration time series into deterministic and stochastic components, as given in Equation (
7).
Using Equation (
8), obtain a one-step-ahead forecast for each hour that will result in a 24 h one-day-ahead forecast.
Obtain the stochastic component using Equation (
9). Use the models described in
Section 2 to obtain a one-day-ahead forecast.
Compute the final one-day-ahead out-of-sample forecasts by combining the forecasts from deterministic and stochastic components.
Repeat the above steps for 365 days using a rolling window to obtain the one-day-ahead out-of-sample forecasts for the entire year.
Figure 2 shows the flowchart of the proposed modeling framework.
4. Conclusions
Air pollution, particularly ground-level , is a global issue with severe implications for both human health and ecosystems. It reduces agricultural output, exacerbates global warming as a greenhouse gas, and affects respiratory health, causing symptoms such as coughing, chest tightness, and worsening asthma. Given the severity of these consequences, accurate forecasting of ground-level concentrations is crucial. This study aims to use a functional approach to model ground-level concentrations, a method less explored in the literature. The performance of this approach is compared with traditional time series and machine learning models, including ARIMA, VAR, NNAR, RF, and SVM. Hourly data from Los Angeles, collected from 2013 to 2017, are used, with 80% allocated for model estimation and the remaining 20% for one-day-ahead out-of-sample forecasting. The performance of the model is evaluated using , MAE, and RMSE.
The results revealed that the FAR(1) model performs better than other models included in the study by producing lower forecasting errors. The second-best model was the VAR model, followed by the ARIMA, NNAR, and RF models. The SVM model performed the worst among all models. These findings highlight the importance of choosing a suitable model for accurate forecasting, which is crucial for mitigating the impacts of air pollution. In future research, including exogenous variables, including temperature, humidity, wind speed, solar radiation, and other meteorological factors, may improve the forecasting accuracy. As the current study is based on the Los Angeles dataset, the performance of the proposed approach could be assessed by conducting a study on other site datasets.