Development of an Ozone (O3) Predictive Emissions Model Using the XGBoost Machine Learning Algorithm

Hernandez-Santiago, Esteban; Tello-Leal, Edgar; Jaramillo-Perez, Jailene Marlen; Macías-Hernández, Bárbara A.

doi:10.3390/bdcc10010015

Open AccessArticle

Development of an Ozone (O₃) Predictive Emissions Model Using the XGBoost Machine Learning Algorithm

by

Esteban Hernandez-Santiago

,

Edgar Tello-Leal

^*

,

Jailene Marlen Jaramillo-Perez

and

Bárbara A. Macías-Hernández

Faculty of Engineering and Science, Autonomous University of Tamaulipas, Victoria 87000, Mexico

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput. 2026, 10(1), 15; https://doi.org/10.3390/bdcc10010015 (registering DOI)

Submission received: 24 October 2025 / Revised: 16 December 2025 / Accepted: 25 December 2025 / Published: 1 January 2026

(This article belongs to the Special Issue Machine Learning and AI Technology for Sustainable Development)

Download

Browse Figures

Versions Notes

Abstract

High concentrations of tropospheric ozone (

O_{3}

) in urban areas pose a significant risk to human health. This study proposes an evaluation framework based on the XGBoost algorithm to predict

O_{3}

concentration, assessing the model’s capacity for seasonal extrapolation and spatial transferability. The experiment uses hourly air pollution data (

O_{3}

, NO,

{NO}_{2}

, and NOx) and meteorological factors (temperature, relative humidity, barometric pressure, wind speed, and wind direction) from six monitoring stations in the Monterrey Metropolitan Area, Mexico (from 22 September 2022 to 21 September 2023). In the preprocessing phase, the datasets were extended via feature engineering, including cyclic variables, rolling windows, and lag features, to capture temporal dynamics. The prediction models were optimized using a random search, with time-series cross-validation to prevent data leakage. The models were evaluated across a concentration range of 0.001 to 0.122 ppm, demonstrating high predictive accuracy, with a coefficient of determination (

R^{2}

) of up to 0.96 and a root-mean-square error (RMSE) of 0.0034 ppm when predicting summer (

O_{3}

) concentrations without prior knowledge. Spatial generalization was robust in residential areas (

R^{2}

> 0.90), but performance decreased in the industrial corridor (AQMS-NL03). We identified that this decrease is related to local complexity through the quantification of domain shift (Kolmogorov–Smirnov test) and Shapley additive explanations (SHAP) diagnostics, since the model effectively learns atmospheric inertia in stable areas but struggles with the stochastic effects of NOx titration driven by industrial emissions. These findings position the proposed approach as a reliable tool for “virtual detection” while highlighting the crucial role of environmental topology in model implementation.

Keywords:

ozone; XGBoost; prediction model; NOx; feature engineering

1. Introduction

In urban areas, especially in large metropolitan regions and megacities, high levels of tropospheric ozone (

O_{3}

) are frequently recorded. This

O_{3}

is formed through complex photochemical reactions involving its precursors, primarily nitrogen oxides (NOx), carbon monoxide (CO), volatile organic compounds (VOCs), and methane (

{CH}_{4}

) [1,2]. In the presence of sunlight and at ambient temperature (>20 °C), VOCs and NOx are reactive and volatilize, producing

O_{3}

via photochemical processes. Specifically, when VOC concentrations are low and NOx (NO and

{NO}_{2}

) predominate, the latter can generate

O_{3}

through photolysis. Conversely, when VOCs predominate in the atmosphere (at high concentrations), photochemical reactions compete with hydroxyl (OH) free radicals to generate

O_{3}

, reaching a stable concentration influenced by solar intensity, ambient temperature, and the ratio of precursor concentrations [1,3,4,5,6]. Furthermore, meteorological factors such as high temperatures, low relative humidity, and low wind speed, as well as specific geographical conditions, can contribute to

O_{3}

formation [7,8,9].

Accurate prediction of historical and future

O_{3}

concentrations is essential for effective air quality management, supporting the planning of emergency response actions and the implementation of mitigation measures to reduce health risks related to high air pollution levels. In this context, research on

O_{3}

prediction models remains limited in Mexico, especially in its major metropolitan areas. Addressing this gap requires methods to develop predictive models that accurately capture the complex, nonlinear relationships between

O_{3}

precursors and meteorological factors. This highlights the potential of advanced machine learning (ML) techniques.

Extreme Gradient Boosting (XGBoost) is a decision-tree-based ML technique that uses the boosting ensemble method; it has become a leading solution for complex regression and classification problems with nonlinear data [10]. XGBoost iteratively combines multiple simple decision tree models to create a strong predictor, with each new model trained to correct the errors of the previous one [11]. The algorithm optimizes a loss function using gradient descent, thereby minimizing the residual error at each step [12]. Key advantages include the incorporation of L1 regularization terms (using the

α

parameter to apply Lasso regression) and L2 regularization terms (using the

λ

parameter to apply Ridge regression) to prevent overfitting. Furthermore, it includes features for automated handling of missing values and parallel processing, which are essential for managing big data [13].

Recent research has shown XGBoost’s effectiveness in predicting air pollutant levels. Li et al. [14] demonstrated that XGBoost outperforms other algorithms in capturing the spatial and temporal variation patterns of

O_{3}

, consistently achieving better metrics at urban sites than at suburban ones. Other studies support this robustness: Gagliardi and Andenna [15] reported coefficient of determination (

R^{2}

) values between 0.59 and 0.86 using XGBoost across various regions, while Liu et al. [16] achieved a cross-validation

R^{2}

(CV-

R^{2}

) of up to 0.78 for seasonal

O_{3}

predictions by combining ground measurements with reanalysis and meteorological data.

In this regard, several studies have integrated feature engineering and model optimization techniques to improve the predictive performance of XGBoost. Fan et al. [17] generated a spatiotemporal

O_{3}

dataset, in which the optimized XGBoost model achieved CV-

R^{2}

values ranging from 0.96 to 0.97, with RMSE between 4.58 and 5.00 μg/m³, demonstrating strong generalizability. Methodologically, Liu et al. [18] used lag variables (derived from historical

O_{3}

and

{NO}_{2}

data over the last 3 h) and the Shapley additive explanations (SHAP) method to assess feature significance, thereby increasing computational efficiency by 30% [18]. Similarly, Dai et al. [19] and Xiaomin et al. [20] proposed hybrid models that combine vector autoregression and kriging with XGBoost to achieve high-accuracy predictions (

R^{2}

of 0.95) using input data such as precursor pollutants, meteorological factors, and contextual socioeconomic data. Other feature selection methods, like BO-XGBoost-RFE [21] and principal component analysis (PCA) [22], have also proven effective in enhancing model efficiency for

O_{3}

prediction. XGBoost-based models are particularly robust in validation and short-term forecasting, as demonstrated by studies in different regions [23,24,25].

Therefore, to address the scarcity of high-performance

O_{3}

prediction models in Mexico, this study develops and validates a predictive framework for hourly

O_{3}

concentrations in the Monterrey Metropolitan Area (MMA). Using an XGBoost architecture optimized through rigorous time-series cross-validation and extensive feature engineering (including cyclic, rolling, and lagging variables), this work makes a significant contribution beyond simply applying an established algorithm. Instead, we introduce a systematic evaluation approach to determine the model’s operational limits. Specifically, this study aims to: (1) Evaluate the model’s seasonal extrapolation capacity for forecasting high

O_{3}

summer levels using models trained exclusively with data from the coldest stations; (2) evaluate the spatial generalization capabilities by explicitly analyzing how local industrial topologies affect the model’s transferability; and (3) compare chronological stratified sampling with time-series cross-validation to assess theoretical learning capacity against the operational stability of the forecast. The optimized models showed high predictive accuracy, with an

R^{2}

of up to 0.96 for local targets. Notably, the spatial analysis revealed strong generalization ability and some limitations in complex industrial areas.

2. Materials and Methods

2.1. Study Area

The MMA comprises 16 municipalities and covers a total surface area of 7658 km² [26]. Located in the central region of the state of Nuevo León (NL), México (MX), the area lies between latitudes 25°37^′ and 25°44^′ N and longitudes 100°07^′ and 100°16^′ W, at an average elevation of 537 m above sea level (masl). The region has a semi-arid climate with three main subtypes: dry semi-warm with summer rain, semi-warm with limited rainfall, and temperate sub-humid. The average annual temperature ranges from 16 to 24 °C, with precipitation levels between 400 and 1000 mm [27]. Seasonal variations are significant, with summer temperatures often exceeding 40 °C and winter temperatures falling below 0 °C. Rainfall mainly occurs between August and December, while the rest of the year typically features light drizzle or dry conditions [28].

Topographically, the area is surrounded by a mountainous landscape. The prevailing winds flow from east to west, interacting with the terrain to create a trapping effect that promotes the accumulation of pollutants at ground level [29]. Demographically, the MMA has a population of 5,784,442 inhabitants [28]. Economically, it is an industrial hub with automotive manufacturing, electronic circuit integration, oil refining, and iron production, along with constantly growing construction, commerce, and service sectors [30]. Mobile emission sources are substantial, with a registered vehicle fleet comprising 1,539,035 private cars, 149,727 motorcycles, 346,434 cargo trucks, and 14,891 public transport units [31].

2.2. Dataset

The dataset used in this study was compiled from public records available in the National Air Quality Information System. These data were collected by the Ministry of the Environment of the State of Nuevo León, México, through its Integral Environmental Monitoring System. For this research, six air quality monitoring stations (AQMS) were selected from the network’s 16 available sites (see Figure 1). These stations were chosen for their strategic locations near densely industrialized zones and mixed residential areas, which have the highest concentrations of air pollutants. The AQMS has neighborhood-scale spatial representativeness (approximately a 4 km² radius) and covers residential, commercial, and industrial land uses. Table 1 lists the geolocation of the selected stations. The AQMS-NL05 (Cadereyta) plays an essential role within the monitoring network due to its proximity to a large oil refinery bordering an urban area.

The monitoring network operates 24 h a day, 365 days a year, reporting hourly averages.

O_{3}

concentrations are measured using UV photometry (Thermo Environmental Instruments Inc., Model TEI-49C, Waltham, MA, USA), and nitrogen oxides (NO,

{NO}_{2}

, and NOx) are measured using chemiluminescence (Thermo Environmental Instruments Inc., Model TEI-42C, Waltham, MA, USA), with an accuracy of ±1 ppb (±0.001 ppm) and ±4 ppb (±0.004 ppm), respectively. Both instrument references are US EPA-certified.

The dataset includes variables like

O_{3}

, NOx, NO,

{NO}_{2}

, temperature (T), relative humidity (RH), barometric pressure (BP), wind speed (WS), and wind direction (WD), with 8760 hourly data points per station, covering the period from 22 September 2022, to 21 September 2023. The measurement units for the variables in the dataset are listed in Table A2. The raw data were validated by the government institution using the NOM-156-SEMARNAT-2012 methodology to ensure that they met official quality standards.

2.3. Methodology

This section outlines the stages involved in data preprocessing, model training, hyperparameter optimization, and validation. Datasets were obtained from the Integrated Environmental Monitoring System (SIMA), which operates in accordance with Mexican Official Standards NOM-156-SEMARNAT-2012 [32] and NOM-020-SSA1-2021 [33], ensuring strict sampling procedures and using UV photometry for

O_{3}

measurements. The proposed method is independently applied across six replicates to cover the selected AQMS.

2.3.1. Preprocessing Phase

The preprocessing phase (see Figure 2) begins with data cleaning, which is the second activity in Quality Control (QC). This step involves reviewing the dataset’s attributes to find inconsistencies, artifacts, or incorrect measurements. Any issues found are removed from the attributes and replaced with a null value.

Subsequently, to ensure data quality and respect the temporal nature of the time series, a hybrid data imputation strategy based on gap size was implemented. Before imputation, a data completeness analysis revealed that the raw dataset was of high quality, with an average missing record rate of approximately 2.6% across all stations (see Table A1). For the target variable (

O_{3}

), the percentage of missing records stayed below 5%, ranging from 3.61% to 4.98% across all stations. For short gaps (less than 12 consecutive hours), linear interpolation was used. This approach is appropriate for continuous physical phenomena, such as

O_{3}

concentration, where sudden changes are infrequent over brief periods, effectively maintaining short-term temporal continuity. For larger gaps (≥12 h), where linear interpolation would not capture the diurnal cycle, imputation with K-Nearest Neighbors (KNN) (

k = 5

) was used. KNN maintains the data’s statistical structure by filling in gaps using multivariate similarities (such as meteorological conditions) between historical records, making sure that imputed values align with expected environmental dynamics.

During the feature engineering stage, a series of transformations were applied to expand the original dataset. An important step was to extract and encode cyclical variables, specifically temporal attributes (hour, day of the week, day of the month, and month) and wind direction. Standard ordinal coding is insufficient for these variables because it does not capture their periodicity (e.g., the transition from 23:00 to 00:00, or

360^{\circ}

to

0^{\circ}

). To address this, we map these features onto a unit circle and compute their sine and cosine components, as defined in Equations (1) and (2). This transformation ensures that the prediction model correctly interprets the proximity between the end and beginning of a cycle, providing an explicit and continuous representation of time as well as wind direction.

x_{s i n} = s i n (\frac{2 π \cdot t}{T})

(1)

x_{c o s} = c o s (\frac{2 π \cdot t}{T})

(2)

The next step is to generate the lag and the rolling-window features. Unlike models such as ARIMA or recurrent neural networks (LSTM/GRU), which are mathematically designed to model sequences and their order, the XGBoost algorithm considers each data row as an independent and identically distributed observation. That is, if it receives unordered data, the model will continue to work because it does not recognize that row t occurred immediately after row

t - 1

. Therefore, the solution involves creating lag and rolling-window features that capture different temporal dynamics, converting the temporal data into static columns, and allowing XGBoost to maintain explicit memory for each row. This enables it to recognize past information without internal recursion. The rolling windows are built using the mean, standard deviation (sd), minimum (min), and maximum (max) values of the

O_{3}

attribute (target variable) for the last 6, 12, and 24 h. This allows the entire time series to be processed in small segments (windows), considering past observations and future values to capture trends and cycles. The mean value smooths out noise and indicates whether the overall trend for the day is rising or falling, regardless of a momentary peak. The minimum and maximum values capture volatility; for example, if the maximum for the past 24 h was very high, the model can detect that atmospheric conditions are prone to pollution buildup. The standard deviation indicates instability; a sudden change in it can alert the model to a regime shift, such as the beginning of a storm or wind gusts that disperse

O_{3}

.

Next, a set of lag features for the

O_{3}

variable is constructed using the previous 7 h, allowing the model to capture how past observations may contain important information about the target variable’s future value. A lag window is included within this period due to the photochemical dynamics and statistical autocorrelation of

O_{3}

. As mentioned earlier,

O_{3}

is formed when sunlight reacts with precursors (NOx and VOCs), mainly from specific industries and vehicle traffic. For the MMA, morning rush-hour traffic usually occurs between 6:00 and 9:00 a.m., and the highest

O_{3}

levels are generally reached between 1:00 and 4:00 p.m., when solar radiation peaks. The time lag between precursor emission and the

O_{3}

peak is approximately 5–7 h. Hence, when the XGBoost model analyzes concentrations from 7 h earlier, it can link the cause (morning

O_{3}

levels during traffic emissions) to the effect (the current

O_{3}

level). By including seven consecutive concentration readings (from

t - 1

to

t - 7

), the XGBoost model can implicitly infer the curve’s shape, enabling it to differentiate between 10:00 a.m. (rapid rise) and 6:00 p.m. (fast fall), even if the

O_{3}

value is the same. In summary, XGBoost uses these delays to make the initial prediction.

Strict measures were applied during feature engineering to prevent data leakage (look-ahead bias). All lag features and rolling window statistics (mean, max, min, std) were carefully designed as strictly backward-looking variables. This means that for any specific time step t, the features are only based on observations from times

t - 1, t - 2, \dots, t - n

. As a result, no future information influences the current state. Additionally, due to the chronological train-test split, calculating these features maintains the causal order of the time series, ensuring that the training process relies only on past information available at the time of prediction.

The feature engineering process expanded the initial dataset to 38 input variables (see Table 2). This final dataset contains the original pollutant and meteorological variables, cyclic transformations of time attributes and wind direction (sine and cosine components), a binary indicator for weekends, 12 rolling window statistics (mean, maximum, minimum, standard deviation), and 7 lag variables.

The final step in preprocessing is feature scaling. To avoid data leakage, the scaling parameters (mean and standard deviation) were calculated only using the training dataset. These parameters were then applied to the testing dataset, ensuring that the test data stayed completely unseen during the scaling process. In our experiment, the StandardScaler technique (standardization) was applied to all input features, including the cyclic variables (sine/cosine). Although cyclic features are naturally bounded within

[- 1, 1]

, standardizing them along with the rest of the dataset ensures a uniform feature space (zero mean, unit variance), facilitating consistent interpretation and algorithm convergence. This method effectively transforms the data by aligning the distributions of meteorological and air pollution variables.

2.3.2. Dataset Phase

The dataset phase manages the partitioning of records to support two different experimental strategies (as depicted in Figure 2).

Experiment 1. The first method assesses the model’s ability to extrapolate to unseen regimes. The dataset was divided into seasonal segments: records for autumn, winter, and spring were used for the training set (approximately 75% of the data), while summer records were reserved exclusively for the testing set.
Experiment 2. To evaluate robustness across all climatic conditions, the second method applies a chronological stratified sampling strategy. For each season, the first 80% of consecutive records are assigned to training, with the remaining 20% set aside for testing. This approach ensures that temporal continuity is preserved while capturing the dynamics of all seasons in both stages.

2.3.3. Training Phase

The training phase focuses on identifying the optimal model setup. In this study, we implemented the Random Search method with 400 iterations to explore the hyperparameter space. This method was chosen for its ability to find near-optimal configurations with much less computational cost than an exhaustive grid search. During optimization, each candidate configuration was assessed using five-fold Time-Series Cross-Validation. Unlike traditional k-fold validation, this approach preserves the chronological order of observations, ensuring that hyperparameters are selected based on their ability to predict future data rather than interpolating random points. The algorithm selected the configuration achieving the highest average

R^{2}

across folds.

The XGBoost hyperparameters were tuned within the following search space: n_estimators

{200, 400, 600, 800, 1000}

; max_depth

{3, 5, 7, 9}

; and learning_rate

{0.01, 0.05,

0.1}

. Regarding stochasticity and regularization, the grid included: subsample and colsample_bytree

{0.6, 0.8, 1.0}

; min_child_weight

{1, 3, 5}

; gamma

{0, 0.1, 0.3}

; reg_alpha

{0, 0.1, 1.0}

; and reg_lambda

{1, 5, 10}

. Finally, the model was retrained on the entire training dataset using the optimal hyperparameters identified by the optimizer.

2.3.4. Testing Phase

During testing, the optimized XGBoost model is deployed to predict

O_{3}

concentrations on unseen test datasets. This phase includes two evaluation scopes:

Self-Prediction: Evaluating the performance of the predictive model using test data from the source monitoring station.
Spatial Generalization: Applying the predictive model to test datasets from the other five monitoring stations to assess transferability.

Model accuracy and performance were evaluated using four standard metrics:

R^{2}

, Mean Squared Error (MSE), Root Mean Square Error (RMSE), and Mean Absolute Error (MAE). These specific metrics were selected to align with the reporting framework of recent state-of-the-art

O_{3}

forecasting studies (e.g., Chang et al. [34], Dai et al. [19], Wei et al. [35]), thereby enabling direct and consistent benchmarking. Although alternative indices, such as the Kling–Gupta Efficiency (KGE) [36], provide valuable insights into bias decomposition in hydrological modeling,

R^{2}

and RMSE remain the main standards in atmospheric science for assessing the explained variance and error magnitude in hourly time-series data.

The experiments were implemented using Python 3.10 along with the libraries Scikit-learn, XGBoost, pandas, numpy, json, joblib, matplotlib, seaborn, and optuna. The random seed was set to 42 to ensure reproducibility. The optimization process (400 iterations per station) and model training were executed on a workstation equipped with an Intel Core i9-14900K CPU, a 2TB M.2 PCIe SSD, 128 GB of DDR5 RAM, and an NVIDIA GeForce RTX 4900 24 GB GPU. The average runtime for the entire pipeline per station (using random search) was approximately 56.09 min in Experiment 1 and 46.56 min in Experiment 2.

3. Results

3.1. Experiment 1

Experiment 1 assesses the model’s ability to extrapolate seasonally. The training data includes records from 22 September 2022, to 21 June 2023, covering fall, winter, and spring. To thoroughly evaluate the model’s adaptability, the testing phase uses only data from the summer season (22 June to 21 September 2023), which typically exhibits the highest

O_{3}

variability. Table 3 summarizes the evaluation metrics for each monitoring station. The results show strong robustness despite seasonal changes, with five out of six models reaching

R^{2}

values above 0.90. The AQMS-NL06 station excels at achieving an

R^{2}

of 0.96 and an RMSE of 0.0034 ppm. This is a very close match between the predicted concentrations and the regression line, indicating an excellent fit even on unseen summer data. Notably, even the station with the lowest metrics, AQMS-NL03, maintained strong predictive ability with an

R^{2}

of 0.83. Regarding error magnitude, the MAE provides key insight into the model’s typical accuracy, as it is less affected by sporadic outliers compared to RMSE. The very low MAE values across all stations, ranging from 0.0023 ppm (AQMS-NL06) to 0.0044 ppm (AQMS-NL03), confirm that for most concentration levels, the prediction error is nearly zero. Specifically, stations AQMS-NL01, AQMS-NL02, AQMS-NL04, and AQMS-NL05 demonstrated very consistent precision levels (0.0027 to 0.0030 ppm), further confirming the stability of the proposed XGBoost approach.

In the second phase of Experiment 1, the prediction model’s ability to generalize spatially is assessed. During this phase, models trained at a specific monitoring station are tested on datasets from the other five stations to evaluate their transferability across different urban areas. Table 4 shows the results for the two best-generalizing models: AQMS-NL02 and AQMS-NL05. The AQMS-NL02 model demonstrated strong robustness, successfully applying learned patterns to predict

O_{3}

levels at stations AQMS-NL01, AQMS-NL04, and AQMS-NL06, with

R^{2}

values of 0.94, 0.93, and 0.95, respectively. The low MSE values (consistently

\leq 0.000019

ppm²) confirm the high accuracy of these predictions, indicating very close matches to the observed data.

Similarly, the AQMS-NL05 model showed strong generalization ability. However, a noticeable drop in performance was seen when predicting for station AQMS-NL03 (

R^{2} = 0.81

), a trend that occurred with other models as well. This lower performance mainly stems from the station’s specific type: a mixed industrial and residential area at the neighborhood level. Unlike other stations, AQMS-NL03 experiences high-frequency variability because of its closeness to local industrial emissions and residential traffic. Additionally, the area is affected by pollution plumes carried by wind from the southeast, east, and northeast. Moreover, local conditions during the summer testing period were worsened by frequent fires and very low relative humidity, leading to irregular spikes in

O_{3}

levels that are hard for a model trained elsewhere to predict. These factors create a highly complex and volatile

O_{3}

profile, making it difficult for a model trained in a different urban setting to generalize effectively.

Figure 3 illustrates a time-series comparison between the observed

O_{3}

concentrations and those predicted by the AQMS-NL03 model. As presented in Table 3, this model exhibited the lowest performance metrics among the stations studied, but its analysis provides valuable insights into the algorithm’s behavior during critical events. A visual review of the six plots shows a consistent tendency to underpredict extreme concentrations. When the series exhibits constant peaks or sudden fluctuations exceeding 0.06 ppm, the model often fails to reach the full amplitude observed in the data. This “peak dampening” effect is particularly evident in the AQMS-NL02 plot (top center) between 1 July and 15 July 2023. Although this limits the model’s ability to capture absolute maxima, it still shows a strong correlation, with an

R^{2}

of 0.88.

On the other hand, the model often overestimates values in the lower concentration range (0.005 to 0.02 ppm). This behavior is evident in the AQMS-NL05 plot (bottom center of Figure 3), where the predicted baseline often lies slightly above the actual measurements, yet the model still attains an R2 of 0.88. Notably, the model performs best within the intermediate range (0.02 to 0.06 ppm), where the predicted trend lines closely match the observed temporal dynamics. This stability is evident in the plots for AQMS-NL04 (bottom left) and AQMS-NL06 (bottom right), which show

R^{2}

values of 0.92 and 0.93, respectively, confirming that the model effectively captures the main diurnal cycles despite challenges at the extremes.

Figure 4 presents the generalization results of the AQMS-NL06 prediction model, which initially demonstrated the highest performance on its own testing dataset. When applied to external stations, the model maintained high predictive capacity. Specifically, in the AQMS-NL01 dataset (top left plot), the model achieved a robust

R^{2}

of 0.93. A notable improvement here is the reduction of the overestimation bias observed in previous models. However, there is still a tendency to underestimate

O_{3}

concentrations when values exceed 0.06 ppm. Conversely, the lowest generalization performance (

R^{2} = 0.84

) was observed at station AQMS-NL03. As with previous experiments, the model struggles with the complex dynamics of this industrial area, consistently underestimating concentrations above 0.05 ppm. Regarding the AQMS-NL05 station (bottom-middle plot), the model shows dual behavior depending on concentration levels. In the mid-range (0.01 to 0.05 ppm), the model effectively reduces underestimation, resulting in a closer fit. However, at very low concentrations (0.005 to 0.01 ppm), a clear overestimation occurs, with the predicted baseline exceeding the actual measurements.

In this way, to test the hypothesis that environmental complexity hinders generalization at station AQMS-NL03, we measured the domain shift between this station and the higher-performing one (AQMS-NL06). Figure 5 shows the comparison of the probability density functions of hourly ozone concentrations from both sites. While AQMS-NL06 (blue curve) shows a sharp, peaked distribution, typical of stable urban background sites, AQMS-NL03 (red curve) displays a flat, broad distribution, indicating greater variance and a higher frequency of diverse concentration regimes. To confirm this difference statistically, a two-sample Kolmogorov–Smirnov (KS) test was applied. The resulting KS statistic is 0.120 (p < 0.001), which indicates a statistically significant difference between the underlying data distributions, highlighting the challenge of spatial generalization. This significant domain change explains the decrease in predictive performance when transferring a model trained in NL06 to NL03. The model learns the probability distribution of the reference station (AQMS-NL06) and does not account for the greater dispersion and shifted density of the target industrial area (AQMS-NL03), thereby validating the need for localized training in highly heterogeneous areas.

Additionally, the reference algorithms Ridge Regression, Lasso Regression, and Random Forest were implemented under similar experimental conditions (see Table 5) to compare their performance with the XGBoost approach. The linear models (Ridge and Lasso) showed limited ability to capture the complex behavior of

O_{3}

, with

R^{2}

values between 0.70 and 0.80. The Random Forest model proved to be the most competitive baseline, with error rates closer to those of XGBoost. For example, at station AQMS-NL06, Random Forest achieved an

R^{2}

of 0.88 and an RMSE of 0.0054 ppm. In comparison, the proposed XGBoost model (see Table 3) outperformed this baseline with an

R^{2}

of 0.96 and an RMSE of 0.0034 ppm, representing about a 37% improvement in accuracy. It is noteworthy that at station AQMS-NL03, the performance difference was significant: the Random Forest model achieved an

R^{2}

of 0.67, whereas XGBoost remained at 0.83. Regarding error metrics, Random Forest achieved an RMSE of 0.0102 ppm, whereas XGBoost yielded an RMSE of 0.0065 ppm. These differences in RMSE indicate that, in addition to correlation, the proposed model yields significantly lower prediction uncertainty in the actual concentration units.

Model Interpretability and Physical Drivers

To interpret the model’s black-box nature and ensure that its predictions are based on atmospheric physics rather than statistical artifacts, an analysis was performed using the SHAP method. Figure 6 displays the analysis of station AQMS-NL03, chosen for its complex industrial dynamics. Figure 7 shows the global feature importance, indicating that the 6-h rolling mean of ozone (O₃_mean_6h) is the most influential factor in the model. This confirms that the short-term historical trend is the strongest predictor of the immediate future. It reflects the physical property of atmospheric persistence (inertia), where accumulated concentrations change gradually rather than suddenly. After the O₃_mean_6h variable, the autoregressive terms (O₃_Lag_2h, O₃_Lag_4h, O₃_Lag_3h) act as fine-tuning correctors.

The SHAP summary plot (see Figure 6) provides key insights into how the model reacts to environmental variables. A clear inverse relationship is evident for NOx. High NOx values (indicated by red dots) correspond to negative SHAP values, which decrease the predicted

O_{3}

concentration. This aligns with the chemical process known as ozone titration (

NO + O_{3} \to {NO}_{2} + O_{2}

), common in industrial and traffic-heavy areas like AQMS-NL03, where fresh emissions tend to lower local ozone levels. Moreover, HR shows a negative correlation; high humidity (red) reduces

O_{3}

prediction. This is consistent with meteorological observations that high humidity often accompanies cloud cover or precipitation, thereby reducing the solar radiation required for photochemical production. Furthermore, the hour_cos feature clearly distinguishes between day and night periods, thereby adjusting the baseline prediction to account for the diurnal cycle. In summary, the SHAP analysis shows that the XGBoost model has effectively learned not only the statistical autocorrelation of the time series but also the underlying chemical and meteorological forcing mechanisms that influence

O_{3}

dynamics.

To contextualize the physicochemical challenges observed at the industrial station, we performed the same SHAP analysis on the AQMS-NL06 under identical experimental conditions (Experiment 1). Figure 8 presents the SHAP summary plot. The comparison highlights distinct predictive drivers:

The AQMS-NL06 model identified historical ozone statistics as the main features. The 6-h rolling mean (O₃_mean_6h) and maximum (O₃_max_6h) are the most important, followed closely by the autoregressive lags. This indicates a highly stable atmospheric regime where short-term persistence is a reliable predictor across seasons.
In sharp contrast to the industrial station (AQMS-NL03), where precursor gases played a visible role, in AQMS-NL06, the chemical variables (NOx, ${NO}_{2}$ ) rank below the top 12. This confirms that Santa Catarina (AQMS-NL06) behaves as a stable urban background site, where ozone accumulation is driven by regional transport and atmospheric stability rather than immediate local emission spikes (titration).

This contrast confirms that the generalization gap observed in Experiment 1 is due to environmental complexity. The model effectively extrapolated the consistent inertia patterns of AQMS-NL06 for the summer season but struggled to predict the volatile, emission-driven chemistry of AQMS-NL03 without dedicated summer training data.

3.2. Experiment 2

For the second experiment, the dataset was partitioned using a season-based, chronological stratified sampling strategy. This partition ensures that the different ozone behaviors across seasons are accurately captured in both phases without compromising temporal accuracy. Table 6 presents the performance indicators for the prediction models generated for each monitoring station. The six models showed strong performance, with an overall average coefficient of determination (

R^{2}

) of 0.913, a significant improvement over the results of Experiment 1, confirming the effectiveness of the chronological stratified 80/20 splitting strategy.

The results show excellent performance across all stations, with

R^{2}

values consistently between 0.88 and 0.95 (see Table 6). Specifically, station AQMS-NL06 exhibited the strongest correlation, with an

R^{2}

of 0.95. Meanwhile, station AQMS-NL05 recorded the lowest prediction error, with an RMSE of 0.0037 ppm and an MAE of 0.0024 ppm, providing the most accurate estimates compared to observed data. Additionally, a significant improvement was observed at station AQMS-NL03, which previously performed poorly in seasonal extrapolation experiments, reaching an

R^{2}

of 0.90 and an RMSE of 0.0048 ppm. This suggests that training the model on the full range of seasonal variability enables it to capture better complex local dynamics that were previously difficult to model. Overall, the small gap between RMSE and MAE across all stations indicates that the model is stable and reliable, minimizing substantial errors even during peak concentration periods. Therefore, including a representative sample from each season (autumn, winter, spring, and summer) in the training set enabled the XGBoost model to reach its highest predictive accuracy in this study.

To assess the generalization ability of the proposed approach, Figure 9 shows a matrix of scatter plots for the best-performing model (AQMS-NL06) on the test datasets from all six monitoring stations. In each subplot, the horizontal axis (x) represents the observed

O_{3}

concentrations (ground truth), and the vertical axis (y) displays the values predicted by the model. The red dashed diagonal line marks the 1:1 identity line (

y = x

). Points on this line indicate perfect agreement, while points above or below indicate overestimation or underestimation, respectively. Therefore, the subplot for the AQMS-NL06 station in Figure 9 displays the result reported in Table 6 (

R^{2} = 0.95

), corresponding to the unrounded value (

0.948

) shown in the plot. This confirms that the table metric reflects the model tested on its own data (local baseline). When examining the generalization results for the AQMS-NL06 plot, a strong linear trend is evident. However, a slight overestimation bias is observed in the lower-concentration range (observed values between 0.005–0.01 ppm) and extends into the mid-range (predicted values of 0.02–0.03 ppm), where data points cluster slightly above the identity line. Conversely, when the AQMS-NL06 model is applied to stations AQMS-NL03 and AQMS-NL04, it exhibits lower performance metrics. While maintaining a positive linear trend, these plots show greater dispersion and a systematic underestimation of high

O_{3}

concentrations. Specifically, the AQMS-NL03 plot (top right) highlights underestimation for values above 0.07 on the X-axis (observed) and 0.05 on the Y-axis (predicted). Likewise, AQMS-NL04 (bottom left) indicates underestimation at the point where values exceed 0.06 on the X-axis and 0.04 on the Y-axis. Additionally, certain outliers, such as coordinates

(0.014, 0.048)

for AQMS-NL03 and

(0.028, 0.055)

for AQMS-NL04, negatively impact the fit by increasing the residual distance.

Table 7 shows the results of a comprehensive 10-fold cross-validation with a time series split. While the results in Table 6 show the model’s theoretical maximum performance when the training data fully captures all seasons, the metrics in Table 7 demonstrate the model’s practical robustness under strict temporal constraints. In this setup, the model is repeatedly tested on future time windows without prior exposure to the immediate trend, thereby simulating a continuous forecasting operation. Consequently, the metrics are slightly more conservative but highly consistent. The XGBoost model achieved average

R^{2}

values ranging from 0.79 to 0.86 across the six stations. Station AQMS-NL01 demonstrated the highest stability with an

R^{2}

of 0.86 and an RMSE of 0.0063 ppm. Even for the most complex station, AQMS-NL03, the model maintained an

R^{2}

of 0.71. It is important to note that the RMSE values (e.g., 0.0063 ppm) are consistently higher than the MAE values (e.g., 0.0041 ppm). As discussed, this discrepancy is expected in cross-validation of environmental data, where the model effectively captures the general trend (low MAE) but struggles to predict the exact magnitudes of sporadic extreme peaks (higher RMSE). In conclusion, these results confirm that the proposed model is not only capable of learning complex patterns (as shown in Table 6) but is also mathematically robust against temporal shifts.

To visualize the model’s stability over time, Figure 10 shows scatter plots for each of the 10 validation folds at station AQMS-NL01 using the time-series split strategy. In each subplot, the red dashed line represents the 1:1 line of perfect agreement. This breakdown demonstrates the model’s ability to adapt to changing meteorological conditions. While most folds exhibit strong linearity, with

R^{2}

values exceeding 0.85 (peaking at 0.92 in Fold 4), the method effectively reveals periods of increased volatility. For example, Fold 2 exhibits a temporary decline in performance (

R^{2} = 0.673

, RMSE = 0.0075 ppm), likely due to a seasonal transition characterized by irregular ozone spikes. Nevertheless, the model demonstrates adaptability, rapidly recovering performance in subsequent folds (e.g., Fold 3:

R^{2} = 0.829

). This fluctuation confirms that the validation strategy is thorough and “time-aware,” providing a transparent view of operational risks rather than a masked average.

Table 8 summarizes the performance of the benchmark algorithms under the same cross-validation scheme. The linear models (Ridge and Lasso) performed poorly, with

R^{2}

values stabilizing below 0.79 at most stations, failing to capture the non-linear variability of

O_{3}

over a full year. XGBoost consistently outperformed Random Forest across all key metrics. For example, at station AQMS-NL06, XGBoost achieved an

R^{2}

of 0.83 compared to 0.80 for Random Forest. Furthermore, in terms of prediction error, XGBoost consistently achieved the lowest RMSE across all six stations. At station AQMS-NL04, XGBoost achieved an RMSE of 0.0060 ppm, slightly but significantly outperforming Random Forest’s 0.0061 ppm. These results confirm that, even when using strict cross-validation that considers all seasonal variations, XGBoost offers the most accurate and dependable predictions for operational forecasting.

3.3. Residual Diagnostics and Calibration Analysis

To assess the reliability of the model’s point estimates and verify calibration, a residual diagnostic analysis was performed. Figure 11a illustrates the residual plot for station AQMS-NL04, corresponding to Experiment 1. The analysis reveals three key characteristics of the XGBoost model:

The mean residual is approximately zero (−0.0014 ppm), indicating that the model predictions are centered on the observed values without systematic over- or under-estimation bias (the zero-line runs through the middle of the distribution).
For most of the operational range (0.01 to 0.05 ppm), the residuals show a consistent dispersion density, appearing as a uniform band rather than a diverging funnel. This indicates that the model maintains stable precision across typical ozone concentrations.
When predicted values exceed 0.06 ppm, an increase in residual variance is observed, with scattered points reaching beyond the $\pm 0.02$ ppm range. This pattern is consistent with the stochastic nature of extreme pollution events. Nevertheless, the vast majority of residuals remain bounded within the $\pm 2 σ$ interval (standard deviation = 0.0039 ppm), confirming that the model’s error distribution is statistically stable for operational forecasting.

To verify that the model effectively captured the temporal dynamics of the ozone series without leaving residual autocorrelation, the prediction errors were analyzed in chronological order (see Figure 11b). The plot shows consistent, stationary behavior around the zero line for most of the testing period. The absence of discernible cyclic patterns or long-term trends in the residuals confirms that the feature engineering strategy (specifically, the cyclic and lag variables) successfully captured seasonality and diurnal cycles. A period of increased variance occurs around index 1750, likely associated with a weather event characterized by high volatility. Nevertheless, the model showed strength by quickly restoring its error margins right after this event. Additionally, the random shifts of residuals around the mean indicate no significant serial autocorrelation, confirming the effectiveness of the autoregressive input features.

The analysis in Experiment 2 confirms that the XGBoost model acts as a highly calibrated, unbiased estimator. The mean residual is −0.0002 ppm, which is statistically indistinguishable from zero (see Figure 12a). This bias indicates that the model does not systematically overestimate or underestimate ozone levels over the long term. Geometrically, the residuals are symmetrically distributed around the zero line, suggesting that the algorithm has successfully captured the deterministic component of the time series, leaving only random, irreducible noise. Furthermore, the residuals show a consistent dispersion density across the nominal ozone range (0.01–0.05 ppm), as shown in Figure 12a. Although characteristic heteroscedasticity appears at extreme concentrations (>0.06 ppm), most errors stay within the

\pm 2 σ

confidence interval. Moreover, the chronological analysis (see Figure 12b) demonstrates the temporal stability of the model errors. The residuals fluctuate randomly around zero without exhibiting long-term trends or seasonality patterns. Notably, despite isolated high-volatility events (e.g., around observation index 1250), the model demonstrates rapid recovery, restabilizing its error margins immediately. This absence of serial autocorrelation confirms that the engineered lag features effectively captured the temporal dependencies of the time series.

4. Discussion

4.1. Robustness and Error Diagnostics

A primary concern in feature engineering is the risk of overfitting, in which the model memorizes training patterns rather than generalizing. However, the results of Experiment 1 strongly indicate otherwise. By training on autumn, winter, and spring data and testing exclusively on the summer season, the model was forced to extrapolate to a regime with different temperature and radiation profiles. The high performance achieved (

R^{2}

up to 0.96) confirms that the selected features (lags and rolling windows) capture intrinsic physical relationships of ozone formation that remain valid across seasonal shifts, rather than fitting statistical noise.

Regarding the lower performance at station AQMS-NL03, error diagnostics from the time-series analysis (see Figure 3) reveal a clear pattern: the model correlates well at low to medium levels but systematically underestimates extreme peaks (>0.06 ppm). This specific error pattern supports the attribution of performance loss to local industrial complexity. Unlike meteorological trends, which are regional and predictable, local industrial emissions create sudden, high-intensity spikes (outliers). The model, optimized to capture the general trend, conservatively predicts these anomalies, resulting in the observed ‘peak dampening’ effect. This suggests that while the model generalizes well spatially for urban background pollution, it requires localized training to accurately capture point-source industrial events.

4.2. Study Limitations and Future Research Directions

The proposed XGBoost model shows strong predictive performance, but several limitations of this study need to be acknowledged to provide context for the findings and inform future research.

First, regarding input variables, the study relied exclusively on data available from the public monitoring network. As a result, Volatile Organic Compounds (VOCs), which are key precursors of ozone formation, were excluded because continuous, real-time measurements were unavailable in the study area. Although the model achieved high accuracy (

R^{2} > 0.90

) using nitrogen oxides and meteorological variables as proxies, including VOC data could improve predictions, especially in complex industrial zones such as AQMS-NL03.

Second, for benchmarking, this study compared the proposed approach against standard machine learning algorithms (Ridge, Lasso, and Random Forest). While sufficient for establishing a strong baseline, the study did not include comparisons with Deep Learning architectures, such as Long Short-Term Memory (LSTM) networks or Transformers, which currently represent the state-of-the-art in time-series forecasting. Future work will focus on benchmarking XGBoost against these deep learning methods to assess the trade-off between computational cost and marginal gains in accuracy.

Finally, the dataset used in this study covers a single complete annual cycle (September 2022 to September 2023). While this period is sufficient to validate the model’s ability to detect intra-annual seasonal patterns (e.g., the shift from spring to summer), it does not include inter-annual variability or long-term climate events such as El Niño/La Niña oscillations. Therefore, the current model parameters might not perform as well in years with significantly different weather patterns. For operational use, we recommend a continuous learning approach in which the model is periodically retrained on new data to adapt to long-term trends and mitigate concept drift.

4.3. Benchmarking with Global Ozone Forecasting Studies

To contextualize the predictive performance of our framework, we benchmarked our results against recent notable studies that utilize XGBoost-based architectures for hourly

O_{3}

forecasting (see Table 9).

Our approach achieves accuracy levels similar to complex hybrid ensembles. For example, Chang et al. [34] reported an

R^{2}

of 0.94 using a sophisticated W-BiLSTM(PSO)-GRU + XGBoost architecture trained on five years of data. Similarly, Dai et al. [19] achieved an

R^{2}

of 0.95 with a VAR-XGBoost model using a six-year dataset. Notably, our Experiment 1 (Seasonal Extrapolation) and Experiment 2 (CV) achieved comparable peak

R^{2}

values (0.96 and 0.95, respectively) using a standard XGBoost architecture trained on only one year of data.

Compared to standard XGBoost implementations, our feature-engineered model exhibits better performance. Liu et al. [18] and Wei et al. [35] reported

R^{2}

values of 0.87, while Juarez et al. [23] achieved 0.61 using a 6-year dataset.

As summarized in Table 9, the proposed model often outperforms state-of-the-art benchmarks despite using fewer input variables (excluding parameters like cloud cover or dew point) and a much smaller historical dataset (8760 hourly records versus multi-year archives). This confirms that the feature engineering strategy applied, specifically the cyclic, rolling, and lag features, effectively overcomes data scarcity, enabling a computationally efficient model to compete with data-intensive deep learning architectures.

4.4. Physical Interpretation and Operational Applicability

The model’s predictive ability primarily depends on the atmospheric physical inertia. As evidenced by the high significance of lag features and rolling means (SHAP analysis), the algorithm effectively captures the persistence of air masses in stable urban environments. This explains its high performance in residential areas (e.g., AQMS-NL06), where

O_{3}

formation follows a repetitive diurnal photochemical cycle driven by solar radiation and temperature.

The operational applicability of this framework depends on the assumption of quasi-stationarity in emission sources; that is, the model assumes that historical periodic patterns (daily/weekly) will persist in the near future. Consequently, the approach is:

Highly applicable in urban backgrounds and commercial zones, acting as a reliable “virtual sensor” for gap-filling and public health alerts.
Limited in industrial corridors (e.g., AQMS-NL03), where pollutant behavior is dominated by stochastic, high-intensity point-source emissions (e.g., sudden NOx plumes causing titration). Since these events are neither strictly periodic nor purely inertial, the model, without real-time precursor data, reaches its validity limit. Therefore, for industrial applications, we recommend combining this inertial model with real-time emissions monitoring to more effectively detect nonstationary spikes.

5. Conclusions

This study addressed the scarcity of high-performance air quality forecasting tools in the Monterrey Metropolitan Area (MMA) by developing and thoroughly validating an XGBoost-based framework for hourly ozone (

O_{3}

) prediction.

The principal contribution of this work is the shift from standard model fitting to a rigorous operational stress-testing framework. Unlike studies that report global metrics on randomized splits, this research systematically measures the limits of machine learning through seasonal extrapolation and spatial transferability tests. Theoretically, combining SHAP interpretability with Domain Shift analysis (Kolmogorov–Smirnov test) provides a clear link between atmospheric inertia and model generalization. We show that the algorithm’s success is not universal but is limited by the physicochemical topology of the site: models effectively reconstruct inertial patterns in stable accumulation zones but face a “validity cliff” in industrial corridors driven by stochastic chemical kinetics (NOx titration).

The operational validation confirmed the robustness of the proposed feature engineering strategy. In Experiment 1, models achieved high accuracy even when extrapolating to the unseen summer season (

R^{2} \geq 0.90

for residential stations). In Experiment 2, the 10-fold Time-Series Cross-Validation verified stability (

R M S E \approx

0.0034–0.0075 ppm) without data leakage, confirming the method’s reliability for continuous forecasting.

The findings offer immediate utility for environmental management in the MMA: (1) The high fidelity of models in residential zones supports their use as “virtual sensors” to fill data gaps during instrument failures, ensuring continuous public health monitoring; (2) the spatial generalization analysis offers a strategic plan for infrastructure investment. Since machine learning models struggle in volatile industrial zones (e.g., AQMS-NL03), authorities should focus on installing physical reference-grade sensors in these high-complexity areas, while using cost-effective modeling for the wider urban background.

This research is limited by the absence of VOC precursor data and a one-year timeframe, which prevents analysis of year-to-year climate variability (e.g., El Niño effects). Future research will focus on benchmarking Deep Learning architectures (e.g., LSTM) to capture longer-term dependencies and implementing Continuous Learning pipelines to adapt models to urban growth and changing climate patterns.

The main conclusion of this work is that while XGBoost is a powerful tool for ozone forecasting, its operational reliability is limited by environmental topology. In stable accumulation zones, the model serves as a highly accurate inertial predictor. However, in complex industrial areas, local chemical kinetics cause a domain shift that requires localized training. Therefore, thorough topological stress-testing is crucial for the safe deployment of ML in environmental public health.

Author Contributions

Conceptualization, E.T.-L. and B.A.M.-H.; methodology, E.H.-S., E.T.-L., J.M.J.-P. and B.A.M.-H.; software, E.H.-S.; validation, E.H.-S., E.T.-L. and B.A.M.-H.; formal analysis, E.T.-L. and B.A.M.-H.; investigation, E.H.-S., E.T.-L., J.M.J.-P. and B.A.M.-H.; resources, E.T.-L.; data curation, E.H.-S. and E.T.-L.; writing—original draft preparation, E.H.-S., E.T.-L., J.M.J.-P. and B.A.M.-H.; writing—review and editing, E.T.-L., J.M.J.-P. and B.A.M.-H.; visualization, E.H.-S. and E.T.-L.; supervision, E.T.-L.; project administration, B.A.M.-H.; funding acquisition, E.T.-L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Autonomous University of Tamaulipas, Mexico, to cover the APC, under the internal identifier 211590 (Edgar Tello-Leal).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset analyzed in the study is openly available at https://sinaica.inecc.gob.mx/ (accessed on 17 March 2025).

Acknowledgments

The Autonomous University of Tamaulipas (Mexico) partially supported this research. Additionally, the study received partial funding from the Secretariat of Science, Humanities, Technology, and Innovation (SECIHTI) through grants 1244286 (Esteban Hernandez-Santiago) and 1239803 (Jailene Marlen Jaramillo-Perez).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

O₃	Ozone
ML	Machine Learning
NO	Nitric Oxide
NO₂	Nitrogen Dioxide
NOx	Nitrogen Oxides
R²	Coefficient of Determination
CO	Carbon Monoxide
VOC	Volatile Organic Compounds
CH₄	Methane
WHO	World Health Organization
O₂	Molecular Oxygen
SHAP	Shapley Additive Explanations
RFE	Recursive Feature Elimination
BO	Bayesian Optimization
AQMS	Air Quality Monitoring Station
T	Temperature
RH	Relative Humidity
BP	Barometric Pressure
WS	Wind Speed
WD	Wind Direction
KNN	K-Nearest Neighbors
RMSE	Root Mean Square Error
MAE	Mean Absolute Error
MSE	Mean squared error

Appendix A

Table A1. Percentage of missing values imputed per variable across the six monitoring stations.

Station ID	O₃	NO	NO₂	NOx	TMP	RH	BP	WS	WD	Total Imputed
NL01	4.52	2.59	2.72	2.68	1.42	1.74	1.46	1.43	1.46	2.22
NL02	4.57	3.58	3.60	3.62	1.29	2.04	1.39	1.31	1.44	2.54
NL03	3.61	4.28	3.48	4.45	2.09	1.42	1.21	1.68	1.91	2.68
NL04	4.98	3.88	3.88	3.89	1.59	1.96	1.67	1.27	2.16	2.81
NL05	4.75	4.61	4.71	4.60	1.16	1.31	1.45	1.16	1.22	2.78
NL06	4.38	4.42	3.96	4.43	1.45	1.60	2.03	1.52	1.51	2.81
Average	4.47	3.89	3.73	3.95	1.50	1.68	1.54	1.40	1.62	2.64

Table A2. Units of the variables compose the dataset.

Variable	Unit	Abbreviation
Ozone	ppm	$O_{3}$
Nitric Oxide	ppm	NO
Nitrogen Dioxide	ppm	${NO}_{2}$
Nitrogen Oxides	ppm	NOx
Temperature	°C	T
Relative humidity	%	RH
Barometric pressure	mmHg	BP
Wind speed	m/s	WS
Wind direction	°A	WD

References

Zhang, J.J.; Wei, Y.; Fang, Z. Ozone Pollution: A Major Health Hazard Worldwide. Front. Immunol. 2019, 10, 2518. [Google Scholar] [CrossRef] [PubMed]
Dantas, G.; Siciliano, B.; da Silva, C.M.; Arbilla, G. A reactivity analysis of volatile organic compounds in a Rio de Janeiro urban area impacted by vehicular and industrial emissions. Atmos. Pollut. Res. 2020, 11, 1018–1027. [Google Scholar] [CrossRef]
Wang, L.; Wang, J.; Tan, X.; Fang, C. Analysis of NOx Pollution Characteristics in the Atmospheric Environment in Changchun City. Atmosphere 2020, 11, 30. [Google Scholar] [CrossRef]
Becerra-Rondón, A.; Ducati, J.; Haag, R. Satellite-based estimation of NO₂ concentrations using a machine-learning model: A case study on Rio Grande do Sul, Brazil. Atmósfera 2022, 37, 175–190. [Google Scholar] [CrossRef]
Sosa Echeverría, R.; Alarcón Jiménez, A.L.; del Carmen Torres Barrera, M.; Sánchez Alvarez, P.; Granados Hernandez, E.; Vega, E.; Jaimes Palomera, M.; Retama, A.; Gay, D.A. Nitrogen and sulfur compounds in ambient air and in wet atmospheric deposition at Mexico city metropolitan area. Atmos. Environ. 2023, 292, 119411. [Google Scholar] [CrossRef]
Paraschiv, S.; Barbuta-Misu, N.; Paraschiv, S.L. Influence of NO₂, NO and meteorological conditions on the tropospheric O₃ concentration at an industrial station. Energy Rep. 2020, 6, 231–236. [Google Scholar] [CrossRef]
Lu, X.; Zhang, L.; Chen, Y.; Zhou, M.; Zheng, B.; Li, K.; Liu, Y.; Lin, J.; Fu, T.M.; Zhang, Q. Exploring 2016–2017 surface ozone pollution over China: Source contributions and meteorological influences. Atmos. Chem. Phys. 2019, 19, 8339–8361. [Google Scholar] [CrossRef]
Wang, L.; Zhao, B.; Zhang, Y.; Hu, H. Correlation between surface PM_2.5 and O₃ in eastern China during 2015–2019: Spatiotemporal variations and meteorological impacts. Atmos. Environ. 2023, 294, 119520. [Google Scholar] [CrossRef]
Nguyen, D.H.; Lin, C.; Vu, C.T.; Cheruiyot, N.K.; Nguyen, M.K.; Le, T.H.; Lukkhasorn, W.; Vo, T.D.H.; Bui, X.T. Tropospheric ozone and NOx: A review of worldwide variation and meteorological influences. Environ. Technol. Innov. 2022, 28, 102809. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 13–17 August 2016; KDD ’16, pp. 785–794. [Google Scholar] [CrossRef]
Niazkar, M.; Menapace, A.; Brentan, B.; Piraei, R.; Jimenez, D.; Dhawan, P.; Righetti, M. Applications of XGBoost in water resources engineering: A systematic literature review (Dec 2018–May 2023). Environ. Model. Softw. 2024, 174, 105971. [Google Scholar] [CrossRef]
Li, Y.; Chen, W. A Comparative Performance Assessment of Ensemble Learning for Credit Scoring. Mathematics 2020, 8, 1756. [Google Scholar] [CrossRef]
Mienye, I.D.; Sun, Y. A Survey of Ensemble Learning: Concepts, Algorithms, Applications, and Prospects. IEEE Access 2022, 10, 99129–99149. [Google Scholar] [CrossRef]
Li, J.; An, X.; Li, Q.; Wang, C.; Yu, H.; Zhou, X.; Geng, Y.A. Application of XGBoost algorithm in the optimization of pollutant concentration. Atmos. Res. 2022, 276, 106238. [Google Scholar] [CrossRef]
Gagliardi, R.V.; Andenna, C. Exploring the Influencing Factors of Surface Ozone Variability by Explainable Machine Learning: A Case Study in the Basilicata Region (Southern Italy). Atmosphere 2025, 16, 491. [Google Scholar] [CrossRef]
Liu, R.; Ma, Z.; Liu, Y.; Shao, Y.; Zhao, W.; Bi, J. Spatiotemporal distributions of surface ozone levels in China from 2005 to 2017: A machine learning approach. Environ. Int. 2020, 142, 105823. [Google Scholar] [CrossRef]
Fan, J.; Wang, T.; Wang, Q.; Li, M.; Xie, M.; Li, S.; Zhuang, B.; Kalsoom, U. Unveiling Spatiotemporal Differences and Responsive Mechanisms of Seamless Hourly Ozone in China Using Machine Learning. Remote Sens. 2025, 17, 2318. [Google Scholar] [CrossRef]
Liu, Z.; Lu, Z.; Zhu, W.; Yuan, J.; Cao, Z.; Cao, T.; Liu, S.; Xu, Y.; Zhang, X. Comparison of machine learning methods for predicting ground-level ozone pollution in Beijing. Front. Environ. Sci. 2025, 13, 1561794. [Google Scholar] [CrossRef]
Dai, H.; Huang, G.; Wang, J.; Zeng, H. VAR-tree model based spatio-temporal characterization and prediction of O₃ concentration in China. Ecotoxicol. Environ. Saf. 2023, 257, 114960. [Google Scholar] [CrossRef]
Hu, X.; Zhang, J.; Xue, W.; Zhou, L.; Che, Y.; Han, T. Estimation of the Near-Surface Ozone Concentration with Full Spatiotemporal Coverage across the Beijing-Tianjin-Hebei Region Based on Extreme Gradient Boosting Combined with a WRF-Chem Model. Atmosphere 2022, 13, 632. [Google Scholar] [CrossRef]
Zhang, B.; Zhang, Y.; Jiang, X. Feature selection for global tropospheric ozone prediction based on the BO-XGBoost-RFE algorithm. Sci. Rep. 2022, 12, 9244. [Google Scholar] [CrossRef]
Xie, L.; He, J.; Lei, R.; Fan, M.; Huang, H. Accurate and efficient prediction of atmospheric PM₁, PM_2.5, PM₁₀, and O₃ concentrations using a customized software package based on a machine-learning algorithm. Chemosphere 2024, 368, 143752. [Google Scholar] [CrossRef]
Juarez, E.K.; Petersen, M.R. A Comparison of Machine Learning Methods to Forecast Tropospheric Ozone Levels in Delhi. Atmosphere 2022, 13, 46. [Google Scholar] [CrossRef]
Li, R.; Cui, L.; Hongbo, F.; Li, J.; Zhao, Y.; Chen, J. Satellite-based estimation of full-coverage ozone (O₃) concentration and health effect assessment across Hainan Island. J. Clean. Prod. 2020, 244, 118773. [Google Scholar] [CrossRef]
Luo, Z.; Lu, P.; Chen, Z.; Liu, R. Ozone Concentration Estimation and Meteorological Impact Quantification in the Beijing-Tianjin-Hebei Region Based on Machine Learning Models. Earth Space Sci. 2024, 11, 003346. [Google Scholar] [CrossRef]
Gobierno de México. Metrópolis de México 2020, 2024. Secretaría de Desarrollo Agrario, Territorial y Urbano. Ciudad de México. Available online: https://www.gob.mx/cms/uploads/sedatu/MM2020_06022024.pdf (accessed on 17 October 2025).
INEGI. Aspectos Geográficos de Nuevo León; Instituto Nacional de Estadística y Geografía: Aguascalientes, México, 2023; Available online: https://www.inegi.org.mx/contenidos/app/areasgeograficas/resumen/resumen_19.pdf (accessed on 17 March 2025).
INEGI. Censo de Población y Vivienda 2020; Instituto Nacional de Estadística y Geografía: Aguascalientes, México, 2020; Available online: https://www.inegi.org.mx/programas/ccpv/2020/#datos_abiertos (accessed on 17 March 2025).
Mancilla, Y.; Paniagua, I.H.; Mendoza, A. Spatial differences in ambient coarse and fine particles in the Monterrey metropolitan area, Mexico: Implications for source contribution. J. Air Waste Manag. Assoc. 2019, 69, 548–564. [Google Scholar] [CrossRef]
Ministry of Economy. Nuevo León: Economy, Employment, Equity, Quality of Life, Education, Health, and Public Safety; Ministry of Economy: Mexico City, Mexico, 2020. Available online: https://www.economia.gob.mx/datamexico/es/profile/geo/nuevo-leon-nl (accessed on 21 May 2025).
INEGI. Vehículos de Motor Registrados en Circulación (VMRC); Instituto Nacional de Estadística y Geografía: Aguascalientes, México, 2023; Available online: https://www.inegi.org.mx/programas/vehiculosmotor/#datos_abierto (accessed on 21 May 2025).
SEMARNAT. Norma Oficial Mexicana NOM-156-SEMARNAT-2012, Establecimiento y Operación de Sistemas de Monitoreo de la Calidad del Aire, 2016. Secretaría de Medio Ambiente y Recursos Naturales. Available online: https://www.gob.mx/profepa/documentos/norma-oficial-mexicana-nom-156-semarnat-2012 (accessed on 9 September 2025).
SSA Salud ambiental. Norma Oficial Mexicana NOM-020-SSA1-2021, Criterio Para Evaluar la Calidad del Aire Ambiente, con Respecto al Ozono O₃. 2021. Secretaría de Salud. Available online: https://dof.gob.mx/nota_detalle.php?codigo=5633956&fecha=28/10/2021#gsc.tab=0 (accessed on 9 September 2025).
Chang, W.; Chen, X.; He, Z.; Zhou, S. A Prediction Hybrid Framework for Air Quality Integrated with W-BiLSTM(PSO)-GRU and XGBoost Methods. Sustainability 2023, 15, 16064. [Google Scholar] [CrossRef]
Wei, C.; Zhao, C.; Hu, Y.; Tian, Y. Predicting the Concentration Levels of PM_2.5 and O₃ for Highly Urbanized Areas Based on Machine Learning Models. Sustainability 2025, 17, 9211. [Google Scholar] [CrossRef]
Gupta, H.V.; Kling, H.; Yilmaz, K.K.; Martinez, G.F. Decomposition of the mean squared error and NSE performance criteria: Implications for improving hydrological modelling. J. Hydrol. 2009, 377, 80–91. [Google Scholar] [CrossRef]

Figure 1. Geographic distribution of the six AQMS and major industrial clusters in the MMA. The map highlights the complex topography and the proximity of stations to specific industrial sectors (e.g., San Nicolás and Apodaca manufacturing zones). Note the location of AQMS-NL03, situated in a high-density industrial corridor, and the petrochemical refinery located to the southeast, which influences regional pollution transport. The red line on the map represents the territorial boundaries of the municipalities that constitute the state of Nuevo León, México.

Figure 2. Visualization of the workflow of the proposed methodology for predicting O₃ concentration.

Figure 3. Visualization of predicted data compared to the original data from the AQMS-NL03 station prediction model for the other monitoring stations included in the study.

Figure 4. Visualization of the original data versus the predicted data by the best prediction model for the AQMS-NL06 station, generalized to other monitoring stations.

Figure 5. Domain shift analysis comparing the ozone probability density functions between the reference station (AQMS-NL06) and the complex industrial station (AQMS-NL03).

Figure 6. SHAP summary plot for AQMS-NL03, where each dot represents a specific hourly data point.

Figure 7. Mean absolute SHAP values for station AQMS-NL03.

Figure 8. SHAP summary plot for the best-performing station (AQMS-NL06) in Experiment 1.

Figure 9. Spatial generalization assessment for the best-performing model (AQMS-NL06). The matrix displays predictions against observed values for all six stations (Experiment 2).

Figure 10. Scatter plots of the data predicted by the AQMS-NL06 model for each partition generated from the 10-fold cross-validation method.

Figure 11. (a) Residual analysis for station AQMS-NL04 in Experiment 1. (b) Chronological sequence of residuals for station AQMS-NL04 in Experiment 1.

Figure 12. Diagnostic plots for station AQMS-NL04 in Experiment 2. (a) Residuals versus Predicted

O_{3}

, showing an unbiased distribution centered at zero. (b) Residuals versus Time Order, demonstrating stationary error behavior and model stability throughout the testing period.

Figure 12. Diagnostic plots for station AQMS-NL04 in Experiment 2. (a) Residuals versus Predicted

O_{3}

, showing an unbiased distribution centered at zero. (b) Residuals versus Time Order, demonstrating stationary error behavior and model stability throughout the testing period.

Table 1. Geographic location and description of the selected AQMS.

Station ID	Location (City)	Coordinates (Lat, Lon)
AQMS-NL01	Guadalupe	25.6643, −100.2450
AQMS-NL02	San Nicolás	25.7450, −100.2530
AQMS-NL03	Apodaca	25.7773, −100.1882
AQMS-NL04	Juárez	25.6460, −100.0957
AQMS-NL05	Cadereyta	25.6008, −99.9958
AQMS-NL06	Santa Catarina	25.6756, −100.4654

Table 2. Summary of original and engineered input variables utilized in the proposed approach.

Quantity	Type	Variable
4	Air pollution	$O_{3}$ (target), NOx, NO, ${NO}_{2}$
4	Meteorological	T, RH, BP, WS
8	Cyclic time	hour_sin, hour_cos, day_sin, day_cos, dow_sin, dow_cos, month_sin, month_cos
2	Cyclic wind	WD_sin, WD_cos
1	Weekend indicator	is_weekend (binary)
12	Rolling window	mean_6h, mean_12h, mean_24h, sd_6h, sd_12h, sd_24h, min_6h, min_12h, min_24h, max_6h, max_12h, max_24h
7	Lagged variable	O₃_lag_1h, O₃_lag_2h, O₃_lag_3h, O₃_lag_4h, O₃_lag_5h, O₃_lag_6h, O₃_lag_7h

Table 3. Metric values achieved by the prediction model at each monitoring station in Experiment 1.

Stations	R²	MSE (ppm²)	RMSE (ppm)	MAE (ppm)
AQMS-NL01	0.93	0.000017	0.0041	0.0029
AQMS-NL02	0.93	0.000015	0.0039	0.0027
AQMS-NL03	0.83	0.000042	0.0065	0.0044
AQMS-NL04	0.91	0.000016	0.0039	0.0030
AQMS-NL05	0.90	0.000017	0.0042	0.0027
AQMS-NL06	0.96	0.000011	0.0034	0.0023

Table 4. Assessment of the spatial generalization capability of prediction models applied to unseen datasets from other monitoring stations.

Station Model	Station Test Dataset	R²	MSE (ppm²)	RMSE (ppm)	MAE (ppm)
	AQMS-NL01	0.94	0.000016	0.0040	0.0028
	AQMS-NL03	0.89	0.000028	0.0053	0.0035
AQMS-NL02	AQMS-NL04	0.93	0.000013	0.0037	0.0027
	AQMS-NL05	0.89	0.000019	0.0044	0.0030
	AQMS-NL06	0.95	0.000012	0.0035	0.0025
	AQMS-NL01	0.91	0.000023	0.0048	0.0030
	AQMS-NL02	0.88	0.000026	0.0051	0.0034
AQMS-NL05	AQMS-NL03	0.81	0.000046	0.0068	0.0049
	AQMS-NL04	0.92	0.000015	0.0039	0.0025
	AQMS-NL06	0.94	0.000015	0.0038	0.0026

Table 5. Performance comparison of benchmark machine learning algorithms (Ridge, Lasso, and Random Forest) using the Experiment 1 dataset.

Method	Stations	R²	RMSE (ppm)	MAE (ppm)
	AQMS-NL01	0.77	0.0076	0.0061
	AQMS-NL02	0.77	0.0071	0.0054
LR Ridge	AQMS-NL03	0.77	0.0085	0.0059
	AQMS-NL04	0.74	0.0069	0.0055
	AQMS-NL05	0.70	0.0074	0.0058
	AQMS-NL06	0.79	0.0073	0.0059
	AQMS-NL01	0.81	0.0068	0.0049
	AQMS-NL02	0.80	0.0066	0.0046
LR Lasso	AQMS-NL03	0.70	0.0098	0.0068
	AQMS-NL04	0.79	0.0062	0.0044
	AQMS-NL05	0.76	0.0065	0.0044
	AQMS-NL06	0.82	0.0068	0.0053
	AQMS-NL01	0.88	0.0054	0.0037
	AQMS-NL02	0.85	0.0058	0.0039
Random Forest	AQMS-NL03	0.67	0.0102	0.0066
	AQMS-NL04	0.82	0.0058	0.0042
	AQMS-NL05	0.79	0.0061	0.0038
	AQMS-NL06	0.88	0.0054	0.0037

Table 6. Performance values obtained by each prediction model in the approach of Experiment 2.

Stations	R²	MSE (ppm²)	RMSE (ppm)	MAE (ppm)
AQMS-NL01	0.89	0.000031	0.0056	0.0030
AQMS-NL02	0.92	0.000024	0.0049	0.0029
AQMS-NL03	0.90	0.000023	0.0048	0.0029
AQMS-NL04	0.88	0.000028	0.0053	0.0031
AQMS-NL05	0.94	0.000014	0.0037	0.0024
AQMS-NL06	0.95	0.000017	0.0041	0.0025

Table 7. Predictive performance of the proposed approach using 10-fold time-series cross-validation.

Stations	R²	MSE (ppm²)	RMSE (ppm)	MAE (ppm)
AQMS-NL01	0.86	0.000039	0.0063	0.0041
AQMS-NL02	0.80	0.000043	0.0066	0.0041
AQMS-NL03	0.71	0.000053	0.0073	0.0045
AQMS-NL04	0.82	0.000036	0.0060	0.0039
AQMS-NL05	0.79	0.000038	0.0062	0.0039
AQMS-NL06	0.83	0.000039	0.0063	0.0041

Table 8. Performance comparison of benchmark machine learning algorithms (Ridge, Lasso, and Random Forest) using the Experiment 2 dataset.

Method	Stations	R²	RMSE (ppm)	MAE (ppm)
	AQMS-NL01	0.79	0.0072	0.0051
	AQMS-NL02	0.77	0.0071	0.0048
LR Ridge	AQMS-NL03	0.67	0.0076	0.0050
	AQMS-NL04	0.77	0.0066	0.0047
	AQMS-NL05	0.72	0.0070	0.0049
	AQMS-NL06	0.76	0.0074	0.0052
	AQMS-NL01	0.79	0.0073	0.0049
	AQMS-NL02	0.76	0.0072	0.0048
LR Lasso	AQMS-NL03	0.66	0.0078	0.0052
	AQMS-NL04	0.77	0.0067	0.0045
	AQMS-NL05	0.72	0.0070	0.0046
	AQMS-NL06	0.76	0.0074	0.0050
	AQMS-NL01	0.84	0.0062	0.0040
	AQMS-NL02	0.79	0.0065	0.0040
Random Forest	AQMS-NL03	0.70	0.0074	0.0045
	AQMS-NL04	0.80	0.0061	0.0039
	AQMS-NL05	0.76	0.0064	0.0039
	AQMS-NL06	0.80	0.0065	0.0040

Table 9. Benchmarking of predictive performance against related works utilizing XGBoost-based architectures.

Approach	Method	R²	Air Pollutants	Period
Chang et al. [34]	W-BiLSTM(PSO)-GRU + XGBoost	0.94	PM_2.5, PM₁₀, CO, SO₂, O₃	2013–2017
Liu et al. [18]	XGBoost	0.87	PM_2.5, PM₁₀, CO, SO₂, O₃, NOx	2023
Jaurez et al. [23]	XGBoost	0.61	PM_2.5, PM₁₀, CO, SO₂, O₃, NOx, NO, NO₂, NH₃, Benzene, Toluene, Xylene	2015–2020
Dai et al. [19]	VAR-XGBoost	0.95	PM_2.5, PM₁₀, CO, SO₂, O₃, NO₂	2016–2021
Wei et al. [35]	XGBoost	0.87	PM_2.5, PM₁₀, CO, SO₂, O₃, NO₂	2019–2023
Experiment 1	XGBoost	0.83–0.96	O₃, NOx, NO, NO₂	2022–2023
Experiment 2	XGBoost	0.88–0.95	O₃, NOx, NO, NO₂	2022–2023

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hernandez-Santiago, E.; Tello-Leal, E.; Jaramillo-Perez, J.M.; Macías-Hernández, B.A. Development of an Ozone (O₃) Predictive Emissions Model Using the XGBoost Machine Learning Algorithm. Big Data Cogn. Comput. 2026, 10, 15. https://doi.org/10.3390/bdcc10010015

AMA Style

Hernandez-Santiago E, Tello-Leal E, Jaramillo-Perez JM, Macías-Hernández BA. Development of an Ozone (O₃) Predictive Emissions Model Using the XGBoost Machine Learning Algorithm. Big Data and Cognitive Computing. 2026; 10(1):15. https://doi.org/10.3390/bdcc10010015

Chicago/Turabian Style

Hernandez-Santiago, Esteban, Edgar Tello-Leal, Jailene Marlen Jaramillo-Perez, and Bárbara A. Macías-Hernández. 2026. "Development of an Ozone (O₃) Predictive Emissions Model Using the XGBoost Machine Learning Algorithm" Big Data and Cognitive Computing 10, no. 1: 15. https://doi.org/10.3390/bdcc10010015

APA Style

Hernandez-Santiago, E., Tello-Leal, E., Jaramillo-Perez, J. M., & Macías-Hernández, B. A. (2026). Development of an Ozone (O₃) Predictive Emissions Model Using the XGBoost Machine Learning Algorithm. Big Data and Cognitive Computing, 10(1), 15. https://doi.org/10.3390/bdcc10010015

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu