A Machine Learning Approach to Investigate the Surface Ozone Behavior

: The concentration of surface ozone (O 3 ) strongly depends on environmental and meteorological variables through a series of complex and non-linear functions. This study aims to explore the performances of an advanced machine learning (ML) method, the boosted regression trees (BRT) technique, in exploring the relationships between surface O 3 and its driving factors, and in predicting the levels of O 3 concentrations. To this end, a BRT model was trained on hourly data of air pollutants and meteorological parameters, acquired, over the 2016–2018 period, in a rural area affected by an anthropic source of air pollutants. The abilities of the BRT model in ranking, visualizing, and predicting the relationship between ground-level O 3 concentrations and its driving factors were analyzed and illustrated. A comparison with a multiple linear regression (MLR) model was performed based on several statistical indicators. The results obtained indicated that the BRT model was able to account for 81% of changes in O 3 concentrations; it slightly outperforms the MLR model in terms of the predictions accuracy and allows a better identification of the main factors influencing O 3 variability on a local scale. This knowledge is expected to be useful in defining effective measures to prevent and / or mitigate the health damages associated with O 3 exposure.


Introduction
The generalized interest in analyzing the behavior of ground level ozone (O 3 ), which is a secondary pollutant in the atmosphere, is due to its proved adverse human health effects [1][2][3][4] and its detrimental impact on vegetation [5][6][7][8] and materials [9][10][11]. The exposure to surface O 3 increases morbidity levels and premature mortality in a population. According to the World Health Organization (WHO) [12], there is strong evidence, from epidemiological and toxicological studies, that surface O 3 is causally associated with adverse respiratory effects. These effects range from changes in lung function and asthma to mortality, especially amongst sensitive risk groups in the population (children, elderly people, and individuals with respiratory illnesses). As established by the European Environmental Agency [13], in 2017, as much as approximately 96% of the total EU-28 urban population was exposed to O 3 levels exceeding the threshold for the protection of human health (8 h mean of 100 µg/m 3 ) set by the WHO Air Quality Guidelines [14]. No less important is the influence of tropospheric O 3 on climate change, to which it contributes by acting both as a greenhouse gas and as an indirect controller of others greenhouse gases lifetimes [15][16][17]. Moreover, growing interest has recently emerged on the role played by O 3 and by other pollutants (particulate matter and nitrogen oxides) as potentially related to coronavirus (Covid-19) diseases [18].
The level of surface O 3 in a given area is controlled by various chemical and dynamical processes acting as sources (chemical production, stratospheric intrusions, long-range transport) and sinks be useful for the effective management of public protection activities from O3 exposure. Similarly, this knowledge could support the development of win-win strategies that maximize co-benefits of O3 reduction for both air quality and climate changes [16,34]. Moreover, the urgent question concerning the role of air pollution and meteorological factors on the COVID-19 outbreak spreading [35] could profit by a deeper insight of the relationships among variables obtainable through a machine learning approach.

Study Area
The study area is the Agri Valley, located in the south-west part of the Basilicata Region (Southern Italy) ( Figure 1); the valley, which is located at approximately 600 m a.s.l., is bordered on both sides by the Apennine Mountains and hosts a population of approximately 50,000 inhabitants distributed in several small hilltop towns surrounding the valley. The area represents a predominantly rural environment, comprising woods, agricultural and breeding zones, partially included in a National Park (Appennino Lucano, Val d'Agri, Lagonegrese National Park). In general, the climate of the area is sub-continental, characterized by cold and rainy winter as well as cool summers with frequent rainfall [36]. Furthermore, the site is characterized by the presence of the largest on-shore western European reservoir of crude oil and gas and of an oil pre-treatment plant (identified as Centro Olio Val d'Agri-hereafter COVA) in a populated area. The industrial processes taking place at the COVA plant, operating since 2001, produce conveyed and diffuse emissions of gases and particulate, which can affect the air quality and potentially pose health risks for the population living in the area. Moreover, the site is located in the center of the Mediterranean area, which is considered a "hot spot" for climate change due to the intense photochemical activity, the crossing of air masses of different origin, and the strong anthropogenic pressure. These conditions make the area interesting for future insights on the interconnections between air quality and climate changes [37]. An air quality control network, consisting of five monitoring stations, is arranged in the Agri Valley, providing continuous concentration measurements of regulated pollutants (CO, CH4, NOx, An air quality control network, consisting of five monitoring stations, is arranged in the Agri Valley, providing continuous concentration measurements of regulated pollutants (CO, CH 4 , NO x , Atmosphere 2020, 11, 1173 4 of 16 NO 2 , NO, O 3 ) and of several pollutants specifically related to oil/gas extraction activities (NMHC). The following meteorological parameters are measured at the stations: temperature (T), atmospheric pressure (P), relative humidity (RH), solar radiation (SR), wind direction (wd), and wind speed (ws). The Environmental Protection Agency of the Basilicata Region (ARPAB), managing the network, validates and makes public the data. More details about the methods and the instrumentation used for the measurements can be found elsewhere [38,39]. For the purpose of this work, data were obtained from the station named Masseria De Blasis (MdB, 40 • 19 27" N, 15 • 52 02" E), an industrial station in a rural area, already subjected to a preliminary investigation [40]. It is located at the same altitude of the COVA plant, namely 603 m a.s.l., at approximately 2800 m from the industrial site and approximately 250 m from a high-speed motorway (SS598).

Data Preparedness
Based on the open data measured and validated by ARPAB, several available parameters were selected to prepare the database on which developing the BRT and MLR models.
Therefore, variables representing O 3 precursors were included among predictors together with the following meteorological variables: T, RH, SR, P, ws, and wd. Overall, a data set consisting of 13 variables and more than 25,000 observations covering the 2016-2018 period was set up. This timeframe was defined selecting the most complete time series and the most updated available data. The time series of all predictors considered respected the required 75% proportion of valid data. The final dataset was divided into two parts: the first one, concerning the data of the period from 2016 to 2017, was used to build the BRT and the MLR models; while the second one, the validation dataset based on the data from 2018, was used to investigate the predictive ability of both models. In the case of the BRT model development, the first dataset was again randomly divided into a training set (80%), used to train the model, and a testing set (20%) to test the performances of the model.

BRT Model Devolopment
The BRT technique, whose theoretical foundations are described in detail in [31,32], does not require pre-processing of data. It works successfully with both continuous and categorical variables, it handles missing data and it is not sensitive to outliers. However, to fit the algorithm to the dataset under examination, the BRT technique requires setting several parameters to control the learning process. The process of choosing the optimal values for these parameters is referred to as the tuning process of hyperparameters. Generally, the main hyperparameters to be taken into account are: the learning rate (lr, controlling the rate of model complexity increasing), the tree complexity (tc, the number of nodes in a tree), and the bag-fraction (bf, specifying the proportion of data randomly selected to fit each consequent tree). Finally, nt indicates the number of trees required for an optimal prediction. Therefore, to fit the algorithm to the dataset under examination, a two-step process was carried out: first, potential models were built using the training dataset for each combination of hyperparameters specified on a grid; second, the combination of the best performing hyperparameters on the testing dataset was selected. The model trained with this resulting combination of hyperparameters was chosen as the final model. In the present study, the optimal values of hyperparameters thus determined were, respectively, lr = 0.01, tc = 6, bf = 0.5, and nt = 8150.

MLR Model Development
A MLR model based on the backward elimination for variables selection was developed using the dataset covering the 2016-2017 period. Before implementing the MLR procedure, the explanatory variables were standardized by subtracting the mean and dividing by the standard deviation to avoid the scale effect.
The MLR technique requires several assumptions to be satisfied: (1) no multi-collinearity in the data, i.e., explanatory variables independent from each other, and (2) normal distribution of the residual errors with zero mean and constant variance [41,42]. To check the multi-collinearity between variables, Atmosphere 2020, 11, 1173 5 of 16 the variance inflation factor (VIF) was used as a diagnostic tool. The criterion of VIF greater than 10, representing serious levels of multi-collinearity, was adopted to exclude the critical variables from the regression [43]. In addition, a graphical analysis of the residuals was performed to detect violations of the normality of residuals and homogeneity of variance. Finally, F-test and t-test were conducted to verify the statistical significance of both the overall relationship represented by the regression model developed and the individual parameters. Among all variables, only those respecting the statistical significance levels (p-value < 0.05) were retained.

Models Evaluation
Once developed, the performances of the BRT and MLR models in predicting the O 3 concentration was verified using the validation data set. The accuracy and the errors of both models were evaluated through several statistical indicators, namely the coefficient of determination (R 2 ), the index of agreement (IoA), the mean bias error (MBE), the mean absolute error (MAE), and the root mean square error (RMSE), whose equations are provided in Appendix A [44,45]. High accuracy (R 2 and IoA close to 1) and minimal errors (MBE, MAE, and RMSE close to 0) are the desired performances for an optimal prediction model. Some graphical functions, as, for example, the scatter plots, were also utilized to support the analysis of the strengths and the criticalities of both the developed models.
All the analysis was accomplished in the R 3.6.1 software environment [46], mainly using the Openair package for air quality data analysis [47], the stats package, and the gbm package for the development of the MLR and the BRT model [48], respectively.

Statistical Analysis
The data statistical analysis for the whole period considered is presented in Table 1. The O 3 limit values for the protection of human health are set in the European Legislation [49]. O 3 hourly average concentrations were between 0.20 and 229.20 µg/m 3 , with a mean value of 63.23 µg/m 3 . Two exceedances of the information threshold, set to 180 µg/m 3 , were registered in July 2016, while conformity to the alert threshold, set to 240 µg/m 3 , was observed for the entire examined period. The target value for the protection of human health, i.e., 120 µg/m 3 , as maximum daily eight-hour mean, not to be exceeded on more than 25 days per year, as a mean of three years, was not reached.
However, the threshold for the human health protection set by the WHO [14], (8 h mean of 100 µg/m 3 ), was surpassed in all three years during the summer season. For the other regulated pollutants included in the present analysis, the concentrations levels resulted lower than the threshold values set by the national legislation currently in force [50]. It is also worth noting that the NMHC concentration levels, not regulated by law either at the European or at national level, exhibited a peculiar trend, which is characterized by several concentration spikes well above its average or background level. During the study period, the mean temperature was 13.0 • C; the minimum value was recorded in January 2017 in correspondence with of an exceptional cold wave involving the entire national territory. Relative humidity ranged from 58.2% of July to 81.3% of November with a mean value of 71.21%, while pressure was rather static. The mean value of ws was 2.76 m s −1 , with the higher values generally measured during daytime. As results from the wind rose in Figure 1, the prevailing wind direction was from the west sector and, in the second instance, from the north-west sector, determining an upwind position of the MdB station with respect to the COVA plant.
A Pearson correlation analysis was performed to identify the strength of the relationships between pairs of variables. The most significant results are summarized in Figure 2.
Atmosphere 2020, 11, x FOR PEER REVIEW 6 of 17 The target value for the protection of human health, i.e., 120 μg/m 3 , as maximum daily eighthour mean, not to be exceeded on more than 25 days per year, as a mean of three years, was not reached. However, the threshold for the human health protection set by the WHO [14], (8 h mean of 100 μg/m 3 ), was surpassed in all three years during the summer season. For the other regulated pollutants included in the present analysis, the concentrations levels resulted lower than the threshold values set by the national legislation currently in force [50]. It is also worth noting that the NMHC concentration levels, not regulated by law either at the European or at national level, exhibited a peculiar trend, which is characterized by several concentration spikes well above its average or background level. During the study period, the mean temperature was 13.0 °C; the minimum value was recorded in January 2017 in correspondence with of an exceptional cold wave involving the entire national territory. Relative humidity ranged from 58.2% of July to 81.3% of November with a mean value of 71.21%, while pressure was rather static. The mean value of ws was 2.76 m s −1 , with the higher values generally measured during daytime. As results from the wind rose in Figure 1, the prevailing wind direction was from the west sector and, in the second instance, from the north-west sector, determining an upwind position of the MdB station with respect to the COVA plant.
A Pearson correlation analysis was performed to identify the strength of the relationships between pairs of variables. The most significant results are summarized in Figure 2.  Among the meteorological parameters, the highest negative correlation was found between O 3 and the RH (R = −0.78), while a positive correlation was registered with T (R = 0.59) and ws (R = 0.49), respectively. O 3 was negatively correlated with all the precursor gases, in particular CH 4 was the most efficient one (R = −0.68). Finally, a strong positive correlation was found between NO x and NO 2 (R = 0.91), and a negative correlation between RH and T (R = −0.66). All the obtained results are generally congruent with the expected trends and they will be discussed in details in the paragraph dedicated to the BRT model results.

BRT Results
The outcomes of the BRT algorithm were obtained training the following model over the period 2016-2017: where gbm is the function implementing the boosted regression tree technique in the R software environment. Several tools allow interpreting the BRT model, enhancing its understanding and trustworthiness, i.e., the relative influence of predictors, the partial dependence plots, and the two-way predictor interactions. The relative influence, ranking in a descending order the most influential predictors, indicates to what extent each predictor influences the target variable. According to the obtained results (Figure 3), the overall contribution of meteorological variables explained over 70% of the variance in O 3 , indicating that meteorology, at a local scale, was a strong driver for O 3 concentrations in the area. The most important meteorological parameters were RH, ws, and T, the others making a negligible contribution. Remarkable was the role of RH, explaining alone more than 50% of the variance in the BRT model. Among the precursor gases, the most relevant was CH 4 , showing an influence of 13% on O 3 , while nitrogen gases accounted for percentage values of less than 6%.
Among the meteorological parameters, the highest negative correlation was found between O3 and the RH (R = −0.78), while a positive correlation was registered with T (R = 0.59) and ws (R = 0.49), respectively. O3 was negatively correlated with all the precursor gases, in particular CH4 was the most efficient one (R = −0.68). Finally, a strong positive correlation was found between NOx and NO2 (R = 0.91), and a negative correlation between RH and T (R = −0.66). All the obtained results are generally congruent with the expected trends and they will be discussed in details in the paragraph dedicated to the BRT model results.

BRT Results
The outcomes of the BRT algorithm were obtained training the following model over the period 2016-2017: O ~gbm(T, H, ws, wd, P, SR, NO, NO , NO , CH , NMHC, CO), where gbm is the function implementing the boosted regression tree technique in the R software environment. Several tools allow interpreting the BRT model, enhancing its understanding and trustworthiness, i.e., the relative influence of predictors, the partial dependence plots, and the twoway predictor interactions. The relative influence, ranking in a descending order the most influential predictors, indicates to what extent each predictor influences the target variable. According to the obtained results ( Figure  3), the overall contribution of meteorological variables explained over 70% of the variance in O3, indicating that meteorology, at a local scale, was a strong driver for O3 concentrations in the area. The most important meteorological parameters were RH, ws, and T, the others making a negligible contribution. Remarkable was the role of RH, explaining alone more than 50% of the variance in the BRT model. Among the precursor gases, the most relevant was CH4, showing an influence of 13% on O3, while nitrogen gases accounted for percentage values of less than 6%.  The partial dependence plots describe how the target variable is changing in terms of the chosen predictor, after accounting for the average effects of all other explanatory variables; it provides a useful basis to visualize the relationships between O 3 and its driving factors. With reference to the top four predictors identified by the BRT model (Figure 4), some considerations can be formulated.
The partial dependence plots describe how the target variable is changing in terms of the chosen predictor, after accounting for the average effects of all other explanatory variables; it provides a useful basis to visualize the relationships between O3 and its driving factors. With reference to the top four predictors identified by the BRT model (Figure 4), some considerations can be formulated. The strong negative association found between O3 concentrations and RH can be due to several factors. Relative humidity influences photochemistry through reactions between water vapor and atomic oxygen, increasing O3 chemical losses [51]. In addition, photochemistry reduction, due to the enhanced cloud cover associated with high humidity levels, is another plausible explanation. Furthermore, RH influences O3 concentration also by dry deposition, for example, enhancing plants stomatal opening and consequently O3 uptake [52]. Although not easily quantifiable, this is expected to be a non negligible phenomenon in the area under examination due to the relevant presence of vegetation near the monitoring site.
O3 was negatively correlated with CH4 as expected, since this pollutant is known to be a precursor of O3 [53]; CH4 oxidation, in fact, leads to enhanced formation of O3 in the troposphere and lower stratosphere through a sequence of reactions involving NOx compounds [51]. Several farming and ranching activities carried out around the monitoring station, have been suggested as potential local-sources of CH4 in the study area [39]; moreover, the oil/gas extractive and pre-treatment activities carried out in the Agri Valley should also be included among the potential sources of CH4 in the area.
An increase of the O3 concentrations with ws increasing was observed. Similar trends were attributed to transport from distant places [54], although the secondary nature of O3 makes its relationship with ws of not univocal interpretation. In the examined site, the diurnal pattern showed The strong negative association found between O 3 concentrations and RH can be due to several factors. Relative humidity influences photochemistry through reactions between water vapor and atomic oxygen, increasing O 3 chemical losses [51]. In addition, photochemistry reduction, due to the enhanced cloud cover associated with high humidity levels, is another plausible explanation. Furthermore, RH influences O 3 concentration also by dry deposition, for example, enhancing plants stomatal opening and consequently O 3 uptake [52]. Although not easily quantifiable, this is expected to be a non negligible phenomenon in the area under examination due to the relevant presence of vegetation near the monitoring site.
O 3 was negatively correlated with CH 4 as expected, since this pollutant is known to be a precursor of O 3 [53]; CH 4 oxidation, in fact, leads to enhanced formation of O 3 in the troposphere and lower stratosphere through a sequence of reactions involving NO x compounds [51]. Several farming and ranching activities carried out around the monitoring station, have been suggested as potential local-sources of CH 4 in the study area [39]; moreover, the oil/gas extractive and pre-treatment activities carried out in the Agri Valley should also be included among the potential sources of CH 4 in the area.
An increase of the O 3 concentrations with ws increasing was observed. Similar trends were attributed to transport from distant places [54], although the secondary nature of O 3 makes its relationship with ws of not univocal interpretation. In the examined site, the diurnal pattern showed that the higher values of ws, T, and O 3 were registered during daytime (Figure 5a), when, besides the photochemical production, the reduction of the stability of the boundary layer allows O 3 vertical mixing [55].
At the same time, the monthly pattern of ws, T, and O 3 (Figure 5b) showed that the higher O 3 monthly averages, registered in spring and summer, occurred concurrently with lower values of ws and higher values of T; this circumstance is usually associated with stagnant meteorological conditions conducing to tropospheric O 3 formation and accumulation [56]. Consequently, local photochemical processes and transport phenomena might be plausible reasons for O 3 level in the area, although studies based on regional models are required to estimate the contributions of these processes [57].
Atmosphere 2020, 11, x FOR PEER REVIEW 9 of 17 that the higher values of ws, T, and O3 were registered during daytime (Figure 5a), when, besides the photochemical production, the reduction of the stability of the boundary layer allows O3 vertical mixing [55].
(a) (b) At the same time, the monthly pattern of ws, T, and O3 (Figure 5b) showed that the higher O3 monthly averages, registered in spring and summer, occurred concurrently with lower values of ws and higher values of T; this circumstance is usually associated with stagnant meteorological conditions conducing to tropospheric O3 formation and accumulation [56]. Consequently, local photochemical processes and transport phenomena might be plausible reasons for O3 level in the area, although studies based on regional models are required to estimate the contributions of these processes [57].
The variation of O3 with T followed the expected trend: T, in fact, tends to accelerate the rate of O3-related photochemical reactions promoting O3 production [58]. In addition, an increase in temperature is often accompanied by an increase in solar radiation and emissions of VOCs from biogenic sources, as well as by a decrease in water vapor: all these factors together lead to an increase in the O3 concentrations.
Finally, a particular strength of the BRT model is the ability to identify the interactions between predictors and their relative strength: the explanatory variables are examined two at a time fixing the remaining predictors at their mean values. As far as the dataset under examination is concerned, the strongest interaction identified by the BRT model was between RH and CH4, confirming the relevant role of these variables in modulating the O3 concentrations [59]. However, as shown in Figure 6, the effect of CH4 becomes negligible at higher values of RH. The variation of O 3 with T followed the expected trend: T, in fact, tends to accelerate the rate of O 3 -related photochemical reactions promoting O 3 production [58]. In addition, an increase in temperature is often accompanied by an increase in solar radiation and emissions of VOCs from biogenic sources, as well as by a decrease in water vapor: all these factors together lead to an increase in the O 3 concentrations.
Finally, a particular strength of the BRT model is the ability to identify the interactions between predictors and their relative strength: the explanatory variables are examined two at a time fixing the remaining predictors at their mean values. As far as the dataset under examination is concerned, the strongest interaction identified by the BRT model was between RH and CH 4 , confirming the relevant role of these variables in modulating the O 3 concentrations [59]. However, as shown in Figure 6, the effect of CH 4 becomes negligible at higher values of RH.

MLR Results
To develop the MLR model over the period 2016-2017, the multi-collinearity among the standardized variables was preliminary checked. Consequently, NO2 was eliminated from the variables due to its high correlation with NOx and NO (VIF > 10). Furthermore, T, wd, and SR were

MLR Results
To develop the MLR model over the period 2016-2017, the multi-collinearity among the standardized variables was preliminary checked. Consequently, NO 2 was eliminated from the variables due to its high correlation with NO x and NO (VIF > 10). Furthermore, T, wd, and SR were also removed from the MLR model because their p-value was above the significance level. On this basis, the MLR model representing O 3 variability in terms of statistically valid variables was described by the following expression: The normality analysis, carried out through the normal Q-Q plot, confirmed that the residuals were normally distributed (Figure 7a), and the scale-location plot confirmed the homogeneity of the variance (Figure 7b). The sensitivity of O3 to the different predictors was evaluated by means of the coefficients obtained from the MLR model. The most relevant variables were RH, CH4, NMHC, and ws, substantially confirming the role of main predictors already observed in the BRT model. It is worth noting that the temperature was not included among the predictors by the MLR model.

Comparison between BRT and MLR Models
Once developed, the ability of the BRT and MLR models to generalize the knowledge acquired during the learning process on a new cohort of observations was tested applying the models on the validation dataset regarding the 2018 year. The resulting predictive performances and behavior of both models were compared through statistical indicators (Table 2) as well as through several graphical functions. The R 2 value indicated that 81% of the variability in the observed data was accounted for by the BRT model versus 79% accounted for by the MLR one. The slightly outperformance of the BRT model was registered also in the IoA parameter. The MBE parameter underlined a tendency to over predict the observed values by 3.58 μg/m 3 for BRT model and by 5.66 μg/m 3 for MLR model. Finally, the BRT model was also able to reduce the error at 12.29 μg/m 3 (RMSE) and 9.84 μg/m 3 (MAE). Overall, all metrics reported in Table 2 pointed out that the BRT model slightly overcomes the MLR approach in terms of both accuracy and minimal errors.  The sensitivity of O 3 to the different predictors was evaluated by means of the coefficients obtained from the MLR model. The most relevant variables were RH, CH 4 , NMHC, and ws, substantially confirming the role of main predictors already observed in the BRT model. It is worth noting that the temperature was not included among the predictors by the MLR model.

Comparison between BRT and MLR Models
Once developed, the ability of the BRT and MLR models to generalize the knowledge acquired during the learning process on a new cohort of observations was tested applying the models on the validation dataset regarding the 2018 year. The resulting predictive performances and behavior of both models were compared through statistical indicators ( Table 2) as well as through several graphical functions. The R 2 value indicated that 81% of the variability in the observed data was accounted for by the BRT model versus 79% accounted for by the MLR one. The slightly outperformance of the BRT model was registered also in the IoA parameter. The MBE parameter underlined a tendency to over predict the observed values by 3.58 µg/m 3 for BRT model and by 5.66 µg/m 3 for MLR model. Finally, the BRT model was also able to reduce the error at 12.29 µg/m 3 (RMSE) and 9.84 µg/m 3 (MAE). Overall, all metrics reported in Table 2 pointed out that the BRT model slightly overcomes the MLR approach in terms of both accuracy and minimal errors.  Figure 8 provides a visual reference for interpreting the results obtained from the BRT and MLR models, respectively. The scatterplot of predicted versus measured O 3 concentrations obtained by BRT and MLR models are depicted in Figure 8a,b, respectively. It is worth noting that both models tend to predict the mean better than the tails of the distribution [20]. Figure 8c illustrates the comparison between the predicted and observed O 3 concentrations on a monthly basis. The MBE index is positive for both models, i.e., on average, the predicted O 3 concentrations overestimate the measured data. However, a more detailed analysis showed that both models tend to underestimate the highest concentrations measured in the July months, and this underestimation is more prominent in the MLR model. In Figure 8d, the diurnal cycle of observed and predicted O 3 concentrations clearly shows the better performance of the BRT model with respect to the MLR model, net of small over predictions during the day.
Atmosphere 2020, 11, x FOR PEER REVIEW 12 of 17 index is positive for both models, i.e., on average, the predicted O3 concentrations overestimate the measured data. However, a more detailed analysis showed that both models tend to underestimate the highest concentrations measured in the July months, and this underestimation is more prominent in the MLR model. In Figure 8d, the diurnal cycle of observed and predicted O3 concentrations clearly shows the better performance of the BRT model with respect to the MLR model, net of small over predictions during the day. Overall, the better predictive performances of the BRT technique can be due to its inherent capability to take into account nonlinearities in O3 response to changes in predictors. The remaining discrepancies between the predicted and observed O3 values may be due to several causes, such as the lack of additional predictive factors, including the height of the boundary layer or the contribution of biogenic volatile organic compounds emissions. In terms of diagnostic capabilities, both models Overall, the better predictive performances of the BRT technique can be due to its inherent capability to take into account nonlinearities in O 3 response to changes in predictors. The remaining discrepancies between the predicted and observed O 3 values may be due to several causes, such as the lack of additional predictive factors, including the height of the boundary layer or the contribution of biogenic volatile organic compounds emissions. In terms of diagnostic capabilities, both models indicated that, among the precursors, the most significant contribution was due to CH 4 , the others having a negligible influence on O 3 variability. More significantly, the role of the local scale meteorology as a strong driver of the O 3 concentrations was better represented by the BRT model, which, unlike the MLR model, takes into account the role of temperature too.
Moreover, the BRT model was able to represent the interactions between variables, as the strong linkages between RH, CH 4 , and O 3 , and therefore it seems to better reflect the physical and chemical processes underlying the O 3 variability in the area. It is worth noting that the site-specific character of the analysis carried out does not allow, at this stage, the generalization of the obtained results over a wider area. However, the developed approach, if extended to the entire monitoring network, can represent a promising tool for interpreting and predicting the O 3 behavior in the Agri Valley as well as to better characterize the impact of the COVA plant on the local air quality.

Conclusions
In this study, the surface O 3 behavior was characterized by means of an advanced machine learning method, the boosted regression trees (BRT) technique. The developed BRT model turned out to be a powerful tool for interpreting O 3 variability since it automatically selects the relevant predictors, identifies and models their interactions, and visualizes the relationship between O 3 and each predictors, thus providing powerful insights into the structure of the data. Although the data-driven approach here adopted does not consider the physical and chemical processes underlying the O 3 formation, the results produced were plausible and comparable to other studies in which these processes have been analyzed. Moreover, it was found that the model predictions and the real observations were consistent despite the collinearity and nonlinearity problems existing among models' variables. A comparison, carried out via statistical performance indicators, showed the ability of BRT to perform better than the MLR model, especially in terms of the identification of the main factors influencing O 3 variability.
However, the BRT model tends to not properly estimate the extreme values. This fact negatively affects the predictive performances of the technique in determining the peak concentration values. Furthermore, the results obtained in the work are computed on a defined set of predictors based on data provided by a single measurement site. Extending the developed approach to the other stations of the Agri Valley monitoring network, using larger datasets and adding new predictors (such as the spatial data), could wider the inferential and predictive capabilities of the model and offer new insights on the O 3 behavior in the area.
In conclusion, the BRT technique proved to be promising in evaluating the role of different drivers of the surface O 3 variability. This knowledge is essential to the optimization of the O 3 control strategies aimed at human health protection, as well as to investigate the contribution of local scale phenomena in the complex interactions between air quality and climate change.