Classiﬁcation and Prediction of Nitrogen Dioxide in a Portuguese Air Quality Critical Zone

: This study presents classiﬁcation and prediction exercises to evaluate the future behavior of nitrogen dioxide in a critical air quality zone located in Portugal using a dataset, the time span of which covers the period between 1 September 2021 and 23 July 2022. Three main results substantiate the importance of this research. First, the classiﬁcation analysis corroborates the idea of a neutrality principle of road trafﬁc on the target since the respective coefﬁcient is signiﬁcant, but quantitatively close to zero. This result, which may be the ﬁrst sign of a paradigm shift regarding the adoption of electric vehicles in addition to reﬂect the success of previously implemented measures in the city of Lisbon, is reinforced by evidence that the carbon monoxide emitted mostly by diesel vehicles exhibits a signiﬁcant, negative and permanent effect on satisfying the hourly limit value associated with the target. Second, robustness checks conﬁrm that the period between 8 h and 16 h is particularly remarkable for inﬂuencing the target. Finally, the predictive exercise demonstrates that the internationally patented Variable Split Convolutional Attention model has the best predictive performance among several deep learning neural network alternatives. Results indicate that the concentration of nitrogen dioxide is expected to be volatile and only a redundant downward trend is likely to be observed. Therefore, in terms of policy recommendations, additional measures to avoid exceeding the legal nitrogen dioxide ceiling at the local level should be focused on reducing carbon monoxide emissions, rather than just being concerned about halting the intensity of road trafﬁc.


Introduction
Air pollution remains the most serious environmental threat in the European Union (EU). According to estimates by the European Environment Agency (EEA), around 400,000 annual premature deaths can be attributed to air pollution in EU Member States, which must guarantee good air quality standards for their citizens. European Green Deal (EGD) and Zero Pollution Action Plan (ZPAP) highlight the importance of reducing air pollution through the implementation of air quality standards enshrined in EU legislation to effectively protect human health and preserve the natural environment. The EU legislation on air quality-Directive 2008/50/EC-leaves to EU Member States the choice of measures to be adopted in order to comply with the limit values set.

Context
Regarding nitrogen dioxide (NO 2 ), which is the target selected for this study, estimates indicate almost 50,000 premature deaths per year in the EU. This air pollutant negatively affects the human health by promoting severe diseases such as asthma and reduced lung function. Air pollution caused by NO 2 is mainly triggered by economic and human activities such as industry. This source of concern is particularly remarkable in the Iberian Peninsula since this European region is characterized by a strong persistence of both types of activities, being the situation more problematic in urban areas.
While H1-H3 ensure a comparison of our results with those observed in previous studies, H4-H6 allow us to improve the state-of-the-art. In particular: • H4 guarantees that the classification analysis is categorized according to different peak traffic hours and/or time periods; • H5 tests whether the use of electric vehicles, which are absent of CO emissions, plays a role on influencing NO 2 concentration levels in the atmosphere; and • H6 ensures a credible prediction of NO 2 in one of the Portuguese cities with the highest level of air pollution caused by this specific pollutant.

Summary of Results
Main findings can be summarized as follows. First, the classification analysis corroborates the idea of a neutral effect of road traffic on the target since the respective coefficient is significant, but quantitatively close to zero. Since the time span of analysis covers the period between 1 September 2021 and 23 July 2022, this result may be the first sign of a paradigm shift observable in the city of Lisbon regarding the adoption of electric vehicles, in addition to reflect the success of previously implemented measures (e.g., ban the circulation of vehicles registered before the year 2000). This finding is reinforced by the fact that CO has a significant, negative and permanent effect on the target, thereby reflecting that a higher concentration of NO 2 in the atmosphere is more likely to hold with increasing level of CO emissions. Second, robustness checks confirm that: • [8][9][10][11][12][13][14][15][16] h is the period where compliance with the NO 2 's hourly ceiling is most compromised for a given increase in the level of CO emissions; • Several regularization procedures confirm that the set of input variables used in the benchmark analysis has explanatory power on the target; • The input-oriented model aimed at minimizing NO 2 emissions, which is solved through several variants of stochastic frontier analysis, confirms the influence and significance of input variables on the target; • The two-step continual learning approach also validates benchmark findings from a qualitative point of view; and • Considering multiple scenarios, a variance importance analysis shows that CO unambiguously has the highest relevance on explaining NO 2 concentration levels.
Third, the predictive exercise demonstrates that the internationally patented VSCA model has the best predictive performance among several DLNN alternatives. Results indicate that: 1.
The level of NO 2 concentrations in the atmosphere is expected to be volatile; and 2.
This air pollutant does not have a downward trend in test and/or validation sets.
Therefore, in terms of policy recommendations, one can conclude that additional measures focused on reducing CO emissions in the period 8 h-16 h should be implemented at the local level to avoid surpassing NO 2 's limit values in the near future. This is not a contradiction, but rather the manifestation that desirable political actions to maximize social welfare should consider environmental conditions faced by local communities.
The remainder of the study is organized as follows. Section 2 provides materials and methods, while Section 3 clarifies main results and robustness checks. A discussion and recommendations are exposed in Section 4, while Section 5 concludes.

Description
The exposure to high concentrations of NO 2 can be translated into a serious damage to the human health. When oxidized in the atmosphere, it can even produce nitric acid, which is a component that increases the acidity of rain and causes a degradation in the nature and materials due to its corrosive nature. This is, however, an occurrence with little significance in Portugal. The legislation stipulates air quality objectives for the protection of human health. The indicator used is the annual average of NO 2 , which is an aggregated value of the hourly concentration of this air pollutant measured at each station, for comparison with the respective threshold value. The analysis of air quality at each zone is carried out by considering the worst value obtained in the set of stations belonging to each territorial unit. The typology of a measuring station with respect to the environment (i.e., urban, suburban or rural) and influence (i.e., traffic, industrial or background) of its location allows us to identify the nature of emission sources and the magnitude of measured levels. Lastly, it should be highlighted that trend analyses carried out by the Portuguese Environment Agency (APA) rely on the aggregation of average hourly values, depend on the type of station and allow a better elucidation of measures to be adopted to satisfy limit values throughout the national territory. All stations with a measurement efficiency greater than 75% are used, and in the case of stations covered by some indicative evaluation strategy, measurements with efficiency greater than 14% are considered.

Targets
Main objectives established in Portugal are summarized as follows: • Ensure compliance with objectives established at the EU level in terms of air quality, which aim to avoid and prevent harmful effects of air pollutants on human health and the environment; • Assess the air quality throughout the national territory, with a special focus on urban centers; and • Portuguese legislation materialized by Decree-Law No. 102/2010 of 2010 establishes the obligation to preserve air quality in cases where it is good and improve it in other cases; moreover, it imposes two ceilings on NO 2 -hourly limit value (HLV) and annual limit value (ALV)-that cannot be exceed: 1. HLV (i.e., a limit value for the average hourly concentration of 200 µg/m 3 of NO 2 , which must not be exceeded more than 18 times per calendar year); and 2.
ALV (i.e., a limit value for the average annual concentration of NO 2 of 40 µg/m 3 ).
This study focuses on analyzing the first type of target.

Evolution
Europe. Regarding NO 2 concentration levels in the EU-28, 2000-2017 is a period marked by a reduction of the mean annual concentration of NO 2 in a magnitude of 25% in suburban stations, 28% at road traffic stations, and 34% at both industrial and rural stations. In 2018, the greatest contribution to NO x emissions was the transport sector (47%, with 39% of the road transport and 8% of the non-road transport) followed by the energy supply (15%), agriculture (15%) and industry (15%) sectors [15].
Portugal. Figure 1 provides the concentration level of NO 2 in the biennium 2019-2020 for different Portuguese regions. The year 2020 was atypical due to the pandemic effect on air quality, resulting from several periods of confinement that significantly influenced economic activities and mobility in the national territory. In particular, measures imposed during the state of alert and the state of emergency resulted in a strong reduction in emissions of atmospheric pollutants and led to a significant improvement of air quality, especially in large urban areas. This reduction was predominantly felt in the NO 2 , including in Lisbon, Porto and Braga, where the ALV of 40 µg/m 3 has been surpassed in previous years. For the first time, there is no exceedance of this ceiling in large agglomerations, with a decrease in levels measured in the North MAL, from 55 µg/m 3 in 2019 to 40 µg/m 3 in 2020, in Coastal Porto, from 48 µg/m 3 in 2019 to 40.5 µg/m 3 in 2020 and in Between Douro and Minho, from 57 µg/m 3 in 2019 to 32 µg/m 3 in 2020. The significant decrease in the average annual concentration in Between Douro and Minho is explained by the fact that the station that determines the exceedance of the AVL was not used in the 2020 assessment due to malfunctions of the equipment. It can also be observed that, in the remaining local administrative units (LAUs), there is a generalized decrease in the magnitude of measured levels. An exception stands out in Coimbra, where there is a slight increase in NO 2 concentration levels compared to 2019, from 15 µg /m 3 to 22 µg/m 3 . This situation may result from the use, in the 2019 assessment, of the urban background station instead of the usual road traffic station, which was not operational that year due to equipment failure [16].   [11]. Nevertheless, Figure 2 allows us to identify a trend of decrease, more accentuated after 2017, in road traffic stations where the most significant exposure of the population to this pollutant occurs and a less expressive decrease in background (i.e., urban and suburban) stations from 2018 afterwards. For the remaining typologies, the magnitude of NO 2 concentration levels has been maintained stable over time, with a series break observed in 2020 because of the lockdown, which influenced the normal functioning of the economy and resulted in the improvement of air quality. With respect to the HLV of 200 µg/m 3 not being exceeded more than 18 times a year, Ref. [16] confirms that this air quality objective was fully met in all areas of the national territory during the biennium 2019-2020. This fact collapses and, in some sense, it seems to be inconsistent with the legal action taken by the EC against Portugal in 12 November 2021.

Data
This study uses public metadata related to the monitoring of: 1. Air quality; 2.
Meteorological conditions In the city of Lisbon, which can be freely accessed online at https://dados.cm-lisboa. pt/dataset/monitorizacao-de-parametros-ambientais-da-cidade-de-lisboa (accessed on 25 July 2022). Currently, this LAU maps the municipality in real or close time, at the level of these indicators, in a sensing-as-a-service mode. This innovation, provided by the MEO/Monitar/QART consortium, consists of an installed network that includes 80 stations, each equipped with different sensors, located mostly on public lighting poles. The selection of locations for the installation of sensors aims to monitor the diversity of spaces with different environmental characteristics and represent the entire county with a homogeneous distribution. As a basis, the following criteria were considered: The road network proximity to sources of pollution.
In general, available values correspond to hourly average data obtained from an arithmetic mean of 4 segments, where each segment contains 15 min of resolution. Exceptions are road traffic values, which result from summing the number of vehicles per hour. This mapping complements the monitoring carried out by the official network of fixed stations under management of the Lisbon and Tagus Valley Regional Coordination and Development Commission (CCDR-LVT) and the Portuguese Institute of the Sea and Atmosphere (IPMA), whose monitored values in real time are displayed in UTC time in autumn-winter periods and UTC + 1 in spring-summer periods, without decimal places, except for indicators T, RH, P, UVI, WI and WD. The present analysis deals with a multivariate time series (MTS) dataset, whose time span covers the period between 1 September 2021 and 23 July 2022. Descriptive statistics of the set of input variables and target are presented in Table 1.

Preliminary Statistical Tests and Hypotheses Testing
Prior to classification and prediction exercises, several statistical tests are reported in Tables 2-4 to obtain a greater sensitivity on data and justify technical options taken to perform subsequent tasks in the most appropriate way from a scientific point of view.
These include the inspection of: Normality, skewness and kurtosis; 3.

Multicollinearity
Results show that the mean variable inflation factor (VIF) corresponds to 51.89, which is above the critical value of 5 usually considered for evaluation. Hence, data suffer from multicollinearity in case of nothing being done. However, all VIF values related to each input variable are below the critical threshold with the exception of those representing PM 10 and PM 2.5 . These input variables exhibit a positive and strong correlationρ PM 10 ,PM 2.5 = 0.999-and, consequently, both must be excluded from classification and prediction exercises to avoid multicollinearity concerns.

Normality, Skewness and Kurtosis
Results indicate p < 0.05 for all normality tests. As such, the null hypothesis claiming the presence of normality in errors' distribution is rejected for a critical p-value of 5%. Similar conclusion holds true for skewness tests. Nevertheless, Cameron-Trivedi's decomposition to test the zero kurtosis hypothesis has observed p-value of 0.255, which suggests non-worrying differences compared to the representative kurtosis of a normal distribution at the 5% critical significance level. Overall, these results legitimize to reject the null hypothesis that input variables and target are homogeneously distributed at different time-steps, thereby justifying the execution of several feature engineering actions when moving towards the prediction exercise. In particular, Refs. [17,18] highlight the need to normalize data in cases where either inputs have different units of measure or error terms do not follow a normal distribution. Moreover, outliers can affect the learning ability of DLNN models, particularly if data are concentrated in a very small portion of the input range. Since explanatory variables subject to this type of bias may not have a significant impact on the learning and final predictions of DLNN models, Ref. [19] proposes a solution based on feature engineering to mitigate this technical concern. In a nutshell, it consists of applying a bilateral truncation to the MTS problem based on frequency analysis and histogram re-scaling. As such, in this study, DLNN models initially use normalized data for training and testing. After these tasks being executed, data are de-normalized using the inverse function to ensure a more realistic comparison between actual and predictive values of the target under scrutiny.

Homoscedasticity
Both tests reject the null hypothesis of var(u i ) = σ 2 , with σ 2 ∈ (0, ∞), ∀i = {1, . . . , 7824}. Although heteroscedasticity is more common in cross-sectional samples due to the heterogeneity between individuals, it is also observed in MTS problems due to the presence of temporal differences between different observations. Given the detection of heteroscedasticity, its correction is mandatory when running a multiple linear regression model using the Ordinary Least Squares (OLS) method because coefficients remain centric and consistent, but no longer have minimal variance in the class of linear and centric estimators. White's procedure, while not solving the problem of heteroscedasticity, restores the validity of statistical inference in large samples such as the one considered in this study [20].

Autocorrelation
Autocorrelation (a.k.a. serial correlation) is more common in time series because observations of the same target in successive periods are often influenced by factors that exhibit some persistence over time. Moreover, autocorrelation can also be induced by data manipulation via certain procedures (e.g., deseasonalization). While the Durbin-Watson test checks whether u i follows a first-order autoregressive process AR(1), the Breusch-Godfrey test generalizes this assessment to check for the presence of a p-order autoregressive process AR(p). Results in both tests reject the null hypothesis claiming the absence of autocorrelation and suggest the presence of positive autocorrelation (i.e., past NO 2 concentration levels positively influence current values). Given the presence of autocorrelation, regression coefficients are centric and consistent, but they no longer have minimum variance in the class of linear and centric estimators. Furthermore, statistical inference conducted in the usual way is not valid because var β OLS = σ 2 (X X) −1 is no longer the correct formula. In fact, this holds true regardless whether the variance and covariance matrix representative of disturbance terms Ω is known or not. The main drawback of using OLS under these conditions is the loss of validity of the statistical inference, which implies the need to obtain another estimator of var β OLS that restores the validity of the statistical inference.
Knowing that var β OLS = σ 2 (X X) −1 X ΩX(X X) −1 in the presence of autocorrelation, several estimation methods are useful to achieve this end [21]: • OLS combined with the Newey-West estimator, after [22] proposing a var β OLS estimator that is consistent in the presence of heteroscedasticity and/or autocorrelation generated by stationary processes and that asymptotically enables the statistical inference conducted from the OLS results; this estimator is based on a function of estimated residuals obtained by OLS and, unlike White's procedure, takes into account not only sample variances, but also sample covariances; • First-order generalized least squares (GLS) method, either in GLS (AR(1)) or EGLS (AR(1)) form, respectively depending on whether Ω is known or not; hence, the GLS estimation is equivalent to the OLS estimation of the original model after this being transformed by the first-order generalized difference method; • Two-step Cochrane-Orcutt method [23] and subsequent extensions where, in first place, the original model is estimated by OLS to obtain the residuals to estimate the equation e t = ρe t−1 + v t , t = {2, 3, . . . , 7824} to obtain ρ and then, in second place, use this estimate to construct a set of new input variables from which results a transformed model by first differences; and • Nonlinear least squares (NLS) method, where the minimization condition cannot be obtained analytically, being necessary to resort to a search algorithm that is developed in several iterations, starting with predefined initial values. The search process ends when a certain convergence criterion defined in advance is satisfied.

Stationary
Several stationary tests were executed in addition to Augmented Dickey-Fuller (ADF) tests reported in Table 3, whose results confirm that the dependent variable does not exhibit unit roots being, thus, unnecessary to integrate the target by taking first-or higherorder differences.

Specification
In terms of specification of the regression model, the observed p-value of the Ramsey test is 0.000, thereby implying that the null hypothesis of a correct specification is strongly rejected for a critical significant level of 5%. This result suggests that the specification can be improved by means of including other input variables (e.g., dummy variables with additive and multiplicative effects related to weekly cyclicality) capable of providing additional explanatory power on the target. However, if this action is not performed, then one must introduce some technique that eliminates the need to include these extra covariates. In this study, roll padding is introduced to overcome this source of concern. This mandatory action is justified by the fact that this method is able to capture the cyclicality of data without the need to include dummy variables representing this statistical property. Therefore, taking into account the result highlighted by the Ramsey test according to which additional input variables must be included, the comparison of a restricted model-absent from weekly cycles-with an unrestricted model-which includes weekly cyclicality-allows us to statistically confirm and formally validate the need to include roll padding and, in affirmative case, avoid the need to include additional input variables representative of weekly cyclicality.

Hypotheses Tests
Based on Tables 1 and 2, the OLS regression model applied to the MTS problem under analysis is formally described as follows: where β 1 corresponds to the coefficient or weight associated with the independent term, whereas β j corresponds to the coefficient associated with each input variable j, with j = {2, . . . , 12}. The disturbance term u i is assumed to be independently and identically distributed with zero mean and constant variance σ 2 if the hypothesis of homoscedasticity is satisfied, which is not the case as confirmed by White and Breusch-Pagan tests. This regression model must be subject to hypotheses testing for single and joint significance of the set of explanatory variables, whose results are displayed in Table 4. It can be concluded that ECSL, AP, RH, T, P, WI and UVI individually and jointly lack statistical significance for a critical p-value of 5% being, thus, excluded. As such, the restrictive model to be used in the classification exercise is given by where, as explained in Section 2.4.1, the target NO 2i consists of a dummy variable, whose development is based on observation values of the latent variable NO * 2i .

Benchmark Model
Generically, considering that a certain observation i falls into one of the m possible categories or alternatives that characterize a given target y i , a multinomial density is formally defined as follows: where the probability that the observation i belongs to the alternative j corresponds to so that the usual multiple linear regression equation in convenient notation is given by: with X corresponding to a vector (1 × k) of components 1, X 2i , X 3i , . . . , X ki and β is a vector (k × 1) of coefficients or weights associated with the independent terms and regressors. In any discrete choice model, the functional form of F j (·) should be selected and each p ij lies between 0 and 1 such that their sum over j corresponds to 1 m ∑ j=1 p ij = 1.
The only exception occurs with the linear probability model (LPM) because the respective binomial distribution implies F j (X β) = X β such that p ij = pr[y i = j|X] = X β. As such, a problem with the LPM is that predicted probabilities may not be limited to [0, 1] even though their sum over j corresponds to 1. Other weaknesses include obtaining constant marginal effects, the fact that the normality hypothesis is not acceptable and the fact that the homoscedasticity hypothesis is not sustainable. Consequently, the LPM is not used with binary outcome data. Instead, a more satisfactory approach to models containing a dichotomous target is to assume that the dependent variable y is the observable manifestation of an unobservable event (a.k.a. latent variable). Therefore, the benchmark analysis of this case study in particular specifies a rule for determining the binary target NO 2i as a function of the latent variable NO * 2i , which is defined as follows: such that each measurement only belongs to one of the two possible categories: • 1 = HLV is satisfied since the observation i stands below the HLV (alternative); or • 0 = HLV is surpassed and, thus, not satisfied (base category), which allows the identification of the probability of the HLV being satisfied and quantifying the magnitude of the marginal effect that each regressor has on the target. Obviously, after considering the battery of preliminary statistical tests, the binary regression model can be formally written as follows where u i is the disturbance term. In the class of models characterized by the relations (1) and (2), it follows that: and, by complementarity, also: Then, being u i a random variable with a distribution function F(.), yields that The two most common choices for the functional form of F(·) are those corresponding to the logistic distribution (reduced normal distribution) representative of the logit (probit) model, respectively. In the probit model, it is postulated that F(·) is the normal distribution function whose associated probability density function is given by: In the logit model, the choice of F(·) falls onto: which is the distribution function of a logistic variable with zero mean and variance π 2 /3. The probability density function of the logistic distribution is given by: It is easy to verify that the equality, is unequivocally verified. In both cases, predicted probabilities are limited to [0, 1]. From a theoretical point of view, Ref. [24] confirms that the choice between logit and probit depends on the type of characteristics sustained by the dependent variable. Ref. [25] details that, if the dependent variable is directly observed and measured in reality, then a logit model is expected to be the optimal choice to estimate the probability that a given observation falls into a certain category. Despite these normative considerations, this study follows objective criteria to determine the optimal model, which is the one that exhibits the lowest Akaike information criterion (AIC) and Bayesian Information criterion (BIC) and the highest log-pseudolikelihood.
Estimated coefficients assume the base category as reference point, so that their interpretation is given as follows: compared to the HLV violation scenario, a unit increase in a given explanatory variable makes compliance with the HLV associated with the level of NO 2 concentration in the atmosphere more (less) likely if the respective coefficient takes a positive (negative) sign, respectively.
In turn, the marginal effect of a given regressor on the probability of belonging to the alternative j results from observing where f (·) designates the density function corresponding to the distribution function F(·). In (4), the first member is a column vector of partial derivatives of which the generic component is, The sum of marginal effects associated with each category is equal to zero and their interpretation is straightforward: a unit increase in a given explanatory variable either increases or decreases the probability of belonging to alternative j by the magnitude ∂E(NO 2i )/∂X ij expressed in percentage terms. Additionally, two facts deserve emphasis in (5): 1.
Marginal effects vary across distinct observations given that different explanatory variables influence the probability density function f (·); and 2.
Marginal effects can vary for a given observation since the magnitude of the effect may be different from one value of X ij to another.
As such, in addition to estimated coefficients, this study reports both marginal effects at the mean and average marginal effects.

Technical Extensions
Several formal extensions are proposed to test whether benchmark findings remain qualitatively robust, namely: 1.
Application of a supervised machine learning (ML) model combined with several regularization methods to the set of inputs used in the classification exercise; 2.
Use of a two-step continual learning (CL) approach; 3.
Execution of a variable importance analysis (VIA); 4.
Evaluation of the ordered multinomial discrete choice model (OM-DCM) by increasing the number of categories that define the target; and

Development of a stochastic frontier analysis (SFA);
Standard regularization. To dissuade concerns related to the omitted variable bias, this robustness check consists of a four-step procedure: 1.
The objective function of this supervised ML model accommodates a penalty component that must be optimized. To achieve this purpose, a certain penalized regression technique (a.k.a. regularization method or tuning process) is applied to the initial set of covariates in order to find the general degree of penalization (a.k.a. penalty level) that optimizes the objective function, which is represented by λ. After identifying the optimal value of this parameter, one can find which input variables have a statistically significant impact on the target and, for each of these, the weight or coefficient on the target which, similarly to what was verified in the initial exercise, can only be interpreted in terms of sign. In this extension, three regularization techniques are combined with the logistic LASSO model: • k-fold cross-validation, whose purpose is to assess the out-of-sample classification performance; • AIC, BIC and Extended Bayesian information criterion (EBIC), by finding the penalty level λ that corresponds to the lowest value taken by each of these information criteria; and • Rigorous penalization, which corresponds to a penalty level formally given by where c is a slack parameter with default value of 1.1, Φ(·) is the standard normal cumulative distribution function (CDF) and γ is the significance level, whose default value is 0.05/max{p × log(n), n}. According to [26], this approach requires standardized predictors and the penalty level is motivated by self-normalized moderate deviation theory to overrule the noise associated with the data-generating process.

3.
After identifying covariates with explanatory power on the target, coefficients are estimated; and 4.
Although not mandatory because classification metrics (e.g., cross-entropy loss) can be reported, we convert the classification problem into a regression problem in order to run the Wilcoxon signed-rank test and provide the following metrics: RMSE, MAE, Mean Absolute Percentage Error (MAPE) and Theil's U statistic. The later is a relative accuracy measure that compares the predicted outcomes with results of forecasting with minimal historical data.
Two-step continual learning approach. In alternative to the previous regularization techniques, this study considers a two-step CL model proposed in [27]. In the context of ML, the idea behind CL is to refine the treatment of covariates based on a bias-variance trade-off argument given that CL consists of a two-step procedure where:

1.
In a first step, random forest (RF) is applied to the set of covariates holding a strong relative importance on explaining the dependent variable in order to mitigate the risk of overfitting; and 2.
In a second step, a probit model estimated by the maximum likelihood method is applied to predicted target values obtained in the first step to mitigate the risk of underfitting.
Overall, this approach is justified by the fact that it dissuades concerns related to endogeneity, spurious correlation and reverse causation.
Variable importance analysis. Following [28], VIA is applied to gain additional robustness over the regularization approaches previously considered for variable selection. This way, the analysis provides a tool to assess the importance of input variables when dealing with complex interactions, making the framework more interpretable and computationally more efficient. VIA is performed under four different circumstances: • Applied to the entire set of input variables; • Only applied to meteorological data; • Only applied to ai pollutants; and • Only applied to statistically significant covariates.
Ordered multinomial discrete choice model. Thenceforth, we decide to increase the number of categories representing the dependent variable, which is now expressed as: The intermediate category can be interpreted as a class containing observations characterized by a strong sensitivity towards the HLV of NO 2 . From a theoretical point of view, multinomial logit models are used for discrete outcome modeling by operating under a logistic distribution being, thus, preferred for large samples. Probit models operate differently for a dependent variable defined by more than two categories, particularly when these obey to a ranking. In MTS problems, Ref. [29] find evidence that the logit specification provides a better fit in the presence of independent variables that exhibit extreme levels. Conversely, model fit with random effects in moderate size datasets is improved by selecting a probit specification. Notwithstanding, in similar vein to the benchmark exercise, this extension uses statistical criteria to decide between using logit and probit multinomial models. However, only the ordered form is considered in this extension given that, ceteris paribus, any municipality unequivocally prefers to observe a low level of NO 2 concentrations in the atmosphere.
Stochastic frontier analysis. Lastly, the benchmark exercise is complemented by the inclusion of an input-oriented SFA based on the ground that a policymaker pretends to minimize the latent output variable NO * 2 (i.e., concentration level of NO 2 in the atmosphere) given the set of input variables.

Normative Improvements through Sample Segmentation
The initial assessment is also improved by developing three normative extensions based on a segmentation according to: Seasons (summer, spring, autumn and winter); and 3.
The weekend-weekdays dichotomy.

Prediction Analysis Framework
Feature engineering. DLNN models are trained, tested and validated by considering a 50-25-25 partition. This rule ensures a prediction of NO 2 concentration levels for 81 days ahead in the test and/or validation set. As required in MTS problems, shuffling is neglected [30]. Figure 3 clarifies that the set of DLNN models used in this study define a time resolution of inputs (outputs) in hours (days), respectively. In particular, 168 time-step hours correspond to the resolution used for input variables, while the output to be predicted corresponds to the average of the next 24 time-step hours. As such, the predictive exercise considers historical data corresponding to one week in order to predict the average level of NO 2 concentrations in the atmosphere during 24 h, representative of the first day of the following week. In addition to the subset of covariates previously identified as being statistically significant through the application of Wald tests in Section 2.3 (i.e., O 3 , CO, THTV and WD), three additional regressors are included in the set of input variables adopted for the prediction exercise: 1.
AP because the respective Wald test indicates that this covariate is statistically significant for a critical p-value of 10% when NO 2 residuals follow a logistic distribution, which corresponds to the inclusion of a softmax layer in DLNN frameworks;

2.
The target lagged in time by one period, NO 2i−1 , based on the presence of autocorrelation identified by Durbin-Watson, Breusch-Godfrey and autoregressive conditional heteroscedasticity (ARCH) tests; and 3.
ECSL, which is not individually significant according to the Wald test, so that the DLNN models used in the prediction exercise are expected to assign a null weight to this input variable; hence, its inclusion is justified by the need to identify the consistency between the statistical robustness associated with the pre-processing phase and the set of different DLNN models used in the forecast analysis.
Thus, as shown in Figure 3, a total of 7 input variables are used in the prediction exercise.  Benchmark models. To evaluate and compare the performance of the proposed DL innovative model, we consider the following benchmark models: • WaveNet defined by [31], where the application of a grid search to find its optimal depth shows that this is given by N = 4. A relevant aspect from this model is the use of causal convolutions, which ensures that the order of data modeling is not violated. Hence, the prediction P(x t+1 ) generated by the WaveNet at time-step t + 1 cannot depend on future time-steps. Another key characteristic is the use of dilated convolutions to gradually increase the receptive field. A dilated convolution is a convolution where the kernel K is applied over one area larger than its length k by skipping input values with a certain step. It is equivalent to a convolution with a larger filter derived from the original filter by dilating it with zeros. Hence, when using a dilation rate dr for any dr > 1 , the causal padding has size given by dr × (K − 1). Finally, as recognized by [32], the residual block is the heart of a WaveNet, being constituted by two convolutional layers, one using the sigmoid activation function and another using the tanh activation function, which are multiplied by each other. Then, inside the block, the result is pass through into another convolution with k = 1 and dr = 1. This technique is referred to as the projection operation or channel pooling layer. Both residual and parametrized skip connections are used throughout the network to accelerate convergence and enable training of much deeper models. This block is executed a given number of times in the depth of the network, with N = {1, . . . , depth}. The dilatation dr increases exponentially according to the formula dr = k N . • Temporal Convolutional Network (TCN) defined by [33], where the application of a grid search allows us to conclude that one should optimally consider N = 2, with dr = {1, 2} and a global kernel size given by k = 2. The TCN is characterized by three main characteristics: -Convolutions in the architecture are causal, which means that there is no information moving from future to past since causal convolutions imply that an output at time t is convolved only with elements from time t and earlier (i.e., sequence modeling) from the previous layer; - The architecture can take an input sequence of any length and map it to an output sequence of the same length similarly to RNNs; hence, similar to the WaveNet model, causal padding of length k × 1 is added to keep subsequent layers with the same length as previous ones; and -It is possible to build a long and effective history size using deep neural networks augmented with residual blocks and dilated convolutions.
• Standard Long Short-Term Memory (LSTM) of [34], where the application of a grid search reveals as optimal strategy the inclusion of a first LSTM layer with 64 neurons, input shape of (168, 7), active return sequences and ReLu activation function, followed by a second LSTM layer with 32 neurons and without active return sequences; these layers are followed by three dense layers, where the last one for the output is defined with tanh activation function; the remaining optimal technical options include 100 epochs with a batch size of 64 and applying the Adam optimizer. Conceptually, the LSTM has three basic requirements: - The system should store information for arbitrary durations; - The system should be resistant to noise (i.e., fluctuations of random or irrelevant inputs to predict the correct output); and -Parameters should be trainable in a reasonable time.
The LSTM adds a cell state c to the standard RNN that runs straight down the entire recursive chain with only some minor linear interactions to control the information that needs to be remembered. These control structures are called gates. Activations are stored in the internal state of an LSTM unit (i.e., recursive neuron), which holds longterm temporal contextual information. A common architecture consist of a memory cell c and its update c input gate i, output gate o and forget gate f . Gates have in and out connections. Weights of connections, which need to be learned during training, are used to orient the operation of each gate. • LSTM of [34] complemented by the standard attention mechanism of [35]. A standard attention mechanism is a residual block that multiplies the output with its own input h i and then reconnects to the main pipeline with a weighted scaled sequence. Scaling parameters are called attention weights α i and the result is called context weights c i for each value i of the input sequence. As such, c is the context vector of sequence size n.
The computation of α i results from applying a softmax activation function to the input sequence x l on layer l, thereby meaning that input values of the sequence compete with each other to receive attention. Knowing that the sum of all values obtained from the softmax activation corresponds to 100%, attention weights in the attention vector α have values within [0, 1]. • Bi-Dimensional Convolutional Long Short-Term Memory (ConvLSTM2D) defined by [36] considering 7 segments, where each segment represents 1 day, each day consists of 24 time-step hours and each time-step corresponds to 1 h. The ConvLSTM2D is a recurrent layer similar to the LSTM, but internal matrix multiplications are exchanged with convolution operations. Since we are dealing with a MTS problem, the concept of sequence segment is introduced to make inputs compatible with ConvLSTM2D layers, which have to deal with segments of temporal sequences. Data flow through the ConvLSTM2D cells by keeping a 3D input dimension composed by Segments × TimeSteps × Variables rather than merely applying a 2D input dimension composed by TimeSteps × Variables as observed in the standard LSTM.

Details on the standard attention mechanism.
Attention is a mechanism to be combined with LSTM layers to ensure the focus on certain parts of the input sequence, thereby enabling a higher quality of learning [35]. Attention was originally introduced for machine translation tasks, but it was rapidly spread into many other application fields. As highlighted in Figure 4, the basic attention mechanism can be seen as a residual block that multiplies the output with its own input h i and then reconnects to the main pipeline with a weighted scaled sequence. Scaling parameters are called attention weights α i and the result is called context weights c i for each value i of the input sequence. As such, c is the context vector of sequence size n. Formally The computation of α i results from applying a softmax activation function to the input sequence x l on layer l which means that input values of the sequence compete with each other to receive attention.
Knowing that the sum of all values obtained from the softmax activation is 1, attention weights in the attention vector α have values within [0, 1]. Since we are dealing with a MTS problem, Figure 4 shows the incorporation of attention mechanism in a 2D dense layer (i.e., the standard attention mechanism). This layer is subject to a permutation to ensure that the attention mechanism is applied to the time-steps dimension of each sequence and not to the dimension representative of input variables. It is important to highlight that, when the attention mechanism is applied after the LSTM layer, it must return internal generated sequences, which are equal to the number of recursive cells (NRC). This parameter is used inside the attention block to know how many sequences must be processed.  Details on the bi-dimensional convolutional long short-term memory. Ref. [36] propose the ConvLSTM2D layer to predict future rainfall intensity based on sequences of meteorological images. By using this layer in a DLNN model, they were able to outperform state-of-the-art algorithms. The ConvLSTM2D is a recurrent layer similar to the LSTM, but internal matrix multiplications are exchanged with convolution operations. As highlighted in Figure 5, the concept of sequence segment is introduced to make inputs compatible with ConvLSTM2D layers, which in turn deal with segments of temporal sequences in light of the MTS problem under scrutiny. Incorporation of roll padding, which is a new padding method that allows us to dissuade the specification problem identified by the Ramsey test.
The first technical refinement consists of introducing convolutional layers inside the attention block. Attention mechanisms can be useful to obtain a pattern of attention in contiguous time-steps, particularly in the context of MTS problems. Two types of convolutional layers can be introduced inside any attention block: • One dimensional (1D) convolutional (i.e., Conv1D) layers; and • Two dimensional (2D) convolutional (i.e., Conv2D) layers.
The implementation of Conv1D layers inside the attention block requires that, after the multivariate sequence being split by variable, a path should be created for each univariate sequence processed with 1D convolutional layers. It is also important to observe that, inside the attention block, each path must return a 1D vector with size TimeSteps, using only one convolution kernel. Additionally, this single channel 1D convolution output must use the softmax activation. The result of each path is concatenated leading to the formation of a feature map of size TimeSteps × Variables with attention weights α i for each input variable.
Nevertheless, as observed in Figure 6, the novelty in the proposed architecture consists of introducing Conv2D layers inside the attention block for the ConvLSTM2D model, which deals with 3D maps holding a Segments × TimeSteps × Variables format. After the segmented multivariate sequence being split by variable, a 2D input dimension is formed since the new map is characterized by a Segments × TimeSteps format, which is then processed by 2D convolutional layers. This means that a 2D kernel is not only able to capture patterns in contiguous time-steps, but it also has the ability to identify patterns for the same time-step in previous and next segments (e.g., when a segment represents days and time-steps correspond to hours, a 2D kernel captures patterns that associate different hours of the day i as well as patterns that associate the same hour j in adjacent days to the day i). Note that the 2D output map for each variable is the result of a softmax activation function. Individual values for each explanatory variable in the resulting 2D map compete against each other and their sum is equal to 1 with a scaling factor ranging between [0, 1]. As described in Equation (7), the resulting α after all convolutions being concatenated is a compatible 3D map to scale inputs h of the attention block to obtain c.  Figure 6. Attention using 2D convolutional layer before and after the ConvLSTM2D. Figure 6 also reveals that roll padding can be used to complement the Conv2D attention mechanism, particularly in MTS problems that exhibit cyclical properties. Since the analysis deals with segments of 7 days, roll padding can be used in the component representing segments to ensure that the first day of a given week is correlated with the last day of the previous week once assuming that data exhibit a weekly cyclicality. To confirm that this fact holds true in the MTS problem under scrutiny, one needs to compare the unconstrained model given by: against the restricted model that corresponds to: by means of performing a likelihood ratio (LR) test to check for the joint significance of the null hypothesis H 0 : β 9 = β 10 = β 11 = β 12 = 0, where D1 i corresponds to a dummy variable that represents the first week of each month, D2 i corresponds to a dummy variable that represents the second week of each month, D3 i corresponds to a dummy variable that represents the third week of each month and D4 i corresponds to a dummy variable that represents the fourth week of each month. Note that both models should not include the regression coefficient β 1 related to the independent term to ensure that the statistical test confronting both models is focused on capturing significant differences strictly caused by weekly cyclicality. Note that it is mandatory emphasizing that the statistical test must be developed without including the independent term since this action, which consists of creating a regression model with centered explanatory variables, is the correct way to empirically infer the quality adjustment of any regression model because centering a model allows to remove the unobserved-heterogeneous-characteristics affecting the data, so that all regression coefficients remain equal to the original model with the exception of the independent term's coefficient, which disappears from the orginal model's specification. If the observed p-value is below the critical significance level of 5%, then the null hypothesis representative of the absence of weekly cyclicality can be rejected. If this statistical property is verified, then we are obliged to introduce dummy variables representing weekly cyclicality or, alternatively, not introduce them through the inclusion of roll padding. Estimated coefficients of the restricted model in (10) are given by: with R 2 = 0.9746, standard errors within parentheses and observed t-statistics within brackets.
The LR test result is χ 2 (4, n = 7823) = 12.64, p < 0.05, thereby implying that the null hypothesis contemplating the absence of weekly cyclicality is rejected. Similar conclusion emerges from the Wald test applied to assess the joint significance of all additive effects, given that F(4, n − k = 7812) = 3.16, p < 0.05 for the case with uncontrolled heteroscedasticity and F(4, n − k = 7812) = 2.87, p < 0.05 for the case with inclusion of robust standard errors by considering the White's procedure to control for heteroscedasticity.
Consequently, knowing that roll padding is a process capable of capturing the 7-days window cyclicity observed in the data, the incorporation of dummy variables D1, D2, D3 and D4 in the set of input variables becomes a redundant action.
For illustrative purposes, Figure 7 describes the ConvLSTM2D architecture with the Conv2D attention mechanism and roll padding, which defines the internationally patented VSCA model.

Conceptual Framework and Methodological Pipeline
In order to systematize main ideas related to the empirical analysis and robustness checks, Figure 8 presents the conceptual framework-including research hypotheses-and the methodological pipeline adopted in this study.

Performance, Model Selection, Interpretation of Coefficients and Marginal Effects
Probit and logit exhibit high sensitivity but also low specificity, which suggests that both discrete choice models predominantly fail to predict observations that do not comply with the legal hourly ceiling imposed onto NO 2 concentration levels.
However, as a large part of the sample is composed of observations that satisfy the HLV associated with the target, both models end up showing a good classification capacity, manifested by high accuracy values. Knowing that probit is considered the best binary model since it exhibits the highest log-likelihood and the lowest AIC and BIC, (6) in Table 5 displays key results for clarifying the benefit of this classification analysis. In addition to verifying that all coefficients are significant, the following interpretation can be inferred: • Negative sign of O 3 and CO indicates that satisfying the HLV of NO 2 is less likely with additional level of O 3 and CO emissions; • Positive sign of WD reveals that satisfying the HLV of NO 2 is more likely when the wind direction changes from north to east; and • Even more interestingly, the null value taken by the representative coefficient of THTV suggests that higher road traffic does not influence the fulfilment of NO 2 's legal hourly ceiling. Consequently, this result allows us to conclude that THTV has a neutral character on satisfying the HLV associated with the target.
Moreover, both types of marginal effects are significant and quantitatively indicate that:     (6) provides results from the Probit estimation. Statistical criteria (i.e., either the lowest AIC and BIC or the highest log-likelihood value) imply that probit is the best model to fit the data being, thus, the one whose coefficients and marginal effects are subject to interpretation.
• At the level of wind direction, the exact value of the average marginal effect is 0.0000575 which suggests that, for each change in wind direction from north to east in the magnitude of 100 degrees, the probability of satisfying the HLV concentration of NO 2 increases by 0.58 p.p.; at the average observation level, the increase is only 0.07 p.p. given the value 7.02 × 10 −6 taken by the respective marginal effect; and • With regard to road traffic, the exact value of the average marginal effect is 4.12 × 10 −6 , which suggests that, for every 10,000 additional vehicles circulating per hour in the city of Lisbon, the probability of satisfying the HLV of NO 2 concentration decreases by 4.12 p.p.; at the mean observation level, the reduction is only 0.05 p.p. given the value 5.03 × 10 −7 taken by the respective marginal effect.
Considering the set of research hypotheses formulated in Section 1 and relying on the outcomes of this initial classification exercise, it can be stated that: • H1 and H3 are verified and thus unequivocally validated; notwithstanding, despite the magnitude being low, results indicate that a change in wind direction can influence the probability of satisfying NO 2 's legal hourly ceiling; • H2 is only partially verified because, although O 3 negatively influences the probability of compliance with the HLV associated with the target, the magnitude of the effect is redundant; and • H5 is also ratified such that, during the period between 1 September 2021 and 23 July 2022, results suggest that the relationship between THTV and NO 2 is characterized by the approximation to what, in the absence of a better terminology, can be considered a principle of neutrality. Table 6 shows results of the logistic LASSO model combined with three different regularization techniques: k-fold cross-validation, rigorous penalization and information criteria (i.e., AIC, BIC and EBIC). This extensions allows us to formulate three main conclusions:

1.
Depending on the penalization technique adopted to reduce input dimensionality, it can be concluded that only AP and WI are included on top of the input variables used in the initial exercise; 2.
Results show that the sign exhibited by the estimated coefficients is coherent across different regularization techniques.; and 3.
More importantly, the sign of each coefficient does not change relative to the benchmark scenario, which qualitatively reinforces the initial findings.

Continual Learning
While the first step of the CL approach allows us to obtain predicted probabilities of the dependent variable by means of applying RF to the set of input variables, the second step uses these values as dependent variable of a binary probit/logit estimation. Estimated coefficients displayed in Table 7 exhibit the same sign to those sustained in the initial classification exercise, thereby confirming the qualitative robustness of benchmark results.

Variable Importance Analysis
VIA is also performed, whose outcomes in Figure 9a-d demonstrate that: • Regardless of the scenario, CO has the highest relative importance on the target; • WD has the strongest relative importance on NO 2 if considering only weather conditions; and • VIA outcomes are clearly consistent with those yielding in the benchmark analysis.     Notes: *** (**) denotes statistical significance at the 1% (5%) level, respectively. OLS estimation with correction of heteroscedasticity serves to confirm that the linear predicted probability of the target estimated by RF in the first-step, rather than its quadratic version, should be the one used as explained variable in the second-step of CL.

Ordered Multinomial Discrete Choice Model
Coefficients of the logit and probit ordered model differ by a scale factor. Therefore, similarly to the binary DCM, only their sign can be subject to interpretation. Results displayed in Table 8 indicate that the ordered logit model is the optimal choice for assessing coefficients and marginal effects since it exhibits the highest log-likelihood and the lowest information criteria, with AIC of 4040.452 and BIC of 4082.242. All estimated coefficients are statistically significant and their interpretation is straightforward: the reduction of NO 2 is improved-from the worst category with concentration levels above or equal to 250 µg/m 3 to the intermediate category with concentration levels between 150 and 249 µg/m 3 and the best category with concentration levels below 150 µg/m 3 -with decreasing (increasing) CO and O 3 (WD and THTV) values, respectively. Therefore, a first conclusion is that the results of binary and multinomial DCMs are consistent. Moreover, intercept parameters of the OM-DCM are significantly different from each other, which validates the more refined segmentation of the target. Results also show that marginal effects at the mean and average marginal effects from the logit estimation resemble to those of the probit model. A key characteristic of marginal effects is that their sum across categories is necessarily zero.
Consequently, when a certain target is divided in more than two classes, one can identify which category is more volatile to changes in a given regressor. Focusing on the average marginal effect of CO on the target, it can be concluded that:      Furthermore, the remaining covariates sustain marginal effects on NO 2 concentration levels that are close to zero being, thus, characterized by a much more stable influence on the target compared to CO. This conclusion reinforces the stylized fact that CO is the most important determinant of volatility in NO 2 concentration levels and, based on the strong interchangeability of marginal effects between the low and intermediate category of NO 2 , also the main driver of NO 2 legal ceiling violations.

Stochastic Frontier Analysis
Knowing that estimated coefficients of input-oriented SFA models correspond to direct effects, results in Table 9 reinforce the significant and positive impact of CO on the target.  Notes: *** denotes statistical significance at the 1% level, respectively. The regression includes robust standard errors, which adjust for within-cluster correlation. THTV is the explanatory variable specified for the inefficiency variance function. 'Tfe SFA' stands for true fixed effects SFA model, whereas 'Tre SFA' stands for true random effects SFA model [37]. 'MLrei SFA' stands for maximum likelihood random effects time-varying inefficiency effects model [38]. 'ILSfe SFA' stands for iterative least squares time-varying fixed effects model [39]. 'MLred SFA' stands for maximum likelihood random effects time-varying efficiency decay model [40]. 'LSDVfe SFA' stands for modified least squares dummy variable time-varying fixed effects model [41].
Overall, it can be concluded that SFA results are consistent with those obtained by DCMs and respective ML extensions in the classification analysis.

Peak traffic hours.
Considering the effect of CO emissions on the target keeping the remaining input variables fixed at their mean value, Figure 11 confirms that the lowest (highest) probability of satisfying the HLV of NO 2 for all admissible values of CO stands in the period between 8 h-16 h (24 h-8 h), respectively. Therefore, H4 is empirically validated.
Seasons. Considering the effect of CO emissions on the target keeping the remaining input variables fixed at their mean value and knowing that the dichotomous dependent variable did not change category during spring and summer, Figure 12 confirms that the probability of satisfying the HLV of NO 2 for all admissible values of CO is higher in autumn compared to winter.

Weekend versus weekdays.
Considering the effect of CO emissions on the target keeping the remaining input variables fixed at their mean value, Figure 13 confirms that the probability of satisfying the HLV of NO 2 for all admissible values of CO is lower in the weekend compared to weekdays. However, unlike the other types of segmentation, the difference is not very pronounced, which suggests the need to apply transversal policy measures instead of differentiating ones depending on whether we are dealing with weekdays or the weekend.   Figure 13. Impact of CO with segmentation between weekdays and the weekend.

Performance Evaluation
Forecasting error metrics. Prediction results are confirmed by relying on standard forecasting error metrics-mean squared error (MSE ), root-mean-square error (RMSE) and mean absolute error (MAE)-whose formulas are respectively given by: with n = 7824. Table 10 presents main results, whose assessment and interpretation can be summarized as follows. First, there is evidence that both WaveNet and TCN have higher forecasting errors relative to LSTM models, which implies that these DLNN models exhibit a weaker performance. Second, LSTM models incorporating attention mechanisms have lower forecasting errors compared to LSTM models absent of attention mechanisms. This result holds true, regardless of the type of attention mechanism introduced. Third, forecasting errors are lower for the ConvLSTM2D model with Conv2D attention mechanism and roll padding, which confirms that VSCA is best model in terms of predictive performance for the case study under analysis. Fourth, the benefit associated with the Conv2D attention mechanism is particularly recognized in the ConvLSTM2D model because forecasting errors decrease substantially relative to the magnitude of reduction observed for LSTM models where standard attention and Conv1D attention mechanisms are included. Graphical visualization of the target applying the best prediction model. To provide a greater sensitivity on the model holding the best predictive performance, Figure 14 shows a confrontation between predicted and actual output values in the validation set.
One can conclude that, in light of the case study under scrutiny, the technical competence provided by the combination of a new padding method-which avoids the need to introduce additive effects representative of cyclicity with weekly periodicity in the regression model-with a special Conv 2D attention mechanism-which ensures that the attention mechanism that establishes the connection between the set of inputs and target performs this operation individually and independently for each input variable-ensures a superior predictive ability of the VSCA model compared to alternative DLNN options currently available for the analysis of MTS problems. Moreover, the VSCA model offers two additional advantages that should be disclosed:

1.
Adaptability and flexibility to other contexts (e.g., forecasting the evolution of inflation rate in the context of supply-side shocks); and 2.
Establishes the rationale for a credible and effective connection between the field of Advanced Econometrics and Artificial Intelligence (AI) by providing a statistical ground for the inclusion of technical components such as the roll padding rather than falling into the fallacy associated with a pseudo-interpretation of weights observed in some AI studies.

Robustness Checks
In Section 3.2.1, different DLNN models are compared to confirm a new stylized fact: a superior generalization power sustained by the VSCA model. Differently, the research goal here is not to evaluate whether forecast errors vary according to different technical modifications, but rather to understand whether changes in the value taken by different parameters modify the qualitative nature of NO 2 concentration levels in the atmosphere which, at the validation set level, exhibits two main characteristics:
A negligible downward trend.
To fulfill this purpose, we use the standard LSTM model considered in the initial prediction exercise, whose parameters have been optimized through a grid search. These include a first LSTM layer that contains 64 neurons and ReLu activation function combined with a second LSTM layer that contains 32 neurons. These layers are followed by three dense layers, where the last one for the output is defined with tanh activation function. Moreover, it is considered 100 epochs with a batch size of 64 and the Adam optimizer is applied to the learning process. Then, in order to identify whether the qualitative nature of NO 2 concentration levels in the atmosphere remains unchanged, test and validation sets are nested to generalize for 163 days and, without de-normalizing predicted and actual values taken by the dependent variable, the following alternative options are applied:  Figure 21a-g, while bearing in mind that: -Adam is a stochastic gradient descent method that is based on the adaptive estimation of first (i.e., mean) and second order (i.e., variance and covariance) moments; -RMSPROP maintains a discounted moving average of the square of the gradients, dividing the gradient by the root of the mean; -Adadelta is a stochastic gradient descent method that relies on the adaptive learning rate per dimension to solve the disadvantage of continuous drop in the learning rate throughout training; -Adagrad is an optimizer with a parameterized learning rate, adapting to the frequency for which a parameter is updated during training (i.e., the more updates a parameter receives, the smaller the parameterizations); -SGD is a stochastic gradient descent method using the Nesterov momentum when the learning rate decreases; -Nadam is an optimizer derived from Adam including Nesterov momentum and with a correlation with the RMSPROP optimizer; and -FTRL is a method that uses a unique global base learning rate and can behave such as Adagrad with learning rate power= −0.5 or as a descending gradient with learning rate power close to zero.
It should be emphasized that the best checkpoint model option is used in the callback, whose function is to save the best model based on the validation error value at each interaction, if this is reduced. The early stop option was also implemented to prevent overfitting with a patience of 25 based on the validation error value. As the model is trained, if there is no improvement in the patience distance, it stops and restores those that are considered best weights for the model.         As clarified in Figures 15a and 21g, main characteristics exhibited by NO 2 in the initial exercise remain qualitatively equivalent in the panoply of robustness checks, which reinforces the consistent character of NO 2 concentration levels in the atmosphere with respect to volatility and stability over time. Indeed, a Wald test applied to each experiment allows us to confirm that the null hypothesis representing the absence of a linear trend cannot be rejected for a critical p-value of 5%. Then, considering the set of research hypotheses formulated in Section 1 and relying on the set of outcomes emerging from this forecasting exercise, it can be stated that H6 cannot be corroborated. Since NO 2 is unlikely to exhibit a decreasing trend in the near future, a discussion contemplating recommendations and actuation measures should be now provided. The classification analysis provides two main contributions: 1.
It demonstrates that CO is a determinant of NO 2 's HLV disruption and identifies a negative relationship between CO and satisfaction of NO 2 's HLV; and 2.
It shows evidence of a neutrality principle between NO 2 and mobility measured by the THTV.
Both conclusions provide a new window of opportunity to constitute, enhance and flourish a valuable source of knowledge externalities. This is because, although the first result is in line with that reported in previous studies, the second result expands the content of evidence reported in the literature, which is only characterized by the homogeneous view claiming that mobility has a negative impact on air quality, ultimately leading to de-growth. However, agents circulate to satisfy their own needs, so that mobility can be interpreted as a socially desirable action. Hence, the second result opens space for observing either a negative, neutral or positive influence of THTV on NO 2 .
Conceptually, the role of environmental funds, as a central instrument for financing climate action and environmental policy, is to promote support for projects:

•
In mitigated fields, including projects to promote electric mobility; • In the decarbonization of cities and industry; and • In adaptation and cooperation on climate change, environmental education, water resources, circular economy, nature conservation and biodiversity.
Therefore, the potential market failure that this study seems to have identified is that, due to the lack of statistical significance in dimensions such as ambient noise and some meteorological input variables, it may exist an underutilization of environmental funds and their improper use towards drivers with redundant influence on the target. Furthermore, this problem can be exacerbated by the integration of other funds (e.g., Energy Efficiency Fund, Permanent Forest Fund, Fund for Systemic Sustainability of the Energy Sector and Innovation Support Fund). With a merger of funds, rather than introducing a greater focus on supporting transition projects, allocative inefficiency could gallop and, with it, a detrimental impact on social welfare.
In this regard, it should be noted that one result of this study indicates that ECSL does not have explanatory power on the target. Nevertheless, this dimension can promote allocative inefficiency of environmental funds given that it is unlikely to be marginalized and based on the argument that ambient noise requires a public policy intervention to minimize other associated impacts. With the review of Portuguese National Air Strategy (ENAR) completed in the first quarter of 2021, the effort to develop and implement the Portuguese Air Quality Improvement Plan (PMQAr) has continued. The approval of this diploma created conditions for stronger requirements with respect to licensing activities and their articulation with the surrounding territory, thus guaranteeing the well-being of local communities and contributing to the improvement of air quality. With regard to ambient noise, the biennium 2022-2023 will meet the approval of a National Strategy for Environmental Noise (ENRA). This aims to define a model for integrating the noise control policy into sectoral, environmental, territorial, economic and social development actions. A good articulation of this strategy with intermunicipal scale noise reduction plans is essential for improving the people's quality of life. Likewise, ENRA is intensifying the effort to provide all major transport infrastructures with strategic noise maps and action plans, complying with European legislation. The role of public entities in monitoring and inspecting the behavior and practices of stakeholders is relevant to ensure national environmental goals with a view to guaranteeing resource management in accordance with the law in order to avoid market distortions.

Contemporaneous Governance Framework
Currently, national public programs tend to identify as the first strategic challenge the need to face climate change, while ensuring a just transition. Since environmental actions define a transversal domain, the concentration of competences for the mitigation of emissions, the energy transition, the adaptation of the territory and the dilution of the carbon footprint are fundamental for a renewed ambition in the urgent response that this challenge entails. Portugal was the first country in Europe to assume, in 2016, the objective of carbon neutrality in 2050 and to achieve this goal with the release of a Roadmap for Carbon Neutrality in 2050 (RNC 2050), innovating in the European landscape and abroad. The Portuguese Central Government follows an integrated approach that: • Recognizes the fundamental role of forests, biodiversity and ecosystem services in building a territory more cohesive and resilient to the effect of climate change; and • Recognizes the need to protect and enhance the coast, promoting a sustainable economy, combating desertification and helping to face demographic challenges.
In this context, and based on the compilation of results provided by this study, political action requires to focus on two main pillars: 1. Decarbonization: through the energy transition, sustainable mobility, circular economy and enhancement of natural capital, territory and forests, promoting initiatives such as sustainable financing, green taxation and environmental education. Among several dimensions of this challenge, the energy transition is the one that will contribute most to the reduction of CO emissions in upcoming years. This should rely on the decarbonization of energy systems, with special emphasis on the end of electricity production from coal, focus on energy efficiency, promotion of energy from renewable sources, placing the citizen at the center of energy policies, ensuring a level playing field and a cohesive transition; 2.
Circular economy, mobility and transport: it is essential to implement a circular economy model that contributes to an efficient management of resources, which allows the exploration of new economic opportunities and promotes efficient waste management. It is also necessary a renewed and competitive public transport network, as well as a sustainable-electric and active-mobility.

Theoretical Recommendations: Main Priorities
Decarbonization must be carried out at two levels: civil society and energy sector. Decarbonize society. Results show that CO emissions are the main cause of noncompliance with the ceiling on NO 2 . The National Energy and Climate Plan for 2030 (PNEC 2030) constitutes the benchmark for economic and social recovery aligned with the objective of carbon neutrality, but also reinforcing the need to generate employment and social welfare gains. In this sense, climate actions must respond to the double challenge of decarbonizing the economy and promoting a territory more and better adapted to structural change, creating more resilient forests, investing in public transport, fostering the transition energy and dissuading energy poverty. Also noteworthy are measures of environmental taxation, which is a crucial step for the decarbonization to be achieved. Therefore, the elimination of tax exemptions on petroleum and energy products used in the production of electricity from more polluting fuels should be continued. Moreover, the scope of this measure may be extended to other niches, such as industry and services overly dependent on non-renewable energy sources.
Decarbonize energy. In the context of achieving carbon neutrality by 2050, Portugal commits to the EU to reach a target of 47% of energy from renewable sources in gross final energy consumption by 2030, with the first years of this decade being essential for the success of the strategy contained in PNEC 2030. The challenge of reducing greenhouse gas emissions is recognized, betting on an economy based on renewable endogenous resources and continuing with efficient models of circular economy, which value the territory, its endogenous resources and promote territorial cohesion, while inducing greater competitiveness, job creation and innovation. Decarbonization and the energy transition must be seen as mobilizing purposes for the entire Portuguese society. In this sense, both elements define an opportunity to: • Increase investment and employment through the development of new industries and services; • Reinforce research and development (R&D) and higher education systems; • Foster economic growth through a substitution model of imports; and • Benefit consumers, who will have lower costs when compared to costs they would have in the case of maintaining fossil dependence.
The energy sector is expected to be the one that will make the greatest contribution, assuming a relevant role in the energy transition. Portugal's Horizon 2030 strategy, which aims to double the installed capacity based on renewable energy sources and reach a 80% level of renewables' inclusion in electricity production before 2030, is based on a combination of different policies, measures and technological options, with priority given to: Promoting low carbon processes, products and services; and • Ensuring a more active and informed consumer participation.
Circular economy, mobility and transport. This study also finds evidence that road traffic has a statistically significant effect on the target, but quantitatively close to zero. It is then important to ensure that the likelihood of satisfying the HLV of NO 2 is improved with the implementation of measures and actions focused on road traffic. In general, the transport and mobility sector is a fundamental pillar for economic development and territorial cohesion. The public health situation caused by the COVID-19 pandemic had a strong economic impact on the country and highlighted the importance of a modern, efficient and safe public transport system that should be properly integrated with other modes of transport, namely with pedestrian and alternative active models such as cycling. The public transport system has been central to the economy, ensuring the mobility of goods, services and people. Hence, accelerating investments in the transport and mobility sector should be seen as key to promote economic recovery and long-term sustainable growth. These investments are direct job generators and their implementation makes it possible to improve the level of connectivity and accessibility to the main economic poles, thus promoting the ability to bring people closer to employment opportunities and bring companies closer to more qualified workers. It is therefore important to promote investments in the reinforcement of public transport networks in medium-sized cities and metropolitan areas (i.e., Lisbon and Porto), while keeping in mind the commitment to electric and shared mobility. This public investment should rely on robust projects with: • Effective impact on the quality of transport services; • Expectation of ensuring a strong contribution to the pursuit of public policies for a decarbonization of the transport sector; and • The goal of boosting the sector's energy transition to renewable sources in order to promote the reduction of greenhouse gas emissions and the incorporation of renewable energy in the transport sector.
Alongside mitigation and adaptation, the system of production and consumption will necessarily have to change. According to the United Nations, 50% of greenhouse gas emissions are associated with the extraction and processing of basic materials. Thus, persisting in a linear economy-which extracts, transforms, sells and throws away-entails a heavy climate burden, in addition to intensifying the risks derived from the scarcity of water, arable land and materials. Almost three years after the approval of the Portuguese Action Plan for the Circular Economy (PAEC), the guidelines contained therein have been implemented through action at three levels: national, sectoral and regional. Its continuity is important, as well as the elaboration of a new PAEC in line with what is being carried out at the European level. For a national economy to be circular, it is not enough to act on waste, that is, only in downstream markets. It is also necessary, on the one hand, to modify people's behavior with environmental education initiatives that ensure less consumption of scarce resources and greater recycling to enable the reuse of transformed materials and, on the other hand, develop initiatives to reduce dependence on raw materials through eco-design and impose a different conceptualization of production processes in upstream markets of the value chain.

Practical Recommendations: Main Measures
Some measures under the two identified pillars of action are now detailed. Decarbonization. Achieving carbon neutrality involves a concerted effort and an alignment of policies, incentives and means of financing. It is, over the next decade, that the greatest decarbonization effort must be achieved, involving the contribution of all economic activity sectors and quadrants of society. The allocation of a significant volume of funds to climate action should not only overcome the economic and social crisis, but also ensure that objectives to which Portugal has committed are achieved. It is therefore important to instill the necessary dynamics for a full implementation of the National Energy and Climate Plan 2030 (PNEC 2030), in order to bring Portugal in line with the established emission reduction objectives −55% decrease in greenhouse gas emissions by 2030 compared to 2005. As such, PNEC 2030 implements RNC 2050 in the period up to 2030 and constitutes the key plan for decarbonization through the establishment of: • Sectoral emission reduction targets; • Targets for the incorporation of energy from renewable sources; and • Targets for the reduction of energy consumption.
This constitutes a transversal exercise that involves all areas of government action, while requiring the creation of a new dynamic focused on decarbonization, continuous monitoring of the progress and evaluation of the contribution of each sectoral policy to climate action. In this context, it is instrumental: • An assessment of legislative impacts on climate action, where the transformation should involve different levels of the public administration, from parishes to the national territory as a whole; • Promote the quality of regional roadmaps for carbon neutrality that translate the ambition set at the national level, which is expected to have a repercussion at the local level with the promotion of carbon neutral city pacts (e.g., gastronomy routes like Pestico); • Promote the creation of sustainable network communities in articulation with municipalities to boost sustainability efforts (e.g., eco-neighborhood, national network of circular cities, network of municipalities for carbon neutrality); • Promote initiatives to mobilize actors of the private sector for decarbonization and develop sectoral roadmaps for the decarbonization of different industries; • Develop a territorial plan for a fair transition focusing on territories potentially most affected by the change to a carbon neutral economy; • Monitoring by creating a platform that aggregates information and constitutes a decision support tool; • Promote sustainable financing through the elaboration of a national strategy, which should include the identification of incentives and the creation of conditions to establish a Portuguese green bank; • Definition of environmental criteria as a requirement for the allocation and articulation between different public funds in order to orient the public funding to investments that foster a resilient, circular and carbon neutral society; and • Adopt a fiscal policy in line with energy transition and decarbonization goals, introducing more fiscal incentives and promoting more sustainable behavior.
Circular economy, mobility and transport. In this domain, investments and actions must be primarily based on the following measures: • Development of plans to expand a sustainable network of collective transport systems in the Metropolitan Area of Porto (MAP), MAL and medium-sized cities; • Continuous investment in the decarbonization of mobility, both in collective and individual transport modes (e.g., in terms of collective electric mobility, new programs should be launched to support the renewal of public transport fleets through the acquisition of clean vehicles; in terms of individual electric mobility, incentive programs should be reinforced for the acquisition of 100% electric light vehicles); • Promote active mobility, while focusing on improving the quality of life in cities and the attractiveness of urban spaces; • Increase the autonomy and capacity building process of sectoral transport authorities to ensure an increasingly efficient and more effective management of the country's transport networks; • Promote innovative and intelligent solutions for the mobility of goods and people, which requires to invest in new assets, improve interventions that promote intermodality with alternative transport modes (e.g., cycling) and increase the accessibility and connectivity between different regions of the country; • Regarding the promotion of urban public transport, it is important to maintain the Public Transport Fare Reduction Support Program (PART) and the Program to Support the Densification and Reinforcement of the Public Transport Offer (PROTransP); • The National Strategy for Active Cyclable Mobility (ENMAC), the National Strategy for Active Pedestrian Mobility (ENMAP) and the Portugal Cyclable 2030 Program should extend active mobility solutions for cities (e.g., construct new cycling routes and totally support the private acquisition of eco-friendly bicycles; implement solutions to sharpen the complementarity between private transportation use and public transport network); • To promote greener cities, it is essential to conceptualize alternative solutions for urban spaces (e.g., new logistic applications to support decarbonization; increase the efficiency of mobility; optimize deliveries generated by e-commerce); and • Increase training actions of the staff belonging to sectoral transport authorities and stakeholders in general.

Prediction
The need to establish a protocol to ensure the persistence of best research practices implies the formulation of recommendations at two distinct levels:

1.
For the period before building DLNN models; and 2.
For the period corresponding to the implementation of DLNN models together with the period after the development of DLNN models.

Recommendations before Building Deep Learning Neural Networks
In this spectrum, recommendations are summarized as follows: • Either collect data from a reliable source [42] or guarantee the intellectual property right over them in the case of holding data collected by own means in order to be fairly compensated for such a work; • Perform exploratory data analysis [43], which requires preliminary statistical tests considered valuable to explain weaknesses and strengths of the data; • Ensure that the dataset is robust by considering additional techniques such as data augmentation and bootstrapping (e.g., either to avoid class imbalance in classification problems or to boost small datasets in regression problems), while reporting them clearly to the audience [44][45][46]; • Differently from [47], sharing projects beforehand with other experts is not recommended because the market for ideas is scarce and classical industrial economics theory already identified 40 years ago the persistence of a negative relationship between horizontal differentiation and competitiveness [48]; instead, either find a credible research partner or work with someone who has notability in the scientific field where you are looking to acquire spillover effects; • Survey the existing literature to confirm the reading of past studies and respect for the effort developed by peers in previous research [49], which means not incurring the fallacy of citing only those who have international reputation, but any researcher who produces valuable content regardless of the respective provenance or professional position; • Think on new methodologies to combine knowledge from the field of Advanced Econometrics and AI; and • In a market environment characterized by information asymmetry between editors, authors and readers, and where some authors are also editors and, consequently, any attempt to criticize a work coming from this side of the market, even in a healthy way, tends to be fruitless, it is mandatory to own international patents to protect the innovative character of methods developed by credible authors. This not only deters bad reviews of scientific work, but also fosters a good reputation within the scientific community. It also defines a clear and objective way of distinguishing, in the microeconomics terminology used by [50], lemons (i.e., low-quality research) from peaches (i.e., high-quality research) in a market environment characterized by asymmetry of information.

Recommendations during and after Building Deep Learning Neural Networks
Based on the detailed contribution of [49], a set of recommendations is also given for the phase related to the model development and post-implementation process. It should be emphasized the need to: • Avoid using inappropriate models (e.g., DLNNs are not a good choice either if there is data limitation or if the model needs to be interpretable) and do not allow test data to leak into the training process [49]; • Optimize hyperparameters with caution by considering a grid search with the support of statistical tests [51][52][53]; • If having a high number of input variables, execute feature selection [54,55]; • Combine DLNN models via ensemble learning and staked generalization [56]; and • Be transparent, report statistical significance with caution and do not generalize beyond the case study in hand [42,57].

Limitations and Future Research
Despite the effort to provide a valid contribution, this study is not exempted from limitations. It should be noted that the interpretation of values taken by some environmental parameters (e.g., abnormal levels of PM 2.5 , PM 10 and CO), may require the correlation with other factors not monitored within the scope of this study, namely unfavorable weather episodes (e.g., floods, heat waves, frost fall, seismic activity), the migration of dust from the Sahara desert or the occurrence of big forest fires.
Although the methodology employed in this study is sophisticated, it ignores the underlying chemistry that characterizes NO 2 in the atmosphere. Critically, NO 2 concentrations are driven by ozone and photochemistry (i.e., through oxidation in photochemically active atmospheres) [58]. The changing balance between NO and NO 2 is also a crucial aspect in Europe, where NO has typically declined more rapidly than NO 2 , especially driven by changes in the emission of primary NO 2 that have shifted considerably over time with modifications in engine design and European emission standards [59]. Since the chemical relationship between NO and NO 2 seems to be relevant for a greater detail about determinants of the HLV of NO 2 , but the Portuguese municipality under analysis only provides open metadata on NO 2 , the failure to consider these specificities should be considered as a limitation of this study.
In addition to classification and prediction exercises, future work should consider a causality analysis of treatments observed in 2022. Treatment evaluation is the estimation of average effects of a given program on the target. Therefore, it requires to compare outcomes between treated and control observations before and after the treatment occur. This type of analysis is particularly relevant because, on 11 May 2022, the Lisbon City Council approved a reduction in maximum speed limits in the amount of 10 km/h. The legal measure, which corresponds to the treatment, foresees the elimination of car traffic on Avenida da Liberdade on Sundays and holidays, so that it is expected to have a statistically significant effect on the target considered in this study.

Conclusions
This study presents classification and prediction exercises to assess the future evolution of NO 2 in a critical air quality zone located in Portugal using a dataset, whose time span covers the period between 1 September 2021 and 23 July 2022. Among several results, at least three deserve to be emphasized.
First, the classification analysis corroborates the idea of a neutrality principle of road traffic on the target since the respective coefficient is significant, but quantitatively close to zero. This result, which may be the first sign of a paradigm shift regarding the adoption of electric vehicles in addition to reflect the success of previously implemented measures in the city of Lisbon, is reinforced by evidence that CO emitted mostly by diesel vehicles exhibits a significant, negative and permanent effect on satisfying the HLV associated with the target.
Second, robustness checks confirm that the lowest probability of satisfying the HLV of NO 2 stands in the period between 8 h and 16 h. Moreover, the probability of surpassing the HLV of NO 2 for a marginal increase in CO while keeping the remaining covariates at their mean value is less pronounced once confronting autumn (weekdays) against winter (weekends), respectively.
Finally, the predictive exercise demonstrates that the internationally patented VSCA model has the best predictive performance against several DLNN alternatives. Results indicate that the concentration of NO 2 is expected to be volatile, but without any positive or negative trend. In terms of policy recommendations, additional measures to avoid exceeding the legal ceiling of NO 2 at the local level should be focused on reducing CO emissions rather than overemphasizing the need to erode traditional forms of mobility.

Patents
International patents reported in this manuscript include WO2021255516: MULTI-CONVOLUTIONAL TWO-DIMENSIONAL ATTENTION UNIT FOR ANALYSIS OF A MULTIVARIABLE TIME SERIES THREE-DIMENSIONAL INPUT DATA. It is online available at https://patentscope.wipo.int/search/en/detail.jsf?docId=WO2021255516 (accessed on 15 August 2022).