A Comparative Analysis for Air Quality Estimation from Traffic and Meteorological Data

: Air pollution in urban regions remains a crucial subject of study, given its implications on health and environment, where much effort is often put into monitoring pollutants and producing accurate trend estimates over time, employing expensive tools and sensors. In this work, we study the problem of air quality estimation in the urban area of Milan (IT), proposing different machine learning approaches that combine meteorological and transit-related features to produce affordable estimates without introducing sensor measurements into the computation. We investigated different conﬁgurations employing machine and deep learning models, namely a linear regressor, an Artiﬁcial Neural Network using Bayesian regularization, a Random Forest regressor and a Long Short Term Memory network. Our experiments show that affordable estimation results over the pollutants can be achieved even with simpler linear models, therefore suggesting that reasonably accurate Air Quality Index (AQI) measurements can be obtained without the need for expensive equipment. and the LSTM using two layers of 100 hidden cells, using a dropout between hidden and linear layer to improve the generalization, with a drop rate set to p = 0.5. For this last training conﬁguration for 20 epochs batch size


Introduction
Nowadays, air pollution represents a major environmental problem, increasingly worsening and affecting more people every year. This is especially true in urban environments, where the majority of the industries and traffic reside, releasing into the air alarmingly large quantities of pollutants and particulate matter that become a severe health risk from exposure [1]. Recent estimates claim that 4.6 million people die each year from causes directly attributable to air pollution [2], with a total of 300,000 cases in Europe only. These figures appear surprisingly high when compared to other common death causes such as car crashes, which is approximately three times lower [3]. The high incidence of deaths caused by air pollution can be explained by the fact that more than 90% of the world population lives in places where the air quality exceeds the guideline limits established by the World Health Organization (WHO) [4].
Several studies [1,5,6] analysed the relation between air pollution and health issues, both on global and local scale over long periods of time, demonstrating for instance augmented risks of respiratory diseases, such as bronchitis and asthma, or even reduced life expectancy [7]. Furthermore, air pollution is one of the major factors contributing to climate change, especially in terms of global warming. Even considering the relatively brief period, average temperatures have already risen by 1.9 • C since 1980 and data records from NASA report that 19 out of 20 warmest years have occurred between 2001 and the present day (https://climate.nasa.gov/). Besides the direct effects of greenhouse gases, several subsequent issues are also involved. For instance, warmer periods could increase the ozone (O 3 ) levels by 1 to 10 ppb during the next years, where drier regions can drastically increase the risk of wildfires, and consequently release significant source of carbon oxides and particulate matter [8].
For these and other reasons, a steadily growing concern from authorities, media and citizens has led to a broad consensus about the need of strong regulations in order to reduce the emissions of major pollutants in the shortest possible time. Together with important international treaties (e.g., Paris agreements (https://unfccc.int/process-and-meetings/the-paris-agreement/the-parisagreement)), the European Union established a remarkable set of legislation aimed at setting strict legal limits and target values for the concentrations of major air pollutants to be reached by the Member States within the next decades (https://eur-lex.europa.eu/legal-content/EN/ALL/?uri= CELEX:32008L0050). Because of the high population density in the old country, the primary purpose of this document is the safeguard of the human health by setting limits for Particulate Matter (PM10 and PM2.5), Sulphur Dioxide (SO 2 ), Benzene (C 6 H 6 ), Carbon Monoxide (CO), Ozone (O 3 ), Nitrogen Oxides (NOx), Lead and other toxic heavy metals such as Arsenic and Cadmium. In order to comply to these directives, each country requires a large amount of sensors to be positioned in key areas, mostly depending on the distribution and density of infrastructures and buildings. These monitoring stations can be divided into different groups according to the emission sources, namely: traffic, industrial or background stations, or the region category. In detail, monitoring stations are categorised as: (i) urban stations if positioned in central city districts, (ii) suburban stations when localised in the surroundings, or (iii) rural stations in any other case. In most cases, the urban stations group is the most significant indicator of the air quality for major cities and therefore requires a higher degree of attention, translated in both equipment and maintenance costs: without including heavy calibration, deployment and administration estimates, professional devices can reach prices over EUR 10.000 for a single unit. Multiplying this value for the number of pollutants or atmospheric conditions and the number of districts to be monitored in a large city, the yearly expenses can quickly become non-trivial.
At the same time, pollution levels in urban environments can typically be traced back to a relatively small subset of factors, with the vast majority of the contribution deriving from vehicle emissions and surrounding industries. Moreover, several studies [5,9] show a strong correlation between weather condition and trends of various pollutants: ozone levels for instance are directly influenced by warmer climates and solar radiation levels because of photo-chemical reactions induced on other pollutants, while temperature inversion layers can trap colder polluted air on the surface.
Given the serious nature of this subject, many studies have been conducted over the years in order to estimate trends of pollutants or provide air quality forecasts for future hours or days. These include statistical approaches such as [10], where spatial correlations and seasonal patterns were exploited to estimate long-term trends of particulate matter (PM 2.5 ) using a hierarchical model, based on Partial Least Squares regression. Other studies [11][12][13] investigated the applicability of recent machine learning approaches based on neural networks to this domain. In [11] different combination of Multi-Layer Perceptrons (MLP), periodic components and Self-Organising Maps are applied to hourly values of NO 2 and meteorological features, showing that the best results can actually be achieved by directly applying the MLP on the original data. Other solutions propose to merge features from different domains and leverage the extreme adaptability of deep learning to extract relevant knowledge and thus provide a finer estimate of air quality. As an example in a urban environment, U-Air [12,14] exploits different kinds of big data thanks to a composite framework where temporal and spatial information are processed by two different models to provide a robust 48-h forecast of the Air Quality Index (AQI). In [13], a similar approach is adopted and improved with the addition of an Attention Pooling layer that dynamically adapts to the most relevant monitoring station for a fine-grained prediction.
Despite the accuracy of current state-of-the-art solutions, a major drawback of the listed systems is the high reliance on real-time sensor data, which strongly limits their use to a small part of urban and densely populated areas covered by monitoring stations. With reference to the systems and the maintenance costs, expanding the encompassed areas can be a slow process, if feasible at all.
In this work we propose to combine meteorological and traffic-related features with recent machine learning and deep learning models to obtain a robust estimate of both pollutants and AQI.
As a case study, we combine three years of sensor and vehicle logs from the city of Milan, from 2013 to 2016, training different kind of models to predict the monitored air pollutants, namely Nitrogen Oxides (NO 2 , NO x ), Carbon Monoxide (CO), Benzene (C 6 H 6 ), Ozone (O 3 ), Black Carbon (BC) and Particulate Matter (PM 10 , PM 2.5 ). Expanding on previous work [15], we adopt a series of machine and deep learning models to predict the evolving trends of the pollutants in time. The predictions are first compared with a subset of the actual sensor logs recorded from high-end monitoring stations as ground truth, then used to predict an Air Quality Index following European standards, assessing the accuracy in identifying the correct category. Despite the undeniable loss in terms of performance with solutions including the pollutants in the feature set, we show that accurate estimates can still be obtained for most pollutants and therefore a fairly accurate AQI can be predicted even when relying only on environmental information.
In summary, the contributions of this paper are as follows: • We study the problem of air quality estimation in urban areas with a focus on environmental features, discussing the relations between meteorological measurements and vehicle transits with air pollutants; • We test and propose a series of experiments conducted with different popular machine and deep learning models, highlighting advantages and drawbacks of each solution; • We demonstrate the feasibility of this task, showing fairly robust estimations without introducing past measurements of pollutants in our experiments.
The paper is organised as follows. Section 2 introduces: (i) the data sources used for the experiments, describing the process of acquisition, preprocessing and analysis; (ii) the frameworks and tools used to manipulate the data; and (iii) the evaluation measures and the experiment setup adopted. Section 3 reports the results obtained from the adopted solutions discussing their performance. Lastly, in Section 4 concludes with a summary of the study and delineates possible future works.

Materials and Methods
This section describes the different data sources selected for our case study and the set of experiments conducted on the processed features. First, in Section 2 we discuss how we obtained the different sets of sensor measurements and vehicle transits and briefly describe the available information. In Section 2.2 we enumerate the different processing steps carried out to prepare the data for our tests. In Section 2.3, we introduce our testing methodology, listing the models employed for training and the motivation behind the choices. Lastly, in Section 2.4 we outline our testing environment, enumerating the tools and frameworks adopted, and in Section 2.5 we provide a thorough description of the different experiments and how they were implemented.

Data Sources and Acquisition
In this case study we used information collected in three years, from 2013 to 2016, limited to the central areas of Milan. In particular, we collected three different data sources: (i) meteorological data from different sensor types, such as temperature, humidity, pressure and wind speed, (ii) traffic data derived from the passage of vehicles recorded from fixed video cameras in a belt surrounding the city center, and (iii) the ground-truth pollutant trends, obtained from different monitoring stations mounted in fixed crucial points.
Regarding meteorological data, we obtained the sensor logs of seven different weather stations, mainly distributed around the borders of the city, as shown in Figure 1. The monitoring platforms and their respective data are provided by Agenzia Regionale per la Protezione dell'Ambiente (ARPA) Lombardia, the reference institution in the region for environmental protection. This and other kind of datasets have recently become freely available for download on the regional Open Data portal (https://www.dati.lombardia.it/). Logs are provided with yearly subdivisions in standard CSV files, where each one stores data corresponding to a single sensor, defined by three fields: the unique ID for the sensor, a timestamp of the measurement in local time and the corresponding value measured. Any other additional information related to the sensors is in fact explained by a descriptor file first indicating the unique ID, name and position of the specific platform, then enumerating the different sensors of which it is equipped, specifying their unique ID, type, their measurement interval, their unit of measure and the operator used to normalize the logs into a time series with regular intervals between samples. From these descriptors it is possible to trace the sensors back to the respective measured weather feature and thus group them in six different types, displayed in Table 1. Except for the pressure category that only presents a single sensor, every other class of meteorological feature is well represented by at least three time series, in the three-year period we analysed. In terms of data format, every sensor class utilises a standard measurement unit and contains records with a regular resolution of an hour, obtained by averaging the aggregation of the raw measurements. The only exception is represented by the precipitations category, which instead provides logs aggregated with a cumulative sum, indicating the total amount of millimeters of rain which fell every hour.  A similar procedure concerned the measurements of the pollutants. Data were once again provided by ARPA, freely available for download through the above-mentioned portal, and were subdivided into file descriptors, containing name, location and equipped sensors of the monitoring stations, together with individual CSV files containing yearly measurements of the individual sensors, identified by a unique ID. The positioning of these cells is visible in Figure 1. In total, exploring the three-year period considered for this work, we analysed 10 different pollutant categories, each one measured by at least one sensor in the central area of Milan.
The count of unique measurements varies greatly with the severity and the intrinsic relevance of the pollutant for the definition of an air quality index, which is typically identified by a combination of NO 2 , O 3 and particulate matter, as described in Section 2.3. Specifically, nitrogen oxides (NO 2 , NO x ) are represented, respectively, by 8 and 7 unique sensors distributed around the city center, C6 H 6 and CO by 4, O 3 and particulate matter (PM 10 , PM 2.5 ) by 3, BC by 2. Lastly, SO 2 and NH 3 are represented by a single measurement only.
As for the meteorological data, logs for the pollutants provided by ARPA correspond to an hourly aggregation by average, with the exception of particulate matter (PM 10 and PM 2.5 ), where measurements are stored with daily interval, representing a 24-h mean value. While this is sub-optimal given the consistency of the remaining data, it does not represent a major issue and it is somewhat expected since the vast majority of air quality indicators only make use of a daily average for particulate matter.
The last feature type employed in this work is represented by recorded hourly transits in the congestion charge surrounding the center of Milan. In this case, data is provided by Agenzia Mobilità Ambiente e Territorio (AMAT) (https://www.amat-mi.it/) and encompasses the C Area, corresponding to the urban region of Cerchia dei Bastioni. The area is accessible by a total of 42 gates, displayed in Figure 1, each one monitored by fixed cameras continuously monitoring and recording the plates of any vehicle entering the city center. The logs were cross-referenced with data from the regional Department of Motor Vehicles, in order to obtain, for each plate, a list of characteristics of the corresponding vehicle, including fuel type and potential category of European emission standard. The complete dataset is subdivided into three main CSV files: the first containing information about the monitoring gates, the second providing detailed information about vehicles and the last one providing the transit records in the city center, during the three-year period. Altogether, these files contain information about four main features: (i) the emission standard category, from Euro 0 to Euro 6; (ii) the fuel type, namely petrol, diesel, electricity, gas or hybrid; (iii) the category of vehicle, public transport, cargo or normal cars; and (iv) additional information such as whether the vehicle was authorized into the area, or whether it is a service vehicle or if it belongs to residents.

Data Preprocessing and Analysis
As introduced in the previous section, the available data appears scattered in many different physical files or presents different formats and configurations. In order to align the features into a single common representation for model training, we carried on an extensive multi-step preprocessing phase, combining (i) data aggregation (ii) data cleaning (iii) imputation and (iv) feature engineering. Considering the first point, aggregation was necessary for two main reasons: first and foremost, the collected data described above unfortunately presents a large number of randomly missing values, without any distinguishable pattern in terms of time periods with the exception of the transits, which include none of all the available features in four specific intervals, as described below. Second, the amount of sensors and their distribution on the territory was extremely limited in terms of covered surface, certainly not enough to conduct a proper spatial analysis. A bilinear interpolation of the values over time is typically applied to simulate a continuous distribution of the features, such as in U-Air [12], but this option was also discarded again because of the lack of data in many of the sensors equipped by the monitoring stations, for both weather and pollutants, as shown in Table 2. Therefore, in order to drastically reduce the amount of missing data while maintaining a coherent ground truth, we opted for a single time series for each sensor type, aggregating by timestamp and averaging when more than one value was present over the same timestamp. This operation is justified by the fact that, given the relative proximity of sensors, most records present extremely similar trends over time. The results, in terms of data coverage, are reported again in Table 2.  Table 3. Missing periods in the transits dataset with respective length. A similar process was carried out on traffic-related features: first, data from vehicle details were merged with transit records and subsequently aligned with the time resolution of the available sensors by summing the passages occurred in the same hour. Then, since spatial information was discarded in previous sets to reduce sparsity, transits from all the gates was summed up in order to obtain once again a single time series for each of the vehicle-related features. Therefore, the final aggregated dataset contains the sum of all the vehicle transits at any gate and during the same hour for every traffic-related feature, from the emission standards (euro-0 to euro-6) to vehicle types and fuel types.
Concerning the data cleaning phase, every available meteorological and aerial feature was checked for noticeable outliers. Because of the hourly aggregation of the original sensor logs and the aforementioned spatial reduction, little to no effort was required to check and limit the values into a reasonable range. Additionally, most of the feature sets already appear in predefined ranges, such as the humidity provided in percentage, while others inherently describe out of the ordinary yet crucial situations, such as the rainfall amount, which can only be checked for impossible values (e.g., negative records). During this phase we also discarded those features that still contained a large amount of missing data, namely wind direction which contained a single year of data out of three, and dropped transit entries representing unknown values (euro_na, fuel_na, vehicle_na). The latter were substituted with a single additional feature named total, representing the total amount of transits, regardless of the vehicle type. An analogous procedure was carried out on target variables. On the basis of the former analysis, we decided to exclude from the study both sulphur dioxide (SO 2 ) and ammonia (NH 3 ) measurements for three main reasons: first and foremost, both pollutants were represented by a single ground-truth sensor each which contained a large portion of missing data, as shown in Table 2. Second, the low correlation displayed by the two pollutants, visible in Figure A1, cannot justify a possible imputation phase over their respective time series, especially with an high percentage of data to be estimated. Third, our main objective is the estimation of an Air Quality Index where both categories are not among the required pollutants for its computation, thus not essential for our case study.
The aggregation procedure reduces the amount of missing information, nevertheless many features still remain incomplete or even unchanged in the particular case of transits, where the spanned missing intervals were the same across every record, as shown in Figure 2 and Table 3. While this does not represent a problem in most cases, it is preferable to maintain a continuous time series, especially for experiments with sequence-based models such as deep Recurrent Neural Networks. For consistency, we therefore opted for a domain-based imputation phase, during which existing features from the same domain (namely weather, pollutants and transits) are exploited to estimate missing values in the others. This decision was once again motivated by the extreme trend similarities among sensors of the same category, as reflected by the correlation matrices displayed in Appendix A. In our work, we adopted an hybrid approach: considering the time series representing the evolution of a single feature and an empirically determined threshold of six hours, if a continuous sequence of missing values remains below the threshold, we simply impute by polynomial interpolation (with grade 2). Otherwise, we apply an iterative statistical imputation strategy, where the features with the least amount of missing values are estimated first, using all the others as input with a round-robin process. This allows for theoretically better results since at every step we maximise the amount of ground-truth information provided to the model. In our case, we exploited a linear regression approach with Bayesian Ridge regularisation. In the particular case of transits, an iterative procedure could not be applied since every feature was missing in the same intervals. Nevertheless, the same interpolation approach is employed for periods below the threshold, while for all the remaining ones we adopted a statistical approach for time series forecasting. Specifically, given the extreme regularity and the multiple seasonalities of transits, we used a Trigonometric, Box-Cox Transformation, ARMA Errors, Trend and Seasonal Components (TBATS) method to generate the periodical components, then scaled the result by mean and variance of the surrounding context in order to avoid large discontinuities between imputed and ground-truth values. This technique was applied to every missing period except the last one, as the time range of 52 days was exceedingly long for any forecast or estimation to be considered reliable. Instead, we opted for truncating the three-year span into a shorter albeit continuous testing period for every feature, starting from 01/03/2013 until 24/10/2015. This in practice excludes the last part of the dataset, but guarantees a full and regular time series.
Subsequently, we carried out the feature engineering phase. This operation is not strictly required, especially for deep sequence-based models, but can be extremely beneficial for simpler linear models. Specifically, we augmented the merged data set with (i) temporal features, (ii) lagged features and (iii) aggregated information. In the first case, for each record we introduced specific information about the month, day of the week and time of day, encoded using trigonometric functions in order to maintain their inherent cyclical trend. In practice, we exploited the decomposition documented in Equation (1) to generate the described pair of values, using the previously mentioned time intervals.
In the formulas, f represents the selected feature (month, day of the week or hour of the day), P f represents the time period required for a complete cycle, in this case respectively 12, 7 and 24. Additionally, we also employed a standard Radial Basis Function φ rb f i for each month i, shown again in Equation (1). This allows the models to capture average trend information in the period highlighted by the Gaussian function. As last time-related feature, we added a simple dummy variable set to 1 when the given day represented a public holiday and 0 otherwise.
Limited to PM 10 and PM 2.5 , we conducted an additional processing phase generating a parallel dataset by further reducing every feature to a daily resolution by averaging over the 24-h window. This is only required for particulate matter since, as reported in Table 2, it is the only category of target variables with daily resolution. While upscaling the respective time series could be considered as a viable option, the behaviour of particulate matter did not present strong correlations when compared with other air pollutants. Therefore, an harmonization procedure with the remaining features would have required different heavy assumptions on the daily trends of particulate matter that were simply not possible given the available data.
In order to assess which features are potentially significant for an accurate estimate of pollutants, we investigated the Pearson correlation coefficient [16] by means of a correlation matrix, a standard measure to quantify linear correlation between pairs of variables. The coefficient can assume values in the range [−1, +1], with −1 indicating total negative correlation, +1 total positive correlation and 0 no correlation at all. In Figure 3, correlation coefficients among target pollutants and feature variables are reported. It can be observed that, despite the weak values, most pollutants present a positive trend with the increase of transits. This is especially noticeable with particulate matter (PM 10 and PM 2.5 ), where the correlation is stronger. With reference to the aforementioned aggregation, we must point out that the reported correlations with particulate matter refer to the dataset reduced to daily resolution, while all the other measures refer to the average hourly records.
Confirming the statements reported in Section 1, meteorological features appear to be the most correlated with the targets, in particular temperature, wind speed and radiation negatively influence the trend of pollutants, while pressure and humidity present a weak positive correlation. In contrast with the other dependent variables, the Ozone (O 3 ) shows an opposite behaviour in most cases: it appears for instance strongly correlated with temperature, radiation and wind speed, while also noticeably correlated with humidity in a negative way. Another unexpected result is represented by the complete absence of linear correlation between rain and pollutants. Despite the weak coefficients, we maintained the feature, as the influence of rainfall on air pollutants is seldom immediate and may typically occur in the following hours or days, thus the apparent lack of linear correlation.
Likewise, we investigated the autocorrelation for each of the hourly pollutants, with the aim of defining suitable time windows for the estimation procedure. This measure can be defined as the correlation between a signal and its copy delayed over time, thus having the same range [−1, +1]. As observable in Figure 4, every pollutant presents a high autocorrelation in a 72-h window, with strong spikes on the 24th and 48th mark. Therefore, limited to the target variables with hourly resolution, we augmented the set with lagged features selecting two time windows of 24 and 48 h. While a longer window could have been beneficial, these intervals should still allow for a thorough analysis over the importance of lagged variables in the estimation, and at the same time avoiding a feature explosion in the datasets. We note that this procedure is not required for recurrent models such as Long Short-Term Memory (LSTM) [17], as the inherently sequential architecture allows for time windows of arbitrary lengths, nevertheless we applied the same criteria by forcing the latter to assume the same values of the two intervals chosen. We also point out that this analysis did not take into consideration particulate matter because of the daily resolution of the signals. In this case, we opted for two simpler time windows with a lag of one and two days respectively.

Methodology
The main objective of this study is the estimation of an Air Quality Index (AQI) in an urban environment, exploiting data related to vehicle transits and meteorological features. While the process can be defined as a standard classification task using AQI levels as target categories, we opted instead for the definition of a regression problem over pollutants in order to assess the validity of the solution on each target variable. Moreover, the latter configuration can be reduced to a classification problem by computing the AQI over the regression result. The AQI estimate is computed in two steps: (i) independent estimation of each pollutant (NO 2 , NO x , C 6 H 6 , CO, BC, O 3 at hourly resolution, and PM 10 , PM 2.5 at daily resolution), and (ii) computation of the AQI according to the official European formula, by using the estimates of each pollutant. The approaches proposed for the aforementioned steps were evaluated with different dataset configurations, considering weather and traffic features at: (i) time t, (ii) from time t − 24 to time t, and (iii) from t − 48 to time t.
Formally, we can define this multi-step prediction pipeline as follows: given a set of aggregated features from two different domains, namely meteorological measurements φ m and vehicle transit counts φ v , temporal features φ t , and a machine learning model g with parameters θ, the objective is to first provide an estimateŷ t i , 1 ≤ i ≤ n for each pollutant i and time step t on the test set, such that where w indicates a specific time window. Obtained the estimateŝ y t i , ∀i ∧ t, the goal is to evaluate a single global measure of air quality over the same data for each time step, named Air Quality Index (AQI), defined as k t = AQI(y t 1 , y t 2 , ..., y t m ) where m ≤ n is a subset of the original pollutants and k ∈ [0, K] is a single integer value indicating the gravity of the air pollution. Specifically, we adopted a common European Air Quality Index (CAQI) which provides five different levels of severity based on three pollutants, namely NO 2 , O 3 and PM 10 , as detailed in Section 2.5.
Given the data described above, we trained the same set of regressors on every available feature for each pollutant. Specifically, we selected for this work four models with different characteristics: (i) a linear regressor with Bayesian Ridge Regularization, (ii) a Neural Network Using Bayesian Regularization (BRNN) (iii) a Random Forest Regressor (RF) [18], an ensemble model using decision trees as base estimators, and (iv) a Long-Short Term Memory (LSTM) model [19].
The first linear model can be seen as a standard Ordinary Least Squares (OLS) algorithm, with the addition on a regularization process that minimizes the weight estimates, therefore reducing the influence of strong outliers in most situations. In this particular case, the regularization uses a probabilistic approach similar to ridge regression, where the weights are assumed to follow a normal distribution around zero. Linear models represent the best compromise between performance and efficiency, for this reason they are typically employed as robust baseline [20].
Together with the linear model, we included a simple Neural Network with Bayesian Regularisation (BRNN) as second baseline. The motivation behind this model is twofold: first of all, previous work [15] had already shown encouraging results using this architecture. Second, it provided a good comparison between standard multi-layer perceptrons and recurrent neural networks.
The third approach selected, Random Forest, belongs to the category of ensemble models based on bagging. The latter technique involves the separate training of n weaker models, in this case standard decision trees, where the strong model is represented by the whole group and the regression output is obtained by averaging the results of each individual estimator. Formally, given a training set S, the bagging procedure generates n new sets S i such that |S i | < |S| by sampling the original data. Subsequently, n individual models M i are trained on each S i and their results are aggregated by averaging in regression tasks, or by majority voting in classification problems. Despite the high demand in terms of computational resources and training time, Random Forests (and bagging in general) typically provide robust solutions without suffering from overfitting like simpler regression trees, thanks to the sampling procedure and ensemble learning. Because of their versatility and resilience to outliers, RF models have already been successfully applied in many different regression tasks, from calibration of air pollutant sensors [21] to air quality estimation in urban environments [22,23].
Lastly, we employed a Deep Recurrent Neural Network (RNN), specifically a LSTM model. In general, RNNs inherently handle input sequences with varying length; however, the latter is particularly suited for the task given the internal architecture. LSTM units are in fact able to capture both long-term patterns and short-term variations thanks to a combined mechanism of input gates, where the information content from new examples is merged with the internal state of the network, and forget gates, where the decision of whether to keep or discard information from previous states is taken. Given their high adaptability to sequential inputs, LSTM models have been successfully applied to time series forecasting and estimation of air pollutants in different domains [17,24,25]. For our work, we made use of a LSTM network structured with three hidden layers with dimension 100, followed by a simple linear layer with a single output, corresponding to a specific estimated pollutant.

Frameworks and Tools
In this section we enumerate and briefly describe tools and hardware used in order to conduct the preprocessing and analysis steps addressed in Section 2.2 and the experiments described in the following paragraphs of this manuscript. As data were provided in CSV format, every preliminary phase from acquisition to aggregation and cleaning was carried out using a scientific Python 3.6 environment through a combination of pandas [26] and numpy [27] libraries, with the support of the statsmodels package [28] for time series analysis and matplotlib for data visualization. For the data imputation phase and the following experiments, we leveraged the popular scikit-learn library [29], which offers rich functionalities in many different machine learning domains, from data preprocessing to the metrics computation, and provides out-of-the-box robust implementations for many popular models and algorithms. Specifically, together with the previously mentioned packages, we used this library for data normalization, definition of the cross validation, other testing setups described in the next section and implementation of the Bayesian Ridge linear regressor. As deep learning counterpart, we employed the PyTorch framework [30], another extensive library providing a wide variety of features. In particular, the latter was leveraged for the implementation and training of the LSTM regressor. A detailed description of software libraries and respective versions is provided in Table 4. In order to carry out the experiments with the BRNN model, we also leveraged an R environment considering the lack of equivalent Python implementations. In this case, we maintained the same processing pipeline using the libraries listed in the right section of Table 4 to reproduce pandas and numpy functionalities, then we trained the model provided by the namesake package brnn. Once trained, the results were stored and evaluated in the Python pipeline, in an identical manner to the other solutions.
All the procedures and experiments described in this paper were performed on a Linux workstation equipped with an Intel Core i9-7940X processor with a base frequency of 3.10GHz and a total of 14 cores, 128GB of RAM and 4× Nvidia GTX 1080Ti video cards, with CUDA 10.1 capabilities.

Experiments
In summary, the main goal of this work is AQI classfication. We tackled this problem by first defining a regression task on each individual pollutant, then merging the results using the index thresholds and computing an overall performance on air quality estimation. Therefore, in the following paragraphs the discussion will first focus on regression results and then move to the classification problem, analysing the estimates of the different models employed. The first regression task was carried out with different evaluation runs performed on each pollutant independently, using the same setup for each target variable with the only exception of particulate matter. In the latter configuration, the training procedure remained the same described below, with the only change of dataset employed, as stated in Section 2.2. For simplicity, in the following paragraphs we refer to the original set of time series with hourly resolution as Hourly Set (HSet), while referring to the daily-averaged data as Daily Set (DSet).
Despite the definition of a standard regression task, special care must be taken in the event of time series data. First, features in the training and test subdivisions are inherently dependent on time, threfore a random selection of a portion of samples cannot be considered a viable option as the temporal dependency must be respected. This is also crucial for training the LSTM model or any recurrent network in general, given their sequential nature.
Additionally, both meteorological trends, transit counts and consequently pollutant measurements are bound to change in the longer period. Considering the available information spanning almost three years, we had to take into account that even the same models may present extremely different results, depending on the training intervals and the test windows selected.
In order to assess the performance of the aforementioned models,we identified 3 folds of equal size over the three year interval, each one subsequently divided into training, validation and test sets. We maintained a continuous timeline in the second subdivision as well, by keeping the first 70% as training, 10% as validation and the remaining 20% as test data. The presence of a validation set is useful to assess the number of iterations required to maximize the performance without overfitting, moreover provides a clear separation between train and test data, avoiding any possible time dependency. Every model was then fitted and evaluated independently and the final results were obtained by averaging the scores over the three splits.
We trained the linear model using the default tolerance threshold t = 1 × 10 −4 , the Random Forest model using 100 base estimators with maximum depth set to 8 to further reduce overfitting and the LSTM model using two layers of 100 hidden cells, using a dropout between hidden and linear layer to improve the generalization, with a drop rate set to p = 0.5. For this last training configuration we used an Adam optimizer with learning rate λ = 1 × 10 −3 , iterating for 20 epochs with a batch size of 32. Lastly, the BRNN was initialized using the parameters defined in previous work in order to maintain a comparable setup.
In every testing scenario we employed a Mean Squared Error loss, then we computed on every model two common regression metrics in order to better assess and compare the results. Specifically, we computed the Root Mean Squared Error (RMSE), which better represents the actual error on the test set and the Symmetric Mean Average Percentage Error (sMAPE), which is an error-based measure expressed in percentage and therefore unbound from any value range. Formally these measures can be expressed as in Equation (2).
We precise that our current formulation for the sMAPE does not correspond to a percentage, but we kept a simpler [0, 1] range, where 1 = 100% error rate, to improve readability. Completed and evaluated the regression problem, the AQI estimation performance still needs to be investigated. For this task, only a subset of the pollutants need to be taken into consideration, which typically consist of Nitrogen Dioxide (NO 2 ), Ozone (O 3 ) and particulate matter, in most cases referring to PM 10 . Regarding the AQI, the European Environment Agency (EEA) defined a common metric, named Common Air Quality Index (CAQI), which is described by five different thresholds and computed on the pollutants listed above [31]. A subset of the complete table, including global indices and thresholds for individual pollutants are reported in Table 5. The CAQI can be computed with both hourly and daily concentrations. In order to compute the global index, the maximum value among pollutant concentrations must be taken.
Since our data contained hourly concentrations for every pollutant except particulate matter, we computed the hourly version using the daily mean of PM 10 , using the ad hoc index provided by the CAQI. In order to evaluate the regression estimates through the AQI, we first computed the ground-truth indices using the real pollutant trends in the test set, then applied the same transformation to the estimated time series produced by the trained models. We defined the latter task as a standard multi-class categorization problem, employing the F1-score as metric for its robustness against unbalanced classes.

Results
In this section, the results obtained from the aforementioned tasks are presented. The discussion will first focus on regression results, analysing the advantages and drawbacks of the employed techniques over each pollutant and time window, for both Hourly Set and Daily Set, subsequently the AQI prediction estimates will be presented and assessed.
As summarized in Table 6, the results on the regression task over the pollutants with hourly resolution is in line with expectations. The performance on the Hourly Set without lagged variables, displayed on the columns marked with w = 0, highlights on average better results for simpler models: in every case, the best values in terms of RMSE were achieved by the linear model or the Random Forest regressor, with the only exception of NO x where the BRNN obtained a slightly smaller error. The second configuration, indicated by w = 24, displays results obtained with lagged variables over a 24-h period. In this case, the longer window allowed the models to reach better estimates in most cases, with a relevant edge of RF in half of the considered pollutants. Here the exceptions are represented by the benzene (C 6 H 6 ), where the linear model outperformed the other solutions, and the ozone (O 3 ), where the LSTM surpassed the linear regressor by a small margin. These values might be explained by the higher autocorrelation of these two pollutants on the 24-hour mark, visible in Table 4, together with the strong periodic patterns of the ozone influenced by solar radiation that could favour sequence-based solutions like LSTM. Further increasing the time interval, as summarized in the last column (w = 48), does not seem to introduce a relevant advantage over previous solutions, as the error rates computed on the new estimates appear roughly the same. Again, the only exception is represented by the LSTM model, that obtained better results for the majority of pollutants, but most importantly consistently reduced its error with respect to the previous 24-h configuration. Not surprisingly, a deep recurrent neural network can outperform standard machine learning techniques taking advantage of the sequential structure of the data; however, we note that, even in this case, the results are only marginally better than, for instance, the linear regressor. This could be very well caused by the strong limitation on the time interval considered, nevertheless a longer window did not represent a viable comparison because of the feature explosion caused by the introduction of lagged variable in the other configurations.
The results over the Daily Set are shown in Table 7. In this case, we carried out the same evaluation procedure, running the models over the test set and computing the error metrics for each pollutant and estimator, in three different time windows corresponding to lags of one and two days, which are comparable with the 24 and 48-h shifts in the Hourly Set. Unfortunately, due to the lack of data and lower resolution of the available signals, the results are far from optimal, when compared to the hourly counterpart. However, the RMSE values presented are consistent with the data distribution, displayed in Table 8, and both linear and Random Forest regressors display good performances in the setups with window size w = 0 and w = 1. In a similar manner to the Hourly Set, further increasing the interval explored did not translate into better estimates. In the specific case of LSTM, the small dataset with daily resolution was not enough to take advantage of the time series structure and the resulting performance are on par with simpler and faster models. Given the regression estimates and their performances described in previous paragraphs, we analysed the results on the AQI classification task. In Table 9 the F1-scores resulted from the AQI computation over ground-truth and predicted pollutants are summarised, where each subsection refers to the same configuration of time intervals already applied on the previous tasks. In every subsection, the first columns represent a pollutant p and each row represents a model m. Consequently, each cell contains the F1-score of the Air Quality sub-index (Table 5) obtained from the ground truth of p and the estimates produced my m, on the same testing period of p. The last two columns of every block summarise instead the global AQI, defined as CAQI = max(I NO 2 , I O 3 , I PM 10 ), where each I k represents a sub-index. Table 9. F1-scores obtained on the sub-indices of each pollutant, on the global indices CAQI and CAQI* (without PM 10 ), using the same configurations with time windows equal to zero, one and two days. The sub-indices are consistent with the performances described in the regression task: I NO 2 reached on average affordable F1-score, with the Random Forest obtaining the best score around 0.75, while O 3 was best predicted by the LSTM with F1-Score of 0.92. Reflecting the inferior regression performances on the previous analysis, worse performances were instead achieved on I PM 10 by every model in most cases. A few exceptions comprise the models trained on the 48-h window, where the F1-scores on average exceed the 0.5 threshold, and the linear model on the 24-h configuration. This is again expected, as the lack of data negatively influenced the regression estimates, thus undermining the classification results as well.
Observing the global AQI results, presented in the penultimate column, we can observe the same result degradation caused by the bad PM 10 estimates generated by some estimators. For this very reason, we introduced a last column identified with CAQI * that presents the same global CAQI calculations, excluding the last pollutant, specifically CAQI * = max(I NO 2 , I O 3 ). Despite its lack of practical use, this external measure allows for better assessments of the information loss (or gain) brought by the additional pollutant. From this last column it is possible to observe that, in the vast majority of the combinations, a decent AQI estimation is possible, within limits. Moreoever, the introduction of a lagged interval can be beneficial in some cases, especially for sequence-based solutions such as LSTM. Nevertheless, as already verified within the regression problem, a window size of 24 is sufficient to raise the estimates by a 0.2 margin in terms of F1-score. On the other hand, standard machine learning models do not benefit as much from the introduction of lagged features, since for instance the RF solution can perform equally well in all the three interval setups.
Both the linear model on w = 24 and the LSTM on w = 48 obtain very good CAQI estimates around 0.8, thanks to better performances over the single pollutants, but most importantly thanks to better estimations over PM 10 . Comparing the global score to the adjusted AQI, we can see that the absence of I PM 10 drastically reduces the results. The confusion matrices derived by these two solutions are reported in Figure 5 (linear and LSTM indicated by a and b respectively), together with an ensemble solution (c): this last configuration was achieved by selecting the best regressors over each pollutant. Specifically, we selected the RF regressor with 48-h configuration for I NO 2 , the LSTM with -hour configuration for I O 3 and lastly the linear model over 24-hour period as estimator for I PM 10 . This last ensemble allowed us to achieve a total F1-score = 0.8388, surpassing the single-model solutions achieved so far. In order to justify the extreme disparity between these higher scores and the extremely low estimates achieved in some cases, we note that the AQI results are strongly influenced by factors: (i) the air quality thresholds in Table 5 are not equally spaced, but the intervals between categories become wider with the increase of the pollutant measurements; (ii) the computed AQI presents a decreasing granularity and inherent class imbalance; in particular, the first levels are represented by a small fraction of samples, while higher levels contain the majority of the data points. This is clearly visible in the confusion matrices, where the AQI level 4 (Highrisk) contains more than 3000 instances.

Conclusions
This work presents a simple yet effective approach to air quality estimation in urban areas from local environmental an traffic information. We demonstrated that good estimates can be provided for most pollutants by only exploiting meteorological features, such as temperature and humidity, and urban-related features, such as vehicle transits, thanks to the strong correlations among them. Moreover, we thoroughly tested different classes of machine learning models, specifically a linear regressor and a Multi-Layer Perceptron with Bayesian regularization, a Random Forest regressor and a Long-Short Term Memory network. We demonstrated that decent results can still be achieved by simpler models, provided that a robust preprocessing and engineering phase is carried out on the available data. In fact, the linear model achieved scores comparable to more complex solutions such as RF and LSTM, especially when considering shorter window dimensions. For instance, the linear model obtained an F-measure of 0.64, 0.91 and 0.83 over the individual AQI sub-indices, that amounted to 0.82 F1-score for the global AQI in the 24-h window configuration. Notably, the linear model surpassed any other solution by a large margin when considering daily measurements of PM 10 and PM 2.5 , where training data was scarce. For these reasons, this solution can be considered a robust lightweight option in the case of limited resources. Nevertheless, when the latter are available, deep recurrent neural networks such as LSTM can provide a relevant edge in terms of estimation performance, as their inherently sequential structure allow for the examination of longer time periods. This can lead to equal or even better results than standard approaches, as demonstrated by the consistently lower RMSE in most hourly pollutants over the 48-h windows. Despite the heavier computational demands, LSTM does not require extensive preprocessing phases such as the introduction of lagged variables, as the sequence length can be trivially modified. Therefore, LSTM-based solution can be considered a valid and versatile option for many different use cases.
Compared to most part of the literature where time series of air pollutants are included in the estimation for future forecasts, our approach provides results that are almost on par with other works. However, it has the advantage of providing AQI estimates in a completely sensor-free scenario, thus suggesting a an alternative methodology that can first of all reduce the costs for professional equipment and its maintenance. Furthermore, this technique can be applied in any urban region, after a prior calibration based on a small ground truth of sensor data.
Future works will involve satellite monitoring to expand the predictions on larger areas. For this task, coarser but more comprehensive aerial information could be exploited, such as data related to air pollutants provided by the Copernicus Sentinel-5P mission by Copernicus, a European project in collaboration with ESA for atmosphere monitoring. The latter could allow a more fine-grained estimation on larger areas, without the need for detailed on-ground measurements.