2.1. Data Sources and Acquisition
In this case study we used information collected in three years, from 2013 to 2016, limited to the central areas of Milan. In particular, we collected three different data sources: (i) meteorological data from different sensor types, such as temperature, humidity, pressure and wind speed, (ii) traffic data derived from the passage of vehicles recorded from fixed video cameras in a belt surrounding the city center, and (iii) the ground-truth pollutant trends, obtained from different monitoring stations mounted in fixed crucial points.
Regarding meteorological data, we obtained the sensor logs of seven different weather stations, mainly distributed around the borders of the city, as shown in
Figure 1. The monitoring platforms and their respective data are provided by Agenzia Regionale per la Protezione dell’Ambiente (ARPA) Lombardia, the reference institution in the region for environmental protection. This and other kind of datasets have recently become freely available for download on the regional Open Data portal (
https://www.dati.lombardia.it/). Logs are provided with yearly subdivisions in standard CSV files, where each one stores data corresponding to a single sensor, defined by three fields: the unique ID for the sensor, a timestamp of the measurement in local time and the corresponding value measured. Any other additional information related to the sensors is in fact explained by a descriptor file first indicating the unique ID, name and position of the specific platform, then enumerating the different sensors of which it is equipped, specifying their unique ID, type, their measurement interval, their unit of measure and the operator used to normalize the logs into a time series with regular intervals between samples. From these descriptors it is possible to trace the sensors back to the respective measured weather feature and thus group them in six different types, displayed in
Table 1. Except for the pressure category that only presents a single sensor, every other class of meteorological feature is well represented by at least three time series, in the three-year period we analysed. In terms of data format, every sensor class utilises a standard measurement unit and contains records with a regular resolution of an hour, obtained by averaging the aggregation of the raw measurements. The only exception is represented by the precipitations category, which instead provides logs aggregated with a cumulative sum, indicating the total amount of millimeters of rain which fell every hour.
A similar procedure concerned the measurements of the pollutants. Data were once again provided by ARPA, freely available for download through the above-mentioned portal, and were subdivided into file descriptors, containing name, location and equipped sensors of the monitoring stations, together with individual CSV files containing yearly measurements of the individual sensors, identified by a unique ID. The positioning of these cells is visible in
Figure 1. In total, exploring the three-year period considered for this work, we analysed 10 different pollutant categories, each one measured by at least one sensor in the central area of Milan.
The count of unique measurements varies greatly with the severity and the intrinsic relevance of the pollutant for the definition of an air quality index, which is typically identified by a combination of
,
and particulate matter, as described in
Section 2.3. Specifically, nitrogen oxides (
,
) are represented, respectively, by 8 and 7 unique sensors distributed around the city center,
C6
and
by 4,
and particulate matter (
,
) by 3,
by 2. Lastly,
and
are represented by a single measurement only.
As for the meteorological data, logs for the pollutants provided by ARPA correspond to an hourly aggregation by average, with the exception of particulate matter ( and ), where measurements are stored with daily interval, representing a 24-h mean value. While this is sub-optimal given the consistency of the remaining data, it does not represent a major issue and it is somewhat expected since the vast majority of air quality indicators only make use of a daily average for particulate matter.
The last feature type employed in this work is represented by recorded hourly transits in the congestion charge surrounding the center of Milan. In this case, data is provided by Agenzia Mobilità Ambiente e Territorio (AMAT) (
https://www.amat-mi.it/) and encompasses the C Area, corresponding to the urban region of Cerchia dei Bastioni. The area is accessible by a total of 42 gates, displayed in
Figure 1, each one monitored by fixed cameras continuously monitoring and recording the plates of any vehicle entering the city center. The logs were cross-referenced with data from the regional Department of Motor Vehicles, in order to obtain, for each plate, a list of characteristics of the corresponding vehicle, including fuel type and potential category of European emission standard. The complete dataset is subdivided into three main CSV files: the first containing information about the monitoring gates, the second providing detailed information about vehicles and the last one providing the transit records in the city center, during the three-year period. Altogether, these files contain information about four main features: (i) the emission standard category, from Euro 0 to Euro 6; (ii) the fuel type, namely petrol, diesel, electricity, gas or hybrid; (iii) the category of vehicle, public transport, cargo or normal cars; and (iv) additional information such as whether the vehicle was authorized into the area, or whether it is a service vehicle or if it belongs to residents.
2.2. Data Preprocessing and Analysis
As introduced in the previous section, the available data appears scattered in many different physical files or presents different formats and configurations. In order to align the features into a single common representation for model training, we carried on an extensive multi-step preprocessing phase, combining (i) data aggregation (ii) data cleaning (iii) imputation and (iv) feature engineering. Considering the first point, aggregation was necessary for two main reasons: first and foremost, the collected data described above unfortunately presents a large number of randomly missing values, without any distinguishable pattern in terms of time periods with the exception of the transits, which include none of all the available features in four specific intervals, as described below. Second, the amount of sensors and their distribution on the territory was extremely limited in terms of covered surface, certainly not enough to conduct a proper spatial analysis. A bilinear interpolation of the values over time is typically applied to simulate a continuous distribution of the features, such as in U-Air [
12], but this option was also discarded again because of the lack of data in many of the sensors equipped by the monitoring stations, for both weather and pollutants, as shown in
Table 2. Therefore, in order to drastically reduce the amount of missing data while maintaining a coherent ground truth, we opted for a single time series for each sensor type, aggregating by timestamp and averaging when more than one value was present over the same timestamp. This operation is justified by the fact that, given the relative proximity of sensors, most records present extremely similar trends over time. The results, in terms of data coverage, are reported again in
Table 2.
A similar process was carried out on traffic-related features: first, data from vehicle details were merged with transit records and subsequently aligned with the time resolution of the available sensors by summing the passages occurred in the same hour. Then, since spatial information was discarded in previous sets to reduce sparsity, transits from all the gates was summed up in order to obtain once again a single time series for each of the vehicle-related features. Therefore, the final aggregated dataset contains the sum of all the vehicle transits at any gate and during the same hour for every traffic-related feature, from the emission standards (euro-0 to euro-6) to vehicle types and fuel types.
Concerning the data cleaning phase, every available meteorological and aerial feature was checked for noticeable outliers. Because of the hourly aggregation of the original sensor logs and the aforementioned spatial reduction, little to no effort was required to check and limit the values into a reasonable range. Additionally, most of the feature sets already appear in predefined ranges, such as the humidity provided in percentage, while others inherently describe out of the ordinary yet crucial situations, such as the rainfall amount, which can only be checked for impossible values (e.g., negative records). During this phase we also discarded those features that still contained a large amount of missing data, namely wind direction which contained a single year of data out of three, and dropped transit entries representing unknown values (
euro_na,
fuel_na,
vehicle_na). The latter were substituted with a single additional feature named
total, representing the total amount of transits, regardless of the vehicle type. An analogous procedure was carried out on target variables. On the basis of the former analysis, we decided to exclude from the study both sulphur dioxide (
) and ammonia (
) measurements for three main reasons: first and foremost, both pollutants were represented by a single ground-truth sensor each which contained a large portion of missing data, as shown in
Table 2. Second, the low correlation displayed by the two pollutants, visible in
Figure A1, cannot justify a possible imputation phase over their respective time series, especially with an high percentage of data to be estimated. Third, our main objective is the estimation of an Air Quality Index where both categories are not among the required pollutants for its computation, thus not essential for our case study.
The aggregation procedure reduces the amount of missing information, nevertheless many features still remain incomplete or even unchanged in the particular case of transits, where the spanned missing intervals were the same across every record, as shown in
Figure 2 and
Table 3. While this does not represent a problem in most cases, it is preferable to maintain a continuous time series, especially for experiments with sequence-based models such as deep Recurrent Neural Networks. For consistency, we therefore opted for a domain-based imputation phase, during which existing features from the same domain (namely weather, pollutants and transits) are exploited to estimate missing values in the others. This decision was once again motivated by the extreme trend similarities among sensors of the same category, as reflected by the correlation matrices displayed in
Appendix A.
In our work, we adopted an hybrid approach: considering the time series representing the evolution of a single feature and an empirically determined threshold of six hours, if a continuous sequence of missing values remains below the threshold, we simply impute by polynomial interpolation (with grade 2). Otherwise, we apply an iterative statistical imputation strategy, where the features with the least amount of missing values are estimated first, using all the others as input with a round-robin process. This allows for theoretically better results since at every step we maximise the amount of ground-truth information provided to the model. In our case, we exploited a linear regression approach with Bayesian Ridge regularisation. In the particular case of transits, an iterative procedure could not be applied since every feature was missing in the same intervals. Nevertheless, the same interpolation approach is employed for periods below the threshold, while for all the remaining ones we adopted a statistical approach for time series forecasting. Specifically, given the extreme regularity and the multiple seasonalities of transits, we used a Trigonometric, Box-Cox Transformation, ARMA Errors, Trend and Seasonal Components (TBATS) method to generate the periodical components, then scaled the result by mean and variance of the surrounding context in order to avoid large discontinuities between imputed and ground-truth values. This technique was applied to every missing period except the last one, as the time range of 52 days was exceedingly long for any forecast or estimation to be considered reliable. Instead, we opted for truncating the three-year span into a shorter albeit continuous testing period for every feature, starting from 01/03/2013 until 24/10/2015. This in practice excludes the last part of the dataset, but guarantees a full and regular time series.
Subsequently, we carried out the feature engineering phase. This operation is not strictly required, especially for deep sequence-based models, but can be extremely beneficial for simpler linear models. Specifically, we augmented the merged data set with (i) temporal features, (ii) lagged features and (iii) aggregated information. In the first case, for each record we introduced specific information about the month, day of the week and time of day, encoded using trigonometric functions in order to maintain their inherent cyclical trend. In practice, we exploited the decomposition documented in Equation (
1) to generate the described pair of values, using the previously mentioned time intervals.
In the formulas,
f represents the selected feature (month, day of the week or hour of the day),
represents the time period required for a complete cycle, in this case respectively 12, 7 and 24. Additionally, we also employed a standard Radial Basis Function
for each month
i, shown again in Equation (
1). This allows the models to capture average trend information in the period highlighted by the Gaussian function. As last time-related feature, we added a simple dummy variable set to 1 when the given day represented a public holiday and 0 otherwise.
Limited to
and
, we conducted an additional processing phase generating a parallel dataset by further reducing every feature to a daily resolution by averaging over the 24-h window. This is only required for particulate matter since, as reported in
Table 2, it is the only category of target variables with daily resolution. While upscaling the respective time series could be considered as a viable option, the behaviour of particulate matter did not present strong correlations when compared with other air pollutants. Therefore, an harmonization procedure with the remaining features would have required different heavy assumptions on the daily trends of particulate matter that were simply not possible given the available data.
In order to assess which features are potentially significant for an accurate estimate of pollutants, we investigated the Pearson correlation coefficient [
16] by means of a correlation matrix, a standard measure to quantify linear correlation between pairs of variables. The coefficient can assume values in the range
, with
indicating total negative correlation,
total positive correlation and 0 no correlation at all. In
Figure 3, correlation coefficients among target pollutants and feature variables are reported. It can be observed that, despite the weak values, most pollutants present a positive trend with the increase of transits. This is especially noticeable with particulate matter (
and
), where the correlation is stronger. With reference to the aforementioned aggregation, we must point out that the reported correlations with particulate matter refer to the dataset reduced to daily resolution, while all the other measures refer to the average hourly records.
Confirming the statements reported in
Section 1, meteorological features appear to be the most correlated with the targets, in particular temperature, wind speed and radiation negatively influence the trend of pollutants, while pressure and humidity present a weak positive correlation. In contrast with the other dependent variables, the Ozone (
) shows an opposite behaviour in most cases: it appears for instance strongly correlated with temperature, radiation and wind speed, while also noticeably correlated with humidity in a negative way. Another unexpected result is represented by the complete absence of linear correlation between rain and pollutants. Despite the weak coefficients, we maintained the feature, as the influence of rainfall on air pollutants is seldom immediate and may typically occur in the following hours or days, thus the apparent lack of linear correlation.
Likewise, we investigated the autocorrelation for each of the hourly pollutants, with the aim of defining suitable time windows for the estimation procedure. This measure can be defined as the correlation between a signal and its copy delayed over time, thus having the same range
. As observable in
Figure 4, every pollutant presents a high autocorrelation in a 72-h window, with strong spikes on the 24th and 48th mark. Therefore, limited to the target variables with hourly resolution, we augmented the set with lagged features selecting two time windows of 24 and 48 h. While a longer window could have been beneficial, these intervals should still allow for a thorough analysis over the importance of lagged variables in the estimation, and at the same time avoiding a feature explosion in the datasets. We note that this procedure is not required for recurrent models such as Long Short-Term Memory (LSTM) [
17], as the inherently sequential architecture allows for time windows of arbitrary lengths, nevertheless we applied the same criteria by forcing the latter to assume the same values of the two intervals chosen. We also point out that this analysis did not take into consideration particulate matter because of the daily resolution of the signals. In this case, we opted for two simpler time windows with a lag of one and two days respectively.
2.3. Methodology
The main objective of this study is the estimation of an Air Quality Index (AQI) in an urban environment, exploiting data related to vehicle transits and meteorological features. While the process can be defined as a standard classification task using AQI levels as target categories, we opted instead for the definition of a regression problem over pollutants in order to assess the validity of the solution on each target variable. Moreover, the latter configuration can be reduced to a classification problem by computing the AQI over the regression result. The AQI estimate is computed in two steps: (i) independent estimation of each pollutant (, , , , , at hourly resolution, and , at daily resolution), and (ii) computation of the AQI according to the official European formula, by using the estimates of each pollutant. The approaches proposed for the aforementioned steps were evaluated with different dataset configurations, considering weather and traffic features at: (i) time t, (ii) from time to time t, and (iii) from to time t.
Formally, we can define this multi-step prediction pipeline as follows: given a set of aggregated features from two different domains, namely meteorological measurements
and vehicle transit counts
, temporal features
, and a machine learning model
g with parameters
, the objective is to first provide an estimate
for each pollutant
i and time step
t on the test set, such that
, where
w indicates a specific time window. Obtained the estimates
, the goal is to evaluate a single global measure of air quality over the same data for each time step, named Air Quality Index (AQI), defined as
where
is a subset of the original pollutants and
is a single integer value indicating the gravity of the air pollution. Specifically, we adopted a common European Air Quality Index (CAQI) which provides five different levels of severity based on three pollutants, namely
,
and
, as detailed in
Section 2.5.
Given the data described above, we trained the same set of regressors on every available feature for each pollutant. Specifically, we selected for this work four models with different characteristics: (i) a linear regressor with Bayesian Ridge Regularization, (ii) a Neural Network Using Bayesian Regularization (BRNN) (iii) a Random Forest Regressor (RF) [
18], an ensemble model using decision trees as base estimators, and (iv) a Long-Short Term Memory (LSTM) model [
19].
The first linear model can be seen as a standard Ordinary Least Squares (OLS) algorithm, with the addition on a regularization process that minimizes the weight estimates, therefore reducing the influence of strong outliers in most situations. In this particular case, the regularization uses a probabilistic approach similar to ridge regression, where the weights are assumed to follow a normal distribution around zero. Linear models represent the best compromise between performance and efficiency, for this reason they are typically employed as robust baseline [
20].
Together with the linear model, we included a simple Neural Network with Bayesian Regularisation (BRNN) as second baseline. The motivation behind this model is twofold: first of all, previous work [
15] had already shown encouraging results using this architecture. Second, it provided a good comparison between standard multi-layer perceptrons and recurrent neural networks.
The third approach selected, Random Forest, belongs to the category of ensemble models based on bagging. The latter technique involves the separate training of
n weaker models, in this case standard decision trees, where the strong model is represented by the whole group and the regression output is obtained by averaging the results of each individual estimator. Formally, given a training set
S, the bagging procedure generates
n new sets
such that
by sampling the original data. Subsequently,
n individual models
are trained on each
and their results are aggregated by averaging in regression tasks, or by majority voting in classification problems. Despite the high demand in terms of computational resources and training time, Random Forests (and bagging in general) typically provide robust solutions without suffering from overfitting like simpler regression trees, thanks to the sampling procedure and ensemble learning. Because of their versatility and resilience to outliers, RF models have already been successfully applied in many different regression tasks, from calibration of air pollutant sensors [
21] to air quality estimation in urban environments [
22,
23].
Lastly, we employed a Deep Recurrent Neural Network (RNN), specifically a LSTM model. In general, RNNs inherently handle input sequences with varying length; however, the latter is particularly suited for the task given the internal architecture. LSTM units are in fact able to capture both long-term patterns and short-term variations thanks to a combined mechanism of input gates, where the information content from new examples is merged with the internal state of the network, and forget gates, where the decision of whether to keep or discard information from previous states is taken. Given their high adaptability to sequential inputs, LSTM models have been successfully applied to time series forecasting and estimation of air pollutants in different domains [
17,
24,
25]. For our work, we made use of a LSTM network structured with three hidden layers with dimension 100, followed by a simple linear layer with a single output, corresponding to a specific estimated pollutant.
2.4. Frameworks and Tools
In this section we enumerate and briefly describe tools and hardware used in order to conduct the preprocessing and analysis steps addressed in
Section 2.2 and the experiments described in the following paragraphs of this manuscript. As data were provided in CSV format, every preliminary phase from acquisition to aggregation and cleaning was carried out using a scientific Python 3.6 environment through a combination of
pandas [
26] and
numpy [
27] libraries, with the support of the
statsmodels package [
28] for time series analysis and
matplotlib for data visualization. For the data imputation phase and the following experiments, we leveraged the popular
scikit-learn library [
29], which offers rich functionalities in many different machine learning domains, from data preprocessing to the metrics computation, and provides out-of-the-box robust implementations for many popular models and algorithms. Specifically, together with the previously mentioned packages, we used this library for data normalization, definition of the cross validation, other testing setups described in the next section and implementation of the Bayesian Ridge linear regressor. As deep learning counterpart, we employed the PyTorch framework [
30], another extensive library providing a wide variety of features. In particular, the latter was leveraged for the implementation and training of the LSTM regressor. A detailed description of software libraries and respective versions is provided in
Table 4. In order to carry out the experiments with the BRNN model, we also leveraged an R environment considering the lack of equivalent Python implementations. In this case, we maintained the same processing pipeline using the libraries listed in the right section of
Table 4 to reproduce
pandas and
numpy functionalities, then we trained the model provided by the namesake package
brnn. Once trained, the results were stored and evaluated in the Python pipeline, in an identical manner to the other solutions.
All the procedures and experiments described in this paper were performed on a Linux workstation equipped with an Intel Core i9-7940X processor with a base frequency of 3.10GHz and a total of 14 cores, 128GB of RAM and Nvidia GTX 1080Ti video cards, with CUDA 10.1 capabilities.
2.5. Experiments
In summary, the main goal of this work is AQI classfication. We tackled this problem by first defining a regression task on each individual pollutant, then merging the results using the index thresholds and computing an overall performance on air quality estimation. Therefore, in the following paragraphs the discussion will first focus on regression results and then move to the classification problem, analysing the estimates of the different models employed. The first regression task was carried out with different evaluation runs performed on each pollutant independently, using the same setup for each target variable with the only exception of particulate matter. In the latter configuration, the training procedure remained the same described below, with the only change of dataset employed, as stated in
Section 2.2. For simplicity, in the following paragraphs we refer to the original set of time series with hourly resolution as Hourly Set (HSet), while referring to the daily-averaged data as Daily Set (DSet).
Despite the definition of a standard regression task, special care must be taken in the event of time series data. First, features in the training and test subdivisions are inherently dependent on time, threfore a random selection of a portion of samples cannot be considered a viable option as the temporal dependency must be respected. This is also crucial for training the LSTM model or any recurrent network in general, given their sequential nature.
Additionally, both meteorological trends, transit counts and consequently pollutant measurements are bound to change in the longer period. Considering the available information spanning almost three years, we had to take into account that even the same models may present extremely different results, depending on the training intervals and the test windows selected.
In order to assess the performance of the aforementioned models, we identified 3 folds of equal size over the three year interval, each one subsequently divided into training, validation and test sets. We maintained a continuous timeline in the second subdivision as well, by keeping the first as training, as validation and the remaining as test data. The presence of a validation set is useful to assess the number of iterations required to maximize the performance without overfitting, moreover provides a clear separation between train and test data, avoiding any possible time dependency. Every model was then fitted and evaluated independently and the final results were obtained by averaging the scores over the three splits.
We trained the linear model using the default tolerance threshold , the Random Forest model using 100 base estimators with maximum depth set to 8 to further reduce overfitting and the LSTM model using two layers of 100 hidden cells, using a dropout between hidden and linear layer to improve the generalization, with a drop rate set to . For this last training configuration we used an Adam optimizer with learning rate , iterating for 20 epochs with a batch size of 32. Lastly, the BRNN was initialized using the parameters defined in previous work in order to maintain a comparable setup.
In every testing scenario we employed a Mean Squared Error loss, then we computed on every model two common regression metrics in order to better assess and compare the results. Specifically, we computed the Root Mean Squared Error (RMSE), which better represents the actual error on the test set and the Symmetric Mean Average Percentage Error (sMAPE), which is an error-based measure expressed in percentage and therefore unbound from any value range. Formally these measures can be expressed as in Equation (
2).
We precise that our current formulation for the sMAPE does not correspond to a percentage, but we kept a simpler
range, where
error rate, to improve readability. Completed and evaluated the regression problem, the AQI estimation performance still needs to be investigated. For this task, only a subset of the pollutants need to be taken into consideration, which typically consist of Nitrogen Dioxide (
), Ozone (
) and particulate matter, in most cases referring to
. Regarding the AQI, the European Environment Agency (EEA) defined a common metric, named Common Air Quality Index (CAQI), which is described by five different thresholds and computed on the pollutants listed above [
31]. A subset of the complete table, including global indices and thresholds for individual pollutants are reported in
Table 5.
The CAQI can be computed with both hourly and daily concentrations. In order to compute the global index, the maximum value among pollutant concentrations must be taken.
Since our data contained hourly concentrations for every pollutant except particulate matter, we computed the hourly version using the daily mean of , using the ad hoc index provided by the CAQI. In order to evaluate the regression estimates through the AQI, we first computed the ground-truth indices using the real pollutant trends in the test set, then applied the same transformation to the estimated time series produced by the trained models. We defined the latter task as a standard multi-class categorization problem, employing the F1-score as metric for its robustness against unbalanced classes.