A Traffic-Based Method to Predict and Map Urban Air Quality

As global urbanization, industrialization, and motorization keep worsening air quality, a continuous rise in health problems is projected. Limited spatial resolution of the information on air quality inhibits full comprehension of urban population exposure. Therefore, we propose a method to predict urban air pollution from traffic by extracting data from Web-based applications (Google Traffic). We apply a machine learning approach by training a decision tree algorithm (C4.8) to predict the concentration of PM2.5 during the morning pollution peak from: (i) an interpolation (inverse distance weighting) of the value registered at the monitoring stations, (ii) traffic flow, and (iii) traffic flow + time of the day. The results show that the prediction from traffic outperforms the one provided by the monitoring network (average of 65.5% for the former vs. 57% for the latter). Adding the time of day increases the accuracy by an average of 6.5%. Considering the good accuracy on different days, the proposed method seems to be robust enough to create general models able to predict air pollution from traffic conditions. This affordable method, although beneficial for any city, is particularly relevant for low-income countries, because it offers an economically sustainable technique to address air quality issues faced by the developing world.


Introduction
As the world evolves towards global urbanization, 56% of its cities (population over 100,000) in developed countries and 98% in low-and middle-income countries violate the World Health Organization's (WHO) recommendations for air quality [1]. As a result, air pollution has become the number one environmental risk accountable for seven million premature deaths worldwide every year [2]. Furthermore, future projections estimate that these numbers will double by 2050 [3].
Among the regulated atmospheric pollutants, such as criteria gases (carbon monoxide-CO, nitrogen oxides-NO x , sulfur dioxide-SO 2 and ozone-O 3 ) and particles, the most complex is fine particulate matter (PM)-PM 2.5 (aerodynamic diameter ≤2.5 µm). As it can originate directly and indirectly from anthropogenic activities, such as traffic, industries, and so forth, PM 2.5 is a good indicator of the overall air quality and is useful to estimate the health impacts of air pollution exposure, due to its well-known respiratory and cardiovascular health effects [4,5]. While in high-to mid-income countries the concentrations of this pollutant are decreasing, due to strict environmental regulations, PM pollution is worsening in the developing countries, as the demand for power and transportation in urban areas grows [6][7][8][9]. In addition, old technologies and poor quality fuel in South America contribute to the fact that the major share of fine PM comes from traffic [10,11]. Due to the health

Pollution Measurement
For street-level PM2.5 pollution mapping, a central area (approximately 2.5 km × 2.8 km), containing busy traffic avenues, a main city highway, secondary residential streets, and two parks was chosen. PM2.5 concentration scans were performed over four days (30 May and 3,12, and 18 June, 2019) during the morning rush hours (8:00-10:00). It is considered that these data would contribute to building the most conservative and representative model for the worst air quality conditions, which is important for health implications in urban areas. A portable real-time CEL-712 Microdust Pro monitor [29] was paired with a GPS device [30]. The Microdust Pro sensor, based on a near forward angle light scattering technique, was calibrated before the experiment by using zero-air and a known concentration filter (164 mg m −3 ). In addition, the validity of this portable particle sensor was confirmed by collocating it for 8 h (battery life) with the Environmental Protection Agency approved method (EQPM-1102-150) showing a good correlation (R = 0.86) [31]. This automated Thermo Scientific 5014i Beta Continuous Ambient instrument forms a part of the air quality and meteorological (Vaisala WXT536) monitoring station (Belisario) in the center of the experimental site ( Figure 1c). The background concentrations of PM2.5 and meteorological data were downloaded from this station, previously described elsewhere [10,28]. The portable PM2.5 and the GPS equipment were synchronized to function at a 5-s time step. Both instruments were held at the height of 1.5 m and faced the particle inlet forward while walking on a sidewalk following the traffic flow. The approximate speed of sampling was 4 km/h.

Traffic Measurement
An additional mobile phone application, developed by the group, was used to register the traffic conditions on the basis of the Google Maps Traffic tool. The quality of traffic in this application is represented by four colors: Green-fast-flowing traffic, Yellow-slower traffic with more vehicles,

Pollution Measurement
For street-level PM 2.5 pollution mapping, a central area (approximately 2.5 km × 2.8 km), containing busy traffic avenues, a main city highway, secondary residential streets, and two parks was chosen. PM 2.5 concentration scans were performed over four days (30 May and 3,12, and 18 June, 2019) during the morning rush hours (8:00-10:00). It is considered that these data would contribute to building the most conservative and representative model for the worst air quality conditions, which is important for health implications in urban areas. A portable real-time CEL-712 Microdust Pro monitor [29] was paired with a GPS device [30]. The Microdust Pro sensor, based on a near forward angle light scattering technique, was calibrated before the experiment by using zero-air and a known concentration filter (164 mg m −3 ). In addition, the validity of this portable particle sensor was confirmed by collocating it for 8 h (battery life) with the Environmental Protection Agency approved method (EQPM-1102-150) showing a good correlation (R = 0.86) [31]. This automated Thermo Scientific 5014i Beta Continuous Ambient instrument forms a part of the air quality and meteorological (Vaisala WXT536) monitoring station (Belisario) in the center of the experimental site ( Figure 1c). The background concentrations of PM 2.5 and meteorological data were downloaded from this station, previously described elsewhere [10,28]. The portable PM 2.5 and the GPS equipment were synchronized to function at a 5-s time step. Both instruments were held at the height of 1.5 m and faced the particle inlet forward while walking on a sidewalk following the traffic flow. The approximate speed of sampling was 4 km/h.

Traffic Measurement
An additional mobile phone application, developed by the group, was used to register the traffic conditions on the basis of the Google Maps Traffic tool. The quality of traffic in this application is represented by four colors: Green-fast-flowing traffic, Yellow-slower traffic with more vehicles, Red-more congested traffic, and Dark red-the most congested or completely stopped traffic. These data were collected by registering traffic representative colors in each segment of the road of a covered path (5-s time step). Apart from that, in a separate experiment on 19 June, 2019, traffic speed and category were registered by traveling the main city avenues with a car. Subsequently, the obtained data were used to produce PM 2.5 pollution maps in QGIS software, using the Inverse Distance Weighting (IDW) function. IDW estimates unknown concentrations of pollutants through an interpolation method, which assumes that closer values are more related than further values.
To get a simplified and high-resolution urban pollution model based on traffic conditions, we started by performing a thorough analysis of the real traffic and the available traffic applications. First, a comparative analysis between the vehicle velocities, measured while driving around the city of Quito, and Google Traffic category was performed. Then, traffic speeds reported by a mobile application Waze were compared with Google Traffic categories. Unfortunately, in the Ecuadorian capital, the use of Waze is limited and mostly available on the major avenues, possibly due to a small number of users. However, over a few months, prior and during the study (April-June, 2019), of random sampling in different parts of the city, we were able to collect enough Waze-based data in parallel to Google Traffic to compare the road travel velocity to the traffic categories of Google Traffic. Finally, the correlation analysis between PM 2.5 concentrations and Google Traffic velocity categories was performed. This data was used to create a hierarchical cluster analysis. It produces a dendrogram, which is a treelike diagram that summarizes the process of clustering, where similar variables are joined by lines whose vertical length reflects the Euclidean distance between these variables. We used the function hclust in R to get the best cutoff distance on the y-axis that separates the four levels of traffic into a set of two clusters.

Decision Trees Algorithm
The predictive models were built by using a category of the machine learning method called Decision Trees. The chosen algorithm was C4.5 [32]. Besides its high classification performance, this method offers an easy tree-based visualization that facilitates the interpretation of the output of the model. C4.5 is based on a top-down recursive divide and conquer strategy. An attribute to split on is selected at the root node, and then a branch is created for each possible attribute value. This operation splits the instances into subsets, one for each branch that extends from the root node. Then, this procedure is repeated recursively for each branch, selecting an attribute at each node and choosing only instances that reach that branch to make the selection. The purest split defines the selection of the best attribute for each node. To do so, the heuristic used in C4.5 is based on information theory and quantification of entropy, which measures information in bits for each of the possible outcomes (p), as described in Equation (1). The idea is to know how much information is gained by knowing the value of an attribute. The information gain is obtained by calculating the entropy of the distribution before the split minus the entropy of the distribution after the split. At each node, the attribute to be selected is the one that provides the highest information gain. The process is repeated until reaching the end of the tree (pure nodes) or getting a maximum depth that preserves the readability of the model. The program J48 from the machine learning workbench Weka was used as the implementation of the algorithm C4.8 (upgraded version of C4.5) to train and test the models. entropy(p 1 , p 2 , . . . , p n ) = −p 1 logp 1 − p 2 logp 2 . . . − p n logp n (1) Appl. Sci. 2020, 10, 2035 5 of 18 In this study, three types of models were created from four days of measurements: 30 May 2019, and 3, 12 and 18 June 2019. The first one is based on an IDW to predict the concentration of PM 2.5 at street level. The values used to proceed with this calculation are provided by the four closest monitoring stations of the Secretariat of the Environment: Belisario (elev. 2835 m.a.s.l., coord. 78 • 29 24" W, 0 • 10 48" S), Centro (elev. 2820 m.a.s.l., coord. 78 • 30 36" W, 0 • 13 12" S), Cotocollao (elev. 2739 m.a.s.l., coord. 78 • 29 50" W, 0 • 6 28" S), and El Camal (elev. 2840 m.a.s.l., coord. 78 • 30 36" W, 0 • 15 00" S) (see Figure 1b). Each of those stations are located about 4-6 km apart from each other (study radius of about 8 km). This method provides an estimation of the pollution concentration which is inversely correlated to the distance from the measured contamination (monitoring stations) [33]. IDW assumes that each measured point has a local influence that diminishes with distance. It gives greater weights to points closest to the prediction location, and the weights diminish as a function of distance, as described in Equation (2). Where Zp stands for the interpolated value of pollution, Zi stands for the actual values measured at the monitoring stations, n stands for the number of stations considered (here n = 4), and d stands for the distance between the monitoring stations and a given geolocation point. It is to note that different powers (p) can be used to calculate the distance. The p value defines the smoothness of the interpolation. Increasing the p raises the overall influence of the known values on the concentration gradient. For instance, a p = 2 will provide values more localized and not averaged out as much as a p = 1.
The two other types of models are based on a closer but indirect measurement of the pollution: traffic intensity and time of the day. The collection and the cleansing of these data are described in the next section.

Data Preparation and Assessment
In order to prepare the dataset, the raw data (5-s step) measured at street level using Microdust Pro were smoothed by performing a running average on two minutes to mitigate the noise created by artefactual events (e.g., sudden passage of a bus). The concentration of PM 2.5 was divided into two classes, low and high, depending on the median of the values of each day of measurements. The median value of PM 2.5 did not vary significantly from one day to another (mean = 40.7 µg m −3 ; standard deviation = 7.8 µg m −3 ) and can be, consequently, considered as a standard concentration. Furthermore, this value is between the national standard (50 µg m −3 ) and WHO health recommendations (25 µg m −3 ) for 24-h PM 2.5 concentrations.
Choosing the median as a threshold allowed us to get balanced classes (same number of instances in each class). If the classes are unbalanced, the machine learning algorithms tend to classify on the majority class (i.e., the class with the highest number of instances), which provides a misleadingly high accuracy by raising the baseline (i.e., the benchmark if the classification is simply based on the majority class). On the contrary, by using the median, we assure that the classification baseline is 50% (random choice between the two possible classes). Thus, the objective is to produce a model that gives an accuracy of classification significantly better than 50%, with the simplest tree as possible (no more than three nodes).
The models were tested through a 10-fold cross-validation. This method is the best alternative when the dataset is relatively small. Cross-validation is a procedure that partitions the data into non-overlapping samples (or folds). Usually, k = 10 folds is chosen, which means that the data are randomly partitioned into 10 equal parts, where each fold has 10% of the instances. A model is then fit k times. Each time, one of the folds serves as the testing set and the remaining k-1 folds are used as the Appl. Sci. 2020, 10, 2035 6 of 18 training set. Consequently, each fold is used once as the training set to make a prediction for every record in the dataset. The overall performance of the model is then obtained by combining the model's predictions on each of the k testing sets [34]. Equation (3) describes the formula used to calculate the accuracy of the prediction.
where TP stands for true positives (PM 2.5 concentrations > median) and TN stands for true negatives (PM 2.5 concentrations < median). These variables are the correctly classified instances. The wrongly classified observations FP and FN are false positives and false negatives, respectively. Instead of predicting a spectrum of concentrations through a regression technique, this study is interested in a binary discrimination between high levels of contamination, which present a risk for public health, versus acceptable levels. This choice is supported by the recommendations of the WHO, which defines a standard threshold, and related work proposing a machine learning approach to classify air pollution [26,[35][36][37][38].

Clustering Analysis
Once the different models were built from the methods described in Section 2.4.1., an unsupervised learning (clustering method) was performed in order to identify the model that generalizes the best. The popular iterative distance-based clustering k-Means was chosen. The Euclidean distance was selected as metric to assess the performance of the algorithm. First, the desired number of clusters, which is the k value, is specified. Here, k = 2, because two classes of pollution are expected: low (below the median value) vs. high (above the median value) concentration of PM 2.5 . Second, the algorithm chooses k points at random as cluster centers. Third, all the instances of the dataset are assigned to their closest cluster center. Fourth, the centroid (or mean) of all the instances in each cluster is calculated, which transforms these centroids into new cluster centers. Then, the algorithm goes back to the beginning and carries on until the cluster centers do not change. In other words, this algorithm searches for a minimization of the total squared distance from the instances to their cluster centers. The best model is the one that gets the minimum distance or error. The variables used to perform this clustering analysis were traffic and time of day for the whole city measurements and the PM 2.5 prediction made by each model.

Urban PM 2.5 Concentrations Based on the Air Quality Network
Long-term (2017-2018) average PM 2.5 concentration IDW maps for different periods of the day (6:00-11:59, 12:00-16:59, 17:00-20:59 and 21:00-05:59) are presented in Figure 2. It can be seen that the resolution of the air quality information is relatively low, spatially not varying much. Only during the morning hours, the concentrations increase and show some variation, due to the elevated levels in the south of the city. This zone is known for industrial activities and usually shows the highest PM 2.5 pollution in the city [10]. This method is worthy for representing the general air quality conditions in the city, as the monitoring stations are positioned on the elevated platforms 10-20 m above street level. At the same time, people are exposed to the street-level pollution that is highly variable and often more elevated, which implies more serious consequences to health, and must be understood. Apart from the fact that the background PM 2.5 concentrations measured by the monitoring network are lower, they are also not very representative of the actual concentrations at an urban scale, due to mobile source pollution, which might reach as high as six times more than those reported by the air quality network ( Figure A1a

Traffic Data Validation and Relationship with PM2.5 Concentrations
A thorough analysis of the real traffic and the available traffic applications is displayed in Figure  3. The analysis of the measured vehicle velocities, while driving around the city, and the Google Traffic category shows a negative correlation (R = −0.56) between an increased velocity and a lightness of traffic ( Figure 3a). A negative correlation (R = −0.74) was also found between the traffic velocity reported in the Waze application and Google Traffic congestion ( Figure 3b). The correlations between data from Google Traffic/Waze and actual vehicle velocities suggest that these web-based applications are reliable estimators of the real-time traffic speed in the city of Quito. Throughout the study, based on tens of hours of sampling (Figure 4a), visual observations confirmed that traffic reported by the Google Traffic application was highly accurate for the main avenues, but not always representative in the secondary residential streets. In several cases of sampling small streets, although the street had no traffic, the traffic application indicated congestion

Traffic Data Validation and Relationship with PM 2.5 Concentrations
A thorough analysis of the real traffic and the available traffic applications is displayed in Figure 3. The analysis of the measured vehicle velocities, while driving around the city, and the Google Traffic category shows a negative correlation (R = −0.56) between an increased velocity and a lightness of traffic ( Figure 3a). A negative correlation (R = −0.74) was also found between the traffic velocity reported in the Waze application and Google Traffic congestion ( Figure 3b). The correlations between data from Google Traffic/Waze and actual vehicle velocities suggest that these web-based applications are reliable estimators of the real-time traffic speed in the city of Quito.

Traffic Data Validation and Relationship with PM2.5 Concentrations
A thorough analysis of the real traffic and the available traffic applications is displayed in Figure  3. The analysis of the measured vehicle velocities, while driving around the city, and the Google Traffic category shows a negative correlation (R = −0.56) between an increased velocity and a lightness of traffic (Figure 3a). A negative correlation (R = −0.74) was also found between the traffic velocity reported in the Waze application and Google Traffic congestion (Figure 3b). The correlations between data from Google Traffic/Waze and actual vehicle velocities suggest that these web-based applications are reliable estimators of the real-time traffic speed in the city of Quito. Throughout the study, based on tens of hours of sampling (Figure 4a), visual observations confirmed that traffic reported by the Google Traffic application was highly accurate for the main avenues, but not always representative in the secondary residential streets. In several cases of sampling small streets, although the street had no traffic, the traffic application indicated congestion Throughout the study, based on tens of hours of sampling (Figure 4a), visual observations confirmed that traffic reported by the Google Traffic application was highly accurate for the main avenues, but not always representative in the secondary residential streets. In several cases of sampling small streets, although the street had no traffic, the traffic application indicated congestion (Google Traffic-Red). This could be due to the parked or stopped cars idling on the side of the street. Based on this finding, we decided to focus our study on the main avenues avoiding secondary residential streets. It is not a significant limitation, for the health exposure to traffic-based pollution, because main avenues represent the principal sources of pollution in a city, as they contain the most polluting city transportation, such as diesel-powered bus lines in the case of Quito. In our study, we also show that the main avenues are highly representative of traffic issues, and most of our sampling data compares well with typical traffic conditions (Figure 4b).
Appl. Sci. 2020, 10, x FOR PEER REVIEW 8 of 18 (Google Traffic-Red). This could be due to the parked or stopped cars idling on the side of the street. Based on this finding, we decided to focus our study on the main avenues avoiding secondary residential streets. It is not a significant limitation, for the health exposure to traffic-based pollution, because main avenues represent the principal sources of pollution in a city, as they contain the most polluting city transportation, such as diesel-powered bus lines in the case of Quito. In our study, we also show that the main avenues are highly representative of traffic issues, and most of our sampling data compares well with typical traffic conditions (Figure 4b). PM2.5 concentrations plotted against Google Traffic velocity categories showed an inverse correlation between vehicle velocities and PM2.5 concentrations until a certain critical level of congestion of the traffic (Dark red, Figure 5a). It can be seen that Green-the fast-flowing trafficand Dark red-the very slow or stopped traffic-tend to generate less PM2.5 pollution, due to a limited acceleration and braking when compared to Yellow and Red-the slower traffic (more braking and acceleration). The data represented in Figure 5a were used to create a hierarchical cluster analysis in the form of a dendrogram. On the x-axis are the variables (traffic levels). Similar variables are joined by lines whose vertical length reflects the Euclidean distance between these variables. The dendrogram shows that Green and Dark red are grouped as a single class that results in lower concentrations of fine particulate matter, whereas Yellow and Red constitute another class producing higher levels of particulate pollution (Figure 5b). Based on these findings, the traffic feature was divided into two categories: group 1 (Green + Dark red) and group 2 (Yellow + Red).  Figure 5a). It can be seen that Green-the fast-flowing traffic-and Dark red-the very slow or stopped traffic-tend to generate less PM 2.5 pollution, due to a limited acceleration and braking when compared to Yellow and Red-the slower traffic (more braking and acceleration). The data represented in Figure 5a were used to create a hierarchical cluster analysis in the form of a dendrogram. On the x-axis are the variables (traffic levels). Similar variables are joined by lines whose vertical length reflects the Euclidean distance between these variables. The dendrogram shows that Green and Dark red are grouped as a single class that results in lower concentrations of fine particulate matter, whereas Yellow and Red constitute another class producing higher levels of particulate pollution (Figure 5b). Based on these findings, the traffic feature was divided into two categories: group 1 (Green + Dark red) and group 2 (Yellow + Red).
the form of a dendrogram. On the x-axis are the variables (traffic levels). Similar variables are joined by lines whose vertical length reflects the Euclidean distance between these variables. The dendrogram shows that Green and Dark red are grouped as a single class that results in lower concentrations of fine particulate matter, whereas Yellow and Red constitute another class producing higher levels of particulate pollution (Figure 5b). Based on these findings, the traffic feature was divided into two categories: group 1 (Green + Dark red) and group 2 (Yellow + Red).

PM 2.5 Prediction from Monitoring Stations and IDW
To study the prediction power of urban pollution based on monitoring stations, we verified the IDW interpolation data with the real measurements. We performed a sensitivity analysis on different p values (range from 1 to 5) to identify the power that provides the best prediction. Overall, the difference of accuracy from one p to another is not significant (Table 1). Nevertheless, the remaining analyses will focus on the models using p = 2, because they tend to give the best performance. Two out of four models do not provide a prediction which is significantly different from a random choice between low and high concentration of PM 2.5 . The percentage of accuracy of the models for 30 May and 18 June, 2019 are both equal to 50%. However, 30 May, 2019 was a day with high relative humidity, low solar radiation, and thus low temperature, which may increase a variability in local source mixing ( Figure A1e, Appendix A). On the other hand, 18 June, 2019 was warmer and, thus, windier ( Figure A1h, Appendix A). For 12 June, 2019, increased cloudiness caused larger temperature changes ( Figure A1g, Appendix A), therefore, the model is slightly better (accuracy = 57%) but still below the expectations (Figure 6a). The only model that provides us with good prediction is 3 June, 2019 (Figure 6f): accuracy = 71%. This day had relatively constant meteorological conditions during the measurements, pointing to the less variation in time of the air pollution. Figure A1b,f demonstrates the relationship between peak PM 2.5 concentrations and a decrease in wind speed. Finally, Figure A1a-d (Appendix A) show that the evolution of the concentrations of fine particulate matter over time is quite different between the measurements at the station and the street level for the days 30 May, 2019, and 12 and 18 June, 2019. The levels registered by the monitoring stations for these days (several small peaks) are significantly noisier than for 3 June, 2019 (clear pollution peak at 9:20). These results tend to demonstrate that the use of the air quality network provides limited spatial resolution and can only be suitable in the case of typical days. For a more reliable prediction, another approach less dependent on the meteorological conditions needs to be adopted. The best way to reduce the effect of the meteorology is to, directly or indirectly, measure the pollution closer to its source of emission. This is the approach that consists of monitoring traffic, for which the resulting models are presented in the next sections.

PM2.5 Prediction from Traffic Only
Four models were built from the four different days of collected data. Figure 7 shows a good consistency between these models. All of them tend to classify a traffic type 1 (fast or completely stopped) as a low source of contamination. On the contrary, a traffic type 2 (significant reduction of the vehicle flow) is always identified as a high source of pollution. The value of the split is slightly different from one day to another except for 3 and 18 June, 2019 (b-broken line in Figure 7), where the threshold is always equal to 1.2. The lower value obtained for 30 May, 2019 (a-solid line in Figure  7) could be explained by the fact that during that day, secondary streets were considered. On 12 June, 2019 (c-dashed line in Figure 7), highly variable concentrations of PM2.5 were registered because of variable meteorological conditions (e.g., cloudy weather and thus variations in temperature and humidity, Figure A1g, Appendix A). Considering these limitations, our results suggest that model b (b-broken line, see Figure 7) is the most representative of the city of Quito during the morning rush hours (worst air quality conditions). The assessment of each model supports this finding. The best performance is obtained for 3 June, 2019 (71% of accuracy), which outperforms 30 May, 2019 (66%), 12 June, 2019 (64%), and 18 June, 2019 (61%). The relatively lower accuracy of this latter model can be explained by the fact that on 18 June, 2019 weather conditions changed to typical of the dry season (i.e., warm temperatures that cause changes in wind speed, Figure A1h, Appendix A). This tends to increase the PM2.5 concentrations in the street canyons, due to dust resuspension caused by an increased ventilation, which also might reduce the pollution (noise due to ventilation of the anthropogenic PM and suspension of the natural PM) [39,40].

PM 2.5 Prediction from Traffic Only
Four models were built from the four different days of collected data. Figure 7 shows a good consistency between these models. All of them tend to classify a traffic type 1 (fast or completely stopped) as a low source of contamination. On the contrary, a traffic type 2 (significant reduction of the vehicle flow) is always identified as a high source of pollution. The value of the split is slightly different from one day to another except for 3 and 18 June, 2019 (b-broken line in Figure 7), where the threshold is always equal to 1.2. The lower value obtained for 30 May, 2019 (a-solid line in Figure 7) could be explained by the fact that during that day, secondary streets were considered. On 12 June, 2019 (c-dashed line in Figure 7), highly variable concentrations of PM 2.5 were registered because of variable meteorological conditions (e.g., cloudy weather and thus variations in temperature and humidity, Figure A1g, Appendix A). Considering these limitations, our results suggest that model b (b-broken line, see Figure 7) is the most representative of the city of Quito during the morning rush hours (worst air quality conditions). The assessment of each model supports this finding. The best performance is obtained for 3 June, 2019 (71% of accuracy), which outperforms 30 May, 2019 (66%), 12 June, 2019 (64%), and 18 June, 2019 (61%). The relatively lower accuracy of this latter model can be explained by the fact that on 18 June, 2019 weather conditions changed to typical of the dry season (i.e., warm temperatures that cause changes in wind speed, Figure A1h, Appendix A). This tends to increase the PM 2.5 concentrations in the street canyons, due to dust resuspension caused by an increased ventilation, which also might reduce the pollution (noise due to ventilation of the anthropogenic PM and suspension of the natural PM) [39,40].

PM 2.5 Prediction from Traffic and Time of the Day
Including time of the day in the models improves the accuracy by an average of 6.5% (mean performance of traffic-based only = 65.5%; mean performance of traffic + time of the day = 72%). Here, all the predictions are higher or equal to 70%. Three out of four models split first on time of day (Figure 8a,c,d), which means that the temporal factor is a dominant feature for the estimation of the pollution levels. The four models show that the earlier the time, the higher the PM 2.5 concentration is. Two main reasons can explain this outcome. First, rush hour occurs before 9:00. Since the traffic is denser during this period, the emission of particulate matter increases. The second explanation is related to the height of the planetary boundary layer (PBL). PBL is low in the early morning and keeps growing all morning long, due to the intensification of solar radiation, which, in result, increases the dilution of PM 2.5 in the atmosphere [36], especially after 10:00 (Figure 8b). These two phenomena account for the two thresholds (around 9:00 and 10:00) for which the time feature is split in the models presented in Figure 8. Regarding the split for the traffic feature, the models present similar thresholds as in the previous section. A traffic type 1 (fluid or completely congested) is a predictor of less contamination than a traffic type 2 (slower flow). To sum up, adding the time of day enables the weaker models solely based on traffic to get a similar performance as the best models (prediction ≥ 70%), but it does not significantly improve the accuracy of these latter.
Four models were built from the four different days of collected data. Figure 7 shows a good consistency between these models. All of them tend to classify a traffic type 1 (fast or completely stopped) as a low source of contamination. On the contrary, a traffic type 2 (significant reduction of the vehicle flow) is always identified as a high source of pollution. The value of the split is slightly different from one day to another except for 3 and 18 June, 2019 (b-broken line in Figure 7), where the threshold is always equal to 1.2. The lower value obtained for 30 May, 2019 (a-solid line in Figure  7) could be explained by the fact that during that day, secondary streets were considered. On 12 June, 2019 (c-dashed line in Figure 7), highly variable concentrations of PM2.5 were registered because of variable meteorological conditions (e.g., cloudy weather and thus variations in temperature and humidity, Figure A1g, Appendix A). Considering these limitations, our results suggest that model b (b-broken line, see Figure 7) is the most representative of the city of Quito during the morning rush hours (worst air quality conditions). The assessment of each model supports this finding. The best performance is obtained for 3 June, 2019 (71% of accuracy), which outperforms 30 May, 2019 (66%), 12 June, 2019 (64%), and 18 June, 2019 (61%). The relatively lower accuracy of this latter model can be explained by the fact that on 18 June, 2019 weather conditions changed to typical of the dry season (i.e., warm temperatures that cause changes in wind speed, Figure A1h, Appendix A). This tends to increase the PM2.5 concentrations in the street canyons, due to dust resuspension caused by an increased ventilation, which also might reduce the pollution (noise due to ventilation of the anthropogenic PM and suspension of the natural PM) [39,40].  the cutoff values (in scale of traffic) that permit the best split on the predictor (i.e., traffic) to separate low and high concentrations of PM2.5.

PM2.5 Prediction from Traffic and Time of the Day
Including time of the day in the models improves the accuracy by an average of 6.5% (mean performance of traffic-based only = 65.5%; mean performance of traffic + time of the day = 72%). Here, all the predictions are higher or equal to 70%. Three out of four models split first on time of day (Figures 8a,c,d), which means that the temporal factor is a dominant feature for the estimation of the pollution levels. The four models show that the earlier the time, the higher the PM2.5 concentration is. Two main reasons can explain this outcome. First, rush hour occurs before 9:00. Since the traffic is denser during this period, the emission of particulate matter increases. The second explanation is related to the height of the planetary boundary layer (PBL). PBL is low in the early morning and keeps growing all morning long, due to the intensification of solar radiation, which, in result, increases the dilution of PM2.5 in the atmosphere [36], especially after 10:00 (Figure 8b). These two phenomena account for the two thresholds (around 9:00 and 10:00) for which the time feature is split in the models presented in Figure 8. Regarding the split for the traffic feature, the models present similar thresholds as in the previous section. A traffic type 1 (fluid or completely congested) is a predictor of less contamination than a traffic type 2 (slower flow). To sum up, adding the time of day enables the weaker models solely based on traffic to get a similar performance as the best models (prediction ≥ 70%), but it does not significantly improve the accuracy of these latter.

Model Generalization
To confirm the proposed approach, which consists of a traffic-based model that can be applied for any day during the morning rush hour, the accuracy of the supposed best model (3 June, 2019) was tested on the three other days. In the case of the traffic-based model, the accuracy of the

Model Generalization
To confirm the proposed approach, which consists of a traffic-based model that can be applied for any day during the morning rush hour, the accuracy of the supposed best model (3 June, 2019) was tested on the three other days. In the case of the traffic-based model, the accuracy of the classification is as follows: 64% for 30 May, 2019, 64% for 12 June, 2019, and 61% for 18 June, 2019. When this performance is compared to the models trained on a proper day, it gives the difference as follows: 2% for 30 May, 2019, 30% for 12 June, 2019, and 0% for 18 June, 2019. This result suggests that the model of 3 June, 2019 can be applied with a very high accuracy on other days, even if they are characterized by different meteorological conditions. Regarding the models based on traffic and time of the day, the performances are: 68% for 30 May, 2019, 63% for 12 June, 2019, and 58% for 18 June, 2019. The difference of accuracy in comparison to the model of the proper day is: 4% for 30 May, 2019, 10% for 12 June, 2019, and 12% for 18 June, 2019. These outcomes tend to demonstrate that the traffic-based model of 3 June, 2019 can indeed be generalizable to other days, which confirms that it is reliable to predict atmospheric pollution from a machine learning model based on the flow of vehicles in the city. Nevertheless, this generalization seems to show some limitations when the factor of time is added in the model. The fact that a model based solely on traffic is more generalizable than a one based on traffic + time of the day has a mathematical and a physical explanation. First, the machine learning approach applies the Occam's Razor principle, which states that for a similar performance, we prefer the simpler model over the more complex one [34]. The more complex the model is, the higher is the probability it was fitted accidentally (overfitting). This is the reason why a dimension reduction (or regularization) should be systematically performed before applying a machine learning algorithm, in order to tackle the affliction caused by increasing the variables in a predictive model (curse of dimensionality). The second interpretation is environmental. The unstable meteorological conditions during the measurements have caused an inconsistent dilution of the pollutants over time. In consequence, the resulting models based on the time of day are less robust to predict new data. Other features, such as land use, could also be considered to improve the prediction of the background concentration of PM 2.5 , as observed by [35]. Nevertheless, this factor was discarded for the purpose of this study, because it cannot account for the dynamic of human mobility and, consequently, is limited to forecast pollution peaks.
While we strongly insist on the good consistence between the models, a certain limitation of the study is the relatively reduced number of recorded days. However, the goal of this work goes beyond the identification of the "hotspots" of pollution in this specific city, in which case it would be crucial to repeatedly map the study area to assure the representability of urban pollution areas. The principal purpose of this investigation is to understand the correlation between live and dynamic traffic and urban PM pollution. We focused on a representative area of the city as it includes a rich variety of street types in the urban infrastructure. We are confident, that each section of the street has its own dynamics of traffic and pollution, which, as expected, might not be always the same. Thus, each of those few-minute records on several sections represents a separate experiment. The proposed method intends to demonstrate the potential of the real-time traffic monitoring to provide an automatic pollution mapping, as illustrated in the next section.

PM 2.5 Mapping
Real measurements and modeling results are presented for 3 June, 2019 at 8:00-10:20 in Figure 9. First, we show the IDW interpolation of PM 2.5 concentrations measured at the neighboring monitoring stations (see Figure 9a). It is clear that the PM 2.5 pollution interpolation of sparsely distributed air quality network stations is simply not good enough to estimate exposure to urban pollution at the local level. We then compare the traffic conditions during the experiment (colored overlapping diamond markers, Figure 9a) and an IDW interpolation of the PM 2.5 concentrations measured that same day during the same hours (Figure 9b). Frequently, increased traffic is a cause of the increase in PM 2.5 concentrations, which can be observed in our study. Meanwhile, the lowest concentrations are registered in the city parks (green areas, Figure 9). Finally, we also compare the real PM 2.5 measurements with the modeled PM 2.5 based on traffic only (Figure 9c), and traffic + time of the day (Figure 9d). The heat maps are obtained by applying an IDW interpolation method on the results provided by the Decision-Tree model for each road segment. While both models performed well in predicting real PM 2.5 pollution, adding the time of the day to the traffic predictor, insignificantly improved the model from 71 to 73% accuracy. However, time of the day is a crucial parameter accounting for the effect of atmospheric dilution. This might help to better predict the dynamics of PM 2.5 concentrations in any city, which is illustrated in Figure 9. It can be seen that Figure 9d better spatially represents the real pollution (Figure 9b) than Figure 9c. Finally, it is by far much better than just relying on the information of the monitoring stations (Figure 9a).
Appl. Sci. 2020, 10, x FOR PEER REVIEW 13 of 18 parameter accounting for the effect of atmospheric dilution. This might help to better predict the dynamics of PM2.5 concentrations in any city, which is illustrated in Figure 9. It can be seen that Figure  9d better spatially represents the real pollution (Figure 9b) than Figure 9c. Finally, it is by far much better than just relying on the information of the monitoring stations (Figure 9a). A possible way to further improve this spatial model would be to include the road-type information. In the case of a highway, traffic behavior is very different, and even if the flow is relatively fast (Green), the concentration of the vehicles is high enough to cause a significant increase in PM2.5 concentrations. Besides, our measurements are based on the sampling at a human step (4 km/h), which might mean that by the time we would reach a congested area, the traffic had already started moving, and we would have to register a fast flow (Green), while a minute ago it was highly congested (Red) and would cause a cloud of pollution in that area. However, if our model would be applied to real-time traffic, the performance would likely be further improved. This suggests a potential power in using this model for real Google Traffic information. Finally, we applied our best traffic-based model (3 June, 2019) on a larger (6.5 km × 5.5 km) area of Quito ( Figure 10). Google traffic data was registered for the main avenues in Quito central area ( Figure 10a) during 8:00-10:30 on 19 June, 2019. For practical reasons, the accuracy of this generalization was not verified through a classical supervised learning assessment. Instead, we used a clustering technique which consisted in applying a k-Means algorithm and calculating the withincluster squared distance for the seven possible models (see Section 2.4.3 for more details). Table 2 shows that the lowest error is obtained for 'Traffic_b', which confirms that the model built from the A possible way to further improve this spatial model would be to include the road-type information.
In the case of a highway, traffic behavior is very different, and even if the flow is relatively fast (Green), the concentration of the vehicles is high enough to cause a significant increase in PM 2.5 concentrations. Besides, our measurements are based on the sampling at a human step (4 km/h), which might mean that by the time we would reach a congested area, the traffic had already started moving, and we would have to register a fast flow (Green), while a minute ago it was highly congested (Red) and would cause a cloud of pollution in that area. However, if our model would be applied to real-time traffic, the performance would likely be further improved. This suggests a potential power in using this model for real Google Traffic information.
Finally, we applied our best traffic-based model (3 June, 2019) on a larger (6.5 km × 5.5 km) area of Quito ( Figure 10). Google traffic data was registered for the main avenues in Quito central area ( Figure 10a) during 8:00-10:30 on 19 June, 2019. For practical reasons, the accuracy of this generalization was not verified through a classical supervised learning assessment. Instead, we used a clustering technique which consisted in applying a k-Means algorithm and calculating the within-cluster squared distance for the seven possible models (see Section 2.4.3 for more details). Table 2 shows that the lowest error is obtained for 'Traffic_b', which confirms that the model built from the data collected on 3 June, 2019 is the best to generalize the proposed approach to the whole city. This suggests the benefit of this method for citizen awareness of air quality in urban areas. Data extracted from Google Maps Traffic application, or other traffic monitoring application program interfaces (APIs), enables us to build a database on the urban traffic, which is used to predict the real-time urban air pollution.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 14 of 18 data collected on 3 June, 2019 is the best to generalize the proposed approach to the whole city. This suggests the benefit of this method for citizen awareness of air quality in urban areas. Data extracted from Google Maps Traffic application, or other traffic monitoring application program interfaces (APIs), enables us to build a database on the urban traffic, which is used to predict the real-time urban air pollution.

Conclusions
In this first-of-its-kind study, we investigated different ways to predict and map street-level urban air pollution. We used machine learning techniques and Inverse Distance Weighting (IDW) on real measurements of fine particulate matter (PM2.5-aerodynamic diameters ≤ 2.5 µm). Firstly, urban PM2.5 concentration mapping, based on the air quality network of Quito, Ecuador, showed that the resolution of spatial variation of the air pollution is relatively low. While this method is important for representing the general air quality conditions for the city, it is not adequate to estimate the exposure of the urban population to street-level air pollution. Therefore, in this study, we propose an innovative way to model urban PM2.5 on the basis of traffic intensity. To confirm the suitability of available traffic applications, we compared traffic data provided by Google Traffic and Waze to the actual traffic speed measurements. Then, we performed a correlation study between the measured traffic and real-time PM2.5 concentrations, that helped us to split the data into two categories of high and low concentrations for slow traffic (increased acceleration and braking) and fluid or stopped traffic (reduced acceleration and braking), respectively.
To study the prediction power of every method, we verified the Inverse Distance Weighting interpolation data with the real measurements. The interpolation of the monitoring network data  Table 2. Comparing the different models in terms of overall average within-cluster distance. The first three models are the ones obtained from traffic (columns 2-4). The last four models are the ones built from traffic and the time of day (columns 5-8).

Conclusions
In this first-of-its-kind study, we investigated different ways to predict and map street-level urban air pollution. We used machine learning techniques and Inverse Distance Weighting (IDW) on real measurements of fine particulate matter (PM 2.5 -aerodynamic diameters ≤ 2.5 µm). Firstly, urban PM 2.5 concentration mapping, based on the air quality network of Quito, Ecuador, showed that the resolution of spatial variation of the air pollution is relatively low. While this method is important for representing the general air quality conditions for the city, it is not adequate to estimate the exposure of the urban population to street-level air pollution. Therefore, in this study, we propose an innovative way to model urban PM 2.5 on the basis of traffic intensity. To confirm the suitability of available traffic applications, we compared traffic data provided by Google Traffic and Waze to the actual traffic speed measurements. Then, we performed a correlation study between the measured traffic and real-time PM 2.5 concentrations, that helped us to split the data into two categories of high and low concentrations for slow traffic (increased acceleration and braking) and fluid or stopped traffic (reduced acceleration and braking), respectively.
To study the prediction power of every method, we verified the Inverse Distance Weighting interpolation data with the real measurements. The interpolation of the monitoring network data exposed limitations and low prediction accuracy (50%-71%), varying from random to improved results for the day with less varying meteorological conditions. The PM model solely based on traffic showed an increased representability of air quality conditions (61%-71% prediction accuracy). Furthermore, our model for PM 2.5 prediction based on traffic and time of the day confirmed that including time in the model tends to improve the accuracy by an average of 6.5%. In the latter case, the temporal factor was a dominant feature for the estimation of pollution levels, confirming that the earlier the time, the higher the PM 2.5 concentration. As the rush hour occurs before 9:00, the traffic is denser during this period and the concentrations of particulate matter increase. In addition, the height of the planetary boundary layer is low in the early morning, which inhibits the dilution of PM 2.5 during the morning hours resulting in the peak concentrations.
Finally, we tested the best model on any day, in order to verify the robustness of the proposed approach. Since the accuracy was maintained, we were able to confirm the model generalization based on traffic (accuracy of 61%-64%). We also noted that the models including the time factor do not generalize very well, which suggests that the simplest models are the most robust and the most reliable feature to predict atmospheric pollution is the flow of vehicles in the city. Our finding is confirmed at a larger scale through an assessment based on an unsupervised learning technique. Since the traffic monitoring can be easily extracted from several application program interfaces (APIs) available on the web, this study provides a sustainable and affordable technique, which does not require expensive equipment to predict air quality in any urban area.