Combining Cluster Analysis of Air Pollution and Meteorological Data with Receptor Model Results for Ambient PM2.5 and PM10

Air pollution regulation requires knowing major sources on any given zone, setting specific controls, and assessing how health risks evolve in response to those controls. Receptor models (RM) can identify major sources: transport, industry, residential, etc. However, RM results are typically available for short term periods, and there is a paucity of RM results for developing countries. We propose to combine a cluster analysis (CA) of air pollution and meteorological measurements with a short-term RM analysis to estimate a long-term, hourly source apportionment of ambient PM2.5 and PM10. We have developed a proof of the concept for this proposed methodology in three case studies: a large metropolitan zone, a city with dominant residential wood burning (RWB) emissions, and a city in the middle of a desert region. We have found it feasible to identify the major sources in the CA results and obtain hourly time series of their contributions, effectively extending short-term RM results to the whole ambient monitoring period. This methodology adds value to existing ambient data. The hourly time series results would allow researchers to apportion health benefits associated with specific air pollution regulations, estimate source-specific trends, improve emission inventories, and conduct environmental justice studies, among several potential applications.


Introduction
Ambient air pollution is a major environmental risk worldwide. Current estimates of premature mortality brought by population exposure to ambient PM 2.5 (particulate matter with aerodynamic diameter below 2.5 µm) vary between five million [1] to ten million [2] annual deaths. This burden of disease has prompted a worldwide effort to regulate ambient PM 2.5 concentrations. A customary set of public policies targeting that overarching goal includes: ambient air pollution monitoring, emission inventories, air pollution modeling, health effects studies, and economic valuation of regulation measures (emission standards, ambient air quality standards, market-based instruments, etc.). Ambient PM 2.5 is directly emitted by traffic, industrial, commercial, and residential sources in urban zones, and by agriculture, mining, industrial, and natural sources elsewhere. However, the resulting ambient PM 2.5 concentrations are far more complex to characterize. Some gaseous pollutants (sulfur and nitrogen oxides) undergo atmospheric oxidation, leading to sulfuric and nitric acids, which, in turn, react with ammonia to produce ammonium sulfate and nitrate particles, forming secondary PM 2.5 [3]. Furthermore, volatile (VOC) and semi-volatile (SVOC) organic compounds are filter-based RM applications, which typically correspond to daily or weekly samples. That kind of instrumentation is more expensive than traditional ambient monitors for regulated pollutants, so it is currently less widespread than RM applications.
The preceding paragraphs show that a quantitative apportionment of major sources of ambient PM 2.5 in urban areas worldwide is yet to be accomplished, particularly in developing countries. This is the current knowledge gap.
With an ever-increasing global ambient air pollution monitoring by regulatory authorities, the possibility of extracting information out of these databases has been on the rise, estimating temporal trends in air pollution [25][26][27][28][29]. Furthermore, more information is expected to be collected due to the rise of low-cost ambient monitoring [30][31][32][33][34][35][36][37]. More insight into the likely sources of ambient PM 2.5 and PM 10 can be achieved by combining meteorological and air pollution measurements. Bivariate plots of ambient concentration as a function of wind speed and direction provide information on sources contributing to ambient concentrations [38,39]. For instance, tall stack and area sources contribute most when wind speeds are higher and lower, respectively [3]. One step further consists in performing a cluster analysis (CA) of this type of bivariate representation, so that major clusters contributing to ambient concentrations may be visualized and explored [40]. One limitation of this CA approach is that there is no simple recipe to know how many clusters should be chosen in a given analysis [41].
In other words, more information is needed to decide how many clusters may be resolved in any given set of ambient air pollution data.
In a review of CA applied to air pollution analysis between 1980 and 2019 [41], it is shown that most CA applications to surface observations (i.e., air quality monitoring) consisted in exploring spatial associations among monitor sites within the same city, region, or country. The temporal evolution of those clusters was focused on their seasonal variability. In one publication [42], chemical speciation of ambient PM 2.5 was included in the cluster analysis to classify daily samples according to differences in PM 2.5 chemical composition. Then back trajectory analyses were developed to estimate air mass origins associated with each resolved cluster.
Our goal is to show that, by combining short-term RM analyses with results gotten from long-term (multiyear) CA, for the same location, it is possible to identify the major sources (clusters) in ambient air pollution data therein. With this approach, the RM results provide additional information to decide how many clusters should be considered in the CA, so both methods complement each other. We show a proof of the concept for this combined approach in three case studies of urban zones with widely different dominant sources and meteorological conditions. As far as we are aware of, this is the first time this joint analysis has been proposed for ambient air pollution data.
This combined analysis is facilitated with a set of simple rules to identify sources: PM 2.5 /PM 10 ratios, dependence of ambient concentrations with temperature, wind speed and relative humidity, and the location of the monitoring site with respect to obvious sources: highways, industrial zones, etc. All these rules come from the literature of RM studies worldwide-see [16,43,44] for references -along with well-known results for the dispersion of stack emissions under different meteorological conditions [3] and the main features of motor vehicle emissions [45][46][47][48][49][50][51][52] as well as other combustion emissions [53][54][55][56][57][58].
The outcome of this proposed computational process is a long-term (multiyear), hourly time series of source contributions to ambient PM 2.5 and PM 10 (and other measured air pollutants as well) suitable for different research purposes: exploring association of health effects with specific sources (example: traffic), tracking source trends, evaluate effectiveness of sector regulations, constrain sector emissions through dispersion modeling, conduct environmental justice analysis, further analysis of long-range sources using backward wind trajectories, etc. A major implication of our work is that ambient air quality databases are a rich source for further air pollution analysis worldwide, especially for source apportionment of PM 2.5 and PM 10 .

Computational Methodology
We use the open access R software environment, including the openair package for air pollution analysis [59]. Openair has several dedicated functions for air pollution analysis, including the polarCluster function that groups ambient pollutant concentrations using bivariate plots. Briefly, polarCluster performs a k-means cluster analysis on the vectors [u, v, c] where u, v are the zonal and meridional wind components (sometimes referred to as the east and north wind components) and c as the pollutant concentration under analysis (PM 2.5 , PM 10 , etc.). Since all these variables have different scales, they are all standardized before the k-means algorithm proceeds [40]. This k-means clustering algorithm is well-known and has been extensively used in air pollution analyses [41], as mentioned above.
Regarding the receptor model methodology, we use already published RM results for two Chilean cities: Santiago and Temuco [60,61]. For Santiago, USEPA Positive Matrix Factorization (PMF, version 5.0) was the RM used [62], while for Temuco, the USEPA Chemical Mass Balance (Version 8.2) was the RM used [63]. Both models solve the following mass balance equation for p sources [13].
where X ij , g ik , and f kj are matrices whose entries are the j-th species mass concentration measured in the i-th sample, the mass concentration from the k-th source contributing to the i-th sample, and the j-th species mass fraction in the k-th source, respectively. p is the total number of resolved sources.
The residuals e ij are assumed random and normally distributed. The Chemical Mass Balance (CMB8.2) model solves Equation (1) for the case when the source profiles {f kj } have been experimentally measured for the k-th source. Hence, Equation (1) is solved for {g ik } using an effective variance least squares approach [64]. The Chemical Mass Balance model assumes that all major sources in the study area have been identified and included in its input. This model is usually run for different combinations of sources until a satisfactory solution is achieved in terms of the percentage of observed variance explained by the model.
The Positive Matrix Factorization (PMF5) model solves Equation (1) without making any assumption regarding source profiles {f kj }. However, in such a case, there are more unknowns than equations in Equation (1). This means additional information must be supplied into the model [65]. Usually, source compositions {f kj } and contributions {g ik } are required to be non-negative. This is the case of software PMF5, which minimizes the following function.
where σ ij is the estimated uncertainty in the j-th species at i-th PM sample. In this approach, all observations are individually weighted by their respective uncertainties. The above minimization is carried out including the previously mentioned non-negative constrains upon compositions {f kj } and contributions {g ik }.
More details on the information used for each RM application (sampling period, chemical analyses, etc.) are provided in the respective references above [60,61].

Case Studies
We have selected, as case studies to test the proposed methodology, three urban zones in Chile with widely different meteorological conditions and dominant sources in which RM results are known. Calama (22 • 27'S, 68 • 55'W, population 2017: 158,600 [66], elevation: 2500 m) is a city located within the Atacama Desert (Koppen-Geiger classification BWk [67]) and the city is close to a large mining district. The major pollution problem is ambient PM 10 , given the desert conditions and rather strong winds that promote road dust resuspension by traffic and aeolian dust blown from the desert environment. Santiago (33 • 28'S, 70 • 40'W, population 2017: 6.85 million [66], elevation: 600 m) is the capital city, and extends along a valley surrounded everywhere by ranges and the Andes cordillera to the east. The climate is a warm temperate with dry summer (Koppen-Geiger classification CSb). Air quality regulations in Santiago were the first to be enacted in Chile and they have been successful in curbing down ambient PM 2.5 [68][69][70][71]. Nonetheless, ambient PM 2.5 in Santiago currently exceeds World Health Organization (WHO) guidelines and Chilean ambient air quality standards (AAQS), and major PM 2.5 sources are traffic, industry, commercial, and residential sources. The third case study is the city of Temuco (38 • 44'S, 72 • 35'W, population 2017: 308,600 [66], elevation: 350 m) with a warm temperate fully humid climate (Koppen-Geiger classification Cfb). In this city, residential wood burning emissions rise in colder months, leading to severe ambient PM 2.5 concentrations [60] that increase indoor concentrations as well [72].
Ambient air pollution (PM 10 , PM 2.5 , CO, SO 2 , NOx) and surface meteorological measurements were downloaded from Chile's Air Quality Information System (https://sinca.mma.gob.cl/). Table 1 shows the summary of information used to conduct the analyses, where one monitoring station per city was chosen. Figure 1 shows the locations of the three monitoring sites used in our analyses. Data were screened for obvious outliers and these were removed before the computational analyses. to the east. The climate is a warm temperate with dry summer (Koppen-Geiger classification CSb). Air quality regulations in Santiago were the first to be enacted in Chile and they have been successful in curbing down ambient PM2.5 [68][69][70][71]. Nonetheless, ambient PM2.5 in Santiago currently exceeds World Health Organization (WHO) guidelines and Chilean ambient air quality standards (AAQS), and major PM2.5 sources are traffic, industry, commercial, and residential sources. The third case study is the city of Temuco (38°44'S, 72°35'W, population 2017: 308,600 [66], elevation: 350 m) with a warm temperate fully humid climate (Koppen-Geiger classification Cfb). In this city, residential wood burning emissions rise in colder months, leading to severe ambient PM2.5 concentrations [60] that increase indoor concentrations as well [72]. Ambient air pollution (PM10, PM2.5, CO, SO2, NOx) and surface meteorological measurements were downloaded from Chile's Air Quality Information System (https://sinca.mma.gob.cl/). Table 1 shows the summary of information used to conduct the analyses, where one monitoring station per city was chosen. Figure 1 shows the locations of the three monitoring sites used in our analyses. Data were screened for obvious outliers and these were removed before the computational analyses.

Simple Comparison Rules
To map the outcome of the CA results with the RM results for the same site, we propose to use a set of rules. Some of these rules have been traditionally used in identifying RM results, such as looking at the weekly seasonality of source contributions to identify traffic or industrial sources

Simple Comparison Rules
To map the outcome of the CA results with the RM results for the same site, we propose to use a set of rules. Some of these rules have been traditionally used in identifying RM results, such as looking at the weekly seasonality of source contributions to identify traffic or industrial sources [61,70,71,73]. The former is lower over weekends while the latter is not. Other rules use the hourly resolution of CA results to generate scatter plots and time-variability plots of ambient PM 2.5 , PM 10 , and gases (CO, NOx, SO 2 ), stratified by cluster, to check for specific patterns. For instance, large industrial stacks contribute more to ambient concentrations when wind speeds are higher [40] and ship emissions contribute more when ambient temperature rises [74]. A key result from scatter (X-Y) plots of air pollutants is the graphical interpretation of source compositions as 'limiting edge lines' [13,75], which define an upper (lower) edge line with the highest (lowest) Y/X ratio. Furthermore, the ratio PM 2.5 /PM 10 is different for combustion particles than for mechanically-generated particles [3]. The proposed rules and their rationale are detailed next.

1.
Residential wood burning (RWB) sources exhibit a high PM 2.5 /PM 10 ratio, typically 0.7 or higher, with features of a single source in a PM 2.5 -PM 10 scatter plot, that is, most points lie along a straight line. RWB contributions show the highest seasonality off all resolved clusters (sources) peaking on colder months. This is a consequence of RWB emissions being driven by increasing space heating demand, so they increase in colder months whereas traffic and industrial sources remain constant all year long. Likewise, hourly RWB contributions increase when ambient temperature decreases. With respect to relative humidity (RH), RWB contributions tend to increase at higher RH, while other area sources (like fugitive dust) decrease as RH increases. RWB contributions tend to peak near midnight in colder months, unlike traffic sources that peak earlier in the evening. On a weekly basis, RWB contributions decrease less over weekends than traffic contributions do.

2.
Traffic sources display a PM 2.5 -PM 10 scatter plot with a cloud of points bounded by 'limiting edge lines': an upper edge line close to a 1:1 line, characteristic of exhaust emissions, and a lower edge with a small PM 2.5 /PM 10 ratio, typical of non-exhaust traffic emissions (i.e., road dust [76]). This is a typical signature whenever a pair of sources contribute to ambient concentrations [13,75]. Traffic contributions universally decrease on weekends, unlike industrial (or mining) sources, which are steady all year long. On a diurnal basis, traffic sources show distinctive morning and evening rush hour peaks. When plotted against wind speed, traffic sources display a negative correlation, explained by the better ventilation conditions brought by higher wind speeds. In contrast, industrial sources do not show a clear correlation or sometimes display a positive correlation because higher wind speeds brought high stack emissions down to the ground in unstable atmospheric conditions. 3.
Industrial sources appear as hot spots in the CA outcome, associated with specific wind directions and high wind speeds, when tall stack contributions are relevant through fumigation processes [40]. An inspection of CA results for sulfur dioxide (SO 2 ) helps to clarify the location and contributions of those industrial spots to ambient PM 2.5 . Since, under stable atmospheric conditions (and lower temperature and wind speeds), stack plumes will rise, their contributions to ambient SO 2 will be negligible and, therefore, traffic contributions will dominate under these circumstances.
Conversely, under unstable atmospheric conditions, higher wind speed and temperatures will promote contributions from stacks, which will be the dominant ones for SO 2 . In coastal areas, this mechanism will work the same way for SO 2 from ship emissions [74].

4.
Aeolian dust sources appear only at high wind speeds, and, to resolve them in the CA, the number of clusters needs to be increased until those sources emerge, provided we know they are at play in the study zone.

5.
As the number of clusters increases in the CA, more intermittent sources are likely to show up. Most of these intermittent sources contribute little to no ambient PM 2.5 on a long-term basis. However, they are unlikely to show up in the RM results because RM have difficulties in resolving intermittent sources with low contributions to ambient PM 2.5 [12]. Nonetheless, they might indicate long-range, regional sources arriving to the monitoring site. Thus, these may be analyzed on their own using backward wind trajectories [77,78] to confirm their identity (natural or anthropogenic). 6.
Ubiquitous area sources (like traffic) would be split by wind direction sectors as the number of clusters increases. They will all have the features presented in rule 2 above. 7.
Additional gaseous measurements will provide further insights into sources' identities, comparing how those concentrations distribute across clusters. In the case of nitrogen oxides (NO, NO 2 , NOx = NO + NO 2 ), a cluster consisting of cleaner air masses (coming from the ocean, for instance) will have low NO 2 and NOx values, while aged, long-range anthropogenic regional sources in another cluster will display larger NO 2 /NOx ratios. In the case of carbon monoxide (CO), this pollutant should be apportioned mostly to traffic sources, and, hence, it would belong to traffic-related clusters. The exceptions are zones when RWB or some industrial sources are relevant. These will apportion CO as well, particularly in colder months (RWB) or under high wind speeds (industrial stack sources). SO 2 is a good tracer for large industrial sources such as copper smelters or coal-fired thermal power generation units. They may also be used as a tracer of ship emissions.

Results for Calama
In this city in an arid environment, annual ambient PM 2.5 concentrations are below 10 µg/m 3 , that is, ambient PM 2.5 satisfies current WHO guidelines [79]. Thus, we focus on ambient PM 10 as the pollutant of concern, considering that fugitive sources, such as those windblown from desert surroundings and road dust, should be relevant. Both sources are also hard to estimate, given the intermittence of physical processes that lead to high wind speeds (gustiness conditions) in the former source, and the heterogenous processes responsible for surface dust loading on the city's streets in the latter source [51].
For this city, no RM results have been published. Nonetheless, available urban emission inventories for Calama [80] estimate that road dust (or traffic non-exhaust emission) is the dominant PM 10 emission source, being 88% of urban emissions (Supplementary Table S1). Therefore, traffic emissions dominate PM 10 emissions. However, no estimation of windblown dust is available for this study zone. We conduct the analysis for the monitoring station 'Centro' in Calama (see Table 1 for further details). Figure 2 shows the outcome of the polarCluster routine when applied to ambient PM 10 data gathered from October 2012 through August 2020. The clusters associated with windblown dust emerge with high wind speeds and wind directions between 270 • and 360 • , and stay the same for solutions of CA with 5-9 clusters. The other clusters include low wind speed conditions and they split into several wind direction sectors as the number of clusters increases. For simplicity, we choose a solution with five clusters and explore the results using the simple rules presented in Section 2.3 to analyze the results. Figure 3 shows the source contributions associated with the five-cluster solution for PM 10 . Clusters 1 and 2 are the dominant ones, followed by clusters 4, 3, and 5. The first analysis is performed looking at the time variability of this five-cluster solution. Figure S1 shows that clusters 3 and 5 present higher contributions between noon and 6 PM when wind speeds are higher in the region. Figure S2 shows scatter plots of ambient PM 10 versus wind speed. PM 10 concentrations from clusters 3 and 5 clearly increase with wind speed, which is distinctive of Aeolian emissions. By contrast, other clusters' contributions have little increase or decrease with wind speed (see R 2 values for instance). Cluster 3 comes from W and WNW directions and cluster 5 from NNW winds. Cluster 3 is characteristic of upslope, anabatic winds that develop all year long given the strong solar irradiation and extreme soil dryness [81]. Cluster 5 is characteristic of mountain synoptic conditions that happen in the austral fall and winter seasons, when the Pacific subtropical high weakens, favoring a higher frequency of north winds. Nonetheless, both windborne dust sources are minor contributors to a long-term average PM 10 , as can be seen in Figure 3.  Figure 3 shows the source contributions associated with the five-cluster solution for PM10. Clusters 1 and 2 are the dominant ones, followed by clusters 4, 3, and 5. The first analysis is performed looking at the time variability of this five-cluster solution. Figure S1 shows that clusters 3 and 5 present higher contributions between noon and 6 PM when wind speeds are higher in the region. Figure S2 shows scatter plots of ambient PM10 versus wind speed. PM10 concentrations from clusters 3 and 5 clearly increase with wind speed, which is distinctive of Aeolian emissions. By contrast, other clusters' contributions have little increase or decrease with wind speed (see R 2 values for instance). Cluster 3 comes from W and WNW directions and cluster 5 from NNW winds. Cluster 3 is characteristic of upslope, anabatic winds that develop all year long given the strong solar irradiation and extreme soil dryness [81]. Cluster 5 is characteristic of mountain synoptic conditions that happen in the austral fall and winter seasons, when the Pacific subtropical high weakens, favoring a higher frequency of north winds. Nonetheless, both windborne dust sources are minor contributors to a long-term average PM10, as can be seen in Figure 3.  Figure 4 shows the PM 10 time variability for clusters 1, 2, and 4. It is clear from this figure that these three clusters correspond to traffic sources: their ambient contributions have distinctive peaks during morning and evening traffic rush hours, and they significantly decrease over the weekends. The same behavior can be seen in Figure 5, where the NOx time variability is plotted for clusters 1, 2, and 4. Figure 6 shows scatter plots of PM 2.5 and PM 10 by cluster. In clusters 1, 2, and 4, the points lie between two limiting 'edge lines': one upper edge with high PM 2.5 /PM 10 ratio (traffic exhaust) and a lower edge with a low PM 2.5 /PM 10 ratio (lower that 0.1), characteristic of non-exhaust traffic emissions [46,51]. Thus, any point in the plot for those three clusters can be regarded as an air mass that arrives to the monitor site with a mixture of those two traffic emissions, as it is customarily presented in the receptor modeling literature [13,75]. Int. J. Environ. Res. Public Health 2020, 17, x 9 of 27  Figure 4 shows the PM10 time variability for clusters 1, 2, and 4. It is clear from this figure that these three clusters correspond to traffic sources: their ambient contributions have distinctive peaks during morning and evening traffic rush hours, and they significantly decrease over the weekends. The same behavior can be seen in Figure 5, where the NOx time variability is plotted for clusters 1, 2, and 4. Figure 6 shows scatter plots of PM2.5 and PM10 by cluster. In clusters 1, 2, and 4, the points lie between two limiting 'edge lines': one upper edge with high PM2.5/PM10 ratio (traffic exhaust) and a lower edge with a low PM2.5/PM10 ratio (lower that 0.1), characteristic of non-exhaust traffic emissions [46,51]. Thus, any point in the plot for those three clusters can be regarded as an air mass that arrives to the monitor site with a mixture of those two traffic emissions, as it is customarily presented in the receptor modeling literature [13,75].

Results for Temuco
In this city located in a wet temperate climate, RWB is the major source of ambient PM2.5. RM results show that, in the winter season, RWB contributions to ambient PM2.5 vary between 70% and 100% [60]. These RM results were found for a monitoring site located 700 m NW of the air quality monitoring site that we analyze here (denoted as LE by its Spanish name 'Las Encinas'). The data cover the period from January 2009 through August 2020. In this case study, we use temperature instead of wind speed for conducting the CA. Had we chosen the traditional CA, we would have found difficulties in interpreting the resulting clusters because of an overlapping of high PM2.5 contributions at low wind speeds, which mixes traffic and RWB contributions (results not shown). By using temperature as an input variable, we are able to extract the RWB contribution and separate

Results for Temuco
In this city located in a wet temperate climate, RWB is the major source of ambient PM 2.5 . RM results show that, in the winter season, RWB contributions to ambient PM 2.5 vary between 70% and 100% [60]. These RM results were found for a monitoring site located 700 m NW of the air quality monitoring site that we analyze here (denoted as LE by its Spanish name 'Las Encinas'). The data cover the period from January 2009 through August 2020. In this case study, we use temperature instead of wind speed for conducting the CA. Had we chosen the traditional CA, we would have found difficulties in interpreting the resulting clusters because of an overlapping of high PM 2.5 contributions at low wind speeds, which mixes traffic and RWB contributions (results not shown). By using temperature as an input variable, we are able to extract the RWB contribution and separate it from the traffic contribution. This approach works well precisely because the highest RWB impacts occur at nearly midnight in colder months when traffic sources are minimal and RWB emissions are the highest. Furthermore, since under those stable atmospheric conditions, wind direction is highly variable because of wind meandering effects, RWB contributions should come from all wind directions, defining a single cluster enclosing the origin of coordinates in the bivariate plot. This geometric feature helps in identifying RWB sources, even when they are not the dominant ones (see Section 3.3). Figure 7 shows the cluster analysis results for two to eight clusters in Temuco, Las Encinas (LE) site. It can be clearly seen that a 'central cluster' develops for solutions with three or more clusters. For seven or more clusters, this central cluster splits itself into a pair of low-temperature and high-temperature clusters. They both correspond to RWB (results not shown here). For simplicity, we choose a three-cluster solution to identify the sources.  Figure 8 shows the monthly source contributions of the three-cluster solution found in Temuco. Source 3 is the dominant one, followed by sources 1 and 2. Figure 9 shows the time variability of those three clusters. Clusters 1 and 2 show lower seasonality than cluster 3, which has dominant  Figure 8 shows the monthly source contributions of the three-cluster solution found in Temuco. Source 3 is the dominant one, followed by sources 1 and 2. Figure 9 shows the time variability of those three clusters. Clusters 1 and 2 show lower seasonality than cluster 3, which has dominant contributions near midnight. Figure S3 shows the pollution rose by the cluster. Cluster 3 has contributions for low wind speeds and from all wind directions, while clusters 1 and 2 are each associated with a set of narrow wind directions. Figure 10 displays PM 2.5 -PM 10 scatter plots. Clusters 1 and 2 have two clear limiting edge lines-characteristic of traffic sources-while cluster 3 points lie along a straight line with a high slope near 1. Figure S4 shows CO time variability by cluster. The cluster patterns mimic the ones already shown in Figure 9 for PM 2.5 . Figure 11 show scatter plots of PM 2.5 against the temperature. Cluster 3 has a different behavior and higher PM 2.5 concentrations than in clusters 1 and 2. Therefore, the CA results indicate that residential wood burning is the dominant source of ambient PM 2.5 , which is followed by traffic sources (clusters 1 and 2) with lower contributions.
We now turn to a quantitative comparison of the above CA results with RM results obtained for the period July-September 2014 at a monitoring site 700 m NW of the LE site. Table 2 shows a comparison of the average sources identified with CA and those resolved using the chemical mass balance RM (CMB8.2) with organic molecular markers [60]. The molecular markers measured in ambient PM 2.5 samples were used to apportion organic carbon (OC) among different combustion sources (gasoline, diesel, wood, coal, natural gas) using Equation (1) with known source profiles {f kj }. Afterward, organic source contributions to PM 2.5 mass were calculated from those source contributions to OC and specific OC/PM 2.5 mass ratios for each source [82][83][84][85].         We now turn to a quantitative comparison of the above CA results with RM results obtained for the period July-September 2014 at a monitoring site 700 m NW of the LE site. Table 2 shows a comparison of the average sources identified with CA and those resolved using the chemical mass balance RM (CMB8.2) with organic molecular markers [60]. The molecular markers measured in ambient PM2.5 samples were used to apportion organic carbon (OC) among different combustion sources (gasoline, diesel, wood, coal, natural gas) using Equation (1) with known source profiles {fkj}. Afterward, organic source contributions to PM2.5 mass were calculated from those source contributions to OC and specific OC/PM2.5 mass ratios for each source [82][83][84][85].
For the CA traffic sources, we compare them with the sum of diesel exhaust emissions, vegetative detritus (both resolved by CMB8.2), and dust (sum of Al, Si, Fe, Ca, and Ti oxides) to Figure 11. PM 2.5 -temperature scatter plot by cluster for a three-cluster solution for Temuco. For the CA traffic sources, we compare them with the sum of diesel exhaust emissions, vegetative detritus (both resolved by CMB8.2), and dust (sum of Al, Si, Fe, Ca, and Ti oxides) to consider the non-exhaust contributions included in the CA (i.e., road dust). For the residential wood burning source, the comparison is more elaborated. We have found that, in Temuco, coal combustion is also used for space heating (identified using picene as organic tracer), so we have added RWB contributions (identified using levoglucosan as an organic tracer) to those from residential coal combustion since both processes happen under the very same environmental conditions. Furthermore, there is a substantial contribution of secondary organic aerosol (SOA) in Temuco, which is ascribed to the inefficient burning of wood. This combustion process is known to release semi-volatile organic compounds [53], which quickly oxidize and, thus, generate secondary organic aerosols [86]. The CMB8.2 RM cannot resolve secondary sources, so the unresolved organic carbon (OC) is denoted as 'Other OC'. This 'Other OC' is identified as SOA by its high correlation with water soluble organic carbon (WSOC), indicating a high degree of molecular oxygenation. Therefore, we added the corresponding SOA contribution to the RWB (and coal combustion contribution) to get the 'RWB RM' entry in Table 2.
We show in Table 2 the previously mentioned comparisons for the 8-week ambient monitoring campaign reported in Reference [60]. For every weekly sample, we computed the average CA contributions for both sources (Traffic: clusters 1 and 2, RWB: cluster 3), considering the same five sampling days per week as in the ambient measurement campaign. Standard errors for all estimates (from CA and RM results) were computed from error propagation. For both major sources that can be resolved with CA, the agreement is good with very similar winter average results. The zero value for one week in the traffic CA contribution is a result of trying to identify a weak signal in a data set dominated by a single source (RWB in this case). This is the usual outcome both in CA and RM analyses.
Therefore, the comparison results in Table 2 show that the proposed CA analysis is able to capture the major sources contributing to ambient PM 2.5 in this case study of a zone dominated by residential wood burning emissions. For the (smaller) traffic contributions, the methodology has difficulties in capturing the temporal variation, even though, on average, the results are similar to the RM results.

Results for Santiago
In this case study of a large metropolitan area, we focus the analysis on a monitoring site located near the east edge of the city at a higher elevation in Santiago's basin (henceforth, denoted as LAC, which is short for its Spanish name 'Las Condes'). For that site, a previous RM study [61] has shown that traffic sources and residential wood burning are the dominant ones, although regional sources also contribute to ambient PM 2.5 . That study was conducted with another RM, Positive Matrix Factorization (PMF), using elemental concentrations in ambient PM 2.5 daily samples as input data. Figure 12 shows the results of CA for 2 to 10 candidate clusters, using temperature instead of wind speed as input variable, to resolve RWB contributions. A central cluster with the lowest ambient temperatures shows up for solutions with seven or more clusters, suggesting this is the RWB source. We choose an eight-cluster solution and we present these results here. Figure 13 shows the source contributions brought by this eight-cluster solution. Clusters 3 and 4 are the ones with the largest source contributions, followed by cluster 7 (that groups the lowest ambient temperatures) and clusters 1 and 6. The rest of the clusters are of minor relevance. Figure S5 shows that cluster 3 has predominant WSW directions while cluster 4 includes ENE and E directions. Figure 14 shows the PM 2.5 time variability but only for the five major contributing clusters. It can be seen that cluster 4 has morning and evening peaks, coincident with rush hour traffic conditions. In cluster 3, its contribution increases from morning to early afternoon, indicating the arrival of traffic contributions from the city, brought by anabatic winds. This rise is followed by a decline of contributions later in the evening. This daylight behavior of anabatic winds and pollution transport toward the east side of the city has been recently measured and modeled for black carbon particles in Santiago [87]. Cluster 7 is the only one that does not decrease over weekends, unlike the other clusters shown in Figure 14, which decrease over weekends. This temporal pattern supports the identification of cluster 7 as the RWB sources. Figure 15 shows PM 2.5 -PM 10 scatter plots by cluster, showing that, in cluster 7, the data have the highest slope of all, suggesting that this is the RWB source. For clusters 1-5 and 8, the scatter plots show that the respective data points have upper and lower limiting edge lines, as expected for traffic sources.
To better identify this eight-cluster solution, we present in Table 3 the monthly average PM 2.5 contribution by cluster, for year 2004 for which we have RM results in this same monitoring site.        To better identify this eight-cluster solution, we present in Table 3 the monthly average PM2.5 contribution by cluster, for year 2004 for which we have RM results in this same monitoring site.   Table 3, we can see that clusters 4, 7, and 8 increase their contributions during the fall and winter season, whereas clusters 1, 2, and 3 show minimum contributions in those colder seasons. The reason for this different behavior has to do with meteorological conditions. In the fall and winter seasons, subsidence conditions promote a low level of thermal inversion layers over Santiago's valley, leading to shallow planetary boundary layers (PBL) [88] and blocking transport of emissions from Santiago's lower valley (dominant wind direction for clusters 1, 2, and 3 as seen on Figure S5). This also explains why local sources (clusters 4, 7, and 8) increase their contributions during the fall and winter. This topography-induced effect has been reported before for total ambient PM concentrations [89] and, more recently, in the simulation of black carbon transport from Santiago toward the Andes mountains east [87].
Regarding cluster 6, it has a different time variability as compared with the rest of the resolved clusters with contributions peaking in the afternoon and increasing in the fall and winter seasons. Therefore, they do not come from Santiago's lower valley. RM results [61] indicate that regional From Table 3, we can see that clusters 4, 7, and 8 increase their contributions during the fall and winter season, whereas clusters 1, 2, and 3 show minimum contributions in those colder seasons. The reason for this different behavior has to do with meteorological conditions. In the fall and winter seasons, subsidence conditions promote a low level of thermal inversion layers over Santiago's valley, leading to shallow planetary boundary layers (PBL) [88] and blocking transport of emissions from Santiago's lower valley (dominant wind direction for clusters 1, 2, and 3 as seen on Figure S5). This also explains why local sources (clusters 4, 7, and 8) increase their contributions during the fall and winter. This topography-induced effect has been reported before for total ambient PM concentrations [89] and, more recently, in the simulation of black carbon transport from Santiago toward the Andes mountains east [87].
Regarding cluster 6, it has a different time variability as compared with the rest of the resolved clusters with contributions peaking in the afternoon and increasing in the fall and winter seasons. Therefore, they do not come from Santiago's lower valley. RM results [61] indicate that regional anthropogenic sources contribute to ambient PM 2.5 at the monitoring site, diagnosed by the presence of arsenic and sulfur in ambient PM 2.5 samples. To check whether cluster 6 could represent those regional sources, Figure S6 shows the source contributions to ambient SO 2 . It can be seen that cluster 6 SO 2 contributions show up all year long, so they cannot come from Santiago's lower valley. Besides, Figure S5 shows that cluster 6 has W and WSW wind directions, suggesting that those regional sources are located west of Santiago. This result agrees with the location of regional sources of arsenic and sulfates identified in Reference [61] using backward trajectory analyses. Therefore, we conclude that cluster 6 can be identified as representative of regional sources of PM 2.5 .
In order to make comparisons between the CA results and the RM results, we need to consider the limitations of both analyses. The RM results were obtained using Positive Matrix Factorization software and elements as tracers. Organic tracers were not included and this is a limitation. For instance, more recent results [90,91] have shown that secondary organic aerosols are relevant in the spring and summer seasons in Santiago. This secondary PM 2.5 was not resolved by the PMF solution including only elements. Another limitation of the PMF solution is the use of potassium as a tracer of wood burning. Potassium may be a reasonable tracer in fall and winter seasons, but not so specific in the spring and summer when soil dust becomes relevant in Santiago's semi-arid climate [70]. Because of these issues, we have decided to compare CA and RM results only for the months from May through August 2004. Table 4 shows such a comparison. We have grouped clusters 3 and 4 as traffic sources, cluster 7 as the RWB source, and cluster 6 as regional sources. The smaller contributions from other clusters have not been considered, given the limitations of both CA and RM to resolve smaller (or intermittent) sources. Since the filter-based RM results consider only a subset of days per month, these specific days have been extracted from the CA results to compute comparable averages. From the results in Table 4, it can be seen that CA tends to overestimate the RM result for traffic contributions (which includes traffic and soil dust contributions) but CA results for RWB and regional sources contributions that are below those estimated by the RM in the very same monitoring site. In fact, the sum of CA estimates in Table 3 has an average of 33.3 (µg/m 3 ), while the corresponding average of RM results is 41.3 (µg/m 3 ). The reason for this discrepancy is ascribed to the different PM 2.5 measurement techniques: filter samples were taken in low-volume dichotomous samplers (Andersen Instrument, Inc, Smyrna, GA, USA, 15 L/min) while continuous measurements were made with a Tapered Element Oscillating Microbalance equipment (TEOM, Rupprecht & Patashnick, MA, USA). The latter instrument is known to present negative artifacts (i.e., underestimation of PM 2.5 concentrations) due to partial volatilization of sampled PM 2.5 in the TEOM's heating inlet used to dry the samples [92]. We think this instrument artifact explains why filter-based RM results are higher than the CA results reported here. This effect explains the lower contributions found for RWB in the CA because RWB sources have a larger fraction of volatile compounds emitted, as compared with other combustion sources [53]. We do not have such an issue in the case of Temuco because, in that monitoring site, a beta-attenuation monitor (BAM, MET-ONE 1020, Met One Instruments Inc., Grants Pass, OR, USA) has been used to measure PM 2.5 and PM 10 .

Conclusions
We have proposed that a CA approach, applied to ambient air pollution and meteorological data, provides a source apportionment of ambient PM 2.5 and PM 10 on an hourly basis. In order to achieve this result, we need to compare the outcome of the CA with RM results for the same monitoring site-or using an emission inventory in case no RM result is available-to identify the major sources (clusters) at play on a given zone. We have shown that our rule-based CA works for three different case studies in Chile: a city in a warm, desert region (Calama), another one dominated by RWB pollution in a cold, wet region (Temuco), and a large metropolitan area in a semi-arid region (Santiago).
In the case of Calama, the CA for ambient PM 10 is able to resolve local sources (traffic) and windblown dust coming from the nearby desert environment. Both are fugitive sources, which are difficult to estimate, because of the amount of information required on each case, such as particle size distributions. CA results indicate that traffic sources dominate ambient PM 10 concentrations, as suggested by the emission inventory of PM 10 sources for that city. CA results show that, over the long-term, outbursts of windblown dust are not significant within the city. This is relevant for policy purposes. The evolution of traffic contributions in Calama (Figure 3) suggests that the ongoing street sweeping program has been a successful one with a clear decreasing trend. This is an example of how a sector-specific regulation in the city may be evaluated. Furthermore, within the traffic source, ambient PM 10 data can be regarded as a combination of exhaust and non-exhaust emissions, providing additional insights such as whether new exhaust emission standards for motor vehicles have curbed down exhaust emissions.
In the case of Temuco, with dominant RWB contributions to ambient PM 2.5 , the use of ambient temperature instead of wind speed improves the apportionment of all major sources of ambient PM 2.5 . RWB contributions show up in the bivariate polar plots as a 'central' cluster because, under stable atmospheric conditions (with lowest ambient temperatures and wind speeds), wind direction is highly variable. Hence, the monitor site will sample air masses from all wind directions. CA source contributions for RWB sources are statistically comparable with RM results obtained for RWB sources in a short-term campaign in 2014.
For the large metropolitan area of Santiago, the CA methodology resolved RWB, traffic, and regional sources of ambient PM 2.5 . The CA analysis again required using ambient temperature to resolve the RWB contribution. The seasonality of the resolved clusters showed a distinctive effect brought by topography: Santiago's plume does not reach the eastern side of the city in fall and winter seasons due to low PBL depths in colder months. This geographic condition provided an additional criterium to identify local and non-local traffic sources. However, the RM resolved source contributions (short-term campaign in 2004) were higher than the CA resolved counterparts. This is a result of an underestimation of ambient PM 2.5 monitoring brought by volatilization losses in the continuous TEOM PM 2.5 monitor.
The rule-based CA presented here generates added value to existing ambient air pollution databases. Regarding RM results to complement the CA analysis, methods that use specific source tracers (like organic molecular markers) are preferable.
The rule-based CA results may be applied to the following analyses: 1.
Identifying specific meteorological conditions leading to high PM 2.5 concentrations, like windblown dust in arid regions.

2.
Provide long-term time series of source contributions to constrain emissions through DM applications to improve emission inventories.

3.
Tracking source trends and assess efficiency of specific regulations.

4.
Conduct environmental justice studies with the aid of low-cost air pollution monitoring (citizen science).

5.
Identify intermittent sources contributions, which may be further pinpointed using backward trajectory analysis. 6.
Conduct epidemiological studies to find associations between health effects and exposure to a single PM 2.5 source such as traffic. 7.
Help in analyzing massive databases coming from state-of-the-science continuous monitors such as time-of-flight mass spectrometers measuring aerosols or VOC, multi-wavelength aethalometers, etc.
The proposed rule-based CA analysis has limitations, though. The methodology works well for resolving the larger sources at play in a city, but smaller sources remain a challenge.  Figure S1: PM 10 Time variability results for a five-cluster solution for Calama. Figure S2: PM 10 -wind speed scatter plot by cluster for a 5-cluster solution for Calama. Figure S3: Pollution rose results by cluster for a 3-cluster solution for Temuco. Figure S4: Time variability of CO by cluster for a 3-cluster solution for Temuco. Figure S5: Pollution rose by cluster for an 8-cluster solution for Santiago. Figure S6: Source apportionment for SO 2 for an 8-cluster solution for Santiago.