Estimation of Particulate Matter Contributions from Desert Outbreaks in Mediterranean Countries (2015–2018) Using the Time Series Clustering Method

: North African dust intrusions can contribute to exceedances of the European PM 10 and PM 2.5 limit values and World Health Organisation standards, diminishing air quality, and increased mortality and morbidity at higher concentrations. In this study, the contribution of North African dust in Mediterranean countries was estimated using the time series clustering method. This method combines the non-parametric approach of Hidden Markov Models for studying time series, and the deﬁnition of different air pollution proﬁles (regimes of concentration). Using this approach, PM 10 and PM 2.5 time series obtained at background monitoring stations from seven countries were analysed from 2015 to 2018. The average characteristic contributions to PM 10 were estimated as 11.6 ± 10.3 µ g · m − 3 (Bosnia and Herzegovina), 8.8 ± 7.5 µ g · m − 3 (Spain), 7.0 ± 6.2 µ g · m − 3 (France), 8.1 ± 5.9 µ g · m − 3 (Croatia), 7.5 ± 5.5 µ g · m − 3 (Italy), 8.1 ± 7.0 µ g · m − 3 (Portugal), and 17.0 ± 9.8 µ g · m − 3 (Turkey). For PM 2.5 , estimated contributions were 4.1 ± 3.5 µ g · m − 3 (Spain), 6.0 ± 4.8 µ g · m − 3 (France), 9.1 ± 6.4 µ g · m − 3 (Croatia), 5.2 ± 3.8 µ g · m − 3 (Italy), 6.0 ± 4.4 µ g · m − 3 (Portugal), and 9.0 ± 5.6 µ g · m − 3 (Turkey). The observed PM 2.5 /PM 10 ratios were between 0.36 and 0.69, and their seasonal variation was characterised, presenting higher values in colder months. Principal component analysis enabled the association of background sites based on their estimated PM 10 and PM 2.5 pollution proﬁles. Á and J.C.M.P.; writing—review and editing, Á and J.C.M.P.; visualization, Á supervision, Á and J.C.M.P.; project administration, Á and J.C.M.P.; funding acquisition, Á and J.C.M.P.


Introduction
Air pollution is a mixture of particles and gases, which can reach unsafe concentrations for human health, the environment, and vegetation. It has become one of the main sustainability issues and a concerning topic in atmospheric science. According to the World Health Organisation (WHO), 90% of the world population lives in highly polluted environments, and about 7 million deaths are caused every year by outdoor and indoor air pollution [1].
Particulate matter (PM) is one of the most concerning air pollutants in Europe [2,3]. It comprises suspended particles in the air with variable composition and size, which results from several anthropogenic activities. The main sources are industrial facilities, power plants, traffic emissions, wood stove combustion, dust, and forest fires. PM can be divided into two main components [4,5]: (i) primary PM emitted directly by pollution sources; and (ii) secondary PM that is produced in the atmosphere by chemical reactions with primary gaseous pollutants (e.g., organic compounds, sulphur dioxide, and nitrogen oxides). PM is usually classified according to its size, with fine particles being designated as PM 2.5 (PM with an aerodynamic diameter <2.5 µm). The group of fine (with an aerodynamic diameter with time coverage higher than 80% (7008 hourly observations per annual TS) were considered in this study and independently analysed for each year. The number of analysed TS from background sites fulfilling this data quality criterion is indicated in Table 1, together with their distribution by countries and years. A relation of the studied PM 10 and PM 2.5 background monitoring sites is provided as Supplementary Material (SM) in two comma-separated value files. In these files, the typology, main pollution source, and the geographical coordinates of monitoring stations are given in EPSG 4979 projection. The length of the analysed TS from each station by year is also provided.  FR  22  16  21  15  22  12  25  19  Croatia  HR  3  3  3  2  4  3  3  3  Italy  IT  7  2  8  3  7  -11  4  Portugal  PT  9  6  10  4  10  4  10  3  Turkey  TR  5  -7  4  7  3  12  6 For the years 2015, 2016, 2017, and 2018, a total of 61, 70, 73, and 86 PM 10 TS were studied, respectively; with respect to PM 2.5 , 35, 33, 23, and 44 TS, respectively, were studied for those years. All of these PM 10 and PM 2.5 TS were obtained from background monitoring stations. Tables 2 and 3 show summary statistics for dust desert days based on the exceedance of PM 10 and PM 2.5 daily average concentrations of 150 and 43 µg·m −3 , respectively. These values were selected as threshold values due to the maximum levels of these pollutant concentrations in Southern Europe during dust outbreaks [18,19]. They correspond to a lower bound of ranges typically found in this region. Table 2. Average concentration by country for PM 10 time series during desert and non-desert days in background monitoring sites (in µg·m −3 ). n, number of days with exceedance of 150 µg·m −3 PM 10 daily average. BA-Bosnia and Herzegovina; ES-Spain; FR-France; HR-Croatia; IT-Italy; PT-Portugal; TR-Turkey.

Country
Year

Time Series Clustering
In this study, HMM were used to detect clusters in PM 10 and PM 2.5 TS data. It represents a model-based strategy for clustering by assuming that each cluster of data is described by a different probability distribution (components of the mixture). This methodology is explained in Gómez-Losada et al. [14,20] and illustrated graphically in Figure 1. A complete definition and elements of HMM are provided in SM1. HMM represents a flexible method of modelling TS that exhibits dependence over time as those obtained in air quality monitoring networks. The model estimation lies in finding the parameter of the HMM that is most likely to have generated a given TS. This is referred to as the maximum likelihood estimation problem, which in this study was solved using the Expectation-Maximisation algorithm [21,22]. The computational implementation of HMM in this study was accomplished using the depmixS4 package [23] from open-source software R [24]. The basic script of this implementation is provided in SM2. the HMM that is most likely to have generated a given TS. This is referred to as the maximum likelihood estimation problem, which in this study was solved using the Expectation-Maximisation algorithm [21,22]. The computational implementation of HMM in this study was accomplished using the depmixS4 package [23] from open-source software R [24]. The basic script of this implementation is provided in SM2. For many practical applications, there is often some physical significance attached to the hidden states of the HMM. For instance, in economics, states of the HMM can be related to expansion and recession periods and the interest is in studying the dynamics between them [25]; in developmental psychology, the states of the HMM are used to quantify knowledge that subjects express in an implicit learning task [26]. In this work, clusters of TS data are assigned to represented different PM concentration ranges (regimes of concentration or pollution profiles) in each modelled annual TS. HMM provides, inter alia, the mean and standard deviation values for each regime. These values represent the different parameterisation of the Gaussian distributions assigned to the regimes, providing For many practical applications, there is often some physical significance attached to the hidden states of the HMM. For instance, in economics, states of the HMM can be related to expansion and recession periods and the interest is in studying the dynamics between them [25]; in developmental psychology, the states of the HMM are used to quantify knowledge that subjects express in an implicit learning task [26]. In this work, clusters of TS data are assigned to represented different PM concentration ranges (regimes of concentration or pollution profiles) in each modelled annual TS. HMM provides, inter alia, the mean and standard deviation values for each regime. These values represent the different parameterisation of the Gaussian distributions assigned to the regimes, providing very useful information as they fully characterise every analysed TS for a given time period and monitoring site. Each cluster's average values were represented as µ i (i = 1, . . . , n) with n the number of detected clusters in the TS data ( Figure 1A).

Time Series Clustering
Once the pollution profiles in the TS data are determined, a meaning for each profile must be then assigned to estimate the PM 10 and PM 2.5 contributions from desert outbreaks. The following definitions are provided for the average values of each profile [14], applied to the remote stations studied in this work and considering as an example, the PM 10 TS from Figure 1: represents the underlying or threshold (background) concentration over which great changes in value are not expected over the years if atmospheric and pollution conditions remain relatively constant, being a characteristic of the studied area. Daily average PM 10 concentrations assigned to this regime are supposedly not caused by any direct influence of natural or anthropogenic sources, or if they are, they are negligible.
is the average PM 10 concentration on the days affected by moderate contributions of anthropogenic sources due to activities that take place in the region. The value of µ 2 is subject to slightly more variation than µ 1 between years. The referenced days may be affected by contributions from natural sources attributable to African dust transport episodes that have a minor impact on the observed PM 10 concentrations. Some useful quantities can be then roughly estimated using these definitions, namely: • µ 2 -µ 1 (6 µg·m −3 ): average concentration due to anthropogenic contributions from the region. • µ 3 -µ 2 (8 µg·m −3 ): average concentration associated with characteristic contributions from dry regions from North Africa when they occur. • µ 4 -µ 2 (18 µg·m −3 ): average concentration from severe contributions from dry regions from North Africa when they occur.
It must be noted that the above definitions coincide from a conceptual point of view with those of horizontal profiles given by Lenschow et al. [27] when studying the PM 10 regimes of concentrations in the Berlin area. These definitions have to be adapted then to the geographical location under study and adapted to each monitoring network. Given the transnational scope of this study and it being unfeasible to characterise every PM regime for each station and country, these definitions were adopted and used generically. Furthermore, it must also be considered that contributions from other sources of PM such as those from wildfires or other anthropogenic sources of pollution could also be represented in the aforementioned defined pollution profiles. In this study, the above definitions are extended to the PM 2.5 case.
As with other available modelling techniques deriving information about pollution sources and the amount they contribute to ambient air pollution levels (e.g., chemical mass balance and positive matrix factorisation models), the time series clustering method is also subject to intrinsic limitations. Among them are that it provides estimates together with its uncertainties only based on the observed concentrations of pollutants (in this study, PM 10 and PM 2.5 ). To overcome this setback, definitions given to each of the pollution regimes were incorporated into this method when created (Gómez-Losada et al., 2015) to provide a reasonable physical explanation to these regimes. However, even though these definitions work well in general, it can ignore some local or regional peculiarities from areas where emission sources originate. In this study, this limitation was partially overcome by selecting only background monitoring stations from the studied countries when analysing the contributions from North African arid regions. Another understandable disadvantage of the time series clustering method is that it does not take into consideration the origin or location of the emission sources, as air mass trajectory or wind-based models can do. For that reason, only countries bordering the Mediterranean Sea were selected for this study (Portugal was included due to being strongly influenced by Saharan dust) due to their proximity to the North African regions. Having exposed all of the above, it is also convenient to remember their advantages. As a modelling based on a statistical clustering technique, it does not require from expensive and time-consuming chemical assays, modelling outputs are comprehensive and easily interpreted, its computational implementation requires limited competence and low resources, and can be deployed using open-source software (R or Python programming languages). Moreover, scripts for the computational implementation have been provided by the authors on several occasions in other studies, and again, are available here upon request. Furthermore, the time clustering method can be used to complement other source apportionment methods, validate their modelling outputs, or simply used as an exploratory technique for deriving information from pollution sources.

Principal Component Analysis
To study the clustering of remote monitoring sites according to the annual average values of their pollution profiles (µ 1 , µ 2 , µ 3 , µ 4 ) for PM 10 and PM 2.5 , a biplot principal component analysis (PCA) was performed for each year. To facilitate the analysis, 95% confidence interval ellipses were calculated by studied country. The computational implementation of PCA was performed using the ggplot2 function from the ggbiplot library from R [28]. Datasets containing the estimated pollution profiles by year are available upon request.

Results and Discussion
3.1. Behaviour of PM 2.5 /PM 10 Ratio PM 2.5 and PM 10 have several sources, thus presenting different physical and chemical properties. The study of PM 2.5 /PM 10 ratio is a useful tool to identify PM sources and effects on human health [10,11]. Evolution of the PM 2.5 /PM 10 ratios from 2015 to 2018 is illustrated in Figure 2. Remarkably, the pattern of the ratio in countries is similar to alternate years (2015 and 2017, 2016 and 2018). Even this analysis is absent for Turkey and Italy in 2015 and 2017, respectively, due to the lack of data fulfilling the quality criteria applied in this study; the highest ratios are obtained regularly for Croatia (HR-50% of the ratios were higher than 0. As a reference, medians between countries are joined, and a dotted horizontal line is added at 0.5 ratio. Figure 3 shows, as a more detailed analysis, a monthly analysis of PM2.5/PM10 ratio. Higher values (with median higher than 0.65) are indicated with a red median as a reference in the box-whisker plots. As a general rule, higher ratios are observed in the coldest months (October to February), in particular for countries like Croatia (HR) and Italy (IT). Other countries like France (FR) may show the highest ratios in February. This seasonal pattern is justified by the increase in fuel consumption in this period of the year for domestic and industrial heating [29]. Additionally, the stable atmospheric conditions promote the dry deposition of coarse particles, maintaining the fine particles in the air [10,30]. It can be slightly appreciated that PM2.5/PM10 ratio values higher than the P75 and lower than the P25 are prone to extending to the maximum and minimum ratios, respectively, in the warmest month. As a reference, medians between countries are joined, and a dotted horizontal line is added at 0.5 ratio. Figure 3 shows, as a more detailed analysis, a monthly analysis of PM 2.5 /PM 10 ratio. Higher values (with median higher than 0.65) are indicated with a red median as a reference in the box-whisker plots. As a general rule, higher ratios are observed in the coldest months (October to February), in particular for countries like Croatia (HR) and Italy (IT). Other countries like France (FR) may show the highest ratios in February. This seasonal pattern is justified by the increase in fuel consumption in this period of the year for domestic and industrial heating [29]. Additionally, the stable atmospheric conditions promote the dry deposition of coarse particles, maintaining the fine particles in the air [10,30]. It can be slightly appreciated that PM 2.5 /PM 10 ratio values higher than the P75 and lower than the P25 are prone to extending to the maximum and minimum ratios, respectively, in the warmest month.

Study of Pollution Profiles over Time
Every PM 10 and PM 2.5 TS from 2015 to 2018 obtained from the studied countries (Table 1) was analysed independently with HMM to estimate their regimes of concentration. Regimes of monitoring stations by countries are illustrated as box-whisker plots in Figure 4, for PM 10 (A) and PM 2.5 (B). The first to fourth pollution profiles are shown in blue, green, orange, and red, respectively. To detect the most representative particulate matter concentrations (PM 10 and PM 2.5 ), outliers (1.5 times below and higher the 25th and 75th percentile in distributions) were not considered in this study since the aim is to estimate the general trend of PM contributions quantitatively and not isolated outbreaks characterised by the highest concentrations. In Figure 4A

Study of Pollution Profiles over Time
Every PM10 and PM2.5 TS from 2015 to 2018 obtained from the studied countries (Table 1) was analysed independently with HMM to estimate their regimes of concentration. Regimes of monitoring stations by countries are illustrated as box-whisker plots in Figure  4, for PM10 (A) and PM2.5 (B). The first to fourth pollution profiles are shown in blue, green, orange, and red, respectively. To detect the most representative particulate matter concentrations (PM10 and PM2.5), outliers (1.5 times below and higher the 25th and 75th percentile in distributions) were not considered in this study since the aim is to estimate the general trend of PM contributions quantitatively and not isolated outbreaks characterised by the highest concentrations. In Figure 4A, the highest pollution profile (red box-whisker plots) is estimated in Turkey (TR) and Spain (ES), while remaining with similar concentrations for the rest of countries (approximately around 50 µg·m −3 ), except for Italy (IT) in 2016 and Portugal (PT) in 2016 and 2017. Bosnia and Herzegovina (BA) lacks representativeness regarding its pollution profiles estimation since only one TS from a single monitoring station was available for this country for 2016 and 2017. The rest of the pollution profiles (one to three) remains similar to little variations for all countries.  Figure 4B shows the same analysis for PM2.5. The distribution of regimes is less homogenous than for PM10, although Turkey (TR) shows the highest concentration in the fourth pollution profile (red box-whisker plots) again. Croatia (HR) also exhibits the second highest concentrations in the fourth regime, followed by Portugal (PT)-2015 to  Figure 4B shows the same analysis for PM 2.5 . The distribution of regimes is less homogenous than for PM 10 , although Turkey (TR) shows the highest concentration in the fourth pollution profile (red box-whisker plots) again. Croatia (HR) also exhibits the second highest concentrations in the fourth regime, followed by Portugal (PT)-2015 to 2017-Italy (IT)-2015 to 2016, and 2018-and France (FR). Remarkably, Spain (ES) shows the lowest fourth pollution profile even though it is a country markedly affected by PM contributions from North African deserts [31,32]. The rest of the regimes show some more variations with respect to PM 2.5 than in the PM 10 case . Another interesting profile is represented by the first (background) profile. Turkey (TR) shows the highest levels of PM 2.5 background pollution (blue box-whisker plots) followed by Italy (IT), Croatia (HR), France (FR), and Spain (ES), similarly to the behaviour of the fourth pollution profile.

Estimation of African Dust Contributions to PM 10 and PM 2.5
Once pollution profiles were estimated with HMM, PM 10 and PM 2.5 contributions (African dust contributions to PM 10 and PM 2.5 ) according to the regimes' average values were calculated as defined in Section 2.3. These results are illustrated in Figure 5 for PM 10  2017-Italy (IT)-2015 to 2016, and 2018-and France (FR). Remarkably, Spain (ES) shows the lowest fourth pollution profile even though it is a country markedly affected by PM contributions from North African deserts [31,32]. The rest of the regimes show some more variations with respect to PM2.5 than in the PM10 case. Another interesting profile is represented by the first (background) profile. Turkey (TR) shows the highest levels of PM2.5 background pollution (blue box-whisker plots) followed by Italy (IT), Croatia (HR), France (FR), and Spain (ES), similarly to the behaviour of the fourth pollution profile.

Estimation of African Dust Contributions to PM10 and PM2.5
Once pollution profiles were estimated with HMM, PM10 and PM2.5 contributions (African dust contributions to PM10 and PM2.5) according to the regimes' average values were calculated as defined in Section 2.3. These results are illustrated in Figure 5 for PM10 (A) and PM2.5 (B). Severe dust contributions to PM10 were estimated for Bosnia and Herzegovina (BA), Spain (ES), and Turkey (TR), although this first country lacks representativeness since just TS data from two monitoring sites were studied according to the quality criteria for data coverage used in this study.   In the case of Spain, the average concentration associated with characteristic contributions from North African deserts (orange dots in Figure 5A) was estimated as 6.5 ± 7.2, 8.4 ± 10.5, 11.2 ± 6.8, and 9.1 ± 5.4 µg·m −3 from 2015 to 2018, respectively. The same estimated concentrations following the 40th percentile method [15,31,33] were 2.5, 4.0, 3.2, and 3.4 µg·m −3 , for the same years [34][35][36][37]. Higher estimations of the time series clustering method in comparison with the 40th percentile method are also in agreement with those of Wang et al. [38], who estimated a concentration range for Spain of 10-20 µg·m −3 from 2016 to 2017 using the GEOS-Chem model. Highest severe contributions from dust desert to PM 10 (red dots in Figure 5A) are estimated for Bosnia and Herzegovina (BA) and Turkey (TR).

Daily Level Evolution of PM 2.5 and PM 10 Regimes
To provide a different insight regarding the evolution of the PM 2.5 /PM 10 ratios, from those TS indicated in Table 1, the one with the highest PM 10 and PM 2.5 annual concentration average value for each country was selected (independently for each air pollutant), and their regimes of concentration were studied considering the hour of the day (analysed monitoring sites are provided in SM3). These results are graphically presented in Figure 6 for PM 10 and in Figure 7 for PM 2.5 . For the sake of clarity, observations lower and higher than the 25th and 75th percentile were not represented. Only the selected air quality monitoring stations in TR coincided for the PM 10 and PM 2.5 air pollutants. The cases of the TS from IT for PM 10 and PM 2.5 , and IT and BA for PM 2.5 are omitted from this analysis for not being informative concerning the analysis performed in this section.
Although the interpretation of these results would require an in-depth assessment of emission sources contributing to air quality at the monitoring stations from which TS were obtained, in Figure 6, it can be noticed that some of the relative fluctuations in concentrations detected in Regime 1 to Regime 3 could be showing some patterns of anthropogenic activity (grey arrows). However, this pattern could be reflected at a different time of the day depending on the location (distance) of the monitoring station with respect to the activities responsible for the emission. This could be the case of the TS from TR, BA, FR, and HR. In these TS, an increase in PM 10 concentration can be detected in Regimes 1 and 3 before 12 h and after 18 h. Additionally, this latter activity is detected in the TS from PT. In the case of the TS from ES, this raise in PM 10 concentration is detected just after 12 h. This could suggest that although Regime 3 was typically associated with moderate contributions from desert outbreaks (Section 2.3), this regime could also be influenced by contributions from these anthropogenic activities, although to a much lesser extent as evidenced by the magnitude of the fluctuations estimated in this regime. In this regard, another more plausible explanation could be related to the evaporation of semi-volatile organics or semi-volatile inorganic species. This could be similar to the diurnal cycle of the NO 3 [39], in which the simultaneous drop in humidity and the rise of temperature causes the evaporation of NH 4 NO 3 . The case of PM 2.5 in Figure 7 shows similarities with the analysis from Figure 6. Fluctuations around 12 and 18 h can be described for Regimes 1 to 3, being even more evident for the case of PT. In both, either PM 10 or PM 2.5 analysis, patterns of anthropogenic activity could be partially explained by the location of the stations at low altitude (eight out of ten stations are located below 200 m a.s.l-above sea level; see SM3), and therefore, more likely to be affected by its emission sources.

PCA
Results of PCA biplot analysis for the different pollution profiles obtained for PM 10 and PM 2.5 from analysed monitoring stations are illustrated in Figures 8 and 9, respectively. The importance of the principal components is represented as the proportion of variance (in percentage) explained (Table 4). In Figure 8, the confidence ellipses indicate a remarkable grouping of most of the countries during all studied years, except for Turkey (TR-yellow ellipse), due to the higher dust load that this country receives from North African deserts.
In particular, for the year 2015, this latter country exhibits a complete dissociation from the rest of the countries in the five studied stations. Monitoring stations from Spain (ES), a country typically suffering from desert particulate matter contributions, show similar behaviour with those from Turkey, since they appear within the TR confidence ellipse. Other monitoring stations showing similar behaviour to those from TR are from Portugal (PT) and Italy (IT), and to a lesser extent, from France (FR), indicated by the ellipse superposition during the studied years. All biplot vectors point to the location of the Turkey ellipse, indicating that these monitoring stations exhibit a higher magnitude than all the PM 10 regimes estimated, coinciding with the values from Figure 4A. With respect to PM 2.5 , the superposition of ellipses is less evident than for the PM 10 case, although the distinct role of Turkey remains regarding the magnitude of the estimated regimes (2016 and 2017). In general, PM 10 and PM 2.5 confidence ellipses show a similar pattern during the studied years, although the PM 2.5 analysis for TR in 2015 is not available due to the lack of data.

Conclusions
This study aims to estimate the PM10 and PM2.5 pollution profiles at background locations in Mediterranean countries and to quantify the contributions to PM10 and PM2.5 caused by dust outbreaks from North African desert regions. The time series clustering method based on Hidden Markov Models was used to estimate such pollution profiles, and contributions were obtained according to their given definitions. A discussion on the advantages and limitations of the time series clustering methods was provided in the methodological section. Countries like Turkey and Spain show the PM10 profiles with the highest concentration, and Turkey, Croatia, Portugal, Italy, and France with respect to

Conclusions
This study aims to estimate the PM 10 and PM 2.5 pollution profiles at background locations in Mediterranean countries and to quantify the contributions to PM 10 and PM 2.5 caused by dust outbreaks from North African desert regions. The time series clustering method based on Hidden Markov Models was used to estimate such pollution profiles, and contributions were obtained according to their given definitions. A discussion on the advantages and limitations of the time series clustering methods was provided in the methodological section. Countries like Turkey and Spain show the PM 10 profiles with the highest concentration, and Turkey, Croatia, Portugal, Italy, and France with respect to PM 2.5 .