A Methodological Approach to Use Contextual Factors for Epidemiological Studies on Chronic Exposure to Air Pollution and COVID-19 in Italy

The large availability of both air pollution and COVID-19 data, and the simplicity to make geographical correlations between them, led to a proliferation of ecological studies relating the levels of pollution in administrative areas to COVID-19 incidence, mortality or lethality rates. However, the major drawback of these studies is the ecological fallacy that can lead to spurious associations. In this frame, an increasing concern has been addressed to clarify the possible role of contextual variables such as municipalities’ characteristics (including urban, rural, semi-rural settings), those of the resident communities, the network of social relations, the mobility of people, and the responsiveness of the National Health Service (NHS), to better clarify the dynamics of the phenomenon. The objective of this paper is to identify and collect the municipalities’ and community contextual factors and to synthesize their information content to produce suitable indicators in national environmental epidemiological studies, with specific emphasis on assessing the possible role of air pollution on the incidence and severity of the COVID-19 disease. A first step was to synthesize the content of spatial information, available at the municipal level, in a smaller set of “summary indexes” that can be more easily viewed and analyzed. For the 7903 Italian municipalities (1 January 2020—ISTAT), 44 variables were identified, collected, and grouped into five information dimensions a priori defined: (i) geographic characteristics of the municipality, (ii) demographic and anthropogenic characteristics, (iii) mobility, (iv) socio-economic-health area, and (v) healthcare offer (source: ISTAT, EUROSTAT or Ministry of Health, and further ad hoc elaborations (e.g., OpenStreetMaps)). Principal component analysis (PCA) was carried out for the five identified dimensions, with the aim of reducing the large number of initial variables into a smaller number of components, limiting as much as possible the loss of information content (variability). We also included in the analysis PM2.5, PM10 and NO2 population weighted exposure (PWE) values obtained using a four-stage approach based on the machine learning method, “random forest”, which uses space–time predictors, satellite data, and air quality monitoring data estimated at the national level. Overall, the PCA made it possible to extract twelve components: three for the territorial characteristics dimension of the municipality (variance explained 72%), two for the demographic and anthropogenic characteristics dimension (variance explained 62%), three for the mobility dimension (variance explained 83%), two for the socio-economic-health sector (variance explained 58%) and two for the health offer dimension (variance explained 72%). All the components of the different dimensions are only marginally correlated with each other, demonstrating their potential ability to grasp different aspects of the spatial distribution of the COVID-19 pathology. This work provides a national repository of contextual variables at the municipality level collapsed into twelve informative factors suitable to be used in studies on the association between chronic exposure to air pollution and COVID-19 pathology, as well as for investigations on the role of air pollution on the health of the Italian population.


Introduction
Air pollution is a major global public health risk factor and puts an enormous health and economic burden on human societies. Based on the last available estimates, air pollution ranked 4th among major mortality risk factors globally, exceeding the impacts of obesity, high cholesterol, and malnutrition. Air pollution is estimated to have contributed to 6.67 million deaths worldwide in 2019, nearly 12% of the global total, and ambient PM 2.5 alone is responsible for 4.14 million deaths [1,2].
An overwhelming body of evidence has accumulated over the past two decades, demonstrating that health effects of air pollution can affect nearly all organ systems [3,4]. Recent systematic reviews of epidemiological evidence linking ambient air pollution (both long-and short-term exposures) to human health are collected in a Special Issue [5], adopted as a basis to inform the formulation of the new air quality guidelines (AQG) published by WHO in 2021 [6]. The new AQGs reflect the large impact of air pollution on global health, halving the recommended limits for average annual PM 2.5 levels from 10 micrograms per cubic meter to 5, and lowering those for PM 10 from 20 to 15 micrograms.
There is now broad scientific consensus that long-term exposures to air pollution contribute to increased risk of illness and death from ischemic heart disease, lung cancer, chronic obstructive pulmonary disease (COPD), lower-respiratory infections (e.g., pneumonia), stroke, type 2 diabetes, and adverse birth outcomes [1,2,7,8]. Interestingly, many chronic health conditions, such as diabetes, cardiovascular disease and COPD, have also been associated with increased vulnerability to COVID-19 [1,[9][10][11]. Long-term exposure to air pollution can therefore indirectly worsen the prognosis of COVID-19 by increasing the risk of chronic diseases associated with COVID-19, but can also act directly, as it can suppress or influence early immune responses to SARS-CoV-2 infection [12] and alter the host's immunity towards respiratory infections [13]. These associations, moreover, have been shown to be biologically plausible [14]. However, the exact contribution of long-term exposure to air pollutants in modulating the spread and severity of COVID-19 is still controversial.
The availability of both atmospheric pollution and COVID-19 data, and the easiness to make simple geographical correlations between them, has led to a proliferation of ecological studies which have related the levels of pollution in an area (county, municipality, zip code areas, region, etc.) to COVID-19 incidence, mortality, or lethality rates in that area [12,[15][16][17][18]. However, the potential risk of ecological fallacy can lead to non-existent risk associations or, even worse, in the opposite direction compared to true associations at an individual level. The strengths and limitations of different approaches, as well as challenges and recommendations for studying outdoor air pollution in relation to COVID-19, have been reviewed [11,[19][20][21]. Studies available on the incidence, spread and severity of COVID-19 have not taken into account, or have not done so adequately, individual risk factors such as gender, age, area of residence, comorbidities, or occupation, as well as the role of context variables, such as socio-economic deprivation, health supply, production activities that may involve a greater risk of contagion, social interactions in the community, mobility, time-activity patterns, type of environment (urban, rural, semi-rural) and demographic factors [20].
Due to the limitations of the data currently available and the type of predominantly adopted (ecological) study design, most epidemiological studies available are not able to give an exhaustive answer to the question whether and to what extent air pollution increases the COVID-19 disease severity.
In Italy, the National Health Institute of Health (ISS) and the National Network System for Environmental Protection (SNPA-ISPRA) have launched, in collaboration with the Ital-ian Environment and Health Network (https://rias.epiprev.it/) (accessed on 24 February 2022), the EpiCovAir epidemiological studies program, based on the data produced by the national integrated COVID-19 surveillance (https://www.epicentro.iss.it/coronavirus/) (accessed on 20 February 2022) and by the SNPA (www.snpambiente.it) (accessed on 20 February 2022). EpiCovAir aims to carry out epidemiological studies at the national level to verify the association of long-term exposure to air pollution and the onset of symptoms and the severity of the health effects among COVID-19 cases in Italy while adjusting for socio-demographic and economic confounding factors associated with the infection.
In this regard, there is a need to better understand the role of contextual variables such as municipalities characteristics, resident population features, the network of social relations, the mobility of people, and the responsiveness of the NHS. It is also necessary to synthesize the content of spatial information, available at the municipal level, in a smaller set of "summary indexes" variables that can be more easily viewed and analyzed for understanding the dynamics of the phenomenon.
The objective of this paper is therefore to identify and collect the contextual factors available at the municipality level in Italy (characteristics of administrative areas, demographic, mobility, and socio-economic-health information of communities), and to synthesize their information content to produce indicators to be used in national epidemiological studies aimed at assessing the role of air pollution on the incidence and severity of the COVID-19 disease.

Variables Selection
A large number of spatial variables related to the 7903 Italian municipalities (list as of 1 January 2020-ISTAT) and their resident communities, were initially identified, collected and processed, and grouped into five information dimensions a priori defined as: (1) geographic characteristics of the municipality, (2) demographic and anthropogenic characteristics, (3) mobility, (4) socio-economic-health area, and (5) availability of health care.
Most of the data used in the analyses carried out in this paper are freely downloadable, and Table S1 shows the complete list of the variables, their description, the temporal dimension, and the source of the data. The spatial typology of data collected varies from municipality to municipality and are assumed to be constant over the period of time to which they refer.
Due to the high correlation of several variables referring to the same phenomenon (for example, altitude and altimetric zone), in order to avoid redundancy of information, 44 variables by five macro-categories were selected; three further variables were added to describe pollution levels for each Italian municipality.

Statistical Analysis
In order to evaluate the relationship between the variables within each dimension, and to identify the presence of further redundancy, Spearman correlation coefficients (ρ) have been estimated to quantify the relationship between each variable pair (x and y). A principal component analysis (PCA) was performed for each dimension. The goal of the PCA is to reduce the large number of initial variables into a smaller number of components, limiting as much as possible the loss of information content (variability). This occurs through a linear transformation of the variables that projects the original ones into a new Cartesian system in which the variables are sorted in decreasing order of variance. Therefore, the variable with the greatest variance is projected to the first axis, the second to the second axis, and so on. The reduction of complexity occurs by limiting itself to analyzing the main ones (in terms of variance) among the new variable. The PCA is effective only when there is a good share of variance in common among the variables (with correlation coefficients that are not very low or very high); in this case, a few principal components will be sufficient to obtain a good approximation to the starting matrix. The advantage of the PCA is the ability to condense most of the variances and covariances present in the initial set of variables into the first components. Thus, considering only the first principal components, we obtain the best possible synthesis of the information provided by the initial variables.
Within each of the five dimensions a priori defined, each main component represents a linear combination of the starting variables and, consequently, the intra-group correlation between the components is equal to 0.
For the PCA purposes, the starting variables were therefore standardized (mean = 0 and variance = 1); then, for each dimension, only the main components with eigenvalues ≥ 1 were selected [22]. This guideline is based on the idea that, given a certain total variability of all standardized variables, a PCA should explain at least one variation equal to the mean value of a single standardized variable.
The analyses also included ordinal qualitative variables for which it made sense to hypothesize a unit linear increase in the transition from one category to another (for example, degree of urbanization, socio-economic position (SEP).
We also analyzed correlation of PCA components with PM 2.5 , PM 10 and NO 2 population weighted exposure (PWE) values, obtained using a four-stage approach based on the machine learning method, "random forest", which use space-time predictors, satellite data, and air quality monitoring data estimated at a national level [23].
All analyses were conducted using R statistical software (version 3.6.0) [24]. Table 1 describes the characteristics, across the 7903 Italian municipalities, of the 18 contextual continuous variables for each dimension under study and air pollution PWE concentrations, while Table 2 shows the distribution of the 7903 Italian municipalities with respect to the 26 contextual categorical variables for each dimension under study.  # mean, standard deviation (SD), percentiles (p5, p25, p50, p75, p95), minimum (min) and maximum (max) value; * number of non-resident individuals who carry out work or study activities in the municipality over the total mobility flows (active residents plus outgoing flows of residents); ** number of resident individual who carry out work or study activities in the municipality over the total mobility flows (active residents plus outgoing flows of residents); *** number of individuals who travel outside the municipality of residence for work or study reasons; **** number of individuals who move within the municipality of residence for work or study; ***** ratio between the total gross income of registered households and the total number of members of registered households. * municipalities with at least 0% of the surface at a maximum distance of 10 km from the sea; ** data from the 2011 Italian Census and calibrated on a regional basis; *** on 31 December 2019.

Results
The area of the Italian municipalities varies from 0.120 km 2 to about 1300 km 2 (mean 35.2; standard deviation (SD) 50.8). The maximum altitude is 2035 m above sea level. Coastal municipalities account for 8.1% of the total, while 0.4% are island municipalities. About 64% of the municipalities are located in rural or sparsely populated areas. As regards the level of anthropization of the Italian municipalities, we used the maximum value of impervious surfaces (ISA) in a 1 km × 1 km cell within the municipal area. ISA is an indicator of the spatial distribution of surfaces. Examples of ISAs include streets, parking lots, buildings, driveways, sidewalks. The ISA maximum value measured in a 1 km × 1 km cell was equal to 58.6 (SD = 42.7), while for the night luminosity index, the measured value was 23.1 (SD = 41.9). The percentage of urban coverage (maximum value in a 1 km × 1 km cell) is on average less than 50% (41.2% SD = 27.6).
The population as of 31 December 2019 ranged from a minimum of 30 to a maximum of about 3 million inhabitants, with a median of 2459 residents. The median population density is 105 inhabitants per km 2 . The percentage of residents aged 65 years and more in the Italian municipalities ranges from a minimum of 8.6% to a maximum of 62.3% (95th percentile equal to 34.8%).
As for mobility, the median value of the attraction index is equal to 21.1 (5th percentile 6.7, 95th percentile 45.7), while for the self-containment index we observed a value equal to 33.2 (5th percentile 15.5, 95th percentile 59.7). The maximum number of people who move outside the municipality for work or study reasons is more than 90,000 people, while about 1,300,000 move within the municipality. In 41.5% of the municipalities, there is an airport within 30 km of the municipal boundaries (4.8% 2 or more), while only 22.3% of the municipalities have at least one railway station on its territory.
For the variables describing the socio-economic-health dimension, the average family income in Italian municipalities is equal to 13,000 Euros (SD 3123; min 3796; max 29,985), while the entrepreneurship rate varies from a minimum value of 9 to a maximum of 407 companies per 100,000 inhabitants. Annual all causes mortality rates vary between 0.7 (5th percentile) and 2 (95th percentile) percentile) with a mean 1.2 (SD 0.4) per 100; cardiovascular disease mortality rates vary between 0.2 (5th percentile) and 0.9 (95th percentile) with a mean of 0.5 (SD 0.2) per 100; mortality rates from respiratory diseases vary between 0.02 (5th percentile) and 0.2 (95th percentile) with a mean of 0.1 (SD 0.06) per 100. Annual hospitalization rates of residents in Italian municipalities vary between 4 (5th percentile) percentile) and 6.1 (95th percentile) with a mean of 5 (SD 0.2) per 100; cardiovascular disease hospitalization rates vary between 0.8 (5th percentile) and 1.8 (95th percentile) per 100 residents with an average of 1.2 (SD 0.31) per 100 residents; and respiratory disease hospitalization rates vary between 0.5 (5th percentile) and 2 (95th percentile) with an average of 0.7 (SD 0.2) per 100.
Regarding health care availability, the minimum average distance of Italian municipalities from a health facility is just over 9.4 km (SD 6 km), while the distance from an emergency room is 10.7 km (SD 6.4 km). Less than 1% of municipalities (0.91%) have at least one hospital or university hospital on their territory, approximately 7% have at least one hospital, and 3.6% of municipalities have at least one nursing home. In 8.2% of the municipalities, at least one acute care bed is available. In 4% of the municipalities, at least one bed is available for long-term care; in 5.3%, at least one bed is available for rehabilitation; and in 4.8%, at least one place is available in intensive care (3.2% between 1 and 10 places). In 230 municipalities (2.91% of the total), there is at least one emergency room, and in 589 municipalities (7.5%), at least one private hospital. In the territory of 481 municipalities (6%), there is at least one residence for the elderly with medical assistance.
As for air pollution, in the period 2016-2019, the population weighted average exposure value to PM 2.5 was equal to 14.6 µg/m 3 (SD 5.0 µg/m 3 ), for PM 10 it was 21.1 µg/m 3 (SD 6.5 µg/m 3 ) while for NO 2 it was equal to 14.5 µg/m 3 (SD 6.5 µg/m 3 ) with a maximum of 46.3 µg/m 3 . Figure 1 shows the correlations between the variables within each of the five categories under study.
Among the geographic characteristics of the municipalities (Figure 1a), the greatest correlation is observed between the coastal zone and coastal municipality variables (ρ = 0.72). For the demographic and anthropogenic characteristics dimensions (Figure 1b), the percentage of population over-65 years shows a moderate negative correlation with all the other variables, which in general are all very correlated to each other. For the mobility dimension (Figure 1c), a correlation equal to 0.97 is observed between total movements and intra-municipal and extra-municipal movements, which are highly correlated (also between them) (p = 0.88). From the correlation matrix of the socio-economic-health dimension (Figure 1d), the all-cause mortality rate is highly correlated with that for diseases of the circulatory system (ρ = 0.66), and the total hospitalization rate with that for diseases of the circulatory system (ρ = 0.75).
As for the health supply dimension (Figure 1e), the variables that describe the supply of beds are all positively correlated with each other and also positively correlated with the presence of health facilities in the municipal area. Finally, the variables distance from an emergency room and distance from a health facility are highly correlated (ρ = 0.85).
The results of the PCA are shown in Figure 2. For each dimension, the most informative components are displayed (i.e., components with eigenvalue ≥ 1). Size and colors of dots indicate the contribution of the single variable in the explanation of the component; in this way, in fact, it is possible to understand which variable contributes most to the construction of the component itself. Overall, the PCA made it possible to extract twelve components: three for the territorial characteristics dimension of the municipality (variance explained 72%), two for the demographic and anthropogenic characteristics dimension (variance explained 62%), three for the mobility dimension (variance explained 83%), two for the socio-economic-health dimension (variance explained 58%), and two for the health supply dimension (variance explained 72%).   Table 1.  Table 1.  Table  1.  Table 1.
All the components of the different dimensions are only marginally correlated with each other (Figure 3), demonstrating their potential ability to grasp different aspects of the geographical distribution of the COVID-19 pathology.  Table  1. . Figure 3. Correlation matrix among the components resulted from the principal component analysis.

Discussion
A large body of evidence on the impact of air pollution on human health has accumulated over the past years. Among the various documented health effects, air pollution also increases the risk of chronic diseases (respiratory, cardio-metabolic), described as the comorbidities increasing the risk of being hospitalised or dying from COVID-19. This evidence contributed to a fast proliferation of epidemiological studies linking ambient air pollution to COVID-19 disease. However, the exact role of air pollutants in modulating the spread and severity of COVID-19 is still unclear.  Table 3 shows the correlation of the twelve components with the air pollution variables. We did not find high correlation values between components and the Italian population weighted exposure level to PM 2.5 , PM 10 , and NO 2 , except for the second component of the territorial characteristics dimension of the municipality for which there is a high negative correlation with all three pollutants.

Discussion
A large body of evidence on the impact of air pollution on human health has accumulated over the past years. Among the various documented health effects, air pollution also increases the risk of chronic diseases (respiratory, cardio-metabolic), described as the comorbidities increasing the risk of being hospitalised or dying from COVID-19. This evidence contributed to a fast proliferation of epidemiological studies linking ambient air pollution to COVID-19 disease. However, the exact role of air pollutants in modulating the spread and severity of COVID-19 is still unclear.
Among the aspects contributing to make the issue difficult to face, there is the need to integrate methods and approaches belonging to different disciplines, like epidemiology of infectious diseases, environmental epidemiology of non-communicable diseases, and assessment and modelling of exposure to air pollutants. To correctly address the question of how and how much air pollution does impact on COVID-19 disease, it entails both to understand the spatial and temporal dynamics of the epidemics, whose spread is primarily based on direct contagion, and to identify the most relevant factors linked to the probability of becoming a case, and/or to the risk of hospitalisation and disease prognosis.
Dealing with the above aspects strongly depends, among other aspects, on the ability to adopt appropriate study design/analytical models, and to have information on individual variables (age, gender, comorbidities, etc.) and on contextual covariates. Contextual factors include characteristics of the area of residence, socio-economic indicators, availability of healthcare services and social interactions within communities, mobility, time-activity patterns, type of environment (urban, rural, semi-rural), prevalent occupational activities, and demographic and background health profiles.
Epidemiological research has a long tradition of studies based on the systematic reports about potential time and space varying determinants of diseases (i.e., characteristics of the territory and of the population), following the pioneering work of William Farr with his studies on cholera [25].
The present work follows this approach by addressing the critical issue of how to identify, collect and synthetize relevant information from large national datasets of contextual variables, in order to better characterise, through epidemiological studies, the relationships between air pollution and COVID-19 contagion and severity at the national level.
By applying data reduction techniques to the overall dataset of collected variables, this work made it possible to identify few informative summary factors accounting for large amounts of the observed variance within each of the five dimensions of contextual factors: from 58% (socio-economic-health dimension) to 83% (mobility dimension). The explained variance might be greatly increased by including one or two additional components with eigenvalues approaching the value of 1. For instance, regarding the socio-economic-health dimension (58% explained variance), the PCA analyses showed that including a third component with eigenvalue = 0.96, and a fourth component with eigenvalue = 0.95, would increase the explained variance respectively to 68% and to 79%. Similarly, the inclusion of a third component with eigenvalue = 0.91 would augment the healthcare offer dimension's overall explained variance from 72% to 79%.
The ongoing EpiCovAir program, as well as other epidemiology studies, would benefit from the explorative and preparatory work herein presented. The methods adopted are of course still susceptible to improvement. To this regard, new analyses are focusing, for instance, on the use of a generalized propensity score (GPS) approach [26], representing the conditional probability of being exposed to air pollution given the observed values of area-level covariates, to account for the major determinants of the spatial distribution of COVID-19 cases and case-fatality rates.

Conclusions
In conclusion, the present work provides both a method and a dataset to be used in epidemiological studies on the association between chronic exposure to air pollution and health outcomes in Italian territory. All information has been collected and is available at the municipality level. The data repository is available upon request to the authors.