Measuring Urban Land Cover Influence on Air Temperature through Multiple Geo-Data—The Case of Milan, Italy

Climate issues are nowadays one of the most pressing societal challenges, with cities being identified among the landmarks for climate change. This study investigates the effect of urban land cover composition on a relevant climate-related variable, i.e., the air temperature. The analysis exploits different big geo-data sources, namely high-resolution satellite imagery and in-situ air temperature observations, using the city of Milan (Northern Italy) as a case study. Satellite imagery from the Landsat 8, Sentinel-2, and RapidEye missions are used to derive Local Climate Zone (LCZ) maps depicting land cover compositions across the study area. Correlation tests are run to investigate and measure the influence of land cover composition on air temperature. Results show an underlying connection between the two variables by detecting an average temperature offset of about 1.5 ◦C between heavily urbanized and vegetated urban areas. The approach looks promising in investigating urban climate at a local scale and explaining effects through maps and exploratory graphs, which are valuable tools for urban planners to implement climate change mitigation strategies. The availability of worldwide coverage datasets, as well as the exclusive use of Free and Open Source Software (FOSS), provide the analysis with a potential to be empowered, replicated, and improved.


Introduction
Assessing and mitigating human-driven impacts on climate is nowadays a well-established societal concern, as well as a highly pressing challenge for a number of actors operating across different scales and sectors. Interest from the scientific community in climate change dates back to more than a century ago [1]. However, it was only over the last few decades that the growing participation of public and non-public institutions to international summits on climate has provided a clear proof of the increasing importance attributed to this phenomenon from both the economic and the civil society [2,3]. This has raised attention on a number of climate-related issues, recognized as priorities by the international community.
Among them, the increasing trend in urbanization has been identified as a landmark for climate change. Today, cities host 55% of the world's population and this number is projected to rise to 68% by 2050 [4]. Cities generate more than 80% of global Gross Domestic Product (GDP), but they also consume about 2/3 of the world's energy and account for more than 70% of global greenhouse out so far. The proposed analysis aims to provide insights into the underlying effect of land surface features on the air temperature by leveraging geo-data exploratory procedures. This enables to answer questions such as how and how much each LCZ class is correlated to local air temperature observations as well as to highlight strengths and limits of the considered geo-data in achieving the analysis goal.
The remaining of the paper is structured as follows. Section 2 introduces the area chosen to analyze the relation between urban land cover and temperature, as well as the data sources and software technology used. In Section 3 the methodology undertaken to preprocess the datasets, compute the LCZ maps and correlate them with temperature observations are presented in detail. Section 4 discusses the results achieved. Section 5 concludes the paper by outlining the main findings of the study.

Study Area
The analysis framework developed within this study is tested on the city of Milan (Northern Italy). Milan is located in one of the most densely populated European regions, the Padana Plain (see Figure 1), affected by poor wind circulation and high urbanization which favour the stagnation of pollutants as well as the persistence of a historically well-known UHI [28]. This makes Milan a suitable test area to investigate the influence of land cover on local air temperature that is the primary goal of the proposed analysis. The study region is characterized by a flat terrain with an average altitude of 120 m above sea level. Minimum and maximum altitudes are 85 m and 175 m above sea level, respectively. Due to this small range, we did not account for any relevant variation of temperature produced by orographic phenomena. The region climate characteristics can be summarized according to the Köppen-Geiger Climate Classification [29] in warm temperate, fully humid with hot summer (Cfa zone). Such environmental peculiarities are also common to other cities in Italy and Southern Europe [30], thus the same work might be replicated or adapted in other sites.

Data Collection
Collected data consists mainly of optical satellite imagery and temperature series from ground sensors. Data sources have been selected by considering data availability as well as licenses. The whole data used for this work is available under fully open licenses or at most distributed by the providers under custom open licenses enabling exploitation for research purposes. A detailed description of the two data sources and the strategies for their collection is reported in the following. According to the satellite revisiting time and weather conditions, two different images were downloaded for each satellite: The first representing a non-vegetative period (from October to March) and the second representing a vegetative period (from April to September). The two periods were defined according to seasonal features. Non-vegetative includes autumn and winter seasons, whereas vegetative refers to spring and summer seasons. Table 1 summarizes the acquired images for Landsat 8, Sentinel-2, and RapidEye. Descriptions of the reference synoptic conditions at the dates of acquisition of the satellites are reported in Table 2.

Air Temperature Observations
Time series of environmental and weather observations are traditionally collected and distributed by public authorities with fewer restrictions. In Italy, official environmental monitoring is accomplished by the Regional Agency for Environmental Protection (ARPA, Agenzia Regionale per la Protezione dell'Ambiente) through its sensor networks located in all the 20 administrative regions. Time series can be requested by any user through the ARPA data download platform. Data is provided under a custom open license with quality standard certification as requested by the national law. For this study, additional weather stations are considered in order to increase the spatial coverage of temperature observations over the study area. This is due to the need of investigating the correlation between temperature observations and LCZ at a very local scale, i.e., taking into account the likely point-wise nature of the two phenomena and their possible cause-effect relationship (see Section 3). The additional observations are collected from the Weather Underground (WU) platform [36]. This is an Internet service providing weather observations collected from both private and public weather stations from all over the world. Being based on a crowdsourced data collection strategy, information is delivered with no quality standard certification. Weather observations are publicly shared through the Weather Underground API, which allows requesting data at no cost for non-commercial purposes and within the limits expressed by the API provider. Because of new company policies, starting from June 2018 WU is no longer providing free API keys. All the data used in this study was acquired before this policy change. WU temperature time series for the stations of interest in Milan are downloaded in comma-separated values (CSV) format using a custom JavaScript collector module developed by the authors. The collected series refer to the period 1 January to 31 December 2016. Temperature measures from both ARPA and WU are available with a sub-hourly frequency. Stations locations are also retrieved and organized in a separate table linked with the time series CSV table for both data sources. The distribution of ARPA and WU stations providing temperature observations in Milan is depicted in Figure 2. The satellite basemap clearly shows that the stations in the South and South-West sides of the study area are mainly located in rural or not heavily urbanized areas.

Software Tools
This work makes exclusive use of Free and Open Source Software (FOSS) as data analysis tools with the aim of favoring the reproducibility of the study. The premier FOSS QGIS [37] is exploited for general geodata management, layout design, and part of the raster and vector processing. The programming languages Python and R are also widely used for data analysis and processing. The most exploited Python library is the Python Data Analysis Library (pandas) [38], an open source library providing high-performance data structures and data analysis tools for Python. The geospatial extension of this library, i.e., geopandas [39], is also used. Statistical processing is performed by using the NumPy [40] and the SciPy [41] libraries that provide many built-in methods for numerical and statistical computations. The rasterstat library [42] is also adopted to enable raster layer processing. Regarding R, ad hoc scripts are used for processing and classification of satellite imagery, by exploiting the raster [43] and randomForest [44] packages. The former allows to read, write and manipulate gridded spatial data, while the latter performs data classification and regression based on a forest of trees using random inputs. Details about the use of software and libraries are included in the following sections.

Data Processing
LCZ maps can explain, in an intuitive way, the contribution of urban surface characteristics to heat fluxes. This is better achieved when LCZ maps show a good agreement with temperature observations, so that this coupled information can act as a valuable tool for urban climate management and local mitigation strategies. In addition to the classification accuracy, the resolution of LCZ maps is also fundamental for their adoption in urban planning and management. In the following, a FOSS-based methodology is described to exploit satellite imagery at a different spatial and spectral resolution, derive detailed LCZ maps, assess their classification accuracies, and perform a combined analysis with air temperature data.

Satellite Imagery Preprocessing
In order to perform LCZ classification of satellite imagery, besides standard procedures such as the clip and merge of tiles to reconstruct a complete scene of the study area at any time period, band pixels need first to be converted to their real reflectance values. This ensures that classifications are not negatively affected by atmospheric disturbances so that images acquired in different seasons can be compared. Specific processing for the different satellite imagery is described below.
• Landsat 8: Level-1 data available to users consists of radiometrically and geometrically corrected images. The Level-1 image is presented in units of DNs, which can be easily rescaled to spectral radiance or Top of Atmosphere (TOA) reflectance. In order to obtain real surface reflectance values, band pixels need to be further corrected. The Semi-Automatic Classification Plugin for QGIS [45] was used for such preprocessing. • Sentinel-2: The Copernicus Open Access Hub provides Sentinel-2 imagery with different level products. In this study Level 1C imagery is used, which is ortho-corrected and with pixel radiometric measurements provided in Top of Atmosphere (ToA) reflectance. Moreover, with the aim of exploiting the complete band set of Sentinel-2 imagery at 10 m spatial resolution, resampling is required. As for Landsat 8 the preprocessing, including atmospheric correction and resampling, was performed through the Semi-Automatic Classification Plugin in QGIS. • RapidEye: The Planet Explorer platform distributes only ortho-corrected imagery with bands at 5 m spatial resolution. Atmospheric disturbances can be adjusted by exploiting the band reflectance and geometric coefficients available in the metadata file of each tile, in either eXtensible Markup Language (XML) or Javascript Object Notation (JSON) formats, and by applying a Dark Object Subtraction (DOS) procedure [46]. To automate this process, an R script based on the open source Geospatial Data Abstraction Library (GDAL) [47] was created. The script makes also use of the raster R package.

Local Climate Zone (LCZ) mapping
LCZ classification system divides the urban area into different climate zones based on their physical and thermal characteristics. The classification consists of 17 classes, which are divided into "built types" and "land cover types" mainly based on the characteristics of the surface structure (height and density of buildings and trees) and surface cover (impervious and pervious) [17]. In the city of Milan, 8 different LCZ classes were found and considered out of the 17 original classes. A brief description of each class is reported in Table 3.

Random Forest (RF) Classification
To perform the LCZ classification, a number of steps were required which are summarized in the flowchart in Figure 3.  Compact low-rise Dense mix of low-rise building (1-3 stories). Few or no trees. Land cover mostly paved. Stone, brick, tile, and concrete construction material. 3 Open midrise Open arrangement of midrise buildings (3-9 stories). Abundance of pervious land cover (low plants, scattered trees).
Concrete, steel, stone, and glass construction materials.

Open low-rise
Open arrangement of low-rise buildings (1-3 stories). Abundance of pervious land cover (low plants, scattered trees).
Wood, brick, stone, tile, and concrete construction materials.

Large low-rise
Open arrangement of large low-rise buildings (1-3 stories). Few or no trees. Land cover mostly paved. Steel, concrete, metal, and stone construction materials. 6 Scattered trees Lightly wooded landscape of deciduous and/or evergreen trees. Land cover mostly pervious (low plants). Zone function is natural forest, tree cultivation, or urban park.

7
Low plants Featureless landscape of glass or herbaceous plants/crops. Few or no trees. Zone function is natural grassland, agriculture, or urban park. 8 Water Large, open water bodies such as seas or lake, or small bodies such as rivers, reservoirs, and lagoons.
A supervised pixel-based classification was carried out using training areas digitized in QGIS and selected according to the LCZ classification system. Training polygons were defined by means of open geo-databases provided by the Lombardy Region. The most considered was the DUSAF (Destinazione d'Uso dei Suoli Agricoli e Forestali) [48], which includes detailed land use and land cover vector layers at scale 1:10,000, coupled with additional information such as the building heights from the regional topographic database [49].
The classifier adopted for this work was the Random Forest (RF) algorithm [50]. RF was chosen due to a number of reasons as follows. It is non-parametric, it provides high accuracy and fast computational performance on large databases, and since the classification unbiased error is estimated internally in the run, there is no need for cross-validation or for using a separate testing dataset [51]. RF classification was performed through the use of an R script, based on the RF implementation randomforest in R.
Accuracy of the RF classification outputs was assessed using the so-called "out-of-bag" (OOB) samples, not included in the training data. Common statistical measures were calculated from the confusion matrix [52], including Overall Accuracy (OA), Producer's Accuracy (PA) and User's Accuracy (UA). Since the pixel-based classification is typically noisy, a post-classification filtering was applied. In particular the Majority filter [53], available within the raster package in R, was applied on the resulting maps with a 3 × 3 moving window.

Air Temperature Time Series Adjustment
Time series of temperature observations were manipulated through Python Pandas scripting. Raw CSV files containing temperature observations from each station were collapsed in a single pandas DataFrame object. This is a two-dimensional tabular data structure with labeled axes, which uses the timestamps of collected observations as row indexes and the unique ids of the stations as column indexes. Sub-hourly observations are aggregated into hourly averages. A quick quality check of observations was performed by removing the stations for which data completeness-computed as the ratio between the number of hours in the study period with observations actually available and the total number of hours in the study period-was lower than 0.8. Hence, each considered station series contained less than 20% missing hourly observations. An additional check was performed to remove stations having a minimum correlation coefficient-computed as the pairwise Pearson's correlation coefficient [54]-with all the other stations lower than 0.75. These two thresholds were arbitrarily selected as coarse indicators of the goodness of the time series. The values are aligned with those suggested by the Italian Institute for the Environmental Protection and Research (ISPRA) [55]. Finally, a rolling mean using a 5 h window was applied to smooth the series. This preliminary data cleaning step was useful to improve the reliability of the available data, especially the WU observations which-in contrast to the ones from ARPA-are provided without any prior quality check (Figure 4). The temperature time series resulting from these cleaning operations were considered as reference data for the correlation analysis described in the following sections.

LCZ and Air Temperature Correlation Analysis
To investigate the correlation between the land cover type, expressed by the LCZ, and the air temperature, a 500 m buffer around each temperature station was considered. This relatively small buffer radius was introduced to account for the large variability of land cover features that characterize the urban environment. Moreover, temperature measurements are representative only in the spatial proximity of the sensors, where temperature changes can be assumed as negligible. By selecting a small buffer radius, also the point-wise nature of the temperature observations was taken into account for the correlation tests. Nevertheless, thanks to the abundance of weather stations across the Milan area, a satisfactory data coverage for the study area was achieved, especially within the city centre and the suburbs (Figure 2).
The fraction of each LCZ class coverage within each buffer was then calculated as the ratio between the number of pixels of a class and the total number of pixels within a buffer. A second measure of the LCZ class composition inside buffers was computed. This was achieved by assigning to each buffer its corresponding majority LCZ class id, computed as the class having the highest coverage inside the buffer out of the total number of pixels. Both measures were computed for each satellite at the two considered time periods, i.e., the non-vegetative and the vegetative (see Table 1).
These measures were used to run two separate correlation tests. The first one consisted of computing scatter-plots and connected regression lines by comparing the LCZ fractions at each station buffer with the temperature registered by the station at the exact acquisition time of satellites. This was accomplished using a custom Python script that combines GeoPandas and Rasterstat functionalities to process the station dataset and the LCZ raster maps. Scatter-plots were generated by using the Matplotlip library, while the linear regression was computed by the SciPy library. A positive slope of the regression line is interpreted as a positive contribution of the LCZ class to the temperature, while a negative slope as a mitigative effect. A scatter-plot was computed for each satellite at each time period and for each LCZ class. Results are reported and discussed in Section 4.
In the second experiment, the buffers were grouped according to their majority LCZ class id. Descriptive statistics such as the weekly temperature median and standard deviation were computed for each group of buffers from the temperature series of the stations included in each group (see Table 5). The reference week is the one in which the satellite image used for LCZ classification was acquired. The majority LCZ class and the descriptive statistics were both computed using the QGIS Zonal Statistics Plugin as well as some Python scripts based on the GeoPandas library. This second experiment was designed with the goal of finding evidence of the persistence of higher temperatures within buffers characterized by a heavy urbanization and lower temperatures in buffers characterized by a more natural land cover composition. Results are reported and discussed in Section 4.

LCZ Maps
The LCZ maps obtained are shown in Figure 5, from which differences among different sensors as well as time periods can be outlined. Table 4 summarizes the OA for each classification output. OA is around 81% for the LCZ maps derived from Landsat 8, between 75% and 78% for the maps derived from Sentinel-2, and 71% for the maps derived from RapidEye. The slightly higher values registered for Landsat 8 can be mostly attributed to the pixel size and the choice of training datasets. In fact, in Landsat 8 imagery the pixel size allows smoothing reflectance information as well as noisy details such as buildings or tree shadows, which are instead more evident in Sentinel-2 and RapidEye imagery. Moreover, with the aim of comparing the classifications derived from the different satellite imagery, the same reference samples were used with a pixel size equal to the largest among the three, i.e., that of Landsat 8. A possible improvement of the OA values for the LCZ maps derived from Sentinel-2 and RapidEye might be achieved by selecting training samples more suited to the pixel size. In addition, the lower spectral resolution of RapidEye sensors negatively influences the classification results, as highlighted by the spectral signatures of the 8 LCZ classes reported in Figure 6. Except for the class water, which shows its peculiar spectral signature, the other classes have a similar spectral behavior. The main spectral differences are evident for infrared wavelengths, starting from 800 nm, represented by three bands in Landsat 8 imagery; four bands in Sentinel-2 dataset; and only one band in RapidEye collection. These considerably reduce OA values of the LCZ maps derived from RapidEye imagery. Comparing the non-vegetative and vegetative periods, OA values are almost the same for Landsat 8 and RapidEye, while for Sentinel-2 an improvement (more than 3%) is achieved for the map corresponding to the vegetative season. Major spectral differences among the classes are found for the summer period, with respect to the winter. In terms of PA and UA of each LCZ class, the LCZ maps derived from all the three satellites show very high values for some classes such as water, low plants or large low-rise (higher than 80% and up to 99% for the water class). Other classes, e.g., compact low-rise and open midrise, show instead very low values of UA and PA, e.g., 11% PA for compact low-rise in the RapidEye-derived classification.
These low values point out the difficulty of the classification algorithm in correctly detecting some LCZ classes. Among built-up classes, compact midrise and large low-rise score high values of UA (more than 65% in all the cases), while the remaining classes cannot be distinguished. This aspect has to be considered for further analysis. Finally, as expected, a significant improvement is registered for the class scattered trees in the vegetative season, since during the non-vegetative period the absence of leaves causes confusion with the low plants class [56].

Air Temperature Patterns in Milan
Temperature maps were derived from the air temperature observations by interpolating the station measures using Kriging [57]. Ordinary Kriging was performed through an R script based on the gstat package [58], available from the QGIS Processing Toolbox. Interpolated maps provided a better understanding of the air temperature patterns in the city of Milan. These allowed to explore evidence and features of the Milan UHI. A qualitatively assess of the actual goodness for the temperature series adopted was also achieved. In fact, the series provided a reasonable feedback in describing both spatial and temporal temperature patterns in the city. The presence of a significant UHI in the North-Eastern area of the city is clearly outlined in Figure 7, especially during the non-vegetative period. Conversely, during the vegetative period the UHI is less clustered while remaining more intense around the city center.

Correlation Analysis
Results from the the correlation analysis between LCZ classes and temperature are included in Figures 8-10, and Table 5. For the sake of clarity, some LCZ class compositions for buffers were collapsed into a single class. These are classes 2, 3, 4, and 5, which represent slightly different land cover features of medium compact built-up areas. As a matter of fact, it was observed that the contribution of these classes to the correlation tests was comparable, and generally not interesting in terms of influence on air temperature. On the other hand, this might be also due to an intrinsically lower interclass accuracy of the LCZ classification procedure, that might be biased by the similar surface characteristics of these classes compared to more extreme classes such as 1, 6 and 7. Class 8, i.e., water, was not considered in the correlation analysis because no significant fractions (i.e., always < 0.04% ) were detected in any of the station buffers.
By observing the scatter-plots in Figures 8-10, it can be argued that the contribution of highly urbanized areas, such as the one expressed by the Class 1 fractions, is always positive on the air temperature. This is true for any satellite imagery and time period considered. The contribution of the collapsed Classes 2, 3, 4, and 5 shows almost no correlation evidence with air temperature. Regarding classes 6 and 7, a sensible negative correlation is detected in almost all the cases. Besides a general low linear correlation between the two variables that is expressed by the coefficient of determination (R 2 ) computed for each graph, regression lines show comparable trends by considering the same LCZ class and time period for different satellites. The different spatial and spectral resolutions of the different satellite imagery seems not to have a significant influence on this exploratory experiment. The results show how temperature patterns are connected with the land cover type that is a well known effect, also reported in literature [59]. Nevertheless, the data used in this example, as well as the analysis strategy adopted, would benefit from a further improvement in order to conceptualize robust models that can better explain this correlation.   The most important outcome is, however, the demonstration of how heterogeneous geo-data can be combined and processed to explore urban climate patterns and get insight into these [9].
With the second experiment, we aim at providing some quantitative evidence of land cover influence on air temperature instead of qualitative results as we did in the previous experiment. Table 5 reports the median and standard deviation of weekly temperatures by LCZ majority buffers, for each time period and satellite considered. A marked difference of about 1.5 • C can be observed, which characterizes median temperatures of heavily urbanized areas, i.e., with Class 1 majority, and vegetated areas such as the ones with the majority in Class 7. The majority groups of the intermediate Classes 2, 3, 4, and 5 are always placed in between these two. Standard deviations are generally lower than these differences. A different information can be retrieved by the buffers with the majority in Class 6, i.e., scattered trees. These few areas are detected in the Sentinel-2 and the RapidEye imagery in the vegetative period only. In Table 5, the Class 6 entry for the RapidEye represents the measurement and not the median because only a single buffer with this majority was detected. The absence of buffers with the majority in Class 6 for the Landsat 8 imagery can be explained by considering its lower spatial resolution (30 m). Hence, the spatial resolution of the imagery influences the classification of some LCZ classes by providing an approximated indication of the limit resolution to be used in this kind of analysis. During non-vegetative periods, part of Class 6 areas is instead classified with a different label, thus explaining its absence in the results achieved.

Conclusions
The experiments performed in this work showed an underlying relationship between the land cover composition and the air temperature for the chosen study area. The analysis of the average temperature patterns provided also a metric to quantify this relationship in terms of observed temperature offsets between different LCZs.
Data availability was commensurate to the expectation of the analysis. The combined use of open data with a worldwide coverage, such as Landsat 8 and Sentinel-2 imagery, with traditionally available datasets such as air temperature observations, makes it possible to replicate the analysis in almost all major cities on the Earth. Nevertheless, it must be said that an enormous amount of high-accuracy, high-frequency and high-resolution environmental data which is potentially key for studying, understanding and solving climate-related issues is only managed by private companies, with no or limited access to both the research and public community. At least at the European level, some efforts are however underway towards the preparation of policies aiming to promote and facilitate the open release of private data [60]. The use of FOSS technology was an asset for both data processing and analysis. This proves the maturity, completeness, and reliability of such technology for geo-data management [61], since flexibility and customization (e.g., in data modeling) are combined with the sustainability in terms of technology costs.
Departing from the urban environment, the establishment of good practices in environmental policy-making by leveraging the use of available geo-data represents the ultimate goal of the presented work. The proposed analysis proved to be promising for all those applications requiring insights into the human impact on relevant climate-related variables such as the air temperature. These include urban planning and design, land policies impact assessment, and urban climate monitoring, among others.