A New Approach for Understanding Urban Microclimate by Integrating Complementary Predictors at Different Scales in Regression and Machine Learning Models

Climate change is a major contemporary phenomenon with multiple consequences. In urban areas, it exacerbates the urban heat island phenomenon. It impacts the health of the inhabitants and the sensation of thermal discomfort felt in urban areas. Thus, it is necessary to estimate as well as possible the air temperature at any point of a territory, in particular in view of the ongoing rationalization of the network of fixed meteorological stations of Météo-France. Understanding the air temperature is increasingly in demand to input quantitative models related to a wide range of fields, such as hydrology, ecology, or climate change studies. This study thus proposes to model air temperature, measured during four mobile campaigns carried out during the summer months, between 2016 and 2019, in Lyon (France), in clear sky weather, using regression models based on 33 explanatory variables from traditionally used data, data from remote sensing by LiDAR (Light Detection and Ranging), or Landsat 8 satellite acquisition. Three types of statistical regression were experimented: partial least square regression, multiple linear regression, and a machine learning method, the random forest regression. For example, for the day of 30 August 2016, multiple linear regression explained 89% of the variance for the study days, with a root mean square error (RMSE) of only 0.23 °C. Variables such as surface temperature, Normalized Difference Vegetation Index (NDVI), and Modified Normalized Difference Water Index (MNDWI) have a strong impact on the estimation model. This study contributes to the emergence of urban cooling systems. The solutions available vary. For example, they may include increasing the proportion of vegetation on the ground, facades, or roofs, increasing the number of basins and water bodies to promote urban cooling, choosing water-retaining materials, humidifying the pavement, increasing the number of public fountains and foggers, or creating shade with stretched canvas.


Introduction
Climate change is a major current phenomenon with multiple environmental, social and economic consequences [1]. In urban areas, it exacerbates the urban heat island (UHI) phenomenon [2,3] which is characterized by a difference in temperature between an urban area and the surrounding rural areas. In this case, the temperature in urban areas is higher than in rural areas, particularly at night [4,5]. The factors that contribute to heat intensification and UHI can be explicated mainly by the surface factors linked to the substitution of water surfaces, vegetation cover, and wetlands by artificial areas, causing low evaporation and evapotranspiration [6][7][8][9][10][11][12]. Buildings made of low-albedo materials with high thermal inertia capture, stock, and discharge the heat trapped with a thermal lag of several

Lyon: A Study Area Characterized by a Considerable Urban Morphological Diversity
The area of interest chosen for this study is the urban heart of the city of Lyon and part of the city of Villeurbanne, on the border with the 6 th district of Lyon (Figure 1). This area has the advantage of grouping together a significant diversity of land use in an urban environment. It is mainly occupied by continuous urban fabric (50%) of which 12.3% is discontinuous dense urban fabric, as well as by industrial, commercial, military, or public units (19.5%). Water, roads (main and secondary), and vegetation cover also occupy a significant surface of the territory, with respectively, 7.3%, 14.3%, and 8.9% (Table 1).   With just over 1.4 million inhabitants, this agglomeration of 59 municipalities is the second largest in France after Paris. The study area is composed of a very dense urban environment (Figure 1). Natural vegetation is therefore absent. There is, however, a very large park of 117 ha and urban green areas. The main park in Lyon (the Tête d'Or Park, to the north in Figure 1) is the largest urban park in France. It has vast expanses of lawn shaded by tall trees of various species, a lake, an island, and several botanical gardens, including an alpine garden and a flower garden.
This study area is located in the south-east of France (45 • 45 35"N, 4 • 50 32"E). According to the Köppen-Geiger classification [79,80], it has a continental temperate climate, fully humid with hot or warm summer, depending on the year (  With just over 1.4 million inhabitants, this agglomeration of 59 municipalities is the second largest in France after Paris. The study area is composed of a very dense urban environment ( Figure  1). Natural vegetation is therefore absent. There is, however, a very large park of 117 ha and urban green areas. The main park in Lyon (the Tête d'Or Park, to the north in Figure 1) is the largest urban park in France. It has vast expanses of lawn shaded by tall trees of various species, a lake, an island, and several botanical gardens, including an alpine garden and a flower garden.
This study area is located in the south-east of France (45° 45′35″N, 4° 50′32″E). According to the Köppen-Geiger classification [79,80], it has a continental temperate climate, fully humid with hot or warm summer, depending on the year (

Data Acquired by the Measuring Instruments and Selected Days
Air temperature is the variable to be estimated at any point in the territory from several indicators. The training sample of this variable is obtained from mobile measurement transects using high-precision measuring devices, according to manufacturer's data. The first equipment used is the EL-USB-1-RGC (EasyLog From Lascar Electronic). It measures the air temperature continuously, with an accuracy of +/-1 °C (manufacturer's data) and a minimum recording interval of 1 second. The second equipment, the LOG 32 (from Dostmann electronic GmbH), records relative humidity and air

Data Acquired by the Measuring Instruments and Selected Days
Air temperature is the variable to be estimated at any point in the territory from several indicators. The training sample of this variable is obtained from mobile measurement transects using high-precision measuring devices, according to manufacturer's data. The first equipment used is the EL-USB-1-RGC (EasyLog From Lascar Electronic). It measures the air temperature continuously, with an accuracy of +/− 1 • C (manufacturer's data) and a minimum recording interval of 1 second. The second equipment, the LOG 32 (from Dostmann electronic GmbH), records relative humidity and air temperature, with an Remote Sens. 2020, 12, 2434 6 of 35 accuracy of +/− 0.5 • C (manufacturer's data) and +/− 3% (40 to 60%) and a minimum recording interval of 2 seconds. The measurement campaigns were associated with a precision GPS (from Garmin, with a high sensitivity GPS/GLONASS receiver and Quad Helix antenna) to record the geographical position of the measurement.
The location of the points of all measurement campaigns was checked and corrected if necessary, using a geographic information system (GIS), for example, to ensure that it was not located on the roof of a building. Indeed, the dense urban environment can interfere with the geolocation of the position. Streets in dense urban centers may be boxed in, with little visible sky.
In addition, the Météo-France site of the Direction Centre-Est (DIRCE) of Lyon-Bron, (45 • 43 30"N, 4 • 56 12"E and 197 m altitude), was used as a study site for a quality control campaign of the air temperature and relative humidity measuring instruments. This station was chosen because it is Météo-France's professional weather station closest to our measurement campaigns. Hourly measurements synchronous to the measurements of the Météo-France station were carried out from 28 June 2018 at 09:00 to 24 September 2018 at 14:00 ( Figure 3). The comparison between site observation and mobile measurements have been done at the same time, on a single second during the exact precise hour.
Remote Sens. 2020, 12, x FOR PEER REVIEW 6 of 39 temperature, with an accuracy of +/-0.5 °C (manufacturer's data) and +/-3% (40 to 60%) and a minimum recording interval of 2 seconds. The measurement campaigns were associated with a precision GPS (from Garmin, with a high sensitivity GPS/GLONASS receiver and Quad Helix antenna) to record the geographical position of the measurement. The location of the points of all measurement campaigns was checked and corrected if necessary, using a geographic information system (GIS), for example, to ensure that it was not located on the roof of a building. Indeed, the dense urban environment can interfere with the geolocation of the position. Streets in dense urban centers may be boxed in, with little visible sky.
In addition, the Météo-France site of the Direction Centre-Est (DIRCE) of Lyon-Bron, (45°43'30 "N, 4°56'12 "E and 197 m altitude), was used as a study site for a quality control campaign of the air temperature and relative humidity measuring instruments. This station was chosen because it is Météo-France's professional weather station closest to our measurement campaigns. Hourly measurements synchronous to the measurements of the Météo-France station were carried out from 28 June 2018 at 09:00 to 24 September 2018 at 14:00 ( Figure 3). The comparison between site observation and mobile measurements have been done at the same time, on a single second during the exact precise hour. The measuring devices used in this study proved to be highly accurate since, after this comparison at the Météo-France site over the summer 2018 period, the air temperature correlation of these two different acquisition sources shows a lowest correlation coefficient of 0.981 for the air temperature and 0.977 for the relative humidity measured by LOG 32 (Table 2). The measuring devices used in this study proved to be highly accurate since, after this comparison at the Météo-France site over the summer 2018 period, the air temperature correlation of these two different acquisition sources shows a lowest correlation coefficient of 0.981 for the air temperature and 0.977 for the relative humidity measured by LOG 32 (Table 2). The mobile measurements were taken on days when the Landsat 8 satellite was over the city, on clear sky days only, i.e., with less than 10% cloud cover. These campaigns were spread out between 2016 and 2018, exclusively over the summer period: 30 August 2016, 1 August 2017, 19 July 2018, and 22 July 2019. Numerous measurement campaigns were carried out from 2016 to 2019 but only those four days with similar weather conditions were used in this study. However, not all of them had similar weather conditions. Moreover, in only one summer of a year, the set of days was too poor regarding the different cumulative criteria, i.e., similar weather conditions and no clouds. Indeed, the weather conditions for each day were similar: the standard deviation of the temperature was only 0.9 • C and 4.3%, for humidity, 2.3 m.s-1 for wind speed, 132 degrees for wind direction, and 3.7 hPa for pressure. The average weather conditions for these indicators are 29.3 • C, 45.3%, 8.8 m.s-1, 260.8 degrees, and 1016.6 hPa, respectively (Table 3). Respectively, these measurement campaigns yielded 573, 300, 393, and 397 measurement points for air temperature and relative humidity ( Figure 4). The routes travelled during the measurement campaigns vary slightly ( Figure 4). In fact, besides the technical reasons such as works or new developments that caused us to deviate from the route, we wanted to maximize the morphological diversities crossed, making deviations to places of particular interest due to their urbanistic characteristics (docks, the historic urban center, industrial sectors, etc.).
Additionally, air temperature measurement campaigns sometimes last several tens of minutes. It was therefore necessary to make a correction based on a polynomial equation elaborated according to the evolution of the day's temperatures recorded at a time step of 10 minutes. This phase before the data processing is essential and allows to bring all these air temperature measurements back to the hottest hour of the day [74].
In addition, in order to have a very complete sample of temperature measurements, all the data from the four field trips were pooled. This allows obtaining global results. Indeed, even if the weather conditions are similar for the four days studied, some results may differ due to the different routes carried out, which cross different urban morphologies. For example, in Lyon, the type of buildings and the urban morphology are relatively different depending on whether one is in the east, with modern buildings from the end of the 19th and 20th centuries, or in the west of the main river, with very old buildings from the medieval or Renaissance period (Figures 1 and 4). This could explain why the results between the days of 2019 and 2018, for example, were not strictly identical, although general trends may emerge.

Morphological Descriptors Relevant to Air Temperature Estimation
Changes in land-use patterns related to the urban factory contribute to the spatial structuring of the urban landscape, which also influence energy transmission and balance [81,82]. These changes are considered a direct cause of the formation of the UHIs [83,84]. Thus, the relationship of changes in air temperature to land use and land cover is apparent.
The routes travelled during the measurement campaigns vary slightly ( Figure 4). In fact, besides the technical reasons such as works or new developments that caused us to deviate from the route, we wanted to maximize the morphological diversities crossed, making deviations to places of particular interest due to their urbanistic characteristics (docks, the historic urban center, industrial sectors, etc.). In this study, thirty-eight explanatory variables contributed to the estimation of air temperature over the study area [85][86][87][88][89]. They belong to various categories such as climate data from remote sensing, topographic variables, vegetation indices, the presence of water, moisture, bare soil, buildings, radiation, urban morphology, and proximity to various land uses (Table 4 and Appendix A). The acquisition Remote Sens. 2020, 12, 2434 9 of 35 sources were multiple and came from the Landsat 8 satellite (https://earthexplorer.usgs.gov/), LiDAR (https://data.grandlyon.com/jeux-de-donnees/nuage-points-lidar-2018-metropole-lyon-formatlaz/donnees) points and other cartographic products downloaded from the open data platform of the Greater Lyon.  These morphological descriptors are acquired to a spatial precision that can go down to the centimeter scale. As a result, the information collected is dense and allows us to acquire a rigorous state of the urban environment for the purpose of modelling air temperature. These morphological descriptors are acquired to a spatial precision that can go down to the centimeter scale. As a result, the information collected is dense and allows us to acquire a rigorous state of the urban environment for the purpose of modelling air temperature.

An Explanatory Buffer Zone, Which Varies According to the Indicator
The aim of this study is to model air temperature using the linear regressions, multiple and partial least square regressions, and nonlinear regression by the random forest regression, from selected predictors. Initially, the scale with the best correlation between air temperature and explanatory variables was selected for each indicator based on a proximity buffer analysis (5 to 1000 m; Figure 5). Thus, the selected buffer zone varies for indicators of the presence of vegetation, water, humidity, bare soil and buildings, radiation indices, proximity to land use, urban morphology, and finally climate data ( Table 5). Each of the measuring points was compared with the average of the indicator concerned, according to the size of the buffer considered.
For example, the process followed for the 5-meter buffer is as follows: 1 • /creation of a 5-meter buffer around each point; 2 • /calculation of the area (for vegetation, water surfaces, etc.), length (railways), average (spectral indices), or standard deviation (STD Building Height) of the indicator in this buffer; 3 • /calculation of the Spearman correlation coefficient between the temperature measured at the point and the indicator; and 4 • /repeat the operation for all the indicators and for all the buffers. Buffer (meters) 500m 1000m 1500m Figure 5. Example of variation in the correlation (coefficient of determination) between predictor and air temperature as a function of study scale.

Three Complementary Regression Methods in Modelling Use
Three regression methods of air temperature modelling are compared in this study. These are two linear regressions: multiple [42,50,56] and partial least square [90], and one non-linear regression: random forest [91,92]. The aim is to select the best regression for this modelling. This evaluation is essentially carried out by comparing the coefficients of determination and the roots of the root mean square error (RMSE) obtained for the samples. The conditions of use for each of the regressions were also verified.
Multiple linear regression (MLR) is a data modelling method that requires several statistical steps before its application [93]. First, it is necessary to verify the normal distribution of the series in the dataset using the Shapiro-Wilk test (applies to samples of less than 5000 observations) [94]. This test has been invalidated, so the Spearman correlation matrix was used. It allows redundant variables not to be included in the regression model. One of the two indicators for which the pair has |r|>0.7 in the Spearman correlation matrix and a Variance Inflation Factor (VIF) > 5 was removed [95,96]. Finally, after removing the correlated variables, multiple linear regressions are carried out on about 20 variables, between 21 for the August 30 th , 2016, and 27 for the August 1 st , 2017 (Table 6).In addition, a holdout cross-validation was performed because of its ability to detect multiple regression overfitting (80% learning data and 20% validation data) [97].

Three Complementary Regression Methods in Modelling Use
Three regression methods of air temperature modelling are compared in this study. These are two linear regressions: multiple [42,50,56] and partial least square [90], and one non-linear regression: random forest [91,92]. The aim is to select the best regression for this modelling. This evaluation is essentially carried out by comparing the coefficients of determination and the roots of the root mean square error (RMSE) obtained for the samples. The conditions of use for each of the regressions were also verified.
Multiple linear regression (MLR) is a data modelling method that requires several statistical steps before its application [93]. First, it is necessary to verify the normal distribution of the series in the dataset using the Shapiro-Wilk test (applies to samples of less than 5000 observations) [94]. This test has been invalidated, so the Spearman correlation matrix was used. It allows redundant variables not to be included in the regression model. One of the two indicators for which the pair has |r|>0.7 in the Spearman correlation matrix and a Variance Inflation Factor (VIF) > 5 was removed [95,96]. Finally, after removing the correlated variables, multiple linear regressions are carried out on about 20 variables, between 21 for the 30 August 2016, and 27 for the 1 August 2017 (Table 6).In addition, a holdout cross-validation was performed because of its ability to detect multiple regression overfitting (80% learning data and 20% validation data) [97].
In a complementary way to the multiple linear regression (MLR), partial least square regression is a method that is applied when a large number of explanatory variables are present and when these variables are likely to show strong collinearities among themselves [98]. Thus, this method allows us to model and predict air temperature values as a function of a linear combination of several quantitative (or qualitative) explanatory variables, overcoming the constraints of linear regression with respect to the distribution and number of variables included. Therefore, there is no need to remove the collinear variables. The model gives a value for each Variable Importance for the Projection (VIP). An explanatory variable is considered important when the VIP is greater than 0.8 [99]. A standardized coefficient is then generated for each of them [100].  The third type of regression tested is the random forest regression. This is a predictive model using binary decision trees [101]. From an observation sample, the bagging method will generate several possibilities before selecting only one. This machine learning technique [102] is based on Classification and Regression Trees (CART). These are constructed from different bootstrap samples, randomly selected with random discounting, in order to obtain, after aggregation, a robust and efficient set of air temperature predictors [103]. The importance of each variable is calculated by the mean increase in error of a tree in the forest, i.e., when the values of each variable are randomly swapped in the out-of-bag (OOB) samples. The variables used in the nonlinear regression, random forest, to model air temperature are derived from the selection of multiple linear regressions for each day. The random t forest classification and regression has the advantage of reducing white noise, and thus potentially improving the correlation coefficients and RMSE already obtained by multiple linear regression. In addition, the number of variables in the bagging and the number of trees used are user-defined parameters. When the number of trees increases, the general error converges to the same value. Overfitting is then not a problem due to the large numbers law. Despite this, the number of analysed trees must be limited in order not to excessively increase the computation time (1): where c is a constant, T is the number of trees in the set, M is the number of variables, and N is the number of samples in the training data set [104]. In this work, the classifiers were optimized with 80 decision trees and were trained with the same number of pixels in each category. The general error of the models converged around 80 decision trees ( Figure 6). Therefore, a more complex model would have required more computation time without improving the classification.
Remote Sens. 2020, 12, x FOR PEER REVIEW 9 of 39 Figure 6. Convergence of the general error of the models for each study days.
In addition, Lasso regression was not applicable in this study. Lasso regression is only used when the number of predictors is greater than the number of observations [105,106]. Here, though, the number of observations was much higher than the number of predictors. In addition, many explanatory variables were included.

Quality Control on Modeling by Spatial Identification of Error Clusters
The spatial autocorrelation of the difference between the modelled air temperature and the air temperature measured by the mobile measurements was analyzed, on one hand, using the Anselin Local Moran I spatial association indicator (LISA) [107], and, on the other hand, using the degree of clustering of high and low intensity values by the Getis Ord General G (Gi*) [108,109].
The LISA makes it possible to group together, for statistically significant results (p < 0.05), the similarity of a spatial unit with its neighbours. It allows identifying spatial aggregates of entities with In addition, Lasso regression was not applicable in this study. Lasso regression is only used when the number of predictors is greater than the number of observations [105,106]. Here, though, the number of observations was much higher than the number of predictors. In addition, many explanatory variables were included.

Quality Control on Modeling by Spatial Identification of Error Clusters
The spatial autocorrelation of the difference between the modelled air temperature and the air temperature measured by the mobile measurements was analyzed, on one hand, using the Anselin Local Moran I spatial association indicator (LISA) [107], and, on the other hand, using the degree of clustering of high and low intensity values by the Getis Ord General G (Gi*) [108,109].
The LISA makes it possible to group together, for statistically significant results (p < 0.05), the similarity of a spatial unit with its neighbours. It allows identifying spatial aggregates of entities with high or low values as well as spatial outliers. A cartographic representation showing a cluster type for each statistically significant entity is thus obtained. With a geographic information system (GIS), a statistically significant group of high values (HH), a group of low values (LL), an outlier in which a high value is surrounded mainly by low values (HL), and an outlier in which a low value is surrounded mainly by high values (LH) is distinguished.
The local application of the general G statistic is the Getis Ord Gi* statistic. It is used to identify statistically significant (p < 0.05) spatial clusters of high and low intensity. Thus, for positive Z scores, the higher the Z score, the stronger the cluster of high intensity values (error overestimating air temperature). On the contrary, the lower the negative Z-score, the higher the group of low intensity values (error underestimating the air temperature).
In order to summarize the methodology, a general diagram of the study has been inserted ( Figure 7).
Remote Sens. 2020, 12, x FOR PEER REVIEW 10 of 39 temperature). On the contrary, the lower the negative Z-score, the higher the group of low intensity values (error underestimating the air temperature).
In order to summarize the methodology, a general diagram of the study has been inserted ( Figure 7).

Multiple Linear Regression Modeling
After removing the collinear variables for each day, the predictors involved in air temperature

Multiple Linear Regression Modeling
After removing the collinear variables for each day, the predictors involved in air temperature modelling provide significant coefficients of determination. These ranged from 0.60 for 22 July 2019, to 0.89 for 30 August 2016, with RMSEs of only 0.96 • C and 0.23 • C, respectively. Moreover, each variable retained in the model was characterized by a normalized coefficient that corresponded to the weight of this explanatory variable. This weight varies according to the study days (Figure 8). The probability associated with Fisher's F (Pr>F) was always less than 0.05 and very often less than 0.0001.
Remote Sens. 2020, 12, x FOR PEER REVIEW 11 of 39 The probability associated with Fisher's F (Pr>F) was always less than 0.05 and very often less than 0.0001. For example, for July 19 th , 2018, the variables contributing to a positive impact on the model were proximity to subways, Bare Soil Index (BI), longitude, Tasseled Cap Transformation (TCT) wetness, low vegetation density, and Normalized Difference Bareness Index (NDBaI). Variables negatively impacting the model are sky view factor, high vegetation density, proximity to tourist attractions, and soil density. From the equation obtained, it is therefore possible to model the air temperature continuously. The resolution can be adapted to the display and the purpose of the study. For example, a resolution of 10 meters was chosen for Figure 9 (Figure 9). It can be seen in Figure 9 that some areas are cooler or hotter than others on the map. This is directly related to the equation used in the modelling, including the explanatory variables included, as shown in Figure 8. Thus, for example, in Figure 9, cold spots at some locations are related to the density of tall vegetation or water surface. Hot spots, in contrast, would be related to low vegetation density, soil density, or BI. Thus, the greater the presence of these variables, the greater the chance of detecting a hot or cold spot. This confirms the results of previous studies showing in particular the cooling power of tall vegetation [43,110], water surfaces, or urban density [33,111,112]. For example, for 19 July 2018, the variables contributing to a positive impact on the model were proximity to subways, Bare Soil Index (BI), longitude, Tasseled Cap Transformation (TCT) wetness, low vegetation density, and Normalized Difference Bareness Index (NDBaI). Variables negatively impacting the model are sky view factor, high vegetation density, proximity to tourist attractions, and soil density. From the equation obtained, it is therefore possible to model the air temperature continuously. The resolution can be adapted to the display and the purpose of the study. For example, a resolution of 10 meters was chosen for Figure 9 (Figure 9). It can be seen in Figure 9 that some areas are cooler or hotter than others on the map. This is directly related to the equation used in the modelling, including the explanatory variables included, as shown in Figure 8. Thus, for example, in Figure 9, cold spots at some locations are related to the density of tall vegetation or water surface. Hot spots, in contrast, would be related to low vegetation density, soil density, or BI. Thus, the greater the presence of these variables, the greater the chance of detecting a hot or cold spot. This confirms the results of previous studies showing in particular the cooling power of tall vegetation [43,110], water surfaces, or urban density [33,111,112]. Remote Sens. 2020, 12, x FOR PEER REVIEW 12 of 39  The results for the sample with all measurements for the four outputs ( Figure 10) show an R 2 of 0.65 and an RMSE of 1.54. The results for the sample with all measurements for the four outputs ( Figure 10) show an R 2 of 0.65 and an RMSE of 1.54. The RMSE is logically slightly higher than for the single day models due to the larger sample of measurements and greater morphological diversity, even though the weather conditions remain similar. These results confirm the general trends observed at day scales. In particular, the cooling effect of variables such as water density (normalized coefficient of -0.35; Figure 10), densely vegetated areas (-0.11 for NDVI and -0.09 for the density of high vegetation), road embankment (-0.08 for SVF), and humidity (-0.05 for Modified Normalized Difference Water Index (MNDWI)) can be found. The presence of proximity to tourist areas can be explained by the fact that these areas are mostly made up of green spaces or historic buildings in old Lyon.
The results for the sample with all measurements for the four outputs ( Figure 10) show an R² of 0.65 and an RMSE of 1.54. The results for the sample with all measurements for the four outputs ( Figure 10) show an R² of 0.65 and an RMSE of 1.54. The RMSE is logically slightly higher than for the single day models due to the larger sample of measurements and greater morphological diversity, even though the weather conditions remain similar. These results confirm the general trends observed at day scales. In particular, the cooling effect of variables such as water density (normalized coefficient of -0.35; Figure 10), densely vegetated areas (-0.11 for NDVI and -0.09 for the density of high vegetation), road embankment (-0.08 for SVF), and humidity (-0.05 for Modified Normalized Difference Water Index (MNDWI)) can be found. The presence of proximity to tourist areas can be explained by the fact that these areas are mostly made up of green spaces or historic buildings in old Lyon. The variables contributing to urban warming were logically surface temperature and emissivity (normalized coefficients of 0.53 and 0.26, respectively; Figure 10) as well as indicators of the built environment and the absence of medium or high vegetation density (0.13 for the NDBaI, 0.12 for the Index-based Built-Up Index (IBI), and 0.23 for the density of low vegetation).

Partial Least Square Regression Modeling
Partial least square (PLS) regression modeling did not show much consistency in air temperature prediction since the mean coefficient of determination for all four study days is only 0.62, with a maximum of 0.79 for August 30 th , 2016, and a minimum of 0.53 for July 22 nd , 2019 (Table 7). In addition, a large number of explanatory variables were retained, with a maximum of 26 for the day of 22 nd July, 2019. Some variables influenced the model both positively and negatively as a function of the day. For example, the MNDWI had a significant positive impact for August 30, 2016, and a negative impact for August 1 st , 2017, July 19 th , 2018, and July 22 nd , 2019. As a result, the air temperature modeling results were much less relevant than by multiple linear regression.
Overall PLS modelling based on the measurements of the four outputs provided results relatively similar to multiple linear regression, with an R² equal to 0.699 and an RMSE of 1.503. They also confirmed the dominant role of surface temperature. This variable had a VIP of 2.2. This is followed by the density of water areas (VIP = 1.81), the density of low vegetation (1.43), the NDVI The variables contributing to urban warming were logically surface temperature and emissivity (normalized coefficients of 0.53 and 0.26, respectively; Figure 10) as well as indicators of the built environment and the absence of medium or high vegetation density (0.13 for the NDBaI, 0.12 for the Index-based Built-Up Index (IBI), and 0.23 for the density of low vegetation).

Partial Least Square Regression Modeling
Partial least square (PLS) regression modeling did not show much consistency in air temperature prediction since the mean coefficient of determination for all four study days is only 0.62, with a maximum of 0.79 for 30 August 2016, and a minimum of 0.53 for 22 July 2019 (Table 7). In addition, a large number of explanatory variables were retained, with a maximum of 26 for the day of 22 July 2019. Some variables influenced the model both positively and negatively as a function of the day. For example, the MNDWI had a significant positive impact for 30 August 2016, and a negative impact for 1 August 2017, 19 July 2018, and 22 July 2019. As a result, the air temperature modeling results were much less relevant than by multiple linear regression.
Overall PLS modelling based on the measurements of the four outputs provided results relatively similar to multiple linear regression, with an R 2 equal to 0.699 and an RMSE of 1.503. They also confirmed the dominant role of surface temperature. This variable had a VIP of 2.2. This is followed by the density of water areas (VIP = 1.81), the density of low vegetation (1.43), the NDVI (1.15), and the humidity indices (1.03 for the MNDWI and 1.01 for the TCT Wetness). These results are in agreement with those obtained through multiple linear regression (Section 3.1).

Random Forest Regression Modeling
For the four study days, the coefficients of determination obtained were strong: 0.98 for the 30 August 2016, 0.96 for the 1 August 2017, 0.95 for the 19 July 2018, and 0.92 for the 22 July 2019 (Table 8). Thus, on average, a coefficient of determination of 0.95 was obtained, with a RMSE of only 0.17 • C and an out-of-bag (OOB) error of 0.05. In addition, the measure of importance for each of the variables was measured by the mean increase in error of a tree in the forest when the observed values of that variable were randomly swapped in the out-of-bag samples (OOB; Figure 11). As a reminder, an increase in errors allowed us to know the importance of the variable in the modeling.
Global random forest modelling based on all days highlighted the dominant role of surface temperature, which had a mean error increase of 102.5 ( Figure 12). This was followed by emissivity, density of low vegetation, IBI, and density of high vegetation with mean error increases of 26.9, 23.5, 16.2, and 16.2, respectively ( Figure 12). These results are in agreement with those obtained using multiple linear regression and PLS regression. Remote Sens. 2020, 12, x FOR PEER REVIEW 15 of 39 Figure 11. Evolution of the importance of the variables selected in random forest classification and regression modelling for the four study dates.
Global random forest modelling based on all days highlighted the dominant role of surface temperature, which had a mean error increase of 102.5 ( Figure 12). This was followed by emissivity, density of low vegetation, IBI, and density of high vegetation with mean error increases of 26.9, 23.5, 16.2, and 16.2, respectively ( Figure 12). These results are in agreement with those obtained using multiple linear regression and PLS regression.

Implication of Important Predictors in Urban Air Temperature Modeling
The results of the simple regressions (Section 2.4.1 and Figure 4), the multiple linear regression

Implication of Important Predictors in Urban Air Temperature Modeling
The results of the simple regressions (Section 2.4.1 and Figure 4), the multiple linear regression (Section 3.1), the Random forest regression (Section 3.3), and to a lesser extent the PLS regression (Section 3.2) make it possible to identify the parameters that positively and negatively influence urban air temperature. Naturally, surface temperature is frequently used in air temperature modelling ( Figure 11, for example). This is confirmed by the overall results of multiple linear regression and random forest modelling. Its normalized coefficient is 0.53 and its mean error increase is 102.5. However, this is not an urban morphological descriptor on which designers, urban planners, or politicians can directly act. The urban parameters highlighted by the models that can be influenced in planning operations to combat extreme temperatures in cities mainly concern green and blue solutions, and grey solutions [113][114][115].
Modelling results indicate that the factors that contribute to increasing temperatures in urban areas are related to building density. Indeed, regarding the density of buildings, for example, the BI had a normalized coefficient of 0. respectively. This was confirmed by the overall results of multiple linear regression and random forest modelling. Its normalized coefficient was 0.23 and its mean error increase is 23.5 (Figures 10 and 12).
By contrast, the factors that favor the decrease in urban temperatures were related to the presence of vegetation, humidity, and surface water. Thus, high vegetation density had a high cooling power in the models with normalized coefficients of −0.6 for 22 July 2019 and −0.15 for 19 July 2018, and a mean error increase of 19 for 22 July 2019. These results are consistent with those of the global modelling using multiple linear regression and random forest. Its normalized coefficient is −0.09 and its mean error increase is 16.2 (Figures 10 and 12).
In addition, NDVI was found with normalized coefficients of −0.5, −0.18, and −0.11 and mean error increases of 20, 8, and 7.6 for the days of 2016, 2017, and globally, respectively, but also TCT greenness (−0.14 for 22 July 2019). Moisture also had a cooling effect through the TCT wetness (with normalized coefficients of −0. To our knowledge, this is one of the studies aimed at modelling air temperature in a morphologically contrasting urban environment, with areas of unequally dense habitat, two rivers, a historic center and the largest urban park in France (Figure 1), which uses the most extensive sample of explanatory variables, with classic data, LiDAR data, and data from remote sensing. This study has shown the interest of the complementary use of the latter two types of data, in particular LiDAR for a precise view of vegetation densities (high, medium, or low) but also remote sensing for surface temperature and water, humidity, and vegetation indices to a lesser extent. While we expected very satisfactory results with random forest modelling, confirming the results of previous studies [63,97,116], we were surprised to note also the very high performance of the classical multiple linear regression and PLS regression, with very low RMSE and often below 0.5 • C.
These results confirm the roles played by vegetated spaces [43,110], building density, and water surfaces in previous studies [33,111,112] and confirm mitigation practices based on green and blue space solutions [113]. This study also highlighted the relatively low cooling power of low vegetation during the sunny afternoons of the measurement campaigns. Even if we had observed this during the measurement campaigns, this had yet to be confirmed by the models. This weakness can be explained by the density of vegetation, which is not high enough to promote sufficient evapotranspiration for cooling, but above all by a lack of shade compared to the tall vegetation.
Finally, the influence of buildings on air temperature is generally considered within a radius of 500 m [117][118][119][120]. However, in this study, it was found that the buffers with the best correlation between air temperature and building density, based on LiDAR data, were 5 and 10 m. Furthermore, the vegetation density obtained from LiDAR, which explains air temperature in an optimum way, was within a radius of 50 to 200 m, regardless of its height (low, medium, or high; Figure 5). This also corresponds to a smaller buffer size than that used in previous studies [46]. On the other hand, the buffer size for the bare-soil surfaces was between 50 m and 1000 m depending on the indicator (respectively density of bare-soil and BI, and NDBaI, and Enhanced Built-Up and Bareness Index (EBBI)). In this latter case, this size is similar to that of previous studies [46].
Consequently, this methodology based on mobile cycling measures, buffer analysis and regressions using complementary explanatory variables are fully applicable to other cities. However, we recommend testing the choice of scales for these variables, all the more so if it is not an old European city with morphological urban similarities to Lyon, although the optimal radii found in this study coincide with previous similar experiences in other cities.
In addition, it can be noted that similar spectral indices significantly correlate with the air temperature at the same scale because of the similar physical meaning represented by the indices. For example, vegetation indices such as TCT greenness, NDVI, and Soil Adjusted Vegetation Index (SAVI) are most relevant between 500 and 1000 m, as are water indices (NDWI or MNDWI), building indices (NDBI and Urban Index UI) and bare soil indices (NDBaI and TCT Brightness).

Spatialization of Error
The modelling error found is minimal for multiple linear regression and random forest modelling. For all study days, the median for multiple linear regression modelling was 0.02 • C and for random forest classification and regression modelling was 0.002 • C (σ of 0.44 and 0.17, respectively; Table 9). In contrast to the closeness of the median and mean values by these two modelling methods, the agreement is stronger for multiple linear regression than for random classification and regression forest ( Figure 13). When looking at the location of errors in air temperature modelling between the multiple linear regression method and the random forest, similarities between the two are observed. The models overestimate the air temperatures towards the water areas on Confluence (south of the peninsula) and near the Perrache train station. They underestimate the air temperature in the streets in the embankment, near green areas, and south of the left bank of the Rhône (Figure 14).  When looking at the location of errors in air temperature modelling between the multiple linear regression method and the random forest, similarities between the two are observed. The models overestimate the air temperatures towards the water areas on Confluence (south of the peninsula) and near the Perrache train station. They underestimate the air temperature in the streets in the embankment, near green areas, and south of the left bank of the Rhône (Figure 14). In contrast, the model suggests that the streets in the embankments are cooler than in the mobile in situ measurements ( Figure 15). The same can be seen in the random forest modelling (Figure 16). In addition, an overestimation of air temperature near the green spaces of the In contrast, the model suggests that the streets in the embankments are cooler than in the mobile in situ measurements ( Figure 15). The same can be seen in the random forest modelling (Figure 16). In addition, an overestimation of air temperature near the green spaces of the Tête d'

Grouping of Similar Errors
Spatial groupings of statistically similar values of the differences between modelled and measured air temperatures are evaluated using LISA ( Figure 17) and Gi* (Figure 18). Between the two regression methods (linear multiple and random forest), similarities in the location of error clustering types by LISA and Gi* are observed. As a reminder, for LISA, the distinction is made between a statistically significant cluster consisting of high values only (HH), a cluster of low values only (LL), a cluster in which a high value is surrounded mainly by low values (HL), and a cluster in which a low value is surrounded mainly by high values (LH).
Firstly, using the LISA method, clusters of small errors (LH), underestimation of the model in relation to the measured values, can be identified on the left bank of the Rhône, in the steep streets of the peninsula and Vieux Lyon district, and on bridges. Areas with a high value (HL), i.e., an overestimation of the model, can be observed near the Perrache train station, in the Confluence area, and near the green spaces of the Tête d'Or park ( Figure 17).
Secondly, groupings of the errors of underestimation and overestimation of the air temperature modelling compared to that measured by the Gi* method are located in areas similar to the LISA. These recurring areas for statistically low negative z-score values are the steep streets of the peninsula and Old Lyon and the south of the left bank of the Rhône. The statistically high positive z-score values are the Perrache train station and Confluence district, the proximity to the green spaces of the Tête d'Or park and the Morand bridge ( Figure 18).

Limits and Future Research Outlooks
When looking at the positive or negative effects of variables on air temperature, it can be noticed that some can vary depending on the day being studied. For example, on July, 22nd 2019, the density of ground affects the air temperature positively, but on July, 19th 2018, it had a negative impact. This Firstly, using the LISA method, clusters of small errors (LH), underestimation of the model in relation to the measured values, can be identified on the left bank of the Rhône, in the steep streets of the peninsula and Vieux Lyon district, and on bridges. Areas with a high value (HL), i.e., an overestimation of the model, can be observed near the Perrache train station, in the Confluence area, and near the green spaces of the Tête d'Or park ( Figure 17).
Secondly, groupings of the errors of underestimation and overestimation of the air temperature modelling compared to that measured by the Gi* method are located in areas similar to the LISA. These recurring areas for statistically low negative z-score values are the steep streets of the peninsula and Old Lyon and the south of the left bank of the Rhône. The statistically high positive z-score values are the Perrache train station and Confluence district, the proximity to the green spaces of the Tête d'Or park and the Morand bridge ( Figure 18).

Limits and Future Research Outlooks
When looking at the positive or negative effects of variables on air temperature, it can be noticed that some can vary depending on the day being studied. For example, on 22 July 2019, the density of ground affects the air temperature positively, but on 19 July 2018, it had a negative impact. This is probably related to the route that differs between the two rides. The 2018 route is almost twice as long as the 2019 route ( Figure 4) and the 2019 route passes through different neighborhoods, especially with regard to soil characteristics. This would indicate, among other reasons, why the results may differ depending on the days studied. In addition, the data provided by LiDAR concerning the ground is of a different nature, such as impermeable concrete or sandy soil for example, and may fluctuate depending on the routes taken. The same observation can be made for the proximity to metro stations. The proximity of the subway entrances is a variable that can affect air temperature in opposite ways. In our own experience, some subway entrances seem to give off fresh air and other entrances seem to give off warm air. When looking at the overall results of multiple linear regression modelling, it should be noted that these two variables are not included in the explanation of air temperature for these reasons.
The number of days processed for this study is one of its limitations. Indeed, only four days were analyzed. This limited number was partly due to the availability of quality (cloud-free) data from the Landsat satellite, but also due to the reduced occurrence of similar days in terms of climatic conditions. Another point of constraint is that modelling only took place in dense urban centers.
Consequently, we can argue on two perspectives: the spatial and temporal scope of this study. In the first case, it would be interesting to extend the mobile measurements in the periphery, or even in the rural areas, to be able to model the temperature in any point of the territory and compare the urban and the outskirts results. Secondly, it would be necessary to extend this analysis not only in summer, but in all seasons and at different moments of the day and at night, and for different weathers. Therefore, a global model could be built on observations from all the experimental dates rather than separating models by date.
In addition, some other data satellites may be used. For example, the use of the Sentinel 2 satellite with a 10 m resolution may help to increase the model results using sharper spectral indices, like NDVI or NDBaI.

Conclusions
The objective of this study was to identify the most appropriate and efficient regression to model urban air temperature based on numerous explanatory variables of various natures. The integration of these predictors in multiple regressions and machine learning method showed very satisfactory results. In addition, this methodology can be applied in other study area. The proportion of the variance explained by multiple linear regressions in air temperature modeling for each study day is globally high, with coefficients of determination ranging from 0.60 to 0.89. The results are even better when the random forest method is used. Indeed, the average coefficient of determination is 0.95 for a RMSE of For all these models, there are recurring dominant variables such as NDVI or surface temperature. Consequently, the integration of satellite predictors is a definite advantage in urban microclimate modelling by linear regression model based on mobile air temperature measurements. In this study, Landsat 8 data were used, but one prospect for improvement would be to use higher resolution Sentinel data.
When we look at the overall results for all days combined, the same trends emerge. The multiple linear regression always gives very satisfactory results with an R 2 of 0.65 and an RMSE of 1.54 • C, on a par with the PLS regression which shows an R 2 of 0.70 and an RMSE of 1.50 • C. The global random forest modelling based on all days, however, proposes superior results with a high R 2 of 0.98 and an RMSE of 0.33 • C. This modelling method is therefore the most efficient of the three tested for this study area and this sample of measurements. However, it is less accessible than the other types of multiple regressions tested and requires a greater statistical investment.
One of the strengths of this study is also the fact that it is relatively easily applicable to other areas. The equipment used for mobile measurements is not very expensive. All that is needed is a radiation shelter, a GPS, and a temperature and relative humidity recorder. All the explanatory variables used in this study, such as land use area or satellite data, are freely available. GIS and statistical processing can also be freely available if one wishes to dispense with paying software. From a practical point of view, the most complicated part of the study remains the mobile field measurements, which are very time-consuming. Indeed, they have to be synchronized with the passage of Landsat and it is necessary to have similar and favorable weather conditions, with a completely clear sky and no wind.
The results of this study confirmed the cooling roles played by green areas and water surfaces and the problems linked to building density without vegetation in the urban overheating issue. In addition, low vegetation displayed low cooling power, mainly because of an absence of shade compared to the high vegetation and the low-density vegetation providing little evapotranspiration. This highlights the real need to use green and blue spaces solutions in order to limit the UHI and improve the thermal comfort.
Author Contributions: Conceptualization, L.A. and F.R.; methodology, L.A. and F.R.; validation, L.A. and F.R.; formal analysis, L.A. and F.R.; writing-original draft preparation, L.A. and F.R.; writing-review and editing, L.A. and F.R.; supervision, L.A. and F.R.; and project administration, L.A. and F.R. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.

Acknowledgments:
The authors gratefully acknowledge the EROS USGS, the Lyon Metropolis and the other data platform for the useful data, free of charge. This work would not have been possible without them.  [97,116,139]