Modelling Long-Term Urban Temperatures with Less Training Data: A Comparative Study Using Neural Networks in the City of Madrid

In the last decades, urban climate researchers have highlighted the need for a reliable provision of meteorological data in the local urban context. Several efforts have been made in this direction using Artificial Neural Networks (ANN), demonstrating that they are an accurate alternative to numerical approaches when modelling large time series. However, existing approaches are varied, and it is unclear how much data are needed to train them. This study explores whether the need for training data can be reduced without overly compromising model accuracy, and if model reliability can be increased by selecting the UHI intensity as the main model output instead of air temperature. These two approaches were compared using a common ANN configuration and under different data availability scenarios. Results show that reducing the training dataset from 12 to 9 or even 6 months would still produce reliable results, particularly if the UHI intensity is used. The latter proved to be more effective than the temperature approach under most training scenarios, with an average RMSE improvement of 16.4% when using only 3 months of data. These findings have important implications for urban climate research as they can potentially reduce the duration and cost of field measurement campaigns.


Introduction
In the context of raising awareness on climate change, a good understanding of urban climate phenomena is a key milestone in order to mitigate and adapt to thermal extremes within urban environments [1,2]. Cities are not only one of the main contributors to the greenhouse effect [3], but also places where many inequalities and therefore potential vulnerabilities accumulate [4][5][6]. Moreover, recent studies, such as those developed by Grimm et al. [7] and Youngsteadt [8], suggest that cities could provide important insights into the socio-ecological dynamics of our near future at a global scale, thus increasing the interest for reliable urban climatic data and expanding its applications to many other disciplines.
However, obtaining reliable climatic data within urban areas is still a challenging task due to the complexity of the urban climate. Nowadays, some of the most important advances concentrate on the modelling field [9]. Examples can be found evaluating the inter-relation between some parameters and the urban climate, such as the presence of water-bodies [10] or the emission of anthropogenic heat [11,12]. Regarding the accuracy of these numerical models, recent advances coupling urban canopy models with mesoclimatic ones have also proved their overall reliability [13,14]. However, there are still some barriers that limit their applications in other fields. For example, Computational Fluid A widespread alternative technique for obtaining reliable and affordable long-term datasets of urban air temperatures is the development of empirical models. These models use pre-existing statistical correlations among available data to generate accurate projections without compromising their computational efficiency. Consequently, these datadriven approaches represent bespoke alternatives to more complex numerical models.
Several algorithms can be used for this purpose. A widely used technique for modelling urban temperatures is using Multiple Linear Regression (MLR), which has been tested for both temporal [52][53][54][55] and spatial predictions [56][57][58][59][60]. However, the increasing availability of machine learning and big data solutions is boosting the widespread use of other algorithms which, although potentially harder to interpret, are likely to improve their accuracy. Popular machine learning techniques include Support Vector Machines [61][62][63][64], Random Forest [58,60,62,65,66], or Artificial Neural Networks (ANN).
ANN seem to stand as the most popular approach for modelling the hourly evolution of outdoor urban temperatures. To the authors' knowledge, Mihalakakou et al. [67] presented the first attempt to model the outdoor temperature at an urban site using ANNs. They used the dry-bulb temperature data available from two existing meteorological stations in Athens: one located within the city (the target), and one at the outskirts (the reference site). In a follow-up study, the model was adapted for other urban sites in the same city, where they deployed a network of 23 temperature sensors across the city for 2 years [68,69].
In these early attempts to model urban temperatures using ANNs, the authors only used the air temperature from the reference site as the input. However, other researchers have explored the inclusion of additional predictors to increase model performance. The Sustainability 2021, 13, 8143 3 of 23 most common ones are meteorological parameters linked with the UHI formation. Kim and Baik [70], for example, used the maximum UHI intensity of the previous day in Seoul together with wind speed, cloud cover, and relative humidity. In London, Kolokotroni et al. [71][72][73] used hourly air temperature, relative humidity, wind speed, cloud cover and global solar radiation. More recently, in Ontario, Demirezen et al. [74,75] used the air temperature, humidity, solar radiation, wind speed and wind direction. Other researchers have also included a time reference as an input to better capture the hourly evolution of urban temperatures. For example, Gobakis et al. [24] and Papantoniou and Kolokotsa [76] used the date in conjunction with air temperature and global solar radiation. Similarly, Heijden et al. [35] and Erdemir and Ayata [77] used the hour of the day together with other meteorological parameters. Table 1 summarizes these and other ANN studies that focused on outdoor urban temperatures and their modelling characteristics, such as the length of their datasets.  [87]. b Output of the ANN model, as declared or shown by the authors. 1 Extends further from the limits of the city, covering the surrounding regional areas. 2 Includes other cities of the same country. 3 Year not specified.
In most of these studies, the modelling of outdoor urban air temperature time series is addressed from a common perspective: using the temperatures collected during a monitoring campaign at the urban level to train a Feed-forward Neural Network (FNN, a relatively simple type of ANN). This modelling is usually performed using data from one or several reference points, in many cases well-established meteorological observatories providing detailed and robust information on a wide range of parameters. Although this process is quite extended, it could be discussed whether other ANN topologies might be more suitable for this purpose. Cascade Neural Networks (CNN) or Elman Neural Networks (ENN) have also been widely applied [24,72,76], the latter being simplified versions of Recurrent Neural Networks (RNN). RNNs have proved to be very effective when it comes to make forecasts, especially when Long Short-Term Memory (LSTM) is used [88]. In that sense, the work of Han et al. [86] has recently demonstrated the superiority of RNNs over FNNs for predicting outdoor urban temperatures.
However, it should be noted that the aim of most of these studies is not to make time predictions or forecasts, but to model an urban time series from a preexisting one. In other words, the purpose is to obtain an adapted version of a reference time series that already exists, being this new time series representative of a certain urban area and covering the exact same period as the data used as a reference. This simplifies the process by eliminating the time dependence of the outputs, and which justifies working with simpler neural networks, such as FNNs. In fact, and under this modelling scenario, Kolokotroni et al. [72] did not find any improvement when comparing ENNs and CNNs with FNNs.
Although empirical models are site-specific (predictions are always made for a particular urban location), they can be used to extend the temporal coverage of urban monitoring campaigns, thus potentially increasing their utility among other disciplines. And despite FNN-based models are not suitable for future projections, they are certainly useful to adapt historical records obtained outside the city to the reality of urban areas. However, there is currently a knowledge gap with regard to the amount of input data potentially needed to accurately model urban temperature time series using FNNs. Collecting experimental data is very time-consuming and resource-intensive and, while it seems a common practice to rely on one whole year of data for the training, there is no evidence that this should be a minimum requirement. This study, therefore, aims to quantify the degree to which the amount of input data needed to train FNNs can be reduced without sacrificing their accuracy. We also explore the use of the UHI intensity as an alternative output of the FNN models, instead of directly targeting the air temperature, to test the hypothesis that its lower seasonality and direct association with the input variables might help reduce the amount of required data for the training phase.
The present research is structured in three phases: first, we compared the performance of more than 5000 different FNN configurations for modelling the outdoor urban temperature (TEMP approach) and the UHI intensity (UHII approach) when trained with 12 months of data in the city of Madrid. An optimal configuration was then selected and analysed further in-depth for both approaches, including their sensitivity to input parameters. Finally, the amount of data provided during the training phase was reduced from the initial 12 months to 9, 6 and 3 months to evaluate the capacity of these models to continue producing accurate results with fewer input data.

Study Area: The City of Madrid
The present study focuses on the city of Madrid. Due to its size, location and climatic conditions, Madrid is characterised by a strong UHI, with nighttime UHI intensities up to 10 • C during calm and clear nights. During the last decades, this phenomenon has been intensively studied in the city by means of on-site measurements [89][90][91][92], remote sensing [93,94] and numerical models [95,96].
Between 2016 and 2019, a continuous monitoring campaign was carried out at 20 fixed urban sites with the aim to study the temporal patterns of the UHI in Madrid [97]. In the present study, we use part of that experimental data to define the outputs of our ANN models. More specifically, we use the hourly, dry-bulb temperature gathered at the city centre (Embajadores, see Figure 1), classified as compact midrise (LCZ 2) according to the Local Climate Zones (LCZ) scheme [98], and which registered the highest mean and nighttime UHI intensity. The data available for this study cover the period from July 2016 to September 2018 on an hourly basis (800 days or 19,200 h, in total). All sensors used in this monitoring campaign were protected from the rain and solar radiation using a custom-made, mechanically ventilated radiation shield. They were installed in the Urban Canopy Layer (UCL) at 5-6 m above the ground, following the guidelines of the World Meteorological Organization (WMO) for urban sites [99,100]. The location of each sensor was also studied in terms of its thermal source area [101]. In that sense, the representativeness of each sensor was appraised in terms of its surroundings' homogeneity [102,103].
Quality Control (QC) procedures were also applied, consisting of a plausible value check, a time consistency check, and an internal-consistency check [104]. This analysis was complemented by a spatial consistency check [105], which analysed whether the difference between a measurement and its surroundings was too large compared to the average. For the City Centre sensor, 126 records were flagged as suspect and just three as erroneous. 72 missing values were identified due to a recording failure between the 17th and the 20th of October 2017. Both erroneous and missing values were left blank in the analysed dataset. Further details about the monitoring campaign and QC procedures can be found in [97].
In addition to the experimental data collected at the city centre, records from the nearby meteorological stations of Barajas Airport (LCZ D) and Ciudad Universitaria (LCZ 9) were used. Hourly values of dry bulb temperature, relative humidity, wind speed, wind direction and precipitation were extracted from the former, while global solar radiation was obtained from the latter. The data covered the same time period (July 2016-September 2018). Both stations are managed by the Spanish Meteorological Agency (AEMET), which complies with the requirements established by the WMO Integrated Global Observing System (WIGOS, [106,107]) regarding QC and sensor installation.
Three different types of datasets, the training, validation and the test datasets, were created. The former were used to fit and evaluate different ANN model configurations. Several training and validation datasets, which varied in length (12, 9, 6 and 3 months) and the months that they covered, were created based on almost 15 months of monitoring (July 2016-September 2017, 10,440 records/hourly measurements). All these datasets were continuous over time, and they were distributed as 80% training and 20% validation. These training and validation subsets were created by randomly sampling the data. This prevented the potential accumulation of specific events in any of these datasets (e.g., certain meteorological conditions), which could bias either the training or the validation of the models. Additionally, a test dataset was created based on the second year of recorded data (October 2017-September 2018, 8688 records/hourly measurements) to independently test the models and assess their accuracy over an entirely different year.

Designing the ANNs
Feed-forward Neural Networks (FNN) were used in this study. Although FNNs are at the baseline of supervised deep neural networks, their utility for modelling urban temperatures has been widely demonstrated in previous studies (see Section 1.1). Figure 2 outlines the two different approaches, based on two different outputs, that were adopted in this study to model urban temperatures. The first one consisted of directly targeting the air temperature at the urban site, validating its outputs with the measurements previously recorded at that location. This approach is aligned with the majority of similar studies found in the literature, and it is referred in this study as the temperature approach (TEMP approach). The second option aims at modelling the urban air temperature indirectly. In this case, the model targets the UHI intensity instead, computed as the temperature difference between the urban site (Embajadores) and the reference location (Barajas Airport, ∆T LCZ2, LCZD ). The urban temperature is then derived indirectly by adding the airport temperature to the output of the model. This will be referred to as the UHII approach from this point onwards. Three different types of datasets, the training, validation and the test datasets, were created. The former were used to fit and evaluate different ANN model configurations. Several training and validation datasets, which varied in length (12, 9, 6 and 3 months) and the months that they covered, were created based on almost 15 months of monitoring (July 2016-September 2017, 10,440 records/hourly measurements). All these datasets were continuous over time, and they were distributed as 80% training and 20% validation. These training and validation subsets were created by randomly sampling the data. This prevented the potential accumulation of specific events in any of these datasets (e.g., certain meteorological conditions), which could bias either the training or the validation of the models. Additionally, a test dataset was created based on the second year of recorded data (October 2017-September 2018, 8688 records/hourly measurements) to independently test the models and assess their accuracy over an entirely different year.

Designing the ANNs
Feed-forward Neural Networks (FNN) were used in this study. Although FNNs are at the baseline of supervised deep neural networks, their utility for modelling urban temperatures has been widely demonstrated in previous studies (see Section 1.1). Figure 2 outlines the two different approaches, based on two different outputs, that were adopted in this study to model urban temperatures. The first one consisted of directly targeting the air temperature at the urban site, validating its outputs with the measurements previously recorded at that location. This approach is aligned with the majority of similar studies found in the literature, and it is referred in this study as the temperature approach (TEMP approach). The second option aims at modelling the urban air temperature indirectly. In this case, the model targets the UHI intensity instead, computed as the temperature difference between the urban site (Embajadores) and the reference location (Barajas Airport, ΔTLCZ2, LCZD). The urban temperature is then derived indirectly by adding the airport temperature to the output of the model. This will be referred to as the UHII approach from this point onwards. The selection of the FNN model inputs of this study were informed by previous studies in Table 1, which have identified the variables that have a strong correlation with the formation of heat islands [109,110]. They consist of six meteorological variables: dry bulb temperature (°C), relative humidity (%), precipitation (mm), wind direction (degrees), wind speed (m/s) and global solar radiation (J/m 2 ). The time of the day was added to these The selection of the FNN model inputs of this study were informed by previous studies in Table 1, which have identified the variables that have a strong correlation with the formation of heat islands [109,110]. They consist of six meteorological variables: dry bulb temperature ( • C), relative humidity (%), precipitation (mm), wind direction (degrees), wind speed (m/s) and global solar radiation (J/m 2 ). The time of the day was added to these six input parameters, which was expected to reflect the daily variability of the outputs, either the temperature or the UHI intensity. Cloud cover was not used as an input parameter because the available frequency (one record every eight hours) was incompatible with the hourly frequency for the outputs. The wind speed presented strong variations at an hourly level and introduced strong oscillations in the prediction. Thus, to help avoid abrupt changes in the output, a moving average (MA) filter was applied. The use of a MA filter is a common pre-processing technique when it comes to modelling time series from data with a high variability. Examples can be found in the field of urban traffic (applying a MA to the car's acceleration [111]), atmospheric pollution (MA applied to measured PM 2.5 concentration [112]) or urban climate modelling [113], the latter using a MA of order 8 (i.e., 8 h) to reduce the presence of wind gust peaks in the dataset prior feeding their model. In this study, a MA of order 4 (4 h) was found to be sufficient to reduce the noise of the wind speed while preserving the time series trend.
All the inputs were standardized prior the FNN feeding, meaning that all variables were transformed in order to have a mean = 0 and a standard deviation = 1 [114,115]. A diagram of the FNN structure for both approaches can be seen in Figure 3. All the inputs were standardized prior the FNN feeding, meaning that all variables were transformed in order to have a mean = 0 and a standard deviation = 1 [114,115]. A diagram of the FNN structure for both approaches can be seen in Figure 3.

Comparing and Evaluating the FNNs
Several FNN structures with different configurations were trained during the first phase of this research. Hyperparameters, such as the number of neurons per hidden layer, the activation functions, or the number of epochs, were thoroughly iterated in order to find a common, optimal configuration for both the TEMP and the UHII approach. Despite some of the tested activation functions are commonly applied for classification tasks and were not likely to give the best performance (i.e., sigmoid-like functions), they were included in the iterative process since preceding similar works made use of them [24,67]. To streamline the process and reduce the complexity of the iteration, each subsequent hidden layer adopted half the neurons of the previous one. All models initialized their weights randomly and were initially trained using 12 months of data. Each configuration was compared by iterating just one parameter (e.g., the activation functions) and leaving the others fixed, while increasing the number of neurons per hidden layer. Those parameters that reached the best overall accuracy with the lowest number of neurons were selected. After this iterative process 5478 FNNs were trained. Table 2 summarizes the parameters used to test these configurations, as well as the ones that were finally selected. The task outlined above was performed using Python and Keras, a deep-learning library based on Tensor-

Comparing and Evaluating the FNNs
Several FNN structures with different configurations were trained during the first phase of this research. Hyperparameters, such as the number of neurons per hidden layer, the activation functions, or the number of epochs, were thoroughly iterated in order to find a common, optimal configuration for both the TEMP and the UHII approach. Despite some of the tested activation functions are commonly applied for classification tasks and were not likely to give the best performance (i.e., sigmoid-like functions), they were included in the iterative process since preceding similar works made use of them [24,67]. To streamline the process and reduce the complexity of the iteration, each subsequent hidden layer adopted half the neurons of the previous one. All models initialized their weights randomly and were initially trained using 12 months of data. Each configuration was compared by iterating just one parameter (e.g., the activation functions) and leaving the others fixed, while increasing the number of neurons per hidden layer. Those parameters that reached the best overall accuracy with the lowest number of neurons were selected. After this iterative process 5478 FNNs were trained. Table 2 summarizes the parameters used to test these configurations, as well as the ones that were finally selected. The task outlined above was performed using Python and Keras, a deep-learning library based on Tensorflow [116,117].
Once a common structure and configuration were defined, a comparative analysis of these models was carried out. First, the contribution of each input to the model output was assessed using a sensitivity analysis [114,118,119]. The 5th, 25th, 50th, 75th and 95th percentiles were used to run the sensitivity analysis for each input, while fixing the rest on their means. The time of the day was excluded from the sensitivity analysis and fixed at two different moments: noon and midnight. Next, their overall accuracy was compared for the TEMP and the UHII approach using several error metrics, such as the root mean squared error (RMSE), the median absolute deviation (MAD) or the coefficient of determination (R 2 ). Modelled results were then plotted for three different weeks to visually assess whether the modelling ability of any of these two approaches could be compromised under certain scenarios. These corresponded to a week of high atmospheric stability (and thus, strong UHI intensity), a week of high atmospheric instability (weak UHI intensity), and a week under both of these conditions.  Table 4).
The last step of the evaluation process consisted of modifying the amount of data provided to the neural networks during the training phase. To this end, FNN models for both the TEMP and the UHII approach were trained using 12, 9, 6 and 3 months of data, and were used to model the outdoor air temperatures for one complete year using the test dataset. The accuracy was estimated, as in the previous cases, using common error metrics. The loss of accuracy of the models trained with shorter datasets was addressed by comparing their performance with the models trained on more data, obtaining a percentage indicating the increase of error for each metric. In the case of models trained with just 3 months of data, the Mean Absolute Error (MAE) was estimated on a monthly basis to further explore its distribution along one year of modelling. Figure 4. Each graph represents the overall accuracy of a certain FNN when iterating just one of its parameters, and while increasing the number of neurons in the hidden layers. From this iterative process, a common, optimal FNN configuration for both the TEMP and the UHII approach was established. The optimal structure was defined as a neural network with seven inputs, two hidden layers of 18 and 9 neurons respectively, and one output. In that sense, it was found that increasing from one to two hidden layers produced a significant improvement in the models' accuracy, while increasing the number of hidden layers further did not. Similarly, moving from 100 to 200 epochs during the training phase could reduce the error of the FNN, while the computational expense of using 500 epochs instead of 200 did not seem justified. This was particularly evident when having tens of neurons in the hidden layers.   Table 2. Results obtained with the models derived from the TEMP and the UHII approach for the site Embajadores. Three weeks were selected, each one representing a different atmospheric stability scenario. The timeframe used to train these models extends from July 2016 to September 2017.  Table 2. Results obtained with the models derived from the TEMP and the UHII approach for the site Embajadores. Three weeks were selected, each one representing a different atmospheric stability scenario. The timeframe used to train these models extends from July 2016 to September 2017.

A comparison between several FNN configurations is first shown in
In some cases, due to the performance differences between the TEMP and the UHII approach, a common ground had to be reached in terms of the optimal configuration. That was the case of the activation functions, where the Stochastic Gradient Descent (SGD) seemed to produce the best results for those FNNs modelling the UHI intensity, but it led to exploding gradient problems when modelling the temperature. Thus, the Adaptive Moment Estimation (Adam) optimiser was used instead, which performed optimally in both scenarios. For the activation functions, a combination of the Exponential Linear Unit (ELU) for the hidden layers and the linear function for the output layer was used.
Overall, UHII models presented fewer converging problems than TEMP models, which seemed to have some difficulties with some activation functions and optimizers. Furthermore, the UHII approach usually outperformed the TEMP approach. The former did not only produce models with relatively smaller errors than the latter but required fewer neurons per hidden layer to reach a similar accuracy. This behaviour might be indicative of a clearer and more direct relationship between inputs and output, which in the case of the UHII approach links parameters such as wind speed, precipitation, or solar radiation with the UHI formation.
Differences between both modelling approaches also arise when looking at the inputs' relevance. In that sense, the sensitivity analyses presented in Figures 5 and 6 seem to reveal significant variations among them. The temperature from the reference site shifts from being the most relevant parameter of the entire FNN (TEMP approach) to being one of the least important (UHII approach). This is especially visible at night, when inter-and intra-urban temperature differences are most pronounced. The other parameters, albeit with different magnitude, seem to condition the outcome of both models in a similar way. In that sense, wind speed and direction seem to be two highly influential parameters during the night, while solar radiation and relative humidity seem to be key during the day.
Although the UHII approach appears to yield more balanced models, this apparent advantage does not seem to have a significant impact on their outputs when trained with 12 months of data. In this scenario, reasonably good results, and with similar error patterns, are obtained for both approaches. As it can be noted in Figure 7, modelled temperatures fit satisfactorily with the measured temperatures at the urban site and under a wide variety of circumstances, including different UHI scenarios: a rainy and windy week with generalised low UHI intensities (<2 • C); a week with varying meteorological conditions, during which a sudden weather change from calm to rainy was observed leading to a rapid change in the UHI intensities; or a calm week with strong UHI intensities (>5 • C), probably reinforced by temperature inversions. The greatest errors seem to accumulate on those nights when unusual conditions occur, such as when very high UHI intensities, close to 10 • C, are registered; or when the UHI intensity drops and rises abruptly, perhaps coinciding with occasional and localised weather events, such as rainfalls. Overall, models produced relatively smooth time series, without spikes or large variations from one hour to the next one, despite not having a built-in temporal dependence between consecutive outputs. Using a moving average for the wind speed seems to have contributed to reducing the noise in the models' output.
The greatest errors seem to accumulate on those nights when unusual conditions occur, such as when very high UHI intensities, close to 10 °C, are registered; or when the UHI intensity drops and rises abruptly, perhaps coinciding with occasional and localised weather events, such as rainfalls. Overall, models produced relatively smooth time series, without spikes or large variations from one hour to the next one, despite not having a built-in temporal dependence between consecutive outputs. Using a moving average for the wind speed seems to have contributed to reducing the noise in the models' output.     Models targeting the UHI intensity got a slightly better score in the error metrics, with a reduction of the error between 6.4 and 11.7% (see Table 3). RMSE was 1.09 °C and 1.02 °C for the TEMP and the UHII approach, respectively. These results are in line with Models targeting the UHI intensity got a slightly better score in the error metrics, with a reduction of the error between 6.4 and 11.7% (see Table 3). RMSE was 1.09 • C and 1.02 • C for the TEMP and the UHII approach, respectively. These results are in line with previous studies, such as in Kim [75]), both modelling outdoor air temperature. The only exception is the coefficient of determination, which is extraordinarily high when targeting the temperature (R 2 = 0.99). This is also in line with previous studies (e.g., [75,77]) and it is further addressed in the discussion section. Table 3. Metrics of the selected models targeting both the air temperature and the UHI intensity. Both models were trained using 12 months of data (July 2016-September 2017). The two variables regressed are modelled and monitored air temperatures.

Metrics
Model Targeting

Shortening the Training Dataset
The results presented above correspond to FNN models trained with one year of hourly data. So far, the TEMP and the UHII approach have proved to yield similar results. When training models with less data, however, differences started to arise. Results show that using 9 months instead of 12 months of data slightly increased the RMSE, with 0.9% and 2.4% for the TEMP and the UHII approach, respectively. When using 6 months of data the accuracy loss increased more markedly, especially in the case of the TEMP models (11.7% vs. 6.2%). The error kept growing exponentially when using 3 months of data, although the tendency was more accentuated and led to significant differences between both approaches (63.1% vs. 40.7%). A similar trend was observed for the MAE and MAD metrics, which can be found in Table 4. Table 4. Relative accuracy loss when reducing the size of the training dataset for both the TEMP and the UHII approach. The number of months in each column establish the baseline of accuracy. The accuracy was obtained using the evaluation dataset. These results are the average error yielded by several models trained with shortened datasets and are relative to the accuracy of the models trained with 12 months of data. Figure 8 presents the models' accuracy absolute levels, including the accuracy of all models trained with each shortened dataset. As already noted, differences arise not only when reducing the training datasets, but also when changing from one approach to another. The large variability of error between the models trained with 3 months of data is noticeable, being more accentuated in the case of the TEMP approach. It seems that, depending on the data used during training, it is possible to obtain models with an acceptable overall accuracy (RMSE < 1.5 • C, in line with previously developed models) to others that it is not clear that they could be used to make a reasonable modelling (RMSE > 2 • C).

TEMP
Yet, these results represent the average cumulative error over a year. A more detailed analysis of the accuracy of the models showed that their error is unevenly distributed over the months, losing accuracy outside the months for which they were trained. It was also observed that their results do not suffer excessively within the months for which they were trained, being comparable with models trained on more data. In that sense, Figure 9 shows the additional error yielded by models trained with only 3 months of data. For convenience purposes, these months were made coincident with the seasons of the year, and a model trained with all 12 months of data was used as a reference.
analysis of the accuracy of the models showed that their error is unevenly distributed over the months, losing accuracy outside the months for which they were trained. It was also observed that their results do not suffer excessively within the months for which they were trained, being comparable with models trained on more data. In that sense, Figure 9 shows the additional error yielded by models trained with only 3 months of data. For convenience purposes, these months were made coincident with the seasons of the year, and a model trained with all 12 months of data was used as a reference. Figure 8. Comparison of the error obtained by several models, trained using different datasets of different length and differentiating between those targeting the air temperature and the UHI intensity. On the left is presented the RMSE. On the left is presented the RMSE. On the right is presented the MAD. Figure 9. Additional error yielded by models trained with just 3 months of data, using both the TEMP and the UHII approach. These models are named according to the season they were trained on. The reference error baseline was established by the same ANN configuration trained with 12 months of data. Yet, these results represent the average cumulative error over a year. A more detailed analysis of the accuracy of the models showed that their error is unevenly distributed over the months, losing accuracy outside the months for which they were trained. It was also observed that their results do not suffer excessively within the months for which they were trained, being comparable with models trained on more data. In that sense, Figure 9 shows the additional error yielded by models trained with only 3 months of data. For convenience purposes, these months were made coincident with the seasons of the year, and a model trained with all 12 months of data was used as a reference.   The results show that the models systematically tend to minimise their error within their season, with the RMSE gradually increasing as they move away from it. This is accentuated for models trained in winter and summer. The reason behind this could lie in the annual cyclical behaviour of temperatures: between solstices and equinoxes, temperatures remain at one extreme of the annual cycle, either at the high end of temperatures (summer) or at the low end (winter). Between the equinoxes and solstices (spring and autumn), though, the transition between the two extremes takes place. This could favour the training of the ANN, as it would extend pattern recognition to practically the entire annual temperature range, and where only the extremes would be at the expense of the neural network's ability to generalise and extrapolate its modelling capacity beyond what is known during its training.
This dynamic is noticeable in the case of the UHII approach as well, although it seems to be rather less pronounced. As it was pointed out in the introduction, Madrid's UHI does not seem to follow a seasonal pattern, which means it might reach its highest and lowest UHII intensities at any time during the year (see Figure A1). However, these UHI peaks depend on the meteorological conditions, thus the loss of accuracy registered by these UHII models seems to be likely related to the concentration of certain meteorological conditions during the training phase. In other words, these FNNs would have difficulties in refining the modelling if, within the three months of data used to train them, there is not a sufficiently large record of the different meteorological conditions that favour the occurrence of UHI.
The performance differences between the TEMP and the UHII approach are now clearly noticeable when plotting the data. In this respect, Figure 10 shows how the results of a TEMP model trained from May to August would produce quite precise results for June of the next year, like the ones obtained by models trained with 12 months of data. However, when trying to obtain the temperature profile in February, that same model barely captures the global trend. In that scenario, the UHII model, trained with the same three months of data, was able to fit to observed values with higher accuracy. It accumulated the error at the same moment as the models trained with 12 months of data, in many cases amplifying it. Despite the unusual distribution of temperatures and UHI intensities for that week, the UHII model was able to capture most of it, which turned to be surprising due to the relatively low amount of data used for its training.

Discussion
The results of this research point towards the potential reduction of the training datasets without having a significant loss of accuracy. This could facilitate the work of urban climate researchers, thus promoting the development of shorter and simpler monitoring campaigns. This does not mean that it is preferable to use smaller amounts of data to train ANN models, but that their accuracy might not be compromised when they are trained in this manner. Although using large amounts of high-quality data is always desirable, in some cases it is not possible due to varying circumstances, such as budget constraints or

Discussion
The results of this research point towards the potential reduction of the training datasets without having a significant loss of accuracy. This could facilitate the work of urban climate researchers, thus promoting the development of shorter and simpler monitoring campaigns. This does not mean that it is preferable to use smaller amounts of data to train ANN models, but that their accuracy might not be compromised when they are trained in this manner. Although using large amounts of high-quality data is always desirable, in some cases it is not possible due to varying circumstances, such as budget constraints or human resources limitations. In this context, knowing where the accuracy limits of the models are when trained with fewer data might help researchers explore their experimental data or design new measurement campaigns in an efficient manner.
In this study we propose the use of empirical, FNN-based models to extend the temporal coverage of urban monitoring campaigns. These models, although limited for carrying out temporal predictions into the future, they can be used to adjust long-term records gathered outside the city to the urban context. This approach, the generation of long-term datasets by looking backwards, might be potentially useful in many disciplines, including the generation of site-specific weather files for building energy modelling [120][121][122][123], the downscaling of heat-related epidemiological studies to evaluate the effect of urban temperatures in health [4,[124][125][126][127], or the identification and characterization of energy poor households in urban environments [128][129][130][131][132].
It is worth noting that the use of UHI intensity instead of outdoor temperature as the output of the FNN models yielded significantly better results mainly when reducing the size of the training dataset. The accuracy improvement was limited when using 9 or more months of data during the training phase. The benefits of targeting the UHI intensity with the FNN model are, therefore, linked to the potential of using smaller datasets to model outdoor urban temperatures. However, using the UHI intensity instead of the temperature as the output, sustained on the lower seasonality of the former, could be arguable. ANN are universal function approximators [133] and, for that reason, using one parameter or the other should not produce significant differences. Although this was mathematically demonstrated, Curry [134] showed that to model the seasonality of a time series with FNN would require a very large structure. This structure would grow exponentially when increasing the length of the dataset, since more turning points are likely to appear. In fact, Zhang and Qi [135] recommended not only to deseasonalize the time series, but also to remove its trend (if any). Nowadays, pre-processing the dataset to make it stationary before feeding the ANN is a very extended practice and has demonstrated to be very effective with RNN as well [88,136]. This approach might be helpful in the future for other studies such as Han et al. [86], where the UHI intensity could be used instead of the outdoor air temperature to remove much of the seasonality from their temperature forecasts. However, it is unclear whether they could be extended to FNNs that use a reference site for modelling outdoor urban temperatures without any time dependence. Other reasons, such as the range of temperatures or the concentration of meteorological stability of the training dataset, might explain the varying accuracy results between the TEMP and the UHII approach when training these types of models, especially when using just 3 months of training data.
In line with the latter, it seems that the selection of days with different meteorological conditions and at different times of the year might be more relevant for the modelling than the continuity of the monitoring campaign. Thus, it may be more appropriate that future studies work with shorter, discontinuous monitoring campaigns covering a wider range of meteorological situations rather than a single, continuous-over-time campaign that might concentrate in a specific time of the year. Results may also support the use of data from sources whose long-term continuity may be compromised (i.e., CWS). In these cases, it would be relevant not only to apply filtering techniques to reduce the risks of introducing outliers, but also to carry out frequency distribution analyses to ensure that all meteorological conditions are being included into the modelling.
Some attention should be drawn to the pertinence of using certain error metrics. Despite being very extended (e.g., [75][76][77]), the use of R 2 as a performance indicator could be misleading [137,138]. As it can be seen in Equation (1), R 2 relies both on the size of the residuals (SS res , the actual deviation of the prediction from the observed values) and the total variance of the dependent variable (SS tot ): Thus, obtaining a higher R 2 does not implicitly mean having less error (numerator), but might be the result of a higher variance of the output (denominator). This was observed in this study when comparing the two approaches. In the case of the TEMP approach, the variance of the output temperature (which ranges from −2 to 41 • C) is much higher that the variance of the UHI intensity (ranging from 0 to 8 • C). Furthermore, since the TEMP approach contains an input variable (airport temperature) that explains most of the variance of the output variable (urban temperature), the R 2 tends to be extremely high (R 2 > 0.99). This explains why significantly lower R 2 were obtained when using the UHII approach in spite of yielding better results with the rest of the performance indicators (RMSE, MAE, MAD).
Taking the above into consideration, it would be worthwhile to investigate whether the behaviour of the TEMP and the UHII models presented in this paper is only attributable to the case of Madrid or if, on the contrary, it might be common in other cities at different latitudes and climatic conditions. Some existing studies have identified strong seasonal differences in UHI intensities in other cities [139][140][141][142], while others have not found such differences [143][144][145]. In this respect, a strong annual seasonality of the UHI intensity would probably limit the capacity of the models to produce accurate results when trained with small datasets. On the contrary, in cities where air temperatures remain within a narrow range throughout the year, such as tropical regions, the TEMP approach might perform better.

Conclusions
Feed-forward neural networks were used in this study to model urban temperature time series from experimental data. The aim was to explore the reliability of these models in the context of low data availability, as well as the potential benefits from targeting the UHI intensity with these models. Results showed that, for the case study of Madrid, the training dataset could be reduced to 9 or even 6 months without compromising too much the accuracy of the FNN models, particularly when using the UHII approach (2.4% and 6.2% increase in RMSE, respectively).
Results showed that the UHII approach generally outperformed the TEMP approach. Overall, UHII models converged to lower error ratios with a smaller number of neurons, proving to be more effective at predicting the urban temperature of a reference site. When using the exact same configuration and structure, UHII models exhibited a significant increase in performance. TEMP models appeared to be quite seasonally dependent, thus facing more problems for modelling temperatures outside the training months. This was particularly relevant when trained on just 3 months of data, when the accuracy differences between UHII and TEMP models was at their highest. We argue that this could be related to the annual cyclical behaviour of temperatures. Targeting the UHI intensity with the FNNs instead, which in Madrid has shown to be almost stationary, seems to reduce uncertainty when modelling temperatures from a relatively small dataset.
The potential use of smaller datasets for training FNNs and still obtaining reliable results might benefit urban climate researchers since field measurements could be reduced in time and costs. Researchers might also take advantage of the accurate preliminary results that can be generated with relatively small datasets for speeding up their research, or for extending their measurements to other urban areas.