Definition of Residential Power Load Profiles Clusters Using Machine Learning and Spatial Analysis

This study presents a novel approach for discovering actionable knowledge and exploring data-based models from data recorded by household smart meters. The proposed framework is supported by a machine learning architecture based on the application of data mining methods and spatial analysis to extract temporal and spatial restricted clusters of characteristic monthly electricity load profiles. In addition, it uses these clusters to perform short-term load forecasting (1 week) using recurrent neural networks. The approach analyses a database with measurements of 1000 smart meters gathered during 4 years in Guayaquil, Ecuador. Results of the proposed methodology led us to obtain a precise and efficient stratification of typical consumption patterns and to extract neighbour information to improve the performance of residential energy consumption forecasting.


Introduction
Smart meters (SMs) provide a granularity and precision that make it possible to overcome classical methods such as the definition of a single typical consumption curve for residential or commercial clients. In addition, this rich source of data for energy consumption analysis shows that conventional methods are unable to cope with such volume or speed. This situation, together with the increasing availability of more powerful computers, has promoted the use of intelligent techniques to study patterns in time series data. The application of machine learning (ML) to smart grid data enables designers to address problems in a more tailored way and seek more accurate results.
One of the most topical research areas taking advantage of the aforementioned premise is the application of clustering techniques to determine energy consumption efficiency. Clustering can be used on a daily electricity demand time series to group identical profiles to reveal the most typical load profiles [1,2]. An example is the work proposed by Lavin et al. in [3] that applied the k-means heuristic partitioning method to compare the structural similarity of daily time series with a focus on finding clients with similar energy consumption profiles. Clustering can extract consumption patterns at different time periods (e.g., monthly [4], seasonal [5], or annual). [6,7]. This reveals important information about households consumption habits and their relationship with time variables.
Energy consumption habits can be affected by elements, such as geographical barriers (rivers, hills), political boundaries (district divisions), commercial areas (free trade zones or industrial parks and harbours), soil characteristics or orientation of buildings among others. Geographic information systems (GIS) collect these spatial elements and their relationships to perform a spatial-temporal analysis that may prove extremely useful for spatial clustering. Spatial-temporal analyses are also used to forecast energy demand; for instance, authors in [8] provided a framework that facilitates the exploitation of lowdimensional structures to govern the interactions between the surrounding residential SM users. In [9], a k-nearest Vector Autoregressive framework with exogenous input for models with spatial-temporal variation of electricity consumption in individual household load forecasting was proposed. Another example is the work presented in [10], where authors studied how an electrical load is distributed in a city using area-independent agents and the relationships among neighbouring areas. In [11], the authors applied a temporal and spatial analysis to study the evolution patterns of alternative energies and improve the planning and construction of energy systems by a cellular automata model. In [12], load forecasting is studied using spatial regression to determine the probability of rural regions becoming urban areas as part of urban sprawl by spatially relating installed load and socioeconomic variables distributed in the study area.
Several studies have shown that forecasting independent residential loads is more challenging than forecasting commercial or aggregated loads [13,14]. The main reason of this larger complexity is an increment in the variability of the load profile. This is normally produced by the use of household appliances that generate significant fluctuations in the consumption patterns. These fluctuations are often unpredictable due to the dynamic nature of the behaviour of the household residents. Furthermore, there are studies that show that the hybridisation of clustering with forecasting techniques improves prediction performance [15,16]. Recent works on training artificial neural networks (ANNs) with residential SM data for energy consumption predictions can be found in the literature. ANN methodologies vary from classic implementations [17] to more complex approaches, such as the recurrent neural network (RNN) with long short-term memory (LSTM) architecture [18] or restricted Boltzmann machines [19,20].
This study analyses data from residential SMs by applying ML techniques to extract knowledge and build prediction models. First, we applied soft dynamic time warping [21] clustering methodology to find sets of users with similar monthly load profiles. Then, we used the clusters in a second process where a spatial-temporal analysis was applied to find sets of users adjusted to the particular reality of the geographical zones. Finally, users in the same geographical zone with similar energy consumption patterns were used in conjunction with a ML method to forecast future energy consumption. Hence, this study yielded two important contributions. On the one hand, it applied a temporal and a constrained spatial analysis to a large dataset of residential SMs, proving that a spatial neighbourhood is a significant source of information that can improve decision support and predictability. On the other hand, it supported our experiments with a novel hybrid ML framework that applied a constrained spatial clustering technique together with LSTM architecture. Although previous studies that have applied methodologies of time series clustering and spatial analysis, to our knowledge, there is no single study that incorporates time series clustering using soft dynamic time warping with a spatial-temporal analysis to define load profiles in specific geographic areas and then uses those profiles to feed a LSTM-based neural network to forecast energy consumption estimates.
The identification of a geographic area with characteristic behaviours can provide an accurate and updateable feedback for advanced technical analysis of utilities and regulators, including planning and decision making for activities such as a deeper understanding of electricity demand [22,23], analysing criteria for establishing dynamic tariffs [24], setting up tariff adjustment mechanisms for energy efficiency [25] and optimising energy demands [6,26]. Profiling and forecasting of energy consumption are also necessary for optimising local energy systems and energy communities [27,28]. In this regard, the E-LAND Horizon 2020 project is developing a set of software tools, including an energy forecaster, to support energy management in communities [29].

Materials and Methods
The general methodology proposed in this study is divided into four main steps: • Data collection, pre-processing and time series generation, • Clustering time series generation, • Spatial-temporal analysis, and • Applications for energy consumption forecasting.
First, we apply a series of pre-processing procedures to obtain a clean, complete and consistent dataset to generate a time series representing a user's monthly load profile. Secondly, we deploy a classification methodology to cluster the load profile time series according to characteristic monthly behaviour. We then perform a spatial-temporal analysis to extract spatial clusters restricted in nearby areas. Finally, to demonstrate the benefits of obtaining information on consumer behaviour in spatial areas, we model the extracted knowledge using a RNN to forecast the week-ahead hourly energy consumption.
To improve the readability and follow up of this study, Figure 1 shows the phases at each step.

Data Pre-Processing and Time Series Generation
Data were gathered from a sample of 1000 residential SMs-one for each house-over 4 years in the city of Guayaquil, Ecuador. Measurements are taken every 15 min, which generated a database with approximately 130 million records. The collected measurements include the following variables: • Geographic position of the SMs, • Timestamp with the date and time of measurement, • Customer code to identify the client (anonymised data), and • Active power (kW).
Although our database initially had 1000 SMs, the percentage of available SMs with respect to the total number of meters in the city of Guayaquil is significantly low. The city has a very high concentration of electricity consumers with the number reaching more than 700,000 in an urbanised area of approximately 190 km 2 . Therefore, the sample is quite dispersed, mainly due to the existence of old electromechanical meters that had not yet been replaced. Therefore, our initial data set was strictly analysed through different processes to obtain a final dataset focused on validating the objectives of this study.
A series of data pre-processing procedures was applied to obtain a consistent dataset. First, an exploratory analysis pointed out missing active power values and outliers on the gathered SM information. Missing data (fails in the SMs or in the data hub transmitters) was found either in short periods of less than or equal to one hour or in longer periods (weeks), usually because some properties were holiday rental houses and the SMs aere turned off when they are uninhabited. Thus, a missing value between two valid measurements is imputed by linear interpolation and, in cases where up to 96 measurements (1 day) are missing, these values are imputed using the corresponding measurements of the previous week. After processing the missing data, an outlier detection methodology was applied to detect unusual values and replace them using the earlier imputation mechanism. Then, we extracted time series to comply with the purposes of this study. First, time series were simplified by resampling to hourly measurements with the average active power value in this period. Secondly, since long periods of no consumption could represent a change of tenant or owner, to ensure that we were always analysing the same customer, only time series with the consecutive and complete measurements of at least 10 months of the year were selected.
To obtain comparable energy consumption behaviour profiles, a normalisation procedure was applied to time series in the range of 0 to 1. Furthermore, characteristic time series representing typical weekly behaviour for each month of the year (e.g., a time series representing a characteristic week of January) were generated by averaging the value of active energy in each hour of a day (24 measurements) for each particular day of the week (Monday-Sunday). After all the aforementioned pre-processing procedures, summarised in Figure 2, we hada total of 754 SMs which representeds a database of approximately 100 million records.

Time Series Clustering
Analysing and extracting information from the SM time series data was complex. A set of cluster analysis techniques were used to evaluate their performance and understand the macroscopic structure and relations among the analysed time series. Four different clustering methodologies were evaluated. First, a classic k-means based on Euclidean distance had been implemented [3]. Euclidean distance is the most commonly used metric for measuring similarities between profiles in clusters [30]. However, it can generate considerable errors when calculating the distance between time series [31]. To deal with this possible issue, specific time series metrics were also implemented. Dynamic time warping (DTW) allowed the finding of the minimum distance between profiles by shifting on the time axis, which made it possible to group profiles with similar shapes regardless of their temporality. In addition, a gradient-based version of DTW was implemented. The metric is known as soft-DTW [21] and is a differentiable loss function that depends on a hyperparameter that controls the smoothness of the resulting metric. On the other hand, the k-shape clustering methodology [32] was also implemented. K-shape is based on the computation of cluster centroids and cross-correlation measurements with the cluster time series to preserve the shapes of the time series sequences.
This stage includes a clustering validation method; namely, the Silhouette coefficient [33], which provides a representation of how well each piece of data lies within its cluster. The values of the index are in the range of −1 and 1. Values close to 1 indicate that the instances are well classified, values close to −1 indicate a misclassification and values close to 0 indicate that the instances fall between two natural clusters.
Finally, the last step of this stage was to classify (type 1 to n) each SM according to its memberships in a characteristic cluster.

Spatial-Temporal Analysis
At this stage, SMs are georeferenced into a geographic information system (GIS) to perform a spatial-temporal analysis that aims to determine if similarities in energy consumption patterns are related to their specific location or to the spatial-temporal proximity of other SMs. The hybrid analysis combining clustering and spatial-temporal analysis is one of the highlighted contributions of this study, which models our problem geographically and then allow us to explore, interpret and detect important patterns hidden in the dataset.
The temporal analysis is concerned with the variability of the SM membership to the generated clusters and the spatial analysis defines areas that belong to the same type of behaviour using spatially restricted clustering by applying the minimum spanning tree [34]. Previously, a proximity analysis was performed to discard isolated SMs.
Thus, this analysis evaluated the hypothesis that the energy consumption behaviour of users within the same geographic area is better represented by also taking into consideration their nearest neighbours. The definition of a geographical area in this study involves SMs located consecutively without any one belonging to another behaviour type or having a significant spatial element between them, such as a river or non-residential zone.
To validate the generated spatial clusters, we had randomly selected and separated 10% of the SMs. The validation process consisted of evaluating whether the SMs in the validation set were located in areas with same type of behaviour as defined by spatial clustering.

Forecasting of Energy Consumption
Here, we validate the contribution of this study using the spatial clusters generated by the spatial and temporal analysis outcomes as a component of a recurrent neural network (RNN). Specifically a long short-term memory (LSTM) [35] configuration was implemented. Although RNNs exhibit a superior ability to model sequences [36], they suffer from socalled gradient fading during the backpropagation process explained in [37,38], so they are unable to learn long-term dependencies; that is, the relationship between entities that are separated by several steps. Hochreiter demonstrates in his article [37] that when neural networks have multiple steps, the error gradient will decrease exponentially with each step in the backpropagation process, so the training of a basic RNN with a long-term dependency becomes very slow and does not fit properly. To solve this problem, Hochreiter and Schmidhuber designed a special type of recurrent neural networks called long shortterm memory (LSTM) networks [35]. LSTMs, like RNNs, have a chain-like structure, but instead of having a single activation function in their memory cell, LSTM networks have three structures called gates (forgetting, input and output), through which information can be removed or added to a cell state, which is like a conveyor belt that runs directly through each memory cell with interactions in each of them that do not affect it exponentially.
The LSTM network was applied by two different approaches to compare the resulting predictions. In the first approach, a coded LSTM network used only measurements from the same SM in a univariate analysis. In the second approach, a multivariate analysis was implemented in the LSTM network that included measurements from 5 SMs within the same spatial cluster, including the SM to be predicted. Considering that we had information for 4 years, the measurements for the years 2014-2015 (50% of the data) were defined to execute the training. To adjustment the hyperparameters, the data from 2016 (25%) was used as a validation set, and the data for 2017 (25%) was initially reserved as a test set. As more LSTM hidden layers were added, the network was able to infer more complex behaviour in our time series and increase the accuracy of the prediction, so two hidden layers that assumed hourly and daily behaviour were used in our model. Finally, an output layer with 168 neurons was included. The results of this output layer matched the predicted values for 24 hours a day for a week. In addition, a neuron dropout layer was alternated between each LSTM layer to speed up learning and avoid over fitting. This consisted of updating only a percentage of the neuron weights in the iterations, while the rest remained constant. In our case, a 20% dropout was applied to each layer. Additionally, the setup included an Adam optimiser and a root mean square error (RMSE) loss function. Neural network training was established with a maximum of 100 epochs with early stopping if the RMSE did not improve in 5 epochs (see Figure 3). The performance of the forecasting model was measured using root mean square error (RMSE) and symmetric mean absolute percentage error (sMAPE) [39]. The metric sMAPE is defined as: |y t −ŷ t | (|y t | + |ŷ t |)/2 (1) where y t is the actual value andŷ t the forecast value.

Results
Results were analysed using Python software while TSLearn library was used for time series cluster generation [40]. ESRI ® ArcGIS Pro software, a Geographic Information System (GIS), was used for spatially restricted cluster generation.

Time Series Clustering
First, we present the results of the silhouette validation method for the combination of each of the clustering methods (k-means, k-shape, DTW and soft-DTW) with a different number of target clusters (k = 3, 4, 5) and methodology hyperparameters (soft-DTW with γ = 0.5, 1, 2). The results are in Table 1, which presents a ranked classification of the configurations. The cluster with k = 3 was the best performing configuration among all the methodologies, achieving a better differentiation between the clusters. Since soft-DTW ranked in the first positions, it was therefore, the clustering method that provided the best results, so the results of soft-DTWs with hyperparameters k = 3 and γ = 1 are used in the remainder sections.  Figure 4 shows the different profiles obtained from the application of soft-DTW with k = 3 and γ = 1.
Subsequently, a temporal analysis of the extracted clusters was performed to explore the variability of cluster memberships during the 12 months of the year. A summary of the results are presented in Table 2. It can be seen that 594 SMs had 12 months of complete information and a constant cluster membership over the whole year. Similarly, there were another 82 SMs with more than 10 months of complete information and a constant cluster membership. The remaining SMs changed their cluster membership up to 2 months.
To perform the spatial analysis, only SMs that contained at least 10 months of complete information and had a constant cluster membership were used, resulting in 676 SMs (89.66% of SMs analysed in this section); that is, the 78 SMs with variable behaviour during the analysis period were not considered in the analysis.

Spatial and Temporal Analysis
Following the insights provided by the previous temporal analysis, a proximity analysis was performed using a GIS over the resulted sub-sample. First, the average circle radius in which a SM had at least 4 neighbours was calculated. The number of 4 neighbours was defined considering at least the 4 sides of a two-dimensional space. The result of the analysis resulted in 980 m, which was rounded up to 1 km. Next, the spatial proximity analysis was performed, resulting in 20 SMs located more than 1 km from any other SM or had less than 4 neighbours 1 km around. Table 3 presents the proximity analysis results of clusters of this sub-sample. Results show that the majority of isolated SMs belonged to Cluster 2, mainly country houses at least 1 km away from other dwellings. The 20 SMs located in isolated houses were not considered in the next spatial analysis.  Table 4 presents a series of metrics to evaluate the consistency of the elements in each of the clusters. The table presents RMSE and sMAPE between the load profile of electrical consumers and the barycenter of each cluster. The mean RMSE of the 58 SMs in Cluster 2 is greater than twice the RMSE of Cluster 1 and Cluster 3. Similarly, the sMAPE is 17.47 and 22.43% greater than the sMAPE of Cluster 1 and Cluster 3, respectively. These values indicate that the members of Cluster 2 had a high variability and represented SMs that were not identified in Cluster 1 or Cluster 3. For this reason, and since our objective was to focus on sets of SMs that were similar and closely located, the SMs in Cluster 2 were excluded from the analysis, resulting in a final dataset of 598 SMs. The final database was subsequently set to extract spatially constrained clusters and validate them with a sample. Therefore, 59 SMs were randomly selected to later validate the spatial clusters. The remaining 539 SMs, georeferenced in a GIS together with a layer of rivers and a layer of non-residential areas in the city, were geoprocessed to find the spatially constrained clusters using a minimum spanning tree [34]. In this way, the SMs that belonged to the same type of behaviour and were spatially contiguous were selected, i.e., they were not separated by a river, and there were no other commercial SMsor SMs belonging to other clusters between them. The results of the analysis are presented in the Table 5, which points to 21 different spatial clusters, i.e., 21 zones where consumption patterns were more similar among them. In the case of the spatial sub-clusters belonging to Cluster 1, the results defined 14 zones having a total of 143 SMs. On the other hand, for the case of the spatial sub-clusters belonging to Cluster 3, the results showed 7 zones with a total of 46 SMs. This fact indicated that the behaviour of Cluster 1 was more common, frequent and geographically stable than Cluster 2. In addition, Figure 5 graphically presents a cluster map of the georeferenced SMs in the city of Guayaquil. The blue points represent the SMs belonging to Cluster 1 and the red are those belonging to Cluster 3. The polygons were generated to visualise the spatial cluster zones to which the SMs belong, likewise in blue for the zones comprising SMs belonging to Cluster 1 and red for SM clients belonging to Cluster 3.  Table 6 shows the results of the validation process. Results were satisfactory since all the metric values were close to 90%.

Forecasting of Energy Consumption
In this subsection the results of a RNN incorporating spatial clustering information are shown. To compare the performance of the inclusion of this new information, the prediction scenario was applied to two different RNNs, one with and one without the added information of the neighbours. The training of both RNNs was performed with 100 epochs. In the first scenario, a univariate analysis was done, applying a RNN with a LSTM architecture to forecast the energy consumption at each location for the following week, using only the hourly active power measured by the SMs. In the second scenario, the same RNN architecture was applied for a multivariate analysis, adding the active power measurements of the SMs belonging to the same spatial cluster. The results of both forecasting models are shown in Table 7. It can be seen that using data from the closest neighbours improved the results by 2.46%.

Discussion
The temporal clustering of the initial set of measurements gathered by SMs in Guayaquil over 4 years, once preprocessed, resulted in 3 different clusters. Further analysis of the clusters revealed that only 2 were significant since the third contained a low percentage of meters that had a distinct behaviour. Next, for the meters classified in each of the clusters, a typical monthly consumption profile was generated. After performing an analysis of these profiles, it was observed that, on one hand, consumer behaviour did not change significantly over time due to the slight climatic variability in the area, where the temperature is usually between 21 and 30 • C [41]. On the other hand, months in which users had variant behaviour were those with long holidays periods, such as December (Christmas) or February (Carnival).
To illustrate the valuable information found in the spatial clusters, load profiles without normalising the measurements were graphed. Figure 6 shows the average hourly load profiles from Monday to Sunday for the types of behaviour found. Load profiles within each cluster were stratified according to monthly energy consumption to avoid smoothing them out when averaging users with higher consumption but with the same behaviour: (i) less than 130, (ii) 130-500, (iii) 500-1000 and (iv) more than 1000 kWh/month. The first two groups (i, ii) are those that received the largest economic subsidies from the government, whereas the other two had benefits-group (iv) did not receive any subsidy. Type 1 customers demanded a greater amount of energy on Mondays, while type 3 customers demanded it on Saturdays, regardless of their monthly consumption. Then, for type 1 clients, it would be better to plan maintenance activities on weekends, while for type 3 clients,they should be avoided on Saturdays. Furthermore, if maintenance were planned on working days, they would have a lesser impact from 5:00 p.m. for type 1 clients and after 7:00 p.m. for type 3 clients. For both types, mornings until 10:00 a.m. would be critical periods for carrying out maintenance activities due to the rapid increase in energy demand in that period. This precise, updated and geographically zoned consumption information will substantially improve maintenance planning and optimise resources.
Furthermore, the electrical utility reported that there are users who do not respect the use of the assigned electrical energy since uses other than residential (e.g., by bars, minimarkets, cybercafes or micro-business workshops) were detected during the verification of the quality of the measurements. This fact highlighted the valuable information provided by this investigation since the results can be an input for detecting such illicit usage and verify it in the field. For example, the upper graph in Figure 6 shows a lower consumption during the weekend, which may not correspond to a usual residential behaviour.
The spatial analysis aimeds to define the geographical zones where all meters had the same behaviour; that is, they were classified in the same temporal cluster and were exploited for forecasting purposes. To predict the energy consumption of one consumer, not only data gathered by its SM was used, but also data gathered by other SMs that belonged to the same spatial sub-cluster. The results showed that the accuracy of forecasting improved by 2.46% on average when information about neighbouring SMs was included.
A more precise knowledge of energy consumption patterns of clients is valuable for both technical and commercial management. From a commercial perspective, the methodology allows the accurate estimate of energy being supplied to different zones of the city during blackouts and the ability to prioritise zones in consumption awareness campaigns. From a technical point of view, it allows better planning of maintenance activities and a more accurate estimation of future demand factors, which are useful for network planning and fro reducing investment in networks or power plants. Figure 6. Average hourly load profiles are graphed from Monday to Sunday for type 1 and type 3 behaviours divided according to the energy consumed: less than 130, 130-500, 500-1000 and more than 1000 kWh/month

Conclusions
In this study we presented a methodology to demonstrate that energy consumption patterns in nearby areas are related and to extract models that use this information as an advantage. The use of ML tools helps define and discover new consumption behaviour profiles of residential users and determine geographic zones where behaviour is more marked and stable, thereby allowing us to improve the forecasting of energy consumption for the members of each sub-cluster.