Data Model for Residential and Commercial Buildings. Load Flexibility Assessment in Smart Cities

: Demand response (DR) programs were usually designed to provide load peak reduction and ﬂatten the load curve, but in the context of rapid adoption of emerging technologies, such as smart metering and sensors, load ﬂexibility will address current trends and challenges (such as grid modernization, demand, and renewables growth) encountered by the evolving power systems. The uncertainty of the renewable energy sources (RES) and electric vehicle (EV) ﬂeet operation has increased the importance of load ﬂexibility that can be managed to provide more support for the stable operation of power systems, including balancing. In this paper, we propose a data model to handle load ﬂexibility and take advantage of its beneﬁts. We also develop a methodology to collect and organize data, combining the consumption proﬁle with several auxiliary datasets such as climate characteristics of the location, independent system operator (ISO) to which the consumer is afﬁliated, geographical coordinates, assessed ﬂexibility coefﬁcients, tariff rates, weather forecast for day-ahead ﬂexibility forecast, DR-enabling technology costs, and DR programs. These multiple features are stored into a ﬂexibility relational database and NoSQL database for large consumption data collections. Then, we propose a data processing ﬂow to obtain valuable insights from numerous .csv ﬁles and an algorithm to assess the load ﬂexibility using large residential and commercial proﬁle datasets from the USA, estimating plausible values of the ﬂexibility provided by two categories of consumers.


Introduction and Literature Review
According to [1], it is estimated that by 2030, load flexibility will avoid new generation capacity (57%), lower energy costs by shifting the operation of controllable appliances from peak to off-peak hours (29%), allow new transmission and distribution capacity (12%), and provide frequency regulation of ancillary services regulation (2%). Various demand response (DR) programs have been studied [2,3] as well as methods to assess the load flexibility of buildings [4][5][6]. In [2], the flexibility was measured for a three-year period using 186 residential consumers' data with a large set of smart appliances from Belgium. The purpose of this study was to identify the impact and performance of demand response (DR) programs. Another residential consumption area was investigated in the Netherlands from the DR point of view. The responsiveness of residential demand to signal tariffs using home energy management systems that shift the flexible load from evening to midday hours and consume energy from local generation was proposed in [3].
A useful classification of demand response services into three main categories ("shed"reduce load, "shift"-reschedule, and "shimmy"-fast dispatch) is performed in [7], which also provides a comprehensive study on DR-enabling technology costs for control and communication. For residential consumers, flexible appliances, such as heating, ventilation, and air conditioning (HVAC); water heaters; pool pumps; and battery storage, including electric vehicles (EVs), were proposed to provide DR through direct load control (DLC), automated demand response (ADR-involving communication, measurement, control infrastructure, and a signal that triggers a specific response) programs and programmable communicating thermostats (PCTs). Most of these appliances can provide DR services except for HVAC, which was only investigated to shed and shift with PCTs.
The demand flexibility potential of seven northern European countries with high shares of renewable energy sources was reviewed and estimated in [8]. The study identified and compared the flexibility potential by analyzing the methods used for flexibility potential estimation. The flexibility was assessed to be 15-29% of the peak load. However, the study did not show the value of the flexibility potential or integrate other data sources to provide added value.
The demand potential of HVAC of 12-storey residential buildings in Nordic countries to sustain renewable energy sources (RES) integration was investigated in [9] by taking into account the indoor climate conditions and comfort of the residents. The flexibility potential of HVAC was analyzed to provide ancillary services. The results underline the necessity of aggregating the loads and offering ancillary services, and they lead to a load reduction of 1.57 MW for a Danish region.
Another interesting study investigated and calculated the DR potential in Germany to support RES integration using an electricity market model. It results showed that DR diminished the curtailments, emphasizing that the flexibility potential varied between DR programs [10]. This study strengthens the idea of implementing various DR programs or combinations of programs to obtain the best results. In addition, the flexibility load was assessed for buildings by considering the potential of heat pumps [11].
An estimation of DR potential using a fitted regression model was performed for residential and commercial buildings with EnergyPlus (a building simulation program to model energy consumption) datasets using two-state models for appliances, with thermostats obtaining an accuracy of 80-90% [12]. The main drawback of this study was that it used only a limited dataset of load. The big data feature and multiple data sources correlated with load data were not considered in the DR prediction. In addition, a quantification of load flexibility potential using cooling and heating systems of office buildings was analyzed in [13]. Several big data [14,15] and business intelligence techniques [16,17] for decision makers were engaged to find patterns in large sets of consumption data, and extensive reviews were performed into smart metering data analytics [18].
The experts in flexible demand from Nordic countries mentioned that the savings are as yet uncertain. They also underline the importance of taking further steps to properly design real-time prices, reconsider revenue regulation of the grid operators to incentivize DR implementation, and envision aggregation services [19].
In this paper, we corroborated the findings of previous studies [1,7] regarding flexibility coefficients estimated at the state level and DR-enabling technology costs by using large load profile datasets. We organized data to obtain useful insights regarding flexibility potential for certain geographical areas. We propose a data model to organize data in order to answer the following two questions: What is the potential size of the load flexibility resources? What are the savings that residential or commercial consumers can obtain? To achieve these goals, the paper is divided into five sections. In Section 1, the most relevant research studies are presented, in the Section 2, the data processing methodology is defined. The data model is built in Section 3. Simulations and results are provided in Section 4, and in Section 5 the conclusions of this study are drawn.

Data Processing Methodology
The load profile datasets were downloaded from the OpenEI website that provides a free data source to facilitate research in energy [20]. Load profile datasets for residential consumers and commercial buildings were provided by the Department of Energy in the USA, representing reference houses by location and targeting the energy efficiency for residential and commercial buildings [21].
From a large set of files, 273 load profiles for residential consumers located in various locations of the USA and storing hourly consumption data in kWh with the following attributes were extracted: They were classified into gas and electricity consumption. Some of the loads are not controllable whereas some of them can be included into DR programs. The consumption was split into electricity and gas in total and with details for heating, cooling, HVAC, interior and exterior lights, and interior and other appliances. Gas provided heating for the houses and water. Thus, the volume of data for residential consumers was over 2.07 million records. However, the volume of data could expand to hundreds of thousands if the individual consumers were considered. In addition, for commercial buildings, the record count was more than 131 million records.
The datasets were archived as zip files containing large .csv files [20]. For the residential data, 273 .csv files were inside one large archive. For the commercial data, 10 large archives each contained 94 folders (except the last archive that had 90 folders), and each folder contained 16 .csv files and 8760 records, with a total of 14,976 .csv files of commercial data. Considering the size and dimensionality of the data, Mongo DB collections with an aggregation pipeline and Python were used to import, process, and calculate the residential data. Whereas, for the commercial data, the overview looked totally different. Analyzing the folders, files, and data structure, some interesting facts were noticed:

•
Only .csv files were analyzed as input datasets, so the data presentation was standard; These facts lead to the conclusion that an import of data into another tool for data processing and calculation could be easily automated. For the automation, Python script is used a. Next step was to choose a tool for data processing and calculation. Our criteria for choosing the database were: Scalable with a big data load; Familiar and easy to use; Can run on local machines; Easy to setup; Open Source. Therefore, five databases, both SQL and NoSQL, are taken into consideration. In Table 1, the comparison is shown: Based on the table of comparison, we decided to use PostgreSQL, as we are mostly familiar with SQL technology and this kind of databases. However, for handling the two large datasets that come from residential and commercial consumers, the top open source NoSQL and relational databases at the moment were used [22]. Thus, each file from the datasets was scanned and each row from the .csv file was inserted into a row in PostgreSQL. Additionally, location, city and state were extracted from the file name and added to the consumption data. For the beginning, all the rows were inserted into just one table, after that the relations were normalized. To understand the correlation and data source, a synthesis is provided in Table 2.  When the data was inserted into the database, the insertion time was too long, only inserting 1,007,400 rows took 5270.9 seconds, having in mind that the total target of rows to be inserted is 131,189,412, the insertion time was not acceptable in this state. So, we decided to optimize the insertion by doing two things. Firstly, the insert statement from a simple insert to a prepared statement was changed, so the statement is already processed by the database; and secondly, instead of single row statement, batch statement was used. For the batch statement, some experiments were performed to see which volume per statement is more suitable and performant in our case. Table 3 shows the results of this experiment. The best insertion time for batch was 367.4466257095337 having the batch size of 400 rows. Keep in mind that this may be different depending on the resources available on the machine (CPU, RAM memory, storage type) and the insertion statement.
We are aware that the insertion procedure can be further optimized, implementing mechanisms such us "in file" import directly into database, parallelism, or resource boosting. In the end, the total insertion of 131,189,412 rows took 47,958.8661942482 s or 13 h and 18 min.
In Figure 1, the algorithm implemented in the Python script for import written in pseudocode is provided. The algorithm steps are the following: The script starts scanning the target path for unarchived data folders (prerequisite step was to download all the archived data and unarchived it to a specific path).

2.
The script finds an unarchived folder and scans it inside for a root data folder. 3.
The script finds a root data folder and starts to iterate on each file. At this point, the files are in .csv format.

4.
Each file is opened, and the columns of .csv format are validated. In case a column is missing, or an unexpected column appears the script will log an error and close the process. In this way, an inconsistency can easily be identified and investigated. 5.
If the columns are valid the script will iterate though the rows. The columns of file will be combined with the row values and each row data will be added to the "rows to be inserted" list. 6.
If the number of items in list is equals to the number of batch size (to be inserted), the list will be transformed into a big batch statement and ran against the PostgreSQL database. 7.
If the end of file is reached and the number of rows to be inserted is not equals to the number of batch size, the insert will be done, so it can be continued with another file. 8.
The next file will be processed. 9.
If the root folder has no more file, the root folder will be changed. 10. If the unarchived folder has no more root folders, the unarchived folder will be changed. 11. The program will end when there will be no more unarchived folders to be processed. After the data was fully inserted into PostgreSQL database, further investigation to reduce the data is performed as it was still big and the queries were not performant enough, even by adding indexes on some common columns and aggregation query could took over several hours.
Then, the table of consumption is combined with another tables which contains data about state independent system operator (ISO) code together with the ISO code of region, flexibility coefficients, DR enabling technology costs. After this join, the consumption for each hour, region ISO code and location type is analyzed.
The very large volume of consumption datasets especially for commercial consumers falls into big data paradigm. Therefore, the dimensionality of the datasets is reduced as in Figure 2 to process the data and obtain useful insights. First, the peak hours are identified and then built the data model considering other datasets that are correlated with the consumption data. The data model represents the interaction of reduced consumption dataset with the flexibility potential, ISO affiliation, tariff rates, flexibility services, weather forecast, and DR enabling technology costs. The data model is implemented into a database for flexibility assessment and savings computation.

Building the Data Model and Algorithm for Load Flexibility Assessment
Identification of flexibility potential for residential and commercial buildings starts with the main datasets: the load curve for the two categories of consumers. The load curve represents the mean hourly load. Its shape allows us to identify the peak (h PEAK_START , h PEAK_STOP ) and off-peak hours and perform further calculations to assess the flexibility and the savings.
The other components of the data model will be described in the following subsections.

DR Capability, Climatic Areas, and ISO
The regional differences could be significant in terms of challenges and load flexibility. Thus, the average estimation is not useful, and the consumers are grouped by ISO, states, and locations.
Five flexibility ranges are assessed in [1] and seven climatic areas in the USA are identified in [23] according to Figure 3. Additionally, the states and their affiliation to an ISO is shown in Figure 4.  The identification of the entities that form the data model contains information that is presented in Figures 3 and 4: Loads, State_flex that embeds the flexibility coefficients; Climate_areas; ISOs; and Locations.
Correlating the information from Figures 3 and 4, the average flexibility coefficients are obtained for residential (a) and commercial consumers (b) at the ISO level as in Figure 5.

DR Programs
Load flexibility potential is empowered by the emerging programs enabled primarily by smart meters, sensors technologies, smart thermostats, switching modules connecting/disconnecting and controlling the operation of the appliances and local generation and storage. Some of the programs such as DLC should be just revitalized by data analytics showing the benefits of their implementation at large scale. DR programs are briefly described in Table 4.

DR Enabling Technology Costs
The DR enabling technology costs were assessed for both residential and commercial consumers [7] as in Tables 5 and 6. Table 4. Demand response (DR) programs.

No. DR Program Description
1 ADR or DLC Control of customers' flexible appliances for DR purposes using an automated signal that triggers a specific response.
2 Smart-thermostats Temperature is remotely controlled to reduce heating/cooling appliances usage at peak.

3
Discounted rates Residential consumers reduce consumption to a specific level and get a discounted rate.

Bid load
Residential consumers bid the day-ahead curtailment program at 15 min-resolution. If their orders are executed, they must curtail and receive an additional payment. Otherwise, they will encounter a penalty that is usually mentioned in an agreement.

Signal tariffs
Static or dynamic signal rates that encourage the consumption at off-peak intervals. These tariffs can be designed as Time-of-Use (ToU) tariffs, critical peak, variable peak, real-time pricing, or dynamic tariff rates set by consumption optimization with game theory.

Consumers' awareness
Residential consumers are aware of the load reductions requirements and behave accordingly without a financial incentive. Such programs are tailored to show the advantages and benefits of a certain behavior towards a sustainable consumption.

7
Charging batteries Residential consumers are stimulated to charge batteries including EV during off-peak interval.

Thermal storage
Boilers and other similar appliances operate at off-peak hours. The boiler tanks preserve the water temperature. For commercial buildings, it can be extended to the ice processing. Thus, the water can be frozen at off-peak hours and provide cooling at peak hours.

9
Smart Adaptive Switching Module (SASM) It automates the control of various appliances, including generation and storage. It is based on a priority of the appliances and control of the load with fuzzy rules.

Dispatchable load
The program requires more engagement from consumers.

Weather Data and Flexibility Forecast
In order to perform the day-ahead flexibility forecast, the meteorological datasets could be extracted from the weather websites. The data is stored in .csv files and then imported in the database. Even if it is an important component of the load flexibility management, it is not an objective of this paper.
The steps required to extract weather data from websites, as in Figure 6, are presented in Table 7.   Steps to extract weather data from the websites written in PHP.

Data Model
The relational data model reflects the above-mentioned components that can be correlated with load data to calculate flexibility potential and savings. Moreover, the flexibility can be daily estimated using weather forecast that is based on weather readings.
The most linked entity is LOCATIONS connecting 9 entities: WEATHER_READINGS, TARIFF_RATES (through TARIFF_LOCATIONS), WEATHER_FORECAST, CLIMATE_AREAS, ISOS, STATE_FLEX, LOADS, LOAD_FORECAST. The load forecast is stored at the level of an appliance that belongs to a specific location. DR_PROGRAMS and DR_APP_COSTS store data about programs described in Table 4 and DR enabling technology costs presented in Tables 5 and 6. In WEATHER_READINGS data is collected as in Table 7. Based on the data model, a procedure is written to implement algorithm presented in Table 8 to calculate the flexibility potential and savings of the residential or commercial consumers. C EH -consumption of electricity heating; C WH -consumption of water heater; C HVAC -consumption of HVAC; S-savings from flexibility usage; t PEAK tariff rate at peak hours; t OFFPEAK tariff rate at off-peak hours.

IF h ≥ h PEAK_START AND h≤h PEAK_STOP THEN
: IF h≥h PEAK_START AND h≤h PEAK_STOP THEN :

END IF;
Compute savings Compute savings Starting from this model, useful insights are extracted using performant queries that release synthetic information related to load flexibility. The results achieved in this paper are based on the data model presented in Figure 7.

Algorithm for Load Flexibility Assessment
To assess the flexibility potential and compute the savings, we propose the algorithm described in Table 9. For calculation, two cases are simulated: CASE1 ALL SHIFT algorithm in which a small percent of the electric heating, water heater and HVAC consumptions are shifted from peak to off-peak hours; and CASE2 HVAC SHED, EH, WH SHIFT in which a small percentage of the HVAC consumption is shed, whereas also a small percentage of electric heating and water heater consumption is shifted. The percentages assessed as in [1] are converted to the flexibility coefficients. To estimate the savings, different rates for peak t PEAK and off-peak hours t OFFPEAK are considered.

Simulations and Results
Apart from consumption data, in the simulations the datasets presented in the data model above were partially considered. As it is complex, the consumption forecast, DR enabling technology costs and climate area are eluded in the simulations that could be the subject of another study. Therefore, the flexibility forecast is not in the scope of this paper.
The hourly average load is aggregated and grouped by ISO to obtain the daily load curve (as in Figure 8a for residential and Figure 8b for commercial consumers). The commercial consumers are located in the control area of only 7 ISO. In both cases, we can notice that consumers that belong to SOUTHEAST and ERCOT ISO have a higher consumption than the other consumers. However, the average profile consumption for the two types of consumers varies very much in shape and amplitude.
As the datasets contain gas and electricity consumption, the total hourly consumption is compared, noticing that for residential consumers the gas consumption is always higher than electricity, especially at the morning hours when electricity consumption is lower (as in Figure 9a). However, at evening the gas and electricity are more balanced, therefore, it is assumed that gas is replaced by electricity. In case of commercial consumers (as in Figure 9b), the electricity always exceeds the gas consumption. Further, gas consumption peak records in the morning, whereas electricity consumption peak starts around 9 and decreases at 19. Therefore, different time intervals are taken into calculating the flexibility value.  The electricity daily load curve breakdown is presented in Figure 10a for residential and Figure 10b commercial consumers. Both consumers have three flexible appliances that are measured separately and may bring savings. The sum of controllable and non-controllable appliances consumption is shown in Figure 11a for residential and Figure 11b for commercial consumers. The controllable appliances count for 26% of the total consumption of the residential consumers. The share of the controllable appliances is much bigger for commercial consumers showing an increased flexibility potential. The breakdown of the controllable appliances is presented in Figure 12a,b. Heating has the smallest share in both cases, whereas the HVAC and Cooling have the largest shares. The structure of the controllable appliances by ISO is shown in Figure 13. Most of the flexible consumers belong to SOUTHEAST, MISO, and ERCOT ISO for residential, respectively, MISO, SOUTHEST, and PJM for commercial consumers. Based on the proposed methodology and algorithm presented in Figure 2 and Table 5, the flexible capacity and savings are calculated. The results are shown in Table 8.
The annually savings are quite modest even in the second case considering the total number of residential consumers (273), the average value being almost 16 Euro. Therefore, the DR enabling technology costs were not taken into account from evident reasons. They could be supported by the aggregator (retailer) or grid operator. To be more efficient, the reward that result from the tariff rates difference (t PEAK − t OFFPEAK ) should be accompanied by additional incentives. As peak rate and off-peak rates for residential consumers, t PEAK = 0.38 Euro and t OFFPEAK = 0.09 Euro, respectively t PEAK = 0.27 Euro and t OFFPEAK = 0.09 Euro are considered in calculation.
The flexibility coefficients are taken from a previous study and used as input data. They vary from less than 0.1 to 0.15 of the peak load. In case the flexibility coefficients are increased up to 0.15 of the peak, the results of this alternative scenario are shown in Figure 14. For commercial consumers, the shifted power totalized 9.3 million MWh with savings of 88.1 million Euro. In this case, the savings are significant considering the total number of commercial consumers (14,976), the average value being almost 5900 Euro per consumer. However, the DR enabling costs are much higher for commercial consumers and their supportability can be shared or they can be on aggregator.

Conclusions
In this paper, we started from the findings of previous studies, regarding flexibility coefficients estimated at the state level and DR enabling technology costs and combined them with large load profile datasets to obtain useful insights regarding flexibility potential for certain geographical areas.
Usually, large datasets challenge researchers when it comes to achieving valuable insights from numerous data. Thus, 273 load profiles for residential consumers located in various locations of the U.S. storing hourly consumption data with the numerous attributes were extracted. The volume of data for residential consumers was over 2.07 million records. However, the volume of data could expand to hundreds of thousands if the individual consumers were considered. In addition, for commercial buildings, the records counted for more than 131 million records. To handle such volume of data, the files were automatically imported with a Python script, reduced and for data processing, storage, and calculation, two top databases (both SQL and NoSQL) are considered.
Hence, in this paper, we proposed a data model that handles large volumes of data and correlates load data with other datasets to assess or calculate the savings from flexibilities. The algorithm can be written as a stored procedure and applied separately for residential and commercial consumers considering different peak intervals, DR programs, costs, and tariff rates.
By implementing the proposed data flow, model, and algorithm, we obtained total savings of over 7500 Euro per year for residential consumers and 88.1 million Euro for commercial consumers. the annual savings are quite modest even in the second case considering the total number of residential consumers (273), the average value being almost 16 Euro. For commercial consumers, the shifted power totalized 9.3 million MWh with savings of 88.1 million Euro. In this case, the savings are significant considering the total number of commercial consumers (14,976), the average value being almost 5900 Euro per consumer. However, we noticed a significant difference in the population sizes and consumption levels of the two categories. As individual savings could be considered as reasonable for commercial, savings for residential consumers are very modest and may not incentive enough to implement DR programs.
As a further study, we will focus on emergent data management solutions in relation to the local flexibility markets and their implementation by means of direct load control and machine learning algorithms that are able to forecast the generation of local energy sources (Photovoltaic (PV) and wind generating systems) and balance the consumption requirements, storage and provide the surplus to the local markets. Furthermore, the model implementation can be further enhanced with consumption forecast, DR enabling technology costs and climate area characteristics.
Author Contributions: The authors contributed equally. All authors have read and agreed to the published version of the manuscript.
Funding: This work was supported by a grant of the Romanian Ministry of Research and Innovation, CCCDI-UEFISCDI, project number 462PED/28.10.2020, project code PN-III-P2-2.1-PED-2019-1198, within PNCDI III. This paper also an extension and presents the scientific results of the project "Intelligent system for trading on wholesale electricity market" (SMARTRADE), co-financed by the European Regional Development Fund (ERDF), through the Competitiveness Operational Programme (COP) 2014-2020, priority axis 1-Research, technological development and innovation (RD&I) to support economic competitiveness and business development, Action 1.