RAE: The Rainforest Automation Energy Dataset for Smart Grid Meter Data Analysis

Datasets are important for researchers to build models and test how well their machine learning algorithms perform. This paper presents the Rainforest Automation Energy (RAE) dataset to help smart grid researchers test their algorithms which make use of smart meter data. RAE contains 72 days of 1Hz data from a residential house's mains and 24 sub-meters resulting in 6.2 million samples for each sub-meter. In addition to power data, environmental and sensor data from the house's thermostat is included. Sub-meter data includes heat pump and rental suite captures which is of interest to power utilities. We also show (by example) how RAE can be used to test non-intrusive load monitoring (NILM) algorithms.


INTRODUCTION
Datasets are becoming increasingly more relevant when measuring the accuracy of smart grid algorithms and to see how well they might perform in a real-world situation. Testing the accuracy performance with real-world datasets is crucial in this field of research. Synthesized data does not realistically represent an actual dataset as "a real-world dataset would normally have certain complexity that is harder to predict and in many cases can be very difficult to deal with" [1, p. 114]. For smart grid research it is valuable to have public datasets that show how smart meters report aggregate power readings with the accompanying sub-meter data for the different loads that comprise that aggregate reading. This is very true when testing non-intrusive load monitoring (NILM) algorithms [2,3]. NILM (sometimes referred to as load disaggregation) is a computational approach to determining what appliances are running in a given house (or building) just by examining the aggregate power signal from a smart meter.
We introduce The Rainforest Automation Energy (RAE) 1 This work was funded in part by NSERC Engage Grant EGP-501582-16. 1 Download from Harvard Dataverse via doi: 10.7910/DVN/ZJW4LC. dataset; a free and publicly available dataset. This dataset contains 59 files and is roughly 7.8 GB in size. There are 24 sub-meters (one for each breaker on the house's main power panel) sampled at 1Hz which capture eleven electrical data-points (voltage, current, frequency, power factor, real power/energy, reactive power/energy, and apparent power/energy). There are 72 days of captures which resulted in 6.2 million one-second samples for each sub-meter. Mains are sampled as a typical "smart meter communication to inhome display"-rate (per 15s) and there are roughly 414 thousand samples over the 72 days of capture. We also include environmental and sensor data from the house's thermostat. Figure 1 depicts an arbitrary Sunday (a 24-hour period) to give the reader a visual idea of what the load consumption pattern can look like.
In addition to smart grid and NILM this dataset can be used in research that looks at: statistical signal processing and blind source separation, energy use behaviour, eco-feedback and eco-visualizations, application and verification of theoretical algorithms/models, appliance studies, demand forecasting, smart home frameworks, grid distribution analysis, timeseries data analysis, energy efficiency studies, occupancy detection, energy policy and socio-economic frameworks, and advanced metering infrastructure (AMI) analytics.

RELATION TO PRIOR WORK
Previously, we created a widely used dataset, named the Almanac of Minutely Power dataset (AMPds1 [4] and AM-Pds2 [5]) which contained data sampled at 1 minute intervals. This new dataset has all power panel circuits sampled at 1Hz. Besides AMPds and this dataset, and at the time of writing this, there exists no other Canadian open public dataset.
One of the first and well-known datasets the Reference Energy Disaggregation Data Set (REDD) [6], which was released 2011 (USA homes) has a low-frequency sampling version which has the mains sampled at a higher frequency (1Hz, or per second) than the sub-metered loads (per 3-seconds). It is worth noting a more recent dataset called UK Domestic Appliance-Level Electricity (UK-DALE) dataset [7] has used this methodology, as well. The RAE dataset has a different approach. We feel it is best to sample the sub-metered loads at a higher sampling frequency to be able to capture interesting features from the appliance's power signature. Further, we wanted the mains data sampled at a sampling frequency that would be common to most smart meter in-home displays (e.g., Rainforest Automation's EMU2).
The aforementioned datasets (in the area of NILM) are considered low-frequency sampling (≤1Hz) datasets. There do exist high-frequency sampling datasets. REDD does have a high-frequency version of its data. Two such examples are: the Building-Level fUlly labeled Electricity Disaggregation dataset (BLUED) [8], sampled at 12kHz (USA data); and the Controlled On/Off Loads Library dataset (COOLL) [9], sam-pled at 100kHz (France data). While these datasets provide valuable data for high-resolution applications, we feel that it is a more realistic scenario to use low-frequency sampled data for most smart grid and NILM systems, especially where there is a processor constraint on storage and speed.

DATA ANALYSIS
For the initial release of the RAE dataset we have a single house which is called House 1. We perform an analysis of the data that we collected and describe House 1 in detail below.

Capture Methodology
When designing the data capture system for RAE we prioritized the need for accuracy and reliability. Hence, we chose commercial grade metering equipment. We chose to use the Rainforest Automation EMU2 in-home display 2 to capture smart meter data. See Table 1 (column IDs 4 and 5) for the data we capture from the EMU2. The EMU2 reads data from a ZigBee enabled smart meter at roughly 15s intervals.
To capture sub-meter data we chose a Class 1 branch circuit power meter from DENT, the PowerScout 24 3 . We had prior experience with using the DENT PowerScout 18 meter. See Table 1 (column IDs 1, 2, and 3) and Table 2 for the data we captured from the PowerScout 24. The PowerScout 24 can monitor up to 24 circuits at a rate of 1Hz.
Thermostat data was collected from the EcoBee3 thermostat 4 at 5-minute intervals (a product limitation). Data includes set-points, operation mode (heat/cool and stage), outdoor temperature and wind speed, and inside humidity. Inside temperature and motion is reported from from the thermostat and three remote sensors (living room, basement rec room, and master bedroom).
The hardware setup used to capture data for RAE is depicted in Figure 2 and we have released (as open source) the code 5 used to capture, store, and convert the raw data. This setup is minimal and will allow us to easily install this equipment in a different house to capture data and add it to RAE.

House 1 Overview
House 1 has a 200A power service that is a split 1-phase service consisting of two 120V lines (L1 and L2) which is typical for a Canadian house. The house is located in the city of Burnaby, BC, Canada and is 80m above sea level. The house was build in 1955 and had a major renovations in 2005 and 2006. This house is a single detached home with a main floor and a basement having roughly 99.5m 2 of livable space. The HVAC system consists of a high efficiency natural gas furnace and 2-stage heat pump. More details about the house can be found in one of our previous publication [5]. Table 3 lists what appliances/loads were monitored by each sub-meter.

Missing Readings
Data that is missing will be represented by a timestamp then one or more null data-points. For comma-separated value (CSV) files this would mean no data between commas. For example: "1457282030,,,,4.582,38193.4" would mean that in a mains file the voltage and frequency readings are missing. Table 4 summarizes how many sampling times data was missing due to the occasional USB/serial timeout or because of the change to Daylight saving time. At this point we are not sure why these timeouts occurred. One reason might be due to the fact that the USB/serial capture programs ran in Linux user mode rather than as daemons. House 1 is located in the Pacific Timezone and on Sunday, March 13, at 3:00 AM timezones in Canada switched to Daylight saving time.

Energy Consumption Analysis
The three highest consumers of energy that period are HVAC & Heat Pump (570kWh), Plugs & Lights (531kWh), and Rental Suite (430kWh) as shown in Figure 3. Over the 72day period of capture period the smart meter reported a total energy consumption of 1,982kWh. A total of 1,971kWh is reported when each of the 24 sub-meters real energy accumulator is summed up. There is a 11kWh discrepancy which is due to rounding errors in each sub-meter accumulator as each sub-meter reports only whole-watt measurements.

Occupancy Notes
House 1 has four occupants. Three of the occupants live in the main house and a single occupant resides in the basement rental suite. Of the main house occupants, there are two adults in their early 40s and work from home. The third occupant is an elementary school child. The rental suite occupant is in his 20s and works away from home during the day. During the capturing of House 1 there were some occupant anomalies. During school Spring Break (March 11-28) the child was at home. From (afternoon) March 11 to (early morning) March 21 two additional occupants stayed in the main house's basement guest room which is also known as the rec room. Additionally, one of the regular main house occupants went on an out-of-town trip (March 17-20).

HVAC & Heat Pump 570kWh
Clothes Dryer 87kWh Garage 1kWh Fig. 3. Percent of energy consumed (in kWh) over the 72-day period for a total of 1,971kWh.

File Organization
The file all houses.txt contains summary information on all the houses in the dataset. Each house has a labels file to describe the loads each sub-meter monitored. For House 1, this would contain the sub-meter data in Table 3 and a panel file to depict the house's power breaker panel that was submetered. Each house can have one or more contiguous sampling blocks (blk). For House 1, see Table 5 for details on the two sampled blocks that were recorded. Each block has the following files associated with it. mains file(s) have smart meter measurements sampled roughly at 15s intervals (see Table 1). power and smeter file(s) contain all real power measurements from mains and sub-meters (good for testing NILM). A number of sub# files, one for each sub-meter, sampled at 1Hz (see Table 2). tstat file(s) that contains data from the house's thermostat. Next, we tested the accuracy of our HMM by having it disaggregate the data from the second block file (63 days). Testing took 46 minutes to complete disaggregating 5.4 million samples with an average disaggregation time of 330µs per sample. We report overall accuracy results in Table 6. Figure 4 shows the accuracy results of each appliance/load that was disaggregated.  Fig. 4. Appliance/load specific accuracy results (in percent of total desegregated not of total house).

CONCLUSTIONS
In this paper we have presented a dataset that smart grid researchers can use when creating models and algorithms that use smart meter and appliance/load data. Sub-meter data includes heat pump and rental suite captures which is of interest to power utilities. Although in the initial release of RAE there is only one house, we are actively assessing other house that can be monitored and added to this dataset. The monitoring system that we present is an accurate and reliable data capture system that can be easily installed in a house to collect data in the same format and frequency. Researchers interested in installing this system and adding data to RAE can contact the lead author.