# Big Data Analytics for Discovering Electricity Consumption Patterns in Smart Cities

^{1}

^{2}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

## 2. Related Work

## 3. Methodology

#### 3.1. First Phase: Data Preprocessing

#### 3.2. Second Phase: Obtaining the Optimal Number of Clusters

**BD-Silhouette:**This index [11] is defined as the difference between inter-cluster and intra-cluster distances, divided by the maximum of them. The inter-cluster distance is the average of distances between each cluster centroid and global centroid ${C}_{0}$. It is defined by:

**BD-Dunn:**This index [11] relates the maximum distance between all the points belonging to the same cluster and its corresponding centroid, and the minimum distance between these centroids and the global centroid.

**Davies-Bouldin:**This index [31] assesses how distant clusters can be in order to make them of higher quality. Therefore, we will choose the first minimum of the Davies-Bouldin value chart to create a better model. The index is defined as follows:

**Within Set Sum of Square Errors (WSSSE):**This index [32] is implemented in the MLlib. It is a measure of cluster cohesiveness and it calculates the sum of the distances from each point to the centroid of its cluster.

#### Majority Voting Methodology

#### 3.3. Third Fase: MLlib

#### 3.4. Fourth Phase: Evaluation

- Five instances of $m3.xlarge$ with Intel Xeon E5-2670 v2 (Ivy Bridge) processors with 8 CPUs, 15 GB RAM, and 2 SSDs of 40 GB each.
- Five instances of $m3.2xlarge$ with Intel Xeon E5-2670 v2 (Ivy Bridge) processors with 16 CPUs, 30 GB RAM, and 2 SSDs of 80 GB each.

## 4. Results

#### 4.1. Description of the Dataset

- Building 1—Backup data processing centre (DPC).
- Building 11—Office for professors and classrooms on the ground floor.
- Building 12—Administration services.
- Building 20—Research centre of developmental biology.
- Building 21—Experimental research services.
- Building 42—Old kindergarten (closed since 2010).
- Building 44—Administration services.
- Cafeteria—Cafeteria.

#### 4.2. Cluster Validity Indices Analysis

#### 4.3. Clustering Results

#### 4.3.1. Analysis of Results: Four Clusters

- Clusters 2 and 3 with the highest consumptions but with few instances (7% and 4%, respectively).
- Clusters 1 and 4 with the lowest consumptions and the largest percentage of instances (72% and 18%, respectively).

- Clusters 1 has low consumption and a significant number of instances corresponding to non-working days.
- Cluster 4 has low consumption, and consists of buildings 11 (offices), 20, and 21 (research centres) and instances with a greater presence in non-working days.
- Cluster 2 and 3 have high consumption and both contain building 20, but they are opposites in terms of seasons and days of the week. On the one hand, cluster 2 may be considered a non-summer cluster with a larger number of instances corresponding to non-working days. Although the cluster 2 has a large number of non-working days, the electricity consumption is high because building 20 is dedicated to experimental research. On the other hand, cluster 3 is considered a non-winter cluster, defined by weekdays mainly.

#### 4.3.2. Analysis of Results: Eight Clusters

- Cluster 1 contains the instances with the lowest consumption and that are constant throughout the day. It is composed of all the buildings except buildings 20 and 21 (research centres). The instances are mostly non-working days and they are distributed uniformly over all seasons of the year.
- Clusters 2, 3, 4, and 6 are composed of building 20. These clusters contain the highest consumption during daylight hours. Clusters 2 and 6 include instances from all the days of the week, while clusters 3 and 4 just have instances from working days and non-working days, respectively. Most of the instances of the cluster 2 are non-summer instances, and cluster 3 is just the opposite because it includes summer instances mainly.
- Cluster 5 is composed of building 21. It is characterized by a low consumption which is higher during daylight hours. In addition, it contains instances of all the days of the week but slightly more for non-working days.
- Cluster 7 consists of all the buildings, except 20, 21, and 42. It represents a low consumption higher during daylight hours and working days.
- Cluster 8 is formed by the buildings 11 (offices) and 21. It represents low consumption but higher during daylight hours and non-working days.

## 5. Execution Times

## 6. Conclusions

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## Abbreviations

MLlib | Machine Learning Library |

CVI | Cluster Validity Index |

BD-CVI | Big Data Cluster Validity Index |

RDD | Resilient Distributed Dataset |

DPC | Data Processing Centre |

AWS | Amazon Web Services |

## References

- Nuaimi, E.A.; Neyadi, H.A.; Mohamed, N.; Al-Jaroodi, J. Applications of big data to smart cities. J. Internet Ser. Appl.
**2015**, 6, 1–15. [Google Scholar] [CrossRef] - Gungor, V.C.; Sahin, D.; Kocak, T.; Ergut, S.; Buccella, C.; Cecati, C.; Hancke, G.P. Smart Grid Technologies: Communication Technologies and Standards. IEEE Trans. Ind. Inf.
**2011**, 7, 529–539. [Google Scholar] [CrossRef] - Fernández, A.; del Río, S.; López, V.; Bawakid, A.; del Jesús, M.J.; Benítez, J.M.; Herrera, F. Big Data with Cloud Computing: An insight on the computing environment, MapReduce, and programming frameworks. Wiley Interdiscip. Rew. Data Min. Knowl. Discov.
**2014**, 4, 380–409. [Google Scholar] [CrossRef] - Orgaz, G.B.; Jung, J.J.; Camacho, D. Social big data: Recent achievements and new challenges. Inf. Fusion
**2016**, 28, 45–59. [Google Scholar] [CrossRef] - Dean, J.; Ghemawat, S. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM
**2008**, 51, 107–113. [Google Scholar] [CrossRef] - Zaharia, M.; Chowdhury, M.; Franklin, M.J.; Shenker, S.; Stoica, I. Spark: Cluster Computing with Working Sets. In Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing; HotCloud’10; USENIX Association: Berkeley, CA, USA, 2010; p. 10. [Google Scholar]
- Zaharia, M.; Chowdhury, M.; Das, T.; Dave, A.; Ma, J.; McCauley, M.; Franklin, M.J.; Shenker, S.; Stoica, I. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation; NSDI’12; USENIX Association: Berkeley, CA, USA, 2012; p. 2. [Google Scholar]
- Meng, X.; Bradley, J.; Yavuz, B.; Sparks, E.; Venkataraman, S.; Liu, D.; Freeman, J.; Tsai, D.; Amde, M.; Owen, S.; et al. MLlib: Machine Learning in Apache Spark. J. Mach. Learn. Res.
**2016**, 17, 1–7. [Google Scholar] - Bahmani, B.; Moseley, B.; Vattani, A.; Kumar, R.; Vassilvitskii, S. Scalable K-means++. Proc. VLDB Endow.
**2012**, 5, 622–633. [Google Scholar] [CrossRef] - Arbelaitz, O.; Gurrutxaga, I.; Muguerza, J.; PéRez, J.M.; Perona, I.N. An Extensive Comparative Study of Cluster Validity Indices. Pattern Recogn.
**2013**, 46, 243–256. [Google Scholar] [CrossRef] - Luna-Romera, J.M.; García-Gutiérrez, J.; Martínez-Ballesteros, M.; Santos, J.C.R. An approach to validity indices for clustering techniques in Big Data. Prog. Artif. Intell.
**2017**, 7, 1–14. [Google Scholar] [CrossRef] - Martínez-Álvarez, F.; Troncoso, A.; Riquelme, J.C.; Ruiz, J.S.A. Energy Time Series Forecasting Based on Pattern Sequence Similarity. IEEE Trans. Knowl. Data Eng.
**2011**, 23, 1230–1243. [Google Scholar] [CrossRef] - Tuballa, M.L.; Abundo, M.L. A review of the development of Smart Grid technologies. Renew. Sustain. Energy Rev.
**2016**, 59, 710–725. [Google Scholar] [CrossRef] - Calvillo, C.; Sánchez-Miralles, A.; Villar, J. Energy management and planning in smart cities. Renew. Sustain. Energy Rev.
**2016**, 55, 273–287. [Google Scholar] [CrossRef] - Sun, Y.; Song, H.; Jara, A.J.; Bie, R. Internet of Things and Big Data Analytics for Smart and Connected Communities. IEEE Access
**2016**, 4, 766–773. [Google Scholar] [CrossRef] - Xu, J.; Zhang, R. CoMP Meets Smart Grid: A New Communication and Energy Cooperation Paradigm. IEEE Trans. Vehicular Technol.
**2015**, 64, 2476–2488. [Google Scholar] [CrossRef] - Wijk, J.J.V.; Selow, E.R.V. Cluster and calendar based visualization of time series data. In Proceedings of the IEEE Symposium on Information Visualization, San Francisco, CA, USA, 24–29 October 1999; pp. 4–9. [Google Scholar]
- Martínez-Álvarez, F.; Troncoso, A.; Riquelme, J.C.; Riquelme, J.M. Partitioning-Clustering Techniques Applied to the Electricity Price Time Series. In Proceedings of the Intelligent Data Engineering and Automated Learning, Birmingham, UK, 16–19 December 2007; pp. 990–999. [Google Scholar]
- Martínez-Álvarez, F.; Troncoso, A.; Riquelme, J.C.; Riquelme, J.M. Discovering patterns in electricity price using clustering techniques. In Proceedings of the International Conference on Renewable Energy and Power Quality, Sevilla, Spain, 28–30 march 2007; pp. 245–252. [Google Scholar]
- Keyno, H.S.; Ghaderi, F.; Azade, A.; Razmi, J. Forecasting electricity consumption by clustering data in order to decline the periodic variable’s affects and simplification the pattern. Energy Convers. Manag.
**2009**, 50, 829–836. [Google Scholar] [CrossRef] - Hernández, L.; Baladrón, C.; Aguiar, J.M.; Carro, B.; Sánchez-Esguevillas, A. Classification and Clustering of Electricity Demand Patterns in Industrial Parks. Energies
**2012**, 5, 5215–5228. [Google Scholar] [CrossRef] - Fahad, A.; Alshatri, N.; Tari, Z.; Alamri, A.; Khalil, I.; Zomaya, A.Y.; Foufou, S.; Bouras, A. A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis. IEEE Trans. Emerg. Top. Comput.
**2014**, 2, 267–279. [Google Scholar] [CrossRef] - Ding, R.; Wang, Q.; Wang, Q.; Dang, Y.; Fu, Q.; Zhang, H.; Zhang, D.; Ding, J. YADING: Fast Clustering of Large-Scale Time Series Data. Proc. Very Large Data Bases
**2015**, 8, 473–484. [Google Scholar] [CrossRef] - Rakthanmanon, T.; Campana, B.; Mueen, A.; Batista, G.; Westover, B.; Zhu, Q.; Zakaria, J.; Keogh, E. Addressing Big Data Time Series: Mining Trillions of Time Series Subsequences Under Dynamic Time Warping. ACM Trans. Knowl. Discov. Data
**2013**, 7, 1–31. [Google Scholar] [CrossRef] - Zhao, W.; Ma, H.; He, Q. Parallel K-Means Clustering Based on MapReduce. In Cloud Computing; Jaatun, M.G., Zhao, G., Rong, C., Eds.; Springer: Berlin/Heidelberg, Germany, 2009; pp. 674–679. [Google Scholar]
- Capó, M.; Pérez, A.; Lozano, J.A. An Efficient Approximation to the K-means Clustering for Massive Data. Know.-Based Syst.
**2017**, 117, 56–69. [Google Scholar] [CrossRef] - Melzi, F.N.; Same, A.; Zayani, M.H.; Oukhellou, L. A Dedicated Mixture Model for Clustering Smart Meter Data: Identification and Analysis of Electricity Consumption Behaviors. Energies
**2017**, 10, 1–21. [Google Scholar] [CrossRef] - Deb, C.; Zhang, F.; Yang, J.; Lee, S.E.; Shah, K.W. A review on time series forecasting techniques for building energy consumption. Renew. Sustain. Energy Rev.
**2017**, 74, 902–924. [Google Scholar] [CrossRef] - Singh, S.; Yassine, A. Big Data Mining of Energy Time Series for Behavioral Analytics and Energy Consumption Forecasting. Energies
**2018**, 11, 452. [Google Scholar] [CrossRef] - Li, C.; Ding, Z.; Zhao, D.; Yi, J.; Zhang, G. Building Energy Consumption Prediction: An Extreme Deep Learning Approach. Energies
**2017**, 10, 1525. [Google Scholar] [CrossRef] - Davies, D.L.; Bouldin, D.W. A Cluster Separation Measure. IEEE Trans. Pattern Anal. Mach. Intell.
**1979**, 1, 224–227. [Google Scholar] [CrossRef] [PubMed] - Spark, A. Clustering—RDD-Based API—Spark 2.2.0 Documentation. 2017. Available online: https://spark.apache.org/docs/2.2.0/mllib-clustering.html#k-means (accessed on 20 December 2017).
- Ketchen, D.J.; Shook, C.L. The Application Of Cluster Analysis In Strategic Management Research: An Analysis And Critique. Strateg. Manag. J.
**1996**, 17, 441–458. [Google Scholar] [CrossRef] - Koprinska, I.; Rana, M.; Troncoso, A.; Martínez-Álvarez, F. Combining pattern sequence similarity with neural networks for forecasting electricity demand time series. In Proceedings of the 2013 International Joint Conference on Neural Networks (IJCNN), Dallas, TX, USA, 4–9 August 2013; pp. 1–8. [Google Scholar]

**Figure 1.**Proposed methodology. RDD: Resilient Distributed Dataset; MLlib: Machine Learning Library; WSSSE: Within Set Sum of Square Errors.

**Figure 3.**BD-Silhouette, BD-Dunn, Davies-Bouldin, and WSSSE clustering validity indices for k values from 2 to 15.

Values | BD-Silhouette | BD-Dunn | Davies-Bouldin | WSSSE |
---|---|---|---|---|

First | 4 | 6 | 6 | 7 |

Second | 6 | 8 | 9 | 15 |

Third | 9 | 13 | 15 | 21 |

ID | Cluster | Building | Season | Day |
---|---|---|---|---|

1 | 1 | Build_1 | Summer | Day off |

2 | 1 | Build_1 | Winter | Day off |

3 | 2 | Build_20 | Summer | Thursday |

4 | 1 | Build_42 | Summer | Friday |

5 | 3 | Build_1 | Autumn | Monday |

Values | BD-Silhouette | BD-Dunn | Davies-Bouldin | WSSSE |
---|---|---|---|---|

First | 4 | 4 | 10 | 4 |

Second | 8 | 8 | 12 | 8 |

Third | - | - | 14 | - |

Cluster | Total | Rate |
---|---|---|

1 | 6161 | 72% |

2 | 605 | 7% |

3 | 311 | 4% |

4 | 1504 | 18% |

Consumption | Buildings | Days | Seasons | |||||
---|---|---|---|---|---|---|---|---|

Cluster | High | Low | 11 | 20 | 21 | Non-Working Days | Non-Summer | Non-Winter |

1 | ✓ | ✓ | ||||||

2 | ✓ | ✓ | ✓ | ✓ | ||||

3 | ✓ | ✓ | ✓ | |||||

4 | ✓ | ✓ | ✓ | ✓ | ✓ |

Cluster | Total | Rate |
---|---|---|

1 | 3333 | 39% |

2 | 472 | 6% |

3 | 171 | 2% |

4 | 274 | 3% |

5 | 684 | 8% |

6 | 198 | 2% |

7 | 2715 | 32% |

8 | 734 | 9% |

Consumption | Days | Seasons | Buildings | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

Cluster | High | Low | Diurnal | Working Days | Non-Working Days | Non-Summer | Summer | Non-Winter | 11 | 20 | 21 |

1 | ✓ | ✓ | ✓ | ||||||||

2 | ✓ | ✓ | ✓ | ✓ | |||||||

3 | ✓ | ✓ | ✓ | ✓ | ✓ | ||||||

4 | ✓ | ✓ | ✓ | ||||||||

5 | ✓ | ✓ | ✓ | ||||||||

6 | ✓ | ✓ | ✓ | ✓ | |||||||

7 | ✓ | ✓ | ✓ | ✓ | |||||||

8 | ✓ | ✓ | ✓ | ✓ | ✓ |

**Table 8.**Computing times (in hours) using synthetic big data for two different hardware configurations.

Buildings | Instances | File Size | ${\mathit{Time}}_{1}$ | ${\mathit{Time}}_{2}$ |
---|---|---|---|---|

16 | 17,162 | 10.3 MB | 0.0015 | 0.0015 |

32 | 34,324 | 20.5 MB | 0.0015 | 0.0016 |

64 | 68,648 | 41.2 MB | 0.0015 | 0.0014 |

128 | 137,296 | 82.4 MB | 0.0014 | 0.0015 |

256 | 274,592 | 190.1 MB | 0.0018 | 0.0017 |

512 | 549,184 | 380.9 MB | 0.0021 | 0.0015 |

1024 | 1,098,368 | 744.1 MB | 0.0023 | 0.0022 |

2048 | 2,196,736 | 1.45 GB | 0.0037 | 0.0020 |

4096 | 4,393,472 | 2.91 GB | 0.0067 | 0.0023 |

8192 | 8,786,944 | 5.81 GB | 0.0094 | 0.0054 |

16,384 | 17,573,888 | 11.63 GB | 0.0156 | 0.0091 |

32,768 | 35,147,776 | 23.26 GB | 0.7078 | 0.0162 |

65,536 | 70,295,552 | 46.52 GB | 3.8555 | 0.0995 |

131,072 | 140,591,104 | 93.03 GB | 5.2325 | 1.1985 |

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Pérez-Chacón, R.; Luna-Romera, J.M.; Troncoso, A.; Martínez-Álvarez, F.; Riquelme, J.C.
Big Data Analytics for Discovering Electricity Consumption Patterns in Smart Cities. *Energies* **2018**, *11*, 683.
https://doi.org/10.3390/en11030683

**AMA Style**

Pérez-Chacón R, Luna-Romera JM, Troncoso A, Martínez-Álvarez F, Riquelme JC.
Big Data Analytics for Discovering Electricity Consumption Patterns in Smart Cities. *Energies*. 2018; 11(3):683.
https://doi.org/10.3390/en11030683

**Chicago/Turabian Style**

Pérez-Chacón, Rubén, José M. Luna-Romera, Alicia Troncoso, Francisco Martínez-Álvarez, and José C. Riquelme.
2018. "Big Data Analytics for Discovering Electricity Consumption Patterns in Smart Cities" *Energies* 11, no. 3: 683.
https://doi.org/10.3390/en11030683