Big Data for Energy Management and Energy-E ﬃ cient Buildings

: European buildings are producing a massive amount of data from a wide spectrum of energy-related sources, such as smart meters’ data, sensors and other Internet of things devices, creating new research challenges. In this context, the aim of this paper is to present a high-level data-driven architecture for buildings data exchange, management and real-time processing. This multi-disciplinary big data environment enables the integration of cross-domain data, combined with emerging artiﬁcial intelligence algorithms and distributed ledgers technology. Semantically enhanced, interlinked and multilingual repositories of heterogeneous types of data are coupled with a set of visualization, querying and exploration tools, suitable application programming interfaces (APIs) for data exchange, as well as a suite of conﬁgurable and ready-to-use analytical components that implement a series of advanced machine learning and deep learning algorithms. The results from the pilot application of the proposed framework are presented and discussed. The data-driven architecture enables reliable and e ﬀ ective policymaking, as well as supports the creation and exploitation of innovative energy e ﬃ ciency services through the utilization of a wide variety of data, for the e ﬀ ective operation of buildings.


Introduction
As the world's urban population increases by more than 2.5 billion, especially to urban areas by 2050, the construction of new, energy-efficient buildings and cities will be essential to the transformation of the economy [1]. Moreover, the building and construction sector must decarbonize by 2050 to meet the goals of the Paris Agreement [2,3].
The current energy performance of the building sector is poor [4]. In the European Union (EU), with buildings accounting for nearly 40% of its energy consumption, the building sector should play a key role in effective climate policy [5,6]. A number of European directives and initiatives are setting strict objectives on member states, such as the Energy Performance of Buildings Directive (EPBD) [7], the Energy Efficiency Directive [8] and the amending Directive on Energy Efficiency as part of the 'Clean energy for all Europeans package' [9], as well as the ecodesign directive [10] and the energy labelling regulation [11], providing consistent EU-wide rules for improving the environmental performance of products (e.g., household appliances) [12].
The needs for a variety of buildings value chain (BVC) stakeholders should be seamless integrated within a comprehensive energy transition framework, in order to guarantee its sustainability and resilience along the buildings' life cycle (from its conceptualization to its refurbishment or demolition). Energy is consumed at each of these buildings' steps, sometimes is locally generated at prosumer level and can be analyzed from different perspectives or scales (building, district, city, regional or national scale), depending on the level of granularity required and the BVC stakeholder that should be involved.
The constantly increasing momentum of big data and their related technologies constitutes an unprecedented market opportunity for improving the energy efficiency along the building sector and its lifecycle and for better managing energy consumption and generation at building level. More and more data are being generated within buildings nowadays, due to the increasing adoption of leading-edge information and communication technologies (ICTs), such as Internet of things (IoT), artificial intelligence (AI), distributed ledger technology (DLT)/blockchain and big data; hence, contributing to move forward towards a smart building landscape [13].
Buildings data are heterogeneous, often dispersed in non-interoperable silos, with varying resolution, mostly asynchronous and are stored in different formats (raw or processed) at various locations [14]. Individual devices and functional units generate thousands of terabytes data annually and building-related stakeholders must handle millions of terabytes of data [15], which continues to increase over time. Accordingly, next generation building management systems will be processing overwhelming amounts of heterogeneous data. Appropriate management of buildings data clearly requires a big data original approach where it becomes necessary to process large volume and varieties of both real-time and historical data to extract meaningful information in order to make data-driven decisions [16].
These trends-combined with AI, namely the new 'engine' of the Fourth Industrial Revolution and IoT infrastructure management enablers-constitute a catalyst towards conceptualizing and generating innovative applications and services for energy management and energy-efficient buildings. It becomes of utmost importance to create an open interoperable scalable data-driven framework [17] able to manage in a fully scalable and interoperable way the implementation of policy objectives-hence to generate win-win situations-which may enable the adoption of novel business models by buildings-related existing stakeholders and/or opening up new opportunities for BVC stakeholders.
The aim of this paper is to present a high-level data-driven architecture for buildings data exchange, management and real-time processing. This multi-disciplinary big data environment enables the integration of cross-domain data, combined with emerging AI algorithms and distributed ledgers technology. Semantically enhanced, interlinked and multilingual repositories of heterogeneous types of data are coupled with a set of visualization, querying and exploration tools, suitable application programming interfaces (APIs) for data exchange, as well as a suite of configurable and ready-to-use analytical components that implement a series of advanced machine learning (ML) and deep learning (DL) algorithms. The results from the pilot application of the proposed framework are presented and discussed. The collected data can be processed to create different business cases considering the interaction of different stakeholders and scales. More specifically, the data-driven architecture enables reliable and effective policymaking, as well as support the creation of innovative applications through the utilization of a wide variety of data, for the safe and effective operation of buildings.
The rest of the paper is organized as follows. In Section 2, a literature review is presented concerning the exploitation of big data technologies in energy, smart grid and building domain. In Section 3, the high-level data-driven architecture for buildings is thoroughly presented. Section 4 is devoted to the pilot application. In Section 5, innovative energy efficiency applications related to the buildings sector and its lifecycle are described. In Section 6, the contributions and conclusions of the current study are discussed respectively.

Scalable Big Data Management
Data processing through ML techniques are fundamental for the data analytics aiming to improve the accuracy of the developed algorithms and systems based on intelligent adaptive systems [18]. It leverages on a variety of ML, statistical and AI-based algorithms and models, including clustering, correlation, classification, categorization, regression, feature extraction, with a view to extract valuable yet actionable information from the dataset [19,20] and enable informed decision-making.
Energies 2020, 13, 1555 3 of 18 ML techniques are further classified into supervised, unsupervised, semi-supervised and reinforcement learning techniques based on the nature of the learning "signal" or "feedback" available to a learning system. Most big data analytics and AI techniques for smart buildings are based on conventional (supervised or unsupervised) ML/DL algorithms, which operate in a context specific way and hence do not provide enough support for cross-stakeholder transfer learning, AI-based learning models reusability and fast cross-domain applications adaptation [21].
Supervised learning includes Bayesian classification methods, regression techniques, lazy learners, decision trees and support vector machines. These methods have been used in analyzing building operational data [22,23], for short, medium and long-term energy prediction in distinct building environment [24]. To accurately capture the complicated relationships between input and output variables, the supervised learning techniques adopted are typically of high complexity, such as artificial neural networks [25], deep neural networks [26], support vector machines [27,28] and decision-tree based ensembles [29,30]. For buildings with complex and unstable occupancy schedules and energy use patterns, multiple linear regression and support vector machine methods can achieve a high accuracy with fast computation speed [31].
In the past, smart grid, smart buildings and big data are usually reported separately and the analysis of the big data in smart grid / buildings is rarely reported. Zhou et al. reviewed the big data-driven smart energy management which mainly illustrates the architecture and industrial applied energy management tools [32]. Support vector machines have been used as well for fault detection in power grids [33]. Moreover, research projects have investigated ML methods with supervised techniques for wind turbines power generation [34,35].
Unsupervised learning explores the intrinsic data structures, associations and correlations. Miller et al. [36] and Fan et al. [37] presented a review of unsupervised data analytics for building energy performance big data. The research of Lapalu et al. focused on the unsupervised mining of activities for smart home prediction [38]. Unsupervised ML algorithms for developing personalized behavior models using activity data have been presented [39]. dynamic programming, Monte Carlo methods and temporal difference methods can be exploited for solving a reinforcement learning problem. In semi-supervised learning, learning accuracy is improved by combining large amount of labelled data with a typically small amount of unlabeled data. Semi-supervised energy modelling for smart buildings / homes was used [40,41].
Supervised learning and training of ML algorithms is a very challenging process as building managers and stakeholders are reluctant to share data, which are necessary to train ML algorithms. If not properly handed, ML models may reveal inappropriate details of the sensitive data, since models are known to implicitly memorize details during training and inadvertently reveal them during inference [42,43]. A solution to the problem is to adopt differential privacy, which is considered as the by default standard in privacy preserving ML modelling. The first learning algorithms adapted to provide differential privacy with respect to their training data were often linear and convex [44]. More recently, successful developments in deep learning called for differentially private stochastic gradient descent algorithms, some of which have been tailored to learn in federated settings [45]. Differentially private selection mechanisms like GNMax [46] are commonly used in hypothesis testing, frequent item set mining and as building blocks of more complicated private mechanisms.
In the domain of buildings, different tools exist for energy performance assessment and prediction as well as for buildings simulation [47]. On the one hand, tools like Leap [48] and AleaSoft [49] include a limited set of AI methods, which rely on expert knowledge to ensure appropriate use. On the other hand, simulation tools are used to model different aspects of building management like building physics; thermal building models; heating, ventilation and air conditioning (HVAC) systems or building control systems use typically static data and do not allow integration of dynamic information [50]. Another important aspect is that buildings data analytics at the moment include mostly descriptive and diagnostic models, hence very few predictive and prescriptive applications have been made available. As the key objective of data analytics is to provide preventive solution, predictive models become more and more necessary to forecast operating conditions and future decisions [51]. Prescriptive analysis, on the other hand, are designed for providing longer term insights to utilities in making strategic operational and investment planning.
The challenges that also arises here, is the fact that training data-driven models needs a considerable amount of data and can take time to converge to an optimal policy [22]. Cross-context transfer learning is explored as a solution, aiming to minimize number of days of training data needed to achieve a policy with a certain accuracy. In that respect, there are some initial attempts to utilize transfer learning in the smart energy domain, in particular addressing the problem of building energy consumption [52,53]. On the other hand, forecasting algorithms are often based on static predictive modelling which leverage on AI-based learning and are not able to capture and assess when a given model is not anymore able to capture the modelled context and/or to reconstruct the missing and/or poor data quality.

High-Level Architecture for Building Data
The proposed framework relies on a decentralized architecture, where data are stored locally at "building-edge" layer and are exposed in a privacy-aware manner at the AI layer run in a centralized cloud. This hybrid approach is expected to increase trust in data sharing among stakeholders and subsequently increase models' accuracy by raising the amounts of data available for AI-based learning.
The high-level architecture is presented in Figure 1.
Energies 2020, 13, x FOR PEER REVIEW 4 of 18 mostly descriptive and diagnostic models, hence very few predictive and prescriptive applications have been made available.
As the key objective of data analytics is to provide preventive solution, predictive models become more and more necessary to forecast operating conditions and future decisions [51]. Prescriptive analysis, on the other hand, are designed for providing longer term insights to utilities in making strategic operational and investment planning.
The challenges that also arises here, is the fact that training data-driven models needs a considerable amount of data and can take time to converge to an optimal policy [22]. Cross-context transfer learning is explored as a solution, aiming to minimize number of days of training data needed to achieve a policy with a certain accuracy. In that respect, there are some initial attempts to utilize transfer learning in the smart energy domain, in particular addressing the problem of building energy consumption [52,53]. On the other hand, forecasting algorithms are often based on static predictive modelling which leverage on AI-based learning and are not able to capture and assess when a given model is not anymore able to capture the modelled context and/or to reconstruct the missing and/or poor data quality.

High-Level Architecture for Building Data
The proposed framework relies on a decentralized architecture, where data are stored locally at "building-edge" layer and are exposed in a privacy-aware manner at the AI layer run in a centralized cloud. This hybrid approach is expected to increase trust in data sharing among stakeholders and subsequently increase models' accuracy by raising the amounts of data available for AI-based learning.
The high-level architecture is presented in Figure 1. This distributed scalable data governance module facilitates data sharing among building stakeholders to maximize the value of AI-based analytics at upper layers. More specifically, the proposed framework consists of three main pillars: This distributed scalable data governance module facilitates data sharing among building stakeholders to maximize the value of AI-based analytics at upper layers. More specifically, the proposed framework consists of three main pillars: 1.
The Governance Layer, encompassing modules related to data collection, semantic annotation and distributed storage.

2.
The Processing Layer, including ML and DL models.

3.
The Analytics Layer, providing a set of analytics tools.

of 18
The aim is to increase accuracy of AI-based services for buildings, by training a large set of ML/DL models utilizing a rich data set of heterogeneous dispersed static and dynamic data and enable data analysis and visualization as well as scenario analysis and simulation at different scales in time and geographical level addressing needs from different stakeholders.

Infrastructure/Asset/Components
A large share of big data is related to energy consumption and production data in buildings. Moreover, off-domain data, such as EC databases (e.g., EU Building Stock Observatory, De-risking Energy Efficiency Platform, EU Energy Poverty Observatory, etc.), building stock auditing, energy performance certificates, energy performance contracts, weather and climate data, geometry data (e.g., from building information modelling), geographical imagery, multimedia unstructured data sources, financial data on energy efficiency investments, socioeconomic data, social media, energy end-user's characteristics and comfort levels, etc. whereas suitably integrated, may allow novel energy analytics to provide BVC stakeholders with more robust actionable insights, as well as to enable analytics driven improved decision-making.
A wide range of data analysis techniques (including among others optimization, forecasting, classification and clustering) can be applied on the aforementioned amounts of (big) data, supporting the design of new data-driven applications for BVC stakeholders, such as national and local governments, network operators and suppliers, energy service companies (ESCOs), building managers and facilitators, construction and renovation sector, investors and financiers, policy makers and researchers.
Data are provided through suitable data service providers. Connectors encapsulate the capability of federating interoperable data sets and/or interoperable data platforms. External off-grid data sets and resources may be federated and integrated, including weather data, etc.

Data Services and Semantic Enrichment
The data services and semantic enrichment (governance) layer provides the necessary middleware to act as a mediator between data users (applications and tools) and data providers who may want to decide case by case whether to disclose their data or not. State of the art solutions (i.e., blockchain/DLT/smart contracts with off-chain data) are used to guarantee traceability, provenance tracking and accountability. The components at the governance layer allow the integration, pre-processing, semantic annotation and querying of heterogeneous data. It integrates the following main components related to data and semantic interoperability:

•
At the bottom, an interoperability service module is in charge of facilitating data sharing from different sources and/or platforms belonging to different actors in the energy and non-energy ecosystem, such as smart meters, sensors, IoT devices, building management systems (BMSs), systems (TBMs), building automation and control systems (BACs), energy performance contracts, energy performance certificates, legacy systems. It is based on open standards, open APIs (e.g., NGSI-LD CIM APIs [54]) and open data models (e.g., FIWARE Smart Energy Reference Architecture [55], Building Information Modeling (BIM) [56], Smart Appliances REFerence (SAREF) [57]). Interfaces to other third-party energy and non-energy datasets/data platforms willing to federate/integrate with the proposed framework are provided, with a view to allowing the incremental population of the platform's data hub. • Data Cleansing Curation and Formatting Module is an umbrella term for tasks that span from simple data pre-processing, such as restructuring, predefined value substitutions and reformatting of fields (e.g., dates) to more advanced processes, such as outliers' detection and elimination from a dataset, data inconsistencies handling and noise reduction. To better organize the data collected and facilitate their future use, special ML pre-processing algorithms are developed for automatically cleansing and formatting it. This includes algorithms for normalizing their values, handling possible outliers, filling missing observations and dealing with different timestamp formats. The abovementioned algorithms take into consideration the particular characteristics of the data examined, such as their frequency, trend, seasonality, cycle, randomness and empirical distribution, enhancing that way the quality and the content of the constructed dataset, decreasing simultaneously the time required for training the algorithms of the toolbox and boosting their expected performance. • Access Policy and Anonymization Module: The proposed framework incorporates enforcement policies mechanisms for data access policy brokerage, hence allowing to address and programmatically encapsulate (via DLT/smart contracts) specialized and context-based data hubs access policies brokerage. In order to be able to handle datasets containing sensitive information, this module also performs anonymization on the data ingestion process to protect this information, by either complete data removal-suppression, generalization or pseudonymity.  [62], LonMark [63]). The Common Data Model serves data interoperability by ensuring that all data processed by the system adhere to the same standards of semantics based on a common set of terms, concepts and relations across different data sources.

•
On top of the data integration and semantic enrichment components, the platform enables easy access and querying of data to be exposed in upper analytics layer: • Reasoning Engine-A Graph Database technology (i.e., AllegroGraph) can be used as a triplestore in order to persist the dataset semantics and any Resource Description Framework (RDF) information produced by the Semantic Enrichment Module. On top of that, a Semantic Reasoning Engine, such as PoolParty Semantic Classifier, Jena or BaseVisor is going to enable the application of semantic queries on the triplestore to retrieve the semantic information and improve the performance of reasoning operations to extract new insights. This component exposes intelligent querying and search capabilities as API to the Virtual Workbench or directly feeds UI and recommender engines supporting the analytics for designing and developing buildings and related infrastructure. • Distributed Query Engine: The data retrieval from the distributed data warehouse is performed by utilizing a high-performance distributed query execution engine, like Presto, Tez or Apache Druid while also utilizing column-oriented approaches like MonetDB for handling the analytics workload. Such engines provide the ability to perform complex queries on a distributed Data Lake in very efficient and high-scalable way. The distributed query execution over a pure memory-based architecture allows the fast generation of the result-sets required from the analytical processes.

Big Data Management and AI Services
The Big data management and AI services (Processing) Layer provides a library of reusable AI-based ML/DL models that are made available with a view to promote quick adaptation and reuse of ML models along different contexts. The following functional modules are included: • Classification of data sources: In order for the AI-based analytics to be meaningful, accurate and easy to construct, their input variables have to be highly correlated and refer to the same time, place and application. For instance, when constructing a model for predicting the hourly energy produced by a Photovoltaic (PV) system, the weather forecasts exploited, such as radiation and temperature forecasts, must all be easy to track and refer to the same geographic location and time.
Given the size and the diversity of the data present, retrieving the most relevant variables becomes a challenging problem, especially for cases of semi-structured or completely unstructured data.
To deal with this problem, special ML algorithms are exploited to effectively classify the data available in terms of domain, type, location, time and frequency. These algorithms consider Natural Language Processing and Sentiment Analysis techniques to effectively process the description and the labels provided for each variable and classify them in representative classes based on their content (domain, type and location). The timestamp being available is also be processed to extract additional valuable information (frequency and time) and introduce further filters (sub-classes) that can improve the categorization of the available data and facilitate modelling.

•
Dimension reduction: Identifying the most appropriate variables for solving a regression, classification or clustering problem is a complicated task, especially when lots and diverse data are present. To cope with this issue, dimension reduction ML algorithms are used to enable the identification and creation of principal variables, either through feature selection or feature extraction approaches. Such algorithms have been proven particularly effective when constructing deep learning models that effectively extract information from large unstructured datasets and provide solutions in a completely unsupervised way. For instance, Convolution Neural Networks can be exploited to minimize the pre-processing required for training other ML algorithms, filter and clear the raw information provided and boost the final performance of the algorithms.

•
Training and validation: In order to make sure that the developed algorithms will be accurate and robust and mitigate the uncertainty present in the whole modelling process, the adoption of proper training and validation procedures becomes a prerequisite. Depending on the problem examined and the algorithms tested, different procedures and measures for assessing the performance of the available alternatives might be required. In this respect, the proposed framework involves a variety of training and evaluation procedures, as well as advanced criteria for selecting the most appropriate one per case. Simple holdout tests, cross-validation and random sampling are just some examples of the validation procedures that are considered, while Classification Accuracy, Logarithmic Loss, Confusion Matrix, F1 Score and Mean Absolute/Squared Error some of the indicative performance measures that will accompany them. Note that the type of the problem being solved (supervised or unsupervised learning / classification, regression or clustering), the size of the sample data and the objectives of the algorithm (accuracy vs. efficiency) is also taken into consideration for performing an incremental analysis and determining the selections made. Moreover, different hyper parameters are examined for each one of the considered algorithms and the most successive ones are adopted per case to maximize their potential and ensure that they are properly optimized for the particular training dataset.

•
Library of ML algorithms: The Processing Layer provides a variety of advanced ML algorithms that are supported by diverse and multiple data to support, in a smart way, complex decisions related to energy management and energy-efficient services. The aim of these services is to enhance energy systems' reliability and robustness, mitigate the effect of critical events and power unavailability, improve the profit-loss function of the power generation units, perform proactive analytics to track buildings' performance and decrease the risk of malfunctions and deterioration, interact and exchange data between different power generation units to provide smart energy solutions at local level, provide accurate power & capacity forecasting and planning, exploit smart meter data to enhance energy conservation and promote efficiency, improve energy storage options and finally, provide powerful descriptive analytics and evaluations. Each algorithm has different data import requirements, pre-defined based on the type of decision support problem being supported. However, these requirements are as abstract and generalized as possible, in order to enable their direct utilization from the majority of the users and parties interested in their exploitation. • Model Serving Module: It includes the set of the developed and trained models and constitutes the building block of the upper layer. These models are fed with both batch and streaming data coming from the Query Engine and the Data Streaming Module respectively. The models will be evaluated and refined over several iterations until will be finally used (served).

Big Data Analytics Toolbox for Buildings
The Big Data Analytics Toolbox for Buildings (Analytics) Layer exposes AI-based analytics services to multiple BVC stakeholders, incorporating the following components: • A Visualizations and Reports Engine, responsible for the visual representation of the stored data and the results produced from the analytical components. It offers a variety of visual representations including charts and map visualizations, based on specific Key Performance Indicators (KPIs).

•
A range of innovative Analytics Building Services, such as: (1) Analytics for energy performance-indoor condition evaluation and intelligent energy management; (2) Analytics for building systems and infrastructure; (3) Analytics for policy making and policy impact assessment on building level; (4) Analytics for building efficiency investments. • A 'virtual workbench', to incorporate a variety of assets, including data, third party services, ML models, computing resources, storage resources as tradable assets. It provides a set of tools targeting Small-Medium Enterprises (SMEs), developers, researchers and potential innovators, who design and develop new applications for the buildings sector. The tools at this level constitute a set of APIs exposing the ML/DL models and data to be tailored on specific circumstances and context provided by the users.

Cyber Security and Data Privacy
To establish user authentication and authorization, to secure the non-open data of the transactions as well as to comply with the European Commission (EC) regulations on Data Protection (GDPR) and finally, for logging user actions and system events, a security layer is necessary that is vertical in the proposed architecture, in the sense that it spans and interacts with several building blocks of the latter.

Methodological Approach
This section presents a case study for the scheduling of the photovoltaic (PV) maintenance, introducing a decision support systems (DSS) tool. PV maintenance tools are necessary for the optimization of the return-on-investment and minimization of time to warranty claim in PV installations. Novel solutions for fault prediction based on data-driven approaches can contribute to the cost-effective energy management of the buildings. To this end, a decision support tool was developed, aiming at the monitoring of the PV performance and triggering maintenance actions.
The energy produced by the PV plant is intermittent and is highly dependent on a number of variables, such as solar irradiance, temperature and other atmospheric parameters (e.g., humidity and cloud coverage), as well as age of the equipment and operational condition [64]. According to the literature, there are numerous applications of multiple linear regression (MLR) models for energy production forecasting, such as hourly PV production estimation [65]. In this context, an MLR model was adopted, to predict the PV production (ŷ), considering the relation among different variables (x i ): This means that ε will be the deviation between the predicted (ŷ) and actual (y) PV production. In order to further improve the forecasting performance of the model we calculated 24 different models, one for each hour of the day [66]. In this regard, we took into consideration more effectively the particularities of each hour and season.

Pilot Appraisal
The presented DSS tool was applied to the campus of Savona, Italy. It closely resembles any urban district, since it hosts (a) educational buildings, such as offices, classrooms and laboratories; (b) research centers; (c) private companies; (d) student residences. The smart polygeneration microgrid contributes to the operation of its electrical and thermal systems [67]. Thanks to this infrastructure, the site may be considered an example of a "smart urban district", equipped with distributed generations, local control and supervision infrastructures [68].

Infrastructure/Asset/Components
The microgrid supervisory control and data acquisition (SCADA) system is used to share information with the DSS tool via an ftp connection. More specifically, the following weather forecasting information is acquired on daily basis: outdoor temperature, relative humidity, pressure, global radiation and rainfall. In addition, the following data streams are shared ( Figure 2): Actual solar radiation; electrical power produced by the PV field; electrical power produced by the grid connected microturbine; electrical power produced by the dual mode microturbine; thermal power produced by the grid connected microturbine; thermal power produced by the dual mode microturbine; electrical power exchanged by the storage; electrical power exchanged with the external network; thermal power from the boilers; chiller thermal power in input and in output.

Data Governance and Processing
Given that PV production and weather data were collected without any problems worth mentioning, we exploited a sample of 12 months to analyze energy production. The production was not standard as it strongly depended on the radiation levels in the field. The mean PV production of the campus for each hour of the day was also analyzed. Production began to increase at around 7:00 and reaches a peak at 14:00. After this point, production was reduced until sunset, when it reached a value of zero. In this respect, 15 MLR models were calculated (from 7:00 to 20:59). From 21:00 to 6:59 energy production was not detected and the corresponding prediction values were set to zero.

Data Governance and Processing
Given that PV production and weather data were collected without any problems worth mentioning, we exploited a sample of 12 months to analyze energy production. The production was not standard as it strongly depended on the radiation levels in the field. The mean PV production of the campus for each hour of the day was also analyzed. Production began to increase at around 7:00 and reaches a peak at 14:00. After this point, production was reduced until sunset, when it reached a value of zero. In this respect, 15 MLR models were calculated (from 7:00 to 20:59). From 21:00 to 6:59 energy production was not detected and the corresponding prediction values were set to zero.
A data-cleansing approach was implemented before the coefficients of the MLR models were calculated. The removal of the outliers, high leverage points through standardized residuals and the calculation of the Cook's distance to the observations were included [69]. Finally, the collinearity and the p-values/F-statistic were checked, and additional adjustments were performed as necessary.

Data Analytics
Using the DSS tool, an actual action is required by the user (energy manager) only when an alarm is triggered, due to a detected anomaly in the produced energy values. The proposed approach provides 95% accuracy, when a deviation (ε) between predicted (ŷ) and actual (y) values is detected and exceeds the 5% accepted error. Then, an alarm should be sent by the system, in order to notify the user of the possible need for maintenance of the PV system.
The modelling and the whole forecasting procedure were performed using the R software platform. R is a software environment and programming language focusing on data manipulation, calculation and graphical display such as (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, etc.).

Results
Starting from April 2016, the DSS tool was available for the campus. For instance, at the end of June 2016, when an alarm was triggered for the campus PV, the user verified that an actual problem affected the system (in particular, the inverter connecting it to the network) by inspecting the values of the power produced by the PV, which are logged in the microgrid SCADA; the behavior of the power over time evidenced that the system kept going off-line and restarting due to a fault in the inverter controller, thus causing the loss of generated energy (Figure 3). In this respect, the user contacted the inverter maintenance service to solve this malfunction.
The DSS tool has a diagnostic nature, so its impact can be evaluated only in terms of avoided costs or avoided loss of production. For instance, if the problem affecting the PV inverter was not detected, supposing a daily production of about 300 kWh for that period and a reduction of the 50% due to the problem, a loss of about 150 kWh per day would have occurred. The DSS tool intervention prevented an energy loss of about half the PV production per day. An unmeasurable outcome is how the managers of the installation run it. For example, within the implementation of the PV maintenance the DSS users have significantly reduced the visits to the PV installation to check the operation of the panels for possible errors.
detected, supposing a daily production of about 300 kWh for that period and a reduction of the 50% due to the problem, a loss of about 150 kWh per day would have occurred. The DSS tool intervention prevented an energy loss of about half the PV production per day. An unmeasurable outcome is how the managers of the installation run it. For example, within the implementation of the PV maintenance the DSS users have significantly reduced the visits to the PV installation to check the operation of the panels for possible errors.

Enabling Data-Driven Applications and Services for Buildings
This multi-scale and multi-stakeholder approach, presented in Section 3, can enable the development of data-driven applications and services for any stakeholder involved in the BVC.

Data-Driven Management of Self-Production Systems in Energy Communities
Nowadays, some prosumers have access to data regarding the production of energy from their Renewable Energy System (RES) self-consumption system and know what they are consuming from the grid [70]. However, they do not have tools where they can match that data or analyze it or receive indications on how to better management their energy consumption in accordance with their production. Energy cooperatives also have members which are in energy poverty and do not know what they can do to tackle that problem; in general, all members want to have more information on energy savings [71].

Enabling Data-Driven Applications and Services for Buildings
This multi-scale and multi-stakeholder approach, presented in Section 3, can enable the development of data-driven applications and services for any stakeholder involved in the BVC.

Data-Driven Management of Self-Production Systems in Energy Communities
Nowadays, some prosumers have access to data regarding the production of energy from their Renewable Energy System (RES) self-consumption system and know what they are consuming from the grid [70]. However, they do not have tools where they can match that data or analyze it or receive indications on how to better management their energy consumption in accordance with their production. Energy cooperatives also have members which are in energy poverty and do not know what they can do to tackle that problem; in general, all members want to have more information on energy savings [71].
In this context, new applications can be provided to cross data coming from smart meters and from energy bills with the data from energy performance certificates. With this crossing tools, it will be possible to match the real consumption of electricity with the energy performance certificate, which are geo-referenced. More specifically, the aim is to support members of an energy cooperative to better manage their self-production system (for prosumers) and to improve the energy performance of the households (for the members with supplier contract).
The outcomes of these tools could also be used by municipalities to better identify people in energy poverty and to realize how they could help them to tackle that issue. Policy makers could also use these tools to understand how energy poverty is more located in the country due to the analysis of real consumption data and adapt and create more policies to better help those people.

SECAPs Impact Assessment, Implementation and Monitoring
Local and regional authorities during the past decade have been actively engaged in sustainable energy policy planning [72], with efforts also placed the last couple of years on integrating climate planning as well, through the initiative of the (Global) Covenant of Mayors for Climate and Energy (CoM). In this aspect, the authorities produce Sustainable Energy and Climate Action Plans (SECAPs) that are focusing on the climate resilience of the public infrastructure and services, as well as reduction of the local authorities' energy consumption and carbon footprint, through a wide range of actions that mainly target the municipal lighting and transport sectors and the buildings of the municipal, tertiary and residential sectors [73].
The actions included in the SECAPs are a significant data source for activities of business interest to a wide range of stakeholders, including among others building component manufacturers and installers, especially concerning renovation activities, such as double glazing and buildings' insulation (walls, roofs, floor), RES installers (photovoltaics, solar water heaters, biomass burners etc.), architecture companies (bioclimatic design), urban planners, street lighting wholesale companies, ESCOs, as well as civic crowdfunding projects, etc. It is thus clear that a market opportunity is there but needs data retrieval and classification in user friendly databases, where interested market actors and other municipalities can easily and quickly navigate around [74].
The challenge is to retrieve these data from SECAPs, as well as monitoring reports, not only from the CoM, but other databases (e.g., carbonn Climate Registry [75]) or initiatives at the national level (Lighthouse cities, etc.), analyze and integrate them in apposite sectoral actions at the local/regional/national level. In this respect, different types of data can be collected and analyzed, using the data-driven architecture, as follows: • Data related to the planned actions' characteristics and specific category: building, street lighting or transport action, as well as the type of the action (e.g., building insulation, etc.) and more specific characteristics, such as the envisaged energy savings, envelope design, construction techniques and materials, size, building type, appliances used, lighting technology etc.

•
Data about the envisaged costs, discount rates used, as well as any calculated financial indicators, such as net present value (NPV), internal rate of return (IRR), etc. • Data regarding the reduction of the carbon footprint and cross comparison with similar actions from other plans at the national and European level.

Next Generation Energy Performance Assessment and Certification
Energy performance certificates (EP certificates) are among the most important drivers of energy performance of the European building stock [76]. They provide a picture of the current state of the building stock in terms of energy efficiency and include recommendations to improve the buildings' energy performance. In each EU country, EP certificates data have been collected using different energy assessment tools and procedures. They have been stored in disperse databases and in multiple formats. Moreover, there is not a common vocabulary to define the contents of EP certificates in different EU countries.
In order to make analyses of EP certificates data cutting across countries and domains, a unique representation of their content is required. Besides the structured attributes that an EP Certificate record contains, there are non-structured data provided by energy technicians (usually in natural language), such as refurbishment measures, building regulations, etc. This information cannot be processed and structured and, therefore, cannot be used to enrich the EP certificates data [77].
The integration of EP certificates with other data sources (e.g., building regulations, socioeconomic data, building materials and systems, financial investments, etc.) is fundamental for third parties, in order to analyze the building stock and take actions to improve it. For instance, to identify a building typology which is amenable to renovation, to carry out a deep refurbishment plan at the municipal, regional or even national and trans-national scales. This requires not only to have access to the data, but to integrate them so that a multidimensional analysis can be performed.
Based on the analyses of the above-mentioned multidimensional, cross-domain and multi-lingual data, new applications can give rise to new business opportunities for ESCOs, building components manufacturers, construction companies, etc. Specific data analysis and visualization tools will enable to meet the requirements of specific beneficiaries.

Improving the Financeability of Energy Efficiency Investments
The energy efficiency projects are often fragmented, with high transaction costs and fall below the minimum value that many private financial institutions are willing to consider. The finance community is lacking a tested, evidence-based platform, providing decision makers with support in regard to the impacts of various investment criteria, risk-aware assessment and performance applied on a pool of energy efficiency investments [78]. The capability offered by emerging near big data analytics to integrate cross-domain financial and energy consumption is the key for building the necessary market confidence in energy efficiency projects and making them an attractive investment asset class. The availability of comparable, anonymized historical data, pooled from the major market segments for buildings and corporates and structured along major project characteristics can encourage a greater energy efficiency investment flow [79].
New data-driven applications, built on the proposed reference architecture, can attract and mobilize private funding on energy efficiency projects, providing investors/financiers (e.g., commercial/green investment banks, institutional and insurance funds) and project developers (public/local authorities, providers of energy companies, ESCOs, construction companies, architects, etc.) with data and tools, in order to identify sustainable investment pathways and decrease the risk of investing in energy efficiency.
In this context, different types of data can be exploited, as follows: energy data (energy consumption before and after the saving measures, avoided CO 2 emissions, etc.); financial data (upfront capital expenditure, total volume of the investment, type and sources of financing, payback period, market value and many others), enabling a deep understanding of the financial aspects of the investment.; secondary data (location, type of investment, market segment, type of project promoter and type of asset or user profile); large databases for energy efficiency investments performance monitoring and benchmarking, with extensive data on the existing projects in the EU countries, such as De-risking Energy Efficiency Platform (DEEP) (including more than 5000 projects in buildings).
Extensive processing of these data can be applied, in order to elaborate and categorize financing instruments and risk mitigation strategies, as well as to identify best practices on private financing which can be considered as a basis for benchmarking.

Data-Driven Policy Making and Policy Impact Assessment for Energy-Efficient Buildings
The European Commission (EC) presented its long-awaited vision for a European Green Deal [80] which, among other objectives, aims at making the continent CO 2 neutral by 2050. Among the various announced initiatives, the EC proposes to work with stakeholders on a new renovation initiative in 2020, whose aim would be to organize renovation efforts, lifting national regulatory barriers to renovation and focusing in particular on social housing [81,82].
The public, cooperative and social housing providers are in position to play a leading role in this transition, as they are already key drivers of the renovation efforts across Europe. However, the role of the sector in the successful implementation of the future European Green Deal cannot be overestimated but could benefit from the right policy and financial framework. To this end, extensive processing of these data can be applied using the data-driven architecture, in order to elaborate and categorize policy instruments and risk mitigation strategies, as well as to identify best practices on right policy and financial framework which can be considered as a basis for benchmarking and for policy implementation and its link to financial mechanisms.
In this respect, an accurate vision of the impact of the implemented policies is offered, enabling the identification and benchmarking of best practices, which can be replicated in order to achieve the maximum impact. In turn, the setting of objectives within policies can be facilitated. The reliable deployment of actions can be induced by linking policy objectives to the specific actions that have been proven most effective and to the financing mechanisms and business models capable of unlocking them.

Conclusions
This paper presented a high-level data-driven architecture that aims to combine existing modern technological breakthroughs in the areas of the DLTs/blockchain, ML/DL and big data, in order to develop a new decision-making and data analytics solutions for energy management and energy-efficient buildings. The proposed approach realizes a holistic, state-of-the-art AI-empowered framework for decision-support models, data analytics and visualizations for the building sector.
The proposed framework provides the necessary capabilities for pushing forward scalable big data management and processing, with a view to reduce some of the major barriers actually hampering big data and analytics full scale deployment in the emerging smart energy-efficient buildings domain, i.e., the excessive communication costs and the need for keeping generated data as much as possible closer to the generation point and the owner.
A library of trained models is introduced, aiming at solving problems that may constitute building blocks for more complex problems, such as energy prediction, geo-clustering, energy performance prediction, multi-criteria assessment of building interventions, etc. These advanced data processing and management methods are combined in view of providing advanced and precise statistical analysis, data visualization, business intelligence, predictive modelling and multi-criteria decision support systems.
It is expected to have a significant impact on the building sector and its lifecycle, as it will have the ability to be utilized in a wide range of use cases under different perspectives, including, but not limited to: monitor and improve the energy performance of buildings; facilitate the design and development of building infrastructure; support policy making and policy impact assessment; de-risk investments in energy efficiency.