Data Analysis of Heating Systems for Buildings A Tool for Energy Planning, Policies and Systems Simulation

: Heating and cooling in buildings is a central aspect for adopting energy efﬁciency measures and implementing local policies for energy planning. The knowledge of features and performance of those existing systems is fundamental to conceiving realistic energy savings strategies. Thanks to Information and Communication Technologies (ICT) development and energy regulations’ progress, the amount of data able to be collected and processed allows detailed analyses on entire regions or even countries. However, big data need to be handled through proper analyses, to identify and highlight the main trends by selecting the most signiﬁcant information. To do so, careful attention must be paid to data collection and preprocessing, for ensuring the coherence of the associated analyses and the accuracy of results and discussion. This work presents an insightful analysis on building heating systems of the most populated Italian region—Lombardy. From a dataset of almost 2.9 million of heating systems, selected reference values are presented, aiming at describing the features of current heating systems in households, ofﬁces and public buildings. Several aspects are considered, including the type of heating systems, their thermal power, fuels, age, nominal and measured efﬁciency. The results of this work can be a support for local energy planners and policy makers, and for a more accurate simulation of existing energy systems in buildings.


Introduction
Buildings represent a major share of the total energy consumption around the world. Multiple drivers are influencing the energy demand of buildings, and the trends show that the total energy demand remained the same in the last few decades on a world basis, but, with a significant increase in the quality of services [1]. On the other hand, in several countries, multiple policies are moving towards the promotion of energy efficiency measures in buildings [2], thanks to the refurbishment of existing buildings [3], high requirements for new constructions [4] and operational optimization of buildings' management [5]. In this framework, the evolution of ICT allows to significantly increase the amount of data that is collected for monitoring energy systems and for assessing other building-related performance indicators. The availability of live data with high temporal resolution from smart meters allows developing advanced models for the optimization of buildings' energy performance [6]. Smart meters are also fostering an enhanced awareness of the users towards their energy consumption and the actual effect of some energy efficiency measures [7,8].
heating plants in operation. Insights from real data are useful to confirm standard reference values for a number of performance indicators, including the average efficiency, the share of fuels use in the heating sector, the average installed thermal power, the number of units per inhabitant, etc.

Data Analysis
This analysis is performed by using open data from a registry of the heating plants installed in Lombardy, a large region (around 10 million inhabitants) located in Northern Italy. This choice has been made both for the climate features of this location and for the availability of open data. Lombardy is characterized by a continental climate, with an average of around 2300 Degree Days measured by Eurostat in 2009 [34], which is comparable to other locations in Europe and around the world. On the other hand, the availability of open data is crucial to guarantee the replicability and future updates of the results of this study.
The data analysis presented in this paper has been completely performed in R, an open source language and environment for statistical computing [35,36].

Description of the Dataset
The dataset is based on the regional registry of heating plants in Lombardy [37]. This registry has been developed thanks to the regional energy legislation, which makes mandatory the conduction of a census for all the heating plants by 15 October 2014 and thereafter: for both new installations and maintenance operations on existing plants. Since the majority of heating plants requires a maintenance check yearly or every two years, the current update of this dataset should include the majority of the systems in the region.
Data are currently available for 2.89 million plants (as of November 2017), including: • fossil fuels boilers (Natural Gas, diesel oil, Liquefied Petroleum Gas (LPG), other fuels); • biomass boilers (wood and pellet); • heat pumps with output thermal power higher than 12 kW, i.e., with an electrical load generally between 2.5 kW and 5 kW; • solar collectors with output thermal power higher than 12 kW, i.e., with an equivalent surface of the array usually higher than 20 m 2 and 25 m 2 in the case of Combined Thermal and Photovoltaic (PV/T); • chillers with output cooling power higher than 12 kW; • heat exchangers for users of District Heating (DH) networks; • Combined Heat and Power (CHP) or Combined Cooling Heat and Power (CCHP) systems.
As a matter of fact, the registry does not include: • water heaters for single families (for domestic hot water only), such as electric boilers installed in buildings where the heating systems are often centralized and dedicated only to space heating (in this case, the database accounts only for the space heating provided by the centralized boilers); • wood fireplaces or stoves; • heating plants used for industrial processes; • heat pumps and chillers with output power lower than 12 kW (considered as the threshold from the regional regulations).
Each record of the dataset includes 41 different features that can be grouped into five categories: • Location: features related to the municipality, the address and the cadastral references (no information is available for longitude and latitude); • Building features: the total heated volume and the cooled one, the building category, the availability of an energy certificate and its reference number; The data distribution and some statistical summary for the most relevant features provide useful insights for a preliminary description of the heating plants installed in the region. Moreover, some indicators and some potential relations are described in the following sections.

Focus on Heat Generator Performance
An important remark is about the generator efficiency, which is available both as a nominal value and as a measured value from test reports. However, while nominal value represents the total efficiency of the heat generator, the measured value from the test reports is only considering the combustion efficiency. The test procedure is described by the Italian regulation UNI 10389-1:2009, which can be applied on each boiler fueled with liquid or gaseous fuels. The test requires a measure of the air temperature, the flue gases temperature and the O 2 concentration in the flue gases. The flue gases heat losses share Q S is calculated by Equation (1): where O 2 is the oxygen concentration in volume fraction in the flue gases (with a precision of ±0.3%), t f is the flue gases temperature (±2 • C) and t a is the ambient temperature (±1 • C). The values of A 1 and B are specific constants related to the fuel and are provided by Italian Standardized Regulation UNI 10389-1:2009 (e.g., for Natural Gas A 1 = 0.66 and B = 0.010). An alternative formula allows the calculation of Q S by using the concentration of CO 2 instead of O 2 by using the following relation, where CO 2,th is theoretical carbon dioxide concentration referred to the dry exhaust gas: After the calculation of the heat losses, the combustion efficiency η comb can be calculated for noncondensing boilers by using Equation (3): The efficiency for the condensing boilers is computed accounting for the increase of the thermal output obtained through the steam condensation in the flue gases, which is performed by applying a more complex procedure than the one described before in the same regulation. The reference accuracy of the combustion efficiency in the tests defined by UNI 10389-1:2009 is in the range ±2.0%.

Data Quality and Preprocessing
The issue of data quality is particularly critical on large datasets, where data cannot be checked manually, and automated procedures or rules need to be defined. The data quality is a key aspect in energy systems, both for energy planning analyses [33] and for management, innovation and operation [38,39]. Multiple aspects affect the coherence of the data, especially when they are not recorded by the same observer, be it a human or a sensor. Even if a method is precisely defined and codified, often the handling of unexpected results leads to differences in the recorded values.
The main problems for data quality in the present study are the following ones: • missing data: codified with Nan, empty space or codes such as "99999"; • data of the wrong type or without physical sense (e.g., negative values for energy or power); • data with an incorrect order of magnitude (potentially caused by a wrong interpretation of the measurement unit, but difficult to be corrected); • accuracy of the data, which are often approximated (e.g., rounded values, approximated estimations when information is not available).
While missing data can be easily ruled out, the other errors need a more careful evaluation. A first step in the analysis is the definition of a validity range to exclude the non-acceptable values. However, for some quantities, this range can be defined from literature values (e.g., conversion efficiency) while, in other cases where the expected range is not known a priori, a manual analysis of the dataset is needed. The exclusion of the outliers is generally performed by an analysis on the percentiles of the data distribution, as often wrong data that have larger orders of magnitude need to be filtered out before performing further actions.
Finally, the aspect of data accuracy is non-trivial, as particular anomalies cannot be found by simple automatic algorithms. In some cases, anomalous distributions can be the result of compilation errors or data approximations. Figure 1 clearly shows an anomalous data record: the heating generators installed in January represent one third of all the installations. This is very unlikely, especially considering the fact that a new heat generator is seldom installed during the heating season, unless there is a major incident that cannot be repaired.
The cause is due to the fact that often only the installation year is available, but, since the system requires an installation date, the operators use January, 1st as fake date. As a result, any analysis based on the month of installation would be biased, and these anomalies cannot be easily described by a common pattern. For this reason, some aspects are still requiring a human interpretation, although artificial intelligence could provide a useful support in the future. While missing data can be easily ruled out, the other errors need a more careful evaluation. A first step in the analysis is the definition of a validity range to exclude the non-acceptable values. However, for some quantities, this range can be defined from literature values (e.g., conversion efficiency) while, in other cases where the expected range is not known a priori, a manual analysis of the dataset is needed. The exclusion of the outliers is generally performed by an analysis on the percentiles of the data distribution, as often wrong data that have larger orders of magnitude need to be filtered out before performing further actions.
Finally, the aspect of data accuracy is non-trivial, as particular anomalies cannot be found by simple automatic algorithms. In some cases, anomalous distributions can be the result of compilation errors or data approximations. Figure 1 clearly shows an anomalous data record: the heating generators installed in January represent one third of all the installations. This is very unlikely, especially considering the fact that a new heat generator is seldom installed during the heating season, unless there is a major incident that cannot be repaired.
The cause is due to the fact that often only the installation year is available, but, since the system requires an installation date, the operators use January, 1st as fake date. As a result, any analysis based on the month of installation would be biased, and these anomalies cannot be easily described by a common pattern. For this reason, some aspects are still requiring a human interpretation, although artificial intelligence could provide a useful support in the future. Therefore, each analysis will be carried out on the largest subset with available and acceptable data for the type of interest. As a result, the analyses will be performed on different subsets of the entire dataset, depending on the available data for each analysis.
An alternative solution could be the preliminary filter on the entire dataset considering all the desired aspects, but this would result in an excessive reduction of the final dataset, as a limited amount of records has all the aspects that are correct and available. For this reason, each analysis will focus on distributions, medians, etc., in order to provide useful information that is affected by these potential errors as little as possible.

Calculations and Indicators
A large part of the calculations performed in this analysis are related to statistical evaluations on the features of the dataset entries. Data will be described by considering the distributions, medians, percentiles, etc. The analysis of the data distributions allows describing the characteristics of the heating systems, by highlighting the main trends.
The availability of several features can be an advantage for looking at some relations among them, in order to focus on specific aspects. Therefore, each analysis will be carried out on the largest subset with available and acceptable data for the type of interest. As a result, the analyses will be performed on different subsets of the entire dataset, depending on the available data for each analysis.
An alternative solution could be the preliminary filter on the entire dataset considering all the desired aspects, but this would result in an excessive reduction of the final dataset, as a limited amount of records has all the aspects that are correct and available. For this reason, each analysis will focus on distributions, medians, etc., in order to provide useful information that is affected by these potential errors as little as possible.

Calculations and Indicators
A large part of the calculations performed in this analysis are related to statistical evaluations on the features of the dataset entries. Data will be described by considering the distributions, medians, percentiles, etc. The analysis of the data distributions allows describing the characteristics of the heating systems, by highlighting the main trends.
The availability of several features can be an advantage for looking at some relations among them, in order to focus on specific aspects.
Among the available features, two indicators have been considered with major detail: the ratio between installed thermal power and heated volume, and the boiler efficiency. These indicators provide useful information for the estimation of preliminary heating systems parameters both for energy simulation and for local energy planning.

Geographical Indicators
The first indicator of interest is related to the geographical distribution of the heating systems. The information available in the dataset includes province, municipality, zipcode, address and cadaster info. Lombardy hosts 10 million people, distributed into 1551 municipalities in 12 Provinces. The systems are well distributed among provinces, the three main provinces being Milan (21.8%), Brescia (14.5%) and Bergamo (13.8%).
The population of each municipality of the region has been compared to the number of heating systems in the same municipality, in order to calculate the number of systems per capita. Figure 2 shows the map of the region with the administrative municipality boundaries. The majority of the values lay in the interval 0.3-0.4 systems per capita, as a result of the mix between centralized and autonomous systems. It has to be noted that, although the large majority of systems are residential, other buildings are also included into the dataset, which can lead in some particular cases to (small) municipalities with more heating systems than inhabitants (i.e., mainly mountain municipalities in the northern part of the region). The analysis of installed power per inhabitant shows a similar pattern, with median values of all provinces' distributions around 10 kW per capita. Among the available features, two indicators have been considered with major detail: the ratio between installed thermal power and heated volume, and the boiler efficiency. These indicators provide useful information for the estimation of preliminary heating systems parameters both for energy simulation and for local energy planning.

Geographical Indicators
The first indicator of interest is related to the geographical distribution of the heating systems. The information available in the dataset includes province, municipality, zipcode, address and cadaster info. Lombardy hosts 10 million people, distributed into 1551 municipalities in 12 Provinces. The systems are well distributed among provinces, the three main provinces being Milan (21.8%), Brescia (14.5%) and Bergamo (13.8%).
The population of each municipality of the region has been compared to the number of heating systems in the same municipality, in order to calculate the number of systems per capita. Figure 2 shows the map of the region with the administrative municipality boundaries. The majority of the values lay in the interval 0.3-0.4 systems per capita, as a result of the mix between centralized and autonomous systems. It has to be noted that, although the large majority of systems are residential, other buildings are also included into the dataset, which can lead in some particular cases to (small) municipalities with more heating systems than inhabitants (i.e., mainly mountain municipalities in the northern part of the region). The analysis of installed power per inhabitant shows a similar pattern, with median values of all provinces' distributions around 10 kW per capita.

Features of the Building
The largest part of the systems is installed in residential buildings, accounting for a total of 89% of the records. Residential buildings include normal houses and holiday houses, but this level of detail is available for only 30% of the residential buildings (of which 2% only are recorded as holiday houses). The other buildings with a share above 1% are industrial or similar activities (4.8%, for a total of 140,000 systems), offices (2.2% of share, i.e., 64,000 units) and commercial buildings (1.4%, i.e., 39,000 systems). The other types of buildings sum up to a marginal share, and therefore they do not have a statistical relevance, but they can be useful for specific analyses. The dataset includes heating systems installed in schools (11,000), hospitals and other sanitary buildings (8000), museums, hotels, bars and restaurants, sport centers, etc.
Another piece of useful information is the heated volume, which allows for characterizing the buildings and evaluate possible indicators on the sizing of the systems (i.e., specific thermal power, see Section 3.3). The information related to the heated volume is available for a limited number of units, around 85% of the systems. The majority of systems serve small buildings or apartments, with 54% smaller than 250 m 3 , 36% between 251 m 3 and 500 m 3 , and only 10% over 500 m 3 . However, considering the installed power instead of the number of units, the above-mentioned categories account roughly for one third each (with the middle one slightly larger). For some buildings, a reference to the energy performance certificate (called APE in the Italian regulation) is available. This reference allows for connecting the heating system to the database of the energy certifications (which is also available as open data). However, only less than 1% of the systems are currently installed in a building with a codified energy certificate, but this number is likely to increase and be useful for a deeper analysis by integrating these two datasets.

General Features of the Heating Systems
The first piece of information is the purpose of the heating system: the largest difference is between space heating and domestic hot water. Cooling systems are included into this dataset, but are a marginal part (less than 1%). The large majority of the heating systems, i.e., 85%, have the purpose of producing domestic hot water and providing space heating at the same time. The remaining part is mainly devoted to space heating only (10%), while other combinations including cooling sum up to the remaining 5%. Other aspects not related to the heat generator itself (which will be described in Section 3.2.3) are the features of the heat distribution and emission systems, mainly quality aspects related to design solutions and types. In particular, availability of heat metering, type of heat emission systems and heating system control logics are included. The heat metering in Italian buildings is very rare: in the entire dataset, only 2% of the heating systems are coupled to heat metering (and mainly in systems after 2005). Considering the emission systems, the large majority is represented by radiators (around 85%), while all the other systems include a variety of types, each of them lower than a two-percent share (radiant floors, air systems, fan coils, combined configurations). Finally, analyzing the control logic of the heating systems, 19% of the generators are still installed without any control logics, which remains a significant issue for the energy efficiency of the systems. The majority of the cases include an ON/OFF control, based on a dwelling thermostat in 55% of the cases and a zone thermostat in the remaining 16% of the cases. Few buildings have a proportional control. The correlation between control logic and year of installation is not significant, nor it is the one with the building category.

Features of the Heat Generator
This group includes the most interesting features for this analysis, representing the nominal features of the heat generators installed in Lombardy. In detail, the following aspects are of interest: More than 97% of the systems in the dataset are heat generators or boilers, which is the main purpose of this cadaster. However, other systems include chillers, heat pumps, district heating exchangers, solar collectors and few CHP units. Since the cadaster has been developed focusing on the heating systems, additional information for other units is not available (e.g., surface of collectors, electric power of CHP units, etc.). For this reason, the following analyses will be focused on the heat generators. Considering heat generators' fuels, Natural Gas represents by far the fuel with the largest share, supplying almost 95% of the units. LPG, pellet, diesel oil and wood are the other fuels that are used by at least 1000 systems in the dataset. Natural Gas is very well distributed in Italy, with a network that reaches the majority of the municipalities. Figure 3 shows that 86% of the municipalities in Lombardy have a natural gas penetration higher than 80%. The left part of Figure 3 shows that around a hundred of municipalities have no or limited access to Natural Gas network, while the range between 20% and 70% of natural gas penetration is almost empty. This distribution suggests that, where Natural Gas is available, it becomes the preferred fuel for the heating systems. Considering the heat output of the units, 87% of the plants have a capacity between 20 kW and 35 kW, which is the range related to autonomous heating systems. However, these units account for only 61% of the total cumulated installed heat output (which is around 104 GW), as larger centralized plants have a considerable weight. The most standard capacity appears to be 24 kW, which accounts alone for almost 50% of the units and roughly one third of the total cumulated thermal power. More than 97% of the systems in the dataset are heat generators or boilers, which is the main purpose of this cadaster. However, other systems include chillers, heat pumps, district heating exchangers, solar collectors and few CHP units. Since the cadaster has been developed focusing on the heating systems, additional information for other units is not available (e.g., surface of collectors, electric power of CHP units, etc.). For this reason, the following analyses will be focused on the heat generators. Considering heat generators' fuels, Natural Gas represents by far the fuel with the largest share, supplying almost 95% of the units. LPG, pellet, diesel oil and wood are the other fuels that are used by at least 1000 systems in the dataset. Natural Gas is very well distributed in Italy, with a network that reaches the majority of the municipalities. Figure 3 shows that 86% of the municipalities in Lombardy have a natural gas penetration higher than 80%. The left part of Figure 3 shows that around a hundred of municipalities have no or limited access to Natural Gas network, while the range between 20% and 70% of natural gas penetration is almost empty. This distribution suggests that, where Natural Gas is available, it becomes the preferred fuel for the heating systems. Considering the heat output of the units, 87% of the plants have a capacity between 20 kW and 35 kW, which is the range related to autonomous heating systems. However, these units account for only 61% of the total cumulated installed heat output (which is around 104 GW), as larger centralized plants have a considerable weight. The most standard capacity appears to be 24 kW, which accounts alone for almost 50% of the units and roughly one third of the total cumulated thermal power. Those numbers depict a reality where the majority of the users are equipped with single-dwelling boilers for space heating and domestic hot water production. Figure 4 shows a violin plot of the five main fuels used in the boilers. As aforementioned, considering the systems with acceptable efficiency values (i.e., values that are not outside a plausible range), almost 95% of the systems are supplied by natural gas. However, each of the minor fuels has a number of units between 7000 (for the wood) to 44,000 (for LPG), thus ensuring a statistically significant population for the analysis. Further details are reported in Table 1. The plot shows that gaseous and liquid fuels have generally higher performances than solid fuels. Natural Gas and LPG have very similar patterns, with strong peaks at 87%, 90% and 92%, which are specific values that are associated with regulation limits evolved over the years. Similar distributions can be observed for Those numbers depict a reality where the majority of the users are equipped with single-dwelling boilers for space heating and domestic hot water production. Figure 4 shows a violin plot of the five main fuels used in the boilers. As aforementioned, considering the systems with acceptable efficiency values (i.e., values that are not outside a plausible range), almost 95% of the systems are supplied by natural gas. However, each of the minor fuels has a number of units between 7000 (for the wood) to 44,000 (for LPG), thus ensuring a statistically significant population for the analysis. Further details are reported in Table 1. The plot shows that gaseous and liquid fuels have generally higher performances than solid fuels. Natural Gas and LPG have very similar patterns, with strong peaks at 87%, 90% and 92%, which are specific values that are associated with regulation limits evolved over the years. Similar distributions can be observed for diesel oil and pellet, which show specific peaks at 90% for pellet, and at 90% and 87% for diesel oil. Wood shows generalized lower performances with a large variability, representative of less standard and automatic plants than for fossil fuels or pellet. Measured efficiency for solid fuels (wood and pellet) is not available as there is currently no regulation for the definition of a standard for the measurement. The median age of the heating systems in Lombardy is around 11 years, as it can be seen by the distribution of the installation years plotted in Figure 5. A large anomaly can be seen for the year 2000 (and a smaller one for 1990), probably due to an estimate for the systems with unknown installation year, performed by the professionals who filled out the reports. diesel oil and pellet, which show specific peaks at 90% for pellet, and at 90% and 87% for diesel oil. Wood shows generalized lower performances with a large variability, representative of less standard and automatic plants than for fossil fuels or pellet. Measured efficiency for solid fuels (wood and pellet) is not available as there is currently no regulation for the definition of a standard for the measurement. The median age of the heating systems in Lombardy is around 11 years, as it can be seen by the distribution of the installation years plotted in Figure 5. A large anomaly can be seen for the year 2000 (and a smaller one for 1990), probably due to an estimate for the systems with unknown installation year, performed by the professionals who filled out the reports.   diesel oil and pellet, which show specific peaks at 90% for pellet, and at 90% and 87% for diesel oil. Wood shows generalized lower performances with a large variability, representative of less standard and automatic plants than for fossil fuels or pellet. Measured efficiency for solid fuels (wood and pellet) is not available as there is currently no regulation for the definition of a standard for the measurement. The median age of the heating systems in Lombardy is around 11 years, as it can be seen by the distribution of the installation years plotted in Figure 5. A large anomaly can be seen for the year 2000 (and a smaller one for 1990), probably due to an estimate for the systems with unknown installation year, performed by the professionals who filled out the reports.    However, this bias is not drastically influencing the distribution. An additional aspect noticeable in Figure 5 is the slight increase in the last decade of small units (20 kW of output thermal power), probably due to the diffusion of high-efficiency buildings that have lower heat demand. The data for the year 2017 are obviously partial and they cannot represent any significant trend. As aforementioned, a great part of the installations happen at the beginning of the winter season, so, for 2017, the period between September and November will be the crucial one. Finally, two marginal aspects are worth being mentioned: the number of generators in each system and the manufacturer.
Referring to the first aspect, the heat generators of this dataset are mainly part of a heating system with a single heat generator (93%), while minor shares of units are in a two-unit heating plants (3%), or three-unit plants (1%). Larger groups are available but with negligible share on the dataset. Then, as regards the second aspect, the top five manufacturers (Beretta: Lecco, Italy; Vaillant: Remscheid, Germany; Immergas: Reggio Emilia, Italy; Riello: Verona, Italy; Baxi: Warwick, UK) together account for 53% of the units and 43% of the total installed thermal power.

Specific Thermal Power
The ratio between the installed thermal power and the heated volume of the building or apartment is defined as specific thermal power.
This parameter is often useful for a preliminary estimation of the requested power, and it depends on multiple aspects, including the geometrical features of the buildings (the surface area to volume ratio, the share of glazed area, the surface area contiguous to another building, etc.), the insulation and other heating design parameters. For this reason, the value of specific thermal power usually shows some variability. Figure 6 reports the frequency distribution of the specific thermal power, dividing the heat generators classified as "Space heating and Domestic Hot Water (DHW)" (around 86% of the total) and the ones for "Space heating only" (around 10% of the total, the remainder being classified for other purposes, such as "Other", "Cooling" or a combination of the previous ones).
The distribution of "Space heating only" systems has a mode of 35 W/m 3 , and a median of 54 W/m 3 . On the other hand, the distribution of "Space heating and DHW" systems has higher values (a median of 96 W/m 3 and a mode of 80 W/m 3 ). The need for producing instant DHW usually leads to a large oversizing of boilers in small dwellings: around one quarter of the total boilers are rated to a single capacity (24 kW), whereas 30% of the buildings or dwellings have a volume of 300 m 3 , 270 m 3 or 240 m 3 .  Figure 7 shows a histogram of nominal and measured efficiency distributions, limited to natural gas boilers. Both traditional and condensing boilers have been considered, and the latter are responsible for higher efficiency, especially when observing nominal values. Natural Gas boilers in Lombardy are 2.67 million, of which only 64% have acceptable nominal and measured efficiency. The efficiency has been considered as acceptable only in the range 75% to 110%, in order to filter out the values that may lead to results with low significance. The largest part of unacceptable values is due to nominal efficiency (around 28.1%), while a smaller part has unacceptable measured efficiency (a share of 3.9%). Measured efficiency is related to a test required by the regional legislation, while nominal efficiency is non-compulsory information, which is therefore often ignored or reported as a wrong value (usually 0 or 1).

Natural Gas Boilers-Efficiency
The bins' width in the histogram of Figure 7 has been set to 1%, as often nominal efficiencies are reported rounded to the percentage units, i.e., with no decimals. In particular, the three values of  Figure 7 shows a histogram of nominal and measured efficiency distributions, limited to natural gas boilers. Both traditional and condensing boilers have been considered, and the latter are responsible for higher efficiency, especially when observing nominal values. Natural Gas boilers in Lombardy are 2.67 million, of which only 64% have acceptable nominal and measured efficiency. The efficiency has been considered as acceptable only in the range 75% to 110%, in order to filter out the values that may lead to results with low significance.  Figure 7 shows a histogram of nominal and measured efficiency distributions, limited to natural gas boilers. Both traditional and condensing boilers have been considered, and the latter are responsible for higher efficiency, especially when observing nominal values. Natural Gas boilers in Lombardy are 2.67 million, of which only 64% have acceptable nominal and measured efficiency. The efficiency has been considered as acceptable only in the range 75% to 110%, in order to filter out the values that may lead to results with low significance. The largest part of unacceptable values is due to nominal efficiency (around 28.1%), while a smaller part has unacceptable measured efficiency (a share of 3.9%). Measured efficiency is related to a test required by the regional legislation, while nominal efficiency is non-compulsory information, which is therefore often ignored or reported as a wrong value (usually 0 or 1).

Natural Gas Boilers-Efficiency
The bins' width in the histogram of Figure 7 has been set to 1%, as often nominal efficiencies are reported rounded to the percentage units, i.e., with no decimals. In particular, the three values of The largest part of unacceptable values is due to nominal efficiency (around 28.1%), while a smaller part has unacceptable measured efficiency (a share of 3.9%). Measured efficiency is related to a test required by the regional legislation, while nominal efficiency is non-compulsory information, which is therefore often ignored or reported as a wrong value (usually 0 or 1).
The bins' width in the histogram of Figure 7 has been set to 1%, as often nominal efficiencies are reported rounded to the percentage units, i.e., with no decimals. In particular, the three values of efficiency with a higher frequency are 92%, 87% and 90%, which represent, respectively, 12.9%, 11.3% and 10.5% of the total boilers. These values were associated in past years with some limits required by the regulations. Therefore, the boilers entering the market have been built in accordance with those limits.
The most noticeable aspect in Figure 7 is the fact that the distribution of measured efficiency contains higher values than the distribution of nominal efficiency. However, it has to be noted that the measured efficiency is only representing the combustion efficiency, as only the flue gas losses are accounted for during the performance field controls. The case losses are therefore not considered in the measurement.
On the other hand, when the boiler is installed in a heated room, case losses are contributing to the space heating, and therefore they should not be accounted. Other aspects that cause the differences between those efficiency distributions are the accuracy of the instruments (the accuracy of field instruments is estimated around ±2%) and the additional measures available in laboratory (e.g., fuel composition, heating value).
For all these reasons, a comparison of nominal efficiency and measured efficiency should take into account all the aspects mentioned above.
A further analysis can be performed on the measured efficiency, as reported in Figure 8.
Energies 2018, 11, 233 12 of 15 efficiency with a higher frequency are 92%, 87% and 90%, which represent, respectively, 12.9%, 11.3% and 10.5% of the total boilers. These values were associated in past years with some limits required by the regulations. Therefore, the boilers entering the market have been built in accordance with those limits. The most noticeable aspect in Figure 7 is the fact that the distribution of measured efficiency contains higher values than the distribution of nominal efficiency. However, it has to be noted that the measured efficiency is only representing the combustion efficiency, as only the flue gas losses are accounted for during the performance field controls. The case losses are therefore not considered in the measurement.
On the other hand, when the boiler is installed in a heated room, case losses are contributing to the space heating, and therefore they should not be accounted. Other aspects that cause the differences between those efficiency distributions are the accuracy of the instruments (the accuracy of field instruments is estimated around ±2%) and the additional measures available in laboratory (e.g., fuel composition, heating value).
For all these reasons, a comparison of nominal efficiency and measured efficiency should take into account all the aspects mentioned above.
A further analysis can be performed on the measured efficiency, as reported in Figure 8. In this case, the dataset also includes the systems with unacceptable nominal efficiency but acceptable measured efficiency (around 2.55 M); for this reason, the frequency of Figure 8 shows different values from those previously discussed of Figure 7, although the trend is comparable. The bins' width of this chart has been set to 0.2%.
The duality of the distribution is related to the type of boiler: traditional boilers show lower efficiency, with a median of 92.7%, while condensing boilers have a median efficiency of 98.4%. Moreover, traditional boilers show a wider variability, which is probably caused by their average higher age and its wider range, and the more stringent limits set by the regulations for new and condensing boilers.
A final remark is related to the evolution of the efficiency over the last years, following the limits set by the regulations, which gradually raised the limits for a 30 kW traditional natural gas boiler from 85% before 1993 to 90% after 2005, and to 92% for a condensing boiler.

Conclusions
This work presents a data analysis on a large dataset of heating systems in a region in Northern In this case, the dataset also includes the systems with unacceptable nominal efficiency but acceptable measured efficiency (around 2.55 M); for this reason, the frequency of Figure 8 shows different values from those previously discussed of Figure 7, although the trend is comparable. The bins' width of this chart has been set to 0.2%.
The duality of the distribution is related to the type of boiler: traditional boilers show lower efficiency, with a median of 92.7%, while condensing boilers have a median efficiency of 98.4%. Moreover, traditional boilers show a wider variability, which is probably caused by their average higher age and its wider range, and the more stringent limits set by the regulations for new and condensing boilers.
A final remark is related to the evolution of the efficiency over the last years, following the limits set by the regulations, which gradually raised the limits for a 30 kW traditional natural gas boiler from 85% before 1993 to 90% after 2005, and to 92% for a condensing boiler.

Conclusions
This work presents a data analysis on a large dataset of heating systems in a region in Northern Italy. The results provide useful insights for further works related to energy simulation and local energy planning. The main aspects are the following ones:

•
The availability of large datasets is a precious support for the analysis of the characteristics of existing systems. However, attention must be paid to the data quality, as missing data points and errors could significantly affect the aggregated results. The dataset considered in this study shows that, while the availability of big data is a powerful resource, the data quality should be further improved. For this reason, distributions and medians provide more accurate insights than means and sums, since the former are less affected by missing data or erroneous outliers.

•
The large majority of heating systems, both for number of units and heated volume (where available), is composed by residential buildings or dwellings. Around 90% of the heating systems are installed in buildings with a heated volume lower than 500 m 3 .

•
The ratio between installed thermal power and heated volume is a useful indicator for design parameterizations. Considering heating systems used only for space heating, the distribution of the specific thermal power shows a mode of 35 W/m 3 and a median of 54 W/m 3 . The simultaneous production of domestic hot water leads to a significant increase of the distribution. The main driver appears to be the standardization of boilers in small dwellings, which usually have a thermal power output between 24 kW and 28 kW. • Natural Gas is the most diffused fuel for heat production in Northern Italy. The municipalities served by the network have a very high share of natural gas heating systems (usually above 85%-90%), while there are still some municipalities (mainly in mountain regions) in which natural gas is not available.

•
The nominal efficiency of the heat generators shows a considerable dependence on the fuel.
Another major driver appears to be the lower acceptance limits set by the regulations in recent years, which correspond to the most common nominal efficiency values (87%, 90% and 92% for Natural Gas).

•
The dataset also includes information about the measured combustion efficiency of the boilers. An analysis of the natural gas heat generators shows that two separate distributions can be highlighted for traditional and condensing boilers, the former with a median of 92.7%, and the latter with a median of 98.4% and a lower variability.
These insights describe the current situation of heating systems in Lombardy, which is representative of the situation of Northern Italy. The results of this work can be the basis for further analyses on different domains: energy policies, local planning and simulation of energy systems.
In more detail, a first step can be taken by overlapping this database to the one related to the energy label to assess the quality of the heating demand and to provide a clearer picture of the Public Administration. Furthermore, heat metering and sensors as part of Building Management Systems or inclusion in the monitoring phase for accessing incentive schemes, or even energy efficiency credits, can certainly increase the requests for data analysis similar to the proposed one by the authors as well as create dedicated guidelines to collect and manage this data for codified energy strategy and associated checking procedures.
Author Contributions: The authors equally contributed to the paper. M.N. conceived the idea and analyzed the data, B.N. and M.N. wrote the paper.

Conflicts of Interest:
The authors declare no conflict of interest.