Since data collected from sensors are expected to be dirty, collected measurements are analyzed through the DBSCAN algorithm, which is able to automatically identify outliers. Outliers are often considered noise points when proper density parameters are set. To select the algorithm parameters (i.e., and ) the k-distance plot has been analyzed. We performed many runs with varying values of k (i.e., parameter) between 2 and 20. We noticed that the resulting curve was very similar with values k = 12 and k = 13. For both plots we have looked for the knee. The Y-axis value in which the knee is formed corresponds to a good value for that particular value. If is chosen too small, a large part of the data will not be clustered; whereas for a too high value of , clusters will merge and the majority of objects will be in the same cluster. METATECH sets as good possible configuration for the DBSCAN algorithm = 12 and = 0.2.
Also association rule extraction requires parameters to be defined. However, these are less important, since they act as a pruning factor of the result set, by automatically removing the less important rules based on their value of support, confidence and lift. Hence, the best values of these parameters depend on how much time and resources domain experts can devote to the manual inspection of the rules. Based on the experimental results, the following parameter settings have been used as the reference default configuration for METATECH:
5.2. Feature-Correlation Analysis
METATECH exploits the correlation matrix to analyze the dependence between multiple variables at the same time. The correlation matrix shown in
Table 4 contains the correlation coefficients between each couple of attributes computed as discussed in
Section 4.2. This matrix is symmetric (i.e., the correlation of column
i with column
j is the same as the correlation of column
j with column
i), and its generic element
models the correlation between the attribute in row
i and the one in column
j. Correlation coefficients always lie in the range
. A positive value (
) implies a positive correlation between attributes
i and
j. Thus, large (small) values of attribute
i tend to be associated with large (small) values of attribute
j. A negative value (
) means a negative or inverse association. In this case, large values of
i tend to be associated with small values of
j and vice versa. A value near 0 indicates weakly correlated or uncorrelated data.
The matrix shown in
Table 4 highlights two strong correlations, i.e., whose value is above the experimental threshold set at 0.9: (1) a positive and strong correlation (0.97) between
air temperature, i.e., the mean external temperature monitored through PWS, and
outdoor temperature monitored through a sensor deployed on the roof of the building; (2) a very high correlation (0.91) between
UV index and
Solar Radiation.
Since highly correlated attributes are similar in behavior, for each couple of attributes highlighted in the matrix, the attribute which is less correlated with the thermal energy consumption is removed from the analysis to reduce both the computational cost and the cardinality of the extracted knowledge. Based on the above results, we do not consider Outdoor Temperature and Solar Radiation in the subsequent analysis process.
5.4. Cluster Characterization
The cluster analysis is exploited by METATECH to identify energy consumption patterns occurred in similar meteorological conditions. The K-Means clustering algorithm has been applied to meteorological data related to a complete winter period. METATECH supports domain experts in capturing the rationale of the clustering results by exploiting two representations: (i) the singular value decomposition (SVD) [
4] to show the clustered points in a graphical and friendly two-dimensional space; (ii) an attribute-based box-plot comparison, to better understand the distribution of the attribute values characterizing each cluster.
SVD is a matrix factorization method that factorizes the input data matrix into three matrices. It can be easily exploited to reduce the data dimensions by only considering the most representative attributes.
Figure 4 shows the SVD decomposition of the cluster set discovered by K-means with K = 4. All clusters are well-separated, and indeed K-means was able to identify a good partition of records that occurred with similar meteorological conditions.
Figure 5 compares the value distributions of the meteorological attributes to characterize the clustering results through the boxplot analysis [
42]. In more detail,
Figure 5a (left) shows the
humidity distribution in the four clusters. Cluster
1 and Cluster
2 have high median values and are characterized by positive skewness. Cluster
3 and Cluster
4 have low median values, with the former exhibiting negative skewness. In case of positive skewness, more observations with lower values are present, while in the case of negative skewness, more observations fall in correspondence of the highest values. For instance, considering Cluster
1 and Cluster
2 that have a negative skewness,
, where
is the median,
the first quartile and
the third quartile.
Figure 5a (right) shows the
pressure distribution. All clusters exhibit a similar behavior in terms of both skewness and median values, hence the pressure is not a characterizing attribute for the clustering result.
Figure 5b (right) shows the
wind direction distribution separately for each cluster. With respect to the humidity distribution, Cluster
1 and Cluster
3 are characterized by positive skewness. Instead Cluster
2 is characterized by negative skewness, while Cluster
4 is almost symmetric. In more details, Cluster
1 has 212.5 as median value, a value that is related to winds that blow from the South-West. Half of records of Cluster
1 fall within the range [187.5, 250.0], corresponding to winds that blow from South-East, East and South-West.
Overall, on a per cluster basis, we can see that each cluster is characterized by specific ranges of values for different variables. For instance, Cluster3 shows low humidity and high temperature values, whereas pressure is not a characterizing feature of the cluster. On the contrary, Cluster1 exhibits high humidity, low temperature, and “high” wind direction values.
The current cluster characterization, as provided by the boxplots, is coarse. However, it is provided as a support for the following association-rule extraction experiment (
Section 5.5). Association rules and their corresponding quality metrics allow to describe not only the information provided by boxplots, but also deeper insights on the data, in a more human-readable fashion, also for non-expert end-users.
5.5. Analysis of Extracted Patterns at Different Abstraction Levels
In this subsection we discuss the most interesting correlation patterns found by METATECH, in the form of association rules. Since association rule mining requires a transactional dataset of categorical values, METATECH performs the discretization of continuously-valued measurements to obtain categorical bins.
In our case study, the knowledge discovery process is driven by a taxonomy. The taxonomy in the context of the association rule mining is called generalization tree, since it allows rules to be generalized. Discretization bin values are provided by a domain expert, so that they are based on their meaning in the energy and meteorological contexts, as described in the following.
(1) Energy consumption per unit of volume (denoted as consumption level): two bins until 5.5 KW/m3 (off until 0.05 KW/m3, low until 5.5 KW/m3), a bin each 10 KW/m3 for values until 25.5 (medium consumption until 15.5, high consumption until 25.5) and an additional bin for values exceeding 25.5 KW/m3 (very high). Thus, the corresponding generalization tree includes 5 leaf values ([0.0, 0.05] (0.05, 5.5] (5.5, 15.5] (15.5, 25.5] (25.5, +∞)), each one associated to a range of non-overlapping values. The tree also includes an intermediate level with three aggregate values (i.e., [0, 5.5] (5.5, 15.5] (15.5, +∞)) and the root including all values in the corresponding domain.
(2) Humidity: a bin each 20% from 0 to 100% (i.e., very low until 20%, low until 40%, medium until 60%, high until 80% and very high until 100%). The corresponding generalization tree includes 5 leaf values ([0.0, 0.20] (0.20, 0.40] (0.40, 0.60] (0.60, 0.80] (0.80, 1.0]) and the root (representing all values).
(3) Temperature: values are discretized in five bins (very cold up to 5 °C, cold up to 10 °C, mild up to 18 °C, warm up to 25 °C, hot up for higher values). The corresponding generalization tree includes 5 leaf values ((−∞, 5] (5, 10] (10, 18] (18, 25] (25, +∞)), an intermediate level with values (−∞, 10], (10, 18], and (18, +∞)), and the root including all values in the corresponding domain.
(4) Temporal data: the timestamp is aggregated into the corresponding daily time slot (e.g., morning, midday, afternoon, evening). Each day is classified as holiday or working, and aggregated in week, fortnight, month, 2-month, 3-month, 4-month and 6-month periods.
(5)
Meteorological measurements have been discretized based on the criteria available in [
43,
44,
45,
46]: precipitation level values and wind direction have been categorized in eight leaf values each, UV index in six leaf values, and atmospheric pressure in two leaf values.
From experimental experience, to avoid pruning interesting correlations with low confidence but high lift, recommended values of support and confidence thresholds for association rule mining in the current context are 0.1% and 1% respectively. Moreover, we also recommend a minimum lift threshold equal to 1.1 to prune both negatively correlated and uncorrelated item combinations.
5.5.1. Fine-Grained Association Rule Extraction
This section presents the most interesting correlations in the form of traditional (fine-grained) association rules. To this aim, the rule templates presented in
Section 4.4 are exploited.
Table 5 shows the top-three rules, sorted by descending lift, characterizing each cluster according to the first template. Support, confidence, and lift are computed on the overall dataset, as the cluster is a feature of the dataset itself. Rules
identify the most representative meteorological items in each cluster.
Rules describe Cluster1 as the group modeling “bad” weather data (drizzling, cloudy, low UV index), Cluster2 has cold humid measurements, Cluster3 warm sunny days, and Cluster4 mild dry ones. The characterization of the clusters by means of the rules provides insights that from a boxplot would be hard to spot. For instance, Cluster1 from the boxplot seems to have zero UV index as main value. However, the proportion of zero UV index records in Cluster1 is lower than the overall presence in the dataset. Hence, Cluster1 is actually characterized by the minimum UV index instead of the zero value, because minimum UV index values are more present in Cluster1 than in other clusters. Such information is provided by the lift quality index, which is above 1.0, specifically in rule .
These weather items are subsequently combined with other meteorological items to characterize each cluster in more detail through the second template.
Table 6 reports a subset of extracted rules according to the third template. Support, confidence, and lift are computed separately on the subset of the dataset of each cluster.
Rules , and describe the weather conditions correlated with a very high level of thermal energy consumption. For instance, the first rule of Cluster1 () applies to drizzling evenings in January, with very high humidity, and cold temperature, besides the presence of South wind, which is a very weak and moist wind, accentuating the body’s discomfort. All three rules correlate very high energy consumption with minimum UV index, very high humidity, and cold or very cold temperatures. Daily time slot changes from evening (for two rules) to morning (for the third one), as well as the fortnights, from December to January. Two rules have very high confidence values, from to , while the other rule has a relatively high confidence at .
Rules , and instead characterize periods with no thermal energy consumption (off value). Common conditions are absence of rain, high pressure, warm or mild temperatures, winds from the South or Southeast. The period is in March or April. All rules have very high confidence values, from to , meaning that when the meteorological conditions are met, then the thermal energy consumption is almost always off.
Identified correlations are confirmed by domain experts and for some aspects are obvious, e.g., the energy consumption is higher in December and January when it is colder. However, the interestingness of such results is twofold. First, the correlations are automatically inferred from data, showing that they correctly model a more or less known phenomenon, i.e., they actually make sense. Second, the results are human readable, and add meaningful details to trivial correlations, e.g., they specify the most correlated daily time slots and wind directions.
5.5.2. High-Level Generalized Association Rules
This Section discusses the most relevant generalized association rules, extracted by METATECH and classified according to the rule template presented in
Section 4.4. These kind of rules allow us to extract interesting relationships at a higher level among data under analysis, capturing correlations that in the fine-grained extraction would be missed.
Table 7 shows the top-three interesting generalized association rules (with the highest lift value) characterizing each cluster. We concentrate directly on the second template, which yields the most interesting rules. Resulting rules can contain both original leaf values (e.g., morning, afternoon, cold, hot, etc.), and generalized values, such as “root” to indicate the full domain of the attribute, e.g., any value of temperature, or different levels of aggregation, i.e., 4-week period or 8-week period aggregating two or four adjacent fortnights.
We remind that rules described Cluster1 as the group modeling “bad” weather data (drizzling, cloudy, low UV index), Cluster2 has cold humid measurements, Cluster3 warm sunny days, and Cluster4 mild dry ones.
Rules with the highest confidence typically correlate low consumption levels. For instance, all Cluster3 and Cluster4 top correlations (R8 to R12) present low consumption levels (only R7 has a medium level). R11 stems out from this group of rules because it targets a very large 8-week period from October to December (late Autumn), and states that independently of the temperature (“root” level of generalization), during the midday time slot (from 9 a.m. to 1 p.m.), the consumption level is low, with confidence 74%. A similar behavior is presented by R8, which states that in afternoons from mid February to mid March (4 week aggregation of two fortnights), independently of the temperature (“root” value), the consumption level is low, with a very high confidence (90%).
Other rules stem out due to their high support. For instance R9 presents a correlation verified for 16% of the Cluster3 observations. Typically, generalized association rules, since collect more observations, present higher support than fine-grained rules, which are more specific and intuitively describe quantitatively-limited conditions.
5.6. Summarizing and Comparing Energy Consumption
To present the rule results at a glance, METATECH summarizes energy consumption levels over time in similar meteorological conditions by exploiting a graphical representation, where self-explaining bubble symbols are used for different energy consumption levels.
Figure 6 shows the proposed graphical representation to simplify and synthesize the energy consumption patterns over time in a compact, human-readable, detailed and exhaustive representation.
The four graphs refer to the four clusters identified by the experimental session. Specifically, for each cluster, rules in the form of the second template are partitioned for each daily time slot and fortnight. The rule with the highest lift value is selected and the symbol associated with the corresponding energy consumption level is reported in the graph.
Cluster
1 and Cluster
2 graphs (
Figure 6 top) include a large number of symbols modeling very high and high consumption levels. In particular, in the mornings of the winter months consumption is high due to the bad weather conditions. In spring and autumn there was a reduction of the consumption level, while evenings are typically characterized by a medium consumption level (in Cluster
1). Instead the Cluster
4 graph is characterized by lower consumption levels because this cluster represents mild weather conditions. Especially in spring and autumn, consumption levels are low or negligible during the day and afternoon time slots, while during the winter, low or medium consumption levels are frequent.
Hierarchical Graphical Representation
The hierarchical graphical model that METATECH uses to display the extracted knowledge can simultaneously compare the energy consumption levels at different granularity levels as shown in
Figure 7. From top to bottom, the three graphs are characterized by coarse to detailed time periods: the upper graph has an 8-week granularity, the middle one presents 4-week periods, and the lower one details fortnights. The first two graphs differentiate three energy consumption levels: “high” aggregates the values “very high” and “high” in a five-level scale; and “low” aggregates “low” and “off”. The third graph, besides the more detailed time granularity, presents energy consumption levels on a five-value scale (very high, high, medium, low, and off).
As an example, Cluster1 results are reported. The graphs are built from rules extracted from the dataset and belonging to the second template. When no rules are present, no bubble is indicated. For instance, in the period 1–15 October, rules characterize the morning, while no correlations have been identified for the other periods of the day.
The graph can be analyzed in two ways: a (i) bottom-up approach and a (ii) top-down approach. The lower graph is obtained using the third template (i.e., correlations between weather conditions and energy consumption level at a different time granularity) of the traditional association rules, while the two upper graphs are obtained using the generalized association rules with the fortnight aggregation to 4 weeks and 8 weeks respectively. The graphical model is able to summarize in a friendly and simple way the consumption of each cluster. Specifically, for each cluster, rules in the form of the second template with the highest lift value are selected and the symbol associated with the corresponding energy consumption level is reported in the graph.