#### 4.3. Detecting Patterns Using Hierarchical Clustering

Hierarchical cluster analysis is an algorithmic approach to find discrete groups with varying degrees of similarity in a data set represented by a similarity matrix. These groups are hierarchically organized as the algorithms proceed and may be presented as a dendrogram. Many of these algorithms are greedy (i.e., the optimal local solution is always taken in the hope of finding an optimal global solution) and heuristic, requiring the results of cluster analysis to be evaluated for stability.

Hierarchical clustering methods can be divided into agglomerative and divisive approach. Agglomerative clustering is a widespread approach to cluster analysis. Agglomerative algorithms successively merge individual entities and clusters that have the highest similarity values computed using for instance Euclidean distance.

One of the most popular agglomerative clustering algorithm is Ward’s method [

24]. This is an alternative approach for performing cluster analysis. Basically, it looks at cluster analysis as an analysis of variance problem, instead of using distance metrics or measures of association. It will start out at the leaves and work its way to the trunk, so to speak. It looks for groups of leaves that it forms into branches, the branches into limbs and eventually into the trunk. Ward’s method starts out with

$n$
clusters of size 1 and continues until all the observations are included in one cluster.

In general, Ward’s method can be defined and implemented recursively by a Lance–Williams algorithm. The Lance–Williams [

25] algorithms are an infinite family of agglomerative hierarchical clustering algorithms which are represented by a recursive formula for updating cluster distances in terms of squared similarities at each step (each time a pair of clusters is merged).

The recurrence formula allows, at each new level of the hierarchical clustering, the dissimilarity between the newly formed group and the rest of the groups to be computed from the dissimilarities of the current grouping. This approach can result in a large computational savings compared with re-computing at each step in the hierarchy from the observation-level data.

The purpose of this analysis is to discover similar profiles or, in other words, appliances with similar switch ON probability distribution through the whole day or the whole week. As a result of grouping using Ward’s method with the Euclidean distance measure, the following dendrogram was obtained as presented in

Figure 2.

**Figure 2.**
Dendrogram for grouping the electrical appliances throughout the whole day.

**Figure 2.**
Dendrogram for grouping the electrical appliances throughout the whole day.

The height of each edge of the dendogram is proportional to the distance between the joined groups. As provided in

Figure 2, two groups are distinctly separated from each other, and then one of them is further separated into two subgroups. Such information can be used to determine the final division of the data (in this case, three final groups).

From the visual analysis of the dendrogram, it can be observed that the switch ON probability of the kettle and the microwave at certain times are very similar (cluster marked in blue). In particular, it can be observed between 7 am and 9 am (as shown before in

Table 2), which is usually associated with the users’ activity related with breakfast preparation.

A similar correlation in periods of joint work, can be seen in the case of washing machine and tumble dryer. In the investigated households, there is a logical relationship taking washing first and then drying the washed clothes (cluster marked in red).

On the right hand side of the chart (marked in blue) one can find a group of similar usage patterns for the kettle and the microwave in the middle of the week. In the middle of the graph (marked in red) there is a group associated with the use of big household appliances consuming greater amounts of electricity. This group is also associated with the work period taking place in the middle of the week. The group marked in yellow and purple is related with the weekend use of such appliances as washing machine, tumble dryer, dishwasher and microwave. Group marked in green is the least likely to be interpreted, since it clusters different devices working throughout the whole week.

**Figure 3.**
Dendrogram for grouping the electrical appliances throughout the whole week.

**Figure 3.**
Dendrogram for grouping the electrical appliances throughout the whole week.

#### 4.4. Detecting patterns Using C-Means Clustering and Multidimensional Scaling

$C$-means [

26] is one of the simplest unsupervised learning algorithms that solve the well-known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume

$C$
clusters) fixed a priori. The main idea is to define

$C$
centroids, one for each cluster.

Clustering is the process of partitioning a group of data points into a small number of clusters. In general, we have

$n$
data points

${x}_{i},i=\mathrm{1...}n$
that have to be partitioned in

$C$
clusters. The goal is to assign a cluster to each data point.

$C$-means is a clustering method that aims to find the positions

${\mu}_{k},k=\mathrm{1...}C$
of the clusters that minimize the distance from the data points to the cluster.

$C$-means clustering solves:

where

$S$
is the set of points that belong to cluster

$k$. The

$C$-means clustering uses the square of the Euclidean distance

$d\left(x,{\mu}_{k}\right)={\Vert x-{\mu}_{k}\Vert}_{2}^{2}$.

Unfortunately, there is no general theoretical solution to find the optimal number of clusters for any given data set. Although it can be proved that the procedure will always terminate and the
$C$-means algorithm does not necessarily find the most optimal configuration, corresponding to the global objective function minimum. A simple approach is to compare the results of multiple runs with different
$C$
classes and choose the best one according to a given criterion, but we need to be careful because increasing
$C$
results in smaller error function values by definition, but also an increasing risk of overfitting. The algorithm is also significantly sensitive to the initial randomly selected cluster centers.

Multidimensional scaling (MDS) [

27] is a term that is applied to a class of techniques that analyses a matrix of distances or dissimilarities in order to produce a representation of the data points in a reduced-dimension space. Most of the data reduction methods have analyzed the

$n\times p$
data matrix

$X$
or the sample covariance or correlation matrix. Thus, MDS differs in the form of the data matrix on which it operates—it is an individual-directed method. Of course given a data matrix, a dissimilarity matrix could be constructed and then proceed with an analysis using MD techniques. However, data often arise already in the form of dissimilarities and so there is no recourse to the other techniques. Also, in other methods, the data-reducing transformation is linear. Some forms of multidimensional scaling permit a nonlinear data-reducing transformation.

There are many types of MDS, but all address the same basic problem: Given an
$n\times n$
matrix of dissimilarities and a distance measure find a configuration of
$n$
points
${\text{x}}_{1},\dots ,{\text{x}}_{n}$
in the reduced dimension space
${\mathbb{R}}^{q}(q<p)$
so that the distance between a pair of points is close in some sense to the dissimilarity between the points. All methods must find the coordinates of the points and the dimension of the space,
$q$. Two basic types of MDS are metric and nonmetric MDS. Metric MDS assumes that the data are quantitative and metric MDS procedures assume a functional relationship between the interpoint distances and the given dissimilarities. Nonmetric MDS assumes that the data are qualitative, having perhaps ordinal significance and nonmetric MDS procedures produce configurations that attempt to maintain the rank order of the dissimilarities. In our study we used one form of metric MDS, namely classical scaling.

In general, given a set of

$\text{n}$
points in

$\text{p}$-dimensional space,

${\text{x}}_{1},\dots ,{\text{x}}_{n}$, it is straightforward to calculate the distance between each pair of points. Classical scaling (or principal coordinates analysis) is concerned with the converse problem to determine the coordinates of a set of points in a dimension

$q$ [

28].

Classical scaling is one particular form of metric MDS in which an objective function measuring the discrepancy between the given dissimilarities,

${\text{\delta}}_{\text{ij}}$, and the derived distances in

${\mathbb{R}}^{\text{q}}$,

${\text{d}}_{\text{ij}}$, is optimized. The derived distances depend possible to calculate on the coordinates of the samples that we wish to find. There are many forms that the objective function may take. To find the minimum of the stress function, most implementations of MDS algorithms use standard gradient methods [

29].

The purpose of these computational experiment is to discover similar profile, in the same way as in the previous case. As it was mentioned, the partitioning method divides the data into

$\text{C}$
disjoint clusters, so that objects of the same cluster are close to each other and objects of different clusters are dissimilar. The output of a partitioning method is simply a list of clusters and their objects, which may be hard to interpret. Therefore, it would be useful to have a graphical display which describes the objects with their interrelations, and showing, at the same time, the clusters. Such a display was constructed using so-called CLUSPLOT [

30].

For this purpose we have used the

$C$-means algorithm, but of course also other clustering methods can be applied. For higher-dimensional data sets a dimension reduction technique before constructing the plot was applied, as described in

Section 4.2. The MDS method yields components such that the first component explains as much variability as possible, the second component explains as much of the remaining variability as possible. The percentage of point variability explained by these two components (relative to all components) is listed below the plot.

Then, CLUSPLOT uses the resulting partition, as well as the original data, to produce

Figure 4. The ellipses are based on the average and the covariance matrix of each cluster, and their size is such that they contain all the points of their cluster. This explains why there is always an object on the boundary of each ellipse [

31].

**Figure 4.**
MDS surface for grouping the electrical appliances throughout the whole week.

**Figure 4.**
MDS surface for grouping the electrical appliances throughout the whole week.

In our study, we examined several dissimilarity measures, but in the

Figure 4 we show results based only on Euclidean distance, which explain 42.53% of the point variability. It is due to the fact that other measures explain less the point variability, namely: maximum −26.34%, Manhattan −30.15%, Canberra −32.84%. The results refer to the larger input data matrix, as denoted in Section 5.1.

On the right hand side of the picture (marked in red) can clearly be seen a group of similar periods of work of the washing machine, tumble dryer and microwave oven in the weekend. On the left hand side of the graph (marked in blue) there is a group associated with the use of kettle and the microwave in the middle of the week. Group marked in purple is the least likely to be interpreted, which cluster different devices working throughout the working days and weekend days.

#### 4.5. Detecting Patterns Using Grade Data Analysis

Grade data analysis is efficient technique which works on variables measured on any measurement scales (including categorical), since it bases on dissimilarity measures such as concentration curves and some precisely defined measure of monotonic dependence. Its main framework is constituted of grade transformation proposed by [

32]. The idea is to transform any distribution of two variables into a convenient form of the so called grade distribution. This transformation is characterized by the property which leaves unchanged the order of variables, ranks, values of monotone dependence measures like Spearman’s

${\text{\rho}}^{\text{*}}$
and Kendall’s

$\text{\tau}$. In case of empirical data this approach consists of analyzing the two-way table with objects/variables, which is preceded by proper recoding of variable values.

The main tool of grade methods is Grade Correspondence Analysis (GCA), which refers to classical correspondence analysis, but it is going significantly beyond it, by the mean of grade transformation. To put it shortly, GCA is ordering the variables/objects table in such a way that neighboring objects are more similar than those further apart, and at the same time, neighboring variables are also more similar than those further apart. After optimal ordering is found it is possible to aggregate neighboring objects and neighboring variables, and therefore to build a clusters with similar distributions. The Spearman

${\text{\rho}}^{\text{*}}$
was originally defined for continuous distributions, but it may be defined also as Pearson’s correlation applied to distribution after the grade transformation. The grade distribution may be defined for discrete distribution too, and it is possible to calculate Spearman

${\text{\rho}}^{\text{*}}$
for probability table

$\text{P}$
with

$m$
rows and

$k$
columns, where

${p}_{is}$
is the frequency (treated as probability) of

$i$-th row in

$s$-th column:

where

and

${p}_{j+}$
and

${\text{p}}_{\text{j}+}\text{and}$ ${\text{p}}_{+\text{t}}$
are marginal sums defined as:

${p}_{j+}={{\displaystyle \sum}}_{s=1}^{k}{p}_{js}$,

${\text{p}}_{\text{j}+}={{\displaystyle \sum}}_{\text{s}=1}^{\text{k}}{\text{p}}_{\text{js}}$ ${p}_{+t}={{\displaystyle \sum}}_{t=1}^{m}{p}_{ts}$ ${\text{p}}_{+\text{t}}={{\displaystyle \sum}}_{\text{t}=1}^{\text{m}}{\text{p}}_{\text{ts}}$.

GCA tends to maximize

${\text{\rho}}^{\text{*}}$
by ordering row and columns according to their grade regression value, which is the center of the gravity for each row or each column. The grade regression for the rows is defined as:

and for the columns:

The algorithm calculates the grade regression for columns and sorts the columns by its values what results in increase of the regression for columns, but at the same time the regression for rows changes. If the regression for rows is sorted then regression for columns changes. As proved in [

33] each sorting of the grade regression increases the value of Spearman

${\text{\rho}}^{\text{*}}$. The number of possible states (combination of permutations of rows and columns) is finite and is equal to

$k!m!$. Each time the value of Spearman

${\text{\rho}}^{\text{*}}$
increases, and the last ordering produces the largest

${\text{\rho}}^{\text{*}}$, called local maximum of Spearman

${\text{\rho}}^{\text{*}}$. The output from GCA depends on the initial permutation of rows and columns, and if it is ordered in reversed way with respect to initial permutation, it is possible to achieve symmetrically reversed local maximum.

GCA primarily permutes randomly rows and columns and reorders them to achieve a local maximum. This process is iterated as many times as needed, but typically 100 iterations is enough to receive the result with the highest

${\text{\rho}}^{\text{*}}$. Having checked all possible start permutation then the result would be the global maximum of

${\text{\rho}}^{\text{*}}$
thus resulting in the largest possible value in the analyzed table. It is important to mention that calculation of grade regression requires non-zero sum of every row and column in a table, so this requirement applies also to the GCA. More detailed description about grade transformation can be found in [

34,

35].

Finally, grade analysis technique is aided by visualizations using over-representation map which is the chart of the probability density of grade distribution, showing which cells are over or under-represented in a particular dataset.

The data structure presented in

Table 2 have been analyzed in GradeStat tool [

36] which has been developed in Institute of Computer Science Polish Academy of Science.

The first step was to calculate over-representation ratios for each field (cell) of the table. A given

$m\times k$
data matrix with non-negative values can be visualized using over-representation map in the same way as a contingency table [

28]. Instead of frequency

${n}_{ij}$
the value of

$j$-th variable for

$i$-th object is used. Next, it is compared in a contingency table with the corresponding neutral or fair representation

${n}_{i\u2022}\times {n}_{\u2022j}/{{\displaystyle \sum}}^{\text{}}{{\displaystyle \sum}}^{\text{}}{n}_{ij}$
where

${n}_{i\u2022}/{\displaystyle \sum}_{j}{n}_{ij}$,

$\text{}{n}_{\u2022j}/{\displaystyle \sum}_{i}{n}_{ij}$. The ratio of the first and second expression is called the over-representation ratio. An over-representation surface over a unit square is divided into

$m\times k$
rectangles situated in m rows and

$k$
columns, and the area of rectangle placed in row

$i$
and column

$j$
being equal to fair representation of normalized

${n}_{ij}$. For instance, taking into account the use of kettle at 7 am on Monday the ratio would be equal to 1.579 (for

Table 2), since probability of using the kettle in this hour is 0.12 and the row sum is 0.38 (for five appliances) then we have: 1.579 = 0.12/((1 × 0.38)/5). In the same manner, the calculations for the

Supplementary Information were prepared. Having the over-representation ratios, the over-representation map for the initial raw data can be constructed.

The color of each field in the map depends on the comparison of the two values: (1) the real value of measure connected to the considered field and corresponding to population element; (2) the expected value of the measure. The cells’ colors in the map are grouped into three classes:

Gray—the measure for the element is neutral (ranging between 0.99 and 1.01) what means that the real value of the measure is equal to its expected value;

black or dark gray—the measure for the element is over-represented (between 1.01 and 1.5 for weak over-representation and more than 1.5 for strong) what means that the real value of the measure is greater than the expected one;

light gray or white the measure for the element is under-represented (between 0.66 and 0.99 for weak under-representation and less than 0.66 for strong under-representation) what means that the real value of measure is less than the expected one.

The following step was to apply the grade analysis to measure the dissimilarity between two data distributions in order to reveal the structural trends in data. The grade analysis was done based on Spearman’s
${\text{\rho}}^{\text{*}}$, used as the total diversity index. The value of
${\text{\rho}}^{*}$
strongly depends on the mutual order of the map’s rows and columns. To calculate
${\text{\rho}}^{*}$, the concentration indexes of differentiation between the distributions are used. The basic procedure of GCA is executed through permuting the rows and columns of a table in order to maximize the value of
${\text{\rho}}^{*}$. After each sorting the
${\text{\rho}}^{*}$
value increases and the map becomes more similar to the ideal one. That means that the darkest fields are placed in the upper-left and lower-right map corners while the rest of the fields is assigned according to the following property: the farther from the diagonal towards the two other map corners (the lower-left and upper-right ones) the lighter gray (or white) color the fields have.

The result of the GCA procedure for the

Supplementary Information is presented in

Figure 5. The initial value of the Spearman’s

${\text{\rho}}^{*}$
was 0.1045, and after sorting the overrepresentation map the

${\text{\rho}}^{*}$
value increases to 0.5563 (which means that neighboring objects are more similar than those further apart). Additionally, cluster analysis was performed through the aggregation of some columns into one column (and for the rows respectively). The optimal number of clusters is obtained when the changes of the subsequent

${\text{\rho}}^{\text{*}}$
values appear to be negligible as referenced in [

35]. Based on the results presented in

Figure 6 (showing increase in

${\text{\rho}}^{*}$
depending on the number of columns and rows), overrepresentation map was divided into 25 clusters (five clusters for the rows and five for the columns).

**Figure 5.**
Overrepresentation map after transformations and grouping for the whole week.

**Figure 5.**
Overrepresentation map after transformations and grouping for the whole week.

**Figure 6.**
The values of the rho-Spearman depending on the number of clusters.

**Figure 6.**
The values of the rho-Spearman depending on the number of clusters.

The resulting order presents the structure of underlying trends in data. Twenty-five clusters show typical usage patterns of home appliances. The overrepresentation map in

Figure 5 presents that the use of all devices on Tuesday morning happens very often (four clusters in the left upper corner), as frequently as the usage of tumble dryer with washing machine on Friday and Saturday in late afternoon or in the evening (four clusters in the right bottom corner). In the opposite corners (upper right and bottom left) there are devices which were operated very rarely.

#### 4.6. Detecting Patterns Using Sequential Association Rules

The problem of discovering sequential patterns is based on a database containing information about events that occurred within a specified period of time. The aim of the sequential association rules is to find the relationship between the occurrences of certain events in the selected time period [

37].

The problem of discovering frequent item sets is to find all item sets occurring in the database
$\text{D}$
with the support higher or equal to minimum support threshold supplied by a user. An itemset with the support higher than minsup is called a frequent item set.

The support of the rule
$\text{X}\to \text{Y}$
is the ratio of the number of transactions that support both the antecedent and the consequent of the rule to the total number of transactions. The support of a rule denotes its statistical significance. Rules with low support tend to describe relationships that are not common in the database. On the other hand, rules with high support are covered by many transactions in the database and they describe common patterns.

The confidence of the rule
$\text{X}\to \text{Y}$
is the ratio of the number of transactions that support both the antecedent and the consequent of the rule to the number of transactions that support only the antecedent of the rule. The confidence of a rule denotes its statistical strength. High confidence indicates strong correlation between elements contained in the antecedent and the consequent of the rule. Low confidence denotes weak correlation between elements and may indicate purely coincidental co-occurrence of elements.

Lift of the rule
$\text{X}\to \text{Y}$
in the database
$\text{D}$
is called the measure of the rule correlation, indicating what is the impact of an element
$\text{X}$
for occurrence of an element
$\text{Y}$. In other words, lift measures how many times more often
$\text{X}$ and $\text{Y}$
occur together than expected if they where statistically independent. Lift is not down-ward closed and does not suffer from the rare item problem. Also, lift is susceptible to noise in small databases. Rare item sets with low counts (low probability) which per chance occur a few times (or only once) together can produce enormous lift values.

A sequence is an ordered list of elements
$<{\text{X}}_{1},{\text{X}}_{2},\mathrm{...},{\text{X}}_{n}>$
where
${\text{X}}_{i}$
is a set of items,
$\forall \text{i}{\text{X}}_{i}\subseteq \text{L}$. Each set
${\text{X}}_{i}$
is called a sequence element. The length of a sequence
$\text{X}$
is the number of sequence elements. Each sequence element has a timestamp denoted as
$\text{ts}({\text{X}}_{i})$. A sequence
$<{\text{X}}_{1},{\text{X}}_{2},\mathrm{...},{\text{X}}_{n}>$
is contained in another sequence
$<{\text{Y}}_{1},{\text{Y}}_{2},\mathrm{...},{\text{Y}}_{m}>$
if there exist integers
${\text{i}}_{1}<{\text{i}}_{2}<{\text{i}}_{n}$
in such that
${\text{X}}_{1}<{\text{Y}}_{i1},{\text{X}}_{2}<{\text{Y}}_{i2},\dots ,{\text{X}}_{n}<{\text{Y}}_{in}$. The sequence
$<{\text{Y}}_{i1},{\text{Y}}_{i2},\dots ,{\text{Y}}_{in}>$
is called an occurrence of
$\text{X}$
in
$\text{Y}$.

There are three main time constraints involved in sequential pattern discovery, namely, the minimum and the maximum time gap between consecutive occurrences of elements within a sequence element (called min-gap and max-gap respectively) and the size of the time window which allows for merging items into sequence elements, denoted as window-width [

38].

The starting point for the usage patterns detection, based on the sequential association rules, was to determine the transaction matrix. Each transaction has a time stamp indicating the occurrence of the elements in the specified sequence. In this case, we assume that a single sequence is the whole day, therefore, the tag sequence is the particular date. The time stamp is the hour at which specific devices were turned ON (column 3 of

Table 3). Created transaction table takes into account only the binary information (the appliance was turned ON or not), but does not include the number of switch ON states in a given hour. In the analyzed period, there are theoretically 24 × 44 = 1056 transactions (the number of hours multiplied by the number of days), whereas the used SPADE algorithm (Sequential Pattern Discovery using Equivalence classes [

39]) does not include empty transaction (hours, in which none of the tested devices was turned ON); therefore, the final transaction table contains only 319 transactions.

Given the rules with the support of more than 0.1, the minimum time difference between successive elements in the sequence of 1 and a maximum time difference between successive elements in the sequence of 1, the following behavior patterns can be observed:

with the support equal to 0.1 and with the confidence of 100%, if in a certain hour the washing machine operated, in the next hour the tumble dryer and kettle operated;

with the support equal to 0.1 and with the confidence of 100%, if in a certain hour the washing machine operated, in the next hour the washing machine and kettle operated, and in the next hour the washing machine also operated, so did the tumble dryer and kettle;

rule No. 4 with the support equal to 0.15, and with the confidence of 75% shows that the occurrence in a sequence of such devices as kettle, dish washer and washing machine influences the occurrence in a sequence of such appliances as tumble dryer and kettle.

with the support equal to 0.1 and with the confidence of 66%, if in a certain hour the kettle operated, in the next hour the washing machine was turned ON, then in the next hour the washing machine and microwave were in operation.

All these observed sequential rules have lift greater than one, which means that the occurrence of the elements in the left side of the rules influence the occurrence of the elements contained on the right side of the sequential rule (

Table 4).

**Table 3.**
Part of the transaction table.

**Table 3.**
Part of the transaction table.
Sequence Stamp | Time Stamp | Elements |
---|

20120910 | 8 | kettle |

20120910 | 9 | kettle, microwave |

20120910 | 10 | kettle, dish washer |

20120910 | 11 | kettle, dish washer |

20120910 | 18 | microwave |

20120910 | 19 | kettle |

20120910 | 20 | washing machine |

20120910 | 21 | washing machine, tumble dryer |

20120910 | 22 | microwave, washing machine, tumble dryer |

20120911 | 10 | kettle, microwave, dish washer, tumble dryer |

20120911 | 11 | tumble dryer, dish washer |

20120911 | 12 | kettle |

20120911 | 13 | microwave |

20120911 | 19 | washing machine |

20120911 | 20 | microwave, washing machine |

20120911 | 21 | kettle, microwave, tumble dryer |

**Table 4.**
Selected sequential association rules.

**Table 4.**
Selected sequential association rules.
Sequence | Support | Confidence | Lift |
---|

{washing machine} => {kettle, tumble dryer} | 0.10 | 1.00 | 4.44 |

{kettle} => {kettle, tumble dryer} | 0.10 | 1.00 | 4.44 |

{washing machine},{kettle, washing machine},{washing machine} => {kettle, tumble dryer} | 0.10 | 1.00 | 4.44 |

{kettle},{dish washer},{kettle},{washing machine},{washing machine} => {kettle, tumble dryer} | 0.15 | 0.75 | 3.33 |

{washing machine},{kettle},{washing machine} => {washing machine, tumble dryer} | 0.10 | 0.66 | 2.96 |

{kettle},{washing machine} => {microwave, washing machine} | 0.10 | 0.66 | 2.96 |