## 1. Introduction

The analysis of structural regularities in the connectivity of complex systems has been a major field of research since the beginning of the so-called “theory of complex networks” [

1,

2]. The most outstanding approach has been known as community detection, that aims to analyze the mesoscale of complex networks [

3,

4]. The definition of community in networks is quite qualitative, a community is a set of nodes with more connections between them than with the rest of the network nodes. The success of this approach relies on the findings that have been proved to reproduce some known knowledge about explicit communities in networks. However, less works have concentrated on the finding of other kinds of regularities in networks. Particularly interesting is the detection of similarities between nodes [

5] in a network. Similarities can arise from different measures that represent diverse target patterns, for example, we can search for patterns in local connectivity, or patterns in global connectivity, defining a local distance or global information respectively. Here we concentrate on the hierarchical clustering of nodes in a network, based on their connectivity (local/global) similarities. In general, many pairs of nodes will share the same value of similarity, and then the problem of clustering according to a hierarchy becomes non-unique.

Hierarchical clustering methods are widely used to classify data items into a hierarchy of clusters organized in a tree structure called a dendrogram. Agglomerative hierarchical clustering [

6] starts from a distance matrix between items, each one forming a singleton cluster, and gathers clusters into groups of clusters, the process being repeated until a complete hierarchy of partitions into clusters is formed. There are different types of agglomerative methods such as single linkage, complete linkage, unweighted average, weighted average,

etc. which only differ by the definition of the distance measure between clusters. To name just a few, uses of hierarchical clustering include the classification of organisms from different populations or species [

7], the determination of sets of genes with similar profiles of expression [

8,

9], and the classification of proteins according to sequence similarity [

10,

11].

Except for the single linkage case, all the other agglomerative hierarchical clustering techniques suffer from a non-uniqueness problem, sometimes called the ties in proximity problem, when the standard pair-group algorithm is used. This problem arises when there are more than two clusters separated by the same minimum distance during the agglomerative process. The standard approach consists of choosing any pair of clusters, breaking the ties between distances, and proceeds in the same way until a final hierarchical classification is obtained. However, different output clusterings are possible depending on the criterion used to break ties, and very frequently the results of a hierarchical cluster analysis depend on the order of the observations in the input data file.

The ties in proximity problem is well-known from several studies in different fields, for example in biology [

12,

13], psychology [

14], or in chemistry [

15]. Generally speaking, the problem will arise whenever using discrete values to represent similarity between elements and eventually also with continuous valued functions. The existence of possible ties makes the number of binary dendrograms eventually grow exponentially with the number of elements. This problem is usually ignored by software packages [

16,

17], while some other packages just warn against the existence of ties in data sets.

Here we make use of

multidendrograms, a variable-group algorithm [

18] that solves the non-uniqueness problem found in the standard pair-group approach. In

Section 2 we describe the variable-group algorithm, which groups more than two clusters at the same time when ties occur. In

Section 3 we show several case studies where we use multidendrograms. Finally, in

Section 4, we give some concluding remarks.

## 2. Multidendrograms Algorithm

Agglomerative hierarchical procedures build a hierarchical classification in a bottom-up way, starting from a distances (or weights) matrix between

n individuals. The standard pair-group algorithm has the following steps:

- (1)
Initialize n singleton clusters with one individual in each of them. Initialize also the distances between clusters with the values of the distances between individuals.

- (2)
Find the minimum distance separating two different clusters.

- (3)
Select two clusters separated by such minimum distance and merge them into a new supercluster.

- (4)
Compute the distances (Depending on the criterion used to compute the distances, different agglomerative hierarchical clusterings are obtained: single linkage, complete linkage, unweighted average, weighted average, unweighted centroid, weighted centroid, and Ward’s method are the most commonly used.) between the new supercluster and each of the other clusters.

- (5)
If all individuals are not yet in the same cluster, then go back to Step 2.

The ties in proximity problem arises when there are more than two clusters separated by the same minimum distance in Step 3 of the algorithm. To ensure uniqueness in the agglomerative hierarchical clustering, multidendrograms implement a variable-group algorithm [

18] that groups more than two clusters at the same time when ties occur. Its main properties are:

When there are no ties, multidendrograms give the same result as the pair-group algorithm.

It always gives a uniquely determined solution thanks to the implementation of the variable-group algorithm.

In the multidendrogram representation of the results, the occurrence of ties during the agglomerative process can be explicitly observed, and a subsequent notion of the degree of heterogeneity inside the tied clusters is obtained.

The algorithm has been encapsulated in a public application

MultiDendrograms [

19] that allows the tuning of many graphical representation parameters, and the results can be easily exported to file. A summary of other characteristics are: graphical user interface including data selection, hierarchical clustering options, layout parameters, navigation across the dendrogram,

etc.; command-line direct calculation without graphical interface; works both with distances and weights matrices; calculation of ultrametric matrix and deviation measures such as cophenetic correlation coefficient, normalized mean squared error, and normalized mean absolute error; save dendrogram details in text and Newick tree format; and export dendrogram image as JPG, PNG and EPS.

## 4. Conclusions

The search for structural patterns in complex systems is approached herein from the hierarchical clustering of its elements, according to different similarity (or distance) measures. The correct visualization of the hierarchy is essential to discern these patterns. We have shown the feasibility of the application of multidendrograms to scrutinize the hierarchical structure emerging in different complex networked systems. We have focussed on those representations that, because of inherent symmetries, will provoke ties when trying to discern which groups to merge. This information can be very useful in several scenarios, e.g., discerning groups according to vertex similarity, modular node similarity, and data similarity.

The non-uniqueness problem found in the standard pair-group algorithm for agglomerative hierarchical clustering is usually ignored by the standard algorithms. The software packages ignore or fail to adopt a common standard with respect to ties, many of them simply breaking ties in an arbitrary way. However, different output clusterings are possible depending on the criterion used to break ties, and very frequently the results depend on the input order of the observations. The selection of just one of the possible classifications in such cases can be misleading, and the user is usually unaware of this problem, taking for granted the output given by the software.