Detection of Representative Variables in Complex Systems with Interpretable Rules Using Core-Clusters

: In this paper, we present a new framework dedicated to the robust detection of representative variables in high dimensional spaces with a potentially limited number of observations. Representative variables are selected by using an original regularization strategy: they are the center of speciﬁc variable clusters, denoted CORE-clusters, which respect fully interpretable constraints. Each CORE-cluster indeed contains more than a predeﬁned amount of variables and each pair of its variables has a coherent behavior in the observed data. The key advantage of our regularization strategy is therefore that it only requires to tune two intuitive parameters: the minimal dimension of the CORE-clusters and the minimum level of similarity which gathers their variables. Interpreting the role played by a selected representative variable is additionally obvious as it has a similar observed behaviour as a controlled number of other variables. After introducing and justifying this variable selection formalism, we propose two algorithmic strategies to detect the CORE-clusters, one of them scaling particularly well to high-dimensional data. Results obtained on synthetic as well as real data are ﬁnally presented.


Introduction
Discovering representative variables in high dimensional and complex systems with a limited number of observations is a recurrent problem of machine learning. Heterogeneity between the variables behavior and multiple similarities between variable subsets often make this process ambiguous. A convenient strategy to solve this task is to associate each representative variable of the complex system to a cluster of variables, and to model the relations between variables in a graph [1]. The complex systems are indeed typically modeled as undirected weighted graphs [2,3], where the nodes (vertices) represent the variables and the edge weights are a measure of the observed similarity between the variables of the dataset. The choice of a specific clustering algorithm over the wide variety of traditional methods often depends on the nature of the data (e.g., their structure or size). Determining the granularity level of the clusters is also a common issue in high dimensional data clustering. If the clustering granularity is high, some clusters have a high similarity rate between the nodes/variables they contain, but potentially, many other clusters only contain noisy relations. A large amount of selected representative variables can then be meaningless. Conversely, a low granularity leads to few large clusters with high internal heterogeneous behaviors, which makes it hard to identify the representative variables of the system in each cluster. Importantly, this issue is particularly critical when the number of observations n is lower than the observations dimension p, because of the instability related to high dimensionality and high complexity.
As in k-means clustering algorithms [4], we will use, in this paper, the notion of cores to address with an interpretable strategy the choice of the granularity level. Based on a distance function, the k-means algorithms indeed form a controlled number of clusters. This notion of core is also used in [5,6], where the graph is partitioned into a maximal group of entities, which are connected to at least k other entities in the group. The method of [7] is also related to the notion of coreness, as it hierarchically calculates the core number for each node with a complexity in the order of O(p). Strong connections also exist between the issues we address and the notion of coresets [8]. This notion has recently gained significant interest in the machine learning community, as it deals with finding representative observations, and not variables, in large datasets. As described in [9], it can be used in supervized learning to reduce the size of large training sets. In a similar vein, it can also be used to robustify the generalisation of the trained decision rules [10], for neural network compression [11] or for unsupervized learning [12]. Note that [12] is also particularly close to core methods, as it specifically deals with k−clustering, i.e., finding at most k cluster centers. The proposed method however does not address explicit interpretability issues.
The challenge of high dimensionality is clearly raised in [13][14][15], where the authors proposed different feature selection techniques with an explicit regularization in order to speed up a data mining algorithm and to improve mining performance. In this spirit, [16] recently developed an approach based on an iterative spectral optimization technique that improves the quality, computation time and scalability to high dimension of an existing alternative clustering method (kernel dimension alternative clustering). In [17], the authors also used a multinomial regression model to learn automatically the number of clusters, and then to limit strong assumptions required by the model in high dimension. Note that [18] also defined a strategy for the detection of representative variables in high-dimensional data, but did not explored a regularization strategy when the number of observations is much lower than the problem dimension. Those strategies require as well to make a decision about the number of clusters to determine.
Motivated by the above issues, we propose a new formalism to robustly and intuitively estimate the representative variables of complex systems. This is first made through a graph clustering strategy for which the clusters do not necessarily cover all nodes/variables. This clustering strategy specifically estimates CORE-clusters, which are connected subsets of variables having (1) a minimum number of nodes/variables, and (2) a minimal similarity level between all their variables. The representative variables are then those having the lowest average distance to all other variables in each CORE-cluster. The detection of representative variables is therefore regularized using a control on the minimal COREcluster size and not the number of representative variables to be detected, or a LASSOderived penalty term. This point of view has been originally considered in [19,20], We present here a totally reformulated version of this initial idea, which makes fully interpretable the selection of the representative variables by introducing the notion of CORE-clusters. Fine algorithmic improvements, described in this paper, also make the original clustering algorithm more efficient. A greedy version of this original algorithm, which turns out to scale particularly well to high dimensional data, is additionally proposed. New results on synthetic data as well as comparisons with other methods now shed light on how the proposed strategy is robust and explainable. Finally, we now demonstrate the validity of our formalism on two high dimensional datasets representing the expression of genes and a road network.
Our methodology is described in Sections 2 and 3 and is then tested both on simulated and real data in Section 4.

Graph-Based Representation of the Observations
Let us consider a complex system of p quantitative variables X = (X 1 , · · · , X p ) and n observations (X j 1 , · · · , X j n ), j ∈ {1, · · · , p} of these p variables. The driving motivation of our work is to detect representative variables out of X when n p. As mentioned in Section 1, the detection of these representative variables is regularized using a graph-based approach. We then model the relations between the variables using an undirected weighted graph G(N, E), where N = (N 1 , · · · , N p ) is the nodes set corresponding to the p variables, and E is the edges set. We also denote e i,j ∈ E the edge joining the nodes N i and N j with weight w i,j .
In order to handle the properties of application-specific similarity measures that can be encoded in the edge weights w i,j , we will consider in the remainder of the paper that all w i,j ≥ 0 and that the higher w i,j the closer the observed behavior of the variables X i and X j . The weights therefore represent a notion of similarity between the variables X i and X j . For instance, if the empirical correlations cor(X i , X j ) are measured between the pairs of variables X i and X j , with (i, j) ∈ {1, . . . , p} 2 , then w i,j = |cor(X i , X j )| can reasonably be used.

Coherence of a Variable Set
To define a notion of distance between two variables X i and X j , which are not directly connected in the graph, we use the notion of capacity (see [21,22]) of a path P between the corresponding nodes N i and N j in G(N, E). A path P of a graph G from X i to X j of length Λ is a list of indices {d 1 , . . . , d Λ } ⊂ {1, . . . , p} such that X i = X d 1 , X j = X d Λ , and w d l ,d l+1 is known and is not equal to 0, for all l = 1, . . . , Λ − 1. The capacity cap(P) of path P is then the minimal weight of its edges, i.e., We also denote by P i,j the set of all possible paths connecting X i to X j . The coherence c(X i , X j ) between X i and X j is then defined by considering the path P having the maximum capacity among the paths of P i,j [22], i.e., If the weight w i,j is known, it is interesting to remark that the coherence c(X i , X j ) is not necessarily equal to its value. For instance, both X i and X j may be very similar to a third variable X k , but not similar to each other. From a computational point of view, the similarity in w i,j may also be unknown if the edge e i,j is not stored in a sparsified version of the complete graph. Since n p, pertinent relations may finally be lost in w i,j but recovered in c(X i , X j ) thanks to other relations that would be captured. We believe that these points are particularly important to define coherent variable sets in the complex data case.
We now denote by S a connected subset of the variable set X. The coherence c(S) of this variable subset is the minimal coherence between the variables it contains: If all the variables of S have a coherent observed behavior, then c(S) is high. The use of this notion on synthetic data is illustrated in Appendix A. The coherence of a subset measures the strength of the variables it contains. The more coherent S, the more sense it makes to consider that its variables share common features measured by the chosen similarity. Decomposing the graph into maximal groups sharing a strong similarity, i.e., finding all the groups of variables with a large enough coherence is the core of the following subsections on CORE-clusters selection.

CORE-Clusters
We recall that the goal of our formalism is to detect the representative variables of complex systems. In our formalism, each representative variable is extracted out of a CORE-cluster, defined as: Definition 1. A CORE-cluster S ξ,τ ⊂ X is a connected variable subset with a size higher than τ and a coherence c(S ξ,τ ) higher than a threshold ξ.
The parameters τ and ξ ensure that each representative variable has a non-negligible coherence with at least τ − 1 other variables, which directly regularizes its selection: Large values of τ indeed lead to the detection of large sets of coherent variables. In that case, the representative variables are likely to be meaningful even if n < p. If τ is too large, each CORE-cluster may however contain several variables that would have been ideally representative. On the contrary, too small values of τ are likely to detect all meaningful representative variables, but also a non-negligible number of false positive representative variables, especially if the observations are noisy or if n < p. A good trade-off, which depends on n, p and the level of noise in the observations has then to be found when tuning τ.

CORE-Clustering
CORE-clustering consists of estimating an optimal set of CORE-clusters, so that the representative variables they contain explain as much information as possible in the observed complex system. We use the following definition: Definition 2. CORE-clustering with parameters ξ and τ consists of finding U variable subsets S = { S u } u∈{1,..., U} , where U is not fixed, by optimizing: under the two constraints: 1. All S u are connected variable sets having a size higher than τ and a coherence c(S u ξ,τ ) > ξ. They therefore correspond to CORE-clusters and can be denoted S u ξ,τ .
It may first seem that U should be as high as possible, so that the union of all S u ξ,τ contains all the variables of X. Each CORE-cluster S u ξ,τ must however have a coherence higher than the threshold ξ. As illustrated in Appendix A, the variables of X which are not coherent with at least τ other variables should ideally not be contained in any COREcluster, as they would make their coherence drop. The CORE-clusters in S should then only contain pertinent variables so that the optimal value for U is implicitly defined during the CORE-clustering procedure.
It is also important to remark that the potential number of subsets of X to find good CORE-cluster candidates is huge, even for moderate values of p. Moreover, computing Equation (4) is particularly demanding. The two optimization algorithms of Section 3 are then aggregative and divisive algorithms designed to optimize Equation (4) without explicitly computing it.

Representative Variables Selection
We now present how each representative variable is extracted out of a CORE-cluster S ξ,τ . As mentioned in Section 2.2, the pertinent relations between two variables X i and X j may be lost in w i,j , since n p and recovered by their coherence c(X i , X j ) using other relations. In this example, the similarity s captures true positive and false negative relations and the notion of coherence makes robust the detection of relations. For the same reasons, it may however also capture false positive relations, so the CORE-clusters may contain undesirable variables. The variables captured by CORE-clusters using false positive relations should however be located at the cluster boundaries if the data are not too noisy. False positive connections are indeed less common and on average weaker than true positive connections in this case.
In order to limit the impact of the false positive relations, we then define the representative variables as the CORE-cluster centers. More specifically, each representative variable minimizes an average distance with the other variables of a CORE-cluster S ξ,τ . Instead of using distances based on the maximum capacity Equation (2), we use a more standard notion of distance calculated as sums of weighted edges traversed by the optimal paths. This limits the phenomenon of having multiple variables with the same optimal distance due to the min-max strategy. The graph weights w i,j , which represent a similarity level, must however be converted into distances, which can be simply done by using The representative variable of a CORE-cluster S ξ,τ is then the one that has the lowest average distance to the other variables of S ξ,τ . The impact of selecting the representative variables as CORE-cluster centers is discussed Appendix A.

General Guidelines for the Choice of ξ and τ
The selection of the representative variables directly depends on the parameters ξ and τ. As explained in Section 2.3, a CORE-cluster S ξ,τ is indeed a connected variable subset with a size higher than τ and a coherence c(S ξ,τ ) higher than a threshold ξ. In order to estimate pertinent representative variables, we recommend to use the following guidelines: (1) First compute how the edge weights w i,j of the graph G(N, E) are distributed. The coherence c(S ξ,τ ) represents the weakest connection between the variables of S ξ,τ , so the threshold ξ should be relatively high regarding the different values of w i,j . A value of ξ equal to the 80th percentile of the edge weights w i,j appears to be reasonable. (2) Choosing a suitable minimal amount of variables τ in S ξ,τ is more subtle, as this choice both depends on the complexity of the relations expressed in G(N, E), and how the number of observations n is low compared with the observations dimension p. In all generality, tuning τ as equal to p/10 is reasonable as a first guess. (3) If no CORE-clusters are found with the initial parameters, the user may try to run again the CORE-clustering procedure with lower parameters ξ and τ.
From our experience, we recommend in all cases not using values of ξ lower than the 40th percentile of the edge weights or values of τ lower than 10. The CORE-clusters would be likely to contain variables with strongly heterogeneous behaviors or false positive connections in these cases. Note finally that when several connected CORE-clusters are identified with a given parametrization, it is interesting to test whether stronger COREclusters would be locally found by using higher values of ξ or τ. This is illustrated in Section 4.3.

Main Interactions Estimation
The very first step of our strategy is to quantify the similarity between the different observed variables. The similarity is first computed using the absolute value of Pearson's correlation and represented as a dense graph G(N, E), where N contains p nodes, each of them representing one of the observed variables, and E contains K E = p(p − 1)/2 undirected weighted edges that model a similarity level between all pairs of variables. In this approach, the variables with no connection are associated with a correlation coefficient of zero. The algorithmic cost of this estimation can be O(n 2 p), but it can also be easily parallelized using divide and conquer algorithms for reasonably large datasets, as in [23]. For large to very large datasets, correlations should be computed on sparse matrices, using e.g., [24] to make this task computationally tractable.

CORE-Clustering Algorithms
Inspired by [22], who solved the maximum capacity problem of [21] using optimal spanning tree, we estimate the CORE-clusters on optimal spanning trees. A spanning tree G(N, T) is a subgraph of G(N, E) with no cycle and T ⊂ E. The maximum spanning tree of G is then the spanning tree of G, having the maximal sum of edge weights. Using the maximum spanning tree to detect the CORE-clusters strongly limits the potential amount of node combinations to test, while preserving the graph edges that are likely to be good candidates for the optimal paths of Equation (2). Conversely, it is straightforward to show that the coherence of a variable subset in G(N, T) is lower or equal to the coherence of the same variable subset in G(N, E). The edges of T are indeed a subset of E, so Equation (2) between two variables X i and X j is lower or equal on T than on E. The CORE-clusters computed in G(N, T) are therefore eligible CORE-clusters on G(N, E). By discussing the algorithmic cost of our algorithms, we will make it clear that this reasonable domain reduction makes our problem scalable to large datasets. The impact of using maximal spanning trees on the measure of a cluster coherence is also discussed in Appendix A on synthetic examples.

Maximum Spanning Tree
The maximum spanning tree of G is the simple and reliable modeling of the relationship between the graph nodes (only p − 1 links). One of the most famous algorithms developed to find such trees is called Kruskal's algorithm [25]. The maximum spanning tree is built by adding step by step partial associations so that there will be no cycle in the partial graph.
We denote by G(N, T) the resulting graph, where T has a tree-like structure. Details of the algorithm are given in Algorithm 1. The algorithmic cost of the sort procedure (row 1) is O(K E log (K E )). Then, the for loop (rows 4 to 10) only scans the edges once. In this for loop, the most demanding procedure is the propagation of label L(N i ), row 8. Fortunately, the nodes on which the labels are propagated are related to N j in G(N, E). We can then use a depth-first search algorithm [26] for that task, making the average performance of the for loop O(K E log (p)).

Algorithm 1 Maximum spanning tree algorithm
1: Sort the edges by decreasing weights, so Initiate an edge list T as void. 4: for k = 1 : K E do 5: We denote N i and N j the nodes linked by edge E k . 6: if L(N i )! = L(N j ) then 7: Add edge E k to the list T 8: Propagate the label L(N i ) to the nodes that have label L(N j ).

9:
end if 10: end for 11: return Graph with a tree structure G(N, T).

CORE-Clustering Algorithm
In contrast with other clustering techniques, this CORE-clustering approach detects clusters having an explicitly controlled granularity level, and only gathers nodes/variables with a maximal path capacity. CORE-clusters are detected from the maximal spanning tree G(N, T) by gathering iteratively its nodes N in an order that depends on the edge weights in T (increasing weight order). A detected CORE-cluster has a size higher than τ, where τ is the parameter that controls the granularity level. Thus, Algorithm 2 first generates many small and meaningless clusters and parameter τ should not be too small to avoid considering these clusters as CORE-clusters. Then, the first pertinent clusters will be established using the edges of T with larger weights, leading to pertinent node groups. Finally, the largest weights of T are treated at the end of Algorithm 2 in order to split into several CORE-clusters the nodes related to the most influential nodes/variables.

Algorithm 2 CORE-clustering algorithm
Require: Graph G(N, T) with nodes N i , i ∈ 1, · · · , p; and edges T k , k ∈ 1, · · · , K T . Require: Weight of edge T k is W(T k ). Require: Granularity coefficient τ and threshold ξ. We denote N i and N j the nodes linked by edge E k .

7:
if L(N i )! = L(N j ) then 8: Propagate the label L(N i ) to the nodes that have label L(N j ).
It is worth mentioning that the Rows 20 to 25 of Algorithm 2 are the only ones that require to compute the coherence c of the estimated CORE-clusters. Computing a coherence Equation (3) is indeed demanding, so it is considered here as a post-treatment limited to pre-computed CORE-clusters. In practice, it is also performed on the maximal spanning tree G(N, T) and not on the whole graph G (N, E).
Remark too that the algorithmic structures of Algorithms 1 and 2 are similar. However, Algorithm 2 runs on G (N, T) and not on G (N, E). The number of edges K T in G(N, T) is much lower than K E , since T has a tree structure and not a complete graph structure. It should indeed be slightly higher than p [27], which is much lower than K E = p(p − 1)/2. Moreover, the propagation algorithm (rows 9 and 11) will then never propagate labels on more than 2τ − 1 nodes. The algorithmic cost of the sort procedure (row 1) is then O(K T log (K T )) and the average performance of the for loop O(K T log (τ)).

A Greedy Alternative for CORE-Clusters Detection
We propose an alternative strategy to the CORE-clusters detection algorithm: The edge treatment queue may be ordered by following decreasing edge weights instead of increasing edge weights. The nearest edges are then first gathered, making coherent CORE-clusters as in Algorithm 2, although one CORE-cluster may contain several representative variables. To avoid gathering noisy information, the for loop on the edges (row 6 of Algorithm 2) should also stop before meaningless edges are treated. This strategy has a key interest: It can strongly reduce the computational time dedicated to Algorithms 1 and 2. By doing so, Algorithm 1 and modified Algorithm 2 are purely equivalent to Algorithm 3.
1: Sort the edges by decreasing weights. 2: Define the number of edges γ having a weight higher than ξ. 3: Assign label L(N i ) = i to each node N(i). 4: Set CORElabel = −1. 5: for k = 1 : γ do 6: We denote N i and N j the nodes linked by edge E k .

7:
if L(N i )! = L(N j ) then 8: Propagate the label L(N i ) to the nodes that have label L(N j ).

9:
if number of nodes with label L(N i ) ∈ {τ, · · · , 2τ − 1} then 10: Label CORElabel is given to the nodes with label L(N i )
Again, the structure of this algorithm is very similar to the structure of the maximum spanning tree strategy in Section 3.1. The algorithmic cost of the sort procedure (row 1) is O(K E log (K E )). Then, the loop rows 5 to 14 scans γ edges, where γ is the number of edges having a weight higher than ξ. In most cases, γ should be much lower than K E , which strongly limits the computational impact of this loop. Labels propagation in this loop (rows 8 and 10) are also limited to 2τ − 1 nodes. The average performance of the for loop is therefore O(γ log (τ)).

Central Variables Selection in CORE-Clusters
Once a CORE-cluster is identified, we use a straightforward strategy to select its central variable: the distance between all pairs of variables in each CORE-cluster is computed using a Dijkstra's algorithm [28,29] in G(N, E). The central variable is then the one that has the highest average distance to all other connected variables in the CORE-cluster. As the computed CORE-clusters have less than 2τ nodes, the algorithmic cost of this procedure is O(τ 2 ) times the number of detected CORE-clusters, which should remain low, even for large datasets.

Core Clustering of Simulated Networks
In this section, we compare our CORE-clustering algorithms with other standard methods on simulated scale-free networks. Such complex networks are indeed common in the data science literature. They contain a little amount of highly connected nodes (hubs), that we will assimilate to representative variables, and many poorly connected nodes.

Experimental Protocol
To generate the synthetic networks, we first simulated the profile (observations) of representative variables and then the profile of remaining variables around these hubs. We then considered K different clusters of size p C 1 , p C 2 , . . ., p C K . A simulated expression data set X ∈ R n×p is then composed of p = p C1 + . . . + p C K variables. In the cluster k, the observations are then simulated as follows: (1) Generate the observations x (1,k) ) of a representative variable using a normal distribution N (0, 1). (2) Choose a minimum correlation r min and a maximum correlation r max between the representative variable and the other variables of the predefined cluster. In this paper, we always used r max = 1 and r min = 0.5. (3) Generate the profiles x (j,k) , with j ∈ {2, . . . , p C k }, such that the correlation of the j-th profile with the profile of . Three different types of networks were simulated using this protocol with different parameterizations. (a) The first type of network consisted in simulating n = 100 observations of 40 variables with K = 2 clusters of 20 variables. The additional noise was simulated with α, ranging from 0.25 to 1.5. (b) The same protocol was used for the second type of networks, but K = 5 clusters of 7 variables were simulated. (c) The third kind of networks consisted in varying the number of the observations from n = 5 to n = 30, with K = 5 clusters having 50 to 100 variables.
Note finally that an amount of 30 networks of each type was generated to assess the stability of our methodology. This will indeed make it possible to draw in Figure 1 the box-plots of the clustering quality for each type of network.

Measure of Clustering Quality
There exist various criteria to measure a clustering quality. External indices such as impurity and Gini indices measure the extent to which the clusters match externally supplied class labels. Internal indices like the modularity and the intra-cluster to intercluster distance ratio are also used to measure the quality of a clustering structure without any external information. Such criteria are however not suitable here, as we do not clusterize all variables, but rather extract core structures that emphasize representative variables. Thus, we propose the following criterion for simulated data for which the block structure of the similarity matrix is known: Let X i , i ∈ {1, · · · , p} be the variables and C j (j ∈ [1, K]) be the ground-truth CORE-clusters of variables, typically on synthetic data. As in Section 2, we also denote S u ξ,τ (u ∈ [1, U]) the CORE-clusters predicted by our algorithm. In order to evaluate the quality of the prediction, we compute a score R defined as: where 1 ≤ i ≤ p and R = 1 p U ∑ u=1 R u . To compute this score, each R u is equal to 0 if there is no overlap between C j and any S u ξ,τ , and is equal to the number of variables in C j if a S u ξ,τ contains all the variables of C j . A score R equal to 1 then means that a perfectly accurate estimation of the C j was reached, and the closer to 0 its values, the less accurate the CORE-clusters detection.

Results
We compared the standard and greedy CORE-clustering algorithms (Algorithms 2 and 3) on these simulated datasets with two other graph-based clustering algorithms: the spectral clustering [30,31], available in the R-package anocva, and Louvain method for community detection [32], available in the R-package igraph. Note that the Louvain method requires as input parameter a graph modeling the dataset (the correlation matrix is transformed upstream into a graph) but not the final number of clusters. Boxplots of the computed scores R (see Equation (5)) are shown in Figure 1. Note that in each boxplot, the dots represent the outlier scores, which are either lower than q 0.25 − 1.5(q 0.75 − q 0.25 ) or higher than q 0.75 + 1.5(q 0.75 − q 0.25 ), where q 0.25 and q 0.75 are the first and third quartiles of the scores, respectively. The subplots of Figure 1a,b show that Algorithm 2 is more robust than Algorithm 3, and gives slightly better results than spectral clustering and Louvain method, when the level of noise is high. The same applies when the sample size decreases in the subplots of Figure 1c.

Application to Real Biological Data
We now present the results obtained on the classic Yeast dataset (https://archive.ics. uci.edu/ml/datasets/Yeast (accessed on 19 February 2021)) [33]. With a total of about 1.3 × 10 6 weighted edges considered when representing the variables correlations in the graph G (N, E), the CORE-clustering procedure required about 160 and 3 seconds with Algorithms 2 and 3, respectively.

Yeast Dataset
The well-known synchronized yeast cell cycle data set of [33] includes 77 samples under various time during the cell cycle and a total of 6179 genes, of which 1660 genes are retained for this analysis after pre-processing and filtering. The goal of this analysis is then to detect CORE-clusters among the correlation patterns in the time series of yeast gene expressions measured along the cell cycle. Using this dataset, a measure of similarity between all gene pairs was measured with the absolute value of Pearson's correlation. A total of about 1.3 × 10 6 weighted edges are then considered when representing the variables correlations in the graph G(N, E).

Comparison of the Two CORE-Clustering Algorithms
In order to compare the two proposed CORE-clustering algorithms described (Algorithms 2 and 3) we tested them on the yeast dataset with τ = 30 and ξ = 0.75. We indeed empirically considered that CORE-clusters containing 30 to 59 variables are reasonable to regularize the problem, and that 0.75 is a threshold above which the absolute value of the correlation between two variables reasonably shows that their behavior is similar.
The clustering obtained using the standard algorithm, rows 1 to 19 of Algorithm 2, is shown in Figure 2. In this figure, the represented clusters 1 to 6 have a coherence c (see Equation (3)  The computations were much faster using the greedy algorithm Algorithm 3. It indeed required about 3 seconds. An amount of 11 CORE-clusters was found. To interpret this result, we computed the coherence c (see Equation (2)) between the 4 representative variables obtained using Algorithm 2 and the 11 obtained using Algorithm 3. Interestingly, Algorithm 3 selected YLL026W = RV3 and YDL003W = RV5. Other variables very close to RV1, RV5 and RV6 (with c > 0.83 i.e., higher that the highest c within the CORE-clusters) were also selected: YHR219W, YLR103C, YLR276C and YLR196W. Results equal or close to those obtained with Algorithm 2 were then obtained. The representative variables YDR418W, YML119W, YJL038C, YNL283C and YGR167W were additionally found. In this experiment, Algorithm 3 therefore selected more representative variables and has then naturally a larger score (see Equation (4)) than by using Algorithm 2. However, it also obviously captured different representative variables that would be gathered in the same CORE-cluster using Algorithm 2. The two algorithms have therefore slightly different properties but lead to coherent results.

Impact of the Number of Observations
In order to evaluate the stability of the results with respect to the number of observations, we tested again Algorithm 2 with the same parameters, but by using only the 30 first observations of the yeast dataset out of the 77 observations. Interestingly, YDL003W = RV5 was selected and YML093W, which is very close to RV6 (c > 0.86), was also selected. The two other representative variables found, YER190W and YLL026W, were however not similar to RV1 or RV3. The information lost in the 47 observation that we removed therefore did not allowed to recover the influence of RV1 and RV3 on the complex system but the 30 remaining observations contained a sufficient amount of information to detect RV5 and RV6 as influent variables. This suggests that the strategy detects stable representative variables, even when the number of observations is very low compared with the dimension of the observations.

Comparison with Spectral-Clustering
In order to compare our CORE-clustering strategy with a standard clustering approach, we finally estimated representative variables in the Yeast dataset as the center of clusters estimated using spectral clustering. The standard version of the spectral clustering available in R was used. Its main parameter is the number of seeds η used in the k-means part of the spectral clustering. When using η seeds, an amount of 1 to η clusters (with more than one variable) are estimated using k-means and all variables are contained in a cluster. In order to fairly compare the spectral clustering and the CORE-clustering approaches, we then clusterized the variables of the Yeast dataset using η = {5, 30, 50, 70, 110}. Note that we only tested an η higher than 77, due to the fact that n = 77 observations are known. We tested η = 110 to evaluate the spectral clustering behaviour with η slightly higher than n.
An amount of {3, 3, 6} large clusters were obtained for η = {50, 70, 110}, respectively. For η = {5, 30}, only a single cluster gathering almost all variables was also obtained. The average coherence of the estimated clusters was 0.44 for a standard deviation of 0.06. The highest coherence was 0.63, which is clearly lower than the considered threshold of ξ = 0.75 that we used with the CORE-clustering. This makes ambiguous the interpretation of the role played by the representative variables.
Finally, it is worth mentioning that all representative variables (genes) obtained using Algorithm 2 have a known physiological function and only two variables out of eleven have an unknown function by using Algorithm 3. For the spectral clustering with η = 110, four selected variables out of six have known functions. For η = 70, three variables with unknown functions were selected. For η = 50, two variables out of three with known functions were selected. For η = 5 and η = 30, the single selected variable was the center of all variables with respect to the center definition in Section 3.3, and turns out to have a known function. It therefore appears in this experiment that the representative variables obtained using CORE-clustering are more interpretable than those estimated using spectral clustering.

Application to the U.S. Road Network
We now assess the CORE-clustering algorithms on the U.S. road network out of the 9th DIMACS Implementation challenge (http://users.diag.uniroma1.it/challenge9/ (accessed on 19 February 2021)). Our goal here is to discuss the pertinence of the detected CORE-clusters, and not specifically their representative variables, on a large scale and straightforwardly interpretable dataset. The graph contains here 2.4 × 10 7 nodes, each of them representing a crossing of the U.S. road/streets network, and 5.8 × 10 7 arcs, representing the part of the roads/streets between two crossings. Interestingly, the distance between the crossings is also associated with each edge of the graph.
To assess the CORE-clustering algorithms on the U.S. road network, we first transformed the distances between the crossings into weights that are higher and higher for increasingly closer crossings. A weight max (4 × 10 3 − l)/(4 × 10 3 ), 10 −4 was specifically given to each edge, where l is its original length in feet.
This means that each CORE-cluster contains a road network of 5 × 10 4 to 10 5 crossings, for which one can travel from any crossing to another one by only using streets of less than 3996 feet (about 1.2 km) between two crossings. As expected, the clusters represented in Figure 3 correspond to the 18 largest urban areas in the U.S. Note that two of them are made of three CORE-clusters (New York City and Los Angeles), and two other ones (Miami and Chicago) are made of two CORE-clusters. This is due to an algorithmic choice, discussed in Sections 3.2.2 and 3.2.3, which leads to the detection of CORE-clusters containing τ to 2τ − 1 nodes by using the proposed CORE-clustering algorithms. As discussed in Section 2.6, an interesting strategy if connected CORE-clusters are found consists in running again the CORE-clustering algorithms with higher values of τ or ξ, to potentially detect more robust CORE-clusters: By reproducing this test with τ = 10 5 instead of τ = 5 × 10 4 , we now try to find CORE-clusters containing 10 5 to 2 × 10 5 crossings, only 5 urban areas are found: New York City (2 CORE-clusters), Los Angeles, Chicago, Miami and San Fransisco. Now, by using τ = 5 × 10 4 and now ξ = 0.5, i.e., with distances between the crossings lower than about 2000 feet (about 0.6 km) instead of 3996 feet, only 3 urban areas are found: New York City (2 CORE-clusters), Los Angeles and Chicago. What is interesting here in terms of interpretability is that we have been able to select specific subparts of the U.S. road network by explicitly controling the size of the clusters or a specific level of density in the network. This notion of interpretability can be further discussed by observing the CORE-clusters obtained in the New York city urban area, as shown in Figure 3b-d. By using τ = 5 × 10 4 and ξ = 10 −3 , the three CORE-clusters split the densest parts of the New York City urban area into the New Jersey state, Long Island and the rest of New York state. When having larger CORE-clusters, i.e., with τ = 10 5 , the New Jersey cluster, expands and the densest parts of the two New York state clusters are merged. Now, when enforcing instead a stronger coherence inside of the clusters, i.e., with τ = 5 × 10 4 and ξ = 0.5, the New Jersey and Long Island CORE-clusters remain stable, but the Manhattan/mainland New York state cluster is not large and dense enough, so it is not captured as a CORE-cluster. Note that each CORE-cluster estimation required about 50 s and 2.4 GB here without any parallelization.

Conclusions
Although complex systems in high dimensional spaces with a limited number of observations are quite common across many fields, having efficient methods to treat the associated problem of graph clustering is an ambiguous task. Some of these techniques, based on assumptions in view of controlling the variables contribution to the global clustering, often do not allow to select the best graph partition. In reply to this issue, we developed a formalism based on an original graph clustering strategy with specific properties. This formalism makes it possible to robustly identify groups of representative variables of the studied system by tuning two intuitive parameters: (1) the minimum number of variables in each CORE-cluster, and (2) a minimum level of similarity between all the variables of a CORE-cluster. Its effectiveness was further satisfactorily assessed on simulated data and on real datasets.
From a methodological perspective, an interesting research direction would be to mix Algorithms 2 and 3 into a single hybrid top-down and bottom-up optimization scheme. Our goal would be to scale well to very high dimensional datasets, as when using Algorithm 3, while being as robust as Algorithm 2 when potential CORE-clusters are coarsely identified. When the CORE-clusters found by either Algorithms 2 or 3 are very large, a stochastic strategy could also make faster the detection of their representative variables. Although this secondary part of our methodology has been addressed by using a standard Dijkstra's algorithm (see Section 3.3), it could be addressed by using an extension of [34] where the optimal paths between all variables would not be pre-computed prior to the central variable detection. Application of our formalism in various fields such as gene regulatory networks, social networks or recommender systems would also be of interest. Our formalism is indeed sufficiently flexible to incorporate different types of similarity measures between the observed variables.
Note finally that our formalism was implemented in C++ and wrapped in a R package, which is freely available on sourceforge (https://es.sourceforge.net/projects/coreclustering/ (accessed on 19 February 2021)), so that these results can be easily reproduced.

Conflicts of Interest:
The authors declare no conflict of interest. Let us first interpret the results in Table A1. The coherences' first row on G 1 in tests T1 and T2 first show that similar coherences were obtained with CORE-clusters of size 7 and 3, although the coherences are slightly higher for smaller CORE-clusters. These coherences are also clearly higher for those obtained in tests T3 and T4 where all variables are not contained in the same simulated block. Interestingly, the results obtained on graph G 2 are similar to those obtained on G 1 . The only difference here is that the coherences of T4 are slightly higher than in G 1 , but still relatively low. Note that the tested CORE-clusters may lead to disconnected subgraphs when tested on the maximum spanning trees. Equation (2) does not make sense in this case. When the simulated subgraphs were connected (12 cases out of 16), we obtained the same coherences on the maximum spanning trees and the whole graph. As discussed in Section 2.4, the coherence of a given CORE-cluster on a maximum spanning tree is indeed lower or equal to the coherence of the same CORE-cluster on the whole graph. We however computed the representative variables in all tested cases, as the centrality measure makes sense even on disconnected subgraphs (see Section 3.3). The results in Table A2 show that the estimated representative variables are always in the reference sets S 1 Re f and S 2 Re f in the tested configurations. They also correspond to the simulated representative variables, X 3 and X 11 , in most cases, or are very close to these variables. Note that slightly inaccurate reference variables were detected in S 1 Re f , and we can clearly see ( Figure A1(top)) that the influence of its simulated representative variable is less obvious than in set S 2 Re f . We further study the influence of the undesirable relations between {X 1 , · · · , X 5 } and {X 9 , · · · , X 13 } in graph G 2 by simulating these relations with different strengths. In the previous subsection, undesirable relations were sampled following N (µ, 0.1), where µ = 0.37. Here, we sampled 100 graphs G 2 for each strength µ ∈ {0.1, 0.40, 0.60, 0.80, 0.90}. For each graph, we then measured the portion of representative variable estimates in the true reference block of variables and the portion of representative variable estimates that are the ground truth representative variables.
Results are given in Table A3 and show that the representative variables detection is particularly stable in these tests, even for large values of µ. All estimated representative variables are indeed in the true block of variables, except in Test 4 (where two variables of the reference sets are swapped) with µ = 0.9, i.e., with the same level of similarity as between the blocks representative variables and the other variables they contain. False estimations are however uncommon even in this case. Exact estimates of the representative variables are naturally less frequent, as this test is more strict. They are however always clearly higher than random estimations which would have portions equal to 0.14. The estimates of pertinent representative variables therefore appear as robust, even with strong undesirable relations in the variable similarities and tested CORE-clusters which contain undesirable variables. Table A3. Portion of representative variable estimates contained in the proper reference block of variables (main value) and corresponding to the ground truth representative variables (between brackets). Each portion was computed on 100 simulated graphs (G 2 ) and their corresponding maximum spanning trees (ST(G 2 )). For each group of 100 graphs, a different strength µ of the undesirable relations is tested. Test 2  Test 3  Test 4