Unsupervised KPIs-Based Clustering of Jobs in HPC Data Centers

Performance analysis is an essential task in high-performance computing (HPC) systems, and it is applied for different purposes, such as anomaly detection, optimal resource allocation, and budget planning. HPC monitoring tasks generate a huge number of key performance indicators (KPIs) to supervise the status of the jobs running in these systems. KPIs give data about CPU usage, memory usage, network (interface) traffic, or other sensors that monitor the hardware. Analyzing this data, it is possible to obtain insightful information about running jobs, such as their characteristics, performance, and failures. The main contribution in this paper was to identify which metric/s (KPIs) is/are the most appropriate to identify/classify different types of jobs according to their behavior in the HPC system. With this aim, we had applied different clustering techniques (partition and hierarchical clustering algorithms) using a real dataset from the Galician computation center (CESGA). We concluded that (i) those metrics (KPIs) related to the network (interface) traffic monitoring provided the best cohesion and separation to cluster HPC jobs, and (ii) hierarchical clustering algorithms were the most suitable for this task. Our approach was validated using a different real dataset from the same HPC center.


Introduction
HPC systems are known for their costly operation and expensive complex infrastructure [1].Companies and research centers are increasingly demanding this technology to solve different complex computational problems.This has led to a growing need for constant monitoring of HPC systems to ensure stable performance.These monitoring systems are periodically checking the computational nodes of the HPC system to gather the values of different performance counters known as KPIs [2].This information illustrates the operational status of the system.KPIs are usually organized in different categories, regarding the parameters that are being monitored: CPU usage, Memory usage, network traffic, or other hardware sensors.Each KPI is often recorded as a time series: different values of the same parameter (KPI) that are periodically gathered, with a specific frequency.Thus, KPIs are usually recorded as a time series matrix that can be processed for different purposes: anomaly detection, optimal resource allocation, visualization, segmentation, identifying patterns, trend analysis, forecasting, indexing, clustering, etc.For instance, abnormal behavior in KPIs may explain or predict the existence of some problems like application issues, work overload or system faults in the HPC systems.
Therefore, time series analysis techniques are relevant for the analysis of KPIs.In fact, there are different approaches in the literature [3,4] based on the analysis of a large number of time varying performance metrics.These proposals apply different techniques, such as statistical analysis [5], machine learning [6,7], and time series [8].Among all these approaches, machine learning (ML) stands out in analyzing time series data.The availability of the current advanced ML techniques can quickly process a massive matrix with diverse data types, like text, numerical data, or categorical data.These approaches face some common challenges to analyze the gathered data: • Large data volume.Each HPC node generates a large number of KPIs (usually more than a thousand).Thus, selecting the most appropriate set of KPIs for job analysis is a key aspect [9].

•
Large data dimensionality.The KPI matrix that corresponds to one job may contain a huge number of vectors depending on the number of parallel nodes required during its execution.

•
Lack of annotated data.This entails problems to validate the models and methodologies.This problem has been highlighted in previous proposals [10], where only a reduced number of annotated KPIs were used.Consequently, the obtained results cannot be considered complete or representative [10,11].
Our research work focuses on identifying groups of similar jobs.Since similar jobs tend to have similar performance, we have opted to analyze the KPI data obtained from the monitoring system: each job run in some parallel nodes and the monitoring system is gathering the KPI data per node.We decided to apply clustering techniques to the information given by the KPIs.Besides, the lack of annotated data has driven our research to the application of unsupervised techniques, such as partition and hierarchical clustering algorithms.
As previously mentioned, the large data volume is an important challenge when analyzing the KPIs.So, one of our objectives is identifying which metrics (KPIs) are the most appropriate for clustering.For this to be possible, we have done a two-step analysis.First, we performed clustering by combining KPIs information.Second, we performed clustering using each KPI information individually.The evaluation was done using a real dataset obtained from the Centro de Supercomputación de Galicia (CESGA) 1 .
Consequently, our contributions are: (i) a clustering-based methodology that is able to identify groups of jobs that are executed in HPC systems; (ii) simplifying the computational problem by analyzing the different KPIs in order to determine which ones are the most suitable for this type of clustering; and (iii) providing the best clustering algorithm in order to identify different types of HPC jobs according to their performance.This methodology can be applied in any HPC to obtain clusters that identify the different types of running jobs.Finally, the resulting clusters constitute the base for a further analysis that will enable the identification of anomalies in jobs.To the best of our knowledge, this approach entails a novelty approach because of the following aspects: the variety of the KPIs used for our analysis (CPU usage, Memory usage, network traffic and other hardware sensors) and the approach of applying PCA reduction in order to face an overwhelming and challenging clustering of KPIs.
This paper is organized as follows.Section 2 presents some background about the techniques used in this research.Section 3 describes the latest work related to time series clustering and anomalies detection in HPC.Section 4 describes the methodology used in this study.Section 5 defines the experiments and their evaluation.Section 6 provides results discussion and section 7 covers the conclusions and future work proposals.

Background
There are three types of learning in ML: supervised, semi-supervised, and unsupervised learning.In supervised learning, the data used for analysis is labeled (annotated) before applying any supervised techniques.One example would be a data table with a sequence of behaviors that have labels.This data table is fed to the supervised algorithm to build a model from the labeled data.This model will be used afterward for future predictions.In semi-supervised learning, part of the data is labeled, and the other is not.Finally, in unsupervised learning, the data is not labeled.For example, an unlabeled data table with a sequence of behaviors is fed to an unsupervised algorithm to group the data with similar behaviors with the aim of labeling these groups later [9].
Since we are dealing with a huge number of KPIs that are not labeled, we have decided to consider unsupervised learning techniques and discard other approaches, like classification.In fact, we used clustering techniques that were considered appropriate to discover hidden patterns or similar groups in our dataset without the need of labeled data.In the following subsections, we introduce the algorithms and the distances we have selected (Section 2.1), as well as the different options for clustering validation that helped us to find out the optimal number of clusters (Section 2.2).Finally, we also explained how to deal with a large amount of data by using dimensionality reduction techniques (Section 2.3).

Clustering Algorithms
Clustering algorithms can be classified into five types: partitioning, hierarchical, density-based, grid-based and model-based methods.Since we are interested in applying clustering to a lower dimensional time-series (described in Section 4.3), we have decided to select Partitioning (k-means) and Hierarchical (agglomerative clustering) techniques for clustering as they are the most appropriate for this type of data and widely used for our purpose: K-means is the most widely used clustering technique thanks to its simplicity.It partitions the data into K-clusters by enhancing the centroids of the clusters and assigning each object in the data to only one cluster.K-means use the Euclidean distance to measure the distance between all the objects and the corresponding centroids to form the cluster [12].The main advantages of K-means are that it is simple to implement, it is relatively fast in execution, it can be applied in numerous applications that involve a large amount of data, and it obtains very reliable results with largescale datasets [13,14].
Strategies of hierarchical clustering are divided into two types: divisive and agglomerative.Divisive clustering is a "top-down" approach where all objects are initially grouped into one cluster.Then, the objects are split gradually into different clusters until the number of clusters equal to the number of objects.Conversely, the agglomerative clustering is a "bottom-up" approach where each object is assigned to an individual cluster at the initial step of the observation.Then, the clusters are progressively merged until they become one cluster.Agglomerative clustering uses a combination of (i) a linkage method [15,16] and (ii) a distance metric to merge the clusters.In our analysis, we have used the metrics Euclidean [17], Manhattan [18], and Cosine [19]  Hierarchical clustering has important advantages, such as having a logical structure, setting the number of clusters is not required in advance, it provides good result visualization, and it provides dendrogram-based graphical representation [14,20].

Cluster Validation
Many clustering algorithms require the number of desired clusters as an input parameter.Therefore, the experience of the data analyst and/or the specific requirements of the application of the algorithm are keys in determining that number.However, the cluster validation methods are useful to measure the quality of the clustering results and, consequently, to identify the optimal number of clusters.Clustering validation techniques can be classified into two categories: (i) external clustering validation and (ii) internal clustering validation.The former requires -predefined data labels to evaluate the goodness of the cluster, while the latter not require predefined data labels to evaluate the goodness of the cluster [21].The KPIs of the HPC jobs are usually unlabeled.
Consequently, the internal clustering validation methods are the best option to evaluate the clusters under these circumstances.In fact, our analysis uses three popular internal clustering validation methods to evaluate our clusters: The Silhouette coefficient [22], the Calinski-Harabasz index [21], and the Davies-Bouldin index [23].These three methods consider for their decision the compactness of the clusters and the separation between them.
The Silhouette index measures the difference between the distance from an object of a cluster to other objects of the same cluster and the distance from the same object to all the objects of the closest cluster.The silhouette score stretches between two values: -1 and 1.The closer the value is to one, the better the shape of the cluster [22].In fact, a Silhouette score above 0.5 is considered a good result and a result greater than 0.7 is evidence of a very good clustering [24].Thus, this technique focuses on assessing the shape or silhouettes of the different identified clusters.Besides, the score obtained with this index only depends on the partition, not on the clustering algorithm [22].
The Calinski-Harabasz index is also identified as a variance ratio criterion, where a cluster validation function is based on the average of the sum of the squared distances among clusters and among objects within the cluster [21].It focuses on assessing the dispersion of objects within their cluster and the distance from other clusters.
Where  is the total number of samples, B and B are the between and within-cluster variances, respectively,  is the number of clusters.
Finally, the Davies-Bouldin index is used to calculate the separation between the clusters.It focuses on comparing the centroid diameters of the clusters.The closer the Davies-Bouldin value is to zero, the greater the separation is between clusters since zero is the lowest value [23].
Where S(uk)+S(ul) is the distance within the cluster and d(uk,ul) is the distance between the cluster.

Dimensionality Reduction
HPC KPIs data is usually organized into high-dimensional matrices, which affects the accuracy of any machine-learning algorithms and slows down the model learning process.Hence, it is essential to implement a feature dimension reduction technique that combines the most relevant variables in order to obtain a more manageable dataset [25].There are several techniques used for dimensionality reduction such as Principal Component Analysis (PCA) [26], t-Distributed Stochastic Neighbor Embedding (t-SNE) [27], and Uniform Manifold Approximation and Projection (UMAP) [28].
The Principal Component Analysis (PCA) [26] is one of the most widely used methods to reduce data dimensionality.Its goal is to reduce data with a large dimension into a small number of the socalled principal components.These principal components highlight the essential features of real data and are expected to maintain the maximum information (variance) of the original data.There are two approaches to apply PCA: (i) fixed PCA and (ii) variable PCA.In the former the number of principal components is fixed beforehand, whereas in the latter the number of principal components is calculated during the process by analyzing the percentage of variance that is maintained.
PCA was successfully applied in different research areas [29,30,31,32,33].However, some of them revealed two downsides [25,27].On the one hand, in large dimension covariance matrix, the estimation and evaluation tasks are challenging.On the other hand, PCA mainly focuses on the large invariance instead of the small invariance except for the information that is explicitly given in the training data.However, our analysis did not face any of these problems.The maximum dimensionality of the analyzed jobs in our dataset (described in Section 4.2) is 43 parameters.This made the calculation of the principal components feasible with a percentage of retained information greater than 85% for 80% of the jobs (see Section 4.3).

Related Work
The increasing demand for HPC technology entails that maintaining the quality of the service is key in data centers.Clustering is one of the techniques that is becoming more relevant for this purpose.Analyzing and comparing the differences and similarities of jobs that are run in HPC systems open the door to further and deeper studies, such as anomalies detection.In fact, security and performance go hand by hand.In fact, Zanoon [34] confirmed this direct relationship between security and performance by analyzing the quality of service of cloud computing services (jobs running in HPC systems).The author concluded that better security means better and better performance.
In the specialized literature, there are different approaches that focus on clustering the KPIs in order to support the comparison between jobs [6,12,35].Yahyaoui et al. [12] obtained a good clustering result with a novel approach to cluster performance behaviors.They used different clustering algorithms: K-means, hierarchical clustering, PAM, FANNY, CLARA, and SOM after reducing the dimensionality of time-oriented aggregation of data with the Haar transform.
Li et al. [36] achieved a higher accuracy score for clustering by proposing a robust time series clustering algorithm for KPIs called ROCKA.This algorithm extracts the baseline of the time series and uses it to overcome the high dimensionality problem.Besides, Tuncer et al. [35] proposed a new framework for detecting anomalies in HPC systems by clustering statistical features that retain application characteristics from the time series.On another hand, Mariani et al. [37] proposed a new approach named LOUD that associates machine learning with graph centrality algorithms.LOUD analyzes KPIs metrics collected from the running systems using machine learning lightweight positive training.The objective is twofold: to detect anomalies in KPIs and to reveal causal relationships among them.However, this approach does not work properly with high precision.

Methodology
HPC systems execute a huge number of jobs every day, which is usually done on hundreds of parallel nodes.These nodes are monitored by more than a thousand KPIs.The goal of this study is to identify clusters of HPC job performances based on the information given by their KPIs.We assume that this task is going to give relevant information about the usual behavior of the jobs, which will be used in the short-term to identify anomalies in jobs.However, this goal brings challenges like data scaling and dimensionality that we have faced defining a six-step methodology which is summarized in Figure 1.
The first step was the selection and definition of the KPIs used in clustering (Section 4.1).The second step was data preprocessing (Section 4.2), where we managed to read the data and identify the jobs that were used in operational jobs, which are those that have a systematic nature like scheduled system update, sensors checks, and backups.On the other hand, non-operational jobs are those that have a non-systematic nature.In addition, a basic analysis of non-operational jobs gave us a better view of the data to prepare them for the pre-clustering phase.

Figure 1.
Framework for clustering HPC jobs KPIs using feature selection Some of the dimensionality reduction methods applied like PCA were affected by the scale, which is a requirement for the optimal performance of many machine-learning algorithms.For this reason, a third step to standardize data was needed (Section 4.2).The fourth step was to overcome the dimensionality problem (Section 4.3), always present when analyzing large time series data, like in our case.The PCA dimensionality reduction method helped to reduce our KPIs matrix and speed up the clustering process.The fifth step was clustering (Section 4.4).Two clustering experiments were performed using K-means and agglomerative hierarchical algorithms with different linkage methods and distance metrics (Section 5).The first experiment clustered the PCAs of the non-operational jobs for all the metrics (KPIs) combined.The second experiment clustered the PCAs of the non-operational jobs for each KPI individually.The study did not have a predetermined number of clusters (K).Therefore, in the sixth step, both algorithms clustered the data considering different values of K (from 2 to 200).Then, the clustered results of all K values were evaluated using three previously mentioned internal cluster validation methods (Silhouette analysis, the Calinski-Harabasz index, and the Davies-Bouldin index) to determine the goodness of the clusters and to identify the optimal number of clusters.The clustering results from both experiments were compared to identify which KPIs show the best clustering results and, consequently, are the most representative to cluster the jobs.Lastly, a validation experiment was conducted with a new dataset to validate the obtained results.

Performance Data Selection
The execution of HPC jobs is deployed over a high number of nodes, thousands of parallel nodes that are closely monitored by specific systems.As previously mentioned, these monitoring systems are periodically gathering the values of specific metrics or KPIs.Depending on the monitoring system, the information may be overwhelming with thousands of metrics or KPIs.The collected data is stored as time series matrix per node.These KPIs are usually classified into five different categories:

•
Metrics about CPU usage, such as the time spent by a job in the system, owner of the job, nice (priority) or idle time.

•
Metrics of network (interface) traffic, such as the number of octets sent and received, packets and errors for each interface.• IPMI (Intelligent Platform Management Interface) metrics that collect the readings of hardware sensors from the servers in the data center.• Metrics about the system load, such as the system load average over the last 1, 5 and 15 minutes.

•
Metrics of memory usage, such as memory occupied by the running processes, page cache, buffer cache and idle memory.
For our analysis, we have acquired a dataset from the CESGA Supercomputing Center (Centro de Supercomputación de Galicia).Foundation CESGA is a non-profit organization that has the mission to contribute to the advancement of Science and Technical Knowledge, by means of research and application of high performance computing and communications, as well as other information technologies resources.The dataset stores information about a total amount of 1,783 jobs (operational jobs and non-operational jobs), which were running in the 74 available parallel nodes from 1st June 2018 to 31st July 2018.
The collected data give information about 44,280 different KPIs.In order to filter this overwhelming amount of data, we have done a previous filter according to the needs of the CESGA experts.Therefore, we focus our attention on the 11 KPIs summarized in Table 1.The selected KPIs belong to the five previously mentioned categories (CPU usage, memory usage, system load, IPMI and network interface), and were selected by the CESGA experts based on their relevance and clear representation of the performance of jobs from each category.
Each KPI gives a matrix with the following information: (i) the value of the KPI, (ii) the time of the machine when the value was acquired, (iii) the job and (iv) the node to which this value belongs.

Data Preprocessing and Standardization
The objective of this preprocessing phase was to read and organize the KPI matrices into data frames before applying any machine-learning steps.For this task, we used the functionality of the Python Pandas library [38].Additionally, we have also done analysis and data visualization that helped understand the nature of our dataset before applying any further analysis, whose results are summarized in Table 2.
From a total of 1,783 jobs, 200 were excluded from our clustering analysis because of one of the following reasons:

•
The jobs were not included in all the 11 KPIs matrices, i.e. we do not have complete information about the metrics of the job.

•
The jobs were executed in only one node, which entails they were not parallelized jobs, which is mandatory for our proposed method dimensionality reduction phase.

•
These one-node jobs (12% of the dataset) were mostly operational jobs, which are not the focus of our study.

•
The analysis of one-node jobs (operational) deserves a specific study that is out of the scope of this paper.
Before proceeding to job clustering, we split the 1,583 jobs into two types: operational -totaling 1,281 jobs-and non-operational -totaling 302 jobs.As it was previously mentioned, our analysis focused only on non-operational jobs.Consequently, we ran two clustering experiments considering the 302 non-operational jobs.In the first experiment, clustering the 11 KPI matrices combined and, in the second experiment, clustering each KPI matrix individually.Jobs excluded because they were not included in all the 11 KPI matrices 56 Table 3 shows the number of nodes per non-operational job in our dataset.The executable nodes count per job revealed the following: zero jobs were executed on only one node, 195 jobs were executed on less than 5 nodes and 49 jobs were executed on nodes in between 6 and 10.Finally, the calculation showed that 80.7% of the jobs were executed on less than 10 nodes.The standardization process is usually a required step before applying any machine learning algorithm, in order to achieve reliable results [39].In our case, we proceeded to do this standardization stage because PCA is affected by scale and the values gathered in the 11 KPI matrices ranged from very low to very high values.Thus, the data was standardized into a unit scale: the mean is equal to zero and the variance is equal to one.

Jobs KPIs: Dimensionality Reduction
One of the major challenges in KPIs analysis is the large volume of available data.After preprocessing our dataset, each column of the matrix represents the KPIs of the nodes that are being used to run the jobs in parallel.The number of nodes is proportional to the parallelization and computational needs of each job as (Time x Nodes) matrix.Analyzing our data, we can see that 19.3% of the jobs were executed on more than 10 nodes.We also have the time series storing the KPIs for each node, so the analysis of such volume of data is overwhelming.Consequently, we have decided to apply a dimensionality reduction method to overcome this challenge.As previously mentioned, we decided to use PCA to reduce the dimensionality of the matrix that represents the KPI gathered data of each job.The objective is reducing this dimensionality without losing information (variance) and, therefore, reducing the computation load and execution time of the clustering algorithms.
We decided to apply a fixed PCA technique with two principal components.This decision is based on two aspects.On the one hand, our initial analysis (Section 4.2) showed that 195 jobs of the total have from two to five nodes.Moreover, 80.7% of the jobs were executed on less than 10 nodes.Thus, applying more than two principal components does not seem to be appropriate in this context.On the other hand, we have checked that applying two principal components was enough to retain information (variance) of the original data (job KPIs performance): the percentage of retained information is greater than 85% in 81% of the jobs, as Table 4 shows.
The PCA was applied to each KPI matrix individually resulting in a matrix of (time x 2 principal components) for each job.On the one hand, for experiment one (Section 5.1), we used jointly the information of the 11 KPIs.For this, we took advantage of the Python Pandas library [38] to combine and flatten the PCA results of each job for the all-11 KPIs into one row in a data frame labeled with the job number resulting in a matrix of (jobs x (times x 2 principal components x KPIs)).Each row in this data frame represents the PCAs for all 11 metrics combined with each job indexed by job number.On another hand, for experiment two (Section 5.2), we analyze each KPI individually.Thus, the PCA results of each job for each KPI were combined and flatten into one row in a separate data frame labeled with the job number resulting in a matrix of (jobs x (times x 2 principal components).

Clustering
The study applied the K-mean algorithm and the agglomerative hierarchical algorithm to cluster the jobs for both experiments.On the one hand, the K-means used only Euclidean distance for clustering.On the other hand, the agglomerative hierarchical algorithms used three distance metrics -Euclidean, Manhattan, and Cosine-with different linkage methods for clustering.Both algorithms were applied with different numbers of iterations for the number of clusters -from 2 to 200-because no predetermined number of clusters (K) was given.All clustering results were stored and evaluated with three internal cluster validation methods: the silhouette score, the Calinski-Harabasz index and the Davies-Bouldin index, to determine the optimal number of K for the K-means and the agglomerative hierarchical algorithms using all distances.Figure 2

Experiment One: Results
In this experiment, we clustered all the non-operational jobs taking into account the information provided by the 11 KPIs.With this aim, we have applied the k-means algorithm and the agglomerative hierarchical algorithm with different linkage rules, as shown in the experimental setup in Table 5.We did not have a predetermined number of clusters for both algorithms.The clustering was done with a number of iterations for K from 2 to 200 and the results were fed to the three cluster validation methods to identify the optimal number of clusters.

Non-operational jobs K-means, Agglomerative Hierarchical
Silhouette score Calinski-Harabasz index Davies-Bouldin index Table 6 illustrates the comparison of the optimal numbers of clustering for both algorithms using each one of the three validation methods.Regarding the combined selected 11 KPIs jobs values, we found that the agglomerative hierarchical algorithm performance is better than the K-means algorithm using the Euclidean distance average linkage with a Calinski-Harabasz score of 24,545,720,615 and a silhouette score of 0.523 for 3 clusters.The combined selected 11 KPIs jobs values also performed well with the hierarchical single-linkage clustering using the Euclidean distance, with a Davies-Bouldin score of 0.503 for 13 clusters.

Experiment Two: Results
In this experiment, we clustered all the non-operational jobs using only one of the KPIs each time.That is, the study had performed 11 clustering procedures.Once one of the KPIs is selected, the procedure is the same as in experiment one: using the k-means algorithm and the agglomerative hierarchical algorithm with different linkage rules -see the experiment set-up in Table 7.Without a predetermined number of clusters for both algorithms, the number of iterations considered for K ranged from 2 to 200 as done in the previous experiment.Then the results were fed to the cluster validation methods to identify the optimal number of clusters.The results of clustering each of the 11 KPIs individually showed that the K-means performed well using the Euclidean distance.The results give a Calinski-Harabasz score of 726.341 for 4 clusters in the KPI interface.bond0.if_octets.tx, as shown in Figure 3.   Additionally, the results confirm that the agglomerative hierarchical algorithm performed well in clustering jobs.Figure 4 shows the results with cosine distance; single linkage and Davies-Boulding index.The results show a good score (0.340) using the KPI interface.bond0.if_octets.rx with 12 clusters.Figure 5 shows the results with Manhattan distance; average linkage and Silhouette index.The results show a good score (0.598) using the KPI interface.bond0.if_octets.rx with 4 clusters.All the results are summarized in a complete Table (Table A1) in Appendix A.

Validation Experiment
With the aim of validating the conclusions obtained --KPIs belonging to the Network interface traffic are the most adequate to obtain a good clustering of non-operational jobs that run in the HPC system--, we have performed a new experiment with a different dataset also acquired from CESGA.We have used the same methodology used in experiments one and two (data preprocessing, data standardization, dimensionality reduction, and clustering), but using only the information about the two selected KPIs: interface.bond0.if_octets.rx, and interface.bond0.if_octets.tx.
The dataset stores information about a total amount of 1,500 jobs (non-operational jobs), which were running in the 81 available parallel nodes from 1st August 2019 to 31st September 2019.Table 8 shows the number of nodes per job (non-operational) in the new dataset.The results of clustering based on these two KPIs are shown in Table 9.The highlighted scores in Table 10 demonstrates the best results of the comparison between the scores of the three clustering validation methods for all clustering algorithm.This implies that interface.bond0.if_octets.txKPI showed better clustering results in all measures cluster shape, cohesion, and separation than interface.bond0.if_octets.rxKPI in the performance of both algorithms (K-means and agglomerative hierarchical) with different distance metrics and linkage methods.K-means performed well using the Euclidean distance with Calinski-Harabasz score of 4,608.5 for 3 clusters; the agglomerative Silhouette scores Agglomerative Hierarchical (Manhattan Average Linkage) hierarchical algorithm performed well in clustering jobs with cosine distance; single linkage of Davies-Boulding score 0.119 for 3 clusters and Manhattan distance; complete linkage with Silhouette score of (0.858) with 3 clusters using the KPI interface.bond0.if_octets.rx.

Discussion
After obtaining the results from both experiments shown in Table 6 and Table A1, we have done two comparisons.The first one is done between the results of experiment two to identify which KPI provides the best clustering results in terms of cohesion and separation.With this aim, we have analyzed the results obtained from all the experiments that have been done taking into account the information given individually per KPI (different clustering methods, different metrics, different linkage methods and the assessment with the three quality indexes).The second one is done between the results of experiment one and experiment two to identify which is the best clustering approach, according to the quality indexes.With this aim, we have compared the clustering results when we take into account the joint information given by the 11 KPIs together and the results obtained with the KPI that offered the best result in the first comparison.
Consequently, we can conclude that the Network (interface) traffic KPIs (interface.bond0.if_octets.rx and interface.bond0.if_octets.tx)present the best clustering results over all 11 KPIs, providing 4 and 13 clustering, respectively.In order to decide which is the most adequate number of clusters for our dataset, i.e. the most adequate KPI, we have analyzed the time series decomposition of all the jobs per cluster.Figure 6 shows sample jobs from two different clusters A and B from the optimal result obtained with the KPI interface.bond0.if_octets.rx. Figure 7 also displayed the working nodes behaviors of each job.After our analysis, we concluded that this KPI (interface.bond0.if_octets.rx) is the one that shows a high percentage of jobs with similar trends and behavior.
The results of the second comparison conclude that, according to the Silhouette and Davies-Bouldin indexes, the best results are obtained applying hierarchical algorithms.However, and according to the Calinski-Harabasz index, K-means is the best option.Since we obtain the same conclusion in two out of three clustering validation methods, we consider that hierarchical algorithm is the most adequate for our purpose.Besides, Calinski-Harabasz index does not have an upper value level, so it is usually applied to compare different classifications with the same conditions, which reinforce our approach.

Cluster (A)
Cluster (B) Finally, our results were validated by conducting a clustering experiment with a new dataset, which has confirmed that the Network (interface) traffic KPIs (interface.bond0.if_octets.rx and interface.bond0.if_octets.tx)show the best clustering results.

Conclusions
This study aimed to provide a methodology to cluster HPC jobs (non-operational) in order to automatically detect different types of jobs according to their performance.The job performance can be studied by using the KPI metrics provided by the HPC monitoring system.Our goal was also to select the most suitable or representative set of KPIs for clustering non-operational jobs according to their performance.Our analysis and validation were done by using a data set provided by the Supercomputing Center of Galicia (CESGA) that collected the information of the KPIs of 1,783 jobs from 1st June 2018 to 31st July 2018.
Considering the large amount of available KPIs (44,280), we have made a previous selection based on the advice from experts who work at CESGA.They provided us with 11 KPIs from the following categories: CPU usage, Memory usage, IPMI, System load and Network (interface) traffic.
We performed two different kinds of experiments in order to select the most suitable KPIs for clustering HPC jobs.The first experiment performed the clustering by combining the information gathered from the 11 KPIs, whereas the second one performed the individual clustering individually for each one of the 11 KPIs.Both experiments were done by using different clustering algorithms (Kmeans and agglomerative hierarchical algorithm), using different linkage methods (single-linkage, complete-linkage, average-linkage and Ward's method), and using different distance metrics (Euclidean, Manhattan and Cosine).In order to assess the quality of the obtained clusters, we have also used different indexes (Silhouette, Calinski-Harabasz and Davies-Boulding).Before performing the clustering, we have applied PCA in order to reduce the dimensionality of the data, without losing information, to reduce the computational load of the algorithms.Finally, a clustering experiment based only on the two selected KPIs (interface.bond0.if_octets.rx, and interface.bond0.if_octets.tx) was performed with the aim of validating our approach.For this, we have obtained a new dataset with 1,500 jobs (non-operational) from 1st August 2019 to 31st September 2019.The results confirmed our proposal.
Our analysis concluded that the clustering based on the joint information given by the 11 KPIs performed worse than the clustering based on the individual KPIs.What is more, the results showed that the information given by those KPIs belonging to the Network (interface) traffic, are the most adequate (interface.bond0.if_octets.rx and interface.bond0.if_octets.tx).The clusters obtained with the information of these KPIs showed the best quality in terms of cohesion and separation of HPC jobs.More specifically, the visualization of the KPI (interface.bond0.if_octets.rx)clusters showed a high percentage of jobs with similar trends.Therefore, our methodology can be applied to any data set with information about these two KPIs in order to obtain a good clustering and infer the number of types of non-operational jobs that run in the HPC system.The procedure is simple and offers a solution to some challenges faced in other experimentations [9,10,11] when dealing with similar unlabeled data with large dimensionality.
In our opinion, this clustering phase should be considered the first stage in a broader procedure to detect anomalies in HPC systems.In fact, we are currently working on analyzing this categorization.We consider that the obtained clusters would help to infer similar characteristics of the jobs belonging to each cluster that, definitively, could give information to detect those jobs whose performance is not the expected one and be able to early detect potential anomalies in the system.Finally, and although we have checked that the mechanism applied for dimensionality reduction (fixed PCA) supports a good percentage of retained information, we are working to improve this aspect.Since it was mentioned in the literature [25,27], the problem with the cost function used in PCA entails that there are retained large pairwise distances instead of focusing on retaining small pairwise distances, which is usually much more important.The solution given in [25] is defining a specific cost function based on a non-convex objective function.We are currently defining this new cost function using a larger dataset obtained from the same high performance computing center.We are also considering to use KPIs time series feature extraction in our clustering methodology.The extracted features statistical significance will be evaluated and analyzed by different state-of-art machine learning approaches to achieve our purpose.

Appendix
(a) and Figure2(b) illustrate the scores of each cluster for each clustering validation methods, Silhouette score (a) and Davies-Bouldin index (b), to identify the optimal number of clustering visually.In Figure2(a), a Silhouette score close to 1 implies a better cluster shape.On the contrary, in Figure2(b), a Davies-Bouldin index close to zero implies greater separation between clusters as described in Section 2.2.

Figure 3 .
Figure 3.The results of Calinski-Harabasz scores for k-means Euclidean distance.

Figure 4 .
Figure 4.The results of Davies-Bouldin scores for Agglomerative Hierarchical Cosine Single Linkage.

Figure 5 .
Figure 5.The results of Silhouette scores for Agglomerative Hierarchical Manhattan Average Linkage.

Figure 6 .
Figure 6.A time series decomposition of jobs from two different clusters of the KPI interface.bond0.if_octets.rx.

Table 1 .
Performance metrics selected

Table 2 .
Basic data analysis

Table 3 .
Number of nodes per job (non-operational)

Table 4 .
PCA two principal components retained information

Table 6 .
Experiment one: results

Table 7 .
Experiment two set-up

Table 8 .
Number of nodes per job (non-operational) in validation experiment new dataset

Table A1 .
A Experiment two: results