Improving the Eco-Efficiency of High Performance Computing Clusters Using EECluster

As data and supercomputing centres increase their performance to improve service quality and target more ambitious challenges every day, their carbon footprint also continues to grow, and has already reached the magnitude of the aviation industry. Also, high power consumptions are building up to a remarkable bottleneck for the expansion of these infrastructures in economic terms due to the unavailability of sufficient energy sources. A substantial part of the problem is caused by current energy consumptions of High Performance Computing (HPC) clusters. To alleviate this situation, we present in this work EECluster, a tool that integrates with multiple open-source Resource Management Systems to significantly reduce the carbon footprint of clusters by improving their energy efficiency. EECluster implements a dynamic power management mechanism based on Computational Intelligence techniques by learning a set of rules through multi-criteria evolutionary algorithms. This approach enables cluster operators to find the optimal balance between a reduction in the cluster energy consumptions, service quality, and number of reconfigurations. Experimental studies using both synthetic and actual workloads from a real world cluster support the adoption of this tool to reduce the carbon footprint of HPC clusters.


Introduction
Data and supercomputing centres are an essential element in modern society, as the vast majority of IT services are supported by them, profiting from the consolidation and centralization of high performance processors and networks.Targeting both academic and industrial communities, they provide the key infrastructure for web and application servers, e-commerce platforms, corporate databases, network storage, data mining, or the high performance computing resources required to address fundamental problems in science and engineering, to name a few examples.
The versatility of these computing facilities, coupled with the ever-increasing demand for IT services and their substantial power consumption, makes data and supercomputing centres one of the fastest-growing users of electricity in developed countries [1].According to [1,2], electricity consumptions is the U.S. alone escalated from 61 billion kilowatt-hours (kWh) in 2006 to 91 billion kWh in 2013, and is projected to increase to 140 billion kWh in 2020.However, it should be noted that these large energy demands not only produce a significant economical impact for IT services providers [3,4], but also a carbon footprint equivalent to the aviation industry [5], which is expected to reach 340 million metric tons of CO 2 by 2020 worldwide [6].
Because of this, there is an unyielding need to improve the energy efficiency of data and supercomputing centres to reduce their environmental impact, operation costs and to improve the reliability of their components.
Abundant research has been conducted over the last years on the improvement of cluster computing efficiency, following multiple approaches that could be taxonomically classified in two categories: static and dynamic power management [7].Static approaches focus on the development of low power CPUs seeking maximum efficiency, such as the IBM PowerPC A2 processors [8,9], as well as using GPUs or Intel Xeon Phi coprocessors as the main computing resources, given that this type of hardware is designed for an optimal FLOPS/watt relation instead of just raw performance.Dynamic techniques focus on the reconfiguration of the compute resources to best suit current workloads, saving energy when the cluster is underused.Among these techniques is the Dynamic Voltage and Frequency Scaling (DVFS) [10][11][12][13][14][15][16][17], which adjusts CPU frequency and voltage to match current demand, energy-efficient job schedulers that implement algorithms capable of reducing intercommunication-related power consumptions [18,19], thermal-aware methods which take into account the cooling efficiency of each area of the cluster [20,21], or software frameworks to assist in the development of energy-efficient applications [22][23][24][25][26]. Lastly, the adaptive resource cluster technique consists of the automatic reconfiguration of the cluster resources to fit the workload at every moment by switching on or off its compute nodes, thus saving energy whenever these are idle.This technique has been applied to Load-Balancing clusters in [27][28][29][30][31][32] and in VMware vSphere [33] and Citrix XenServer [34] hypervisors.Recently, various software tools implementing this technique in HPC clusters have also been developed [35][36][37].
However, previous adaptive resource solutions for High Performance Computing (HPC) clusters have limited practical applications for two fundamental reasons.Firstly, as shown in [38,39], closed sets of expert-defined rules are not optimal to every scenario, leading to a lack of flexibility when it comes to complying with the preferences and tolerances of real-world cluster administrators in terms of impact in service quality and node reliability.Secondly, these solutions require its expert system to be tuned by hand, what is a complex task and is likely to conduce to incorrectly-configured systems that can cause substantial interferences with the cluster operation, such as node thrashing or reduction of its productivity, as demonstrated in [40].
Because of this, the tool EECluster is presented that overcomes these limitations.EECluster can improve the energy efficiency of HPC clusters by dynamically adapting their resources to the changing workloads.This is done using a Hybrid Genetic Fuzzy System as the decision-making mechanism and is tuned by means of multi-objective evolutionary algorithms in a machine learning approach to achieve good compliance with the administrator preferences.
The remainder of the paper is as follows.Section 2 explains the concept of eco-efficiency and details the modelling assumptions for the carbon footprint of an HPC.Section 3 explains the architecture of the EECluster tool.Section 4 explains the decision making-mechanism.Section 5 explains the learning algorithm used.Section 6 shows multiple experimental results in both synthetic and actual scenarios.Section 7 concludes the paper and discusses the future work.

Eco-Efficiency
The concept of eco-efficiency brings together economic and environmental factors for a more efficient use of resources and lower emissions [41].Eco-efficiency is represented by the quotient between the service value and its environmental influence.In the particular case of HPCs, the service value is related to the Quality of Service (QoS), and the environmental influence affects both energy consumption and greenhouse gas emissions.
As mentioned in the introduction, the dependence between the energy consumption and the Quality of Service has been studied, and different strategies were proposed to improve their balance [35][36][37][38][39][40].In this work, these studies are updated by including other sources of carbon dioxide emissions that are originated in the life cycle of a compute node.These additional sources are of secondary importance but nonetheless represent a significant part of the emissions.According to [42], manufacturing a computer requires more than 1700 kWh of primary energy, and more than 380 kg of CO 2 are emitted in the process, accounting for a significant fraction of the greenhouse emissions during the whole life of the equipment.It must be noted that a standard factor of 220 kg CO 2 /MWh was assumed for manufacturing-related consumptions, corresponding to an energy mix with a significant proportion of wind power.For operation-related consumptions, the 370 kg CO 2 /MWh emission factor reported by the Ministry of Agriculture, Food and Environment from the Government of Spain ("Ministerio de Agricultura, Alimentación y Medio Ambiente") [43] was used.This factor must be altered accordingly for clusters operating under different energy mixes.
As a consequence of this, in this paper it is proposed that three different aspects are taken into account in the model of the emissions of an HPC: 1. Dependence between QoS and primary consumption of energy.The primary savings are about 370 g of CO 2 for each kWh of electrical energy that is saved during the operation of the HPC. 2. Dependence between the QoS and the lifespan of the equipment.According to our own experience, the average life of a compute node in an HPC cluster is between 4 and 5 years.The number of failures of a given node during its whole lifetime is typically two or three.A rough estimation of the average number of failures of a single node during a year is 0.5 failures/year (thus 0.5 failures/year * 5 years = 2.5 failures).Both the life extent and the quantity of failures depend on the number of power-on and power-off cycles.Heavily loaded nodes might suffer from 0.75 failures/year and a shorter lifespan of 3 years.Assuming that the most common failures are power supplies, motherboards, and disk drives, the typical cost of a reparation can be estimated in 5% of the acquisition cost, i.e., about 20 kg of CO 2 are saved for each failure that is prevented.Each additional year of use of a compute node saves more than 80 kg of CO 2 (approx.22% of the total manufacturing emissions if the life is between 4 and 5 years, as mentioned).This includes the primary energy used for manufacturing a new node and the recycling costs of the discarded equipment.3. Dependence between the QoS and the lifespan of the support equipment.An additional 1% was added for each saved kWh (2.2 g CO 2 ) and 1 g CO 2 for each saved power cycle.In the first case, this models the improved failure rate of support equipment such as cooling fans and air conditioning.The second case models different failures in the datacenter that may be caused by current surges when a large number of compute nodes are powered on or off at the same time.
The emissions model described in this section will be applied in the experiments of Section 6 to estimate global energy savings and carbon footprint reductions as a result of adopting of the proposed system.

Architecture
Computing clusters are a type of computer system consisting of multiple computers interconnected that, together, work as a single computing resource [44].High Performance Computing (HPC) clusters are a particular type of cluster whose main purpose is to address complex and computationally-demanding problems, such as new material, semiconductors, or drugs design, cardiovascular engineering, new combustion systems, cancer detection and therapies, CO 2 sequestration, etc. [45].
HPC clusters typically combine a master node and several compute nodes.The master node is the only one accessible by the users and is tasked with the cluster management using various software components, including the Resource Management System (RMS) and monitoring tools (such as Ganglia, Nagios, Zabbix), among others.The RMS is a software layer which abstracts users from the cluster underlying hardware by providing a mechanism where they can submit resource requests to run any supplied software program (hereafter denoted as jobs).It is worth noting that cluster resources are represented logically by a number of slots which, depending on the RMS configuration, can depict form a single CPU core to a whole compute node.The RMS working cycle consists of (1) gathering job submissions in an internal queue; (2) running a job scheduling algorithm to find the best possible matching between the resources available in the compute nodes and the slots requested by each job and (3) assigning slots and dispatching the job to the compute nodes (see Figure 1).Data and results are passed between the master and the compute nodes through a shared network storage space by means of a network file system or a storage area network.The EECluster tool is a solution which can reduce the carbon footprint of ordinary HPC clusters running open-source RMS, such as OGE (Oracle Grid Engine, Open Grid Engine/SGE (Sun Grid Engine, Son of Grid Engine) and PBS (Portable Batch System)/TORQUE (Terascale Open-source Resource and QUEue Manager) by implementing an intelligent mechanism to adapt the cluster resources to the current workload, saving energy when the cluster is underused.Specifically, the prototype of EECluster features only two out-of-the-box connectors for OGE/SGE and PBS/TORQUE, as these are two of the most used RMS worldwide in HPC infrastructures.As mentioned in reference [46], the OGE/SGE family (including its multiple branches of products and projects, such as Sun Grid Engine, Oracle Grid Engine, Open Grid Engine, and Son of Grid Engine) is a suitable choice for small and medium sites because it is easy to deploy and operate, what has leaded to a very substantial expansion over the last decade in HPC centres.On the other hand, TORQUE (Terascale Open-source Resource and QUEue Manager) is the open-source RMS based on the original PBS project (Portable Batch System), and it is arguably the most widely-used batch system nowadays in HPC grid infrastructures and also in small and medium site [46].It is noteworthy that EECluster can be potentially integrated with any RMS as long as it provides a suitable interface in the form of either a series of command-line utilities or an API (Application Programming Interface) that allows EECluster to obtain the required information for its operation (detailed below).
EECluster is composed of a service (EEClusterd) and a learning algorithm, coupled with a Database Management System (DBMS) as the persistence system, and a web-based administration dashboard.The EEClusterd service periodically updates an internal set of cluster status records by retrieving information from multiple command-line applications.This information, which is stored in a Database Management System, is used by the EECluster decision-making mechanism to dynamically reconfigure the cluster resources by issuing power-on or shutdown commands to the compute nodes using the Power Management module.The learning algorithm mission is to find a set of optimal configurations for the decision-making mechanism from which the administrator can choose one according to its preferences in terms of impact in the service quality, energy savings, and node reconfigurations.
A functional prototype of EECluster can be downloaded via web [47,48], where can also be found a brief description of the software, quick start guides, contact address and acknowledgements.
Figure 2 provides a high-level overview of the system components.A detailed description of this architecture is out of the scope of this paper and can be found in reference [49].

Decision-Making Mechanism
The essential component in an adaptive resource cluster solution is the decision-making mechanism, for it is the one that determines the amount of nodes that will be available at every moment.As mentioned earlier, multiple approaches have been proposed previously based on sets of expert-defined rules, such as "if the node i has been idle for more than a t time threshold, it must be powered off".Closed expert systems like these have the advantage of coping better with unforeseen workload scenarios over systems learnt automatically in a machine learning approach.This is because machine-learnt systems are more likely to overtrain, especially in scenarios with great changes in the pattern of job arrivals.However, this advantage of simple expert systems comes at the price of low flexibility to adapt to both the sharp changes inherent to ordinary HPC clusters due to the large granularity of its workload (low number of concurrent jobs and multiple resources requested by each job), and the complex set of constraints implicit in the preferences of the cluster administrator.The first limitation leads to worse results than the ones obtainable with a pure machine learning approach.The second limitation leads to a potentially inadmissible solution for the cluster administrator if it does not comply with his or her tolerance of negative impacts on service quality and node reliability.
In order to avoid overtraining and achieve good generalization and flexibility capabilities, EECluster decision-making mechanism is a Hybrid Genetic Fuzzy System (HGFS) composed of both a set of crisp expert-defined rules and a set of fuzzy rules elicited automatically from previous workload records in a machine learning approach.The first set of human-generated expert rules was adapted from reference [36], and can be defined as follows:

•
If the current number of resources are insufficient to run every queued job in a sequential manner, then keep powered on at least the highest number of slots requested by any queued job, as long as that amount does not exceed the total number of slots in the cluster.

•
If the average waiting time for the queued jobs is higher than a given threshold t max or if the number or queued jobs is higher than a given threshold n max , then power on one slot.

•
If the average waiting time for the queued jobs is lower than a given threshold t min or if the number of queued jobs is lower than a given threshold n min , then power off one slot.
The mission of this rule set is to assure that minimum working conditions for the cluster are met, avoiding undesired behaviours in unforeseen scenarios, which may lead to a dramatic impact in the service quality or to a node thrashing effect reducing node reliability and causing early damages in the hardware equipment.
The purpose of the second set of computer-generated rules is the progressive shutdown of idle nodes when the cluster load decreases.Each rule in this set defines the degree of truth of the assertion "the i-node must be powered off".This degree of truth is computed using a zero-order Tagaki-Sugeno-Kang fuzzy model [50,51] with N triangular fuzzy subsets on the domain of the nodes idle times and N weights between 0 and 1.For instance, if N = 3 then the computer-generated rules would consider three linguistic terms regarding the total amount of time that the i-node has been at idle state, with N 1 being "SHORT", N 2 being "MEDIUM" and N 3 being "LARGE".In this case, the degree of truth of the aforementioned assertion would be computed as: idle i is the amount of time that the i-node has been idle, SHORT, MEDIUM, and LARGE are fuzzy sets with triangular memberships [50], and w 1 , w 2 , w 3 are the N weighs taking values between 0 and 1.
Once each rule has been computed, results are combined so that the number of nodes to be powered off is the sum of the values off for each node.As can be seen, this second set of fuzzy rules does not require nodes to reach a certain crisp value before they can be selected for shutting down, but rather is applied to the cluster as a whole, as opposed to the rules proposed previously to power off idle nodes, such as in reference [36].This approach allows the system to respond to smaller changes in the cluster load more frequently, thus progressively adapting its resources to better match workload valleys produced when jobs release their resources upon completion.Further information on the Hybrid Genetic Fuzzy System can be found in references [38,39].
Once a decision is made by the HGFS determining the number of slots that must be powered on/off, then this decision is translated to a set of physical compute nodes that will be chosen for reconfiguration.This is done considering the state of each node (only idle nodes can be shutdown and only powered-off nodes are candidates for power-on) plus two additional values: the node efficiency, measured as performance power consumption , and the timestamp of its last failure.The latter is used to determine how likely it is for a given node to fail upon a reconfiguration request, in such a way that if a node was recently issued a power on/off command and it failed to comply with it, then the same problem is expected also to occur in the near future.However, if the node failed a long time ago it is more likely to have been repaired.The target nodes to be reconfigured are first split into two groups depending on whether they failed or not to comply with the last issued command.The first group, consisting of the nodes which worked correctly, are sorted according to their efficiency so that the least efficient ones are chosen to be powered off and, conversely, the most efficient ones are chosen to be powered on.If the nodes in the previous group are not enough to match the slots in the reconfiguration decision, the remaining nodes are chosen from the second group, which are sorted according to the timestamps of their failures, choosing first the nodes with the earliest values.That is, when a node is chosen from the second group it is because its last failure occurred before the failure of any other node.If this chosen node fails again, the timestamp of the last failure would be updated, thus the next time nodes are sorted, then it will be the last one.The idea behind this design is to prevent the system from selected systematically the same malfunctioning node if others are available for reconfiguration.

Learning Algorithm
The Hybrid GFS described is flexible enough to behave as desired in order to suit the cluster administrator preferences.However, this requires every HGFS parameter to be properly tuned, which is a complex task due to the presence of multiple conflicting objectives and to the huge amount of combinations that renders infeasible an extensive search.In particular, every instance or configuration for the previous HGFS is the combination of the following parameters: (t min , t max , n min , n max , w 1 , . . ., w N ) (2) To address this problem, the EECluster learning algorithm uses multi-objective evolutionary algorithms (MOEAs) to find the parameters defining the HGFS by optimizing a fitness function consisting in three conflicting criteria: the quality of service (QoS), the energy saved, and the number of node reconfigurations.Specifically, EECluster uses the MOEA Framework [52] implementation of the Non-dominated Sorting Genetic Algorithm II (NSGA-II) [53] to obtain a Pareto Efficient Frontier from which the administrator can choose a suitable HGFS configuration.The Pareto Efficient Frontier is the set of configurations for the HGFS obtained in the experiment that are not worse than any other configuration in all components of the fitness function simultaneously.Every configuration in the Pareto Efficient Frontier is known to be "non-dominated".As can be seen, the result of the learning algorithm is not a single optimal solution (configuration for the HGFS) but rather a set of optimal configurations from which an expert human can pick the one that is best given his or her preferences.The reason for this is that there is no optimal solution because the three objectives involved are in conflict and attempts to apply any form of weighted sorting of the solutions obtained would lead to an inaccurate model of the preferences of the administrator.
For a given set of n jobs, where the j-th job (j = 1 . . .n) is scheduled to start at time tsch j , but effectively starts at time ton j and stops at time toff j , the quality of service in an HPC cluster reflects the amount of time that each job has to wait before it is assigned its requested resources.Once the job starts its execution, it will not be halted; thus, we focus only on its waiting time.Because jobs do not last the same amount of time, their waiting in the queue is better expressed as a ratio considering their execution time.It is noteworthy that the execution times of the job can differ greatly since they range from seconds to weeks or months.This can potentially lead to situations where very short jobs must wait over a hundred times their execution timespan, distorting the measurement of the quality of service and depicting inaccurately the cluster performance.Because of this, the 90 percentile is used instead of average: where ||A|| is the cardinality of the set A.
The energy saved is measured as the amount of watts-hour that were prevented from being wasted by shutting down idle nodes.Let c be the number of nodes, let state(i, t) be 1 if the i-th node (i = 1 . . .c) is powered at time t, and 0 otherwise, let the time scale be the lapse between tini=min j {tsch j } and tend= max j {toff j }.Lastly, let power idle (i) be the power consumption measured in watts of the i-th node when it is at idle state.Then, The node reconfigurations is the number of times that a node has been powered on or off.Let nd(i) be the number of discontinuities of the function state(i, t) in the time interval t ∈ (tini, tend): The mission of the NSGA-II algorithm is to obtain a set of non-dominated configurations for the HGFS, guided by the previous fitness function, whose values are calculated by running a cluster simulation with a given number of nodes, slots, and job records, as seen in Figure 3.

Learning algorithm
Simulator (t min , t max , n min , n max , w 1 , w 2 w N ) (QoS, energy saved, reconfigured nodes) (c nodes, s slots, n jobs)

Experimental Results
In order to provide a sound answer on whether the decision-making mechanism has the required flexibility to perform correctly and suit any desired working mode, it must be tested in a range of cluster scenarios which together can build a significant representation of real-world clusters.To do so, a combination of synthetically-generated and actual cluster workloads from the Scientific Modelling Cluster (CMS) of the University of Oviedo [54] were used.Synthetic workloads represent four different scenarios with an increasing degree of fluctuation in terms of job arrival rates, each one spanning 24 months.Job arrivals in each scenario follow a Poisson process with the λ values shown in Table 1, and job run times are distributed exponentially with rate λ = 10 −5 s in all scenarios.As can be seen in the table, scenario 1 exhibits a cluster with a stable and sustained workload where all hours of the year have the same job arrival pattern.Scenario 2 adds a distinction between working, non-working, and weekend hours.Scenario 3 adds a substantial variation in the arrival rates depending on the week of the month, and scenario 4 increases this variation even more.On the other hand, the workloads from the CMS cluster consist of 2907 jobs spanned over 22 months.This real-world cluster, built from three independent computing clusters and five transversal queues using PBS as Resource Management System (RMS), can accurately show a very common activity pattern in most HPC clusters.scenario 1 are displayed in Table 2 and in Figure 4, scenario 2 in Table 3 and in Figure 5, scenario 3 in Table 4 and in Figure 6, and scenario 4 in Table 5 and in Figure 7. Lastly, results obtained for the CMS cluster recorded workloads are displayed in Table 6 and in Figure 8.As can be seen from these results, in every cluster scenario used in the experiments, the learning algorithm found a configuration for the HGFS that achieves significant energy savings without any noticeable impact in service quality.Also, additional configurations were found that comply with the synthetic administrator preferences defined, increasing energy savings while strictly complaining with the constraints set in the aforementioned preferences in terms of QoS and node reconfigurations.It should be noted that the five configurations displayed in the previous tables are only a small selection of the vast set obtained in the Pareto Efficient Frontier, and many other solutions are available that can save even more energy at the cost of a higher penalty in service quality.
The experiments also show that the results obtained differ significantly depending on the characteristics of the workload.In scenario 1, the regular job arrival rate depicts a workload where the distances between peaks are very short and valleys tend to be shallow.This leads to HGFS configurations with a higher average number of reconfigurations and a relatively low amount of saved energy.Also, as can be seen in Table 2, rising the degree of tolerance for impact in QoS from 0.0 to 0.02 and node reconfigurations up to 3000 only allows an increase in the overall energy savings of 9.88%.Results improve progressively as job arrival patterns vary over time and the workload becomes more irregular with deeper valleys and longer distances between peaks.For instance, scenario 2 allows an increase in energy savings of 13.37% between the HGFS configurations obtained for a QoS of 0.0 and a QoS of 0.02.In scenario 3, the difference grows up to 31.45%, and in scenario 4 reaches 38.51%.
These two last scenarios show important results since they represent the workload patterns most likely to occur in real-world HPC clusters.This can be verified by the results obtained using actual records from the CMS cluster of the University of Oviedo, where the workload is even sharper than in scenario 4, with energy savings between 46.67% and 75.54%, depending on the administrator preferences.These values can be translated to actual power savings ranging from 13.38 MWh to 21.66 MWh over the course of the test set, and figures of carbon reduction between 4.95 and 8.01 tonnes of CO 2 .

Concluding Remarks
The EECluster tool has been designed to reduce the carbon footprint of HPC clusters by improving their energy efficiency.This software package implements the adaptive resource cluster technique in clusters running OGE/SGE or PBS/TORQUE as RMS, allowing for practical application in real-world scenarios, owing to the flexibility of its sophisticated machine-learnt decision-making mechanism to comply with cluster administrator preferences.This mechanism, based on Computational Intelligence techniques, is learnt by means of multi-objective evolutionary algorithms to assure finding a suitable configuration that maximises energy savings within the tolerance region of the administrator in terms of service quality and node reliability.
Thorough experimental studies based on both synthetic and actual workloads from the Scientific Modelling Cluster of Oviedo University [54] provide empirical evidence of the ability of EECluster to deliver good results in multiple scenarios, supporting the adoption of EECluster to reduce the environmental impact of real world clusters.

Figure 4 .
Figure 4. Cluster simulation trace for the test set of scenario 1. GFS: Genetic Fuzzy System.

Figure 5 .
Figure 5. Cluster simulation trace for the test set of scenario 2.

Table 2 .
Experiment results for the test set of scenario 1.

Table 3 .
Experiment results for the test set of scenario 2.

Table 4 .
Experiment results for the test set of scenario 3. Cluster simulation trace for the test set of scenario 3.

Table 5 .
Experiment results for the test set of scenario 4. Cluster simulation trace for the test set of scenario 4.

Table 6 .
Experiment results for the test set of the Scientific Modelling Cluster (CMS) workload records.Cluster simulation trace for the test set of the CMS cluster workload records.