A Machine Learning Solution for Data Center Thermal Characteristics Analysis

: The energy e ﬃ ciency of Data Center (DC) operations heavily relies on a DC ambient temperature as well as its IT and cooling systems performance. A reliable and e ﬃ cient cooling system is necessary to produce a persistent ﬂow of cold air to cool servers that are subjected to constantly increasing computational load due to the advent of smart cloud-based applications. Consequently, the increased demand for computing power will inadvertently increase server waste heat creation in data centers. To improve a DC thermal proﬁle which could undeniably inﬂuence energy e ﬃ ciency and reliability of IT equipment, it is imperative to explore the thermal characteristics analysis of an IT room. This work encompasses the employment of an unsupervised machine learning technique for uncovering weaknesses of a DC cooling system based on real DC monitoring thermal data. The ﬁndings of the analysis result in the identiﬁcation of areas for thermal management and cooling improvement that further feeds into DC recommendations. With the aim to identify overheated zones in a DC IT room and corresponding servers, we applied analyzed thermal characteristics of the IT room. Experimental dataset includes measurements of ambient air temperature in the hot aisle of the IT room in ENEA Portici research center hosting the CRESCO6 computing cluster. We use machine learning clustering techniques to identify overheated locations and categorize computing nodes based on surrounding air temperature ranges abstracted from the data. This work employs the principles and approaches replicable for the analysis of thermal characteristics of any DC, thereby fostering transferability. This paper demonstrates how best practices and guidelines could be applied for thermal analysis and proﬁling of a commercial DC based on real thermal monitoring data.


Introduction
Over the past decade, Data Centers (DC) have made considerable efforts to ensure energy efficiency and reliability, and the size and stability of their facilities have been upgraded because of the enormous increase in demand [1,2]. Currently, the amount of data to be processed is expanding exponentially due to the growth of the information technology (IT) industry, advent of IoT, and AI technologies. Consequently, new DC construction and smart DC management is on the rise to meet this demand. If a data center experiences system failure or outage, it becomes challenging to ensure a stable and continuous IT service provision (particularly smart businesses, social media, etc.). If such a situation occurs on a large scale, it could lead to chaos to particularly the business sector and other sectors (e.g. health, manufacturing, entertainment, etc…). In other words, a data center has emerged as a mission-critical infrastructure [30] to the survival of businesses and other smart technologies supported sectors. Therefore, it warrants a critical necessity for backup system management and uninterruptible power supply (UPS) systems so that compute system stability can be maintained even in emergency situations. DCs maintain their stability by having redundant power supply paths, including emergency generators, UPSs, etc. IT servers require uninterruptible supplies of not only power but also cooling [3,4]. For this purpose, in liquid cooling, central cooling systems are designed and manage to allow for chilled water supply during cooling system outages by including cooling buffer tanks for stable cooling of IT equipment. If the chillers are interrupted, the emergency power and cooling system are activated, and then the chilled water is supplied again. Consequently, the mission-critical facility management for the stable operation of DC leads to huge cost increases, and careful reviews must be performed starting from the initial planning stage [5,6]. Considering that the number of times that such emergency situations occur during a DC life cycle is very small and IT servers tolerance to various operational thermal environments has vastly improved compared to that in the past due to the development of IT equipment, there is considerable room for reducing the operating times and capacities of chilled-water storage tanks. Specifications of every IT equipment are expressed as (but not limited to) different admissible ranges for temperature, humidity, periods of overheating before automatic power off. Additionally, maintaining healthy operational conditions is a complex task because IT devices might have different recommended specifications for operation. Undeniably, covert factors such as bypass, recirculation, hotspots and partial rack overheating could negatively affect the health of IT and power equipment that is critical for efficient DC operations. For example, in the case where an IT room is divided into cold and hot aisles, improper partitioning of the aisles may result in recirculation of hot air or cold air bypass [7]. Consequently, such emerging challenges call for the need for optimized thermal conditions within a DC facility. Thermal management involves the reduction of excess energy consumption by cooling systems, servers' load processing, and their internal fans. It encompasses compliance of IT facility environment to temperature requirements and standards that will inevitably result in reliability, availability, and overall improved server performance. Thermal management in a DC could be the primary contributor to IT infrastructure inefficiency due to hardware degradation and for this reason, it is necessary to disperse dissipated waste heat so that there will be an even distribution of waste heat within a premise to avoid overheating [31]. This work explores the thermal characteristics analysis of an IT room (due to waste heat) using data mining techniques for the purpose of relevant knowledge discovery. The primary goal is to use an unsupervised machine learning modelling technique to uncover weaknesses in the DC cooling system based on real DC monitoring thermal data. Analysis in this research leads to identification of areas for energy efficiency improvement that will feed into DC recommendations. The proposed methodology includes statistical analysis of IT room thermal characteristics, and identification of individual servers that frequently occur in the hotspot zones. The reliability of the analysis has been enhanced due to the availability of big dataset of ambient air temperature in the hot aisle of ENEA Portici CRESCO6 computing cluster. In brief, clustering techniques have been used for hotspots localization as well as nodes categorization based on surrounding air temperature ranges. The principles and approaches employed in this work are replicable for energy efficiency evaluation of any DC and thus, foster transferability. This work showcases applicability of best practices and guidelines in the context of a real commercial DC that transcends the typical set of existing metrics for DC energy efficiency assessment. The remainder of the paper is organized as follows: Section 1 is dedicated to introduction; Section 2 is focuses on background and discussion of Background and Related Work; Section 3 presents the Methodology adopted for this work; Section 4 covers Results and Discussion; while Section 5 concludes the paper with future work.

Background and Related Work
In recent years, a small number of theoretical and practical studies have been conducted on DC thermal management to understand the cooling systems under fault conditions, including system thermal and energy performance, system distribution optimization, and simulation study. Thermal management involves the reduction of excess energy consumption by cooling systems, servers' load processing, and their internal fans. It encompasses compliance of IT facility environment to temperature requirements and standards that will inevitably result in reliability, availability, and overall improved server performance. Existing data center-related thermal management research: highlight the primary challenges of cooling high power density DCs [8]; recommend a list of thermal management strategies based [9]; experiment the effect of a cooling approach on PUE, using direct air with a spray system that evaporates water to cool as well as humidify incoming air [10]; investigate thermal performances of air-cooled data centers with raised and non-raised floor configurations [11] and quantification of thermos-fluid processes through performance metrics [12]; propose a thermal model for joint cooling and workload management [13] while [14] explore thermal-aware job scheduling, dynamic resource provisioning and cooling; utilise real thermal information about servers, inlet/outlet air temperature, air mover speed to create thermal and power maps to monitor the real-time status of a data center [15]. Majority of the previously listed research focuses on simulations or numerical modelling [9][10][11][12][13][14], empirical research involving R&D or small scale data center [10,15] and thus there is a need for more empirical research involving real relevant thermal-related data for big scale data centers. Undeniably, it is tremendously beneficial to identify hotspots and air dynamics (particularly negative effects) within a DC IT room. Such useful evidence-based information will help DC operators improve their DC thermal design and ensure uninterrupted steady compute system operations. Additionally, it will be an added value if thermal management related research adheres to [16] recommended thermal management framework at varying granularity of DCs. Thermal metrics have been created by the research and enterprise DC community to facilitate DC thermal management [7]. The employment of metrics aims to reveal the underlying causes of thermal-related challenges within a DC IT room and to assess the overall thermal conditions of the room. Finally, [28] proposes a holistic data centre assessment method based on biomimicry by integrating data on energy consumption for powering and cooling ICT equipment. This research work focuses on the analysis of a DC IT room thermal characteristics analysis with machine learning techniques to uncover ways to render a more effective cooling system as well as ways to effect even distribution of server waste heat within a DC.
This work focuses on the identification of individual servers in an IT room of a DC cluster that frequently occurs in the hotspot zones applying a clustering algorithm to an available dataset with thermal characteristics of ENEA Portici CRESCO6 computing cluster. This paper represents the completion of the previous authors' work [7,17,18,19,20,21,31] in terms of exploring the intricacies of deploying the theoretical framework applied in a real DC. Appropriate data analytics techniques have been based on real server-level sensors data to identify potential risks caused by the possible presence of negative covert factors related to the cooling strategy. This work is based first of all on the statistical analysis of available real thermal data and to provide a complete thermal characteristic analysis through machine learning techniques. However, ML has been generally employed for VM allocation, global infrastructure management, prediction of electricity consumption and availability of renewable energy [22]. Thus far, there is work on ML for thermal characteristics assessment and weather conditions prediction but only limited work on thermal management. Typically, Computational Fluid Dynamics (CFD) techniques have been employed for the exploration of DC thermal management. Their drawbacks are high computational power and memory requirements. Therefore, the added value of this research is utilization of less power demanding techniques for thermal characteristics analysis (i.e. namely, hotspot localization). Additionally, this paper aims to increase DC thermal awareness and provide recommendations for thermal management based on the study of thermal characteristics of DC IT room environment and IT equipment energy consumption of ENEA Portici CRESCO6 cluster using real monitored thermal data. This work exploits machine learning analysis of IT room thermal characteristics. To achieve this aim, the following research objectives are addressed: RO.1. To identify the clustering (grouping) algorithm that is appropriate for the purpose of this research; RO.2. To determine the criteria for features selection in cluster analysis of the thermal characteristics; RO.3. To determine the optimal number of clusters for thermal characteristics analysis; RO.6. To provide recommendations related to the thermal management of the IT room appropriately address servers overheating resulting in local hotspots related issues.

Methodology
This section discusses the thermal characteristics analysis of ENEA cluster, CRESCO 6. A Machine Learning clustering technique is chosen for a more in-depth analysis of hotspots localization based on available dataset of CRESCO6 nodes temperature measurements. The drawback of this the analysis of temperature measurements is that it could not pinpoint specific nodes which cause rack hotspots. Hence, to address this gap, we have applied Machine Learning techniques for node clustering to localize hotspots. Locating hotspots in the CRESCO6 group of nodes (the term "group of nodes" stands for the DC "cluster" and note this term is not used to avoid its confusion with clusters of data) is achieved through grouping of sequential sets of nodes into clusters with higher or lower hot aisle and internal server temperatures.

Cluster and Dataset Description
The analysis is based on collected data related to server power consumption and ambient air temperature of the CRESCO6 cluster in ENEA-Portici Research Center premises (up and running since summer 2018). The cluster was created due to the growing demand for research center computational and analytic activities as well as the general motivation to keep abreast with current modern technologies. The High-Performance Computing cluster CRESCO6 has the nominal computing power of around 1,4 PFLOPS (1000 TFLOPS the result obtained on High-Performance Computing Linpack Benchmark, a computational power test that performs parallel calculations on dense linear systems with 64bit precision). It complements the CRESCO4 HPC system, already installed and still operating in the Portici Research Center, with nominal calculation powers of 100 TFLOPS. CRESCO6, on its own, provides an increase equal to a factor x7 of the entire computing capability currently available for computational activities in the ENEA research center. The cluster comprises 418 Lenovo nodes with FatTwinTM 2U form factor, housed in a total of 5 racks. Each node houses two Intel® Xeon® Platinum 8160 CPUs, each with 24 cores and operating at a frequency clock of 2.1 GHz, for a total of 20,064 cores. Each node also houses an overall RAMof 192 GB, equivalent to 4 GB/core. Finally, the nodes are interconnected by an Intel® Omni-Path network with 15 switches of 48 ports each, bandwidth equal to 100 Gb/s, latency equal to 1 s. CRESCO6 could satisfy the needs of high scalability in the execution of parallel codes. This resource is aimed to support Research and Development activities in ENEA Research Center. In the last ten years, CRESCO HPC system has enabled and supported ENEA participation in national and international projects in various technological sectors which range from bio-informatics to structural biology with effects in the medical and environmental fields, from the design of new materials to fluid dynamics with relapses in different energy sectors (e.g., photovoltaic, nuclear, energy from the sea, combustion). Furthermore, thanks to the availability of the CRESCO infrastructure, ENEA is a partner of the European Center of Excellence EoCoE (Energy oriented Center of Excellence), Focus CoE (Center of Excellence) projects: one of eight Centers for HPC applications financed by the Horizon2020 program. EoCoE intends to contribute to accelerating the transition to a carbon-free economy by exploiting the growing computational power of HPC infrastructures. Apart from enhanced hardware, improvement has also been made to the monitoring system of the new cluster. It comprises energy and power meters, temperature and airflow sensors and fans speed registration. Measurements were taken throughout the period from cluster initialization and performance tuning in the months of May-July 2018 to the months of cluster utilization by end-users in September 2018-February 2019 for approximately 9 months in total with a break in the month of August 2018 as represented in Figure 1. The measurement system covered all 216 nodes, out of which 214-215 nodes were consistently monitored, and other 1-2 nodes had missing values or were turned off. The monitoring the system consisted of energy meter, power meter of CPU, RAM and the entire IT system utilization of every node, CPU temperature for both processing units of each node with thermal sensors installed inside the servers, inlet and exhaust in cold and hot aisle respectively placed in the front and rear parts of every node.

Data Analytics
Data analytics encompasses the investigation on temperature variation in different parts of the IT room and evaluation of thermal metrics. However, the variability of thermal data and uncertainties in defining temperature thresholds for hotspots (identified via statistical analysis) has invoked a need for unsupervised learning. Therefore, a K-means clustering algorithm has been employed to address the limitations of typical statistical techniques. With Machine Learning techniques, the number of clusters is determined using two indices (Silhouette metric and Within-Cluster Sum of Squares), and available thermal characteristics (i.e. exhaust temperature, CPUs temperatures) are inputs to a clustering algorithm. Subsequently, a series of clustering results are intersected to unravel nodes (identified by IDs) that frequently fall into high-temperature areas of the cluster racks. As depicted in Figure 2, an adapted data lifecycle methodology has been employed for this work. The methodology comprises stages of data preprocessing, data analysis as well as results interpretation and exploitation in the form of recommendations for the DC.  localization and applied to the dataset of CRESCO6 nodes temperature measurements. All data analytics stages represented in the Figure 2 are described in detail below. Data preprocessing step consists of data cleansing and data set organization. This data set is cleansed of zero and missing values. It is organized as shown in Table 1. This table summarizes both the results of monitoring of the overall number of nodes in CRESCO6, N. In addition, data preprocessing involves timestamps, user information formatting for further exploitation. The system is configured so that with an interval of around 15 minutes, the monitoring system records the thermal and other measured data for every node with a slight latency between each node readings. The readings result in a set of N rows with information for every node ID. As shown in Table 1, data preprocessing, includes extracting important thermal data features and removing direly incomplete or erroneous data. The Data Analysis stage includes several substages. In the Data Analysis stage, sequential clustering involves the following: determining the optimal number of clusters (done with the use of two indices); actual clustering of servers into groups with low, medium and high surrounding air temperature ranges; and consolidation of results to ascribe the most frequently occurred cluster label for each server (i.e. low, medium or high). The analysis is based on the aforementioned data preprocessing step, Table 1. Clustering is performed M times, where M is the overall number of time labels at which measurements are taken from all cluster nodes. Each new set of monitoring system readings is labelled with a time label. The exact timestamp for the extracted information is marked with for every node l. Depending on the available dataset, a number of relevant features describe the state of every node and their different combinations can be used as a basis for clustering (RO.2 will be more considered in detail in Section 3). In Table 1, in the last column, base is an indicator of one of the three combinations of measurements used as the basis for clustering and corresponds to the temperature of the cluster centroid. In this work, the K-Means algorithm is chosen for clustering the nodes for several reasons (RO.1): -The number of features used for clustering is small. Therefore, the formulated clustering problem is simple and does not require complex algorithms; -K-Means has linear computational complexity which renders it fast to use for the type of problem in question. While the formulation of the problem is simple, it requires several thousands of repetitions of clustering for each set of N nodes. From this point of view, the speed of the algorithm becomes an influential factor; -K-Means has a weak point, namely random choice of initial centroids, which can lead to different results when different random generators are used. This does not pose any issue in this use case since the nodes are clustered several times based on sets of measurements taken at different timestamps and minor differences brought by randomness is mitigated by the repetition of the clustering procedure.  3). The methods of these two indices application are shown in Appendix A. In brief, Silhouette coefficient is computed for each clustered sample and shows how much the clusters are isolated from each other or the quality of clustering. The +1 value of Silhouette index for a specific number of clusters K indicates the high density of clusters, -1 shows incorrect clustering; and 0 stands for overlapping clusters. Therefore, we focus on local maxima of this coefficient. WCSS is used in the Elbow method of determining the number of clusters and is used here to support the decision obtained from Silhouette coefficient estimation. It measures the compactness of clusters, and the optimal value of K is the one that results in the "turning point" or an "elbow" of the WCSS(K) graph. In other words, increasing the number of clusters after reaching the elbow point does not result in significant improvement of clusters compactness. Although it could be argued that other indices could be additionally used for determining the number of clusters, the combination of the two aforementioned methods has converged on the same values of K, which is assumed to be sufficient for this current research. Once the optimal number of clusters is obtained, actual clustering is performed for the chosen bases. For every cluster base, we further examine how frequently every node is assigned to each cluster and deduce the final cluster label as one of C base range and corresponding sets of nodes as N base range .
Subsequently, sets of nodes in the hot range for every cluster base are intersected to unravel nodes that are clustered to be in "danger" or hot zone with the highest frequency by three clustering algorithms: The next section will discuss results of this clustering procedure and list the nodes that fall in the hot zone.

Results and Discussions
High-granularity Analysis of this work has considered temperature ranges of the surrounding air Exhaust and CPU temperature, (c) CPU temperature During sequential clustering, each node has been labelled with a certain temperature range cluster.
Since clustering is repeated for each set of measurements grouped by time label, every node is clustered several times and tagged with different labels while the algorithm is in progress (RO.4). Figure. 2 (a-c) shows the frequency of occurrence of every node in a particular cluster based on available measurements and clustering algorithm. This information indirectly implies "duration" of a particular node exists in a certain temperature range (see legend in Figure 2 (a-c)). Here, the majority of the nodes frequently occur in the medium temperature range for all cluster bases. However, some nodes remain in the hot range for more than 50% of clustering cases. When nodes remain in the hot range for a prolonged period or frequently fall in this range, it implies that they are overheated.
Consequently, this brings about hardware degradation where the nodes have reduced reliability and accessibility as they automatically switch to lower power mode when overheated. Therefore, we continue with the analysis to identify the actual node IDs that have most frequently been clustered within the hot ranges.  Table 2. . Ratio of cluster sizes and intersection of node labels from three bases.
The present work has contributed to thermal characteristics analysis of the DC cluster addressing an issue of hotspots. It has two positive effects in terms of sustainability. Firstly, being a thermal design pitfall, hotspots impose a risk of local overheating and deterioration of servers exposed to high temperature for prolonged periods. In this regard, localization of hotspots is crucial for better overview and control of the IT room temperature distribution. It provides a direction of future thermal management improvements that would mitigate the mentioned risk. Secondly, a clustering technique used in this phase requires less computational resources than computational fluid dynamics modelling and/or simulations performed on existing simulation packages. Such models provide an overview of the entire IT room ambient temperature distribution, whereas racks and their immediate proximity limit the area of interest. Therefore, with less computational power (and thus energy consumption) analysis techniques of this phase have brought about sufficient information to incentivize improvement of thermal conditions in data centers. Finally, the results infer that the majority of the servers operated in the medium and hot temperature ranges. Given that 8% of all cluster servers have been labelled as most frequently hot range nodes, a list of recommendations is suggested below to address the issue of hotspots (RO.6). The proposed methodology for IT room thermal characteristics assessment of an air-cooled DC cluster (located in the region where free air cooling is unavailable) comprises: -Locate nodes by identified hot range IDs and find possible underlying patterns in overheated nodes (e.g. position in the rack, and proximity to the PDUs); -Tune load sharing so that these 'hot' nodes are not overloaded in the future; -Add directional cooling, for example, spot cooling; -Continue monitoring IT room thermal conditions in the immediate proximity of the nodes to evaluate the effectiveness of recommended actions and their effects on IT room temperature.

Conclusion
Analysis of IT and cooling systems is necessary for the investigation of DC operations-related energy efficiency. A reliable cooling system is essential to produce a persistent flow of cold air to cool the servers due to increasing demand on computational load. Energy efficiency has been addressed in this work from the point of thermal characteristics analysis of an IT room. In particular, a machine learning technique applied to a real DC monitoring data has resulted in the identification of areas for energy efficiency improvement that feed into appropriate DC recommendations. The research methodology discussed in this paper, includes statistical analysis of IT room thermal characteristics, thermal metrics evaluation. and the identification of individual servers that frequently occur in the hotspot zones (using a machine learning algorithm). Clustering techniques are used for effective hotspots localization as well as categorization of nodes based on surrounding air temperature ranges. This methodology has been applied to available big dataset with thermal characteristics of ENEA Portici CRESCO6 computing cluster. The concepts covered in this work are useful for energy efficiency evaluation of any DC and ensures a high degree of transferability. This work showcases applicability of best practices and guidelines to a real DC and goes beyond the set of existing metrics for DC energy efficiency assessment.
Where, a is the mean intra cluster distance (i.e. mean distance between a data point and all other points in the same cluster), while b is the mean nearest cluster distance (i.e. distance between a sample and all other points in the next nearest cluster). Example of these indices utilization is shown in Figure 3. for one step of sequential clustering based on the exhaust air temperature. The optimal elbow point of WCSS is =3 and the same for Silhouette index local maximum.