Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Intelligent Monitoring of Data Center Physical Infrastructure

Appl. Sci. 2019, 9(23), 4998; https://doi.org/10.3390/app9234998

by Vojko Matko^1,*

, Barbara Brezovec² and Miro Milanovič¹

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Appl. Sci. 2019, 9(23), 4998; https://doi.org/10.3390/app9234998

Submission received: 23 October 2019 / Revised: 16 November 2019 / Accepted: 18 November 2019 / Published: 20 November 2019

(This article belongs to the Special Issue Managing Innovation and Sustainable Development in the Era of Industry 4.0)

Round 1

Reviewer 1 Report

The topics contained in the article is interesting. The problem of proper event management is undoubtedly important, both for the safety of people and data stored in server rooms. In its essence, it is interesting to diagnose the main cause of the sequence of events based on a lot of information sent from the object.

After careful reading of the article, I must say that I did not find any specific information in it, no exact description of the algorithm. In the content of the article I found only a general description of how the event analysis system works. In my opinion, presented results do not have explicit references to real events.

However, if the article were qualified for publication, I suggest introducing the following corrections in its content. The order of the comments does not reflect their significance. It results only from the order of appearance in the text of the manuscript.

My remarks, comments and tips:

Lines 43-44, “Recently, a number of studies on event monitoring and management have been published, but only some of them are not directly related to the data in the data center infrastructure” – I suggest simplifying this sentence. Line 72, “the technical cooling” - what this term means, suggests changing this term to a term appropriate for a scientific article. Lines 137-138, “The temperature for the A1 class should be in the range from 15 ‒ 32 ˚C and the relative humidity between 20% and 80% [27].” - Where did these parameters come from, are they in some standards for the Data Center or are they just taken from literature [27]. Line 247, “a correlation process” - This is probably the crux of the idea, but what exactly is done as part of the correlation process. Table 2, “Filtering” - Does “filtering” mean the number of events after filtering ? If so, this line should be marked "events after filtering", and if not, then it may be marked "events rejected by filtration". Line 256, “connected users” - do the authors mean “users” or "child events” here? Line 265, “Another experiment” - A sample detailed description of the event could be useful. Events with the main reason and other alarms and warnings, giving actual real messages and description of the failure. Line 275, “proposed RCA” - But where is the exact description of this method. How else can be understand this statement, but only like, that the authors want to propose the RCA method. Lines 289-291, “For example, if we look at servers, even when they are in the sleep mode, they consume a lot of energy. Great savings can be achieved by turning off these servers in the event that there is no need to operate in idle mode.” - This conclusion is very doubtful for me, because it is not known when the service request from this server will come. Line 294, “can receive energy through UPSs to maintain consistent cooling even during” – For data centers, I think this idea is completely frivolous. The amount of energy needed to produce cold is very large, you can’t load such devices with backup power sources. Lines 316-317, “For Class A1, the temperature should be between 15 °C and 32 °C, and relative humidity between 20% and 80% [26,27].” - Once again, the same remark about such temperature and humidity ranges.

Comments for author File: Comments.docx

Author Response

Responses to reviewer 1:

Corrected to: “Recently, a number of studies on event monitoring and management have been published.” (new line 49)

Line 72, “the technical cooling” - what this term means, suggests changing this term to a term appropriate for a scientific article.

Corrected to: “cooling device power supplies” (new line 77)

Lines 137-138, “The temperature for the A1 class should be in the range from 15 ‒ 32 ˚C and the relative humidity between 20% and 80% [27].” - Where did these parameters come from, are they in some standards for the Data Center or are they just taken from literature [27].

Corrected to: (new lines 142 to 159)

Added were Table 1 and references [29-31].

The temperature for the A1 class should be in the range from 15‒32 ˚C and the relative humidity between 20% and 80% as shown in Table 1 [28].

Table 1. ASHRAE thermal guidelines for data centers [29-31].

Dry-bulb Humidity Maximum

temperature range dew point

Recommended

Class A1 and A4 18 to 27 ˚C 5.5 ˚C DP to 60% –

RH and 15 ˚C DP

Allowable

Class A1 15 to 32 ˚C 20% to 80% 17 ˚C

Class A2 10 to 35 ˚C 20% to 80% 21 ˚C

Class A3 5 to 40 ˚C 8% to 85% 24 ˚C

Class A4 5 to 45 ˚C 8% to 90% 24 ˚C

Line 247, “a correlation process” - This is probably the crux of the idea, but what exactly is done as part of the correlation process.

Additional text was inserted (new lines 393 to 400):

A correlation step must be performed in order to search for all related events among alarms. Correlation does not imply causation [48,49], which means that one alarm event does not necessarily cause another. It can only be assumed, as to the definitions, that if the events are correlated, that means that a change of one event causes a change in another node. The correlation factor in this approach is related connectivity. If there are any alarm events that have the same parents, they are grouped together, so the node path of these events is checked in the root cause analysis only once.

References [48,49] were added:

Table 2, “Filtering” - Does “filtering” mean the number of events after filtering? If so, this line should be marked "events after filtering", and if not, then it may be marked "events rejected by filtration".

In Table 3 and 4 (new numbering) “Filtering and Correlation” were corrected to: “Events after filtering” and “Events after correlation”.

Line 256, “connected users” - do the authors mean “users” or "child events” here?

Corrected to: “connected” (new line 408).

Line 265, “Another experiment” - A sample detailed description of the event could be useful. Events with the main reason and other alarms and warnings, giving actual real messages and description of the failure.

Added was text and 2 tables (new lines 419 to 432) - Tables 5 and 6 to explain detailed description of specific events and “Another experiment”:

The messages are gathered by blocks (system functional parts) and data type in the CODESYS software. Event declarations can be sorted by individual blocks. The connections to node to which the event is related can determined in the list and during event identification. The event origin (device, node, type of event) can be determined for event variables. An example of the table with event declarations and alarm statuses for the cooling cabinets is shown in Table 5 below:

Table 5. An example of event definition and features the for cooling cabinet (CC_1).

No.	Variable	Device/ Sensor	Type	Controller	Description	Status/ Unit
1	CC_1_error	CC-A1	Digital	Wago LMI	Cooling cabinet 1 error	0-Failure, 1-No error
2	CC_1_works	CC-A1	Digital	Wago LMI	Cooling cabinet 1 works	0-It does not work, 1-It works
3	CC_1_flood	CC-A1	Digital	Wago LMI	Water spill in the cooling cabinet 1	0-No, 1-Yes

Table 6 shows an example of alarm activation areas with variables for the border regions for the cooling cabinet temperature measurement.

Table 6. Activation alarm area for the increased temperature in the cooling devices CC1 and CC2.

No.	Variable	Device/ Sensor	Type	The name of the alarm	Description (alarm activated)	Description (alarm canceled)
1	CC1_AirTemp	CC-A1	Analog	CC1_AirTemp _alarm	Alarm DC CC-A1: High temp. > 25 ˚C	Cancellation of alarm DC CC-A1: High temp. > 25 ˚C
2	CC2_AirTemp	CC-B1	Analog	CC2_AirTemp _alarm	Alarm DC CC-B1: High temp. > 25 ˚C	Cancellation of alarm DC CC-B1: High temp. > 25 ˚C
3	RcCC1_AirTemp	CC-A1	Analog	CC1_critical_ tem _alarm	Critical alarm DC CC-A1: Hihg temp. > 30 ˚C	Canceling a critical alarm DC CC-A1: High temp. > 30 ˚C
4	RcCC2_AirTemp	CC-B1	Analog	CC2_critical_ tem _alarm	Critical alarm DC CC-B1: High temp. > 30 ˚C	Canceling a critical alarm DC CC-B1: High temp. > 30 ˚C

Line 275, “proposed RCA” - But where is the exact description of this method. How else can be understand this statement, but only like, that the authors want to propose the RCA method.

Added was new chapter 3.3 (line 245) and text (description) for RCA method which is changed to RCM (Root Cause Method). (lines 252 to 301)

RCM requires first the definition of the root cause (RC). The following appropriate definitions can be used:

RC is a specific cause. RCs are those that are logically identified. RCs are those that are monitored by system administrators.

The main RCM steps are as follows:

Data acquisition Graphical display of all root factors (Figure 4) RC identification Change proposal generation and implementation

Once the data is acquired, graphical display of events and root factors helps us identify the sequence of events and their connection to the circumstances caused. These events and current circumstances are displayed on the timeline (Figure 4).

Figure 4. Event cause–consequence diagram.

The events and conditions proven by facts are diplayed along the full line. Graphical display of the events and root factors:

- shows the root cause based explanation and causes for events,

- helps identify those key areas that represent weak points of the system functionning,

- helps guaranteeing objectivity in cause identification,

- helps prove events and root factors,

- shows mutiple cause situations and their mutual dependancies,

- shows chronology of events,

- shows the basis for helpful changes aimed at preventing the error reapppearence in the future,

- is efficient help in futher system designs and planning.

The advantages of event and root factor graphical display are:

- it provides the structure for fact, proof and known fact recording ,

- it reveals the gaps in the system knowledge through the logical connections,

- it enables integration of other software tool results.

Once the root factors are identified, the process of root cause identification starts. The novelty of the RCM method is the use of root cause diagram (Figure 4), failure root cause diagram (Figure 5) and the node entities model (Figure 2) in both cases. The first diagram structures the sampling procedure, answering the question why a certain root factor exists/appears. It is followed by a failure root diagram used for possible system failure identification. It is frequently used in the design stage and performs well when we search for possible cause relations. It requires the use of specific data with component failure probability values. The cause dependancies are given by IN and OR logical digital decision or a combination of them (Figure 5).

With RCA, a query procedure started with a node with no child. In case an internode existed as a parent of a node, all other children from the internodes can be ignored.

Figure 5. Failure root cause diagram.

More root causes were identified due to the special rules saved with node entities: in case the node is the redundancy power supply possibility, it always has to be considered as a root cause option. Backward chaining method was used to find the root cause from the group of related events.

Lines 289-291, “For example, if we look at servers, even when they are in the sleep mode, they consume a lot of energy. Great savings can be achieved by turning off these servers in the event that there is no need to operate in idle mode.” - This conclusion is very doubtful for me, because it is not known when the service request from this server will come.

The text is changed

New lines 454 to 456.

For example, if we look at servers, even when they are in the sleep mode, they consume a lot of energy. Great savings can be achieved by turning these servers in the standby mode when there is no need for their operation.

Line 294, “can receive energy through UPSs to maintain consistent cooling even during” – For data centers, I think this idea is completely frivolous. The amount of energy needed to produce cold is very large, you can’t load such devices with backup power sources.

The sentence was deleted

Lines 316-317, “For Class A1, the temperature should be between 15 °C and 32 °C, and relative humidity between 20% and 80% [26,27].” - Once again, the same remark about such temperature and humidity ranges.

Was

Author Response File: Author Response.pdf

Reviewer 2 Report

1.The authors propose a «new method of intelligent monitoring of data center physical infrastructure». However, it is not clear what the scientific novelty of this method is. It is proposed to use the well-known a method of problem solving – «Root Cause Analysis» (RCA) method (the authors call it the «algorithm»). The authors did not present possible directions for improving this algorithm, as well as the features of using this method to solve the monitoring problem of Data Center Physical Infrastructure under consideration.

2.The motivation of the study is not clear, and its relevance is not proven. It is necessary to provide an analysis of data on unexpected events in data centers, energy costs, an estimate of losses due to unexpected events.

3.The article is a case study. Therefore, a specific (developed by the authors) causal graph should be presented.

4.The experimental conditions are not clearly described. It is not clear how many infrastructure elements were part of the experimental equipment? During what period were the experiments conducted, how did the external conditions change?

5.Description of the Data Center (level, number of infrastructure elements) for which the results in table 4 was obtained must be presented.

6.A literary review has not actually been submitted. A literature review of research into the application and improvement of the RCA method needs to be presented. In addition, alternative methods for solving problems need to be analyzed, for example, Big Data methods and algorithms (Markov chains, Kalman filter, etc.)

7.The conclusions are not substantiated. For example, the authors claim that «The concept of information design ... was used». However, this «concept» is only mentioned in the conclusion. Not proven how «The RCA method has a positive economic, environmental and operational impacts»?

Overall recommendation: Reconsider after major revision (control missing in some experiments)

Author Response

Responses to reviewer 2:

Added was new chapter 3.3 and explanation of the RCA (new RCM method). The novelty of the proposed method was explained. (lines 252 to 301).

RCM requires first the definition of the root cause (RC). The following appropriate definitions can be used:

RC is a specific cause. RCs are those that are logically identified. RCs are those that are monitored by system administrators.

The main RCM steps are as follows:

Data acquisition Graphical display of all root factors (Figure 4) RC identification Change proposal generation and implementation

Figure 4. Event cause–consequence diagram.

The events and conditions proven by facts are diplayed along the full line. Graphical display of the events and root factors:

- shows the root cause based explanation and causes for events,

- helps identify those key areas that represent weak points of the system functionning,

- helps guaranteeing objectivity in cause identification,

- helps prove events and root factors,

- shows mutiple cause situations and their mutual dependancies,

- shows chronology of events,

- shows the basis for helpful changes aimed at preventing the error reapppearence in the future,

- is efficient help in futher system designs and planning.

The advantages of event and root factor graphical display are:

- it provides the structure for fact, proof and known fact recording ,

- it reveals the gaps in the system knowledge through the logical connections,

- it enables integration of other software tool results.

With RCA, a query procedure started with a node with no child. In case an internode existed as a parent of a node, all other children from the internodes can be ignored.

Figure 5. Failure root cause diagram.

Corrected to:

In introduction (paragraph 1 (lines 42 to 48)) the text was added explaining the motivation of the study:

Reference [8] was added.

Energy cost and losses: This study was not focusing on a detailed valuation of the energy costs, but rather on intelligent monitoring and fast event identification in data center physical infrastructure, resulting also in a low energy cost.

3.The article is a case study. Therefore, a specific (developed by the authors) causal graph should be presented.

New graphs and description were added in the article. Please see below.

Figure 4. Event cause – consequnce diagram.

Figure 5. Failure root cause diagram.

Added were the types and the quantities of infrastructure elements: (lines 171 to 192).

Experimental equipment (level Tier-IV):

6 EMERSON Chillers (300 KW of cooling capacity each) ‒ Periodic autorotation of active vs

standby (N+2 redundancy), integrated freecooling

2 EuroDiesel Rotary diesel UPSs (1340 KW) (3‒5 days autonomy) 1 standard diesel generator (1250KVA) which can be used for emergency power or non-IT

equipment (e.g.chillers)

3 transformers (1 reserve), 2.5 MW each (15.00 V -› 400 V, 2 separated 400 A) 3 static transfer switches 5 power switches and circuit breakers

Room 1

44 racks (10KW each) 12 APC InRow RPs (IRP), precision cooling with 2-ways valves 3 fans and humidity controls of 50 kW cooling capacity each 4 redundant power distribution lines of 400 A

Room 2

76 racks (20KW each) 32 APC InRow RPs (IRP), precision cooling with 2-ways valves 3 fans and humidity controls of 50 kW cooling capacity each 16 redundant power distribution lines of 400 A

- 300 IT different electric devices

Altogether, there were 510 elements of infrastructure used.

Experiments were conducted in period 2016 – 2019 (line 356)

The external conditions were changing depending on the season – from summer to winter and back.

5.Description of the Data Center (level, number of infrastructure elements) for which the results in table 4 was obtained must be presented.

The elements are presented in the article and outlined here above under Point 4.

A new chapter 3.4. was added. Added were also new references [39-47].

3.4. An Overview of Alternative Literature on Big Data Reducing Methods

This Chapter provides an overview of original research and literature on the subject that has been gathered through the search of the Web of Science portals.

In the big data methods, new challenges emerge regarding the computing requirements and strategies to conduct operations management analysis, where data minimizing method strengths are combining statistical and machine learning models which make it versatile to deal with different types of data, but these methods are suffering from weaknesses of the underlying models [39]. In [40], to better understand the processor design decisions in the context of data analytics in data centers, comprehensive evaluations using representative data analytics workloads on representative conventional multi-core and many-core processors were conducted. After a comprehensive analysis of performance, power and energy efficiency, the authors have the following observations: contrasted with the conventional wisdom that uses wimpy many-core processors to improve energy-efficiency, the brawny multi-core processors and dynamic over clocking technologies outperform the counterparts in terms of not only execution time, but also energy-efficiency for most of data analytics workloads in the experiments performed. In [41], the integration and coordination of big data are required to improve the application and network performance of big data applications. While empirical evaluation and analysis of big data can be one way of observing proposed solutions, it is often impractical or difficult to apply for several reasons, such as expensive undertakings, time consuming, and complexity. Thus, simulation tools are preferable options for performing cost effective, scalable experimentation in a controllable, repeatable, and configurable manner.

Markov chains are stochastic models describing a sequence of events in which the probability of each event depends only on the previous state of the event. In data reduction and reconstruction using Markov Models, higher batch sizes reduces accuracy as the reconstruction does not retain the information, which is present in the original dataset. Following a hybrid active/passive approach [42], this paper introduces both a family of change-detection mechanisms, differing in the required assumptions and performance, for detecting changes. In [43], the paper considers the scenario where the number of virtual CPUs requested by each customer job may be different, and proposes a hierarchical stochastic modeling approach applicable to performance analysis under such a heterogeneous workload. In [44], authors address the problem by proposing and evaluating a large-scale data centers to efficiently reduce bandwidth and storage for telemetry data through real-time modeling using Markov chain based methods. Large-scale data centers are composed of thousands of servers organized in interconnected racks to offer services to users. These data centers continuously generate large amounts of telemetry data streams (e.g., hardware utilization metrics) used for multiple purposes, including resource management, and real-time analytics.

The article [45] proposes an irregular sampling estimation method to extract the main trend of the time series, in which the Kalman filter is used to remove the noise, missing data, and outliers; then the cubic spline interpolation and average method are used to reconstruct the maintrend. In [46], transport protocols such as optical packet switching and optical burst switching allow a one-sided view of the traffic flow in the network. This therefore causes disjointed and uncoordinated decision-making at each node. For effective resource planning, there is the need to consider joining the distributed with centralized management which anticipates the system's needs and regulates the entire network using Kalman filters. In [47], a multisensor data fusion, distributed state estimation techniques that enable a local processing of sensor data are the means of choice in order to minimize storage. In particular, a distributed implementation of the optimal Kalman filter has recently been developed. A disadvantage of this algorithm is that the fusion center needs access to each node so as to compute a consistent state estimate, which requires full communication each time an estimate is requested.

Corrected to: »The advanced concept of RCM design …..” (line 494)

The RCA method has a positive economic, environmental and operational impacts (Deleted)

Added were text (lines 488 to 490):

Furthermore, RCM has clear positive economic effects for data center functioning. Its fast event identification enables swift reaction to events (turning off of individual units or changing their functioning to sleep mode to save the electricity). It also increases the availability class.

Author Response File: Author Response.pdf

Reviewer 3 Report

Article : Intelligent Monitoring of Data Center Physical Infrastructure
Authors : Vojko Matko, Barbara Brezovec and Miro Milanovič
——————————————————————————————

This is an interesting paper aimed at presenting a smart method for monitoring and event management in data centers.
The latter is based on a multilevel tree node approach based on the root cause algorithm (RCA). The advantage of this method is to reduce the number of alarms/alerts by approx. 50 %

To improve this paper, it would be wise to present this method (perhaps in a more detailed way) in a dedicated section and not in the results section where we expect a comparison between this method and traditional methods.

References :
The bibliography is very correct. All the references are less than 10 years old.
27 references (out of 36) are less than 5 years old, which reflects a recent state of the art.

Author Response

Response to reviewer 3:

The paper was improved in many parts, and a more detailed description was added about the proposed method. All corrections, new figures, tables and the texts added were highlighted in blue.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

In my opinion, the changes made to the text by the authors have significantly increased the value of the manuscript.

Including a lot of detailed information in the manuscript content allows for a better understanding of the concept proposed by the authors.

All my comments regarding the previous version of the manuscript have been taken into account.

Author Response

Author's Response:

I thank the reviewer for valuable comments and suggestions how to improve the manuscript.

Author Response File: Author Response.docx

Reviewer 2 Report

The authors have done some work on the proposals and improved the content of the paper. However, two fundamental points were left without enough explanation.

The authors provided a general description of the standard «Root Cause Analysis» (RCA) method in Section 3.3. This description does not clarify the scientific novelty of this method for this study. It is necessary to present a specific RCA scheme for the Data Center Physical Infrastructure for this case study instead of general schemes (Figs. 4 and 5). Schemes in fig. 4 and 5 do not allow readers to understand the specifics and features of Root Cause Analysis for Data Center Physical Infrastructure in the new version of the paper. The authors showed an increase in the cost of a minute of downtime for Data Center as evidence of the relevance of the study. However, the cost of a minute of downtime can increase not only because of equipment malfunctions. Evidence of the relevance of the study may be actual data on the dynamics (increase) of the total downtime of Data Center.

Overall recommendation: Accept after minor revision (corrections to minor methodological errors and text editing)

Author Response

Author's Response:

I thank the reviewer for valuable comments and suggestions how to improve the manuscript.

The novelty of the RCA method was further explained in the Article, please see lines 286 - 307.

Figure 5 was replaced and a new Figure 6 was added, explaining the algorithm.

I have also corrected the sentence relating to the downtime cost (line 47)

Author Response File: Author Response.docx

Article Menu

Intelligent Monitoring of Data Center Physical Infrastructure

Further Information

Guidelines

MDPI Initiatives

Follow MDPI