The REGALE Library: A DDS Interoperability Layer for the HPC PowerStack
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe paper presents a middleware designed within the REGALE project to enable interoperability among power management tools in HPC infrastructures. The framework is based on the standardization of communication using the Data Distribution Service (DDS) and it has been demonstrated integrating existing tools such as EAR, COUNTDOWN, and EXAMON.
The paper contribution is visible. However, its presentation should be improved a lot. Here are some comments in this direction.
Abstract. The sentence “The proposed framework is based on the data distribution service (DDS) [...]” is too long and should be splited.
The numbering of the sections is incorrect. The introduction starts at 0 which is odd.
Section 1. Background. Include at least one introductory sentence before beginning the subsection. Provide information on why you are describing the different parts and how they relate to the proposed paper.
Section 1.1. says nothing about power management, just that there is a need for power management in HPC. Perhaps this should be merged with section 1.3.
Subsections 1.3, 1.4 and 1.5 are too fragmented. Find a way to present the concepts more harmoniously without having too many sub-sub-sections with only a few sentences in them.
Section 1.4.x subsections. The first word is sometimes capitalized and sometimes not. This is a bulleted list. Do not use sub-sub-sections.
Section 2. I expected Section 2 to have not only a long list of frameworks, but also some descriptions of the limitations of some of them in terms of compatibility/interoperability. For example, GEOPM had one of its original claims in its interoperability feature.
Section 3. Before going into the details of each component, the whole stack should be presented in a broader way. Figure 2 is too simple and it would be interesting to have a broader description of how the components interact with each other and also with the target infrastructure. It would be important to have more of a requirements-implementation link.
The tests are limited to 32 nodes. This is a quite low number of nodes compared to the infrastructure-level promises of the paper. This is particularly important as scalability may be an issue. Given that the queue used is limited to 32 nodes, describing some more theoretical derivations in the text regarding this issue can at least give the reader a sense. By the way, with authors from two HPC centers involved in the paper, I was expecting short tests with access to queues capable of using a still limited but larger number of resources than 32.
Table 3 is rather useless and redundant. I think that if it was a way to demonstrate that it works, I have to admit that the reviewers always believe the authors. Maybe something graphical is nicer than a table with 16 almost identical rows. Think about the feature you want to show and how it can be done. For example, a time base plot with frequency for each core might be nicer, showing that the frequency switch is working and of course that your framework is actually working.
Minor issues:
-
"Power Management" sometimes is capitalized, sometimes not. Try to be coherent. The same is for “Monitor”.
-
Abstract. "niece" -> "niche"
-
For API and code, probably will be better to use \texttt instead of \emph.
-
Section 3. First sentence. Put a comma after “In this section”
-
Line 177. Please use the full names and not the acronyms for JM, NM, since it is important to describe how the proposal has been constructed.
-
Line 322 goes much beyond the text boundaries.
-
Listing 1 and 2. Using a listing to show something that was shown on the monitor panel (or terminal) is not the best way. maybe a screenshot provides more the feeling that is something out of the framework rather than something written in the tex file.
Author Response
Abstract. The sentence “The proposed framework is based on the data distribution service (DDS) [...]” is too long and should be splited.
Thank you for highlighting this point. We acknowledge that the sentence in the abstract is unnecessarily long. To address this, we have split it into two sentences to ensure better clarity and structure. (all changes are highlited in blue also in the paper)
Section 1. Background. Include at least one introductory sentence before beginning the subsection. Provide information on why you are describing the different parts and how they relate to the proposed paper.
Thank you for your suggestion. We have added an introductory sentence to provide context, explaining the relevance of the described parts and their connection to the proposed framework.
"Section 1.1. says nothing about power management, just that there is a need for power management in HPC. Perhaps this should be merged with section 1.3." and "Subsections 1.3, 1.4 and 1.5 are too fragmented. Find a way to present the concepts more harmoniously without having too many sub-sub-sections with only a few sentences in them." and "Section 1.4.x subsections. The first word is sometimes capitalized and sometimes not. This is a bulleted list. Do not use sub-sub-sections."
Thank you for pointing this out. We agree that the fragmentation in these subsections affects the flow of the paper. To improve readability, we have reorganized some of these sections and eliminated unnecessary sub-sub-sections. Furthermore we fixed the bullet list and removed sub'sub section from 2.3.x (old 1.4.x)
Section 2. I expected Section 2 to have not only a long list of frameworks but also some descriptions of the limitations of some of them in terms of compatibility/interoperability. For example, GEOPM had one of its original claims in its interoperability feature.
We thank you for your comments. We updated the section describing the interoperability/compatibility characteristics for each SoA work.
Section 3. Before going into the details of each component, the whole stack should be presented in a broader way. Figure 2 is too simple and it would be interesting to have a broader description of how the components interact with each other and also with the target infrastructure. It would be important to have more of a requirements-implementation link.
We appreciate your insight regarding Section 3. We have revised the section to include a broader overview of the entire stack before delving into individual components. Additionally, Figure 2 has been enhanced.
The tests are limited to 32 nodes. This is a quite low number of nodes compared to the infrastructure-level promises of the paper. This is particularly important as scalability may be an issue. Given that the queue used is limited to 32 nodes, describing some more theoretical derivations in the text regarding this issue can at least give the reader a sense. By the way, with authors from two HPC centers involved in the paper, I was expecting short tests with access to queues capable of using a still limited but larger number of resources than 32.
Thank you for addressing this important concern. We have added some more motivation and explained what we did to compensate.
Table 3 is rather useless and redundant. I think that if it was a way to demonstrate that it works, I have to admit that the reviewers always believe the authors. Maybe something graphical is nicer than a table with 16 almost identical rows. Think about the feature you want to show and how it can be done. For example, a time base plot with frequency for each core might be nicer, showing that the frequency switch is working and of course that your framework is actually working.
We appreciate your constructive suggestion. We agree that Table 3 could be more effectively presented. We have tried with a graphical plot showing frequency changes for each core but it wasn't "expressive", so we have replaced it with an old screenshot. If it is still useless, we can also remove it.
Minor issues and the numbering of section:
We appreciate your observation regarding the minor issues and section numbering. We have renumbered the sections accordingly and fixed all minors.
Reviewer 2 Report
Comments and Suggestions for AuthorsThe authors have very correctly set out the underlying problems when working on HPC from various points of view.
The authors give a good presentation of the problem and, throughout the text, they unpack the architecture of the proposed solution and carry out preliminary experiments with very encouraging results.
The work is very successful in the field of HPC.
As a reader, no longer as a reviewer, I look forward to the publication of more advanced results as set out in future work.
The overall work is very interesting.
Just one typo in line 283 "the x-axisthe number of nodes, and on the y-axisthe average". You have to separate axis from the.
Author Response
As a reader, no longer as a reviewer, I look forward to the publication of more advanced results as set out in future work.
The overall work is very interesting.
Just one typo in line 283 "the x-axisthe number of nodes, and on the y-axisthe average". You have to separate axis from the.
Thank you very much for your kind words and for finding our work interesting. We greatly appreciate your thoughtful feedback. The typo you pointed out on line 283 has been corrected—"x-axis the" and "y-axis the" are now properly separated. We look forward to sharing our future results with you.
Reviewer 3 Report
Comments and Suggestions for AuthorsThe REGALE paper focuses on an undoubtedly important problem related with the actual HPC systems, thus it is indeed worth to be published. Nevertheless I have some general comments and minor ones:
-
Is there an estimation of the impact of the various monitoring tools in terms of power consumption (i.e. carbon footprint) ? Can you easily judge the carbon footprint of the REGALE library itself ? In this prospect, can the values reported in Table 4 easily be used as a basic estimation of the overall impact of the various components ?
-
Is there any data to compare, again in terms of energy efficiency, the RAGALE library with respect to the other tools you are citing in the paper ?
Minor
-
In section 1.4 there are several subsection I guess you need to use Upper case for the initial letter, for the first two of them, or itemize all of them
-
Page 8, “subscriberMaster#i” maybe “subscriberMaster_{i}”, i.e., subscript i
-
Table 2 to me should be moved , row by row, to a set of references
-
I would like to suggest to move the short section 5 to the final conclusion section , to me it is where this conclusive considerations belong to
-
I would like to encourage the authors to improve the general shapes of the various tables, indeed it is not always easy to distinguish the headers respect to the data reported
Author Response
Is there an estimation of the impact of the various monitoring tools in terms of power consumption (i.e. carbon footprint)? Can you easily judge the carbon footprint of the REGALE library itself? In this prospect, can the values reported in Table 4 easily be used as a basic estimation of the overall impact of the various components?
We thank the reviewer for your comment. We extended the description of the results in Section 5.3, showing that the use of the REGALE library in conjunction with the COUNTDOWN power management library leads to an overhead of 0.04% to the TtS of the tested application on top of the one induced by the COUNTDOWN power management library itself which is of the 0.41%. However, as the effect of the power management policy of COUNTDOWN, the core's frequency during the application is reduced, with a significant energy reduction and carbon footprint reduction. Further assessment of the energy saving induced by the COUNTDOWN library can be found in the COUNTDOWN original paper [ref. 17].
Is there any data to compare, again in terms of energy efficiency, the RAGALE library with respect to the other tools you are citing in the paper?
Thank you for raising these pertinent questions. We compared the REGALE library in terms of application run-time overhead when used to support power management communication between some of the power management tools cited in the paper. By itself, the REGALE library does not constitute the power management tools but provides a communication layer which can be leveraged to make power management tools interoperable or to ease their development and implementation. The test conducted shows a negligible overhead. It must be noted that this is based on a simple example, and future works will focus on assessing the performance of the REGALE library in a more systematic fashion.
Minor issue
Thank you for pointing out the minor issues. We have addressed all of them, and the changes have been highlighted in blue.