Next Article in Journal
Machine Learning Techniques in Structural Wind Engineering: A State-of-the-Art Review
Next Article in Special Issue
The Effects of High-Performance Cloud System for Network Function Virtualization
Previous Article in Journal
Screening for Antibacterial Activity of French Mushrooms against Pathogenic and Multidrug Resistant Bacteria
Previous Article in Special Issue
RLSchert: An HPC Job Scheduler Using Deep Reinforcement Learning and Remaining Time Prediction
 
 
Article
Peer-Review Record

A Survey on Malleability Solutions for High-Performance Distributed Computing

Appl. Sci. 2022, 12(10), 5231; https://doi.org/10.3390/app12105231
by Jose I. Aliaga, Maribel Castillo, Sergio Iserte, Iker Martín-Álvarez * and Rafael Mayo
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Appl. Sci. 2022, 12(10), 5231; https://doi.org/10.3390/app12105231
Submission received: 18 March 2022 / Revised: 2 May 2022 / Accepted: 13 May 2022 / Published: 22 May 2022
(This article belongs to the Special Issue State-of-the-Art High-Performance Computing and Networking)

Round 1

Reviewer 1 Report

Summary:
The paper presents a survey on state-of-the-art process malleability solutions
targeting high-performance computing applications primarily based on the
message passing programming paradigm and the de facto MPI.
Various existing studies are covered both on the MPI side and the surrounding
infrastructure (e.g., the cluster resource manager) also bringing in solution
that originally targeted fault tolerance and later have been extended for
malleability.


Pros:
- Introducing better flexibility and adaptivity into high-performance
  computing codes is an increasingly important topic and a comprehensive
  overview of the field would be highly beneficial for the community.
- Presentation wise the paper is well written and easy to follow.

Cons:
- As a survey I found the paper a little too short and incomplete. Most
  importantly it lacks any conclusive insight of the current state of the field.
  One would expect that a survey paper would introduce some kind of taxonomy
  of the subject area and provide a summary of trade-offs between different
  approaches. While the paper makes a relatively good case motivating the need
  for such an overview, it then goes on more like a quick summary of existing
  approaches without any new observation.
- The title and the rest of the paper is a little inconsistent. The title
  itself suggests a much wider topic, I would suggest to spell out explicitly
  that the target here is HPC codes or even going as far as mentioning MPI.
- The paper should really be more explicit about the target domain and
  preferably provide a more general context. With the current title one would
  expect to at least see some discussion how the HPC oriented techniques
  compare to the ones available in cloud environments, e.g., microservice
  and/or serverless type of solutions which are inherently malleability and
  dynamic. Is there anything the HPC community can learn from those approaches
  and how far would they be applicable?

Comments:
- Considering how the paper's potential target audience may try to find such a
  study, the word "survey" should be added to the title.
- For the MPI domain the paper needs to discuss "MPI Sessions", a proposal that
  is half-way in the MPI standard already and which has the very same goals in
  sight in terms of providing a more flexible execution environment for MPI
  codes.
- Given the increasing interest in the convergence of HPC and cloud
  environments I think the paper should provide some discussion on Kubernetes
  and how that may solve some of the issues the traditional, rigid resource
  managers in HPC fail to address.
- On a similar note, the paper should probably discuss the Flux resource
  manager out of Lawrence Livermore as it also aims at being flexible and
  dynamic.
- The discussion on the various job types and the major components of the
  surrounding infrastructure for providing malleability shouldn't be part of
  the introduction, but maybe in a broader background/motivation section.
- Again, the paper needs a more complete discussion section on taxonomy and
  insights from the different approaches, e.g., their functional limitations,
  scalability properties, etc.

- I would also welcome a little more discussion on alternative programming
  models (e.g., task based ones) which are designed with dynamicity from the
  ground up. Examples may include Legion, ParSEC or perhaps Chapel?

 

Author Response

Please see the attachment for the response.

Also, in the next URL is provided the modified version of the manuscript:

https://drive.google.com/file/d/1LpxT66IIK588cR_hCsxtLchzS4rV0Epq/view?usp=sharing

Author Response File: Author Response.pdf

Reviewer 2 Report

A State-of-the-art on Process Malleability Solutions for Cluster

Systems

The paper presents a survey of the process malleability for cluster systems.

 

The paper has many examples of jobs execution in clusters with different technologies and approximations to perform the process malleably ( resource allocation dynamically at runtime given some trade-offs and respecting some constraints).

 

The paper has two main drawbacks:

 

The first one is a constraint for accepting the manuscript. Many of the images, if not all, are under copyright, and "listings" are directly extracted from the referenced papers. 

https://library.georgetown.edu/copyright/images-publications

No plagiarism but copyright infringement:

https://www.aur.org/uploadedFiles/Affinity_Groups/SIGs/ADVICER/Guidelines-for-the-scholarly-use-of-images.pdf

 

For accepting the paper, authors should be sure to hold the copyright of the image they are using. Otherwise, they can modify the pictures, reference them, and slightly modify the examples. 

 

The second is the absence of an objective systematically comparison of the different approximations. The section would be inserted before the conclusions. The reader would expect some tables to compare the different approximations with their pros and cons. For example, a set of parameters could be as an example: scalability (exascale), programming model, automatic job allocation vs human interaction, technology (internet, Infiniband and impact like bandwidth, latencies), overhead (with and without malleability feature, % or low, medium, high), the complexity of the kernels/applications (low, high, medium where matrix multiplication is low and complete application is high), etc.

The authors generally show such kinds of comparisons and further explanations in the surveys.

 

Minor comments:

Given that authors write about exascale: 

  • They can explain how the paper contributes to the exascale. With the tables proposed, it can be evident which approximations are the best for malleability or depending on the scenario. For example, Slurm is in the current job allocation for large systems and allows elastic computation, DMR, which features SLURM misses for malleability.
  • Checkpoint and restart are scalable?

The authors are the experts here, and they can explain the different trade-offs and which approximation is more suitable for a given cluster.

  • Are other dimensions of space exploration worthy of mentioning, like malleable job allocation systems depending on power/time vs time-to-solution(i.e. D.A.V.I.D.E https://eventi.cineca.it/en/hpc/davide-openpower-gpu-cluster, energy-aware job allocation).

English (848): "For instance, in Application 2, we can see how it

 requests GPUs with appropriate constraints" --> For instance, in Fig. 7, we can observe how application two requests...

Author Response

Please see the attachment.

Also, in the next URL is provided the modified version of the manuscript:

https://drive.google.com/file/d/1LpxT66IIK588cR_hCsxtLchzS4rV0Epq/view?usp=sharing

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Thank you for addressing my concerns, I appreciate the improvements to the paper. I would have liked to see MPI Sessions included, but I also understand your argument.

Reviewer 2 Report

Thanks to the authors for the effort. The authors followed the recommendations, the paper has original images, and there is a final table with its discussion about the different approximations of malleability computation for HPC systems. 
The paper can be a suitable starting point for researchers and readers interested in the topic and who want to be familiar with the subject.

Back to TopTop