Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

A Survey on Malleability Solutions for High-Performance Distributed Computing

Appl. Sci. 2022, 12(10), 5231; https://doi.org/10.3390/app12105231

by Jose I. Aliaga

, Maribel Castillo, Sergio Iserte

, Iker Martín-Álvarez^*

and Rafael Mayo

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Appl. Sci. 2022, 12(10), 5231; https://doi.org/10.3390/app12105231

Submission received: 18 March 2022 / Revised: 2 May 2022 / Accepted: 13 May 2022 / Published: 22 May 2022

(This article belongs to the Special Issue State-of-the-Art High-Performance Computing and Networking)

Round 1

Reviewer 1 Report

Summary:
The paper presents a survey on state-of-the-art process malleability solutions
targeting high-performance computing applications primarily based on the
message passing programming paradigm and the de facto MPI.
Various existing studies are covered both on the MPI side and the surrounding
infrastructure (e.g., the cluster resource manager) also bringing in solution
that originally targeted fault tolerance and later have been extended for
malleability.

Pros:
- Introducing better flexibility and adaptivity into high-performance
computing codes is an increasingly important topic and a comprehensive
overview of the field would be highly beneficial for the community.
- Presentation wise the paper is well written and easy to follow.

Cons:
- As a survey I found the paper a little too short and incomplete. Most
importantly it lacks any conclusive insight of the current state of the field.
One would expect that a survey paper would introduce some kind of taxonomy
of the subject area and provide a summary of trade-offs between different
approaches. While the paper makes a relatively good case motivating the need
for such an overview, it then goes on more like a quick summary of existing
approaches without any new observation.
- The title and the rest of the paper is a little inconsistent. The title
itself suggests a much wider topic, I would suggest to spell out explicitly
that the target here is HPC codes or even going as far as mentioning MPI.
- The paper should really be more explicit about the target domain and
preferably provide a more general context. With the current title one would
expect to at least see some discussion how the HPC oriented techniques
compare to the ones available in cloud environments, e.g., microservice
and/or serverless type of solutions which are inherently malleability and
dynamic. Is there anything the HPC community can learn from those approaches
and how far would they be applicable?

Comments:
- Considering how the paper's potential target audience may try to find such a
study, the word "survey" should be added to the title.
- For the MPI domain the paper needs to discuss "MPI Sessions", a proposal that
is half-way in the MPI standard already and which has the very same goals in
sight in terms of providing a more flexible execution environment for MPI
codes.
- Given the increasing interest in the convergence of HPC and cloud
environments I think the paper should provide some discussion on Kubernetes
and how that may solve some of the issues the traditional, rigid resource
managers in HPC fail to address.
- On a similar note, the paper should probably discuss the Flux resource
manager out of Lawrence Livermore as it also aims at being flexible and
dynamic.
- The discussion on the various job types and the major components of the
surrounding infrastructure for providing malleability shouldn't be part of
the introduction, but maybe in a broader background/motivation section.
- Again, the paper needs a more complete discussion section on taxonomy and
insights from the different approaches, e.g., their functional limitations,
scalability properties, etc.

- I would also welcome a little more discussion on alternative programming
models (e.g., task based ones) which are designed with dynamicity from the
ground up. Examples may include Legion, ParSEC or perhaps Chapel?

Author Response

Please see the attachment for the response.

Also, in the next URL is provided the modified version of the manuscript:

https://drive.google.com/file/d/1LpxT66IIK588cR_hCsxtLchzS4rV0Epq/view?usp=sharing

Author Response File: Author Response.pdf

Reviewer 2 Report

A State-of-the-art on Process Malleability Solutions for Cluster

Systems

The paper presents a survey of the process malleability for cluster systems.

The paper has many examples of jobs execution in clusters with different technologies and approximations to perform the process malleably ( resource allocation dynamically at runtime given some trade-offs and respecting some constraints).

The paper has two main drawbacks:

The first one is a constraint for accepting the manuscript. Many of the images, if not all, are under copyright, and "listings" are directly extracted from the referenced papers.

https://library.georgetown.edu/copyright/images-publications

No plagiarism but copyright infringement:

https://www.aur.org/uploadedFiles/Affinity_Groups/SIGs/ADVICER/Guidelines-for-the-scholarly-use-of-images.pdf

For accepting the paper, authors should be sure to hold the copyright of the image they are using. Otherwise, they can modify the pictures, reference them, and slightly modify the examples.

The second is the absence of an objective systematically comparison of the different approximations. The section would be inserted before the conclusions. The reader would expect some tables to compare the different approximations with their pros and cons. For example, a set of parameters could be as an example: scalability (exascale), programming model, automatic job allocation vs human interaction, technology (internet, Infiniband and impact like bandwidth, latencies), overhead (with and without malleability feature, % or low, medium, high), the complexity of the kernels/applications (low, high, medium where matrix multiplication is low and complete application is high), etc.

The authors generally show such kinds of comparisons and further explanations in the surveys.

Minor comments:

Given that authors write about exascale:

They can explain how the paper contributes to the exascale. With the tables proposed, it can be evident which approximations are the best for malleability or depending on the scenario. For example, Slurm is in the current job allocation for large systems and allows elastic computation, DMR, which features SLURM misses for malleability.
Checkpoint and restart are scalable?

The authors are the experts here, and they can explain the different trade-offs and which approximation is more suitable for a given cluster.

Are other dimensions of space exploration worthy of mentioning, like malleable job allocation systems depending on power/time vs time-to-solution(i.e. D.A.V.I.D.E https://eventi.cineca.it/en/hpc/davide-openpower-gpu-cluster, energy-aware job allocation).

English (848): "For instance, in Application 2, we can see how it

requests GPUs with appropriate constraints" --> For instance, in Fig. 7, we can observe how application two requests...

Author Response

Please see the attachment.

Also, in the next URL is provided the modified version of the manuscript:

https://drive.google.com/file/d/1LpxT66IIK588cR_hCsxtLchzS4rV0Epq/view?usp=sharing

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Thank you for addressing my concerns, I appreciate the improvements to the paper. I would have liked to see MPI Sessions included, but I also understand your argument.

Reviewer 2 Report

Thanks to the authors for the effort. The authors followed the recommendations, the paper has original images, and there is a final table with its discussion about the different approximations of malleability computation for HPC systems.
The paper can be a suitable starting point for researchers and readers interested in the topic and who want to be familiar with the subject.

Article Menu

A Survey on Malleability Solutions for High-Performance Distributed Computing

Further Information

Guidelines

MDPI Initiatives

Follow MDPI