Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

HPC Cloud Architecture to Reduce HPC Workflow Complexity in Containerized Environments

Appl. Sci. 2021, 11(3), 923; https://doi.org/10.3390/app11030923

by Guohua Li¹

, Joon Woo¹ and Sang Boem Lim^2,*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Reviewer 4: Anonymous

Appl. Sci. 2021, 11(3), 923; https://doi.org/10.3390/app11030923

Submission received: 19 November 2020 / Revised: 13 January 2021 / Accepted: 15 January 2021 / Published: 20 January 2021

(This article belongs to the Special Issue High-Performance Computing and Supercomputing)

Round 1

Reviewer 1 Report

The study is very empirical so that, in my opinion, is not appropriate for its publication in a journal. The authors should try its presentation is some conference on related subjects.

Author Response

Reviewer 1:

The study is very empirical so that, in my opinion, is not appropriate for its publication in a journal. The authors should try its presentation is some conference on related subjects.

Thank you for your suggestion. The presentations dealing with technical and experimental issues in this paper have already been discussed briefly through poster sessions and other meetings with national supercomputing centers at the International Supercomputing Conference (ISC) 2019. We wrote this paper with an emphasis on proposing an effective approach to an HPC architecture for HPC workflows. We hope that our proposed HPC cloud architecture will be a prototype and contribute to reducing the complexity of HPC workflows in containerized environments.

Reviewer 2 Report

The authors of this manuscript introduce a holistic architecture for running HPC workloads in the cloud by adopting containerized execution environments to be able to run long-running compute tasks without the complexity of designing complex workflows.

The paper, although not the most novel, does exhibit merit, it is easy to read and the solution is adopted in a real-world super-computing center. These reasons make this paper interesting and should be published in the journal.

However, my sole comment has to do with the aim of this paper as presented in the title. Specifically, I fail to understand how the current architecture improves containerized execution environment security and how it boosts autonomicity. In order to support such claims, one must both present novel algorithmic solutions and, in turn, validate these claims through comprehensive experiments. Towards this, I suggest changing the title and the claims in the introduction so that the scope is narrowed down to presenting a novel architecture for HPC in the cloud, alliviating the complexity in dealing with containerized environments from the novice HPC user, and dealing with metering in long-lasting HPC workloads.

Author Response

Reviewer 2:

Author:

Thank you for your great suggestion. We have changed the title accordingly to “HPC Cloud Architecture to Reduce HPC Workflows Complexity in Containerized Environments.” In line with the revision of the title, we have also revised the abstract. (lines 9–22)

Reviewer 3 Report

Summary: The article is dedicated to High performance computing (HPC), which popularity has been steadily increasing in recent years. The demand for HPC service in the cloud environment is increasing to meet the resource scalability, management efficiency, and ease of use. While a number of container-based cloud solutions have been proposed in recent years, the authors point out that there are various problems, such as an isolated environment between HPC and the cloud, security issues and workload management issues. To resolve these issues, the authors define automatic and secure solutions by developing a container-based HPC cloud platform that has both image management and job management. Although the topic is unquestionably up-to-date and important, there is a list of suggestions and recommendations on how to improve the article and its overall quality.

The paper is presented in a qualitative way and is suitable for the journal; however, there is a list of potential improvements that are recommended to be implemented to improve the overall quality of the paper.

Abstract indicates what the authors are studying and for what purposes. However, more article-specific detail should be added. At the current stage, it is not quite clear exactly what was done during the study – just a topic to be addressed. This could be easily solved by extending the following sentences: “We have chosen two container solutions for our platform which meet almost same performance with the bare-metal environment” but what are these container solutions? As for results addressed here, the authors states “We expect our study become a successful case of research” but what is the reason to make such an assumption? Please, indicate it clearly by adding an explanation of the potential success of your proposal. In addition, I would not recommend using such constructions as “Methods: ”, “Results: ” since it would be better to incorporate them in the appropriate sentences by making reading more fluently (if your choice is not a style required by the journal). Otherwise, I would recommend replacing these constructions with something like “As for the method to be used, we gave the preference to ...”.

Introduction: the overall quality of the introduction is high in terms of delivery and background. It indicates the objectives of the study and addresses the results. The structure of the article is also covered.

Section II: although this section is well organised, at the very beginning, when the authors address Docker, it would be recommended to add more details of every defficiency they mentioned as is done for the last con. Otherwise, a reader without prior knowledge of the Docker cannot agree nor disagree until there is any explanation (at least very brief) about these flaws. Please, try looking at these points, adding some context, answering the “and what?” question. This would strengthen your conclusion.

Given that the author’s state “Shifter’s workflow is shown in Figure 1 [14].”, it seems that the Figure 1 is from an external source, therefore, the reference should be added to the title of the Figure.

Section 2.1.: more discussion on the strengths and weaknesses of Shifter is needed. In which cases would it be one of the most appropriate solutions and in which cases would it fail? This is also the case for the following subsections.

Section 2.2.: while the authors point out “HPC platforms that meet the following requirements [17]:” I am not really sure it is correct to assign the following points as “requirements”, i.e. “Admin experience: includes easy installation”, “Maintenance effort: can reuse third-party technology and leverage community efforts”. The words “requirements” and “can reuse” are quite controversial. Please, reword the sentence.

In addition, the literature review does not seem to cover solutions that could be considered as alternatives or competitors for the solution presented. This affects the further Sections. The authors are invited to make sure what is already known about the topic – what are the solutions developed by other researchers, what are their weaknesses which are not the case for the proposed solution?

Figures: please, reword the headlines of the figures. At the moment every title begins with “this is a” – this part is not needed. For instance, instead of “This is a figure of NERSC Shifter workflow”, use “NERSC Shifter workflow”. It is recommended to improve the resolution of these Figures (especially the case for Figure 23, 24 and Appendix).

Section 3: the authors have summarized the requirements of both users and administrators, which I would recommend in a bulleted list rather than as plain text. It would be more reader-friendly. Also, the explanation of these requirements (at least very brief) would be beneficial. This also applies to statements such as “The Job Metadata Server is more suitable for building NoSQL databases rather than relational databases for storing metering data on resource utilization” (lines 309-310).

Otherwise, the Section 3 is strong enough and contains a list of Figures describing the proposed solution at a very easy-to-understand level.

Section 4 is rather the “to do list” or “check list” of the things that have already been done, but, there is a lack of the discussion about why the software in question was chosen and installed. There is no explanation for the choice of authors. Perhaps a list of alternatives should be added.

Section 5, similar to Section 3, is strong enough. The authors have evaluated their solution and provided both, visual proofs and textual descriptions. The evaluation method is explained and the process is also clear. However, there is a lack of evaluation in the context of other solutions provided by other authors (if any – unfortunately, this article does not allow finding it out since the literature review does not covered alternative solutions at the needed level).

Section 6 provides both a summary and critical discussion and future works. There is no need for significant improvements.

Language: there are no major language-related issues in general,, but it is strongly recommended that you re-read the text, that will allow for the detection and elimination of many minor defects, such as “can also runs containers” (Line 99), “To evaluated the performance”, “results ... is not enough” (Line 340) etc.

Overall: The article deals with an important topic that is presented in a qualitative way. However, although the practical (main part) is of high quality, the literature review lacks other studies dealing with the issue. Therefore, the authors cannot compare whether and how the proposed solution is better or worse compared to those already existing. Otherwise, the proposed solution could be of interest for readers.

Author Response

Reviewer 3:

Author:

First of all, thank you for your insightful and detailed comments. In accordance with your suggestion about the abstract, we have reconstructed and revised it to exclude the “Methods…Results” construction, as highlighted in red font (lines 9–22).

In response to your questions, we now describe the problem, content, and strengths of our proposed architecture. In Section 2, we added several points that distinguish our architecture from other competing projects in more detail (lines 90–92, 99-102, 109-112, 136-141, 158, 176-179).

We have also applied your “As for the method to be used ….” expression to our abstract in the form “We propose an architecture that reduces this complexity by utilizing …”.

Author:

At the beginning of Section 2 (lines 78–81), we have added some context about Docker description and advantages in red font. For the solutions mentioned in the related works from subsections 2.1 to 2.4 (lines 99-102, 109-112, 136-141, 158, 176-179), we analyze the architecture and describe its strengths and weaknesses, and also reference it in designing our architecture. This is also updated in Section 2 in red font (lines 90-92).

Author:

Thank you very much. This has been rectified. (line 97)

Author:

In this subsection, we added more information about Shifter with explanations about its strengths and weaknesses (lines 99-102, 109-112). Especially, we point out the weaknesses of this architecture to emphasize our complimentary parts to these weaknesses (lines 90-92). Other related subsections in Section 2 have been added as well. (lines 99-102, 109-112, 136-141, 158, 176-179)

Author:

We have revised “requirements” to “issues” and “can reuse” to “can apply” in red font. (lines 119, 129)

Author:

As mentioned before, we have reflected on your points at the beginning of Section 2 (lines 90-92). In related work, through architectural analysis of each solution, the shortcomings were identified and solutions implemented to overcome them in our architecture as much as possible. Furthermore, the pros and cons have been added to each subsection.

Author:

Thank you very much. We have reworded the captions of all figures as you suggested here (lines 103, 114, 143, 160, 170, 181). We have also improved the resolution of all figures, including Figures 23, 24, and Appendix.

Author:

In Section 3, we have changed the construction to a bulleted list for the requirements of both users and admins (lines 185–196). We also created Table 1 (line 354) to Section 4 (which you mentioned here as lines 309–310) to explain the installed software information (lines 336–338).

Otherwise, the Section 3 is strong enough and contains a list of Figures describing the proposed solution at a very easy-to-understand level.

Author:

As we mentioned before, we added Table 1 (line 354) to explain the information regarding the installed software. We have also added the reason why we chose these software and version (lines 336–338). We have software licenses for two of them, and another open-source software compatible with these two were used.

Author:

The reason why we discussed briefly our evaluation with other solutions is that we did a lot of research on networking performance in a previous paper titled, “Performance analysis of container-based networking solutions for high-performance computing cloud” [24] (line 364). We only added a reference with a short mention because we had to provide a lot of technical details about networking performance in a containerized environment. The subject of this paper is architecture-based, and we only want to explain the performance evaluation content to verify the bare-metal level performance in HPC cloud environment, the technical part and prior research have therefore been omitted.

Section 6 provides both a summary and critical discussion and future works. There is no need for significant improvements.

Author:

Thank you for your kind comments. We have corrected those errors, and proofread the entire revised document. (lines 109, 372, 373)

Author:

We would like to express our gratitude for your detailed comments. We tried our best to update our paper in accordance with your comments. They have been a great help in making our paper better. Thank you.

Reviewer 4 Report

The manuscript aims to present the cloud container-based HPC platform to “reduce HPC workflows complexity”. The current version seems to be the 3^rd revision; however, there are still many open questions left.

The notion of HPC workflow is not clearly explained. Having some background with scientific workflows and tools like Kepler, Taverna, Triana, Askalon, Moteur etc, I’d expect to see some application patterns related to this type of workflows represented by interconnected task flows. Apparently, authors mean something else by the term “workflow”, probably sequence of elementary actions to execute a single HPC application/job. This should be explained in detail. The meaning of “HPC workflow complexity” should also be described explicitly in this context.

In the Introduction section authors mention multitude of existing projects to create “HPC Cloud” solutions, however Section 2 gives a modest review of four container-based projects (Shifter, Sarus, Charliecloud and Singularity) that are not even alternatives to the proposed solution, since Singularity is used as one of the backends in the described approach. The advantages of the proposed solution are not clearly explained, more detailed analysis and comparison to the existing solutions would be helpful. A table comparing different solutions by their specific capabilities would help. Minor typos here: NVIDA GPU (instead of NVIDIA)

Image captions: many captions mention “workflows” but the presented diagrams look more like schematic views, architecture, but not as “workflows” as do not depict clear sequences of steps and intercommunication of components. However, this might be author’s interpretation of the term “workflow”, as I mentioned above.

Section 3 is important in the sense that the authors formulate the requirements of HPC users and administrators that are not fully satisfied in the existing systems and are claimed to be satisfied in the proposed approach. Again, a comparison table or some other systematic analysis of these requirements and their fulfillment in existing (at least mentioned in Section 2) and current system would be very helpful. I consider this list of requirements as the research question posed by this manuscript, such that all the items must be addressed by further sections. Leaping ahead, I’m missing justifications of many points in this list related to the presented system.

Section 3.1 gives an overview of the system architecture. In particular, it mentions Image management which “is a base for… workload distributing, task parallelizing, auto-scaling, and image provisioning functions”. We will search for detailed explanations of this functionality further. Not much will be said about “workload distributing, task parallelizing” in particular. “Integration packages according to the different types of jobs” are mentioned, but what are these types of jobs? Please try to be more specific and clear.

Further on, “when HPC users request the desired container image” – what are the ways to understand that a user desires a particular image? Apparently, there is an image repository with image metadata containing particular information, but this functionality is not introduced. “My Resource View .. designed to show the resource generated by each user and is used for implementing multitenancy.” – it is not really clear how multitenancy relates to the resource view.

Section 3.2 introduces distributed and parallel architecture of the Image manager. However, support of distributed and parallel applications (e.g. MPI) in the proposed system is not clear. How images correlate with parallel processes in such jobs? Does every process have a separate image? Auto-scaling scheduler is announced but not enough details are given. What are the suitable jobs to be autoscaled? Is it only master-worker type of jobs? How are the autoscale rules configured?

Section 3.3 discusses job management. The authors need to clearly distinguish and explain Image and Job management. How do they relate? This section says that ”after creating and uploading the image… user can submit job requests”. Are jobs dependent on images?

Metering data management is done with help of the real-time data collector sending requests every 10 seconds. What is the overhead and performance impact of such a polling solution?

Section 4 in general looks good enough giving detailed information on implementation details and versions of utilized software.

Section 5 presents evaluation of the platform performance. The presented test data seems to be detached from all the other previous parts of the paper as here MPI performance is presented, while MPI was only barely mentioned in Introduction and Related work. And this was not the main research question of the paper. Let’s come back to Section 3 and re-read the requirements of users and administrators. I believe, the following must be evaluated (as listed in lines 205-215): “a self-service interface that can be used easily, rapid expansion of new configuration environments, a pay-as-used model; performance similar to that of bare-metal servers; auto-provisioning that includes virtual instances and applications; workload management; on-demand billing management; resource auto-scaling; multitenancy capability; portability of applications”. MPI performance could only partially support the “performance similar to that of bare-metal servers” item.

I’m very positive and enthusiastic about the manuscript in general, but I believe it must be reconsidered with a major update.

Author Response

Thank you for your comments. We have defined HPC workflow as the flow of tasks that need to be executed to compute on HPC resources. In a containerized environment, tasks can be an image, a template, and a job (container), and the diversity of these tasks increases the complexity of the infrastructure to be implemented. (revised and added in lines 61–69, 194–222). As we searched for Kepler, Taverna, Triana, Askalon, and Moteur, these software are used to design workflows for visualization. Our proposed architecture is not designed using a design visualization tool, but it is designed to construct actual infrastructure; therefore, we do not believe that the software that you have suggested fits our subject. This is because we had not initially clarified the definition of HPC workflow as you mentioned before.

Minor typos here: NVIDA GPU (instead of NVIDIA)

Thank you for your comments. Accordingly, we have restructured and revised the ‘Introduction’ (line 61-69) and ‘Related Work’ sections (lines 84-89, 194-222). In the ‘Introduction’ section, we have explained the background in the following scenario: ‘Traditional HPC’-> ‘HPC Cloud’ -> ‘Container-based HPC Cloud’ (lines 46–60). Our final goal is to reach ‘Container-based HPC Cloud’; therefore, we have focused on the architectural perspective and reviewed four projects about this part (lines 61–69). In the ‘Related Work’ section, we have first explained container solutions (subsection 2.1) (Docker, Singularity, and Chaliecloud) and subsequently analyzed the architecture of HPC workflows of four projects (subsection 2.2) (Shifter, Sarus, EASEY, and JEDI). We have explained the advantages of our proposed solution in comparison with the requirements that we have listed in Table 1. (subsection 2.3).

‘NVIDA’ was revised to ‘NVIDIA’. (line 152)

Thank you for your comment. Please check the response to the first question as we believe we have provided an appropriate response to your comment therein.

Thank you for your comments. As you suggested, we have listed 10 requirements in subsection 2.3 (line 194). We have stated that we used the requirements as a basis for designing our proposed architecture. We have emphasized this point in Section 2; thus, we did not mention it in Section 3.

Thank you for your comments. Functions of image management you mentioned in this question are described in detail in subsection 3.2; therefore, we have only added some explanations in subsection 3.1 (in lines 248-250). About “types of jobs” you mentioned, we added some explanations about “jobs” in a containerized environment which includes both information of container and applications (in lines 239-241). And the affecting factor of these integration packages is the type of container workload manager (in lines 248-252). Jobs in “Integration packages according to the different types of jobs” imply container and application tasks because different container solutions need different container workload managers. This was revised accordingly (in lines 248-252).

Thank you for your comments. The relationship between template, image, and job has been added to this section. All verified templates are automatic installation and configuration scripts for versions such as OS, library, compiler, and application. All container images can be constructed based on verified templates. All jobs (including container and application) can be executed based on these constructed images. (lines 239–241) “When HPC users request the desired container image” implies users request for the verified image (including information on OS, library, compiler, and application) for executing a job (container and application). For instance, one user wants to execute a MPI job, then they first check the image list. If it exists (including information on OS, library, compiler, and application), containers can be executed. However, if it does not exist, the user then checks the template list. (added in lines 265-268)

For multitenancy, we have added an explanation in detail in subsection 5.5 (in lines 511-520 with Figure 20). To share the resource pool with different or isolated services with each user, we designed it to check through usage statistics for each user’s resources (in lines 275-276).

Thank you for your questions. Subsection 3.2 introduces image management and it does not pertain to MPI jobs. Here, a distributed and parallel architecture is designed to reduce the workload of the Image Manager, which constructs images, not MPI-run in containers. (in lines 296-298)

We have added additional information about the auto-scaling scheduler of the Image Manager (lines 319–359). This auto-scaling scheduler is for image construction, but not for jobs. We have added a flowchart (Figure 12), Equations (1) and (2), and Table 2 for describing the variables in equations and for explaining the rules configured.

Thank you for your questions. Jobs depend on images, whereas the image depends on the template (in lines 239-241, 265-268).

Metering data management is done with help of the real-time data collector sending requests every 10 seconds. What is the overhead and performance impact of such a polling solution?

Thank you for your question. With sshpass for sending request in every 10 s will make overhead on both metering node compute nodes sides. To improve the performance, we can collect logs using another communication protocol or install a log collection agent on the compute node side that can transmit data to the metering node. This will be applied in the future (added in lines 387-391).

Section 4 in general looks good enough giving detailed information on implementation details and versions of utilized software.

Thank you for your comment. Based on your comments, we restructured and revised the evaluation chapter (in lines 432-531) with essential attributes such as on-demand self-service, rapid elasticity and scalability, auto-provisioning, workload management, multitenancy, portability of application, and performance subsections. We revised “MPI performance” part as one of the evaluation attributes in this paper.

I’m very positive and enthusiastic about the manuscript in general, but I believe it must be reconsidered with a major update.

We want to express our gratitude for your detailed review. We have revised the manuscript in accordance with your comments. Thank you.

Round 2

Reviewer 1 Report

My opinion is similar to that of the former report since the basic article structure did not change and probably it cannot change in future revisions according to its main objective design.

Author Response

My opinion is similar to that of the former report since the basic article structure did not change and probably it cannot change in future revisions according to its main objective design.

Author:

Thank you for your response.

In accordance with your former report, this paper includes the empirical works, but the contribution of this paper is expected not only to implement the proposed architecture but also to contribute to the value of research as a solution to reduce the complexity of HPC workflows in the field of HPC cloud.

Because of this reason, we did not modify the overall structure, but we have changed the title accordingly to “HPC Cloud Architecture to Reduce HPC Workflows Complexity in Containerized Environments”. We also have revised the abstract, introduction, and related works by emphasizing the architecture-oriented point.

We hope that this paper will provide not only the evaluation of the serviceability but also the value of the research in HPC and supercomputing research fields. This is why we submitted our paper to this journal.

Reviewer 4 Report

I thank the authors for addressing my comments in a short time. Most of the questions were answered and clarified in the text. Despite there is still room for improvements, I believe the manuscript has been significantly improved and is in a better shape to be published.

Please re-check the newly added text for typos (I noticed at least one in section name "5.7 Perforamnce...") and consistency: Section 5.2 claims that "to solve this problem, we built Docker Private Registry... As a result, without Docker Private Registry, it took 122 seconds, 20 seconds faster than with Docker Private Registry case." Does this mean that your solution with the Docker Private Registry did not help, or the paragraph should be rephrased to claim the opposite? Please check and fix this.

Author Response

We want to express our gratitude for your good comments which can make our paper better.

“5.7 Performance” was fixed.

“As a result, with Docker Private Registry, it took 122 seconds, 20 seconds faster than without Docker Private Registry case.” This is what we want to say. There was a mistake in claiming this. We fixed it in lines 471-472.

Thank you again for your good comments.

Round 3

Reviewer 1 Report

How is it addressed the infnite summation in the definitions of Eqs. 1-2 taking into account that the processes are finite in practice?. Some " ad hoc" comments will be useful.

Author Response

Reviewer 1:

How is it addressed the infinite summation in the definitions of Eqs. 1-2 taking into account that the processes are finite in practice?. Some " ad hoc" comments will be useful.

Thank you for your good question. There are some limitations when we apply these equations to an actual cluster environment. There are finite values (or maximum values) of N and n in equations because of these limitations. We explained this by adding one paragraph in lines 359-367.

Article Menu

HPC Cloud Architecture to Reduce HPC Workflow Complexity in Containerized Environments

Further Information

Guidelines

MDPI Initiatives

Follow MDPI