Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

GPU Computing with Python: Performance, Energy Efficiency and Usability^†

Computation 2020, 8(1), 4; https://doi.org/10.3390/computation8010004

by Håvard H. Holm^1,2,*

, André R. Brodtkorb^3,4

and Martin L. Sætra^4,5

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Computation 2020, 8(1), 4; https://doi.org/10.3390/computation8010004

Submission received: 6 December 2019 / Revised: 30 December 2019 / Accepted: 6 January 2020 / Published: 9 January 2020

(This article belongs to the Special Issue Energy-Efficient Computing on Parallel Architectures)

Round 1

Reviewer 1 Report

It is a good overview paper related to the performance, energy efficiency and usability using Python. The presentation is clear with fine simulation. I recommend the acceptance of this paper.

Author Response

Thank you for your positive feedback on our paper.

Kind regards,
Håvard Heitlo Holm
On behalf of the authors

Reviewer 2 Report

This paper examines the performance, usability, and energy efficiency when using Python to develop HPC codes on different Graphics Processing Units. For completeness, the authors should define all acronyms before use; e.g., GPU, HPC, etc., even though they are commonly used. The paper is a modest extension of a preliminary paper which was already published in a conference, so that is really my only concern. The new paper does present a more complete set of results and should be of interest to a wide audience. Overall, the paper is well written.

Author Response

Thank you for positive feedback on our paper.

We have revised our paper to define all acronyms and abbreviations. Our changes are as follows:

Line 2 (Abstract): HPC -> high performance computing Line 2 (Abstract): GPU -> graphics processing unit (GPU) Line 14: GPU computing -> General-purpose computing using the graphical processing unit (GPU), known as GPU computing, Line 18: CPU -> central processing unit (CPU) Line 29: CUDA -> CUDA (Compute Unified Device Architecture) Line 30: OpenCL -> OpenCL (Open Compute Language) Line 32: HPC -> high performance computing (HPC) Line 48: OpenACC~\cite{openacc} -> OpenACC~\cite{openacc} (open accelerators) Line 89: CUDA~\cite{cuda_programming_guide} (Compute Unified Device Architecture) --> CUDA~\cite{cuda_programming_guide} Line 95: SDK -> software development toolkit (SDK) Line 103: OpenCL~\cite{opencl} (Open Compute Language) -> OpenCL~\cite{opencl} Line 106: FPGAs, and DSPs -> field-programmable gate arrays (FPGAs), and digital signal processors (DSPs) Line 107: API -> application programming interface (API) Line 108: "(ICD loader) " deleted Line 152: SWIG wrappers -> SWIG (Simplified Wrapper and Interface Generator) Line 154: DGEMM from the BLAS library -> the dense general matrix multiplication (DGEMM) subroutine from the Basic Linear Algebra Subprograms (BLAS) library Line 164: OpenCV~\cite{opencv_library} -> OpenCV~\cite{opencv_library} (open source computer vision) Lines 284, 334, 400: GTX 780 -> GTX780 Line 379: gigaFLOPS -> gigaFLOPS\footnote{FLOPS stands for floating point operations per second} Line 625: Added the following abbreviations to the list of abbreviations: FPGA, ICD, DSP, FLOPS

The changes are also marked in red in the attached pdf.

Kind regards,
Håvard Heitlo Holm.
On behalf of the authors

Author Response File: Author Response.pdf

Reviewer 3 Report

# Intro
The paper studies the impacts of performing computation on Graphics Processing Units (GPU) using a high-level and highly general purpose language, such as Python. The authors addresses the effects in terms of performance, power drain / energy consumption and productivity. The study leverages four computational kernels that are used in some common High Performance Computing (HPC) workload.

# Strengths
The paper addresses the most relevant points of porting applications to GPU, i.e., performance gain, energy consumption reduction, and productivity / efficiency in the porting.

# Weaknesses
Authors should reflects and possibly react on the following weaknesses of the manuscripts:

* It is not completely clear which are the contributions provided by the paper. Authors should do the effort of including the answer of questions such as "which is the final goal of our study?", "to which scientific/technical community are we talking to?".
This kind of content is somewhat spread around the paper, but it could be beneficial to have it clearly stated somewhere, maybe in the introduction.

* Connected to the previous point, it is not clear if the authors would like to contribute to HPC community, proposing optimization methodologies for GPU computing or they want to improve productivity of domain scientists using codes similar to the ones studies by the authors. These two things are not mutually exclusive, but it should be clearer what conclusions fall in the "HPC optimization" class and what conclusions contribute to the "productivity enhancement".

* The manuscript is heavily populated of qualitative statements that make it sound not completely rigorous.
There are several points of the manuscript in which a scientist would expect a number or a range of values instead of words such as "significant/significantly".
E.g.,
316 [...] When developing GPU kernels, however, it becomes
317 noticeably slower to work with the compilation times of PyCUDA.
What "noticeably" means here? Minutes? Hours?
Slower respect to what?
477 certain scheme and GPU combinations result in a significant speedup for CUDA over OpenCL, but we
478 cannot conclude whether this is caused by differences in driver versions or from other factors. We are
479 therefore not able to claim that CUDA performs better than OpenCL in general. When looking at the
What "significant" means? For a single GPU user it could means one thing, while in the context of a data center with thousand of GPUs it could have completely another meaning. Also, which are the possible source of differences in the driver versions?

* There is no mention of an environment in which the code runs on multiple GPUs or even on multiple nodes with multiple GPUs. Please clarify if this is relevant. If yes, a clarification on which parallelization scheme is used should be included (MPI, OpenMP, mixed, etc.).

# Remarks
1. The codes used for the study are probably not recognizable as "scientific applications". With scientific applications a reader would expect a study on complex codes of hundreds of thousand of lines of codes, with library dependencies, etc. The codes used in this manuscript looks more like "computational kernels" that are relevant to some HPC communities. Please avoid to refer to the kernels as applications.

2. Authors study in the same paper 7 hardware platforms, 4 computational kernels, performance and energy consumption. It could be very useful to include a summary / table to make clear to the reader what has been run where for studying what. This would improve in my opinion the readability and the flow of the paper.

3. A methodology section/paragraph/inlet can help clarifying the number of executions (statistical relevance), the model of the devices used for measuring the power, the compiler flags, the version of the compilers, etc. This could be useful for clearing details that are spread around the text and also could improve reproducibility.

4. line 126: If OpenCL is deprecated, why bother? Is it really relevant for the scientific community? How many OpenCL codes are producing relevant scientific results?

5. Fig. 1: please put units on the axis and a color scale to improve readability.

6. line 239: "list comprehensions" is mentioned but not really explained. Does it really add a meaning?

7. line 247-248: other example of speculative/qualitative sentence. Did you verified/studied the compilation flags set by PyCUDA?

8. It looks like authors had no complete control of the hw/sw systems used for the test. Several details seem to be not completely understood. E.g., footnote 5. Could maybe this be improved?

9. In general captions of figures and tables are way too verbose and often contain important details that should be included in the text.

10. Table 1. Authors are reporting measurements in the scale of micro-seconds. Which is the precision of the tool used for the measurement? Which is the error of such a small interval of time?

11. lines 326-328 Nsight / NSight, please unify.

12. line 374: gigaFLOPS --> GFlops

13. lines 446-448: These results should be probably compared with the amount and the speed of the different memory technology used by the GPU.

14. Figure 6: What is expressing the metric "megacells/bandwidth"? Are you dividing the performance for the number of bytes moved to/from the memory? Please clarify this metric.

15. Figure 7: What explanation could you have for the CUDA version of the Normalized mean power (mid plot, second line)? CUDA seems to imply a lower power drain, how this is possible? How do you handle the frequency of the GPU? Do you fix it or maybe some dynamic frequency scaling is entering into play here?

16. lines 562-565: This is interesting. Do you have a guess for the asymmetry between performance and power drain that you capture?

17. Be careful with the heat-maps of Fig. 9. I read the paper on a tabled using Acrobat reader and the rendering was extremely blurry. Not really a big deal, but this should be checked before the camera ready is accepted.

18. Figure 9: The heat-maps are really beautiful. However the scales of colors are different within the same color. This means that the same color e.g., dark red, can have a very different meaning across different plots. Please consider to fix the scale within the same color?

Author Response

Dear Editor and reviewer,

Thank you for your effort in reviewing our submission and for your thoughtful comments. We have considered all of your comments and concerns, and responded to most of them in detail below.
We have made some changes to our paper in accordance with these suggestions, as detailed below. In the attached file, our changes related to your review is marked in red, whereas changes marked in yellow is related to another review. We have also responded to the most important individual comments.

Reviewer:
The paper studies the impacts of performing computation on Graphics Processing Units (GPU) using a high-level and highly general purpose language, such as Python. The authors addresses the effects in terms of performance, power drain / energy consumption and productivity. The study leverages four computational kernels that are used in some common High Performance Computing (HPC) workload.

The paper addresses the most relevant points of porting applications to GPU, i.e., performance gain, energy consumption reduction, and productivity / efficiency in the porting.

Reply:
Thank you for your detailed feedback on our paper with many good suggestions and points to consider. We have responded to the most important comments below.

Reviewer:
It is not completely clear which are the contributions provided by the paper.

Reply:
We have tried to emphasize this in the introduction (page 2, lines 32--46), in which we detail our contribution.

Reviewer:
[...] it is not clear if the authors would like to contribute to HPC community, proposing optimization methodologies for GPU computing or they want to improve productivity of domain scientists using codes similar to the ones studies by the authors. These two things are not mutually exclusive, but it should be clearer what conclusions fall in the "HPC optimization" class and what conclusions contribute to the "productivity enhancement".

Reply:
The results in the paper should be interesting to both domain scientists (modeling, particularly stencil-based computations) and HPC experts/scientists. We therefore see no need to limit the audience any further than the choice of journal/publication channel, but we are happy to specify the main target audience in the introduction if you request this.

Reviewer:
The manuscript is heavily populated of qualitative statements
that make it sound not completely rigorous.
There are several points of the manuscript in which a scientist
would expect a number or a range of values instead of words
such as "significant/significantly".
E.g.,
316 [...] When developing GPU kernels, however, it becomes
317 noticeably slower to work with the compilation times of
PyCUDA.
What "noticeably" means here? Minutes? Hours?
Slower respect to what?
477 certain scheme and GPU combinations result in a significant
speedup for CUDA over OpenCL, but we
478 cannot conclude whether this is caused by differences in
driver versions or from other factors. We are
479 therefore not able to claim that CUDA performs better than
OpenCL in general. When looking at the
What "significant" means? For a single GPU user it could means
one thing, while in the context of a data center with thousand of
GPUs it could have completely another meaning. Also, which are
the possible source of differences in the driver versions?

Reply:
This discussion is in a section titled "4.1 Porting from PyOpenCL to PyCUDA", and the statement should be seen in light of the interactivity of Python. Noticeably is therefore 5-10 seconds. Added "compared to PyOpenCL, with an added compilation overhead of about 5 seconds" on line 323.

With significant, we here mean speedup that can not be explained by noise. We believe that the results in Figure 6 illustrates what we mean by significant.

Reviewer:
There is no mention of an environment in which the code runs on multiple GPUs or even on multiple nodes with multiple GPUs. Please clarify if this is relevant. If yes, a clarification on which parallelization scheme is used should be included (MPI, OpenMP, mixed, etc.).

Reply:
We do not run on multiple GPUs in this work.

Reviewer:
The codes used for the study are probably not recognizable as "scientific applications". With scientific applications a reader would expect a study on complex codes of hundreds of thousand of lines of codes, with library dependencies, etc. The codes used in this manuscript looks more like "computational kernels" that are relevant to some HPC communities. Please avoid to refer to the kernels as applications.

Reply:
We argue that scientific applications should not be limited to codes of hundreds of thousands of lines of codes, but should also include applications such as ours (which collectively have more than 13000 single lines of Python code, and thousands of lines of CUDA and OpenCL code).

Reviewer:
4. line 126: If OpenCL is deprecated, why bother? Is it really relevant for the scientific community? How many OpenCL codes are producing relevant scientific results?

Reply:
Apple has deprecated OpenCL in its product lines, but OpenCL still thrives and has strong software support from other vendors.

Reviewer:
10. Table 1. Authors are reporting measurements in the scale of micro-seconds. Which is the precision of the tool used for the measurement? Which is the error of such a small interval of time?

Reply:
We make no conclusions based on differences in microseconds. The table aims to reveal order of magnitude differences in overhead when accessing the GPU from Python compared to C++.

Reviewer:
16. lines 562-565: This is interesting. Do you have a guess for the asymmetry between performance and power drain that you capture?

Reply:
Better performance can be obtained by doing the same computations with fewer instruction (same mean power usage in less time), or by decreasing the time the GPU is idle (more power usage in less time). These two factors are examples that show that we cannot expect symmetry between performance and power drain.

Reviewer:
17. Be careful with the heat-maps of Fig. 9. I read the paper on a tabled using Acrobat reader and the rendering was extremely blurry. Not really a big deal, but this should be checked before the camera ready is accepted.

Reply:
This is strange (and a bit worrying). We have not been able to reproduce this issue on any of our devices/pdf readers, but will try to re-render the graphics if the problem persists.

Reviewer:
18. Figure 9: The heat-maps are really beautiful. However the scales of colors are different within the same color. This means that the same color e.g., dark red, can have a very different meaning across different plots. Please consider to fix the scale within the same color?

Reply:
We have chosen to preserve the current color ranges because if we use the same ranges across different numerical schemes, we invite the reader to compare between schemes. The interesting aspects in the figures is to look at patterns (or lack of patterns) within each scheme.

Author Response File: Author Response.pdf

Article Menu

GPU Computing with Python: Performance, Energy Efficiency and Usability^†

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

GPU Computing with Python: Performance, Energy Efficiency and Usability†

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

GPU Computing with Python: Performance, Energy Efficiency and Usability^†