Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Prediction-Based Error Correction for GPU Reliability with Low Overhead

Electronics 2020, 9(11), 1849; https://doi.org/10.3390/electronics9111849

by Hyunyul Lim, Tae Hyun Kim and Sungho Kang^*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Electronics 2020, 9(11), 1849; https://doi.org/10.3390/electronics9111849

Submission received: 7 September 2020 / Revised: 27 October 2020 / Accepted: 30 October 2020 / Published: 5 November 2020

(This article belongs to the Special Issue Fault-Tolerant Digital Circuits: Protection Techniques, CAD Tools and Emerging Applications)

Round 1

Reviewer 1 Report

On reviewing the second version of this paper, I find it much, much improved. Explanations are significantly clearer, figures and tables are easier to read.

Author Response

We have uploaded answers as a file.

Author Response File: Author Response.pdf

Reviewer 2 Report

The layout of the figures etc. could be improved.

For example, Figure 1 should move to page 2 where it is cited.

Don't talk about references as (In [15],) rather say: "Doe et. al. did blah blah [15]."

Make related work in its own section. The section is very thin. In a journal paper, it is best to have a more self-contained related work section.

Figure 2 should say DMR/TMR.

Figure 3 has its caption on a different page other than the Figure itself. Make the figure fit on one page.

Number equations.

I would suggest a combined figure of merit that combines, performance, energy, and area.

The following statement is confusing: "The GPGPU-sim was the first simulated application."
It is unclear why do you need GPGPU-Sim if you are using Nyami only?
Which results are coming from which? I am confused.

Explicitly list what Rodinia benchmarks did you run. And provide the results for them all. I don't see any reason why not run the entire Rodinia suite.

Is the architecture you are modifying is Fermi or similar to Fermi?
Fermi is pretty old or in other words ancient.
Can you justify its use or have something more modern?!

Author Response

We have uploaded answers as a file.

Author Response File: Author Response.pdf

Reviewer 3 Report

This paper proposes an error detection and correction method for GPGPU architectures. The experimental results show that the proposed approach can outperform similar traditional error correction approaches in terms of accuracy and reliably for error correction, hardware overhead, and time overhead. The approach proposed by this paper can be used particularly to minimize computational errors in GPGPUs for HPC applications such as scientific and simulation workloads. The paper is well written and well structured. The main contribution of the paper sounds technically and scientifically. The paper clearly discussed the problem, the unreliability of GPGPUs particularly to process non-graphical workloads, and justified the motivation for research. Challenges and the state of the art have been discussed properly. I suggest the authors discuss the drawbacks of the work and the type of errors that might not be detected or corrected through the proposed approach. Also, while a good performance analysis and results comparison have been conducted by the authors, I suggest extending the performance evaluations and analysis to a deeper analysis, ensuring that relevant behavior of the approaches can be measured and compared with other approaches. Considering a more comprehensive evaluation approach in terms of selected metrics and further works can help better justification of the work. Figure 3 and its subfigures need to be elaborated further to enhance the quality of presentation for the paper. The proposed methodology and figure2 need further clarification.

Author Response

We have uploaded answers as a file.

Author Response File: Author Response.pdf

Round 2

Reviewer 3 Report

Authors have significantly improved the content of the paper and applied all my comments. I am happy to recommend accepting the revised paper in present form.

This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.

Round 1

Reviewer 1 Report

I have some issues with the way section 3.1 is presented. I think it would need some serious rewriting to clarify the explanations of the design before being considered for publication.

The primary problem is that figure 5 seems at first explained twice with redundancy: the fact that there are two strategies is never clearly explained (the word ‘strategy’ doesn’t appear until line 299). They appear first by redundancy in the text: lines 282-285 are repeated almost verbatim in lines 294-298. Both sets of lines refer to figure 5, despite describing different strategies. The first explanation mentions Ex3 (which is missing from the figure, Ex1 is repeated twice) and compares O2 with O3 (whereas the figure seems to compare O1 with O3).

The name of the strategy in the text (‘AIO strategy’ and ‘ACO strategy’) are not used as column header in table 1, instead we have ‘Anticipated Corrective Output’ and ‘Anticipated Wrong Output’). Table 1 is also referenced in redundant ways on lines 308 & 319. And the sub-header ‘Output’ appears twice, the second should be ‘Prediction’, making the table difficult to understand at first (‘Output’ and ‘Failure’ together are worrying…).

Figure 4a-e suggest that after a fault is detected in one of the cores (Xo, Xr), the faulted instruction is retried in core Xh – a new core. However, the text (line 231) says ‘on the right side’ – which would be Xr in the figure? Figure 4f-j and the text for it line 240 agrees on Xh.

For that same figure and associated text, the next instruction is fetched in core Y; does that means that this next instruction has no redundancy has it is executed on only one core? The paragraph line 256-266 discuss probability of multiple errors, but mostly for the re-try; it also mentions stuck-at-fault, which would be irrelevant if the retry is in a different core than the first try (Xh vs. Xr).

It's also never explained clearly if the Xh core is an extra piece of hardware added to the design or the reuse of another core in the parallel architecture. Figure 5 awkwardly put the second X1 (a mislabeled X3?) in a different part of the figure than the first X1 and X2, though that might be only to match the clock cycle. The lack of extra hardware as overhead in section 4.4 suggest it is only reuse.

Also - what of the threads that were previously running on Y and Xh (if reuse) when they are preempted to deal with the fault on the pair Xo & Xr?

First sentence of 4.3 (about parsing) is unclear to me and should be revised/rewritten more clearly.
Line 478 refers to figure 12 & DMR vs. PRECOR, yet neither methods appears in the figure itself. The figure and/or the explanations should be revised.
Section 4.4 uses some MIPS core to evaluate area/overheads, as Nvidia’s design are not public. Why not use the Nyuzi design, which is public, in SystemVerilog, and was used for results in previous sections?
Some details:

Line 222: ‘Mo’ should be italicized like all others
Figure 8: ‘instrcution’ -> ‘instruction’
Line 396: spurious space in ‘read-after- write’
Line 422/423: you mention Nyuzi, citing a paper whose title starts with the name ‘Nyami’ -> perhaps a clarification that Nyami is the name used in the paper for the architecture, and Nyuzi the name of the publicly available implementation
Figure 11: both halves sorts MIBF in the opposite direction, it would be better if the order was the same between the two halves.

Reviewer 2 Report

I think the authors should change the title of the paper to include the name of their technique. PRECOR: A Low-Overhead Prediction-based Error Correction for GPU Reliability.

The authors should update the abstract and include statistics about their results including the accuracy of their results, area overhead, and time overheads.

At the end of the introduction, before you start talking about the structure of the remainder of the paper about the specific contributions of the paper.

It is unclear from the text what is the source of Figure 1 and whether this is author-produced or not.

The authors claim that "Data on SDC errors in real GPGPU and HPC systems have not yet been studied".

The authors should articulate that and show if they are providing results as such.

In the related work section, I think the authors should show numbers in place of vague quantification such as "significant area overhead" at the bottom of page 3.

Figure 4 is extremely hard to read. Please, pump the fonts and size of the figures. I cannot read it on a printed paper.

The idea of using the Nyuzi open-source GPU and the NyuziToolChain is very good and very interesting.

Please, start figures from 0 or 1 depending on the graph. Figure 10 should start from 0 and can show a zoom in to show the differences as is shown in the current figure. Same as Figures 11 and 12.

In general, the paper is good and is worthy of publication.

I think the work needed to be better motivated.

Thre are several papers on GPU reliability. Please, have a more comprehensive related work section.

For example, check this patent from AMD, they are doing the same work;

SYSTEM AND METHOD FOR PROTECTING GPU MEMORY INSTRUCTIONS AGAINST FAULTS

https://patentimages.storage.googleapis.com/f8/d9/e1/d7c5a291c3f65b/US10255132.pdf, Can the authors outline how their work is different than the similar works that exist?

Reviewer 3 Report

The paper is clear structured and the proposed methods is rational described. Please is is possible to have a more detailed experimental setup, eventually describing the environmental conditions even high energy radiation that can affect the functioning of experimental system.

Reviewer 4 Report

This is a very interesting paper, the first I've ever read concerning about fault tolerance on GPUs. The technical part is good, however, the paper needs a few corrections.

1. When reading the paper, it was not clear to me what the authors were after. First, I though it is a purely software solution and became quite skeptical. Somewhere in the middle, I realized that that is not possible to be done purely in software and you'd need hardware modifications. Then, I was thinking how you could possibly do it in Nvidia/AMD graphics cards. At the end, I finally learnt what I'd been looking for. You used a simulator! And since, of course, the hardware implementation of customers GPUs is confidential, you created your own GPU.
This is something that should be changed. The reader needs to know what you're after from the beginning of the paper.

2. What I don't understand quite well is how you ensure the warp shuffling before the second execution. As I understand your approach, you have a warp, you run it twice on the same SIMD (I guess in the next clock cycle). Before the second run, you have to shuffle the threads so that all operations were executed by different units. How do you do that? How do you shuffle the in/out data of such an instruction? I don't think that Fig 5 suggests every SIMD line has two ALUs.

3. In my opinion, the proposed architecture section is quite long and repeats several statements said before.

4. Would it be possible to have the algorithms in the legends of Figures 10, 11 in the same order.

5. Since the compute logic is not duplicated, how it is possible that the performance overhead in Fig 12 is somewhere between 1.3 and 1.5? Is this only the overhead caused by recovering from faults?

6. Table 2. What units is the total are in?

7. Abstract: "GPU architectures lack hardware support for error detection." There should be mentioned that you mean compute logic. It immediately triggered my attention, since there's ECC on HPC GPUs.

8. Introduction: One of your motivation are the AI application that should be fault tolerant. However, this is not a good example, since e.g. ANN are error resilient. Actually, there's a lot of research in approximate ANN which deliberately employ faulty units to reduce cost/energy/time.

9. There are two subsections 3.1

10. Several abbreviations are not explained at the first occurrence (SDC - line 54, DMR, TMR line 105, 106.
- Line 198 - assume to BE correct
- Line 516 - s oft

Article Menu

Prediction-Based Error Correction for GPU Reliability with Low Overhead

Further Information

Guidelines

MDPI Initiatives

Follow MDPI