Next Article in Journal
Shielding Properties of Cement Composites Filled with Commercial Biochar
Previous Article in Journal
Optimal Design of Aperiodic Reconfigurable Antenna Array Suitable for Broadcasting Applications
 
 
Article
Peer-Review Record

Reducing the Delay for Decoding Instructions by Predicting Their Source Register Operands

Electronics 2020, 9(5), 820; https://doi.org/10.3390/electronics9050820
by Sanghyun Park, Jaeyung Jun, Changhyun Kim, Gyeong Il Min, Hun Jae Lee and Seon Wook Kim *
Reviewer 1:
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Reviewer 4: Anonymous
Electronics 2020, 9(5), 820; https://doi.org/10.3390/electronics9050820
Submission received: 13 April 2020 / Revised: 13 May 2020 / Accepted: 14 May 2020 / Published: 16 May 2020
(This article belongs to the Section Computer Science & Engineering)

Round 1

Reviewer 1 Report

Hello and congratulations for an interesting straightforward paper! 

I'm not entirely sure that your usage of "dependence" is incorrect, but I have mostly seen the use of "dependency" as in "data dependency".

In figure 1, since the "memory" stage is after the "execution" stage, I would assume it refers to the write-back operation in which the results are written back to the memory. In this context, the description on the 32 bit connection should only be "Store", not "Load/Store", even though the Memory Access Unit can also read data. Also, the operand fetch unit in the decode stage should access the memory for reading purposes? If that connection is specifically depicted for the Instruction fetch unit and memory access unit, maybe it should be depicted for the operand fetch unit as well. Finally, for figure 1, is that the original CPU pipeline or does it contain your new block? If the data dependence processing unit is new, it should be highlighted as such. If it is present in the original design but is optimized by you, then it might be useful to also highlight it in order to show where the proposed changes are made.

Line 96: "Finally, the MEM stage contains logics for accessing data memory through load and store commands." - Does the MEM stage really load data? What is that data used for? Also, logic does not have a plural.

Line 154: "Figure 5(a) shows its example" but the previous sentence talks about the original decoder scheme, not about code, so this should be rephrased.

In figure 6, wouldn't it be more useful to output two bits, in order to detect dependencies in the EX and MEM stages separately, so you know how many stalls to introduce?

In figure 7 the valid bit should also pe AND-ed with the tag hit and pattern index bit. 

I find it difficult to understand figure 8 and the associate explanation. It says "normalized execution cycles". it that the execution duration in cycles divided by the original cycle duration? If that is the case, I would expect to see less that 1 values for tests where your prediction improved execution time, however, I see nothing below 1.0. Is it because the performance improvement is given by the increase in clock frequency? If that is the case, it should be clearly stated because at first glance, the figure shows a degradation of cycle count. Later edit: section 4.3 partially clears this, but I believe a short comment would be in order here as well.

Regarding dynamic power, your conclusion is that even though the total area is larger and your clock frequency is higher, the total dynamic power is decreased. Maybe a short discussion to explain this would be in order. 

You implemented the changes in a Virtex-7 FPGA. However, I assume that all the frequency, area and power dissipation evaluations were done for ASIC designs, it should be clear in the text.

Maybe the related work should be placed after the introduction instead of at the end?

 

Author Response

Thank you for the valuable comment! We clarified the paper, as the reviewer suggested.

 

Hello and congratulations for an interesting straightforward paper!

 

Q1: I'm not entirely sure that your usage of "dependence" is incorrect, but I have mostly seen the use of "dependency" as in "data dependency".

 

A1: The two terms would be interchangeable. However, we changed "dependence" to "dependency" in some familiar cases like "data dependence."

 

Q2: In figure 1, since the "memory" stage is after the "execution" stage, I would assume it refers to the write-back operation in which the results are written back to the memory. In this context, the description on the 32 bit connection should only be "Store", not "Load/Store", even though the Memory Access Unit can also read data.

 

A2: The instruction can be retired at either EX or MEM stages. We marked WB in Figures 1, 5, and 6. Also, we modified the related sentence (Lines 110~111) and added the sentence (Line 123) in Section 3.1.

 

Q3: Also, the operand fetch unit in the decode stage should access the memory for reading purposes? If that connection is specifically depicted for the Instruction fetch unit and memory access unit, maybe it should be depicted for the operand fetch unit as well.

 

A3: The operand fetch unit does not access the data memory, but only the instruction memory.

 

Q4: Finally, for figure 1, is that the original CPU pipeline or does it contain your new block? If the data dependence processing unit is new, it should be highlighted as such. If it is present in the original design but is optimized by you, then it might be useful to also highlight it in order to show where the proposed changes are made.

 

A4: For this study, the instruction decoder was modified, and the history table was added. Figure 1 and its caption were modified (Page 3 of 14).

 

Q5: Line 96: "Finally, the MEM stage contains logics for accessing data memory through load and store commands." - Does the MEM stage really load data? What is that data used for? Also, logic does not have a plural.

 

A5: Yes. The MEM stage accesses the data memory for both loading and storing data. The "logics" was fixed to "logic" (Line 122).

 

Q6: Line 154: "Figure 5(a) shows its example" but the previous sentence talks about the original decoder scheme, not about code, so this should be rephrased.

 

A6: We rephrased the sentence for a clear understanding (Lines 181~182).

 

Q7: In figure 6, wouldn't it be more useful to output two bits, in order to detect dependencies in the EX and MEM stages separately, so you know how many stalls to introduce?

 

A7: Only the case that we may need to consider stalls both in EX and MEM is to execute "ld/st [reg0+offset], reg1." However, the dependence can occur only in the offset addition operation (reg0+offset), which incurs a stall in the EX stage, not both. Therefore, we need only a 1-bit output.

 

Q8: In figure 7 the valid bit should also pe AND-ed with the tag hit and pattern index bit.

 

A8: We corrected Figure 7 (Page 9 of 14).

 

Q9: I find it difficult to understand figure 8 and the associate explanation. It says "normalized execution cycles". it that the execution duration in cycles divided by the original cycle duration? If that is the case, I would expect to see less that 1 values for tests where your prediction improved execution time, however, I see nothing below 1.0. Is it because the performance improvement is given by the increase in clock frequency? If that is the case, it should be clearly stated because at first glance, the figure shows a degradation of cycle count. Later edit: section 4.3 partially clears this, but I believe a short comment would be in order here as well.

 

A9: We changed the term "normalized execution cycles" to "cycle overhead" to avoid the reader's confusion. We clarified the related sentence, Figure 8, and its caption (Lines 263~264, 266~268).

 

Q10: Regarding dynamic power, your conclusion is that even though the total area is larger and your clock frequency is higher, the total dynamic power is decreased. Maybe a short discussion to explain this would be in order.

 

A10: We added more explanation for clear understanding (Lines 324~329), as you suggested.

 

Q11: You implemented the changes in a Virtex-7 FPGA. However, I assume that all the frequency, area and power dissipation evaluations were done for ASIC designs, it should be clear in the text.

 

A11: We added the phrases at Lines 252 and 257.

 

Q12: Maybe the related work should be placed after the introduction instead of at the end?

 

A12: We moved the related work to Section 2 after the introduction, as you suggested (Page 3 of 14).

Reviewer 2 Report

Authors of this paper propose to adopt a table-based way to store the dependence history and later use this information for more precisely predicting the dependence. The proposed approach is applied to the commercial EISC embedded processor with the Samsung 65nm process and showed improvement in the critical path delay and operating frequency. In general, this paper is well written; however, there is a couple of notice that the authors:

  • The main contributions are not very clear. I think adding a block-diagram for the table-based way that explains its main functionalities and the computation flow will help the readers to understand the design.
  • I have some major concerns about how the hardware implantation and testing results are done and being presented.
  • The authors did not explain why they selected this processor with this 65nm, is this because they want to compare their work with earlier work?
  • The author mentioned that they improved the static, dynamic power consumption, and EDP by 7.2%, 8.5%, and 13.6%, I think more explanations are needed to show why this improve of the powered consumptions took place and why such improvements impact performance and memory usage
  • Authors came to conclusion by using only one processor with this technology, is this right for another processor
  • . It's nice that the underlying speedup of the EEMBC applications with respect to the baseline EISC processor by improving the maximum operating frequency by 12.5%., I think more results should be in terms of excursion time/powered vs performance since these are what would actually be used in applications like IoT and embedded systems
  • Also, how the prosed design in figure 1, is better than the earlier design, e.g. in terms of area, cost, delay, etc. Please compare your work with other similar work that you cited in your work.
  •  
  • Some other mistakes that need to be. E.g.

Table 1. Category of the EISC instructions depending on their source register operand bit-fields. {}

                implies the concatenation operation”, proofreading needs to be done before it is submitted again

  • The language needs improvement in general, figures are OK in general.

 

 

Author Response

Authors of this paper propose to adopt a table-based way to store the dependence history and later use this information for more precisely predicting the dependence. The proposed approach is applied to the commercial EISC embedded processor with the Samsung 65nm process and showed improvement in the critical path delay and operating frequency. In general, this paper is well written; however, there is a couple of notice that the authors:

 

Q1: The main contributions are not very clear. I think adding a block-diagram for the table-based way that explains its main functionalities and the computation flow will help the readers to understand the design.

 

A1: We added an example code in Figure 7 and the explanation of how it works on the table-based way (Lines 239~246, Figure 7 in Page 9 of 14).

 

Q2: I have some major concerns about how the hardware implantation and testing results are done and being presented. The authors did not explain why they selected this processor with this 65nm, is this because they want to compare their work with earlier work?

 

A2: No, it was not the reason that we compared with the earlier work. The IDEC program supported the EDA tool for this study only with the Samsung 65nm library (please see the acknowledgment). The library would be fit for low-cost embedded system development.

 

Q3: The author mentioned that they improved the static, dynamic power consumption, and EDP by 7.2%, 8.5%, and 13.6%, I think more explanations are needed to show why this improve of the powered consumptions took place and why such improvements impact performance and memory usage.

 

A3: The synthesis tool used the weak driving strength gates (Section 4.4), thus reducing power consumption. Also, we could improve the operational frequency by 12.5% (Section 4.3), which reduced the execution time. Therefore, we could improve energy and EDP performance. We added the explanation at Lines 331~333.

 

Q4: Authors came to conclusion by using only one processor with this technology, is this right for another processor.

 

A4: As we discussed in the motivation section, other processors' ISAs have similar structural features to the EISC ISA. Therefore, we would apply our technology to other processors and make it useful. We added the sentence in the conclusion section at Lines 368~370.

 

Q5: It's nice that the underlying speedup of the EEMBC applications with respect to the baseline EISC processor by improving the maximum operating frequency by 12.5%., I think more results should be in terms of excursion time/powered vs performance since these are what would actually be used in applications like IoT and embedded systems.

 

A5: The reciprocal of EDP would imply the metric, i.e., performance per power. Since our EDP is less than that of the original design in all cases, our design would be fit for low-powered IoT and embedded systems. We added the sentences at Lines 344~346.

 

Q6: Also, how the prosed design in figure 1, is better than the earlier design, e.g. in terms of area, cost, delay, etc. Please compare your work with other similar work that you cited in your work.

 

A6: We completely revised the related works in Section 2 while comparing them with ours.

 

Q7: Some other mistakes that need to be. E.g. "Table 1. Category of the EISC instructions depending on their source register operand bit-fields. {} implies the concatenation operation", proofreading needs to be done before it is submitted again. The language needs improvement in general, figures are OK in general.

 

A7: We thoroughly examined the paper for avoiding grammar errors.

Reviewer 3 Report

This paper proposes to improve predicting instructions’ source register operands by using the dependence history based table.

The following questions/comments should be addressed by the authors:

  1. Why in the analysis of 32-bit MIPS ISA [11], only 31 instructions have been considered? The respective processor has more instructions involving source registers.

 

  1. “The proportion of instructions that do not have a source register operand (SF) is 6.5% and 13.6% for MIPS and RISC-V, respectively.” -> should not be (SL) instead of (SF)?

 

  1. Organization of the dependence history table is not described to full detail. Please give a small code example to illustrate how you manage the recorded source register dependency.

 

  1. “However, our method interprets more source register operands than what one instruction represents; thus, it can incur unnecessary stalls and consequently would increase the total execution cycle.” – Did you measure how many unnecessary stalls are inserted?

 

  1. This phrase is not clear: “We also improved the static, dynamic power consumption, and EDP by 7.2%, 8.5%, and 13.6%, respectively, despite the implementation cost.” What does it mean “despite the implementation cost”?

 

  1. The related work section gives a brief overview of publications 20-30 years old. Is there any newer work on the subject? If not, does it mean that the subject is not of interest to scientific community? The authors should provide more motivation for their work and include some newer references.

Author Response

This paper proposes to improve predicting instructions' source register operands by using the dependence history based table. The following questions/comments should be addressed by the authors:

 

Q1: Why in the analysis of 32-bit MIPS ISA [11], only 31 instructions have been considered? The respective processor has more instructions involving source registers.

 

A1: We only considered the core instruction set of 32-bit integer ISAs. We clarified the sentence (Line 155).

 

Q2: "The proportion of instructions that do not have a source register operand (SF) is 6.5% and 13.6% for MIPS and RISC-V, respectively." -> should not be (SL) instead of (SF)?

 

A2: We are sorry about the mistake. We corrected (Line 160).

 

Q3: Organization of the dependence history table is not described to full detail. Please give a small code example to illustrate how you manage the recorded source register dependency.

 

A3: Please refer to A1 for Reviewer 2.

 

Q4: "However, our method interprets more source register operands than what one instruction represents; thus, it can incur unnecessary stalls and consequently would increase the total execution cycle." – Did you measure how many unnecessary stalls are inserted?

 

A4: Please see Figure 9, which shows the cycle increment due to the stalls.

 

Q5: This phrase is not clear: "We also improved the static, dynamic power consumption, and EDP by 7.2%, 8.5%, and 13.6%, respectively, despite the implementation cost." What does it mean "despite the implementation cost"?

 

A5: The implementation cost implies the area overhead. We clarified the sentence at Line 16.

 

Q6: The related work section gives a brief overview of publications 20-30 years old. Is there any newer work on the subject? If not, does it mean that the subject is not of interest to scientific community? The authors should provide more motivation for their work and include some newer references.

 

A6: We completely revised the related work section (Page 3 of 14). Most of the previous works, including the state-of-the-art x86 processors (ISSCC 2020 [13]), used the additional memory storage and logic for the pre-decoding in the fetch stage instead of the decode stage. In this work, we get a hint from the instruction structure and predict the dependency with negligible overhead in the decode stage.

Reviewer 4 Report

1. Related work is briefly analyzed in section 5, authors do not adequately contrast their work to existing approaches. In the sense that they do not highlight what is missing from each of the other proposals. The authors should clearly describe related work in more detail, contrasting the limitations of the related works as mention in the table. 2. The proposed processes should be revised in a more formal pseudocode template. Moreover, the authors should include more technical details and explanations in Section 3. 3. Numerical results in this paper are not enough to support the conclusions. The comparison to other improved schemes (state-of-the-art methods within the last 3 years) is required in this paper. 4. Please points out some insufficiency and limitation that needs further improvements in the conclusion. Moreover, formats of reference list lack consistency.

Author Response

A1: Related work is briefly analyzed in section 5, authors do not adequately contrast their work to existing approaches. In the sense that they do not highlight what is missing from each of the other proposals. The authors should clearly describe related work in more detail, contrasting the limitations of the related works as mention in the table.

 

Q1: Please refer to A6 of Reviewer 3.

 

Q2: The proposed processes should be revised in a more formal pseudocode template.

 

A2: We revised the template, especially in the title (Page 1 of 14) and bibliography sections (Pages 13, 14 of 14).

 

Q3: Moreover, the authors should include more technical details and explanations in Section 3.

 

A3: We improved Section 3.2 by providing a sample code and its operation. Please refer to A1 for Reviewer 2.

 

Q4: Numerical results in this paper are not enough to support the conclusions. The comparison to other improved schemes (state-of-the-art methods within the last 3 years) is required in this paper.

 

A4: We could reduce the execution time, power and energy consumption, and EDP from "the commercially available processor" with a negligible area overhead. The performance result implies that our work is in the right design direction. Also, our work is the first to place the pre-decoding unit in the decoder stage, and most other related works, including the state-of-the-art x86 processors (ISSCC 2020 [13]), did in the fetch unit with an additional area overhead.

 

Q5. Please points out some insufficiency and limitation that needs further improvements in the conclusion.

 

A5: The pattern history can work well with 32 statically consecutive instructions. Otherwise, the miss ratio in the history table increases, so incurring more stalls. In order to solve the problem, the structure of hardware would be modified so that the hardware cost would be increased. For example, we would increase the table size and the set associativity. Instead, we have a plan to use a software method, i.e., tracing the executed instructions at runtime and using the profiled result for a compiler's jump optimization. The plan is left as our future work. We added the sentences at Lines 370~376.

 

Q6: Moreover, formats of reference list lack consistency.

 

A6: We made the consistency in the reference list (Pages 13, 14 of 14).

Round 2

Reviewer 3 Report

I am not satisfied with answering my question about the functionality of the dependence history table. The provided explanations are not enough as it is not clear how predictions are done. 

The references section still includes many outdated works.

Author Response

We added the algorithm to explain the table-based scheme in detail, and rephrased the related sentences for clear explanation. 

 

Also, we added two more references, [14 and [15]. We are sure that we completely reviewed all the related work in this paper.

Reviewer 4 Report

This paper has edited and revised according to the reviewer's suggestions.

Author Response

Thank you again for your great comment in the first revision.

Back to TopTop