Embedded Streaming Hardware Accelerators Interconnect Architectures and Latency Evaluation
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsManuscript Title: Embedded Streaming Hardware Accelerators Interconnect Architectures and Latency Evaluation
The purpose of this study was to explore and evaluate several architectural models for integrating an eSAC into an RISC-V based SoC design. The newly released MicroBlaze-V was employed as CPU. Different DMA data organizations were evaluated. Resource utilization and runtime latency metrics were discussed. The Tightly-coupled architecture model excels in resource utilization at the expense of transfer latency and silicon utilization efficiency. Although the per-frame latency metric was lower than that of the nPmBD scenarios, the total latency was greater. This highlights the sequential nature of the software execution. Given the 44 clock cycles latency of the eSAC, this latency was masked for approximately 50% of the time in the Tightly-coupled architecture. Because of the overhead introduced by control transactions from the CPU, the Protocol Adapter model showcased the worst latency results. The eSAC had the longest idle time, waiting for an input for approximately 200% to a maximum of 300% of its own input-to-output latency. The most efficient usage of the eSAC was observed in the DMA 1P1BD scenario, where transfers occurred continuously with no latency between, forcing the DMA to stall between the end of output on MM2S and the beginning of input from S2MM.
Please find some comments that might strengthen the manuscript as follows:
- It is recommended that some quantitative results be included in the abstract section to highlight the proposed contribution.
- Please discuss the synchronization mechanism in the proposed architecture.
- There are several recent publications that need to be added/compared for hardware accelerators such as [Ref. 1] and [Ref. 2]
[Ref. 1] "A Survey on Neural Network Hardware Accelerators," in IEEE Transactions on Artificial Intelligence, vol. 5, no. 8, pp. 3801-3822, Aug. 2024, doi: 10.1109/TAI.2024.3377147
[Ref. 2] "High-speed emerging memories for AI hardware accelerators." Nature Reviews Electrical Engineering 1, no. 1 (2024): 24-34.
- The authors need to compare the proposed architecture and existing method in terms of FPGA resource utilization.
- Please, find some corrections as follows:
On page 4, “ This requirements must be maintained during the design phase. ” could be written as “ These requirements must be maintained during the design phase. ”
On page 16, “ Different DMA data organization were evaluated. Resource utilization and runtime latency metrics were discussed. ” could be written as “ Different DMA data organizations were evaluated. Resource utilization and runtime latency metrics were discussed. ”
Comments on the Quality of English LanguageManuscript Title: Embedded Streaming Hardware Accelerators Interconnect Architectures and Latency Evaluation
The purpose of this study was to explore and evaluate several architectural models for integrating an eSAC into an RISC-V based SoC design. The newly released MicroBlaze-V was employed as CPU. Different DMA data organizations were evaluated. Resource utilization and runtime latency metrics were discussed. The Tightly-coupled architecture model excels in resource utilization at the expense of transfer latency and silicon utilization efficiency. Although the per-frame latency metric was lower than that of the nPmBD scenarios, the total latency was greater. This highlights the sequential nature of the software execution. Given the 44 clock cycles latency of the eSAC, this latency was masked for approximately 50% of the time in the Tightly-coupled architecture. Because of the overhead introduced by control transactions from the CPU, the Protocol Adapter model showcased the worst latency results. The eSAC had the longest idle time, waiting for an input for approximately 200% to a maximum of 300% of its own input-to-output latency. The most efficient usage of the eSAC was observed in the DMA 1P1BD scenario, where transfers occurred continuously with no latency between, forcing the DMA to stall between the end of output on MM2S and the beginning of input from S2MM.
Please find some comments that might strengthen the manuscript as follows:
- It is recommended that some quantitative results be included in the abstract section to highlight the proposed contribution.
- Please discuss the synchronization mechanism in the proposed architecture.
- There are several recent publications that need to be added/compared for hardware accelerators such as [Ref. 1] and [Ref. 2]
[Ref. 1] "A Survey on Neural Network Hardware Accelerators," in IEEE Transactions on Artificial Intelligence, vol. 5, no. 8, pp. 3801-3822, Aug. 2024, doi: 10.1109/TAI.2024.3377147
[Ref. 2] "High-speed emerging memories for AI hardware accelerators." Nature Reviews Electrical Engineering 1, no. 1 (2024): 24-34.
- The authors need to compare the proposed architecture and existing method in terms of FPGA resource utilization.
- Please, find some corrections as follows:
On page 4, “ This requirements must be maintained during the design phase. ” could be written as “ These requirements must be maintained during the design phase. ”
On page 16, “ Different DMA data organization were evaluated. Resource utilization and runtime latency metrics were discussed. ” could be written as “ Different DMA data organizations were evaluated. Resource utilization and runtime latency metrics were discussed. ”
Author Response
Thank you very much for taking the time to review this manuscript. Please find the detailed responses below and the corresponding revisions/corrections highlighted in blue in the re-submitted version.
- It is recommended that some quantitative results be included in the abstract section to highlight the proposed contribution.
The following text was added to the abstract:
When comparing the tightly-coupled architecture with the one including the DMA, the experiments in this paper show an almost 3x decrease in frame latency when using the DMA. Nevertheless, this comes at the price of an increase in FPGA resources utilization as follows: LUT (2.5x), LUTRAM (3x), FF (3.4x) and BRAM (1.2x).
- Please discuss the synchronization mechanism in the proposed architecture
The following text was added to the section three Overview of AXI-Stream:
AXI transactions use a standard synchronization method based on ready/valid handshakes, which manage each data transfer on any channel. In this study, FIFO buffers are also used as an additional synchronization mechanism to support ongoing transactions. While Clock Domain Crossing (CDC) techniques are typically required to synchronize signals between different clock domains, they are not used in the current study, as all components operate within a single clock domain.
- There are several recent publications that need to be added/compared for hardware accelerators such as [Ref. 1] and [Ref. 2].
The following text was added to the section two Related work, for [Ref. 1]:
The paper [14] presents a comprehensive review of current hardware solutions designed to accelerate neural network computations. The methodology involves analyzing and comparing different hardware platforms—such as FPGAs, ASICs, GPUs, and CPUs—used for accelerating various types of neural networks. The study highlights major design challenges such as power consumption, area, speed, throughput, and resource efficiency. Through an extensive survey of recent implementations and experimental results, the authors identify trade-offs in architecture choices, dataflows, memory hierarchy, and precision.
We are unable to provide an opinion since we do not have access to the full paper ([Ref. 2])
- The authors need to compare the proposed architecture and existing method in terms of FPGA resource utilization.
In the introduction, we mention that our study focuses on communication infrastructure and data movement between accelerators, a topic not covered in recent literature.
- Please, find some corrections as follows..
The corrections were performed or text replaced.
Reviewer 2 Report
Comments and Suggestions for AuthorsThe paper presents a comprehensive exploration of integrating an embedded Streaming Hardware Accelerator (eSAC) with a CPU core in a System-on-Chip (SoC) design, focusing on the AXI-Stream protocol. The study evaluates three architectures (Tightly-coupled Streaming, Protocol Adapter FIFO, and DMA Streaming) and provides detailed insights into their performance, resource utilization, and latency. The work is well-structured, methodologically sound, and offers valuable contributions to the field of hardware acceleration and SoC design. However, there are areas where the paper could be improved for clarity, depth, and broader impact.
- The abstract is clear and concise but could briefly mention the key findings (e.g., the 7x latency increase due to poor data organization) to better highlight the paper's contributions.
- The related work section is well-written but could be expanded to include more recent studies, particularly those focusing on RISC-V and open-source hardware accelerators. This would help position the paper within the current research landscape.
- The methodology is sound, but the paper could provide more details on the software-hardware co-design process. For example, how were the buffer descriptors (BDs) optimized for the DMA architecture? Were there any challenges in aligning the data flow chains with the BD chains?
- The results are well-presented, but the discussion could be expanded to include potential optimizations for the DMA architecture. For example, could the DMA be modified to leverage burst capabilities for continuous data, as suggested in the conclusion?
- The paper mentions that the DMA could "look ahead" to enable burst reading and updating of BDs. This is an interesting idea, but it is not explored in detail. A brief discussion or simulation of this concept would add value.
- The conclusion is well-written but could be more forward-looking. For example, what are the next steps for this research? How might the proposed architectures be adapted for more complex systems or different application domains?
- Consider evaluating the proposed architectures in a more complex system with multiple accelerators or heterogeneous components. This would provide insights into scalability and interoperability.
- Adding power consumption measurements would make the study more comprehensive and relevant to energy-constrained applications.
- Develop a theoretical model to explain the impact of data organization on latency. This would help readers understand the underlying principles and apply the findings to other systems.
- Add a glossary or expand definitions of key terms (e.g., "data flow chain," "PE chain") to improve readability for a broader audience.
Author Response
Thank you very much for taking the time to review this manuscript. Please find the detailed responses below and the corresponding revisions/corrections highlighted in blue in the re-submitted version.
- The abstract is clear and concise but could briefly mention the key findings (e.g., the 7x latency increase due to poor data organization) to better highlight the paper's contributions.
The following text was added to the abstract:
When comparing the tightly-coupled architecture with the one including the DMA, the experiments in this paper show an almost 3x decrease in frame latency when using the DMA. Nevertheless, this comes at the price of an increase in FPGA resources utilization as follows: LUT (2.5x), LUTRAM (3x), FF (3.4x) and BRAM (1.2x).
- The related work section is well-written but could be expanded to include more recent studies, particularly those focusing on RISC-V and open-source hardware accelerators. This would help position the paper within the current research landscape.
A new refernce as added:
The paper "A Survey on Neural Network Hardware Accelerators" presents a comprehensive review of current hardware solutions designed to accelerate neural network computations.
- The methodology is sound, but the paper could provide more details on the software-hardware co-design process. For example, how were the buffer descriptors (BDs) optimized for the DMA architecture? Were there any challenges in aligning the data flow chains with the BD chains?
The following ideas were extracted form Section 5 Architecture Evaluation Methodology:
One key focus of the study was the design and organization of buffer descriptors (BDs), which served as the control structures for the DMA to manage data transfers efficiently. Each BD pointed to a data buffer stored in BRAM, and both BDs and their corresponding data were laid out sequentially to improve locality and minimize lookup overhead—although strict ordering wasn't mandatory. A significant software challenge involved aligning the BD chain structure with the pipelined dataflow of the eSAC, especially given its 44-cycle latency and single-clock-cycle throughput after initialization. Care had to be taken to ensure that BDs were configured to match the granularity and continuity of data streams, avoiding fragmentation that could stall the pipeline. Another challenge was managing the AXI-Stream protocol semantics, ensuring each BD correctly described the size and position of packets and that the data beat (transfer) boundaries were clearly respected. The software, developed and debugged using AMD Vitis 2024.1, had to prepare these chains while minimizing CPU intervention.
- The results are well-presented, but the discussion could be expanded to include potential optimizations for the DMA architecture. For example, could the DMA be modified to leverage burst capabilities for continuous data, as suggested in the conclusion?
Section 6, Results and Discussion has several refences on leveraging burst capabilities:
"This test case, for example, is a scenario where the user just wants to feed the AXI-Stream interface with 12 32-bit words (i.e., 12 variables) that were carefully placed in memory one next to another in order to leverage the Burst transaction of the AXI specification."
- The paper mentions that the DMA could "look ahead" to enable burst reading and updating of BDs. This is an interesting idea, but it is not explored in detail. A brief discussion or simulation of this concept would add value.
The following text was added to the conclusions:
This will add up to the hardware logic and thus may only benefit those designs that aim at more general purpose use case scenarios, for example, general-purpose microcontroller units (MCUs). While this improvement may, in theory, benefit a more general purpose design, it could also be inappropriate for other, specific use cases, where the problem of multiple continuous buffer addresses can be tackled by software design.
- The conclusion is well-written but could be more forward-looking. For example, what are the next steps for this research? How might the proposed architectures be adapted for more complex systems or different application domains?
The following text was added to the conclusions:
Future work addresses the development of a generalized co-design framework that automatically aligns data flow chains with BD chains for different accelerator and DMA configurations. This could involve a template-based BD generation tool. Other topics to be pursued include the formalization of a design methodology for integrating accelerators with AXI/DMA systems using BD-aware streaming interfaces and adapting the architecture for SoCs that use multiple eSAC instances or heterogeneous accelerators.
- Consider evaluating the proposed architectures in a more complex system with multiple accelerators or heterogeneous components. This would provide insights into scalability and interoperability.
While evaluating the proposed architectures in a more complex system with multiple accelerators or heterogeneous components would indeed offer valuable insights into scalability and interoperability, this was intentionally deferred in the current paper to maintain a focused and controlled experimental environment. The primary objective was to isolate and analyze the impact of DMA data organization and architectural integration on latency and resource utilization. Introducing multiple accelerators or heterogeneous components would have added significant variables that could obscure the core findings.
- Adding power consumption measurements would make the study more comprehensive and relevant to energy-constrained applications.
Power analysis is recognized as an important future direction, especially once the architecture is stabilized and mapped to more application-specific or energy-constrained environments. Power consumption measurements were not included in the current study to maintain focus on architectural behavior, latency, and resource utilization as primary performance metrics. Accurately measuring power on FPGA platforms, especially at a fine-grained level, requires additional hardware instrumentation or specialized power models, which would have added complexity.
- Develop a theoretical model to explain the impact of data organization on latency. This would help readers understand the underlying principles and apply the findings to other systems.
The development of an analytical model is planned for future work, once a broader range of architectural scenarios is explored and validated. A theoretical model was not developed in this study to prioritize empirical evaluation and hands-on characterization of the proposed architectures on real hardware. Given the complexity of DMA behavior, and the dynamic interactions between buffer descriptors, FIFOs, and streaming interfaces, an accurate theoretical model would require numerous assumptions that could oversimplify or obscure key performance factors.
- Add a glossary or expand definitions of key terms (e.g., "data flow chain," "PE chain") to improve readability for a broader audience.
Added a list of abbreviations. These terms are defined in Table 4.
Reviewer 3 Report
Comments and Suggestions for AuthorsVery good and well-written article. The first part presents the architecture of an embedded hardware accelerator for streaming (eSAC). The next part of this paper examines the integration of eSAC with a 6-core central processing unit (CPU) embedded in a System-on-Chip (SoC) design, using the AXI-Stream protocol specification. The research results are evaluated and possible improvement scenarios are discussed.
Remarks:
1. very good summary - congratulations!!! The first two introductory sentences could be removed because they are too general, but I leave this to the Authors' decision
2. I kindly ask you to include a list of abbreviations, it will make reading much easier, because the article is very specialized
3. the analysis of literature and chapter 2 "Related Work" - very good, such a good analysis is rare
4. where is the goal - it should result from the analysis of literature, What was the goal of the research should be emphasized and in the summary I propose to add whether it was achieved
5. I propose to expand the description of figure 10 - these are very interesting research results and are worth emphasizing, In my opinion, the table above the figure should have its own numbering - but this is a minor issue
6. Please check whether all figures and tables have their references in the text of the work
Generally, the article is very good and suitable for printing in the form presented, and my comments are of a debatable nature.
The article fits very well into the profile of the journal.
Comments on the Quality of English LanguageEnglish does not require proofreading
Author Response
Thank you very much for taking the time to review this manuscript. Please find the detailed responses below and the corresponding revisions/corrections highlighted in blue in the re-submitted version.
- Very good summary - congratulations!!! The first two introductory sentences could be removed because they are too general, but I leave this to the Authors' decision
Thanky for your appreciation.
- I kindly ask you to include a list of abbreviations, it will make reading much easier, because the article is very specialized
Added a list of abbreviations. These terms are defined in Table 4.
- The analysis of literature and chapter 2 "Related Work" - very good, such a good analysis is rare
Thanky for your appreciation.
- where is the goal - it should result from the analysis of literature, What was the goal of the research should be emphasized and in the summary I propose to add whether it was achieved
The following text was added to the introdcution:
The primary goal of this study is to explore and evaluate efficient architectural strategies for integrating a custom accelerator (eSAC) into a RISC-V-based SoC using DMA-driven data transfer. By focusing on the impact of DMA data organization and architectural coupling on latency and resource utilization, the study aimed to identify design trade-offs that directly influence performance metrics in streaming applications.
The following text was added to the conclusions:
The goal of the study—to evaluate architectural strategies for efficiently integrating a custom accelerator (eSAC) into a RISC-V-based SoC using DMA-driven data transfer—was successfully achieved through a systematic set of experiments and analyses. The study implemented and tested multiple architecture models, including Tightly-coupled, Protocol Adapter, and various DMA-based configurations, with a clear focus on how DMA data organization affects system latency and resource utilization.
- I propose to expand the description of figure 10 - these are very interesting research results and are worth emphasizing, In my opinion, the table above the figure should have its own numbering - but this is a minor issue
The table already has its own numbering (3), albeit in the current format of the paper, it can be easily misseen, as it has a large description attached. Table 3 is closely related to Figure 10, and each description can be seen as complementary, even supplementary sometimes.
- Please check whether all figures and tables have their references in the text of the work
A check was peformed.
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsAccept in present form.