A Flexible Multi-Core Hardware Architecture for Stereo-Based Depth Estimation CNNs
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis paper proposed a flexible multi-core hardware architecture tailored for stereo-based depth estimation CNNs. They aim to close the gap in the dual path of classical networks. Modern depth-first and layer-fused execution strategies are not easily compatible with such nonlinear network structures. Therefore, the authors proposed a layer-fused execution-supported architecture. The paper achieves approximately 24% latency reduction compared to a paper that was published in 2016.
Findings and comments:
1-“The code of this project can be found on www.github.com/...” is not working
2-The current references are not enough. The background of technology is not evaluated with clear details. They only cited a few papers, and the current citation is as [1-3] and [4-6] for half of the references. The others are cited as [7-11]. There are no references for the whole paper.
3- The algorithmic complexity of Algorithm 1 and Algorithm 2 is seen as O(n^5). The running time and complexity of the methods must be considered. How do you present the 24% lower latency with these loops?
4-While repeating the same figure throughout the article is beneficial for clarity, unnecessary repetition can confuse readers. Only the proposed method’s (Fig.5) connections have changed in the Figs: 7,9,10,11,12. Therefore, it's recommended that a simpler representation be used, and if possible, the process be explained through a single figure.
5-The results of the proposed method are evaluated with two methods of Ref. [7]. However, the method is out of date. It is strongly recommended to compare the results with recent papers. There are many papers about the topic in the paper in literature. Sİmplicty you can check the citation of the [7] from Google Scholar and see more and more state-of-the-art papers.
6-The abstract and introduction of the paper start with the vehicle, but there is no further evaluation of the vehicle. They said the depth estimation helps with the most accurate driving. However, the remainder of the paper does not contain any supported information or a more detailed evaluation in terms of the vehicles.
7- The FastZbontar and AccurateZbontar networks need more explanations. The authors must clarify why they only select these approaches.
8-They claim to reduce the latency by up to 24% but the highest results were obtained for FastZbontar and a 64x64 size. The abstract and conclusion must be revised accordingly, and the results of different networks must be given separately.
Author Response
1-“The code of this project can be found on www.github.com/...” is not working
Indeed, you are completely right. We only fill in the correct link in the camera ready version when the paper is accepted. The code is not made publicly available yet for confidentiality reasons. We hope you can understand that…
2-The current references are not enough. The background of technology is not evaluated with clear details. They only cited a few papers, and the current citation is as [1-3] and [4-6] for half of the references. The others are cited as [7-11]. There are no references for the whole paper.
We agree that this was indeed a working point, thanks for the remark! As is indicated in red in the introduction, we have elaborated with more references to show the lack in the state-of-the-art and different types of applications. To the best of our knowledge, there is no ASIC implementation for a stereo-based depth-estimation neural network. So therefore, we describe a hardware solution that could be implemented as an ASIC.
3- The algorithmic complexity of Algorithm 1 and Algorithm 2 is seen as O(n^5). The running time and complexity of the methods must be considered. How do you present the 24% lower latency with these loops?
Good question! The 24% lower latency is for the evaluation of the complete networks. These networks are interpreted as an enumeration of neural layers. Each layer is split in OY ‘computation nodes’ that each compute one line. That one line is computed using these 5 nested for loops. To make this more clear for the reader and to avoid misunderstandings, we have added this explanation to the case study section, at the beginning of the paragraph where we talk about the results.
4-While repeating the same figure throughout the article is beneficial for clarity, unnecessary repetition can confuse readers. Only the proposed method’s (Fig.5) connections have changed in the Figs: 7,9,10,11,12. Therefore, it's recommended that a simpler representation be used, and if possible, the process be explained through a single figure.
Great suggestion, using one figure indeed eliminates a lot of redundancy! We extended the original figure in which the architecture was shown with the information through which data buses data is flowing for each of the schedulings. You can find the new figure in Figure 5. Letters there are indicating which schedulings are used by which data bus.
5-The results of the proposed method are evaluated with two methods of Ref. [7]. However, the method is out of date. It is strongly recommended to compare the results with recent papers. There are many papers about the topic in the paper in literature. Sİmplicty you can check the citation of the [7] from Google Scholar and see more and more state-of-the-art papers.
This is indeed a spot-on remark. We added two more networks and also discussed why FastZbontar has the highest benefits for all of them.
All these networks are indeed not very recent (however all of them less than 10 years old, so also not prehistorical). Modern networks typically contain more and more types of features like softmax for example. We are currently working on a follow-up paper that can also handle these. However, we still found our hardware architecture solution worth a publication as there don’t even exist on-the-edge implementations for easy networks, so we are already filling a gap in the state-of-the-art, presenting a first step towards an on-the-edge solution for very recent networks.
We extended our conclusion with this message, to answer your valid concern.
6-The abstract and introduction of the paper start with the vehicle, but there is no further evaluation of the vehicle. They said the depth estimation helps with the most accurate driving. However, the remainder of the paper does not contain any supported information or a more detailed evaluation in terms of the vehicles.
This is a correct notification, thanks for that. The self-driving vehicle is just an example of why depth-estimation algorithms can be useful. The text is meant for general purpose depth-estimation algorithms, that don’t necessarily have to do something with a car. Therefore, we added more useful examples in abstract and introduction to show that our work goes beyond self-driving vehicles. Our method can also be applied in earth observation, cartography, robotics, …
7- The FastZbontar and AccurateZbontar networks need more explanations. The authors must clarify why they only select these approaches.
This indeed would have been a great question if we would have limited ourselves to only two networks. However, we extended the case study section to 4 networks, which makes it not extremely important anymore to discuss every detail of each of the used networks. However, we still describe the elements in these networks that make that not each network has the same profit from our hardware architecture.
8-They claim to reduce the latency by up to 24% but the highest results were obtained for FastZbontar and a 64x64 size. The abstract and conclusion must be revised accordingly, and the results of different networks must be given separately.
We fully agree with you that showing only two networks is not enough to show functionality of our hardware architecture. Therefore, we added two more networks in the case study and also discussed why FastZbontar has the highest benefits for all of them.
Reviewer 2 Report
Comments and Suggestions for AuthorsTo address the challenge of adapting CNNs for stereo depth estimation in autonomous driving to traditional depth-first hardware architectures, the authors propose a flexible multi-core hardware architecture. Experimental results demonstrate a maximum latency reduction of 24% compared to state-of-the-art depth-first implementations without stereo optimization. While this approach is innovative, several issues warrant attention and further refinement.
Question1: The abstract states that the architecture supports integrated execution and efficient management of dual-path computation, but it does not define the scope of dual-path computation support. It remains unclear whether the architecture is compatible with fusion methods beyond concatenation and element-wise addition, subtraction, and multiplication. It is recommended to clarify the flexibility boundaries of the architecture.
Question 2:The introduction mentions that hardware area consists of compute and memory, requiring memory reduction to shrink chip size. However, it fails to compare the area overhead of the proposed architecture with that of state-of-the-art depth-first architectures. Key hardware metrics are missing and should be supplemented.
Question 3: The study compared models and accuracy rates across different research approaches, but it is recommended to supplement the analysis with mainstream deep learning architectures from recent years, such as the literature: [1]https://doi.org/10.1007/s10462-025-11193-y.
Question 4: Figure 2 illustrates the dual-path architecture of the 3D CNN, but fails to label key parameters for each layer—such as convolution kernel size, input/output channel counts, and ReLU placement—nor does it clarify whether this diagram corresponds to the common architecture of FastZbontar/AccurateZbontar. This obscures network details. We recommend supplementing the figure with annotations detailing these parameters and their structural relationships.
Question 5: Figure 7-12 highlights the activation pathway in red. However, the images are highly repetitive. It is recommended that the author adopt a different illustration method and use a single figure to convey the information.
Question 6: Section 4.8 mentions that tiling reduces on-chip memory but requires recalculating edges. However, it does not test the performance changes when tiling is enabled nor explain the basis for selecting the tiling size. We recommend supplementing the experimental data for tiling scenarios.
Question 7: The vertical axis in Figure 14-15 lacks unit labels, making it impossible to determine the absolute magnitude of the delay. Furthermore, none of the data points include error bars, leaving the stability of the results unverified. It is recommended to add units and error bars.
Question 8: The conclusion states that the architecture is network-agnostic, but experimental verification of network compatibility and stable latency reduction rates is lacking. The generalization claim lacks supporting evidence, and cross-network validation is recommended.
Author Response
Question1: The abstract states that the architecture supports integrated execution and efficient management of dual-path computation, but it does not define the scope of dual-path computation support. It remains unclear whether the architecture is compatible with fusion methods beyond concatenation and element-wise addition, subtraction, and multiplication. It is recommended to clarify the flexibility boundaries of the architecture.
Thank you for notifying that unclarity in the text. The mentioned combination types are indeed the ones that are implemented. Currently, there are not more combination types supported. We clarified that in the Background section II.A, where we discuss the algorithms.
Question 2:The introduction mentions that hardware area consists of compute and memory, requiring memory reduction to shrink chip size. However, it fails to compare the area overhead of the proposed architecture with that of state-of-the-art depth-first architectures. Key hardware metrics are missing and should be supplemented.
Very valid point about this unclarity in the text! We made our design parameterizable in size, means that we can have area for for example memory as big as we want. The bigger the memory is, the more layers can be stored in the on-chip memory and the bigger the chance we don’t need to send any feature map to the off-chip memory. Therefore, hard area comparisons don’t make a lot of sense as the same remark can be made for state-of-the-art depth-first implementations. We added this remark about parameterizability in the paper, at the beginning of Section III.
Question 3: The study compared models and accuracy rates across different research approaches, but it is recommended to supplement the analysis with mainstream deep learning architectures from recent years, such as the literature: [1]https://doi.org/10.1007/s10462-025-11193-y.
We fully understand your concern in importance of accuracy rates. However, the goal of this project is to generate hardware to run neural networks on, not to design or compare neural networks. Therefore, didn’t do a complete comparison of highest mentioned accuracies in literature.
Question 4: Figure 2 illustrates the dual-path architecture of the 3D CNN, but fails to label key parameters for each layer—such as convolution kernel size, input/output channel counts, and ReLU placement—nor does it clarify whether this diagram corresponds to the common architecture of FastZbontar/AccurateZbontar. This obscures network details. We recommend supplementing the figure with annotations detailing these parameters and their structural relationships.
Great suggestion for clarifying the text! Figure 2 illustrates the general concept of how a dual path figure looks like. Therefore, we didn’t include network-specific dimensions. We have also added two more networks in the case study section, so therefore the specific numbers for only the Zbontar network are not the only relevant ones anymore for this paper.
Question 5: Figure 7-12 highlights the activation pathway in red. However, the images are highly repetitive. It is recommended that the author adopt a different illustration method and use a single figure to convey the information.
Great suggestion, using one figure indeed eliminates a lot of redundancy! We extended the original figure in which the architecture was shown with the information through which data buses data is flowing for each of the schedulings. You can find the new figure in Figure 5. Letters there are indicating which schedulings are used by which data bus.
Question 6: Section 4.8 mentions that tiling reduces on-chip memory but requires recalculating edges. However, it does not test the performance changes when tiling is enabled nor explain the basis for selecting the tiling size. We recommend supplementing the experimental data for tiling scenarios.
Thank you for pointing this element to us! We indeed did not experiment on tiling, our memory sizes were big enough to make tiling not necessary. We just wanted to include this information in case a user needs it for his particular case. We removed the section 4.H as we agree it is not a main contribution of the paper. We hope this adds clarity to the text.
Question 7: The vertical axis in Figure 14-15 lacks unit labels, making it impossible to determine the absolute magnitude of the delay. Furthermore, none of the data points include error bars, leaving the stability of the results unverified. It is recommended to add units and error bars.
Great remark on the readability and comprehensiveness of the figure! The reason why we didn’t add error bars is that runtime is fixed for a given network. Due to the memory design, we can’t have problems like cache misses and so on. The absolute magnitude of the delay is also given in the plot, expressed in number of clock cycles. The translation of clock cycles to seconds is then clock frequency and therefore technology dependent. We wanted to make as general as possible and therefore technology independent conclusions.
Question 8: The conclusion states that the architecture is network-agnostic, but experimental verification of network compatibility and stable latency reduction rates is lacking. The generalization claim lacks supporting evidence, and cross-network validation is recommended.
We fully agree with you that showing only two networks is not enough to show functionality of our hardware architecture. Therefore, we added two more networks in the case study and also discussed why FastZbontar has the highest benefits for all of them.
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe authors, incorporating my findings, have greatly improved the article. I consider the article to be a good contribution to the literature. I consider the lack of sufficient discussion of the articles compared in the Introduction a significant weakness. However, I find the authors' statement about the paucity of literature reasonable.
Reviewer 2 Report
Comments and Suggestions for AuthorsFine.

