Do It Once: Concatenating the Image Pair for a Single Pass Feature Extraction in Stereo Depth Sensing
Highlights
- A single-pass feature extraction approach, based on concatenating stereo image pairs, improves inference speed by 10–39% across multiple state-of-the-art models.
- The proposed method preserves accuracy while introducing only a moderate increase in runtime memory usage, with no changes to the underlying network architecture.
- Stereo depth models can be significantly accelerated through simple plug-and-play modifications to the feature extraction module on existing or future deep-learning-based stereo depth models, with demonstrated benefits on both high-performance GPU and edge devices.
Abstract
1. Introduction
- We identify redundant left/right FE as a source of inefficiency in stereo matching pipelines and introduce a method that removes this redundancy by processing the stereo pair in a single FE pass.
- We define and evaluate several stereo-pair concatenation strategies, including batch, spatial, and multi-cut variants, and analyze how these alternatives affect inference speed, memory usage, and accuracy.
- We validate the approach across three structurally different stereo networks, showing 10–39% average inference speed improvements while maintaining accuracy.
- We provide TensorRT layer-level profiling to explain where the speedup arises, showing that the gains are primarily driven by reduced FE execution time.
- We demonstrate applicability of our work on an edge device with up to 28.4% FPS increase.
2. Materials and Methods
- 1.
- Baseline: The default method represents the original, i.e., unmodified version of each tested model, where FE is implemented as two separate modules that run successively (Figure 2(1)).
- 2.
- Batch-based concatenation: In our first proposed method, a, we concatenate the image pair along the batch dimension, similarly to how batching is done in batched training (Figure 2(2)). In this method, the image pair stays completely independent of each other. Since batch processing is the universal paradigm for training deep neural networks, the entire ecosystem (from PyTorch down to the GPU hardware) is inherently optimized for it. Thus, we anticipate strong performance improvements with minimal side effects.
- 3.
- Spatial concatenation: In the second and third implementations, b and c (Figure 2(3)), we concatenate the image pair along the same 2D plane in either horizontal or vertical direction, i.e., we double the width or height, respectively.
- 4.
- Multi-cut batching: In the next two methods d and e (Figure 2(4)), we first cut the two images to multiple smaller (sub)images and have them concatenated along the batch dimension. Configuration d slices once vertically at the middle of each image, while e slices twice–once vertically and once horizontally. Therefore from the two starting images, i.e., left and right, we obtain four or eight subimages, respectively, and concatenate them along the batch dimension. While heavy batching might accelerate inference, the required disassembly and assembly process might, in turn, decelerate it.
2.1. Inference Benchmarking
2.2. Accuracy Benchmarking
3. Results
3.1. Inference Rate
Stage Profiling
- FE impl.: The core FE process executed by the network.
- FE assembly & FE disassembly: The computational overhead introduced by slicing, concatenating, and reorganizing the input images or resulting feature maps according to the selected configuration.
- Cost agg.: The cost aggregation process, which corresponds to the coarse depth estimation stage illustrated in Figure 1.
- Cost volume: The construction of the initial cost volume, part of the feature matching stage shown in Figure 1.
- Refinement: The post-processing stage where initial, noisy coarse depth estimates are smoothed and enhanced into a coherent depth map.
- Out stage: The final processing step that converts the network’s raw numerical outputs into actual pixel-level disparity values.
- Uncategorized: Miscellaneous execution nodes within the computation graph that cannot be strictly assigned to any of the defined categories above.
3.2. Memory Consumption
3.3. Accuracy
4. Edge Device Benchmarking
5. Discussion
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| AI | Artificial Intelligence |
| FPGA | Field Programmable Gate Array |
| FE | Feature Extraction (Network) |
| SSD | Solid-state Drive |
| EPE | End-Point-Error |
| ONNX | Open format to represent neural models [21] |
| GPU | Graphics Processing Unit |
| VRAM | Video Random Access Memory on GPU |
| FPS | Frames Per Second |
| ARNES | Academic and Research Network of Slovenia |
| HPC | High-performance Computing |
| MPS | Nvidia Multi-Process Service |
| RVC4 | Robotic Vision Core 4 |
| OS | Operating System |
Appendix A. Single Pass FE Formulation
| Configuration (k) | Tensor Dimensions () |
|---|---|
| default | (two separate tensors) |
| a, ea | |
| b | |
| c | |
| d | |
| e |
Appendix B. Error Map

References
- Hamad, L.; Khan, M.A.; Mohamed, A. Object Depth and Size Estimation using Stereo-vision and Integration with SLAM. arXiv 2024, arXiv:2409.07623. [Google Scholar] [CrossRef]
- Suthakorn, J.; Kishore, M.; Ongwattanakul, S.; Matsuno, F.; Svinin, M.; Madhavan Pillai, B. Stereo Vision-based Object Detection and Depth Estimation from 3D Reconstructed Scene for an Autonomous Multi Robotic Rescue Mission. In Proceedings of the Twenty-Seventh International Symposium on Artificial Life and Robotics, Online, 25–27 January 2022. [Google Scholar]
- Rodríguez-Martínez, E.A.; Flores-Fuentes, W.; Achakir, F.; Sergiyenko, O.; Murrieta-Rico, F.N. Vision-Based Navigation and Perception for Autonomous Robots: Sensors, SLAM, Control Strategies, and Cross-Domain Applications—A Review. Eng 2025, 6, 153. [Google Scholar] [CrossRef]
- Zbontar, J.; LeCun, Y. Computing the stereo matching cost with a convolutional neural network. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2015; pp. 1592–1599. [Google Scholar] [CrossRef]
- Kendall, A.; Martirosyan, H.; Dasgupta, S.; Henry, P.; Kennedy, R.; Bachrach, A.; Bry, A. End-to-End Learning of Geometry and Context for Deep Stereo Regression. arXiv 2017, arXiv:1703.04309. [Google Scholar] [CrossRef]
- Chang, J.R.; Chen, Y.S. Pyramid Stereo Matching Network. arXiv 2018, arXiv:1803.08669. [Google Scholar] [CrossRef]
- Liang, B.; Yang, H.; Huang, J.; Liu, C.; Yang, R. Real-Time Stereo Matching Network Based on 3D Channel and Disparity Attention for Edge Devices Toward Autonomous Driving. IEEE Access 2023, 11, 76781–76792. [Google Scholar] [CrossRef]
- Zhang, J.; Tong, Q.; Yan, N.; Liu, X. FAMNet: A Lightweight Stereo Matching Network for Real-Time Depth Estimation in Autonomous Driving. Symmetry 2025, 17, 1214. [Google Scholar] [CrossRef]
- Lipson, L.; Teed, Z.; Deng, J. RAFT-Stereo: Multilevel Recurrent Field Transforms for Stereo Matching. In Proceedings of the International Conference on 3D Vision (3DV); IEEE: New York, NY, USA, 2021; Volume 3. [Google Scholar] [CrossRef]
- Xu, G.; Wang, X.; Ding, X.; Yang, X. Iterative Geometry Encoding Volume for Stereo Matching. arXiv 2023, arXiv:2303.06615. [Google Scholar] [CrossRef]
- Li, X.; Zhang, C.; Su, W.; Tao, W. IINet: Implicit Intra-inter Information Fusion for Real-Time Stereo Matching. Proc. AAAI Conf. Artif. Intell. 2024, 38, 3225–3233. [Google Scholar] [CrossRef]
- Guo, X.; Zhang, C.; Zhang, Y.; Zheng, W.; Nie, D.; Poggi, M.; Chen, L. LightStereo: Channel Boost Is All You Need for Efficient 2D Cost Aggregation. arXiv 2025, arXiv:2406.19833. [Google Scholar] [CrossRef]
- Rahim, R.; Woerz, S.; Zell, A. LeanStereo: A Leaner Backbone based Stereo Network. In Proceedings of the 2023 International Joint Conference on Neural Networks (IJCNN); IEEE: New York, NY, USA, 2023; pp. 1–8. [Google Scholar] [CrossRef]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. arXiv 2019, arXiv:1801.04381. [Google Scholar] [CrossRef]
- Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv 2020, arXiv:1905.11946. [Google Scholar] [CrossRef]
- Vasu, P.K.A.; Gabriel, J.; Zhu, J.; Tuzel, O.; Ranjan, A. MobileOne: An Improved One millisecond Mobile Backbone. arXiv 2023, arXiv:2206.04040. [Google Scholar] [CrossRef]
- Regoršek, Ž.; Žemva, A. LightFusion: Design and Evaluation of a New Stereo Depth Neural Network with Fusion of Additional Depth Modality. In Proceedings of the 60th International Conference on Microelectronics, Devices and Materials & The Workshop on Energy Management and Renewable Energy Sources; Society for Microelectronics, Electronic Components and Materials (MIDEM): Ljubljana, Slovenia, 2025; pp. 112–120. Available online: https://plus.cobiss.net/cobiss/si/sl/data/cobib/274700547 (accessed on 5 June 2026).
- Deng, Y.; Xiao, J.; Zhou, S.Z. ToF and Stereo Data Fusion Using Dynamic Search Range Stereo Matching. IEEE Trans. Multimed. 2022, 24, 2739–2751. [Google Scholar] [CrossRef]
- Goetschalckx, K.; Verhelst, M. Breaking High-Resolution CNN Bandwidth Barriers With Enhanced Depth-First Execution. IEEE J. Emerg. Sel. Top. Circuits Syst. 2019, 9, 323–331. [Google Scholar] [CrossRef]
- Colleman, S.; Nardi-Dei, A.; Geilen, M.C.W.; Stuijk, S.; Goedemé, T. A Flexible Multi-Core Hardware Architecture for Stereo-Based Depth Estimation CNNs. Electronics 2025, 14, 4425. [Google Scholar] [CrossRef]
- ONNX Runtime. Version: 1.19.2. 2026. Available online: https://onnxruntime.ai/ (accessed on 7 June 2026).
- Huang, Y.; Zhang, Y.; Feng, B.; Guo, X.; Zhang, Y.; Ding, Y. A Close Look at Multi-tenant Parallel CNN Inference for Autonomous Driving. In Proceedings of the Network and Parallel Computing: 17th IFIP WG 10.3 International Conference, NPC 2020, Zhengzhou, China, 28–30 September 2020; pp. 92–104. [Google Scholar] [CrossRef]
- Guo, X.; Zhang, C.; Lu, J.; Wang, Y.; Duan, Y.; Yang, T.; Zhu, Z.; Chen, L. OpenStereo: A Comprehensive Benchmark for Stereo Matching and Strong Baseline. arXiv 2023, arXiv:2312.00343. [Google Scholar]
- Amdahl, G.M. Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities. In Proceedings of the April 18–20, 1967, Spring Joint Computer Conference; Association for Computing Machinery: New York, NY, USA, 1967; pp. 483–485. [Google Scholar] [CrossRef]
- Mayer, N.; Ilg, E.; Häusser, P.; Fischer, P.; Cremers, D.; Dosovitskiy, A.; Brox, T. A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR); Computer Vision Foundation: New York, NY, USA, 2016. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Li, F.-F. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2009; pp. 248–255. [Google Scholar] [CrossRef]
- Scharstein, D.; Szeliski, R. A Taxonomy and Evaluation of Dense Two-frame Stereo Correspondence Algorithms. Int. J. Comput. Vis. 2002, 47, 7–42. [Google Scholar] [CrossRef]
- Luxonis. OAK-4-D Camera. 2026. Available online: https://docs.luxonis.com/hardware/products/OAK%204%20D (accessed on 5 June 2026).






| Configuration Label | Description |
|---|---|
| default | This is the default model as published by the authors. FE is done sequentially two times. Serves as the baseline in all our experiments. |
| a | The left and right image are concatenated along batch dimension. FE is done once for both images. |
| b | The left and right image are concatenated together in vertical direction, i.e., along height dimension. FE is done once. |
| c | The left and right image are concatenated together in horizontal direction, i.e., along width dimension. FE is done once. |
| d | The left and right image are cut and halved in width dimension. Then the 4 subimages are concatenated along batch dimension. |
| e | The left and right image are cut and halved in both height and width dimension. Then the 8 subimages are concatenated along batch dimension. |
| ea | The left and right image are not cut at all but concatenated along batch dimension. This is effectively the same as configuration a but implemented in the same way as d and e. |
| Config | LightStereo | IINet | LeanStereo | |||
|---|---|---|---|---|---|---|
| FPS | Speedup | FPS | Speedup | FPS | Speedup | |
| default | 137.38 | 1.00 | 56.22 | 1.00 | 31.83 | 1.00 |
| a | 187.13 | 1.36 | 63.44 | 1.13 | 41.53 | 1.30 |
| b | 183.73 | 1.34 | 63.04 | 1.12 | 41.39 | 1.30 |
| c | 183.10 | 1.33 | 63.25 | 1.13 | 41.53 | 1.30 |
| d | 207.15 | 1.51 | 60.18 | 1.07 | 39.57 | 1.24 |
| e | 202.03 | 1.47 | 58.73 | 1.04 | 39.45 | 1.24 |
| ea | 184.98 | 1.35 | 62.73 | 1.12 | 39.64 | 1.25 |
| Avg. | 191.35 | 1.39 | 61.86 | 1.10 | 39.28 | 1.27 |
| Component | default | a | b | c | d | e | ea |
|---|---|---|---|---|---|---|---|
| FE impl. | 1822.04 | 936.36 | 941.70 | 962.23 | 902.97 | 906.50 | 982.12 |
| Cost agg. | 711.21 | 738.15 | 733.61 | 733.61 | 675.62 | 677.16 | 701.14 |
| Cost volume | 468.67 | 451.77 | 458.50 | 453.66 | 353.20 | 368.97 | 575.15 |
| Refinement | 281.79 | 279.70 | 275.23 | 288.23 | 241.46 | 240.62 | 245.60 |
| Out stage | 45.07 | 61.45 | 61.18 | 61.14 | 61.35 | 61.25 | 61.87 |
| FE assembly | - | - | - | - | 36.12 | 68.33 | - |
| FE disassembly | - | 7.59 | 23.24 | 25.45 | - | - | - |
| Uncategorized | 31.93 | 47.46 | 48.17 | 49.07 | 28.89 | 29.19 | 48.41 |
| Total | 3360.70 | 2522.48 | 2541.62 | 2573.38 | 2299.61 | 2352.01 | 2614.29 |
| FPS | 148.78 | 198.22 | 196.72 | 194.30 | 217.43 | 212.58 | 191.26 |
| Config | LightStereo | IINet | LeanStereo | |||
|---|---|---|---|---|---|---|
|
Model Size (MB) |
Memory Usage (MB) |
Model Size (MB) |
Memory Usage (MB) |
Model Size (MB) |
Memory Usage (MB) | |
| default | 16.34 | 761.88 | 48.19 | 1041.88 | 28.08 | 1533.88 |
| a | 10.25 | 1197.88 | 45.57 | 1375.88 | 18.90 | 1909.88 |
| b | 10.28 | 1197.88 | 44.14 | 1375.88 | 16.96 | 1907.88 |
| c | 10.26 | 1197.88 | 45.82 | 1377.88 | 18.97 | 1909.88 |
| d | 10.28 | 1141.88 | 43.30 | 1389.88 | 16.91 | 1641.88 |
| e | 10.39 | 1139.88 | 43.54 | 1397.88 | 17.00 | 1639.88 |
| ea | 10.31 | 1135.88 | 45.44 | 1409.88 | 16.91 | 1643.88 |
| Avg. | 10.30 | 1168.55 | 44.64 | 1387.88 | 17.61 | 1775.55 |
| −37.0% | 53.4% | −7.4% | 33.2% | −37.3% | 15.8% | |
| Config | LightStereo | IINet | LeanStereo | |||
|---|---|---|---|---|---|---|
| EPE | Ratio | EPE | Ratio | EPE | Ratio | |
| default | 0.8126 | 1.000 | 0.7629 | 1.000 | 0.9505 | 1.000 |
| a | 0.8171 | 1.006 | 0.7429 | 0.974 | 0.9202 | 0.968 |
| b | 0.8160 | 1.004 | 0.7477 | 0.980 | 0.9389 | 0.988 |
| c | 0.8135 | 1.001 | 0.7402 | 0.970 | 0.9297 | 0.978 |
| d | 0.8279 | 1.019 | 0.7560 | 0.991 | 0.9329 | 0.981 |
| e | 0.8290 | 1.020 | 0.7573 | 0.993 | 0.9449 | 0.994 |
| ea | 0.8089 | 0.995 | 0.7466 | 0.979 | 0.9376 | 0.986 |
| Avg. | 0.8187 | 1.008 | 0.7485 | 0.981 | 0.934 | 0.983 |
| (+0.8%) | (−1.9%) | (−1.7%) | ||||
| Config | FPS | Norm. FPS | DSP Util. [%] | ||||||
|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 4 | 1 | 2 | 4 | 1 | 2 | 4 | |
| default | 30.17 | 32.76 | 32.84 | 1.000 | 1.000 | 1.000 | 87.75 | 95.14 | 95.13 |
| a | 30.70 | 33.20 | 33.24 | 1.018 | 1.013 | 1.012 | 88.13 | 75.87 | 87.17 |
| b | 30.33 | 32.81 | 32.65 | 1.005 | 1.002 | 0.994 | 87.96 | 91.76 | 94.64 |
| c | 30.56 | 33.08 | 33.31 | 1.013 | 1.010 | 1.014 | 87.80 | 81.18 | 81.67 |
| d | 37.65 | 41.16 | 41.12 | 1.248 | 1.256 | 1.252 | 92.88 | 95.14 | 95.12 |
| e | 35.58 | 38.70 | 39.00 | 1.179 | 1.181 | 1.188 | 78.09 | 95.11 | 93.04 |
| ea | 38.24 | 41.89 | 42.16 | 1.267 | 1.279 | 1.284 | 85.78 | 95.14 | 86.38 |
| Avg. | 33.32 | 36.23 | 36.33 | 1.104 | 1.106 | 1.106 | 86.91 | 89.91 | 90.45 |
| Config | Processor Memory [MB] | Total Memory [MB] |
|---|---|---|
| default | 159.27 | 1350.16 |
| a | 158.64 | 1349.04 |
| b | 158.53 | 1345.88 |
| c | 157.54 | 1346.50 |
| d | 157.27 | 1342.24 |
| e | 157.42 | 1343.79 |
| ea | 156.53 | 1343.89 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Regoršek, Ž.; Žemva, A. Do It Once: Concatenating the Image Pair for a Single Pass Feature Extraction in Stereo Depth Sensing. Sensors 2026, 26, 3919. https://doi.org/10.3390/s26123919
Regoršek Ž, Žemva A. Do It Once: Concatenating the Image Pair for a Single Pass Feature Extraction in Stereo Depth Sensing. Sensors. 2026; 26(12):3919. https://doi.org/10.3390/s26123919
Chicago/Turabian StyleRegoršek, Žan, and Andrej Žemva. 2026. "Do It Once: Concatenating the Image Pair for a Single Pass Feature Extraction in Stereo Depth Sensing" Sensors 26, no. 12: 3919. https://doi.org/10.3390/s26123919
APA StyleRegoršek, Ž., & Žemva, A. (2026). Do It Once: Concatenating the Image Pair for a Single Pass Feature Extraction in Stereo Depth Sensing. Sensors, 26(12), 3919. https://doi.org/10.3390/s26123919

