This section presents a systematic review of key perception tasks in vision-based autonomous driving, examining the evolution and real world deployment of detection algorithms, from classical computer vision to state-of-the-art deep learning with a dedicated focus on FPGA-accelerated implementations. The extensive variability in algorithms, datasets, platforms, and evaluation metrics observed in the existing literature is highlighted to guide the development of more reliable and efficient next generation autonomous driving systems.
2.2. Classical Computer Vision Approaches
Early implementations predominantly utilized traditional computer vision techniques coupled with lightweight machine-learning models, primarily due to the computational constraints of embedded platforms. The Histogram of Oriented Gradients (HOG) feature descriptor, combined with Support Vector Machines (SVM), has emerged as the predominant classical approach, offering advantages in terms of reduced memory footprint and computational requirements, making it suitable for embedded deployment.
Several pioneering works demonstrated the effectiveness of these classical approaches on FPGA platforms. Martelli et al. (2011) in [
13,
14] presented a fast FPGA-based architecture for pedestrian detection using covariance matrices, achieving 132 fps for
pixel images on a Xilinx Virtex-6 LX240T FPGA, exploiting the symmetry of second order integrals and tensor computation parallelism to minimize latency. Lin et al. (2008) developed a PID controller based on fuzzy logic using VHDL for vehicle collision avoidance, demonstrating the application of traditional control methods on FPGA platforms for real-time applications [
15]. The HOG + SVM combination proved particularly successful for FPGA implementations due to its computational characteristics. Suleiman and Sze (2016) presented an energy efficient hardware implementation of HOG-based object detection, achieving 1080p processing at 60 fps with multiscale support on 45 nm SOI CMOS ASIC technology, consuming an average power of 69 mW while maintaining high detection accuracy [
14,
16]. Building on this foundation, Meus et al. (2017) implemented a HOG + SVM pipeline on a Xilinx Zynq SoC, employing a hardware software codesign approach where the ARM processor handled detection and tracking while the FPGA-accelerated HOG and SVM processing, achieving 60 fps for
pixel images with an energy efficiency of 3.95 GOPS/W [
14,
17]. In addition, Nazir et al. (2018) demonstrated the feasibility of traditional methods on low cost embedded platforms by implementing HOG and SVM on Raspberry Pi 3 and Odroid C2, achieving 5–7 fps for
pixel images [
14,
18]. Borrego-Carazo et al. (2020) emphasized that SVMs are particularly well suited for resource constrained hardware due to their lightweight inference characteristics following offline training [
14].
Recent comprehensive analysis by Lin (2023) provided a thorough review of HOG-SVM pedestrian-detection methods based on FPGA, confirming that current implementations typically achieve detection accuracy rates exceeding 95% and detection speeds greater than 30 FPS for INRIA datasets, reinforcing the continued relevance of classical approaches alongside modern deep-learning methods [
19]. Advanced HOG implementations have achieved remarkable efficiency improvements, with novel low resource consumption hardware implementations achieving approximately 0.933 pixels per clock cycle processing speed while maintaining 91.79% accuracy on the INRIA dataset and 98.49% on the MIT dataset, demonstrating only minimal (i.e., <2% absolute) accuracy degradation (1.2% and 0.11%, respectively) compared to original HOG algorithms [
20].
2.3. Deep Learning Revolution
The emergence of deep neural networks (DNNs), particularly Convolutional Neural Networks (CNNs), has substantially improved object-detection accuracy. However, their computational intensity presents significant challenges for real-time deployment on resource constrained FPGAs, necessitating various optimization strategies. Two-stage detectors, including R-CNN, Fast R-CNN, Faster R-CNN, and Mask R-CNN, employ a sequential approach that involves generating region proposals followed by classification and localization. While achieving superior accuracy, their two-stage architecture and larger model sizes result in higher latency, making deployment on embedded FPGA platforms challenging. Conversely, one-stage detectors, such as YOLO (You Only Look Once) and SSD (Single Shot MultiBox Detector), directly predict bounding boxes and classes from feature maps, offering significantly higher detection speeds that are suitable for real-time applications.
Early deep-learning implementations on FPGAs focused on compressed and lightweight architectures to address resource constraints. Fan et al. (2018) developed a real-time object-detection accelerator using compressed SSDLite on FPGA, achieving a throughput of 65 FPS [
21]. Takasaki et al. (2021) implemented a road marking detector using Binarized Neural Networks (BNN) on the Programmable Logic (PL) of an Ultra96-V2 board, achieving 0.0054059 s processing time (approximately 185 FPS) for
images, demonstrating ultra low latency processing capabilities for autonomous driving applications [
22]. Surapally et al. (2022) evaluated Quantized Neural Networks (QNNs) for object detection using a Tiny YOLO variant on an AMD-Xilinx PYNQ-Z2 board, achieving a 50× speedup compared to software implementations [
23]. Additionally, Talib et al. (2022) conducted a comparative analysis of CNN, QNN, and BNN implementations on ZYNQ FPGAs, identifying the CNN’s superior accuracy and memory efficiency for object-detection tasks [
24]. Recent advances have introduced sophisticated real-time hazard-detection systems specifically designed for autonomous vehicles. Zhou et al. (2025) achieved 92.3% mAP with 8 ms latency for detecting pedestrians, vehicles, and obstacles, utilizing an attention-based dynamic CNN with DVFS optimization that consumes only 35 W on an FPGA, compared to 250 W on a GPU, while processing at 125 FPS for the CNN inference stage [
25]. This represents a significant advance in power efficiency for safety critical applications.
Hybrid approaches combining traditional and deep-learning methods have also shown promise. Hamdaoui et al. (2022) proposed an optimized hardware vision system combining HOG, Particle Swarm Optimization (PSO), and SVM on a Virtex-7 FPGA for vehicle detection, achieving 97.84% accuracy on the KITTI dataset with 1.483 ms latency [
26]. Kojima (2022) developed an autonomous robot car using an Xilinx SoC FPGA (Ultra96 board with XCZU3EG) for real-time image processing with YOLOv3 tiny via Xilinx DPU IP, achieving approximately 3 FPS for
images, highlighting the challenges of real-time processing for larger input sizes [
27].
Advanced optimization techniques have enabled more sophisticated implementations. Zhai et al. (2023) implemented YOLOv3 and YOLOv3-tiny for vehicle detection and tracking on Zynq-7000 FPGAs, achieving significant model size reduction (up to 98.2%) through dynamic threshold structured pruning and 16-bit fixed point quantization, alongside hardware optimizations including memory interlayer multiplexing and Winograd algorithms, with the YOLOv3-tiny model achieving 91.65 fps with 12.51 W power consumption, demonstrating high cost efficiency [
28]. Baczmanski et al. (2023) implemented the MultiTaskV3 detection segmentation network on an AMD Xilinx Kria KV260 SoC FPGA, achieving over 97% mAP for detection and above 90% mIoU for segmentation while consuming approximately 5 W at 4.85 FPS [
29]. Anupreetham et al. (2023) presented an end-to-end fully pipelined FPGA-based object-detection system accelerating SSD-MobileNet-V1, achieving a very high throughput of 2167 FPS with 2.13 ms latency while maintaining 22.8 mAP on an Intel Stratix 10 FPGA through a novel pipelined Non-Maximum Suppression (NMS) algorithm that eliminated sequential dependencies [
30]. Specialized applications have emerged for military surveillance, with Vasavi et al. (2024) achieving 93% accuracy for tank and APC detection (75.4% AP for tanks, 83.0% AP for APCs) using Mask R-CNN with ResNet50+FPN backbone on ZYBO Z7-10 ZYNQ-7000 FPGA platforms [
31]. Advanced optimization strategies demonstrate remarkable performance improvements, with Jeyalakshmi et al. (2025) achieving a 1600× speedup over software implementation using VGG16 with quantization aware training and the FINN framework on Pynq-Z2 boards for obstacle avoidance systems [
32].
Recent implementations have focused on achieving higher accuracy while maintaining real-time performance. Guerrouj et al. (2023) explored YOLOv4 acceleration on Intel Arria 10 FPGAs for autonomous driving, focusing on General Matrix Multiplication (GEMM) implementation and achieving competitive mAP on KITTI (up to 89.40%) and Self Driving Car datasets with 38 ms runtime on KITTI [
33]. Ali et al. (2023) integrated object detection using YOLOv5 into an ADAS framework on a DE10 Nano board, achieving 55 FPS for single channel processing [
34]. Al Amin et al. (2024) developed an FPGA-based real-time object detection and classification system using YOLOv3 Tiny on a Xilinx Kria KV260, achieving 15 FPS for HD video streaming with 99% accuracy while consuming only 3.5 W [
35]. Power optimization techniques for FPGAs in autonomous vehicles have achieved remarkable efficiency gains, with Kalaiselvi et al. (2025) reducing power consumption by 65.9% while maintaining 91.7% lane-detection accuracy, demonstrating the potential for energy efficient autonomous systems through dynamic voltage and frequency scaling approaches [
36].
Multi-task learning systems now integrate multiple perception tasks on single FPGA platforms, with Tatar et al. (2024) implementing real-time multi-learning deep neural networks on MPSoC-FPGA processing 5 ADAS functions at 22.45 FPS with only 6.920 W power consumption [
37]. The field has witnessed significant developments in specialized vehicle-detection applications. Vaithianathan (2024) presented an innovative methodology for real-time object detection and recognition in FPGA-based autonomous driving systems, integrating deep-learning methodologies with FPGA hardware acceleration to achieve minimal latency and optimal precision necessary for secure navigation [
38]. Mani et al. (2024) developed a high-accuracy FPGA-based system specifically designed for emergency vehicle classification, achieving 99.87% accuracy using a ResNet50-MOP-CB network architecture. This demonstrates the versatility of FPGA platforms in handling specialized vehicle classification tasks critical for autonomous driving systems [
39]. Advanced driver assistance systems with real-time image processing on custom Xilinx DPUs achieve 22.15 FPS with 57.76% segmentation mIoU while consuming only 7.19W, showcasing the potential for multi-task learning on embedded platforms [
40].
Efficient FPGA-based embedded vision platforms achieve 361.8 GOPS/W energy efficiency for mobile robot applications, with Yang et al. (2024) demonstrating superior performance for autonomous mobile robots through accumulation-as-convolution packing techniques [
41]. Innovative approaches have emerged incorporating alternative sensing modalities. Izquierdo et al. (2024) introduced an acoustic-based pedestrian-detection system using MEMS acoustic arrays, with FPGA implementation handling sensor acquisition tasks while processing algorithms detect pedestrians in real-time urban environments, demonstrating the expanding scope of FPGA applications beyond traditional vision-based detection [
42]. Cambuim et al. (2022) developed an FPGA-based pedestrian-detection system specifically designed for collision prediction, emphasizing the safety critical aspects of real-time detection in autonomous vehicles [
43]. Advanced sensor fusion approaches achieve 97% accuracy with a 0.421 ms prediction time using mmWave radar data processed on FPGA platforms, demonstrating the potential for multimodal perception systems [
44].
2.4. Optimization Strategies
Optimization techniques for FPGA-based implementations encompass both model compression and hardware optimization design methodologies, which are crucial for fitting large DNNs on limited FPGA resources while maintaining real-time performance. Model compression techniques aim to reduce the computational and memory requirements of neural networks. Quantization involves converting floating point weights and activations to lower bit fixed point representations, such as 1-bit, 8-bit, 16-bit, etc., which significantly reduces the memory footprint and computational requirements with minimal accuracy degradation. For instance, Sim-YOLOv2 with 1-bit weights and 3–6-bit activation achieved a 31× model size reduction with only a 10.15% accuracy loss [
45]. Pruning eliminates redundant connections or neurons (unstructured pruning) or entire filters/channels (structural pruning), thereby reducing the number of parameters and computations. Structural pruning is preferred for hardware efficiency as it maintains regular computational patterns [
45]. Removing fully connected layers, which typically contain numerous parameters and require frequent off-chip memory access, can be achieved by replacing them with pooling layers or eliminating them to reduce memory bandwidth and improve inference speed [
45]. Recent work by Emmanuel et al. (2024) on optimizing resource utilization and power efficiency in FPGA-accelerated YOLOv8 object detection achieves 9.2% resource reduction with 9.342 W power consumption using Vivado High Level Synthesis tools on Xilinx ZYNQ-7 ZC706 platforms, demonstrating continued advances in optimization techniques [
46].
Hardware optimization design methods focus on maximizing computational parallelism and efficient data flow within the FPGA architecture. Pipelining involves overlapping execution stages to process multiple data elements concurrently, thereby improving throughput. Line buffering efficiently manages and reuses on-chip memory for feature map operations, reducing off-chip memory accesses. Loop unrolling, tiling, and reordering increase parallelism in computations by processing multiple data elements or iterations simultaneously, optimizing memory access patterns and data reuse [
45]. Fused layer architecture combines operations of adjacent layers, such as convolution and batch normalization, or multiple convolution layers, to reduce off-chip data transfers and intermediate memory storage. Fast convolution algorithms such as Winograd and FFT reduce the number of multiplications, significantly improving computational efficiency for convolutional layers.
2.6. Comparative Performance Insights
Figure 2 and
Figure 3 display a comprehensive visualization of FPGA-based vehicle and pedestrian-detection performance across different methodologies. The analysis reveals significant performance evolution from classical to modern approaches. Modern deep-learning implementations demonstrate remarkable performance improvements, with the SSD-MobileNet-V1 implementation by Anupreetham et al. (2023) achieving an exceptional 2167 FPS through highly efficient hardware acceleration and a fully pipelined architecture [
30]. The recent hazard-detection system by Zhou et al. (2025) achieves a high throughput of 125 FPS with 92.3% mAP while consuming only 35W, representing a 7× power reduction compared to GPU implementations [
25]. Similarly, the BNN road marking detector by Takasaki et al. (2021) achieves approximately 185 FPS for smaller image sizes (
) [
22].
On the other hand, traditional HOG + SVM-based methods by Suleiman and Sze (2016) and Meus et al. (2017) achieve 60 FPS [
16,
17], while optimized HOG implementations by He et al. (2024) now achieve 0.933 pixels per clock cycle with minimal resource consumption [
20]. The YOLOv3 tiny implementation with pruning and quantization by Zhai et al. (2023) demonstrates commendable 91.65 FPS for
images, showcasing the benefits of model compression and hardware optimizations [
28]. Lower performance is observed in implementations on general purpose embedded boards, such as Raspberry Pi 3/Odroid C2 (5–7 FPS) by Nazir et al. (2018) [
18] and unoptimized YOLOv3 tiny on Ultra96 (3 FPS) by Kojima (2022) [
27], emphasizing the importance of dedicated FPGA acceleration.
The latest implementations demonstrate continued advancement in performance metrics. The emergency vehicle classification system proposed by Mani et al. (2024) achieves an exceptional 99.87% accuracy, representing one of the highest accuracy rates reported for specialized vehicle-detection tasks [
39]. Military vehicle classification systems by Vasavi et al. (2024) achieve 93% accuracy with specialized Mask R-CNN implementations [
31], while mmWave radar integration by Mohan et al. (2025) achieves 97% accuracy with ultra low 0.421 ms prediction time [
44]. Vaithianathan’s (2024) comprehensive framework demonstrates superior performance compared to conventional CPU and GPU implementations in terms of power efficiency, inference latency, and detection precision [
38]. Power consumption analysis reveals significant improvements in energy efficiency in recent FPGA implementations. While earlier methods, such as the 45 nm SOI HOG + SVM by Suleiman and Sze (2016), consumed relatively high power of 69W [
16], modern FPGA-based deep-learning solutions demonstrate substantial energy efficiency gains. Power optimization techniques proposed by Kalaiselvi et al. (2025) achieve a 65.9% power reduction (from 313.74 mW to 106.98 mW) while maintaining detection accuracy [
36].
Multi-task systems achieve very high efficiency, with implementations by Tatar et al. (2024) consuming only 6.920 W while processing 5 ADAS functions at 22.45 FPS [
37]. The YOLOv3 Tiny implementation by Al Amin et al. (2024) operates at remarkably low 3.5 W [
35], while the MultiTaskV3 by Baczmanski et al. (2023) consumes approximately 5 W [
29]. Even the high performance YOLOv3-tiny with pruning and quantization by Zhai et al. (2023) maintains a reasonable power consumption of 12.51 W, despite achieving higher processing speeds [
28].
Advanced embedded vision platforms achieve exceptional energy efficiency of 361.8 GOPS/W for mobile robot applications [
41], while transfer learning implementations by Jeyalakshmi et al. (2025) demonstrate a very high 1600× speedup over software implementations [
32]. In terms of performance comparison, FPGAs generally offer superior energy efficiency (GOPS/W) compared to GPUs, especially for inference tasks, due to their customizability and ability to support optimized data precision. While GPUs often provide higher peak throughput (GFLOPs) for floating point operations and possess richer development ecosystems, FPGAs achieve competitive throughput for specific tasks with significantly lower power consumption and latency, particularly in streaming data applications where direct peripheral connections are beneficial [
45].
This trend highlights FPGA’s inherent advantages in customizability, enabling highly optimized designs that achieve real-time performance with substantially reduced power footprints, making them ideal for power constrained autonomous vehicle applications. The evolution from classical HOG + SVM approaches to modern deep-learning implementations demonstrates significant improvements in both processing speed and energy efficiency. While traditional methods provided a solid foundation with 60 FPS performance, contemporary implementations utilizing model compression techniques, hardware optimizations, and advanced architectures have achieved remarkable throughput improvements, with some implementations reaching over 2000 FPS. The combination of model compression strategies, such as quantization and pruning, with hardware specific optimizations, including pipelining and memory management, continues to advance the state-of-the-art in FPGA-based vehicle and pedestrian-detection systems.
However, the analysis reveals significant inconsistencies in performance metrics across studies. Some implementations report accuracy, others precision or mean Average Precision (mAP), while certain entries lack critical data such as power consumption, rendering direct comparisons challenging. For instance, latency values range widely due to differences in input image resolutions and FPGA architectures, while throughput metrics vary based on optimization techniques like pipelining or quantization. Power consumption data are frequently absent or reported under varying conditions, complicating energy efficiency assessments. These disparities underscore the need for a standardized evaluation framework to ensure fair and reproducible comparisons.
2.7. Cross Task Analysis and Trends
The diversity of FPGA-accelerated vision tasks in autonomous driving—ranging from vehicle and pedestrian detection to traffic sign recognition (TSR), traffic light detection (TLD), and lane detection—presents both opportunities and challenges. This subsection synthesizes cross task patterns to extract insights, addressing variability in algorithms, datasets, and hardware, aiming to inform design without proposing a validation framework.
Cross task trends show classical methods like HOG-SVM excel in low resource scenarios, achieving >95% accuracy at 30+ FPS (e.g., INRIA, GTSRB) [
48], but falter with variability (occlusion, lighting). Deep learning (e.g., YOLO) offers robustness (98% mAP in TLD under occlusion) [
21] at 5–10 W and 70–80% LUTs, with energy efficiency of 5–7 GOPS/W on KV260, versus 3–4 GOPS/W for classical methods [
17].
Shared challenges include dataset fragmentation (e.g., GTSRB for TSR, TuSimple for lanes) [
4] and real-time constraints (<10 ms, <5 W) [
3]. Classical methods suit low latency edge cases, while DL generalizes across tasks, favoring one-stage detectors (YOLO, SSD) [
21] over two-stage (Faster R-CNN) [
13] due to complexity. A case study integrates TSR and lane detection on Xilinx Zynq, using YOLOv5 (98% accuracy on TT100K) [
49] and SCNN for lanes [
50], achieving 50 FPS at 4 W. This fusion improves occlusion handling but drops to 92% accuracy in low contrast scenarios, using 75% LUTs, highlighting the need for standardized metrics [
3]. Future designs should blend classical and DL methods, exploring sensor fusion [
16] and XAI for transparency [
17], guiding cohesive FPGA systems.