Design of a Hardware-Optimized High-Performance CNN Accelerator for Real-Time Object Detection Using YOLOv3 with Darknet-19 Architecture

Wu, Shuo; Kunapareddy, Manasa; Wang, Nan

doi:10.3390/electronics15061264

Open AccessArticle

Design of a Hardware-Optimized High-Performance CNN Accelerator for Real-Time Object Detection Using YOLOv3 with Darknet-19 Architecture

by

Shuo Wu

^*

,

Manasa Kunapareddy

and

Nan Wang

Department of Electrical and Computer Engineering, California State University, Fresno, CA 93740, USA

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(6), 1264; https://doi.org/10.3390/electronics15061264

Submission received: 25 February 2026 / Revised: 13 March 2026 / Accepted: 14 March 2026 / Published: 18 March 2026

Download

Browse Figures

Versions Notes

Abstract

This research proposes a novel hardware-optimized design to accelerate Convolutional Neural Networks (CNNs) using Verilog HDL. The design is specifically developed for the DARKNET-19 system model, which serves as the backbone of the YOLOv3-tiny algorithm, a widely used framework for real-time object detection in dynamic environments. The CNN architecture was implemented in Verilog HDL and synthesized using Synopsys Design Compiler, with a focus on improving both object detection accuracy and hardware resource efficiency. The proposed design efficiently performs key CNN operations, including convolution, pooling, and activation, enabling faster real-time object detection compared to many existing methods. To improve performance, the hardware design incorporates parallel processing techniques, allowing multiple computations to be executed simultaneously. This significantly reduces the system latency and power consumption. The convolutional layers of the DARKNET-19 architecture are efficiently mapped onto the hardware platform, ensuring optimized data storage and fast memory access, which further enhances processing speed and detection accuracy. An innovative feature of the design is a 2-dimensional image preprocessing module that prepares input images before they are fed into the CNN. This preprocessing stage includes image resizing, brightness normalization, and color adjustment, which helps the CNN process visual data more effectively. After preprocessing, the images pass through several CNN layers. The convolutional layers extract key features from the images, while the pooling and activation layers refine these features to improve detection performance. Finally, the processed data is analyzed by the YOLOv3-tiny algorithm, which identifies and locates objects in the images with high precision. Experimental results demonstrate that the proposed high-speed and resource-efficient hardware architecture is well-suited for real-time object detection applications, particularly in highly dynamic and unpredictable environments.

Keywords:

real-time object detection; DARKNET-19; YOLO v3; parallel computing architect; hardware optimization

1. Introduction

1.1. Background

Real-time object detection is a challenging problem in navigation and task execution for robotics and autonomous systems. Computational learning methods, including Support Vector Machines (SVMs) and Convolutional Neural Networks (CNNs), have become essential tools for pattern recognition based on observational datasets [1]. CNNs used for visual data analysis rely on layered architectures that detect edges, patterns, shapes, and objects. By automatically extracting relevant features from raw pixel data, CNNs improve performance in tasks such as classification, segmentation, tracking, and object detection [2]. The preprocessing phase plays an important role in improving model performance. Techniques such as image resizing, grayscale conversion, normalization, and data augmentation—including flipping and rotation—can enhance input consistency, improve signal robustness, and increase training efficiency [3]. Object detection involves identifying and localizing multiple objects within an image using bounding boxes, unlike simple classification tasks. Early approaches, such as R-CNN and Faster R-CNN, relied on region proposal methods followed by classification. Although these methods provided high accuracy, they also required significant computational resources. The YOLO (You Only Look Once) framework simplified this process by treating object detection as a regression problem, predicting bounding boxes and class labels in a single pass. This approach enables real-time detection and is particularly suitable for applications such as autonomous drones and surveillance systems [2]. YOLOv3-Tiny, a lightweight version of YOLOv3 that uses the Darknet-19 backbone, is designed for computational platforms with limited power and processing capacity. Compared with Darknet-53, Darknet-19 maintains reasonable detection accuracy while significantly reducing the computational cost, making it suitable for embedded systems and FPGA-based platforms [3]. Despite existing algorithmic optimizations, CNN models remain computationally intensive because convolution operations require repeated matrix multiplications. Deploying such models on CPUs or GPUs in embedded environments introduces challenges related to power consumption, processing latency, and heat dissipation, which limit their widespread use in time-critical edge applications [2]. Field-Programmable Gate Arrays (FPGAs) offer several advantages, including high parallelism, reduced latency, re-programmability, and lower power consumption. Compared with general-purpose GPU solutions, FPGAs enable customized data flow and efficient memory reuse, making them well-suited for real-time and energy-constrained environments. This work presents a hardware accelerator implemented on an FPGA and designed for potential ASIC implementation to support the YOLOv3 framework with the Darknet-19 backbone. The architecture integrates optimized hardware modules for convolution, pooling, ReLU activation, and preprocessing operations such as resizing and normalization to support real-time object detection. The proposed hardware architecture improves system response speed and power efficiency, enabling practical deployment in autonomous systems and industrial robotics that operate under strict weight and power constraints.

1.2. Motivation

CNNs have been widely used in computer vision and deep learning applications. They have proven highly effective in recognizing and classifying objects in images, making them valuable for real-time applications. However, CNN models require significant computational resources and power, which can lead to high energy consumption and slower processing speeds when the CNN subsystems are executed on conventional general-purpose hardware such as CPUs and GPUs. These challenges limit the practical deployment of CNNs, especially in portable or small embedded devices that require strict energy efficiency and fast response times. For example, mobile robotic systems often have size and weight limitations, while autonomous space vehicles operate under strict power consumption constraints. This research addresses these limitations by developing a specialized hardware accelerator using FPGAs, designed specifically for the DARKNET-19 model used in the YOLOv3-tiny algorithm. Although several hardware accelerators for CNNs already exist, this work proposes an optimized implementation based on the DARKNET-19 architecture, improving the efficiency of CNN operations and enhancing real-time object detection performance.

1.3. Literature Review

Several CNN accelerators have been proposed in the literature to improve the performance of deep learning inference on hardware platforms. Zhao et al. proposed an FPGA-based CNN accelerator using Dynamic Network Surgery (DNS) for pruning and 5-bit quantization to enable shift-based operations, eliminating multipliers. Their LUT-based architecture, implemented in Verilog for LeNet-5, achieves 33.6 GMACS at 1.758 W while maintaining

98.9 %

accuracy on MNIST [4]. Veena et al. presented a resource-optimized CNN hardware accelerator based on VGG16 for image processing on FPGAs. They reduced feature map sizes and used hardware-aware image compression to minimize memory and computation. Preprocessing in MATLAB ensures binary compatibility with Verilog, resulting in a compact, real-time solution for edge devices [5]. Karapurkar et al. designed an energy-efficient CNN accelerator with a row-stationary data flow and pipelined MAC-based processing elements. Using local buffers and scratchpads, the design reduced memory access energy and improves utilization. Implemented in System Verilog, it balances computational efficiency and low power for 2D convolution [6]. Crumley et al. implemented YOLOv2 onto a ZYNQ PYNQ-Z1 FPGA using Vivado HLS to convert C++/Python models into Verilog. Their accelerator processes quantized data and runs on a system-on-chip (SoC), demonstrating scalable convolution performance on the COCO dataset and enabling the efficient deployment of CNNs on edge platforms [7]. Lee et al. introduced a reconfigurable AI accelerator for real-time seizure detection using EEG. The system, built on a RISC-V controller and CNN coprocessor, was implemented in System Verilog on PYNQ-Z2 FPGA. It achieves

99.06 %

accuracy while consuming 0.108 W at 1 MHz, making it suitable for wearable biomedical devices [8]. Song et al. developed a lightweight CNN accelerator implemented in Verilog for Internet of Things (IoT) applications. By simplifying MAC units and applying 8-bit quantization to weights and biases, they reduced both memory usage and power consumption. The design, tested on an FPGA using the MNIST dataset, achieved

95 %

accuracy while maintaining efficient resource utilization [9]. Thomas et al. improved CNN processing elements by incorporating Modified Booth Encoding and Wallace Tree Adders in a Winograd-based UniWiG architecture. Their Verilog implementation for small-kernel convolutions demonstrated improved performance and power efficiency [10]. Shen implemented MobileNet on a Xilinx ZU104 FPGA for gesture recognition, leveraging depthwise separable convolutions to reduce computational complexity. The hardware architecture includes 64 parallel processing units for convolution, pooling, and ReLU operations. The design achieved a

28.4 \times

speed-up compared to CPU implementations and a

6.5 \times

speed-up compared to GPU implementations while consuming 4.07 W of power, processing 1250 FPS with 304.9 fps/W efficiency [11]. Several recent works have also explored the hardware acceleration of YOLO-based object detection models on FPGA platforms. In [12], Xiong et al. presented a hybrid architecture of ARM and FPGA for YOLOv3-tiny, where the ARM processor handles system control while the FPGA performs computational tasks. Their implemented system achieved a performance of 28.99

GOP / S

. In [13], Kim et al. proposed a resource-constrained FPGA design for YOLOv3 that utilized both DSPs and LUTs on a small FPGA board. However, the resulting performance reached only

76.75

frames per second with a relatively high power consumption of

2.203 W

. In [14], Tsai et al. introduced a reconfigurable hardware system for YOLOv3 designed to support variable input image sizes. Their system was implemented on the Xilinx ZCU104 platform, and the accelerator layout was fabricated using TSMC 90 nm technology, achieving an energy efficiency of

159.65 GOPs / W

. In [15], Ahmad et al. proposed a hardware–software co-design approach to accelerate YOLOv3-tiny using an FPGA-based system. In their design, convolution operations were partially offloaded to programmable logic while other parts of the detection pipeline were executed on the processor subsystem. Their work demonstrated the feasibility of accelerating YOLO inference using heterogeneous computing platforms; however, the design primarily focuses on FPGA implementation through hardware–software partitioning.

1.4. Contribution

Although existing studies demonstrate various approaches to accelerating convolutional neural networks, many focus on general CNN models or rely on high-level synthesis frameworks for FPGA deployment. In contrast, this work focuses specifically on the hardware implementation of the Darknet-19 backbone used in the YOLO detection framework, with an emphasis on efficient RTL-level design and ASIC implementation. The proposed architecture maps convolution operations onto a systolic array composed of parallel processing elements, enabling efficient multi-channel feature extraction while reducing computational latency. In addition, Booth multiplier-based arithmetic units are incorporated within the convolution modules to improve multiplication efficiency in hardware. Unlike many previous studies that primarily demonstrate FPGA implementations, such as those in [5,7], the proposed design is further validated through a complete ASIC design flow. This includes synthesis using Synopsys Design Compiler and physical implementation using Cadence Innovus with the 45 nm Nangate standard cell library. These design choices allow the proposed accelerator to achieve improved area and power efficiency, making it suitable for real-time object detection in embedded and edge computing systems.

The original YOLOv3 framework employs the Darknet-53 network as its backbone for feature extraction. However, Darknet-53 is relatively deep and computationally intensive, which makes direct hardware implementation challenging for resource-constrained embedded systems. In this work, a lightweight alternative based on the Darknet-19 architecture is adopted to reduce computational complexity and hardware resource requirements while maintaining the essential feature extraction capability required for object detection.

2. Problem Formulation

2.1. Problem Statement

The widespread adoption of deep learning technologies has created an increasing demand for fast and power-efficient processing capabilities. Convolutional Neural Networks (CNNs) are powerful models widely used in artificial intelligence tasks such as object detection and image classification. However, their complex computational processes make them resource intensive, resulting in longer processing times and high energy consumption when executed on conventional hardware such as CPUs and GPUs. These performance requirements create significant challenges for deploying CNNs in real-time applications, particularly in embedded systems where computational resources and power availability are limited. Therefore, there is a critical need to develop specialized hardware accelerators optimized for the DARKNET-19 model used in the YOLOv3-tiny algorithm. Such an optimized accelerator can improve real-time CNN performance by reducing data processing latency, minimizing chip area, and enhancing overall power efficiency.

2.2. Objectives

This research aims to design and implement an optimized hardware accelerator for convolutional neural networks (CNNs) using Field-Programmable Gate Arrays (FPGAs), specifically targeting the DARKNET-19 model, which serves as the core architecture of the YOLOv3-tiny algorithm. The objective is to significantly enhance the speed and efficiency of CNN processing in order to achieve real-time object detection performance. This work focuses on optimizing key CNN operations, including convolution, pooling, and activation functions, by applying parallel processing techniques that reduce computational latency and minimize power consumption. In addition, the research emphasizes effective image preprocessing methods including input data resizing, normalization, and color attribute adjustment, to ensure accurate and reliable data analysis within the CNN framework. The Darknet-19 CNN architecture is implemented using Verilog, and the synthesized design is further developed for physical implementation at the ASIC hardware level. This approach enables improved performance, efficiency, and suitability for real-time embedded and edge computing applications. This hardware accelerator design problem can be formulated as follows:

Problem 1.

Design a multi-objective optimized hardware accelerator for the DARKNET-19 model in the YOLOv3 tiny algorithm, to minimize detection system processing latency, power consumption, and silicon area in ASIC level by efficient RTL-level design including systolic array mapping and Booth multiplier-based arithmetic units integration.

3. Convolution Neural Network Model and Darknet-19 Structure

3.1. Convolution Layer

The convolution layer is one of the core components in CNNs, responsible for learning and extracting spatial features from input two-dimensional array signals such as images. It operates by applying small window filters, known as kernels, that slide across the input feature map to generate output feature maps capturing specific patterns such as edges, corners, and textures. Each filter moves across the spatial dimensions of the input and computes a dot product between the filter weights and the corresponding region of the input data, followed by a summation operation. This process is repeated across all input channels, and the results are accumulated to produce a single output pixel. Repeating this operation across the entire input produces the output feature map [10]. Suppose a convolution layer with kernel weights defined as

K [n, i, j]

, given input feature array

X [m, x, y]

, the output feature array

Y [n, x, y]

can be calculated as

\begin{matrix} Y [n, x, y] = \sum_{i = - \frac{W_{k}}{2}}^{\frac{W_{k}}{2}} \sum_{j = - \frac{H_{k}}{2}}^{\frac{H_{k}}{2}} \sum_{m = 0}^{M - 1} X [m, x + i, y + j] \times K [n, i, j] \end{matrix}

(1)

In color image signals, each pixel consists of three channels: red, green, and blue (RGB). The convolution operation is applied to each channel separately using either the same or different kernels. The partial results from each channel are then summed element-wise to produce the final feature map [11]. This multi-channel processing allows CNNs to capture both color information and spatial variations across the image. Mathematically, the convolution operation can be expressed as nested summations over the kernel dimensions and input channels. The resulting feature maps contain rich information that serves as the input for subsequent layers in the neural network.

3.2. Pooling Layer

The pooling layer in a CNN plays a crucial role in reducing the spatial resolution of feature maps, which decreases computational and memory requirements while preserving important information in the intermediate representations. The two main types of pooling are max pooling and average pooling. In max pooling, the input feature map is divided into small, non-overlapping sub-regions, (e.g.,

2 \times 2

), and the maximum value within each sub-region is selected to represent that region. This down-sampling approach reduces the size of the feature map and improves the network’s robustness to minor perturbations, such as small translations or distortions [10].

Average pooling, on the other hand, computes the mean value of all entries within each sub-region, effectively smoothing the feature map. Let y be the output, the average pooling operation can be written as

\begin{matrix} y = \frac{1}{l \times w} \sum_{i = 1}^{l} \sum_{j = 1}^{w} x_{i, j}, \end{matrix}

(2)

where

x_{i, j}

is the element within the pre-defined pooling window, l and w represent the dimension 1 and 2 of the pooling window, respectively.

Both max pooling and average pooling help prevent overfitting and improve the generalization of the network. Since pooling does not involve any learnable parameters, it is computationally efficient. For example, applying a

2 \times 2

max pooling operation with a stride of 2 to a

4 \times 4

input feature map produces a

2 \times 2

output. The resulting pooled feature maps are then passed to subsequent convolutional or fully connected layers for the extraction of higher-level features [10].

3.3. Activation Layer

Activation functions introduce non-linearity into a neural network, enabling CNNs to learn complex relationships between inputs and outputs. By incorporating activation functions, networks can approximate nonlinear input-output mappings effectively. Among the various activation functions, the Rectified Linear Unit (ReLU) is the most widely used due to its simplicity and efficiency. ReLU sets all negative inputs to zero while leaving positive inputs unchanged and can be expressed as

f (x) = max (0, x)

[10]. This function facilitates faster training and mitigates the vanishing gradient problem, which is particularly beneficial in deep networks. Other activation functions, such as Sigmoid and SELU, are applied in specific scenarios, but ReLU remains the dominant choice because of its performance and ease of implementation in hardware [11]. In CNN accelerators, ReLU is typically implemented using simple logic gates, making it well-suited for FPGA and ASIC designs.

3.4. Fully Connected Layer

The fully connected (FC) layer serves as the final stage in a CNN, where each neuron is connected to every neuron in the preceding layer. This dense connectivity allows the network to integrate local features extracted in earlier layers and make global decisions, such as classifying an image. Before reaching the FC layer, the multi-dimensional outputs from convolutional and pooling layers are flattened into a one-dimensional vector [16]. Each element in this vector is multiplied by a corresponding weight, summed with a bias term, and passed through an activation function to produce the final output. This process enables the network to learn complex interactions among high-level features. Although powerful, fully connected layers are computationally intensive and account for a large portion of model parameters, which is why lightweight CNN architectures often reduce their use or replace them with global average pooling [16]. Mathematically, the operation of a fully connected layer can be expressed as follows:

\begin{matrix} y [n] = \sum_{m = 0}^{M - 1} W [m, n] \times x [m] + b [n], \end{matrix}

(3)

where

x [m]

is the flattened input vector,

W [m, n]

defined as the weights connecting input neuron m to output neuron n,

b [n]

is the bias term added to neuron n, and

y [n]

is the output from neuron n.

3.5. Padding Layer

Padding is used in CNNs to control the spatial dimensions of output feature maps after convolution [17]. Without padding, the size of the feature map decreases with each convolutional layer, which can lead to the loss of edge information. Common padding strategies include zero-padding (also called “same” padding), which preserves the input size, and valid padding, which reduces it. Zero-padding adds rows and columns of zeros around the input, ensuring consistent spatial dimensions [18]. In FPGA implementations, padding is particularly important for data alignment, efficient memory access, and uniform processing across layers. It also helps maintain receptive field coverage at the edges and supports pipe-lined convolution operations in hardware. Efficient padding logic is often integrated into the data path to prevent performance bottlenecks during inference [18].

3.6. Darknet-19 Structure

Darknet-19 is a lightweight CNN architecture originally introduced as the backbone of YOLOv2. It comprises 19 convolutional layers and 5 max-pooling layers, utilizing small

1 \times 1

and

3 \times 3

filters, batch normalization, and ReLU activations. Its compact design provides a balance between accuracy and speed, making it well-suited for real-time applications and hardware deployment [19].

For FPGA-based accelerators, Darknet-19 is particularly advantageous due to its regular structure and predictable layer patterns, which simplify mapping onto hardware blocks. Compared to deeper networks such as Darknet-53, Darknet-19 has a smaller memory footprint, lower power consumption, and simpler logic design. These characteristics make it ideal for integrating CNNs with object detection models like YOLOv3 in embedded systems [7,19].

3.7. Batch Normalization

Batch normalization standardizes the activations in a network by scaling and shifting inputs to have zero mean and unit variance across mini-batches. This process improves training stability, speeds up convergence, and allows for deeper network architectures [20]. It is typically applied after convolutional or fully connected layers and before activation functions.

During inference, batch normalization can be pre-computed and incorporated into the preceding layer’s weights, eliminating runtime overhead. This approach is particularly important for FPGA-based systems, where computational resources and memory are limited. The mean and variance values are calculated during training and embed them directly into the inference logic, ensuring both efficient and accurate performance [20]. The normalization processing can be written as follows:

\begin{matrix} \hat{x} = \frac{x - μ}{\sqrt{σ^{2} + ϵ}} \end{matrix}

(4)

3.8. YOLOv3 with Darknet-19

YOLOv3 is a single-stage object detection framework that simultaneously performs bounding box regression and object classification in a single forward pass. It makes predictions at three different spatial resolutions—

13 \times 13

,

26 \times 26

, and

52 \times 52

—allowing it to detect objects of varying sizes effectively [21]. The use of anchor boxes and a multi-scale prediction strategy enhances YOLOv3’s robustness and efficiency for real-time detection. Although the original YOLOv3 architecture uses Darknet-53 as its backbone, it can be adapted to use Darknet-19, a more lightweight core. This substitution significantly reduces computational complexity, power consumption, and processing latency, which is particularly important for FPGA-based deployments [22,23,24]. Despite having fewer layers, Darknet-19 retains sufficient representational capacity, especially when combined with YOLOv3’s multi-scale detection mechanism and Non-Maximum Suppression (NMS) to filter overlapping bounding boxes.

4. Proposed Hardware Architecture

Figure 1 illustrates the proposed high-performance CNN accelerator architecture, which integrates a hardware-based feature extractor using Darknet-19 with a YOLOv3-based software detection module. The processing pipeline begins with an input image, either captured in real time or pre-loaded, which undergoes preprocessing to meet the hardware’s input requirements. This preprocessing includes resizing to a standard resolution (e.g.,

224 \times 224

), pixel normalization (e.g., scaling to 0 to 1 or −1 to 1), and potential format adjustments such as RGB to BGR conversion or fixed-point quantization. These operations reduce image complexity, ensure compatibility with the hardware modules, and optimize overall performance.

The preprocessed image is stored in an input buffer, which feeds pixel data in manageable blocks to the core accelerator implemented on an FPGA or ASIC. A control and status unit coordinates all sub-modules, managing data flow, synchronization, and operation sequencing. Within the accelerator, the image passes through multiple convolution blocks based on the Darknet-19 architecture. Each convolution layer is followed by batch normalization, ReLU activation, and max pooling. These stacked operations extract hierarchical features, ranging from basic edges to complex object components, transforming the image into a set of high-level feature maps. The extracted features are then passed to the YOLOv3 detection module. YOLOv3 divides the feature map into grids and applies prediction logic to each cell, generating bounding boxes, objectness scores, and class probabilities using anchor boxes. Post-processing steps, including Non-Maximum Suppression (NMS), eliminate overlapping boxes and retain the most confident detections. The final output presents the input image annotated with bounding boxes and class labels, completing the real-time detection pipeline with high accuracy and efficiency.

5. Tools and Libraries Used

5.1. Xilinx Vivado

The Xilinx Vivado Design Suite is a comprehensive FPGA development environment used in this work for RTL simulation and schematic generation. It supports both Verilog and VHDL and provides a waveform viewer for verifying input–output behavior in critical modules, including the Booth multiplier [25,26] and CNN processing elements. Vivado’s RTL schematic feature allows visualization of the hardware structure. While Vivado also offers synthesis and IP integration capabilities, in this study, it is primarily used for functional verification prior to synthesis.

5.2. Synopsys Design Compiler

Synopsys Design Compiler was used for RTL synthesis, converting the Verilog code into a gate-level netlist using the Nangate 45 nm standard cell library. The tool performed optimizations such as logic minimization and pipelining while generating key performance metrics, including power consumption, hardware area, and timing. These metrics were essential for evaluating the hardware efficiency of the CNN accelerator. Design constraints, such as target clock frequency and area limits, were applied to guide the synthesis process, ensuring that the design was prepared for subsequent physical implementation and validation.

5.3. Cadence Innovus

Cadence Innovus was employed for the physical design stage, performing place-and-route (P&R) operations on the synthesized netlist. This process included floor planning, standard cell placement, clock tree synthesis, and routing. Post-layout verification, including timing analysis, Design Rule Checking (DRC), and Layout Versus Schematic (LVS) checks, was conducted to ensure design correctness. The tool also provided key metrics such as total wire length and congestion, finalizing the layout for fabrication or further backend processing.

5.4. MATLAB

MATLAB was used for algorithm validation, preprocessing input images, and verifying the outputs from the CNN accelerator. It facilitated image resizing, normalization, and simulation of convolution results, which were then compared with the hardware outputs. MATLAB also visualized feature maps and bounding boxes, ensuring the accuracy of the object detection logic prior to synthesis. This workflow verifies that the optimized hardware behaves consistently with the reference CNN model.

5.5. Nangate Open Cell Library

The 45 nm Nangate Open Cell Library was utilized during synthesis and physical design to map the RTL design onto standard cells. It contains logic gates, flip-flops, and buffers, along with detailed models for power, timing, and area. The library provides Liberty (.lib) files for synthesis and LEF/GDSII formats for layout, enabling accurate analysis of power, area, and timing. Its open source availability makes it well-suited for academic use and for validating complete ASIC design flows.

6. ASIC Implementation

The Application-Specific Integrated Circuit (ASIC) implementation is essential for evaluating the performance of the hardware design as if it were manufactured as a physical chip. In this stage, the design is first synthesized using Synopsys Design Compiler to generate gate-level netlists and estimate area, power, and timing. Cadence Innovus is then employed to carry out critical physical design steps, including standard cell placement, clock tree synthesis, and routing.

6.1. Functional Simulation

Functional simulation is a crucial step in the digital design process, used to verify that the RTL (Register Transfer Level) code behaves as intended before proceeding to synthesis or physical design. In this work, Xilinx Vivado was employed for functional simulation. Various test inputs were applied to CNN accelerator modules, including the Booth multiplier and convolution units, and the outputs were observed to ensure correctness.

Vivado’s integrated simulation environment supports waveform analysis, signal tracing, and logic debugging at the RTL level. This allows functional errors to be detected and corrected early in the design flow, ensuring that the Verilog code accurately reflects the intended hardware behavior. The waveform shown in Figure 2 represents the functional simulation output of the proposed CNN accelerator’s image padding module. The simulation confirms that the module operates correctly and that data flows through the padding unit as expected. In the waveform, the clock signal (denoted as clk) toggles at regular intervals, providing timing for all sequential operations in the design. The reset signal (denoted as reset) remains low (denoted as 0), indicating that the module is not in a reset state and is functioning normally. This enables internal processes such as data reading, pixel indexing, and padding to execute properly.

6.2. RTL Schematic

Figure 3 illustrates the internal architecture of the Convolution Layer 0 module, a key component of the proposed CNN accelerator, generated using Xilinx Vivado. This module performs the convolution operation by receiving input data (in_data [215:0]) and routing it to a systolic array block labeled as sys_array_0. The systolic array processes the input in a structured and parallel manner, producing intermediate outputs across columns labeled as

c 0

to

c 15

. These outputs are then combined and stored in a 128-bit register block (r_c_reg [127:0]) for synchronization before being forwarded to the output port (c_o [127:0]). The design also incorporates control logic, including a resettable multiplexer (RTL_MUX) and a synchronous register (RTL_REG_SYNC) to manage data flow and ensure timing correctness. Clock (clk_i) and reset (rst_n) signals are distributed globally, providing proper sequencing and initialization. This schematic highlights the organized, parallel nature of the convolution layer, emphasizing both efficiency and high throughput in the CNN hardware accelerator.

The core of the Convolution Layer 0 module in the CNN accelerator is the systolic array block, which consists of 16 Processing Elements (PEs) labeled PE0 through PE15, arranged in a vertically cascading structure. Each PE performs the core convolution computations by receiving input data and weights, executing multiply-accumulate (MAC) operations, and forwarding results through interconnected data paths. Additional control logic, including a multiplexer (RTL_MUX)), a synchronous register (RTL_REG_SYNC), and a validation unit (RTL_REDUCTION_OR) manages control signals and ensures that outputs are synchronized and flagged as valid once processing is complete. This highly structured design enables parallel processing, efficient data reuse, and high-speed computation, which are essential for accelerating convolution operations on hardware platforms such as FPGAs or ASICs.

Figure 4 shows the RTL schematic of PE0, a single Processing Element within the systolic array (sys_array_0) of the conv_layer_0 module. PE0 contains three parallel convolution sub-modules: conv_r_channel, conv_g_channel, and conv_b_channel, responsible for processing the red, green, and blue input channels, respectively. Each convolution unit receives 8-bit pixel values (k_) and weights (w_), processes them, and generates intermediate results (r_o). These results are summed using two RTL_ADD blocks to combine the outputs of all three channels. The final result is temporarily stored in registers such as out_a_reg and out_c_reg, using RTL_REG_SYNC blocks to synchronize with the system clock (clk_i) and reset signal (rst_n). Multiplexers (RTL_MUX) are used to select between incoming and computed data paths. The combined output of PE0 is directed to the output ports out_a [215:0] and out_c [7:0], contributing to the generation of the overall feature map in the CNN accelerator. This modular architecture supports parallel, channel-wise processing, improving computational efficiency and making it suitable for high-speed image processing in hardware-based CNNs.

Figure 5 presents the internal RTL schematic of the convolution green channel block within PE0 of the systolic array. This module performs convolution specifically for the green channel of the input image. It receives eight kernel values (k [7:0]) and nine pixel weights (w [8:0]) as inputs and executes nine parallel multiplications using Booth multipliers, which are optimized for high-speed signed arithmetic. The 16-bit outputs from each multiplier are passed through a series of RTL_ADD blocks, forming a multi-stage adder tree that accumulates all partial products to produce the final convolution result. This result is stored in a 64-bit register (r_reg) via an RTL_REG_SYNC block, synchronizing it with the system clock and reset signals. Before being sent to the final output (r_o [7:0]), the data is processed through a ReLU activation function implemented in the relu_dut block, which clips all negative values to zero. This schematic highlights the modular, pipe-lined structure of the convolution operation, enabling fast and efficient hardware-level processing of image data.

6.3. Gate-Level Synthesis

In the next step of the design process, logic synthesis is performed using Synopsys DC, which converts the In the next stage of the design process, logic synthesis is carried out using Synopsys Design Compiler, which converts the RTL design into a gate-level representation called a netlist. This process maps the design onto standard logic gates from a specific technology library, in this case, a 45 nm NAND gate library suitable for physical implementation. The primary objective of synthesis is to optimize the design in terms of area utilization, power efficiency, and timing performance.

6.3.1. Area Optimization

The area metrics were obtained from Synopsys Design Compiler after synthesizing the design using the 45 nm NAND gate Open Cell Library. The design comprises a total of 14,745 cells, including 14,093 combinational and 632 sequential elements. A substantial number of buffer and inverter cells (3043) are included to support signal integrity and timing. The total cell area is approximately 20,375

μ m^{2}

, with the combinational area accounting for around 17,919

μ m^{2}

and the sequential (non-combinational) area approximately 2456

μ m^{2}

. The net interconnect and total physical area are listed as undefined, likely due to zero wire-load modeling at this stage. These area metrics provide an estimate of the silicon footprint of the design prior to placement and routing.

6.3.2. Power Optimization

The power metrics were analyzed using Synopsys Design Compiler, providing a detailed breakdown of power consumption across the design components. The total dynamic power is reported as 8.6975 mW, comprising 4.7036 mW of internal power and 3.9939 mW of net switching power, indicating a near-even distribution. Additionally, the cell leakage power is 1.5332 mW, bringing the total power consumption to approximately 10.231 mW. Combinational logic dominates the power usage, consuming around 9.1 mW, which accounts for over

89 %

of the total power. Registers and other sequential elements contribute less, reflecting efficient pipelining and storage design. These power metrics are essential for assessing the energy efficiency of the CNN accelerator, particularly for low power applications.

6.3.3. Timing Analysis

The timing analysis report from Synopsys Design Compiler confirms that the design meets the specified timing constraints. The report shows a data arrival time of 3.77 ns, and a required data time of 3.78 ns, resulting in a setup time slack of 0.00 ns, indicating that the design is timing-accurate (MET). This analysis considers a clock uncertainty of −0.20 ns and a library setup time of −0.02 ns. With zero slack, the circuit operates correctly at the target clock frequency, ensuring reliable performance of the CNN accelerator when synthesized for a 4.00 ns clock period.

6.4. Physical Implementation

The physical implementation was carried out through a series of steps, including floor planning, addition of global nets and power rings, special routing configurations, insertion of power stripes, placement of standard cells, clock tree synthesis, insertion of I/O filler cells, design rule check (DRC) verification, and early global and nano routing. Figure 6 illustrates the final chip layout, representing the complete physical implementation of the design after all backend stages, including logic synthesis, placement, clock tree synthesis, routing, and DRC checks. The layout integrates all critical design elements standard cells, interconnects, power rails, and clock distribution efficiently organized within the defined core area. Different metal layers and routing tracks are visually distinguished using color-coded patterns, highlighting the multi-layer routing used to manage signal, power, and ground connections effectively.

This layout was finalized using Cadence Innovus, ensuring compliance with 45 nm design rules while optimizing for minimal chip area, high-performance accuracy, and low power consumption. Additionally, power-saving techniques such as clock gating were implemented to reduce dynamic power, demonstrating the design’s suitability for energy-efficient VLSI applications.

7. Experimental Results and Performance Analysis

7.1. Experimental Setup

To evaluate the functionality of the proposed CNN accelerator, object detection experiments were conducted using representative real-world images containing multiple object types. Input images were first resized to a fixed resolution of

224 \times 224

pixels to match the hardware input format of the CNN modules. The preprocessing stage also included pixel normalization and padding to ensure compatibility with the convolution hardware pipeline. These preprocessing steps were implemented and verified in MATLAB before being mapped to the hardware input stage. The accelerator executes convolution, activation, and pooling operations within the Darknet-19 feature extraction network. The YOLOv3 detection stage, including bounding box prediction and non-maximum suppression, was evaluated in MATLAB to validate the correctness of the feature maps extracted by the hardware. Detection results were visually verified by comparing the bounding boxes generated by the hardware-assisted pipeline with the expected output from the software reference model. Since the main focus of this work is the hardware design and implementation of the CNN accelerator, the experimental evaluation emphasizes hardware performance metrics such as silicon area, power consumption, and timing performance, obtained from synthesis and physical implementation using the 45 nm Nangate standard cell library. Metrics like mean Average Precision (mAP) and Intersection-over-Union (IoU) are determined by the YOLOv3 framework and were not the primary focus of this hardware evaluation.

7.2. Generation of Padded Image

Figure 7 shows a visual representation of an image after padding, as processed by the CNN hardware accelerator. The padding adds extra boundary pixels to the original image, enabling proper convolution operations.

The CNN accelerator uses the padded image to generate feature maps, which are essential for tasks such as object detection and classification. Visible padding, color variations, and clearly defined boundaries highlight the regions of interest handled by the hardware.

7.3. Verifying Functionality of CNN Accelerator

Figure 8 illustrates the verification of the CNN accelerator’s functionality for object detection, conducted using MATLAB. In this evaluation, the YOLOv3 algorithm with the Darknet-19 backbone was used to detect objects such as cars and persons in real-world images. The hardware output was validated by comparing the bounding boxes and class predictions generated by the accelerator with the expected results from MATLAB. As shown in the figure, the model accurately identifies and labels both the car and the person, with correctly positioned bounding boxes. This confirms that the CNN hardware accelerator produces reliable and accurate detection results, demonstrating its suitability for real-time object detection in embedded systems and advanced autonomous applications.

7.4. System Performance Metrics

In order to test the comprehensive performance of the proposed hardware design, the key performance metrics for the implemented detection system were evaluated as follows.

Since the clock period is

T = 4 ns = 4 \times 10^{- 9} s

, the clock frequency can be determined by

f = \frac{1}{4 \times 10^{- 9}} = 250 MHz

. Thus, the implemented accelerator runs at

f = 250 MHz

. The input data is a 2-D array per frame, with pixels per frame as

P P F = 224 \times 224 = 50,176 pixels / frame

. Considering 1 pixel processed per clock cycle, the accelerator processes pixel data sequentially through the convolution pipeline with processing cycles per frame

n = 50,176

. Hence, the frame processing time

T_{p}

can be determined as

\begin{matrix} T_{p} = \frac{n}{f} = \frac{50,176}{250,000,000} = 2.007 \times 10^{- 4} s . \end{matrix}

(5)

For the implemented real-time detection system, the detection throughput (FPS) can be determined as follows:

\begin{matrix} F P S = \frac{1}{T_{p}} = \frac{1}{0.0002007} \approx 4982 frame / s . \end{matrix}

(6)

Based on the post-synthesis timing analysis, the proposed accelerator runs with a clock period of

4 ns

, corresponding to a frequency of

250 MHz

. For an input resolution of

224 \times 224

pixels, the convolution pipeline processes

50,176

pixels per frame. This yields an estimated processing throughput of approximately 4982 frames per second for the CNN feature extraction stage, demonstrating that the architecture can support real-time object detection applications.

The performance of the hardware CNN accelerator can also be quantified in terms of GMACs (Giga Multiply–Accumulate operations). A single MAC operation consists of one multiplication followed by one accumulation. Considering each PE performs 1 MAC per clock cycle, and the system contains

n_{P E} = 16

processing elements (PEs). The total MAC operations rate can be determined with clock frequency

f = 250 MHz

as follows:

\begin{matrix} R_{M A C} = n_{P E} \times f = 16 \times 250 M = 4000 M = \frac{4000 M}{10^{9}} GMACs = 4 GMACs . \end{matrix}

(7)

Thus, the peak compute throughput is approximately

4 GMACs

.

Since the measured total power consumption is

p = 8.69 mW

, GOPS/W (Giga Operations Per Second per Watt) can be calculated as follows:

\begin{matrix} GOPS / W & = \frac{Giga Operations per Second}{Power (W)} \\ = \frac{R_{M A C} \times 2}{p} \\ = \frac{8}{0.00869} \approx 920 GOPS / W . \end{matrix}

(8)

The proposed accelerator utilizes a systolic array architecture consisting of

n_{P E} = 16

PEs operating in parallel to execute convolution operations. Post-synthesis timing analysis confirms that the design meets the target clock period of

4 ns

, corresponding to an operating frequency of approximately

f = 250 MHz

. Each PE performs one multiply–accumulate (MAC) operation per clock cycle, yielding a peak computational throughput of around

4 GMACs

for the convolution process. Considering that each MAC comprises one multiplication and one addition operation, the equivalent computational performance reaches approximately

8 GOPS

. With the measured dynamic power consumption of

p = 8.69 mW

, the accelerator achieves an estimated energy efficiency of approximately

920 GOPS / W

, highlighting its suitability for low power, real-time embedded object detection applications.

The design also integrates small on-chip memory structures, including input buffers and intermediate registers within each PE, to temporarily store feature maps and partial computation results. This on-chip storage reduces reliance on external memory access, improving data locality and enhancing the overall efficiency of the convolution pipeline.

7.5. Power and Area of Processing Element

Based on Table 1, the comparison provides a comprehensive evaluation of area and power consumption among several CNN hardware accelerators, including the proposed design based on the Darknet-19 architecture. The table highlights three prior implementations: a cyclic array-based accelerator, a vector-wise CNN accelerator, and an energy-efficient design. The cyclic array-based accelerator occupies an area of 29,176

μ m^{2}

and consumes 18.52 mW of power. The vector-wise accelerator requires a larger area of 46,172

μ m^{2}

but has slightly lower power consumption at 15.98 mW. The energy-efficient design reports only its power usage, which is 9.43 mW, with no area specification provided.

As shown in Table 1, Figure 9, and Figure 10, the proposed CNN accelerator achieves the lowest silicon area of only 20,375

μ m^{2}

and a power consumption 8.69 mW, while delivering the highest computational throughput of

4 GMACs

and approximately 4982 FPS. By leveraging 16 processing elements, a systolic array architecture, and ASIC-level design optimizations, the accelerator attains superior energy efficiency of

920 GOPS / W

compared to existing FPGA-based YOLO/CNN accelerators. These experimental results clearly demonstrate the design’s high computational performance, energy efficiency, and scalability, making it well-suited for embedded and edge AI applications.

8. Conclusions

This work presents the design, implementation, and verification of a hardware-optimized CNN accelerator for real-time object detection, based on the DARKNET-19 architecture integrated within the YOLOv3 framework. The primary objective was to develop a compact and energy-efficient solution suitable for embedded systems and IoT devices. DARKNET-19, selected for its lightweight and high-performance characteristics, consists of 19 convolutional layers and five max-pooling layers, providing a robust foundation for extracting both low- and high-level image features. The full CNN pipeline—including convolution, batch normalization, ReLU activation, and pooling—was implemented in Verilog HDL and validated in MATLAB to ensure functional accuracy.

The hardware design underwent extensive simulation and synthesis using industry-standard tools. Module-level simulations for convolution layers, processing elements, and Booth multipliers were verified using Xilinx Vivado. RTL synthesis was performed with Synopsys Design Compiler using the 45 nm Nangate Open Cell Library to generate a gate-level netlist optimized for area, timing, and power. The final design achieved a compact silicon footprint of 20,375

μ m^{2}

and consumed only 8.69 mW of dynamic power. Physical implementation was carried out using Cadence Innovus, including floor-planning, power grid insertion, standard cell placement, clock tree synthesis, and routing. The final layout passed design rule checks with no violations, confirming a robust and manufacturable design.

End-to-end functionality was validated in MATLAB by running the complete YOLOv3 detection pipeline with the proposed accelerator. Results demonstrated accurate real-time object detection and precise bounding box localization, confirming the reliability and effectiveness of the hardware implementation. Compared to existing CNN accelerators, the proposed design achieved superior area and power efficiency while maintaining high computational performance. The use of Booth multipliers, parallelism, and a lightweight CNN architecture were key factors in this improvement. Future work will explore hardware optimization design for more complex YOLO models and the integration of dynamic power management to further enhance adaptability.

Author Contributions

Conceptualization, S.W., M.K. and N.W.; methodology, S.W., M.K. and N.W.; formal analysis, S.W., M.K. and N.W.; investigation, S.W., M.K. and N.W.; resources, S.W., M.K. and N.W.; data curation, S.W., M.K. and N.W.; writing—original draft preparation, S.W., M.K. and N.W.; writing—review and editing, S.W., M.K. and N.W.; visualization, S.W., M.K. and N.W.; supervision, S.W. and N.W.; project administration, S.W., M.K. and N.W.; funding acquisition, S.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by start-up funds from the lab of S.W. in the Lyles College of Engineering at California State University (CSU), Fresno.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Trinchero, R.; Manfredi, P.; Stievano, I.S.; Canavero, F.G. Machine learning for the performance assessment of high-speed links. IEEE Trans. Electromagn. Compat. 2018, 60, 1627–1634. [Google Scholar] [CrossRef]
Gonzalez, R.C.; Woods, R.E. Digital Image Processing, 4th ed.; Pearson Education: Hong Kong, China, 2018; ISBN 978-1-292-22304-9. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Zhao, M.; Li, X.; Zhu, S.; Zhou, L. A Method for Accelerating Convolutional Neural Networks Based on FPGA. In Proceedings of the 2019 4th International Conference on Communication and Information Systems (ICCIS), Wuhan, China, 19–21 December 2019; pp. 241–246. [Google Scholar] [CrossRef]
Veena, M.B.; Deodurg, R.; Shrinidhi, V.; Soundarya, S. Design of Optimized CNN for Image Processing using Verilog. In Proceedings of the 2023 4th IEEE Global Conference for Advancement in Technology (GCAT), Bangalore, India, 6–8 October 2023; pp. 1–6. [Google Scholar] [CrossRef]
Karapurkar, S.S.; Bramhane, L.K.; Rahulkar, A.D.; Veerakumar, T. Energy-Efficient Implementation of Processing Elements for CNN Hardware Accelerator. In Proceedings of the 2023 11th International Conference on Emerging Trends in Engineering & Technology-Signal and Information Processing (ICETET-SIP), Goa, India, 28–29 April 2023; pp. 1–6. [Google Scholar] [CrossRef]
Crumley, D.; Hossain, M.; Martin, K.; Ivey, F.; Yarnell, R.; DeMara, R.F.; Bai, Y. Rehosting YOLOv2 Framework for Reconfigurable Fabric-Based Acceleration. In Proceedings of the 2022 IEEE SoutheastCon, Mobile, AL, USA, 26 March–3 April 2022; pp. 445–446. [Google Scholar]
Lee, S.-Y.; Ku, M.-Y.; Pan, S.-Y.; Lin, C.-C. Reconfigurable and Scalable Artificial Intelligence Acceleration Hardware Architecture With RISC-V CNN Coprocessor for Real-Time Seizure Detection. IEEE Access 2025, 13, 31057–31068. [Google Scholar] [CrossRef]
Song, Y.-S.; Lee, K.-Y. A Design of Lightweight Convolutional Neural Network Accelerator for IoT Devices. In Proceedings of the 2023 14th International Conference on Ubiquitous and Future Networks (ICUFN), Paris, France, 4–7 July 2023; pp. 474–477. [Google Scholar] [CrossRef]
Shen, Y. Accelerating CNN on FPGA: An Implementation of MobileNet on FPGA. Master’s Thesis, School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology, Stockholm, Sweden, 2019. [Google Scholar]
Kwon, H. Designing CNN Accelerators–Day 1; Lecture Notes; Synergy Lab, Georgia Institute of Technology: Atlanta, GA, USA, 2017; Available online: http://synergy.ece.gatech.edu (accessed on 8 January 2026).
Xiong, Q.; Liao, C.; Yang, Z.; Gao, W. A method for accelerating YOLO by hybrid computing based on ARM and FPGA. In Proceedings of the 2021 4th International Conference on Algorithms, Computing and Artificial Intelligence, Sanya, China, 22–24 December 2021; pp. 1–7. [Google Scholar]
Kim, M.; Oh, K.; Cho, Y.; Seo, H.; Nguyen, X.T.; Lee, H.J. A low-latency FPGA accelerator for YOLOv3-tiny with flexible layerwise mapping and dataflow. IEEE Trans. Circuits Syst. I 2023, 71, 1158–1171. [Google Scholar] [CrossRef]
Tsai, T.H.; Tung, N.C.; Chen, C.Y. An FPGA-based reconfigurable convolutional neural network accelerator for tiny YOLO-V3. Circuits Syst. Signal Process. 2025, 44, 3388–3409. [Google Scholar] [CrossRef]
Ahmad, A.; Muhammad, A.P.; Ghulam, J.R. Accelerating tiny YOLOv3 using FPGA-based hardware/software co-design. In Proceedings of the 2020 IEEE International Symposium on Circuits and Systems (ISCAS), Virtual, 10–21 October 2020; IEEE: New York, NY, USA, 2020; pp. 1–5. [Google Scholar]
Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Amidie, M.A.; Farhan, L. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. J. Big Data 2021, 8, 53. [Google Scholar] [CrossRef] [PubMed]
Wiranata, A.; Wibowo, S.A.; Patmasari, R.; Rahmania, R.; Mayasari, R. Investigation of Padding Schemes for Faster R-CNN on Vehicle Detection. In Proceedings of the 2018 International Conference on Control, Electronics, Renewable Energy and Communications (ICCEREC), Bandung, Indonesia, 5–7 December 2018; pp. 208–212. [Google Scholar]
Xie, J.; Long, Z.; Song, Q.; Liu, Z.; Du, Y.; Wang, T. Visible-Light Insulator Defect Detection Based on Improved YOLOv3. In Proceedings of the 2023 3rd International Conference on Electrical Engineering and Mechatronics Technology (ICEEMT), Kunming, China, 21–23 July 2023; pp. 287–292. [Google Scholar]
Varma, P. Hands-on Guide to Implement Batch Normalization in Deep Learning Models. Analytics India Magazine, 25 July 2020. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Chakure, A. All You Need to Know About YOLO v3 (You Only Look Once). DEV, 1 March 2021. Available online: https://dev.to/afrozchakure/all-you-need-to-know-about-yolo-v3-you-only-look-once-e4m (accessed on 24 January 2026).
Qiu, J.; Wang, J.; Yao, S.; Guo, K.; Li, B.; Zhou, E.; Yu, J.; Tang, T.; Xu, N.; Song, S.; et al. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 21–23 February 2016; pp. 26–35. [Google Scholar]
Zhang, C.; Li, P.; Sun, G.; Guan, Y.; Xiao, B.; Cong, J. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2015; pp. 161–170. [Google Scholar]
Sulaiman, N.; Saad, N.; Yusof, R. High Speed Booth Encoder Multiplier Design for FPGA Implementation. In Proceedings of the 2010 International Conference on Computer and Communication Engineering (ICCCE), Kuala Lumpur, Malaysia, 11–12 May 2010; pp. 1–6. [Google Scholar]
Guem, D.-H.; Kim, S. Variable Precision Multiplier for CNN Accelerators Based on Booth Algorithm. Int. J. Adv. Sci. Eng. Inf. Technol. (IJASEIT) 2023, 13, 1025–1030. [Google Scholar] [CrossRef]
Kyriakos, A.; Papatheofanous, E.A.; Bezaitis, C.; Reisis, D. Resources and Power Efficient FPGA Accelerators for Real-Time Image Classification. J. Imaging 2022, 8, 114. [Google Scholar] [CrossRef] [PubMed]
Lahari, P.L.; Poola, R.G.; Yellampalli, S.S. High Speed and Area Efficient FPGA Implementation of CNN Accelerator for Biomedical Applications. Res. Sq. 2023; preprint. [Google Scholar] [CrossRef]

Figure 1. Proposed hardware accelerator architecture.

Figure 2. Functional simulation output.

Figure 3. RTL schematic of internal architecture of convolution layer 0.

Figure 4. RTL schematic of processing element (PE).

Figure 5. RTL schematic of the convolution green channel block with PE.

Figure 6. Final chip layout.

Figure 7. Generated padded image.

Figure 8. YOLOv3 detection results.

Figure 9. Power comparison of different models.

Figure 10. Area comparison of different models.

Table 1. Complete performance comparison of CNN accelerators.

Design	Hardware Level	Area (μm²)	Power $(mW)$	PEs	FPS	Clock Frequency $(MHz)$	GMACs	GOPS/W
CNN accelerator—Cyclic array [27]	FPGA	29,176	18.52	8	3660	367	2.94	326
Vector-wise CNN accelerator [28]	FPGA	46,172	15.98	16	4300	215	3.45	431
Energy efficient CNN accelerator [6]	FPGA	-	9.43	8	4510	452	3.62	767
Proposed CNN accelerator	45 nm ASIC	20,375	8.69	16	4982	250	4	920

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wu, S.; Kunapareddy, M.; Wang, N. Design of a Hardware-Optimized High-Performance CNN Accelerator for Real-Time Object Detection Using YOLOv3 with Darknet-19 Architecture. Electronics 2026, 15, 1264. https://doi.org/10.3390/electronics15061264

AMA Style

Wu S, Kunapareddy M, Wang N. Design of a Hardware-Optimized High-Performance CNN Accelerator for Real-Time Object Detection Using YOLOv3 with Darknet-19 Architecture. Electronics. 2026; 15(6):1264. https://doi.org/10.3390/electronics15061264

Chicago/Turabian Style

Wu, Shuo, Manasa Kunapareddy, and Nan Wang. 2026. "Design of a Hardware-Optimized High-Performance CNN Accelerator for Real-Time Object Detection Using YOLOv3 with Darknet-19 Architecture" Electronics 15, no. 6: 1264. https://doi.org/10.3390/electronics15061264

APA Style

Wu, S., Kunapareddy, M., & Wang, N. (2026). Design of a Hardware-Optimized High-Performance CNN Accelerator for Real-Time Object Detection Using YOLOv3 with Darknet-19 Architecture. Electronics, 15(6), 1264. https://doi.org/10.3390/electronics15061264

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Design of a Hardware-Optimized High-Performance CNN Accelerator for Real-Time Object Detection Using YOLOv3 with Darknet-19 Architecture

Abstract

1. Introduction

1.1. Background

1.2. Motivation

1.3. Literature Review

1.4. Contribution

2. Problem Formulation

2.1. Problem Statement

2.2. Objectives

3. Convolution Neural Network Model and Darknet-19 Structure

3.1. Convolution Layer

3.2. Pooling Layer

3.3. Activation Layer

3.4. Fully Connected Layer

3.5. Padding Layer

3.6. Darknet-19 Structure

3.7. Batch Normalization

3.8. YOLOv3 with Darknet-19

4. Proposed Hardware Architecture

5. Tools and Libraries Used

5.1. Xilinx Vivado

5.2. Synopsys Design Compiler

5.3. Cadence Innovus

5.4. MATLAB

5.5. Nangate Open Cell Library

6. ASIC Implementation

6.1. Functional Simulation

6.2. RTL Schematic

6.3. Gate-Level Synthesis

6.3.1. Area Optimization

6.3.2. Power Optimization

6.3.3. Timing Analysis

6.4. Physical Implementation

7. Experimental Results and Performance Analysis

7.1. Experimental Setup

7.2. Generation of Padded Image

7.3. Verifying Functionality of CNN Accelerator

7.4. System Performance Metrics

7.5. Power and Area of Processing Element

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI