Optimizing Convolutional Operation and Dataflow in FPGA Acceleration of Bayesian Convolutional Neural Network

Wang, Shulei; Ling, Yun; Cai, Daolin; Zhang, Hao; Liu, Mingxin; Cheng, Cheng; Ding, Qihang; Fu, Zhu; Zhao, Jiale; Zhou, Haoyu; Zhang, Junxin

doi:10.3390/electronics15122603

Open AccessArticle

Optimizing Convolutional Operation and Dataflow in FPGA Acceleration of Bayesian Convolutional Neural Network

by

Shulei Wang

¹,

Yun Ling

^1,*,

Daolin Cai

^2,*,

Hao Zhang

^3,*,

Mingxin Liu

¹,

Cheng Cheng

¹,

Qihang Ding

¹,

Zhu Fu

¹,

Jiale Zhao

¹,

Haoyu Zhou

³

and

Junxin Zhang

¹

School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China

²

Huada Semiconductor Co., Ltd., Shanghai 200131, China

³

Suzhou Laboratory, Suzhou 215131, China

^*

Authors to whom correspondence should be addressed.

Electronics 2026, 15(12), 2603; https://doi.org/10.3390/electronics15122603 (registering DOI)

Submission received: 5 May 2026 / Revised: 3 June 2026 / Accepted: 8 June 2026 / Published: 12 June 2026

Download

Browse Figures

Versions Notes

Abstract

A Bayesian convolutional neural network (BCNN) quantifies prediction uncertainty by introducing randomness into weights or activations, which is important for safety-critical applications such as medical diagnosis and autonomous driving. However, BCNN inference typically relies on Monte Carlo sampling requiring multiple forward passes, leading to computation and energy consumption far beyond standard CNN hardware acceleration. FPGA, with its parallel processing, reconfigurability, and high-energy efficiency, are ideal platforms for dedicated BCNN accelerators. This paper designs and implements an FPGA acceleration method for BCNN-using high-level synthesis. First, convolution, pooling, and fully connected modules are individually optimized. Then, a mean/variance dual-path parallel expansion is adopted, combined with mixed-precision quantization and global scaling compensation, local reparameterization sampling, parameter reordering, and ping-pong buffering, achieving low resource usage and high-energy efficiency while enabling uncertainty evaluation. Experimental results on Bayes VGG16 show resource utilization of 24,776 LUT, 23,378 FF, 115 BRAM, and 129 DSP, with total power of 2.049 W. Compared with an unoptimized Bayesian implementation, the proposed design reduces inference latency to about one-third, and its latency is only 17% higher than that of the classical VGG16. Compared with PC-based floating-point models, the accuracy loss on four BCNN models (tested on CIFAR-10) is within 1%. The predictive entropy effectively distinguishes normal, noisy, and out-of-distribution (OOD) samples, validating the uncertainty quantification capability of the BCNN FPGA accelerator.

Keywords:

Bayesian convolutional neural networks; FPGA acceleration; parallel expansion; uncertainty estimation

1. Introduction

In Bayesian neural networks (BNNs), probability distributions are introduced over the network weights, allowing for the quantification of predictive uncertainty. This capability is of particular importance in safety-critical domains, including medical diagnosis and autonomous driving, where in-models are required not only to deliver accurate predictions but also to evaluate the reliability of their outputs, thereby mitigating the risk of high-confidence misclassifications [1,2]. However, the repeated Monte Carlo sampling required by Bayesian inference leads to significantly higher computational complexity and energy consumption than conventional deterministic neural networks, posing a major challenge for deployment on resource- and power-constrained edge devices [3]. Field-programmable gate arrays (FPGA), with their parallel-processing capability, reconfigurability, and high-energy efficiency, provide a promising hardware platform for accelerating uncertainty-aware Bayesian neural networks at the edge. Existing studies have shown that on-chip data organization [4], parallel computation [5], and the use of high-level synthesis (HLS) tools [6] can substantially improve development efficiency, enabling researchers to rapidly implement and verify complex convolutional neural network accelerator designs.

In the hardware acceleration of Bayesian neural networks, early studies mainly focused on reducing the overhead of random sampling. Representative work, such as VIBNN [7], optimized Gaussian random number generation and sampling operations in variational inference-based Bayesian networks, demonstrating that random sampling is one of the major bottlenecks in FPGA deployment. B2N2 [8], from the perspective of resource efficiency, replaced Gaussian sampling with hardware-friendly Bernoulli sampling, effectively reducing logic resource consumption. These studies laid an important foundation for the hardware implementation of Bayesian inference. However, in Bayesian convolutional neural networks (BCNNs), the distribution parameters associated with convolutional weights are much larger in scale, and the corresponding probabilistic computations introduce more intensive sampling and memory access pressure. Therefore, more efficient acceleration strategies are still required.

To address the high computational overhead of BCNN, several innovative hardware acceleration strategies have been proposed. One major approach exploits algorithm-level sparsity or determinism to skip redundant computations. For example, some studies analyze the absolute values and differences in input features across multiple forward passes, approximate small values as zeros to introduce sparsity, and design dedicated dataflows and accelerator architectures for sparse computation, achieving up to an 81.1% reduction in computation [9]. Another work proposed a massive neuron-skipping strategy, which intelligently skips neurons disabled by Dropout masks and uses information from the first inference pass to predict zero-valued neurons in subsequent passes, thereby avoiding unnecessary hardware computation and achieving 2.1× to 8.2× acceleration [10]. Another line of research focuses on system-level co-optimization. Some studies have proposed automated frameworks to explore the trade-off between hardware performance and algorithmic accuracy under partial Bayesian inference, where uncertainty estimation is applied only to selected layers of the network, in order to identify optimal design points in the design space [11,12]. Although these methods improve the hardware efficiency of Bayesian inference from different perspectives, achieving a low-resource, high-energy-efficiency BCNN accelerator with complete uncertainty-aware inference capability on edge devices remains an open challenge.

This paper designs a Bayesian convolutional neural network (BCNN) FPGA accelerator for edge devices, which achieves low-power and efficient inference while possessing uncertainty quantification capabilities. The main contributions include:

First, a dual-path parallel architecture is adopted, separating the calculation of mean and variance. And through local re-parameterization, random sampling is transferred to the output end, thereby reducing the number of Gaussian samplings and eliminating the sampling overhead in the weight domain.

Second, the parameters are re-ordered and the ping-pong buffering and mixed-precision quantization are integrated with the global scaling compensation (a combination that was not used in the previous BCNN accelerator) to minimize off-chip memory access and fixed-point quantization errors while maintaining the quality of uncertainty estimation.

Thirdly, a complete FPGA implementation on the XC7Z020 device (manufacturer: AMD Xilinx; location: San Jose, CA, USA) has achieved a low-power consumption of 2W, reducing inference latency to one-third of the unoptimized version, and realizing anomaly detection based on predictive entropy, providing a low-power consumption and high-energy efficiency solution for uncertainty aware inference on low-power edge platforms.

2. BCNN Accelerator Module Design

2.1. Design of Bayesian Convolution Module

Convolutional layers are the most crucial computational module in convolutional neural networks, accounting for the majority of computational and memory overhead in BCNN implementations. Unlike ordinary convolutions, the kernel weights in Bayesian convolutional layers are no longer fixed values, but rather a distribution composed of the mean and variance. Directly performing convolution calculations on each point of this distribution in hardware would require frequent calls to a Gaussian random number generator and repeated sampling during inference, significantly increasing resource consumption and latency. Therefore, reducing hardware resource consumption and latency while preserving Bayesian uncertainty has become a key challenge in convolutional kernel design.

In this work, the Bayesian module adopts dual-path computation for the mean and variance, followed by sampling at the layer output. This method uses the idea of local reparameterization to transfer the random sampling process from the weight domain to the activation domain. Specifically, the mean and variance parameters of the layer output are first obtained through two deterministic computation paths, and then Gaussian random noise is introduced at the output side to generate the output feature under the current sample. By avoiding random sampling of the weights, the number of random values required during inference is reduced, thereby lowering the overhead of random number generation and storage. Meanwhile, the main computation process remains in the form of regular convolutional multiply–accumulate operations, which facilitates pipelined and parallel implementation on FPGA.

Figure 1 shows the basic flow of the Bayesian convolution dual-path computation proposed in this paper. The input feature map has a size of

N_{i f} \times N_{i x} \times N_{i y}

, where

N_{i f}

denotes the number of input channels, and

N_{i x}

and

N_{i y}

denote the width and height of the input feature map, respectively. The Bayesian convolution kernel consists of two sets of parameters: the mean kernel

μ

and the variance kernel

σ^{2}

. Both have a size of

N_{i f} \times N_{k x} \times N_{k y}

, where

N_{k x}

and

N_{k y}

denote the width and height of the convolution kernel, respectively. After the input feature map is convolved with the two sets of kernels, the output mean term

μ_{o u t}

and variance term

σ_{o u t}^{2}

are obtained. The output feature map has a size of

N_{o f} \times N_{ox} \times N_{oy}

, where

N_{o f}

denotes the number of output channels, and

N_{ox}

and

N_{oy}

denote the width and height of the output feature map, respectively. Finally, Gaussian sampling noise is introduced at the end of the convolution output to obtain the final output of the Bayesian convolution, which can be expressed as follows:

z = μ_{o u t} + \sqrt{σ_{o u t}^{2} + δ} ε

(1)

where

ε ~ N (0, I)

denotes Gaussian noise,

δ

is a correction term to prevent numerical instability caused by excessively small variance.

The two convolutional paths share the same set of input windows, separating only the storage areas for the weight parameters. Sharing the input path reduces on-chip cache redundancy and facilitates simultaneous computation and input/output of the mean and variance paths; this design approach is superior to two completely independent convolutional arrays in resource-constrained FPGA.

To improve the local reuse rate of feature maps and reduce external storage access overhead, this paper adopts a Line Buffer and Window Buffer design.

As shown in Figure 2, without an on-chip cache reuse mechanism, each convolution operation requires reading nine input pixels from outside the chip. With Line Buffer and Window Buffer, the convolution window only needs to continuously add a new pixel from outside the chip as it slides through space; the remaining pixels are directly reused from the on-chip cache. During the computation phase, off-chip memory access is reduced from a maximum of nine reads to only one new pixel read per slide, thus reducing DDR bandwidth and memory access pressure.

This design transforms multiple parameters and complex calculations of the BN layer into multiplication and addition through linear transformation at the algorithm level, thereby significantly reducing the amount of parameter computation in the BN module during inference [13]. In the inference stage, the mean

μ

and variance

σ^{2}

in the BN layer are fixed constants after training, so the BN operation can be represented as a deterministic linear transformation, and its calculation form is:

y = γ \frac{x - μ}{\sqrt{σ^{2} + ϵ}} + β

(2)

During training,

γ

and

β

are learnable parameters, while

μ

and

σ^{2}

are estimated running statistics of the input features.

ϵ

is a small constant used to prevent division by zero and improve numerical stability. By rearranging the above formula, it can be equivalently represented as a linear scaling and bias transformation:

S c a l e_{f o l d} = \frac{γ}{\sqrt{σ^{2} + ϵ}}

(3)

B_{f o l d} = β - μ \cdot \frac{γ}{\sqrt{σ^{2} + ϵ}}

(4)

During inference,

S c a l e_{f o l d}

and

B_{f o l d}

are fixed constants. After fusion, the convolution layer output only needs to perform a multiplication and a bias addition to complete the normalization originally handled by the BN layer.

A layer-fusion pipeline is adopted to implement the continuous computation of con-volution, BN, activation, and pooling. From the perspective of memory access, a separated implementation would require multiple read and write operations between different modules. After layer fusion, the intermediate feature map results are processed through the on-chip pipeline, requiring only one read at the beginning and one write at the end.

To quantify the computational efficiency of the proposed design, a convolution baseline without dual-path parallel computation and the BN-fusion pipeline was also constructed for comparison. Both designs used the same convolution parameters, including a 3 × 3 kernel, a stride of 1, and an input feature map size of 32 × 32, and were synthesized on the same FPGA platform (XC7Z020) at 100 MHz using Xilinx Vivado HLS. The results are shown in Table 1. Compared with the baseline, the optimized convolution module reduces latency from 15.21 ms to 6.85 ms (54.9% reduction), while BRAM usage increases from 68 to 76 (+8 BRAMs, +11.8%). This BRAM increase is mainly due to the dual-path mean/variance computation and the folded BatchNorm parameters stored on-chip. Given that the XC7Z020 device provides 140 BRAMs in total, the additional consumption is acceptable, and the significant latency reduction justifies the trade-off for real-time edge inference. These results indicate that the dual-path parallel computation and BN-fusion pipeline can effectively reduce convolution latency without substantially compromising hardware resource efficiency.

All module-level experiments (convolution, pooling, fully connected) are synthesized and implemented at 100 MHz on the XC7Z020 device.

2.2. Design of Pooling Module

The pooling layer adopts a max-pooling module to downsample the feature map output by the Bayesian convolution layer. For the commonly used 2 × 2 max-pooling operation, only four input data values within a local window need to be compared. Unlike the multiply–accumulate operations in the convolution layer, max pooling does not involve multiplication or accumulation and can therefore be implemented without DSP resources. In this design, the pooling layer mainly uses a comparator-tree structure to select the maximum value within the local window. Specifically, the two input values in the same row are first compared, and then the comparison results of the two rows are further compared in the second stage to obtain the maximum value within the pooling window. This structure is simple in logic and has low latency, making it suitable for streaming processing after the convolution layer.

In terms of input buffer design, the pooling module needs to read multiple data values within the local window at the same time. Therefore, an on-chip buffer partitioning method is used to organize the pooling input buffer, allowing the data within the 2 × 2 window to be read out in parallel and then sent to the comparator tree for maximum-value computation. This method distributes spatially adjacent feature-map rows across eight different basic BRAM blocks in a cyclic interleaving manner. As a result, the waiting time during the pooling stage can be reduced, ensuring that the data rate of the pooling module matches the output pipeline of the preceding convolution layer.

To quantify the advantages of the optimized pooling design in terms of hardware overhead and computational efficiency, a conventional pooling layer without buffer partitioning was constructed as the baseline for comparison. Both designs used exactly the same computational specifications, namely a 2 × 2 pooling window, a stride of 2, 16 channels, and a 32 × 32 input feature map. Both designs are synthesized and implemented at 100 MHz on the XC7Z020 device. Table 2 compares the resource utilization and latency of the two designs. With only a slight increase in logic resources, where LUTs and FFs increased by approximately 3–4%, and with zero DSP usage, the latency was reduced from 2.02 ms to 1.32 ms, corresponding to a reduction of approximately 35%. This result demonstrates the effectiveness of the buffer-partitioning strategy.

2.3. Design of the Bayesian Fully Connected Module

The fully connected layer (FC) is the layer used to output the classification result in a convolutional neural network. It is essentially a standard matrix–vector multiplication operation. Its main function is to flatten the high-dimensional spatial feature maps extracted by the preceding convolution and pooling modules into a one-dimensional vector and then map the high-dimensional features to the target output through a linear transformation using a weight matrix. The pseudocode is shown in Algorithm 1.

In BCNN, the weights of the fully connected layer are no longer deterministic parameters, but probability distributions described by their means and standard deviations. To reduce the hardware implementation complexity introduced by this characteristic, the randomness in the Bayesian fully connected layer is shifted from the weight side to the output side. As shown in Figure 3, the Bayesian fully connected layer is divided into a mean path, a variance path, and a local reparameterization sampling unit.

Algorithm 1: Fully connected layer pseudocode.

\begin{array}{l} f o r (o = 0; o < N o f; o + +) ⟶ L o o p - 2 \\ b i a s_b u f (o) = b i a s (o) \times S c a l e; \\ f o r (i = 0; i < N i f; i + +) ⟶ L o o p - 1 \\ i n p u t_b u f (i) = i n p u t (i) \times S c a l e; \\ f o r (o = 0; o < N o f; o + +) {⟶ L o o p - 0 \\ o u t (o) = b i a s_b u f (o); \\ f o r (i = 0; i < N i f; i + +) [P I P E L I N E I I = 1] \\ o u t (o) + = w e i g h t (o \times N i f + i) \times i n p u t_b u f (i); \\ o u t (o) = o u t (o) / S c a l e; \\ } \end{array}

In terms of computation organization, for each output neuron in the Bayesian fully connected layer, the elements of the input vector participate in multiply–accumulate operations sequentially along the input dimension. Unlike the standard fully connected layer, this process does not only produce a single output value, but simultaneously computes two types of statistics. The mean path generates the output mean according to the inner product between the input vector and the mean parameters, while the variance path generates the output variance according to the computation between the input vector and the variance-related parameters. Since the two paths share the same input vector, pipeline optimization can be applied to the inner accumulation loop in the same way as in a standard fully connected layer, allowing the multiply–accumulate operations to be executed sequentially in consecutive clock cycles. Based on the original optimization framework, the dual-path design extends the single-path accumulation process into a parallel computation process for the mean and variance statistics.

From the hardware implementation perspective, the mean path can reuse the conventional matrix–vector multiply–accumulate structure of a standard fully connected layer. On this basis, the variance path introduces additional input-square and variance-accumulation operations. Finally, the results of the two paths are combined in the sampling unit to generate the output. Since the output of the fully connected layer directly affects the final classification result and predictive entropy calculation, a higher-precision fixed-point format, ap_fixed<40,27>, is used in the variance accumulation and sampling output stages to ensure the computational accuracy on the FPGA.

For the Bayesian fully connected layer, a baseline without dual-path expansion is implemented at 100 MHz on XC7Z020. As shown in Table 3, the optimized design reduces latency from 5.12 ms to 2.17 ms (−57.6%), decreases BRAM from 125 to 104 (−16.8%), but increases DSP from 46 to 75 due to parallel mean/variance MAC operations.

3. Bayesian Neural Network Global Optimization Method

3.1. System Overall Architecture

This paper constructs a PS–PL collaborative design framework based on the Zynq-7000 platform. The overall architecture is shown in Figure 4. The PS side is responsible for input image preprocessing, Gaussian random noise generation, sampling scheduling, predictive entropy calculation, and threshold-based decision making, while the PL side mainly implements computing modules such as Bayesian Conv, MaxPool, and Bayesian FC. During system operation, the PS first writes the input image, weights, variance parameters, and noise samples into the external DDR. It then configures the parameters of the current network layer through the AXI control interface and starts the corresponding computation on the PL side. The PL reads the input feature maps, weight means, and variance parameters from DDR, performs the mean-path computation, variance-path computation, and output-side sampling, and writes the results back to the buffer or external DDR for subsequent network layers.

In the hardware implementation, this paper adopts a unified configurable module-reuse strategy to complete full-network inference. Instead of assigning an independent hardware module to each layer on the PL side, the same set of PL-computing modules is dynamically configured according to parameters such as the current layer type, the number of input and output channels, the feature map size, and the weight address. In this way, the modules sequentially execute convolution, pooling, fully connected, and residual-connection operations. This strategy avoids the resource consumption caused by layer-specific hardware customization and improves the reusability of the hardware modules. The scheduling flow is shown in Figure 5.

In Figure 5, the left part shows the inter-layer execution structure of the network, which consists of read, compute, and write stages. The middle part further abstracts the scheduling process within each layer. The right part presents the decision and scheduling logic for convolution, pooling, fully connected, and residual operations under the unified scheduling framework.

3.2. Parameter Reordering and Ping-Pong Buffering Optimization

During the hardware inference process of BCNN, the Bayesian convolution layer and the Bayesian fully connected layer need to process both mean parameters and variance parameters simultaneously. Compared with conventional CNN, BCNNs have a larger parameter scale and require more frequent memory accesses. If the parameters are still read according to the default storage order used during software training, problems such as frequent address jumps, low transfer efficiency, and computation units waiting for data may occur. Therefore, this paper optimizes the data transfer process from two aspects: parameter organization and data buffering. Specifically, parameter reordering [14] and a ping-pong buffering mechanism [15] are adopted to improve data transfer efficiency during dual-path computation.

In terms of parameter organization, the mean and variance parameters in the Bayesian layers are reordered. The original parameters are usually arranged according to the order exported from PC-based training, which is generally not favorable for continuous hardware access. In this paper, the mean and variance parameters required within the same output-channel block, input-channel block, and convolution-kernel window are reorganized into unified data blocks, so that the mean path and the variance path can continuously read the corresponding parameters during computation. After reordering, the size of each parameter block is

T_{o f} \times T_{i f} \times N_{k x} \times N_{k y}

. Figure 6 shows the schematic diagram of dual-path parameter reordering.

To reduce waiting overhead, this paper introduces a ping-pong buffering mechanism, in which two on-chip buffer groups operate alternately. When Buffer 1 provides input data for the current computation, Buffer 2 simultaneously loads the next batch of parameters or feature data from DDR. After the current computation is completed, the two buffers exchange their roles, thereby enabling parallel overlap among data loading, data processing, and result write-back. Figure 7 shows a comparison between the single-buffer scheme and the ping-pong buffering scheme. Compared with the single-buffer scheme, ping-pong buffering can effectively hide part of the data transfer latency, allowing loading, computation, and write-back to be interleaved in time, thereby reducing idle cycles in the computation array and improving overall execution efficiency.

Parameter reordering and ping-pong buffering work together to optimize the entire data transfer path. Parameter reordering ensures that the data required for the same computation can be read from DDR through continuous addresses. On this basis, ping-pong buffering further enables overlapping execution between off-chip memory access and on-chip computation, preventing the high-speed computation array from remaining idle while waiting for data loading.

3.3. Fixed-Point Quantization and Global Scaling Design

In FPGA hardware deployment, directly using floating-point data for computation would significantly increase the consumption of on-chip resources such as DSP, which is unfavorable for low-power edge implementation. In this paper, the main computations in BCNN inference are represented using a fixed-point format of <16,7>. For the quantization design of BCNN, it is necessary to consider not only the multiply–accumulate operations in the mean path, but also the square, accumulation, and output-side sampling operations in the variance path. Therefore, if a unified low-bit-width fixed-point format is used for all computation stages, variance estimation errors may accumulate and further affect the final recognition results and predictive entropy calculation.

To preserve the input and output features as completely as possible without increasing the hardware data bit width, a global scaling mechanism is introduced into the hardware architecture and interfaces. The core idea is to introduce a constant scaling factor S, which amplifies the activation values at the input side, thereby reducing the proportion of the relative error caused by a fixed absolute error Δ.

To ensure that global scaling does not affect the correctness of the convolution output, the data paths in this design, including the weight path, strictly follow equivalent transformation logic. Specifically, the input feature is scaled as

X_{s c a l e d} [i] = X [i] \times S

, while the convolution kernel weight remains unchanged, i.e.,

W_{s c a l e d} [i] = W [i]

. The scaled inner product operation is performed within the DSP array inside the FPGA.

Y_{s c a l e d} = \sum_{i} (X_{s c a l e d} [i] \times W_{s c a l e d} [i])

(5)

To verify the effect of the global scaling compensation mechanism on fixed-point quantization error, this paper statistically analyzes the overall error propagation of different operator layers under two modes: fix and global scaling. The results are shown in Table 4.

As shown in Table 4, compared with the fixed-point mode, the scale mode significantly reduces the average relative error across all operator layers. For example, the average relative error of the convolution layer is reduced from 28.43% to 3.21%, while that of the normalization layer is reduced from 29.05% to 3.38%. This indicates that the global scaling compensation mechanism can effectively suppress the propagation of fixed-point quantization errors throughout the complete inference process.

Considering that different computation paths in BCNN have different sensitivities to numerical precision, a mixed-precision quantization strategy [16] is adopted in this section. The mean path mainly performs conventional convolution and fully connected multiply–accumulate operations, and therefore uses a lower-bit-width fixed-point format. The variance path involves squaring, accumulation, and variance propagation before sampling; it has a wider numerical range and is more sensitive to quantization errors, so a higher-bit-width data format is adopted. The output of the fully connected layer and the sampling-related computations are close to the final classification result and predictive entropy calculation, and thus higher intermediate computation precision is also retained. The data type configurations of the key computation stages are listed in Table 5.

Global scaling compensation is primarily used to reduce small numerical relative errors in low-bit-width fixed-point representations, while mixed-precision quantization is used to adapt to the different precision requirements of the mean path, variance path, and sampling calculations in BCNN.

3.4. Calculation of Prediction Entropy

During the inference stage, a BCNN approximates the Bayesian predictive distribution through multiple-sampled forward passes. Let the total number of samples be T, and let the class probability obtained from the t-th sample be

P_{t} (y | x)

. The averaged predictive probability after multiple samples can then be expressed as:

\bar{p} (y | x) = \frac{1}{T} \sum_{t = 1}^{T} p_{t} (y | x)

(6)

To measure the overall uncertainty of the model for an input sample, this paper adopts predictive entropy as the uncertainty metric, which is calculated as follows:

H (\bar{p}) = - \sum_{c = 1}^{C} {\bar{p}}_{c} l o g {\bar{p}}_{c}

(7)

where C denotes the number of classes, and

{\bar{p}}_{c}

denotes the average predictive probability of the c-th class after multiple samples. When the predictive probability is concentrated on only a few classes, the predictive entropy is low, indicating that the model is relatively certain about the current input sample. In contrast, when the input sample is strongly corrupted by noise or belongs to an out-of-distribution sample, the results of multiple samples tend to become more dispersed, leading to a corresponding increase in predictive entropy [17].

3.5. Dataflow, Pipeline and Bottleneck Analysis

In order to present the hardware architecture details of the proposed accelerator more clearly, this section provides a detailed description from four aspects: data flow, storage hierarchy, pipeline, and bottleneck analysis. Figure 8 shows the overall architecture of the BCNN accelerator.

Data flow: The PS end writes the input image, mean/variance weight parameters, and pre-generated Gaussian noise into the DDR. During inference, the PL end reads data through the AXI Master channels: the input feature map is sent to B-Conv through Input Buffers; The mean/variance weights are continuously read in through Weight Buffers (another set of ping-pong buffers); Gaussian noise is directly fed into the Reparam sampling unit. B-Conv contains a line buffer and a window buffer internally. The line buffer stores multiple rows of feature maps, while the window buffer is a 3 × 3 shift register that provides nine pixels to the computing array per clock cycle. After completing the mean/variance dual path calculation of the array, the results are sent to Reparam and combined with noise, and then passed through modules such as Uncertainty Quant, BN fusion, activation, pooling, and full connection. Finally, they are written back to DDR through Output Buffers. Throughout the process, the intermediate feature maps are not written back to DDR and are directly passed along the on-chip pipeline.

Storage hierarchy and data layout: Adopting a three level storage structure: DDR (off chip), ping-pong buffer (on-chip, hidden DDR latency), and line buffer (on-chip BRAM, supporting window sliding). This level ensures that the vast majority of data access is completed on-chip, reducing off-chip memory access and power consumption. And through parameter reordering, the mean and variance parameters required for the same computing block are continuously stored in DDR during deployment, maximizing the AXI burst transmission length and improving DDR read efficiency and AXI bus utilization.

Dual parallel pipeline: The calculation adopts a deep pipeline with mean/variance dual path parallel, which is divided into three stages. Double buffering is used between stage to achieve a start interval of II = 1, which means that after the pipeline is filled, a new convolution window calculation can be started for each clock cycle. During the loading phase, the ping-pong buffer simultaneously reads the input feature map, corresponding mean weight blocks, variance weight blocks, and Gaussian noise samples. The dual path parallel-computing stage calculates the mean path and variance path in parallel within the same clock cycle. The mean path multiplies the pixels within the window with their corresponding mean weights and accumulates them to obtain the output mean. The variance path multiplies the squared pixels within the window with their corresponding variance weights and accumulates them to obtain the output variance. These two paths run completely in parallel. The output sampling stage combines the calculated mean and variance with Gaussian noise, obtains the final output through local reparameterization, and directly enters the subsequent pipeline (BN fusion, activation) without writing back to DDR.

The theoretical peak bandwidth of DDR3 interface is 2.1 GB/s, and the actual bandwidth requirement of this design is much lower than this peak, so memory bandwidth does not constitute a bottleneck. Meanwhile, thanks to parameter reordering and ping-pong buffer mechanism, DDR access and on-chip computing can effectively overlap and execute, resulting in high bus utilization. Therefore, the overall performance of this design is limited by computing power rather than storage bandwidth. This is mainly because this design prioritizes low-power consumption (with a total power consumption of only 2.049 W) and resource conservation (with only 53 DSP) over peak throughput. In resource-constrained edge deployment scenarios, this trade-off of prioritizing energy efficiency is reasonable.

4. Experimental Results and Analysis

4.1. Experimental Environment

The hardware experimental platform is based on the XC7Z020 device. The design was implemented using Vivado HLS, and synthesis and power evaluation were completed in the Vivado environment.

The software experimental platform was configured as follows: the CPU was an Intel Core i7-14700K with a main frequency of 3.4 GHz; the GPU was an NVIDIA GeForce RTX 2080 Ti; the memory capacity was 64 GB; the operating system was Ubuntu 20.04.6 LTS; and the software environment included Python 3.9.18, PyTorch 2.8.0, and CUDA 12.8.

The experiments selected four networks, namely Bayes LeNet, Bayes AlexNet, Bayes VGG16, and Bayes ResNet, as test cases. These networks cover shallow convolutional networks, deep convolutional networks, large-scale fully connected structures, and residual connection structures, and are used to verify the deployment effectiveness of the proposed accelerator under different network scales and architectures.

4.2. Results and Analysis

To further analyze the deployment performance of the proposed design on different computing platforms, the Bayes VGG16 network was selected for inference testing on CPU, GPU, and FPGA platforms using the CIFAR-10 dataset. The results are shown in Table 6.

As shown in Table 6, in terms of energy efficiency, FPGA achieves 11.47 GOPS/W, about 46 times higher than CPU (0.248 GOPS/W) and 65 times higher than GPU (0.176 GOPS/W). The energy of each image on FPGA is only 115.8 mJ, which is far lower than 5362.8 mJ of CPU and 7548.0 mJ of GPU. Although the absolute GOPS of FPGA is lower than that of CPU and GPU, its low operating frequency and greatly reduced power consumption make FPGA more suitable for edge deployment with limited power consumption. In addition, in terms of total clock cycles required per image, the FPGA uses only 5.65 Million cycles, which is 25.8× fewer than the CPU and 7.2× fewer than the GPU. This confirms that despite the lower frequency, the proposed accelerator achieves higher inference efficiency. These results show that the proposed BCNN FPGA accelerator has significant energy consumption advantages compared with the general-purpose processor, which is the key to realize edge real-time uncertainty reasoning.

To verify the consistency between FPGA and PC floating-point implementations, we repeated each experiment 10 times using different Gaussian noise seeds. The accuracy reported in Table 7 is the average ± standard deviation of these 10 runs. All reported accuracies are Top-1 on the CIFAR-10 test set for Bayes LeNet/AlexNet/VGG16/ResNet.

The inference accuracy loss of all four Bayesian networks on the FPGA is controlled within 1%, indicating that the optimization strategies adopted in this paper, including global scaling compensation, mixed-precision quantization, parameter reordering, and ping-pong buffering, can effectively maintain the inference accuracy on the FPGA. Each precision is the average of 10 independent runs with different Gaussian noise random seeds. A low standard deviation (<0.3%) indicates stable inference performance.

In this experiment, we conducted a comparative study on the FPGA acceleration implementation methods used by various Bayesian convolutional neural networks. The experiment comprehensively evaluated the performance of different acceleration schemes on actual hardware platforms from four dimensions: resource utilization, throughput, power consumption, and energy efficiency. The final comparison results are shown in Table 8.

As shown in Table 8, the operating frequency of our work is 100 MHz. Since different FPGA devices have different architectures and resource definitions, we focus on device-agnostic indicator for fair comparison. In terms of throughput, our design (57.6 GOPS) outperforms BYNQNet [18] (24.2 GOPS) and HP-BNN [19] (1.92 GOPS) and is comparable to VIBNN (59.6 GOPS). In terms of energy efficiency, our design achieves 55.38 GOPS/W, which is substantially higher than VIBNN (9.75 GOPS/W), BYNQNet (8.77 GOPS/W), and HP-BNN (3.15 GOPS/W). In terms of resource utilization, the LUT, FF, DSP usage, and power consumption of the proposed design are significantly lower than those of VIBNN and BYNQNet, while its power consumption (1.04 W) is lower than VIBNN (6.11 W) and BYNQNet (2.76 W), but moderately higher than HP-BNN (0.61 W). These results show that the proposed design maintains favorable throughput and energy-efficiency performance even at a lower operating frequency.

In addition to resource utilization and power consumption, inference latency is also an important metric for evaluating accelerator performance. In this paper, two representative networks, Bayes VGG16 and Bayes ResNet, are selected to compare the inference latency of standard convolutional neural networks and Bayesian convolutional neural networks before and after optimization on the FPGA platform. The results are shown in Figure 9 and Figure 10.

As shown in Figure 9 and Figure 10, the optimized implementations of both networks achieve significantly lower inference latency than the unoptimized Bayesian implementations. This improvement mainly results from the combined effects of the proposed optimization strategies, including mean/variance dual-path parallel expansion, output-side sampling, parameter reordering, and ping-pong buffering. These strategies effectively reduce the latency introduced by weight computation, sampling, and data memory access, reducing the inference latency to approximately one-third of that of the unoptimized implementation.

To verify the ability of predictive entropy to distinguish abnormal inputs, this paper conducts experiments using the CIFAR-10 test set, the CIFAR-10 dataset with noise perturbation [20], and the out-of-distribution SVHN dataset. A total of 1500 samples are selected from each category, and the statistical results are shown in Table 9. As shown in the table, the mean predictive entropy of normal samples is only 0.142, while it increases to 1.359 for noisy samples and further increases to 2.214 for out-of-distribution (OOD) samples. This indicates that noise perturbations and out-of-distribution inputs significantly increase the uncertainty of model outputs. Based on this characteristic, predictive entropy can reflect the uncertainty differences among different input samples and can be used for edge-side anomaly detection.

4.3. Quantitative Evaluation of Uncertainty Quality

In Section 3.4, we use predictive entropy to qualitatively demonstrate the discrimination between normal, noise and out-of-distribution (OOD) samples. In order to quantitatively evaluate the uncertainty estimation quality of the proposed BCNN accelerator, this paper uses two standard indicators: Expected Calibration Error (ECE) and Area Under the ROC Curve (AUROC).

ECE measures the consistency between the prediction probability and the real accuracy of the model. For example, in a perfectly calibrated model, when the prediction probability is 0.8, about 80% of the samples should be correctly classified. ECE is calculated by dividing all test samples into 15 intervals from small to large according to their prediction confidence (i.e., maximum prediction probability). In each interval, the absolute value of the difference between the average confidence of samples in the interval and the actual classification accuracy is calculated. Then, taking the proportion of the number of samples in each interval to the total number of samples as the weight, these absolute values are weighted average to obtain ECE. The lower the ECE value, the more reliable the uncertainty estimation of the model.

AUROC is used to measure the ability of the model to distinguish between in-distribution (ID) samples and out-of-distribution (OOD) samples. A model with good uncertainty estimation should produce high uncertainty (i.e., low prediction confidence) for OOD samples, so that abnormal inputs can be detected by setting the confidence threshold. The calculation method of AUROC is to combine the confidence of samples within the distribution with that of samples outside the distribution, take the negative value of confidence as the abnormal score, draw the ROC curve and calculate the area under the curve. The higher the AUROC value, the more sensitive the uncertainty of the model to abnormal input.

Experimental setup: We use the CIFAR-10 test set as the in-distribution data set, and the SVHN test set as the out-of-distribution data set. For each input, perform t = 10 Monte Carlo sampling on PC floating-point BCNN and FPGA fixed-point implementation respectively (different Gaussian noise is used for each forward propagation), average the category probability of 10 outputs, and obtain the final prediction probability distribution of each sample, and then calculate the above two indicators based on these probability distributions. The results of FPGA come from the actual operation of hardware accelerator.

As shown in Table 10, the ECE realized by FPGA is 0.076, which is only 0.011 higher than that of PC model; AUROC was 0.892, only decreased by 0.009. These small differences show that the quantization and hardware optimization in this paper hardly lose the uncertainty estimation quality of the original BCNN. Therefore, the proposed FPGA accelerator can reliably output the well-calibrated prediction probability while maintaining high-energy efficiency, and effectively detect the distributed external input.

4.4. Scalability Analysis

To evaluate the potential of the proposed BCNN accelerator for deeper Bayesian models, larger FPGA platforms, and multi-chip deployment, this subsection provides a scalability analysis from three perspectives: resource growth, memory bandwidth headroom, and extension strategies for extremely large models.

Based on the measured resource utilization of Bayes VGG16 on the XC7Z020 device (LUT: 24,776; DSP: 129; BRAM: 115; power: 2.049 W), we establish a resource scaling model. For convolutional layers, the number of DSP is approximately proportional to the number of input channels

C_{i n}

and the kernel size

K_{h} \times K_{w}

. Owing to the dual-path mean/variance design, a factor of 2 is introduced:

DSP \approx 2 \times K_{h} \times K_{w} \times C_{i n}

. BRAM consumption mainly comes from line buffers, window buffers, and parameter storage, and scales roughly linearly with the number of output channels

C_{out}

and feature map size. LUT/FF usage grows linearly with channel numbers. Taking Bayes ResNet152 as an example, its maximum convolutional layer channel number is 512 (same as the current VGG16), but it has many more layers; the total resource requirement is estimated to be 1.5–2× that of the current design. The XC7Z020 has only 220 DSP, of which 53 are used; therefore, it can still accommodate slightly larger models with up to 256 channels. For deep networks with 512 channels, a larger FPGA is required.

The current platform uses a 32-bit DDR3-533 interface, with a theoretical peak bandwidth of 2.1 GB/s. By using ping-pong buffering and parameter reordering, a significant reduction in off-chip memory access is achieved. For the current Bayes VGG16 model, the required bandwidth is much lower than the peak, leaving ample bandwidth space. Even with a doubling of the number of channels, bandwidth demand is expected to increase by 1.5–2 times, which is still within the peak capacity range. Therefore, memory bandwidth is not a bottleneck. Only very large models require techniques such as weight compression or wider memory interfaces.

For ultra-large BCNNs that exceed the capacity of a single device (e.g., models targeting ImageNet), two potential extension strategies can be considered based on the proposed architecture. Off-chip parameter storage with on-demand loading is a natural extension: the current design already separates parameter loading from computation via ping-pong buffering. By increasing the external DDR capacity while keeping the same core data path, larger models could be accommodated, at the cost of increased latency due to more frequent off-chip accesses. This trade-off could be tuned by adjusting the ping-pong buffer depth. Model parallelism across multiple FPGA is another possible direction: each FPGA could process a subset of layers (layer-wise parallelism) or a subset of output channels (channel-wise parallelism). Our modular design and parameter reordering scheme would likely require only minor modifications to support such partitioning. These extensions are beyond the scope of this paper but represent feasible directions for future work.

The proposed BCNN accelerator can efficiently run medium-sized and small-scale Bayesian models on the Zynq-7000 platform. For larger-scale models and devices, the paper discusses the resource expansion relationship and possible upgrade schemes, indicating that it has good scalability and practical application value.

5. Conclusions

This paper addresses the challenges faced by Bayesian convolutional neural networks during edge deployment, including the large computational cost of distribution-parameter estimation, high random-sampling overhead, and intensive memory access pressure. A BCNN-FPGA accelerator is designed to improve the efficiency of uncertainty-aware inference. By adopting optimization methods such as mean/variance dual-path parallel computation, local reparameterization-based output-side sampling, parameter reordering, ping-pong buffering, and scale-based compensation, the proposed design effectively reduces the sampling overhead, memory access cost, and quantization error during BCNN inference. The experimental results show that the total power consumption of the accelerator system is only 2.049 W. Compared with the unoptimized Bayesian inference implementation, the inference latency of the proposed design is reduced to approximately one-third. Compared with the PC-side floating-point model, the inference accuracy loss of the four BCNN models on the FPGA is controlled within 1%. In addition, the predictive entropy-based anomaly detection experiment demonstrates that the system can effectively distinguish normal samples, noisy samples, and out-of-distribution samples, verifying the feasibility of the proposed method for uncertainty-aware inference and abnormal sample detection at the edge.

Author Contributions

Software, M.L.; Formal analysis, Z.F.; Investigation, S.W.; Resources, J.Z. (Jiale Zhao); Data curation, Q.D. and J.Z. (Junxin Zhang); Writing—original draft, S.W.; Writing—review & editing, Y.L. and H.Z. (Haoyu Zhou); Supervision, Y.L., D.C. and C.C.; Project administration, Y.L., D.C. and H.Z. (Hao Zhang); Funding acquisition, H.Z. (Hao Zhang). All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by The Frontier Technology Research Program of Jiangsu Province: BF2024026. And The APC was funded by Suzhou National Laboratory.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the author.

Conflicts of Interest

Authors Daolin Cai was employed by the company Huada Semiconductor Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Gal, Y.; Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. Proc. Mach. Learn. Res. 2016, 48, 1050–1059. [Google Scholar]
Abdar, M.; Pourpanah, F.; Hussain, S.; Rezazadegan, D.; Liu, L.; Ghavamzadeh, M.; Fieguth, P.; Cao, X.; Khosravi, A.; Acharya, U.R.; et al. A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Inf. Fusion 2021, 76, 243–297. [Google Scholar] [CrossRef]
Blundell, C.; Cornebise, J.; Kavukcuoglu, K.; Wierstra, D. Weight uncertainty in neural network. Proc. Mach. Learn. Res. 2015, 37, 1613–1622. [Google Scholar]
Choi, J. Minimizing Global Buffer Access in a Deep Learning Accelerator Using a Local Register File with a Rearranged Computational Sequence. Sensors 2022, 22, 3095. [Google Scholar] [CrossRef]
Doumet, M.; Stan, M.; Hall, M.; Betz, V. H2PIPE: High throughput CNN inference on FPGAs with high-bandwidth memory. In 2024 34th International Conference on Field-Programmable Logic and Applications (FPL); IEEE: New York, NY, USA, 2024; pp. 69–77. [Google Scholar]
Cong, J.; Lau, J.; Liu, G.; Neuendorffer, S.; Pan, P.; Vissers, K.; Zhang, Z. FPGA HLS Today: Successes, Challenges, and Opportunities. ACM Trans. Reconfigurable Technol. Syst. 2022, 15, 51.1–51.42. [Google Scholar] [CrossRef]
Cai, R.; Ren, A.; Liu, N.; Ding, C.; Wang, L.; Qian, X.; Pedram, M.; Wang, Y. VIBNN: Hardware acceleration of Bayesian neural networks. ACM SIGPLAN Not. 2018, 53, 476–488. [Google Scholar] [CrossRef]
Awano, H.; Hashimoto, M. B2N2: Resource efficient Bayesian neural network accelerator using Bernoulli sampler on FPGA. Integration 2023, 89, 1–8. [Google Scholar] [CrossRef]
Fujiwara, Y.; Takamaeda-Yamazaki, S. ASBNN: Acceleration of Bayesian convolutional neural networks by algorithm-hardware co-design. In 2021 IEEE 32nd International Conference on Application-Specific Systems, Architectures and Processors (ASAP); IEEE: New York, NY, USA, 2021; pp. 226–233. [Google Scholar]
Wan, Q.; Fu, X. Fast-BCNN: Massive neuron skipping in Bayesian convolutional neural networks. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO); IEEE: New York, NY, USA, 2020; pp. 229–240. [Google Scholar]
Fan, H.; Ferianc, M.; Que, Z.; Liu, S.; Niu, X.; Rodrigues, M.R.D.; Luk, W. FPGA-based acceleration for Bayesian convolutional neural networks. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2022, 41, 5343–5356. [Google Scholar] [CrossRef]
Fan, H. Optimising Algorithm and Hardware for Deep Neural Networks on FPGAs; Imperial College London: London, UK, 2022. [Google Scholar]
Luo, Y.; Cai, X.; Qi, J.; Guo, D.; Che, W. FPGA–accelerated CNN for real-time plant disease identification. Comput. Electron. Agric. 2023, 207, 107715. [Google Scholar] [CrossRef]
Park, J.; Choi, D.; Kim, H. RADAR: An Efficient FPGA-Based ResNet Accelerator with Data-Aware Reordering of Processing Sequences. J. Semicond. Technol. Sci. 2025, 25, 451–458. [Google Scholar] [CrossRef]
Guan, X.; Wang, Z.; Fang, H. Design and Implementation of CNN Accelerator Based on FPGA. In 2024 43rd Chinese Control Conference (CCC); IEEE: New York, NY, USA, 2024; pp. 8969–8974. [Google Scholar]
Wang, J.; He, Z.; Zhao, H.; Liu, R. Low-bit mixed-precision quantization and acceleration of CNN for FPGA deployment. IEEE Trans. Emerg. Top. Comput. Intell. 2024, 9, 2597–2617. [Google Scholar] [CrossRef]
Lakshminarayanan, B.; Pritzel, A.; Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17), Long Beach, CA, USA, 4–9 December 2017; pp. 6402–6413. [Google Scholar]
Awano, H.; Hashimoto, M. BYNQNet: Bayesian Neural Network with Quadratic Activations for Sampling-Free Uncertainty Estimation on FPGA. In 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE); IEEE: New York, NY, USA, 2020. [Google Scholar] [CrossRef]
Jia, X.; Zhang, B.; Wu, Z.; Yang, J.; Wang, X.; Zhao, W. High-Performance Bayesian Neural Network Inference Accelerator Based on FPGA. IEEE Trans. Circuits Syst. Artif. Intell. 2025, 1–11. [Google Scholar] [CrossRef]
Hendrycks, D.; Dietterich, T. Benchmarking neural network robustness to common corruptions and perturbations. arXiv 2019, arXiv:1903.12261. [Google Scholar] [CrossRef]

Figure 1. Bayesian convolution dual-path computation.

Figure 2. Local data reuse based on line and window buffers.

Figure 3. Fully connected layer structure.

Figure 4. Overall architecture of the proposed BCNN FPGA accelerator.

Figure 5. Unified scheduling flow for layer-wise inference.

Figure 6. Comparison chart before and after parameter reordering.

Figure 7. Comparison between single-buffer and ping-pong buffering schemes.

Figure 8. Overview of the BCNN accelerator architecture.

Figure 9. Layer-wise latency comparison for Bayes VGG16.

Figure 10. Layer-wise latency comparison for Bayes ResNet.

Table 1. Comparison table of convolutional layer baseline and optimized resource consumption (operating frequency: 100 MHz).

Resource Utilization/Latency	Baseline	Our Work
BRAM	68	76
DSP	43	45
FF	16,753	16,842
LUT	22,147	22,693
Latency	15.21 ms	6.85 ms

Table 2. Comparison table of resource consumption for pooling layer baseline and optimization (operating frequency: 100 MHz).

Resource Utilization/Latency	Baseline	Our Work
BRAM	124	132
DSP	17	0
FF	8271	9123
LUT	11,210	11,654
Latency	2.02 ms	1.32 ms

Table 3. Comparison table of resource consumption for fully connected layer baseline and after optimization (operating frequency: 100 MHz).

Resource Utilization/Latency	Baseline	Our Work
BRAM	125	104
DSP	46	75
FF	7508	8397
LUT	12,371	11,746
Latency	5.12 ms	2.17 ms

Table 4. Comparison of calculation errors between fixed-point quantization and global scaling optimization in different modules.

Module	Evaluation Criteria	Fix	Global Scaling
Conv	MRE	28.43%	3.21%
BatchNorm	MRE	29.05%	3.38%
LeakyReLU	MRE	22.01%	2.47%
MaxPool	MRE	16.52%	2.01%
FC	MRE	19.48%	2.18%

Table 5. Data type configuration for key computation stages.

Calculation Process	Data Type/Bit Width
Input feature map	Fixed point <16,7>
Mean path parameters	Fixed point <16,3>
Variance path parameters	Unsigned fixed point <40,27>
Bayesian conv layer	Fixed point <16,7>
Bayesian fully connected layer	Fixed point <40,27>
Output sampling	Floating-point 32

Table 6. Comparison of Bayes VGG16 network experimental results on different platforms.

	CPU	GPU	Our Work
Hardware platform	Intel Core i7-14700K	GeForce RTX 2080Ti	XC7Z020
frequency	3.4 GHz	1.35 GHz	100 MHz
Software platform	PyTorch 2.8.0	PyTorch 2.8.0 CUDA 12.8	-
Model	Bayes VGG16	Bayes VGG16	Bayes VGG16
GOPS	30.98	44.02	23.51
Latency per image (ms)	42.902	30.192	56.526
Power(W)	125	250	2.049
GOPS/W	0.248	0.176	11.47
Energy per image (mJ/image)	5362.75	7548.00	115.822
Total clock cycles (M)	145.87	40.76	5.65

Table 7. Top-1 accuracy comparison between PC and FPGA inference.

Network	PC Accuracy	FPGA Accuracy	Accuracy Loss
Bayesian LeNet	71.56% ± 0.18%	70.89% ± 0.22%	0.67%
Bayesian AlexNet	74.48% ± 0.15%	73.71% ± 0.18%	0.77%
Bayesian VGG-16	86.90% ± 0.24%	85.99% ± 0.27%	0.91%
Bayesian ResNet	88.02% ± 0.19%	87.05% ± 0.21%	0.97%

Table 8. Comparison with existing Bayesian neural network FPGA accelerators.

	VIBNN	BYNQNet	HP-BNN	Our Work
FPGA Platform	CycloneV	XC7Z020	XC7Z020	XC7Z020
Frequency	200 MHz	200 MHz	100 MHz	100 MHz
LUT	-	37,102	16,863	16,724
ALM	98,006	-	-	-
FF	88,720	43,268	44,479	14,281
DSP	342	220	123	53
BRAM	558	220	50	117
GOPS	59.6	24.2	1.92	57.6
Power(W)	6.11	2.76	0.61	1.04
Energy efficiency (GOPS/W)	9.75	8.77	3.147	55.38

Table 9. Predicted entropy distribution table.

Sample Type	Samples	Mean Entropy	Entropy Range
Normal	1500	0.142	0.012–0.568
Noisy	1500	1.359	0.684–2.119
Out-of-distribution	1500	2.214	1.512–2.936

Table 10. Uncertainty quality comparison.

METHOD	ECE	AUROC
PC BCNN	0.065	0.901
FPGA BCNN	0.076	0.892

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, S.; Ling, Y.; Cai, D.; Zhang, H.; Liu, M.; Cheng, C.; Ding, Q.; Fu, Z.; Zhao, J.; Zhou, H.; et al. Optimizing Convolutional Operation and Dataflow in FPGA Acceleration of Bayesian Convolutional Neural Network. Electronics 2026, 15, 2603. https://doi.org/10.3390/electronics15122603

AMA Style

Wang S, Ling Y, Cai D, Zhang H, Liu M, Cheng C, Ding Q, Fu Z, Zhao J, Zhou H, et al. Optimizing Convolutional Operation and Dataflow in FPGA Acceleration of Bayesian Convolutional Neural Network. Electronics. 2026; 15(12):2603. https://doi.org/10.3390/electronics15122603

Chicago/Turabian Style

Wang, Shulei, Yun Ling, Daolin Cai, Hao Zhang, Mingxin Liu, Cheng Cheng, Qihang Ding, Zhu Fu, Jiale Zhao, Haoyu Zhou, and et al. 2026. "Optimizing Convolutional Operation and Dataflow in FPGA Acceleration of Bayesian Convolutional Neural Network" Electronics 15, no. 12: 2603. https://doi.org/10.3390/electronics15122603

APA Style

Wang, S., Ling, Y., Cai, D., Zhang, H., Liu, M., Cheng, C., Ding, Q., Fu, Z., Zhao, J., Zhou, H., & Zhang, J. (2026). Optimizing Convolutional Operation and Dataflow in FPGA Acceleration of Bayesian Convolutional Neural Network. Electronics, 15(12), 2603. https://doi.org/10.3390/electronics15122603

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Optimizing Convolutional Operation and Dataflow in FPGA Acceleration of Bayesian Convolutional Neural Network

Abstract

1. Introduction

2. BCNN Accelerator Module Design

2.1. Design of Bayesian Convolution Module

2.2. Design of Pooling Module

2.3. Design of the Bayesian Fully Connected Module

3. Bayesian Neural Network Global Optimization Method

3.1. System Overall Architecture

3.2. Parameter Reordering and Ping-Pong Buffering Optimization

3.3. Fixed-Point Quantization and Global Scaling Design

3.4. Calculation of Prediction Entropy

3.5. Dataflow, Pipeline and Bottleneck Analysis

4. Experimental Results and Analysis

4.1. Experimental Environment

4.2. Results and Analysis

4.3. Quantitative Evaluation of Uncertainty Quality

4.4. Scalability Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI