2.1. Design of Bayesian Convolution Module
Convolutional layers are the most crucial computational module in convolutional neural networks, accounting for the majority of computational and memory overhead in BCNN implementations. Unlike ordinary convolutions, the kernel weights in Bayesian convolutional layers are no longer fixed values, but rather a distribution composed of the mean and variance. Directly performing convolution calculations on each point of this distribution in hardware would require frequent calls to a Gaussian random number generator and repeated sampling during inference, significantly increasing resource consumption and latency. Therefore, reducing hardware resource consumption and latency while preserving Bayesian uncertainty has become a key challenge in convolutional kernel design.
In this work, the Bayesian module adopts dual-path computation for the mean and variance, followed by sampling at the layer output. This method uses the idea of local reparameterization to transfer the random sampling process from the weight domain to the activation domain. Specifically, the mean and variance parameters of the layer output are first obtained through two deterministic computation paths, and then Gaussian random noise is introduced at the output side to generate the output feature under the current sample. By avoiding random sampling of the weights, the number of random values required during inference is reduced, thereby lowering the overhead of random number generation and storage. Meanwhile, the main computation process remains in the form of regular convolutional multiply–accumulate operations, which facilitates pipelined and parallel implementation on FPGA.
Figure 1 shows the basic flow of the Bayesian convolution dual-path computation proposed in this paper. The input feature map has a size of
, where
denotes the number of input channels, and
and
denote the width and height of the input feature map, respectively. The Bayesian convolution kernel consists of two sets of parameters: the mean kernel
and the variance kernel
. Both have a size of
, where
and
denote the width and height of the convolution kernel, respectively. After the input feature map is convolved with the two sets of kernels, the output mean term
and variance term
are obtained. The output feature map has a size of
, where
denotes the number of output channels, and
and
denote the width and height of the output feature map, respectively. Finally, Gaussian sampling noise is introduced at the end of the convolution output to obtain the final output of the Bayesian convolution, which can be expressed as follows:
where
denotes Gaussian noise,
is a correction term to prevent numerical instability caused by excessively small variance.
The two convolutional paths share the same set of input windows, separating only the storage areas for the weight parameters. Sharing the input path reduces on-chip cache redundancy and facilitates simultaneous computation and input/output of the mean and variance paths; this design approach is superior to two completely independent convolutional arrays in resource-constrained FPGA.
To improve the local reuse rate of feature maps and reduce external storage access overhead, this paper adopts a Line Buffer and Window Buffer design.
As shown in
Figure 2, without an on-chip cache reuse mechanism, each convolution operation requires reading nine input pixels from outside the chip. With Line Buffer and Window Buffer, the convolution window only needs to continuously add a new pixel from outside the chip as it slides through space; the remaining pixels are directly reused from the on-chip cache. During the computation phase, off-chip memory access is reduced from a maximum of nine reads to only one new pixel read per slide, thus reducing DDR bandwidth and memory access pressure.
This design transforms multiple parameters and complex calculations of the BN layer into multiplication and addition through linear transformation at the algorithm level, thereby significantly reducing the amount of parameter computation in the BN module during inference [
13]. In the inference stage, the mean
and variance
in the BN layer are fixed constants after training, so the BN operation can be represented as a deterministic linear transformation, and its calculation form is:
During training,
and
are learnable parameters, while
and
are estimated running statistics of the input features.
is a small constant used to prevent division by zero and improve numerical stability. By rearranging the above formula, it can be equivalently represented as a linear scaling and bias transformation:
During inference, and are fixed constants. After fusion, the convolution layer output only needs to perform a multiplication and a bias addition to complete the normalization originally handled by the BN layer.
A layer-fusion pipeline is adopted to implement the continuous computation of con-volution, BN, activation, and pooling. From the perspective of memory access, a separated implementation would require multiple read and write operations between different modules. After layer fusion, the intermediate feature map results are processed through the on-chip pipeline, requiring only one read at the beginning and one write at the end.
To quantify the computational efficiency of the proposed design, a convolution baseline without dual-path parallel computation and the BN-fusion pipeline was also constructed for comparison. Both designs used the same convolution parameters, including a 3 × 3 kernel, a stride of 1, and an input feature map size of 32 × 32, and were synthesized on the same FPGA platform (XC7Z020) at 100 MHz using Xilinx Vivado HLS. The results are shown in
Table 1. Compared with the baseline, the optimized convolution module reduces latency from 15.21 ms to 6.85 ms (54.9% reduction), while BRAM usage increases from 68 to 76 (+8 BRAMs, +11.8%). This BRAM increase is mainly due to the dual-path mean/variance computation and the folded BatchNorm parameters stored on-chip. Given that the XC7Z020 device provides 140 BRAMs in total, the additional consumption is acceptable, and the significant latency reduction justifies the trade-off for real-time edge inference. These results indicate that the dual-path parallel computation and BN-fusion pipeline can effectively reduce convolution latency without substantially compromising hardware resource efficiency.
All module-level experiments (convolution, pooling, fully connected) are synthesized and implemented at 100 MHz on the XC7Z020 device.
2.2. Design of Pooling Module
The pooling layer adopts a max-pooling module to downsample the feature map output by the Bayesian convolution layer. For the commonly used 2 × 2 max-pooling operation, only four input data values within a local window need to be compared. Unlike the multiply–accumulate operations in the convolution layer, max pooling does not involve multiplication or accumulation and can therefore be implemented without DSP resources. In this design, the pooling layer mainly uses a comparator-tree structure to select the maximum value within the local window. Specifically, the two input values in the same row are first compared, and then the comparison results of the two rows are further compared in the second stage to obtain the maximum value within the pooling window. This structure is simple in logic and has low latency, making it suitable for streaming processing after the convolution layer.
In terms of input buffer design, the pooling module needs to read multiple data values within the local window at the same time. Therefore, an on-chip buffer partitioning method is used to organize the pooling input buffer, allowing the data within the 2 × 2 window to be read out in parallel and then sent to the comparator tree for maximum-value computation. This method distributes spatially adjacent feature-map rows across eight different basic BRAM blocks in a cyclic interleaving manner. As a result, the waiting time during the pooling stage can be reduced, ensuring that the data rate of the pooling module matches the output pipeline of the preceding convolution layer.
To quantify the advantages of the optimized pooling design in terms of hardware overhead and computational efficiency, a conventional pooling layer without buffer partitioning was constructed as the baseline for comparison. Both designs used exactly the same computational specifications, namely a 2 × 2 pooling window, a stride of 2, 16 channels, and a 32 × 32 input feature map. Both designs are synthesized and implemented at 100 MHz on the XC7Z020 device.
Table 2 compares the resource utilization and latency of the two designs. With only a slight increase in logic resources, where LUTs and FFs increased by approximately 3–4%, and with zero DSP usage, the latency was reduced from 2.02 ms to 1.32 ms, corresponding to a reduction of approximately 35%. This result demonstrates the effectiveness of the buffer-partitioning strategy.
2.3. Design of the Bayesian Fully Connected Module
The fully connected layer (FC) is the layer used to output the classification result in a convolutional neural network. It is essentially a standard matrix–vector multiplication operation. Its main function is to flatten the high-dimensional spatial feature maps extracted by the preceding convolution and pooling modules into a one-dimensional vector and then map the high-dimensional features to the target output through a linear transformation using a weight matrix. The pseudocode is shown in Algorithm 1.
In BCNN, the weights of the fully connected layer are no longer deterministic parameters, but probability distributions described by their means and standard deviations. To reduce the hardware implementation complexity introduced by this characteristic, the randomness in the Bayesian fully connected layer is shifted from the weight side to the output side. As shown in
Figure 3, the Bayesian fully connected layer is divided into a mean path, a variance path, and a local reparameterization sampling unit.
| Algorithm 1: Fully connected layer pseudocode. |
|
In terms of computation organization, for each output neuron in the Bayesian fully connected layer, the elements of the input vector participate in multiply–accumulate operations sequentially along the input dimension. Unlike the standard fully connected layer, this process does not only produce a single output value, but simultaneously computes two types of statistics. The mean path generates the output mean according to the inner product between the input vector and the mean parameters, while the variance path generates the output variance according to the computation between the input vector and the variance-related parameters. Since the two paths share the same input vector, pipeline optimization can be applied to the inner accumulation loop in the same way as in a standard fully connected layer, allowing the multiply–accumulate operations to be executed sequentially in consecutive clock cycles. Based on the original optimization framework, the dual-path design extends the single-path accumulation process into a parallel computation process for the mean and variance statistics.
From the hardware implementation perspective, the mean path can reuse the conventional matrix–vector multiply–accumulate structure of a standard fully connected layer. On this basis, the variance path introduces additional input-square and variance-accumulation operations. Finally, the results of the two paths are combined in the sampling unit to generate the output. Since the output of the fully connected layer directly affects the final classification result and predictive entropy calculation, a higher-precision fixed-point format, ap_fixed<40,27>, is used in the variance accumulation and sampling output stages to ensure the computational accuracy on the FPGA.
For the Bayesian fully connected layer, a baseline without dual-path expansion is implemented at 100 MHz on XC7Z020. As shown in
Table 3, the optimized design reduces latency from 5.12 ms to 2.17 ms (−57.6%), decreases BRAM from 125 to 104 (−16.8%), but increases DSP from 46 to 75 due to parallel mean/variance MAC operations.