4.1. Fixed-Point Conversion of Batch Normalization
Upon completion of model training, each batch normalization layer produces four parameters: the mini-batch mean (
), mini-batch variance (
), and the learnable scaling (
γ) and shifting (
β) factors. To reduce the computational complexity in hardware implementation, these four parameters can be algebraically reduced to two simplified terms using mathematical formulations, as shown in Equations (3) and (4), where
denotes the mini-batch mean and
the mini-batch variance. As a result, the batch normalization computation can be efficiently implemented using only multiplication and addition operations, as illustrated in Equation (5).
Next, the simplified batch normalization parameters are subject to fixed-point quantization. Taking
γ′ as an example, since its values lie within the range [0, 1], 2 bits are allocated for the integer part. The fractional bit-width is then determined based on quantization sensitivity. As shown in
Table 14, when the fractional precision ranges from 9 to 11 bits, model accuracy remains largely unaffected. However, a noticeable decline in accuracy is observed when the fractional precision is reduced to 8 bits or fewer. Therefore, each
γ′ value is represented using 11 bits in total, comprising 2 bits for the integer part and 9 bits for the fractional part.
After completing the fixed-point quantization of
γ′, a similar procedure is applied to
β′. Since
β′ values range from −3 to 3, 3 bits are allocated for the signed integer part. The next step involves determining the appropriate fractional precision. As shown in
Table 15, the model maintains stable accuracy when the number of fractional bits ranges from 6 to 8. However, a significant degradation in accuracy is observed when the fractional precision is reduced to 5 bits or fewer. Therefore,
β′ is represented in fixed-point format using 3 integer bits and 6 fractional bits.
4.2. B-CNN Hardware Accelerator Architecture
This section presents a detailed discussion of the hardware implementation strategy and memory utilization. To achieve high efficiency and flexibility, the proposed hardware architecture utilizes fully on-chip memory blocks, eliminating the requirement for external memory access. This design choice maximizes memory bandwidth utilization and enables improved control over memory access scheduling and power management. At the software-level optimization stage, memory consumption is carefully minimized by compressing selected intermediate features (as described in detail in
Section 3). Furthermore, memory usage is partitioned into multiple smaller blocks tailored specifically to the demands of each network layer, thus enabling finer-grained activation control and reducing overall energy consumption. In addition, to decrease system complexity and power consumption, convolution computations are implemented using a single set of processing elements (PEs), which are reused across different layers through a time-multiplexed mechanism.
Figure 6 illustrates the hardware architecture of the proposed B-CNN model, which incorporates an early-exit mechanism.
The memory requirements for hardware computation are primarily divided into two types: Static Random-Access Memory (SRAM) and Read-Only Memory (ROM). SRAM is utilized to store the input image, intermediate partial sums generated during computation, and the feature maps produced after each convolutional layer. ROM is used to store the weights associated with the convolutional and fully connected layers.
The convolution block performs zero-padding on the input image and feature maps, and applies convolution operations using the corresponding weights. The batch normalization block processes the convolution outputs using the parameters γ′ and β′, as defined in Equation (5), through multiplication and addition operations. These parameters are stored in a look-up table, eliminating the need for an additional ROM. The ReLU block replaces negative values with zero. The max pooling block performs the pooling operation to reduce the spatial dimensions of the feature maps. Finally, the global average pooling block computes the average across each input channel and forwards the results to the fully connected block for final model prediction.
Table 16 summarizes the memory usage for each layer. The input is a 64 × 64 RGB image, with each pixel value represented using 8 bits. This image is stored across three SRAM modules, namely sram_img1, sram_img2, and sram_img3, amounting to 98,304 bits. This constitutes the largest memory allocation in the entire computation process.
After loading the input image into SRAM, convolution operations begin. The feature map generated by the first convolutional layer is stored in four SRAM modules: sram_fmap1, sram_fmap2, sram_fmap3, and sram_fmap4, with a total memory usage of 65,536 bits. This makes it the largest intermediate feature map in the model. Due to max pooling, the spatial resolution of feature maps is halved in each subsequent layer. Consequently, from the third layer onward, feature maps are alternately stored in sram_fmap1 and sram_fmap2, eliminating the need for additional memory allocation.
The proposed B-CNN model comprises a coarse prediction stage followed by a fine prediction stage. The cycle count required to complete the full model inference is illustrated in
Figure 7. Inference with the B-CNN model requires a total of 2,350,000 cycles. However, the coarse prediction stage alone completes 1,850,000 cycles. This enables early termination of inference for data belonging to the Normal class, significantly reducing computational demand and power consumption.
For other categories such as disasters or accidents, more extensive computation is required to ensure higher classification accuracy. Nevertheless, after the coarse classifier, only an additional 500,000 cycles are required to complete the fine prediction stage.
Power gating is a key technique in memory management, aimed at improving energy efficiency in electronic systems. By selectively disabling power to inactive sections of a circuit, it significantly reduces both dynamic and static power consumption. In memory systems, energy savings are achieved by deactivating unused memory blocks, ensuring that only the memory blocks required for current operations remain powered.
Control circuitry plays a critical role in managing the power supply by determining when to enable or disable specific memory regions based on access patterns. This technique not only enhances energy efficiency but also reduces heat generation, which is essential for maintaining device reliability and prolonging operational lifespan. The proposed memory power gating strategy is illustrated in
Figure 8. Notably, in the proposed design, power gating is applied exclusively to the memory subsystem, such as ROM and RAM modules. Other components, including logic and control circuits, are not subjected to power gating in this architecture.
The power control module, referred to as the power management unit, generates control signals (pgen1 to pgen8) to operate the power switches associated with various memory blocks. Due to shared circuit activity, sram_img1, sram_fmap3, and sram_fmap4 are managed collectively by the control signal pgen1, while sram_img2 and sram_img3 are controlled by pgen2. Additional details regarding memory power management across the entire hardware architecture are provided in
Table 17.
Figure 9 illustrates the computation flow of the proposed B-CNN hardware architecture. After loading the input image, the inference process is divided into a three-layer coarse prediction stage or a five-layer fine prediction stage. Each layer performs a similar sequence of operations, including convolution, batch normalization, ReLU activation, max-pooling, and memory read/write processes. The final prediction result, regardless of whether it is produced by the coarse or fine prediction stage, is obtained by applying global average pooling to the feature map, followed by a fully connected layer.