5.2. Implementation Details
In this work, we implemented our architecture for 8-bit integer operations where PEs are designed to perform convolution. The attention module on the RCU operates independently on each image region and calculates the data variation. For low-relevance regions, PEs are only activated to perform 4-bit convolutions.
In this implementation, we designed each RPU with PE units. Here, we designed eight instances of RPU that comprise the ICL. This indicates that the ICL layer will have 512 PEs corresponding to 8 RPUs. These RPU can be programmed to realize one or multiple consecutive CNN layers. The data transfer time between the off-chip memory to the on-chip acts as a performance bottleneck to the architecture. To minimize this, while loading input data from the off-chip memory to on-chip memory, we use a ping-pong buffering scheme to hide latency.
We simulate our architecture with different region sizes in the ICL.
Figure 9a illustrates our analysis. Here, the y-axis on the right indicates the number of clock cycles for latency in RPUs. The y-axis on the left denotes the number of units utilized for resources, such as flip flops and LUTs. The figure indicates an increase in area overhead and latency with the increase in region size. In contrast, the performance of ROI degrades in a broader region because the regions cannot preclude redundant information. In addition, it asserts high latency that breaks the benefit of parallelism. Based on this analysis, in
Figure 9a, we consider the
region size as the optimal region size in the ICL layer.
5.3. Performance Analysis
Each RPU in ICL has 64-PEs, and each RCU in RCL operates on 64 pixels. The resource utilization in the FPGA and layout extracted parameters in ASIC have been tabulated in
Table 2 and
Table 3. We report a maximum frequency of 350 MHz and 380 MHz in FPGA and ASIC, respectively. We provide the mentioned frequency to every module in our design to maintain synchronization. From
Table 3, we can see that on average, each PE has an area overhead of 145,00 µm
. The survey in Reference [
42] reports that the minimum silicon footprint of an ADC in 90 nm technology is 59,000 µm
. Therefore, to avoid performance degradation, we found that using an additional computational layer in parallel to the sensor is the most viable option for sensor level CNN integration. However, the total area will shrink considerably with improved silicon technology. In our implementation, we have a dedicated RCU for each input image region and eight RPUs in the ICL layer for inference computation. Therefore, for the VGG16 model, with the input frame size of
, we will require 784 RCUs in the RCL layer. However, RCUs are lightweight modules and consume limited resources. The 784 RCU will account for 42 k LUTs and 22 flip-flops, comparable to a single RPU. For this design, the 8 RPUs in the ICL layer will require 350 k LUTs and 390 k flip-flops in total.
We exhibit the benefit of integrating the relevance computation layer by analyzing the energy consumption of the RCU and RPU in ICL. The energy overhead in our architecture is the added RCUs in the RCL layer, and the RCUs are active at all times. However,
Table 2 and
Table 3 indicate that, compared to an RPU, an RCU only consumes 0.5% of energy, whereas an inactive RPU only draws leakage power, and the leakage power in our design is 2.5%∼3.0%. The RCL and ICL layers are designed to compute the convolution operations of the CNN model. Considering our design to a design without the RCL layer, we found that our architecture will save energy in convolution operations, if the ROI is less than the size of the image frame. We tested it on the input image stream of
frames to a VGG16 model and computed the energy savings on the first convolutional layer. We see that the energy consumption decreases in our proposed architecture as the size of the ROI decreases, whereas a baseline architecture (design without saliency computation layer) consumes constant energy for convolution layers regardless of the size of the ROI. This is shown in
Figure 10a. The sequential processor in the architecture is used to implement other computational layers in the high-level processing model. For an arm A53 core implemented as the sequential processor, the estimated power consumption is 500 mW.
The proposed architecture eliminates computational redundancies by reducing the number of multiplication operations in CNN layers. We analyzed the reduction in computational redundancies on the VGG16 model. This is shown in
Figure 10b. Here, we assume that the size of the ROI in the incoming image frame is 30% (estimated from the first three entries in
Table 4). The figure indicates that the proposed model performs fewer multiplications than a baseline design (without visual attention) in the first five layers. For the consecutive layers, the ROI size increases due to the relevance propagation model and the computation load is equal for both designs. The computational savings in the first five layers result in around 13% reduction in total energy consumption. Next, we investigate the off-chip memory access for the ICL layer. The off-chip memory is accessed to fetch weights of the convolution layers and store OFMs containing only the relevant regions. The OFMs may need to be read back to the on-chip global memory pool for the next layer operation if the global memory is not large enough to accommodate OFMs. It is the same if the CNN model has intermediate functions that need to be performed on a separate module.
Figure 9b shows the off-chip memory access for different convolution layers of the VGG16 model. Here, we compared our proposed design with the same baseline design as described above. We see a considerable reduction in memory access on the first few layers (up to 3.3× decrease on the first convolution layer). However, in our calculation, we do not consider any compression mechanism. Further reduction can be achieved by adopting a suitable compression mechanism in the architecture. For instance, in Reference [
22], the authors employed a Run-length compression (RLC) mechanism to reduce the off-chip memory accesses. The RLC coding exploits the zeros in the feature maps to save off-chip memory bandwidth. A similar encoding approach can be explored in our design to curtail memory read/write volume.
Next, we seek to estimate the typical size of the ROI in an image frame. We explore different object tracking datasets and developed a simulator to simulate the behavior of region-level event detection of our image sensor. We specifically tested on MOT17 dataset for a real-world scenario [
43]. The dataset contains different video clips of people walking around various public places (intersection, square, etc.). We use four different MOT17 datasets to evaluate our temporal relevance model.
Table 4 indicates the average number of non-relevant regions for each dataset. As we can see, for the first three datasets, on average up to 75% of the regions can be avoided on each frame for repetitive computation. However, for the last dataset, we have a comparatively less number of irrelevant regions. Because, in this case, the images are captured from a moving camera. Therefore, it is understood that our attention-based approach provides greater power and computational savings for static camera positions.
Figure 11 provides a pictorial view of our ROI simulator. Besides, we calculated the spatial redundancy on MNIST and FashionMNIST dataset. The average spatial redundant regions are 50% and 29%, respectively. The results indicate that our architecture will save power at all times.
Next, we evaluate the impact of region-aware processing on the performance of the different inference models implemented in our sensor architecture. We found that we do not see a drop in accuracy when we perform region-aware training (replacing non-relevant regions with zero-valued pixels during training). This is tested on two different models (Lenet-5, VGG16) for three different datasets (MNIST, FashionMNIST, and mosquito species [
44]). However, the threshold value used to identify non-relevant regions needs to be adjusted for different datasets. Nevertheless, we were able to exclude 30-to-50% regions for all three datasets.
In our architecture, we perform low-precision operations on regions that have lower relevance scores. The approach was inspired by Reference [
45], where the authors used different bit-width operations in different CNN layers. We evaluate whether the approach affects the detection accuracy of the implemented inference. We tested it on the YOLOv3 model and calculated the
precision metric for the MOT17-05 dataset.
Precision is defined as the ratio of correct detections to the sum of correct detections and false detections. A substantial value of the
precision metric indicates high detections. The formula is shown in Equation (5).
Here,
indicates True Positive, and
indicates False Positive. The calculated
precision data is shown in
Figure 12a. The figure suggests that the detector can identify objects on images with reduced bit-width on non-relevant regions.
5.4. Performance Comparison
Our architecture’s novelty relies on the implementation of the convolutional layers mapped on an array of pixel processors. We evaluated the performance of our architecture and compared them to a similar work found in the literature. The authors in Reference [
46] presented an inference architecture that fully embeds a CNN network on a pixel processor array (PPA) near the vision sensors and demonstrated results for the MNIST dataset. However, the approach does not perform attention-based processing like us. We evaluated our architecture with them to make a fair comparison. The result is shown in
Table 5. For the first two convolution layer, our proposed architecture can exhibits
speed-up compared to Reference [
46]. Therefore, we can argue that our architecture outperforms the work in Reference [
46]. Next, we compare our work to an FPGA-based SIMD CNN accelerator design [
47]. The results are shown in
Table 6 which indicated performance improvements in our design.
We performed an energy consumption analysis of our proposed model and compared the work with Reference [
22]. The results are shown in
Figure 12b. We compared the energy consumption for the first six convolution layers of the VGG16 model. As the figure suggests, our proposed model reports better energy management for the first few layers of the VGG16 model. However, the work in Reference [
22] starts to save more energy from the 5th convolution layer and onward. The result is aligned with our previous computations, as our earlier results suggest that the ROI regions spread across the image frame as data propagates through the convolutional layers. Nevertheless, it should be noted that the work reported in Reference [
22] exploits zero values in the IFMs to save energy. In our architecture, it is possible to attain greater energy savings by adopting a similar methodology.
The novelty of our work relies on the attention-based CNN processing near the sensor and relevance propagation. Therefore, we focused on highlighting the region-based computation model in this work. However, the modular nature of our design allows the integration of additional optimization methods to improve performance and reduce computational redundancy. For instance, in Reference [
48], the authors presented MobileNetV2, which can significantly reduce the memory footprint needed during inference computation. The PE array can be conveniently configured in our architecture to implement depthwise separable convolution layers to carry out CNN operations. There are other optimization schemes that can be integrated within our attention-based approach [
49]. In our future work, we plan to explore methods to combine optimization mechanisms to improve inference performance in the sensor.