Efﬁcient Binarized Convolutional Layers for Visual Inspection Applications on Resource-Limited FPGAs and ASICs

: There has been a recent surge in publications related to binarized neural networks (BNNs), which use binary values to represent both the weights and activations in deep neural networks (DNNs). Due to the bitwise nature of BNNs, there have been many efforts to implement BNNs on ASICs and FPGAs. While BNNs are excellent candidates for these kinds of resource-limited systems, most implementations still require very large FPGAs or CPU-FPGA co-processing systems. Our work focuses on reducing the computational cost of BNNs even further, making them more efﬁcient to implement on FPGAs. We target embedded visual inspection tasks, like quality inspection sorting on manufactured parts and agricultural produce sorting. We propose a new binarized convolutional layer, called the neural jet features layer, that learns well-known classic computer vision kernels that are efﬁcient to calculate as a group. We show that on visual inspection tasks, neural jet features perform comparably to standard BNN convolutional layers while using less computational resources. We also show that neural jet features tend to be more stable than BNN convolution layers when training small models.


Introduction
Deep learning (DL) models currently dominate as the state-of-the-art method for image classification tasks. They are versatile and achieve unprecedented accuracy for most applications. Unlike traditional computer vision approaches, DL can construct complex models with little to no expert knowledge of the target domain.
DL models tend to be large and computationally intensive. They are well suited for complex image classification tasks, but many straightforward image classification applications do not require such complexity, but could still benefit from the ease of use that DL provides. Many applications, like quality inspection in manufacturing and agricultural produce sorting, rely on automated image classification systems, but do not require overly complex DL models. Instead, they benefit from systems that are high-speed, low-cost, low-power and small form-factor. FPGAs and ASICs make great candidates for these types of applications, but generally lack the computational resources to run mainstream DL models at reasonable speeds.
In this work, we propose a new binarized convolutional layer called the neural jet features layer. The neural jet features layer is a convolutional layer that is trained to form jet features within a BNN using DL methods. This creates a BNN model that is less costly to compute on resource-limited systems while maintaining comparable classification accuracy. We also show that neural jet features are more stable during training on the MNIST dataset.

Related Work
Various efforts have been made to make deep neural networks (DNNs) less resource intensive. Models like SqueezeNet [1] and techniques like network pruning [2][3][4] are effective ways to reduce the number of operations required by a NN. They do not reduce the complexity of those operations. GhostNET [5] reduces the number of convolutional operations needed in a convolutional layer by generating fewer feature maps. These feature maps are then augmented by applying various linear operations to each of these intrinsic feature maps, creating more features with fewer convolution operations. Our work shares a similar philosophy by reusing feature maps between different output channels, as explained in Section 3.2.
Quantized neural networks use low-precision fixed-point values instead of fullprecision floating-point values, making the operations of the network simpler and more compatible with limited resource systems. Binarized neural networks (BNNs) use the most extreme form of network quantization, using single bits to represent values throughout the model. The original BNN model, as proposed by Courbariaux et al. [6], was shown to perform well on simple image classification tasks like MNIST and CIFAR-10, but was not well suited for the ImageNet [7] dataset. Many subsequent works have augmented BNNs to perform better on ImageNet at the cost of greater network complexity and larger model size. The larger models usually require very large FPGAs and/or CPU co-processors [8]. In this work, instead of making BNNs more complex, we seek to reduce the size of BNNs while maintaining comparable accuracy for simpler image classification tasks, which can be implemented in resources-limited FPGAs and ASICs. FPGAs and ASICs tend to be much less power-hungry than GPUs and CPUs on quantized and binarized values. Qasaimeh et al. [9] have shown that on floating-point arithmetic image processing tasks, FPGAs, CPUs, and GPUs consume similar amounts of power. However, FPGAs are not well equipped to process floating-point values like CPUs and GPUs. On image processing pipelines that used quantized values, FPGAs consume 10× to 20× less power.
In our previous work, we presented jet features [10], a set of convolution kernels that are efficient to compute on resource-limited systems, which utilize the key scaling and partial derivative properties of classic image processing kernels. We demonstrated their effectiveness by replacing the set of traditional image features used in the ECO features algorithm [11] with jet features, which allowed the algorithm to be effectively implemented on an FPGA for high-speed, low-power classification without sacrificing image classification accuracy.

Binarized Neural Networks
BNN are deep neural networks that use binarized values (−1 and 1) for their weights and activations. BNNs were first introduced by Courbariaux et al. [6], and the methods they introduced are still the most widely used form of network binarization.
Courbariaux et al. first presented a method for binarizing the weights of DNNs known as BinaryConnect [12]. This was the first known technique to directly train binary values in a DNN, which is not possible with standard deep learning methods. Standard deep learning uses gradient descent, which makes incremental changes to a model's weights, which is not possible with binary valued weights. To overcome this challenge, BinaryConnect trains set of full-precision weights, which can be updated in small increments, but passes these weights through a binarization layer before being used by the network, as show in Figure 1.

Binarization of Weights
The binarization layer produces a binary output by taking the sign of its input (−1 or 1). Standard backpropagation with a discrete function, like the sign function, produces a gradient of zero, which is incompatible with gradient decent. Instead of using standard backpropagation, the binarization layer uses a Straight Through Estimator (STE) [13]. The STE ignores the sign function during backpropagation and passes the gradient at the output of the layer through to the input of the layer, as shown in Figure 1. This allows for the real-valued background weights to be updated during the backward pass while the binary valued weights are used during the forward pass. Once training is complete, the real valued background weights are no longer needed.
With binary valued weights, BinaryConnect reduces the model size of the network and lowers the computational cost of the operations. Instead of multiplying two floating-point values, the signs of activation are merely flipped according to the binary valued weights, which requires no multiplication operations at all. Courbariaux et al. were able to show that binary connect could match the accuracy of contemporary full precision methods on the MNIST, CIFAR-10 and SVHN datasets.

Binarization of Activations
After BinaryConnect showed that directly training binary valued weights was feasible, Courbariaux et al. developed the BNN model, which uses binary values for both weights and activations. This reduces computational cost even further. By encoding values of −1 as 0 and +1 as 1, weights can be applied to activations through bitwise XNOR logic operations. The single bit results can then be accumulated with simple bit counting operations.
In order to binarize activations, BNNs pass the activations through the same STE binarization layers that are used to binarize the weights. Courbariaux et al. noticed that as gradients are backpropagated through multiple binarized layers, they encounter many activations exceeding an absolute value of 1. This causes the gradients to grow rapidly and explode. To counteract this, BNNs artificially force the gradients to zero whenever they pass through layers whose activations have an absolute value greater than 1.

Performance and Improvements
Courbariaux et al. reported that on simple datasets like MNIST and CIFAR-10, BNNs could achieve comparable accuracy to their full precision counter parts, but failed to produce similar success on the more complex ImageNet dataset. Various efforts have been made to improve performance on more complex image classification datasets. Rastegari et al. proposed XNOR-Net, which is similar to the original BNN, but augments the model with some full-precision calculations to more faithfully match full-precision models [14]. Zhou et al. generalized quantized networks with the DoReFa model which broke down full-precision arithmetic into bitwise operations [15]. ABC-Net, developed by Lin et al., used learned scaling factors and multiple parallel binarization layers to add capacity and complexity to BNNs. These efforts led to increased accuracy on the ImageNet dataset, but significantly increased the number and/or complexity of operations required within the model [16].

Jet Features
Additional research has focused on implementing BNNs on efficient platforms like ASICs, FPGAs and FPGA-CPU hybrid systems. Some works have focused on binarizing even more aspects of the BNN model. Several works have noticed that previous BNNs used zeros to pad the inputs to convolutional layers, which were not a part of the binary set of value −1 and 1 [17][18][19]. Multiple works have focused on designing BNN accelerators, especially useful in CPU-FPGA designs [17,18,20,21]. Other works have focused on standalone streaming models for very high-speed image classification. While many works have reported success in implementing BNNs in FPGA for task like MNIST and CIFAR-10 image classification, they generally require a large FPGA, like the Xilinx Vertex or Kintex UltraScale, or they employ a CPU-FPGA hybrid systems [8] like the Xilinx Zynq MPSoC. In this work, we are interested in making BNNs even more efficient by reducing the computational cost of convolution operations, allowing them to be implemented in small to midsize FPGAs for affordable high-speed classification on low power and portable platforms.
In our previous work, we introduced the concept of jet features [10], which are a set of convolutional kernels that are especially efficient to compute as a group. They take advantage of the core elements of popular image processing kernels like the Gaussian and Sobel kernels. These features were originally developed to replace the traditional image processing transforms that was used in the ECO feature algorithm [22].

Definition of Jet Features
jet features are convolution kernels that can be broken down into a series of smaller convolutions with special 2 × 1 kernels chosen from the set [1,1] Examples of a few jet features are shown in Figure 2. Convolutions with these small kernels either takes the sum or difference of adjacent pixels, in either the vertical or horizontal direction. Taking the sum of adjacent pixels blurs the image, which effectively scales the input image [22]. Taking the difference of adjacent pixels produces a partial derivative, useful for gradient calculation. For this reason, we refer to these small kernels as scaling factors and partial derivatives, as shown in Figure 3.
Traditional computer vision algorithms make heavy use of scaling operations and derivative calculation [22,23]. The most popular image processing kernels, like the Sobel, Gaussian and Laplacian filters, can be duplicated or approximated from a series of these scaling factors and partial derivatives. Figure 4 shows how the Sobel and Gaussian kernels can be duplicated from these building blocks.

Efficient Calculation
Convolutions with these simple building block kernels require no multiplication, since the kernels only use values of 1 and −1. It is also important to note that the order in which they are applied does not matter. With these building blocks, whole sets of image features can be efficiently calculated.
These convolutions can be fully pipelined in a FPGA or ASIC hardware design with a buffer and an adder/subtract unit. Operations along the axis of the image that is streamed (usually along the x axis) require only a single pixel to be buffered, and operations in the other direction require a single line to be buffered, as shown in Figure 5.    Since all jet features are made of common elements and the order of those operations does not matter, the results of lower order jet features can be used to calculate higher order ones. This makes it very efficient to calculate all jet features up to a certain degree. By forming a lattice of buffers and add/subtraction units, scaling and partial derivative operations can be fed into one another creating sets of jet features.

Multiscale Local Jets
Jet features borrow their name from the concept of multiscale local jets, introduced by Florack et al. [24]. A multiscale local jet is the set of partial derivatives of a given function taken to an nth degree in each dimension. This set is then scaled to various sizes producing a set of transforms that encapsulate useful features at various scales. This is a generalization of the scaling and gradient calculations used in most classic computer vision algorithms. It has been used as a general purpose set of features for image classification, image compression and feature matching [25][26][27].

The ECO Jet Features Algorithm
jet features were originally developed to make the evolution constructed features (ECO features) algorithm [11] more hardware friendly. The ECO features algorithm uses a grab bag of traditional image transforms and a genetic algorithm (GA) to select, configure and order the transforms into a series, called an ECO features. ECO features are used for image classification.
Many of the original transforms used by the algorithm were not hardware-friendly, making it impractical to implement them in an FPGA. It was observed that the convolutional filters, especially the Sobel and Gaussian filters, were selected most often by the GA. Instead of making hardware units to compute each type of convolution filters, jet features were developed as a set of convolutional filters that could all be easily computed with limited resources. With jet features, the GA determines the number of scaling factors and partial derivatives used in each feature. This variation of the algorithm is called the ECO jet features algorithm and is comparable to the original ECO features algorithms in terms of accuracy in achieving faster execution times on CPUs and easily implemented in an FPGA.

Neural Jet Features
In this work, we propose a new binarized convolutional layer called the neural jet features layer. Neural jet features are jet features used in place of standard binarized convolutional filters, trained through deep learning. Neural jet features require fewer parameters and fewer operations than the traditional convolutional layers used in BNNs. They learn essential classic computer vision kernels, combined through deep learning. It is not possible for standard BNN convolutional layers to learn these classic computer vision features. The results in Section 4 show that BNNs using neural jet features achieve comparable accuracy on certain image classification tasks compared to BNNs using binary conventional layers. Convolutions typically account for the majority of operations in a BNN, and by using fewer operations to perform convolution, neural jet features allow BNNs to fit into resource-limited ASICs and FPGAs.
The small 2 × 1 kernels that make up a jet feature, as shown in Figure 3, differ only in orientation and whether their second element contains a −1 or 1. A 3 × 3 jet kernel is formed from four of these smaller kernels, thus four binary weights need to be selected to form a 3 × 3 jet feature.

Constrained Neural Jet Features
In our previous work on the ECO jet features algorithm [10], we observed that the genetic algorithm selected scaling factors ([1, 1]) more frequently than partial derivatives ([1, −1]) when forming jet features. Based on this observation, we experimented with a constrained version of neural jet features where one of the vertical parameters and one of the horizontal parameters was forced to be 1, forming scaling factors more often than partial derivatives. This reduces the computational cost of neural jet features (Section 3.2) while increasing the average accuracy in some of our testing compared to unconstrained neural jet features (Section 4). With only two binary parameters to be learned per kernel, there are only four possible kernels, the Gaussian kernel, vertical and horizontal Sobel kernels, and a diagonally oriented edge detector, as shown in Figure 2. The constrained version of neural jet features is more efficient than the unconstrained version with comparable or better accuracy. The constrained form of neural jet features is the proposed form of neural jet features.

Computational Efficiency
Neural jet features learn how to combine input channels using these four classical computer vision kernels, shown in Figure 2. These kernels have been essential to traditional computer vision and are often used multiple times within a single algorithm [22,23]. Since there are only four possible kernels to be learned, it may make more sense to view a neural jet feature layer as a layer that learns to combine these four features with the four features of all the other input layers. Even though there are only four possible features to learn for each input channel, there are N 4 unique ways in which to combine the features of the input channels to form a single output channel, where N is the number of input channels.
Like traditional convolutional layers, neural jet features reuse weights in ways that emulate traditional computer vision and image processing. Instead of learning separate weights for every combination of input and output pixel, as done in fully connected layers, convolutional layers form weights into kernels, that are applied locally and reused throughout the entirety of the input image, greatly reducing the number of weights to be learned. Similarly, neural jet features also reuse weights. Neural jet features do not learn unique kernels for each and every combination of input and output channels. Instead, all four 3 × 3 jet features are applied to each input channel and then reused as needed to form the output channels. This reduces the number of operations required, especially when there are more than just a few output channels.
All four jet features are made up of similar 2 × 1 convolutions, as shown in Figure 4. Since all four jet features are always calculated for every input channel (and reused as needed), these convolutions can be effectively calculated as a group. The smaller 2 × 1 convolutions that are common to multiple features can be calculated first and reused to calculate the larger 3 × 3 jet features. Figure 6 shows how these 2 × 1 kernels can be applied in an effective manner to reduce the number of operations that are needed. By contrast, four arbitrary 3 × 3 binary convolutions are not guaranteed to share any common operations, thus they must be calculated independently of each other.
Both of these aspects, kernel reuse and common 2 × 1 convolutions, make neural jet features computationally efficient compared to standard binary convolutions. These bitwise efficiencies are particularly well suited for FPGA and ASIC implementations where data paths are designed at a bitwise level of granularity.
We illustrate the computational efficiency of neural jet features with a potential FPGA hardware design, shown in Figure 6. This top diagram shows a typical multiply-accumulate operation for an input channel in a BNN. The multiplication operations are simply bitwise XNOR operations. The addition operations are organized into a tree structure to reduce the resources needed. One accumulation tree is required for every output channel. In contrast, the number of accumulation operations in a neural jet feature layer does increase with added output channels. All features are calculated and reused as needed to form the output channels. The addition and subtraction units in the bottom two diagrams are the same units shown in Figure 5.
To form a rough comparison of the computational cost of each of the options shown in Figure 6, we assign a cost of 1, 2, 3 or 4 Full Adder(FA) units to each of the operations depending on the number of bits in their operands. We assign a cost of 2 to the final addition of the standard BNN convolutional layer, which has a 4-bit operand and a 1-bit operand. The standard BNN would cost 13 FAs per output channel. For the unconstrained neural jet features the cost would be 61 FAs to produce all 9 features. The constrained version would only cost 27 FAs for all possible jet features. In layers with 3 or more output channels, calculating all of the constrained jet features is less expensive than standard BNN convolutions. For example, in a layer with 64 output channels, a standard BNN layer would cost 832 FAs (64 × 13), while a constrained neural jet feature layer would cost only 27 FAs, since the features would be reused as needed. The number of accumulation resources does not scale the number of output channels like they do in standard BNN convolutional layers. We note that these comparisons are hypothetical, and as part of our future work we plan to implement standard BNN convolutional layers and neural jet feature layers in an FPGA to more accurately demonstrate their computational efficiency.

Results
We tested neural jet features on datasets that are representative of visual inspection tasks where the images are of a somewhat consistent scale and/or orientation. We used the BYU Fish dataset [10], BYU Cookie dataset and the MNIST dataset [28]. The MNIST dataset has a bit more variety, which is not typical of visual inspection datasets, but it does lend insight into how jet features compare on a widely used dataset.

Model Architecture
We experimented with three different types of convolutional layers: standard binarized convolutional layers [6], unconstrained neural jet feature layers, and constrained neural jet feature layers (see Section 3.  Table 1. We note that these models are smaller than most BNNs used throughout the literature [8]. All convolutional layers used the same number of filters for a given experiment. The activation function was the binarization operation that takes place after the batch normalization layers, except after the final batch normalization layer where a softmax function was used. The inputs were not binarized, similar to other BNNs throughout the literature.

BYU Fish Dataset
The BYU Fish dataset consists of images of eight different fish species, all oriented the same way, 161 pixels wide and 46 pixels high. Examples are shown in Figure 8. The model used when testing the BYU dataset used eight filters in the convolutional layers and 16 neurons in the fully connected layer.
From the results shown in Figure 9, we see that the standard BNN convolutional layers and the constrained neural jet features performed similarly on the BYU Fish dataset, both reaching an average accuracy of 99.8% accuracy. Unconstrained neural jet features performed worse, hovering around 95% accuracy, significantly worse than the constrained neural jet features. A similar pattern was shown with the BYU Cookie dataset as well, which we hypothesize is due to the fact that the unconstrained neural jet features are allowed to learn features that are not as useful as the one the constrained version learns.

BYU Cookie Dataset
The BYU Cookie dataset includes images of sandwich style cookies that are either in good condition, offset, or broken, as seen in Figure 10. These images are 100 × 100 pixels in size. This dataset is fairly small, with 345 training images and 88 validation images. The validation accuracy on the BYU Cookie dataset, shown Figure 11, shows that the normal BNN convolution and constrained Neural jet features outperform unconstrained Neural jet features. In addition, we see that validation accuracy is sporadic over the course of training, which is to be expected when dealing with a smaller dataset. The results seem to be more consistent when using constrained neural jet features than standard BNN convolutional layers, which can also be observed in the results from the MNIST dataset.

MNIST Dataset
The MNIST dataset consists of 70,000 images of handwritten numbers [28], 28 × 28 pixels in size. We trained three models of different sizes on this dataset: one model consisting of 16 filters and 32 fully connected units, one with 32 filters and 128 fully connected units and another with 64 filters and 256 fully connected units, which are smaller than other models trained on this dataset [8]. Figures 12-14 show the validation accuracy of these models, respectively. The scale of the y-axis is kept consistent between these three figures in order to easily compare the results between each of them. In the larger model, the average accuracy of all models approached 99%. On the smaller models, the normal BNN convolutions produce inconsistent results, shown in Figures 12 and 13, and as seen on the BYU Cookie dataset. This demonstrates a known difficulty in working with small BNNs. Switching binarized weights between the values −1 and 1 can have dramatic effects in local regions of the network during the learning process. By adding more weights to a model, this effect is mitigated, as seen in Figure 14. Our experiments show that neural jet features are less susceptible to this effect, making them a good choice for smaller BNN models. We postulate that this is due to the fact that Neural jet feature convolutional layers are limited to the classic computer vision kernels shown in Figure 2.    Table 2 summarizes the performance of these three methods on all three datasets after 100 epochs of training. Although the final accuracies for the normal binary convolution and constrained jet convolution are comparable for the BYU Fish dataset and MNIST dataset using 16 and 64 filters, constrained jet convolution had much more stable performance as shown in Figures 9, 11-14.

Conclusions
We have presented neural jet features, a binarized convolutional layer that combines the power of deep learning with classic computer vision kernels. Not only do neural jet features require fewer operations than standard convolutional layers in BNN's, but they are also more stable in smaller models that would be used for visual inspection tasks. Neural jet features have comparable accuracy on visual inspection datasets while requiring fewer operations and parameters. Neural jet features offer an efficient solution for resource-limited systems that can take advantage of their bitwise efficiency, like ASIC and FPGA designs. Future work includes implementing a BNN on an FPGA using neural jet features to measure logic resource savings. We also plan on exploring the use of neural jet features in various state-of-the-art topologies like attention-based models [29] and skip-connection-based models [30].