Unlike full-precision networks, in binarized CNNs, discrete quantization values limit their ability to learn richly distributed expressions; moreover, it is challenging for binarized networks to retain and transfer information efficiently. Furthermore, in the binarized quantization process, quantization errors occur and there is a mismatch between the gradients of forward and backward propagation during the training of binarized CNNs, decreasing the accuracy of binarized CNNs. This affects the final classification. Therefore, we address these problems and improve the accuracy of the binarized CNNs for vehicle classification by learning the effective features of information.
In this section, we describe our binarized CNN model in detail. We first describe how to improve the building blocks of ResNet to fit the binarized network for achieving higher accuracy with a more streamlined structure and propose a new pooling method in that structure. Then, a weight redistribution binarization method is used. Finally, we present how to train the binarization model more effectively.
3.1. Improved Binarized Residual Network
In network models such as VGG and GoogLeNet [
21], the network accuracy saturates and even decreases, and gradient disappearance and gradient explosion occur when the network depth increases. The residual network proposed by He et al. in 2016 incorporates a residual unit for constant mapping through a short-circuiting mechanism, which can effectively solve these problems. The residual units also enhance the number of network information transfer paths, allowing effective training of deeper network models while ensuring a higher accuracy.
An improved residual block is illustrated in
Figure 1. Unlike ResNet, in a binarized CNN, the binary quantization of activation values and weights leads to serious loss of network information; thus, we use a denser residual connection to ensure more effective retention of the information in the network and improve the expressiveness of the network. In a traditional residual network, in the downsampling layer, after full-precision 1 × 1 convolution and normalization layer to ascend and downsample, the main function is to obtain an output with the same size as the output of the convolutional output path, to not make the improvement of the network performance too obvious. In the binary quantized network, such a structure also adds additional floating-point operations; accordingly, this study proposes a new downsampling method, called absolute value maximum pooling (Abs-MaxPooling) (
Figure 2), which can reserve the number with the largest absolute value in each pooling block of the input feature map.
Furthermore, the binarized convolution in the downsampling layer is not expanded by the number of channels, and the two are subsequently stitched together as the input to the next layer. Thus, the downsampling layer has fewer floating-point operations due to the 1 × 1 convolution and normalization layers, and the binary convolution kernel is reduced by a half.
For a feature map of a binary distribution, we assume that it obeys a Bernoulli distribution with a probability distribution function:
where
is the probability of taking the value +1,
,
is the binarized value; then, the information entropy of the distribution after binarization is expressed using Equation (2):
For the binarized distribution to retain the maximum amount of information, the information entropy of the binarized distribution should be maximized:
Under the binomial distribution, the information entropy value of the binomial quantized values is maximum when p = 1 − p, i.e., p = 0.5, which means that the values of the binomial quantization should be uniformly distributed, i.e., the probabilities of +1 and −1 should be almost equal. Experiments showed that Abs-MaxPooling has the maximum information entropy with the number of +1 and −1 close to 1:1 in the feature map after binarization.
3.3. Dynamic Progressive Training
Similar to training a full-precision neural network model, a gradient descent-based backpropagation algorithm is used to update the parameters when training the binarized model. The binarized weights and activation values are used in the forward propagation process, and the full-precision parameters are updated in the backpropagation so that the model is fully trained. However, the derivative value of the sign function is almost always 0, which can lead to the disappearance of the gradient and the parameters cannot be updated for training purposes; thus, the gradient approximation is inevitably needed in backpropagation. In this study, three common approximation methods are used (
Figure 3).
The first one uses the identity function (i.e., ) to directly transfer the gradient information of the output value to the input value, completely ignoring the effect of binarization, which leads to a large gradient error due to the obvious gradient mismatch between the actual gradient of sign and the constant function; moreover, it accumulates errors during the backpropagation process, because of which the network training deviates from the normal extreme value point, resulting in an under-optimized binary network and, thus, seriously mitigating the performance.
The second method is a technique called the straight-through estimator (STE), proposed by Hinton et al. The STE is defined using Equation (5):
The STE considers the effect of binary quantization on the part of cropping greater than 1 to reduce the gradient error. However, the STE can only pass the gradient information within the interval [−1, +1], and beyond that range, the gradient becomes 0. That is, once the value is outside the interval range [−1, +1], it can no longer be updated, and a problem similar to the death of neurons in the ReLu (Rectified Linear Unit) activation function occurs.
The third method is the ApproxSign function. Liu et al. proposed it in Bi-RealNet. The ApproxSign function replaces the sign function for gradient calculation in backpropagation, and it is expressed using Equation (6):
The gradient approximates the gradient of the sign function in the form of a triangular wave, which is more similar to the impulse function than the STE, and thus more closely approximates the calculation of the gradient of the sign function. However, there is still the problem that the parameters are no longer updated once the values are outside the [−1, +1] interval.
However, it is crucial to ensure that all parameters are updated effectively during the model training process, especially at the beginning of the training.
To address this problem, we propose an incremental training method, i.e., we try to ensure that all parameters are updated at the beginning of the model training. For this, the gradient of the backpropagation of the sign function is gradually approximated in the following training process. Instead of the sign for backpropagation, we design a function expressed using Equation (7):
where
λ changes with the epoch during training expressed using Equation (8);
k is a given using Equation (9):
where
i is the number of epochs of the current training,
N is the total number of epochs trained,
set to −1 and
set to 2. Our approximation function can effectively train parameters outside the interval [−1, +1]. As the training goes on, it is closer to the sign function than other approximate functions (
Figure 4).