Increasing Information Entropy of Both Weights and Activations for the Binary Neural Networks

: In terms of memory footprint requirement and computing speed, the binary neural networks (BNNs) have great advantages in power-aware deployment applications, such as AIoT edge terminals, wearable and portable devices, etc. However, the networks’ binarization process inevitably brings considerable information losses, and further leads to accuracy deterioration. To tackle these problems, we initiate analyzing from a perspective of the information theory, and manage to improve the networks information capacity. Based on the analyses, our work has two primary contributions: the ﬁrst is a newly proposed median loss (ML) regularization technique. It improves the binary weights distribution more evenly, and consequently increases the information capacity of BNNs greatly. The second is the batch median of activations (BMA) method. It raises the entropy of activations by subtracting a median value, and simultaneously lowers the quantization error by computing separate scaling factors for the positive and negative activations procedure. Experiment results prove that the proposed methods utilized in ResNet-18 and ResNet-34 individually outperform the Bi-Real baseline by 1.3% and 0.9% Top-1 accuracy on the ImageNet 2012. Proposed ML and BMA for the storage cost and calculation complexity increments are minor and negligible. Additionally, comprehensive experiments also prove that our methods can be applicable and embedded into the present popular BNN networks with accuracy improvement and negligible overhead increment.


Introuction
In the past few decades, deep convolution neural networks (CNNs) have evolved rapidly. This technology shows excellent performance on a lot of tasks, such as image recognition [1,2], object detection [3,4], and segmentation [5]. A large part of reasons why the performance is so excellent is that traditional CNNs usually have large number of parameters and floating-point operations (FLOPs). The property makes CNNs capable of strong representation ability, although with intensive computational cost and memory footprint requirement. According to existing hardware level, these complex models can be trained and inferred effectively in the cloud servers equipped with powerful GPUs, but still difficult to be deployed on limited-resources platforms such as smartphones, AR/VR devices, and drones. To solve this problem, increasing researchers begin to explore reducing the size of network models and its FLOPs scale with minimal computational accuracy loss.
Among those solutions, two categories are in this research field. The first category, mainly for network structure, designs efficient network architectures with less computation and memory footprint requirement, such as the MobileNet [6], SqueezeNet [7], ShuffleNet [8], and DenseNet [9]. The second category, mainly for further light-weighting optimization on existing complex network, mainstreams the parameter pruning [10][11][12] and quantization [13][14][15][16][17] methods. The binary neural networks (BNNs) is a radical case of quantization. It has been attracted increasing attention due to its beneficial propertiesboth activations and weights are quantized to {−1, +1}. Moreover, the calculations inside BNNs can only have simple XNOR and Bitcount operations with this advantageous feature. This makes significant performance improvement both in run-time and its power-aware hardware implementations.
However, the BNN's performance declines dramatically in accuracy compared to full precision network. The reason is that when both activations and weights are quantized to {−1, +1}, the representation ability of the network drops sharply and results in severe information loss. To solve this problem, two approaches has been proposed in previous researches: (1) For minimizing the information loss of parameters, IR-Net [18] adopts Libra Parameter Binarization (Libra-PB) to balance and standardize weights in forward propagation. (2) In perspective of increasing entire system's information entropy, the Shannon entropy based information loss penalty is proposed for the BNN networks [19]. The above two methods can effectively mitigate the information loss during the binarization process, thereby improve the representation ability and entire system's accuracy additionally. Even so, the IR-Net still ignores that the binarized weights do not obey the normal distribution. This leads the normalization adopted cannot get a maximum entropy of binarized weights. In addition, the IR-Net only deals with the weights and excludes the factor of activations as well as reference [19] does. In addition, reference [20] summaries most of the exited binarization methods, such as CI-BCNN [21], BCGD [22], etc. These articles are targeted to improve the performance of BNN. In our work, we start from a new aspect of information entropy increment, and the weights and activations inside BNNs are both included and under a thoroughly consideration. Figure 1 illustrates the basic blocks of binarization process in our work. Inside it, a median loss (ML, Red block in Figure 1) method is proposed to make the binarized network weights distribute more evenly. This benefits a greater entropy of weights in each layer, and improves the representation ability of the whole network accordingly. Another, if a large difference between the positive and negative activation amplitudes happens in the binarized network, a unified scaling factor employed quantization leads considerable error and the network's performance deteriorates seriously. To solve this problem, the batch median of activations (BMA, Orange block in Figure 1) scheme is also proposed. It further maximizes the information entropy of activations in each layer.

Binarized Neural Networks
The Binarized Neural Networks (BNNs) has been firstly proposed in year 2016 [24]. After the proposal, it attracts a lot of attentions because its weights and activations are  Our work has following two main contributions: (1) From the perspective of entropy maximization, we propose a new regulation technique called the median loss for binary neural networks. This technique benefits for maintaining the upper-level value of information entropy, and leading a higher accuracy of the networks.
(2) To minimize the quantization error and avoid the network performance deterioration, we propose the BMA method to calculate the positive and negative scaling factors separately during the forward propagation's activations. At the same time, the method subtracts a median value from activations. This also helps to maximize the information entropy of activations in each layer.
This work takes the Bi-Real/ResNet-18 and ResNet-34 [23] on ImageNet 2012 as a baseline, our accuracy increases 1.3% and 0.9% respectively. It successfully proves that our methods are more effective in terms of improving the accuracy of binary neural networks. In addition, we also combine and apply the proposed methods into state-ofthe-art BNN's binarization process. The final experiment results indicate that the whole networks' accuracy is improved accordingly, and validate the versatility of this work.

Binarized Neural Networks
The Binarized Neural Networks (BNNs) has been firstly proposed in year 2016 [24]. After the proposal, it attracts a lot of attentions because its weights and activations are binarized. This can speed up the inference time and save considerable computation and memory footprint. The basic principles of BNNs can be presented in Equation (1): where a r , w b , z r represent the full-precision input activations, weights and output activations, respectively. a b , w b represent the binarized activations and weights, respectively. α ∈ R c o and β ∈ R c i denote the scaling factors of weights and input activations. In practice, Sign function is usually used to get the binarized weights and activations in Equation (1). Furthermore, in recent years of research, some novel binarization methods [18,19] have been proposed in order to obtain a higher accuracy. Finally, the output activations are obtained by a bitwise operation XNOR and Bitcount from the binarized weights and input activations. It can be formulated as Equation (2). In the backward propagation, the derivative of Sign function is zero almost everywhere. This property makes the binarized models hard to be optimized. To handle this problem, a straight-through estimation (STE) [25] method is proposed and used to train the BNNs. It employs the Identity or Hardtanh function to propagate the gradients. Moreover, many previous works [26][27][28] further conduct to correct the backward gradient mismatch with improved static binarization functions which originate from the STE method.

Information Entropy
The Shannon information entropy is a milestone in the information theory development roadmap. The theory has firstly been proposed by C.E. Shannon in 1948 [29]. It is defined as an expected amount of information H(X) in a random variable which can be formulated as Equation (3). Therefore, a system's information uncertainty can be mathematically quantified and have a precise value.
Electronics 2021, 10,1943 4 of 13 where x i denotes the possible values of a random variable X, p(x i ) denotes the probability that the random variable X takes the value x i . After and with the Shannon entropy proposal, the information theory has developed rapidly. Furthermore, various classical theories have also been proposed, including the principle of maximum entropy [30]. Under a premise of known partial knowledge, the most reasonable inference about the unknown distribution is the most uncertainty of a random inference in accordance with the already known part [31]. It is the essence of maximum entropy. For this reason, if a more uniform and unbiased input dataset can be achieved, a better generalization and higher entropy the set will attain. This is the inspiration and theoretical basis for our proposed approaches to implement a BNN networks with the maximum entropy principle in this article.
In past few years, there have been many excellent works [18,19] in applying the knowledge of information theory into the field of neural network. These works are dedicated to maximizing the entropy in the network. The stronger generalization ability the network has, the more effective for increasing the network entropy and performance improvement. Nevertheless, the existed present works only deal with the weights and excludes the factor of activations. In this paper, on the one hand, we propose a new regularization approach to maximize the network's weights entropy. On the other hand, a more evenly distributed activations has been achieved by an operation of subtracting its median value. This newly proposed process increases and maximizes the information entropy of the binarized activations.

Median Loss (ML)
An important reason for the accuracy's deterioration of the binarization network is that, in the forward propagation, the binarization of weights and activations cause a large amount of information loss. From the perspective of information theory, the process of binarization leads to a decrease in entropy, thereby reduces the representation ability of network. To handle this problem, we propose the median loss as a new regularization approach to minimize information loss.
In the traditional binary network, both the activations and weights are quantified to {−1, +1}. Thus, the quantized values can be modeled by the Bernoulli distribution, which is formulated as Equation (4): where p a , p w ∈ [0, 1] denotes the probability of taking the value +1. The entropy of this random variable can be calculated as Equation (5). In addition, through computing the derivative of the entropy as Equation (6), it can be obtained that when p equals 0.5, that is, the distribution is relatively even, the entropy is the largest.
In order to minimize the information loss in BNNs' forward propagation, many methods have been proposed to increase the information in Equation (5). IR-Net [18] is proposed to normalize the weights through the operation of Equation (7), so that the quantized values can be more evenly distributed and achieve a larger entropy. However, Electronics 2021, 10, 1943 5 of 13 this method ignores the most important problem that the binarized weights do not obey the normal distribution. The normalization operation of Equation (7) cannot make the values distributed evenly as expected. Furthermore, the IR-Net only consider maximizing information capacity of the weights and ignore the activations. In another related research, Dmitry Ignatov [19] also proposes a Shannon entropy-based information loss penalty rule to increase the networks' entropy. Nevertheless, this method still has no network activations part involvement, and it is too complicated to deploy in a real practice.
In this work, a novel regularization technique called the median loss for BNNs is proposed. It can be formulated as: where W l represents the floating-point weights in layer l. W l + and W l − represent the positive and negative values in W l accordingly. n l represents the number of weights in layer l, n l + and n l − represent the number of positive and negative weights. |X| s represents the sum of all elements in X(X ∈ W l , W l + , W l − ). The information loss penalty is added to the overall loss function in a conventional way as indicated in Equation (9), and is used in the backward propagation.
where L cls represents the original loss function, λ represents the regularization parameter. The goal of our proposed median loss is to make the binarized weights distributed evenly, it can be proved as Equation (10). In this equation, n l > 0, and W l + s > 0, W l − s < 0, from which it can be concluded that only when n + = n − , the median loss takes the only minimum value. Therefore, the employed median loss can increase network's information entropy effectively. Moreover, compared to the Dmitry Ignatov's method [19], our proposed median loss is simpler and more intuitive to use without any loss in accuracy.

Batch Median of Activations (BMA)
Till now, present existed works are all to process the network's weights to achieve its maximum entropy. In order to obtain a greater entropy and reduce the quantization error in BNNs, another our method called the batch median of activations (BMA) is proposed in this work. To our knowledge, this is the first time to process the activations from an information perspective to improve the network entropy.
As shown in Equation (11), the median value ω l r is subtracted from the activations A l r . According to the median's mathematical definition, regardless of any distributed dataset, subtracting the median operation can make the number of positive/negative values equal. After this, the operated positive/negative elements inside the set are evenly parted. As presented in Section 3.1, the information entropy of these evenly distributed binarized values have the maximum information entropy.
Meanwhile, the improved entropy with an even distribution inevitably leads to effect the scaling factors and whole networks accuracy. In classic XNOR-Net [14], the factor is employed to reduce the quantization error and additionally improve the model's accuracy. Almost all binarized networks follow this idea, including the well-known IR-Net [18]. However, the IR-Net is committed to making the distribution of binarized weights evenly, and this naturally brings a large difference between the positive and negative amplitudes. In this way, employing a same scaling factor for the positive and negative activations causes a considerable quantization error. To tackle this problem, our proposed BMA calculates the scaling factors for positive and negative values separately as Equation (12) shown. Therefore, the information loss introduced by quantization error in forward propagation can be mitigated and further improve the network's accuracy.
where A l r represents the floating-point activations in layer l. ω l r represents the mid-value of A l r . A l m,+ and A l m,− represent the positive and negative values in A l m . n l + and n l − represent the number of positive and negative weights. · 2 represents the L2 regularization. γ a , β a scales and biases the normalized value individually [32].

Experimental Details
We further investigate the network performance by utilizing proposed ML and BMA with XNOR-Net/ResNet-20 on CIFAR-10 dataset, separately. The experiments can help figure out and analyze how the ML and BMA methods work in practice. The model adopts the Kaiming initialization and trains from scratch. The training flow runs for 400 epochs with 128 batch-size. Optimization process applies the Stochastic Gradient Descent (SGD) optimizer with momentum = 0.9, weight decay = 10 −4 , initial learning rate = 0.1, and cosine learning rate decay.
The training flow of BNNs using ML and BMA is illustrated in below Algorithm 1. Firstly, the forward propagation binarizes weights and input activations. Then, computes the output activations and output losses. The ML and BMA proposals are adopted in this stage. In order to remove the influence of updating weights on final results, the algorithm's backward propagation follows the method used in XNOR-Net [14]. Finally, gradients and weights are updated.

Ablation Studies
In this part, we conduct several experiments with binary network on the CIFAR-10 [33] and ImageNet 2012 datasets. The experimental results and analyses conclude the behaviors and effects of proposed ML and BMA techniques on the BNNs, and further testify the correctness of our theoretical analysis in previous Section 3.

Algorithm 1. Training Flow of the BNNs using Our-proposed ML and BMA.
Input: A mini-batch of inputs and targets.

01: Compute the binarized weights and input activations:
A l b = sign(BMA(A l r )), W l b = sign((W l r − µ(W l r ))/σ(W l r )) 02: Compute the output activations: Compute the output losses:

04:
Compute the gradients employing the method adopted in XNOR-Net [14] 05: Update the W: where η is the learning rate.

Median Loss (ML)
When training a binarized model, we add the median loss regularization term on the basis of the existed loss function. More specific experimental protocol is to employ the median loss on the XNOR-Net/ResNet-20.
To verify the effectiveness of ML proposal, Figure 2 indicates the distribution of binarized weights of each layer in XNOR-Net/ResNet-20 without and with the ML respectively. The figure's results also show that ML makes the distribution more evenly. Additionally, the entropy value H of binarized BNN increases from 5.40 to 5.41 which means more information retains inside the networks with the ML employment. Furthermore, the ML is more effective than other methods to increase the network entropy. Table 1 summarizes the experimental results obtained on CIFAR-10 dataset with XNOR-Net/ResNet-20. The networks are trained with the same configurations, and with different methods to increase its entropy. Method 1 employs the network weights' standardization and balance operations proposed in the IR-Net [18]. Meanwhile, Method 2 applies our proposed ML. From the table, our Method 2 proves an accuracy of 84.30% which is higher than the Method 1′s 83.85% and referenced baseline's 80.33%. These comparisons prove that the ML outperforms the method of IR-Net and the baseline.  Furthermore, the ML is more effective than other methods to increase the network entropy. Table 1 summarizes the experimental results obtained on CIFAR-10 dataset with XNOR-Net/ResNet-20. The networks are trained with the same configurations, and with different methods to increase its entropy. Method 1 employs the network weights' standardization and balance operations proposed in the IR-Net [18]. Meanwhile, Method 2 applies our proposed ML. From the table, our Method 2 proves an accuracy of 84.30% which is higher than the Method 1's 83.85% and referenced baseline's 80.33%. These comparisons prove that the ML outperforms the method of IR-Net and the baseline. Table 1. Accuracy comparison with different methods adopted in XNOR-Net/ResNet-20 (Referenced Baseline) on CIFAR-10 database. The best results are shown in bold.

Regularization Parameter λ
Aiming at determining an optimal value of the regularization parameter λ, a series of preliminary experiments are conducted. Their values vary from a collection 10 −6 , 10 −5 , 10 −4 , 10 −3 , 10 −2 . The accuracy of pre-activation XNOR-Net/ResNet-20 binary network is assessed on the validation subset of CIFAR-10 datasets. We repeat each experiment four times with a random weight initialization, and take the average of these results as the final accuracy. Table 2 summarizes the accuracy variations with different λ. The larger λ is, the more prior information in network exists. This leads the information entropy in weights increase and the accuracy results improve accordingly. Meanwhile, oversized λ makes the initial loss function almost ineffective and further results in a deviate from the basic task. In this way, the λ's ablation study concludes that 10 −4 is a favorable value.

Batch Median of Activations (BMA)
To verify the BMA's effectiveness, we design two experimental sets. One is based on the conventional method, which uses a uniform scaling factor for all activations in the same layer. Another is our BMA, which calculates the scaling factor of each layer's positive and negative activations separately. Figure 3 shows the BMA minimizes quantization error in forward propagation effectively and decreases it from 0.59 to 0.02 obviously. To verify the BMA's effectiveness, we design two experimental sets. One is based on the conventional method, which uses a uniform scaling factor for all activations in the same layer. Another is our BMA, which calculates the scaling factor of each layer's positive and negative activations separately. Figure 3 shows the BMA minimizes quantization error in forward propagation effectively and decreases it from 0.59 to 0.02 obviously. In addition, Figure 4 illustrates the distribution of binary activations of each layer in XNOR-Net/ResNet-20 without and with the BMA, respectively. From the figure, it can be proved that the BMA makes a greatly positive influence on the distribution of binary activations. The entropy value of the activations increases from 5.25 to 5.41. This large increase means the information in activated network can be retained. In addition, Figure 4 illustrates the distribution of binary activations of each layer in XNOR-Net/ResNet-20 without and with the BMA, respectively. From the figure, it can be proved that the BMA makes a greatly positive influence on the distribution of binary activations. The entropy value of the activations increases from 5.25 to 5.41. This large increase means the information in activated network can be retained. models are trained on the CIFAR-10 dataset.
In addition, Figure 4 illustrates the distribution of binary activations of each layer in XNOR-Net/ResNet-20 without and with the BMA, respectively. From the figure, it can be proved that the BMA makes a greatly positive influence on the distribution of binary activations. The entropy value of the activations increases from 5.25 to 5.41. This large increase means the information in activated network can be retained.

Comparison with State-of-the-Art Methods
Firstly, we study the performance of ML and BMA upon ResNet-20 topology and compare with other state-of-the-art methods comprehensively. To verify an improved performance brought by the superimposed individual ML and BMA together, we add them both into the XNOR-Net simultaneously. As Figure 5 and Table 3 illustrated, the

Comparison with State-of-the-Art Methods
Firstly, we study the performance of ML and BMA upon ResNet-20 topology and compare with other state-of-the-art methods comprehensively. To verify an improved performance brought by the superimposed individual ML and BMA together, we add them both into the XNOR-Net simultaneously. As Figure 5 and Table 3 illustrated, the improved version with ML(XNOR+ML) outperforms the XNOR-Net without ML over 3.97% accuracy; The improved version with BMA(XNOR+BMA) outperforms the XNOR-Net without BMA over 2.98% accuracy; Moreover, the accuracy increases when the XNOR-Net adopts both ML and BMA together (XNOR+BMA+ML), and can be achieved as high as 4.67%.   Table 3. Accuracy results of XNOR/ResNet-20 baseline and improved versions with our proposed ML, BMA, and both ML+BMA (all three cases are with λ = 10 −4 ). The best results are marked in bold.

Topology Method Bit-Width (W/A) Accuracy (%)
ResNet-20 Next, to verify our ML and BMA can be applied to other network structures and also be combined with different training methods, we extend the evaluations to a new structure-Bi-Real18 [23]. In addition, we also adopt the training method EDE proposed in IR-Net. As illustrated in Figure 6 and Table 4, our ML and BMA version (IR-Net+BMA+ML) shows a 0.53% accuracy improvement compared to the IR-Net benchmark.  Table 3. Accuracy results of XNOR/ResNet-20 baseline and improved versions with our proposed ML, BMA, and both ML+BMA (all three cases are with λ = 10 −4 ). The best results are marked in bold.

Topology Method Bit-Width (W/A) Accuracy (%)
ResNet-20 Next, to verify our ML and BMA can be applied to other network structures and also be combined with different training methods, we extend the evaluations to a new structure-Bi-Real18 [23]. In addition, we also adopt the training method EDE proposed in IR-Net. As illustrated in Figure 6 and Table 4, our ML and BMA version (IR-Net+BMA+ML) shows a 0.53% accuracy improvement compared to the IR-Net benchmark.
ResNet-20 Next, to verify our ML and BMA can be applied to other network structures and also be combined with different training methods, we extend the evaluations to a new structure-Bi-Real18 [23]. In addition, we also adopt the training method EDE proposed in IR-Net. As illustrated in Figure 6 and Table 4, our ML and BMA version (IR-Net+BMA+ML) shows a 0.53% accuracy improvement compared to the IR-Net benchmark.   Table 4. Accuracy results of IR-Net/BiReal-18 baseline and improved versions with our proposed ML, BMA, and ML+BMA both (all three cases are with λ = 10 −4 ). The best results are shown in bold.

Topology Method Bit-Width (W/A) Accuracy (%)
BiReal-18 Finally, we extend the evaluation to a larger scale image classification dataset-ImageNet 2012. The set contains 1.2 Mil. training and 50,000 validation samples individually. The training configurations are the same as Bi-Real18/CIFAR-10, except ResNet18/Bi-Real and ResNet34/Bi-Real are selected as the baseline. The training epoch is set 160, and input images are all cropped into a 224 × 224 resolution as references required. The regularization parameter λ utilize decided 10 −4 . The model is trained on 4 Nvidia RTX2080Ti GPUs with a total batch size of 128. Table 5 lists the performance comparison of present mainly-utilized BNNs. When the Bi-real combined with ML and BMA (Bi-MB 3 , Bi-Real+ML+BMA), their accuracies can be improved 1.3% and 0.9% for the ResNet-18 and ResNet-34 individually. Furthermore, our proposed methods are also testified on the IR-Net (Bi-IR-MB 4 , Bi-Real+IR-Net+ML+BMA), the accuracies are improved 0.4% and 0.3% for the ResNet-18 and ResNet-34 accordingly. Table 5 also lists the performance of another state-of-the-art method-CI-BCNN [21] which mines the channel-wise interactions by a reinforcement learning model. The performance is quite close to the Bi-MB 3 , and evidences the effectiveness of our prosed BMA and ML methods. Above results prove that our proposed methods are versatile and applicable. These methods can be embedded and utilized into the present popular BNN networks with a further accuracy improvement.

Storage Cost and Calculation Complexity Analyses
The storage and computational complexity analyses of our methods in ResNet18/Bi-Real and ResNet34/Bi-Real are demonstrated. Floating-Point, XNOR, Bi-Real and proposed Bi-Real+ML+BMA four modes are compared and concluded in Table 6. Compared with the floating-point networks, our Bi-Real+ML+BMA saves the storage by 11.13×/15.96×, and speeds up the computation by 11.10×/18.96× upon the ResNet18 and ResNet34, respectively. This is quite close to the Bi-Real's performance.
Additionally, the performance of our proposed Bi-Real+ML+BMA and standard Bi-Real has also been calculated. In the storage cost aspect, our method only increases the different scaling factors (BMA) and normalization of weights. The exact increment values are 11.7 Kbit, 22.8 Kbit (ResNet18, ResNet34/Bi-Real+ML+BMA) respectively. These values are minor increments compared with the original Bi-Real's 33.6 Mbit, 43.7 Mbit (ResNet18, ResNet34). Simultaneously, from the computation cost point of view, only normalization operation of weights increases the cost. The exact values are 4.62 M FLOPs, 7.12 M FLOPs (ResNet18, ResNet34/Bi-Real+ML+BMA), respectively. These increments are also negligible compared with the standard Bi-Real's 1.63 × 10 8 , 1.93 × 10 8 (ResNet18, ResNet34).
As two major algorithm performance evaluation parameters-storage cost and calculation complexity concerned, extra expenditure brought by our proposed ML and BMA methods is minor and negligible. This greatly benefits for implementation and embedding this BNN network into a recourse-limited hardware platform.

Conclusions
In this article, we propose a novel regularization technique-median loss (ML)-to improve the binarized weights distribute more evenly, and further increase the entropy information left inside the network consequently. We also propose a new batch median of activations (BMA)method to retain more information of BNNs in two aspects. Firstly, the mid-value is subtracted from the activations to attain the maximum information entropy of binarized activations. Secondly, the scaling factor is calculated separately for the positive and negative of weights and activations to reduce the quantization error. Owing to the sufficient information retain inside network with these effective two methods, the XNOR-Net/ResNet-20 network gains a 4.67% accuracy on the CIFAR-10 dataset, and the Bi-Real18 network gains a 1.3% accuracy on the ImageNet 2012 dataset. Moreover, our proposed ML and BMA for the storage cost and calculation complexity increments are proved minor and negligible.
In this work, our proposed methods can increase the whole BNN network's information entropy and further accuracy improvement. What is more, the applicability into the present popular binary networks with a comprehensive good performance is also an attractive property. These advantages are greatly helpful and promising for its future wide applications into the binary neural networks, and additionally embedding the networks into the power-aware products.