Between-Class Adversarial Training for Improving Adversarial Robustness of Image Classification

Deep neural networks (DNNs) have been known to be vulnerable to adversarial attacks. Adversarial training (AT) is, so far, the only method that can guarantee the robustness of DNNs to adversarial attacks. However, the robustness generalization accuracy gain of AT is still far lower than the standard generalization accuracy of an undefended model, and there is known to be a trade-off between the standard generalization accuracy and the robustness generalization accuracy of an adversarially trained model. In order to improve the robustness generalization and the standard generalization performance trade-off of AT, we propose a novel defense algorithm called Between-Class Adversarial Training (BCAT) that combines Between-Class learning (BC-learning) with standard AT. Specifically, BCAT mixes two adversarial examples from different classes and uses the mixed between-class adversarial examples to train a model instead of original adversarial examples during AT. We further propose BCAT+ which adopts a more powerful mixing method. BCAT and BCAT+ impose effective regularization on the feature distribution of adversarial examples to enlarge between-class distance, thus improving the robustness generalization and the standard generalization performance of AT. The proposed algorithms do not introduce any hyperparameters into standard AT; therefore, the process of hyperparameters searching can be avoided. We evaluate the proposed algorithms under both white-box attacks and black-box attacks using a spectrum of perturbation values on CIFAR-10, CIFAR-100, and SVHN datasets. The research findings indicate that our algorithms achieve better global robustness generalization performance than the state-of-the-art adversarial defense methods.


Introduction
DNNs have achieved impressive success in many computer vision tasks such as image classification [1], object detection [2], and semantic segmentation [3]. However, recent studies on adversarial examples [4,5] reveal the weakness of DNNs on robustness, showing that carefully designed small perturbations can mislead a network to produce incorrect outputs with high confidence. In the context of image classification, the perturbations in adversarial examples are human-imperceptible and can change the prediction of a classification model to incorrect classes. Moreover, adversarial examples can also transfer across different model parameters and even architectures. As a result, adversarial examples become a significant threat to deep learning-based security-crucial applications such as self-driving cars [6], person detection systems [7], or medical diagnosis systems [8]; hence, it is a crucial issue to develop methods that improve the robustness of DNNs against adversarial examples.
Adversarial attacks have attracted considerable research interest in developing adversarial defense to improve the adversarial robustness of DNNs. For example, feature squeezing [17] reduces the power of the adversary by reducing the color bit depth of pixel values of input images and spatial smoothing. Stochastic Activation Pruning [18] and Deep Contractive Network [19] modify the network architecture to improve the adversarial robustness of DNNs. Defense-gan [20], Pixeldefend [21], and Magnet [22] add auxiliary networks to make DNNs robust to adversarial examples. Nevertheless, these defense methods proposed are demonstrated to give a false sense of robustness due to obfuscated gradients or evaluated under weak threat models [23,24]. It is generally accepted that AT [4,5,9] that trains DNNs with adversarial examples is, so far, the only method that can improve the robustness of DNNs against adversarial examples. However, AT is known to damage the accuracy on clean examples [4,5,9]; in addition, the adversarial generalization performance gained from AT is much lower than the standard generalization performance gained from standard training. Schmidt et al. [25] demonstrate that this is due to the higher sample complexity needed by robustness generalization than standard generalization.
Improving the robustness of DNNs against adversarial examples can be viewed as the problem of reducing overfitting, namely improving the generalization performance of DNNs on testing adversarial examples. Regularization is a commonly used method to reduce overfitting and improve the standard generalization of DNNs. Well-known regularization methods include weight decay [26], dropout [27], and data augmentation [28]. Weight decay regularizes the DNNs on the model side by introducing a regularization term into the loss function to penalize high weight values, which prevents the model from getting too complex, thus reducing overfitting and improving generalization. Dropout also works on the model side by randomly dropping out nodes during training which approximates assembling a large number of models with different architectures, thus reducing overfitting and improving generalization. Different from weight decay and dropout, data augmentation regularizes the DNNs on the data side. By applying geometric transformations such as flipping, cropping, rotation, and translation to already existing data [28] or generating synthetic data [29], data augmentation increases sample complexity, thus reducing overfitting and improving generalization. BC-learning [30,31] is a recently proposed data augmentation method that mixes two examples belonging to different classes with a random ratio to generate between-class examples, then inputs the mixed between-class examples to a model and trains the model to output the mixing ratio. BC-learning imposes regularization on the feature distribution of clean examples, which enlarges the betweenclass distance. BC-learning was originally designed for sound recognition [30,31], but was then found to also improve the standard generalization of image classification [30,31]. However, little work has been done to study the effectiveness of BC-learning on the robustness generalization of image classification.
This paper aims to answer the question of whether BC-learning can further improve the robustness generalization of adversarially trained DNNs on the image classification task by regularizing the feature distribution of adversarial examples. We first introduce an intriguing property of adversarial examples called Label-Feature Distribution Mismatch and point out that the Label-Feature Distribution Mismatch property is a reason that causes poor generalization performance of DNNs against adversarial examples. We then propose a novel adversarial training algorithm named BCAT that combines BC-learning with AT.

Adversarial Training
AT is widely recognized as the only method that can improve the adversarial robustness of DNNs. Standard AT [4,5,9] is formulated as a min-max optimization problem. The inner maximization generates the worst-case adversarial examples using a first-order adversary called Projected Gradient Descent (PGD), and the outer minimization trains the model on the generated adversarial examples to update the model parameters. Many recently proposed state-of-the-art methods are based on this AT formulation. For example, Adversarial Logit pairing (ALP) [32] encourages the logits of a clean image x and its corresponding adversarial example x to be similar. ALP imposes regularization on the model, which encourages similar feature distribution of clean and adversarial examples. TRADES [33] trains a model by optimizing a loss function consisting of two terms: one for maximizing the natural accuracy of the model, and another for improving the adversarial robustness of the model. TRADES provides a better trade-off between robustness and accuracy. TLA [34] and AT2L [35] combine metric learning with AT to train a model on triplet loss, which produces more robust classifiers. Zhang et al. [36] propose a feature scattering-based AT approach that considers inter-sample relationships for improving the adversarial robustness of DNNs. Yu et al. [37] demonstrate that latent features in an adversarially trained model are susceptible to adversarial attacks and propose the LAFEAT method to improve the robustness of the latent features against adversarial attacks. Chen et al. [38] propose self-supervised AT that maximizes the mutual information between the representations of clean examples and corresponding adversarial examples during training. Liu et al. [39] propose a defense algorithm named Adv-BNN that combines AT and the Bayesian neural network. Wang et al. [40] propose a dynamic training strategy to gradually increase the convergence quality of the generated adversarial examples, which improves the robustness of AT. Rice et al. [41] demonstrate that the improvement in adversarial robustness of AT can be achieved by simply adopting early stopping. Yu et al. [42] propose an AT-based method that can learn a representation that captures the shared information between clean examples and their corresponding adversarial examples while discarding these samples' view-specific information, which leads to an improved robust vs. natural accuracy tradeoff. Apart from these works that aim to improve the robustness generalization performance of AT, there are also many studies trying to solve specific problems in AT. For example, AT based on the min-max formulation hurts the standard generalization of DNNs. Zhang et al. [43] propose a novel formulation of AT called friendly adversarial training (FAT) that trains a model on the least adversarial examples instead of the worst-case adversarial examples. FAT achieves adversarial robustness without compromising natural generalization. It is known that adversarial robustness requires a larger capacity of a network than that for standard generalization [4,5,9]. In order to achieve compactness of the robust models, ADMM [44] and HYDRA [45] combine AT with weight pruning to give consideration to adversarial robustness and model compactness at the same time. Adversarial robustness gained from AT comes with a high computational cost. In order to reduce the computational cost of AT, freeAT [46] generates adversarial examples and updates model parameters within one gradient computation, thus speeding up the AT. YOPO [47] restricts most of the forward and backward propagation of AT within the first layer of a network during adversary updates, which reduces the computational cost. The single-step adversary such as FGSM can reduce the computational cost but fails to defend against adversarial attacks. Wong et al. [48] propose to train a model using FGSM combined with random initialization. Vivek et al. [49] propose a single-step AT method with dropout scheduling. Most of the works on adversarial robustness design methods are on balanced datasets. Wu et al. [50] investigate the adversarial vulnerability and defense under long-tailed distribution and propose RoBal which tackles adversarial robustness under long-tailed distribution.

Regularization
Regularization is any method we adopt to improve the generalization performance of a learning algorithm. There are many regularization methods. For example, L 2 regularization [26], also known as weight decay, adds a regularization term measuring the overall size of weight parameters by L 2 norm to the loss function to penalize high weight values, which prevents the model from getting too complex, thus reducing overfitting and improving generalization. Similar to L 2 regularization, L 1 regularization [51] replaces the L 2 norm with the L 1 norm to penalize the size of weight parameters, which results in a more sparse weight distribution. Other than the weight parameters, the penalty can also be applied to the activations of the units in DNNs to result in representational sparsity [52], which improves the generalization performance of DNNs. The lack of labeled data is a reason for the poor generalization performance of DNNs. Data augmentation adds synthetic data into the training set to improve sample complexity. DNNs trained on the augmented training set benefit from the improved sample complexity, thus the generalization performance is improved. The synthetic data can be acquired by applying geometric transformations such as flipping, cropping, rotation, and translation to already existing data [28] or be generated using Generative Adversarial Networks (GANs) [29]. When training models with a large capacity, the generalization error often peaks before the training is finished. Early stopping adopts the model parameters with the best generalization performance rather than the model parameters after the training is finished, which is a simple but effective regularization method. Caruana et al. [53] explain the regularization of early stopping imposed on restricting the model complexity: models with larger capacity first learn hypotheses that are similar to those learned by smaller models during the training process. When early stopping is used, the training of the larger model can be halted when the large model's parameters are similar to parameters learned by smaller nets. Dropout [27] randomly masks out the hidden units of a network by multiplying their outputs by zero during training. This is similar to training an ensemble of different networks and then averaging the predictions of all networks, which improves the generalization performance of the single network. Other regularization methods such as semi-supervised learning [54], multitask learning [55], and noise injection [56] can also improve the generalization performance of DNNs.

Label-Feature Distribution Mismatch
An intriguing property of adversarial examples which we call the Label-Feature Distribution Mismatch is first introduced here. As can be seen from Figure 1, for a standard- f : x →ŷ is a classification model that outputs the predicted labelŷ for the input x; x is the adversarial example corresponding to x. In this paper, the ground-truth labels Y of the adversarial examples X is called real labels, and the predicted labels Y = f (X ) of the adversarial examples X is called fake labels. generalization performance of the single network. Other regularization methods such semi-supervised learning [54], multitask learning [55], and noise injection [56] can a improve the generalization performance of DNNs.

Label-Feature Distribution Mismatch
An intriguing property of adversarial examples which we call the Label-Feature D tribution Mismatch is first introduced here. As can be seen from Figure  is the adversa example corresponding to . In this paper, the ground-truth labels of the adversa examples is called real labels, and the predicted labels = ( ) of the adversa examples is called fake labels.

Motivation
AT [9] trains DNNs using online-generated worst-case adversarial examples. T training strategy imposes regularization on the distribution of adversarial example decrease the intra-class distance and increase the inter-class distance of the feature adversarial examples, as shown in Figure 2. As a result, AT mitigates the Label-Feat Distribution Mismatch problem of adversarial examples and improves the adversarial bustness generalization of DNNs. Therefore, the adversarial robustness generalizat performance of DNNs can be improved from the point of view of regularizing the feat distribution. Nevertheless, the feature distribution of adversarial examples with the sa real labels is still not ideal compared to clean examples (for example, the cyan point Figure 2). BC-learning [31] is able to impose constraints on the feature distribution of cl examples to enlarge Fisher's criterion and regularize the positional relationship am feature distributions, thus improving the standard generalization. Therefore, it is reas able to assume that if BC-learning is applied to AT, the feature distribution of adversa

Motivation
AT [9] trains DNNs using online-generated worst-case adversarial examples. This training strategy imposes regularization on the distribution of adversarial examples to decrease the intra-class distance and increase the inter-class distance of the features of adversarial examples, as shown in Figure 2. As a result, AT mitigates the Label-Feature Distribution Mismatch problem of adversarial examples and improves the adversarial robustness generalization of DNNs. Therefore, the adversarial robustness generalization performance of DNNs can be improved from the point of view of regularizing the feature distribution. Nevertheless, the feature distribution of adversarial examples with the same real labels is still not ideal compared to clean examples (for example, the cyan points in Figure 2). BC-learning [31] is able to impose constraints on the feature distribution of clean examples to enlarge Fisher's criterion and regularize the positional relationship among feature distributions, thus improving the standard generalization. Therefore, it is reasonable to assume that if BC-learning is applied to AT, the feature distribution of

BCAT: Between-Class Adversarial Training
In this section, we first introduce the standard AT formulation. Then, we propose the BCAT method and introduce how to apply BC-learning to AT.
Madry et al. [9] formulate AT as a min-max optimization problem: where ⊆ ℝ is the set of perturbations the threat model can find, such as the union of the -balls around the clean examples . (•) is a loss function such as the cross-entropy loss for DNNs. In this min-max optimization, the inner maximization problem finds adversarial examples that maximize the loss function, which is solved by PGD [9]: The outer minimization problem finds model parameters so that the loss function is minimized on the adversarial examples found by the inner maximization, which can be solved by back-propagation for DNNs. The training procedure of AT is exhibited in Fig

BCAT: Between-Class Adversarial Training
In this section, we first introduce the standard AT formulation. Then, we propose the BCAT method and introduce how to apply BC-learning to AT.
Madry et al. [9] formulate AT as a min-max optimization problem: where S ⊆ R d is the set of perturbations the threat model can find, such as the union of the L ∞ -balls around the clean examples X. L(·) is a loss function such as the cross-entropy loss for DNNs. In this min-max optimization, the inner maximization problem finds adversarial examples that maximize the loss function, which is solved by PGD [9]: The outer minimization problem finds model parameters so that the loss function is minimized on the adversarial examples found by the inner maximization, which can be solved by back-propagation for DNNs. The training procedure of AT is exhibited in Figure 3.

BCAT: Between-Class Adversarial Training
In this section, we first introduce the standard AT formulation. Then, we propose the BCAT method and introduce how to apply BC-learning to AT.
Madry et al. [9] formulate AT as a min-max optimization problem: where ⊆ ℝ is the set of perturbations the threat model can find, such as the union of the -balls around the clean examples . (•) is a loss function such as the cross-entropy loss for DNNs. In this min-max optimization, the inner maximization problem finds adversarial examples that maximize the loss function, which is solved by PGD [9]: The outer minimization problem finds model parameters so that the loss function is minimized on the adversarial examples found by the inner maximization, which can be solved by back-propagation for DNNs. The training procedure of AT is exhibited in Figure 3.   (1). This procedure is iterated until the model converges.
In order to regularize the feature distribution of adversarial examples and improve the adversarial robustness generalization performance of DNNs trained by AT, we propose the BCAT method which applies BC-learning to AT. First, adversarial examples x are generated by the inner maximization of Equation (1): Suppose x 1 and x 2 are two adversarial examples with different real labels generated from Equation (3), and y 1 and y 2 are their one-hot real labels. Then, a random mixing ratio r is generated from U(0, 1), and two sets of adversarial examples and real labels are mixed with this mixing ratio: According to [31], Kullback-Leibler divergence is adopted as the loss function for BCAT: where y is the output of the DNN model given the mixed adversarial examples x mixed . Algorithm 1 describes the training procedure of BCAT.
Algorithm 1Pseudo code of BCAT Input: Dataset D, initial weight parameters θ 0 , training steps K, batch size M, PGD perturbation value t , PGD step size α, PGD number of steps T Output: weight End for 8 End for 9 Generate a batch of random mixing ratio r m ∼ U(0, 1) m 10 For m = 1, 2, . . . , M do 11 End for

BCAT+: A More Powerful Mixing Method
Inspired by the mixing method of BC+ [31], we adopt another mixing method that treats images as waveform data. This mixing method is a modified version of the mixing method of BC+, which aims to better adapt to AT. In AT-based methods, pixel values of input data are normalized to a fixed range such as [−1, 1] because of bounded adversarial perturbations. However, in the mixing method for BC+, per-image mean values are first subtracted from images, and then the zero-centered images are normalized for each channel using the mean and standard deviation calculated from the whole training data. In order to better adapt to AT, we do not adopt the normalization method of BC+ for BCAT+, and simply restrict the pixel values of images to [−1, 1]. Specifically, two normalized adversarial examples x 1 and x 2 are first mixed by Equation (8) instead of Equation (4): Equation (8) takes waveform energy into consideration, which is proportional to the square of the amplitude. This mixing method prevents the input variance from decreasing. Second, following [31], we consider the difference of energies of two adversarial examples and use a new coefficient p instead of r to mix two adversarial examples by where p is solved from where σ 1  CNNs have an aspect of treating input data as waveforms. Therefore, the mixed adversarial examples from BCAT+ are more adaptive to CNNs than those from BCAT; Secondly, the mixing methods of both BCAT and BCAT+ are, by nature, two data augmentation methods which increase the variance of the training data, which imposes constraints on the feature distribution of the adversarial examples, and thus improves the adversarial robustness generalization performance. The key point is that the mixing method of BCAT+ takes the difference in the energies of the adversarial examples into consideration, which can generate mixed adversarial examples with higher variance. This is equivalent to imposing stronger constraints on the feature distribution of the adversarial examples. Experiments in Section 4 will demonstrate the advantage of BCAT+ over BCAT in terms of adversarial robustness generalization performance. Algorithm 2 describes the training procedure of BCAT+.
As two training algorithms for DNNs, BCAT and BCAT+ iteratively update the parameters θ of the DNN models to minimize the loss function L by stochastic gradient descent. In each training iteration, firstly the gradients of the loss function L with respect to the parameters θ are calculated; then, the parameters θ are updated for one small step along the opposite direction of the gradients, which is the direction in which the value of the loss function descends the fastest. There is the parameters updating rule: where θ i are the parameters calculated in the ith iteration during training; η is the learning rate that controls how much to change the model in each training iteration. η is a critical hyperparameter that affects the convergence of the training process, as too large η may result in a sub-optimal set of parameters too fast or cause the training not to converge, whereas too small η may result in a long training process that could become stuck. The chosen value of η is introduced in Section 4.3, and the convergence process of BCAT and BCAT+ is visualized and analyzed in Section 4.4.3. For m = 1, 2, . . . , M do 5 x 1 m ← arg max End for 8 End for 9 Generate a batch of random mixing ratio r m ∼ U(0, 1) m 10 p m ← . Third, the mixed adversarial exa ples are used to train the model to output the mixing ratio. This procedure is it ated until the model converges. The overall framework representing the working mech nism of the proposed method is shown in Figure 5. The classification model is first train using BCAT (+) on the training data. The adversarial robustness generalization perf mance of the trained model is then evaluated on the unseen testing data.

Real Label and Fake Label
Recall that when BC-learning is applied to clean examples, two clean examples from different classes are chosen to be mixed by a random ratio. This operation chooses examples from different spatial distributions and regularizes the feature distribution of clean   Figure 6. Therefore, we also consider a different realization of BCAT and BCAT+ that takes the fake label of adversarial examples into consideration. Specifically, the two adversarial examples chosen to be mixed have different real labels and fake labels. We call BCAT and BCAT+ methods taking fake labels into consideration as BCATf and BCATf+, respectively. Due to the higher computational complexity of BCATf and BCATf+ than BCAT and BCAT+, and considering BCAT and BCAT+ have already achieved good adversarial robustness generalization performance, we mainly focus on BCAT and BCAT+ in our experiments and show the performance of BCATf and BCATf+ in Section 4.6.1.

Datasets
We evaluated BCAT and BCAT+ on CIFAR-10, CIFAR-100, and SVHN datasets i this paper. The CIFAR-10 dataset contains 60,000 32 × 32 RGB images from 10 classe airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. The training se contains 50,000 samples and the testing set contains 10,000 samples. The CIFAR-100 da taset has 60,000 32 × 32 RGB images from 100 classes. There are 500 training images an 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 super classes. Each image comes with a "fine" label (the class to which it belongs) and a "coarse label (the superclass to which it belongs). The SVHN dataset contains 10 classes of stree view house numbers RGB images of size 32 × 32. The training set contains 73,257 sample and the testing set contains 26,032 samples. Pixel values are normalized to −1, 1 fo these three datasets in this paper. During training, the standard data augmentatio scheme [57] is applied to CIFAR-10 and CIFAR-100 datasets.

Threat Model
The threat model used in this paper for evaluating the adversarial robustness gener

Datasets
We evaluated BCAT and BCAT+ on CIFAR-10, CIFAR-100, and SVHN datasets in this paper. The CIFAR-10 dataset contains 60,000 32 × 32 RGB images from 10 classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. The training set contains 50,000 samples and the testing set contains 10,000 samples. The CIFAR-100 dataset has 60,000 32 × 32 RGB images from 100 classes. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). The SVHN dataset contains 10 classes of street view house numbers RGB images of size 32 × 32. The training set contains 73,257 samples and the testing set contains 26,032 samples. Pixel values are normalized to [−1, 1] for these three datasets in this paper. During training, the standard data augmentation scheme [57] is applied to CIFAR-10 and CIFAR-100 datasets.

Threat Model
The threat model used in this paper for evaluating the adversarial robustness generalization of the proposed method is the L ∞ -PGD, which generates L ∞ norm bounded adversarial examples against the defended networks. The number of steps and the step size of PGD are set to be 20 and 2/255. A wide range of perturbation values is chosen from 1/255 to 8/255 with a step size of 1/255 to globally evaluate the adversarial robustness generalization. Both white-box attacks and black-box attacks are considered in our experiments.

Training Parameters
The same training schedule is adopted for CIFAR-10, CIFAR-100, and SVHN in this paper. The batch size is set to 128 and the epoch is set to 250. The Momentum optimizer parameterized with a momentum of 0.9 is used and the Nesterov is used for the optimizer. The initial learning rate is set to be 0.1 and is decayed by a factor of 10 at epochs {100, 150, 200}. Weight decay is set to 0.0005 according to [58], which is shown to achieve higher adversarial robustness generalization accuracy.

Evaluation under White-Box Attacks
In this section, we evaluate BCAT and BCAT+ under white-box attacks and compare BCAT and BCAT+ with several strong baselines (standard AT [9], freeAT [46], ALP [32], TRADES [33], and Yu et al.'s method [42]) on CIFAR-10, CIFAR-100, and SVHN to illustrate the superiority of BCAT and BCAT+ in terms of improving the adversarial robustness generalization of DNNs. For convenience, we denote Yu et al.'s method as ATLSI (Adversarial Training by Learning Shared Information) in this paper.

Feature Distribution
As introduced in Section 3.2, BC-learning can benefit the standard generalization of DNNs by regularizing the feature distribution of clean examples to yield larger inter-class distances and smaller inner-class distances. To study whether BC-learning can also regularize the feature distribution of adversarial examples after combining with AT, we first train three ResNet34 networks on CIFAR-10 using standard AT, BCAT, and BCAT+, respectively, and then use t-SNE to visualize the testing set adversarial examples of CIFAR-10 generated on these three ResNet34 networks in a two-dimensional feature space. The visualization results are displayed in Figure 7. As can be seen from the top row of Figure 7, for the network trained using standard AT, the feature distributions of the adversarial examples from different classes significantly overlap in feature space; there still exists a noticeable label-feature distribution mismatch problem in the network trained using standard AT. By contrast, as shown in the middle row and the bottom row of Figure 7, the networks trained using BCAT and BCAT+ exhibit better feature distribution than standard AT: the inter-class distance is substantially increased, and the inner-class distance is substantially decreased. The label-feature distribution mismatch problem is effectively mitigated by BCAT and BCAT+. This indicates that BC-learning imposes effective regularization on the feature distribution of the adversarial examples. In fact, a larger discrimination margin allows the networks to learn better classification boundaries during AT, thus improving the adversarial robustness generalization. networks trained using BCAT and BCAT+ exhibit better feature distribution than standard AT: the inter-class distance is substantially increased, and the inner-class distance is substantially decreased. The label-feature distribution mismatch problem is effectively mitigated by BCAT and BCAT+. This indicates that BC-learning imposes effective regularization on the feature distribution of the adversarial examples. In fact, a larger discrimination margin allows the networks to learn better classification boundaries during AT, thus improving the adversarial robustness generalization.

Robustness Generalization
In order to quantitatively analyze the robustness generalization and the standard generalization performance of BCAT and BCAT+ and compare them with the baselines, we first train several networks using BCAT, BCAT+, and the baselines on CIFAR-10, CIFAR-100, and SVHN. Specifically, ResNet34 and ResNet18 are adopted for CIFAR-10

Robustness Generalization
In order to quantitatively analyze the robustness generalization and the standard generalization performance of BCAT and BCAT+ and compare them with the baselines, we first train several networks using BCAT, BCAT+, and the baselines on CIFAR-10, CIFAR-100, and SVHN. Specifically, ResNet34 and ResNet18 are adopted for CIFAR-10 and SVHN. WRN34-5 is adopted for CIFAR-100. The perturbation value of all evaluated defense methods during training is set to be 8/255 for CIFAR and 12/255 for SVHN. The number of steps and step size are set to be 7 and 2/255. Then, we test the robustness generalization accuracy and the standard generalization accuracy of these networks under PGD20 with a wide range of perturbation values described in Section 4.2. The experimental results are exhibited in Tables 1-3 and Figure 8.

Convergence Analysis
We also analyze the convergence process of the robustness and standard generalization accuracy of the networks trained using standard AT, BCAT, and BCAT+ on CIFAR-10, CIFAR-100, and SVHN datasets. The networks used for CIFAR-10, CIFAR-100, and SVHN are ResNet34, WideNetNet34-5, and ResNet18, respectively. During training, the step size of PGD is set to be 7 for the validation set. The validation accuracy during training is plotted in Figure 9.
CIFAR-10 From Figure 9a it can be seen that, in the CIFAR-10 dataset, the robustness generalization accuracy of standard AT is higher than that of BCAT and BCAT+ before

CIFAR-10
Results on CIFAR-10 are exhibited in Table 1 and Figure 8a. For the sake of simplicity, the results under only four representative perturbation values of 2/255, 4/255, 6/255, and 8/255 are listed in Table 1. As shown in Table 1, for the ResNet34 network, BCAT+ achieves the highest robustness generalization accuracy among the evaluated defense methods at all chosen perturbation values except for the perturbation value of 2/255, where ATLSI slightly outperforms BCAT+ (0.818 vs. 0.817). The robustness generalization accuracy of BCAT ranks second at the perturbation values of 4/255 and 6/255; at the local perturbation values of 2/255 and 8/255, ATLSI, ALP, and TRADES slightly outperform BCAT, respectively. In a word, for the ResNet34 network, BCAT and BCAT+ outperform the baselines in terms of global robustness generalization performance, which is considered a more convincing evaluation metric than the robustness generalization performance at a single perturbation value [59]. For the ResNet18 network that has a smaller capacity, the robustness generalization accuracy of BCAT+ ranks first only at the perturbation values of 4/255, 6/255, and 8/255; at the small perturbation value of 2/255, ATLSI achieves the highest robustness generalization accuracy. The robustness generalization accuracy of BCAT ranks second only at the perturbation value of 8/255. Comparing the ResNet34 network and the ResNet18 network indicates that a larger network capacity can benefit the robustness generalization performance of BCAT and BCAT+. In terms of standard generalization, nevertheless, the performance of BCAT and BCAT+ is less than ideal on CIFAR-10; ATLSI achieves the highest standard generalization accuracy on the ResNet34 network and the ResNet18 network. A reasonable explanation is that this is due to the specificity of the datasets. Figure 8a shows the superiority of BCAT+ in robustness generalization.
CIFAR-100 Results on CIFAR-100 are shown in Table 2 and Figure 8b. As listed in Table 2, BCAT+ and BCAT achieve the highest and the second-highest standard generalization accuracy among the evaluated defense methods; recalling the specificity of datasets mentioned above, the standard generalization performance gain of BCAT and BCAT+ on CIFAR-100 is more noticeable (3.1% and 3.4%) than that on CIFAR-10. BCAT and BCAT+ also achieve significantly higher robustness generalization accuracy than the baseline methods for all the chosen perturbation values. Figure 8b intuitively shows the superiority of BCAT+ and BCAT. Compared to CIFAR-10, the performance gap between BCAT (+) and the baseline methods are larger on CIFAR-100.
SVHN Results on SVHN are shown in Table 3 and Figure 8c. As can be seen from Table 3, for both the ResNet34 network and the ResNet18 network, BCAT+ and BCAT achieve the highest and the second-highest standard generalization accuracy and robustness generalization accuracy among the evaluated defense methods under the majority of chosen perturbation values, except for the perturbation value of 8/255 for the ResNet34 network where ATLSI outperforms BCAT+. Nonetheless, the global robustness generalization performance of BCAT and BCAT+ is not affected by this exception. Comparing the robustness generalization accuracy for the ResNet34 network and the ResNet18 network, we find that the ResNet18 network outperforms the ResNet34 network for AT, freeAT, and ALP; for TRADES, BCAT, and BCAT+, the ResNet18 network outperforms the ResNet34 network for higher perturbation values. This observation is different from that of CIFAR-10 where the network with a larger capacity performs better than the network with a smaller capacity. This is because classification on SVHN is easier than classification on CIFAR-10. A bigger network is easy to overfit on SVHN. Figure 8c exhibits the standard and robustness generalization accuracy for ResNet18, from which we can see the superiority of BCAT+.

Convergence Analysis
We also analyze the convergence process of the robustness and standard generalization accuracy of the networks trained using standard AT, BCAT, and BCAT+ on CIFAR-10, CIFAR-100, and SVHN datasets. The networks used for CIFAR-10, CIFAR-100, and SVHN are ResNet34, WideNetNet34-5, and ResNet18, respectively. During training, the step size of PGD is set to be 7 for the validation set. The validation accuracy during training is plotted in Figure 9.
the first decay of the learning rate. After the first decay of the learning rate, the robustness generalization accuracy of BCAT+ gradually exceeds that of standard AT. After the second decay of the learning rate, the robustness generalization accuracy of BCAT exceeds that of standard AT. Additionally, the standard generalization of standard AT is exceeded by that of BCAT+ after the third decay of the learning rate.  Figure 9b it can be seen that, in the CIFAR-100 dataset, the robustness generalization accuracy of BCAT and BCAT+ exceed that of standard AT after the first decay of the learning rate. Although the robustness generalization accuracy of standard AT exceeds that of BCAT again after the third decay of the learning rate, BCAT+ ranks first until the end of the training process. Additionally, the standard generalization accuracy of BCAT and BCAT+ gradually exceeds that of standard AT after the first decay of the learning rate and keeps a large margin until the end of the training process.

CIFAR-100 From
SVHN As can be seen from Figure 9c, in the SVHN dataset, the robustness generalization accuracy of BCAT+ exceeds that of standard AT after the second decay of the learning rate. Additionally, the standard generalization accuracy of BCAT and BCAT+ exceeds that of standard AT after the second decay of the learning rate.  Figure 9a it can be seen that, in the CIFAR-10 dataset, the robustness generalization accuracy of standard AT is higher than that of BCAT and BCAT+ before the first decay of the learning rate. After the first decay of the learning rate, the robustness generalization accuracy of BCAT+ gradually exceeds that of standard AT. After the second decay of the learning rate, the robustness generalization accuracy of BCAT exceeds that of standard AT. Additionally, the standard generalization of standard AT is exceeded by that of BCAT+ after the third decay of the learning rate.

CIFAR-10 From
CIFAR-100 From Figure 9b it can be seen that, in the CIFAR-100 dataset, the robustness generalization accuracy of BCAT and BCAT+ exceed that of standard AT after the first decay of the learning rate. Although the robustness generalization accuracy of standard AT exceeds that of BCAT again after the third decay of the learning rate, BCAT+ ranks first until the end of the training process. Additionally, the standard generalization accuracy of BCAT and BCAT+ gradually exceeds that of standard AT after the first decay of the learning rate and keeps a large margin until the end of the training process.
SVHN As can be seen from Figure 9c, in the SVHN dataset, the robustness generalization accuracy of BCAT+ exceeds that of standard AT after the second decay of the learning rate. Additionally, the standard generalization accuracy of BCAT and BCAT+ exceeds that of standard AT after the second decay of the learning rate.
The observation above suggests that large training epochs and a proper learning rate schedule are vital to BCAT and BCAT+.

Evaluation under Black-Box Attacks
In order to demonstrate that the robustness of BCAT and BCAT+ is not a result of obfuscated gradients [23], we evaluate BCAT and BCAT+ under black-box attacks in this section. In black-box attacks, the adversary has access to nothing but the output of the target model; yet, due to the transferability of adversarial examples, the adversary can first construct a substitute model (or source model) and attack this substitute model to generate adversarial examples in the manner of white-box attacks. Then, the adversary attacks the black-box target model using the generated adversarial examples. According to [23], a robust model that does not rely on obfuscated gradients has better black-box robustness than white-box robustness. In our black-box attacks experiment, we adopt the defense-agnostic adversary; namely the substitute model constructed by the adversary is undefended. We first use PGD20 to

CIFAR-10
The data in Table 4 show that, on CIFAR-10, BCAT and BCAT+ achieve higher black-box robustness generalization accuracy than the white-box robustness generalization accuracy. Additionally, BCAT+ achieves the highest black-box robustness generalization accuracy among the evaluated defense methods under most of the perturbation values, with the exception of 4/255 where the standard AT slightly outperforms BCAT+.
CIFAR-100 The data in Table 5 show that, on CIFAR-100, BCAT and BCAT+ achieve higher black-box robustness generalization accuracy than the white-box robustness generalization accuracy. BCAT+ and BCAT achieve the highest and the second-highest black-box robustness generalization accuracy under all chosen perturbation values.
SVHN From Table 6 it can be seen that, on SVHN, BCAT and BCAT+ achieve higher black-box robustness generalization accuracy than the white-box robustness generalization accuracy. Additionally, BCAT+ achieves the highest black-box robustness generalization accuracy under all chosen perturbation values. BCAT achieves the same black-box robustness generalization accuracy under perturbation values of 2/255, 4/255, and 6/255. From the observation above, we know that our proposed BCAT+ outperforms the baselines in black-box attacks; this suggests that BCAT and BCAT+ do not rely on obfuscated gradients.

Ablation Study
In this section, we conduct ablation studies to investigate the effect of the fake label, data augmentation, and attack steps on the performance of BCAT and BCAT+. 4.6.1. BCAT (+) and BCATf (+) Previously, in Section 3.5, we introduce a different realization of BCAT and BCAT+ that we call BCATf and BCATf+. BCATf and BCATf+ take the fake label of adversarial examples into consideration. Specifically, two adversarial examples chosen to be mixed have different real labels and fake labels. Here, we conduct white-box attacks experiments to evaluate the standard generalization and the robustness generalization performance of BCATf and BCATf+ and compare BCATf and BCATf+ with BCAT and BCAT+ to study the effect of the fake label on BCATf and BCATf+. Because of the high computational cost of BCATf and BCATf+, we only conduct experiments on CIFAR-10 and SVHN. The perturbation value of BCAT (+) and BCATf (+) during training is set to be 8/255 for CIFAR-10 and 12/255 for SVHN. The number of steps and step size are set to be 7 and 2/255. The comparison results between BCAT (+) and BCATf (+) are given in Figure 10. From Figure 10, it can be seen that BCATf and BCATf+ achieve similar standard generalization and robustness generalization performance with BCAT and BCAT+ under different perturbation values. This observation implies that when applying BC-learning to AT, only considering real-label is enough for regularizing the feature distribution of adversarial examples. The fake label has no obvious effect on further improving the generalization performance.

Ablation Study
In this section, we conduct ablation studies to investigate the effect of th data augmentation, and attack steps on the performance of BCAT and BCAT+ 4.6.1. BCAT (+) and BCATf (+) Previously, in Section 3.5, we introduce a different realization of BCAT that we call BCATf and BCATf+. BCATf and BCATf+ take the fake label of examples into consideration. Specifically, two adversarial examples chosen have different real labels and fake labels. Here, we conduct white-box attacks e to evaluate the standard generalization and the robustness generalization per BCATf and BCATf+ and compare BCATf and BCATf+ with BCAT and BCA the effect of the fake label on BCATf and BCATf+. Because of the high compu of BCATf and BCATf+, we only conduct experiments on CIFAR-10 and SVH turbation value of BCAT (+) and BCATf (+) during training is set to be 8/255 fo and 12/255 for SVHN. The number of steps and step size are set to be 7 and comparison results between BCAT (+) and BCATf (+) are given in Figure 10. F 10, it can be seen that BCATf and BCATf+ achieve similar standard genera robustness generalization performance with BCAT and BCAT+ under differe tion values. This observation implies that when applying BC-learning to AT, ering real-label is enough for regularizing the feature distribution of adversari The fake label has no obvious effect on further improving the generalization p a b

Ablation on Data Augmentation
As mentioned in the Introduction, BC-learning is an effective data augmentation method for AT, which adds synthetic mixed adversarial examples into original adversarial examples to improve the sample complexity. In order to study the relative importance between BC-learning and the standard data augmentation applied to CIFAR-10 and CIFAR-100, we compare the model using both methods and the model using only one of the two methods. The perturbation value during training is set to be 8/255. The number of steps and the step size are set to be 7 and 2/255. The experimental results are summarized in Table 7. In Table 7, 'standard with/without' stands for the adversarially trained models only using standard data augmentation or without any data augmentation. 'BCAT (+) with/without' stands for the adversarially trained models using both BC-learning and standard data augmentation or only using BC-learning. We can obtain two insights from the results. First, both BC-learning and standard data augmentation alone can improve the robustness and standard generalization performance of adversarially trained models on CIFAR-10 and CIFAR-100. However, the accuracy improvement gained from BC-learning is lower than that gained from standard data augmentation. Especially for high pertur-bation values such as 6/255 and 8/255, BC-learning alone even degrades the robustness generalization accuracy. Second, when BC-learning is used in conjunction with standard data augmentation, the resulting robustness generalization accuracy and the standard generalization accuracy are higher than that when using BC-learning or standard data augmentation alone. This indicates that standard data augmentation is vital when applying BC-learning to AT. We also attack BCAT (+) and the baselines using a range of attack steps to show the effectiveness of our method under different attack strengths. The attack steps are chosen from 5 to 90 with a step size of 5. The attack step size is set to be 2/255. The perturbation value during training and testing is set to be 8/255. The results are shown in Figure 11. For CIFAR-10, except for attack steps 15, 35, 40, and 50 where TRADES slightly outperforms BCAT+, BCAT+ achieves better robustness generalization performance than the baselines under other attack steps. The overall performance of BCAT+ under the full spectrum of tested attack steps is better than TRADES. For CIFAR-100 and SVHN, BCAT+ significantly outperforms the baselines under the full spectrum of tested attack steps. Moreover, while the attack step is gradually increased, the robustness generalization accuracy of all the evaluated defense methods first decreases and then becomes stable. This also demonstrates that the evaluated defense methods are free of obfuscated gradients. , x FOR PEER REVIEW 20 of 23 evaluated defense methods first decreases and then becomes stable. This also demonstrates that the evaluated defense methods are free of obfuscated gradients.

Conclusions
In this paper, we proposed two novel adversarial defense algorithms against both white-box and black-box attacks called BCAT and BCAT+ that combine BC-learning with

Conclusions
In this paper, we proposed two novel adversarial defense algorithms against both white-box and black-box attacks called BCAT and BCAT+ that combine BC-learning with standard AT. BCAT and BCAT+ first mix two adversarial examples that have different real labels using different mixing methods, and then train the DNN model on the mixed adversarial examples instead of on the original adversarial examples to output the mixing rate during AT, which mitigates the Label-Feature Distribution Mismatch problem of adversarial examples and improves the robustness generalization performance of the DNN model trained by AT. We evaluated BCAT and BCAT+ under white-box and black-box attacks on CIFAR-10, CIFAR100, and SVHN datasets. The experimental results show that BCAT and BCAT+ can effectively regularize the feature distribution of adversarial examples, thus achieving better global robustness generalization performance than the state-of-the-art adversarial defense methods.
The proposed methods still have some limitations. Firstly, the process of searching the adversarial examples of different real labels increases the computational cost of standard AT. Secondly, the step of mixing two adversarial examples is between the generation of adversarial examples and the update of weight parameters, which makes it difficult to reduce the computational cost of BCAT and BCAT+ by combining the generation of adversarial examples and the update of weight parameters. Therefore, in the future, we will design a method to reduce the computational cost of BCAT and BCAT+, while not significantly damaging the robustness generalization. Additionally, we will develop a more powerful mixing method to further improve the standard generalization and the robustness generalization performance of BCAT+.