#### 3.1. CIFAR10 Experiments

We trained a set of 25

$28\times 10$ Wide ResNet (WRN) CEB models on CIFAR10 at

$\rho \in [-1,-0.75,\dots ,5]$, as well as a deterministic baseline. They trained for 1500 epochs, lowering the learning rate by a factor of 0.3 after 500, 1000, and 1250 epochs. This long training regime was due to our use of the original AutoAug policies, which requires longer training. The only additional modification we made to the basic

$28\times 10$ WRN architecture was the removal of all Batch Normalization [

30] layers. Every small CIFAR10 model we have trained with Batch Normalization enabled has had substantially worse robustness to L

${}_{\infty}$ PGD adversaries, even though typically the accuracy is much higher. For example,

$28\times 10$ WRN CEB models rarely exceeded more than 10% adversarial accuracy. However, it was always still the case that lower values of

$\rho $ gave higher robustness. As a baseline comparison, a deterministic

$28\phantom{\rule{-0.166667em}{0ex}}\times \phantom{\rule{-0.166667em}{0ex}}10$ WRN with BatchNorm, trained with AutoAug reaches 97.3% accuracy on clean images, but 0% accuracy on L

${}_{\infty}$ PGD attacks at

$\u03f5=8$ and

$n=20$. Interestingly, that model was noticeably more robust to L

${}_{2}$ PGD attacks than the deterministic baseline without BatchNorm, getting 73% accuracy compared to 66%. However, it was still much weaker than the CEB models, which get over 80% accuracy on the same attack (

Figure 1). Additional training details are in

Appendix A.1.

Figure 1 demonstrates the adversarial robustness of CEB models to both targeted L

${}_{2}$ and L

${}_{\infty}$ attacks. The CEB models show a marked improvement in robustness to L

${}_{2}$ attacks compared to an adversarially-trained baseline from Madry et al. [

5] (denoted Madry). The attack parameters were selected to be about equally difficult for the adversarially-trained WRN

$28\times 10$ model from Madry et al. [

5] (grey dashed and dotted lines in

Figure 1). The deterministic baseline (Det.) only gets 8% accuracy on the L

${}_{\infty}$ attacks, but gets 66% on the L

${}_{2}$ attack, substantially better than the 45.7% of the adversarially-trained model, which makes it clear that the adversarially-trained model failed to generalize in any reasonable way to the L

${}_{2}$ attack. The CEB models are always substantially more robust than Det., and many of them outperform Madry even on the L

${}_{\infty}$ attack the Madry model was trained on, but for both attacks there is a clear general trend toward more robustness as

$\rho $ decreases. Finally, the CEB and Det. models all reach about the same accuracy, ranging from 93.9% to 95.1%, with Det. at 94.4%. In comparison, Madry only gets 87.3%.

Figure 2 shows the robustness of five of those models to PGD attacks as

$\u03f5$ is varied. We selected the four CEB models to represent the most robust models across most of the range of

$\rho $ we trained. All values in the figure are collected at 20 steps of PGD. The Madry model [

5] was trained with 7 steps of L

${}_{\infty}$ PGD at

$\epsilon =8$ (grey dashed line in the figure). All of the CEB models with

$\rho \le 4$ outperform Madry across most of the values of

$\u03f5$, even though they were not adversarially-trained. It is interesting to note that the Det. model eventually outperforms the CEB

${}_{5}$ model on L

${}_{2}$ attacks at relatively high accuracies. This result indicates that the CEB

${}_{5}$ model may be under-compressed.

Of the 25 CEB models we trained, only the models with

$\rho \ge 1$ successfully trained. The remainder collapsed to chance performance. This is something we observe on all datasets when training models that are too low capacity. Only by increasing model capacity does it become possible to train at low

$\rho $. Note that this result is predicted by the theory of the onset of learning in IB and its relationship to model capacity from Wu et al. [

31].

We additionally tested two models (

$\rho =0$ and

$\rho =5$) on the CIFAR10 Common Corruptions test sets. At the time of training, we were unaware that AutoAug’s default policies for CIFAR10 contain brightness and contrast augmentations that amount to training on those two corruptions from Common Corruptions (as mentioned in Yin et al. [

11]), so our results are not appropriate for direct comparison with other results in the literature. However, they still allow us to compare the effect of bottlenecking the information between the two models. The

$\rho =5$ model reached an mCE of 61.2. The

$\rho =0$ model reached an mCE of 52.0, which is a dramatic relative improvement. Note that the mCE is computed relative to a baseline model. We use the baseline model from Yin et al. [

11].

#### 3.2. ImageNet Experiments

To demonstrate CEB’s ability to improve robustness, we trained four different ResNet architectures on ImageNet at $224\phantom{\rule{-0.166667em}{0ex}}\times \phantom{\rule{-0.166667em}{0ex}}224$ resolution, with and without AutoAug, using three different objective functions, and then tested them on ImageNet-C, ImageNet-A, and targeted PGD attacks.

As a simple baseline we trained ResNet50 with no data augmentation using the standard cross-entropy loss (XEnt). We then trained the same network with CEB at ten different values of

$\rho =(1,2,\dots ,10)$. AutoAug [

9] has previously been demonstrated to improve robustness markedly on ImageNet-C, so next we trained ResNet50 with AutoAug using XEnt. We similarly trained these AutoAug ResNet50 networks using CEB at the same ten values of

$\rho $. ImageNet-C numbers are also sensitive to the model capacity. To assess whether CEB can benefit larger models, we repeated the experiments with a modified ResNet50 network where every layer was made twice as wide, training an XEnt model and ten CEB models, all with AutoAug. To see if there is any additional benefit or cost to using the consistent classifier (

Section 2.3), we took the same wide architecture using AutoAug and trained ten consistent classifier CEB (cCEB) models. Finally, we repeated all of the previous experiments using ResNet152: XEnt and CEB models without AutoAug; with AutoAug; with AutoAug and twice as wide; and cCEB with AutoAug and twice as wide. All other hyperparameters (learning rate schedule, L

${}_{2}$ weight decay scale, etc.) remained the same across all models. All of those hyperparameters where taken from the ResNet hyperparameters given in the AutoAug paper. In total we trained 86 ImageNet models: 6 deterministic XEnt models varying augmentation, width, and depth; 60 CEB models additionally varying

$\rho $; and 20 cCEB models also varying

$\rho $. The results for the ResNet50 models are summarized in

Figure 3. For ResNet152, see

Figure 4. See

Table 1 for detailed results across the matrix of experiments. Additional experimental details are given in

Appendix A.2.

The CEB models highlighted in

Figure 3 and

Figure 4 and

Table 1 were selected by cross validation. These were values of

$\rho $ that gave the best

clean test set accuracy. Despite being selected for classical generalization, these models also demonstrate a high degree of robustness on both average- and worst-case perturbations. In the case that more than one model gets the same test set accuracy, we choose the model with the lower

$\rho $, since we know that lower

$\rho $ correlates with higher robustness. The only model where we had to make this decision was for ResNet152 with AutoAug, where five models all were within 0.1% of each other, so we chose the

$\rho =3$ model, rather than

$\rho \in \{5\dots 8\}$.

#### 3.2.1. Accuracy, ImageNet-C, and ImageNet-A

Increasing model capacity and using AutoAug have positive effects on classification accuracy, as well as on robustness to ImageNet-C and ImageNet-A, but for all three classes of models CEB gives substantial additional improvements. cCEB gives a small but noticeable additional gain for all three cases (except indistinguishable performance compared to CEB on ImageNet-A with the wide ResNet152 architecture), indicating that enforcing variational consistency is a reasonable modification to the CEB objective. In

Table 1 we can see that CEB’s relative accuracy gains increase as the architecture gets larger, from gains of 1.2% for ResNet50 and ResNet152 without AutoAug, to 1.6% and 1.8% for the consistent wide models with AutoAug. This indicates that even larger relative gains may be possible when using CEB to train larger architectures than those considered here. We can also see that for the XEnt 152x2 and 152 models, the smaller model (152) actually has better mCE and equally good top-1 accuracy, indicating that the wider model may be overfitting, but the 152x2 CEB and cCEB models substantially outperform both of them across the board. cCEB gives a noticeable boost over CEB for clean accuracy and mCE in both wide architectures.

#### 3.2.2. Targeted PGD Attacks

We tested on the random-target version of the PGD L

${}_{2}$ and L

${}_{\infty}$ attacks [

4]. The L

${}_{\infty}$ attack used

$\u03f5=16$,

$n=10$, and

${\u03f5}_{i}=2$, which is considered to be a strong attack still [

25]. The L

${}_{2}$ attack used

$\u03f5=200$,

$n=10$, and

${\u03f5}_{i}=220$. Those parameters were chosen by attempting to match the baseline XEnt ResNet50 without AutoAug model’s performance on the L

${}_{\infty}$ attack—the performance of the CEB models were not considered when selecting the L

${}_{2}$ attack strength. Interestingly, for the PGD attacks, AutoAug was detrimental—the ResNet50 models without AutoAug were substantially more robust than those with AutoAug, and the ResNet152 models without AutoAug were nearly as robust as the AutoAug and wide models, in spite of having much worse test set accuracy. The ResNet50 CEB models show a dramatic improvement over the XEnt model, with top-1 accuracy increasing from 0.3% to 19.8% between the XEnt baseline without AutoAug and the corresponding

$\rho =4$ CEB model, a relative increase of 66 times. Interestingly, the CEB ResNet50 models

without AutoAug are much more robust to the adversarial attacks than the AutoAug and wide ResNet50 models. As with the accuracy results above, the robustness gains due to CEB increase as model capacity increases, indicating that further gains are possible.

#### 3.2.3. Calibration and ImageNet-C

Following the experimental setup in Reference [

29], in

Figure 5 we compare accuracy and ECE on ResNet models for both the clean ImageNet test set and the collection of 15 ImageNet-C corruptions at each of the five different corruption intensities. It is easy to see in the figure that the CEB models always have superior mean accuracy and ECE for all six different sets of test sets.

Because accuracy can have a strong impact on ECE, we use a different model selection procedure than in the previous experiments. Rather than selecting the CEB model with the highest accuracy, we instead select the CEB model with with the closest accuracy to the corresponding XEnt model. This resulted in selecting models with lower $\rho $ than in the previous experiments for four out of the six CEB model classes. We note that by selecting models with lower $\rho $ (which are more compressed), we see more dramatic differences in ECE, but even if we select the CEB models with highest accuracy as in the previous experiments, all six CEB models outperform the corresponding XEnt baselines on all six different sets of test sets.