Sequential Normalization: Embracing Smaller Sample Sizes for Normalization

: Normalization as a layer within neural networks has over the years demonstrated its effectiveness in neural network optimization across a wide range of different tasks, with one of the most successful approaches being that of batch normalization. The consensus is that better estimates of the BatchNorm normalization statistics ( µ and σ 2 ) in each mini-batch result in better optimization. In this work, we challenge this belief and experiment with a new variant of BatchNorm known as GhostNorm that, despite independently normalizing batches within the mini-batches, i.e., µ and σ 2 are independently computed and applied to groups of samples in each mini-batch, outperforms BatchNorm consistently. Next, we introduce sequential normalization (SeqNorm), the sequential application of the above type of normalization across two dimensions of the input, and ﬁnd that models trained with SeqNorm consistently outperform models trained with BatchNorm or GhostNorm on multiple image classiﬁcation data sets. Our contributions are as follows: (i) we uncover a source of regularization that is unique to GhostNorm, and not simply an extension from BatchNorm, and illustrate its effects on the loss landscape, (ii) we introduce sequential normalization (SeqNorm) a new normalization layer that improves the regularization effects of GhostNorm, (iii) we compare both GhostNorm and SeqNorm against BatchNorm alone as well as with other regularization techniques, (iv) for both GhostNorm and SeqNorm models, we train models whose performance is consistently better than our baselines, including ones with BatchNorm, on the standard image classiﬁcation data sets of CIFAR–10, CIFAR-100, and ImageNet (( + 0.2%, + 0.7%, + 0.4%), and ( + 0.3%, + 1.7%, + 1.1%) for GhostNorm and SeqNorm, respectively).


Introduction
The effectiveness of batch normalization (BatchNorm), a technique first introduced by Ioffe and Szegedy [1] on neural network (NN) optimization, has been demonstrated over the years on a variety of tasks, including computer vision [2][3][4], speech recognition [5], and other [6][7][8]. BatchNorm is typically embedded at each NN layer either before or after the activation function, normalizing and projecting the input features to match a Gaussianlike distribution. Consequently, the activation values of each layer maintain more stable distributions during NN training, which in turn is thought to enable faster convergence and better generalization performance [1,9,10]. Following the effectiveness of BatchNorm on NN optimization, other normalization techniques emerged [11][12][13][14][15], a number of which introduced normalization across a different input dimension (e.g., layer normalization [12]), while others focused on improving other aspects of BatchNorm, such as the accuracy of the batch statistics estimates [11,16,17], or the train-test discrepancy in BatchNorm use [18].
Despite the wide adoption and practical success of BatchNorm, its underlying mechanics within the context of NN optimization has yet to be fully understood. Initially, Ioeffe and Szegedy suggested that it came from it reducing the so-called internal covariate shift [1].
At a high level, internal covariate shift refers to the change in the distribution of the inputs of each NN layer that is caused by updates to the previous layers. This continual change throughout training was conjectured to negatively affect optimization [1,9]. However, recent research disputes that with compelling evidence that demonstrates how BatchNorm may in fact be increasing the internal covariate shift [9]. Instead, the effectiveness of Batch-Norm is argued to be a consequence of a smoother loss landscape [9]. In our present work, we began with a novel analysis of the effects on the loss landscape [9] between BatchNorm and Ghost normalization (GhostNorm) on MNIST and CIFAR-10 data sets. GhostNorm can be thought as an extension to BatchNorm, as GroupNorm is to LayerNorm (illustrated in Figure 1). In particular, in GhostNorm, the initial batch is divided into a number of smaller batches (also called "ghost" batches), each normalized independently of the other [19]. GhostNorm goes against the popular belief that associates the degradation in BatchNorm performance with smaller batch sizes to poorer estimates of mean and variance due to having a smaller sample size [11,14,20]. We observed that although GhostNorm decreased the smoothness of the loss landscape when compared to BatchNorm, models trained with GhostNorm across a range of batch sizes (4 to 32 and in later experiments, up to 512), and ghost batch sizes, consistently outperformed BatchNorm alternatives. Our experimental results corroborate our hypothesis that GhostNorm has a fundamentally different, yet better, effect on NN optimization when compared to BatchNorm. Finally, we used the insights revealed by our analysis to propose a new type of normalization, which we term sequential normalization (SeqNorm). The contributions of this paper are as follows: (i) we introduce different ways of employing GhostNorm as a normalization layer, (ii) we identify a source of regularization in GhostNorm that cannot be found in any of the existing alternatives, (iii) we visualize the loss landscape of GhostNorm under vastly different experimental setups, and observe that GhostNorm consistently decreases the smoothness of the loss landscape, especially on the later epochs of training, while outperforming BatchNorm alternatives, (iv) we introduce a new normalization layer called SeqNorm that adopts the GhostNorm approach to normalization sequentially over more input dimensions, (v) we demonstrate consistently better generalization performances on CIFAR-10, CIFAR-100, and ImageNet when BatchNorm is replaced with either GhostNorm or SeqNorm, with the latter surpassing the SOTA on CIFAR-100 and ImageNet.
The rest of the paper is organized as follows. In Section 1.1, we discuss the related work for both GhostNorm and SeqNorm. In Section 2, we formulate the existing layer normalization techniques as well as the key novelty of the present work, SeqNorm, highlight the differences, and provide implementation details for both GhostNorm and SeqNorm. In Section 3, we first conduct experiments to visualize the loss landscape of GhostNorm, a component which has been described as the primary reason behind the effectiveness of BatchNorm, and then train models for image classification on CIFAR-10, CIFAR-100, and ImageNet. This section, alongside Appendices B and C, provides reproducibility information for all conducted experiments. Finally, we conclude our work with a discussion of our experimental results in Section 4.

Related Work
Ghost normalization is a technique originally introduced by Hoffer et al. [19]. Over the years, the primary use of GhostNorm has been to optimize NNs with large batch sizes over multiple GPUs [21]. Unfortunately, when compared to other normalization techniques [11][12][13][14][15], the adoption of GhostNorm has been rather scarce, and narrow to large batch size training regimes [21][22][23][24]. More recently, GhostNorm has been used over BatchNorm as a means of regulating the amount of noise that arises from the estimation of the normalization statistics for increasingly larger batch sizes [22][23][24]. This was achieved by keeping the ghost batch size constant [19].
Closest in spirit to the present work is the recent research by Summers and Dinneen [21], who experimented with GhostNorm on both small and medium batch size training regimes. Summers and Dinneen [21] tuned the number of groups within Ghost-Norm (see Section 2.1) on CIFAR-100, Caltech-256, and SVHN, and reported positive results on the first two data sets. More results are reported on other data sets through transfer learning. However, the use of other new optimization methods confounds the attribution of the observed improvement.
The closest line of work to SeqNorm is, again, found in the work of Summers and Dinneen [21]. Therein, they employed a normalization technique which although at first glance may appear similar to SeqNorm, it is fundamentally different. This stems from the vastly different goals between our works, i.e., Summers and Dinneen tried to improve layer normalization for small batch sizes [21], whereas we strive to improve layer normalization in a more general setting. At a high level, where SeqNorm performs GroupNorm and GhostNorm sequentially, their normalization method applies both simultaneously. At a fundamental level, the normalization layer that was used by Summers and Dinneen embeds the stochastic nature of GhostNorm into that of GroupNorm (see Section 2.2), thereby potentially disrupting the learning of channel grouping within NNs. Other works that apply simultaneous normalization strategies include that of Bronskill et al. [25], who blended the moments of BatchNorm with InstanceNorm or LayerNorm, as well as Luo et al. [26], who introduced switchable normalization-a layer that enables the NN to learn which normalization techniques to employ at different layers.

Formulation
Given a fully connected or convolutional neural network, the parameters of a typical layer l with normalization, Norm, are the weights W l as well as the scale and shift parameters γ l and β l . For brevity, we omit the l superscript. Given an input tensor X, the activation values A of layer l are computed as where g(·) is the activation function, corresponds to either matrix multiplication or convolution for fully connected and convolutional layers respectively, and ⊗ describes an element-wise multiplication.
Most normalization techniques differ in how they transform the product X W. Let the product be a tensor with (M, C, F) dimensions, where M is the so-called mini-batch size, or just batch size, C is the channels dimension, and F is the spatial dimension.
In BatchNorm [1], the given tensor is normalized across the channels dimension. In particular, the mean and variance are computed across C number of slices of (M, F) dimensions (see Figure 1), which are subsequently used to normalize each channel c ∈ C independently. In LayerNorm [12], statistics are computed over M slices, each having the dimension (C, F), normalizing the values of each data sample m ∈ M independently. InstanceNorm [15] normalizes the values of the tensor over both M and C, i.e., computes statistics across M × C slices of F dimension.
GroupNorm [14] can be thought as an extension to LayerNorm, wherein the C dimension is divided into G C number of groups, i.e., (M, G C , C / G C , F). Statistics are calculated over M × G C slices of ( C / G C , F) dimensions. Similarly, GhostNorm can be thought as an extension to BatchNorm, wherein the M dimension is divided into G M groups, normalizing over C × G M slices of ( M / G M , F) dimensions. Both G C and G M are hyperparameters that can be tuned based on a validation set. All of the aforementioned normalization techniques are illustrated in Figure 1.
SeqNorm can be thought as the employment of GroupNorm and GhostNorm sequentially. Initially, the input tensor is divided into (M, G C , C / G C , F) dimensions, normalizing across M × G C number of slices, i.e., same as GroupNorm. Then, once the G C and C / G C dimensions are collapsed back together, the input tensor is divided into Any of the slices described above is treated as a set of values S with one dimension. The mean and variance of S are computed in the traditional way (see Equation (2)). The values of S are then normalized as shown below.
Once all slices are normalized, the output of the Norm layer is simply the concatenation of all slices back into the initial tensor shape.

The Effects of Ghost Normalization
There is only one other published work which has investigated the effectiveness of ghost normalization for small and medium mini-batch sizes [21]. Therein, the authors hypothesized that GhostNorm offers stronger regularization than BatchNorm, as it computes the normalization statistics on smaller sample sizes [21]. In this section, we support that hypothesis by providing insights into a particular source of regularization, unique to GhostNorm, that stems from the normalization of activations in independent groups and with different statistics.
Consider as an example the tuple X with (35, 39, 30,4,38,26,27,19) values, which can be thought as an input tensor with (8, 1, 1) dimensions. Given to a BatchNorm layer, the output is the normalized versionX with values (0.7, 1.1, 0.3, −2.2, 1.0, −0.1, −0.02, −0.8). Note how although the values have changed, the ranking order of the activation values has remained the same, e.g., the 2nd value is larger than the 5th value in both X (39 > 38) and X (1.1 > 1.0). More formally, the following holds true: On the other hand, given X to a GhostNorm layer with G M = 2, the outputX is (0.6, 0.9, 0.2, −1.7, 1.5, −0.2, −0.07, −1.2). Now, we observe that after normalization, the 2nd value has become much smaller than the 5th value inX (0.9 < 1.5). Where BatchNorm preserves the ranking order of the received activations, GhostNorm can end up modifying their values, and hence alter the course of optimization. Our experimental results demonstrate how GhostNorm improves upon BatchNorm, supporting the hypothesis that the above type of regularization can be beneficial to optimization. Note that for BatchNorm, the condition in Equation (4) only holds true across the M × F dimension of the input tensor, whereas for GhostNorm it cannot be guaranteed for any dimension.

GhostNorm to BatchNorm
One can argue that the same type of regularization can be found in BatchNorm over different mini-batches, e.g., given [35,39,30,4] and [38, 26,27,19] as two different minibatches. However, GhostNorm introduces the above during each forward pass rather than between forward passes. Hence, it is a regularization that is embedded during learning (GhostNorm), rather than across learning (BatchNorm).

GhostNorm to GroupNorm
Despite the visual symmetry between GhostNorm and GroupNorm, there is one major difference. Grouping has been employed extensively in classical feature engineering, such as SIFT, HOG, and GIST, wherein independent normalization is often performed over these groups [14]. At a high level, GroupNorm can be thought as motivating the network to group similar features together [14]. However, for GhostNorm, this would not be possible due to random sampling, and random arrangement of the data within each mini-batch. Therefore, we hypothesize that the effects of these two normalization techniques could be combined for their benefits to be accumulated. Specifically, we propose SeqNorm, a normalization technique that employs both GroupNorm and GhostNorm in a sequential manner. SeqNorm can also be thought as a natural extension to GhostNorm that allows smaller sample size normalization on more input dimensions.

Ghost Normalization
An implementation of GhostNorm is shown in the Appendix A, Figure A1. Since the exponential moving averages are omitted for brevity, it is worth mentioning that they were accumulated in the same way as BatchNorm. In addition to the above implementation, GhostNorm can be effectively employed while using BatchNorm as the underlying normalization technique.
When the desired batch size exceeds the memory capacity of the available GPUs, practitioners often resort to the use of accumulating gradients. However, it turns out that when BatchNorm is employed in the NN, the gradients can be substantially different for the above two cases. This is a consequence of the mean and variance calculation (see Equation (2)) since each forwarded smaller batch of M / n fp data will have a different mean and variance than if all M examples were present. Accumulating gradients with BatchNorm can thus be thought as an alternative way of using GhostNorm with the number of forward passes n fp corresponding to the number of groups G M . A PyTorch implementation of accumulating gradients is shown in the Appendix A, Figure A2.
Finally, the most popular implementation of GhostNorm via BatchNorm, albeit typically unintentional, comes as a consequence of using multiple GPUs. Given n g GPUs and M training examples, M/n g examples are forwarded to each GPU. If the BatchNorm statistics are not synchronized across the GPUs, often the case for image classification, then n g corresponds to the number of groups G M .
A practitioner who would like to use GhostNorm should employ the implementation shown in Appendix A. Nevertheless, under the discussed circumstances, one could explore GhostNorm through the use of other means.

Sequential Normalization
The implementation of SeqNorm is straightforward since it applies GroupNorm, a widely implemented normalization technique, and GhostNorm, for which we have discussed three possible implementations, in a sequential manner. A CUDA-native approach is subject to future work.

Experiments and Results
In this section, we first strive to take a closer look at the effects of GhostNorm by visualizing the smoothness of the loss landscape during training: a component which has been described as the primary reason behind the effectiveness of BatchNorm. Then, we conduct a number of ablation experiments, comparing both GhostNorm and SeqNorm against other approaches (methods that failed to improve over our baselines are discussed in Appendix D). Finally, we evaluate the effectiveness of both GhostNorm and SeqNorm on the standard image classification data sets of CIFAR-10 (Canadian Institute For Advanced Research), CIFAR-100, and ImageNet. Note that in all of our experiments, the smallest M / G M we employ for both SeqNorm and GhostNorm is 4. A ratio of 1 would be undefined for normalization, whereas a ratio of 2 results in large information corruption, i.e., all activations are reduced to either 1 or −1 values.

Loss Landscape Visualisation
We visualize the loss landscape during optimization on MNIST and CIFAR-10, using an approach that was described by Santurkar et al. [9]. Each time the network parameters are to be updated, we walk toward the gradient direction and compute the loss at multiple points. This enables us to visualize the smoothness of the loss landscape by observing how predictive the computed gradients are. In particular, at each step of updating the network parameters, we compute the loss at a range of learning rates, and store both the minimum and maximum loss. Implementation details are provided in Appendix B.
For both data sets and networks, we observe that the smoothness of the loss landscape deteriorates when GhostNorm is employed. In fact for MNIST, as seen in Figure 2, the loss landscape of GhostNorm bears a closer resemblance to our baseline which did not use any normalization technique. For CIFAR-10, as seen in Figure 3, this is only observable toward the last epochs of training. In spite of the above observation, we consistently witness better generalization performance with GhostNorm in almost all of our experiments, even at the extremes, wherein G M is set to 128, i.e., only 4 samples per group.
Our experimental results challenge the often established correlation between a smoother loss landscape and a better generalization performance [9,13]. Although beyond the scope of our work, a theoretical analysis of the implications of GhostNorm when compared to BatchNorm could potentially uncover further insights into the optimization mechanisms of both normalization techniques.

CIFAR-100
Initially, we turn to CIFAR-100, and tune the hyperparameters of both GhostNorm and SeqNorm in a grid-search fashion. The results are shown in Table 1. We also examine a noisy version of BatchNorm, wherein we inject Gaussian noise on the activations just before normalization. Finally, for all normalization layers, we also train models that employ dropout as well as RandAugment [27]. All of the aforementioned regularization techniques were tuned as described in Appendix C.
Both GhostNorm and SeqNorm improve upon the BatchNorm baseline by a large margin (+0.7% and +1.7%, respectively). Noisy BatchNorm does not improve the generalization performance, showing that GhostNorm and SeqNorm embed more than just unstructured noise. Models with dropout are omitted since they fail to provide any improvement on the validation set over the baselines. RandAugment substantially improves the BatchNorm (+0.9%) and GhostNorm models (+0.7%), but fails to benefit models with SeqNorm. Despite the lack of synergy with RandAugment, it is important to note that SeqNorm still manages to surpass the current SOTA performance on CIFAR-100 by 0.5% [27]. These results support our hypothesis that sequentially applying GhostNorm and GroupNorm layers can have an additive effect on improving NN optimization.
However, the grid-search approach to tuning G C and G M of SeqNorm can be rather time consuming (time complexity: Θ(G C × G M )). Hence, we attempt to identify a less demanding hyperparameter tuning approach. The most obvious, and the one we actually adopt for the next experiments, is to sequentially tune G M and G C . In particular, we find that first tuning G M , then selecting the largest g M ∈ G M for which the network performs well (amongst similarly performing models, select the one with the lowest variance), and finally tuning G C with g M to be an effective approach (time complexity: Θ(G C + G M )). In other words, for tuning the hyperparameters of SeqNorm, one first tunes the hyperparameter of GhostNorm G M , and then the hyperparameter of GroupNorm G C while keeping G M constant. Note that by following this approach on CIFAR-100, we still end up with the same best hyperparameter configuration, i.e., G C = 4 and G M = 8. Table 1. Results on CIFAR-100. For SeqNorm, we only show the best results for each G C . Both validation and testing performance results are averaged over two different runs.
Based on the tuning strategy described in the previous section, for SeqNorm, we adopt G M = 8 (lowest variance) and tune G C for values between 1 and 16, inclusively. Although the network performs similarly at ≈96.8% accuracy for G C ∈ (1, 8, 16), we choose G C = 16, as it achieves slightly higher accuracy than the rest. Using the above configuration, SeqNorm is able to match the current SOTA on the testing set [28], yet as with CIFAR-100 without the employment of RandAugment.
In addition to the original ImageNet validation set, we also evaluate our models on three recently released test sets for ImageNet [29]. Without any further retraining (i.e., on the validation set), on average, SeqNorm is able to surpass substantially the reproduced top1 accuracy of BatchNorm, namely by 1.5%, while GhostNorm also improves the accuracy by 0.8%. The results for both CIFAR-10 and ImageNet are shown in Table 2. Table 2. Results on CIFAR-10 and ImageNet data sets. Both validation and testing performance results of CIFAR-10 are averaged over two different runs. For ImageNet, each model is evaluated on the conventional validation set, as well as on three newly released test sets [29].

Discussion
In this work, we first demonstrate the effectiveness of GhostNorm on a number of different networks, learning policies, and data sets. For instance, when using superconvergence on CIFAR-10, GhostNorm performs better than BatchNorm, even though the former normalizes the input activations using 4 samples (i.e., mini-batch is divided into groups of 4 that are normalized based on their group µ and σ 2 ), whereas the latter uses all 512 samples. This is antithetical to the common belief that associates poorer estimates of batch statistics, perhaps due to having a smaller sample size, to BatchNorm performance degradation [11,14,20,21]. Instead, our experimental results suggest that grouping along the batch dimensional is effective. Indeed, similar results were observed in GroupNorm, wherein any number of groups would give better results than LayerNorm (all channels in one group) [14]. By providing novel insight on the source of regularization in GhostNorm, and by introducing a number of possible implementations, we hope to inspire further research into GhostNorm and more widespread adoption.
However, we argue that even though GhostNorm and GroupNorm both use grouping, they have vastly different effects on optimization. Based on the understanding developed while investigating GhostNorm, we introduce SeqNorm and follow up with the empirical analysis. Unlike methods such as switchable normalization, we argue that SeqNorm provides a better alternative, since the use of the different normalization techniques is independent of the training optimization [21].
Surprisingly, SeqNorm not only surpasses the performances of BatchNorm and Ghost-Norm, but even meets or surpasses current SOTA methodologies on CIFAR-10, CIFAR-100, and ImageNet [27][28][29]. The proposed normalization layer results in models that consistently outperform our baseline alternatives with minimal cost (two hyperparameters) yet notable generalization gains. SeqNorm provided performance gains that are comparable, or better, than sophisticated data augmentation strategies [27,28]. Finally, we describe and validate a hyperparameter tuning strategy for SeqNorm that provides a faster alternative to the traditional grid-search approach.

Appendix B. Loss Landscape Visualization
Appendix B.1. Implementation Details On MNIST, we train a fully connected neural network (SimpleNet) with two fully connected layers of 512 and 300 neurons. The input images are transformed to onedimensional vectors of 784 channels, and are normalized based on the mean and variance of the training set. The learning rate is set to 0.4 for a batch size of 512 on a single GPU.
In addition to training SimpleNet with BatchNorm and GhostNorm, we also train a SimpleNet baseline without any normalization technique.
A residual convolutional network with 56 layers (ResNet-56) [4] is employed for CIFAR-10. We achieve super-convergence by using the one cycle learning policy described in the work of Smith and Topin [30]. Horizontal flipping, and pad-and-crop transformations are used for data augmentation. Most of the hyperparameter values were adopted from the work of Smith and Topin [30]. In particular, we employ stochastic gradient descent with a weight decay of 0.0001, and a one-cycle learning policy linearly increasing from 0.1 to 3.0 in 15 epochs, linearly decreasing to 0.1 in the next 15 epochs, and decreasing to 0.003 linearly over the last 10 epochs. The optimizer does not employ any momentum. In order to train ResNet-56 without a normalization technique (baseline), we have to adjust the cyclical learning rate schedule to (0.1, 1).
We train the networks on 50,000 and 60,000 training images (CIFAR-10 and MNIST respectively), and evaluate on 10,000 testing images.