MiAMix: Enhancing Image Classification through a Multi-stage Augmented Mixed Sample Data Augmentation Method

Despite substantial progress in the field of deep learning, overfitting persists as a critical challenge, and data augmentation has emerged as a particularly promising approach due to its capacity to enhance model generalization in various computer vision tasks. While various strategies have been proposed, Mixed Sample Data Augmentation (MSDA) has shown great potential for enhancing model performance and generalization. We introduce a novel mixup method called MiAMix, which stands for Multi-stage Augmented Mixup. MiAMix integrates image augmentation into the mixup framework, utilizes multiple diversified mixing methods concurrently, and improves the mixing method by randomly selecting mixing mask augmentation methods. Recent methods utilize saliency information and the MiAMix is designed for computational efficiency as well, reducing additional overhead and offering easy integration into existing training pipelines. We comprehensively evaluate MiaMix using four image benchmarks and pitting it against current state-of-the-art mixed sample data augmentation techniques to demonstrate that MIAMix improves performance without heavy computational overhead.


Introduction
Deep learning has revolutionized a wide range of computer vision tasks like image classification, image segmentation, and object detection [1,2].However, despite these significant advancements, overfitting remains a challenge [3].The data distribution shifts between the training set and test set may cause model degradation.This is also particularly exacerbated when working with limited labeled data or with corrupted data.Numerous mitigation strategies have been proposed, and among these, data augmentation has proven to be remarkably effective [4,5].Data augmentation techniques increase the diversity of training data by applying various transformations to input images in the model training.The model can be trained with a wider slice of the underlying data distribution which improves model generalization and robustness to unseen inputs.Of particular interest among these techniques are mixup-based methods, which create synthetic training examples through the combination of pairs of training examples and their labels [6].
Manuscript in submission 2023, do not distribute arXiv:2308.02804v2[cs.CV] 15 Aug 2023 Subsequent to mixup, an array of innovative strategies were developed which go beyond the simple linear weighted blending of mixup, and instead apply more intricate ways to fuse image pairs.Notable among these are CutMix and FMix methods [7,8].The CutMix technique [7] formulates a novel approach where parts of an image are cut and pasted onto another, thereby merging the images in a region-based manner.On the other hand, FMix [8] applies a binary mask to the frequency spectrum of images for fusion, hence achieving an enhanced mixup process that can take on a wide range of mask shapes, rather than just square mask in CutMix.These methods have been successful in preserving local spatial information while introducing more extensive variations into the training data.
While mixup-based methods have shown promising results, there remains ample room for innovation and improvement.These mixup techniques utilize little to no prior knowledge, which simplifies their integration into training pipelines and incurs only a modest increase in training costs.To further enhance performance, some methodologies have leveraged intrinsic image features to boost the impact of mixup-based methods [9].Recently, following this approach, some methods employ the model-generated feature to guide the image mixing [10].Furthermore, some researchers have also incorporated image labels and model outputs in the training process as prior knowledge, introducing another dimension to improve these methods' performance [11].The utilization of these methods often introduces a considerable increase in training costs to extract the prior knowledge and construct a mixing mask dynamically.This added complexity not only impacts the speed and efficiency of the training process but can also act as a barrier to deployment in resource-constrained environments.Despite their theoretical simplicity, in practice, these methods might pose integration challenges.The necessity to adjust the existing pipeline to accommodate these techniques could complicate the training process and hinder their adoption in a broader range of applications.Given this, we are driven to ponder an important question about the evolution of mixed sample data augmentation methods: How can we fully unleash the potential of MSDA while avoiding extra computational cost and facilitating seamless integration into existing training pipelines?Considering the RandAugment [4] and other image augmentation policies, we are actually applying multiple layers of data augmentation to the input images and those works have shown that a multilayered and diversified data augmentation strategy can significantly improve the generalization and performance of deep learning models.The work RandomMix [12] starts ensembling the MSDA methods by randomly choosing one from a set of methods.However, by restricting to only one mixing mask can be applied, RandomMix imposes some unnecessary limitations.Firstly, the variety of mixing methods can be highly improved if multiple mask methods can be applied together.Secondly, the diversity of possible mixing shapes can be greater if we can further augment the mixing masks.Thirdly, we draw insights from AUGMIX, an innovative approach that apply different random sampled augmentation on the same input image and mix those augmented images.With the help of customized loss function design, it achieved substantial improvements in robustness.Inspired by this, we propose to remove a limitation in conventional MSDA methods and allow a sample to mix with itself with an assigned probability.It is essential to note that, during this mixing process, the input data must undergo two distinct random data augmentations.
In this paper, we propose the MiAMix: Multi-layered Augmented Mixup.MiAMIX alleviates the previously mentioned restrictions.Our contributions can be summarized as follows: 1. We firstly revisit the design of GMix [13], leading to an augmented form called AGMix.This novel form fully capitalizes the flexibility of Gaussian kernel to generate a more diversified mixing output.
2. A Novel sampling method of mixing ratio is designed for multiple mixing masks.
3. We define a new MSDA method with multiple stages: random sample paring, mixing methods and ratios sampling, generation and augmentation of mixing masks, and finally, the mixed sample output stage.We consolidate these stages into a comprehensive framework named MiAMix and establish a search space with multiple hyper-parameters.
To assess the performance of our proposed AGmix and MiAMix method, we conducted a series of rigorous evaluations across CIFAR-10/100, and Tiny-ImageNet [14] datasets.The outcomes of these experiments substantiate that MiAMix consistently outperforms the leading mixed sample data augmentation methods, establishing a new benchmark in this realm.In addition to measuring the generalization performance, we also evaluated the robustness of our model in the presence of natural noises.The experiments demonstrated that the application of RandomMix during training considerably enhances the model's robustness against such perturbations.Moreover, to scrutinize the effectiveness of our multi-stage design, we implemented an extensive ablation study using the ResNet18 [1] model on the Tiny-ImageNet dataset.

Related Works
Mixup-based data augmentation methods have played an important role in deep neural network training [15].Mixup generates mixed samples via linear interpolation between two images and their labels [6].The mixed input x and label ỹ are generated as: where x i , x j are raw input vectors.
where y i , y j are one-hot label encodings.
(x i , y i ) and (x j , y j ) are two examples drawn at random from our training data, and λ ∈ [0, 1].The λ ∼ Beta(α, α), for α ∈ (0, ∞).Following the development of Mixup, an assortment of techniques have been proposed that focus on merging two images as part of the augmentation process.Among these, CutMix [7] has emerged as a particularly compelling method.
In the CutMix approach, instead of creating a linear combination of two images as Mixup does, it generates a mixing mask with a square-shaped area, and the targeted area of the image are replaced by corresponding parts from a different image.This method is considered a cutting technique due to its method of fusing two images.The cutting and replacement idea has been also used in FMix [8] and GridMix [16].
The paper [13] unified the design of different MSDA masks and proposed GMix.The Gaussian Mixup (GMix) generates mixed samples by combining two images using a Gaussian mixing mask.GMix first randomly selects a center point c in the input image.It then generates a Gaussian mask centered at c, where the mask values follow: where σ is set based on the mixing ratio λ and image size N as This results in a smooth Gaussian mix of the two images, transitioning from one image to the other centered around the point c.

GMix and Our AGMix
To further enhance the mixing capabilities of our method, we extend the Gaussian kernel matrix used in GMix to a new kernel matrix with randomized covariance.The motivation behind this extension is to allow for more diversified output mixing shapes in the mix mask.Specifically, we replace the identity kernel matrix with a randomized kernel matrix as follows: Here, Σ is the Gaussian kernel covariance matrix.We keep the value in the diagonal as 1, which means that we do not randomize the intensity of the mixing, which should be solely controlled by the mixing ratio coefficient λ.To preserve the assigned mixing ratio λ and to constrain the shape of the mask region, we sample the parameter q from a uniform distribution in a restricted range (−1, 1).By randomizing the off-diagonal covariance q, we allow the mixing mask to have a broader range of shapes and mixing patterns.To add further variation to the mixing shape, we apply sinusoidal rotations to the mixing mask by defining a rotation matrix R as follows: where θ is a random rotation angle.We then rotate the mixing mask M using the rotation matrix R to obtain a rotated mixing mask M rot as follows: A comparative visualization between GMix and AGMix is depicted in Figure 1.This comparison underlines the successful augmentation of the original GMix approach by AGMix, introducing a wealth of varied shapes and distortions.This innovation also inspires us to apply similar rotational and shear augmentations to other applicable mixing masks.In the forthcoming experiment results section, a series of experiments provides an in-depth comparison of AGMix and GMix, further underscoring the enhancements and improvements brought by the method.

MiAMix
We introduce the MiAMix method and its detailed designs in this section.The framework is constructed by 4 distinct stages: random sample paring, sampling of mixing methods and ratios, the generation and augmentation of mixing masks, and finally, the mixed sample output stage.Each stage will be discussed in the ensuing subsections.These stages are presented step-by-step in Algorithm 1, the parameters are listed in Table 1, and a practical illustration of the processes within each stage can be found in Figure 2.
To understand the effects of the various design choices of this proposed algorithm in this section, we conduct a series of ablation studies in the following experiment result section.We also compare our method with previous MSDA methods to justify that the MiAMix works as a balance between performance and computational overhead.

Random Sample Paring
The conventional method of mix pair sampling is direct shuffling the sample indices to establish mixing pairs.There are two primary differences that arise in our approach.The first difference is that, in our image augmentation module, we prepare two sets of random augmentation results   [17], revealed a crucial oversight in prior work.In MiAMix, we addressed this issue and yielded measurable improvement.The second, and arguably more critical distinction, is the introduction of a new probability parameter, denoted as p self , which enables images to mix with themselves and generate "corrupted" outputs.This strategy draws from the notable enhancement in robustness exhibited by AUGMIX [18].Integrating the scenario of an image mixing with itself can significantly benefit the model, and we delve into an experimental section of this paper.

Sampling Number of Mixing Masks, Mixing Methods, and Ratios
Previous studies such as RandAug and AutoAug have shown that ensemble usage and multi-layer stacking in image data augmentation are essential for improving a computer vision model and mitigating overfitting [4].However, the utilization of ensembles and stacking in mixup-based methods has been underappreciated.Therefore, to enhance input data diversity with mixing, we introduce Sample a mixing data point (x t , y t ) either by sampling from the entire pool of data samples or alternatively, selecting itself as the mixing data point with a ratio p self .

7:
Sample number of mixing layers k from 1 to k max 8: Sample λ 1 , λ 2 , . . ., λ k from a Dirichlet distribution Dir(α), where the parameter vector α = (α 1 , . . ., α k , α k+1 ), such that α 1 = α k = α and α k+1 = k • α.Append mixed xi and ỹi to output list 16: end for 17: return Mixed samples x1 , x2 , ..., xn , mixed labels ỹ1 , ỹ2 , ..., ỹn two strategies.Firstly, we perform random sampling over different methods.For each generation of a mask, a method is sampled from a mixing methods set M , with a corresponding set of sampling weights W .The M contains not only our proposed method AGMix above but also MixUp, CutMix, GridMix and FMix.These mixup techniques blend two images with varying masks, and the main difference between those methods is how it generates these randomized mixing masks.As such, an MSDA can be conceptualized as a standardized mask generator, denoted by m.This generator takes as input a designated mixing ratio, λ, and outputs a mixing mask.This mask shares the same dimensions as the original image, with pixel values ranging from 0 to 1.And the final image can be directly procured using the formula: In this context, ⊗ denotes element-wise multiplication, the mask is the generated mixing mask, and x 1 and x 2 represent the 2 original images.
Secondly, We pioneer the integration of multi-layer stacking in mixup-based methods.Therefore, we need to sample another parameter to set the mixing ratio for each mask generation step.For this, the mixup's methodology here is: While the Beta distribution's original design caters to bivariate instances, the Dirichlet distribution presents a multivariate generalization.It's a multivariate probability distribution parameterized by a positive reals vector α, essentially generalizing the Beta distribution.Our sampling approach is: λ 1 , λ 2 , . . ., λ k ∼ Dir(α), for k masks where α = (α 1 , . . ., α k , α k+1 ), and We maintain α as the sole sampling parameter for simplicity.With the Dirichlet distribution's multidimensional property, the mixing ratios derived from sampling are employed for multiple mask generators.In other words, our MiAMix approach employs the parameter λ i to determine the mixing ratio for each mask mask i .This parameter selection method plays a pivotal role in defining the multi-layered mixing process.

Mixing Mask Augmentation
Upon generation of the masks, we further execute augmentation procedures on these masks.To preserve the mixing ratio inherent to the generated masks, the selected augmentation processes should not bring substantial change to the mixing ratio of the mask, so we mainly focus on some morphological mask augmentations.Three primary methods are utilized: shear, rotation, and smoothing.The smoothing applies an average filter with varying window sizes to subtly smooth the mixing edge.It should be explicitly noted that these augmentations are particularly applicable to CutMix, FMix, and GridMix methodologies.In contrast, Mixup and AGMix neither require nor undertake the aforementioned augmentations.

Mixing Output
During the mask generation step, we may have multiple mixing masks.The MiAMix employs the masks to merge two images and obtains the mixed weights for labels by point-wise multiplication.
The n denotes the number of masks, and the multiplication operation is conducted in a pointwise manner.Another approach we also tried is by summing the weighted mask: clip serves to confine the mixing ratio at each pixel within the [0,1] interval.It is crucial to note that the cumulative mask weights could potentially exceed 1 at specific pixels.As a consequence, we enforce a clipping operation subsequent to the summation of masks if we sum them up.
In the output stage, our approach is different from the conventional mixup method.We sum the weights of the merged mask, mask merged , to determine the final λ merged , which defines the weights of the labels.
In this equation, H and W denote the height and width of the mask, respectively, j and k are the indices of the pixels within each mask.Therefore, λ merged represents the overall mixing intensity by averaging the mixing ratios over all the pixels in mask merged .The rationale behind this is that, if multiple masks have significant overlap between them, the final mixing ratio will deviate from the initially set λ sum = Σλ i , regardless of whether the masks are merged via multiplication or summation.We will compare these two ways of merging the mixing mask and two ways of acquiring the weights λ for labels in the upcoming experimental results section.

Results
In order to examine the benefits of MiAMix, we conduct experiments on fundamental tasks in image classification.Specifically, we chose the CIFAR-10, CIFAR-100, and Tiny-ImageNet datasets for comparison with prior work.We replicate the corresponding methods on all those datasets to demonstrate the relative improvement of employing this method over previous mixup-based methods.

Tiny-ImageNet, CIFAR-10 and CIFAR-100 Classification
For our image classification experiments, we utilize the Tiny-ImageNet [14] dataset, which consists of 200 classes with 500 training images and 50 testing images per class.Each image in this dataset has been downscaled to a resolution of 64 × 64 pixels.We also evaluate our methods (AGMix and MiAMix) against those mixing methods on CIFAR-10 and CIFAR-100 datasets.The CIFAR-10 dataset consists of 60,000 32x32 pixel images distributed across 10 distinct classes, and the CIFAR-100 dataset, mirroring the structure of the CIFAR-10 but encompasses 100 distinct classes, each holding 600 images.Both datasets include 50,000 training images and 10,000 for testing.
Training is performed using ResNet-18 and ResNeXt-50 network architecture over the course of 400 epochs, with a batch size of 128.Our optimization strategy employs Stochastic Gradient Descent (SGD) with a momentum of 0.9 and weight decay set to 5 × 10 −4 .The initial learning rate is set to 0.1 and decays according to a cosine annealing schedule.
In our investigation of various mixup methods, we select a set of methods M = [M ixup, CutM ix, F mix, GridM ix, AGM ix].Each of these methods was given a weight, represented as a vector W = [2, 1, 1, 1, 1].The mixing parameter, α, was set to 1 throughout the experiments.

Robustness
To assess robustness, we set up an evaluation on the CIFAR-100-C dataset, explicitly designed for corruption robustness testing and providing 19 distinct corruptions such as noise, blur, and digital corruption.Our model architecture and parameter settings used for this evaluation are consistent with those applied to the original CIFAR-100 dataset in our above experiments.According to Table 3, our proposed MiAMix method demonstrated exemplary performance, achieving the highest accuracy.This provides compelling evidence that our multi-stage and diversified mixing approach contributes significantly to the improvement of model robustness.[6] 81.55 58.10 CutMix [7] 78.52 49.32 AutoMix [11] 83.32 58.36 MiAMix 83.50 58.99

Ablation Study
The MiAMix method involves multiple stages of randomization and augmentation which introduce many parameters in the process.It is essential to clearly articulate whether each stage is necessary and how much it contributes to the final result.Furthermore, understanding the influence of each major parameter on the outcome is also crucial.To further demonstrate the effectiveness of our method, we conducted several ablation experiments on the CIFAR-10, CIFAR-100-C and Tiny-ImageNet datasets.

GMix, AGMix, and Mixing Mask Augmentation
A particular comparison of interest is between the GMix and our augmented version, AGMix in Table 2 and Table 3.The primary difference between these two methods lies in the inclusion of additional randomization in the Gaussian Kernel.The experiment results reveal that this simple yet effective augmentation strategy indeed brings about a significant improvement in the performance of the mixup method across all three datasets and one corrupted dataset, despite maintaining almost the same training cost as GMix.As the results in Table 4 illustrate, the introduction of various forms of augmentation progressively improves model performance.These experiment results underscore the importance and effectiveness of augmenting mixing masks during the training process, furthermore, validate the approach taken in the design of our MiAMix method.The data presented in Table 5 demonstrates the substantial impact of multiple mixing layers on the model's performance.As the table shows, a discernible improvement in Top-1 accuracy is observed when more layers of masks are added, emphasizing the effectiveness of this approach in enhancing the diversity and complexity of the training data.Most notably, the mod el's performance is further amplified when the number of layers is not constant but rather sampled randomly from a set of values, as indicated by the bracketed entries in the table.This observation suggests that introducing variability in the number of mixing layers could potentially be an effective approach for extracting more comprehensive and robust features from the data.

The Effectiveness of MSDA Ensemble
In the study, the ensemble's efficacy was tested by systematically removing individual mixup-based data augmentation methods from the ensemble and observing the impact on Top-1 accuracy.The results, as shown in Table 6, clearly exhibit the vital contributions each method provides to the overall performance.Eliminating any single method from the ensemble led to a decrease in accuracy, As shown in Table 7, the combination of multiplication for mask merging and the "out" method for λ merging yields the highest accuracy for both Top-1 (67.95%) and Top-5 (87.26%).On the other hand, when using the sum operation for mask merging or reusing the original λ (the "orig" method), the performance degrades.This suggests that reusing the original λ might not provide a sufficiently adaptive mixing ratio for the model's learning process.Moreover, compared with the multiplication operation, the lower flexibility of the sum operation does impede the performance.These results reaffirm the superiority of the (mul, out) method in our multi-stage data augmentation framework.In our experiments, we also explore the concept of self-mixing, which refers to a particular case where an image does not undergo the usual mixup operation with another randomly paired image but instead blends with an augmented version of itself.This process can be controlled by the self-mixing ratio, denoting the percentage of images subject to self-mixing.

Effectiveness of Mixing with an Augmented Version of the Image Itself
Table 8 showcases the impact of the self-mixing ratio on the classification accuracy on both CIFAR-100 and CIFAR-100-C datasets when employing the ResNeXt-50 model.The results illustrate a notable trend: a 10% self-mixing ratio leads to improvements in the classification performance, especially on the CIFAR-100-C dataset, which consists of corrupted versions of the original images.The improvement on CIFAR-100-C indicates that self-mixing contributes significantly to the model's robustness against various corruptions and perturbations.By incorporating self-mixing, our model gets exposed to a form of noise, thereby mimicking the potential real-world scenarios more effectively and enhancing the model's ability to generalize.The noise introduced via self-mixing could be viewed as another unique variant of the data augmentation, further justifying the importance of diverse augmentation strategies in improving the performance and robustness of the model.

Conclusion
In conclusion, our work in this paper has provided a significant contribution towards the development and understanding of Multi-layered Augmented Mixup (MiAMix).By reimagining the design of GMix, we have introduced an augmented form, AGMix, that leverages the Gaussian kernel's flexibility to produce a diversified range of mixing outputs.Additionally, we have devised an innovative method for sampling the mixing ratio when dealing with multiple mixing masks.Most crucially, we have proposed a novel approach for MSDA that incorporates various stages, namely: random sample pairing, mixing methods and ratios sampling, the generation and augmentation of mixing masks, and the output of mixed samples.By unifying these stages into a cohesive framework-MiAMix-we have constructed a search space replete with diverse hyper-parameters.This multi-stage approach offers a more diversified and dynamic way to apply data augmentation, potentially leading to improved model performance and better generalization on unseen data.Importantly, our methods do not incur excessive computational cost and can be seamlessly integrated into established training pipelines, making them practically viable.Furthermore, the versatile nature of MiAMix allows for future adaptations and improvements, promising an exciting path for the continuous evolution of data augmentation techniques.Given these advantages, we are optimistic about the potential of MiAMix to significantly influence and shape the field of machine learning, thereby enabling more robust and efficient model training processes.

Figure 1 :
Figure 1: Examples generated by GMix and AGMix.The first column shows the generated sample and the second row shows the corresponding mixing mask.We set λ = 0.7 for both 2 methods.

Figure 2 :
Figure 2: An illustrative example of the MiAMix process, involving: 1) Random sample pairing; 2) Sampling the number, methods, and ratios of mixing masks; 3) Augmentation of mixing masks; 4) Generation of the final mixed output.

9 :
Sample k mixing methods m 1 , m 2 , ..., m k from M with weighted distribution over W 10:Generate all mask j from m j (λ j ) k masks to mask merged , Get the λ merged from the mask merged 13:Apply m merged to the sampled input pair xi = mask merged ⊗ x i + (1 − mask merged ) ⊗ x t 14:Apply λ merged to sampled label pair ỹi = λy i + (1 − λ)y j 15:

Table 2 ,
we compare the performance and training cost of several MSDA methods.The training cost is measured as the ratio of the training time of the method to the training time of the vanilla training.From the results, it is clear that our proposed method, MiAMix, shows a state-of-the-art performance among those low-cost MSDA methods.The test results even surpass the AutoMix which embeds the mixing mask generation into the training pipeline to take more advantage of injecting dynamic prior knowledge into the sample mixing.Notably, the MiAMix method only incurs an 11% increase in training cost over the vanilla model, making it a cost-effective solution for data augmentation.In contrast, the AutoMix takes approximately 70% more training costs.

Table 2 :
Comparison of various MSDA on CIFAR-10 and CIFAR 100 using ResNet-18 and ResNeXt-50 backbones, on Tiny-ImageNet using a ResNet-18 backbone.Note that AutoMix needs additional computations for learning and processing extra prior knowledge.T raining Cost =

Table 4 :
Ablation study on mixing mask augmentation with ResNet-18 on Tiny-ImageNet.The percentage after "Smoothing" and "rotation and shear" refers to the ratio of masks applied with the respective type of augmentation during training.

Table 5 :
Ablation study on multiple mixing layers with ResNet-18 on Tiny-ImageNet.The brackets indicate that the number of turns is randomly selected from the enclosed numbers with equal probability during each training step.

Table 6 :
Effectiveness experiment of MSDA ensemble, tested on CIFAR-10 dataset.Each weight corresponds to a different MSDA candidate, and a weight of zero signifies the removal of the corresponding method from the ensemble.

Table 7 :
Comparison between different ways of merging multiple mixing masks and merging mixing ratios on Tiny-ImageNet with a ResNet-18 model."sum" and "mul" respectively refer to merging masks through sum and multiplication."merged" and "orig" denote the methods of acquiring λeither averaging the final merged mask or reusing the original λ.

Table 8 :
Impact of self-mixing ratio on CIFAR-100 and CIFAR-100-C with ResNeXt-50."Selfmixing ratio" denotes the percentage of images that are not mixing with other randomly paired images but mixup with an augmented version of themselves.